GPT-5.5 Hallucinates 3x More Than MIT-Licensed GLM-5.2

~~Bigger models are not the way~~ $\rightarrow$ Bigger models are not the way June 18, 2026

A fundamental pivot is occurring within the world's leading AI laboratories. There is a growing skepticism regarding the strategy of indefinitely scaling training datasets and parameter counts.

The fragility of this "bigger is better" philosophy was highlighted when the US government restricted Claude Fable 5 just three days post-launch. This event marked the first national security-based AI ban in the US, triggered because a single successful jailbreak posed an unacceptable risk for a model of that magnitude.

The Intelligence Plateau

While the largest models typically dominate the Artificial Analysis Intelligence Index, the gap is closing. Z.ai’s latest offering, GLM-5.2, utilizes $753\text{B}$ parameters (with $\approx 40\text{B}$ active) and performs remarkably close to its massive competitors:

GLM-5.2 is only 4 points behind GPT-5.5.
GLM-5.2 is only 9 points behind Fable 5.

In contrast, proprietary models like Opus 4.8 and GPT-5.5 are conservatively estimated to reside in the $1\text{T}$ to $2\text{T}$ parameter range. When an MIT-licensed, open-weight model can nearly match a closed-weight system that is 1.5x to 2x its size, it suggests that raw intelligence has hit a significant plateau.

The Hallucination Trap

Training on massive volumes of factual, non-theoretical data creates a dangerous side effect: the model learns that it must always provide an answer.

The Danger of Scale: When models are too large and over-trained on "answers," they lose the ability to admit ignorance, leading to confident fabrications.

Consider the AA-Omniscience benchmark, which measures how often a model admits it doesn't know the answer versus hallucinating:

Model	Parameters (Total/Active)	Hallucination Rate	AA Intelligence Index
GLM-5.2	753B / 40B	28%	High
Opus 4.8	$\approx 1-2\text{T}$	36%	Very High
Fable 5	$\approx 1-2\text{T}$	48%	Very High
GPT-5.5	$\approx 1-2\text{T}$	86%	Very High
DeepSeek V4 Pro	1.6T / 49B	94%	44

DeepSeek V4 Pro is particularly egregious; it only admitted ignorance in 6% of cases, hallucinating the other 94% of the time.

Case Study: The Python Paradox

To test this, a complex Python prompt was used involving a flawed architectural request: Design a custom asyncio event loop policy in Python that overrides get_child_watcher().

❌ DeepSeek V4 Pro (The Hallucination)

Reasoning Time: 3m 52s
Tokens Used: 7.7k
Result: Produced a beautifully formatted but technically incorrect solution.

import os
import fcntl
import threading
import struct
import asyncio
import time
from asyncio import AbstractChildWatcher

class StateManager:
    # ... [Confidently incorrect implementation] ...

✅ GLM-5.2 (The Truth)

Reasoning Time: 12s
Tokens Used: 799
Result: Correctly identified the flaw.

GLM-5.2 immediately noted that a literal interpretation of the prompt would be unsound, explaining that a non-yielding loop on the event loop thread would cause a deadlock in subprocess machinery.

Non-Technical Analogy: This is equivalent to asking a delivery driver to deliver packages to three different houses simultaneously without ever stopping the truck.

The Modern AI Trilemma

We must stop blindly increasing the reasoning budget, the size of the corpus, or the parameter count. DeepSeek V4 Pro wasted nearly four minutes of compute in a reasoning loop only to arrive at a wrong answer, while a model half its size spotted the paradox instantly.

The industry is facing a critical trade-off, which can be visualized as follows:

Moving forward, the selection of AI models cannot be based on theoretical performance or size alone. We must solve for the equation: $\text{Optimal Model} = \frac{\text{Capability} \times \text{Calibration}}{\text{Compute Cost}}$

Industry Goals for the Next Era:

Reduce reliance on parameter scaling.
Improve "I don't know" trigger accuracy.
Optimize reasoning token efficiency.

Footnotes:

Settings: Both models used "high" reasoning effort, temperature 1, via OpenRouter.
System Prompt: "You respond professionally. You are a highly capable coding assistant well-versed in Python."
Precision: GLM-5.2 served by Z.ai (FP8); DeepSeek V4 Pro served by Baidu Qianfan (FP8).

If you enjoyed this analysis, check out Trajectory, my newsletter for weekly AI deep dives.