VibeThinker-3B: Redefining the Limits of Verifiable Reasoning in Compact LLMs

This technical report introduces VibeThinker-3B, a dense language model featuring 3 billion parameters. The primary objective of this project was to determine the maximum potential for verifiable reasoning when constrained to a strictly small-model architecture.

🛠️ The Development Pipeline

The researchers utilized a specialized post-training framework known as the Spectrum-to-Signal paradigm. The model's capabilities were honed through a systematic, three-stage optimization process:

Curriculum-based Supervised Fine-Tuning (SFT): A structured approach to initial training.
Multi-domain Reinforcement Learning (RL): Utilizing techniques like GRPO to refine reasoning paths.
Offline Self-Distillation: Further compressing and refining the model's internal logic.

📊 Performance Benchmarks

VibeThinker-3B demonstrates "frontier-level" capabilities, often rivaling or surpassing flagship models that are significantly larger (e.g., Gemini 3 Pro, GLM-5, and DeepSeek V3.2).

Benchmark	Score / Metric	Notes
AIME26	$94.3$	Increases to $97.1$ via claim-level test-time scaling
LiveCodeBench v6	$80.2$	Measured as $\text{Pass@1}$
LeetCode (Unseen)	$96.1\%$	Acceptance rate on recent contests
IFEval	$93.4$	Validates strict instruction following

Key Achievements:

Outperforms models orders of magnitude larger.
Maintains high instruction controllability (via IFEval).
Exhibits exceptional out-of-distribution generalization.

🧠 Theoretical Contribution: The Parametric Compression-Coverage Hypothesis

The findings from VibeThinker-3B (and previous 1.5B iterations) lead the authors to propose a new theoretical framework:

The Parametric Compression-Coverage Hypothesis: This theory posits that verifiable reasoning can be compressed into "compact reasoning cores." In contrast, open-domain knowledge and general competence require "broad parameter coverage" to account for the vast array of facts, concepts, and long-tail edge cases.

In mathematical terms, we can view the requirement for parameters ( $P$ ) as: $P_{reasoning} \ll P_{knowledge}$

This suggests that small models are ~~just efficient alternatives~~ $\rightarrow$ a complementary path toward achieving frontier performance in specific, dense capability regimes.

📝 Summary of Impact

By proving that a 3B model can compete with the industry's largest reasoning systems, VibeThinker-3B shifts the narrative on model scaling. It demonstrates that for tasks where the answer is verifiable (like math and code), architectural efficiency and training quality can override raw parameter count.

Reference: arXiv:2606.16140 Authors: Sen Xu, Shixi Liu, Wei Wang, et al. Date: June 15, 2026

license icon