VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO
VibeThinker-3B: Redefining the Limits of Verifiable Reasoning in Compact LLMs
This technical report introduces VibeThinker-3B, a dense language model featuring 3 billion parameters. The primary objective of this project was to determine the maximum potential for verifiable reasoning when constrained to a strictly small-model architecture.
๐ ๏ธ The Development Pipeline
The researchers utilized a specialized post-training framework known as the Spectrum-to-Signal paradigm. The model's capabilities were honed through a systematic, three-stage optimization process:
- Curriculum-based Supervised Fine-Tuning (SFT): A structured approach to initial training.
- Multi-domain Reinforcement Learning (RL): Utilizing techniques like
GRPOto refine reasoning paths. - Offline Self-Distillation: Further compressing and refining the model's internal logic.
๐ Performance Benchmarks
VibeThinker-3B demonstrates "frontier-level" capabilities, often rivaling or surpassing flagship models that are significantly larger (e.g., Gemini 3 Pro, GLM-5, and DeepSeek V3.2).
| Benchmark | Score / Metric | Notes |
|---|---|---|
| AIME26 | Increases to via claim-level test-time scaling | |
| LiveCodeBench v6 | Measured as | |
| LeetCode (Unseen) | Acceptance rate on recent contests | |
| IFEval | Validates strict instruction following |
Key Achievements:
- Outperforms models orders of magnitude larger.
- Maintains high instruction controllability (via IFEval).
- Exhibits exceptional out-of-distribution generalization.
๐ง Theoretical Contribution: The Parametric Compression-Coverage Hypothesis
The findings from VibeThinker-3B (and previous 1.5B iterations) lead the authors to propose a new theoretical framework:
The Parametric Compression-Coverage Hypothesis: This theory posits that verifiable reasoning can be compressed into "compact reasoning cores." In contrast, open-domain knowledge and general competence require "broad parameter coverage" to account for the vast array of facts, concepts, and long-tail edge cases.
In mathematical terms, we can view the requirement for parameters () as:
This suggests that small models are just efficient alternatives a complementary path toward achieving frontier performance in specific, dense capability regimes.
๐ Summary of Impact
By proving that a 3B model can compete with the industry's largest reasoning systems, VibeThinker-3B shifts the narrative on model scaling. It demonstrates that for tasks where the answer is verifiable (like math and code), architectural efficiency and training quality can override raw parameter count.
Reference: arXiv:2606.16140
Authors: Sen Xu, Shixi Liu, Wei Wang, et al.
Date: June 15, 2026
![]()