Weibo Open-Sources 3B Small Model VibeThinker, Matching Trillion-Parameter Models in Verifiable Reasoning
Sina Weibo team recently open-sourced VibeThinker-3B, a dense reasoning model with only 3 billion parameters, achieving astonishing performance on verifiable reasoning tasks such as math competitions and programming, rivaling top models with hundreds of billions or even trillions of parameters.
Core Performance Data
- AIME26: Score 94.3, improved to 97.1 with test-time scaling strategy CLR.
- IMO-AnswerBench: 76.4 standalone, 80.6 with CLR, on par with DeepSeek V3.2 (78.3, 671B params), GLM-5 (82.5, 744B params), and Kimi K2.5 (81.8, 1T params).
- LiveCodeBench v6: Pass@1 of 80.2.
- Latest LeetCode Unseen Weekly Contests: 123 out of 128 problems solved on first submission (96.1% pass rate) from April 25 to May 31, 2026.
- IFEval: 93.4.
Training Method
VibeThinker-3B is built on Qwen2.5-Coder-3B using an upgraded Spectrum-to-Signal post-training pipeline, including:
- Curriculum Two-Stage SFT: First stage covers general abilities like math, programming, STEM reasoning; second stage focuses on high-difficulty long-span samples, using diversity exploration distillation to retain multiple effective solution paths.
- Multi-Domain Reasoning Reinforcement Learning: Reuses MGPO strategy, training sequentially on math, programming, and STEM tasks with a single 64K long context window.
- Offline Self-Distillation: Selects high-quality trajectories from RL checkpoints, prioritizing trajectories that are correct but not yet well learned by the model based on learning potential scores, distilled into a unified student model.
- Instruction Reinforcement Learning: Improves controllability over user prompts, using rule-based verifiers and scoring reward models for format-sensitive and open-ended instructions respectively.
Parameter Compression-Coverage Hypothesis
The research team proposes that different capabilities depend on parameter scale in different ways: verifiable reasoning (e.g., math, programming) is a highly compressible parameter-intensive capability with clear task structure and reliable feedback signals, allowing small models to approach frontier performance; while open-domain knowledge and general conversation rely on large parameter scale to cover facts and concepts. This hypothesis suggests small and large models are complementary, not substitutive.
Open Source and Limitations
The model is open-sourced on Hugging Face, GitHub, and ModelScope. The team explicitly states that the model performs poorly in domains requiring general knowledge, with its strengths concentrated on verifiable reasoning tasks.
Also available in 中文.