ModelsJun 23, 2026
Baidu Open-Sources Unlimited-OCR: Constant KV Cache Enables End-to-End Long Document Parsing, Sets New SOTA on OmniDocBench
Baidu recently open-sourced the Unlimited-OCR model, with a total of 3B parameters (500M activated), achieving a comprehensive score of 93.92% on OmniDocBench v1.6, setting a new end-to-end OCR SOTA. The core innovation is Reference Sliding Window Attention (R-SWA), which compresses the decoder KV cache from linear growth to a constant, enabling single forward inference to transcribe dozens of pages without increasing latency or memory usage as output length grows.
Technical Core: R-SWA and DeepEncoder
- R-SWA Mechanism: Each generated token simultaneously attends to all reference tokens (visual tokens and prompt tokens) and the most recent 128 output tokens. Visual tokens do not participate in state updates, avoiding feature degradation. The KV cache size is constant, equal to the reference segment length plus the sliding window width, and does not grow with sequence length.
- DeepEncoder: Inherits the encoder from DeepSeek-OCR, compressing a 1024×1024 PDF image into 256 visual tokens (16× compression). Supports two modes: Base (fixed resolution) and Gundam (dynamic resolution).
- Model Architecture: 3B total parameter MoE with 500M activated parameters; all attention layers are replaced with R-SWA. Trained for 4000 steps from a DeepSeek-OCR checkpoint using 2 million OCR samples (single-page:multi-page = 9:1). Multi-page samples are randomly generated with 2-50 pages and sequence length up to 32K.
Performance: Comprehensive SOTA
- OmniDocBench v1.5: Overall score 93.23%, a 6.22 percentage point improvement over DeepSeek-OCR (87.01%). Text edit distance dropped from 0.073 to 0.038, formula CDM rose from 83.37 to 92.61, table TEDS rose from 84.97 to 90.93, and reading order edit distance dropped from 0.086 to 0.045.
- OmniDocBench v1.6: Overall score 93.92%, end-to-end SOTA.
- Long Document Test: Edit distance of 0.0572 for 20-page documents, and 0.1069 for 40+ page documents (Distinct-35 reaches 96.90%). The team notes that errors in 40+ page documents mainly stem from resolution limitations of DeepEncoder's multi-page mode, not R-SWA.
- Efficiency: TPS on OmniDocBench reaches 5580 (DeepSeek-OCR: 4951). When outputting 6144 tokens, TPS is 7847 (DeepSeek-OCR: 5822), with advantages growing as output length increases.
Open Source and Impact
- Model weights and code have been open-sourced on GitHub and HuggingFace.
- R-SWA is designed as a general decoding scheme applicable to long-output tasks such as ASR and translation. The team plans to validate its transferability in the next steps.
- The technical director of the paper is credited as "YY", speculated by the industry to be Wei Haoran (former core author of DeepSeek OCR, now departed), who previously led GOT-OCR2.0 and the DeepSeek-OCR series.
- Baidu's combination of PaddleOCR's industrial foundation with cutting-edge research is expected to drive OCR evolution from single-page recognition to full-book understanding.
Also available in 中文.