โ† Back to news

Unlimited OCR: One-Shot Long-Horizon Parsing

github.com|341 points|84 comments|by ingve|Jun 23, 2026

Unlimited OCR: Advancing One-Shot Long-Horizon Parsing

GitHub Hugging Face ![arXiv](https://img.shields.io/badge/arXiv-Unlimited OCR Works-b31b1b?logo=arxiv&logoColor=white) ![Twitter Follow](https://img.shields.io/badge/Twitter-Baidu Inc.-white?logo=x&logoColor=white)

Welcome to the repository for Unlimited-OCR, a project designed to evolve the capabilities of Deepseek-OCR. This framework introduces the era of One-shot Long-horizon Parsing, allowing for the processing of extensive documents in a single pass.

Core Objective: To push the boundaries of document understanding by enabling high-fidelity, long-form parsing without the need for complex multi-step pipelines.

๐Ÿ“… Project Milestones

  • 2026/06/23: ๐Ÿ“„ Official paper released on arXiv.
  • 2026/06/23: ๐Ÿค Integration and support provided by the ModelScope community.
  • 2026/06/22: ๐Ÿš€ Initial launch of Unlimited-OCR.

๐Ÿ›  Technical Specifications

To ensure the model runs correctly on NVIDIA GPUs, the following environment is recommended. We have tested these requirements on python 3.12.3 with CUDA 12.9.

DependencyVersionPurpose
torch2.10.0Deep Learning Framework
torchvision0.25.0Image Processing
transformers4.57.1Model Loading/Inference
Pillow12.1.1Image Manipulation
matplotlib3.10.8Visualization
einops0.8.2Tensor Operations
addict / easydict2.4.0 / 1.13Configuration Management
pymupdf1.27.2.2PDF Handling
psutil7.2.2System Monitoring

The Parsing Logic

The process can be mathematically viewed as a mapping function ff where an image II and a prompt PP are transformed into a structured text sequence TT: Unlimited-OCR(I,P)โ†’Tlong-horizon\text{Unlimited-OCR}(I, P) \rightarrow T_{\text{long-horizon}}


๐Ÿš€ Implementation Guide

Method 1: Hugging Face Transformers

For standard inference on NVIDIA hardware, use the following approach.

import os
import torch
from transformers import AutoModel, AutoTokenizer

# Initialize model and tokenizer
model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    use_safetensors=True, 
    torch_dtype=torch.cuda()
)

# --- Configuration Options ---
# 1. 'gundam': base_size=1024, image_size=640, crop_mode=True
# 2. 'base': base_size=1024, image_size=1024, crop_mode=False

# Single Image Inference
model.infer(
    tokenizer, 
    prompt='image document parsing.', 
    image_file='your_image.jpg', 
    output_path='your/output/dir', 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    max_length=32768, 
    no_repeat_ngram_size=35, 
    ngram_window=128, 
    save_results=True
)

# Multi-page/PDF Inference (Uses 'base' config)
model.infer_multi(
    tokenizer, 
    prompt='image Multi page parsing.', 
    image_files=['page1.png', 'page2.png', 'page3.png'], 
    output_path='your/output/dir', 
    image_size=1024, 
    max_length=32768, 
    no_repeat_ngram_size=35, 
    ngram_window=1024, 
    save_results=True
)

Handling PDFs

To process PDFs, you must first convert pages into images using PyMuPDF. Manual cropping is no longer required.

import tempfile, fitz # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    paths = []
    for i, page in enumerate(doc):
        # Conversion logic here...
        pass
    return paths

Method 2: SGLang (Optimized Server)

For high-performance deployment, SGLang is recommended.

๐Ÿ“‹ Setup Checklist

  • Create virtual environment using uv
  • Install local SGLang wheel
  • Pin kernels==0.11.7
  • Install pymupdf==1.27.2.2

Installation Commands:

uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7 pymupdf==1.27.2.2

Launching the Server:

python -m sglang.launch_server \
  --model baidu/Unlimited-OCR \
  --served-model-name Unlimited-OCR \
  --attention-backend fa3 \
  --page-size 1 \
  --mem-fraction-static 0.8 \
  --context-length 32768 \
  --enable-custom-logit-processor \
  --disable-overlap-schedule \
  --skip-server-warmup \
  --host 0.0.0.0 \
  --port 10000

API Interaction

The server provides an OpenAI-compatible API. It utilizes the DeepseekOCRNoRepeatNGramLogitProcessor to ensure output quality.

import base64, json, requests
from sglang.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

# API Configuration
server_url = "http://127.0.0.1:10000"
session = requests.Session()

def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {"ngram_size": 35, "window_size": ngram_window},
        "stream": True,
    }
    # Request logic follows...

๐Ÿ”„ System Workflow

The following diagram illustrates the data flow from a raw PDF to the final parsed text: