Unlimited OCR: Advancing One-Shot Long-Horizon Parsing

![arXiv](https://img.shields.io/badge/arXiv-Unlimited OCR Works-b31b1b?logo=arxiv&logoColor=white) ![Twitter Follow](https://img.shields.io/badge/Twitter-Baidu Inc.-white?logo=x&logoColor=white)

Welcome to the repository for Unlimited-OCR, a project designed to evolve the capabilities of Deepseek-OCR. This framework introduces the era of One-shot Long-horizon Parsing, allowing for the processing of extensive documents in a single pass.

Core Objective: To push the boundaries of document understanding by enabling high-fidelity, long-form parsing without the need for complex multi-step pipelines.

📅 Project Milestones

2026/06/23: 📄 Official paper released on arXiv.
2026/06/23: 🤝 Integration and support provided by the ModelScope community.
2026/06/22: 🚀 Initial launch of Unlimited-OCR.

🛠 Technical Specifications

To ensure the model runs correctly on NVIDIA GPUs, the following environment is recommended. We have tested these requirements on python 3.12.3 with CUDA 12.9.

Dependency	Version	Purpose
`torch`	`2.10.0`	Deep Learning Framework
`torchvision`	`0.25.0`	Image Processing
`transformers`	`4.57.1`	Model Loading/Inference
`Pillow`	`12.1.1`	Image Manipulation
`matplotlib`	`3.10.8`	Visualization
`einops`	`0.8.2`	Tensor Operations
`addict` / `easydict`	`2.4.0` / `1.13`	Configuration Management
`pymupdf`	`1.27.2.2`	PDF Handling
`psutil`	`7.2.2`	System Monitoring

The Parsing Logic

The process can be mathematically viewed as a mapping function $f$ where an image $I$ and a prompt $P$ are transformed into a structured text sequence $T$ : $\text{Unlimited-OCR}(I, P) \rightarrow T_{\text{long-horizon}}$

🚀 Implementation Guide

Method 1: Hugging Face Transformers

For standard inference on NVIDIA hardware, use the following approach.

import os
import torch
from transformers import AutoModel, AutoTokenizer

# Initialize model and tokenizer
model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    trust_remote_code=True, 
    use_safetensors=True, 
    torch_dtype=torch.cuda()
)

# --- Configuration Options ---
# 1. 'gundam': base_size=1024, image_size=640, crop_mode=True
# 2. 'base': base_size=1024, image_size=1024, crop_mode=False

# Single Image Inference
model.infer(
    tokenizer, 
    prompt='image document parsing.', 
    image_file='your_image.jpg', 
    output_path='your/output/dir', 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    max_length=32768, 
    no_repeat_ngram_size=35, 
    ngram_window=128, 
    save_results=True
)

# Multi-page/PDF Inference (Uses 'base' config)
model.infer_multi(
    tokenizer, 
    prompt='image Multi page parsing.', 
    image_files=['page1.png', 'page2.png', 'page3.png'], 
    output_path='your/output/dir', 
    image_size=1024, 
    max_length=32768, 
    no_repeat_ngram_size=35, 
    ngram_window=1024, 
    save_results=True
)

Handling PDFs

To process PDFs, you must first convert pages into images using PyMuPDF. ~~Manual cropping~~ is no longer required.

import tempfile, fitz # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    paths = []
    for i, page in enumerate(doc):
        # Conversion logic here...
        pass
    return paths

Method 2: SGLang (Optimized Server)

For high-performance deployment, SGLang is recommended.

📋 Setup Checklist

Create virtual environment using uv
Install local SGLang wheel
Pin kernels==0.11.7
Install pymupdf==1.27.2.2

Installation Commands:

uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7 pymupdf==1.27.2.2

Launching the Server:

python -m sglang.launch_server \
  --model baidu/Unlimited-OCR \
  --served-model-name Unlimited-OCR \
  --attention-backend fa3 \
  --page-size 1 \
  --mem-fraction-static 0.8 \
  --context-length 32768 \
  --enable-custom-logit-processor \
  --disable-overlap-schedule \
  --skip-server-warmup \
  --host 0.0.0.0 \
  --port 10000

API Interaction

The server provides an OpenAI-compatible API. It utilizes the DeepseekOCRNoRepeatNGramLogitProcessor to ensure output quality.

import base64, json, requests
from sglang.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

# API Configuration
server_url = "http://127.0.0.1:10000"
session = requests.Session()

def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {"ngram_size": 35, "window_size": ngram_window},
        "stream": True,
    }
    # Request logic follows...

🔄 System Workflow

The following diagram illustrates the data flow from a raw PDF to the final parsed text: