Unlimited OCR: One-Shot Long-Horizon Parsing
Unlimited OCR: Advancing One-Shot Long-Horizon Parsing


Welcome to the repository for Unlimited-OCR, a project designed to evolve the capabilities of Deepseek-OCR. This framework introduces the era of One-shot Long-horizon Parsing, allowing for the processing of extensive documents in a single pass.
Core Objective: To push the boundaries of document understanding by enabling high-fidelity, long-form parsing without the need for complex multi-step pipelines.
๐ Project Milestones
- 2026/06/23: ๐ Official paper released on arXiv.
- 2026/06/23: ๐ค Integration and support provided by the ModelScope community.
- 2026/06/22: ๐ Initial launch of Unlimited-OCR.
๐ Technical Specifications
To ensure the model runs correctly on NVIDIA GPUs, the following environment is recommended. We have tested these requirements on python 3.12.3 with CUDA 12.9.
| Dependency | Version | Purpose |
|---|---|---|
torch | 2.10.0 | Deep Learning Framework |
torchvision | 0.25.0 | Image Processing |
transformers | 4.57.1 | Model Loading/Inference |
Pillow | 12.1.1 | Image Manipulation |
matplotlib | 3.10.8 | Visualization |
einops | 0.8.2 | Tensor Operations |
addict / easydict | 2.4.0 / 1.13 | Configuration Management |
pymupdf | 1.27.2.2 | PDF Handling |
psutil | 7.2.2 | System Monitoring |
The Parsing Logic
The process can be mathematically viewed as a mapping function where an image and a prompt are transformed into a structured text sequence :
๐ Implementation Guide
Method 1: Hugging Face Transformers
For standard inference on NVIDIA hardware, use the following approach.
import os
import torch
from transformers import AutoModel, AutoTokenizer
# Initialize model and tokenizer
model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
torch_dtype=torch.cuda()
)
# --- Configuration Options ---
# 1. 'gundam': base_size=1024, image_size=640, crop_mode=True
# 2. 'base': base_size=1024, image_size=1024, crop_mode=False
# Single Image Inference
model.infer(
tokenizer,
prompt='image document parsing.',
image_file='your_image.jpg',
output_path='your/output/dir',
base_size=1024,
image_size=640,
crop_mode=True,
max_length=32768,
no_repeat_ngram_size=35,
ngram_window=128,
save_results=True
)
# Multi-page/PDF Inference (Uses 'base' config)
model.infer_multi(
tokenizer,
prompt='image Multi page parsing.',
image_files=['page1.png', 'page2.png', 'page3.png'],
output_path='your/output/dir',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35,
ngram_window=1024,
save_results=True
)
Handling PDFs
To process PDFs, you must first convert pages into images using PyMuPDF. Manual cropping is no longer required.
import tempfile, fitz # PyMuPDF
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
paths = []
for i, page in enumerate(doc):
# Conversion logic here...
pass
return paths
Method 2: SGLang (Optimized Server)
For high-performance deployment, SGLang is recommended.
๐ Setup Checklist
- Create virtual environment using
uv - Install local SGLang wheel
- Pin
kernels==0.11.7 - Install
pymupdf==1.27.2.2
Installation Commands:
uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7 pymupdf==1.27.2.2
Launching the Server:
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--attention-backend fa3 \
--page-size 1 \
--mem-fraction-static 0.8 \
--context-length 32768 \
--enable-custom-logit-processor \
--disable-overlap-schedule \
--skip-server-warmup \
--host 0.0.0.0 \
--port 10000
API Interaction
The server provides an OpenAI-compatible API. It utilizes the DeepseekOCRNoRepeatNGramLogitProcessor to ensure output quality.
import base64, json, requests
from sglang.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor
# API Configuration
server_url = "http://127.0.0.1:10000"
session = requests.Session()
def generate(prompt, image_paths, image_mode, ngram_window):
payload = {
"model": "Unlimited-OCR",
"messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
"temperature": 0,
"skip_special_tokens": False,
"images_config": {"image_mode": image_mode},
"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
"custom_params": {"ngram_size": 35, "window_size": ngram_window},
"stream": True,
}
# Request logic follows...
๐ System Workflow
The following diagram illustrates the data flow from a raw PDF to the final parsed text: