As discussed in a previous post, existing OCR benchmarks are not especially useful for discriminating between models on the kinds of documents that social scientists actually work with. Most benchmarks, like OmniDocBench v1.5, over-index on modern printed text, clean scans, and well-resourced languages. Handwritten census records, historical logbooks, degraded administrative forms, and other ``messy" real-world data are not well represented.
socOCRbench is a small (private) benchmark designed with this gap in mind. It evaluates OCR models on hundreds of samples across two broad task types: handwriting recognition and table extraction. Within each, results are split into (A) and (B) sub-categories that roughly correspond to how well-covered the material is by existing training data and benchmarks: (A) being more conventional, (B) being more challenging or underrepresented. The metric used is Normalized Edit Similarity (NES), where 1.0 represents a perfect transcription. Note that details about what data constitutes each category are intentionally unforthcoming to protect the sanctity of the evaluation.
Future iterations of the benchmark will include more diverse document types, larger sample sizes, model efficiency, and performance after fine-tuning.
| Model | Overall | HW [A] | HW [B] | Tables [A] | Tables [B] |
|---|---|---|---|---|---|
| Gemini 3 Pro (low) VLM | 0.5472 | 0.6445 | 0.5289 | 0.8936 | 0.2667 |
| Gemini 3 Flash (high) VLM | 0.4808 | 0.6049 | 0.4750 | 0.7551 | 0.1862 |
| Gemini 3 Flash (low) VLM | 0.4596 | 0.5390 | 0.3689 | 0.8734 | 0.2541 |
| Qwen3-VL-235B VLM | 0.4428 | 0.6174 | 0.1969 | 0.9398 | 0.2520 |
| Qwen3-VL-30B VLM | 0.4022 | 0.5716 | 0.1699 | 0.9104 | 0.1966 |
| GPT-5.2 (low) VLM | 0.3759 | 0.5140 | 0.1693 | 0.8445 | 0.2014 |
| Gemini 3 Pro (high) VLM | 0.3700 | 0.5336 | 0.3506 | 0.4097 | 0.1564 |
| GPT-5.2 (high) VLM | 0.2841 | 0.4190 | 0.0564 | 0.8312 | 0.0995 |
| dots.ocr VLM | 0.3328 | 0.4193 | 0.0751 | 0.9150 | 0.2299 |
| GLM-OCR VLM | 0.3247 | 0.4369 | 0.0693 | 0.8200 | 0.2283 |
| PaddleOCR-VL-1.5 VLM | 0.2490 | 0.5019 | 0.1279 | 0.3614 | 0.0000 |
| PaddleOCR-VL-0.9B VLM | 0.2221 | 0.2673 | 0.0446 | 0.6904 | 0.1365 |
| Nemotron-Nano-12B VLM | 0.2172 | 0.2568 | 0.0127 | 0.6318 | 0.1975 |
| OlmOCR-2 VLM | 0.2087 | 0.3405 | 0.0696 | 0.2791 | 0.1624 |
| LightOnOCR-2-1B VLM | 0.2024 | 0.3138 | 0.1005 | 0.4828 | 0.0345 |
| DeepSeek-OCR VLM | 0.1997 | 0.2534 | 0.0336 | 0.5004 | 0.1732 |
| Tesseract Classical OCR | 0.1091 | 0.1037 | 0.0285 | 0.1696 | 0.1808 |
| Layout Parser Classical OCR | 0.0263 | 0.0590 | 0.0231 | 0.0000 | 0.0000 |
| EfficientOCR Classical OCR | 0.0200 | 0.0509 | 0.0108 | 0.0000 | 0.0000 |
| Sarvam Vision VLM | 0.2735 | 0.2851 | 0.1899 | 0.7194 | 0.1335 |
Last updated: February 13, 2026
Comments