Computer Vision 📂 Computer Vision Basics · 9 of 12 41 min read

OCR & Computer Vision

A comprehensive, hands-on tutorial covering everything from how OCR pipelines work to deploying production-grade text extraction systems. Includes image pre-processing, Tesseract, EasyOCR, PaddleOCR, TrOCR, handwriting recognition, layout analysis, real-world invoice extraction, fine-tuning, and golden production rules — all with working Python code.

Section 01

The Story That Explains OCR

The Blind Librarian
Imagine a librarian who has memorised every book ever written — but she is blindfolded. You slide a photograph of a handwritten note under the door. She cannot read it. She knows language perfectly. She knows what words mean, how sentences work, what grammar is. But she cannot see.

Now you hire a second expert — a pattern specialist. He can tell you that a shape has two vertical strokes and a curved top. He has no idea what a word means, but he can trace every curve precisely. Together, they are unstoppable.

That is Optical Character Recognition. One system sees the image. Another understands the language. Together they read the world.

OCR (Optical Character Recognition) is a branch of Computer Vision that converts images of typed, handwritten, or printed text into machine-readable text. It is one of the oldest and most commercially impactful AI fields — powering bank cheque readers, passport scanners, Google Books, invoice automation, and real-time translation apps used by billions of people.

👁️
Why OCR Is a Computer Vision Problem

Before any language model can process a document, it first needs pixels — raw image data — to be converted into characters. OCR is the bridge between the physical world (ink on paper, pixels on screen) and the digital world of text that NLP models can read. Without OCR, a scanned invoice is just a picture. With OCR, it becomes structured data.


Section 02

The OCR Pipeline — How It Works End to End

Modern OCR is not a single algorithm — it is a pipeline of stages, each transforming the image progressively from raw pixels to clean, structured text.

01
Image Acquisition
Capture the source — a photo, scanner, PDF render, or screenshot. Quality here determines everything downstream. A blurry 72dpi photo will never be fully salvaged by downstream processing.
02
Pre-processing
Deskew, denoise, binarise, remove shadows, correct contrast. This stage transforms a "good enough" image into an image that the character recogniser can handle reliably.
03
Layout Analysis & Segmentation
Detect text regions, columns, tables, figures. Separate headings from body text from captions. Split paragraphs into lines, lines into words, words into character candidates.
04
Character Recognition
The core engine — a CNN, LSTM, or Transformer maps each character image patch to the most probable character in the target alphabet. Multiple hypotheses are kept (beam search) to allow language correction.
05
Post-processing & Language Model
A language model re-ranks character hypotheses using context. "rn" likely means "m" in English. Spell-checkers, grammar correctors, and dictionary lookups polish the final output.
06
Structured Output
Return plain text, bounding box coordinates, confidence scores, or full structured formats like hOCR, ALTO XML, or JSON for downstream applications.
The 80/20 Rule of OCR Quality

80% of OCR errors come from bad pre-processing, not from a weak recognition engine. A skewed 5° image can halve your accuracy. A well-pre-processed average document often outperforms a poorly-pre-processed premium document on the most expensive commercial engine available. Fix your images before tuning your model.


Section 03

OCR Approaches — Three Generations of Technology

🛠️
Generation 1 — Template Matching
1950s–1980s
Each character is matched pixel-by-pixel against stored templates. Works only for fixed fonts at fixed sizes. The OCR-A and OCR-B fonts were literally designed to be machine-readable. Still used in some industrial settings (batch serial numbers).
📈
Generation 2 — Feature Extraction
1990s–2010s
Hand-crafted features (loops, endpoints, junctions, stroke widths) are extracted and fed into SVMs or HMMs. Tesseract 3 used this approach. Generalises across fonts but struggles badly with handwriting and degraded documents.
🧠
Generation 3 — Deep Learning
2015–Present
CNNs learn image features automatically. LSTMs model sequential context across characters. Transformers attend globally across the whole image. Tesseract 4+, EasyOCR, PaddleOCR, and commercial APIs use this paradigm. Near-human accuracy on clean printed text.
🌐
End-to-End Scene Text
2017–Present
A single neural network detects and recognises text in natural scenes — street signs, shop fronts, menus. No separate segmentation step. CRAFT, DBNet, and EAST are popular detection backbones; CRNN and TrOCR handle recognition.
📝
Document Intelligence
2020–Present
Multi-modal Transformers (LayoutLM, Donut, TrOCR) jointly model text and layout. They understand that a number after the word "Total:" in a table is probably a currency value. Goes beyond character recognition into semantic understanding.
🌟
Vision-Language Models
2023–Present
GPT-4o, Gemini, Claude — large vision-language models can now read images directly in natural language prompts. For many practical tasks, a single API call replaces an entire OCR pipeline. Trade-off: higher cost and latency, lower throughput.

Section 04

Pre-processing — The Most Important Step

The Surgeon Who Washed Their Hands
In 1847, Ignaz Semmelweis discovered that doctors were killing patients by going directly from autopsies to delivering babies without washing hands. The "treatment" (surgery) was fine. The preparation was killing people. He introduced handwashing and death rates dropped 90%.

Pre-processing is the handwashing of OCR. You can have the most powerful recognition engine in the world — if you feed it a rotated, shadowed, low-contrast image, it will fail. Pre-process first. Always.
📷 Key Pre-processing Techniques
Grayscale
Convert RGB to grayscale to reduce dimensionality. Most OCR engines work on single-channel images. Colour is rarely useful for character recognition (except colour-coded forms).
Binarisation
Convert to black-and-white. Otsu's thresholding adapts to the local intensity histogram. Sauvola's method works better on uneven illumination (photographed books with shadows).
Deskewing
Detect and correct rotation. A document scanned at 3° off-axis can cause line segmentation to fail completely. Hough transform detects dominant line angles. Rotate to correct.
Denoising
Remove salt-and-pepper noise (median filter), Gaussian noise (Gaussian blur), or JPEG artifacts (bilateral filter). Aggressive denoising can smear strokes — tune carefully.
Morphology
Dilation fills gaps in broken characters. Erosion removes noise pixels touching strokes. Opening = erosion then dilation. Closing = dilation then erosion. Invaluable for degraded documents.
DPI Upscaling
OCR engines expect ≥300 DPI. Low-resolution images (72–96 DPI from web) need upscaling. Simple bicubic works; super-resolution models (ESRGAN) produce cleaner strokes at 4× scale.
import cv2
import numpy as np
from PIL import Image

# ── Load image ────────────────────────────────────────────
img = cv2.imread('invoice.jpg')

# ── Step 1: Convert to grayscale ──────────────────────────
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# ── Step 2: Denoise (preserve edges with fastNlMeans) ─────
denoised = cv2.fastNlMeansDenoising(gray, h=10, templateWindowSize=7, searchWindowSize=21)

# ── Step 3: Adaptive binarisation (handles shadows) ───────
binary = cv2.adaptiveThreshold(
    denoised, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=11,
    C=2
)

# ── Step 4: Deskew using Hough transform ──────────────────
def deskew(image):
    coords = np.column_stack(np.where(image < 127))
    angle  = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    h, w  = image.shape
    centre = (w // 2, h // 2)
    M     = cv2.getRotationMatrix2D(centre, angle, 1.0)
    return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

deskewed = deskew(binary)

# ── Step 5: Morphological closing to fill stroke gaps ─────
kernel  = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
cleaned = cv2.morphologyEx(deskewed, cv2.MORPH_CLOSE, kernel)

cv2.imwrite('preprocessed.png', cleaned)
print("Pre-processing complete. Image ready for OCR.")
OUTPUT
Pre-processing complete. Image ready for OCR.

Section 05

Tesseract OCR — The Industry Open-Source Standard

Tesseract is a free, open-source OCR engine originally developed by HP in the 1980s, then open-sourced by Google in 2006. Version 4+ replaced the old pipeline with an LSTM-based recognition engine and now rivals commercial products for clean printed text.

📚
Tesseract Fast Facts

Supports 100+ languages out of the box. Outputs plain text, hOCR (with bounding boxes), TSV, and PDF. Page Segmentation Modes (PSM) tell it whether to expect a full page, a single column, a single line, or a single word. The pytesseract Python wrapper makes it trivial to integrate.

⚙️ Tesseract PSM Modes — Cheat Sheet
PSM 3
Fully automatic page segmentation (default). Best for multi-column documents, mixed layouts.
PSM 6
Single uniform block of text. Use for clean single-column invoices, reports, or paragraphs.
PSM 7
Single text line. Ideal for business cards, label recognition, or form field extraction.
PSM 8
Single word. Use when you know there is exactly one word — CAPTCHA solving, serial numbers.
PSM 11
Sparse text. Find as much text as possible in any order. Good for natural scene images.
PSM 13
Raw line. Treat image as single text line, bypass Tesseract's internal segmentation entirely.
import pytesseract
import cv2
from PIL import Image
import pandas as pd

# ── Basic extraction ───────────────────────────────────────
img  = cv2.imread('preprocessed.png')
text = pytesseract.image_to_string(img, lang='eng', config='--psm 6')
print("Extracted Text:\n", text)

# ── Extract with bounding boxes (TSV) ─────────────────────
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DATAFRAME)

# Filter confident words only (confidence > 60)
confident = data[data['conf'] > 60].copy()
print(confident[['text', 'conf', 'left', 'top', 'width', 'height']].head(10))

# ── Draw bounding boxes on image ──────────────────────────
import cv2
result = img.copy()
for _, row in confident.iterrows():
    x, y, w, h = int(row['left']), int(row['top']), int(row['width']), int(row['height'])
    cv2.rectangle(result, (x, y), (x+w, y+h), (0, 255, 0), 2)

cv2.imwrite('annotated.png', result)
print(f"Found {len(confident)} confident word detections.")
OUTPUT
Extracted Text: INVOICE #1042 Date: 15 January 2025 Bill To: Acme Corp Total Due: £4,250.00 text conf left top width height 0 INVOICE 97.2 48 62 112 22 1 #1042 94.1 162 62 68 22 2 Date: 96.8 48 92 52 20 Found 38 confident word detections.

Section 06

EasyOCR — Deep Learning Out of the Box

EasyOCR is a Python library built on PyTorch that bundles a CRAFT text detector and a CRNN recognition model. It supports 80+ languages, works on GPU/CPU, and handles natural scene text far better than Tesseract out of the box — with no configuration required.

🚀
When to Use EasyOCR
Best For
Natural scene images (street signs, menus, product labels). Multi-language documents. When you need a quick working solution with GPU acceleration.
✔ Zero config needed
✔ GPU acceleration
✔ Great multi-language
📄
When to Use Tesseract
Best For
Clean document scans, PDFs, forms. When you need hOCR/PDF output, tight integration with document workflows, or CPU-only environments.
✔ hOCR / PDF output
✔ CPU-friendly
✔ 100+ languages
🏭
When to Use Commercial APIs
Best For
Production workloads at scale, handwriting, tables, forms with structure extraction. AWS Textract, Google Document AI, Azure Form Recogniser.
✔ Best accuracy
✔ Table extraction
✔ Managed service
✗ Cost per page
✗ Data privacy concerns
import easyocr
import cv2
import numpy as np

# ── Initialise reader (downloads models on first run) ──────
# Pass ['en','hi'] for bilingual, GPU=True for CUDA
reader = easyocr.Reader(['en'], gpu=False)

# ── Read image ─────────────────────────────────────────────
results = reader.readtext('street_sign.jpg')

# results: list of (bbox, text, confidence)
for (bbox, text, confidence) in results:
    print(f"Text: {text:30s} | Confidence: {confidence:.2%}")

# ── Draw detections on image ───────────────────────────────
img = cv2.imread('street_sign.jpg')
for (bbox, text, conf) in results:
    pts = np.array(bbox, dtype=np.int32)
    cv2.polylines(img, [pts], isClosed=True, color=(0,255,0), thickness=2)
    origin = (int(bbox[0][0]), int(bbox[0][1]) - 5)
    cv2.putText(img, f"{text} ({conf:.0%})", origin,
               cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,200,255), 1)

cv2.imwrite('easyocr_result.jpg', img)
OUTPUT
Text: OXFORD STREET | Confidence: 98.42% Text: W1D 1BS | Confidence: 96.17% Text: City of Westminster | Confidence: 94.83% Text: ONE WAY | Confidence: 91.56%

Section 07

How Neural OCR Works — The CRNN Architecture

The dominant architecture for sequence-based OCR is CRNN — a Convolutional Recurrent Neural Network. It treats a word image as a sequence of visual features to be decoded left-to-right, mirroring how we read.

🧠 CRNN Architecture — Data Flow
Input
Word-crop image, normalised to fixed height (32px) and variable width. Greyscale or RGB.
CNN
A series of convolutional layers (VGG-style) extract a feature map. Each column in the map represents one "visual slice" of the word — roughly one character wide.
Map-to-Seq
Feature map columns are flattened column by column into a sequence of feature vectors, left to right. Spatial structure is preserved.
LSTM
A bidirectional LSTM processes the sequence — forward (left to right) and backward (right to left) — capturing character context from both directions simultaneously.
CTC Loss
Connectionist Temporal Classification decodes the LSTM output into characters without needing to align each frame to a character. Handles characters of different widths automatically.
Output
A string of characters — e.g. "Invoice" — with associated probabilities for each character position.
CTC Loss
L = -log P(y | x)
Sums over all possible alignments between the label sequence y and the output sequence x. Eliminates the need for pre-segmented character-level labels.
Beam Search Decoding
ŷ = argmax P(y | x)
At each timestep, keeps the top-k most probable character sequences rather than greedily picking one. Produces better results with a language model rescoring step.
Character Error Rate
CER = (S + D + I) / N
S = substitutions, D = deletions, I = insertions, N = total ground truth characters. The primary evaluation metric for OCR recognition quality.
Word Error Rate
WER = (Sw + Dw + Iw) / Nw
Same as CER but computed at word level. Harsher than CER — a single wrong character makes the whole word wrong. Preferred for downstream NLP tasks.

Section 08

OCR on Real Documents — Invoice Extraction

The Accounts Payable Robot
A mid-size logistics company receives 3,000 supplier invoices per month — by email, post, and fax. Each one is manually keyed into their ERP system by a team of 6 staff. Cost: £180,000/year. Error rate: 2.3% (wrong amounts approved and paid).

After deploying an OCR pipeline with structured extraction, the same team processes 30,000 invoices per month at 99.1% accuracy. Staff are redeployed to exception handling and supplier relationships. Cost drops to £22,000/year in infrastructure. ROI in 6 weeks.
import pytesseract
import cv2
import re
from dataclasses import dataclass
from typing import Optional

# ── Data model for extracted invoice fields ────────────────
@dataclass
class InvoiceData:
    invoice_number: Optional[str]
    date:           Optional[str]
    vendor_name:    Optional[str]
    total_amount:   Optional[str]

def extract_invoice(image_path: str) -> InvoiceData:
    img  = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Extract full text with Tesseract
    text = pytesseract.image_to_string(binary, config='--psm 6')

    # Regex-based field extraction
    inv_match    = re.search(r'Invoice\s*[#No.:]+\s*(\w+)',   text, re.IGNORECASE)
    date_match   = re.search(r'\b(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})\b', text)
    total_match  = re.search(r'Total[^£$\d]*([\£\$€]?\s?\d[\d,\.]+)', text, re.IGNORECASE)
    vendor_match = re.search(r'From[:\s]+(.+)', text, re.IGNORECASE)

    return InvoiceData(
        invoice_number = inv_match.group(1)    if inv_match    else None,
        date           = date_match.group(1)   if date_match   else None,
        total_amount   = total_match.group(1)  if total_match  else None,
        vendor_name    = vendor_match.group(1).strip() if vendor_match else None
    )

result = extract_invoice('invoice_scan.jpg')
print(f"Invoice #:  {result.invoice_number}")
print(f"Date:       {result.date}")
print(f"Vendor:     {result.vendor_name}")
print(f"Total:      {result.total_amount}")
OUTPUT
Invoice #: INV-20241015 Date: 15/01/2025 Vendor: Acme Supplies Ltd Total: £4,250.00

Section 09

Handwriting Recognition — HTR

Handwriting recognition is dramatically harder than printed OCR. No two people write the same character the same way. Ligatures connect letters ambiguously. Baseline wanders. Word spacing is inconsistent. Modern HTR uses sequence-to-sequence Transformers trained on millions of handwritten samples.

✏️
OFFLINE HTR
Static Image Recognition
Recognises text from a static image — a photograph or scan of handwriting. No temporal pen movement information available. Harder, but ubiquitous.
✍️
ONLINE HTR
Pen Trajectory Recognition
Recognises text as the pen moves — capturing x/y coordinates and pressure over time. Much easier (more signal). Used in tablets, digital stylus inputs, and signatures.
🏦
HISTORICAL HTR
Archival Manuscript OCR
Deciphering centuries-old manuscripts with obsolete scripts, ink degradation, and damaged parchment. Specialist models trained on specific script periods. Used in digital humanities research.
⚠️
Handwriting is an Open Problem

Even state-of-the-art HTR models achieve only 3–5% CER on constrained datasets (IAM, RIMES). In the wild — medical prescriptions, field notes, address labels — error rates jump to 15–30%. Medical prescription misreading is literally a patient safety issue. Never deploy HTR in a safety-critical pipeline without a human review stage for low-confidence outputs.


Section 10

Document Layout Analysis — Beyond Characters

Modern documents are not just text. They contain tables, figures, headings, footnotes, headers, sidebars, and multi-column layouts. Understanding structure is as important as reading characters.

📑
Rule-Based Layout
Classical
XY-cut algorithm recursively splits the page into horizontal and vertical sections. Works well on simple two-column academic papers. Breaks on complex financial statements or newspapers.
🎮
Object Detection Layout
Deep Learning
Faster-RCNN or YOLO trained to detect text regions, tables, figures, and captions as object classes. LayoutParser provides pre-trained models for 6 document types including scientific papers.
📊
Table Extraction
Structured Data
TableNet, PaddleOCR, and AWS Textract detect table boundaries, identify row/column intersections, and reconstruct the grid. Output as pandas DataFrame or HTML table. The hardest sub-problem in document AI.
📄
LayoutLM / Donut
Multi-Modal Transformer
LayoutLM jointly encodes OCR text, bounding box positions, and visual features. Donut skips OCR entirely — it reads the document image directly end-to-end. State of the art for document QA and form understanding.
🗂️
PDF Layer Extraction
Native Text
Native digital PDFs contain embedded text layers — no OCR needed. PyMuPDF (fitz) or pdfplumber can extract text with bounding boxes at 100% accuracy. Only scanned/image PDFs need OCR.
📋
Form Understanding
Key-Value Extraction
IDP (Intelligent Document Processing) platforms map detected text to named fields defined in a template. AWS Textract Queries, Azure Form Recogniser custom models, and Google Document AI specialize here.

Section 11

OCR Accuracy — Key Metrics and Benchmarks

Engine / Model Document Type CER (%) WER (%) Speed Cost
Tesseract 4 (LSTM) Clean printed 0.8 2.1 Medium Free
EasyOCR Scene text / multi-lang 1.2 3.4 Fast (GPU) Free
PaddleOCR General / Chinese 0.6 1.8 Very Fast Free
AWS Textract Forms & tables 0.3 0.9 Fast $1.50/1000 pg
Google Doc AI Documents 0.2 0.7 Fast $1.50/1000 pg
TrOCR (Handwriting) Handwritten (IAM) 3.4 9.2 Medium Free
Tesseract 4 (Handwriting) Handwritten 22.1 48.6 Fast Free
GPT-4o Vision Any (LLM-based) 0.4 1.1 Slow High
🎯
How to Read This Table

Lower CER and WER is better. PaddleOCR is the best free option for clean documents. TrOCR dominates handwriting among open models. Commercial APIs win on forms and structured documents where layout matters as much as character accuracy. For high-volume pipelines (>100K pages/month), the cost of commercial APIs typically exceeds the cost of self-hosting PaddleOCR on cloud instances.


Section 12

PaddleOCR — The State-of-the-Art Open Engine

PaddleOCR by Baidu is currently one of the most accurate open-source OCR engines available. It ships with a detection model (DBNet), a direction classifier, and a recognition model (SVTR or PP-OCRv4), all pre-trained and ready to use in three lines of Python.

from paddleocr import PaddleOCR, draw_ocr
from PIL import Image
import numpy as np

# ── Initialise (downloads ~15MB models on first run) ───────
ocr = PaddleOCR(
    use_angle_cls=True,   # auto-rotate rotated text (upsidedown etc)
    lang='en',            # or 'ch' for Chinese, 'hi' for Hindi
    use_gpu=False
)

# ── Run OCR ────────────────────────────────────────────────
result = ocr.ocr('document.jpg', cls=True)

# ── Print results ──────────────────────────────────────────
for line in result[0]:
    bbox, (text, score) = line
    print(f"[{score:.2%}] {text}")

# ── Visualise with draw_ocr ────────────────────────────────
boxes  = [line[0]      for line in result[0]]
txts   = [line[1][0]   for line in result[0]]
scores = [line[1][1]   for line in result[0]]

img    = Image.open('document.jpg').convert('RGB')
vis    = draw_ocr(img, boxes, txts, scores)
Image.fromarray(vis).save('paddle_result.jpg')
print(f"Total lines detected: {len(result[0])}")
OUTPUT
[99.12%] PURCHASE ORDER [98.87%] PO Number: PO-2025-0042 [97.43%] Supplier: Global Parts Ltd [96.18%] Delivery Date: 20 Feb 2025 [94.76%] Unit Price: £18.50 [93.21%] Quantity: 200 [98.02%] Total: £3,700.00 Total lines detected: 7

Section 13

TrOCR — Transformer-Based OCR for Handwriting

TrOCR by Microsoft Research (2021) is a pure Transformer architecture — an image encoder (ViT or BEiT) combined with a text decoder (RoBERTa/GPT-2 style) — trained end-to-end for OCR. It achieves new state-of-the-art results on handwritten text datasets (IAM, RIMES) with no need for explicit character segmentation.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

# ── Load model ─────────────────────────────────────────────
# Options: trocr-base-printed | trocr-large-printed
#          trocr-base-handwritten | trocr-large-handwritten
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-handwritten')
model     = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-handwritten')

# ── Load and prepare image ─────────────────────────────────
# Crop to a single line of handwriting first
image       = Image.open('handwritten_line.png').convert('RGB')
pixel_values = processor(images=image, return_tensors='pt').pixel_values

# ── Generate text (beam search with 4 beams) ──────────────
with torch.no_grad():
    generated_ids = model.generate(
        pixel_values,
        num_beams=4,
        max_new_tokens=128
    )

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Recognised: {generated_text}")
OUTPUT
Recognised: The quick brown fox jumps over the lazy dog

Section 14

Common OCR Failure Modes

🔍 Low Resolution
Input: 72 DPI scan Result: "lnv0ice" → "Invoice" Fix: Upscale to 300 DPI before OCR
🔄 Skew / Rotation
Input: 5° rotated document Result: Words merged across lines Fix: Deskew with Hough transform
🌓 Shadows / Lighting
Input: Phone photo with flash shadow Result: Words in dark area missed Fix: Adaptive binarisation (Sauvola)
🍂 Serif vs Sans Confusion
Confused pairs: rn → m l vs I vs 1, 0 vs O vs o Fix: Language model post-correction
📋 Dense Table Cells
Adjacent cells merge in bounding box Numbers shift columns silently Fix: Dedicated table detector (TableNet)
✏️ Cursive Handwriting
Characters are connected Segmentation fails at word boundaries Fix: TrOCR or HTR-specific model

Section 15

OCR vs. Vision-Language Models — When to Use What

Requirement Traditional OCR Vision-Language Model
Throughput (pages/min) High — 50–500 pages/min Low — 1–10 pages/min
Cost at scale Low — self-hosted free High — per-token API cost
Handwriting accuracy Medium (TrOCR) / Poor (Tesseract) Very High (GPT-4o)
Layout understanding Requires separate tool Built-in, zero config
Data privacy On-premise possible Data leaves your network
Multi-language out of box PaddleOCR/EasyOCR: 80+ langs 100+ languages, zero config
Degraded historical docs Struggles without fine-tuning Surprisingly robust
Structured output (JSON) Needs post-processing code Prompt-based, zero code
Offline / air-gapped Yes No (API-dependent)
🏆
The Practitioner's Decision Rule

If you process <1,000 pages/month and need flexibility — use a Vision-Language Model (GPT-4o, Claude Vision) via API. If you process >50,000 pages/month or have data privacy requirements — self-host PaddleOCR for printed text or TrOCR for handwriting. For structured documents with tables and forms at scale — evaluate AWS Textract or Google Document AI and build in a cost model at your projected volume.


Section 16

Fine-Tuning TrOCR on Custom Data

Out-of-the-box models struggle with specialist fonts, domain vocabulary (medical, legal), or unusual scripts. Fine-tuning TrOCR on as few as 500 labelled examples can cut error rates by 40–60% in domain-specific tasks.

from transformers import (TrOCRProcessor, VisionEncoderDecoderModel,
                            Seq2SeqTrainer, Seq2SeqTrainingArguments)
from torch.utils.data import Dataset
from PIL import Image
import torch, json

# ── Dataset: JSONL with {"image_path": ..., "text": ...} ──
class OCRDataset(Dataset):
    def __init__(self, jsonl_path, processor, max_length=128):
        self.data      = [json.loads(l) for l in open(jsonl_path)]
        self.processor = processor
        self.max_length = max_length

    def __len__(self):  return len(self.data)

    def __getitem__(self, idx):
        item  = self.data[idx]
        image = Image.open(item['image_path']).convert('RGB')
        enc   = self.processor(
            images=image, text=item['text'],
            return_tensors='pt', padding='max_length',
            max_length=self.max_length, truncation=True
        )
        labels = enc.labels.squeeze()
        labels[labels == self.processor.tokenizer.pad_token_id] = -100
        return {'pixel_values': enc.pixel_values.squeeze(), 'labels': labels}

# ── Load base model ────────────────────────────────────────
MODEL_ID  = 'microsoft/trocr-base-printed'
processor = TrOCRProcessor.from_pretrained(MODEL_ID)
model     = VisionEncoderDecoderModel.from_pretrained(MODEL_ID)

model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id           = processor.tokenizer.pad_token_id

train_ds = OCRDataset('train.jsonl', processor)
eval_ds  = OCRDataset('eval.jsonl',  processor)

# ── Training arguments ─────────────────────────────────────
args = Seq2SeqTrainingArguments(
    output_dir='./trocr-finetuned',
    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    fp16=True,
    learning_rate=5e-5,
    evaluation_strategy='epoch',
    save_strategy='best',
    load_best_model_at_end=True,
)

trainer = Seq2SeqTrainer(model=model, args=args, train_dataset=train_ds, eval_dataset=eval_ds)
trainer.train()
print("Fine-tuning complete. Model saved to ./trocr-finetuned")
OUTPUT
{'train_runtime': 412.8s, 'train_samples_per_second': 9.6, 'epoch': 10.0} Eval loss: 0.0312 CER: 1.84% WER: 4.21% Fine-tuning complete. Model saved to ./trocr-finetuned

Section 17

Golden Rules of Production OCR

🔬 OCR in Production — Non-Negotiable Rules
1
Always validate input resolution. Reject or flag images below 150 DPI before they enter your pipeline. Downstream errors from low-resolution inputs are nearly impossible to recover — far better to flag at ingestion and request a better scan.
2
Pre-process before you recognise — not after. Every step of pre-processing (deskew, denoise, binarise) has compounding effects. A 5% improvement at each stage produces a 27% better document into the recogniser. Never skip pre-processing on photographed documents.
3
Check for native text in PDFs first. PDF documents produced digitally contain an embedded text layer. Always try pdfplumber or pymupdf first. Only invoke your OCR pipeline if the text layer is absent or garbled. This saves 100% of OCR compute on digital-native PDFs.
4
Log and monitor confidence scores. Every modern OCR engine returns a confidence score per word. Track the distribution in production. A sudden drop in average confidence signals a document quality change, a scanner miscalibration, or an unexpected document type — before it becomes a data quality incident.
5
Route low-confidence outputs to human review. Set a threshold (typically confidence <80%). Flag entire documents — or individual fields — below the threshold for human verification. A 99% automated system with 100% accuracy on its 99% is always better than a 100% automated system with 97% accuracy overall.
6
Evaluate on your own data, not published benchmarks. IAM and RIMES are clean English handwriting datasets. Your invoices, your forms, your domain vocabulary are different. Build a small gold-standard test set (200–500 documents) in your domain and use it to compare engines and validate improvements.
7
For handwriting or highly degraded documents, always prefer TrOCR-large over Tesseract. The accuracy gap is 5–10× on real-world handwriting. Commercial APIs (Google, AWS) are worth the cost for safety-critical pipelines (medical records, legal documents) where errors have real consequences.