Document Extraction Pipelines: Combining Vision Models, OCR, and Validation

Three clients, three different industries, nearly identical problems. A London law firm could not extract clause information from contracts arriving in mixed formats (PDF, DOCX, scanned images). A Manchester manufacturer needed structured data from supplier invoices that varied wildly in layout. A Mumbai construction company had to parse project specifications from hand-annotated PDFs and mixed-language drawings.

Each client had tried a vendor solution. Each had found that the vendor handled their "standard" documents well and failed on the twenty percent that deviated from the template. That twenty percent was exactly the documents they most needed to process correctly.

Here is what I built for each of them, and what I learned combining vision models, traditional OCR, and validation in a single pipeline.

The Problem With Single-Technology Solutions

Traditional OCR (Tesseract, Azure Form Recognizer, AWS Textract) excels at structured, consistent layouts. A standard invoice template with defined field positions: OCR handles this reliably at low cost.

Vision LLMs (GPT-4o Vision, Claude 3.5 Sonnet) excel at understanding unstructured and variable layouts. A contract where the indemnity clause is in paragraph 7 of one document and paragraph 23 of another: vision models understand context and can locate the relevant section regardless of position.

The mistake I see most often: teams pick one technology and apply it to all documents. The result is either expensive (Vision LLM on every standard invoice) or inaccurate (OCR on every variable-layout contract).

The solution is a hybrid pipeline that routes documents to the appropriate technology based on their properties.

The Three-Layer Architecture

Every extraction pipeline I build now follows this structure:

Layer 1: Document Classification
  Input: raw document (PDF, image, DOCX)
  Output: document_type, layout_type, language, quality_score

Layer 2: Extraction (routed by classification)
  Structured + high quality  --> Template OCR (Form Recognizer / Textract)
  Structured + low quality   --> Pre-process + Tesseract
  Semi-structured             --> Vision LLM with schema prompt
  Unstructured/handwritten    --> Vision LLM + structured output enforcement

Layer 3: Validation
  Schema validation       --> required fields, format checks
  Cross-field validation  --> business rule checks
  Confidence scoring      --> flag fields below threshold
  Exception routing       --> pass/fail decision with human review queue

The classifier is itself a lightweight model (I use a fine-tuned DistilBERT for layout classification plus a simple heuristic for quality scoring based on image DPI and contrast). It adds roughly 200ms to the pipeline and reduces LLM costs by 60-70% by keeping structured documents out of the expensive vision model path.

Case Study 1: London Law Firm (Contract Clause Extraction)

The problem: 800-1,200 page contracts arriving as scanned PDFs, digital PDFs, and DOCX files. The firm needed to extract: parties, effective date, governing law clause, indemnity cap, limitation of liability clause, termination triggers, and notice requirements.

What did not work: A single GPT-4o Vision prompt over the entire document. Token limits meant chunking the document, which caused context loss across chunks. The model would correctly identify the indemnity cap in one chunk but miss the modifier clause three pages later that reduced it.

What worked: A two-pass approach.

def extract_contract_clauses(document: bytes, doc_type: str) -> dict:
    # Pass 1: locate section headers and build a document map
    section_map = locate_sections(document)
    # Returns: {section_name: (start_page, end_page), ...}

    # Pass 2: extract each target clause from its located section
    results = {}
    target_clauses = [
        "indemnity", "limitation_of_liability", "termination",
        "governing_law", "notices"
    ]

    for clause in target_clauses:
        if clause in section_map:
            start, end = section_map[clause]
            # Extract only the relevant pages for this clause
            clause_pages = extract_pages(document, start, end)
            results[clause] = vision_extract(
                clause_pages,
                schema=CLAUSE_SCHEMAS[clause]
            )
        else:
            # Clause not found in expected location; flag for human review
            results[clause] = {"value": None, "confidence": 0.0,
                               "flag": "section_not_found"}

    return results

The two-pass approach reduced token costs by 65% versus full-document extraction and improved accuracy from 81% to 94% on the target clauses.

Monthly numbers at steady state: 340 contracts processed. Cost breakdown:

Component	Monthly Cost
Document classification	$8
Section location pass (GPT-4o mini)	$47
Clause extraction pass (GPT-4o)	$189
Validation + human review (6% of docs)	$94 (staff time)
Infrastructure	$35
Total	$373

Previous manual extraction cost: $4,200/month. Payback period on the $9,000 build: 2.5 months.

Case Study 2: Manchester Manufacturer (Invoice Processing)

The problem: 2,800 invoices per month from 340 suppliers, each with their own format. Roughly 400 invoices per month (14%) deviated from any template the OCR system had been trained on.

What worked: Confidence-gated routing. Every invoice goes through Azure Form Recognizer first. If overall confidence is above 0.88, the output is accepted and validated. If confidence is below 0.88, the invoice is routed to GPT-4o Vision for re-extraction.

pipeline_config:
  primary_extractor:
    tool: azure_form_recognizer
    model: prebuilt-invoice
    confidence_threshold: 0.88

  fallback_extractor:
    tool: gpt4o_vision
    prompt_template: invoice_extraction_v3
    output_schema: invoice_schema_v2
    confidence_threshold: 0.75

  human_review_triggers:
    - primary_confidence_below: 0.88
      AND fallback_confidence_below: 0.75
    - total_amount_above: 50000
    - vendor_not_in_approved_list: true
    - cross_field_validation_failed: true

Result: 86% of invoices processed automatically by Form Recognizer ($0.10/1000 pages). 10% re-processed by Vision LLM ($8/1000 pages). 4% routed to human review. Blended cost: $0.41 per invoice, down from $1.20/invoice with the previous all-Vision-LLM approach.

Case Study 3: Mumbai Construction (Project Spec Extraction)

The problem: Project specification PDFs with hand annotations, tables, mixed Hindi and English text, and engineering drawings with text callouts. Standard OCR failed on the handwriting. Vision LLMs handled the text well but hallucinated on the engineering drawings.

What worked: Separating text regions from drawing regions before extraction.

The pipeline classifies each page as: typed-text, handwritten-text, engineering-drawing, table, or mixed. Typed-text goes to Tesseract (fast, cheap). Handwritten-text goes to GPT-4o Vision. Engineering-drawing pages are extracted as images and sent to a separate pipeline that extracts text callouts using a specialized vision prompt, then discards the drawing itself (the drawing coordinates and symbols were out of scope for this client).

The key insight: do not ask one model to handle everything on a page that has fundamentally different content types. Segment first, then route each segment to the model suited to it.

The Validation Layer That Applies to All Three

Regardless of the extraction technology, every pipeline ends with the same validation sequence:

Schema validation: all required fields present and correctly typed
Business rule validation: cross-field logic checks specific to the document type
Anomaly detection: statistically unusual values flagged for human review
Confidence aggregation: any field below 0.80 confidence triggers review

The validation layer catches errors that the extraction model gets confidently wrong, which are the most dangerous errors in any extraction pipeline. A low-confidence extraction is easy to flag. A high-confidence wrong extraction requires domain-specific validation rules to catch.

What I Got Wrong

On the Mumbai project, I initially tried to extract all information from mixed pages in a single Vision LLM call. The model consistently confused text from drawings with text from specifications, attributing load-bearing measurements to the wrong structural elements. Segmenting the page types first and routing separately was the fix, but it took three weeks of debugging before I identified page segmentation as the problem rather than the prompt.

The lesson: when a vision model is making consistent errors on a specific content type, the problem is almost always the input, not the prompt.