Achieving 93% structured extraction accuracy for complex invoices
93%
extraction accuracy.
1,100+
formats, incl. handwritten.
3x
faster manual verification.

Background
The client was developing a proprietary SaaS platform to deliver enhanced financial services to their SME customers. A key success factor for the platform was the ability to accurately extract structured data from invoices. Since their clients’ invoices came in a wide variety of formats, from complex digital layouts to handwritten documents from cash-and-carry businesses, the extraction system needed to be highly robust and adaptable.
Business Challenges
Without a reliable automated extraction system, a substantial portion of invoices required manual annotation, resulting in significant inefficiencies and processing delays. This manual intervention also increased operational costs, as processing large volumes of invoices became both time-consuming and expensive.
Technical Challenges
The firm worked with thousands of suppliers, each using different invoice templates and formats. The presence of handwritten text and low-quality scanned copies further compounded the difficulty. Additionally, many invoices spanned multiple pages, adding another layer of complexity to the extraction process.
Solution Principles
To address these challenges, the solution was designed around four key principles:
- Accuracy: Ensure precise data extraction from diverse invoice formats.
- Flexibility: Adapt automatically to new and unseen formats without manual rule creation.
- Scalability: Handle large volumes of invoices efficiently.
- Cost Efficiency: Maintain manageable operating costs that don’t scale linearly with usage.
Approach
We began by assembling a diverse dataset that covered both system-generated and handwritten invoices from thousands of suppliers. With this dataset, we systematically evaluated existing solutions, from traditional OCR models to vision-language models (VLMs), to establish a performance baseline.
Problems with Existing Tools
- OCR models (Tesseract, PaddleOCR, etc.): Accurately mapping specific fields such as supplier name, discount amount, or VAT across thousands of varying invoice formats is a major technical challenge.
- Vision-language models (e.g., GPT-4): While more flexible, these models often produced incorrect values on large invoices, hallucinated missing fields, and lacked reproducibility.
Although vision-language models offered the flexibility needed to handle diverse invoice formats, their accuracy remained below production standards, even after extensive benchmarking and few-shot tuning.
Our Solution
Our strategy focused on developing a bespoke, multi-stage AI pipeline that combined the strengths of OCR systems with fine-tuned vision-language models.
Step 1: Selecting the Right Base Model
We first evaluated several open-source vision-language models (VLMs) with fewer than 32B parameters to identify the optimal base for fine-tuning. The goal was to balance performance, adaptability, and cost efficiency.
Step 2: Fine-Tuning the Vision-Language Model
We then fine-tuned the top-performing VLMs on the client’s specific dataset. This improved the model’s ability to understand invoice structure and map relevant fields. However, text recognition errors from the visual input persisted, particularly in noisy or handwritten documents.
The Breakthrough: A Two-Stage OCR + Fine-Tuned SLM Pipeline
Recognizing the limitations of a single-model approach, we engineered a highly efficient two-stage pipeline that became the core of the client’s new feature.
Stage 1: High-Fidelity Text Extraction with OCR
We fine-tuned an OCR model using thousands of the client’s diverse documents, with particular emphasis on handwritten invoices. This model became exceptionally skilled at converting even the most complex and messy documents into clean text, forming a solid foundation for the next stage.
Stage 2: Structured Extraction with a Fine-Tuned Small Language Model (SLM)
The text output from the OCR was then passed into a fine-tuned 7-billion-parameter Small Language Model (SLM). Unlike a conventional text parser, this model was trained to understand context and structure, allowing it to accurately identify and extract only the required invoice fields.
By combining the OCR’s precision with the SLM’s contextual reasoning, we achieved high extraction accuracy while keeping inference costs low. By self-hosting this pipeline, the client completely eliminated the variable, per-document fees associated with third-party APIs. The cost shifted from an unpredictable external expense to a manageable and predictable internal cost of computation, ensuring the unit economics worked at scale.
Results: A Reliable AI Invoice Extractor
Our custom-built AI engine became the foundation of the client’s invoice processing system, delivering transformative business outcomes and a distinct competitive edge.
Performance Highlights
- 93% Extraction Accuracy: The hybrid OCR + SLM pipeline set a new benchmark for accuracy across diverse invoice formats.
- Outperformed Proprietary Models: The solution significantly exceeded the accuracy of leading proprietary models.
- Cost-Efficient at Scale: By leveraging smaller, open-source models, the system maintained low inference costs and strong scalability. This ensured the unit economics remained highly favorable as the business scaled, a key strategic advantage over competitors reliant on expensive third-party services.
If your organization faces similar challenges with complex document extraction, high operational costs, or the limitations of generic AI models, reach out to us today. Our team specializes in designing and building cost-efficient AI pipelines that deliver measurable results and a lasting competitive edge.