Struktur

Extracting Invoices at Scale

Real-world example: processing 10,000 invoices

A practical walkthrough of using Struktur to process 10,000 invoices from 50 different vendors. We'll cover schema design, strategy selection, error handling, and cost analysis.

The Problem

Input:

  • 10,000 PDF invoices
  • 50 different vendors
  • Formats vary wildly
  • Some handwritten notes
  • Multi-page invoices
  • Missing fields common

Output needed:

  • Structured JSON per invoice
  • Validated against schema
  • Loaded into database
  • Error reports for failures

Schema Design

Start with what you need to extract:

import { Type } from '@sinclair/typebox';

const InvoiceSchema = Type.Object({
  // Identifiers
  invoiceNumber: Type.String(),
  vendor: Type.Object({
    name: Type.String(),
    address: Type.Optional(Type.String()),
  }),
  
  // Dates
  invoiceDate: Type.String({ format: 'date' }),
  dueDate: Type.Optional(Type.String({ format: 'date' })),
  
  // Line items
  lineItems: Type.Array(Type.Object({
    description: Type.String(),
    quantity: Type.Number(),
    unitPrice: Type.Number(),
    total: Type.Number(),
  })),
  
  // Totals
  subtotal: Type.Number(),
  tax: Type.Optional(Type.Number()),
  total: Type.Number(),
  
  // Metadata
  currency: Type.String({ default: 'USD' }),
  notes: Type.Optional(Type.String()),
});

Design Decisions

Optional vs Required:

  • dueDate is optional — not all invoices have it
  • tax is optional — some vendors don't itemize
  • notes is optional — capture handwritten notes if present

Nested Objects:

  • vendor is nested — cleaner than flat fields
  • lineItems is an array — variable length

Types:

  • All amounts are number — easier for calculations
  • Dates are ISO strings — standard format

Strategy Selection

For invoices, we chose parallelAutoMerge:

const result = await extract({
  artifacts: [{ path: invoicePath }],
  schema: InvoiceSchema,
  strategy: 'parallelAutoMerge',
});

Why parallelAutoMerge?

  • Invoices are typically 1-3 pages — small enough for parallel
  • Line items are independent — no cross-chunk context needed
  • Auto-merge handles duplicate line items
  • Speed matters at 10,000 documents

Alternative considerations:

  • simple — Works for single-page invoices, fails for multi-page
  • sequential — Overkill for invoices, slower
  • agent — Useful if invoice structure varies wildly

Handling Edge Cases

Missing Fields

Some vendors don't include all fields:

// Vendor A: complete invoice
{
  invoiceNumber: "INV-001",
  invoiceDate: "2024-03-15",
  dueDate: "2024-04-15",  // present
  total: 1234.56
}

// Vendor B: minimal invoice
{
  invoiceNumber: "12345",
  invoiceDate: "2024-03-15",
  // no dueDate
  total: 567.89
}

Schema handles this with optional fields. Validation passes if optional fields are missing.

Multi-Page Invoices

Some invoices span multiple pages:

Page 1: Header, vendor info, first 10 line items
Page 2: Remaining line items, totals

Parallel strategy handles this:

  • Chunk 1 extracts first 10 items
  • Chunk 2 extracts remaining items + totals
  • Auto-merge combines line items

Handwritten Notes

Handwritten notes are tricky:

  • OCR might fail
  • LLM might misinterpret
  • Validation can't catch semantic errors

Approach:

  1. Extract notes as optional string
  2. Flag invoices with notes for human review
  3. Don't rely on notes for critical data
if (result.data.notes) {
  await flagForReview(invoiceId, 'Contains handwritten notes');
}

Error Handling

Validation Failures

When extraction fails validation:

const result = await extract({
  artifacts: [{ path: invoicePath }],
  schema: InvoiceSchema,
  strategy: 'parallelAutoMerge',
  options: {
    maxRetries: 3,
  }
});

if (!result.success) {
  // Log failure reason
  console.error(`Failed: ${invoicePath}`, result.error);
  
  // Save for manual review
  await saveForReview(invoicePath, result.error);
}

Extraction Failures

Some invoices fail to extract:

  • Corrupted PDF
  • Blank pages
  • Non-invoice documents
try {
  const result = await extract({...});
  
  if (!result.success) {
    failedCount++;
    await logFailure(invoicePath, result.error);
  }
} catch (error) {
  // Catastrophic failure (corrupted file, etc.)
  errorCount++;
  await logError(invoicePath, error);
}

Success Rate Tracking

let processed = 0;
let succeeded = 0;
let failed = 0;
let errors = 0;

for (const invoice of invoices) {
  try {
    const result = await extract({...});
    if (result.success) succeeded++;
    else failed++;
  } catch (e) {
    errors++;
  }
  processed++;
}

console.log(`
Processed: ${processed}
Succeeded: ${succeeded} (${(succeeded/processed*100).toFixed(1)}%)
Failed:    ${failed}
Errors:    ${errors}
`);

Cost Analysis

Token Usage

Typical invoice extraction:

ComponentTokens
Input (document)1,500-3,000
Input (prompt)500
Output (JSON)200-500
Total per invoice2,200-4,000

Cost Calculation

Using GPT-4o:

  • Input: $2.50/1M tokens
  • Output: $10.00/1M tokens
10,000 invoices × 3,000 tokens avg = 30M tokens

Input cost:  30M × $2.50/1M  = $75
Output cost: 3M × $10.00/1M = $30
Total:       $105

Using GPT-4o-mini:

  • Input: $0.15/1M tokens
  • Output: $0.60/1M tokens
Total: ~$6.30

Comparison with Alternatives

SolutionCost for 10k invoices
LlamaExtract (balanced)~$125
Struktur + GPT-4o~$105
Struktur + GPT-4o-mini~$6
Manual data entry~$5,000 (50 hrs × $100/hr)

Full Pipeline

Here's the complete extraction pipeline:

import { extract } from '@struktur/sdk';
import { glob } from 'glob';
import { db } from './database';

const InvoiceSchema = Type.Object({...});

async function processInvoices() {
  const invoices = await glob('invoices/*.pdf');
  
  const results = {
    processed: 0,
    succeeded: 0,
    failed: 0,
    errors: 0,
  };
  
  for (const invoicePath of invoices) {
    try {
      const result = await extract({
        artifacts: [{ path: invoicePath }],
        schema: InvoiceSchema,
        strategy: 'parallelAutoMerge',
      });
      
      if (result.success) {
        // Save to database
        await db.invoices.create({
          data: result.data,
          source: invoicePath,
          tokens: result.usage.totalTokens,
        });
        results.succeeded++;
      } else {
        // Save for review
        await db.failures.create({
          path: invoicePath,
          error: result.error,
        });
        results.failed++;
      }
    } catch (error) {
      console.error(`Error processing ${invoicePath}:`, error);
      results.errors++;
    }
    
    results.processed++;
    
    // Progress update
    if (results.processed % 100 === 0) {
      console.log(`Progress: ${results.processed}/${invoices.length}`);
    }
  }
  
  console.log('Final results:', results);
  return results;
}

processInvoices();

Lessons Learned

What Worked

  1. ParallelAutoMerge — Fast and accurate for invoices
  2. Optional fields — Handled vendor variation well
  3. GPT-4o-mini — Good enough quality, much cheaper
  4. Error logging — Made debugging easy

What Surprised Us

  1. Handwritten notes — OCR quality varies wildly
  2. Currency symbols — Some vendors use non-standard symbols
  3. Date formats — More variation than expected
  4. Line item descriptions — Sometimes split across lines

What We'd Do Differently

  1. Pre-validation — Check PDF quality before extraction
  2. Vendor-specific schemas — Tailor to known vendors
  3. Confidence thresholds — Flag low-confidence extractions
  4. Batch processing — Group by vendor for efficiency

See Also

On this page