Extracting Invoices at Scale
Real-world example: processing 10,000 invoices
A practical walkthrough of using Struktur to process 10,000 invoices from 50 different vendors. We'll cover schema design, strategy selection, error handling, and cost analysis.
The Problem
Input:
- 10,000 PDF invoices
- 50 different vendors
- Formats vary wildly
- Some handwritten notes
- Multi-page invoices
- Missing fields common
Output needed:
- Structured JSON per invoice
- Validated against schema
- Loaded into database
- Error reports for failures
Schema Design
Start with what you need to extract:
import { Type } from '@sinclair/typebox';
const InvoiceSchema = Type.Object({
// Identifiers
invoiceNumber: Type.String(),
vendor: Type.Object({
name: Type.String(),
address: Type.Optional(Type.String()),
}),
// Dates
invoiceDate: Type.String({ format: 'date' }),
dueDate: Type.Optional(Type.String({ format: 'date' })),
// Line items
lineItems: Type.Array(Type.Object({
description: Type.String(),
quantity: Type.Number(),
unitPrice: Type.Number(),
total: Type.Number(),
})),
// Totals
subtotal: Type.Number(),
tax: Type.Optional(Type.Number()),
total: Type.Number(),
// Metadata
currency: Type.String({ default: 'USD' }),
notes: Type.Optional(Type.String()),
});Design Decisions
Optional vs Required:
dueDateis optional — not all invoices have ittaxis optional — some vendors don't itemizenotesis optional — capture handwritten notes if present
Nested Objects:
vendoris nested — cleaner than flat fieldslineItemsis an array — variable length
Types:
- All amounts are
number— easier for calculations - Dates are ISO strings — standard format
Strategy Selection
For invoices, we chose parallelAutoMerge:
const result = await extract({
artifacts: [{ path: invoicePath }],
schema: InvoiceSchema,
strategy: 'parallelAutoMerge',
});Why parallelAutoMerge?
- Invoices are typically 1-3 pages — small enough for parallel
- Line items are independent — no cross-chunk context needed
- Auto-merge handles duplicate line items
- Speed matters at 10,000 documents
Alternative considerations:
simple— Works for single-page invoices, fails for multi-pagesequential— Overkill for invoices, sloweragent— Useful if invoice structure varies wildly
Handling Edge Cases
Missing Fields
Some vendors don't include all fields:
// Vendor A: complete invoice
{
invoiceNumber: "INV-001",
invoiceDate: "2024-03-15",
dueDate: "2024-04-15", // present
total: 1234.56
}
// Vendor B: minimal invoice
{
invoiceNumber: "12345",
invoiceDate: "2024-03-15",
// no dueDate
total: 567.89
}Schema handles this with optional fields. Validation passes if optional fields are missing.
Multi-Page Invoices
Some invoices span multiple pages:
Page 1: Header, vendor info, first 10 line items
Page 2: Remaining line items, totalsParallel strategy handles this:
- Chunk 1 extracts first 10 items
- Chunk 2 extracts remaining items + totals
- Auto-merge combines line items
Handwritten Notes
Handwritten notes are tricky:
- OCR might fail
- LLM might misinterpret
- Validation can't catch semantic errors
Approach:
- Extract notes as optional string
- Flag invoices with notes for human review
- Don't rely on notes for critical data
if (result.data.notes) {
await flagForReview(invoiceId, 'Contains handwritten notes');
}Error Handling
Validation Failures
When extraction fails validation:
const result = await extract({
artifacts: [{ path: invoicePath }],
schema: InvoiceSchema,
strategy: 'parallelAutoMerge',
options: {
maxRetries: 3,
}
});
if (!result.success) {
// Log failure reason
console.error(`Failed: ${invoicePath}`, result.error);
// Save for manual review
await saveForReview(invoicePath, result.error);
}Extraction Failures
Some invoices fail to extract:
- Corrupted PDF
- Blank pages
- Non-invoice documents
try {
const result = await extract({...});
if (!result.success) {
failedCount++;
await logFailure(invoicePath, result.error);
}
} catch (error) {
// Catastrophic failure (corrupted file, etc.)
errorCount++;
await logError(invoicePath, error);
}Success Rate Tracking
let processed = 0;
let succeeded = 0;
let failed = 0;
let errors = 0;
for (const invoice of invoices) {
try {
const result = await extract({...});
if (result.success) succeeded++;
else failed++;
} catch (e) {
errors++;
}
processed++;
}
console.log(`
Processed: ${processed}
Succeeded: ${succeeded} (${(succeeded/processed*100).toFixed(1)}%)
Failed: ${failed}
Errors: ${errors}
`);Cost Analysis
Token Usage
Typical invoice extraction:
| Component | Tokens |
|---|---|
| Input (document) | 1,500-3,000 |
| Input (prompt) | 500 |
| Output (JSON) | 200-500 |
| Total per invoice | 2,200-4,000 |
Cost Calculation
Using GPT-4o:
- Input: $2.50/1M tokens
- Output: $10.00/1M tokens
10,000 invoices × 3,000 tokens avg = 30M tokens
Input cost: 30M × $2.50/1M = $75
Output cost: 3M × $10.00/1M = $30
Total: $105Using GPT-4o-mini:
- Input: $0.15/1M tokens
- Output: $0.60/1M tokens
Total: ~$6.30Comparison with Alternatives
| Solution | Cost for 10k invoices |
|---|---|
| LlamaExtract (balanced) | ~$125 |
| Struktur + GPT-4o | ~$105 |
| Struktur + GPT-4o-mini | ~$6 |
| Manual data entry | ~$5,000 (50 hrs × $100/hr) |
Full Pipeline
Here's the complete extraction pipeline:
import { extract } from '@struktur/sdk';
import { glob } from 'glob';
import { db } from './database';
const InvoiceSchema = Type.Object({...});
async function processInvoices() {
const invoices = await glob('invoices/*.pdf');
const results = {
processed: 0,
succeeded: 0,
failed: 0,
errors: 0,
};
for (const invoicePath of invoices) {
try {
const result = await extract({
artifacts: [{ path: invoicePath }],
schema: InvoiceSchema,
strategy: 'parallelAutoMerge',
});
if (result.success) {
// Save to database
await db.invoices.create({
data: result.data,
source: invoicePath,
tokens: result.usage.totalTokens,
});
results.succeeded++;
} else {
// Save for review
await db.failures.create({
path: invoicePath,
error: result.error,
});
results.failed++;
}
} catch (error) {
console.error(`Error processing ${invoicePath}:`, error);
results.errors++;
}
results.processed++;
// Progress update
if (results.processed % 100 === 0) {
console.log(`Progress: ${results.processed}/${invoices.length}`);
}
}
console.log('Final results:', results);
return results;
}
processInvoices();Lessons Learned
What Worked
- ParallelAutoMerge — Fast and accurate for invoices
- Optional fields — Handled vendor variation well
- GPT-4o-mini — Good enough quality, much cheaper
- Error logging — Made debugging easy
What Surprised Us
- Handwritten notes — OCR quality varies wildly
- Currency symbols — Some vendors use non-standard symbols
- Date formats — More variation than expected
- Line item descriptions — Sometimes split across lines
What We'd Do Differently
- Pre-validation — Check PDF quality before extraction
- Vendor-specific schemas — Tailor to known vendors
- Confidence thresholds — Flag low-confidence extractions
- Batch processing — Group by vendor for efficiency
See Also
- Agent vs Simple vs Parallel — Strategy selection
- The Chunking, Validation, and Retry Problem — Pipeline details
- Struktur vs LlamaIndex — Cost comparison