A practical walkthrough of using Struktur to process 10,000 invoices from 50 different vendors. We'll cover schema design, strategy selection, error handling, and cost analysis.

The Problem

Input:

10,000 PDF invoices
50 different vendors
Formats vary wildly
Some handwritten notes
Multi-page invoices
Missing fields common

Output needed:

Structured JSON per invoice
Validated against schema
Loaded into database
Error reports for failures

Schema Design

Start with what you need to extract:

import { Type } from '@sinclair/typebox';

const InvoiceSchema = Type.Object({
  // Identifiers
  invoiceNumber: Type.String(),
  vendor: Type.Object({
    name: Type.String(),
    address: Type.Optional(Type.String()),
  }),
  
  // Dates
  invoiceDate: Type.String({ format: 'date' }),
  dueDate: Type.Optional(Type.String({ format: 'date' })),
  
  // Line items
  lineItems: Type.Array(Type.Object({
    description: Type.String(),
    quantity: Type.Number(),
    unitPrice: Type.Number(),
    total: Type.Number(),
  })),
  
  // Totals
  subtotal: Type.Number(),
  tax: Type.Optional(Type.Number()),
  total: Type.Number(),
  
  // Metadata
  currency: Type.String({ default: 'USD' }),
  notes: Type.Optional(Type.String()),
});

Design Decisions

Optional vs Required:

dueDate is optional — not all invoices have it
tax is optional — some vendors don't itemize
notes is optional — capture handwritten notes if present

Nested Objects:

vendor is nested — cleaner than flat fields
lineItems is an array — variable length

Types:

All amounts are number — easier for calculations
Dates are ISO strings — standard format

Strategy Selection

For invoices, we chose parallelAutoMerge:

const result = await extract({
  artifacts: [{ path: invoicePath }],
  schema: InvoiceSchema,
  strategy: 'parallelAutoMerge',
});

Why parallelAutoMerge?

Invoices are typically 1-3 pages — small enough for parallel
Line items are independent — no cross-chunk context needed
Auto-merge handles duplicate line items
Speed matters at 10,000 documents

Alternative considerations:

simple — Works for single-page invoices, fails for multi-page
sequential — Overkill for invoices, slower
agent — Useful if invoice structure varies wildly

Handling Edge Cases

Missing Fields

Some vendors don't include all fields:

// Vendor A: complete invoice
{
  invoiceNumber: "INV-001",
  invoiceDate: "2024-03-15",
  dueDate: "2024-04-15",  // present
  total: 1234.56
}

// Vendor B: minimal invoice
{
  invoiceNumber: "12345",
  invoiceDate: "2024-03-15",
  // no dueDate
  total: 567.89
}

Schema handles this with optional fields. Validation passes if optional fields are missing.

Multi-Page Invoices

Some invoices span multiple pages:

Page 1: Header, vendor info, first 10 line items
Page 2: Remaining line items, totals

Parallel strategy handles this:

Chunk 1 extracts first 10 items
Chunk 2 extracts remaining items + totals
Auto-merge combines line items

Handwritten Notes

Handwritten notes are tricky:

OCR might fail
LLM might misinterpret
Validation can't catch semantic errors

Approach:

Extract notes as optional string
Flag invoices with notes for human review
Don't rely on notes for critical data

if (result.data.notes) {
  await flagForReview(invoiceId, 'Contains handwritten notes');
}

Error Handling

Validation Failures

When extraction fails validation:

const result = await extract({
  artifacts: [{ path: invoicePath }],
  schema: InvoiceSchema,
  strategy: 'parallelAutoMerge',
  options: {
    maxRetries: 3,
  }
});

if (!result.success) {
  // Log failure reason
  console.error(`Failed: ${invoicePath}`, result.error);
  
  // Save for manual review
  await saveForReview(invoicePath, result.error);
}

Extraction Failures

Some invoices fail to extract:

Corrupted PDF
Blank pages
Non-invoice documents

try {
  const result = await extract({...});
  
  if (!result.success) {
    failedCount++;
    await logFailure(invoicePath, result.error);
  }
} catch (error) {
  // Catastrophic failure (corrupted file, etc.)
  errorCount++;
  await logError(invoicePath, error);
}

Success Rate Tracking

let processed = 0;
let succeeded = 0;
let failed = 0;
let errors = 0;

for (const invoice of invoices) {
  try {
    const result = await extract({...});
    if (result.success) succeeded++;
    else failed++;
  } catch (e) {
    errors++;
  }
  processed++;
}

console.log(`
Processed: ${processed}
Succeeded: ${succeeded} (${(succeeded/processed*100).toFixed(1)}%)
Failed:    ${failed}
Errors:    ${errors}
`);

Cost Analysis

Token Usage

Typical invoice extraction:

Component	Tokens
Input (document)	1,500-3,000
Input (prompt)	500
Output (JSON)	200-500
Total per invoice	2,200-4,000

Cost Calculation

Using GPT-4o:

Input: $2.50/1M tokens
Output: $10.00/1M tokens

10,000 invoices × 3,000 tokens avg = 30M tokens

Input cost:  30M × $2.50/1M  = $75
Output cost: 3M × $10.00/1M = $30
Total:       $105

Using GPT-4o-mini:

Input: $0.15/1M tokens
Output: $0.60/1M tokens

Total: ~$6.30

Comparison with Alternatives

Solution	Cost for 10k invoices
LlamaExtract (balanced)	~$125
Struktur + GPT-4o	~$105
Struktur + GPT-4o-mini	~$6
Manual data entry	~$5,000 (50 hrs × $100/hr)

Full Pipeline

Here's the complete extraction pipeline:

import { extract } from '@struktur/sdk';
import { glob } from 'glob';
import { db } from './database';

const InvoiceSchema = Type.Object({...});

async function processInvoices() {
  const invoices = await glob('invoices/*.pdf');
  
  const results = {
    processed: 0,
    succeeded: 0,
    failed: 0,
    errors: 0,
  };
  
  for (const invoicePath of invoices) {
    try {
      const result = await extract({
        artifacts: [{ path: invoicePath }],
        schema: InvoiceSchema,
        strategy: 'parallelAutoMerge',
      });
      
      if (result.success) {
        // Save to database
        await db.invoices.create({
          data: result.data,
          source: invoicePath,
          tokens: result.usage.totalTokens,
        });
        results.succeeded++;
      } else {
        // Save for review
        await db.failures.create({
          path: invoicePath,
          error: result.error,
        });
        results.failed++;
      }
    } catch (error) {
      console.error(`Error processing ${invoicePath}:`, error);
      results.errors++;
    }
    
    results.processed++;
    
    // Progress update
    if (results.processed % 100 === 0) {
      console.log(`Progress: ${results.processed}/${invoices.length}`);
    }
  }
  
  console.log('Final results:', results);
  return results;
}

processInvoices();

Lessons Learned

What Worked

ParallelAutoMerge — Fast and accurate for invoices
Optional fields — Handled vendor variation well
GPT-4o-mini — Good enough quality, much cheaper
Error logging — Made debugging easy

What Surprised Us

Handwritten notes — OCR quality varies wildly
Currency symbols — Some vendors use non-standard symbols
Date formats — More variation than expected
Line item descriptions — Sometimes split across lines

What We'd Do Differently

Pre-validation — Check PDF quality before extraction
Vendor-specific schemas — Tailor to known vendors
Confidence thresholds — Flag low-confidence extractions
Batch processing — Group by vendor for efficiency

Extracting Invoices at Scale