Examples
Extract Invoice Data
Extract structured invoice data from PDF or text files.
Schema
CLI approach
Single invoice:
struktur --input invoice.pdf \
--schema invoice-schema.json \
--model openai/gpt-4o-miniWith embedded images (for invoices with stamps, logos, or handwritten amounts):
struktur --input invoice.pdf --images \
--schema invoice-schema.json \
--model openai/gpt-4oMultiple invoices:
for file in invoices/*.pdf; do
struktur --input "$file" \
--schema invoice-schema.json \
--model openai/gpt-4o-mini \
--output "outputs/$(basename $file .pdf).json"
doneSDK
Small invoices (1-3 pages):
import { extract, simple, parse } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const artifacts = await parse(
{ kind: "file", path: "invoice.pdf" },
{ includeImages: true }
);
const result = await extract({
artifacts,
schema: invoiceSchema,
strategy: simple({ model: openai("gpt-4o-mini") }),
});Multi-page invoices with many line items:
import { extract, sequentialAutoMerge, parse } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const artifacts = await parse({ kind: "file", path: "invoice.pdf" });
const result = await extract({
artifacts,
schema: invoiceSchema,
strategy: sequentialAutoMerge({
model: openai("gpt-4o-mini"),
dedupeModel: openai("gpt-4o-mini"),
chunkSize: 8000,
}),
});Strategy choice
| Invoice type | Strategy |
|---|---|
| 1-3 pages | simple |
| Multi-page, line items may duplicate | sequentialAutoMerge |
| Many invoices in parallel | parallelAutoMerge |
Expected output
{
"invoice_number": "1042",
"vendor": "Acme Corp",
"invoice_date": "2024-03-01",
"due_date": "2024-04-01",
"currency": "USD",
"line_items": [
{ "description": "Widget A", "quantity": 10, "unit_price": 50, "total": 500 },
{ "description": "Widget B", "quantity": 5, "unit_price": 200, "total": 1000 }
],
"subtotal": 1500,
"tax": 150,
"total": 1650
}See also
- Extraction Strategies — strategy reference
- Process a Directory of Files — batch processing