Extraction Strategies
The Agent strategy uses autonomous exploration. Other strategies are available for specific use cases.
The Agent strategy is the default and recommended way to use Struktur. It gives the LLM a virtual filesystem and lets it autonomously decide how to extract your data.
For documents where you need more control, Struktur also provides alternative strategies that use fixed chunking and parallelism patterns.
Strategy comparison
| Strategy | Speed | Context | Arrays | Token Cost | Best For |
|---|---|---|---|---|---|
agent (default) | Adaptive | Adaptive | Automatic | Varies | Most documents |
simple | Fastest | Full | — | Lowest | Small inputs |
parallel | Fast | None | LLM merge | Medium | Speed priority |
sequential | Medium | Full | Context | Medium | Context-dependent |
parallelAutoMerge | Fast | None | Auto + dedupe | Medium | Large arrays |
sequentialAutoMerge | Medium | Full | Auto + dedupe | Medium | Ordered arrays |
doublePass | Slow | Full | LLM merge | High | Maximum quality |
doublePassAutoMerge | Slow | Full | Auto + dedupe | High | Quality + arrays |
Agent (Default)
The Agent strategy is the default. You don't need to specify --strategy agent — it's used automatically when you run struktur extract.
Autonomous extraction using a virtual filesystem. The agent decides when to read files, search for patterns, and build output incrementally.
Best For
Most documents — adapts automatically
Virtual FS
read, grep, find, ls, bash
Output Tools
set_output_data, update_output_data
Model Requirement
Must support tool calling
How it works
- Document loaded into virtual filesystem (
/artifacts/artifact.json,/artifacts/manifest.json,/artifacts/images/) - Agent explores using tools: read files, grep for patterns, list directories, execute commands
- Incremental extraction — calls
set_output_datawhen first data found,update_output_dataas more discovered - Validation — schema validation on every output update, with automatic retry on errors
- Completion — agent calls
finishwhen done, orfailif extraction impossible
The agent adapts to your document:
- Small documents — reads everything at once
- Large documents — navigates systematically, searching for relevant sections
- Complex schemas — builds output incrementally, validating as it goes
Configuration
Prop
Type
Example
# Agent is the default — no --strategy needed
struktur extract --input ./document.pdf \
--schema ./schema.json \
--model anthropic/claude-sonnet-4
# With max steps limit
struktur extract --input ./document.pdf \
--schema ./schema.json \
--model anthropic/claude-sonnet-4 \
--max-steps 30import { extract, agent } from "@struktur/sdk";
const result = await extract({
artifacts,
schema,
strategy: agent({
provider: "anthropic",
modelId: "claude-sonnet-4",
maxSteps: 50,
}),
});When to use
- Always try agent first — it's the default for a reason
- Works well for most document types and sizes
- Automatically adapts to document structure
- Best for complex schemas with nested objects
Model compatibility
The agent requires models that support tool/function calling:
| Provider | Compatible Models |
|---|---|
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku |
| OpenAI | GPT-4o, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo |
| Gemini 1.5 Pro, Gemini 1.5 Flash |
Some models claim tool support but don't work well with the agent. Avoid: GPT-4o-mini (inconsistent tool calling), older GPT-3.5 models, models without native function calling.
Virtual filesystem
The agent has access to a virtual filesystem containing:
/artifacts/artifact.json— All artifacts in JSON format (images replaced by virtual paths)/artifacts/manifest.json— Summary and metadata/artifacts/images/— Extracted image files (when artifacts have embedded images)
The agent can:
- Read files with pagination (
offset,limit) - Grep for patterns
- Find files by name
- List directories
- Bash execute commands (on virtual filesystem only)
Output management
Special tools for building extraction output:
set_output_data(data)— Set initial output (first time data is found)update_output_data(changes)— Merge changes into existing outputfinish()— Complete extraction (only works if data validates)fail(reason)— Mark extraction as impossible
The agent is encouraged to update output continuously as it explores, not wait until the end.
Simple
Single-shot extraction for small inputs. Use when the agent is overkill for tiny documents.
LLM Calls
1
Parallelism
None
Best for
Small, single-chunk inputs
Configuration
Prop
Type
Algorithm
- Build extraction prompt from artifacts + schema
- Send to LLM
- Validate output against the schema
- Retry on validation failure (up to 3 attempts)
- Return validated output
Example
struktur extract --input document.txt --schema schema.json --strategy simpleimport { extract, simple } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const result = await extract({
artifacts,
schema,
strategy: simple({
model: openai("gpt-4o-mini"),
}),
});When to use
- Document fits within the model's context window (~10k tokens)
- Simple schema without nested arrays
- Testing or prototyping
- Speed is the priority
- When you want predictable token costs (agent costs vary by document)
Parallel
Concurrent batch processing with LLM merge.
LLM Calls
N batches + 1 merge
Parallelism
Full
Best for
Large inputs, speed priority
Configuration
Prop
Type
Algorithm
- Split artifacts into batches (respecting
chunkSizeandmaxImages) - Extract from each batch concurrently
- Validate each batch output with retry
- Send all partial results to
mergeModelfor LLM merge - Validate merged output
- Return final result
Example
struktur extract --input large.pdf --schema schema.json --strategy parallel --model openai/gpt-4o-miniimport { extract, parallel } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const result = await extract({
artifacts,
schema,
strategy: parallel({
model: openai("gpt-4o-mini"),
mergeModel: openai("gpt-4o-mini"),
chunkSize: 10000,
concurrency: 3,
}),
});When to use
- Speed is the top priority
- Chunks are relatively independent
- Many documents to process
- Can accept potential loss of cross-chunk context
- When agent costs are too high for your use case
Sequential
Process chunks in order with context preservation.
LLM Calls
N batches
Parallelism
None
Best for
Context-dependent documents
Configuration
Prop
Type
Algorithm
- Split artifacts into batches
- For each batch in order:
- Build prompt including previous extraction result as context
- Extract from batch
- Validate with retry
- Store result for next iteration
- Return final result
Example
struktur extract --input report.pdf --schema schema.json --strategy sequential --model openai/gpt-4o-miniimport { extract, sequential } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const result = await extract({
artifacts,
schema,
strategy: sequential({
model: openai("gpt-4o-mini"),
chunkSize: 10000,
}),
});When to use
- Context between chunks matters
- Building data incrementally (e.g., accumulating line items)
- Later sections reference earlier sections
- Need better accuracy than parallel
- Agent is making too many tool calls for your document structure
Auto-Merge Strategies
Strategies with "AutoMerge" in the name use schema-aware merge and deduplication. They're ideal for extracting arrays that may have duplicates across chunks.
parallelAutoMerge
Parallel extraction with schema-aware merge and deduplication.
Best for: Array extraction from large inputs where speed matters.
sequentialAutoMerge
Sequential extraction with schema-aware merge and deduplication.
Best for: Ordered array extraction where context matters.
doublePassAutoMerge
Double-pass extraction with schema-aware merge and deduplication.
Best for: Large array extraction with maximum quality requirement.
Choosing a Strategy
Start with the Agent. It's the default because it works best for most documents.
| Strategy | When to use |
|---|---|
agent (default) | Start here — autonomous exploration for most documents |
simple | Small input, fits in one context window, predictable costs |
parallel | Large input, order doesn't matter, speed priority |
sequential | Large input, context carries across chunks |
parallelAutoMerge | Large input with arrays — parallel + dedup |
sequentialAutoMerge | Large input with arrays — sequential + dedup |
doublePass | Quality matters, two-pass refinement |
doublePassAutoMerge | Quality + arrays + dedup |
Quick decision flowchart
flowchart TD
A[Start] --> B{Try Agent first?}
B -->|Yes| C[Use agent — default]
B -->|Need fixed costs| D{Input fits in context?}
D -->|Yes| E[Use simple]
D -->|No| F{Extracting arrays?}
F -->|Yes| G{Cross-chunk context matters?}
F -->|No| H{Cross-chunk context matters?}
G -->|Yes| I[sequentialAutoMerge or doublePassAutoMerge]
G -->|No| J[parallelAutoMerge]
H -->|Yes| K[sequential or doublePass]
H -->|No| L[parallel]See also
- The Extraction Pipeline — where strategies fit
- Chunking & Token Budgets — how batches are formed
- Validation & Retries — the retry loop