Struktur

Extraction Strategies

The Agent strategy uses autonomous exploration. Other strategies are available for specific use cases.

The Agent strategy is the default and recommended way to use Struktur. It gives the LLM a virtual filesystem and lets it autonomously decide how to extract your data.

For documents where you need more control, Struktur also provides alternative strategies that use fixed chunking and parallelism patterns.

Strategy comparison

StrategySpeedContextArraysToken CostBest For
agent (default)AdaptiveAdaptiveAutomaticVariesMost documents
simpleFastestFullLowestSmall inputs
parallelFastNoneLLM mergeMediumSpeed priority
sequentialMediumFullContextMediumContext-dependent
parallelAutoMergeFastNoneAuto + dedupeMediumLarge arrays
sequentialAutoMergeMediumFullAuto + dedupeMediumOrdered arrays
doublePassSlowFullLLM mergeHighMaximum quality
doublePassAutoMergeSlowFullAuto + dedupeHighQuality + arrays

Agent (Default)

The Agent strategy is the default. You don't need to specify --strategy agent — it's used automatically when you run struktur extract.

Autonomous extraction using a virtual filesystem. The agent decides when to read files, search for patterns, and build output incrementally.

Best For

Most documents — adapts automatically

Virtual FS

read, grep, find, ls, bash

Output Tools

set_output_data, update_output_data

Model Requirement

Must support tool calling

How it works

  1. Document loaded into virtual filesystem (/artifacts/artifact.json, /artifacts/manifest.json, /artifacts/images/)
  2. Agent explores using tools: read files, grep for patterns, list directories, execute commands
  3. Incremental extraction — calls set_output_data when first data found, update_output_data as more discovered
  4. Validation — schema validation on every output update, with automatic retry on errors
  5. Completion — agent calls finish when done, or fail if extraction impossible

The agent adapts to your document:

  • Small documents — reads everything at once
  • Large documents — navigates systematically, searching for relevant sections
  • Complex schemas — builds output incrementally, validating as it goes

Configuration

Prop

Type

Example

# Agent is the default — no --strategy needed
struktur extract --input ./document.pdf \
  --schema ./schema.json \
  --model anthropic/claude-sonnet-4

# With max steps limit
struktur extract --input ./document.pdf \
  --schema ./schema.json \
  --model anthropic/claude-sonnet-4 \
  --max-steps 30
import { extract, agent } from "@struktur/sdk";

const result = await extract({
  artifacts,
  schema,
  strategy: agent({
    provider: "anthropic",
    modelId: "claude-sonnet-4",
    maxSteps: 50,
  }),
});

When to use

  • Always try agent first — it's the default for a reason
  • Works well for most document types and sizes
  • Automatically adapts to document structure
  • Best for complex schemas with nested objects

Model compatibility

The agent requires models that support tool/function calling:

ProviderCompatible Models
AnthropicClaude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
OpenAIGPT-4o, GPT-4 Turbo, GPT-4, GPT-3.5 Turbo
GoogleGemini 1.5 Pro, Gemini 1.5 Flash

Some models claim tool support but don't work well with the agent. Avoid: GPT-4o-mini (inconsistent tool calling), older GPT-3.5 models, models without native function calling.

Virtual filesystem

The agent has access to a virtual filesystem containing:

  • /artifacts/artifact.json — All artifacts in JSON format (images replaced by virtual paths)
  • /artifacts/manifest.json — Summary and metadata
  • /artifacts/images/ — Extracted image files (when artifacts have embedded images)

The agent can:

  • Read files with pagination (offset, limit)
  • Grep for patterns
  • Find files by name
  • List directories
  • Bash execute commands (on virtual filesystem only)

Output management

Special tools for building extraction output:

  • set_output_data(data) — Set initial output (first time data is found)
  • update_output_data(changes) — Merge changes into existing output
  • finish() — Complete extraction (only works if data validates)
  • fail(reason) — Mark extraction as impossible

The agent is encouraged to update output continuously as it explores, not wait until the end.


Simple

Single-shot extraction for small inputs. Use when the agent is overkill for tiny documents.

LLM Calls

1

Parallelism

None

Best for

Small, single-chunk inputs

Configuration

Prop

Type

Algorithm

  1. Build extraction prompt from artifacts + schema
  2. Send to LLM
  3. Validate output against the schema
  4. Retry on validation failure (up to 3 attempts)
  5. Return validated output

Example

struktur extract --input document.txt --schema schema.json --strategy simple
import { extract, simple } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";

const result = await extract({
  artifacts,
  schema,
  strategy: simple({
    model: openai("gpt-4o-mini"),
  }),
});

When to use

  • Document fits within the model's context window (~10k tokens)
  • Simple schema without nested arrays
  • Testing or prototyping
  • Speed is the priority
  • When you want predictable token costs (agent costs vary by document)

Parallel

Concurrent batch processing with LLM merge.

LLM Calls

N batches + 1 merge

Parallelism

Full

Best for

Large inputs, speed priority

Configuration

Prop

Type

Algorithm

  1. Split artifacts into batches (respecting chunkSize and maxImages)
  2. Extract from each batch concurrently
  3. Validate each batch output with retry
  4. Send all partial results to mergeModel for LLM merge
  5. Validate merged output
  6. Return final result

Example

struktur extract --input large.pdf --schema schema.json --strategy parallel --model openai/gpt-4o-mini
import { extract, parallel } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";

const result = await extract({
  artifacts,
  schema,
  strategy: parallel({
    model: openai("gpt-4o-mini"),
    mergeModel: openai("gpt-4o-mini"),
    chunkSize: 10000,
    concurrency: 3,
  }),
});

When to use

  • Speed is the top priority
  • Chunks are relatively independent
  • Many documents to process
  • Can accept potential loss of cross-chunk context
  • When agent costs are too high for your use case

Sequential

Process chunks in order with context preservation.

LLM Calls

N batches

Parallelism

None

Best for

Context-dependent documents

Configuration

Prop

Type

Algorithm

  1. Split artifacts into batches
  2. For each batch in order:
    • Build prompt including previous extraction result as context
    • Extract from batch
    • Validate with retry
    • Store result for next iteration
  3. Return final result

Example

struktur extract --input report.pdf --schema schema.json --strategy sequential --model openai/gpt-4o-mini
import { extract, sequential } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";

const result = await extract({
  artifacts,
  schema,
  strategy: sequential({
    model: openai("gpt-4o-mini"),
    chunkSize: 10000,
  }),
});

When to use

  • Context between chunks matters
  • Building data incrementally (e.g., accumulating line items)
  • Later sections reference earlier sections
  • Need better accuracy than parallel
  • Agent is making too many tool calls for your document structure

Auto-Merge Strategies

Strategies with "AutoMerge" in the name use schema-aware merge and deduplication. They're ideal for extracting arrays that may have duplicates across chunks.

parallelAutoMerge

Parallel extraction with schema-aware merge and deduplication.

Best for: Array extraction from large inputs where speed matters.

sequentialAutoMerge

Sequential extraction with schema-aware merge and deduplication.

Best for: Ordered array extraction where context matters.

doublePassAutoMerge

Double-pass extraction with schema-aware merge and deduplication.

Best for: Large array extraction with maximum quality requirement.


Choosing a Strategy

Start with the Agent. It's the default because it works best for most documents.

StrategyWhen to use
agent (default)Start here — autonomous exploration for most documents
simpleSmall input, fits in one context window, predictable costs
parallelLarge input, order doesn't matter, speed priority
sequentialLarge input, context carries across chunks
parallelAutoMergeLarge input with arrays — parallel + dedup
sequentialAutoMergeLarge input with arrays — sequential + dedup
doublePassQuality matters, two-pass refinement
doublePassAutoMergeQuality + arrays + dedup

Quick decision flowchart

flowchart TD
    A[Start] --> B{Try Agent first?}
    B -->|Yes| C[Use agent — default]
    B -->|Need fixed costs| D{Input fits in context?}
    D -->|Yes| E[Use simple]
    D -->|No| F{Extracting arrays?}
    F -->|Yes| G{Cross-chunk context matters?}
    F -->|No| H{Cross-chunk context matters?}
    G -->|Yes| I[sequentialAutoMerge or doublePassAutoMerge]
    G -->|No| J[parallelAutoMerge]
    H -->|Yes| K[sequential or doublePass]
    H -->|No| L[parallel]

See also

On this page