What is Struktur?
All-in-one tool for structured data extraction using an autonomous agent that turns documents into validated JSON.
Struktur is an all-in-one tool for structured data extraction using an autonomous agent. It turns documents into validated, schema-typed JSON by having an LLM agent explore the content, decide what to read, and build the output incrementally.
CLI Tool
Extract data from the command line with a simple, intuitive interface
TypeScript SDK
Programmatic API for embedding extraction in your applications
Agent Strategy
Autonomous exploration with virtual filesystem tools
Examples
Real-world extraction patterns and use cases
Why Struktur?
Large document batches arrive with data locked in semi-structured text. Invoices need to flow into spreadsheets. Product datasheets need to become database rows. The tooling exists, but the orchestration overhead is disproportionate to the extraction task itself.
Managed APIs charge per page, impose schema constraints, and require document uploads to external infrastructure. LLM SDKs provide raw model access but leave you to write chunking, validation, retries, and merging every time.
Struktur fills the gap: a focused extraction engine with an autonomous agent that handles the orchestration so you can focus on the output.
Why an Agent?
Traditional extraction strategies (simple, parallel, sequential) require you to choose the right approach upfront. The agent decides:
- When to read — entire document or specific sections
- How to search — grep for patterns, list directories, execute bash commands
- What to extract — build output incrementally as it explores
- How to validate — check against schema and retry automatically
The agent adapts to your document. Small invoices get read in one shot. Large catalogs get navigated systematically. The result is better accuracy without configuration complexity.
Why not managed APIs?
| Limitation | Impact |
|---|---|
| Per-page pricing | Does not scale for large batches |
| Schema constraints | You work within their data model |
| Document upload | Non-starter for confidential workloads |
| Black-box behavior | Debugging extraction failures is opaque |
Why not a plain LLM SDK call?
A single generateText() call gives you:
- No chunking for large documents
- No retries on schema validation failure
- No merging of multi-chunk results
- No typed output inferred from your schema
You write the same orchestration boilerplate every time. Struktur's agent packages that orchestration into a single, adaptive strategy.
Design philosophy
Agent-first, zero configuration. The agent strategy is the default. It explores documents autonomously, deciding when to read, search, or extract. No need to pick chunk sizes or parallelism upfront.
- Autonomous exploration. The agent uses a virtual filesystem to read files, grep for patterns, find files, and execute commands. It builds output incrementally as it discovers data.
- Shell-composable by default. Reads stdin, writes stdout, speaks JSON. Integrates with
jq,find,curl, and any tool in your pipeline. - Validation in the loop. Errors go back to the model, not to you. The retry loop means most extractions converge within two attempts.
- Schema-first. You define the shape, Struktur guarantees it.
- Fields shorthand. Skip the JSON Schema boilerplate with
--fields "title, price:number, status:enum{draft|live}".
Trade-offs
| Trade-off | Rationale |
|---|---|
| Requires tool-calling models | The agent needs models that support function calling (Claude, GPT-4, etc.) |
| Depends on Vercel AI SDK providers | OpenAI, Anthropic, Google supported; self-hosted models need OpenAI-compatible API |
| Token costs vary by document | The agent makes multiple tool calls; large documents cost more than small ones |
A 10-second demo
struktur extract --input invoice.pdf \
--fields "number, vendor, total:number"Expected output:
{
"number": "1042",
"vendor": "Acme Corp",
"total": 2400
}The agent reads the PDF, decides how to extract the fields, and returns validated JSON.
What Struktur is NOT
It is not a general document conversion tool. It parses files for extraction purposes, not for format conversion. It does not produce formatted output from documents.
- It is not a managed API. It runs locally and calls your provider directly.
- It does not stream. Input in, JSON out.
- It is not a general LLM orchestration framework.
For the full mental model, see The Extraction Pipeline.
Who is it for?
CLI Users
Data engineers, analysts, shell pipeline builders — use Struktur for one-off extractions, batch processing, and CI/CD automation without writing code.
SDK Users
TypeScript developers embedding extraction in applications — use Struktur for typed results, custom strategies, and fine-grained control over the extraction pipeline.
What is the Agent Strategy?
The agent strategy is the default and recommended way to use Struktur. It implements:
- Virtual filesystem tools — read, grep, find, ls, bash
- Output management — set_output_data, update_output_data, finish, fail
- Autonomous exploration — the agent decides what to do based on your schema
- Incremental extraction — builds output as it discovers data
How it works
- The agent receives your schema and access to a virtual filesystem containing the document
- It can read files, search for patterns, list directories, and execute commands
- As it finds data, it calls
set_output_dataorupdate_output_datato build the result - When complete, it calls
finishto return validated JSON
When to use other strategies
The agent is the default and works best for most documents. However, other strategies are available for specific cases:
| Strategy | When to use |
|---|---|
agent (default) | Autonomous exploration — best for most documents |
simple | Small input that fits in one context window |
parallel | Large input where speed matters more than accuracy |
sequential | Large input where order matters |
parallelAutoMerge | Large arrays with parallel processing + deduplication |
sequentialAutoMerge | Large arrays with sequential processing + deduplication |
doublePass | Maximum quality with two-pass refinement |
doublePassAutoMerge | Maximum quality with arrays + deduplication |
See Extraction Strategies for details on all strategies.
Quick navigation
| Goal | Section |
|---|---|
| New here? | Quickstart |
| Need to accomplish something? | Examples |
| Looking up a flag or type? | CLI Reference |
| Quick schema without writing JSON? | Fields Shorthand |
| Want to understand how it works? | Concepts |
| Parse files into artifacts? | Document Parsing |