/jtrʊkˈtuːr/

struktur

All-in-one tool for structured data extraction.
Feed it any document — PDF, text, or custom format.
Get back validated, schema-typed JSON.

Extract data in your command line

$

Installation & Quickstart

Install globally
$npm install -g @struktur/cli
Store your API key and set a default model in one step
$struktur config providers add openai --token "sk-..." --default
Extract structured data from any file
$struktur --input invoice.pdf --fields "number, vendor, total:number"
Read the full quickstart →

Features

Extraction strategies for any kind of document
Choose how Struktur processes your document: single-shot for simple inputs, parallel chunking for large files, sequential pass for context-dependent extraction, or double-pass refinement for higher accuracy. Auto-merge strategies deduplicate results across chunks automatically.
Use any LLM
OpenAI, Anthropic, Google, Mistral, OpenRouter, OpenCode Zen, and more. Switch with a single flag or by configuring default models.
Built-in file parsing
Pass a PDF, image, or text file — Struktur makes it LLM-ready before extraction, including embedded images and full-page "screenshots". Add your own parser easily.
Schema validation with auto-retry
Every LLM response is thoroughly validated against your schema. Validation errors are fed back to the model automatically, letting it fix its own mistakes.
Fields shorthand
Extract data on the fly without writing a verbose JSON schema. Use the --fields flag with the shorthand syntax for one-off extractions or experimentation.
TypeScript SDK
Integrate Struktur into your applications using the fully typed SDK. Everything is just JavaScript, so it works with any runtime.
Embedded media support
File parsing renders document pages as images so the LLM sees tables, charts, and photos in context. It can even reference visual elements in the output data.

How it works

Raw Input
Files, Text or Images
Artifact
Text + Images
Extract
Your chosen strategy
Structured Data
JSON in your schema

Before extracting, Struktur normalizes your raw data into the Artifact format, which is then given to the extraction strategy you picked. Here the data is chunked and given to the LLM, which extracts data in your schema and automatically retries on validation errors.

Extraction pipeline explained →

Prepare any filetype for LLMs

Struktur's parser layer converts files into Artifacts before extraction. PDF, plain text, and images work out of the box. Register custom parsers for any MIME type using an npm package or a shell command.

Built-in Parsers
application/pdf
text + images per page
text/*
split into content slices
image/*
passed as media artifact
application/json
treated as text unless it's valid Artifact data
adding custom parsers
$ struktur config parsers add ...
NPM Package
--npm @myorg/docx-parser
Shell Command (using path)
--file-command "markitdown FILE_PATH"
Shell Command (using stdin)
--stdin-command "my-html-tool"
Register a Word document parser
$ struktur config parsers add \
--mime application/msword \
--file-command "markitdown FILE_PATH"
Parser system explained →

Integrate into your application using the TypeScript SDK

Install the SDK
$npm install @struktur/sdk
import { extract, simple, parse } from '@struktur/sdk';
import { openai } from '@ai-sdk/openai';
// Parse a raw buffer into Artifacts
const artifacts = await parse(
{ kind: 'buffer', buffer, mimeType: 'application/pdf' },
{ includeImages: true }
);
// Run extraction with your chosen strategy
const result = await extract({
artifacts,
schema: {
type: 'object',
properties: { invoice_nr: { type: 'string' }, total: { type: 'number' } }
},
strategy: simple({ model: openai('gpt-4o-mini') }),
});
// result.data is fully typed from your schema
SDK reference →

Ready to extract structured data?

Quickstart

Install globally
$npm install -g @struktur/cli
Extract data from any file
$struktur --input invoice.pdf --fields "total:number"
Full quickstart guide →

Documentation

Explore extraction strategies, parser configuration, SDK integration, and advanced features.