/jtrʊkˈtuːr/

struktur

All-in-one tool for structured data extraction.
Feed it any document — PDF, text, or custom format.
Get back validated, schema-typed JSON.

Extract data in your command line

$

Installation & Quickstart

Install globally

$npm install -g @struktur/cli

Store your API key and set a default model in one step

$struktur config providers add openai --token "sk-..." --default

Extract structured data from any file

$struktur --input invoice.pdf --fields "number, vendor, total:number"

Read the full quickstart →

Features

Extraction strategies for any kind of document

Choose how Struktur processes your document: single-shot for simple inputs, parallel chunking for large files, sequential pass for context-dependent extraction, or double-pass refinement for higher accuracy. Auto-merge strategies deduplicate results across chunks automatically.

Use any LLM

OpenAI, Anthropic, Google, Mistral, OpenRouter, OpenCode Zen, and more. Switch with a single flag or by configuring default models.

Built-in file parsing

Pass a PDF, image, or text file — Struktur makes it LLM-ready before extraction, including embedded images and full-page "screenshots". Add your own parser easily.

Schema validation with auto-retry

Every LLM response is thoroughly validated against your schema. Validation errors are fed back to the model automatically, letting it fix its own mistakes.

Fields shorthand

Extract data on the fly without writing a verbose JSON schema. Use the --fields flag with the shorthand syntax for one-off extractions or experimentation.

TypeScript SDK

Integrate Struktur into your applications using the fully typed SDK. Everything is just JavaScript, so it works with any runtime.

Embedded media support

File parsing renders document pages as images so the LLM sees tables, charts, and photos in context. It can even reference visual elements in the output data.

How it works

Raw Input

Files, Text or Images

→

Artifact

Text + Images

→

Extract

Your chosen strategy

→

Structured Data

JSON in your schema

Before extracting, Struktur normalizes your raw data into the Artifact format, which is then given to the extraction strategy you picked. Here the data is chunked and given to the LLM, which extracts data in your schema and automatically retries on validation errors.

Extraction pipeline explained →

Prepare any filetype for LLMs

Struktur's parser layer converts files into Artifacts before extraction. PDF, plain text, and images work out of the box. Register custom parsers for any MIME type using an npm package or a shell command.

Built-in Parsers

application/pdf

text + images per page

text/*

split into content slices

image/*

passed as media artifact

application/json

treated as text unless it's valid Artifact data

adding custom parsers

$ struktur config parsers add ...

NPM Package

--npm @myorg/docx-parser

Shell Command (using path)

--file-command "markitdown FILE_PATH"

Shell Command (using stdin)

--stdin-command "my-html-tool"

Register a Word document parser
$ struktur config parsers add \
--mime application/msword \
--file-command "markitdown FILE_PATH"

Parser system explained →

Integrate into your application using the TypeScript SDK

Install the SDK

$npm install @struktur/sdk

import { extract, simple, parse } from '@struktur/sdk';
import { openai } from '@ai-sdk/openai';
// Parse a raw buffer into Artifacts
const artifacts = await parse(
{ kind: 'buffer', buffer, mimeType: 'application/pdf' },
{ includeImages: true }
);
// Run extraction with your chosen strategy
const result = await extract({
artifacts,
schema: {
type: 'object',
properties: { invoice_nr: {  type: 'string'  }, total: {  type: 'number'  } }
},
strategy: simple({ model: openai('gpt-4o-mini') }),
});
// result.data is fully typed from your schema

SDK reference →

Ready to extract structured data?

Quickstart

Install globally

$npm install -g @struktur/cli

Extract data from any file

$struktur --input invoice.pdf --fields "total:number"

Full quickstart guide →

Documentation

Explore extraction strategies, parser configuration, SDK integration, and advanced features.

→ Choosing a strategy → Parser system → TypeScript SDK