Struktur

Artifact Format

The artifact abstraction and complete specification.

The normalization boundary

Different document types (PDF, HTML, Excel, email) require different parsing strategies. But LLM extraction is the same regardless of source format. The Artifact is the normalized form that crosses that boundary.

Struktur only cares about what is in the artifact, not where it came from.

What an artifact contains

An artifact has:

  • id: unique identifier
  • type: type hint (text, image, pdf, file)
  • contents: a sequence of content slices

Each content slice may have:

  • text: the text content
  • page: page number (for paginated documents)
  • media: embedded images

This structure naturally maps to paginated documents (each page is a content slice) or segmented text (each paragraph/section is a slice).

Why text + images together?

Some documents (real estate exposés, product datasheets) have critical information in images. Because images are embedded directly in content slices alongside text, the LLM sees them in context.

Image limits per chunk are configurable on parallel strategies via maxImages.

Complete specification

JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "SerializedArtifacts",
  "oneOf": [
    { "$ref": "#/definitions/SerializedArtifact" },
    {
      "type": "array",
      "items": { "$ref": "#/definitions/SerializedArtifact" },
      "minItems": 1
    }
  ],
  "definitions": {
    "SerializedArtifact": {
      "type": "object",
      "required": ["id", "type", "contents"],
      "additionalProperties": false,
      "properties": {
        "id": { "type": "string" },
        "type": { "type": "string", "enum": ["text", "image", "pdf", "file"] },
        "contents": {
          "type": "array",
          "items": { "$ref": "#/definitions/SerializedArtifactContent" },
          "minItems": 1
        },
        "metadata": { "type": "object" },
        "tokens": { "type": "number" }
      }
    },
    "SerializedArtifactContent": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page": { "type": "number" },
        "text": { "type": "string" },
        "media": {
          "type": "array",
          "items": { "$ref": "#/definitions/SerializedArtifactImage" }
        }
      },
      "anyOf": [
        { "required": ["text"] },
        { "required": ["media"] }
      ]
    },
    "SerializedArtifactImage": {
      "type": "object",
      "required": ["type"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "image" },
        "url": { "type": "string" },
        "base64": { "type": "string" },
        "text": { "type": "string" },
        "x": { "type": "number" },
        "y": { "type": "number" },
        "width": { "type": "number" },
        "height": { "type": "number" },
        "imageType": { "type": "string", "enum": ["embedded", "screenshot"] }
      },
      "anyOf": [
        { "required": ["url"] },
        { "required": ["base64"] }
      ]
    }
  }
}

Top-level shape

FieldRequiredDescription
idYesUnique identifier
typeYesOne of: text, image, pdf, file
contentsYesArray of content slices (at least one)
metadataNoPass-through metadata object
tokensNoPre-computed token count hint

Accepted as: a single object or an array [{...}, {...}].

Content slices

Each item in contents has:

FieldRequiredDescription
pageNoPage number for paginated documents
textNoText content of this slice
mediaNoArray of images embedded in this slice

At least one of text or media must be present.

Images

Each item in media has:

FieldRequiredDescription
typeYesMust be "image"
urlNoURL to image (mutually exclusive with base64)
base64NoBase64-encoded image data (no data-URL prefix)
textNoAlt text or OCR output
x, y, width, heightNoOptional spatial metadata (pixels)
imageTypeNo"embedded" or "screenshot". Distinguishes images extracted from the document body from page renders. Omit for hand-crafted artifacts.

Either url or base64 must be present.

The imageType field is set automatically by the PDF parser: "embedded" for images extracted from the PDF body (requires --images), "screenshot" for full-page renders (requires --screenshots). The artifact viewer uses this field to filter and badge images independently.

Complete example

[
  {
    "id": "invoice-2024-1042",
    "type": "pdf",
    "contents": [
      {
        "page": 1,
        "text": "INVOICE\nInvoice #: 1042\nDate: 2024-03-01\nBill To: Acme Corp\n...",
        "media": [
          {
            "type": "image",
            "base64": "iVBORw0KGgoAAAANS...",
            "text": "Company logo",
            "imageType": "embedded"
          },
          {
            "type": "image",
            "base64": "iVBORw0KGgoAAAANS...",
            "imageType": "screenshot"
          }
        ]
      },
      {
        "page": 2,
        "text": "Line Items:\n- Widget A x10 @ $50.00 = $500.00\n- Widget B x5 @ $200.00 = $1,000.00\nTotal: $1,500.00"
      }
    ],
    "metadata": {
      "filename": "invoice-1042.pdf",
      "source": "email-attachment"
    }
  }
]

Validation

Struktur validates artifact JSON before processing. Use the CLI:

# From stdin
cat artifacts.json | struktur verify --stdin
# or from a file:
struktur verify --input artifacts.json

Returns { "valid": true, "artifacts": 1 } on success, throws with error detail on failure.

Built-in artifact creation

PathDescription
--input <file> (CLI)MIME detection + parser resolution; PDF uses built-in parsePdf
--stdin (CLI)MIME detection on buffer; text/plain falls back to text artifact
parse() (SDK)Accepts kind: "text", kind: "file", kind: "buffer", kind: "artifact-json"
urlToArtifact() (SDK)Fetches URL, validates as SerializedArtifact[]

See also

  • Document Parsing — how to get input into Struktur and how files are converted to artifacts
  • parse() — the SDK API

On this page