The normalization boundary

Different document types (PDF, HTML, Excel, email) require different parsing strategies. But LLM extraction is the same regardless of source format. The Artifact is the normalized form that crosses that boundary.

Struktur only cares about what is in the artifact, not where it came from.

What an artifact contains

An artifact has:

id: unique identifier
type: type hint (text, image, pdf, file)
contents: a sequence of content slices

Each content slice may have:

text: the text content
page: page number (for paginated documents)
media: embedded images

This structure naturally maps to paginated documents (each page is a content slice) or segmented text (each paragraph/section is a slice).

Why text + images together?

Some documents (real estate exposés, product datasheets) have critical information in images. Because images are embedded directly in content slices alongside text, the LLM sees them in context.

Image limits per chunk are configurable on parallel strategies via maxImages.

Complete specification

JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "SerializedArtifacts",
  "oneOf": [
    { "$ref": "#/definitions/SerializedArtifact" },
    {
      "type": "array",
      "items": { "$ref": "#/definitions/SerializedArtifact" },
      "minItems": 1
    }
  ],
  "definitions": {
    "SerializedArtifact": {
      "type": "object",
      "required": ["id", "type", "contents"],
      "additionalProperties": false,
      "properties": {
        "id": { "type": "string" },
        "type": { "type": "string", "enum": ["text", "image", "pdf", "file"] },
        "contents": {
          "type": "array",
          "items": { "$ref": "#/definitions/SerializedArtifactContent" },
          "minItems": 1
        },
        "metadata": { "type": "object" },
        "tokens": { "type": "number" }
      }
    },
    "SerializedArtifactContent": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "page": { "type": "number" },
        "text": { "type": "string" },
        "media": {
          "type": "array",
          "items": { "$ref": "#/definitions/SerializedArtifactImage" }
        }
      },
      "anyOf": [
        { "required": ["text"] },
        { "required": ["media"] }
      ]
    },
    "SerializedArtifactImage": {
      "type": "object",
      "required": ["type"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "image" },
        "url": { "type": "string" },
        "base64": { "type": "string" },
        "text": { "type": "string" },
        "x": { "type": "number" },
        "y": { "type": "number" },
        "width": { "type": "number" },
        "height": { "type": "number" },
        "imageType": { "type": "string", "enum": ["embedded", "screenshot"] }
      },
      "anyOf": [
        { "required": ["url"] },
        { "required": ["base64"] }
      ]
    }
  }
}

Top-level shape

Field	Required	Description
`id`	Yes	Unique identifier
`type`	Yes	One of: `text`, `image`, `pdf`, `file`
`contents`	Yes	Array of content slices (at least one)
`metadata`	No	Pass-through metadata object
`tokens`	No	Pre-computed token count hint

Accepted as: a single object or an array [{...}, {...}].

Content slices

Each item in contents has:

Field	Required	Description
`page`	No	Page number for paginated documents
`text`	No	Text content of this slice
`media`	No	Array of images embedded in this slice

At least one of text or media must be present.

Images

Each item in media has:

Field	Required	Description
`type`	Yes	Must be `"image"`
`url`	No	URL to image (mutually exclusive with `base64`)
`base64`	No	Base64-encoded image data (no data-URL prefix)
`text`	No	Alt text or OCR output
`x`, `y`, `width`, `height`	No	Optional spatial metadata (pixels)
`imageType`	No	`"embedded"` or `"screenshot"`. Distinguishes images extracted from the document body from page renders. Omit for hand-crafted artifacts.

Either url or base64 must be present.

The imageType field is set automatically by the PDF parser: "embedded" for images extracted from the PDF body (requires --images), "screenshot" for full-page renders (requires --screenshots). The artifact viewer uses this field to filter and badge images independently.

Complete example

[
  {
    "id": "invoice-2024-1042",
    "type": "pdf",
    "contents": [
      {
        "page": 1,
        "text": "INVOICE\nInvoice #: 1042\nDate: 2024-03-01\nBill To: Acme Corp\n...",
        "media": [
          {
            "type": "image",
            "base64": "iVBORw0KGgoAAAANS...",
            "text": "Company logo",
            "imageType": "embedded"
          },
          {
            "type": "image",
            "base64": "iVBORw0KGgoAAAANS...",
            "imageType": "screenshot"
          }
        ]
      },
      {
        "page": 2,
        "text": "Line Items:\n- Widget A x10 @ $50.00 = $500.00\n- Widget B x5 @ $200.00 = $1,000.00\nTotal: $1,500.00"
      }
    ],
    "metadata": {
      "filename": "invoice-1042.pdf",
      "source": "email-attachment"
    }
  }
]

Validation

Struktur validates artifact JSON before processing. Use the CLI:

# From stdin
cat artifacts.json | struktur verify --stdin
# or from a file:
struktur verify --input artifacts.json

Returns { "valid": true, "artifacts": 1 } on success, throws with error detail on failure.

Built-in artifact creation

Path	Description
`--input <file>` (CLI)	MIME detection + parser resolution; PDF uses built-in `parsePdf`
`--stdin` (CLI)	MIME detection on buffer; `text/plain` falls back to text artifact
`parse()` (SDK)	Accepts `kind: "text"`, `kind: "file"`, `kind: "buffer"`, `kind: "artifact-json"`
`urlToArtifact()` (SDK)	Fetches URL, validates as `SerializedArtifact[]`

Artifact Format