Artifact Format
The artifact abstraction and complete specification.
The normalization boundary
Different document types (PDF, HTML, Excel, email) require different parsing strategies. But LLM extraction is the same regardless of source format. The Artifact is the normalized form that crosses that boundary.
Struktur only cares about what is in the artifact, not where it came from.
What an artifact contains
An artifact has:
id: unique identifiertype: type hint (text,image,pdf,file)contents: a sequence of content slices
Each content slice may have:
text: the text contentpage: page number (for paginated documents)media: embedded images
This structure naturally maps to paginated documents (each page is a content slice) or segmented text (each paragraph/section is a slice).
Why text + images together?
Some documents (real estate exposés, product datasheets) have critical information in images. Because images are embedded directly in content slices alongside text, the LLM sees them in context.
Image limits per chunk are configurable on parallel strategies via maxImages.
Complete specification
JSON Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "SerializedArtifacts",
"oneOf": [
{ "$ref": "#/definitions/SerializedArtifact" },
{
"type": "array",
"items": { "$ref": "#/definitions/SerializedArtifact" },
"minItems": 1
}
],
"definitions": {
"SerializedArtifact": {
"type": "object",
"required": ["id", "type", "contents"],
"additionalProperties": false,
"properties": {
"id": { "type": "string" },
"type": { "type": "string", "enum": ["text", "image", "pdf", "file"] },
"contents": {
"type": "array",
"items": { "$ref": "#/definitions/SerializedArtifactContent" },
"minItems": 1
},
"metadata": { "type": "object" },
"tokens": { "type": "number" }
}
},
"SerializedArtifactContent": {
"type": "object",
"additionalProperties": false,
"properties": {
"page": { "type": "number" },
"text": { "type": "string" },
"media": {
"type": "array",
"items": { "$ref": "#/definitions/SerializedArtifactImage" }
}
},
"anyOf": [
{ "required": ["text"] },
{ "required": ["media"] }
]
},
"SerializedArtifactImage": {
"type": "object",
"required": ["type"],
"additionalProperties": false,
"properties": {
"type": { "type": "string", "const": "image" },
"url": { "type": "string" },
"base64": { "type": "string" },
"text": { "type": "string" },
"x": { "type": "number" },
"y": { "type": "number" },
"width": { "type": "number" },
"height": { "type": "number" },
"imageType": { "type": "string", "enum": ["embedded", "screenshot"] }
},
"anyOf": [
{ "required": ["url"] },
{ "required": ["base64"] }
]
}
}
}Top-level shape
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier |
type | Yes | One of: text, image, pdf, file |
contents | Yes | Array of content slices (at least one) |
metadata | No | Pass-through metadata object |
tokens | No | Pre-computed token count hint |
Accepted as: a single object or an array [{...}, {...}].
Content slices
Each item in contents has:
| Field | Required | Description |
|---|---|---|
page | No | Page number for paginated documents |
text | No | Text content of this slice |
media | No | Array of images embedded in this slice |
At least one of text or media must be present.
Images
Each item in media has:
| Field | Required | Description |
|---|---|---|
type | Yes | Must be "image" |
url | No | URL to image (mutually exclusive with base64) |
base64 | No | Base64-encoded image data (no data-URL prefix) |
text | No | Alt text or OCR output |
x, y, width, height | No | Optional spatial metadata (pixels) |
imageType | No | "embedded" or "screenshot". Distinguishes images extracted from the document body from page renders. Omit for hand-crafted artifacts. |
Either url or base64 must be present.
The imageType field is set automatically by the PDF parser: "embedded" for images extracted from the PDF body (requires --images), "screenshot" for full-page renders (requires --screenshots). The artifact viewer uses this field to filter and badge images independently.
Complete example
[
{
"id": "invoice-2024-1042",
"type": "pdf",
"contents": [
{
"page": 1,
"text": "INVOICE\nInvoice #: 1042\nDate: 2024-03-01\nBill To: Acme Corp\n...",
"media": [
{
"type": "image",
"base64": "iVBORw0KGgoAAAANS...",
"text": "Company logo",
"imageType": "embedded"
},
{
"type": "image",
"base64": "iVBORw0KGgoAAAANS...",
"imageType": "screenshot"
}
]
},
{
"page": 2,
"text": "Line Items:\n- Widget A x10 @ $50.00 = $500.00\n- Widget B x5 @ $200.00 = $1,000.00\nTotal: $1,500.00"
}
],
"metadata": {
"filename": "invoice-1042.pdf",
"source": "email-attachment"
}
}
]Validation
Struktur validates artifact JSON before processing. Use the CLI:
# From stdin
cat artifacts.json | struktur verify --stdin
# or from a file:
struktur verify --input artifacts.jsonReturns { "valid": true, "artifacts": 1 } on success, throws with error detail on failure.
Built-in artifact creation
| Path | Description |
|---|---|
--input <file> (CLI) | MIME detection + parser resolution; PDF uses built-in parsePdf |
--stdin (CLI) | MIME detection on buffer; text/plain falls back to text artifact |
parse() (SDK) | Accepts kind: "text", kind: "file", kind: "buffer", kind: "artifact-json" |
urlToArtifact() (SDK) | Fetches URL, validates as SerializedArtifact[] |
See also
- Document Parsing — how to get input into Struktur and how files are converted to artifacts
- parse() — the SDK API