parse()
Load files and text into artifacts for extraction.
The parse() function is the primary way to load files and text into artifacts. It handles MIME detection, parser resolution, and PDF image extraction automatically.
import { parse } from "@struktur/sdk";
const artifacts = await parse(
{ kind: "file", path: "document.pdf" },
{
parserConfig: parsersConfig, // ParsersConfig — keyed by MIME type
includeImages: true, // extract embedded PDF images
screenshots: false, // render PDF page screenshots
screenshotScale: 1.5, // scale factor for screenshots
screenshotWidth: undefined, // target width in pixels (overrides screenshotScale)
}
);Input kinds
| Input kind | Description |
|---|---|
{ kind: "text", text } | Text artifact (split on double newlines) |
{ kind: "file", path, mimeType? } | File artifact — MIME auto-detected, parser resolved from parserConfig then built-ins |
{ kind: "buffer", buffer, mimeType } | Buffer artifact — parser resolved from parserConfig then built-ins |
{ kind: "artifact-json", data } | Validates and hydrates pre-built artifact JSON |
Options
| Option | Type | Default | Description |
|---|---|---|---|
parserConfig | ParsersConfig | {} | Custom parsers keyed by MIME type |
includeImages | boolean | false | Extract embedded images from PDFs |
screenshots | boolean | false | Render PDF page screenshots |
screenshotScale | number | 1.5 | Scale factor for screenshots |
screenshotWidth | number | — | Target width in pixels (overrides screenshotScale) |
Custom Parsers
Pass a parserConfig to use custom parsers without CLI config:
import { parse } from "@struktur/sdk";
import type { ParsersConfig, InlineParserDef } from "@struktur/sdk";
import * as XLSX from "xlsx";
// npm package parser
const parserConfig: ParsersConfig = {
"application/vnd.ms-excel": {
type: "npm",
package: "@myorg/xlsx-parser",
},
};
// or inline parser
const inlineParserConfig: ParsersConfig = {
"application/vnd.ms-excel": {
type: "inline",
handler: async (buffer) => {
const workbook = XLSX.read(buffer);
const contents = workbook.SheetNames.map((name, i) => ({
page: i + 1,
text: XLSX.utils.sheet_to_csv(workbook.Sheets[name]),
}));
return {
id: `excel-${crypto.randomUUID()}`,
type: "file",
raw: async () => buffer,
contents,
};
},
},
};
const artifacts = await parse(
{ kind: "file", path: "report.xlsx" },
{ parserConfig: inlineParserConfig }
);Inline parser signature
An inline parser is an async function that takes a Buffer and returns an Artifact:
type InlineParserHandler = (buffer: Buffer) => Promise<Artifact>;
const myParser: InlineParserHandler = async (buffer) => {
const pages = await parseMyFormat(buffer);
return {
id: `doc-${crypto.randomUUID()}`,
type: "file",
raw: async () => buffer,
contents: pages.map((page, i) => ({
page: i + 1,
text: page.text,
media: page.images.map((img) => ({
type: "image",
base64: img.base64,
})),
})),
};
};Other Helpers
urlToArtifact(url)
Fetches a URL and expects it to return pre-serialized artifact JSON. Validates and hydrates.
import { urlToArtifact } from "@struktur/sdk";
const artifacts = await urlToArtifact("https://example.com/artifact.json");parseSerializedArtifacts(text)
Parses a JSON string into artifacts with schema validation.
validateSerializedArtifacts(data)
Validates an already-parsed value against the artifact schema.
hydrateSerializedArtifacts(items)
Adds the raw() function to serialized artifacts.
splitTextIntoContents(text)
Splits a text string on double newlines into content slices.
Deprecated: fileToArtifact
fileToArtifact is deprecated. Use parse() with InlineParserDef in parserConfig instead.
Important: fileToArtifact uses the legacy providers registry, which does not include the built-in PDF parser or any parsers configured via config parsers add. For PDF and other format support, use parse instead.
// Deprecated — use parse() instead
import { fileToArtifact } from "@struktur/sdk";
import fs from "node:fs/promises";
const buffer = Buffer.from(await fs.readFile("document.txt"));
const artifact = await fileToArtifact(buffer, {
mimeType: "text/plain",
providers: { /* deprecated */ }
});See also
- The Artifact Format — JSON spec
- Document Parsing — how files are converted to artifacts and extending the parser system