The parse() function is the primary way to load files and text into artifacts. It handles MIME detection, parser resolution, and PDF image extraction automatically.

import { parse } from "@struktur/sdk";

const artifacts = await parse(
  { kind: "file", path: "document.pdf" },
  {
    parserConfig: parsersConfig,   // ParsersConfig — keyed by MIME type
    includeImages: true,           // extract embedded PDF images
    screenshots: false,            // render PDF page screenshots
    screenshotScale: 1.5,          // scale factor for screenshots
    screenshotWidth: undefined,    // target width in pixels (overrides screenshotScale)
  }
);

Input kinds

Input kind	Description
`{ kind: "text", text }`	Text artifact (split on double newlines)
`{ kind: "file", path, mimeType? }`	File artifact — MIME auto-detected, parser resolved from `parserConfig` then built-ins
`{ kind: "buffer", buffer, mimeType }`	Buffer artifact — parser resolved from `parserConfig` then built-ins
`{ kind: "artifact-json", data }`	Validates and hydrates pre-built artifact JSON

Options

Option	Type	Default	Description
`parserConfig`	`ParsersConfig`	`{}`	Custom parsers keyed by MIME type
`includeImages`	`boolean`	`false`	Extract embedded images from PDFs
`screenshots`	`boolean`	`false`	Render PDF page screenshots
`screenshotScale`	`number`	`1.5`	Scale factor for screenshots
`screenshotWidth`	`number`	—	Target width in pixels (overrides `screenshotScale`)

Custom Parsers

Pass a parserConfig to use custom parsers without CLI config:

import { parse } from "@struktur/sdk";
import type { ParsersConfig, InlineParserDef } from "@struktur/sdk";
import * as XLSX from "xlsx";

// npm package parser
const parserConfig: ParsersConfig = {
  "application/vnd.ms-excel": {
    type: "npm",
    package: "@myorg/xlsx-parser",
  },
};

// or inline parser
const inlineParserConfig: ParsersConfig = {
  "application/vnd.ms-excel": {
    type: "inline",
    handler: async (buffer) => {
      const workbook = XLSX.read(buffer);
      const contents = workbook.SheetNames.map((name, i) => ({
        page: i + 1,
        text: XLSX.utils.sheet_to_csv(workbook.Sheets[name]),
      }));

      return {
        id: `excel-${crypto.randomUUID()}`,
        type: "file",
        raw: async () => buffer,
        contents,
      };
    },
  },
};

const artifacts = await parse(
  { kind: "file", path: "report.xlsx" },
  { parserConfig: inlineParserConfig }
);

Inline parser signature

An inline parser is an async function that takes a Buffer and returns an Artifact:

type InlineParserHandler = (buffer: Buffer) => Promise<Artifact>;

const myParser: InlineParserHandler = async (buffer) => {
  const pages = await parseMyFormat(buffer);

  return {
    id: `doc-${crypto.randomUUID()}`,
    type: "file",
    raw: async () => buffer,
    contents: pages.map((page, i) => ({
      page: i + 1,
      text: page.text,
      media: page.images.map((img) => ({
        type: "image",
        base64: img.base64,
      })),
    })),
  };
};

Other Helpers

`urlToArtifact(url)`

Fetches a URL and expects it to return pre-serialized artifact JSON. Validates and hydrates.

import { urlToArtifact } from "@struktur/sdk";

const artifacts = await urlToArtifact("https://example.com/artifact.json");

`parseSerializedArtifacts(text)`

Parses a JSON string into artifacts with schema validation.

`validateSerializedArtifacts(data)`

Validates an already-parsed value against the artifact schema.

`hydrateSerializedArtifacts(items)`

Adds the raw() function to serialized artifacts.

`splitTextIntoContents(text)`

Splits a text string on double newlines into content slices.

Deprecated: fileToArtifact

fileToArtifact is deprecated. Use parse() with InlineParserDef in parserConfig instead.

Important: fileToArtifact uses the legacy providers registry, which does not include the built-in PDF parser or any parsers configured via config parsers add. For PDF and other format support, use parse instead.

// Deprecated — use parse() instead
import { fileToArtifact } from "@struktur/sdk";
import fs from "node:fs/promises";

const buffer = Buffer.from(await fs.readFile("document.txt"));
const artifact = await fileToArtifact(buffer, {
  mimeType: "text/plain",
  providers: { /* deprecated */ }
});

parse()