Struktur

parse()

Load files and text into artifacts for extraction.

The parse() function is the primary way to load files and text into artifacts. It handles MIME detection, parser resolution, and PDF image extraction automatically.

import { parse } from "@struktur/sdk";

const artifacts = await parse(
  { kind: "file", path: "document.pdf" },
  {
    parserConfig: parsersConfig,   // ParsersConfig — keyed by MIME type
    includeImages: true,           // extract embedded PDF images
    screenshots: false,            // render PDF page screenshots
    screenshotScale: 1.5,          // scale factor for screenshots
    screenshotWidth: undefined,    // target width in pixels (overrides screenshotScale)
  }
);

Input kinds

Input kindDescription
{ kind: "text", text }Text artifact (split on double newlines)
{ kind: "file", path, mimeType? }File artifact — MIME auto-detected, parser resolved from parserConfig then built-ins
{ kind: "buffer", buffer, mimeType }Buffer artifact — parser resolved from parserConfig then built-ins
{ kind: "artifact-json", data }Validates and hydrates pre-built artifact JSON

Options

OptionTypeDefaultDescription
parserConfigParsersConfig{}Custom parsers keyed by MIME type
includeImagesbooleanfalseExtract embedded images from PDFs
screenshotsbooleanfalseRender PDF page screenshots
screenshotScalenumber1.5Scale factor for screenshots
screenshotWidthnumberTarget width in pixels (overrides screenshotScale)

Custom Parsers

Pass a parserConfig to use custom parsers without CLI config:

import { parse } from "@struktur/sdk";
import type { ParsersConfig, InlineParserDef } from "@struktur/sdk";
import * as XLSX from "xlsx";

// npm package parser
const parserConfig: ParsersConfig = {
  "application/vnd.ms-excel": {
    type: "npm",
    package: "@myorg/xlsx-parser",
  },
};

// or inline parser
const inlineParserConfig: ParsersConfig = {
  "application/vnd.ms-excel": {
    type: "inline",
    handler: async (buffer) => {
      const workbook = XLSX.read(buffer);
      const contents = workbook.SheetNames.map((name, i) => ({
        page: i + 1,
        text: XLSX.utils.sheet_to_csv(workbook.Sheets[name]),
      }));

      return {
        id: `excel-${crypto.randomUUID()}`,
        type: "file",
        raw: async () => buffer,
        contents,
      };
    },
  },
};

const artifacts = await parse(
  { kind: "file", path: "report.xlsx" },
  { parserConfig: inlineParserConfig }
);

Inline parser signature

An inline parser is an async function that takes a Buffer and returns an Artifact:

type InlineParserHandler = (buffer: Buffer) => Promise<Artifact>;

const myParser: InlineParserHandler = async (buffer) => {
  const pages = await parseMyFormat(buffer);

  return {
    id: `doc-${crypto.randomUUID()}`,
    type: "file",
    raw: async () => buffer,
    contents: pages.map((page, i) => ({
      page: i + 1,
      text: page.text,
      media: page.images.map((img) => ({
        type: "image",
        base64: img.base64,
      })),
    })),
  };
};

Other Helpers

urlToArtifact(url)

Fetches a URL and expects it to return pre-serialized artifact JSON. Validates and hydrates.

import { urlToArtifact } from "@struktur/sdk";

const artifacts = await urlToArtifact("https://example.com/artifact.json");

parseSerializedArtifacts(text)

Parses a JSON string into artifacts with schema validation.

validateSerializedArtifacts(data)

Validates an already-parsed value against the artifact schema.

hydrateSerializedArtifacts(items)

Adds the raw() function to serialized artifacts.

splitTextIntoContents(text)

Splits a text string on double newlines into content slices.


Deprecated: fileToArtifact

fileToArtifact is deprecated. Use parse() with InlineParserDef in parserConfig instead.

Important: fileToArtifact uses the legacy providers registry, which does not include the built-in PDF parser or any parsers configured via config parsers add. For PDF and other format support, use parse instead.

// Deprecated — use parse() instead
import { fileToArtifact } from "@struktur/sdk";
import fs from "node:fs/promises";

const buffer = Buffer.from(await fs.readFile("document.txt"));
const artifact = await fileToArtifact(buffer, {
  mimeType: "text/plain",
  providers: { /* deprecated */ }
});

See also

On this page