Every LLM extraction project needs the same infrastructure: chunking, validation, retries, merging. It's boilerplate you've probably written multiple times. Here's why it's harder than it looks and how Struktur handles it.

The Boilerplate

Every extraction pipeline needs:

Token counting — Know when you're hitting limits
Chunking — Split documents that don't fit
Prompt construction — Build prompts per chunk
API calls — Handle rate limits, timeouts, errors
Schema validation — Check output matches schema
Retry logic — Send errors back to LLM
Result merging — Combine chunk results
Deduplication — Remove duplicates from arrays

You can build each piece. But together, they form a complex system with subtle edge cases.

Token Budgets

Why 10k Tokens Isn't Arbitrary

LLMs have context limits. GPT-4o: 128k tokens. Claude 3.5: 200k tokens. But you can't use all of that:

Input tokens — Your document + prompt
Output tokens — The extracted JSON
Safety margin — Room for error messages, retries

A 50-page contract might be 80k tokens. You can't process it in one call. You need to chunk.

The Chunking Problem

Naive chunking breaks documents:

Document: "...the total amount of $1,234,567.89 shall be paid..."

Chunk 1: "...the total amount of $1,234,"
Chunk 2: "567.89 shall be paid..."

Now the LLM sees "$1,234" in one chunk and "567.89" in another. Neither is the correct total.

Smart chunking:

Split at sentence boundaries
Preserve context overlap
Track chunk relationships

Validation in the Loop

The Problem

LLMs produce invalid output:

Missing required fields
Wrong types (string instead of number)
Invalid enum values
Malformed JSON

You can't trust raw LLM output.

The Solution

Validate, then retry with feedback:

const schema = {
  type: "object",
  properties: {
    total: { type: "number" },
    date: { type: "string", format: "date" }
  },
  required: ["total", "date"]
};

// Attempt 1
const output1 = await llm.extract(document);
const errors1 = validate(output1, schema);
// errors1: "total is required but missing"

// Attempt 2 (with feedback)
const output2 = await llm.extract(document, { 
  previousErrors: errors1 
});
const errors2 = validate(output2, schema);
// errors2: null (valid!)

Why This Works

LLMs are good at correcting mistakes when told what's wrong:

Previous output had validation errors:
- "total" is required but was missing
- "date" must be in YYYY-MM-DD format

Please fix these errors and try again.

Most extractions converge in 2-3 attempts.

The Retry Limit

You can't retry forever. Set a limit:

const maxAttempts = 3;

for (let attempt = 1; attempt <= maxAttempts; attempt++) {
  const output = await llm.extract(document, { previousErrors });
  const errors = validate(output, schema);
  
  if (errors === null) {
    return output; // Success!
  }
}

throw new Error("Max retries exceeded");

Merging Strategies

The Problem

Multiple chunks produce multiple results:

// Chunk 1 output
{ lineItems: [{ name: "Widget A", price: 100 }] }

// Chunk 2 output
{ lineItems: [{ name: "Widget B", price: 200 }] }

// Chunk 3 output
{ total: 300 }

How do you combine these?

Strategy 1: LLM Merge

Ask the LLM to merge:

const merged = await llm.merge({
  results: [result1, result2, result3],
  schema: schema,
  instruction: "Combine these partial extractions into one complete result."
});

Pros: Handles complex merging logic Cons: Extra LLM call, more tokens, more cost

Strategy 2: Auto-Merge

Merge programmatically based on schema:

function autoMerge(results: object[], schema: object): object {
  const merged = {};
  
  for (const key of Object.keys(schema.properties)) {
    const propSchema = schema.properties[key];
    
    if (propSchema.type === 'array') {
      // Concatenate arrays
      merged[key] = results.flatMap(r => r[key] || []);
    } else {
      // Take first non-null value
      merged[key] = results.find(r => r[key] != null)?.[key];
    }
  }
  
  return merged;
}

Pros: Fast, no extra tokens Cons: Can't handle complex logic

Strategy 3: Schema-Aware Merge

Combine both approaches:

Use auto-merge for simple fields
Use LLM merge for complex fields
Let schema specify which to use

Deduplication

The Problem

When merging arrays, you get duplicates:

// Chunk 1: mentions "Widget A" twice
{ lineItems: [{ name: "Widget A" }, { name: "Widget A" }] }

// Chunk 2: also mentions "Widget A"
{ lineItems: [{ name: "Widget A" }] }

// Merged: 3 copies of "Widget A"
{ lineItems: [{ name: "Widget A" }, { name: "Widget A" }, { name: "Widget A" }] }

The Solution

Schema-aware deduplication:

function deduplicate(items: object[], schema: object): object[] {
  const uniqueKey = findUniqueKey(schema);
  
  if (uniqueKey) {
    // Dedupe by unique key
    const seen = new Set();
    return items.filter(item => {
      const key = item[uniqueKey];
      if (seen.has(key)) return false;
      seen.add(key);
      return true;
    });
  } else {
    // Dedupe by similarity (fuzzy matching)
    return fuzzyDeduplicate(items);
  }
}

How Struktur Handles This

Struktur provides all of this out of the box:

import { extract } from '@struktur/sdk';

const result = await extract({
  artifacts: [{ path: 'contract.pdf' }],
  schema: contractSchema,
  strategy: 'sequential', // handles chunking + merging
  options: {
    maxRetries: 3,
    tokenBudget: 10000,
    mergeStrategy: 'auto',
  }
});

console.log(result.data);    // Validated, merged output
console.log(result.usage);   // Token usage stats

Under the hood:

Token counting — Uses tiktoken for accurate counts
Chunking — Splits at sentence boundaries with overlap
Validation — JSON Schema validation with ajv
Retries — Sends errors back to LLM
Merging — Schema-aware auto-merge
Deduplication — Removes duplicates from arrays

The Hidden Complexity

Each piece seems simple. Together, they interact in complex ways:

Chunking affects what the LLM sees
Validation errors affect retry prompts
Merging depends on chunk boundaries
Deduplication depends on merge results

Building this once is educational. Building it correctly for production takes weeks.

The Chunking, Validation, and Retry Problem

The Boilerplate

Token Budgets

Why 10k Tokens Isn't Arbitrary

The Chunking Problem

Validation in the Loop

The Problem

The Solution

Why This Works

The Retry Limit

Merging Strategies

The Problem

Strategy 1: LLM Merge

Strategy 2: Auto-Merge

Strategy 3: Schema-Aware Merge

Deduplication

The Problem

The Solution

How Struktur Handles This

The Hidden Complexity

See Also

On this page