If you have a regex somewhere that strips markdown fences to pull JSON out of an LLM response, you have a time bomb. It works 95% of the time in development. It breaks on the 5% of production traffic that has slightly different phrasing, a model version bump, or a user input the model has never seen. You fix it, it breaks again three weeks later in a different way.

The correct fix is not a better regex. It is stopping the model from producing anything other than valid, schema-compliant output in the first place.

The Naive Approach and Where It Breaks

Most teams start here:

response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": f"Extract order details as JSON: {text}"}]
)
raw = response.content[0].text

The model is helpful and returns:

Here is the extracted order information:

```json
{
  "order_id": "ORD-12345",
  "customer": "Jane Smith",
  "total": 142.50
}

I extracted the order ID, customer name, and total from the provided text.


So you write a parser:

```python
import re, json

def extract_json(text: str) -> dict:
    match = re.search(r'```json\s*(.*?)\s*```', text, re.DOTALL)
    if match:
        return json.loads(match.group(1))
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        return json.loads(match.group(0))
    raise ValueError(f"No JSON found in: {text}")

This breaks when:

  • The model uses inline backticks instead of a fenced block
  • Your greedy {.*} captures too much or too little with nested objects
  • The model cannot extract the field and says so in prose instead of JSON
  • A model update changes the wrapper format
  • A user input causes the model to add a disclaimer before the JSON
  • The total comes back as "$142.50" (string) instead of 142.50 (number)
  • The model uses "customer_name" when your schema expects "customer"

Each of these is a production incident at scale. Across thousands of requests a day, you will hit all of them eventually.

JSON Mode: Better, But Not Enough

Most providers offer a JSON mode that forces the model to emit valid JSON without prose wrapper or markdown fences.

# OpenAI
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[{"role": "user", "content": prompt}]
)

# Google Gemini
response = model.generate_content(
    prompt,
    generation_config={"response_mime_type": "application/json"}
)

JSON mode guarantees syntactically valid JSON. It does not guarantee the JSON matches your schema. The model might return:

{"result": "I cannot extract order details from this text."}

or use "orderId" when you expected "order_id", or omit required fields entirely. You still need defensive code around the output shape, just less of it.

JSON mode solves the syntax problem. The schema compliance problem is separate.

Structured Outputs: Schema Enforcement at the Decoder Level

Structured outputs go further: you supply a JSON schema and the model is constrained to produce output that matches it exactly.

from pydantic import BaseModel
from typing import List, Optional

class OrderItem(BaseModel):
    name: str
    quantity: int
    unit_price: float

class OrderExtraction(BaseModel):
    order_id: str
    customer_name: str
    total: float
    items: List[OrderItem]
    confidence: float

# OpenAI structured outputs
response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": f"Extract order: {text}"}],
    response_format=OrderExtraction,
)

order = response.choices[0].message.parsed
# order.total is a float, guaranteed
# order.items is a List[OrderItem], guaranteed
# no dict gymnastics, no type coercion

For Anthropic’s API, tool use with a forced tool choice gives the same guarantee:

import anthropic

client = anthropic.Anthropic()

tools = [{
    "name": "extract_order",
    "description": "Extract structured order information from text",
    "input_schema": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string"},
            "customer_name": {"type": "string"},
            "total": {"type": "number"},
            "items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "quantity": {"type": "integer"},
                        "unit_price": {"type": "number"}
                    },
                    "required": ["name", "quantity", "unit_price"]
                }
            },
            "confidence": {"type": "number"}
        },
        "required": ["order_id", "customer_name", "total", "items", "confidence"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "extract_order"},
    messages=[{"role": "user", "content": f"Extract order: {text}"}]
)

tool_use = next(b for b in response.content if b.type == "tool_use")
order = tool_use.input  # dict that matches the schema, always

Setting tool_choice to a specific tool name forces the model to call that tool. It cannot respond in prose. The response structure is guaranteed by the API contract.

How Constrained Decoding Works

Knowing the mechanism matters because it defines the actual guarantees, not just the behavior in happy-path testing.

Language models generate tokens one at a time. At each step, the model produces a probability distribution over its full vocabulary. Constrained decoding intercepts before sampling and masks any token that would make the current partial output invalid according to the target grammar (in this case, a JSON schema).

Partial output:  {"order_id":
Allowed tokens:  " (start of a JSON string value)
Masked to zero:  } [ 0-9 true false null and everything else

Partial output:  {"order_id": "ORD-
Allowed tokens:  any character legal inside a JSON string
Masked to zero:  structural tokens that would close the string early

This runs at every token. The model follows its learned distributions, but it can only pick from the subset of tokens that keep the output valid. The model cannot “decide” to add a disclaimer. The only thing it can output is a valid completion of your schema. This is not prompt following - it is a structural guarantee at the decoder level.

Provider Support

ProviderMethodSchema levelNotes
OpenAIbeta.chat.completions.parse()Full JSON schema via PydanticReturns typed Python objects
AnthropicTool use + tool_choiceFull JSON schemaForce specific tool for guaranteed structure
Google Geminiresponse_mime_type + schemaFull JSON schemaAvailable on Gemini 1.5+
Mistralresponse_formatJSON schemaSimilar to OpenAI
vLLM (self-hosted)guided_json parameterFull JSON schemaUses outlines under the hood
Ollama (local)format: "json"Basic JSON onlyUse outlines for schema enforcement

For open-source models, the outlines library provides constrained decoding over any Hugging Face model:

import outlines
from pydantic import BaseModel
from typing import List

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")

class OrderExtraction(BaseModel):
    order_id: str
    customer_name: str
    total: float
    items: List[str]

generator = outlines.generate.json(model, OrderExtraction)
result = generator(f"Extract order details from: {text}")
# result is a validated OrderExtraction instance

Patterns for Extraction at Scale

Pipeline Architecture

Input Text
     |
     v
[Chunker] ---- split long documents into segments
     |
     v
[Batch Extractor] -- parallel schema-constrained extraction
     |
     v
[Validator] -- Pydantic validation + business rule checks
     |
     |-- confidence >= 0.7 --------> Output Store
     |
     |-- confidence 0.4-0.7 -------> Human Review Queue
     |
     |-- confidence < 0.4 ---------> Extraction Failed Log

Async Batch Extraction

import asyncio
from pydantic import BaseModel, ValidationError
from typing import List, Optional
from openai import AsyncOpenAI

class ExtractionResult(BaseModel):
    order_id: str
    customer_name: str
    total: float
    items: List[str]
    confidence: float

class ExtractionError(BaseModel):
    source: str
    error: str

async def extract_one(
    client: AsyncOpenAI,
    text: str
) -> ExtractionResult | ExtractionError:
    try:
        response = await client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[{"role": "user", "content": f"Extract order: {text}"}],
            response_format=ExtractionResult,
            timeout=10.0,
        )
        result = response.choices[0].message.parsed
        if result.total <= 0:
            return ExtractionError(source=text, error="total must be positive")
        return result
    except (ValidationError, Exception) as e:
        return ExtractionError(source=text, error=str(e))

async def batch_extract(client: AsyncOpenAI, texts: list[str], concurrency: int = 20):
    sem = asyncio.Semaphore(concurrency)

    async def bounded(text: str):
        async with sem:
            return await extract_one(client, text)

    results = await asyncio.gather(*[bounded(t) for t in texts])

    successes = [r for r in results if isinstance(r, ExtractionResult) and r.confidence >= 0.7]
    review    = [r for r in results if isinstance(r, ExtractionResult) and r.confidence < 0.7]
    errors    = [r for r in results if isinstance(r, ExtractionError)]
    return successes, review, errors

Schema Design: What Actually Matters

Use enums for categorical fields. A plain string for a field with fixed values invites hallucination. An enum forecloses it:

from enum import Enum

class OrderStatus(str, Enum):
    pending = "pending"
    shipped = "shipped"
    delivered = "delivered"
    cancelled = "cancelled"

class Order(BaseModel):
    status: OrderStatus  # model can only output one of four values

Use Optional for fields that may not exist. If you mark a field required and the source text does not contain it, the model will hallucinate a value rather than fail schema validation. Optional and a null check downstream is safer:

class OrderExtraction(BaseModel):
    order_id: str               # always present in valid orders
    customer_name: Optional[str] = None  # may be omitted
    discount_code: Optional[str] = None  # rarely present

Add a confidence score and an extraction_successful flag. Constrained decoding forces the model to fill every required field. Without an explicit failure path, the model is stuck: it must produce a value, so it guesses. Give it an escape hatch:

class SafeExtraction(BaseModel):
    extraction_successful: bool
    failure_reason: Optional[str] = None  # set when extraction_successful is False
    order_id: Optional[str] = None
    customer_name: Optional[str] = None
    total: Optional[float] = None

If extraction_successful is False, skip the other fields entirely and route to a fallback. Never assume a filled-in field means the model was confident.

Keep schemas flat where possible. Deeply nested schemas increase the number of tokens the model must generate and raise the probability of mid-schema drift. Three levels of nesting is roughly the practical limit before accuracy drops noticeably.

What Constrained Decoding Does Not Solve

Latency overhead. The per-token masking adds 5-15% latency depending on schema complexity. For streaming responses with tight budgets under 100ms, this matters.

Semantic correctness. The model is constrained to output a valid schema. It is not constrained to output a correct schema. It can still populate customer_name with a plausible but wrong name if the source text is ambiguous. Schema enforcement and factual accuracy are separate problems.

Hallucination of values. The model cannot skip a required field, so if it cannot find the value, it will fabricate one. This is why confidence scores and Optional fields are not optional design choices - they are the mechanism that keeps constrained decoding honest.

The Honest Assessment

Structured outputs with schema enforcement solve the output reliability problem. Moving from regex parsing to constrained decoding is not an incremental improvement - it removes an entire category of production bug.

What works well: extraction pipelines, classification tasks, transformation workflows where one step’s output feeds the next step’s input, any use case where you know the shape of what you want in advance.

What does not work well: open-ended generation where schema constraints fight the model’s natural response style, schemas deeper than three or four levels of nesting, situations where the model genuinely does not have enough information and you need it to explain why rather than fill fields with guesses.

What to actually do: add Pydantic models or tool schemas to your existing extraction code. Route anything below your confidence threshold to a human review queue rather than trusting the output. Design every schema with an explicit failure mode from day one. The latency overhead is real and small, and it is worth it every time.