If you have a regex somewhere that strips markdown fences to pull JSON out of an LLM response, you have a time bomb. It works 95% of the time in development. It breaks on the 5% of production traffic that has slightly different phrasing, a model version bump, or a user input the model has never seen. You fix it, it breaks again three weeks later in a different way.
The correct fix is not a better regex. It is stopping the model from producing anything other than valid, schema-compliant output in the first place.
The Naive Approach and Where It Breaks
Most teams start here:
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": f"Extract order details as JSON: {text}"}]
)
raw = response.content[0].text
The model is helpful and returns:
Here is the extracted order information:
```json
{
"order_id": "ORD-12345",
"customer": "Jane Smith",
"total": 142.50
}
I extracted the order ID, customer name, and total from the provided text.
So you write a parser:
```python
import re, json
def extract_json(text: str) -> dict:
match = re.search(r'```json\s*(.*?)\s*```', text, re.DOTALL)
if match:
return json.loads(match.group(1))
match = re.search(r'\{.*\}', text, re.DOTALL)
if match:
return json.loads(match.group(0))
raise ValueError(f"No JSON found in: {text}")
This breaks when:
- The model uses inline backticks instead of a fenced block
- Your greedy
{.*}captures too much or too little with nested objects - The model cannot extract the field and says so in prose instead of JSON
- A model update changes the wrapper format
- A user input causes the model to add a disclaimer before the JSON
- The total comes back as
"$142.50"(string) instead of142.50(number) - The model uses
"customer_name"when your schema expects"customer"
Each of these is a production incident at scale. Across thousands of requests a day, you will hit all of them eventually.
JSON Mode: Better, But Not Enough
Most providers offer a JSON mode that forces the model to emit valid JSON without prose wrapper or markdown fences.
# OpenAI
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{"role": "user", "content": prompt}]
)
# Google Gemini
response = model.generate_content(
prompt,
generation_config={"response_mime_type": "application/json"}
)
JSON mode guarantees syntactically valid JSON. It does not guarantee the JSON matches your schema. The model might return:
{"result": "I cannot extract order details from this text."}
or use "orderId" when you expected "order_id", or omit required fields entirely. You still need defensive code around the output shape, just less of it.
JSON mode solves the syntax problem. The schema compliance problem is separate.
Structured Outputs: Schema Enforcement at the Decoder Level
Structured outputs go further: you supply a JSON schema and the model is constrained to produce output that matches it exactly.
from pydantic import BaseModel
from typing import List, Optional
class OrderItem(BaseModel):
name: str
quantity: int
unit_price: float
class OrderExtraction(BaseModel):
order_id: str
customer_name: str
total: float
items: List[OrderItem]
confidence: float
# OpenAI structured outputs
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": f"Extract order: {text}"}],
response_format=OrderExtraction,
)
order = response.choices[0].message.parsed
# order.total is a float, guaranteed
# order.items is a List[OrderItem], guaranteed
# no dict gymnastics, no type coercion
For Anthropic’s API, tool use with a forced tool choice gives the same guarantee:
import anthropic
client = anthropic.Anthropic()
tools = [{
"name": "extract_order",
"description": "Extract structured order information from text",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"customer_name": {"type": "string"},
"total": {"type": "number"},
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
},
"required": ["name", "quantity", "unit_price"]
}
},
"confidence": {"type": "number"}
},
"required": ["order_id", "customer_name", "total", "items", "confidence"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
tool_choice={"type": "tool", "name": "extract_order"},
messages=[{"role": "user", "content": f"Extract order: {text}"}]
)
tool_use = next(b for b in response.content if b.type == "tool_use")
order = tool_use.input # dict that matches the schema, always
Setting tool_choice to a specific tool name forces the model to call that tool. It cannot respond in prose. The response structure is guaranteed by the API contract.
How Constrained Decoding Works
Knowing the mechanism matters because it defines the actual guarantees, not just the behavior in happy-path testing.
Language models generate tokens one at a time. At each step, the model produces a probability distribution over its full vocabulary. Constrained decoding intercepts before sampling and masks any token that would make the current partial output invalid according to the target grammar (in this case, a JSON schema).
Partial output: {"order_id":
Allowed tokens: " (start of a JSON string value)
Masked to zero: } [ 0-9 true false null and everything else
Partial output: {"order_id": "ORD-
Allowed tokens: any character legal inside a JSON string
Masked to zero: structural tokens that would close the string early
This runs at every token. The model follows its learned distributions, but it can only pick from the subset of tokens that keep the output valid. The model cannot “decide” to add a disclaimer. The only thing it can output is a valid completion of your schema. This is not prompt following - it is a structural guarantee at the decoder level.
Provider Support
| Provider | Method | Schema level | Notes |
|---|---|---|---|
| OpenAI | beta.chat.completions.parse() | Full JSON schema via Pydantic | Returns typed Python objects |
| Anthropic | Tool use + tool_choice | Full JSON schema | Force specific tool for guaranteed structure |
| Google Gemini | response_mime_type + schema | Full JSON schema | Available on Gemini 1.5+ |
| Mistral | response_format | JSON schema | Similar to OpenAI |
| vLLM (self-hosted) | guided_json parameter | Full JSON schema | Uses outlines under the hood |
| Ollama (local) | format: "json" | Basic JSON only | Use outlines for schema enforcement |
For open-source models, the outlines library provides constrained decoding over any Hugging Face model:
import outlines
from pydantic import BaseModel
from typing import List
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")
class OrderExtraction(BaseModel):
order_id: str
customer_name: str
total: float
items: List[str]
generator = outlines.generate.json(model, OrderExtraction)
result = generator(f"Extract order details from: {text}")
# result is a validated OrderExtraction instance
Patterns for Extraction at Scale
Pipeline Architecture
Input Text
|
v
[Chunker] ---- split long documents into segments
|
v
[Batch Extractor] -- parallel schema-constrained extraction
|
v
[Validator] -- Pydantic validation + business rule checks
|
|-- confidence >= 0.7 --------> Output Store
|
|-- confidence 0.4-0.7 -------> Human Review Queue
|
|-- confidence < 0.4 ---------> Extraction Failed Log
Async Batch Extraction
import asyncio
from pydantic import BaseModel, ValidationError
from typing import List, Optional
from openai import AsyncOpenAI
class ExtractionResult(BaseModel):
order_id: str
customer_name: str
total: float
items: List[str]
confidence: float
class ExtractionError(BaseModel):
source: str
error: str
async def extract_one(
client: AsyncOpenAI,
text: str
) -> ExtractionResult | ExtractionError:
try:
response = await client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": f"Extract order: {text}"}],
response_format=ExtractionResult,
timeout=10.0,
)
result = response.choices[0].message.parsed
if result.total <= 0:
return ExtractionError(source=text, error="total must be positive")
return result
except (ValidationError, Exception) as e:
return ExtractionError(source=text, error=str(e))
async def batch_extract(client: AsyncOpenAI, texts: list[str], concurrency: int = 20):
sem = asyncio.Semaphore(concurrency)
async def bounded(text: str):
async with sem:
return await extract_one(client, text)
results = await asyncio.gather(*[bounded(t) for t in texts])
successes = [r for r in results if isinstance(r, ExtractionResult) and r.confidence >= 0.7]
review = [r for r in results if isinstance(r, ExtractionResult) and r.confidence < 0.7]
errors = [r for r in results if isinstance(r, ExtractionError)]
return successes, review, errors
Schema Design: What Actually Matters
Use enums for categorical fields. A plain string for a field with fixed values invites hallucination. An enum forecloses it:
from enum import Enum
class OrderStatus(str, Enum):
pending = "pending"
shipped = "shipped"
delivered = "delivered"
cancelled = "cancelled"
class Order(BaseModel):
status: OrderStatus # model can only output one of four values
Use Optional for fields that may not exist. If you mark a field required and the source text does not contain it, the model will hallucinate a value rather than fail schema validation. Optional and a null check downstream is safer:
class OrderExtraction(BaseModel):
order_id: str # always present in valid orders
customer_name: Optional[str] = None # may be omitted
discount_code: Optional[str] = None # rarely present
Add a confidence score and an extraction_successful flag. Constrained decoding forces the model to fill every required field. Without an explicit failure path, the model is stuck: it must produce a value, so it guesses. Give it an escape hatch:
class SafeExtraction(BaseModel):
extraction_successful: bool
failure_reason: Optional[str] = None # set when extraction_successful is False
order_id: Optional[str] = None
customer_name: Optional[str] = None
total: Optional[float] = None
If extraction_successful is False, skip the other fields entirely and route to a fallback. Never assume a filled-in field means the model was confident.
Keep schemas flat where possible. Deeply nested schemas increase the number of tokens the model must generate and raise the probability of mid-schema drift. Three levels of nesting is roughly the practical limit before accuracy drops noticeably.
What Constrained Decoding Does Not Solve
Latency overhead. The per-token masking adds 5-15% latency depending on schema complexity. For streaming responses with tight budgets under 100ms, this matters.
Semantic correctness. The model is constrained to output a valid schema. It is not constrained to output a correct schema. It can still populate customer_name with a plausible but wrong name if the source text is ambiguous. Schema enforcement and factual accuracy are separate problems.
Hallucination of values. The model cannot skip a required field, so if it cannot find the value, it will fabricate one. This is why confidence scores and Optional fields are not optional design choices - they are the mechanism that keeps constrained decoding honest.
The Honest Assessment
Structured outputs with schema enforcement solve the output reliability problem. Moving from regex parsing to constrained decoding is not an incremental improvement - it removes an entire category of production bug.
What works well: extraction pipelines, classification tasks, transformation workflows where one step’s output feeds the next step’s input, any use case where you know the shape of what you want in advance.
What does not work well: open-ended generation where schema constraints fight the model’s natural response style, schemas deeper than three or four levels of nesting, situations where the model genuinely does not have enough information and you need it to explain why rather than fill fields with guesses.
What to actually do: add Pydantic models or tool schemas to your existing extraction code. Route anything below your confidence threshold to a human review queue rather than trusting the output. Design every schema with an explicit failure mode from day one. The latency overhead is real and small, and it is worth it every time.
Comments