The most common mistake in LLM apps is treating the model as the safety layer. You write a careful system prompt, add “do not answer off-topic questions, do not produce harmful content, always return valid JSON,” and ship it. Then a user asks your customer-support bot for medical advice, it hands out a dosage, and you learn that a system prompt is a suggestion, not a contract.
Guardrails are the code that sits around the model and turns suggestions into constraints. The model is a probabilistic text generator. Everything you actually guarantee has to be enforced outside it. This post builds a guardrail pipeline from the naive version to something you can put in front of real users, with a clear answer to the question that matters: what does each layer catch that the others miss?
This is different from prompt injection defense, which is about an attacker hijacking your agent through untrusted tool output. Guardrails are the broader boundary: they run on every request and every response, for honest users and malicious ones alike, and they exist even when there is no attacker at all.
The Naive Version and Where It Breaks
Here is what most teams ship first.
def handle(user_message: str) -> str:
return llm.generate(
system="You are a support bot for Acme. Only answer questions "
"about Acme products. Never give medical, legal, or "
"financial advice. Be safe and helpful.",
messages=[{"role": "user", "content": user_message}],
)
This works in the demo and fails in production for reasons that have nothing to do with model quality:
- The prompt is not enforced. “Only answer about Acme products” holds until a user phrases the off-topic question as a hypothetical, a roleplay, or a translation task. The model is trained to be helpful, and helpfulness usually wins.
- There is no input check at all. A 40,000-token pasted document, a prompt-injection string, or a raw SQL blob all flow straight into the model. You pay for the tokens and inherit whatever behavior they trigger.
- There is no output check at all. Whatever the model produces goes straight to the user. If it hallucinates a refund policy, leaks another user’s data that slipped into context, or emits a slur it was baited into, you find out from the user.
- Failures are invisible. When something goes wrong you have no signal for which stage failed, because there are no stages.
The fix is not a better prompt. It is to stop asking one model call to be the input filter, the policy engine, the content moderator, and the output validator all at once. Split those into explicit layers, each of which does one job and can be tested on its own.
The Four Layers
A guardrail pipeline has two guarded boundaries: the input on the way in, and the output on the way out. Four layers cover the real failure modes.
User message
|
v
+------------------------------+
| Layer 1: Input Validation | size, encoding, PII, injection
+------------------------------+
|
v
+------------------------------+
| Layer 2: Topical Boundary | is this in scope at all?
+------------------------------+
|
v
[ LLM ]
|
v
+------------------------------+
| Layer 3: Output Validation | schema, grounding, format
+------------------------------+
|
v
+------------------------------+
| Layer 4: Content Safety | toxicity, PII leak, policy
+------------------------------+
|
v
Response to user
Each layer can pass, block, or rewrite. The important design rule: a layer never trusts the layer before it. Input validation does not assume the client sanitized anything. Output content safety does not assume the model behaved just because the input was clean. Layers are cheap and independent, so run all of them.
Here is the skeleton the rest of the post fills in.
from dataclasses import dataclass
@dataclass
class GuardResult:
action: str # "pass", "block", "rewrite"
payload: str # cleaned text, or a safe message on block
reason: str = "" # for logging and metrics
class Guardrail:
def check(self, text: str, ctx: dict) -> GuardResult:
raise NotImplementedError
class Pipeline:
def __init__(self, inbound: list[Guardrail], outbound: list[Guardrail]):
self.inbound = inbound
self.outbound = outbound
def run(self, user_message: str, ctx: dict) -> str:
text = user_message
for g in self.inbound:
r = g.check(text, ctx)
log_guard(g, r, ctx)
if r.action == "block":
return safe_refusal(r.reason)
if r.action == "rewrite":
text = r.payload
raw = llm.generate(system=ctx["system"],
messages=[{"role": "user", "content": text}])
for g in self.outbound:
r = g.check(raw, ctx)
log_guard(g, r, ctx)
if r.action == "block":
return safe_refusal(r.reason)
if r.action == "rewrite":
raw = r.payload
return raw
Layer 1: Input Validation
This layer runs before you spend a single model token. It catches the cheap, deterministic problems that never need an LLM to detect.
import re
class InputValidation(Guardrail):
MAX_CHARS = 12_000
CONTROL_CHARS = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
PII_PATTERNS = {
"email": re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+"),
"credit_card": re.compile(r"\b(?:\d[ -]*?){13,16}\b"),
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
}
def check(self, text: str, ctx: dict) -> GuardResult:
if len(text) > self.MAX_CHARS:
return GuardResult("block", "", "input_too_long")
# Strip control characters used to smuggle hidden instructions
cleaned = self.CONTROL_CHARS.sub("", text)
# Redact obvious PII before it ever reaches the model or your logs
for label, pattern in self.PII_PATTERNS.items():
cleaned = pattern.sub(f"[REDACTED_{label.upper()}]", cleaned)
if cleaned != text:
return GuardResult("rewrite", cleaned, "sanitized_input")
return GuardResult("pass", text)
What this layer catches that no other layer does:
- Size and cost attacks. A pasted 200-page PDF is a denial-of-wallet problem. Reject it before it becomes a bill.
- Encoding tricks. Zero-width characters, control bytes, and homoglyphs that hide instructions from a human reviewer but not from the model. Normalize Unicode (NFKC) and strip control characters here.
- PII on the way in. The single most common data-leak path in LLM apps is not the model saying something clever. It is a user pasting a customer’s full record into the chat, which then lands in your prompt logs, your vector store, and your analytics pipeline. Redact at the door.
Input validation is deterministic and fast (microseconds), so it is always worth running. It is also the wrong tool for anything semantic. It cannot tell whether a question is on-topic or whether a request is harmful in intent. That is the next two layers.
Layer 2: Topical Boundaries
This is the layer people skip, and it is the one that turns a support bot into a liability. The question is narrow: is this request even in scope for this application? A tax-filing assistant should not answer questions about vaccine schedules, not because vaccines are unsafe, but because a wrong answer there is a lawsuit and there is no version of your app that is supposed to handle it.
Regex cannot do this. Topic is semantic. But you do not need your expensive answer-generating model either. Use a small, fast classifier.
class TopicalBoundary(Guardrail):
IN_SCOPE = ["billing", "account_setup", "product_features", "troubleshooting"]
def check(self, text: str, ctx: dict) -> GuardResult:
label = fast_classifier.classify(
text,
labels=self.IN_SCOPE + ["out_of_scope", "medical", "legal", "financial"],
)
if label in self.IN_SCOPE:
return GuardResult("pass", text)
return GuardResult(
"block", "",
f"out_of_scope:{label}",
)
A few things that matter in practice:
- Define in-scope, not out-of-scope. You cannot enumerate everything a user might ask that is off-topic. You can enumerate the handful of things your app is for. Allowlist the topics you support and block the rest by default. This is the same principle as privilege separation for agents: deny by default, permit explicitly.
- Use a cheap model. A small classifier or a Haiku-class call at 50 tokens is enough. This runs on every request, so latency and cost matter. It does not need to be smart, it needs to be conservative.
- Return a specific reason.
out_of_scope:medicaltells you somethingblockeddoes not. When you review blocked traffic, the label distribution tells you what users actually want and where your scope boundary is wrong.
What this layer catches that Layer 1 misses: intent. A perfectly clean, PII-free, correctly-sized message that says “what dose of ibuprofen is safe for a 6 year old” passes input validation cleanly and is exactly the request you must not answer. Topic classification is the only layer positioned to catch it before the model does.
Layer 3: Output Validation
Now the model has produced something. Output validation checks that the response is well-formed and grounded before anyone sees it. This splits into two concerns that are easy to conflate: structure and truth.
Structure is the easy, deterministic half. If your downstream code expects JSON, validate it against a schema and repair or reject on failure.
import json
from jsonschema import validate, ValidationError
class SchemaValidation(Guardrail):
def __init__(self, schema: dict):
self.schema = schema
def check(self, text: str, ctx: dict) -> GuardResult:
try:
obj = json.loads(text)
validate(obj, self.schema)
return GuardResult("pass", text)
except (json.JSONDecodeError, ValidationError) as e:
# One repair attempt is cheap; a second is usually a losing bet
repaired = llm.generate(
system="Fix this to match the schema. Output only JSON.",
messages=[{"role": "user",
"content": f"Schema: {self.schema}\nBroken: {text}"}],
)
try:
validate(json.loads(repaired), self.schema)
return GuardResult("rewrite", repaired, "schema_repaired")
except (json.JSONDecodeError, ValidationError):
return GuardResult("block", "", f"schema_invalid:{e.__class__.__name__}")
Truth is the harder half, and it is where hallucination lives. For any app that answers from a knowledge base, retrieved documents, or tool results, the critical check is grounding: does the response only assert things supported by the context you actually provided?
class GroundingCheck(Guardrail):
def check(self, text: str, ctx: dict) -> GuardResult:
sources = ctx.get("retrieved_context", "")
if not sources:
return GuardResult("pass", text) # nothing to ground against
verdict = judge_model.generate(
system="You verify grounding. Given SOURCES and a RESPONSE, "
"answer GROUNDED if every factual claim in the response "
"is supported by the sources, else UNGROUNDED and list "
"the unsupported claim.",
messages=[{"role": "user",
"content": f"SOURCES:\n{sources}\n\nRESPONSE:\n{text}"}],
)
if verdict.text.strip().upper().startswith("GROUNDED"):
return GuardResult("pass", text)
return GuardResult("block", "", "ungrounded_claim")
Two rules that keep this honest:
- The judge model sees the sources, never the user’s original question framing. You are checking response against evidence, not re-answering the question. Keep the judge’s job narrow so it stays cheap and reliable.
- Grounding is not free, so scope it. Run it on answers that make factual claims (RAG responses, policy lookups), not on greetings or clarifying questions. Gate it behind a cheap check for whether the response even contains a factual assertion.
What this layer catches that the others cannot: a response that is clean, on-topic, and non-toxic but simply wrong. The model invented a refund window of 60 days when your policy says 30. No input filter and no toxicity classifier will ever flag that. Only a check against ground truth will.
Layer 4: Content Safety
The last layer runs on the generated output, right before it reaches the user, and asks: is this text safe to show, regardless of how it got here? This is the layer that catches the model being baited into producing toxic content, leaking PII that entered through the retrieved context, or reproducing something from its training data that you do not want attributed to your product.
class ContentSafety(Guardrail):
def check(self, text: str, ctx: dict) -> GuardResult:
# Fast deterministic pass: PII that appeared in the OUTPUT
for label, pattern in InputValidation.PII_PATTERNS.items():
if pattern.search(text):
return GuardResult("block", "", f"pii_in_output:{label}")
# Semantic pass: a dedicated moderation model
scores = moderation_model.score(text) # {category: 0.0..1.0}
for category, score in scores.items():
if score > THRESHOLDS[category]:
return GuardResult("block", "", f"unsafe_output:{category}")
return GuardResult("pass", text)
THRESHOLDS = {
"hate": 0.5, "harassment": 0.6, "self_harm": 0.3,
"sexual": 0.5, "violence": 0.7,
}
The design decisions worth calling out:
- Content safety on output, not just input. Moderating the user’s input tells you nothing about what the model will say. The model can produce toxic content from a benign prompt (baiting, roleplay, or plain error). The only place to catch harmful output is on the output.
- Per-category thresholds. Self-harm content warrants a much lower tolerance than a mild violence score in a support conversation about a broken product. One global threshold either over-blocks or under-blocks. Tune per category against your actual traffic.
- PII in output is a distinct failure from PII in input. Input PII is a user pasting their own data. Output PII usually means data from your retrieval context or another user’s record leaked into the response. That is a more serious incident and deserves its own alert, not a shared counter.
Putting It Together
Wiring the four layers into the pipeline from earlier:
pipeline = Pipeline(
inbound=[
InputValidation(),
TopicalBoundary(),
],
outbound=[
SchemaValidation(schema=RESPONSE_SCHEMA), # if you need structured output
GroundingCheck(),
ContentSafety(),
],
)
def handle(user_message: str, ctx: dict) -> str:
return pipeline.run(user_message, ctx)
Order matters. Inbound layers run cheapest-first so you reject junk before paying for classification, and reject off-topic before paying for the main model call. Outbound, run structural checks before semantic ones for the same reason: there is no point running an expensive grounding check on output that is not even valid JSON.
Here is the whole picture as a decision matrix. This is the table to keep on the wall, because it answers “which layer owns this failure.”
| Failure mode | Caught by | Not caught by | Cost |
|---|---|---|---|
| Oversized / expensive input | Input validation | Everything else runs too late | Free |
| Hidden control chars, homoglyphs | Input validation | Semantic layers see cleaned text | Free |
| User pastes PII into chat | Input validation | Output layers (already in logs) | Free |
| Off-topic / out-of-scope request | Topical boundary | Input validation (input is clean) | Cheap LLM |
| Malformed / non-JSON output | Schema validation | Content safety (may be valid-but-toxic) | Free + 1 repair |
| Confidently wrong / hallucinated fact | Grounding check | Every non-truth layer | Judge LLM |
| Toxic / harmful generated text | Content safety | Input moderation (input was benign) | Moderation model |
| PII leaked into the response | Content safety | Input validation (came from context) | Free |
The column that matters is “not caught by.” Every failure mode has exactly one layer that owns it, and no other layer covers for it. That is the argument for keeping all four. Drop topical boundaries and off-topic requests sail through. Drop grounding and you ship confident hallucinations. The layers are not redundant, they are orthogonal.
Observability and Failure Handling
A guardrail that blocks silently is a support ticket you will never trace. Every layer logs a structured event: which guard, which action, which reason, the request ID.
def log_guard(guard, result: GuardResult, ctx: dict):
metrics.increment(
"guardrail.action",
tags={
"guard": guard.__class__.__name__,
"action": result.action,
"reason": result.reason,
},
)
if result.action == "block":
audit_log.write({
"request_id": ctx["request_id"],
"guard": guard.__class__.__name__,
"reason": result.reason,
})
Watch the block-rate per layer. A sudden spike in TopicalBoundary blocks means either an attack or, more often, that real users want something you do not support and your scope is wrong. A spike in GroundingCheck blocks means your retrieval quality dropped or the model regressed. The guardrail metrics are one of the best product signals you have, because they measure the gap between what users ask for and what your app is allowed to do.
One more rule: decide your fail-open vs fail-closed policy per layer, explicitly. If the moderation model times out, does the response go through or get blocked? For content safety, fail closed (block on error) - a slow response beats a harmful one. For grounding, you might fail open with a lower-confidence label, because blocking every answer when the judge is down makes the app useless. Write the policy down; do not let it be an accident of exception handling.
The Honest Assessment
What works: the layered structure. Splitting one overloaded system prompt into four independent, testable checks is the highest-leverage change you can make to an LLM app’s reliability. Each layer is simple, each catches a distinct failure class, and you can unit-test them without invoking the main model. Input validation and content safety in particular are cheap and should be non-negotiable.
What does not work as well as vendors claim: the semantic layers are themselves LLMs, and they inherit LLM failure modes. A grounding judge can be fooled. A topic classifier has a false-positive rate that will occasionally block a legitimate question and annoy a real user. A moderation model has a threshold that is always slightly wrong for your specific traffic. Guardrails reduce risk; they do not zero it. Anyone selling you a “100% safe” guardrail is selling you a false sense of security.
What to actually do: start with the two cheap deterministic layers (input validation, output PII and content safety) because they are pure upside - fast, free, and they catch the most common real incidents. Add topical boundaries next, because scope creep is where most LLM apps embarrass themselves. Add grounding last and only if you make factual claims, because it is the most expensive and the hardest to tune. Instrument all of them from day one. The teams running reliable LLM apps in 2026 are not the ones with the best model or the cleverest prompt. They are the ones who treated the model as one untrusted component in a pipeline and put real, tested code on both sides of it.
Comments