Every AI framework promises agents that can “autonomously complete complex tasks.” The demo shows an agent booking flights, writing code, and sending emails. Then you deploy it and it hallucinates a nonexistent API endpoint, gets stuck in an infinite loop calling the same tool, and racks up $200 in API costs before your rate limiter kicks in.
I have shipped AI agents to production across three different products in the last year. Here is what actually works, what fails, and the patterns that separate a demo from a system your users can rely on.
Why Demo Agents Fail in Production
The gap between a demo agent and a production agent comes down to five failure modes:
1. Hallucinated tool calls. The model invents function names or parameters that do not exist. In a demo, you retry and it works. In production, this corrupts data.
2. Infinite loops. The agent calls the same tool repeatedly because its output does not change the state in a way the model can detect. Cost explodes.
3. Context window overflow. Each tool call adds to the conversation history. After 10-15 steps, the agent loses track of its original goal because earlier context has been pushed out or compressed.
4. Cascading errors. One bad tool call produces bad output, which the agent uses as input for the next call, which produces worse output. Error propagation without correction.
5. Non-determinism in critical paths. The same input produces different tool call sequences on different runs. This makes testing, debugging, and reliability guarantees nearly impossible.
ReAct vs Plan-and-Execute
Two dominant patterns for agent design. The choice matters more than the model you pick.
ReAct (Reason + Act)
The agent thinks, acts, observes, and repeats. Each step is independent.
class ReActAgent:
def run(self, task: str, max_steps: int = 10):
messages = [{"role": "user", "content": task}]
for step in range(max_steps):
response = self.llm.generate(
messages=messages,
tools=self.tools,
system="Think step by step. Use tools when needed. "
"Say DONE when the task is complete."
)
if response.stop_reason == "tool_use":
tool_result = self.execute_tool(response.tool_call)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": f"Tool result: {tool_result}"})
elif "DONE" in response.text:
return response.text
else:
messages.append({"role": "assistant", "content": response.content})
return "Max steps reached without completion"
Strengths: Simple, flexible, good for exploratory tasks. Weaknesses: No upfront planning, prone to wandering, expensive for multi-step tasks.
Plan-and-Execute
The agent creates a plan first, then executes steps sequentially, revising the plan if a step fails.
class PlanAndExecuteAgent:
def run(self, task: str):
# Phase 1: Create a plan
plan = self.llm.generate(
system="Create a numbered step-by-step plan to accomplish this task. "
"Each step should be a single tool call. Be specific.",
messages=[{"role": "user", "content": task}]
)
steps = self.parse_plan(plan.text)
results = []
for i, step in enumerate(steps):
# Phase 2: Execute each step
result = self.execute_step(step, previous_results=results)
results.append(result)
if result.failed:
# Phase 3: Replan from current state
revised_plan = self.replan(
original_task=task,
completed_steps=results,
failed_step=step,
remaining_steps=steps[i+1:]
)
steps = steps[:i+1] + self.parse_plan(revised_plan.text)
return self.summarize(task, results)
Strengths: Predictable execution, easier to debug, better cost control. Weaknesses: Upfront planning adds latency, replanning is expensive, rigid for tasks that require exploration.
When to Use Which
| Scenario | Pattern | Why |
|---|---|---|
| Customer support agent | ReAct | Conversations are exploratory, user needs change mid-task |
| Data pipeline automation | Plan-and-Execute | Steps are well-defined, order matters, failures need structured recovery |
| Code generation agent | Plan-and-Execute | Break complex coding tasks into subtasks, execute sequentially |
| Research/analysis agent | ReAct | Needs to follow leads, direction not known upfront |
| Workflow automation | Plan-and-Execute | Deterministic steps, auditability required |
Tool Design - Less Is More
The single most impactful decision in agent design is your tool set. Every extra tool increases the surface area for hallucinated calls, confuses the model’s tool selection, and makes your system harder to test.
Rules for Production Tool Design
Rule 1: Fewer tools, better descriptions.
# Bad: 15 granular tools
tools = [
"search_users_by_email",
"search_users_by_name",
"search_users_by_id",
"search_users_by_phone",
"get_user_orders",
"get_user_subscriptions",
# ... 10 more
]
# Good: 3 well-described tools
tools = [
{
"name": "lookup_user",
"description": "Find a user by any identifier: email, name, user ID, or phone number. Returns user profile and account status.",
"parameters": {
"query": "The search term (email, name, ID, or phone)",
"query_type": "One of: email, name, id, phone. If unsure, use 'auto'."
}
},
{
"name": "get_user_data",
"description": "Get detailed data for a specific user. Specify what data you need.",
"parameters": {
"user_id": "The user's ID (from lookup_user)",
"data_type": "One of: orders, subscriptions, billing, activity_log"
}
},
{
"name": "take_action",
"description": "Perform an action on a user's account.",
"parameters": {
"user_id": "The user's ID",
"action": "One of: refund_order, cancel_subscription, reset_password, escalate_to_human",
"details": "Additional context for the action (e.g., order ID for refund)"
}
}
]
Rule 2: Tools should be idempotent where possible. If the agent calls the same tool twice with the same parameters, it should get the same result without side effects. This makes retries safe.
Rule 3: Tool outputs should be concise. Return only what the agent needs for its next decision. Do not return raw database rows with 50 columns when 5 fields suffice.
# Bad: return everything
def lookup_user(query):
user = db.query(User).filter(...).first()
return user.to_dict() # 50 fields, 2000 tokens
# Good: return what matters
def lookup_user(query):
user = db.query(User).filter(...).first()
return {
"user_id": user.id,
"name": user.name,
"email": user.email,
"status": user.status,
"account_age_days": (now() - user.created_at).days
} # 5 fields, 100 tokens
Guardrails, Timeouts, and Human-in-the-Loop
Production agents need multiple layers of protection:
Layer 1: Input Validation
class AgentGuardrails:
MAX_STEPS = 15
MAX_COST_PER_RUN = 2.00 # dollars
MAX_DURATION = 120 # seconds
FORBIDDEN_ACTIONS = ["delete_account", "modify_billing"]
def check_tool_call(self, tool_name: str, params: dict) -> bool:
# Block forbidden actions
if tool_name == "take_action" and params.get("action") in self.FORBIDDEN_ACTIONS:
return False
return True
def check_budget(self, current_cost: float) -> bool:
return current_cost < self.MAX_COST_PER_RUN
def check_loop(self, history: list[dict]) -> bool:
# Detect if the last 3 tool calls are identical
if len(history) < 3:
return True
last_three = history[-3:]
if all(h["tool"] == last_three[0]["tool"] and
h["params"] == last_three[0]["params"] for h in last_three):
return False # Loop detected
return True
Layer 2: Human-in-the-Loop for High-Risk Actions
class HumanInTheLoop:
HIGH_RISK_ACTIONS = ["refund_order", "cancel_subscription", "escalate_to_human"]
async def maybe_require_approval(self, tool_name: str, params: dict, context: dict):
if tool_name == "take_action" and params.get("action") in self.HIGH_RISK_ACTIONS:
# Pause execution and request human approval
approval = await self.request_approval(
action=params["action"],
context=context,
timeout=300 # 5 minute timeout
)
if not approval.granted:
return {"status": "blocked", "reason": approval.reason}
return {"status": "approved"}
Layer 3: Output Validation
Before returning agent output to the user, validate it:
def validate_agent_output(output: str, task_type: str) -> tuple[bool, str]:
# Check for hallucinated confidence
if "I have completed" in output and not verify_completion(output):
return False, "Agent claimed completion but task is not verified"
# Check for leaked internal information
if contains_internal_data(output):
return False, "Output contains internal system information"
# Check for harmful content
if not content_safety_check(output):
return False, "Output failed content safety check"
return True, "OK"
A Real Production Agent Architecture
Here is the architecture running in production for an internal engineering support agent:
User Request
|
v
[Intent Classifier] -- Haiku/Flash (cheap, fast)
|
|-- Simple query --> [Direct RAG Pipeline] --> Response
|
|-- Complex task --> [Plan-and-Execute Agent]
|
v
[Planner] -- Sonnet (mid-tier)
|
v
[Step Executor] -- Haiku for simple steps
| Sonnet for complex steps
v
[Guardrail Check] -- per step
|
v
[Result Aggregator] -- Sonnet
|
v
[Output Validator]
|
v
Response to User
Key design decisions:
- Intent classification at the entry point saves 70% of costs by routing simple queries away from the expensive agent pipeline
- Model cascading within the agent - different steps use different models based on complexity
- Guardrails at every step, not just at the end
- Timeout at 120 seconds with graceful degradation (“I was not able to complete all steps, here is what I found so far”)
Metrics That Matter
Track these in production:
| Metric | Target | Why |
|---|---|---|
| Task completion rate | > 85% | Measures if the agent actually finishes tasks |
| Average steps per task | < 8 | More steps means more cost and more failure points |
| Loop detection rate | < 2% | Indicates tool design problems |
| Cost per completed task | < $0.50 | Budget control |
| Human escalation rate | 10-20% | Too low means risky autonomy, too high means the agent is not useful |
| Time to completion | < 30s for 90% of tasks | User patience threshold |
The Honest Assessment
AI agents in 2026 work well for structured, bounded tasks with clear success criteria and well-designed tools. They do not work for open-ended, ambiguous tasks that require judgment, creativity, or access to systems the model cannot interact with.
Build agents for the tasks that fit. Use human-in-the-loop for everything else. The companies shipping reliable agents are not the ones with the most sophisticated models - they are the ones with the best guardrails, the simplest tool designs, and the clearest understanding of where the agent should stop and a human should start.
Comments