Every AI framework promises agents that can “autonomously complete complex tasks.” The demo shows an agent booking flights, writing code, and sending emails. Then you deploy it and it hallucinates a nonexistent API endpoint, gets stuck in an infinite loop calling the same tool, and racks up $200 in API costs before your rate limiter kicks in.

I have shipped AI agents to production across three different products in the last year. Here is what actually works, what fails, and the patterns that separate a demo from a system your users can rely on.

Why Demo Agents Fail in Production

The gap between a demo agent and a production agent comes down to five failure modes:

1. Hallucinated tool calls. The model invents function names or parameters that do not exist. In a demo, you retry and it works. In production, this corrupts data.

2. Infinite loops. The agent calls the same tool repeatedly because its output does not change the state in a way the model can detect. Cost explodes.

3. Context window overflow. Each tool call adds to the conversation history. After 10-15 steps, the agent loses track of its original goal because earlier context has been pushed out or compressed.

4. Cascading errors. One bad tool call produces bad output, which the agent uses as input for the next call, which produces worse output. Error propagation without correction.

5. Non-determinism in critical paths. The same input produces different tool call sequences on different runs. This makes testing, debugging, and reliability guarantees nearly impossible.

ReAct vs Plan-and-Execute

Two dominant patterns for agent design. The choice matters more than the model you pick.

ReAct (Reason + Act)

The agent thinks, acts, observes, and repeats. Each step is independent.

class ReActAgent:
    def run(self, task: str, max_steps: int = 10):
        messages = [{"role": "user", "content": task}]
        
        for step in range(max_steps):
            response = self.llm.generate(
                messages=messages,
                tools=self.tools,
                system="Think step by step. Use tools when needed. "
                       "Say DONE when the task is complete."
            )
            
            if response.stop_reason == "tool_use":
                tool_result = self.execute_tool(response.tool_call)
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": f"Tool result: {tool_result}"})
            elif "DONE" in response.text:
                return response.text
            else:
                messages.append({"role": "assistant", "content": response.content})
        
        return "Max steps reached without completion"

Strengths: Simple, flexible, good for exploratory tasks. Weaknesses: No upfront planning, prone to wandering, expensive for multi-step tasks.

Plan-and-Execute

The agent creates a plan first, then executes steps sequentially, revising the plan if a step fails.

class PlanAndExecuteAgent:
    def run(self, task: str):
        # Phase 1: Create a plan
        plan = self.llm.generate(
            system="Create a numbered step-by-step plan to accomplish this task. "
                   "Each step should be a single tool call. Be specific.",
            messages=[{"role": "user", "content": task}]
        )
        steps = self.parse_plan(plan.text)
        
        results = []
        for i, step in enumerate(steps):
            # Phase 2: Execute each step
            result = self.execute_step(step, previous_results=results)
            results.append(result)
            
            if result.failed:
                # Phase 3: Replan from current state
                revised_plan = self.replan(
                    original_task=task,
                    completed_steps=results,
                    failed_step=step,
                    remaining_steps=steps[i+1:]
                )
                steps = steps[:i+1] + self.parse_plan(revised_plan.text)
        
        return self.summarize(task, results)

Strengths: Predictable execution, easier to debug, better cost control. Weaknesses: Upfront planning adds latency, replanning is expensive, rigid for tasks that require exploration.

When to Use Which

Scenario Pattern Why
Customer support agent ReAct Conversations are exploratory, user needs change mid-task
Data pipeline automation Plan-and-Execute Steps are well-defined, order matters, failures need structured recovery
Code generation agent Plan-and-Execute Break complex coding tasks into subtasks, execute sequentially
Research/analysis agent ReAct Needs to follow leads, direction not known upfront
Workflow automation Plan-and-Execute Deterministic steps, auditability required

Tool Design - Less Is More

The single most impactful decision in agent design is your tool set. Every extra tool increases the surface area for hallucinated calls, confuses the model’s tool selection, and makes your system harder to test.

Rules for Production Tool Design

Rule 1: Fewer tools, better descriptions.

# Bad: 15 granular tools
tools = [
    "search_users_by_email",
    "search_users_by_name",
    "search_users_by_id",
    "search_users_by_phone",
    "get_user_orders",
    "get_user_subscriptions",
    # ... 10 more
]

# Good: 3 well-described tools
tools = [
    {
        "name": "lookup_user",
        "description": "Find a user by any identifier: email, name, user ID, or phone number. Returns user profile and account status.",
        "parameters": {
            "query": "The search term (email, name, ID, or phone)",
            "query_type": "One of: email, name, id, phone. If unsure, use 'auto'."
        }
    },
    {
        "name": "get_user_data",
        "description": "Get detailed data for a specific user. Specify what data you need.",
        "parameters": {
            "user_id": "The user's ID (from lookup_user)",
            "data_type": "One of: orders, subscriptions, billing, activity_log"
        }
    },
    {
        "name": "take_action",
        "description": "Perform an action on a user's account.",
        "parameters": {
            "user_id": "The user's ID",
            "action": "One of: refund_order, cancel_subscription, reset_password, escalate_to_human",
            "details": "Additional context for the action (e.g., order ID for refund)"
        }
    }
]

Rule 2: Tools should be idempotent where possible. If the agent calls the same tool twice with the same parameters, it should get the same result without side effects. This makes retries safe.

Rule 3: Tool outputs should be concise. Return only what the agent needs for its next decision. Do not return raw database rows with 50 columns when 5 fields suffice.

# Bad: return everything
def lookup_user(query):
    user = db.query(User).filter(...).first()
    return user.to_dict()  # 50 fields, 2000 tokens

# Good: return what matters
def lookup_user(query):
    user = db.query(User).filter(...).first()
    return {
        "user_id": user.id,
        "name": user.name,
        "email": user.email,
        "status": user.status,
        "account_age_days": (now() - user.created_at).days
    }  # 5 fields, 100 tokens

Guardrails, Timeouts, and Human-in-the-Loop

Production agents need multiple layers of protection:

Layer 1: Input Validation

class AgentGuardrails:
    MAX_STEPS = 15
    MAX_COST_PER_RUN = 2.00  # dollars
    MAX_DURATION = 120  # seconds
    FORBIDDEN_ACTIONS = ["delete_account", "modify_billing"]
    
    def check_tool_call(self, tool_name: str, params: dict) -> bool:
        # Block forbidden actions
        if tool_name == "take_action" and params.get("action") in self.FORBIDDEN_ACTIONS:
            return False
        return True
    
    def check_budget(self, current_cost: float) -> bool:
        return current_cost < self.MAX_COST_PER_RUN
    
    def check_loop(self, history: list[dict]) -> bool:
        # Detect if the last 3 tool calls are identical
        if len(history) < 3:
            return True
        last_three = history[-3:]
        if all(h["tool"] == last_three[0]["tool"] and 
               h["params"] == last_three[0]["params"] for h in last_three):
            return False  # Loop detected
        return True

Layer 2: Human-in-the-Loop for High-Risk Actions

class HumanInTheLoop:
    HIGH_RISK_ACTIONS = ["refund_order", "cancel_subscription", "escalate_to_human"]
    
    async def maybe_require_approval(self, tool_name: str, params: dict, context: dict):
        if tool_name == "take_action" and params.get("action") in self.HIGH_RISK_ACTIONS:
            # Pause execution and request human approval
            approval = await self.request_approval(
                action=params["action"],
                context=context,
                timeout=300  # 5 minute timeout
            )
            if not approval.granted:
                return {"status": "blocked", "reason": approval.reason}
        
        return {"status": "approved"}

Layer 3: Output Validation

Before returning agent output to the user, validate it:

def validate_agent_output(output: str, task_type: str) -> tuple[bool, str]:
    # Check for hallucinated confidence
    if "I have completed" in output and not verify_completion(output):
        return False, "Agent claimed completion but task is not verified"
    
    # Check for leaked internal information
    if contains_internal_data(output):
        return False, "Output contains internal system information"
    
    # Check for harmful content
    if not content_safety_check(output):
        return False, "Output failed content safety check"
    
    return True, "OK"

A Real Production Agent Architecture

Here is the architecture running in production for an internal engineering support agent:

User Request
    |
    v
[Intent Classifier] -- Haiku/Flash (cheap, fast)
    |
    |-- Simple query --> [Direct RAG Pipeline] --> Response
    |
    |-- Complex task --> [Plan-and-Execute Agent]
                              |
                              v
                         [Planner] -- Sonnet (mid-tier)
                              |
                              v
                         [Step Executor] -- Haiku for simple steps
                              |              Sonnet for complex steps
                              v
                         [Guardrail Check] -- per step
                              |
                              v
                         [Result Aggregator] -- Sonnet
                              |
                              v
                         [Output Validator]
                              |
                              v
                         Response to User

Key design decisions:

  • Intent classification at the entry point saves 70% of costs by routing simple queries away from the expensive agent pipeline
  • Model cascading within the agent - different steps use different models based on complexity
  • Guardrails at every step, not just at the end
  • Timeout at 120 seconds with graceful degradation (“I was not able to complete all steps, here is what I found so far”)

Metrics That Matter

Track these in production:

Metric Target Why
Task completion rate > 85% Measures if the agent actually finishes tasks
Average steps per task < 8 More steps means more cost and more failure points
Loop detection rate < 2% Indicates tool design problems
Cost per completed task < $0.50 Budget control
Human escalation rate 10-20% Too low means risky autonomy, too high means the agent is not useful
Time to completion < 30s for 90% of tasks User patience threshold

The Honest Assessment

AI agents in 2026 work well for structured, bounded tasks with clear success criteria and well-designed tools. They do not work for open-ended, ambiguous tasks that require judgment, creativity, or access to systems the model cannot interact with.

Build agents for the tasks that fit. Use human-in-the-loop for everything else. The companies shipping reliable agents are not the ones with the most sophisticated models - they are the ones with the best guardrails, the simplest tool designs, and the clearest understanding of where the agent should stop and a human should start.