Every AI startup in 2025 shipped an “agent” demo. Most of those agents broke in production within the first week. The gap between a compelling demo and a reliable agent system is enormous, and the framework you choose determines how much of that gap you have to bridge yourself.
After building agent systems that handle real workloads - not chat demos, not toy examples, but systems that run unsupervised and process thousands of tasks per day - here is what I have learned about the major frameworks and the patterns that actually work.
Why Most Agent Frameworks Fail in Production
The core problem is that most frameworks optimize for the wrong thing. They optimize for expressiveness - the ability to define complex multi-step workflows with many tools and branching logic. What they should optimize for is reliability - the ability to handle failures, stay within token budgets, and produce consistent results.
An agent that works 95% of the time sounds impressive until you realize that means 1 in 20 requests fails unpredictably. At 1000 requests per day, that is 50 failures. If each failure requires human intervention, you have not built automation - you have built a system that generates support tickets.
The frameworks that work in production share three characteristics:
- Minimal abstraction over the LLM call. Every layer between your code and the model is a layer that can break, introduce unexpected behavior, or make debugging harder.
- Explicit error handling. Not “retry 3 times and hope for the best,” but structured error recovery with fallback strategies.
- Observable execution. You need to see every LLM call, every tool invocation, every decision point - in production, not just during development.
LangChain - The Complexity Problem
LangChain was the first major agent framework and it defined the category. It also defined the most common failure mode: abstraction overload.
# LangChain agent setup - count the abstraction layers
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain.tools import StructuredTool
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferWindowMemory
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [StructuredTool.from_function(func=my_func, name="...", description="...")]
prompt = ChatPromptTemplate.from_messages([...])
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, memory=memory, verbose=True)
result = executor.invoke({"input": "do the thing"})
The problem is not that LangChain is bad software. It is well-maintained and feature-rich. The problem is that it wraps every concept in its own class hierarchy. When something goes wrong - and with LLM agents, something always goes wrong - you are debugging through layers of AgentExecutor, OutputParser, PromptTemplate, Memory, and Callback abstractions.
LangChain’s LCEL (LangChain Expression Language) made this worse by introducing a piping syntax that is clever but makes stack traces nearly unreadable. When your agent fails at 3 AM, you want a stack trace that points to the exact LLM call and the exact tool invocation, not a chain of RunnableSequence and RunnableParallel objects.
When LangChain works: If you need to swap between multiple LLM providers frequently, LangChain’s provider abstractions save time. It also has the largest ecosystem of integrations - vector stores, document loaders, retrievers. For RAG pipelines (not agents), LangChain remains practical.
When it fails: Complex multi-step agents where debugging and reliability matter more than feature breadth.
Claude Agent SDK - Simplicity as a Feature
Anthropic’s Claude Agent SDK takes the opposite approach. It is a thin layer over the Claude API that gives you a tool-use loop with minimal abstraction.
import anthropic
from claude_agent_sdk import Agent, Tool
def search_database(query: str) -> str:
"""Search the product database."""
results = db.search(query)
return json.dumps(results)
agent = Agent(
model="claude-sonnet-4-20250514",
tools=[Tool(search_database)],
system="You are a product research assistant. Search the database and summarize findings.",
)
result = agent.run("Find all products with declining sales in Q1 2026")
The agent loop is straightforward: Claude receives a message, decides whether to call a tool, receives the tool result, and decides what to do next. There is no memory abstraction, no output parser, no expression language. You manage state yourself, which means you understand your state.
The SDK’s key advantage is that Claude’s tool-use implementation is native to the model. The model was trained to use tools in a specific format, and the SDK sends that exact format. There is no translation layer converting a framework-specific tool definition into a model-specific format.
When it works: Single-agent systems where one LLM handles a task with a set of tools. The debugging experience is excellent because you can log the raw API calls and see exactly what the model received and returned.
When it falls short: Multi-agent orchestration. If you need agents that delegate to other agents, you have to build the coordination layer yourself. This is by design - the SDK does not pretend that multi-agent coordination is a solved problem.
CrewAI - Multi-Agent Orchestration
CrewAI addresses a real gap: orchestrating multiple specialized agents that work together on complex tasks. Instead of one agent doing everything, you define a crew of agents with different roles.
from crewai import Agent, Task, Crew
researcher = Agent(
role="Market Researcher",
goal="Find comprehensive market data on the given topic",
backstory="You are a senior market analyst with 15 years of experience.",
tools=[search_tool, web_scraper],
)
writer = Agent(
role="Report Writer",
goal="Write a clear, data-driven market report",
backstory="You are a technical writer who specializes in market analysis.",
tools=[],
)
research_task = Task(
description="Research the current state of {topic}",
agent=researcher,
expected_output="A detailed research brief with data points and sources",
)
writing_task = Task(
description="Write a market report based on the research brief",
agent=writer,
expected_output="A 2000-word market report",
)
crew = Crew(agents=[researcher, writer], tasks=[research_task, writing_task])
result = crew.kickoff(inputs={"topic": "edge computing market"})
The multi-agent pattern works well for tasks that naturally decompose into stages: research then write, plan then execute, generate then review. Each agent has a focused role and a limited tool set, which reduces the chance of any single agent going off track.
When it works: Sequential multi-stage workflows where each stage has a clear input and output. Content generation pipelines, research workflows, and code review systems are good fits.
When it fails: The “backstory” and “role” prompting strategy is fragile. Small changes to agent descriptions can cause large changes in behavior. There is also significant token overhead - each agent needs its system prompt, its tools, and the context from previous agents. A three-agent crew can easily consume 50K+ tokens per run.
Framework Comparison
| Feature | LangChain | Claude Agent SDK | CrewAI |
|---|---|---|---|
| Learning curve | Steep | Low | Moderate |
| Abstraction level | Heavy | Minimal | Moderate |
| Multi-agent support | Via LangGraph | Build your own | Native |
| LLM provider flexibility | Many providers | Claude only | Multiple (via LiteLLM) |
| Debugging experience | Difficult | Excellent | Moderate |
| Production observability | LangSmith (paid) | Raw API logs | Built-in logging |
| Token efficiency | Low (large prompts) | High | Low (role prompts) |
| Community/ecosystem | Largest | Growing | Active |
Patterns That Actually Work in Production
Regardless of framework, these patterns separate reliable agent systems from demo-ware:
Pattern 1: Constrained tool sets. Give agents the minimum set of tools they need. Every additional tool increases the probability that the model calls the wrong one. Five focused tools beat twenty general ones.
Pattern 2: Structured outputs over free-form. Use JSON schemas or typed responses for tool inputs and agent outputs. Parse and validate every response. Never trust that the model returned valid data.
# Always validate tool outputs
try:
result = json.loads(agent_response)
validated = OutputSchema(**result) # Pydantic validation
except (json.JSONDecodeError, ValidationError) as e:
# Structured fallback, not just retry
result = fallback_handler(e, agent_response)
Pattern 3: Budget enforcement. Set hard limits on tokens, tool calls, and wall-clock time per agent run. An agent that enters an infinite loop of tool calls will drain your API budget. Cap it at 10-20 tool calls per run and fail explicitly if the cap is hit.
Pattern 4: Deterministic checkpoints. After each major step, save the intermediate state. If step 3 of 5 fails, you should be able to retry from step 3, not from the beginning. This matters enormously when each run costs $0.50 or more in API calls.
Pattern 5: Human-in-the-loop escape hatches. Build a mechanism for the agent to say “I am not confident about this, routing to a human.” Agents that silently produce wrong answers are worse than agents that ask for help.
The Honest Assessment
No agent framework solves the hard problem: LLMs are probabilistic systems, and building deterministic workflows on probabilistic foundations requires engineering discipline that no framework can substitute for.
The best framework is the one that gets out of your way. If you are building a single-agent system, the Claude Agent SDK or raw API calls with a simple loop will serve you better than any heavy framework. If you need multi-agent orchestration, CrewAI provides useful structure but expect to customize it heavily. If you need RAG with many integrations, LangChain’s ecosystem is unmatched.
Pick based on your actual requirements, not on which framework has the most GitHub stars. Build the simplest thing that works, instrument it heavily, and iterate based on production data. The agent framework landscape will look different in six months. Your debugging skills and operational practices will not expire.
Comments