Common Agent Failure Modes
What goes wrong with agents
AI agents can fail in predictable ways. Learning to spot these failures helps you build better agents.
Failure 1: Wrong Tool Selected
The agent picks the wrong tool for the job.
What it looks like: Your agent is asked to write a blog post. Instead of calling a writing tool, it calls a web search tool.
Why it happens: The tool descriptions are unclear. The model misunderstood what each tool does. The tool names are confusing.
How to detect it: Look at the tool call logs. Does the called tool match the task? Ask yourself: would a human pick this tool for this job? If no, the agent picked wrong.
Code example:
# Bad: unclear tool description
tools = [
{"name": "search", "description": "Get info"},
{"name": "write", "description": "Output text"}
]
# Good: clear, specific descriptions
tools = [
{"name": "web_search", "description": "Search the web for current facts and citations"},
{"name": "write_blog_post", "description": "Write a full blog post (500+ words) with structure and tone"}
]
Failure 2: Infinite Loop
The agent keeps calling the same tool over and over. It never moves forward.
What it looks like: Your agent is asked to find a phone number. It calls the search tool five times in a row, each time searching for the same thing. The conversation never ends.
Why it happens: The tool returned an error or unclear result. The model did not understand the error. It tries the same tool again, hoping for a different result.
How to detect it: Count the tool calls. If the same tool is called 3+ times with the same or similar input, your agent is looping. Track the iteration count in your logs.
Code example:
# Detection: count iterations
for iteration in range(max_iterations):
tool_call = model.plan()
if iteration > 0 and tool_call["name"] == previous_tool["name"]:
loop_count += 1
if loop_count >= 3:
print("WARNING: Agent is looping")
break
Failure 3: Hallucinated Tool Name or Parameters
The agent calls a tool that does not exist, or uses parameters that do not exist.
What it looks like:
Your agent calls send_email_v3() but the real tool is send_email(). Or it passes recipient_email but the parameter is to_address.
Why it happens: The model guessed at a tool name or parameter. Tool definitions were not passed clearly to the model. The model remembered a tool from training data that is not in your tool set.
How to detect it: Check tool call logs. Does the tool exist? Do the parameters match the tool definition? If no, the agent hallucinated.
Code example:
# Add validation before calling the tool
def call_tool(tool_name, params):
if tool_name not in available_tools:
print(f"ERROR: Tool {tool_name} does not exist")
return {"error": f"Unknown tool: {tool_name}"}
tool = available_tools[tool_name]
for param in params:
if param not in tool["parameters"]:
print(f"ERROR: Parameter {param} not valid for {tool_name}")
return {"error": f"Invalid parameter: {param}"}
return tool(params)
Failure 4: Cost Blowup
The agent makes so many tool calls or API requests that the cost explodes.
What it looks like: You run your agent once and rack up $50 in API charges. The agent called your search tool 1000 times in one session.
Why it happens: No limits on tool calls. No tracking of cumulative cost. The agent loops (see Failure 2). Each iteration costs money.
How to detect it: Track the cost of every tool call. Sum the cost per session. Set a budget and check against it.
Code example:
# Track cost per session
total_cost = 0
max_cost = 5.00 # $5 per session
for iteration in range(max_iterations):
tool_call = model.plan()
result, cost = call_tool(tool_call)
total_cost += cost
if total_cost > max_cost:
print(f"Cost limit exceeded: ${total_cost}")
break
Failure 5: Context Window Overflow
The conversation gets too long. The model runs out of memory.
What it looks like: Your agent runs for 50 steps. By the time it gets to step 40, the model says "I am out of context window" or gives nonsensical responses.
Why it happens: Each tool call and result gets added to the conversation history. After many iterations, the history is too long for the model.
How to detect it: Track the number of tokens in your context. Compare against the model's max context window. If you are above 70-80% capacity, you are near the limit.
Code example:
# Monitor context usage
from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4")
context_tokens = len(enc.encode(full_conversation))
max_tokens = 128000 # GPT-4 context window
if context_tokens > max_tokens * 0.8:
print(f"WARNING: Context window {context_tokens / max_tokens * 100:.1f}% full")
Summary
The five main failures are: wrong tool, infinite loop, hallucinated tools, cost blowup, and context overflow. Each has signs you can watch for and fixes you can apply. Check your logs, understand the errors, and build guardrails to prevent these failures.
Discussion
Sign in to comment. Your account must be at least 1 day old.