Common Agent Failure Modes

What goes wrong with agents

AI agents can fail in predictable ways. Learning to spot these failures helps you build better agents.

Failure 1: Wrong Tool Selected

The agent picks the wrong tool for the job.

What it looks like: Your agent is asked to write a blog post. Instead of calling a writing tool, it calls a web search tool.

Why it happens: The tool descriptions are unclear. The model misunderstood what each tool does. The tool names are confusing.

How to detect it: Look at the tool call logs. Does the called tool match the task? Ask yourself: would a human pick this tool for this job? If no, the agent picked wrong.

Code example:

# Bad: unclear tool description
tools = [
  {"name": "search", "description": "Get info"},
  {"name": "write", "description": "Output text"}
]

# Good: clear, specific descriptions
tools = [
  {"name": "web_search", "description": "Search the web for current facts and citations"},
  {"name": "write_blog_post", "description": "Write a full blog post (500+ words) with structure and tone"}
]

Failure 2: Infinite Loop

The agent keeps calling the same tool over and over. It never moves forward.

What it looks like: Your agent is asked to find a phone number. It calls the search tool five times in a row, each time searching for the same thing. The conversation never ends.

Why it happens: The tool returned an error or unclear result. The model did not understand the error. It tries the same tool again, hoping for a different result.

How to detect it: Count the tool calls. If the same tool is called 3+ times with the same or similar input, your agent is looping. Track the iteration count in your logs.

Code example:

# Detection: count iterations
for iteration in range(max_iterations):
  tool_call = model.plan()
  if iteration > 0 and tool_call["name"] == previous_tool["name"]:
    loop_count += 1
  if loop_count >= 3:
    print("WARNING: Agent is looping")
    break

Failure 3: Hallucinated Tool Name or Parameters

The agent calls a tool that does not exist, or uses parameters that do not exist.

What it looks like: Your agent calls send_email_v3() but the real tool is send_email(). Or it passes recipient_email but the parameter is to_address.

Why it happens: The model guessed at a tool name or parameter. Tool definitions were not passed clearly to the model. The model remembered a tool from training data that is not in your tool set.

How to detect it: Check tool call logs. Does the tool exist? Do the parameters match the tool definition? If no, the agent hallucinated.

Code example:

# Add validation before calling the tool
def call_tool(tool_name, params):
  if tool_name not in available_tools:
    print(f"ERROR: Tool {tool_name} does not exist")
    return {"error": f"Unknown tool: {tool_name}"}
  
  tool = available_tools[tool_name]
  for param in params:
    if param not in tool["parameters"]:
      print(f"ERROR: Parameter {param} not valid for {tool_name}")
      return {"error": f"Invalid parameter: {param}"}
  
  return tool(params)

Failure 4: Cost Blowup

The agent makes so many tool calls or API requests that the cost explodes.

What it looks like: You run your agent once and rack up $50 in API charges. The agent called your search tool 1000 times in one session.

Why it happens: No limits on tool calls. No tracking of cumulative cost. The agent loops (see Failure 2). Each iteration costs money.

How to detect it: Track the cost of every tool call. Sum the cost per session. Set a budget and check against it.

Code example:

# Track cost per session
total_cost = 0
max_cost = 5.00  # $5 per session

for iteration in range(max_iterations):
  tool_call = model.plan()
  result, cost = call_tool(tool_call)
  total_cost += cost
  
  if total_cost > max_cost:
    print(f"Cost limit exceeded: ${total_cost}")
    break

Failure 5: Context Window Overflow

The conversation gets too long. The model runs out of memory.

What it looks like: Your agent runs for 50 steps. By the time it gets to step 40, the model says "I am out of context window" or gives nonsensical responses.

Why it happens: Each tool call and result gets added to the conversation history. After many iterations, the history is too long for the model.

How to detect it: Track the number of tokens in your context. Compare against the model's max context window. If you are above 70-80% capacity, you are near the limit.

Code example:

# Monitor context usage
from tiktoken import encoding_for_model

enc = encoding_for_model("gpt-4")
context_tokens = len(enc.encode(full_conversation))
max_tokens = 128000  # GPT-4 context window

if context_tokens > max_tokens * 0.8:
  print(f"WARNING: Context window {context_tokens / max_tokens * 100:.1f}% full")

Summary

The five main failures are: wrong tool, infinite loop, hallucinated tools, cost blowup, and context overflow. Each has signs you can watch for and fixes you can apply. Check your logs, understand the errors, and build guardrails to prevent these failures.

Discussion

  • Loading…

← Back to Tutorials