Systematic Debugging for Agents

How to debug an agent

When your agent fails, follow this step-by-step process to find and fix the problem.

Step 1: Log Every Tool Call

Before debugging anything, you need a record of what happened.

What to log:

  • Timestamp of the tool call
  • Tool name
  • Input parameters
  • Output or error from the tool
  • The model's response to the output

Code example:

import json
import logging
from datetime import datetime

logger = logging.getLogger("agent_debug")

def call_tool_with_logging(tool_name, params):
  timestamp = datetime.now().isoformat()
  logger.info(f"TOOL CALL | {timestamp} | {tool_name} | {json.dumps(params)}")
  
  try:
    result = available_tools[tool_name](params)
    logger.info(f"TOOL RESULT | {timestamp} | {tool_name} | {json.dumps(result)}")
    return result
  except Exception as e:
    logger.error(f"TOOL ERROR | {timestamp} | {tool_name} | {str(e)}")
    raise

Step 2: Track Iteration Count and Token Usage

Know how many steps your agent took and how many tokens it used.

What to track:

  • Total iterations (how many tool calls)
  • Tokens per step (input + output tokens)
  • Tokens cumulative (total for the session)
  • Cost per step and total cost

Code example:

from tiktoken import encoding_for_model

enc = encoding_for_model("gpt-4")
stats = {
  "iteration_count": 0,
  "tokens_per_step": [],
  "total_tokens": 0,
  "cost_per_step": [],
  "total_cost": 0
}

def run_agent_with_stats():
  for iteration in range(max_iterations):
    stats["iteration_count"] += 1
    
    response = model.generate(conversation)
    input_tokens = len(enc.encode(conversation))
    output_tokens = len(enc.encode(response))
    
    step_tokens = input_tokens + output_tokens
    step_cost = (input_tokens * 0.01 + output_tokens * 0.03) / 1000
    
    stats["tokens_per_step"].append(step_tokens)
    stats["total_tokens"] += step_tokens
    stats["cost_per_step"].append(step_cost)
    stats["total_cost"] += step_cost
    
    print(f"Iteration {stats['iteration_count']}: {step_tokens} tokens, ${step_cost:.4f}")

Step 3: Replay the Conversation

Go through the logs step by step. Look for where things went wrong.

Questions to ask:

  1. Did the model pick the right tool?
  2. Did it pass the right parameters?
  3. Did the tool return what the model expected?
  4. Did the model misinterpret the result?
  5. What did the model do next?

Example replay:

Step 1: User asks "Find the price of Tesla stock"
Agent calls: search(query="Tesla stock price")
Result: Error 404 - query parameter should be 'q', not 'query'

Problem found: Wrong parameter name.

Step 2: Agent calls: search(q="Tesla stock price")
Result: {"results": [{"title": "...", "snippet": "..."}]}
Agent interprets: "I found Tesla stock price data"
Agent stops.

Problem: Agent should call a parsing tool or extract the actual number.

Step 4: Test Tools in Isolation

Do not assume your tools work as expected. Test them.

Test plan:

  • Call each tool with valid input. Does it return the right shape?
  • Call each tool with edge cases (empty input, very long input, special characters).
  • Call each tool with invalid input. Does it error gracefully?

Code example:

def test_tool(tool_name):
  tool = available_tools[tool_name]
  
  # Test valid input
  result = tool({"query": "example"})
  assert "results" in result or "data" in result, f"{tool_name} returned unexpected shape"
  
  # Test edge case
  result = tool({"query": ""})
  assert result is not None, f"{tool_name} crashed on empty input"
  
  # Test invalid
  try:
    result = tool({"invalid_param": "value"})
    print(f"WARNING: {tool_name} accepted invalid parameter")
  except Exception as e:
    print(f"OK: {tool_name} rejected invalid parameter")

for tool_name in available_tools:
  test_tool(tool_name)

Tools for Debugging

LangSmith (https://smith.langchain.com)

  • Visual trace of every agent step
  • See the exact prompt sent to the model
  • See the exact model output
  • Compare runs side by side

Braintrust (https://www.braintrust.dev)

  • Test and evaluate agent behavior
  • Run experiments to improve your agent
  • Compare before and after

Simple logging

  • Print statements to a log file
  • JSON logs you can parse later
  • Minimal setup, good for quick debugging

Build an Agent Trace Viewer

A trace viewer shows you the agent's thought process visually.

What to show:

  1. Iteration number
  2. Tool called
  3. Input to the tool
  4. Output from the tool
  5. The model's interpretation
  6. Cost and tokens for this step

Simple HTML template:

<table>
  <tr>
    <th>Step</th>
    <th>Tool</th>
    <th>Input</th>
    <th>Output</th>
    <th>Tokens</th>
    <th>Cost</th>
  </tr>
  <!-- Loop through each step -->
  <tr>
    <td>1</td>
    <td>search</td>
    <td>{"q": "weather today"}</td>
    <td>{"results": [...]}</td>
    <td>150</td>
    <td>$0.01</td>
  </tr>
</table>

Debugging Checklist

When your agent fails:

  1. Check the logs. What was the last tool call?
  2. Did the tool return an error? What was the error message?
  3. Did the model understand the error? Or did it ignore it?
  4. Is the agent looping? Same tool called multiple times?
  5. Are you over budget or over context?
  6. Test the failing tool in isolation. Does it work?
  7. Update the tool description or error message.
  8. Run the agent again. Did it fix the problem?

This process finds and fixes most agent issues.

Discussion

  • Loading…

← Back to Tutorials