Systematic Debugging for Agents
How to debug an agent
When your agent fails, follow this step-by-step process to find and fix the problem.
Step 1: Log Every Tool Call
Before debugging anything, you need a record of what happened.
What to log:
- Timestamp of the tool call
- Tool name
- Input parameters
- Output or error from the tool
- The model's response to the output
Code example:
import json
import logging
from datetime import datetime
logger = logging.getLogger("agent_debug")
def call_tool_with_logging(tool_name, params):
timestamp = datetime.now().isoformat()
logger.info(f"TOOL CALL | {timestamp} | {tool_name} | {json.dumps(params)}")
try:
result = available_tools[tool_name](params)
logger.info(f"TOOL RESULT | {timestamp} | {tool_name} | {json.dumps(result)}")
return result
except Exception as e:
logger.error(f"TOOL ERROR | {timestamp} | {tool_name} | {str(e)}")
raise
Step 2: Track Iteration Count and Token Usage
Know how many steps your agent took and how many tokens it used.
What to track:
- Total iterations (how many tool calls)
- Tokens per step (input + output tokens)
- Tokens cumulative (total for the session)
- Cost per step and total cost
Code example:
from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4")
stats = {
"iteration_count": 0,
"tokens_per_step": [],
"total_tokens": 0,
"cost_per_step": [],
"total_cost": 0
}
def run_agent_with_stats():
for iteration in range(max_iterations):
stats["iteration_count"] += 1
response = model.generate(conversation)
input_tokens = len(enc.encode(conversation))
output_tokens = len(enc.encode(response))
step_tokens = input_tokens + output_tokens
step_cost = (input_tokens * 0.01 + output_tokens * 0.03) / 1000
stats["tokens_per_step"].append(step_tokens)
stats["total_tokens"] += step_tokens
stats["cost_per_step"].append(step_cost)
stats["total_cost"] += step_cost
print(f"Iteration {stats['iteration_count']}: {step_tokens} tokens, ${step_cost:.4f}")
Step 3: Replay the Conversation
Go through the logs step by step. Look for where things went wrong.
Questions to ask:
- Did the model pick the right tool?
- Did it pass the right parameters?
- Did the tool return what the model expected?
- Did the model misinterpret the result?
- What did the model do next?
Example replay:
Step 1: User asks "Find the price of Tesla stock"
Agent calls: search(query="Tesla stock price")
Result: Error 404 - query parameter should be 'q', not 'query'
Problem found: Wrong parameter name.
Step 2: Agent calls: search(q="Tesla stock price")
Result: {"results": [{"title": "...", "snippet": "..."}]}
Agent interprets: "I found Tesla stock price data"
Agent stops.
Problem: Agent should call a parsing tool or extract the actual number.
Step 4: Test Tools in Isolation
Do not assume your tools work as expected. Test them.
Test plan:
- Call each tool with valid input. Does it return the right shape?
- Call each tool with edge cases (empty input, very long input, special characters).
- Call each tool with invalid input. Does it error gracefully?
Code example:
def test_tool(tool_name):
tool = available_tools[tool_name]
# Test valid input
result = tool({"query": "example"})
assert "results" in result or "data" in result, f"{tool_name} returned unexpected shape"
# Test edge case
result = tool({"query": ""})
assert result is not None, f"{tool_name} crashed on empty input"
# Test invalid
try:
result = tool({"invalid_param": "value"})
print(f"WARNING: {tool_name} accepted invalid parameter")
except Exception as e:
print(f"OK: {tool_name} rejected invalid parameter")
for tool_name in available_tools:
test_tool(tool_name)
Tools for Debugging
LangSmith (https://smith.langchain.com)
- Visual trace of every agent step
- See the exact prompt sent to the model
- See the exact model output
- Compare runs side by side
Braintrust (https://www.braintrust.dev)
- Test and evaluate agent behavior
- Run experiments to improve your agent
- Compare before and after
Simple logging
- Print statements to a log file
- JSON logs you can parse later
- Minimal setup, good for quick debugging
Build an Agent Trace Viewer
A trace viewer shows you the agent's thought process visually.
What to show:
- Iteration number
- Tool called
- Input to the tool
- Output from the tool
- The model's interpretation
- Cost and tokens for this step
Simple HTML template:
<table>
<tr>
<th>Step</th>
<th>Tool</th>
<th>Input</th>
<th>Output</th>
<th>Tokens</th>
<th>Cost</th>
</tr>
<!-- Loop through each step -->
<tr>
<td>1</td>
<td>search</td>
<td>{"q": "weather today"}</td>
<td>{"results": [...]}</td>
<td>150</td>
<td>$0.01</td>
</tr>
</table>
Debugging Checklist
When your agent fails:
- Check the logs. What was the last tool call?
- Did the tool return an error? What was the error message?
- Did the model understand the error? Or did it ignore it?
- Is the agent looping? Same tool called multiple times?
- Are you over budget or over context?
- Test the failing tool in isolation. Does it work?
- Update the tool description or error message.
- Run the agent again. Did it fix the problem?
This process finds and fixes most agent issues.
Discussion
Sign in to comment. Your account must be at least 1 day old.