Testing and Evaluating Agents
How to Know If Your Agent Is Ready
Testing answers one question: does the agent do what you designed it to do?
Step 1: Create Test Cases
You need three types of tests: happy path, edge cases, and failure cases.
Happy path tests (things go right):
These test the agent on ideal inputs where it should succeed.
happy_path_tests = [
{
"name": "Simple research question",
"input": "What is photosynthesis?",
"expected": "Should find and summarize articles about photosynthesis"
},
{
"name": "Multi-step research",
"input": "Compare the benefits of solar and wind energy",
"expected": "Should search for both, fetch articles, and compare"
},
{
"name": "Find specific facts",
"input": "How many countries have signed the Paris Climate Agreement?",
"expected": "Should find an article with the number and cite it"
}
]
Edge case tests (unusual but valid):
These test the agent on inputs that are harder but still reasonable.
edge_case_tests = [
{
"name": "Empty search results",
"input": "Find articles about [very obscure topic]",
"expected": "Should handle no results gracefully and say so"
},
{
"name": "Conflicting sources",
"input": "Find the best diet for weight loss",
"expected": "Should acknowledge that sources disagree and present both views"
}
]
Failure case tests (things go wrong):
These test the agent's error handling.
failure_case_tests = [
{
"name": "Tool returns error",
"setup": "Make search tool fail",
"input": "Research a topic",
"expected": "Should catch error and retry or tell user"
},
{
"name": "No relevant sources",
"input": "Find articles about [nonsense]",
"expected": "Should not hallucinate results"
}
]
Step 2: Run the Agent Against Each Test
Execute your agent on each test case and capture the result.
def run_test(test_case):
input_text = test_case["input"]
print(f"\n=== Test: {test_case['name']} ===")
print(f"Input: {input_text}")
print(f"Expected: {test_case['expected']}")
try:
result = run_agent(input_text)
print(f"Output: {result}")
return {"status": "success", "result": result}
except Exception as e:
print(f"Error: {e}")
return {"status": "error", "error": str(e)}
# Run all tests
for test in happy_path_tests + edge_case_tests + failure_case_tests:
run_test(test)
Step 3: Score Outputs
For each test, rate the agent on three dimensions.
Dimension 1: Did it pick the right tools?
Score 1-5. (1 = wrong tools, 5 = perfect tools)
def score_tool_selection(test_case, agent_trace):
expected_tools = test_case.get("expected_tools", [])
used_tools = [call["name"] for call in agent_trace]
# Did it use the tools we expected?
matches = [t for t in used_tools if t in expected_tools]
score = (len(matches) / len(expected_tools)) * 5 if expected_tools else 5
return int(score)
Dimension 2: Did it complete the task?
Score 1-5. (1 = did not complete, 5 = fully completed)
def score_task_completion(test_case, output):
# Did the output address the input?
# This is often manual: did the agent answer the question?
input_text = test_case["input"]
# Use keyword matching or manual review
if "error" in output.lower():
return 1
if "no results" in output.lower():
return 2
if len(output) < 50:
return 3
return 5 # Good output
Dimension 3: Did it stay within cost and iteration limits?
Score 1-5. (1 = massively over budget, 5 = under budget)
def score_efficiency(agent_state, max_cost=1.0, max_iterations=15):
cost_ratio = agent_state.total_cost / max_cost
iteration_ratio = agent_state.iteration_count / max_iterations
efficiency = 1 - max(cost_ratio, iteration_ratio)
return max(1, int(efficiency * 5))
Overall score:
def score_test(test_case, agent_trace, output, agent_state):
tool_score = score_tool_selection(test_case, agent_trace)
completion_score = score_task_completion(test_case, output)
efficiency_score = score_efficiency(agent_state)
overall = (tool_score + completion_score + efficiency_score) / 3
return {
"tool_selection": tool_score,
"task_completion": completion_score,
"efficiency": efficiency_score,
"overall": overall
}
Step 4: Build a Simple Eval Harness
Automate your testing. Run all tests at once.
class AgentEvaluator:
def __init__(self, agent_func, test_cases):
self.agent = agent_func
self.test_cases = test_cases
self.results = []
def run_all_tests(self):
for test in self.test_cases:
result = self.run_test(test)
self.results.append(result)
return self.results
def run_test(self, test_case):
output = self.agent(test_case["input"])
score = score_test(test_case, output)
return {
"test_name": test_case["name"],
"score": score,
"output": output
}
def summary(self):
scores = [r["score"]["overall"] for r in self.results]
avg = sum(scores) / len(scores)
print(f"Average score: {avg:.1f} / 5")
print(f"Passed: {sum(1 for s in scores if s >= 4)} / {len(scores)}")
# Use it
evaluator = AgentEvaluator(run_agent, all_tests)
evaluator.run_all_tests()
evaluator.summary()
Tools for Testing
LangSmith (https://smith.langchain.com) Built for LangChain agents. Trace every step. Compare runs side by side. Debug what went wrong.
Braintrust (https://www.braintrust.dev) Eval framework. Define test cases. Run evals. Compare before and after improvements. Good for iterating on agents.
Custom script Simple Python script. Works with any agent. You control everything.
Interpreting Results
Excellent (score 4-5): Agent is ready to ship.
Good (score 3-4): Agent works but needs polish. Fix failure cases.
Okay (score 2-3): Agent has issues. Redesign or improve tools.
Poor (score <2): Agent is not ready. Rethink the approach.
When Is an Agent Production-Ready?
Your agent is production-ready when all of these are true:
- Happy path tests: 80% pass with score >= 4
- Edge case tests: 60% pass with score >= 3
- Failure case tests: 100% do not crash (handle errors gracefully)
- Cost is within budget 95% of the time
- Iteration count is under max 95% of the time
- Tool selection is correct 90% of the time
- You have guardrails (max iterations, cost cap, rate limits)
- You understand what the agent does and does not do well
Iterating Based on Results
If your agent scores poorly on a test:
Tool selection is wrong: Rewrite tool descriptions. Add examples. Add new tools if needed.
Task completion is low: Give the model clearer instructions. Simplify the task. Add intermediate steps.
Cost or iterations are too high: Reduce search results. Limit the number of sources. Simplify the workflow. Use a cheaper model.
Agent crashes on error: Add try-catch blocks. Make error messages clearer. Add fallback tools.
Run tests again after each change. Track improvements over time.
Checklist: Ready to Deploy
- Created test cases (happy path, edge, failure)
- Ran agent against all tests
- Scored outputs on 3 dimensions
- Built an eval harness
- 80%+ of happy path tests pass
- Agent handles errors gracefully
- Cost and iterations are within budget
- Tool selection is correct 90%+ of the time
- You have guardrails implemented
- You have documented known limitations
Once you pass all checks, your agent is production-ready.
Discussion
Sign in to comment. Your account must be at least 1 day old.