Testing and Evaluating Agents

beginner618 reads

How to Know If Your Agent Is Ready

Testing answers one question: does the agent do what you designed it to do?

Step 1: Create Test Cases

You need three types of tests: happy path, edge cases, and failure cases.

Happy path tests (things go right):

These test the agent on ideal inputs where it should succeed.

happy_path_tests = [
  {
    "name": "Simple research question",
    "input": "What is photosynthesis?",
    "expected": "Should find and summarize articles about photosynthesis"
  },
  {
    "name": "Multi-step research",
    "input": "Compare the benefits of solar and wind energy",
    "expected": "Should search for both, fetch articles, and compare"
  },
  {
    "name": "Find specific facts",
    "input": "How many countries have signed the Paris Climate Agreement?",
    "expected": "Should find an article with the number and cite it"
  }
]

Edge case tests (unusual but valid):

These test the agent on inputs that are harder but still reasonable.

edge_case_tests = [
  {
    "name": "Empty search results",
    "input": "Find articles about [very obscure topic]",
    "expected": "Should handle no results gracefully and say so"
  },
  {
    "name": "Conflicting sources",
    "input": "Find the best diet for weight loss",
    "expected": "Should acknowledge that sources disagree and present both views"
  }
]

Failure case tests (things go wrong):

These test the agent's error handling.

failure_case_tests = [
  {
    "name": "Tool returns error",
    "setup": "Make search tool fail",
    "input": "Research a topic",
    "expected": "Should catch error and retry or tell user"
  },
  {
    "name": "No relevant sources",
    "input": "Find articles about [nonsense]",
    "expected": "Should not hallucinate results"
  }
]

Step 2: Run the Agent Against Each Test

Execute your agent on each test case and capture the result.

def run_test(test_case):
  input_text = test_case["input"]
  
  print(f"\n=== Test: {test_case['name']} ===")
  print(f"Input: {input_text}")
  print(f"Expected: {test_case['expected']}")
  
  try:
    result = run_agent(input_text)
    print(f"Output: {result}")
    return {"status": "success", "result": result}
  except Exception as e:
    print(f"Error: {e}")
    return {"status": "error", "error": str(e)}

# Run all tests
for test in happy_path_tests + edge_case_tests + failure_case_tests:
  run_test(test)

Step 3: Score Outputs

For each test, rate the agent on three dimensions.

Dimension 1: Did it pick the right tools?

Score 1-5. (1 = wrong tools, 5 = perfect tools)

def score_tool_selection(test_case, agent_trace):
  expected_tools = test_case.get("expected_tools", [])
  used_tools = [call["name"] for call in agent_trace]
  
  # Did it use the tools we expected?
  matches = [t for t in used_tools if t in expected_tools]
  score = (len(matches) / len(expected_tools)) * 5 if expected_tools else 5
  
  return int(score)

Dimension 2: Did it complete the task?

Score 1-5. (1 = did not complete, 5 = fully completed)

def score_task_completion(test_case, output):
  # Did the output address the input?
  # This is often manual: did the agent answer the question?
  
  input_text = test_case["input"]
  # Use keyword matching or manual review
  
  if "error" in output.lower():
    return 1
  if "no results" in output.lower():
    return 2
  if len(output) < 50:
    return 3
  return 5  # Good output

Dimension 3: Did it stay within cost and iteration limits?

Score 1-5. (1 = massively over budget, 5 = under budget)

def score_efficiency(agent_state, max_cost=1.0, max_iterations=15):
  cost_ratio = agent_state.total_cost / max_cost
  iteration_ratio = agent_state.iteration_count / max_iterations
  
  efficiency = 1 - max(cost_ratio, iteration_ratio)
  return max(1, int(efficiency * 5))

Overall score:

def score_test(test_case, agent_trace, output, agent_state):
  tool_score = score_tool_selection(test_case, agent_trace)
  completion_score = score_task_completion(test_case, output)
  efficiency_score = score_efficiency(agent_state)
  
  overall = (tool_score + completion_score + efficiency_score) / 3
  
  return {
    "tool_selection": tool_score,
    "task_completion": completion_score,
    "efficiency": efficiency_score,
    "overall": overall
  }

Step 4: Build a Simple Eval Harness

Automate your testing. Run all tests at once.

class AgentEvaluator:
  def __init__(self, agent_func, test_cases):
    self.agent = agent_func
    self.test_cases = test_cases
    self.results = []
  
  def run_all_tests(self):
    for test in self.test_cases:
      result = self.run_test(test)
      self.results.append(result)
    return self.results
  
  def run_test(self, test_case):
    output = self.agent(test_case["input"])
    score = score_test(test_case, output)
    
    return {
      "test_name": test_case["name"],
      "score": score,
      "output": output
    }
  
  def summary(self):
    scores = [r["score"]["overall"] for r in self.results]
    avg = sum(scores) / len(scores)
    print(f"Average score: {avg:.1f} / 5")
    print(f"Passed: {sum(1 for s in scores if s >= 4)} / {len(scores)}")

# Use it
evaluator = AgentEvaluator(run_agent, all_tests)
evaluator.run_all_tests()
evaluator.summary()

Tools for Testing

LangSmith (https://smith.langchain.com) Built for LangChain agents. Trace every step. Compare runs side by side. Debug what went wrong.

Braintrust (https://www.braintrust.dev) Eval framework. Define test cases. Run evals. Compare before and after improvements. Good for iterating on agents.

Custom script Simple Python script. Works with any agent. You control everything.

Interpreting Results

Excellent (score 4-5): Agent is ready to ship.

Good (score 3-4): Agent works but needs polish. Fix failure cases.

Okay (score 2-3): Agent has issues. Redesign or improve tools.

Poor (score <2): Agent is not ready. Rethink the approach.

When Is an Agent Production-Ready?

Your agent is production-ready when all of these are true:

Happy path tests: 80% pass with score >= 4
Edge case tests: 60% pass with score >= 3
Failure case tests: 100% do not crash (handle errors gracefully)
Cost is within budget 95% of the time
Iteration count is under max 95% of the time
Tool selection is correct 90% of the time
You have guardrails (max iterations, cost cap, rate limits)
You understand what the agent does and does not do well

Iterating Based on Results

If your agent scores poorly on a test:

Tool selection is wrong: Rewrite tool descriptions. Add examples. Add new tools if needed.

Task completion is low: Give the model clearer instructions. Simplify the task. Add intermediate steps.

Cost or iterations are too high: Reduce search results. Limit the number of sources. Simplify the workflow. Use a cheaper model.

Agent crashes on error: Add try-catch blocks. Make error messages clearer. Add fallback tools.

Run tests again after each change. Track improvements over time.

Checklist: Ready to Deploy

Once you pass all checks, your agent is production-ready.

Discussion

Loading…

← Back to Tutorials