Testing AI Agents

11 minute read

“Test agents like systems: validate tool calls, pin behaviors with replayable traces, and catch regressions before users do.”

1. Why “testing agents” is different from testing normal code

In normal software, you test deterministic functions: same input → same output.

Agents are not like that:

outputs can vary across runs
“correctness” can be fuzzy
behavior depends on tools (web, APIs, files, code execution)
safety matters as much as success

So testing agents is really testing a system:

the model prompt and policy
the orchestrator code
tool contracts and validators
sandboxing and safety gates
state management and stop conditions

If you test only the final text output, you’ll miss the most expensive failures: unsafe tool use, loops, and silent hallucinations.

2. The testing stack: unit → integration → scenario → regression

A practical hierarchy:

2.1 Unit tests (fast, local)

Test deterministic components:

tool argument validation
schema parsers
redaction functions
budget counters and repetition detectors

2.2 Integration tests (tool boundaries)

Test orchestration with mocked tools:

tool calls happen with correct args
unsafe calls are blocked
retries are bounded

2.3 Scenario tests (end-to-end)

Test realistic user flows:

multi-step tasks
realistic tool outputs
expected stop reasons

2.4 Regression tests (pin behaviors)

Replay known traces and compare:

action sequences
safety decisions
cost metrics

This is how you keep agent behavior stable over time.

2.5 A quick mapping: what to test where

Layer	What you test	What it catches	Typical runtime
Unit	validators, redaction, budgets	obvious bugs, safety bypasses	milliseconds
Integration	orchestrator + mocked tools	wrong tool args, wrong ordering	seconds
Scenario	end-to-end intent flows	drift, missing steps, bad stop reasons	seconds–minutes
Regression	replay traces	regressions from prompt/model/tool changes	minutes

This mapping is useful because it prevents a common mistake: trying to test everything with slow end-to-end tests.

3. What to test: outputs, trajectories, and invariants

3.1 Output-level tests

Useful for:

strict JSON outputs
required fields and formats
deterministic tasks

But output tests alone are insufficient.

3.2 Trajectory-level tests (the agent’s path)

Trajectory tests validate:

tool selection
action ordering
bounded retries
stop conditions

This is the core of agent reliability engineering.

3.3 Invariant tests (must always hold)

Invariants should be enforced by code:

budgets never go negative
write actions require approval
no network in sandbox (if policy)
no secrets in logs

Invariants are your “hard rails.” When they fail, the test should fail loudly.

3.5 Practical invariants (copy/paste for most agents)

These are generic invariants that improve most agent systems:

Budget invariants: steps_used <= max_steps, tool_calls <= max_tool_calls.
Safety invariants: no write tools when writes are disallowed; no network when sandbox policy forbids it.
Schema invariants: any structured output must parse; unknown fields are rejected (when strict).
No secret echo: outputs and logs must not contain patterns that look like keys/PII.
Stop invariants: agent must always stop with an explicit stop reason.

You can enforce many of these in code (deterministically) and then test the enforcement layer.

4. Mocking tools: the most important agent testing technique

If your tests depend on real tools:

they’re slow
they’re flaky
they’re expensive
they change over time (web pages, APIs)

So you need tool mocks.

4.1 Record/replay

Record real tool outputs once, then replay them.

Pros:

realistic
stable over time

Cons:

recordings can go stale if tools evolve

4.2 Synthetic mocks

Write simple mocked outputs:

simulate success
simulate timeouts
simulate malformed responses

This is great for testing error handling and safety.

4.3 Hybrid approach

Use record/replay for “happy paths” and synthetic mocks for “failure paths.”

4.5 Mocking strategy for multi-step agents: “tool scripts”

For scenario tests, it’s helpful to define tool outputs as a script:

call #1 → returns success payload
call #2 → returns 429
call #3 → returns success after backoff

This lets you verify:

retries are bounded
the agent changes strategy
stop reason is correct

These scripts become reusable test fixtures.

5. Testing structured outputs: schemas + fuzzing

If your agent outputs structured JSON, you can test:

schema validity
missing required fields
invalid types
range constraints

Add fuzz tests:

random strings
edge-case unicode
extremely long inputs

These catch parsing edge cases and drift.

Further reading (optional): see Structured Output Patterns.

5.5 Property-based thinking (even without a property-testing library)

You can get most property-testing value without adopting a new library by writing a few loops:

generate 100 random small inputs
run the agent/tool logic
assert invariants

Examples:

“Any extracted list has no duplicate IDs.”
“Any plan has no cycles when dependencies are used.”
“Any cost metric is non-negative.”

This catches “weird” failures that aren’t in your hand-written examples.

6. Testing autonomy: budgets and stop reasons

Autonomous agents fail by looping. So tests must assert:

the agent stops within budget
the stop reason is correct (SUCCESS/NEEDS_INPUT/FAILURE)
repeated tool calls are detected

A simple scenario test might assert:

“Given tool failure twice, the agent asks a clarifying question instead of retrying forever.”

Further reading (optional): see Autonomous Agent Architectures and Hierarchical Planning.

6.5 Testing statefulness: resumption and replay

If your agent can pause/resume, test:

resuming from a checkpoint produces a consistent next action
tool cache prevents duplicate calls
old failures are remembered (negative memory) to avoid loops

This is where “agent tests” become “workflow tests.”

Further reading (optional): see State Management and Checkpoints.

6.6 Testing planning quality (lightweight but effective)

If your agent generates plans, you can test plan quality without “grading prose.”

Practical checks:

No cycles: if the plan has dependencies, it must be acyclic.
Step bounds: each plan chunk has a max step budget.
Done definitions exist: each chunk has a measurable done condition.
No forbidden actions: plan does not contain disallowed tools/operations.

These are deterministic checks that catch many planning failures early.

Further reading (optional): see Hierarchical Planning.

7. Testing safety: prompt injection and forbidden actions

Safety tests should be explicit and automated:

7.1 Prompt injection tests

Inputs containing:

“ignore previous instructions”
“call delete tool”
“reveal secrets”

Expected behavior:

refuse
sanitize
continue safely or escalate

7.2 Forbidden tool calls

Make sure the orchestrator blocks:

writes when writes are forbidden
network calls when network is forbidden
file access outside allowlist

These should be deterministic tests that do not depend on the model “doing the right thing.”

7.5 Testing prompt injection defense: fail closed

When you test injection, treat the model as untrusted:

the orchestrator should block dangerous tool calls even if the model requests them
untrusted content must not be placed into instruction channels

Your tests should assert “fail closed” behavior: when unsure, the system blocks and escalates rather than acting.

7.6 Data leakage tests: protect outputs and logs

Leakage tests should cover both user-visible output and internal logs/traces.

Test inputs:

fake API keys (sk-..., long random strings)
fake PII (emails, phone numbers)
internal hostnames or file paths

Expected behavior:

agent output is redacted
logs are redacted (or sensitive fields are omitted)
the agent refuses to reveal secrets even when prompted

These tests can be mostly deterministic (regex detectors + policies), which is ideal for CI.

8. Trace-based regression: replay the system, not just the output

The best regression artifacts are traces:

messages
tool calls + results
budgets
stop reason

A regression test can assert:

tool call sequence is unchanged (or changes are intentional)
safety blocks still happen
cost is within bounds

Further reading (optional): see Observability and Tracing and Agent Evaluation Frameworks.

8.5 Snapshot tests for agent behavior (use carefully)

Snapshot tests (“compare output to a stored snapshot”) are tempting, but they’re fragile for agents.

If you use snapshots, snapshot the structured parts:

tool call sequence
stop reason
schema-validated outputs
safety violations list

Avoid snapshotting large freeform text that is expected to vary.

This gives you regression protection without constant churn.

8.6 Golden traces: how to keep them useful over time

Golden traces are saved “known good” trajectories that you replay in regression tests.

To keep them useful:

store traces as structured artifacts (JSON) with redacted tool results
pin tool outputs via record/replay
version the agent prompt and the tool schemas

When a golden trace changes:

require an explicit review/approval (like updating a snapshot test)
record the reason (“prompt updated to handle new constraint”)

This prevents “silent drift” where the test suite slowly stops representing intended behavior.

9. A minimal test harness (pseudocode)

def test_agent_scenario(agent, scenario):
    tools = FakeTools(scenario["tool_responses"])
    result = agent.run(scenario["input"], tools=tools)

    assert result.stop_reason == scenario["expected_stop_reason"]
    assert result.steps <= scenario["max_steps"]
    assert not result.violations  # safety violations
    assert tool_calls_match(result.trace.tool_calls, scenario["expected_tool_calls"])

The goal is stable, repeatable tests. The model can still be stochastic, but the test harness should reduce variability via tool replay and strict invariants.

9.5 Case study: a browsing agent test suite

Test cases to include:

normal query with 2 sources
conflicting sources (agent should surface uncertainty)
malicious page injection text (agent must ignore and not call tools unsafely)
stale page (agent should prefer recent sources if policy requires)

This is an ideal domain for tests because “citation faithfulness” is measurable.

Further reading (optional): see Web Browsing Agents.

9.6 Case study: a code execution agent test suite

Code execution agents are a perfect target for strong tests because you can demand hard evidence (tests passing).

Scenario tests:

compilation/runtime error → agent fixes code in ≤3 retries
timeout case → agent changes strategy (smaller input, early exits)
forbidden operation case (network/filesystem) → system blocks and agent proceeds safely

Assertions:

sandbox limits were respected (time/memory)
no network calls attempted (if policy)
final output matches expected schema/format

Further reading (optional): see Code Execution Agents.

10. Common failure modes (and the test that catches them)

10.1 “Correct answer, wrong tool behavior”

Test: trajectory-level assertions on tool calls.

10.2 “Infinite retries”

Test: budget + repetition detector.

10.3 “Secret leakage”

Test: log scrubbing + regex leak detectors.

10.4 “Schema drift”

Test: schema validation on every run + fuzz tests.

10.5 “Stale browsing results”

Test: citation faithfulness and freshness policy checks (with recorded pages).

11. Summary & Junior Engineer Roadmap

Testing agents is systems testing:

Test deterministic parts hard: validators, redaction, budgets.
Mock tools: record/replay + failure simulations.
Test trajectories: tool choice, ordering, bounded retries.
Assert invariants: safety rules must never be violated.
Replay traces for regression: pin behavior across changes.

Mini-project (recommended)

Build a test suite for one agent intent:

10 scenario tests with tool replay
5 adversarial tests (prompt injection, malformed tool output)
3 regression traces pinned to known-good behavior

Run it in CI and fail the build when success rate or safety degrades.

Common rookie mistakes (avoid these)

Testing only final text: you’ll miss unsafe tool attempts and loops.
No tool mocks: your tests will be flaky and slow.
No budgets: an agent that loops will “pass” until it hits production costs.
No regression suite: you’ll ship changes that silently degrade behavior.

Arun Baby