Last verified: April 2026

How to Evaluate an AI Agent: Reliability, Cost, Latency, Capability (2026)

Evaluating an agent is harder than evaluating a model. Agents are stateful, non-deterministic, and use tools that change between runs. Reliability matters more than peak capability. This page covers the four evaluation dimensions, six common failure modes, and a 12-item procurement checklist.

Before you evaluate vendors

Should you use an agent at all?

The first evaluation question is whether the task suits an agent in the first place. Five questions answered honestly will produce one of four verdicts.

Question 1 of 5

Does the task require multiple steps that depend on each other?

The four evaluation dimensions

Treat each dimension as an independent question. A vendor that excels on capability and fails on reliability is a useful demo, not a useful production system.

Capability

Does the agent succeed on representative tasks?

Capability is the simplest dimension to measure: build a task set that resembles your real workload and run the agent on it. Public benchmarks (τ-bench / tau-bench from Sierra, AgentBench, GAIA) are useful as starting points but rarely match enterprise workloads. Most teams that ship agents in production end up building a private benchmark of 20 to 50 hand-curated examples.

Reliability

Does it succeed consistently?

Reliability is harder than capability and matters more in production. The same agent on the same task can succeed once and fail the next time. Standard practice (per Anthropic's Demystifying evals for AI agents, 2024) is to run each scenario 100+ times and report the success rate as a distribution, not a single number. An agent that succeeds 95 of 100 trials is more useful than one that succeeds 99 of 100 in a curated benchmark and fails silently in the long tail.

Cost

What does each successful completion cost?

Per-completion cost is the right unit, not per-call cost. A successful completion may take 5 model calls and 3 tool calls; a failed completion may take 50 of each before the iteration cap fires. Track cost per completion, not cost per request. Include retry cost and escalation-to-human cost. The hidden cost driver is iteration loops on agents without a hard cap.

Latency

How long does end-to-end completion take?

Latency for an agent is end-to-end, not per-call. Tool calls dominate: a single API call to a slow internal system can add seconds to the overall path. Measure the latency distribution (median, p95, p99) on real workloads, not the median on synthetic benchmarks.

Common failure modes

Six failure modes account for most production agent incidents. Each has a detection pattern and a mitigation pattern. Vendors that cannot speak fluently about these are early in their production maturity.

01Hallucinated tool calls
Agent invents a tool that does not exist or invents arguments to a real tool. Detection: structured output validation at the runtime layer. Mitigation: tighter tool descriptions and constrained generation.
02Silent failures
Agent reports success on actual failure. The most expensive failure mode because it ships incorrect results without an alarm. Detection: outcome verification, not just response inspection. Mitigation: end-to-end tests against real systems.
03Cost explosion
Autonomous loops that spend money before terminating. Detection: per-run budget caps. Mitigation: hard iteration limits and per-tool budget limits.
04Prompt injection
User input redirects the agent away from its original goal. Particularly dangerous when the agent has write-access tools. Detection: input scanning, output validation against original goal. Mitigation: separate user input from system instructions, treat all model output as untrusted before passing to side-effectful tools.
05Tool-use error swallowing
A tool returned an error; the agent treated it as success. Detection: typed error returns from tools that the runtime can recognise. Mitigation: explicit error contracts in tool schemas.
06Distribution shift
Agent works in development on the curated test set, fails in production on real workloads. Detection: production sampling and ongoing eval set expansion. Mitigation: deploy gradually, sample real traffic into eval set, re-run evals as the world changes.

Section 3

Building an eval set

The standard pattern for production agent evaluation is a hand-curated set of 20 to 50 examples that mirror the real workload. Each example has a starting state, a goal, and an expected outcome shape (often a partial specification, since multiple correct answers may exist). The eval set is the foundation; everything else (automated scoring, regression testing, vendor comparison) layers on top.

Start small and grow incrementally. A 20-example set covers more ground than most teams realise; the marginal value of the 21st example is small if it overlaps the first 20. Add examples specifically when you observe new failure modes in production or when adding a new tool.

Section 4

Hybrid quantitative-qualitative evaluation

Automated scoring catches what it is designed to catch. Numeric pass-rates on an eval set tell you whether the agent succeeds at predefined tasks. They do not tell you whether the agent is sometimes right for the wrong reasons, sometimes wrong in ways that matter more than the pass-rate suggests, or sometimes correct on the test data but consistently failing on inputs the test set did not anticipate.

Human review of a sampled subset of agent runs is the second leg. It is expensive and unavoidable. The teams running agents most successfully in production are the ones who set aside reviewer time as a recurring cost, not as a pre-launch one-off.

Procurement checklist (12 items)

Twelve questions to ask any vendor that proposes shipping an agent into your environment. A vendor that answers fewer than eight cleanly is early in their enterprise readiness.

Ask for tau-bench-style benchmark results on tasks like yours, not just on the public benchmark.
Ask for production reliability data: success rate distribution at p50, p95, p99 on representative workloads.
Ask about prompt-injection guardrails: how does the agent separate trusted from untrusted input?
Ask about cost-cap controls: per-run, per-tool, per-day caps. What stops a runaway agent?
Ask for the failure-mode taxonomy the vendor tracks. If they cannot enumerate failure modes, they are not measuring them.
Ask which version of which model is in production at that moment, and the policy for model upgrades.
Ask for SLAs on tool-call latency, not just model latency.
Ask how the agent handles tool errors and what the escalation path is.
Ask whether the agent can be run in a read-only sandbox for evaluation.
Ask for the production logging surface: what can your team see when the agent is running?
Ask about evaluation tooling: does the vendor support running your private eval set against the platform?
Ask for references at companies of your scale running the agent on tasks like yours, in production, for at least three months.

From the methodology

Sources for this page include Sierra's tau-bench paper, Anthropic's engineering blog post "Demystifying evals for AI agents", AWS's "Evaluating AI agents" and Weights and Biases agent eval guidance. See the methodology page for the full source list.

Multi-agent systems→Tool-using agents→Vendor landscape→Glossary→