Every team building agents discovers the same wall around week three. The prototype works on the engineer's laptop, fails on the demo, fails differently next Monday, and the team is reduced to debating prompts in Slack. The fix is unglamorous: an evals harness written before the agent does anything useful.
What 'evals' actually means here
An eval is a frozen test case the agent has to pass — an input, the expected behaviour, and a graded outcome. It is the unit test of an LLM system. The grading function can be exact-match, structured-equality, or another LLM as a judge — whichever models the real outcome you care about.
- Golden set: 50–500 inputs that represent the real distribution, captured from production traffic where possible.
- Grader: a deterministic check (preferred) or an LLM-as-judge with a tightly scoped rubric.
- Regression bar: the version of the agent currently in production. Anything below this bar does not ship.
- Slice: subsets of the golden set you grade separately — by client, by document type, by language. Aggregate scores hide the bugs that matter.
Why the order matters
Writing evals after the agent is the most expensive way to do this. The team locks in to a specific architecture, the metrics get gerrymandered to fit it, and every prompt change becomes a debate without data. Writing evals first inverts the loop: changes get a number, the number is comparable across runs, and the team disagrees about evidence instead of preferences.
What a useful harness looks like in practice
Our reference harness is two scripts and a CSV. One script runs the agent across the golden set, captures the trace (prompt, tools called, output, cost, latency), and writes a row per case. The second computes pass/fail per slice and per overall, and posts a diff against the last run to a Slack channel. Token, latency, and dollar cost are first-class metrics — model regressions show up there before they show up in customer complaints.
Where to draw the line on coverage
More evals are not always better. Build until your overall pass rate stops being a useful signal — usually around 100–200 cases — then start splitting into slices. The point is to catch regressions reliably, not to brute-force the whole input space. The goal is a harness fast enough to run on every prompt change, not a benchmark that takes overnight.
How this changes the engagement
Once evals exist, agent engineering becomes ordinary engineering. Prompt edits get a metric. Model swaps get a metric. New tool integrations get a metric. The team ships faster because every change is reversible — if it regresses on the eval, it never reaches production.
If you cannot tell me whether last week's prompt was better than this week's, you are not building a system, you are tweaking a demo.
