SabanaTech

Agentic AI

Evals before agents: the fastest way to ship LLM features in production

Most agentic projects stall because nobody can answer 'is the agent good?' with anything other than a vibe. A small evals harness, written before the agent, fixes that — and shortens the loop from prototype to production by months.

SabanaTech Team· Agentic AI Practice
7 min read

Every team building agents discovers the same wall around week three. The prototype works on the engineer's laptop, fails on the demo, fails differently next Monday, and the team is reduced to debating prompts in Slack. The fix is unglamorous: an evals harness written before the agent does anything useful.

What 'evals' actually means here

An eval is a frozen test case the agent has to pass — an input, the expected behaviour, and a graded outcome. It is the unit test of an LLM system. The grading function can be exact-match, structured-equality, or another LLM as a judge — whichever models the real outcome you care about.

  • Golden set: 50–500 inputs that represent the real distribution, captured from production traffic where possible.
  • Grader: a deterministic check (preferred) or an LLM-as-judge with a tightly scoped rubric.
  • Regression bar: the version of the agent currently in production. Anything below this bar does not ship.
  • Slice: subsets of the golden set you grade separately — by client, by document type, by language. Aggregate scores hide the bugs that matter.

Why the order matters

Writing evals after the agent is the most expensive way to do this. The team locks in to a specific architecture, the metrics get gerrymandered to fit it, and every prompt change becomes a debate without data. Writing evals first inverts the loop: changes get a number, the number is comparable across runs, and the team disagrees about evidence instead of preferences.

What a useful harness looks like in practice

Our reference harness is two scripts and a CSV. One script runs the agent across the golden set, captures the trace (prompt, tools called, output, cost, latency), and writes a row per case. The second computes pass/fail per slice and per overall, and posts a diff against the last run to a Slack channel. Token, latency, and dollar cost are first-class metrics — model regressions show up there before they show up in customer complaints.

Where to draw the line on coverage

More evals are not always better. Build until your overall pass rate stops being a useful signal — usually around 100–200 cases — then start splitting into slices. The point is to catch regressions reliably, not to brute-force the whole input space. The goal is a harness fast enough to run on every prompt change, not a benchmark that takes overnight.

How this changes the engagement

Once evals exist, agent engineering becomes ordinary engineering. Prompt edits get a metric. Model swaps get a metric. New tool integrations get a metric. The team ships faster because every change is reversible — if it regresses on the eval, it never reaches production.

If you cannot tell me whether last week's prompt was better than this week's, you are not building a system, you are tweaking a demo.

SabanaTech AI lead