Infrastructure

Model Evals in Production: Regression Testing Prompts

Balys Kriksciunas 8 min read
#ai#infrastructure#evals#testing#llm-quality#prompts#ci-cd

Model Evals in Production: Regression Testing Prompts

Traditional software has unit tests. Prompt-based software has evals. Without them, you ship a prompt change on Tuesday and by Friday you’ve silently broken the JSON output format for 3% of your users and nobody knows until someone files a bug with a screenshot.

This post covers building production-quality eval infrastructure: what to test, how to automate, where to run it in your pipeline, and the anti-patterns that make evals performative rather than useful.


Why Evals Are Non-Negotiable

Three properties of LLM-based systems make evals essential:

  1. Changes cascade unpredictably. A 20-token tweak to the system prompt can subtly change output format, refusal rate, or structured-output compliance — in ways unit tests don’t catch.

  2. Models update. When OpenAI ships gpt-4o-2025-05, your prompts may behave differently overnight. Without evals, you find out through user complaints.

  3. Prompts are code. They should be versioned, tested, and rolled out like code. The absence of automated tests is just technical debt.

Teams that invest in evals early ship 3–5x faster on prompt/agent improvements. The teams that don’t stall around month 6 of serious AI development, unable to iterate without breaking things.


The Eval Spectrum

“Eval” means many things. The useful ones:

1. Correctness evals

Does the output solve the task? Examples:

These are the closest to traditional unit tests and the most reliable.

2. Quality evals

Is the output good? Subjective properties:

Measured via human rating or LLM-as-judge (more below).

3. Format/contract evals

Does the output meet structural requirements?

These are the easiest to automate and the highest-leverage early wins.

4. Safety evals

Does the output avoid known failure modes?

5. Retrieval evals (for RAG)

Is retrieval finding the right content?

See Hybrid Search in Production.

6. Production evals

How does the system perform on real traffic over time?


Building An Eval Dataset

The single biggest predictor of eval value: dataset quality. A thousand-example eval set with real production traffic signals beats a 10,000-example synthetic set every time.

Where data comes from

Start with 50 hand-curated examples covering your most important cases. Expand to 500–1000 by sampling from production logs. Treat the eval set like code: versioned, reviewed, updated when the product evolves.

Labels

For correctness / format evals, ground-truth labels are feasible:

For quality evals, use:

We’ll address LLM-as-judge separately because it needs care.


LLM-as-Judge: Useful But Tricky

Use another LLM (often a stronger model) to judge your outputs. Scales well, costs moderately, but has well-documented biases.

Known biases:

Mitigations:

Our default: pairwise Claude-as-judge for quality, with absolute correctness for task evals. Human review on disagreements.


Where Evals Run

Three checkpoints:

1. Pre-commit (local)

Small, fast eval on a subset (~20 examples). Runs in < 1 minute. Catches obvious breakage before the PR is opened.

2. CI (PR validation)

Full eval suite (~100–500 examples). Runs in 5–30 minutes. Required-to-merge on prompt or model changes. Automated comparison vs. main branch.

3. Production (live)

Continuous eval against real traffic. Sample 1% of requests, score them, alert on regression.

The CI eval is the most important. You want “PR is red because eval regressed” to be normal. Teams that don’t fail PRs on evals tend to merge regressions.


Pipeline Architecture

[ Prompt change PR ]


[ CI runs eval suite ]


[ Compare metrics vs main branch ]

   ┌────┴────┐
   ▼         ▼
[ PASS ]  [ FAIL → block merge ]

[ Merge to main ]


[ Deploy to canary (5% traffic) ]


[ Production eval on canary vs baseline ]

   ┌────┴────┐
   ▼         ▼
[ Roll out ]  [ Roll back ]

Most teams start with just the PR-check piece. Production canary evals are mature-team territory.


Tools That Help

Eval frameworks:

LLM-as-judge libraries:

Synthetic data generation:

Our default stack: Promptfoo in CI, Langfuse for production sampling and labeling, homegrown glue.


Metrics That Matter

Pick a small number of primary metrics. Track many secondary ones.

Primary (must not regress):

Secondary (monitor, don’t block on):

Comparison tables:

Metricmainthis PRdelta
Task success0.870.85-2pp ⚠️
JSON valid1.001.000
Latency P951.2s1.1s-8% ✅
Cost/request$0.012$0.014+17% ⚠️

This is what your PR check should print. Block merge if any primary regresses beyond threshold.


Continuous Evals On Production Traffic

The strongest evals are against real traffic:

Why this matters: your offline eval set doesn’t drift. Your users’ queries do. A model that scored 0.87 on your eval set six months ago might be scoring 0.80 on fresh traffic because user behavior changed.

Langfuse, Langsmith, and Helicone all support sampled production evals. Worth the setup cost.


Evals For Agents

Agents add a layer of complexity: the output isn’t a single response, it’s a trajectory of tool calls and reasoning.

Agent eval metrics:

Agent evals are harder because there’s no single ground-truth trajectory. Multiple paths can be correct. Tools like Langsmith and Arize Phoenix help with trajectory visualization; for scoring, task-completion + LLM-as-judge on the trajectory is the pragmatic default.


Anti-Patterns

1. Eval set leaks. Your eval examples show up in training data or prompt few-shot, inflating scores. Curate carefully, rotate examples, check for contamination.

2. Over-fitting to evals. Teams iterate on prompts to raise eval scores, ignoring production quality. Use diverse evals; sample from real traffic.

3. Evals that never fail. If your eval always passes, either you don’t test enough or your threshold is too loose.

4. Ignoring eval cost. A 1000-example GPT-4 eval is $20–$80 per run. CI costs add up. Use smaller/cheaper judges where possible; sample intelligently.

5. No owner. Evals rot if nobody owns them. Assign an owner who curates examples, updates thresholds, investigates failures.

6. Evaluating quality without definition. “Good” is subjective. Write down what good means for your app. Put it in the eval prompt.


Getting Started

If you have no evals today:

Week 1: Hand-curate 50 examples covering your top 10 user intents. Label expected behavior.

Week 2: Set up Promptfoo or DeepEval. Get a CI check running on prompt changes.

Week 3: Add a JSON-validity check and any format contracts. These are the highest-signal automated evals.

Week 4: Sample 100 production requests, grade them manually, see where your system is weakest. Add evals for those failure modes.

Month 2: Add LLM-as-judge for subjective quality. Start tracking trends.

Month 3+: Continuous production eval, drift alerts, automated regression detection.

Don’t skip the hand-curation. The fastest eval programs start from human judgment and scale from there.


Further Reading

Building an eval pipeline for a production LLM app? Get in touch — we’ll help you ship a regression-testing harness in a week.

← Back to Blog