llm-eval-ci: your LLM answers can silently get worse. This fails the PR when they do.

01 · the green build that's wrong

The regression no test catches.

Code regressions are loud: a test goes red, someone gets paged. LLM-answer regressions are silent. Upgrade a model, compress a prompt, the diff is green and the deploy ships, quoting a refund window that doesn't exist. Here the grounded version passes every grader and writes the baseline.

$ bash scripts/demo.sh
[llm-eval-ci] gate: PASS ✓ (v1)
overall pass rate: 100% (6/6 cases)
- grounding       mean=1.00  pass=100%
- hallucination   mean=1.00  pass=100%
- relevance       mean=0.99  pass=100%
- tool_call       mean=1.00  pass=100%
- answer_quality  mean=1.00  pass=100%
· all gate checks passed
[llm-eval-ci] PASS: the gate passed, merge allowed
process exited 0

v1 grounded support bot · every grader passes, baseline written (exit 0)

02 · the silent regression, caught

One rewrite later, exit 1.

Now regress it the way a real refactor would: a rewrite that invents a refund window the policy never states, quotes the wrong price, drops a tool call. The diff still looks fine. The gate does not, and the PR can't merge.

100→17%pass rate collapse · build exits 1

$ bash scripts/demo.sh
[llm-eval-ci] gate: FAIL ✗ (v2)
overall pass rate: 17% (1/6 cases)
- grounding       mean=0.17  pass=17%
- hallucination   mean=0.58  pass=33%
- relevance       mean=0.58  pass=83%
- tool_call       mean=0.83  pass=83%
- answer_quality  mean=0.38  pass=17%
· overall pass rate 17% below required 90%
· regression vs baseline: −83%
[llm-eval-ci] FAIL: the gate failed the build (PR blocked)
process exited 1

v2 silently-regressed rewrite · every grader drops, gate fails the PR (exit 1)

03 · which grader broke

Six axes, not one fuzzy score.

A single dropping number tells you nothing. Six calibrated graders tell you where: grounding is worst-hit at minus 0.83 against the committed baseline, and the rubric LLM-as-judge confirms the answer itself degraded. The fix is obvious.

grounding · held to the case grounding1.00 → 0.17

answer_quality · the rubric LLM-as-judge1.00 → 0.38

hallucination · the claim the source never states1.00 → 0.58

relevance · still answering the question0.99 → 0.58

tool_call · the calls it should have made1.00 → 0.83

04 · three moving parts

Golden set, graders, a gate that never flakes.

About 600 lines of Python, one runtime dependency, MIT. The judge runs offline-deterministic by default, so the gate needs no API key and can't flake on a rate limit. OpenAI and Anthropic backends are there when you want real semantic grading.

01Golden setReal production traces curated into the cases that already broke, with the grounding and tool calls they should have used.
02Six gradersGrounding, hallucination, relevance, tool-call, format, and a rubric LLM-as-judge for what only semantics catch.
03CI gateA GitHub Action: pass rate below the bar and the build exits 1, the PR is blocked. The gate is itself under test, 3/3.

05 · where the value is

The plumbing is free. The judgment is the work.

Wiring a GitHub Action is the commodity part. What decides whether the gate is worth anything is which production failures become golden cases, and how each grader is calibrated to fail on a real regression and stay quiet on a harmless rewrite. A gate calibrated wrong is worse than none: it cries wolf until someone disables it.

The GitHub Action plumbing · any tool can wire thiscommodity

Which failures become golden cases + grader calibrationthe work I sell

I have run that calibration discipline in production: it is how Bridge’s extraction accuracy went from 82% to 96%. The tool is free and MIT (the source is on GitHub). What I sell is the AI Quality Gate ($3,500, 1–2 weeks, fixed scope): your golden set curated from real failures, graders calibrated to your task, shipped as a CI gate your team owns, with a runbook to extend it. Email omar@neurascale.org, or start smaller with Find the leak · $950 to scope it first.

llm-eval-ci.

The regression no test catches.

One rewrite later, exit 1.

Six axes, not one fuzzy score.

Golden set, graders, a gate that never flakes.

The plumbing is free. The judgment is the work.