Skip to main
Omar Nagy.
EX. 05Open source·filed 2026-06-06← back to index
llm-eval-ci — a CI quality gate for LLM products, terminal-style tile showing PASS 100% flipping to FAIL 17% with exit 1.

llm-eval-ci

A model swap, a prompt tweak, a refactor — and your assistant quietly starts inventing a refund window it doesn’t have. No test fails, because most teams have no test for answer quality. llm-eval-ci is the gate I built for that: turn your real production failures into a trusted golden set, score every change with six calibrated graders, and fail the PR in CI the moment quality drops. MIT, Python, one dependency.

100→17%
pass rate on a silent regression
exit 1
CI fails the PR
6
calibrated graders
MIT
License
In one line

Stop quality silently degrading between releases — a golden set + calibrated graders that fail the PR in CI.

The failure mode no unit test catches

Code regressions are loud — a test goes red, the build breaks, someone gets paged. LLM-answer regressions are silent. You upgrade a model, compress a prompt, or rewrite a retrieval step, the diff is green, the deploy ships — and three days later support notices the assistant has been quoting a refund window that doesn’t exist, or the wrong price, or dropping a tool call it used to make. Nothing failed. Nothing was watching the part that actually matters: was the answer still right?

The reason teams don’t guard this is that the obvious version is awful: a folder of brittle string-match assertions that break on a comma, or a vibes-based eval one person runs by hand before a big release and skips the rest of the time. Neither survives a real release cadence. So the gate quietly doesn’t exist, and quality drifts release over release until a customer finds the floor.

What it does

llm-eval-ci is small (~600 lines of Python, one dependency, MIT-licensed) and does one thing: it makes answer quality a thing CI can fail on. Three moving parts.

A golden set from your real production traces

The cases that matter aren’t the ones you’d invent at a whiteboard — they’re the ones that already broke. The tool turns captured production traces into a curated golden regression set: the real questions, the grounding they should have used, the tool calls they should have made. That curation is the judgment the whole gate rests on, and it’s the part a generic eval library can’t do for you.

Six calibrated graders, not one fuzzy score

Every candidate answer is scored on six axes — grounding, hallucination, relevance, tool-call correctness, format, and a rubric LLM-as-judge for the things only semantics catch. When a release regresses, you don’t get a single number that dropped; you get told which graders broke, so the fix is obvious.

A CI gate that exits 1 — and never flakes

The judge runs offline-deterministic by default, so the gate needs no API key and can’t flake your pipeline on a rate limit. It ships as a GitHub Action: when the pass rate drops below the bar, the build exits 1 and the PR can’t merge. OpenAI and Anthropic backends are there when you want real semantic grading instead of the deterministic stand-in.

The demo, run for real

The repo ships a runnable demo (bash scripts/demo.sh) so the gate isn’t a claim — it’s a thing you watch work. It runs a support bot against a small golden set built from a known policy doc, with five graders enabled. The grounded version passes every grader and writes the baseline (output captured from the actual run):

bash scripts/demo.sh
[llm-eval-ci] gate:PASS(v1)
overall pass rate: 100% (6/6 cases)
  • groundingmean=1.00pass=100%
  • hallucinationmean=1.00pass=100%
  • relevancemean=0.99pass=100%
  • tool_callmean=1.00pass=100%
  • answer_qualitymean=1.00pass=100%
all gate checks passed
process exited 0
Fig 1 · v1 grounded support bot — every grader passes, baseline written (exit 0)

Now silently regress it the way a real refactor would: a rewrite that invents a refund window the policy never states, quotes the wrong price, and drops a tool call. The diff still looks fine. The gate doesn’t — every grader drops against the committed baseline:

bash scripts/demo.sh
[llm-eval-ci] gate:FAIL(v2)
overall pass rate: 17% (1/6 cases)
  • groundingmean=0.17pass=17%
  • hallucinationmean=0.58pass=33%
  • relevancemean=0.58pass=83%
  • tool_callmean=0.83pass=83%
  • answer_qualitymean=0.38pass=17%
overall pass rate 17% below required 90% · regression vs baseline: −83%
process exited 1 — the gate failed the build (PR blocked)
Fig 2 · v2 silently-regressed rewrite — every grader drops, gate fails the PR (exit 1)

Overall pass rate collapses from 100% to 17% (only 1 of 6 golden cases still passes every grader), the build exits 1, and the report flags the regression on every grader — grounding worst-hit (−0.83 vs baseline), with hallucination, relevance, tool-call and the rubric judge (answer_quality) all down too. The PR can’t merge. The bad answer never ships. (The tool’s own unit tests pass, 3/3 — the gate is itself under test.)

Where the value actually is

The GitHub Action plumbing is the commodity part — free tools wire a workflow. The part that decides whether the gate is worth anything is the judgment underneath it: which production failures become golden cases, and how each grader is calibrated so it fails on a real regression and stays quiet on a harmless rewrite. A gate calibrated wrong is worse than no gate — it cries wolf until someone disables it. That calibration is the work I sell; the tool is the floor it stands on.

The engagement

You can take llm-eval-ci off GitHub and wire it yourself — it’s MIT, that’s the point. What I sell is the part the README can’t hand you: I sit with your real production traces, decide which failures become the golden set, calibrate the graders to your task so the gate fails on what matters and not on noise, and hand back a working CI gate plus a runbook so your team extends the set without me. That’s the LLM Eval Sprint — one to two weeks, fixed scope. The deliverable is a quality gate you own, with the judgment baked in.

What it’s not

It’s not a hosted eval dashboard, not a model leaderboard, and not a guarantee your model is good — it’s a guarantee it didn’t get worse on the cases you’ve decided are non-negotiable. It’s the regression gate I’d want in front of any LLM product before it ships, and it’s licensed MIT so you can take it.

Want a quality gate on your LLM product?

The tool is free — the source is on GitHub (MIT). Want the golden set curated from your real failures and the graders calibrated to your task, shipped as a CI gate your team owns? Book the Audit Sprint — $1,500 to scope it, or email omar@neurascale.org about the LLM Eval Sprint ($2.5–6K, 1–2 weeks).

// stack on file

  • Python
  • GitHub Action
  • LLM-as-judge
  • MIT
  • one dependency

// adjacent exhibits

Want this kind of build for your business? Book the Audit Sprint — $1,500 or email omar@neurascale.org.