
llm-eval-ci
A model swap, a prompt tweak, a refactor — and your assistant quietly starts inventing a refund window it doesn’t have. No test fails, because most teams have no test for answer quality. llm-eval-ci is the gate I built for that: turn your real production failures into a trusted golden set, score every change with six calibrated graders, and fail the PR in CI the moment quality drops. MIT, Python, one dependency.
Stop quality silently degrading between releases — a golden set + calibrated graders that fail the PR in CI.
The failure mode no unit test catches
Code regressions are loud — a test goes red, the build breaks, someone gets paged. LLM-answer regressions are silent. You upgrade a model, compress a prompt, or rewrite a retrieval step, the diff is green, the deploy ships — and three days later support notices the assistant has been quoting a refund window that doesn’t exist, or the wrong price, or dropping a tool call it used to make. Nothing failed. Nothing was watching the part that actually matters: was the answer still right?
The reason teams don’t guard this is that the obvious version is awful: a folder of brittle string-match assertions that break on a comma, or a vibes-based eval one person runs by hand before a big release and skips the rest of the time. Neither survives a real release cadence. So the gate quietly doesn’t exist, and quality drifts release over release until a customer finds the floor.
What it does
llm-eval-ci is small (~600 lines of Python, one dependency, MIT-licensed) and does one thing: it makes answer quality a thing CI can fail on. Three moving parts.
A golden set from your real production traces
Six calibrated graders, not one fuzzy score
A CI gate that exits 1 — and never flakes
The demo, run for real
The repo ships a runnable demo (bash scripts/demo.sh) so the gate isn’t a claim — it’s a thing you watch work. It runs a support bot against a small golden set built from a known policy doc, with five graders enabled. The grounded version passes every grader and writes the baseline (output captured from the actual run):
- groundingmean=1.00pass=100%
- hallucinationmean=1.00pass=100%
- relevancemean=0.99pass=100%
- tool_callmean=1.00pass=100%
- answer_qualitymean=1.00pass=100%
Now silently regress it the way a real refactor would: a rewrite that invents a refund window the policy never states, quotes the wrong price, and drops a tool call. The diff still looks fine. The gate doesn’t — every grader drops against the committed baseline:
- groundingmean=0.17pass=17%
- hallucinationmean=0.58pass=33%
- relevancemean=0.58pass=83%
- tool_callmean=0.83pass=83%
- answer_qualitymean=0.38pass=17%
Overall pass rate collapses from 100% to 17% (only 1 of 6 golden cases still passes every grader), the build exits 1, and the report flags the regression on every grader — grounding worst-hit (−0.83 vs baseline), with hallucination, relevance, tool-call and the rubric judge (answer_quality) all down too. The PR can’t merge. The bad answer never ships. (The tool’s own unit tests pass, 3/3 — the gate is itself under test.)
Where the value actually is
The GitHub Action plumbing is the commodity part — free tools wire a workflow. The part that decides whether the gate is worth anything is the judgment underneath it: which production failures become golden cases, and how each grader is calibrated so it fails on a real regression and stays quiet on a harmless rewrite. A gate calibrated wrong is worse than no gate — it cries wolf until someone disables it. That calibration is the work I sell; the tool is the floor it stands on.
The engagement
You can take llm-eval-ci off GitHub and wire it yourself — it’s MIT, that’s the point. What I sell is the part the README can’t hand you: I sit with your real production traces, decide which failures become the golden set, calibrate the graders to your task so the gate fails on what matters and not on noise, and hand back a working CI gate plus a runbook so your team extends the set without me. That’s the LLM Eval Sprint — one to two weeks, fixed scope. The deliverable is a quality gate you own, with the judgment baked in.
What it’s not
It’s not a hosted eval dashboard, not a model leaderboard, and not a guarantee your model is good — it’s a guarantee it didn’t get worse on the cases you’ve decided are non-negotiable. It’s the regression gate I’d want in front of any LLM product before it ships, and it’s licensed MIT so you can take it.
Want a quality gate on your LLM product?
The tool is free — the source is on GitHub (MIT). Want the golden set curated from your real failures and the graders calibrated to your task, shipped as a CI gate your team owns? Book the Audit Sprint — $1,500 to scope it, or email omar@neurascale.org about the LLM Eval Sprint ($2.5–6K, 1–2 weeks).
// stack on file
- Python
- GitHub Action
- LLM-as-judge
- MIT
- one dependency
// adjacent exhibits
Want this kind of build for your business? Book the Audit Sprint — $1,500 or email omar@neurascale.org.