llm-eval-ci: the gate that fails the PR when your LLM quietly regresses
Every part of an LLM product is gated except the part that matters. llm-eval-ci is the open-source CI quality gate I shipped today: a golden set scored by six graders, exit 1 when answer quality drops. In the demo, a harmless-looking prompt rewrite falls from 100% to 17% and the build goes red.
- receipts
- evals
- llm-quality-gate
- ci
- open-source
- production-ai
Every part of an LLM product is gated except the part that matters. Types compile or the build fails. Unit tests run on every PR. Linters block a stray import. Then someone edits the system prompt to make the bot "friendlier," the diff is three lines of prose, CI goes green, and the thing ships. Two days later it is quoting a refund window that does not exist. Nothing failed, because nothing was watching the only output that counts: the answer.
I shipped a small open-source tool today for exactly that gap. It is called llm-eval-ci: a CI quality gate that scores answer quality against a golden set curated from real production traces, and exits 1 when quality drops so the bad PR cannot merge. Released v0.1.0, MIT, about 670 lines of Python, one runtime dependency. This is the launch writeup, and the honest part comes first: it went out today, it has no external users yet, and it has no moat. The edge is craft and speed, not the code. I will get to why anyway.
01 · Why silent regressions ship
A prompt change is a code change that does not look like one. You are not touching a function signature. You are rewording an instruction, adding an example, softening a tone. The model reads that differently than you intended, and the failure is not a crash. It is a quietly worse answer: a fact dropped, a forbidden claim added, a tool call skipped, a format contract broken. None of those throw an exception. None of them turn CI red.
So the part most likely to regress, the prompt, is the part nothing gates. Teams catch it the slow way: a user complains, support escalates, someone bisects the prompt history by hand, and by then it has been live for days. I have watched this on my own products, which is the only reason I trust the diagnosis. The fix is not heroics. It is to put a gate in front of the answer, the way there is already one in front of the types.
02 · What the tool does
llm-eval-ci is a regression gate around a golden set you curate. You collect real failures and good answers from production traces, write them down as cases, and the gate scores every PR against them. One golden set, six graders, a GitHub Action that exits 1 on a drop. Each grader catches a different failure mode:
grounding: required facts present
Checks that the answer contains the facts it was supposed to contain. The single most common production regression is a dropped fact: the model summarizes and quietly omits the one number that mattered. Grounding fails when a required fact goes missing.
hallucination: forbidden claims absent
The mirror image. Checks that the answer does not contain claims it must never make. This is the grader that catches the invented refund window, the made-up policy, the price the product does not charge.
relevance: on the question asked
Catches drift: the answer is fluent and confident and about the wrong thing. A "helpful" rewrite often regresses here first, padding the response with adjacent material the user never asked for.
tool_call: the call it should have made
For agentic surfaces. A prompt edit that makes the model chattier often makes it answer from its own head instead of calling the function. This grader is what notices.
format: the output contract
The JSON shape, the required fields, the format downstream code depends on. Cheap to check, expensive to miss, because a broken contract breaks whatever parses the output next.
rubric: the LLM-as-judge for what only semantics catch
The judge for what rules cannot express: tone, faithfulness, whether the answer is actually good and not just structurally valid. It needs a real model behind it, and it is the one most worth calibrating carefully.
A case passes only when all six pass. That is the deliberate part. A partial pass is a fail, because in production a partial pass is a regression with a green light.
03 · The judge runs offline by default, and why that matters
The decision I am most sure about: the rubric judge is offline-deterministic by default. No API key required, no network call in the default path, no rate-limit flakes in CI.
That sounds like a downgrade. It is not. A CI gate has one job: to be trustworthy enough that nobody routes around it. The fastest way to lose that trust is flakiness. An eval step that needs an API key fails on a forked PR that has no secrets. One that calls a model on every run inherits that provider's rate limits, and a 429 on their side becomes a red build on yours. Engineers learn within a week that a red build "might just be the eval being flaky," and the moment they learn that, the gate is dead. They retry it, skip it, merge anyway.
Offline-deterministic means the gate gives the same answer every time for the same input, with no external dependency. The OpenAI and Anthropic backends are opt-in: when you want real semantic grading on the rubric, you turn them on and supply a key. But the default path runs anywhere, including a fork with zero secrets, and it never flakes. The boring choice is the correct one. A gate that cries wolf gets disabled, and a disabled gate catches nothing.
04 · The demo, with the actual numbers
The repo ships a runnable demo. bash scripts/demo.sh runs two versions of a grounded support bot through the gate. These are the verified outputs, not illustrations.
v1 is a grounded support bot that answers from its sources and makes the tool call it should. It passes all six graders on all six cases: 100%, 6 of 6. The gate writes the baseline and exits 0. Green build, merge allowed.
v2 is the "helpful" rewrite, the kind of well-meaning prompt edit that ships silent regressions in real life. It silently invents a refund window the product does not have, quotes the wrong price, and drops a tool call. The gate scores it 17%, 1 of 6 cases passing, and exits 1. Red build, merge blocked.
// v1 vs v2 · cases passing the gate
| same bot, one prompt rewrite apart | percent of cases passing |
|---|---|
| v1 (grounded) | 100% |
| v2 (helpful rewrite) | 17% |
The report does not just say "fail." It names which graders broke and by how much. Grounding is the worst hit at −0.83 against the baseline, which is the refund window the model invented and the price it got wrong. Hallucination, relevance, tool_call, and the rubric judge are all down too. So the failure is not a number you have to go investigate. It is a diagnosis: the rewrite dropped facts, added a forbidden claim, drifted off the question, and skipped a call. The tool's own unit tests pass 3 of 3, so the gate itself is not the thing that broke.
That gap, 100% to 17% on a change that looks harmless in a diff, is the whole argument. A human reviewer reading the v2 prompt would have approved it. It reads better. The gate is what stands between "reads better" and "is worse."
05 · The real work is curation and calibration
Here is the part a library cannot do for you, and the part I would not pretend otherwise about. The plumbing is the easy 670 lines. The value is two judgment calls that no package ships with.
The first is curation: choosing which production failures become golden cases. A golden set is not a random sample of traffic. It is the specific answers that, if they regress, you want the build to stop. That is a product decision dressed as an engineering one. Pick the wrong cases and the gate guards things that do not matter while the real failure mode walks through. You earn a good golden set by reading your own production traces and deciding what "worse" means for your product.
The second is calibration: tuning each grader so it fails on a real regression and stays quiet on a harmless rewrite. A miscalibrated gate is worse than no gate. Set the grounding threshold too tight and a legitimate rephrase trips it; the team learns the gate cries wolf and disables it. Set it too loose and the refund-window regression slides through green. The −0.83 in the demo is calibrated to fail loudly; a harmless reword of the same answer should not move it. Getting that line right, per grader, per product, is the work.
This is the same discipline I wrote about in Bridge Sourcing: scrape accuracy from 82% to 96%. There, the calibration set was the unlock for everything else, and it contributed exactly zero percentage points on its own. Same shape here. The golden set and the grader thresholds are the part that takes craft, and they are invisible in the line count. A package can hand you the graders. It cannot tell you which of your answers must never get worse.
06 · Honest scope, and when to use something else
llm-eval-ci stays in one lane: a regression gate around a golden set you curate. That is the entire product. It is not an eval platform.
The free tools in this space are good, and they do their mechanics well. DeepEval, promptfoo, and Ragas have more features than I do: metric catalogs, UIs, red-teaming, dataset tooling. If you want a broad metrics library, a dashboard to explore runs, or adversarial test generation, reach for one of those. I am not going to pretend my 670 lines beat their feature sets, because they do not.
What llm-eval-ci is, against those, is small and opinionated. One golden set, six graders, deterministic by default, one runtime dependency (PyYAML), runs on Python 3.10 through 3.12. It does the one thing, which is fail the PR when answer quality drops, and it tries to do that thing without flaking. Six roadmap issues are open against it. If the failure mode you are guarding is "a prompt edit quietly made the product worse and nothing caught it," this is the smallest tool that fixes it. If you need more than that, the others are genuinely better and I would point you at them first.
07 · The honest part
I said it up top and I will say it plainly here. This shipped today. It has no external users. It has no moat, because the code is 670 readable lines under MIT and anyone can fork it in an afternoon. The edge, such as it is, is craft and speed: knowing which production failures matter, calibrating the graders so the gate is trusted instead of disabled, and doing it in days rather than weeks. That is not defensible. It is just useful.
The tool is the artifact. The engagement is the work, and it is honest about being the work: I productized this as the AI Quality Gate, $3,500, one to two weeks, on the audit page. What you pay for is not the 670 lines, which are free on GitHub. What you pay for is the curation and the calibration: someone reading your production traces, deciding which answers must never regress, and tuning the gate so it fails on the real thing and stays quiet on the harmless rewrite. That is the part the library cannot do for you, and it is the part I have spent a while getting right on my own products.
If you ship an LLM product and nothing currently gates the answer, the smallest honest next step is the open-source repo: clone it, run bash scripts/demo.sh, watch v2 fail at 17% and exit 1, and decide whether that gap is one you can afford to keep shipping blind. If you would rather have the golden set built and the graders calibrated against your own traffic, the AI Quality Gate is that work, fixed scope, with the first version of the gate live by the end. Either way, the part that matters should be gated like the part that already is.
// sources cited
// next move
Want this level of rigor on your own stack?
Find the leak: 1 week, $950, fixed scope. A plain-English plan plus one real fix built and working, yours to keep regardless.
// related essays
- 7semantic categories · self-hosted · MIT
Mnemonic: self-hosted, categorized memory for AI agents
A small self-hosted FastAPI memory server for AI agents, built on mem0 and Qdrant. It auto-sorts every memory into seven semantic categories, serves a tiered L0/L1/L2 context tree instead of dumping the whole pile, and keeps the conversation history on your own box. About $2/month self-hosted versus $20+/month for the cloud memory APIs.
~10 min · 2026-06-10 - 2,831orders · 30 commits · 7 tables
2,831 orders on 30 commits: the boring system a donut shop kept over the better one I built
My family's donut shop has taken 2,831 orders through a 30-commit system I stopped touching in March. I built the far more capable replacement myself, migrated the full order history into it, and the shop never switched: 1,012 more orders on the old system since, with zero code changes.
~9 min · 2026-06-10