Maintaining a 10,992-question medical exam bank (and rejecting 37% of the fixes)
MedPrüf holds 10,992 active German-language medical exam questions, maintained from Egypt by a non-doctor. The quality lives in the correction pipeline: 91 suggested fixes triaged, 37% rejected on review, 67 human-reviewed AI explanations, and zero runtime AI calls served.
- receipts
- content-ops
- data-quality
- medpruef
- review-pipeline
- production-ai
An exam-prep product is only as good as its question bank, and question banks rot. Answer keys are wrong from day one. Topic tags drift. A stem that read fine to its author is ambiguous to the tenth student who meets it. None of this announces itself. It burns trust quietly, one wrong question at a time.
MedPrüf is my exam-prep product for the Austrian medical licensing exams: 10,992 active questions, in German, across three exam tracks. I am not a doctor. This is German-language medical content maintained from Egypt by a non-doctor, which means the product's quality cannot rest on my authority. I have none to offer. It rests on process: vetted sources, structured ingestion, and a correction pipeline that rejected more than a third of the fixes users proposed.
That last number is the essay. 91 correction suggestions have come in since launch. 34 were rejected on review: 37%. I think that rejection rate is the most honest quality signal in the whole system, and most content products have nothing like it.
A method note, so a skeptic knows what they are reading: every number in this piece is the result of read-only SQL against the MedPrüf production database, run on 2026-06-10. No analytics layer, no estimates. Where a number needs a caveat, the caveat is attached in the same paragraph.
00 · How banks rot
A question bank decays in specific, boring ways:
- The answer key is wrong. The worst class, and the one users find fastest.
- The stem is ambiguous. Two options are defensible, and the marked one is not the one the real exam intends.
- The topic tag is wrong. The question surfaces in the wrong study plan and quietly skews every weak-spot analysis that touches it.
- The question is a near-duplicate. Same fact, slightly different wording, double-counted in practice stats.
- The image is missing or attached to the wrong stem. A question that needs its image and lacks it is not a question. It is a trap.
A wrong question is worse than no question. A student who catches one bad answer key starts doubting the other 10,991, and they are right to. That asymmetry is why the least visible layer of this product got the most engineering, and why this essay is about tables and queues instead of features.
01 · Ingestion: six files, eight days
The bank was ingested from 6 distinct source files in an 8-day window, 2026-04-04 to 2026-04-12. That produced 10,996 question rows, of which 10,992 are active today. By exam track:
- Kenntnisprüfung: 7,566 active questions
- KMP Innsbruck: 3,091
- Pharmakologie: 335
The three tracks sum to 10,992 exactly. Pharmakologie is small because the track is small; padding it to look balanced would be the wrong kind of growth.
Eight days for six files is slow if the goal is rows in a table. The goal was structure, because every future fix depends on it:
- 10,658 questions carry structured multiple-choice options as jsonb. Each option is an addressable object, not a letter buried in a text blob. You cannot fix option C three months later if option C does not exist as a thing the database can point at.
- 880 questions carry images, referenced individually, so an image problem is a row-level fix rather than a re-import.
- Every question landed in exactly one of 16 topics. Topics are what make a weak-spot study plan computable. They are also what turns mis-tagging into a visible, fixable error class instead of background noise.
- Imports de-duplicate on normalized content, so the same question arriving through two source files lands once.
The dry summary: strictness at ingestion is what makes maintenance cheap. Each structural choice above exists so that a user report later becomes a five-minute row edit instead of archaeology.
02 · The correction queue: 91 in, 34 rejected
Post-launch maintenance runs through two queues, both plain database tables fed from inside the product.
The first is question-fix suggestions: a specific question, a described problem, usually a proposed correction. 91 have landed. The outcomes as of 2026-06-10:
// question-fix suggestions by outcome
| 91 suggestions · outcomes as of 2026-06-10 | suggestions |
|---|---|
| Approved | 38 |
| Modified | 12 |
| Rejected | 34 |
| Pending | 7 |
84 of 91 processed: 92%, with 7 pending. The healthy part is not the processing rate. It is the shape of the outcomes. 38 approved as proposed. 12 modified, meaning the report was right but the proposed fix was not, so what shipped differs from what was suggested. 34 rejected outright.
The second queue is general feedback, less structured: 45 items, of which 37 were reviewed and 4 dismissed (91% processed), with 4 still new.
Around the queues sits the rest of the maintenance trail, same database, same query session: 19 questions reclassified into a different topic, 4 questions deactivated, 3 archived. Deactivation keeps the row and kills the exposure: the question stops appearing in practice, while the answer history of everyone who ever met it stays intact.
03 · Why the 37% matters
Run the counterfactual. Same queue, same 91 suggestions, but an operator who approves everything. The dashboard improves: 100% approval, a faster queue, zero arguments. The bank degrades, because user suggestions about medical content are frequently wrong in both directions: error reports about questions that are correct, and proposed fixes that would replace a right answer with a confident wrong one.
The decision tree behind the three outcomes is short enough to state in full:
- Approve when the report survives verification against the question's source material and the proposed fix is exactly right.
- Modify when the problem is real but the proposed repair is wrong or partial. These 12 are the strongest argument against rubber-stamping: someone correctly found a defect and proposed an incorrect fix. Auto-merging those swaps one error for another and marks the question "fixed", which is strictly worse than leaving it flagged.
- Reject when the report fails verification, contradicts the source, or trades one ambiguity for a different one.
The bar for changing a live medical question is the same bar as ingesting it: verified against source, structurally valid, unambiguous. Filing a complaint is free. Making a change is not. 37% is what that bar looks like after being applied 84 times.
To be precise about what 37% is not: it is not a target. I did not set out to reject a third of the suggestions, and pushing the rate to 60% would not make the bank better; it might only mean the reporting UX invites junk. The number is an output, not a dial. What it certifies is narrow: review happens, and review has teeth.
04 · The AI restraint: 67 explanations, zero runtime calls
Given what I do for a living, this is the part that surprises people. MedPrüf has never served a runtime AI call. Not few. Zero rows in the log that would record them.
67 of the 10,992 active questions, 0.6%, carry an AI-generated explanation. Every one was generated offline and human-reviewed before it shipped. At that point it stops being model output and becomes content: a static, versioned explanation that passed the same gate as the questions themselves.
The reasoning is the asymmetry from section 00. A live LLM answer inside a medical exam product is an unreviewed answer delivered with full confidence, attached to a question the student trusts the product to get right. The failure mode is not an awkward chat moment. It is a wrong explanation that reads exactly like a right one, served to someone preparing for the exam that decides their career. Pre-generation moves the model out of the request path and into the content pipeline, where review happens before exposure instead of after.
If you read my position on what changed in production AI, this is the same opinion from the other side. Knowing where the model goes is the job. Sometimes the right amount of runtime AI in a product, today, is none.
05 · What I can and cannot claim
Two context numbers, caveats attached.
Usage exists, but it is launch-weighted: 73 distinct users recorded 12,650 practice answers between April 5 and June 8, most of that in launch month. I am deliberately not quoting monthly curves; a launch spike presented as a steady state is its own kind of wrong answer key. The related row-level fact: per-user, per-question progress sits in 9,977 rows, which is part of what the deactivation design protects. Kill a question, keep the history.
And the claim I cannot make: the exam_attempts table has 0 rows. No verified exam outcomes exist in this database. I do not know how many MedPrüf users went on to pass the Kenntnisprüfung, and any sentence shaped like "helped N doctors pass" would be fiction.
06 · The shape that generalizes
Strip the medicine out and what remains applies to any content product: documentation, course platforms, knowledge bases, eval sets.
- Structure at ingestion. You cannot operate on content you cannot address. The jsonb options and the 16-topic taxonomy are MedPrüf's version; yours will differ, but "text blob plus vibes" is not a version.
- A queue users can route defects into. Decay is found by users, not authors. If reports land in email, they die in email.
- Review with a real rejection rate. The 37% is the receipt that the gate exists. Track the rate; if it sits at zero for a quarter, the gate is open.
- A taxonomy you actually re-file. 19 reclassifications since launch. A taxonomy nobody moves items between is a label, not a model.
- Willingness to remove. 4 deactivations and 3 archives. A bank that only grows is hiding its rot in the count.
This is the same discipline as eval work, pointed at content instead of model output. llm-eval-ci, the open-source gate I built, turns production failures into a golden set and fails the PR when answer quality drops. The correction queue here is the same machine: user reports are candidate canon, review decides what enters, and the rejection rate tells you whether the gate is real. The Bridge Sourcing accuracy writeup is a third instance of the pattern, where the gate was a hand-labeled calibration set. Three products, one lesson: the unglamorous review loop is the product.
The full product context (what the bank feeds: spaced repetition, the exam simulator, weak-spot study plans) lives on the MedPrüf case page, and the product itself is at medpruf.com. The question count on that case page is pulled live from the same database this essay queried, so by the time you read this, 10,992 will probably have drifted. The pipeline above is why it drifts slowly, and on purpose.
// sources cited
// next move
Want this level of rigor on your own stack?
Find the leak: 1 week, $950, fixed scope. A plain-English plan plus one real fix built and working, yours to keep regardless.
// related essays
- 100% → 17%v1 to v2 · one prompt rewrite · exit 1
llm-eval-ci: the gate that fails the PR when your LLM quietly regresses
Every part of an LLM product is gated except the part that matters. llm-eval-ci is the open-source CI quality gate I shipped today: a golden set scored by six graders, exit 1 when answer quality drops. In the demo, a harmless-looking prompt rewrite falls from 100% to 17% and the build goes red.
~9 min · 2026-06-10 - 7semantic categories · self-hosted · MIT
Mnemonic: self-hosted, categorized memory for AI agents
A small self-hosted FastAPI memory server for AI agents, built on mem0 and Qdrant. It auto-sorts every memory into seven semantic categories, serves a tiered L0/L1/L2 context tree instead of dumping the whole pile, and keeps the conversation history on your own box. About $2/month self-hosted versus $20+/month for the cloud memory APIs.
~10 min · 2026-06-10