← writing·Receipts·~10 min read·filed 2026-06-10

Maintaining a 10,992-question medical exam bank (and rejecting 37% of the fixes)

MedPrüf holds 10,992 active German-language medical exam questions, maintained from Egypt by a non-doctor. The quality lives in the correction pipeline: 91 suggested fixes triaged, 37% rejected on review, 67 human-reviewed AI explanations, and zero runtime AI calls served.

37%

of suggested fixes rejected · 34 of 91

receipts
content-ops
data-quality
medpruef
review-pipeline
production-ai

// TL;DR · abstract

10,992 active questions across 3 Austrian exam tracks, ingested from 6 source files in 8 days. Every number is read-only SQL on the production database, run 2026-06-10.
Of 91 user-suggested fixes: 38 approved, 12 modified, 34 rejected. The 37% rejection rate is the quality signal; a queue that approves everything is rubber-stamping.
19 topic reclassifications, 4 deactivations, and 3 archived questions since launch. Maintenance leaves a queryable paper trail.
Only 67 questions (0.6%) carry an AI explanation, each pre-generated offline and human-reviewed. The ai_call_log table has 0 rows: no runtime AI call has ever been served.
No verified exam outcomes exist (exam_attempts = 0 rows), so no pass-rate claims. The claim is narrower: the bank is maintained, with receipts.

An exam-prep product is only as good as its question bank, and question banks rot. Answer keys are wrong from day one. Topic tags drift. A stem that read fine to its author is ambiguous to the tenth student who meets it. None of this announces itself. It burns trust quietly, one wrong question at a time.

MedPrüf is my exam-prep product for the Austrian medical licensing exams: 10,992 active questions, in German, across three exam tracks. I am not a doctor. This is German-language medical content maintained from Egypt by a non-doctor, which means the product's quality cannot rest on my authority. I have none to offer. It rests on process: vetted sources, structured ingestion, and a correction pipeline that rejected more than a third of the fixes users proposed.

That last number is the essay. 91 correction suggestions have come in since launch. 34 were rejected on review: 37%. I think that rejection rate is the most honest quality signal in the whole system, and most content products have nothing like it.

A method note, so a skeptic knows what they are reading: every number in this piece is the result of read-only SQL against the MedPrüf production database, run on 2026-06-10. No analytics layer, no estimates. Where a number needs a caveat, the caveat is attached in the same paragraph.

00 · How banks rot

A question bank decays in specific, boring ways:

The answer key is wrong. The worst class, and the one users find fastest.
The stem is ambiguous. Two options are defensible, and the marked one is not the one the real exam intends.
The topic tag is wrong. The question surfaces in the wrong study plan and quietly skews every weak-spot analysis that touches it.
The question is a near-duplicate. Same fact, slightly different wording, double-counted in practice stats.
The image is missing or attached to the wrong stem. A question that needs its image and lacks it is not a question. It is a trap.

A wrong question is worse than no question. A student who catches one bad answer key starts doubting the other 10,991, and they are right to. That asymmetry is why the least visible layer of this product got the most engineering, and why this essay is about tables and queues instead of features.

01 · Ingestion: six files, eight days

The bank was ingested from 6 distinct source files in an 8-day window, 2026-04-04 to 2026-04-12. That produced 10,996 question rows, of which 10,992 are active today. By exam track:

Kenntnisprüfung: 7,566 active questions
KMP Innsbruck: 3,091
Pharmakologie: 335

The three tracks sum to 10,992 exactly. Pharmakologie is small because the track is small; padding it to look balanced would be the wrong kind of growth.

Eight days for six files is slow if the goal is rows in a table. The goal was structure, because every future fix depends on it:

10,658 questions carry structured multiple-choice options as jsonb. Each option is an addressable object, not a letter buried in a text blob. You cannot fix option C three months later if option C does not exist as a thing the database can point at.
880 questions carry images, referenced individually, so an image problem is a row-level fix rather than a re-import.
Every question landed in exactly one of 16 topics. Topics are what make a weak-spot study plan computable. They are also what turns mis-tagging into a visible, fixable error class instead of background noise.
Imports de-duplicate on normalized content, so the same question arriving through two source files lands once.

The dry summary: strictness at ingestion is what makes maintenance cheap. Each structural choice above exists so that a user report later becomes a five-minute row edit instead of archaeology.

02 · The correction queue: 91 in, 34 rejected

Post-launch maintenance runs through two queues, both plain database tables fed from inside the product.

The first is question-fix suggestions: a specific question, a described problem, usually a proposed correction. 91 have landed. The outcomes as of 2026-06-10:

// question-fix suggestions by outcome

84 of 91 processed (92%), counted by read-only SQL on the production database. The bar that matters is the third one: 34 of 91 suggestions, 37%, rejected on review.
91 suggestions · outcomes as of 2026-06-10	suggestions
Approved	38
Modified	12
Rejected	34
Pending	7

84 of 91 processed (92%), counted by read-only SQL on the production database. The bar that matters is the third one: 34 of 91 suggestions, 37%, rejected on review.

84 of 91 processed: 92%, with 7 pending. The healthy part is not the processing rate. It is the shape of the outcomes. 38 approved as proposed. 12 modified, meaning the report was right but the proposed fix was not, so what shipped differs from what was suggested. 34 rejected outright.

The second queue is general feedback, less structured: 45 items, of which 37 were reviewed and 4 dismissed (91% processed), with 4 still new.

Around the queues sits the rest of the maintenance trail, same database, same query session: 19 questions reclassified into a different topic, 4 questions deactivated, 3 archived. Deactivation keeps the row and kills the exposure: the question stops appearing in practice, while the answer history of everyone who ever met it stays intact.

03 · Why the 37% matters

Run the counterfactual. Same queue, same 91 suggestions, but an operator who approves everything. The dashboard improves: 100% approval, a faster queue, zero arguments. The bank degrades, because user suggestions about medical content are frequently wrong in both directions: error reports about questions that are correct, and proposed fixes that would replace a right answer with a confident wrong one.

The decision tree behind the three outcomes is short enough to state in full:

Approve when the report survives verification against the question's source material and the proposed fix is exactly right.
Modify when the problem is real but the proposed repair is wrong or partial. These 12 are the strongest argument against rubber-stamping: someone correctly found a defect and proposed an incorrect fix. Auto-merging those swaps one error for another and marks the question "fixed", which is strictly worse than leaving it flagged.
Reject when the report fails verification, contradicts the source, or trades one ambiguity for a different one.

The bar for changing a live medical question is the same bar as ingesting it: verified against source, structurally valid, unambiguous. Filing a complaint is free. Making a change is not. 37% is what that bar looks like after being applied 84 times.

To be precise about what 37% is not: it is not a target. I did not set out to reject a third of the suggestions, and pushing the rate to 60% would not make the bank better; it might only mean the reporting UX invites junk. The number is an output, not a dial. What it certifies is narrow: review happens, and review has teeth.

04 · The AI restraint: 67 explanations, zero runtime calls

runtime AI calls served · ai_call_log row count

The table that would log a live model call to a user holds zero rows. Same read-only SQL session, 2026-06-10.

Given what I do for a living, this is the part that surprises people. MedPrüf has never served a runtime AI call. Not few. Zero rows in the log that would record them.

67 of the 10,992 active questions, 0.6%, carry an AI-generated explanation. Every one was generated offline and human-reviewed before it shipped. At that point it stops being model output and becomes content: a static, versioned explanation that passed the same gate as the questions themselves.

The reasoning is the asymmetry from section 00. A live LLM answer inside a medical exam product is an unreviewed answer delivered with full confidence, attached to a question the student trusts the product to get right. The failure mode is not an awkward chat moment. It is a wrong explanation that reads exactly like a right one, served to someone preparing for the exam that decides their career. Pre-generation moves the model out of the request path and into the content pipeline, where review happens before exposure instead of after.

If you read my position on what changed in production AI, this is the same opinion from the other side. Knowing where the model goes is the job. Sometimes the right amount of runtime AI in a product, today, is none.

05 · What I can and cannot claim

Two context numbers, caveats attached.

Usage exists, but it is launch-weighted: 73 distinct users recorded 12,650 practice answers between April 5 and June 8, most of that in launch month. I am deliberately not quoting monthly curves; a launch spike presented as a steady state is its own kind of wrong answer key. The related row-level fact: per-user, per-question progress sits in 9,977 rows, which is part of what the deactivation design protects. Kill a question, keep the history.

And the claim I cannot make: the exam_attempts table has 0 rows. No verified exam outcomes exist in this database. I do not know how many MedPrüf users went on to pass the Kenntnisprüfung, and any sentence shaped like "helped N doctors pass" would be fiction.

06 · The shape that generalizes

Strip the medicine out and what remains applies to any content product: documentation, course platforms, knowledge bases, eval sets.

Structure at ingestion. You cannot operate on content you cannot address. The jsonb options and the 16-topic taxonomy are MedPrüf's version; yours will differ, but "text blob plus vibes" is not a version.
A queue users can route defects into. Decay is found by users, not authors. If reports land in email, they die in email.
Review with a real rejection rate. The 37% is the receipt that the gate exists. Track the rate; if it sits at zero for a quarter, the gate is open.
A taxonomy you actually re-file. 19 reclassifications since launch. A taxonomy nobody moves items between is a label, not a model.
Willingness to remove. 4 deactivations and 3 archives. A bank that only grows is hiding its rot in the count.

This is the same discipline as eval work, pointed at content instead of model output. llm-eval-ci, the open-source gate I built, turns production failures into a golden set and fails the PR when answer quality drops. The correction queue here is the same machine: user reports are candidate canon, review decides what enters, and the rejection rate tells you whether the gate is real. The Bridge Sourcing accuracy writeup is a third instance of the pattern, where the gate was a hand-labeled calibration set. Three products, one lesson: the unglamorous review loop is the product.

The full product context (what the bank feeds: spaced repetition, the exam simulator, weak-spot study plans) lives on the MedPrüf case page, and the product itself is at medpruf.com. The question count on that case page is pulled live from the same database this essay queried, so by the time you read this, 10,992 will probably have drifted. The pipeline above is why it drifts slowly, and on purpose.

// sources cited

https://medpruf.com

// next move

Want this level of rigor on your own stack?

Find the leak: 1 week, $950, fixed scope. A plain-English plan plus one real fix built and working, yours to keep regardless.

Find the leak · $950 · start here Or email omar@neurascale.org

// related essays