MedPrüf — AI exam prep for Vienna's Kenntnisprüfung
Foreign-trained doctors in Austria take the Kenntnisprüfung — a brutal medical licensing exam gating their right to practice. Existing prep is paper question banks, German-only, no explanations. We built the opposite: 7,725 active questions, 1,615 of them AI-explained, German/English bilingual, shipped as a web app and a Telegram Mini App. I'm the engineer, Wael is the product owner and medical expert.
The problem
Austria needs doctors. Doctors trained outside the EU need to pass the Kenntnisprüfung to practice. The exam covers 10 specialties across two structures (SIP 4 and SIP 5), with thousands of possible questions drawn from a pool. Failure rates are high. Candidates often study for 6–12 months.
The options they had before we started:
- Paper question banks. Static, German-only, no feedback when you get it wrong, no spaced repetition.
- Prep schools. Expensive (€2k–€5k), Vienna-only, schedule-bound.
- Generic med-student apps. Not specific to Kenntnisprüfung. Wrong question distribution, wrong language, wrong clinical context.
Wael brought the clinical expertise and the question bank. My job was to turn it into a product that works, scales, and teaches as well as it tests.
Architecture
Two things made this more interesting than a typical exam-prep app:
- The question bank is a living dataset. 15,439 questions in the raw DB, 7,725 active after deduplication and classification, 1,806 with multiple correct answers, 952 with embedded images. Every question is in two languages. Every explanation is generated once and cached.
- The primary surface is Telegram. Not by accident — the audience skews toward WhatsApp/Telegram-native users. A Telegram Mini App lets them study without installing anything, and a Telegram bot handles daily drill reminders with deep-link auth.
The stack
The explanation pipeline
The magic feature isn't the question bank. It's the per-question AI explanation that turns the app from "test me" into "teach me." Here's how it works — and why it's not on-the-fly generation.
- Generate once, cache forever. When a user hits a question that has no explanation yet, we don't call the model from the hot path. We enqueue the generation job, return the raw question immediately, and the explanation appears on their next visit.
- Review before shipping. Every generated explanation is reviewable from the admin panel. Wael (the medical expert) can thumbs-up, thumbs-down, or edit. Edits overwrite the cached version permanently.
- User-rating loop. Users rate explanations thumbs-up/down. Low-rated explanations get flagged for regeneration with a different prompt variant.
- The model is small and cheap. GPT-4o-mini, not GPT-4o. The explanations are short (≤ 200 words) and the medical reasoning for board questions is well within a small model's capability.
This is the opposite of "wrap an LLM in a chat UI." Latency and cost both drop 100× because the model is only touched on generation, not on read. And because we review before shipping, the quality floor is set by the reviewer, not by the model's worst day.
By the numbers
The Telegram deep-link bug that almost killed us
The Telegram bot supports deep-link login: user clicks a link in a message, bot opens the web app, auth is automatic. Race condition: the deep-link token was single-use, but the Mini App sometimes initiated two auth requests (one from the parent frame, one from an iframe embedded in the Mini App container). The second request would fail because the token was consumed, and the user would see a blank auth screen.
It took us two weeks to reliably reproduce because it only happened on certain Android clients, certain versions of the Telegram app, and only when the user came from a specific entry point. The fix was to make the auth token reusable within a 30-second window and lock the client-side auth call to a single inflight promise. Shipped in commit 86caf6b.
Lesson: Telegram Mini Apps are not browsers. Assume every auth, network, and storage API behaves differently than you expect, and instrument everything before you need to debug it.
What I got wrong
Trusting the raw question bank
The initial ingestion of 15,439 raw questions had massive duplicates — same question in slightly different wording, same question with the answer letters shuffled, same question translated from German where the translation was subtly wrong. I shipped the first version without deduping and got complaints about "the same question three times in a row." Had to build a dedicated duplicate scanner in the admin panel, run it, prune, re-import. Should have deduped before shipping.
Not measuring the thumbs-up/down loop early
I shipped the rating buttons but didn't track the data well for the first month. When I finally looked, ~30% of explanations had thumbs-down from someone. I didn't know which 30%. I now track the rating per-question, flag the bottom decile, and rotate prompt variants on regeneration. This is the kind of feedback loop that's worth more than tuning the model.
Mobile-first, but not Telegram-first
The web app was responsive, but the Telegram Mini App needed some specific affordances (smaller fonts for in-chat view, different back-button behavior, no window.confirm). Retrofitting those into a responsive web codebase took a week. Next time I'd architect for Telegram from day one, since ~60% of traffic comes through it.
The lesson
The interesting work in AI-native exam prep isn't the AI part — that's the easy, well-trodden path. The interesting work is the feedback loops: rating → regeneration → review → reshipment. The AI is a junior author. The human is the editor. The app is the publishing pipeline. Design the pipeline first, the model second.
That and: if your primary surface is Telegram, instrument Telegram. Web analytics are a lie there.
Visit medpruf.com →