Live · Client Build Case study · Wien, Austria · Client: Wael

MedPrüf — AI exam prep for Vienna's Kenntnisprüfung

Foreign-trained doctors in Austria take the Kenntnisprüfung — a brutal medical licensing exam gating their right to practice. Existing prep is paper question banks, German-only, no explanations. We built the opposite: 7,725 active questions, 1,615 of them AI-explained, German/English bilingual, shipped as a web app and a Telegram Mini App. I'm the engineer, Wael is the product owner and medical expert.

The problem

Austria needs doctors. Doctors trained outside the EU need to pass the Kenntnisprüfung to practice. The exam covers 10 specialties across two structures (SIP 4 and SIP 5), with thousands of possible questions drawn from a pool. Failure rates are high. Candidates often study for 6–12 months.

The options they had before we started:

Wael brought the clinical expertise and the question bank. My job was to turn it into a product that works, scales, and teaches as well as it tests.

Architecture

Two things made this more interesting than a typical exam-prep app:

  1. The question bank is a living dataset. 15,439 questions in the raw DB, 7,725 active after deduplication and classification, 1,806 with multiple correct answers, 952 with embedded images. Every question is in two languages. Every explanation is generated once and cached.
  2. The primary surface is Telegram. Not by accident — the audience skews toward WhatsApp/Telegram-native users. A Telegram Mini App lets them study without installing anything, and a Telegram bot handles daily drill reminders with deep-link auth.
Web app Next.js 15 · RSC Telegram Mini App + bot · deep-link auth Supabase Postgres · RLS Question bank 15,439 raw · 7,725 active Explanation cache 1,615 generated · rated Dedup scanner admin panel GPT-4o-mini explanation gen GA4 + Clarity session replay
Fig 1 · MedPrüf two-surface architecture · web + Telegram · shared Supabase backend

The stack

Next.js 15React 19TypeScriptSupabasePostgresTelegram Bot APITelegram Mini AppGPT-4o-miniGA4Microsoft ClarityVercel

The explanation pipeline

The magic feature isn't the question bank. It's the per-question AI explanation that turns the app from "test me" into "teach me." Here's how it works — and why it's not on-the-fly generation.

This is the opposite of "wrap an LLM in a chat UI." Latency and cost both drop 100× because the model is only touched on generation, not on read. And because we review before shipping, the quality floor is set by the reviewer, not by the model's worst day.

By the numbers

7,725
Active questions
1,615
AI-explained (21%)
1,806
Multi-correct MCQs
DE / EN
Bilingual

The Telegram deep-link bug that almost killed us

The Telegram bot supports deep-link login: user clicks a link in a message, bot opens the web app, auth is automatic. Race condition: the deep-link token was single-use, but the Mini App sometimes initiated two auth requests (one from the parent frame, one from an iframe embedded in the Mini App container). The second request would fail because the token was consumed, and the user would see a blank auth screen.

It took us two weeks to reliably reproduce because it only happened on certain Android clients, certain versions of the Telegram app, and only when the user came from a specific entry point. The fix was to make the auth token reusable within a 30-second window and lock the client-side auth call to a single inflight promise. Shipped in commit 86caf6b.

Lesson: Telegram Mini Apps are not browsers. Assume every auth, network, and storage API behaves differently than you expect, and instrument everything before you need to debug it.

What I got wrong

Trusting the raw question bank

The initial ingestion of 15,439 raw questions had massive duplicates — same question in slightly different wording, same question with the answer letters shuffled, same question translated from German where the translation was subtly wrong. I shipped the first version without deduping and got complaints about "the same question three times in a row." Had to build a dedicated duplicate scanner in the admin panel, run it, prune, re-import. Should have deduped before shipping.

Not measuring the thumbs-up/down loop early

I shipped the rating buttons but didn't track the data well for the first month. When I finally looked, ~30% of explanations had thumbs-down from someone. I didn't know which 30%. I now track the rating per-question, flag the bottom decile, and rotate prompt variants on regeneration. This is the kind of feedback loop that's worth more than tuning the model.

Mobile-first, but not Telegram-first

The web app was responsive, but the Telegram Mini App needed some specific affordances (smaller fonts for in-chat view, different back-button behavior, no window.confirm). Retrofitting those into a responsive web codebase took a week. Next time I'd architect for Telegram from day one, since ~60% of traffic comes through it.

The lesson

The interesting work in AI-native exam prep isn't the AI part — that's the easy, well-trodden path. The interesting work is the feedback loops: rating → regeneration → review → reshipment. The AI is a junior author. The human is the editor. The app is the publishing pipeline. Design the pipeline first, the model second.

That and: if your primary surface is Telegram, instrument Telegram. Web analytics are a lie there.

Visit medpruf.com →

More case studies