The notebook · production AI · real numbers, no hype

Fieldnotes.

Long-form entries on the production engineering of AI. Each one leads with the number that mattered, then walks the changes that produced it, including the ones that made it worse.

10 entries on file · 1500–2500 words · numbers, not opinions

The register · newest first

01Receipts~9 minfiled 2026-06-10100% → 17%llm-eval-ci: the gate that fails the PR when your LLM quietly regressesEvery part of an LLM product is gated except the part that matters. llm-eval-ci is the open-source CI quality gate I shipped today: a golden set scored by six graders, exit 1 when answer quality drops. In the demo, a harmless-looking prompt rewrite falls from 100% to 17% and the build goes red.100% → 17%v1 to v2 · one prompt rewrite · exit 1 02Receipts~10 minfiled 2026-06-107Mnemonic: self-hosted, categorized memory for AI agentsA small self-hosted FastAPI memory server for AI agents, built on mem0 and Qdrant. It auto-sorts every memory into seven semantic categories, serves a tiered L0/L1/L2 context tree instead of dumping the whole pile, and keeps the conversation history on your own box. About $2/month self-hosted versus $20+/month for the cloud memory APIs.7semantic categories · self-hosted · MIT 03Receipts~9 minfiled 2026-06-102,8312,831 orders on 30 commits: the boring system a donut shop kept over the better one I builtMy family's donut shop has taken 2,831 orders through a 30-commit system I stopped touching in March. I built the far more capable replacement myself, migrated the full order history into it, and the shop never switched: 1,012 more orders on the old system since, with zero code changes.2,831orders · 30 commits · 7 tables 04Receipts~10 minfiled 2026-06-1037%Maintaining a 10,992-question medical exam bank (and rejecting 37% of the fixes)MedPrüf holds 10,992 active German-language medical exam questions, maintained from Egypt by a non-doctor. The quality lives in the correction pipeline: 91 suggested fixes triaged, 37% rejected on review, 67 human-reviewed AI explanations, and zero runtime AI calls served.37%of suggested fixes rejected · 34 of 91 05Receipts~10 minfiled 2026-04-3082% → 96%Bridge Sourcing: how I moved scrape accuracy from 82% to 96%The 8 specific changes that moved a production B2B sourcing extraction pipeline from 82% to 96% field-level accuracy over three months, and the 2 changes that made it worse. The economic moat under the pipeline is Egypt's 0% EU import-tariff lane; the engineering moat is the calibration set. Calibration sets, schema forcing, deterministic validation, drift detection.82% → 96%field-level accuracy · 3 months 06Positioning~8 minfiled 2026-04-302Egypt-to-EU senior AI engineering: the 2026 thesis, not the 2024 cheap-outsourcing pitchTwo arbitrages stacked on one geography: time zone and senior-tier pricing. The 2024 framing of "outsource cheap engineering to Egypt" destroys both. The 2026 framing is single-engineer coverage of EU mornings AND US East afternoons at 60–70% of London rates, IF you filter correctly.2arbitrages stacked · time zone + senior pricing 07Receipts~9 minfiled 2026-04-30−28%How I cut our LLM bill 28% without changing modelsSix specific moves that took 28% off the cost curve at NeuraScale across six products, without downgrading the primary models. Routing, semantic caching, prompt compression, structured output, batching, gatekeeping. Plus a 2026 update on what prompt caching becoming GA changed.−28%aggregate monthly LLM cost · same primary models 08Positioning~9 minfiled 2026-04-303Fractional CTO vs agency vs full-time AI hire: buyer-side mathIf your board just asked about AI strategy and you have 5–50 people, the highest-EV next move is a fractional senior, assuming you do diligence. The math on what each of the three paths costs you when it doesn't work.3paths · pick the smallest blast radius 09Receipts~13 minfiled 2026-04-3014dHarmonia: a single-tenant luxury salon system, end-to-end in 14 daysSingle-tenant on purpose, against every SaaS instinct. An agency would have quoted 8–12 weeks at around $40K for the same POS + booking + payroll + receipts + RTL + PWA scope. I shipped v1 solo in 14 days, then a v2 sweep two days later that added role-aware UIs, finances, audit, and commission settlement. The architectural call is what made the calendar fit.14dsolo, end-to-end · v1 to live tenant 10Positioning~8 minfiled 2026-04-305Production AI in 2026: five shifts most teams are still ignoringMost teams shipping AI in 2026 are still solving 2024 problems with 2024 tools. Here are the five shifts that flipped (prompt caching, agents, evals, tool-calling, observability) and what each one costs to ignore.5shifts to ship · or pay for ignoring

Newsletter · hand-run, no automation

Each entry walks one production AI move: what it contributed, what it broke. Drop your email for the early issues; otherwise it lives here.

Fieldnotes.

The register · newest first

New entries in your inbox if you ask for them.