Skip to main
omar.nagy
← writing·Receipts·~9 min read·published 2026-04-30

How I cut our LLM bill 28% without changing models

Six specific moves that took 28% off the cost curve at NeuraScale across six products — without downgrading the primary models. Routing, semantic caching, prompt compression, structured output, batching, gatekeeping. Plus a 2026 update on what prompt caching becoming GA changed.

−28%
aggregate monthly LLM cost · same primary models
  • receipts
  • llm-cost
  • prompt-caching
  • production-ai
  • routing

The obvious way to save money on an LLM bill is to downgrade the model. That's the wrong answer when the model is doing real work. Here are the six specific moves that took −28% off the cost curve at NeuraScale across six products — without downgrading the primary models. None of them required a quality compromise. None of them was a single clever hack. The order matters.

This is a re-publish of the original 2026 cost-cut writeup from legacy/writing/llm-cost-reduction.html, expanded with one important update: prompt caching went from "nice if you remember" to table stakes in early 2026. There's a new section at the end on what changed and what I'd do differently if I were running the cuts today.

−28%
aggregate monthly LLM cost · same primary models
Six moves stacked. Routing alone was ~11pp.

00 — Before the cuts

NeuraScale runs six products on a mix of GPT-5.4-mini, Claude Sonnet 4.6, Claude Haiku 4.5, and Gemini 2.5 Flash. Different products use different models for different jobs. When I looked at the combined monthly LLM bill six months ago, it was growing at roughly 15% per month while user counts grew at maybe 6%. That delta was the smell.

The first instinct was to move everything to the cheapest model. I tried. Quality dropped, retention dropped, and I made the savings back in churn within a month. The cost curve was broken for reasons that weren't about the model choice. They were about how I was asking the models to work.

Here's what I found and what I changed, in order of impact.

// cost contribution per move

0pp6pp11pppercentage points11ppRoute8ppCache5ppCompress3ppReject3ppSchema2ppBatchsix moves, in descending impact
Each move's contribution to the 28% aggregate cut. The order on the x-axis is impact; the order I shipped them is on section 09.

01 — Route by task difficulty, not by product

The first wrong pattern: every product picked one model for "everything this product needs." RetailOS used GPT-4o-mini for daily summaries, customer-facing chat, log analysis, and inventory suggestions. Bridge Sourcing used Claude Sonnet for discovery, qualification, outreach, and reply parsing.

The task was the unit that mattered, not the product. Within one product I had tasks that required sharp reasoning (qualification scoring, edge-case architecture advice) and tasks that were essentially classification (is this email a reply-yes, a reply-no, or a bounce?). Using the same model for both was wasteful on one end and insufficient on the other.

I built a tiny router that looks at the task type and picks a model. Classification and extraction go to a nano-tier. Generation and summarization go to a mini. Reasoning, planning, and anything touching money or compliance goes to a full-tier model. The router is 40 lines of TypeScript — it's not clever, it just respects the fact that different tasks deserve different brains.

typescriptThe router — 40 lines, no cleverness
function pickModel(task) {
if (task.type === 'classify' || task.type === 'extract') {
  return 'gpt-5.4-nano';
}
if (task.type === 'generate' || task.type === 'summarize') {
  return 'gpt-5.4-mini';
}
if (task.type === 'reason' || task.criticality === 'high') {
  return 'claude-sonnet-4.6';
}
return 'gpt-5.4-mini'; // default
}

Contribution: ~11pp. The biggest single move. Most of the bill was "easy tasks running on expensive models because nobody had thought about it."

02 — Cache responses with semantic keys, not exact keys

Normal caching: hash the prompt text, use it as a key, reuse the response. This works for ~5% of requests in a typical production system — exact-match rates are low because prompts almost always include context that varies (timestamps, user IDs, dynamic data).

Semantic caching: strip the varying parts from the prompt, hash only the stable semantic content, reuse the response when the semantic key matches even if the full prompt differs.

In RetailOS, the daily forecast explanation prompt included the tenant name and the forecast numbers. I realized the LLM was mostly generating the same sentence structure regardless of the numbers. I cached the template of the explanation (with placeholders for the numbers) using a semantic key, then filled in the numbers from the template. Same user-visible output, 1/10th the LLM calls.

For MedPrüf, the question explanations are cached per-question forever — there are 10,993 questions, every explanation is requested over and over by different users, and the explanation for question N never changes.

Contribution: ~8pp. Smaller than routing, but zero downside on the cached path — the user-visible quality is sometimes higher because cached responses have been reviewed and corrected.

03 — Compress prompts, but only the static parts

I had system prompts that were 2,000 tokens. The model was seeing them on every single request, even though 95% of the content never changed between requests. That's cache-friendly in principle — both OpenAI and Anthropic shipped prompt caching in 2024 — but I wasn't using it because I hadn't gotten around to it.

Two changes:

  1. Split system prompts into static and dynamic halves. The static half goes through the provider's prompt caching (up to 90% discount on cached tokens for repeat reads). The dynamic half stays as normal input tokens.

  2. Compress the static half. Went through every system prompt and removed hedge words ("please", "try to", "make sure to", "it would be helpful if"). Replaced multi-sentence instructions with bullet points. Removed examples the model no longer needed. Most system prompts dropped by 40–60% in token count with no measurable quality drop.

Contribution: ~5pp.

04 — Force structured output to cap the downside

The expensive LLM request isn't the one with the long prompt. It's the one with the long response. Output tokens cost 2–5× more than input tokens on most models, and models love to fill available space.

I forced structured output using JSON schema on every request that produces data for downstream code. No more "here's your summary, and also some bonus observations you didn't ask for." The model writes exactly the fields in the schema, nothing more. Response lengths dropped 20–30% on average.

Where structured output didn't apply (user-facing chat, long-form generation), I set aggressive max_tokens limits tied to what the UI could actually render. The user can't read 1,500 tokens of model output on a phone screen anyway.

Contribution: ~3pp. Smaller than I expected, because most tasks already had reasonable output lengths. But it kept the worst-case spike from happening.

05 — Batch the embarrassingly parallel

Bridge Sourcing's discovery agent enriches hundreds of supplier candidates per run. Originally each enrichment was a separate API call. OpenAI and Anthropic both offer batch APIs with 50% discounts for non-urgent work that can wait up to 24 hours.

I split the enrichment pipeline into two lanes:

  • Hot lane for user-triggered searches (synchronous, normal pricing)
  • Warm lane for background crawls and maintenance jobs (batched, 50% off)

The warm lane now handles ~70% of total enrichment volume because the discovery agent is always pre-fetching candidates in the background. The hot lane handles only the "a user is waiting" subset.

Contribution: ~2pp. Lower than I hoped because my non-urgent volume was smaller than I realized. Still worth doing — it's free money once the lane separation exists.

06 — Reject waste before it hits the model

This one was depressing to find. I audited a week of production prompts and discovered ~3% of requests were garbage that should never have reached the LLM:

  • Empty or whitespace-only user inputs that got wrapped in a system prompt and sent anyway
  • Duplicate requests within seconds (users clicking twice)
  • Input that matched a known pattern with a deterministic answer ("what's 2+2"–style questions a calculator handles)
  • Requests from scripts/bots with no rate limiting

I added a gatekeeper layer that runs before the LLM: input sanitization, deduplication with a 30-second window, a small lookup table for known-deterministic queries, and a per-user rate limit. Waste dropped to under 0.5%.

Contribution: ~3pp. Bigger than expected because waste queries were disproportionately on the expensive models — the gatekeeper-less products tended to be the ones on the big models too. Once the gatekeeper was in front of the router, the savings compounded.

07 — The moves that didn't work

For completeness, three things I tried that gave back less than they cost:

  • Fine-tuning a small model on my classification tasks. It worked, but the win was marginal once I accounted for the maintenance cost of the fine-tuned weights — they get stale, you have to retrain. For stable, narrow tasks it's worth it. For anything that changes weekly, skip it.
  • Quantized self-hosted models. Tried Llama 4 via Fireworks and Groq. Fast and cheap, but the quality gap on my actual tasks was larger than the price gap was favorable. Maybe in six months.
  • Aggressive context trimming. Cutting old turns from conversation histories more aggressively. Quality dropped noticeably on multi-turn tasks and I made half the win back in retention damage.

08 — The order that matters

If you're starting from zero, don't try all six at once. The order that worked for me:

  1. Route by task first. Biggest impact, cheapest to implement, no quality risk.
  2. Then structured output and response token limits. Prevents worst-case cost spikes.
  3. Then prompt compression with prompt caching. Compounds with everything else.
  4. Then semantic caching. The hardest to get right; save it until you have solid instrumentation.
  5. Then batching and gatekeeping. Smaller wins but free; do them once you're stable.

The four metrics I watch weekly:

  • Cost per active user per day. Not absolute cost — that grows for good reasons.
  • Cache hit rate by task type. Below 20% means caching isn't aggressive enough; above 60% might mean over-caching.
  • Output token distribution. p50, p95, p99. The p99 is where money leaks.
  • Ratio of routed models. If 90% of traffic is hitting the expensive tier, your router isn't doing its job.

09 — What changed in 2026: prompt caching as table stakes

The original cost-cut writeup is from early 2026. Six months on, the headline change isn't a new model — it's that prompt caching is now baseline, not optional.

Specifically:

  • Anthropic prompt caching is GA, with a clean ergonomic API: mark a content block as cache_control: { type: "ephemeral" } and reads against a 5-minute or 1-hour TTL get up to 90% off on the cached tokens. No infrastructure, no special endpoint.
  • OpenAI prompt caching is automatic on prompts ≥1,024 tokens, with cached input tokens discounted ~50% off the standard input rate — a discount applies to repeat reads with no code change. The catch: it's an opaque heuristic, so the only reliable lever is to make the static prefix of your prompt as long and stable as possible.
  • Both providers expose cache-hit metrics in their APIs. Tracking cache hit rate is now a one-line addition to your observability layer instead of a manual experiment.

If I were running the same six cuts today, the order shifts:

  1. Prompt caching first, before routing. It's a 1-line code change per model with up to 90% off on the static prefix. The pre-prompt-caching essay treated this as part of step 03; in 2026 it's the cheapest thing to do and you should do it before anything else.
  2. Routing second (it's still the biggest single multiplier).
  3. The rest of the original order holds.

10 — One product, six months out

To put numbers on the post-cut steady state for one product specifically: MedPrüf — the Austrian medical-exam prep platform with 10,993 questions across three exam types — is on the post-cuts cost shape. It runs the AI features through the same router. The system prompt for the explanation generator was 1,800 tokens before compression, ~640 after. Every explanation is cached per-question forever — a question-explanation cache hit rate of >90% by month two.

The other anchor product is RetailOS, where the daily forecast explanation pattern from section 02 lives. Forecast explanations are templated, the LLM fills numeric placeholders, and the per-tenant cost per active day stayed flat through a 2.5× user growth across the same six months. A naive scaling would have tripled the bill. It didn't. That's the test.

These aren't synthetic benchmarks. They're the cost-per-active-user-per-day numbers I watch on the same dashboard the cuts targeted. The dashboard, the four metrics, and the router are all the same shape they were when I shipped the first cuts.

11 — The real lesson

The 28% wasn't a single clever move. It was six boring optimizations done properly, in the right order, with measurement. None of them required changing the primary models. None hurt quality. Each took less than a week of engineering.

If your LLM bill is growing faster than your user count, the answer almost never is "switch to a cheaper model." It's "find the waste." The waste is there. It always is. It just doesn't show up until you look at the cost curve with something other than the bill total.

The follow-up essay on production scrape accuracy (82% → 96% over 8 changes) is the operational sister piece — same mindset, different metric. The broader 2026 baseline of which prompt caching is now one piece is in Production AI in 2026 — five shifts most teams are still ignoring. If you're staring at a cost curve right now and would like a second pair of eyes, the audit sprint is one week, fixed scope, and produces a remediation plan you can hand to whoever runs the cuts.

// sources cited

// next move

Want a written architecture brief on your AI stack?

1 week, $1,500, fixed scope. Working prototype of one change in your stack — yours to keep regardless.

// related essays