Essay ~8 min read · April 2026

How I cut our LLM bill 28% without changing models

The obvious way to save money on an LLM bill is to downgrade the model. That's the wrong answer if the model is doing real work. Here are the six specific moves that took 28% off the cost curve at NeuraScale — while staying on the same primary models.

00 Before the cuts

NeuraScale runs six products on a mix of GPT-5.4-mini, Claude Sonnet 4.6, Claude Haiku 4.5, and Gemini 2.5 Flash. Different products use different models for different jobs. When I looked at the combined monthly LLM bill six months ago, it was growing at roughly 15% per month while user counts grew at maybe 6%. That delta was the smell.

The first instinct was to move everything to the cheapest model. I tried. Quality dropped, retention dropped, and I made it back within a month. The cost curve was broken for reasons that weren't about the model choice.

Here's what I found and what I changed, in order of impact.

−28%

Aggregate monthly LLM cost · same primary models

01 Route by task difficulty, not by product

The first wrong pattern: every product picked one model for "everything this product needs." RetailOS used GPT-4o-mini for daily summaries, for customer-facing chat, for log analysis, for inventory suggestions. Bridge Sourcing used Claude Sonnet for discovery, qualification, outreach, and reply parsing.

The task was the unit that mattered, not the product. Within one product I had tasks that required sharp reasoning (qualification scoring, edge-case architecture advice) and tasks that were essentially classification (is this email a reply-yes, a reply-no, or a bounce?). Using the same model for both was wasteful on one end and insufficient on the other.

I built a tiny router that looks at the task type and picks a model. Classification and extraction go to a nano-tier model. Generation and summarization go to a mini. Reasoning, planning, and anything touching money or compliance goes to a full-tier model. The router is 40 lines of JavaScript — it's not clever, it just respects the fact that different tasks deserve different brains.

// simplified router
function pickModel(task) {
  if (task.type === 'classify' || task.type === 'extract')
    return 'gpt-5.4-nano';
  if (task.type === 'generate' || task.type === 'summarize')
    return 'gpt-5.4-mini';
  if (task.type === 'reason' || task.criticality === 'high')
    return 'claude-sonnet-4.6';
  return 'gpt-5.4-mini'; // default
}

Contribution to savings: ~11%. The biggest single move. Most of the bill was "easy tasks being run on expensive models because nobody thought about it."

02 Cache responses with semantic keys, not exact keys

Normal caching: hash the prompt text, use it as a key, reuse the response. This works for ~5% of requests in a typical production system — the exact-match rate is low because prompts almost always include context that varies (timestamps, user IDs, dynamic data).

Semantic caching: strip the varying parts from the prompt, hash only the stable semantic content, reuse the response when the semantic key matches even if the full prompt differs.

In RetailOS, the daily forecast explanation prompt included the tenant name and the forecast numbers. I realized the LLM was mostly generating the same sentence structure regardless of the numbers. I cached the template of the explanation (with placeholders for the numbers) using a semantic key, then filled in the numbers from the template. Same user-visible output, 1/10th the LLM calls.

For MedPrüf, the question explanations are cached per-question forever. I wrote about that here.

Contribution to savings: ~8%. Smaller impact than routing but zero downside — the user experience is identical, the quality is sometimes higher because cached responses have been reviewed.

03 Compress prompts ruthlessly, but only the static parts

I had system prompts that were 2,000 tokens. The model was seeing them on every single request, even though 95% of the content never changed between requests. That's cache-friendly in principle (both OpenAI and Anthropic have prompt caching now) but I wasn't using prompt caching because I hadn't gotten around to it. Wake up.

Two changes:

Split system prompts into static and dynamic halves. The static half goes through the provider's prompt caching (up to 90% discount on cached tokens for repeat reads). The dynamic half stays as normal input tokens.
Compress the static half. Went through every system prompt and removed hedge words ("please", "try to", "make sure to", "it would be helpful if"). Replaced multi-sentence instructions with bullet points. Removed examples that the model no longer needed. Most system prompts dropped by 40–60% in token count with no measurable quality drop.

Rule of thumb: if you have a system prompt longer than 1,000 tokens, you probably have ~400 tokens of actual instruction wrapped in 600 tokens of politeness, hedging, and examples the model already knows.

Contribution to savings: ~5%.

04 Force structured output to cap the downside

The expensive LLM request isn't the one with the long prompt. It's the one with the long response. Output tokens cost 2–5× more than input tokens on most models, and models love to fill available space.

I forced structured output using JSON schema on every request that produces data for downstream code. No more "here's your summary, and also some bonus observations you didn't ask for." The model writes exactly the fields in the schema, nothing more. Response lengths dropped 20–30% on average.

Where structured output didn't apply (user-facing chat, long-form generation), I set aggressive max_tokens limits tied to what the UI could actually render.

Contribution to savings: ~3%. Smaller than I expected, because most tasks already had reasonable output lengths. But it kept the worst-case spike from happening.

05 Batch the embarrassingly parallel

Bridge Sourcing's discovery agent enriches hundreds of supplier candidates per run. Originally each enrichment was a separate API call. OpenAI and Anthropic both offer batch APIs with 50% discounts for non-urgent work that can wait up to 24 hours.

I split the enrichment pipeline into two lanes:

Hot lane for user-triggered searches (synchronous, normal pricing)
Warm lane for background crawls and maintenance jobs (batched, 50% off)

The warm lane now handles ~70% of total enrichment volume because the discovery agent is always pre-fetching candidates in the background. The hot lane handles only the "a user is waiting" subset.

Contribution to savings: ~2%. Lower than I hoped because my non-urgent volume was smaller than I realized. Still worth doing — free money.

06 Reject waste queries before they hit the model

This one was depressing to find. I audited a week of production prompts and discovered that ~3% of requests were garbage that should never have reached the LLM:

Empty or whitespace-only user inputs that got wrapped in a system prompt and sent anyway
Duplicate requests within seconds (users clicking twice)
Input that matched a known pattern with a deterministic answer (e.g. "what's 2+2" style questions that a simple calculator handles)
Requests from scripts/bots with no rate limiting

I added a gatekeeper layer that runs before the LLM: input sanitization, deduplication with a 30-second window, a small lookup table for known-deterministic queries, and a per-user rate limit. Waste dropped to <0.5%.

Contribution to savings: ~3%. Bigger than expected because waste queries were disproportionately on the expensive models (the gatekeeper-less products tended to be the ones on the big models too).

07 The moves that didn't work

For completeness:

Fine-tuning a small model. I tried fine-tuning GPT-5.4-nano on my classification tasks. It worked, but the win was marginal once I accounted for the maintenance cost of the fine-tuned weights (they get stale, and you have to retrain). For stable, narrow tasks it's worth it. For anything that changes weekly, skip it.
Quantized self-hosted models. Tried Llama 4 via Fireworks and Groq. The models are fast and cheap, but the quality gap on my actual tasks was larger than the price gap was favorable. Maybe in six months.
Aggressive context trimming. I tried cutting old turns from conversation histories more aggressively. Quality dropped noticeably on multi-turn tasks and I made half the win back in retention damage.

08 The order that matters

If you're starting from zero, don't try to do all six at once. The order that worked for me:

Route by task first. Biggest impact, cheapest to implement, no quality risk.
Then structured output and response token limits. Prevents the worst-case cost spikes.
Then prompt compression with prompt caching. Requires discipline but compounds with everything else.
Then semantic caching. The hardest one to get right; save it for after you have solid instrumentation.
Then batching and gatekeeping. Smaller wins but free; do them once you're stable.

09 What to measure

If you don't measure, you don't improve. The four metrics I watch weekly:

Cost per active user per day. Not absolute cost. The absolute bill can grow for good reasons (more users).
Cache hit rate by task type. If it's below 20%, your caching isn't aggressive enough. Above 60% and you should check for over-caching (users getting stale data).
Output token distribution. p50, p95, p99. The p99 is where your money is leaking.
Ratio of routed models. If 90% of traffic is hitting the expensive tier, your router isn't doing its job.

10 The real lesson

The 28% wasn't a single clever hack. It was six boring optimizations done properly, in the right order, with measurement. None of them required changing the primary models. None of them hurt quality. All of them took less than a week of engineering each.

If your LLM bill is growing faster than your user count, the answer almost never is "switch to a cheaper model." The answer is "find the waste." The waste is there. It always is. It just doesn't show up until you look at the cost curve with something other than the bill total.