Skip to main
omar.nagy
← writing·Positioning·~8 min read·published 2026-04-30

Production AI in 2026 — five shifts most teams are still ignoring

Most teams shipping AI in 2026 are still solving 2024 problems with 2024 tools. Here are the five shifts that flipped — prompt caching, agents, evals, tool-calling, observability — and what each one costs to ignore.

5
shifts to ship · or pay for ignoring
  • positioning
  • production-ai
  • evals
  • observability
  • prompt-caching
  • agents

Most teams shipping AI in 2026 are still solving 2024 problems with 2024 tools, and it costs them somewhere between 2x and "their AI feature has quietly stopped working." The mental model that got teams to a working demo in 2024 — bespoke RAG pipelines, exact-string caches, vibes-based eval, every API call a cold call — is the wrong shape for production AI now. I'll name the five shifts that flipped between 2024 and 2026, what each one costs to ignore, and what to do about each this week.

This is opinionated. I run three live AI products from Egypt and the cost of being wrong about any of these is on my own bill, not a client's. Treat the takes accordingly.

The five shifts that flipped

Prompt caching went from 'nice optimization' to table stakes

The 2024 mental model was: every API call is a cold call, you eat the full system-prompt token cost on every request, and the line item labelled "AI cost" grows linearly with traffic. That model is dead. Anthropic shipped prompt caching to GA. OpenAI did the same with their Responses API context. If your system prompt is larger than ~1,000 tokens (most are) and you're not caching the static prefix, you're paying somewhere between 2x and 5x the actual marginal cost of each call.

The trade you're making by not adopting it: a few days of integration work in exchange for a permanent line-item drop. That's a stupid trade to refuse. The teams I see still refusing it are usually telling themselves they'll "refactor the prompt structure first." They've been telling themselves that for nine months.

Agents are real — for narrow domains, with structured tools

2024 had two opposing camps. Camp A said agents are vapor and only single-shot completions ship to production. Camp B said agents replace engineers by Q3. Both are wrong. The truth as of 2026: agentic loops with structured tool-calling work in production — but only for narrow domains, only with strict tool schemas, and only when you've already built the deterministic version of the workflow.

I run an agent fleet for my own ops. Each agent has a job description shorter than a paragraph, a tool list with five to nine tools, and clear boundaries on what it cannot touch. They work because I built the deterministic plumbing first and let the agent loop replace the orchestrator, not the executor. Teams that try to skip the deterministic step and "just use an agent" build something that demos brilliantly and breaks on the third novel input.

Evals replaced 'feels right'

The single sharpest line between a 2024 team and a 2026 team is this: ask them what their AI feature scored on its last regression eval run. If the answer is a number, they're 2026. If the answer is "we monitor it qualitatively" or "users haven't complained," they're 2024 and shipping in the dark.

Evals are not a research practice anymore. They're CI. A 2026 team has a golden set of 50-200 examples per AI surface, ground-truth labels, an eval harness in the deploy pipeline, and a rule that says you don't merge prompt changes that drop the eval score by more than X. The trade you're making by skipping this: you save a week of harness work and lose the ability to refactor the prompt without fear. That's a bad trade. Prompt confidence is the most valuable thing eval suites buy you, more than the regression catches themselves.

Tool-calling beat custom orchestration

2024 was the year of LangChain, custom RAG chains, manually-orchestrated multi-step pipelines, retrieval-augmented this and that. 2026, the model is the orchestrator. Define your tools cleanly, give the model a good system prompt, let it call the tools in whatever order it picks. The bespoke chain code you wrote in 2024 is now liability — it's harder to debug, harder to update for new model versions, and it boxes the model into a worse plan than the one it would pick on its own.

I'm not saying LangChain is wrong for everything. I'm saying that if you're starting a new build in 2026 and your first instinct is to wire up a chain framework, you should sit with the question of whether you actually need it. Most new builds don't. Tool-calling natively, plus a thin glue layer you wrote yourself, beats the framework on debuggability, on latency, and on how-quickly-can-this-team-onboard-a-new-engineer.

Observability moved from afterthought to checklist-zero

2024 teams added observability last. The classic pattern: ship the feature, see the bill, panic, integrate LangSmith or LangFuse or Helicone in week six, finally see what's actually happening. 2026 teams instrument from day one. Token cost per user, per route, per request. Latency per tool call. Cache hit rate broken down by prompt template. Failure rate per model — because by 2026 most teams are using more than one model and you need to know which one is the flaky one.

The trade: a day of setup at the start of the project versus a fortnight of reverse-engineering at the end. The cost of the second option is not just the fortnight — it's the months you spent operating blind, paying invisible inefficiencies, and shipping prompt changes whose downstream effects you couldn't see.

The cheap test for which side of the line you're on

Three questions. If you can't answer any of them in under thirty seconds, you're on the 2024 side of at least one shift.

A 2026 team has dashboards for all three. The dashboards are not fancy — most of them are SQL on top of a logs table — but they exist, and the team checks them. A 2024 team has a vague sense and a billing screen.

What to do this week

If you read the five shifts and you're already sitting on the right side of all of them: good, move on, the rest of the field is catching up. If not, here's the order I'd fix them in for highest leverage per hour spent.

The reason caching is first isn't just the bill. It's that the ROI is measurable, irreversible, and it gives the team confidence that the next change is also worth doing. Most AI engineering deferred at 5-50 person teams gets deferred because the team can't tell which change actually changes the bill or the score. Caching breaks that paralysis. Once you've watched the bill drop you'll stop debating the next move.

What hasn't changed

Three things in production AI 2026 are exactly what they were in 2024, and the discipline matters more than ever.

First, model selection still matters more than prompt cleverness. The right model for the task at the right tier still beats a worse model with an artisanal prompt. The 2024 instinct ("just use GPT-4 for everything, prompt-engineer harder") is more wrong now, not less, because the spread between the cheap tier and the expensive tier got wider — Haiku-class models do most classification and extraction at a fraction of the cost of full-tier reasoning models.

Second, the system prompt is still where most quality lives. I see teams in 2026 spending two weeks tuning the eval suite while the system prompt is 1,800 tokens of contradictory examples. Fix the prompt first. Then measure. A clean, compressed system prompt is the highest-ROI artifact in any production AI system, then and now.

Third, the unit of progress is still "shipped to a paying user." The teams that learn fastest are the teams that put the AI in front of a real user as early as possible and watch what happens. That hasn't changed since 2022. It won't change in 2027.

The honest part

Some of what I just argued will be wrong by 2027. Tool-calling-as-orchestration in particular is a snapshot — frameworks like LangGraph and others are evolving, and the model-as-orchestrator pattern works because today's models are good enough to plan; tomorrow's planners might benefit from explicit graph scaffolding again. I'd rather be specific and partly wrong than vague and untestable. If you're at a team that disagrees with one of the five shifts, the right move is to write down why your case differs and audit the bet in six months. Most won't. The ones that do will run circles around the ones that don't.

If you read this and recognized your team in one of the 2024 mental models, the audit version of this — applied to your stack, with a written architecture brief and one prototype change shipped — is what the Audit Sprint buys. Same thinking, applied to your bill.

// sources cited

// next move

Want a written architecture brief on your AI stack?

1 week, $1,500, fixed scope. Working prototype of one change in your stack — yours to keep regardless.

// related essays