The AI Infrastructure Stack: 2026 Edition
A refreshed view of the production AI stack at the start of 2026 — what changed since 2024, what's consolidating, and where the next round of innovation is landing.
2024 was the year AI infrastructure stopped being research and became ops. Teams that were prototyping in notebooks a year ago are now running fleets, paying six-figure GPU bills, and carrying AI services in their production pagerduty rotation. The questions shifted from “can we get it to work?” to “can we make it cheap, fast, and reliable?”
This is our annual ground-truth report on where the stack stands heading into 2025. We work with companies running AI platforms from a single A10G to multi-thousand-H100 fleets; this reflects what we actually see.
The H100 shortage that dominated 2023–2024 is over. On-demand H100 is now broadly available at most hyperscalers and neoclouds without capacity reservations. Reserved-term pricing has come down 25–40% year over year.
Going rate (Jan 2025):
A100 is now the budget option. H100 is the workhorse. The A100 crossover point — where H100 cost efficiency beats A100 on reserved pricing — is reached for almost every workload we size.
Blackwell B200 started reaching hyperscaler and neocloud fleets in Q4 2024. Early users report 2.0–2.8x H100 throughput on training workloads, with FP4 precision giving further inference wins.
What we’re telling clients in early 2025:
AMD’s MI300X went from “interesting” to “credible production option” during 2024. Several large shops (Microsoft, Meta, AMD’s launch customers) are running MI300X fleets. ROCm’s software story improved notably; vLLM, PyTorch, and HuggingFace all run cleanly on MI300X without heroic porting.
Where MI300X wins:
Where it lags:
We expect MI300X to be a 20–30% share player by end of 2025, up from <5% entering 2024.
The pattern: custom silicon is winning targeted workloads (specific hosted APIs, specific training pipelines) but NVIDIA still owns the general-purpose GPU market for 90%+ of teams.
This is the biggest story of 2024 carrying into 2025. LLM token prices collapsed:
| Model tier | Early 2024 | Early 2025 | Change |
|---|---|---|---|
| GPT-4 class | $30 / $60 per M | $2.50 / $10 (GPT-4o) | -85%+ |
| Claude Sonnet class | $3 / $15 | $3 / $15 (Sonnet 3.5), continuing | stable |
| Llama-3-70B hosted (cheapest) | $0.88 / M | $0.35 / M (DeepInfra, Together) | -60% |
| Llama-3-8B hosted | $0.20 / M | $0.07 / M | -65% |
Gemini 2.0 Flash, GPT-4o-mini, and Haiku 3.5 all sit at or near $0.25 / M input tokens. That’s basically free compared to early 2024. This changes the economics of every self-hosting decision.
Implication: The breakeven for self-hosting moved up. What used to be “self-host at 10M tokens/day” is now “self-host at 50M+ tokens/day” for most Llama-class workloads. Fine-tunes, privacy, and latency remain the strongest self-host arguments; pure cost is a harder sell.
From the deployments we work on, these are the patterns stabilizing into best practices:
Every multi-provider stack needs a gateway layer. LiteLLM, Portkey, Kong AI Gateway, or custom. It handles retries, fallback, key management, PII redaction, cost attribution, and rate limiting in one place. See LLM Gateway Patterns.
Teams that instrumented early (with OTel, Langfuse, or Langsmith) iterate 3–5x faster on prompt and agent quality. The teams still adding tracing after launch are the ones stuck debugging intermittent issues for weeks.
BM25 + dense retrieval + reranker. Pure dense-only RAG is an anti-pattern for high-quality search. The plumbing is modest once you’ve done it once. See Hybrid Search in Production.
Regression-tested prompts. You don’t ship a prompt change without an eval suite. This is the strongest predictor we’ve seen of shipping velocity for AI products.
On H100, FP8 is essentially free performance with negligible quality loss. Any 70B+ inference not using FP8 is wasting GPUs.
The newest pattern — splitting prefill (compute-bound) and decode (bandwidth-bound) onto different node pools. Running in production at a few large shops, generally available in vLLM/SGLang during 2025. 30–50% throughput wins on workloads with long prompts.
The open-source model story shifted meaningfully:
For teams choosing a self-host target, the default in 2025 is Llama 3.3 70B or DeepSeek V3 for general chat, Qwen 2.5-Coder for coding agents, and Mistral Large 2 for European compliance.
AI bills got big enough that finance started caring. In 2024, teams were adding FinOps controls as an afterthought. In 2025, it’s table stakes:
See our AI FinOps guide for the full playbook.
Places where the stack is still immature and will be a 2025 focus:
1. Multi-region inference. Running an LLM app that’s fast in both Frankfurt and Singapore is still harder than it should be. Expect more cross-region model serving tools this year.
2. Agent orchestration at scale. LangGraph and CrewAI work for small teams, but no mature “Kubernetes for agents” exists yet. Managed offerings are starting (LangGraph Cloud, AWS Bedrock Agents), but the space is young.
3. Fine-tune management. Managing dozens of LoRA adapters, retraining on fresh data, deploying without downtime — still requires a lot of custom tooling per org.
4. Regulatory compliance. EU AI Act came into force. Most teams are still sorting out what it means operationally. Expect a wave of “AI governance” tools in 2025.
5. Evals that match production. Offline eval sets diverge from production traffic. The best teams are investing in continuous eval pipelines that sample real traffic.
Where we think things go:
If you’re scoping 2025 infrastructure work:
Planning your 2025 AI infrastructure roadmap? Let’s talk — we help shops from pre-launch to scale-beyond-hypergrowth.
A refreshed view of the production AI stack at the start of 2026 — what changed since 2024, what's consolidating, and where the next round of innovation is landing.
AI platform engineering is a distinct discipline from ML ops and generic platform engineering. A practical guide to scoping, staffing, and operating an AI platform team — from first hire to org-wide enablement.
When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.