The State of AI Infrastructure 2025
A ground-truth report on where AI infrastructure stands at the start of 2025 — GPU availability, inference pricing, the neocloud wars, and the architecture patterns winning in production.
Two years after the first version of our stack explainer, the shape of AI infrastructure has consolidated and the open questions have changed. 2024 asked: “How do we run LLMs in production?” 2026 asks: “How do we run agent fleets in production, efficiently, across mixed GPU generations, with spend governance, across sovereign regions?”
This is the refreshed stack view. What’s stable, what’s new, and what the next shift looks like.
The layer model has survived:
What’s changed is mostly inside each layer. Let’s walk through.
Where 2024 was H100-dominant, 2026 fleets are typically three generations of GPU running simultaneously:
The old approach of “one GPU type, many workloads” has given way to workload-matched placement. Platform teams route training to B200, bulk inference to H100, dev/small-model to A100/L40S.
AMD MI300X / MI325X is meaningfully represented in 2026 fleets — ~25% of new deployments we see include AMD GPUs, up from near-zero in 2024. ROCm’s software story is now equivalent for most workloads. See MI300X vs H100.
Custom silicon (Groq, Cerebras, AWS Trainium/Inferentia, TPUs) holds steady 15–20% share of inference traffic, concentrated in hosted-API providers. Few teams build on them directly.
The neocloud vs hyperscaler gap narrowed but didn’t close. Hyperscalers dropped H100 on-demand prices 40%. Neoclouds still undercut by 25–40%. B200 follows the same pattern a year behind.
GPU reserved pricing for 1-year commits in early 2026:
The LLM inference server picture consolidated:
The big 2026 shift: disaggregated inference — separating prefill and decode onto different node pools — is now standard practice for large-scale deployments. vLLM, SGLang, and TRT-LLM all support it. Throughput wins of 30–50% on long-prompt workloads are real. See Disaggregated Inference.
FP4 is the default precision for B200 inference. FP8 on H100. INT4 (AWQ/GPTQ) for A100. Quantization is no longer optional.
This is where the biggest 2024→2026 shift lives.
In 2024, orchestration meant “a LangChain or LangGraph pipeline with some tool calls.” In 2026, agent frameworks became a proper category with enterprise-grade options:
The Model Context Protocol (MCP) won as the tool-calling standard. Nearly every framework supports it. Tool registries are shared across agents. See our Multi-Agent Orchestration Infrastructure guide.
Agent infrastructure is now a distinct category from inference serving. Different scaling, different observability, different resource shapes. See Agent Infrastructure: What’s Different.
Vector DBs consolidated:
What’s new: the context stack — the set of systems managing an agent’s memory, history, and retrieval — expanded beyond vector DBs alone.
Production agents now routinely run:
This stack has a name now: “context engineering.” See Context Engineering: Storage, Retrieval, and the New Memory Stack.
Two-year pattern clear: Langfuse and OTel-native backends beat siloed proprietary SDKs.
Most production deployments we see now use:
Evals are finally first-class. Every serious team has a regression-testing harness gate-keeping prompt/model changes. See Model Evals in Production.
Cost attribution is standard. “Who spent what on which model” is a report finance can pull any day of the month.
LiteLLM (open source) and Portkey (managed) remain the dominant gateway options. Kong AI Gateway is strong for Kong shops.
2026 additions:
The gateway is the single most valuable piece of infrastructure to add early. Still.
Things that didn’t exist (or barely existed) in 2024:
Regional requirements drove real investment in sovereign inference stacks. European, Indian, Gulf-region, and Japanese providers operate full AI stacks inside their respective jurisdictions. EU AI Act compliance drove a lot of this. See Running Sovereign AI.
Small models (3B, 7B) running on consumer GPUs and even phones is production-viable. Llama 4 Tiny, Qwen 3 Edge, Phi-5 are designed for this. Inference-on-device use cases (privacy, latency, offline) are real. See Inference at the Edge.
As noted: agent infra is a category now. Platforms like LangGraph Cloud, Bedrock Agents, Turion Agents ship managed agent orchestration.
Previously ad-hoc memory patterns are now named, productized, and discussed. Short-term memory (KV cache, context window), working memory (active context), long-term memory (vector + graph stores) is a standard decomposition.
“AI platform engineer” is a real role. The playbook is settling. See Building an AI Platform Team.
1. Multi-region agent consistency. Running an agent fleet across 5 regions with consistent behavior is harder than stateless inference because of context stores, tool registries, and provider variations by region.
2. Compliance breadth. EU AI Act, India DPDP, US state regs, sector-specific (HIPAA, FINRA). Legal teams need to stay current. Engineering pays the cost of translating requirements into controls.
3. Rapid model deprecation. OpenAI, Anthropic, and Google deprecate models aggressively. Your app needs to tolerate model churn without user-visible regressions.
4. Agent debugging. Long trajectories, tool-call loops, emergent behaviors. Observability is better but the underlying problem is genuinely hard.
5. Cost scaling in multi-agent systems. An agent that calls another agent that calls a third creates token amplification that’s easy to miss until the bill arrives.
Early predictions:
If you’re starting fresh: Use hosted APIs + LiteLLM gateway + pgvector + OTel + Langfuse. Deploy with Claude, GPT-4o-class, or Llama-3.3-70B. Add fine-tuning only when baseline quality gaps are measurable.
If you’re mid-growth: Time to invest in platform engineering. AI FinOps, eval regression, multi-cloud gateway, security reviews. These compound.
If you’re at scale: B200 for new commits. Invest in disaggregated serving, multi-LoRA per tenant, context engineering, and agent orchestration infrastructure.
Across all sizes: Treat AI infrastructure as software. Evals, CI, canaries, rollbacks. The teams that do this ship 10x faster.
Planning your 2026 AI infrastructure? Let’s talk — we help shops from pre-launch to global scale.
A ground-truth report on where AI infrastructure stands at the start of 2025 — GPU availability, inference pricing, the neocloud wars, and the architecture patterns winning in production.
AI platform engineering is a distinct discipline from ML ops and generic platform engineering. A practical guide to scoping, staffing, and operating an AI platform team — from first hire to org-wide enablement.
When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.