Infrastructure

The State of AI Infrastructure 2025

Balys Kriksciunas 8 min read
#ai#infrastructure#state-of-industry#2025#analysis#trends

The State of AI Infrastructure 2025

2024 was the year AI infrastructure stopped being research and became ops. Teams that were prototyping in notebooks a year ago are now running fleets, paying six-figure GPU bills, and carrying AI services in their production pagerduty rotation. The questions shifted from “can we get it to work?” to “can we make it cheap, fast, and reliable?”

This is our annual ground-truth report on where the stack stands heading into 2025. We work with companies running AI platforms from a single A10G to multi-thousand-H100 fleets; this reflects what we actually see.


The Hardware Picture

H100 availability has normalized

The H100 shortage that dominated 2023–2024 is over. On-demand H100 is now broadly available at most hyperscalers and neoclouds without capacity reservations. Reserved-term pricing has come down 25–40% year over year.

Going rate (Jan 2025):

A100 is now the budget option. H100 is the workhorse. The A100 crossover point — where H100 cost efficiency beats A100 on reserved pricing — is reached for almost every workload we size.

B200 is shipping

Blackwell B200 started reaching hyperscaler and neocloud fleets in Q4 2024. Early users report 2.0–2.8x H100 throughput on training workloads, with FP4 precision giving further inference wins.

What we’re telling clients in early 2025:

MI300X keeps pulling weight

AMD’s MI300X went from “interesting” to “credible production option” during 2024. Several large shops (Microsoft, Meta, AMD’s launch customers) are running MI300X fleets. ROCm’s software story improved notably; vLLM, PyTorch, and HuggingFace all run cleanly on MI300X without heroic porting.

Where MI300X wins:

Where it lags:

We expect MI300X to be a 20–30% share player by end of 2025, up from <5% entering 2024.

The custom silicon players

The pattern: custom silicon is winning targeted workloads (specific hosted APIs, specific training pipelines) but NVIDIA still owns the general-purpose GPU market for 90%+ of teams.


Inference Pricing Fell Off a Cliff

This is the biggest story of 2024 carrying into 2025. LLM token prices collapsed:

Model tierEarly 2024Early 2025Change
GPT-4 class$30 / $60 per M$2.50 / $10 (GPT-4o)-85%+
Claude Sonnet class$3 / $15$3 / $15 (Sonnet 3.5), continuingstable
Llama-3-70B hosted (cheapest)$0.88 / M$0.35 / M (DeepInfra, Together)-60%
Llama-3-8B hosted$0.20 / M$0.07 / M-65%

Gemini 2.0 Flash, GPT-4o-mini, and Haiku 3.5 all sit at or near $0.25 / M input tokens. That’s basically free compared to early 2024. This changes the economics of every self-hosting decision.

Implication: The breakeven for self-hosting moved up. What used to be “self-host at 10M tokens/day” is now “self-host at 50M+ tokens/day” for most Llama-class workloads. Fine-tunes, privacy, and latency remain the strongest self-host arguments; pure cost is a harder sell.


Architecture Patterns Winning in 2025

From the deployments we work on, these are the patterns stabilizing into best practices:

1. LLM gateway first, everything else second

Every multi-provider stack needs a gateway layer. LiteLLM, Portkey, Kong AI Gateway, or custom. It handles retries, fallback, key management, PII redaction, cost attribution, and rate limiting in one place. See LLM Gateway Patterns.

2. Observability before features

Teams that instrumented early (with OTel, Langfuse, or Langsmith) iterate 3–5x faster on prompt and agent quality. The teams still adding tracing after launch are the ones stuck debugging intermittent issues for weeks.

3. Hybrid retrieval by default

BM25 + dense retrieval + reranker. Pure dense-only RAG is an anti-pattern for high-quality search. The plumbing is modest once you’ve done it once. See Hybrid Search in Production.

4. Eval-driven prompt development

Regression-tested prompts. You don’t ship a prompt change without an eval suite. This is the strongest predictor we’ve seen of shipping velocity for AI products.

5. FP8 quantization as default

On H100, FP8 is essentially free performance with negligible quality loss. Any 70B+ inference not using FP8 is wasting GPUs.

6. Disaggregated prefill/decode

The newest pattern — splitting prefill (compute-bound) and decode (bandwidth-bound) onto different node pools. Running in production at a few large shops, generally available in vLLM/SGLang during 2025. 30–50% throughput wins on workloads with long prompts.


The Open-Source Frontier

The open-source model story shifted meaningfully:

For teams choosing a self-host target, the default in 2025 is Llama 3.3 70B or DeepSeek V3 for general chat, Qwen 2.5-Coder for coding agents, and Mistral Large 2 for European compliance.


Cost and FinOps Became A Real Discipline

AI bills got big enough that finance started caring. In 2024, teams were adding FinOps controls as an afterthought. In 2025, it’s table stakes:

See our AI FinOps guide for the full playbook.


What’s Still Broken

Places where the stack is still immature and will be a 2025 focus:

1. Multi-region inference. Running an LLM app that’s fast in both Frankfurt and Singapore is still harder than it should be. Expect more cross-region model serving tools this year.

2. Agent orchestration at scale. LangGraph and CrewAI work for small teams, but no mature “Kubernetes for agents” exists yet. Managed offerings are starting (LangGraph Cloud, AWS Bedrock Agents), but the space is young.

3. Fine-tune management. Managing dozens of LoRA adapters, retraining on fresh data, deploying without downtime — still requires a lot of custom tooling per org.

4. Regulatory compliance. EU AI Act came into force. Most teams are still sorting out what it means operationally. Expect a wave of “AI governance” tools in 2025.

5. Evals that match production. Offline eval sets diverge from production traffic. The best teams are investing in continuous eval pipelines that sample real traffic.


Predictions for 2025

Where we think things go:

  1. H100 becomes the A100 — the default workhorse. B200 handles frontier training. L40S/L4 fleets proliferate for small-model inference.
  2. Inference margins compress further. $0.15/M tokens for 70B-class becomes realistic.
  3. One major neocloud IPOs (CoreWeave is the obvious candidate; it happens in 2025).
  4. Agent infrastructure gets its own category. Distinct from inference serving. See our early take in Agent Infrastructure: What’s Different.
  5. EU sovereign AI infrastructure emerges. French, German, UK providers scale up, driven by EU AI Act.
  6. Vector DB consolidation. Two or three clear winners emerge. Pgvector’s share keeps growing. Some smaller players get acquired or wind down.
  7. MCP becomes the default tool protocol. Standardizes agent tooling across frameworks.
  8. On-prem AI returns for regulated industries. Enterprise on-prem H100 clusters ship in meaningful volume.

What To Do With This

If you’re scoping 2025 infrastructure work:


Further Reading

Planning your 2025 AI infrastructure roadmap? Let’s talk — we help shops from pre-launch to scale-beyond-hypergrowth.

← Back to Blog