Infrastructure

GPU FinOps: Reducing Your $10M AI Compute Bill

Balys Kriksciunas 8 min read
#ai#infrastructure#finops#gpu#cost#compute#commitments#optimization

GPU FinOps: Reducing Your $10M AI Compute Bill

An AI company’s GPU bill follows a predictable curve. First year: invisible. Second year: noticeable. Third year: dominant. At $10M/year of GPU spend, compute becomes the single largest line item after payroll — and the organization has to treat it with the same seriousness as headcount budgeting.

This post is the FinOps playbook for large GPU bills. Not the “tag your workloads” basics; the structural moves that shift millions of dollars.


The Levers, Ranked By Impact

From our work with shops at the $5M–$50M/year compute scale, the levers in rough order of dollar impact:

  1. Commitment structure — on-demand vs reserved vs bare-metal
  2. Workload right-sizing — matching workload to GPU generation
  3. Utilization — getting above 70% sustained
  4. Provider mix — neoclouds vs hyperscalers
  5. Model right-sizing — not using GPT-4o for classification
  6. Disaggregated serving — see our dedicated post
  7. QuantizationFP8 is essentially free on H100
  8. Caching — prompt caching, semantic caching
  9. Model distillation — custom smaller models for your workload
  10. Geographic optimization — region selection

We’ll focus on the top five. They represent 70–80% of the achievable savings.


Lever 1: Commitment Structure

On-demand GPU pricing is a tax you pay for flexibility. If your workload is sustained, commit.

CommitmentDiscount vs on-demand
Spot / preemptible30–60%
1-year reserved30–50%
3-year reserved50–65%
Dedicated / bare-metal multi-year50–70%
Co-location + owned hardware60–75% (incl. capex amortization)

At $10M/year of spend, moving from on-demand to 1-year reserved saves $3M–$5M. 3-year reserved saves more but locks you in when hardware generations change.

The right blend:

This is hard to execute without baseline visibility. Which leads to lever 2.


Lever 2: Workload Right-Sizing

“All our inference is on H100” is a red flag. Different workloads want different hardware.

Typical workload placement we recommend:

WorkloadGPUWhy
70B+ inferenceB200 / MI300XMemory, FP4/FP8 efficiency
7–13B inference at high QPSH100 / L40SRight-sized for throughput
7–13B inference at low QPSL4 / A10No need to pay for H100
Batch embeddingA100 / L40SCheap bandwidth
Training foundation modelsB200 / H100 w/ InfiniBandNetwork matters
Fine-tuningH100 (or MI300X for memory)Balanced FLOPS / memory
Dev / experimentationA100 / L40S / spotCost matters, perf doesn’t

Moving classification workloads off H100 and onto L4 easily cuts that workload’s cost 4–5x. Across a $10M bill, right-sizing typically delivers 15–25% total savings.


Lever 3: Utilization

An H100 sitting at 35% utilization costs the same as one at 90%. The dirty secret of many AI fleets is sub-50% sustained utilization on expensive hardware.

What drives low utilization:

Structural fixes:

Each 10-point utilization improvement on a large fleet is a meaningful dollar amount. Going from 40% to 70% on a $10M fleet is effectively $4M in savings.


Lever 4: Provider Mix

Hyperscalers charge a premium for integrated services. Neoclouds undercut on GPU specifically. See Multi-Cloud GPU Strategy.

Typical discount (neocloud vs hyperscaler on-demand):

At scale, the delta is worth real engineering investment to run multi-cloud.

Pattern:

For a $10M/year bill, shifting from pure hyperscaler to a mixed deployment with neocloud reserved as the baseline typically saves $2M–$3M.


Lever 5: Model Right-Sizing

A team with only one model in their stack is always spending too much. Different queries merit different models.

Current 2026 price ladder:

TierModelRough cost ($/M input)
FrontierGPT-4o, Claude Opus 4$15–$25
Fast frontierClaude Sonnet 4, Gemini 2.5 Pro$3–$5
Mid tierGPT-4o-mini, Claude Haiku 4.5$0.15–$0.50
Cheap tierLlama-3.3-70B hosted$0.30–$0.50
Ultra cheapLlama-3.2-8B hosted$0.05–$0.10
Your ownSelf-hosted 70B FP8~$0.15–$0.40 (blended)

A task-appropriate ladder lets you run a classification at 1/50th the cost of its GPT-4o version. Router logic costs a little complexity; the savings compound.

See AI FinOps: Tracking Token Spend for the attribution side and LLM Gateway Patterns for the routing side.


Attribution: The Foundation

You cannot cut what you don’t see. Every large AI shop ends up with the same three-layer attribution:

Layer 1: Per-workload

Which service, which team, which feature. Implemented via:

Layer 2: Per-customer (for SaaS)

Which end customer is generating which spend. Implemented via:

Layer 3: Per-business-outcome

Cost per resolved support ticket. Cost per generated document. Cost per qualified lead. This is where FinOps becomes a strategic conversation.


Financial Structures Worth Knowing

As spend grows past a threshold, you gain access to commercial options:

Reserved capacity contracts

1-year or 3-year capacity at fixed pricing. Best for baseline load.

Enterprise agreements

Annual or multi-year commitments across multiple services. Hyperscalers will negotiate at $500k+/year.

Bare metal / dedicated racks

Lease racks in a colo. Buy or lease the GPUs. Eliminates cloud margin; adds ops overhead. Worth it above ~1000 GPUs sustained.

Owned infrastructure

Capex your own cluster. Best long-term economics; massive capex commitment; ops team required.

Most shops at $10M/year compute spend should have at least reserved capacity and an enterprise agreement. Bare metal is worth evaluating above $20M/year.


Organizational Structure

FinOps above $5M/year needs more than a dashboard. It needs an operating model:

The human structure matters more than any tool. Teams with “a FinOps dashboard” save little. Teams with “Tomás reviews the GPU bill weekly and escalates deviations” save a lot.


The Common Traps

1. Saving 10% by optimizing the wrong line item. Focus on the top 3 workloads. Everything else is noise.

2. Ignoring capacity commits when scaling down. You committed to 100 H100 for 3 years; you now need 50. Either sublease, stay committed, or eat the loss.

3. Hyperscaler credits mask bad habits. “AWS gave us $5M in credits” → team doesn’t optimize → credits run out → cliff.

4. Chasing every new GPU generation. B200 cost/perf is real, but upgrading mid-reservation just costs money.

5. Under-investing in observability. You cannot manage what you cannot measure. Spend is a sacred cow until you have the data to question it.


The Roadmap From Chaos To Control

If your company’s GPU spend has crept above $500k/month without FinOps discipline:

Month 1: Baseline attribution. Who, what, how much. Accept the number is bigger than anyone thinks.

Month 2: Rate card negotiation with current providers. 10–20% gains from talking.

Month 3: Reserved commitments for baseline load. 20–30% on committed portion.

Month 4: Right-sizing. Audit workload → GPU match. 10–20% gains.

Month 5: Multi-cloud evaluation. Neoclouds vs hyperscalers for suitable workloads. 20–30% gains on shifted workloads.

Month 6: Utilization drive. Consolidate, autoscale aggressively, kill idle.

Month 7+: Structural levers — model right-sizing, disaggregation, distillation.

A team that executes this sequence typically cuts their compute bill 30–50% in the first year.


The Short Version

GPU FinOps at scale is a discipline, not a dashboard. The big wins come from:

These aren’t glamorous. They’re boring. They’re also where the dollars are. Companies that invest here have 20–40% lower unit economics than those that don’t. At serious scale, that’s the difference between a profitable AI product and one that isn’t.


Further Reading

Compute bill growing faster than revenue? Let’s talk — we run FinOps engagements for AI-heavy companies at every scale.

← Back to Blog