AI FinOps: Tracking Token Spend Across Your Org
LLM bills grew from invisible to huge in the span of a year. A complete FinOps playbook for AI workloads: attribution, budgets, alerting, and the reports finance actually wants.
An AI company’s GPU bill follows a predictable curve. First year: invisible. Second year: noticeable. Third year: dominant. At $10M/year of GPU spend, compute becomes the single largest line item after payroll — and the organization has to treat it with the same seriousness as headcount budgeting.
This post is the FinOps playbook for large GPU bills. Not the “tag your workloads” basics; the structural moves that shift millions of dollars.
From our work with shops at the $5M–$50M/year compute scale, the levers in rough order of dollar impact:
We’ll focus on the top five. They represent 70–80% of the achievable savings.
On-demand GPU pricing is a tax you pay for flexibility. If your workload is sustained, commit.
| Commitment | Discount vs on-demand |
|---|---|
| Spot / preemptible | 30–60% |
| 1-year reserved | 30–50% |
| 3-year reserved | 50–65% |
| Dedicated / bare-metal multi-year | 50–70% |
| Co-location + owned hardware | 60–75% (incl. capex amortization) |
At $10M/year of spend, moving from on-demand to 1-year reserved saves $3M–$5M. 3-year reserved saves more but locks you in when hardware generations change.
The right blend:
This is hard to execute without baseline visibility. Which leads to lever 2.
“All our inference is on H100” is a red flag. Different workloads want different hardware.
Typical workload placement we recommend:
| Workload | GPU | Why |
|---|---|---|
| 70B+ inference | B200 / MI300X | Memory, FP4/FP8 efficiency |
| 7–13B inference at high QPS | H100 / L40S | Right-sized for throughput |
| 7–13B inference at low QPS | L4 / A10 | No need to pay for H100 |
| Batch embedding | A100 / L40S | Cheap bandwidth |
| Training foundation models | B200 / H100 w/ InfiniBand | Network matters |
| Fine-tuning | H100 (or MI300X for memory) | Balanced FLOPS / memory |
| Dev / experimentation | A100 / L40S / spot | Cost matters, perf doesn’t |
Moving classification workloads off H100 and onto L4 easily cuts that workload’s cost 4–5x. Across a $10M bill, right-sizing typically delivers 15–25% total savings.
An H100 sitting at 35% utilization costs the same as one at 90%. The dirty secret of many AI fleets is sub-50% sustained utilization on expensive hardware.
What drives low utilization:
Structural fixes:
Each 10-point utilization improvement on a large fleet is a meaningful dollar amount. Going from 40% to 70% on a $10M fleet is effectively $4M in savings.
Hyperscalers charge a premium for integrated services. Neoclouds undercut on GPU specifically. See Multi-Cloud GPU Strategy.
Typical discount (neocloud vs hyperscaler on-demand):
At scale, the delta is worth real engineering investment to run multi-cloud.
Pattern:
For a $10M/year bill, shifting from pure hyperscaler to a mixed deployment with neocloud reserved as the baseline typically saves $2M–$3M.
A team with only one model in their stack is always spending too much. Different queries merit different models.
Current 2026 price ladder:
| Tier | Model | Rough cost ($/M input) |
|---|---|---|
| Frontier | GPT-4o, Claude Opus 4 | $15–$25 |
| Fast frontier | Claude Sonnet 4, Gemini 2.5 Pro | $3–$5 |
| Mid tier | GPT-4o-mini, Claude Haiku 4.5 | $0.15–$0.50 |
| Cheap tier | Llama-3.3-70B hosted | $0.30–$0.50 |
| Ultra cheap | Llama-3.2-8B hosted | $0.05–$0.10 |
| Your own | Self-hosted 70B FP8 | ~$0.15–$0.40 (blended) |
A task-appropriate ladder lets you run a classification at 1/50th the cost of its GPT-4o version. Router logic costs a little complexity; the savings compound.
See AI FinOps: Tracking Token Spend for the attribution side and LLM Gateway Patterns for the routing side.
You cannot cut what you don’t see. Every large AI shop ends up with the same three-layer attribution:
Which service, which team, which feature. Implemented via:
Which end customer is generating which spend. Implemented via:
Cost per resolved support ticket. Cost per generated document. Cost per qualified lead. This is where FinOps becomes a strategic conversation.
As spend grows past a threshold, you gain access to commercial options:
1-year or 3-year capacity at fixed pricing. Best for baseline load.
Annual or multi-year commitments across multiple services. Hyperscalers will negotiate at $500k+/year.
Lease racks in a colo. Buy or lease the GPUs. Eliminates cloud margin; adds ops overhead. Worth it above ~1000 GPUs sustained.
Capex your own cluster. Best long-term economics; massive capex commitment; ops team required.
Most shops at $10M/year compute spend should have at least reserved capacity and an enterprise agreement. Bare metal is worth evaluating above $20M/year.
FinOps above $5M/year needs more than a dashboard. It needs an operating model:
The human structure matters more than any tool. Teams with “a FinOps dashboard” save little. Teams with “Tomás reviews the GPU bill weekly and escalates deviations” save a lot.
1. Saving 10% by optimizing the wrong line item. Focus on the top 3 workloads. Everything else is noise.
2. Ignoring capacity commits when scaling down. You committed to 100 H100 for 3 years; you now need 50. Either sublease, stay committed, or eat the loss.
3. Hyperscaler credits mask bad habits. “AWS gave us $5M in credits” → team doesn’t optimize → credits run out → cliff.
4. Chasing every new GPU generation. B200 cost/perf is real, but upgrading mid-reservation just costs money.
5. Under-investing in observability. You cannot manage what you cannot measure. Spend is a sacred cow until you have the data to question it.
If your company’s GPU spend has crept above $500k/month without FinOps discipline:
Month 1: Baseline attribution. Who, what, how much. Accept the number is bigger than anyone thinks.
Month 2: Rate card negotiation with current providers. 10–20% gains from talking.
Month 3: Reserved commitments for baseline load. 20–30% on committed portion.
Month 4: Right-sizing. Audit workload → GPU match. 10–20% gains.
Month 5: Multi-cloud evaluation. Neoclouds vs hyperscalers for suitable workloads. 20–30% gains on shifted workloads.
Month 6: Utilization drive. Consolidate, autoscale aggressively, kill idle.
Month 7+: Structural levers — model right-sizing, disaggregation, distillation.
A team that executes this sequence typically cuts their compute bill 30–50% in the first year.
GPU FinOps at scale is a discipline, not a dashboard. The big wins come from:
These aren’t glamorous. They’re boring. They’re also where the dollars are. Companies that invest here have 20–40% lower unit economics than those that don’t. At serious scale, that’s the difference between a profitable AI product and one that isn’t.
Compute bill growing faster than revenue? Let’s talk — we run FinOps engagements for AI-heavy companies at every scale.
LLM bills grew from invisible to huge in the span of a year. A complete FinOps playbook for AI workloads: attribution, budgets, alerting, and the reports finance actually wants.
AMD's MI300X turned from curiosity to production option during 2024–2025. Where AMD wins, where NVIDIA still leads, and how to integrate MI300X into a mixed fleet.
Blackwell's B200 is shipping at scale. Benchmarks, cost deltas, FP4 economics, and when it's worth the capex vs sticking with your H100 fleet for another year.