MI300X vs H100: AMD's Bet on Inference
AMD's MI300X turned from curiosity to production option during 2024–2025. Where AMD wins, where NVIDIA still leads, and how to integrate MI300X into a mixed fleet.
If you are picking GPUs for an AI workload in 2024, you are almost certainly choosing between two chips: NVIDIA’s A100 (shipped 2020) and the H100 (shipped 2022). A100 is cheap, ubiquitous, and still available everywhere. H100 is faster, pricier, and harder to get. The right answer depends on your workload, your budget, and your supply.
This guide walks through the specs that matter, benchmarks on real LLM workloads, and a decision framework we use with clients choosing between them.
| Spec | A100 (80GB SXM4) | H100 (80GB SXM5) | Delta |
|---|---|---|---|
| HBM memory | 80 GB HBM2e | 80 GB HBM3 | Same size, 2x bandwidth |
| Memory bandwidth | 2.0 TB/s | 3.35 TB/s | +67% |
| FP16 TFLOPS | 312 | 989 (756 w/o sparsity) | ~3x |
| FP8 TFLOPS | — | 1,979 | New format |
| NVLink bandwidth | 600 GB/s | 900 GB/s | +50% |
| TDP | 400W | 700W | +75% |
| Process | TSMC 7nm | TSMC 4N | New node |
| Transformer Engine | No | Yes | Hardware FP8/BF16 mixing |
The three numbers that drive most decisions:
For frontier-model training, H100 is obviously better. The question is cost efficiency.
On our internal benchmark — fine-tuning a 7B Llama-style model on a single-node 8xGPU box — we measured:
Tokens per dollar:
H100 is roughly 4x more efficient on this workload at 2024 prices. That’s the general pattern — if you can get H100 at reserved-tier pricing, it beats A100 on every dimension. On-demand H100 (at $4.50–$6/hr) the margin is narrower but still real.
For frontier-scale training (70B+, multi-node), the picture shifts further toward H100. NVLink and NVSwitch topologies on H100 nodes feed each GPU better, so scaling efficiency is higher as you grow cluster size. At 256+ GPU scale, H100 is typically 5–6x more cost-efficient than A100.
Inference flips some assumptions. For many workloads, you are memory-bandwidth bound, not compute-bound. The model weights have to cross the memory bus for every token, and FLOPS headroom sits idle.
Consider serving a 7B model with vLLM on a single GPU:
H100 is ~2x on this workload. At on-demand pricing, H100 is roughly equal or marginally better cost-per-token. At reserved, H100 pulls ahead.
Where A100 still wins on cost-per-token:
For most new 2024 deployments, H100 wins on inference too. But “cheaper hardware, worse perf” is still a viable strategy if you have the A100 fleet and don’t.
One subtle point often missed: both A100 and H100 come in 40GB and 80GB variants. For LLMs, 80GB is almost always the right call, even at a premium.
A 13B model in FP16 takes ~26GB of weights. At batch 16 with 4K context, KV cache adds another ~20GB. Push context to 8K or batch to 32 and you’re over 40GB. The 40GB variant leaves no room to grow.
A rough rule: model weights + 50% for KV cache + 10% for activations. If that doesn’t fit comfortably in GPU memory, you’re going to pay for it in tensor parallelism complexity or offloading overhead.
The biggest H100 win that’s easy to miss: FP8 support. The Transformer Engine automatically mixes FP8 and BF16 across a forward/backward pass, cutting memory in half and roughly doubling throughput on Transformer layers.
For serving frontier-size models (70B+), FP8 is the difference between fitting on 4 GPUs vs 8. Quantization schemes like FP8-E4M3 (supported by vLLM, TensorRT-LLM, SGLang) are essentially free performance on H100. On A100, the closest equivalent is INT8 quantization, which has more accuracy loss and is more painful to deploy.
See our FP8 and Quantization guide for the deployment details.
If you are a startup, reserved H100 from a neocloud is usually your best deal. The hyperscalers price H100 at a premium because demand is insane.
If you need occasional burst capacity, on-demand A100 on a hyperscaler is fine — it’s available, predictable, and the per-hour rate is sane.
Concrete scenarios where we still recommend A100 to clients:
B200 (Blackwell) is shipping to hyperscalers in late 2024 / early 2025. It’s roughly 2.5x H100 on training and adds FP4 support. For the overwhelming majority of teams in 2024, B200 is “wait and see” — H100 is the workhorse for the next 12–18 months.
We cover the B200 tradeoffs in depth in NVIDIA B200 vs H100: Should You Upgrade?.
Three questions, in order:
Evaluating GPU capacity for a new workload? Talk to us — we’ve sized fleets from 4 GPUs to 4,000.
AMD's MI300X turned from curiosity to production option during 2024–2025. Where AMD wins, where NVIDIA still leads, and how to integrate MI300X into a mixed fleet.
Blackwell's B200 is shipping at scale. Benchmarks, cost deltas, FP4 economics, and when it's worth the capex vs sticking with your H100 fleet for another year.
FP8 quantization on H100 doubles inference throughput with minimal quality loss. A practical guide to FP8, AWQ, GPTQ, and when each is the right pick — plus the workloads where quantization breaks things.