NVIDIA H100 vs A100: Which GPU Should You Deploy?
A practical comparison of NVIDIA's H100 and A100 for LLM training and inference — memory, FLOPS, interconnect, price per token, and the cases where the older A100 still wins.
A 70B-parameter model in FP16 uses 140 GB of memory. In FP8, it uses 70 GB. In INT4, 35 GB. Less memory = more requests fit on a GPU = more throughput per dollar.
Quantization is the single highest-leverage optimization for LLM inference costs. Done right, it cuts your GPU bill in half with a measurable-but-acceptable quality hit. Done wrong, it cuts your quality and makes your model subtly dumber in ways your evals might miss.
This post walks through the main quantization schemes, benchmarks on real workloads, and the pitfalls to watch for.
| Format | Bits | Range | Used where |
|---|---|---|---|
| FP32 | 32 | ±3.4e38 | Training, some edge cases |
| FP16 | 16 | ±65504 | Baseline inference |
| BF16 | 16 | ±3.4e38 | Training, some serving |
| FP8 (E4M3) | 8 | ±448 | H100+ inference, training |
| FP8 (E5M2) | 8 | ±57344 | H100+ backward pass |
| INT8 | 8 | -128..127 | A100-era quantized inference |
| INT4 / NF4 | 4 | 16 values | Aggressive memory reduction |
| FP4 | 4 | 16 values | B200 inference |
“Quantization” usually means going from FP16 baseline to FP8 or INT4/INT8. The specific scheme (what gets quantized, how the scales are computed, how you calibrate) matters enormously.
H100 (and Ada Lovelace L40/L40S) have hardware FP8 support via the Transformer Engine. This is different from software INT8 quantization — FP8 is a native tensor core data type.
The key property: on H100, an FP8 matmul is 2x the FLOPS of BF16 and uses half the memory. You get a near-2x throughput boost with minimal accuracy loss.
Quality on Llama-3-70B (our eval set, 500 production prompts):
That’s a 0.04-point (<1%) drop for ~2x throughput. For almost every production workload, this is a no-brainer.
vLLM:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--tensor-parallel-size 2
TensorRT-LLM: Build the engine with --use_fp8 enable --fp8_kv_cache enable. The engine build takes longer but the quality is marginally better because of better calibration.
TGI: --dtype fp8 (with hardware support).
The other FP8 win: storing the KV cache in FP8 halves its size. More requests fit concurrently.
--kv-cache-dtype fp8
Quality impact is negligible (<0.5% in our testing). Most production deployments running FP8 weights should also use FP8 KV cache.
For A100 or older hardware without FP8 support, AWQ is the best option for 70B-class models.
AWQ is an INT4 weight quantization scheme that calibrates using activation statistics — it protects the weights that matter most for output quality. It’s more sophisticated than naive INT4.
Quality on Llama-3-70B:
That’s a ~3% drop. For most applications, acceptable; for applications pushing model capability limits, you’ll notice.
Memory: A 70B model in AWQ INT4 is ~40 GB. Fits on a single A100 80GB or H100 80GB with room for KV cache.
Throughput: AWQ activations are still done in FP16/BF16, so the throughput boost isn’t as dramatic as FP8. Expect 1.3–1.5x baseline vs 1.9–2.0x for FP8.
Enabling AWQ:
python -m vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3.1-70b-instruct-awq \
--quantization awq
Pre-quantized AWQ models are common on Hugging Face. Quantizing yourself takes a few hours on a single GPU with the AutoAWQ library.
GPTQ is an older sibling to AWQ — INT4 weight quantization, slightly less sophisticated calibration. Slightly worse quality, similar throughput. Pre-quantized models are everywhere.
Rule: if AWQ exists for your model, use AWQ. GPTQ is the fallback.
INT8 quantization was the main option before FP8 hardware existed. The best variant is SmoothQuant, which smooths activation distributions before quantizing both weights and activations to INT8.
On A100 (which lacks FP8), SmoothQuant gives you roughly BF16 quality at 1.4–1.6x throughput. Not as good as FP8 on H100, but a real improvement over baseline.
For 2025 deployments, INT8 is mostly legacy. Use AWQ INT4 for A100, FP8 for H100.
B200 (Blackwell) adds hardware FP4 support. This is unprecedentedly aggressive — FP4 is just 16 representable values — but it works with the right calibration.
Early benchmarks on B200 + FP4 for Llama-3.1-70B:
For teams running sustained high-throughput inference, B200 + FP4 is where 2025–2026 cost efficiency lives. The caveat: supply is still constrained and software support is less mature than H100.
See our B200 deep-dive: NVIDIA B200 vs H100: Should You Upgrade?.
The biggest mistake teams make with quantization: benchmarking on the wrong eval set.
General benchmarks (MMLU, HumanEval, HellaSwag) are relatively robust to quantization. The quality loss looks tiny.
Your production workload may not be. We’ve seen cases where:
Always run your own eval set on both quantized and unquantized models. Test:
If any of these degrade unacceptably, roll back.
Long-context inference hits quantization schemes differently. KV cache quantization, in particular, can interact badly with very long contexts (100K+ tokens).
The dynamics: errors compound across long generations. A tiny-per-token quality cost, accumulated over 10k tokens of generation, becomes noticeable.
Mitigations:
kv-cache-dtype fp16)A common setup: base model in FP8, multiple LoRA adapters served via multi-LoRA. How does quantization interact?
For serving, leave your LoRAs in FP16 even when the base is quantized. The memory overhead of FP16 LoRAs is small.
Always test on your eval set. Always.
The TheBloke, casperhansen, neuralmagic, and various HF community members publish pre-quantized variants of popular models. Good starting point:
casperhansen/llama-3.1-70b-instruct-awq — AWQ INT4neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 — FP8Pre-quantized saves the hours of calibration work. Validate quality; don’t assume.
You’ll want to do this yourself if:
Libraries:
A 70B calibration pass takes a few hours on 1–2 H100s. Use a calibration dataset representative of your production traffic.
Quantization is free 1.5–2x throughput on your existing GPUs. FP8 on H100 is now the default for 70B-class workloads. AWQ INT4 is the default for A100.
Watch quality carefully, always test on your own eval set, and keep FP16 as a fallback for quality-sensitive paths.
Deploying quantized inference and want help validating quality? Get in touch — we design eval harnesses that catch quantization regressions in the first 48 hours.
A practical comparison of NVIDIA's H100 and A100 for LLM training and inference — memory, FLOPS, interconnect, price per token, and the cases where the older A100 still wins.
Prefill is compute-bound, decode is bandwidth-bound. Running them on the same GPU wastes capacity. Disaggregated inference separates them — 30–50% throughput wins on real workloads.
AMD's MI300X turned from curiosity to production option during 2024–2025. Where AMD wins, where NVIDIA still leads, and how to integrate MI300X into a mixed fleet.