Infrastructure

FP8 and Quantization: Serving LLMs at Half the Cost

Balys Kriksciunas 7 min read
#ai#infrastructure#fp8#quantization#awq#gptq#inference#h100

FP8 and Quantization: Serving LLMs at Half the Cost

A 70B-parameter model in FP16 uses 140 GB of memory. In FP8, it uses 70 GB. In INT4, 35 GB. Less memory = more requests fit on a GPU = more throughput per dollar.

Quantization is the single highest-leverage optimization for LLM inference costs. Done right, it cuts your GPU bill in half with a measurable-but-acceptable quality hit. Done wrong, it cuts your quality and makes your model subtly dumber in ways your evals might miss.

This post walks through the main quantization schemes, benchmarks on real workloads, and the pitfalls to watch for.


The Precisions You’ll Encounter

FormatBitsRangeUsed where
FP3232±3.4e38Training, some edge cases
FP1616±65504Baseline inference
BF1616±3.4e38Training, some serving
FP8 (E4M3)8±448H100+ inference, training
FP8 (E5M2)8±57344H100+ backward pass
INT88-128..127A100-era quantized inference
INT4 / NF4416 valuesAggressive memory reduction
FP4416 valuesB200 inference

“Quantization” usually means going from FP16 baseline to FP8 or INT4/INT8. The specific scheme (what gets quantized, how the scales are computed, how you calibrate) matters enormously.


FP8: The Easy Win on H100

H100 (and Ada Lovelace L40/L40S) have hardware FP8 support via the Transformer Engine. This is different from software INT8 quantization — FP8 is a native tensor core data type.

The key property: on H100, an FP8 matmul is 2x the FLOPS of BF16 and uses half the memory. You get a near-2x throughput boost with minimal accuracy loss.

Quality on Llama-3-70B (our eval set, 500 production prompts):

That’s a 0.04-point (<1%) drop for ~2x throughput. For almost every production workload, this is a no-brainer.

How to enable FP8

vLLM:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --quantization fp8 \
    --tensor-parallel-size 2

TensorRT-LLM: Build the engine with --use_fp8 enable --fp8_kv_cache enable. The engine build takes longer but the quality is marginally better because of better calibration.

TGI: --dtype fp8 (with hardware support).

FP8 KV cache

The other FP8 win: storing the KV cache in FP8 halves its size. More requests fit concurrently.

--kv-cache-dtype fp8

Quality impact is negligible (<0.5% in our testing). Most production deployments running FP8 weights should also use FP8 KV cache.


AWQ (Activation-aware Weight Quantization)

For A100 or older hardware without FP8 support, AWQ is the best option for 70B-class models.

AWQ is an INT4 weight quantization scheme that calibrates using activation statistics — it protects the weights that matter most for output quality. It’s more sophisticated than naive INT4.

Quality on Llama-3-70B:

That’s a ~3% drop. For most applications, acceptable; for applications pushing model capability limits, you’ll notice.

Memory: A 70B model in AWQ INT4 is ~40 GB. Fits on a single A100 80GB or H100 80GB with room for KV cache.

Throughput: AWQ activations are still done in FP16/BF16, so the throughput boost isn’t as dramatic as FP8. Expect 1.3–1.5x baseline vs 1.9–2.0x for FP8.

Enabling AWQ:

python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq

Pre-quantized AWQ models are common on Hugging Face. Quantizing yourself takes a few hours on a single GPU with the AutoAWQ library.


GPTQ

GPTQ is an older sibling to AWQ — INT4 weight quantization, slightly less sophisticated calibration. Slightly worse quality, similar throughput. Pre-quantized models are everywhere.

Rule: if AWQ exists for your model, use AWQ. GPTQ is the fallback.


INT8 / SmoothQuant

INT8 quantization was the main option before FP8 hardware existed. The best variant is SmoothQuant, which smooths activation distributions before quantizing both weights and activations to INT8.

On A100 (which lacks FP8), SmoothQuant gives you roughly BF16 quality at 1.4–1.6x throughput. Not as good as FP8 on H100, but a real improvement over baseline.

For 2025 deployments, INT8 is mostly legacy. Use AWQ INT4 for A100, FP8 for H100.


FP4: The B200 Story

B200 (Blackwell) adds hardware FP4 support. This is unprecedentedly aggressive — FP4 is just 16 representable values — but it works with the right calibration.

Early benchmarks on B200 + FP4 for Llama-3.1-70B:

For teams running sustained high-throughput inference, B200 + FP4 is where 2025–2026 cost efficiency lives. The caveat: supply is still constrained and software support is less mature than H100.

See our B200 deep-dive: NVIDIA B200 vs H100: Should You Upgrade?.


The Quality Measurement Trap

The biggest mistake teams make with quantization: benchmarking on the wrong eval set.

General benchmarks (MMLU, HumanEval, HellaSwag) are relatively robust to quantization. The quality loss looks tiny.

Your production workload may not be. We’ve seen cases where:

Always run your own eval set on both quantized and unquantized models. Test:

If any of these degrade unacceptably, roll back.


Quantization + Long Context: Watch Out

Long-context inference hits quantization schemes differently. KV cache quantization, in particular, can interact badly with very long contexts (100K+ tokens).

The dynamics: errors compound across long generations. A tiny-per-token quality cost, accumulated over 10k tokens of generation, becomes noticeable.

Mitigations:


Multi-Adapter (LoRA) + Quantization

A common setup: base model in FP8, multiple LoRA adapters served via multi-LoRA. How does quantization interact?

For serving, leave your LoRAs in FP16 even when the base is quantized. The memory overhead of FP16 LoRAs is small.


The Decision Framework

Always test on your eval set. Always.


Pre-Quantized Models on Hugging Face

The TheBloke, casperhansen, neuralmagic, and various HF community members publish pre-quantized variants of popular models. Good starting point:

Pre-quantized saves the hours of calibration work. Validate quality; don’t assume.


Quantizing Your Own Model

You’ll want to do this yourself if:

Libraries:

A 70B calibration pass takes a few hours on 1–2 H100s. Use a calibration dataset representative of your production traffic.


The Bottom Line

Quantization is free 1.5–2x throughput on your existing GPUs. FP8 on H100 is now the default for 70B-class workloads. AWQ INT4 is the default for A100.

Watch quality carefully, always test on your own eval set, and keep FP16 as a fallback for quality-sensitive paths.


Further Reading

Deploying quantized inference and want help validating quality? Get in touch — we design eval harnesses that catch quantization regressions in the first 48 hours.

← Back to Blog