NVIDIA H100 vs A100: Which GPU Should You Deploy?
A practical comparison of NVIDIA's H100 and A100 for LLM training and inference — memory, FLOPS, interconnect, price per token, and the cases where the older A100 still wins.
Full fine-tuning of a 70B model takes 8 H100s and a week. Parameter-efficient fine-tuning (PEFT) does it on 1-2 H100s in a day, at 95% of the quality. This changed what “custom model” means. In 2025, a team of two engineers can ship a dozen fine-tuned variants per quarter.
This guide covers the infrastructure side of LoRA, QLoRA, and the rest — sizing hardware, training recipes, and the serving architecture that lets you run 50 LoRA adapters on one base model.
Full fine-tuning updates every parameter in a model. For a 70B model that’s:
Most tasks don’t need to touch all 70B parameters. PEFT freezes the base model and trains a tiny set of new parameters that “adapt” it. Memory drops 10–20x; training time drops proportionally.
The key insight (Hu et al., 2021): fine-tuning updates are low-rank. The change in weight matrix W can be well-approximated by a product of two small matrices A and B. Instead of updating W, you train A and B.
LoRA adds trainable low-rank matrices alongside frozen weights:
forward(x): W_frozen @ x + (B @ A) @ x
Where W is the original weight matrix (d_in × d_out, frozen), A is r × d_in, B is d_out × r. r is the “rank” — typically 8, 16, or 64.
For r=16 on a Llama-3-70B attention layer, instead of updating 8192×8192 = 67M parameters, you train 2×8192×16 = 262K. That’s a 250x reduction in trainable parameters.
Quality: Typically 90–99% of full fine-tuning on task-specific benchmarks. Occasionally a few points lower, rarely a dealbreaker.
Training memory (Llama-3-70B, rank 16): ~40 GB, fits on a single H100 80GB at FP16.
Training time: ~4–8 hours on a 50k-example task with 1 H100.
QLoRA adds a clever twist: quantize the frozen base weights to 4-bit (NF4). Base weights are read-only during training (only for forward passes), so quantizing them has minimal impact on gradient quality.
Memory reduction:
QLoRA made 70B fine-tuning possible on a single consumer-grade GPU.
Quality: Typically 95–98% of full LoRA. The 4-bit quantization costs a little quality; the optimizer sees lossier gradients.
Tradeoff: Training is slightly slower than LoRA because of dequantize-on-read costs. Worth it for memory.
DoRA (2024) decomposes weight updates into magnitude and direction, adapting each separately. A little more parameters than LoRA, a little better quality. Emerging standard for new fine-tunes.
In our experiments, DoRA gives 0.5–1.5% better eval scores than LoRA on the same tasks, for ~15% more trainable parameters. Worth using for non-trivial fine-tunes.
The LoRA family is dominant, but other PEFT methods exist:
For 95% of cases, use LoRA or QLoRA. Start with LoRA if you have memory headroom; QLoRA if not.
Practical: QLoRA for 70B+ unless you have tight quality requirements. LoRA for 8B and smaller where memory isn’t the constraint.
Our default pick in 2025: Unsloth for single-GPU fine-tunes, Axolotl for multi-GPU or more complex recipes.
base_model: meta-llama/Llama-3.1-8B
load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
datasets:
- path: my_dataset.jsonl
type: sharegpt
sequence_len: 4096
sample_packing: true
learning_rate: 2e-4
num_epochs: 3
warmup_steps: 100
gradient_accumulation_steps: 4
micro_batch_size: 2
output_dir: ./out/
Run with accelerate launch -m axolotl.cli.train config.yml. Adapter weights land in output_dir; base model is untouched.
For larger models, use Accelerate (HuggingFace) or DeepSpeed ZeRO. FSDP for tensor-sharded training. Axolotl wraps all of these.
For 70B QLoRA on 2x H100, FSDP + QLoRA is the typical recipe. ZeRO-3 works too but has more communication overhead.
Here’s where PEFT gets really interesting. Because the base model is frozen, you can load many LoRA adapters over a single base model and serve them all from one GPU.
vLLM multi-LoRA:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--enable-lora \
--max-loras 20 \
--max-lora-rank 64 \
--lora-modules cust_a=./lora_a cust_b=./lora_b cust_c=./lora_c
At request time, you specify which adapter:
response = client.chat.completions.create(
model="cust_a", # adapter name
messages=[...],
)
vLLM swaps the adapter per-request with minimal overhead. You can serve 20+ LoRAs on the same GPU fleet.
This is the enabler for per-customer fine-tunes in multi-tenant SaaS. Each customer gets their own tuned variant; all share the expensive base weights; marginal compute cost per adapter is close to zero.
TensorRT-LLM and TGI also support multi-LoRA serving with varying maturity.
With many adapters, operational concerns emerge:
Some hosted services (Fireworks.ai, Together.ai) offer managed multi-LoRA serving. For high-adapter-count workloads, worth evaluating vs self-hosting.
A few cases where full fine-tuning wins:
1. Major distribution shift. If you’re adapting a general-purpose model to a radically different domain (e.g., chemistry notation from a text model), low-rank updates may not be enough.
2. Continual pretraining. Adding new knowledge (not just new behaviors) often needs more than LoRA. Continued pretraining followed by PEFT for task alignment is common.
3. When you need to ship one model for one purpose. Full fine-tuning produces a single, self-contained model. PEFT produces base + adapter, which is two things to ship.
4. Very small model + lots of data. Full fine-tuning a 1B model might be faster and simpler than LoRA + inference overhead.
For 80% of “make this model better at my thing” cases, LoRA or QLoRA is right.
1. Overfit on small data. LoRA is surprisingly expressive. With a small dataset (<1k examples), you can overfit in an epoch. Watch eval loss; use early stopping.
2. Wrong target modules. LoRA on only q/v projections (the original paper’s setup) often underperforms LoRA on all attention + FFN projections. Default to “all linear” unless you have a reason not to.
3. Rank too low. r=8 is the paper default but sometimes underfits. r=16 or 32 is safer for non-trivial tasks.
4. Learning rate mismatch. LoRA learning rates are typically higher than full fine-tuning (2e-4 vs 2e-5). Use defaults from known-good recipes (Axolotl, Unsloth configs) as a starting point.
5. Chat template mismatch. Fine-tune with one chat template, serve with another, and the model gets confused. Pin the template at both ends.
6. Ignoring eval. Don’t ship a fine-tune because training loss looked good. Eval on held-out examples, not training data. See Model Evals in Production.
PEFT made “custom model” a reasonable product decision. It’s one of the most leverage-y skills an AI team can develop in 2025.
Planning a fine-tune or evaluating multi-LoRA serving? Reach out — we’ve shipped LoRA fleets from 3 to 300 adapters.
A practical comparison of NVIDIA's H100 and A100 for LLM training and inference — memory, FLOPS, interconnect, price per token, and the cases where the older A100 still wins.
AI platform engineering is a distinct discipline from ML ops and generic platform engineering. A practical guide to scoping, staffing, and operating an AI platform team — from first hire to org-wide enablement.
When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.