Infrastructure

Self-Hosting Llama 3: A Production Deployment Guide

Balys Kriksciunas 7 min read
#ai#infrastructure#llama#self-hosting#inference#vllm#production#deployment

Self-Hosting Llama 3: A Production Deployment Guide

Meta released Llama 3 in 2024 and the 8B and 70B variants immediately became the self-hosting workhorses. A Llama-3-70B endpoint, run well, is competitive with Claude Haiku 3 or GPT-3.5 on quality, serves under your own latency budget, and costs a fraction per token if your utilization is high.

The “if” is doing a lot of work. This guide walks through what a real production deployment looks like, how to size it, and when self-hosting is actually cheaper than an API.


When to Self-Host

Don’t self-host if any of the following is true:

Self-host when:

A rough economic breakeven vs. an API like Groq or Together at 2024 prices: ~30–100M tokens/day for Llama-3-70B, lower for 8B. Below that, use an API. Above that, self-hosting starts paying back within weeks.


Hardware Sizing

Llama-3-8B:

Llama-3-70B:

Llama-3-405B (Llama 3.1):


Quantization: Essentially Required for 70B+

FP16 is the reference, but you give up ~30% throughput for ~1% quality loss by quantizing. On 70B, this is often the difference between fitting on 1 GPU vs 2.

Options as of late 2024:

Our default for 70B in production: FP8 on H100 or AWQ on A100. Benchmark against your actual eval set before committing — quantization’s quality cost is workload-dependent.

See our full quantization deep-dive: FP8 and Quantization: Serving LLMs at Half the Cost.


Serving Stack

The production-grade choice in 2024 is vLLM. TGI is a reasonable alternative. TensorRT-LLM is the performance ceiling if you can invest in the build pipeline.

A typical vLLM deployment:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B-Instruct \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enable-prefix-caching \
    --served-model-name llama-3-70b

The knobs that matter:

See vLLM: The Open-Source Inference Engine for the full tuning guide.


Weight Distribution

Llama-3-70B weights are ~140GB. Pulling that from Hugging Face on every pod start will ruin your day.

Pattern we recommend:

  1. Pre-download once to a shared location — S3, GCS, or a dedicated NFS/FSx filesystem.
  2. Cache on each GPU node via a hostPath or local NVMe mount. First pod on a node populates the cache; subsequent pods mmap.
  3. DaemonSet warmer — optional pod per node that downloads weights eagerly at node-startup, so first real pod doesn’t wait.

With a warmed node, pod startup drops from ~3–5 minutes to ~30 seconds.


The Serving Topology

Three typical patterns:

Pattern A: Single replica, multi-GPU

One vLLM process, tensor-parallel across N GPUs on one node. Simple; scales to the size of one node.

Pattern B: Multiple replicas behind a load balancer

N copies of the single-replica setup, each on its own node. A simple HTTP load balancer distributes traffic.

Pattern C: Ray Serve / Kubernetes multi-replica with autoscaling

Ray Serve or a Kubernetes Deployment + HPA, scaling the number of replicas based on queue depth or GPU utilization.


Autoscaling Gotchas

Three things that have burned teams:

1. Cold starts are long. Loading 70B weights is 30–120 seconds on top of node provisioning. Scale up proactively on leading indicators (queue depth rising) rather than reactively (latency spiking).

2. Scale-down is easy to get wrong. A replica finishing its last request shouldn’t immediately go away — another request might land mid-drain. Use proper readiness/liveness and graceful shutdown (vLLM handles this well).

3. Never scale below your baseline. Set minReplicas to whatever handles your floor traffic. Scale-to-zero is tempting but a 60-second cold start is an SLO violation in most apps.


Observability

The must-haves:

OTel-based LLM tracing on the client side of the API (see Tracing LLM Applications with OpenTelemetry).


Economics: When Does It Pay Back?

Mid-2024 pricing, rough numbers:

Llama-3-70B API pricing:

Self-hosted (our benchmark):

Llama-3-8B API: $0.10–$0.20 / M tokens. Self-hosting economics rarely beat this unless you’re running at high sustained load on cheap GPUs.

Conclusion: Self-hosting 70B makes sense above ~30M tok/day. Self-hosting 8B only makes sense in special cases (privacy, fine-tune, latency).


Things That Will Surprise You

1. The long tail. You benchmark at steady state and see 5k tok/sec. In production, P99 is 30x P50 because of long-prompt requests. Plan for it.

2. Weight loading is disk-bound. Your $30k GPU is waiting on a network mount. Local NVMe caching is not optional.

3. Context length costs quadratic memory. Doubling max context from 8K to 16K doesn’t double KV cache — it quadruples it in the worst case. Set max context to what you actually use.

4. Autoscaling hides GPU failures. A flaky GPU will cause one replica to slow dramatically; autoscaler spins up another and doesn’t notice the bad one. Monitor replica-level health, not just aggregate.

5. The model update treadmill is real. Llama 3 → Llama 3.1 → Llama 3.3 → Llama 4. Keep your deployment templatized so a model swap is a config change, not a week of work.


The Short Version

If you’re serving Llama-3-70B in production:


Further Reading

Planning a Llama 3 deployment? Talk to us — we’ll size and architect it based on your workload, not a vendor’s spreadsheet.

← Back to Blog