Tracing LLM Applications with OpenTelemetry
OpenTelemetry's GenAI semantic conventions let you trace LLM applications with the same standards as the rest of your stack. A practical guide to instrumenting agents, tool calls, and retrieval with OTel.
If you’ve ever watched a team call OpenAI from seven different microservices with seven different retry policies, you’ve seen the problem an LLM gateway solves. Every service re-implements rate limiting, fallback, key rotation, and cost tracking. Most get it subtly wrong. When OpenAI has a 30-second blip, half your services wedge.
An LLM gateway centralizes all of that. This post compares the options — LiteLLM, Portkey, Kong AI Gateway, Cloudflare AI Gateway — and the patterns they make possible.
A gateway sits between your application code and every LLM provider (OpenAI, Anthropic, Google, your self-hosted vLLM, etc). It looks like an OpenAI-compatible API from the app’s perspective. On the back side, it calls the real provider.
The surface area it handles:
You can build this yourself (many teams have). Or you can use an existing one. The existing ones are good enough in 2025 that building from scratch is usually a mistake.
Open-source gateway from BerriAI. A Python proxy that speaks the OpenAI API and routes to 100+ providers.
What’s good:
What’s not:
Use it when: You want the most flexible open-source option, your scale is moderate (< 5k RPS), and you have the ops capacity.
Managed LLM gateway with a strong observability and experimentation product. Self-hosted option for enterprise.
What’s good:
What’s not:
Use it when: You want managed, you value the observability and prompt-management UI, and the pricing works.
Plugin for Kong API Gateway that adds LLM-specific routing. Good fit for orgs already running Kong.
What’s good:
What’s not:
Use it when: You already run Kong and want LLM routing to live in the same gateway stack.
Cloudflare’s managed edge-gateway for LLM traffic.
What’s good:
What’s not:
Use it when: You’re already in the Cloudflare ecosystem, want zero-ops, and your routing needs are simple.
The most-used pattern. Primary: OpenAI GPT-4o. On 429 or 503: Anthropic Claude Sonnet. On failure: self-hosted Llama-3-70B. On second failure: fail loudly.
# LiteLLM example
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_KEY
- model_name: gpt-4o # same name — creates a group
litellm_params:
model: anthropic/claude-sonnet-4
api_key: os.environ/ANTHROPIC_KEY
router_settings:
routing_strategy: usage-based-routing
fallbacks:
- gpt-4o: [claude-sonnet-4, llama-3-70b-local]
When one backend is unhealthy, the gateway tries the next. Applications stay up through provider blips.
Smart routing picks the cheapest backend that meets quality requirements. For a simple classification task, route to Llama-3-8B hosted ($0.07/M). For complex reasoning, route to Claude Opus ($15/M).
This requires your application to declare intent. Two approaches:
/classify → cheap model; /plan → expensive.Per-team, per-customer, per-endpoint budgets. Enforced at the gateway so individual services can’t accidentally exhaust the org’s OpenAI budget. Critical for multi-tenant SaaS.
Issue per-team or per-service “virtual keys” that map to real provider keys. Revoke a virtual key without rotating the underlying OpenAI key. This is a compliance feature that also saves your ops team.
For deterministic-ish queries, cache by semantic similarity. A classification call with the same question gets cached answer. Saves money on repeat workloads.
Portkey has this built in. LiteLLM has semantic caching via Redis integration. Cloudflare has edge-cached responses.
Intercept and redact PII before it reaches the provider. Email addresses, SSNs, phone numbers replaced with tokens before the call; restored in the response.
This is critical for regulated industries. Portkey and LiteLLM both have guardrails libraries; you can also pipe to an external PII service.
Every call tagged with team/user/feature, sent to tracing backend. You can answer “who drove our $12k Anthropic bill last week?” in one SQL query.
[ App ]
│
▼
[ LiteLLM Proxy ] ─── [ Redis cache ]
│ [ Postgres: spend tracking, virtual keys ]
│
├── [ OpenAI ]
├── [ Anthropic ]
├── [ vLLM (self-hosted Llama 3) ]
├── [ Together / DeepInfra ]
└── [ Ollama (local dev) ]
Run LiteLLM as a Kubernetes Deployment, 3+ replicas, behind a Service. Stateless; horizontally scales. Redis and Postgres alongside for state. Prometheus scrapes metrics.
Our default reference deployment. Works for teams from Series A to unicorn.
Build (DIY) if:
Buy (or use open source) if:
We’ve seen four or five teams build from scratch; all ended up reimplementing 80% of what LiteLLM already does.
Your gateway is the natural place to collect LLM telemetry. Wire it to:
LiteLLM, Portkey, and Kong all have turn-key integrations with the major observability backends.
1. Streaming is harder through a proxy. Make sure your gateway supports streaming well. Some retry/fallback logic gets subtle when you’ve already sent half a response.
2. Latency tax. Adding a gateway adds 10–50ms to each request. For a 5-second LLM call this is noise. For sub-second embedding calls, it matters. Measure.
3. Structured output support varies. OpenAI’s json_mode, tool calling, and response_format need to be passed through correctly. LiteLLM handles most of this; test.
4. Key rotation discipline. When you rotate provider keys, the gateway is the only place it needs to happen — if your apps also have direct access, you’ve defeated the point.
5. Gateway is a single point of failure. Run multiple replicas. Health check. Monitor.
Add a gateway before you have 10 services calling OpenAI directly. Retrofitting later is painful.
Setting up an LLM gateway and want a second opinion on the architecture? We can help — we’ve shipped gateways at every size.
OpenTelemetry's GenAI semantic conventions let you trace LLM applications with the same standards as the rest of your stack. A practical guide to instrumenting agents, tool calls, and retrieval with OTel.
AI platform engineering is a distinct discipline from ML ops and generic platform engineering. A practical guide to scoping, staffing, and operating an AI platform team — from first hire to org-wide enablement.
When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.