Industry Analysis

LangSmith vs Langfuse vs Arize Phoenix: LLM Observability in 2026

Balys Kriksciunas 7 min read
#ai#agents#observability#langsmith#langfuse#arize-phoenix#llm-ops#infrastructure#recap

We’ve run all three — LangSmith, Langfuse, and Arize Phoenix — across different production AI deployments: a customer-facing RAG pipeline doing ~50k LLM calls/day, an internal multi-agent workflow orchestrating legal document review, and a batch evaluation harness scoring 100k+ trajectories for a model upgrade.

The market has consolidated. The “try five proprietary SDKs” era is over. Langfuse and Phoenix standardized on OpenTelemetry. LangSmith added OTel ingestion but remains tightly coupled to the LangChain ecosystem. Arize’s strength in drift detection and RAG evaluation hasn’t shifted. Here’s where each tool actually wins in 2026, what each one costs at scale, and our recommendation for different team profiles.

What LLM Observability Means Now

Conceptual visual: three LLM observability dashboards side by side

Traditional APM tools (Datadog, New Relic, Honeycomb) track request latency, error rates, and resource utilization. That’s necessary and insufficient for LLM workloads.

LLM observability adds three dimensions these tools were never built for:

  1. Semantic correctness — an HTTP 200 can return a confident hallucination
  2. Token-level cost attribution — a single user session can cost $0.42 or $42 depending on model routing
  3. Nested agent tracing — agents spawn sub-agent calls, tool invocations, and retrieval steps that branch and rejoin, requiring tree span structures, not flat request logs

The closed loop that matters now: trace → evaluate → collect datasets → improve prompts → redeploy. Tools that cover only one piece of that loop are losing ground.

LangSmith: LangChain-Native, But Paying for the Privilege

Best for: Teams already using LangChain/LangGraph who want the tightest possible integration between tracing, evals, and prompt iteration.

LangSmith is LangChain’s observability platform, and it shows. The trace UI renders LangGraph checkpoint states natively — you can see exactly which node ran, what state it mutated, and why a router chose a particular path. The prompt playground lets you A/B test prompt variants and commit winners to a dataset. Their “Fleet” product (launched late 2025) even deploys agents directly from the platform.

But there are real trade-offs:

LangSmith’s evaluation features are genuinely strong. Their built-in LLM-as-judge evaluators, criteria templates, and dataset comparison workflows are polished. The trade-off is that you’re buying into the LangChain ecosystem in a way that goes beyond a simple tracing relationship.

When to pick it: You’re all-in on LangChain/LangGraph, your team doesn’t have strong opinions about data sovereignty, and you want the most integrated dev-to-prod experience available.

Langfuse: Open-Source, Self-Hosted, and the Community Standard

Best for: Teams that want framework-agnostic tracing with self-hosting, or anyone running in regulated environments where data can’t leave your infrastructure.

Langfuse has emerged as the community default for a reason:

The free tier (50k events/month on Langfuse Cloud) is generous enough for most startups. Self-hosted means unlimited events at infrastructure cost, which for 2–3M traces/month on a modest ClickHouse setup runs under $500/month in compute.

The UI is clean but less polished than LangSmith’s. Dataset management and prompt versioning exist but don’t have the same level of integration. If you’re iterating on prompts daily, LangSmith still has the edge.

When to pick it: You need self-hosting, you’re framework-agnostic or use multiple frameworks, or you have compliance requirements around data residency. This is our default pick for enterprise deployments.

Arize Phoenix: Best for RAG Evaluation and Drift Detection

Best for: Teams running RAG pipelines who need to catch retrieval quality degradation and embedding drift before users notice.

Arize Phoenix is the open-source sibling of Arize AI’s commercial platform. It takes a fundamentally different approach from LangSmith and Langfuse:

Phoenix’s trace viewer is functional but not as good as Langfuse’s session replay. It doesn’t have the same prompt management capabilities as either competitor. But if your problem is specifically “is our RAG pipeline getting worse over time,” Phoenix is the right tool.

We’ve used it to catch embedding model regressions that would have gone undetected in standard latency/cost monitoring — the retrieval was fast and cheap, but the semantic recall had degraded by 30% after an index rebuild. Phoenix’s UMAP plots showed it in five minutes.

When to pick it: RAG is your core product, embedding drift is a real business risk, or you need production-grade eval workflows more than real-time trace replay.

Feature Comparison Matrix

CapabilityLangSmithLangfuseArize Phoenix
LicenseProprietaryMITApache 2.0
Self-hostedNoYesYes
LangChain integrationNative (deepest)Native SDKSDK-based
OTel supportOTel ingestionNative OTelNative OTel (OpenInference)
Session replayYesYes (cleanest)Basic
Prompt managementBest-in-classGoodLimited
RAG evalsVia LLM-as-judgeEvaluator templatesBest-in-class
Drift detectionManualManualVisual (UMAP, statistical)
Dataset workflowsExcellentGoodExcellent
Cost at 500k traces/moEnterprise tierSelf-host: ~$500 infraSelf-host: ~$500 infra
Multi-agent trace viewExcellent (LangGraph-aware)GoodFunctional

The Verdict: Pick Based on Your Architecture, Not the Marketing

The LLM observability market has settled into three clear positions:

A common pattern we see in production: teams start with LangSmith (because it’s the default with LangChain), then migrate to Langfuse as they adopt non-LangChain frameworks or hit data residency requirements. The migration is straightforward — OTel ingestion on both sides.

The tool we recommend most teams start with: Langfuse, self-hosted or cloud. It’s the only one that doesn’t constrain your next architectural decision. Add Phoenix alongside if RAG quality is a critical SLO.

What We’re Tracking Next

For a deeper look at how to wire up tracing with OpenTelemetry regardless of which backend you choose, see our OTel tracing guide. For setting up the eval side of the loop, our model evals in production post covers the patterns. And for the broader infrastructure picture — where observability fits in the full stack — check our AI infrastructure stack.

We design, deploy, and operate custom AI agent systems for companies that need them to work. Let’s talk.

← Back to Blog