AI Observability: Monitoring LLMs and ML Models in Production in 2025

Track quality, cost, drift, and failures for AI systems with LLMOps observability platforms

进阶约 20 分钟

AI Observability: Monitoring LLMs and ML Models in Production in 2025

Track quality, cost, drift, and failures for AI systems with LLMOps observability platforms

Deploying AI without observability is flying blind. This guide covers LLM-specific monitoring with LangSmith, Arize Phoenix, and Weights & Biases, detecting hallucinations and quality degradation, monitoring embedding drift for RAG systems, tracking token costs and latency SLAs, setting up alerting for AI failures, and building dashboards that give engineering and product teams visibility into AI system health.

LLM ObservabilityAI MonitoringLangSmithArizeMLOpsHallucination Detection

AI Observability: Monitoring LLMs and ML Models in Production

Why AI Observability is Different

Traditional software monitoring: track errors, latency, throughput. These metrics are necessary but insufficient for AI systems.

AI-specific monitoring needs: output quality (is the LLM giving good answers?), factual accuracy (is the model hallucinating?), safety (are harmful outputs being generated?), data drift (has the distribution of inputs changed?), fairness (is performance consistent across user groups?), cost (tokens consumed, API spend).

You cannot know if your AI system is working well without domain-specific observability.

LLM Observability Platforms

LangSmith (LangChain)

Traces every LLM call and agent action: prompt templates, input variables, rendered prompt, model response, latency, token count, and cost. Visualizes agent execution trees. Supports human annotation for quality labeling. A/B testing prompts. Alerting on quality metrics.

Instrumentation: set LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY environment variables. All LangChain operations automatically traced without code changes.

Arize Phoenix

Open-source LLM observability. Embeds traces in vector space for clustering similar queries. Detects drift between production and reference dataset. Identifies outlier queries that the model handles poorly.

Weights & Biases (W&B)

LLM-aware experiment tracking and production monitoring. Log prompts, responses, and evaluation scores. Monitor production metrics over time. Compare model versions.

Helicone, Braintrust, HoneyHive

Purpose-built LLM observability tools with: request/response logging, cost tracking, prompt management, evaluation suites, and team collaboration features.

Quality Monitoring

LLM-as-Judge Evaluation

Use GPT-4 to automatically evaluate production responses on custom criteria. Example evaluator: score each response 1-5 for helpfulness, accuracy, and safety. Run on 5% sample of production traffic continuously.

Implement as a separate async job: store all LLM interactions, sample 5% hourly, run evaluation, store scores in time-series database, alert if rolling average drops below threshold.

Hallucination Detection

Hallucination detection approaches:

Source grounding: for RAG systems, check if each factual claim in the response can be traced to a retrieved document

Consistency checking: run same query multiple times, flag if responses contradict each other

Confidence scoring: models with calibrated confidence flags uncertain claims

Fact verification: external API to verify specific factual claims (knowledge graph lookup)

Automated evaluation: FactScore (automatic factual precision scoring), RAGAS faithfulness metric, custom evaluators using reference documents.

Semantic Drift Detection

Monitor embedding distributions of input queries over time. If today's query embeddings are far from the training distribution, the model may be operating outside its competence zone.

Use a reference dataset of "normal" queries, compute embeddings, build a baseline distribution. For each new batch of queries, compute distribution distance (KL divergence, JS divergence, or statistical tests). Alert when distance exceeds threshold.

Cost and Performance Monitoring

Token Usage Tracking

Track per-request: input tokens, output tokens, total cost ($), model name. Aggregate by: user, feature, model, time period. Dashboards showing: daily/weekly spend, cost per feature, cost per user, cost optimization opportunities (identify expensive queries that could use a cheaper model).

Alert on cost anomalies: if hourly spend exceeds 3× moving average, alert for potential prompt injection attack or usage spike.

Latency Monitoring

Track p50, p95, p99 latency separately. p99 latency often 5-10× p50 for LLMs due to variable output length. Set SLA at p95, alert when violated for 10+ minutes.

Time-to-first-token (TTFT): critical for streaming UX. Track separately from total latency. Optimize: use speculative decoding, optimize server-side batching, consider CDN caching for common queries.

Production Alerting

Critical Alerts (PagerDuty)

LLM API error rate > 5% for 5 minutes. All LLM requests failing (100% error rate). Model serving endpoint down. Latency p95 > 5× baseline for 10 minutes.

Warning Alerts (Slack)

Quality score drops 10%+ from baseline. Token costs 3× above daily average. Hallucination rate increases. Anomalous input distribution detected.

Informational Dashboard

Daily quality metrics report. Weekly cost trends. Monthly model performance comparison. Bias metrics by user demographic.

Observability-Driven Improvement

Observability creates a feedback loop for continuous improvement: monitor → identify issues → analyze root causes → fix (better prompt, fine-tuning, different model) → deploy → monitor. This loop, run consistently, compounds to dramatically improve AI system quality over time.

Target observability stack: LangSmith for tracing + Grafana for dashboards + PagerDuty for alerting + weekly human evaluation of sampled interactions.

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

AI Observability: Monitoring LLMs and ML Models in Production in 2025

AI Observability: Monitoring LLMs and ML Models in Production

Why AI Observability is Different

LLM Observability Platforms

LangSmith (LangChain)

Arize Phoenix

Weights & Biases (W&B)

Helicone, Braintrust, HoneyHive

Quality Monitoring

LLM-as-Judge Evaluation

Hallucination Detection

Semantic Drift Detection

Cost and Performance Monitoring

Token Usage Tracking

Latency Monitoring

Production Alerting

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Informational Dashboard

Observability-Driven Improvement

Documentation

Getting Started

Learn more