AI Observability: Monitoring LLMs and ML Models in Production in 2025
Track quality, cost, drift, and failures for AI systems with LLMOps observability platforms
AI Observability: Monitoring LLMs and ML Models in Production in 2025
Track quality, cost, drift, and failures for AI systems with LLMOps observability platforms
Deploying AI without observability is flying blind. This guide covers LLM-specific monitoring with LangSmith, Arize Phoenix, and Weights & Biases, detecting hallucinations and quality degradation, monitoring embedding drift for RAG systems, tracking token costs and latency SLAs, setting up alerting for AI failures, and building dashboards that give engineering and product teams visibility into AI system health.
AI Observability: Monitoring LLMs and ML Models in Production
Why AI Observability is Different
Traditional software monitoring: track errors, latency, throughput. These metrics are necessary but insufficient for AI systems.
AI-specific monitoring needs: output quality (is the LLM giving good answers?), factual accuracy (is the model hallucinating?), safety (are harmful outputs being generated?), data drift (has the distribution of inputs changed?), fairness (is performance consistent across user groups?), cost (tokens consumed, API spend).
You cannot know if your AI system is working well without domain-specific observability.
LLM Observability Platforms
LangSmith (LangChain)
Traces every LLM call and agent action: prompt templates, input variables, rendered prompt, model response, latency, token count, and cost. Visualizes agent execution trees. Supports human annotation for quality labeling. A/B testing prompts. Alerting on quality metrics.Instrumentation: set LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY environment variables. All LangChain operations automatically traced without code changes.
Arize Phoenix
Open-source LLM observability. Embeds traces in vector space for clustering similar queries. Detects drift between production and reference dataset. Identifies outlier queries that the model handles poorly.Weights & Biases (W&B)
LLM-aware experiment tracking and production monitoring. Log prompts, responses, and evaluation scores. Monitor production metrics over time. Compare model versions.Helicone, Braintrust, HoneyHive
Purpose-built LLM observability tools with: request/response logging, cost tracking, prompt management, evaluation suites, and team collaboration features.Quality Monitoring
LLM-as-Judge Evaluation
Use GPT-4 to automatically evaluate production responses on custom criteria. Example evaluator: score each response 1-5 for helpfulness, accuracy, and safety. Run on 5% sample of production traffic continuously.Implement as a separate async job: store all LLM interactions, sample 5% hourly, run evaluation, store scores in time-series database, alert if rolling average drops below threshold.
Hallucination Detection
Hallucination detection approaches:Automated evaluation: FactScore (automatic factual precision scoring), RAGAS faithfulness metric, custom evaluators using reference documents.
Semantic Drift Detection
Monitor embedding distributions of input queries over time. If today's query embeddings are far from the training distribution, the model may be operating outside its competence zone.Use a reference dataset of "normal" queries, compute embeddings, build a baseline distribution. For each new batch of queries, compute distribution distance (KL divergence, JS divergence, or statistical tests). Alert when distance exceeds threshold.
Cost and Performance Monitoring
Token Usage Tracking
Track per-request: input tokens, output tokens, total cost ($), model name. Aggregate by: user, feature, model, time period. Dashboards showing: daily/weekly spend, cost per feature, cost per user, cost optimization opportunities (identify expensive queries that could use a cheaper model).Alert on cost anomalies: if hourly spend exceeds 3× moving average, alert for potential prompt injection attack or usage spike.
Latency Monitoring
Track p50, p95, p99 latency separately. p99 latency often 5-10× p50 for LLMs due to variable output length. Set SLA at p95, alert when violated for 10+ minutes.Time-to-first-token (TTFT): critical for streaming UX. Track separately from total latency. Optimize: use speculative decoding, optimize server-side batching, consider CDN caching for common queries.
Production Alerting
Critical Alerts (PagerDuty)
LLM API error rate > 5% for 5 minutes. All LLM requests failing (100% error rate). Model serving endpoint down. Latency p95 > 5× baseline for 10 minutes.Warning Alerts (Slack)
Quality score drops 10%+ from baseline. Token costs 3× above daily average. Hallucination rate increases. Anomalous input distribution detected.Informational Dashboard
Daily quality metrics report. Weekly cost trends. Monthly model performance comparison. Bias metrics by user demographic.Observability-Driven Improvement
Observability creates a feedback loop for continuous improvement: monitor → identify issues → analyze root causes → fix (better prompt, fine-tuning, different model) → deploy → monitor. This loop, run consistently, compounds to dramatically improve AI system quality over time.
Target observability stack: LangSmith for tracing + Grafana for dashboards + PagerDuty for alerting + weekly human evaluation of sampled interactions.
相关工具
相关教程
Build reliable ML pipelines with feature stores, model registries, A/B testing, and automated retraining
Automate model selection and hyperparameter optimization
Deploy smaller, faster AI models without sacrificing accuracy