AI Observability: Tracing and Monitoring LLM Applications

Debug, optimize, and monitor production AI systems

AI Observability: Tracing LLM Applications

Why Observability for AI?

AI applications are harder to debug than traditional software because:

Non-deterministic outputs

Complex multi-step chains

Hard to reproduce issues

Quality degradation is silent

Key Metrics to Track

Latency: Time to first token, total response time

Cost: Tokens used per request, total spend

Quality: Relevance scores, user feedback

Errors: Rate, types, common patterns

Throughput: Requests per second

LangSmith Integration

python
import os
from langsmith import Client
from langchain.callbacks import LangChainTracer
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
tracer = LangChainTracer()
All LangChain calls are automatically traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(callbacks=[tracer])
response = llm.invoke("What is RAG?")

Langfuse for Custom Tracing

python
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse(public_key="...", secret_key="...", host="...")@observe()
def process_query(query: str) -> str:
    # Your LLM call here
    response = call_llm(query)
    
    # Add custom scores
    langfuse.score(
        trace_id=langfuse.get_current_trace_id(),
        name="relevance",
        value=evaluate_relevance(query, response)
    )
    return response

Custom Metrics Dashboard

Track key metrics over time:

python
class LLMMetrics:
    def record_call(self, model, prompt_tokens, completion_tokens, latency_ms):
        # Calculate cost
        cost = (prompt_tokens * MODEL_COSTS[model]['input'] + 
                completion_tokens * MODEL_COSTS[model]['output']) / 1000
        
        # Store in time-series DB
        self.metrics_db.record({
            'timestamp': datetime.now(),
            'model': model,
            'latency_ms': latency_ms,
            'cost_usd': cost,
            'tokens': prompt_tokens + completion_tokens
        })

Alerting

Set up alerts for:

Latency > 5 seconds

Cost per day exceeds budget

Error rate > 1%

Quality scores drop below threshold

Also available in 中文.