评估、测试与可观测

LLM 应用的评估与可观测:基准测试、RAG 评估、Tracing 与监控、Guardrails,建立可量化的质量闭环。

全部教程

评估、测试与可观测

LLM 应用的评估与可观测:基准测试、RAG 评估、Tracing 与监控、Guardrails,建立可量化的质量闭环。

本主题共 76 篇教程

进阶

LangSmith vs Langfuse:LLM 可观测性工具怎么选(2026)

一个闭源好用、一个开源能自托管,关键看你在不在乎数据出境和成本

高级

AI Security: Prompt Injection, Jailbreaking, and LLM Guardrails 2026

Protect your AI applications from attacks: prompt injection, data exfiltration, and model abuse

进阶

OpenAI o3 vs Claude 3.5 Sonnet vs Gemini 2.0 Pro: 2026 Benchmark Comparison

Which frontier LLM wins on coding, reasoning, and math in 2026?

高级

AI-Powered Observability: Building Self-Aware Production Systems

Using machine learning to transform metrics, logs, and traces into actionable intelligence

进阶

AI-Powered DevOps: Automating CI/CD Pipelines for Faster, Safer Deployments

How machine learning is transforming continuous integration and deployment workflows

进阶

AI Application Testing: Evaluation Frameworks and Best Practices

Systematically test and evaluate AI-powered applications

高级

Data Pipeline Observability

Monitoring and alerting for ML data pipeline health

进阶

AI Observability: Tracing and Monitoring LLM Applications

Debug, optimize, and monitor production AI systems

进阶

Braintrust Evaluation: Complete Setup Guide

Fast LLM evaluation and experiment tracking

高级

Model Drift Detection

Detecting and alerting on data and model drift in production

高级

AI Evaluation Frameworks: How to Measure What Actually Matters

Building evaluation systems that catch real-world AI failures before they reach users

进阶

LangSmith vs Helicone vs Langfuse: Side-by-Side Comparison

LLM observability platform comparison — comparing monitoring across langsmith and langfuse

进阶

Continuous Evaluation Pipeline: Complete Guide

Automating model quality checks in CI/CD pipelines — practical implementation

进阶

Evidently AI Monitoring: Complete Setup Guide

Open-source ML monitoring and data drift detection

进阶

RAGAS: RAG Evaluation Framework: Complete Guide

Evaluating Retrieval-Augmented Generation quality with RAGAS — practical implementation

进阶

Evaluating LLM Agents: Complete Guide

Metrics and frameworks for measuring AI agent performance — practical implementation

进阶

Production Monitoring Metrics: Complete Guide

Key metrics to track LLM quality in production — practical implementation

进阶

LLM Evaluation with RAGAS: Complete Developer Guide 2026

Master LLM Evaluation with RAGAS with practical examples and production patterns

进阶

WhyLabs AI Observatory: Complete Setup Guide

Real-time data and AI monitoring with WhyLabs

高级

Continuous Monitoring Agent: Complete Tutorial

Agent that continuously monitors and alerts on conditions

进阶

Faithfulness vs Relevance: Complete Guide

Measuring factual accuracy versus helpfulness in RAG systems — practical implementation

高级

Distributed AI Tracing

End-to-end tracing across AI service boundaries

进阶

LangSmith LLM Observability: Complete Setup Guide

Debugging and monitoring LLM chains with LangSmith

进阶

LLM Judge Pattern: Complete Guide

Using GPT-4 as an automated judge for model evaluation — practical implementation

进阶

Regression Testing for LLMs: Complete Guide

Preventing quality regressions during model updates — practical implementation

高级

Fine-tuning Evaluation: Hands-On Tutorial

Evaluating fine-tuned models with domain benchmarks — step-by-step implementation guide

进阶

Cost-Quality Tradeoff Analysis: Complete Guide

Optimizing the cost vs quality tradeoff in LLM deployments — practical implementation

进阶

Building LLM Test Suites: Complete Guide

Creating comprehensive test suites for LLM applications — practical implementation

进阶

Prometheus + Grafana for AI Applications: Monitoring AI services Guide 2026

Set up comprehensive monitoring for LLM API costs, latency, and error rates

入门

Helicone Complete Tutorial 2026: How to log, monitor, and analyze LLM API calls

Step-by-step guide to using Helicone for AI-powered observability workflows

高级

AI Anomaly Detection for Time Series: From Statistical to Deep Learning Approaches

Isolation Forest, LSTM Autoencoders, and production anomaly detection systems

高级

AI Observability: Comprehensive Monitoring for Production LLM Applications

Langfuse, Helicone, and custom observability stacks for LLM debugging and optimization

进阶

Human Evaluation Best Practices: Complete Guide

Designing human evaluation studies for LLM outputs — practical implementation

进阶

LangSmith for LLM Evaluation: Building Systematic Feedback Loops

Trace collection, evaluation datasets, A/B testing, and regression detection

高级

Testing LLM Applications: Strategies, Tools, and Best Practices 2025

DeepEval, golden datasets, regression testing, and production monitoring

高级

AI Output Validation and Guardrails: Building Reliable LLM Pipelines

Pydantic validators, Guardrails AI, and content safety for production systems

高级

ML Model Monitoring Dashboard

Building real-time model performance dashboards

进阶

LLM Evaluation Fundamentals: Complete Guide

Core metrics and methodologies for evaluating language model quality — practical implementation

进阶

Embedding Quality Metrics: Complete Guide

Evaluating embedding models with MTEB and custom benchmarks — practical implementation

进阶

RAG Evaluation with RAGAS: Advanced RAG Tutorial

Systematic evaluation of RAG pipeline quality

进阶

The LLM Evaluation Trap

Common mistakes in evaluating LLM quality and how to avoid

高级

Prometheus ML Metrics

Instrumenting ML services with Prometheus metrics

进阶

AI Benchmark Deep Dive: Complete Guide

Understanding MMLU, HumanEval, GSM8K and other key benchmarks — practical implementation

进阶

A/B Testing LLM Outputs: Complete Guide

Statistical comparison of LLM variants in production — practical implementation

进阶

Phoenix AI Observability: Complete Setup Guide

Arize Phoenix for LLM tracing and evaluation

进阶

Automated Red Team Testing: Complete Guide

Using LLMs to automate safety and quality red-teaming — practical implementation

入门

AI Compensation Benchmarking: How HR Teams Are Getting Salary Data Right

Using AI to analyze market data, identify pay inequities, and make competitive compensation decisions

入门

AI Compliance Monitoring: How Banks Are Using ML to Stay Ahead of Regulators

Real-world implementations of AI for AML, KYC, and regulatory reporting

进阶

AI-Powered Remote Patient Monitoring for Chronic Disease Management

Deploying RPM programs for diabetes, heart failure, COPD, and hypertension

高级

Kubernetes Security Hardening: Complete CIS Benchmark & Runtime Guide 2025

Secure K8s clusters end-to-end from API server hardening to workload runtime protection

进阶

Contextual Retrieval: Advanced RAG Tutorial

Anthropic contextual retrieval for improved chunk context

高级

Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions

进阶

AI Observability: Monitoring LLMs and ML Models in Production in 2025

Track quality, cost, drift, and failures for AI systems with LLMOps observability platforms

进阶

Building AI-Powered Search with Semantic Retrieval

Replace keyword search with intelligent semantic understanding

高级

AI for Legal and Compliance Teams: Contract Review to Regulatory Monitoring

How legal and compliance professionals use AI to handle 10x the work with the same team

进阶

Parent Document Retrieval: Advanced RAG Tutorial

Hierarchical chunking with parent-child document strategy

入门

LLM Benchmarks Cheat Sheet

MMLU, HumanEval, MATH benchmark scores for major models

进阶

Multi-Query Retrieval: Advanced RAG Tutorial

Generating multiple queries for comprehensive RAG retrieval

进阶

RAG (Retrieval Augmented Generation): Complete Developer Guide 2026

Master RAG (Retrieval Augmented Generation) with practical examples and production patterns

进阶

RAGAS Evaluation: Developer Guide and Quick Start 2026

Learn RAGAS Evaluation: evaluate RAG systems quantitatively

进阶

LangSmith Tracing: Developer Guide and Quick Start 2026

Learn LangSmith Tracing: debug and trace LangChain applications

进阶

SPLADE Sparse Retrieval

Sparse neural retrieval with SPLADE for efficient RAG

进阶

MCP Logging and Observability: Complete Guide

Monitoring MCP server health and performance

高级

AI Observability Stack: Production Setup Guide

Full observability for AI systems with OpenTelemetry

进阶

Retrieval-Augmented Prompting: Complete Guide and Examples

Master retrieval-augmented prompting — injecting retrieved context into prompts — best for RAG systems

高级

AI系统评估框架:用RAGAS、DeepEval和HELM评测RAG系统质量

建立系统化的AI质量评估体系,持续监控和改进RAG应用的回答质量

进阶

LLM Observability: Production Patterns

Tracing and monitoring LLM calls with OpenTelemetry

高级

AI Observability Stack: Production AI Architecture Guide 2026

How to implement complete monitoring for production AI systems

入门

LangSmith vs Langfuse: Which is Better for LLM observability? (2026)

Detailed comparison of LangSmith and Langfuse for LLM observability

高级

Synthetic Data Generation for AI: Techniques, Tools, and Quality Evaluation

GANs, diffusion models, LLM-based generation, and validation methods for synthetic datasets

进阶

AI Compliance Monitoring System

Automated regulatory compliance checking with LLMs

高级

Advanced RAG: Moving Beyond Naive Retrieval to Production-Grade Systems

Corrective RAG, Self-RAG, adaptive retrieval, and evaluation with RAGAS

进阶

AI in Precision Agriculture: Crop Monitoring, Yield Prediction, and Smart Irrigation

Drone imagery analysis, soil sensors, and ML models for sustainable farming

进阶

DeepEval Framework: Developer Guide and Quick Start 2026

Learn DeepEval Framework: unit testing for LLM applications

进阶

LLM Output Guardrails

Implementing input/output guardrails for production AI applications

进阶

AI Safety Evaluation Suite

Benchmarks for evaluating safety and alignment of AI systems