Human Evaluation Best Practices: Complete Guide

Designing human evaluation studies for LLM outputs — practical implementation

进阶约 12 分钟

Human Evaluation Best Practices: Complete Guide

Designing human evaluation studies for LLM outputs — practical implementation

Human Evaluation Best Practices Overview Designing human evaluation studies for LLM outputs. Rigorous evaluation is essential for building trustworthy AI applications. Why Evaluation Matters Without proper evaluation, you cannot: - Know if your m

evaluationllmbenchmarksqualitytesting

Human Evaluation Best Practices

Overview

Designing human evaluation studies for LLM outputs. Rigorous evaluation is essential for building trustworthy AI applications.

Why Evaluation Matters

Without proper evaluation, you cannot:

Know if your model is actually good

Catch regressions before users do

Make data-driven decisions about model improvements

Meet regulatory and compliance requirements

Evaluation Framework

python
from dataclasses import dataclass, field
from typing import Callable, Optional
import statistics
@dataclass
class EvalExample:
    """Single evaluation example."""
    id: str
    input: str
    expected_output: str
    metadata: dict = field(default_factory=dict)
@dataclass
class EvalResult:
    """Result for a single evaluation."""
    example_id: str
    model_output: str
    score: float
    metrics: dict = field(default_factory=dict)
    passed: bool = True
    notes: str = ""class Evaluator:
    """Human Evaluation Best Practices evaluator."""
    
    def __init__(self, model_fn: Callable, metrics: list[Callable]):
        self.model_fn = model_fn
        self.metrics = metrics
    
    def evaluate_single(self, example: EvalExample) -> EvalResult:
        """Evaluate one example."""
        output = self.model_fn(example.input)
        
        scores = {}
        for metric in self.metrics:
            score = metric(output, example.expected_output)
            scores[metric.__name__] = score
        
        overall_score = statistics.mean(scores.values()) if scores else 0.0
        
        return EvalResult(
            example_id=example.id,
            model_output=output,
            score=overall_score,
            metrics=scores,
            passed=overall_score >= 0.7
        )
    
    def evaluate_dataset(self, examples: list[EvalExample]) -> dict:
        """Evaluate a full dataset."""
        results = [self.evaluate_single(ex) for ex in examples]
        
        scores = [r.score for r in results]
        passed = [r for r in results if r.passed]
        failed = [r for r in results if not r.passed]
        
        return {
            "total": len(results),
            "passed": len(passed),
            "failed": len(failed),
            "pass_rate": len(passed) / len(results),
            "avg_score": statistics.mean(scores),
            "min_score": min(scores),
            "max_score": max(scores),
            "p50": statistics.median(scores),
            "results": results
        }

Key Metrics Implementation

python
from openai import OpenAI
import re
client = OpenAI()
def rouge_l_score(prediction: str, reference: str) -> float:
    """Compute ROUGE-L similarity."""
    pred_tokens = set(prediction.lower().split())
    ref_tokens = set(reference.lower().split())
    
    if not ref_tokens:
        return 0.0
    
    intersection = pred_tokens & ref_tokens
    precision = len(intersection) / max(len(pred_tokens), 1)
    recall = len(intersection) / len(ref_tokens)
    
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)
def llm_judge_score(prediction: str, reference: str, criteria: str = "quality") -> float:
    """Use GPT-4 as an automated judge."""
    prompt = f"""Rate the following AI response on a scale of 1-10.
Criteria: {criteria}
Reference answer: {reference}
AI response: {prediction}
Rating (just the number 1-10):"""
    
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=5
    )
    
    try:
        score = float(re.findall(r'\d+', resp.choices[0].message.content)[0])
        return min(max(score / 10, 0), 1.0)
    except:
        return 0.5def exact_match(prediction: str, reference: str) -> float:
    """Exact string match score."""
    return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0

Running Evaluations

python
Create test dataset
test_examples = [
    EvalExample(
        id="test_001",
        input="What is the capital of France?",
        expected_output="Paris",
        metadata={"category": "geography", "difficulty": "easy"}
    ),
    EvalExample(
        id="test_002", 
        input="Explain how neural networks learn",
        expected_output="Neural networks learn through backpropagation...",
        metadata={"category": "ml_concepts", "difficulty": "medium"}
    ),
]
Define model function
def my_model(input_text: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": input_text}],
        temperature=0,
        max_tokens=500
    )
    return resp.choices[0].message.content
Create evaluator with metrics
evaluator = Evaluator(
    model_fn=my_model,
    metrics=[rouge_l_score, exact_match]
)
Run evaluation
results = evaluator.evaluate_dataset(test_examples)
print(f"Pass rate: {results['pass_rate']:.1%}")
print(f"Average score: {results['avg_score']:.3f}")
print(f"Failed: {results['failed']} / {results['total']}")

Continuous Evaluation in CI/CD

yaml
.github/workflows/eval.yml
name: LLM Evaluation
on: [push, pull_request]jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run evaluations
        run: python -m eval.run_suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PASS_THRESHOLD: "0.8"
      - name: Post results
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./eval_results.json')
            github.rest.issues.createComment({...})

Best Practices

Separate test sets — never evaluate on training data

Track over time — catch regressions early

Human spot-check — automated metrics miss subtleties

Diverse examples — cover edge cases and failure modes

Version test suites — track test suite changes separately

Resources

RAGAS documentation: https://docs.ragas.io

Eleuther AI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness

HELM benchmark: https://crfm.stanford.edu/helm/

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Human Evaluation Best Practices: Complete Guide

Human Evaluation Best Practices

Overview

Why Evaluation Matters

Evaluation Framework

Key Metrics Implementation

Running Evaluations

Create test dataset

Define model function

Create evaluator with metrics

Run evaluation

Continuous Evaluation in CI/CD

.github/workflows/eval.yml

Best Practices

Resources

Documentation

Getting Started

Learn more