Human Evaluation Best Practices: Complete Guide

Designing human evaluation studies for LLM outputs — practical implementation

返回教程列表
进阶12 分钟

Human Evaluation Best Practices: Complete Guide

Designing human evaluation studies for LLM outputs — practical implementation

Human Evaluation Best Practices Overview Designing human evaluation studies for LLM outputs. Rigorous evaluation is essential for building trustworthy AI applications. Why Evaluation Matters Without proper evaluation, you cannot: - Know if your m

evaluationllmbenchmarksqualitytesting

Human Evaluation Best Practices

Overview

Designing human evaluation studies for LLM outputs. Rigorous evaluation is essential for building trustworthy AI applications.

Why Evaluation Matters

Without proper evaluation, you cannot:

  • Know if your model is actually good
  • Catch regressions before users do
  • Make data-driven decisions about model improvements
  • Meet regulatory and compliance requirements
  • Evaluation Framework

    python
    from dataclasses import dataclass, field
    from typing import Callable, Optional
    import statistics

    @dataclass class EvalExample: """Single evaluation example.""" id: str input: str expected_output: str metadata: dict = field(default_factory=dict)

    @dataclass class EvalResult: """Result for a single evaluation.""" example_id: str model_output: str score: float metrics: dict = field(default_factory=dict) passed: bool = True notes: str = ""

    class Evaluator: """Human Evaluation Best Practices evaluator.""" def __init__(self, model_fn: Callable, metrics: list[Callable]): self.model_fn = model_fn self.metrics = metrics def evaluate_single(self, example: EvalExample) -> EvalResult: """Evaluate one example.""" output = self.model_fn(example.input) scores = {} for metric in self.metrics: score = metric(output, example.expected_output) scores[metric.__name__] = score overall_score = statistics.mean(scores.values()) if scores else 0.0 return EvalResult( example_id=example.id, model_output=output, score=overall_score, metrics=scores, passed=overall_score >= 0.7 ) def evaluate_dataset(self, examples: list[EvalExample]) -> dict: """Evaluate a full dataset.""" results = [self.evaluate_single(ex) for ex in examples] scores = [r.score for r in results] passed = [r for r in results if r.passed] failed = [r for r in results if not r.passed] return { "total": len(results), "passed": len(passed), "failed": len(failed), "pass_rate": len(passed) / len(results), "avg_score": statistics.mean(scores), "min_score": min(scores), "max_score": max(scores), "p50": statistics.median(scores), "results": results }

    Key Metrics Implementation

    python
    from openai import OpenAI
    import re

    client = OpenAI()

    def rouge_l_score(prediction: str, reference: str) -> float: """Compute ROUGE-L similarity.""" pred_tokens = set(prediction.lower().split()) ref_tokens = set(reference.lower().split()) if not ref_tokens: return 0.0 intersection = pred_tokens & ref_tokens precision = len(intersection) / max(len(pred_tokens), 1) recall = len(intersection) / len(ref_tokens) if precision + recall == 0: return 0.0 return 2 * precision * recall / (precision + recall)

    def llm_judge_score(prediction: str, reference: str, criteria: str = "quality") -> float: """Use GPT-4 as an automated judge.""" prompt = f"""Rate the following AI response on a scale of 1-10.

    Criteria: {criteria}

    Reference answer: {reference}

    AI response: {prediction}

    Rating (just the number 1-10):""" resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0, max_tokens=5 ) try: score = float(re.findall(r'\d+', resp.choices[0].message.content)[0]) return min(max(score / 10, 0), 1.0) except: return 0.5

    def exact_match(prediction: str, reference: str) -> float: """Exact string match score.""" return 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0

    Running Evaluations

    python
    

    Create test dataset

    test_examples = [ EvalExample( id="test_001", input="What is the capital of France?", expected_output="Paris", metadata={"category": "geography", "difficulty": "easy"} ), EvalExample( id="test_002", input="Explain how neural networks learn", expected_output="Neural networks learn through backpropagation...", metadata={"category": "ml_concepts", "difficulty": "medium"} ), ]

    Define model function

    def my_model(input_text: str) -> str: resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": input_text}], temperature=0, max_tokens=500 ) return resp.choices[0].message.content

    Create evaluator with metrics

    evaluator = Evaluator( model_fn=my_model, metrics=[rouge_l_score, exact_match] )

    Run evaluation

    results = evaluator.evaluate_dataset(test_examples) print(f"Pass rate: {results['pass_rate']:.1%}") print(f"Average score: {results['avg_score']:.3f}") print(f"Failed: {results['failed']} / {results['total']}")

    Continuous Evaluation in CI/CD

    yaml
    

    .github/workflows/eval.yml

    name: LLM Evaluation

    on: [push, pull_request]

    jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run evaluations run: python -m eval.run_suite env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} PASS_THRESHOLD: "0.8" - name: Post results uses: actions/github-script@v7 with: script: | const results = require('./eval_results.json') github.rest.issues.createComment({...})

    Best Practices

  • Separate test sets — never evaluate on training data
  • Track over time — catch regressions early
  • Human spot-check — automated metrics miss subtleties
  • Diverse examples — cover edge cases and failure modes
  • Version test suites — track test suite changes separately
  • Resources

  • RAGAS documentation: https://docs.ragas.io
  • Eleuther AI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
  • HELM benchmark: https://crfm.stanford.edu/helm/
  • 相关工具

    pythonopenairagas