Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions

返回教程列表
高级14 分钟

Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions

Comprehensive guide to testing AI/LLM applications — evaluation datasets, LLM-as-judge, regression testing, red teaming, load testing, and continuous monitoring in production.

llm-testingai-evaluationsoftware-testingproduction-aiqa

Testing LLM Applications: The Complete Engineering Guide

The Problem with LLM Testing

Traditional software testing doesn't work for LLMs:

  • No deterministic outputs to assert against
  • Quality is subjective and multidimensional
  • Model updates can silently degrade performance
  • Edge cases are almost infinite
  • This guide covers evaluation frameworks designed for probabilistic outputs.

    Evaluation Framework Overview

    
    Evaluation Types:
    
  • Automated Metrics (fast, cheap)
  • - BLEU, ROUGE, BERTScore (for translation/summarization) - Exact match (for structured outputs) - Custom rule-based checkers

  • LLM-as-Judge (medium speed, medium cost)
  • - GPT-4 evaluates GPT-3.5 output quality - Structured criteria scoring - Pairwise comparisons

  • Human Evaluation (slow, expensive, ground truth)
  • - Random sample of real queries - Multiple annotators for reliability - Gold standard dataset creation

    Building Your Evaluation Dataset

    Types of Test Cases

  • Golden examples: Perfect input-output pairs from human curation
  • Edge cases: Ambiguous queries, adversarial inputs, rare scenarios
  • Regression tests: Previous failures that were fixed
  • Production samples: Real queries from users (anonymized)
  • Dataset Structure

    json
    {
      "id": "test-001",
      "input": "Summarize this contract for a non-lawyer",
      "context": "[contract text]",
      "criteria": {
        "accuracy": "Key terms correctly identified",
        "clarity": "No legal jargon in output",
        "completeness": "All major clauses mentioned"
      },
      "expected_output": "[example good response]",
      "tags": ["contract", "summarization", "legal"]
    }
    

    LLM-as-Judge Implementation

    python
    import openai

    def evaluate_response( question: str, response: str, criteria: list[str] ) -> dict: judge_prompt = f""" Evaluate this AI response on a scale of 1-10 for each criterion.

    Question: {question} Response: {response}

    Criteria to evaluate: {chr(10).join(f"- {c}" for c in criteria)}

    Return JSON: {{"criterion_name": score, ...}} """ result = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": judge_prompt}], response_format={"type": "json_object"} ) return json.loads(result.choices[0].message.content)

    Regression Testing Pipeline

    python
    

    Run on every model update or prompt change

    import pytest

    def test_customer_support_quality(): test_cases = load_test_dataset("customer_support_gold.json") scores = [] for case in test_cases: response = chatbot.respond(case["input"]) score = evaluate_response( case["input"], response, case["criteria"] ) scores.append(score) avg_score = sum(s["overall"] for s in scores) / len(scores) assert avg_score >= 7.5, f"Quality regressed: {avg_score:.2f} < 7.5"

    Red Teaming for Safety

    Automated Red Teaming

    python
    from promptfoo import RedTeam

    red_team = RedTeam( model="gpt-4o-mini", attacks=["jailbreak", "prompt-injection", "hallucination"], target=your_llm_function )

    results = red_team.run(n_attacks=100) print(f"Attacks succeeded: {results.success_rate:.1%}")

    Manual Red Teaming Checklist

  • [ ] Prompt injection via user input
  • [ ] Jailbreak attempts
  • [ ] PII extraction attempts
  • [ ] Out-of-scope requests
  • [ ] Contradictory instructions
  • [ ] Competitor mentions
  • Production Monitoring

    python
    

    Log all LLM calls with metadata

    def logged_completion(messages: list, **kwargs) -> str: start = time.time() response = openai.chat.completions.create( messages=messages, **kwargs ) latency = time.time() - start # Log to monitoring system metrics.log({ "model": kwargs.get("model"), "latency_ms": latency * 1000, "input_tokens": response.usage.prompt_tokens, "output_tokens": response.usage.completion_tokens, "finish_reason": response.choices[0].finish_reason }) # Async quality check on sample if random.random() < 0.05: # 5% sample asyncio.create_task(quality_check(messages, response)) return response.choices[0].message.content

    Tools and Frameworks

    ToolBest For

    PromptfooAutomated evaluation, CI/CD integration RAGASRAG-specific evaluation LangSmithEnd-to-end LLM observability BraintrustDataset management + evaluation HumanloopHuman + automated hybrid evaluation

    相关工具

    PromptfooLangSmithRAGASBraintrust