Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions

高级约 14 分钟

Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions

Comprehensive guide to testing AI/LLM applications — evaluation datasets, LLM-as-judge, regression testing, red teaming, load testing, and continuous monitoring in production.

llm-testingai-evaluationsoftware-testingproduction-aiqa

Testing LLM Applications: The Complete Engineering Guide

The Problem with LLM Testing

Traditional software testing doesn't work for LLMs:

No deterministic outputs to assert against

Quality is subjective and multidimensional

Model updates can silently degrade performance

Edge cases are almost infinite

This guide covers evaluation frameworks designed for probabilistic outputs.

Evaluation Framework Overview


Evaluation Types:
Automated Metrics (fast, cheap)
   - BLEU, ROUGE, BERTScore (for translation/summarization)
   - Exact match (for structured outputs)
   - Custom rule-based checkers
LLM-as-Judge (medium speed, medium cost)
   - GPT-4 evaluates GPT-3.5 output quality
   - Structured criteria scoring
   - Pairwise comparisons
Human Evaluation (slow, expensive, ground truth)
   - Random sample of real queries
   - Multiple annotators for reliability
   - Gold standard dataset creation

Building Your Evaluation Dataset

Types of Test Cases

Golden examples: Perfect input-output pairs from human curation

Edge cases: Ambiguous queries, adversarial inputs, rare scenarios

Regression tests: Previous failures that were fixed

Production samples: Real queries from users (anonymized)

Dataset Structure

json
{
  "id": "test-001",
  "input": "Summarize this contract for a non-lawyer",
  "context": "[contract text]",
  "criteria": {
    "accuracy": "Key terms correctly identified",
    "clarity": "No legal jargon in output",
    "completeness": "All major clauses mentioned"
  },
  "expected_output": "[example good response]",
  "tags": ["contract", "summarization", "legal"]
}

LLM-as-Judge Implementation

python
import openai
def evaluate_response(
    question: str,
    response: str,
    criteria: list[str]
) -> dict:
    judge_prompt = f"""
Evaluate this AI response on a scale of 1-10 for each criterion.
Question: {question}
Response: {response}
Criteria to evaluate:
{chr(10).join(f"- {c}" for c in criteria)}Return JSON: {{"criterion_name": score, ...}}
"""
    
    result = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"}
    )
    
    return json.loads(result.choices[0].message.content)

Regression Testing Pipeline

python
Run on every model update or prompt change
import pytestdef test_customer_support_quality():
    test_cases = load_test_dataset("customer_support_gold.json")
    
    scores = []
    for case in test_cases:
        response = chatbot.respond(case["input"])
        score = evaluate_response(
            case["input"], 
            response,
            case["criteria"]
        )
        scores.append(score)
    
    avg_score = sum(s["overall"] for s in scores) / len(scores)
    assert avg_score >= 7.5, f"Quality regressed: {avg_score:.2f} < 7.5"

Red Teaming for Safety

Automated Red Teaming

python
from promptfoo import RedTeam
red_team = RedTeam(
    model="gpt-4o-mini",
    attacks=["jailbreak", "prompt-injection", "hallucination"],
    target=your_llm_function
)results = red_team.run(n_attacks=100)
print(f"Attacks succeeded: {results.success_rate:.1%}")

Manual Red Teaming Checklist

[ ] Prompt injection via user input

[ ] Jailbreak attempts

[ ] PII extraction attempts

[ ] Out-of-scope requests

[ ] Contradictory instructions

[ ] Competitor mentions

Production Monitoring

python
Log all LLM calls with metadata
def logged_completion(messages: list, **kwargs) -> str:
    start = time.time()
    response = openai.chat.completions.create(
        messages=messages, **kwargs
    )
    latency = time.time() - start
    
    # Log to monitoring system
    metrics.log({
        "model": kwargs.get("model"),
        "latency_ms": latency * 1000,
        "input_tokens": response.usage.prompt_tokens,
        "output_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason
    })
    
    # Async quality check on sample
    if random.random() < 0.05:  # 5% sample
        asyncio.create_task(quality_check(messages, response))
    
    return response.choices[0].message.content

Tools and Frameworks

ToolBest For

PromptfooAutomated evaluation, CI/CD integration RAGASRAG-specific evaluation LangSmithEnd-to-end LLM observability BraintrustDataset management + evaluation HumanloopHuman + automated hybrid evaluation

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Testing and Evaluating LLM Applications: Beyond "It Seems to Work"

Testing LLM Applications: The Complete Engineering Guide

The Problem with LLM Testing

Evaluation Framework Overview

Building Your Evaluation Dataset

Types of Test Cases

Dataset Structure

LLM-as-Judge Implementation

Regression Testing Pipeline

Run on every model update or prompt change

Red Teaming for Safety

Automated Red Teaming

Manual Red Teaming Checklist

Production Monitoring

Log all LLM calls with metadata

Tools and Frameworks

Documentation

Getting Started

Learn more