Testing and Evaluating LLM Applications: Beyond "It Seems to Work"
Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions
Testing and Evaluating LLM Applications: Beyond "It Seems to Work"
Software engineers share the testing frameworks and evaluation strategies that caught 90% of LLM regressions
Comprehensive guide to testing AI/LLM applications — evaluation datasets, LLM-as-judge, regression testing, red teaming, load testing, and continuous monitoring in production.
Testing LLM Applications: The Complete Engineering Guide
The Problem with LLM Testing
Traditional software testing doesn't work for LLMs:
This guide covers evaluation frameworks designed for probabilistic outputs.
Evaluation Framework Overview
Evaluation Types:
Automated Metrics (fast, cheap)
- BLEU, ROUGE, BERTScore (for translation/summarization)
- Exact match (for structured outputs)
- Custom rule-based checkersLLM-as-Judge (medium speed, medium cost)
- GPT-4 evaluates GPT-3.5 output quality
- Structured criteria scoring
- Pairwise comparisonsHuman Evaluation (slow, expensive, ground truth)
- Random sample of real queries
- Multiple annotators for reliability
- Gold standard dataset creation
Building Your Evaluation Dataset
Types of Test Cases
Dataset Structure
json
{
"id": "test-001",
"input": "Summarize this contract for a non-lawyer",
"context": "[contract text]",
"criteria": {
"accuracy": "Key terms correctly identified",
"clarity": "No legal jargon in output",
"completeness": "All major clauses mentioned"
},
"expected_output": "[example good response]",
"tags": ["contract", "summarization", "legal"]
}
LLM-as-Judge Implementation
python
import openaidef evaluate_response(
question: str,
response: str,
criteria: list[str]
) -> dict:
judge_prompt = f"""
Evaluate this AI response on a scale of 1-10 for each criterion.
Question: {question}
Response: {response}
Criteria to evaluate:
{chr(10).join(f"- {c}" for c in criteria)}
Return JSON: {{"criterion_name": score, ...}}
"""
result = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"}
)
return json.loads(result.choices[0].message.content)
Regression Testing Pipeline
python
Run on every model update or prompt change
import pytestdef test_customer_support_quality():
test_cases = load_test_dataset("customer_support_gold.json")
scores = []
for case in test_cases:
response = chatbot.respond(case["input"])
score = evaluate_response(
case["input"],
response,
case["criteria"]
)
scores.append(score)
avg_score = sum(s["overall"] for s in scores) / len(scores)
assert avg_score >= 7.5, f"Quality regressed: {avg_score:.2f} < 7.5"
Red Teaming for Safety
Automated Red Teaming
python
from promptfoo import RedTeamred_team = RedTeam(
model="gpt-4o-mini",
attacks=["jailbreak", "prompt-injection", "hallucination"],
target=your_llm_function
)
results = red_team.run(n_attacks=100)
print(f"Attacks succeeded: {results.success_rate:.1%}")
Manual Red Teaming Checklist
Production Monitoring
python
Log all LLM calls with metadata
def logged_completion(messages: list, **kwargs) -> str:
start = time.time()
response = openai.chat.completions.create(
messages=messages, **kwargs
)
latency = time.time() - start
# Log to monitoring system
metrics.log({
"model": kwargs.get("model"),
"latency_ms": latency * 1000,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"finish_reason": response.choices[0].finish_reason
})
# Async quality check on sample
if random.random() < 0.05: # 5% sample
asyncio.create_task(quality_check(messages, response))
return response.choices[0].message.content
Tools and Frameworks
相关工具
相关教程
Replace expensive photo shoots with AI-generated product backgrounds and lifestyle shots
From customer support bots to internal knowledge bases — how to build GPTs your team actually uses
Engineering teams share real productivity gains and workflows after one year of Copilot Enterprise