Claude Fable 5 Shows Polarized Performance in Programming Benchmarks
Claude Fable 5 Shows Polarized Performance in Programming Benchmarks: Top Rankings Despite Refusals, High Cost with Low Scores
Anthropic's latest Claude Fable 5 model demonstrates strong capabilities across multiple programming benchmarks, but also suffers from "excessive refusal" due to safety guardrails and exposes high cost with low pass rates on the new benchmark Agents' Last Exam (ALE).
Capability Tests: From Game Recreation to Robot Design
Developer community tests of Fable 5 showcase its broad abilities:
- Frontend & Gaming: Pure CSS recreation of Apple's liquid glass effect; single-prompt generation of a Skyrim V clone; 8000-line code clone of the original Pokémon game (all 151 Pokémon); procedural generation of a Crysis scene.
- 3D & Simulation: Building a navigable national park in the browser (266,000 trees based on real elevation data); generating a multi-agent traffic neighborhood simulator; assembling a Boeing 747 model using Three.js basic geometries.
- Mechanical Engineering: 1.4 million tokens to design a complete humanoid robot (including hip, knee, ankle joints); generating a robotic arm in Fusion.
Fable 5 scores 80.3% on SWE-Bench Pro, leading the second place by 11 percentage points; ranks first in Agent Arena.
Safety Guardrails Spark Controversy: "Refusing to Test Yet Topping the Charts"
Fable 5's system card reveals two-tier safety guardrails: probes monitor internal activation states in real-time, triggering an independent LLM classifier to adjudicate, intercepting domains including cybersecurity, biochemistry, and frontier AI R&D. When detecting tasks like "binary reverse engineering," the model refuses to answer or quietly downgrades to Opus 4.8, initially without notifying the user.
On ProgramBench (reconstructing source code from binaries), Fable 5 refused all 200 questions. Despite this, the leaderboard still ranks it first based on other benchmarks, sparking controversy over "topping the charts by refusing." Anthropic later adjusted the policy: when safety interception is triggered, it explicitly notifies the user and switches models.
ALE Benchmark: Fable 5 Loses to GPT-5.5, High Cost
UC Berkeley's Agents' Last Exam (ALE) covers 55 occupations, 1500+ real-world work tasks, requiring agents to operate full GUI/CLI environments. Results:
- Pass Rate: GPT-5.5 (Codex) leads at 24.0%, Fable 5 (Claude Code) ranks third at 22.0%.
- Cost: Fable 5 averages ~$15.70 per question, GPT-5.5 only $3.80, Composer 2.5 at $1.33. Fable 5's total cost for all tasks is $2,315, over 4 times that of GPT-5.5.
- Hardest Difficulty: All frontier agents have 0% pass rate.
ALE team notes that the most common failure mode for agents is declaring completion without verifying work. Fable 5's pass rate on ALE-CLI subset (covering 40 industries) is 25.2%, far below Terminal-Bench's 82.0% and SWE-bench-Pro's 59.1%.
Impact and Industry Reaction
Fable 5's "excessive refusal" issue is not new; Claude 3 Opus and 3.5 Sonnet have similar records. Anthropic's safety strategy, while preventing model misuse (e.g., vulnerability exploitation), also limits usability for normal programming tasks (e.g., binary reverse engineering). Developers face a dilemma: "knows everything, says little."
ALE results indicate that current strongest agents are still far from human-level performance in real work scenarios, with significant cost differences. Fable 5's high performance comes with high cost and high refusal rates, questioning its practical usability.
Also available in 中文.