Claude 4 Resets the Agent Benchmark: SWE-bench Breaks 72%, Surpassing Junior Human Programmers

Anthropic releases Claude 4, achieving a score of 72.5% on the software engineering benchmark SWE-bench Verified, significantly surpassing the average performance of junior human engineers (around 60%) for the first time. Claude 4 introduces an "Extended Thinking" mode, allowing the model to engage in internal reasoning for several minutes before delivering a final answer, particularly excelling in agent tasks that require multi-step planning. AI IDEs like Cursor and Windsurf have announced priority access to Claude 4 as their default agent engine.

Also available in 中文.

Claude 4 Resets the Agent Benchmark: SWE-bench Breaks 72%, Surpassing Junior Human Programmers

Documentation

Getting Started

Learn more