AI SafetyApr 28, 2025
Anthropic Achieves Breakthrough in Mechanistic Interpretability Research
Anthropic researchers publish landmark paper on mechanistic interpretability, successfully mapping how Claude represents concepts internally and identifying circuits responsible for safety behaviors.
Also available in 中文.