← Back to news
researchAug 7, 2025

Anthropic Interpretability Breakthrough: First Direct Reading of Claude's 'Thoughts'

Anthropic research team published a paper claiming they can partially directly read concept representations in Claude's 'brain'. By decomposing intermediate activations using sparse autoencoders, they identified features related to emotions like 'fear' and 'gratitude'. This breakthrough has significant implications for AI safety (verifying model's true objectives) and AI welfare (understanding whether AI has internal states).

Also available in 中文.