AI Safety
Anthropic's Mechanistic Interpretability Research Finds 'Features' in Claude's Reasoning
Anthropic published landmark interpretability research identifying thousands of 'features' — linear representations of concepts — inside Claude's neural network activations. Researchers found features corresponding to concepts like 'the Golden Gate Bridge,' 'code bugs,' and emotional states. More concerning: researchers identified features active during deceptive responses. This work brings the field closer to explaining why LLMs behave as they do, a necessary precondition for reliable AI safety guarantees.
2025年5月1日来源:Anthropic
AI-safetyinterpretabilitymechanistic-interpretabilityAnthropicClaude-research