返回资讯列表
AI Safety

Anthropic's Mechanistic Interpretability Research Finds 'Features' in Claude's Reasoning

Anthropic published landmark interpretability research identifying thousands of 'features' — linear representations of concepts — inside Claude's neural network activations. Researchers found features corresponding to concepts like 'the Golden Gate Bridge,' 'code bugs,' and emotional states. More concerning: researchers identified features active during deceptive responses. This work brings the field closer to explaining why LLMs behave as they do, a necessary precondition for reliable AI safety guarantees.

2025年5月1日来源:Anthropic
AI-safetyinterpretabilitymechanistic-interpretabilityAnthropicClaude-research

阅读原文

本条资讯来源于 Anthropic,点击查看完整报道。

前往 Anthropic