Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Part 13 of 12 in the Engineer's Interpretability Sequence.
TL;DROn May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.
Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.
Reflecting on predictionsPlease see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]
---
First published: May 21st, 2024
Source: https://www.lesswrong.com/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may ---
Narrated by
TYPE III AUDIO.