Sveriges mest populära poddar

LessWrong (Curated & Popular)

“EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024” by scasper

7 min • 24 maj 2024
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.Part 13 of 12 in the Engineer's Interpretability Sequence.

 TL;DR

On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today's new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it ultimately underperformed my expectations. I am beginning to be concerned that Anthropic's recent approach to interpretability research might be better explained by safety washing than practical safety work.

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.

 Reflecting on predictions

Please see my original post for 10 specific predictions about what today's paper would and wouldn’t accomplish. I think that Anthropic obviously did 1 and 2 [...]

---

First published:
May 21st, 2024

Source:
https://www.lesswrong.com/posts/pH6tyhEnngqWAXi9i/eis-xiii-reflections-on-anthropic-s-sae-research-circa-may

---

Narrated by TYPE III AUDIO.


00:00 -00:00