Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.
---
Outline:
(01:08) TL;DR
(02:38) Introduction
(02:41) Motivation
(06:09) Our Task
(08:35) Conclusions and Strategic Updates
(13:59) Comparing different ways to train Chat SAEs
(18:30) Using SAEs for OOD Probing
(20:21) Technical Setup
(20:24) Datasets
(24:16) Probing
(26:48) Results
(30:36) Related Work and Discussion
(34:01) Is it surprising that SAEs didn't work?
(39:54) Dataset debugging with SAEs
(42:02) Autointerp and high frequency latents
(44:16) Removing High Frequency Latents from JumpReLU SAEs
(45:04) Method
(45:07) Motivation
(47:29) Modifying the sparsity penalty
(48:48) How we evaluated interpretability
(50:36) Results
(51:18) Reconstruction loss at fixed sparsity
(52:10) Frequency histograms
(52:52) Latent interpretability
(54:23) Conclusions
(56:43) Appendix
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.