Sveriges mest populära poddar

LessWrong (30+ Karma)

“Downstream applications as validation of interpretability progress” by Sam Marks

16 min • 31 mars 2025

Epistemic status: The important content here is the claims. To illustrate the claims, I sometimes use examples that I didn't research very deeply, where I might get some facts wrong; feel free to treat these examples as fictional allegories.

In a recent exchange on X, I promised to write a post with my thoughts on what sorts of downstream problems interpretability researchers should try to apply their work to. But first, I want to explain why I think this question is important.

In this post, I will argue that interpretability researchers should demo downstream applications of their research as a means of validating their research. To be clear about what this claim means, here are different claims that I will not defend here:

Not my claim: Interpretability researchers should demo downstream applications of their research because we terminally care about these applications; researchers should just directly work on the [...]

---

Outline:

(02:30) Two interpretability fears

(07:21) Proposed solution: downstream applications

(11:04) Aside: fair fight vs. no-holds barred vs. in the wild

(12:54) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
March 31st, 2025

Source:
https://www.lesswrong.com/posts/wGRnzCFcowRCrpX4Y/downstream-applications-as-validation-of-interpretability

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graphs comparing efficacy of truth serum with magnet and beer.
Bird saliency map comparison showing trained versus random neural network analysis.

The image shows a side-by-side comparison of an original photo of a small grey bird perched on wood, alongside two saliency maps that demonstrate how different neural networks process and highlight important features in the image.
White screen with no visible content or elements displayed.
Historical portraits showing exchange between Claude Shannon and AT&T CEO Cleo Craig.

The image shows a dialogue with two black and white portraits, with a humorous speech bubble exchange about information theory. The Bell System and AT&T logos are visible.
Comic-style dialogue between Claude Shannon and AT&T CEO about information theory's value.
Two portraits with dialogue showing AT&T's CEO asking Shannon about information theory.

This is a historic image showing communication theory pioneers in a meme-style format, with the Bell System and AT&T logos visible. The dialogue bubbles show a conversation about information theory and entropy between Claude Shannon and Cleo F. Craig.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00