Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.