Twitter thread here.
tl;dr
When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage in today's models, as the sandbagging we observe today would be noticed by humans.
Introduction
AI models might soon assist with large parts of AI and AI safety research, completing multiple hours of work independently. These automated researchers will likely not be sufficiently capable to cause immediate catastrophic harm. However, they will have the opportunity to engage in sabotage, including sabotaging the research we rely on to make more powerful AI models safe. For example, a malicious automated researcher might slow down AI safety research by purposefully making promising experiments fail.
In this work we aim [...]
---
Outline:
(00:15) tl;dr
(00:45) Introduction
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/niBXcbthcQRXG5M5Y/automated-researchers-can-subtly-sandbag
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.