Sveriges mest populära poddar

LessWrong (30+ Karma)

“Automated Researchers Can Subtly Sandbag” by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

8 min • 27 mars 2025

Twitter thread here.

tl;dr

When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage in today's models, as the sandbagging we observe today would be noticed by humans.

Introduction

AI models might soon assist with large parts of AI and AI safety research, completing multiple hours of work independently. These automated researchers will likely not be sufficiently capable to cause immediate catastrophic harm. However, they will have the opportunity to engage in sabotage, including sabotaging the research we rely on to make more powerful AI models safe. For example, a malicious automated researcher might slow down AI safety research by purposefully making promising experiments fail.

In this work we aim [...]

---

Outline:

(00:15) tl;dr

(00:45) Introduction

---

First published:
March 26th, 2025

Source:
https://www.lesswrong.com/posts/niBXcbthcQRXG5M5Y/automated-researchers-can-subtly-sandbag

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Suspiciousness (rated by a Claude 3 Sonnet monitor) vs. task performance of Claude 3.7 Sonnet and Claude 3.5 Sonnet (new) prompted to sandbag their performance on an ML experiment by sabotaging it or by imitating a weak model. Effective sandbagging would cause low suspiciousness and low task performance. Unless otherwise noted, results are zero-shot. Zero-shot prompted Claude 3.5 Sonnet does not sandbag effectively, but Claude 3.7 Sonnet and one-shot prompted models do.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00