Audio note: this article contains 37 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Redwood Research has released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.
Here's the abstract of the paper:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. [...]
---
Outline:
(06:48) The Multi-Step Control Problem
(10:14) Resampling
(11:46) Selection Pressure: Forcing Consistent Attacks
(14:05) Value of Information: Better Attack Detection (Incrimination)
(17:01) Important Limitations
(20:23) Results Summary
(21:03) Core takeaways
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Source:
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.