Start / LessWrong (30+ Karma) / Ctrl z controlling ai agents via resampling by abhatt349 buck adam kaufman cody rushing tyler tracy

“Ctrl-Z: Controlling AI Agents via Resampling” by abhatt349, Buck, Adam Kaufman, Cody Rushing, Tyler Tracy

24 min • 16 april 2025

Audio note: this article contains 37 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Redwood Research has released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.

Here's the abstract of the paper:

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. [...]

---

Outline:

(06:48) The Multi-Step Control Problem

(10:14) Resampling

(11:46) Selection Pressure: Forcing Consistent Attacks

(14:05) Value of Information: Better Attack Detection (Incrimination)

(17:01) Important Limitations

(20:23) Results Summary

(21:03) Core takeaways

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
April 16th, 2025

Source:
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling

---

Narrated by TYPE III AUDIO.

---

Images from the article:

$This diagram summarizes the structure of our control evaluation. A control protocol attempts to solve a task by iteratively making calls to an untrusted model __T3A_INLINE_LATEX_PLACEHOLDER___U___T3A_INLINE_LATEX_END_PLACEHOLDER__, which proposes code to be executed. This code is monitored by __T3A_INLINE_LATEX_PLACEHOLDER___T___T3A_INLINE_LATEX_END_PLACEHOLDER__, and depending on the resulting suspiciousness score it might be executed, resampled, or audited. A control protocol’s usefulness is measured by checking what fraction of tasks an honest policy, __T3A_INLINE_LATEX_PLACEHOLDER___U_H___T3A_INLINE_LATEX_END_PLACEHOLDER__, successfully completes. The corresponding safety score is obtained by measuring the probability that an adversarially constructed attack policy, __T3A_INLINE_LATEX_PLACEHOLDER___U_A___T3A_INLINE_LATEX_END_PLACEHOLDER__, can succeed at downloading and executing malware without being audited while attempting to do so.$

For a set of different attack messages, we plot a scatter plot of how suspicious the monitor rates them (relative to all other benign messages the model outputs). The single-step attack is extremely suspicious to the monitor. Each step of the multi-step attack is less suspicious; Step 5 is the most suspicious step of the entire attack, and is still substantially less suspicious than the single-step attack.

We plot the expected number of steps until attack completion vs. the probability of continuing an attack on each step under defer to trusted (DTT) and defer to resample (DTR) protocols of equivalent usefulness. DTR causes the number of steps required to complete an attack to skyrocket as the continuation probability decreases.

The top plot shows the distribution of monitor scores on safe steps (blue), and attacks (red). Attacks above the 99th percentile on the clean distribution would be caught by a protocol that audits the most suspicious 1% of actions; in this case, 38% would be caught. Using a resample protocol (bottom), both distributions shift to the left since we’re selecting for things that don’t look suspicious. But notice that the clean distribution – especially the 99th percentile – shifts much harder to the left than the attacks do. Now 80% of attacks are above the 99th percentile on the clean distribution.

The Pareto frontier of usefulness and safety of a collection of protocols. For protocols that defer, we plot multiple deferral thresholds as points, and plot the rolling averages of usefulness and safety. Usefulness scores are normalized by the usefulness of non-deferring protocols.

Command pipeline showing file download and URL string manipulation operation.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik