Start / LessWrong posts by zvi / On anthropics sleeper agents paper by zvi

“On Anthropic’s Sleeper Agents Paper” by Zvi

62 min • 17 januari 2024

The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.

The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.

Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.

The rest of this [...]

---

Outline:

(01:02) Abstract and Basics

(04:54) Threat Models

(07:01) The Two Backdoors

(10:13) Three Methodological Variations and Two Models

(11:30) Results of Safety Training

(13:08) Jesse Mu clarifies what he thinks the implications are.

(15:20) Implications of Unexpected Problems

(18:04) Trigger Please

(19:44) Strategic Deceptive Behavior Perhaps Not Directly in the Training Set

(21:54) Unintentional Deceptive Instrumental Alignment

(25:33) The Debate Over How Deception Might Arise

(38:46) Have We Disproved That Misalignment is Too Inefficient to Survive?

(47:52) Avoiding Reading Too Much Into the Result

(55:41) Further Discussion of Implications

(59:07) Broader Reaction

---

First published:
January 17th, 2024

Source:
https://www.lesswrong.com/posts/Sf5CBSo44kmgFdyGM/on-anthropic-s-sleeper-agents-paper

---

Narrated by TYPE III AUDIO.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik

00:00 -00:00