The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.
Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.
The rest of this [...]
---
Outline:
(01:02) Abstract and Basics
(04:54) Threat Models
(07:01) The Two Backdoors
(10:13) Three Methodological Variations and Two Models
(11:30) Results of Safety Training
(13:08) Jesse Mu clarifies what he thinks the implications are.
(15:20) Implications of Unexpected Problems
(18:04) Trigger Please
(19:44) Strategic Deceptive Behavior Perhaps Not Directly in the Training Set
(21:54) Unintentional Deceptive Instrumental Alignment
(25:33) The Debate Over How Deception Might Arise
(38:46) Have We Disproved That Misalignment is Too Inefficient to Survive?
(47:52) Avoiding Reading Too Much Into the Result
(55:41) Further Discussion of Implications
(59:07) Broader Reaction
---
First published:
January 17th, 2024
Source:
https://www.lesswrong.com/posts/Sf5CBSo44kmgFdyGM/on-anthropic-s-sleeper-agents-paper
Narrated by TYPE III AUDIO.