Sveriges mest populära poddar

LessWrong (30+ Karma)

“On Emergent Misalignment” by Zvi

41 min • 28 februari 2025

One hell of a paper dropped this week.

It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.

More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.

The misalignment here is not subtle.

There are even more examples here, the whole thing is wild.

This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.

There's a Twitter thread summary, which if anything undersells the paper.

Ethan Mollick: This [...]

---

Outline:

(01:27) Paper Abstract

(03:22) Funny You Should Ask

(04:58) Isolating the Cause

(08:39) No, You Did Not Expect This

(12:37) Antinormativity is Totally a Thing

(16:15) What Hypotheses Explain the New Persona

(20:59) A Prediction of Correlational Sophistication

(23:27) Good News, Everyone

(31:00) Bad News

(36:26) No One Would Be So Stupid As To

(38:23) Orthogonality

(40:19) The Lighter Side

---

First published:
February 28th, 2025

Source:
https://www.lesswrong.com/posts/7BEcAzxCXenwcjXuE/on-emergent-misalignment

---

Narrated by TYPE III AUDIO.

---

Images from the article:

undefined
undefined
undefined
undefined
undefined
undefined
undefined

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00