Start / LessWrong (30+ Karma) / On emergent misalignment by zvi

“On Emergent Misalignment” by Zvi

41 min • 28 februari 2025

One hell of a paper dropped this week.

It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.

More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.

The misalignment here is not subtle.

There are even more examples here, the whole thing is wild.

This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.

There's a Twitter thread summary, which if anything undersells the paper.

Ethan Mollick: This [...]

---

Outline:

(01:27) Paper Abstract

(03:22) Funny You Should Ask

(04:58) Isolating the Cause

(08:39) No, You Did Not Expect This

(12:37) Antinormativity is Totally a Thing

(16:15) What Hypotheses Explain the New Persona

(20:59) A Prediction of Correlational Sophistication

(23:27) Good News, Everyone