https://astralcodexten.substack.com/p/book-review-a-clinical-introduction
[epistemic status: I didn’t understand this book. Think of this review as detailing the ways I didn’t understand it and hypothesizing what certain parts might mean, and not as an attempt to summarize/re-explain something I understand well]
I.
Remember that AI? From the mesa-optimizers post a few weeks ago? It was trained to pick strawberries. The programmers rewarded it whenever it got a strawberry in its bucket. It started by flailing around, gradually shifted its behavior towards the reward signal, and ended up with a tendency to throw red things at light sources - in the training environment, strawberries were the only red thing, and the glint of the metal bucket was the brightest light source. Later, after training was done, it was deployed at night, and threw strawberries at a streetlight. Also, when someone with a big bulbous red nose walked by, it ripped his nose off and threw that at the streetlight too.
Suppose somebody tried connecting a language model to the AI. “You’re a strawberry picking robot,” they told it. “I’m a strawberry picking robot,” it repeated, because that was the sequence of words that earned it the most reward. Somewhere in its electronic innards, there was a series of neurons that corresponded to “I’m a strawberry-picking robot”, and if asked what it was, it would dutifully retrieve that sentence. But actually, it ripped off people’s noses and threw them at streetlights.