Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.
Some excerpts I think are worth highlighting:
The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.
One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/SebmGh9HYdd8GZtHA/untitled-draft-v5wy
Linkpost URL:
https://www.darioamodei.com/post/the-urgency-of-interpretability
Narrated by TYPE III AUDIO.