Sveriges mest populära poddar

LessWrong (30+ Karma)

[Linkpost] “Untitled Draft” by RobertM

5 min • 27 april 2025
This is a link post.

Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.

Some excerpts I think are worth highlighting:

The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.

One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and [...]

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
April 27th, 2025

Source:
https://www.lesswrong.com/posts/SebmGh9HYdd8GZtHA/untitled-draft-v5wy

Linkpost URL:
https://www.darioamodei.com/post/the-urgency-of-interpretability

---

Narrated by TYPE III AUDIO.

00:00 -00:00