One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).
You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.
[...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.