Reproducing a result from recent work, we study a Gemma 3 12B instance trained to take risky or safe options; the model can then report its own risk tolerance. We find that:
---
Outline:
(00:14) Summary
(01:57) Introduction
(03:18) Reproducing LLM Risk Awareness on Gemma 3 12B IT
(03:24) Initial Results:
(05:59) It's Just A Steering Vector:
(07:14) Can We Directly Train the Vector?
(08:58) Is The Awareness Mechanism Different?
(12:22) Risky Behavior Backdoor
(14:41) Investigating Further
(15:30) en-US-AvaMultilingualNeural__ Bar graph titled Validation Accuracy by Model comparing different backdoor models.
(15:50) Steering Vectors Can Implement Conditional Behavior
---
First published:
May 2nd, 2025
Source:
https://www.lesswrong.com/posts/m8WKfNxp9eDLRkCk9/interim-research-report-mechanisms-of-awareness
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.