This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end.
Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (*Equal Contribution).
AbstractWe study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do [...]
---
Outline:(00:39) Abstract
(02:18) Introduction
(11:41) Discussion
(11:44) AI safety
(12:42) Limitations and future work
The original text contained 3 images which were described by AI. ---
First published: January 22nd, 2025
Source: https://www.lesswrong.com/posts/xrv2fNJtqabN3h6Aj/tell-me-about-yourself-llms-are-aware-of-their-implicit ---
Narrated by
TYPE III AUDIO.
---