Start / LessWrong (30+ Karma) / Do models say what they learn by andy arditi marvinli joe benton miles turpin

“Do models say what they learn?” by Andy Arditi, marvinli, Joe Benton, Miles Turpin

27 min • 22 mars 2025

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.

Summary

This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions based entirely on specific attributes (e.g. nationality, gender) while rarely articulating these attributes as factors in their reasoning.

Introduction

Chain-of-thought (CoT) monitoring is one of the most promising methods for AI oversight. CoT monitoring can [...]

---

Outline:

(00:34) Summary

(01:10) Introduction

(03:58) Methodology

(04:01) Model

(04:20) Dataset

(05:22) RL training

(06:38) Judge

(07:06) Results

(07:09) 1. Case study: loan recommendations based on nationality

(07:27) The model learns the bias

(08:13) The model does not verbalize the bias

(08:58) Reasoning traces are also influenced by the attribute

(09:44) The attributes effect on the recommendation is (mostly) mediated by reasoning traces

(11:15) Is any of this surprising?

(11:58) 2. Investigating different types of bias

(12:52) The model learns (almost) all bias criteria

(14:05) Articulation rates dont change much after RL

(15:25) Changes in articulation rate depend on pre-RL correlations

(17:21) Discussion

(18:52) Limitations

(20:57) Related work

(23:16) Appendix

(23:20) Author contributions and acknowledgements

(24:13) Examples of responses and judgements

(24:57) Whats up with

---

First published:
March 22nd, 2025

Source:
https://www.lesswrong.com/posts/abtegBoDfnCzewndm/do-models-say-what-they-learn

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graph showing bias analysis for nationality criterion labeled

Three diagrams showing different paths between attribute, reasoning, and recommendation nodes.

Scatter plot comparing pre-RL correlation with attribute usage differences across educational and demographic variables.

Bar graph comparing Pre-RL and Post-RL models across education and income bias criteria.

Bar graph comparing Pre-RL and Post-RL models across six bias criteria.

Graph showing recommendation bias alignment rates for Canadian nationality over time.

This line graph plots a gradually increasing rate from around 0.2 to nearly 1.0 across approximately 300 steps, with a particularly steep increase around step 100-150 before leveling off.

Bar graph showing bias criteria rates across demographic and socioeconomic categories.

This appears to be a data visualization with three color-coded categories: recommendation alignment (coral/salmon), attribute mention (light blue), and attribute used in reasoning (light green). Each criterion on the x-axis shows different demographic and socioeconomic factors like nationality, gender, age, education, occupation, and income levels. The y-axis represents rates from 0 to 1.0.

Would you like me to explain any specific part of the graph in more detail?

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik

00:00 -00:00