Sveriges mest populära poddar

LessWrong (30+ Karma)

“Do models know when they are being evaluated?” by Govind Pimpale

29 min • 18 februari 2025

Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn.

Summary

Our goals

  • Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief.
  • This is the first step in a larger project toward building “honeypots” (adversarial alignment evaluations) for scheming-related properties. For honeypots to be useful, the model must believe them to be realistic. 

Current findings

  • We find that LLMs' labeling of scenarios correlates well to human labeling of the same scenarios, with an average Pearson coefficient of 0.8. We interpret this as LLMs having a reasonably good sense of whether a given interaction comes from a real deployment setting or an evaluation. A caveat here is that human labeling of scenarios is subjective, and in practice, raters often did not [...]

---

Outline:

(00:16) Summary

(02:50) Motivation

(04:18) Methodology

(04:21) Overview

(06:58) Selecting scenarios

(07:54) Finding a models P(Evaluation)

(10:25) Main results

(12:12) 1) Correlation between a model's realness belief and ground-truth

(14:16) 2) Correlations between models

(14:57) 3) Plausibility Question (PQ) performance

(18:13) 4) Which features influence the model's realness-belief?

(19:17) LASSO regression

(23:02) SHAP analysis

(24:04) Limitations

(24:47) Appendix

(24:50) More examples of PQs (taken from the calibration plot)

(27:42) Further examples of feature calculation

---

First published:
February 17th, 2025

Source:
https://www.lesswrong.com/posts/yTameAzCdycci68sk/do-models-know-when-they-are-being-evaluated

---

Narrated by TYPE III AUDIO.

---

Images from the article:

undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00