Start / LessWrong (30+ Karma) / Do models know when they are being evaluated by govind pimpale

“Do models know when they are being evaluated?” by Govind Pimpale

29 min • 18 februari 2025

Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn.

Summary

Our goals

Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief.
This is the first step in a larger project toward building “honeypots” (adversarial alignment evaluations) for scheming-related properties. For honeypots to be useful, the model must believe them to be realistic.

Current findings

We find that LLMs' labeling of scenarios correlates well to human labeling of the same scenarios, with an average Pearson coefficient of 0.8. We interpret this as LLMs having a reasonably good sense of whether a given interaction comes from a real deployment setting or an evaluation. A caveat here is that human labeling of scenarios is subjective, and in practice, raters often did not [...]

---

Outline:

(00:16) Summary

(02:50) Motivation

(04:18) Methodology

(04:21) Overview

(06:58) Selecting scenarios

(07:54) Finding a models P(Evaluation)

(10:25) Main results

(12:12) 1) Correlation between a model's realness belief and ground-truth

(14:16) 2) Correlations between models

(14:57) 3) Plausibility Question (PQ) performance