Start / LessWrong (30+ Karma) / Try training token level probes by stefanhex

“Try training token-level probes” by StefanHex

23 min • 14 april 2025

Epistemic status: These are results of a brief research sprint and I didn't have time to investigate this in more detail. However, the results were sufficiently surprising that I think it's worth sharing.

TL,DR: I train a probe to detect falsehoods on a token-level, i.e. to highlight the specific tokens that make a statement false. It worked surprisingly well on my small toy dataset (~100 samples) and Qwen2 0.5B, after just ~1 day of iteration! Colab link here.

Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes[1] would be awesome to assist LLM-based monitors (Parrack et al., forthcoming). They could narrowly indicate deception-y passages (Goldowsky-Dill et al. 2025), high-stakes situations (McKenzie et al., forthcoming), or a host of other useful properties.

The writeup below is the (lightly edited) [...]

---

Outline:

(01:47) Summary

(04:15) Details

(04:18) Motivation

(06:37) Methods

(06:40) Data generation

(08:35) Probe training

(09:40) Results

(09:43) Probe training metrics

(12:21) Probe score analysis