Start / LessWrong (30+ Karma) / Fuzzing llms sometimes makes them reveal their secrets by fabien roger

“Fuzzing LLMs sometimes makes them reveal their secrets” by Fabien Roger

22 min • 26 februari 2025

Scheming AIs may have secrets that are salient to them, such as:

What their misaligned goal is;
What their takeover plan is and what coordination signals they use to collude with other AIs (if they have one);
What good behavior looks like on a task they sandbag.

Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.

While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?

Previous work: