In this post, we study whether we can modify an LLM's beliefs and investigate whether doing so could decrease risk from advanced AI systems.
We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.
Introduction:
Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.
Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms [...]
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/ARQs7KYY9vJHeYsGc/untitled-draft-2qxt
Linkpost URL:
https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.