Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-exhibits-alignment-faking
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.