Start / LessWrong (30+ Karma) / Mistral large 2 123b exhibits alignment faking by marc carauleanu diogo de lucena gunnar_zarncke cameron berg judd rosenblatt mike vaiana ae studio

“Mistral Large 2 (123B) exhibits alignment faking” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, AE Studio

4 min • 27 mars 2025

Summary

We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.

Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.
Source: Mistral

Experimental setup and result

We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
March 27th, 2025

Source:
https://www.lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-exhibits-alignment-faking

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B. Source: Mistral

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik

00:00 -00:00