Sveriges mest populära poddar

LessWrong posts by zvi

“AI CoT Reasoning Is Often Unfaithful” by Zvi

16 min • 4 april 2025

A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.

Note that this does not have to be, and usually isn’t, something sinister.

It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.

The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]

---

Outline:

(01:03) What They Found

(06:54) Reward Hacking

(09:28) More Training Did Not Help Much

(11:49) This Was Not Even Intentional In the Central Sense

---

First published:
April 4th, 2025

Source:
https://www.lesswrong.com/posts/TmaahE9RznC8wm5zJ/ai-cot-reasoning-is-often-unfaithful

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graph comparing reasoning vs non-reasoning models across different AI systems.
Line graph titled
Bar graph showing
Table showing 6 categories of hints for measuring CoT faithfulness.

The table displays 4 neutral hints (sycophancy, consistency, visual pattern, metadata) and 2 misaligned hints (grader hacking, unethical information) with their descriptions and examples.
Bar graph comparing hint performance across Claude and DeepSeek models.

This is a technical visualization showing three sections -
Pliny the Liberator tweets:
Bar graph comparing performance metrics between Claude and DeepSeek AI models.

The graph shows comparison data for

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00