Sveriges mest populära poddar

LessWrong posts by zvi

“Llama Does Not Look Good 4 Anything” by Zvi

34 min • 9 april 2025

Llama Scout (17B active parameters, 16 experts, 109B total) and Llama Maverick (17B active parameters, 128 experts, 400B total), released on Saturday, look deeply disappointing. They are disappointing on the level of ‘people think they have to be misconfigured to be this bad,’ and people wondering and debating how aggressively the benchmarks were gamed.

This was by far the most negative reaction I have seen to a model release, the opposite of the reaction to Gemini 2.5 Pro. I have seen similarly deeply disappointing and misleading releases, but they were non-American models from labs whose benchmarks and claims we have learned not to take as representing model capabilities.

After this release, I am placing Meta in that category of AI labs whose pronouncements about model capabilities are not to be trusted, that cannot be relied upon to follow industry norms, and which are clearly not on the [...]

---

Outline:

(01:31) Llama We Doing This Again

(03:05) Llama the License Favors Bad Actors

(04:35) Llama You Do It This Way

(09:35) Llama Fight in the Arena

(13:52) Llama Would You Cheat on Other Benchmarks

(19:25) Llama It So Bad on Independent Benchmarks

(27:55) Llama You Don't Like It

(32:30) Llama Should We Care

---

First published:
April 9th, 2025

Source:
https://www.lesswrong.com/posts/FHMcHmW4RKY5JiArJ/llama-does-not-look-good-4-anything

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar chart comparing confabulations vs. non-response rates across different AI models.
Bar graph showing
Performance comparison table showing AI model scores on MATH-P tests through 2025.
Code snippet showing special tokens with Peter Griffin cartoon
Benchmark comparison table showing Llama 4 Scout against other language models.
Bar graph:
Bar graph
Cartoon illustration of two men nervously posing with fluffy llama.
Benchmark comparison table showing Llama 4 Maverick versus other AI models.
Bar graph showing TrueSkill rankings comparison of various AI language models in elimination games.
Benchmark comparison table showing Llama 4 Behemoth versus three AI models.

Bar graph
Fiction.live tweets:
Performance comparison table showing math scores for different AI language models.
A small emoji showing a figure facepalming in blue and yellow.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

00:00 -00:00