Start / LessWrong (30+ Karma) / Frontiermath score of o3 mini much lower than claimed by yafahedelman

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

1 min • 18 mars 2025

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

Epoch built the benchmark and has better incentives.
OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.

^
Which had Python access.

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 17th, 2025

Source:
https://www.lesswrong.com/posts/z8zPL2hBqTmx7Kf6J/frontiermath-score-of-o3-mini-much-lower-than-claimed

---

Narrated by TYPE III AUDIO.

Kategorier

Filosofi Poddar Samhälle och kultur Teknologi

Förekommer på

Teknik

00:00 -00:00