Sveriges mest populära poddar

LessWrong (30+ Karma)

“FrontierMath Score of o3-mini Much Lower Than Claimed” by YafahEdelman

1 min • 18 mars 2025

OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.

There are a few reasons to trust Epoch's score over OpenAIs:

  • Epoch built the benchmark and has better incentives.
  • OpenAI reported a 28% score on the hardest of the three problem tiers - suspiciously close to their overall score.
  • Epoch has published quite a bit of information about its testing infrastructure and data, whereas OpenAI has published close to none.
  1. ^

    Which had Python access.

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 17th, 2025

Source:
https://www.lesswrong.com/posts/z8zPL2hBqTmx7Kf6J/frontiermath-score-of-o3-mini-much-lower-than-claimed

---

Narrated by TYPE III AUDIO.

00:00 -00:00