OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
Which had Python access.
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.