Researchllmbenchmarkingexpert curationopenai
Researchers Release Humanity's Last Exam Benchmark
8.9
Relevance Score
An international consortium released Humanity's Last Exam (HLE) in early 2025, a 2,500-question, expert-vetted benchmark covering math, humanities, and natural sciences to assess large language models. The test contains expert-crafted short-answer and multiple-choice items designed to be non-ambiguous and difficult for models; leading systems initially scored in the single digits, with GPT-5 reaching about 25 percent. HLE aims to track AI expertise, though it measures task performance rather than general intelligence.


