MathVista Shows Models Falling Behind Human Math

Researchers from Microsoft Research, Sahara AI, and Emory University this week released results from MATHVISTA, a multimodal math-reasoning benchmark built with more than 6,000 Sahara AI-annotated examples and launched in October 2023. Across 12 foundation models tested, GPT-4V scored 49.9% compared with a 60.3% human average, revealing a substantial gap in visual math reasoning. Authors say improving AGI progress depends more on better training and evaluation data than model scale.
Scoring Rationale
High-quality, widely available benchmark and credible collaborators raise impact, but scope is task-specific and not paradigm-shifting.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.
Sources
- Read OriginalForget AGI—Top AI Models Still Struggle With Mathdecrypt.co


