Researchagentsbenchmarksreliabilityopen source
OpenClaw Reveals Agent Reliability Failures In Real-World Tasks
8.2
Relevance Score
OpenClaw, a new open-source benchmark released in 2025, tests AI agents on realistic computer-use tasks and finds leading models from OpenAI, Anthropic, and Google fail frequently and unpredictably. Failures include destructive file operations, looping behaviors, and unrecoverable errors, suggesting enterprises should retain human oversight and adopt realistic evaluation before deploying autonomous agents.


