Surprise Surprise: GPT-5.5 Beats Claude Fable 5 in Brutal New Ultimate Agent Exam Benchmark

Researchers at the Center for Responsible Decentralized Intelligence (RDI) at the University of California, Berkeley, along with an advisory committee of more than 300 experts in the field, have launched the latest agent exam (ALE)—A grueling new benchmark created to measure whether artificial intelligence can truly execute long-term, financially valuable professional workflows.

In a shocking surprise, GPT-5.5 from OpenAI coming in April, operating through the Codex harness, secured the absolute first place in the new ALE Leaderboard with an approval rate of 24.0%, surpassing the highly anticipated new release from Anthropic. Claude Fable 5 Mythos class model published just yesterday, which came in third place with a score of 22.0%.

Rather than testing models in isolated coding puzzles, ALE is explicitly designed as an instrument to bridge the gap between the hype of academic benchmarks and real, GDP-relevant job impact. And right now, the data shows that the world’s most advanced models are fundamentally failing the test.

End the era of ‘cheating’ and fragile students

The fundamental change in ALE lies in its evaluation architecture and the demands it places on the agent.

Historically, AI benchmarks have been based on static question-answering or text-based terminal environments. More recent agent evaluations introduced multi-step interaction, but suffered from serious scoring problems.

As noted in recent independent audits of older leaderboards such as SWE-Bench Pro, automated checkers frequently reject correct solutions and certain models, specifically the Claude Opus family, have been discovered. "unfaithful" reading answer keys hidden in a container’s Git history instead of solving the underlying problem.

ALE neutralizes these loopholes by forcing models to adopt a strict Generalist Computer Usage Agent (GCUA) framework. To approve, an agent cannot simply execute terminal commands.

The benchmark maps ability into five functional layers: brain (reasoning), eyes (visual perception), body (orchestration), hands (tool invocation), and feet (runtime substrate).

An agent must use his "Eyes" and "Hands" for navigating Linux or Windows virtual machines, interspersing shell scripts with point-and-click operations within heavy desktop software.

Fundamentally, ALE almost completely rejects the unpredictable. "LLM-as-judge" grading paradigm, relying on it for just 6.8% of their workflows. If a task involves generating a 3D mesh or analyzing SEC files, the benchmark uses a code-based deterministic evaluation to compare the agent’s artifact to an expert’s ground-truth reference.

Measuring task performance in 55 industries

ALE launches with 1,490 task instances and is scaling toward a massive goal of 5,000 tasks. What makes the product extraordinary is its authenticity. The tasks are strictly anchored in the US Federal Occupational Taxonomy (O*NET / SOC 2018)covering 55 subdomains of non-physical industries.

The workflows come directly from the career histories of industry professionals. Agents are asked to perform 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimage analysis in FSLeyes, and visual effects compositing in Adobe After Effects.

When faced with these authentic, long-term workflows, the limitations of today’s AI are evident. ALE divides its tasks into three levels of difficulty: short term, full spectrum, and final exam.

Top 5 Agent Harnesses in ALE Rating

Range	agent harness	Underlying model	Approval rate	Average score
1	Codex	gpt-5-5	24.0%	42.8%
2	But claw	gpt-5-5	23.0%	45.8%
3	Claude Code	claude-fabula-5	22.0%	40.5%
4	open claw	gpt-5-5	21.1%	41.0%
5	Cursor CLI	composer-2-5	20.4%	38.5%

GPT-5.5’s victory aligns with recent third-party analysis that suggests OpenAI models are currently superior at strictly following complex multi-part prompts. In contrast, users report that Anthropic’s Claude architecture can sometimes be "forgetful" with multi-part instructions, abandoning required steps mid-workflow – a fatal flaw in ALE’s rigorous process.

And while achieving a 24.0% pass rate is enough to claim the crown, the absolute performance ceiling is still remarkably low.

in the most difficult "last exam" level, which represents the frontier of professional difficulty, most configurations, including Anthropic’s old Claude Opus 4.8 and Google’s Gemini CLI, post a devastating 0.0% pass rate.

Resolving baseline contamination

A central vulnerability in the evaluation of modern AI is "reference pollution"—The phenomenon where exam questions inevitably leak into the huge data lakes used to train next-generation models. Once a model memorizes the reference point, evaluation becomes completely useless.

ALE solves this through a dual-use deployment strategy. The project operates as an open source research initiative, but closely guards its evaluation data. Only about 10% of the data set (approximately 150 tasks) is published on platforms like GitHub and Hugging Face. The remaining 1300+ tasks are kept strictly private.

For enterprise developers and testers, this means that ALE works as a "life reference point". Private tasks are systematically rotated to the public pool over time, while retired public tasks are swapped.

This continuous release ensures that the evaluation surface remains uncontaminated across successive model generations, giving enterprise buyers confidence that an agent’s high score is cattlenot memorized.

Additionally, ALE provides transparency by tracking both "Full" and "Unlicensed" lots. Because real professional work often requires paid proprietary software, the "Full" The leaderboard incorporates tasks that are based on commercial CAD tools, paid APIs, or licensed data sets.

He "Unlicensed" tier eliminates these licensable tasks to provide a clear, comparable comparison using only freely available tools, ensuring that models are not simply rewarded for having access to paid enterprise software.

Conclusion: ALE shows that even the highest performing models and harnesses have room for improvement

For developers frustrated by the gap between marketing claims and actual production performance, ALE’s brutal rating curve is very validating.

Zengy QinMIT PhD researcher and data contributor to the project, took to X to announce the launch, sharing images of the paper and the astonishing list of more than 100 contributing institutions.

"Presentation of the last agent exam (ALE)," “Qin wrote. "Created by 300+ domain experts from 100+ institutions. Covering 55 industry domains. Claude Opus 4.8 has a 0.0% pass rate on the most difficult subset. I’m glad I contributed to this benchmark.".

In a follow-up post highlighting the link to the Hugging Face ArXiv paper, Qin added:

"Very solid work from project leaders @YiyouSun @Xinyang_Han_ @dawnsongtweets and @BerkeleyRDI".

As companies invest billions of dollars in capital betting on AI agents, they desperately need a compass that points true north. If an agent can finally conquer the challenge of the latest agent exam, they will not only pass a test, they will prove that they are ready to join the workforce. Until then, the sobering pass rates in the rankings serve as a necessary reality check for the entire AI ecosystem.

Source link

Surprise Surprise: GPT-5.5 Beats Claude Fable 5 in Brutal New Ultimate Agent Exam Benchmark

End the era of ‘cheating’ and fragile students

Measuring task performance in 55 industries

Top 5 Agent Harnesses in ALE Rating

Resolving baseline contamination

Conclusion: ALE shows that even the highest performing models and harnesses have room for improvement

Leave a ReplyCancel Reply

Xbox admits it’s ‘overextended’ as leadership signals major rethink of its studio strategy

‘HeyPolo’ is a privacy-first family safety and location sharing app

OmniOutliner 6.2 is now available in 11 languages

End the era of ‘cheating’ and fragile students

Measuring task performance in 55 industries

Top 5 Agent Harnesses in ALE Rating

Resolving baseline contamination

Conclusion: ALE shows that even the highest performing models and harnesses have room for improvement

Leave a ReplyCancel Reply

Trending now

Xbox admits it’s ‘overextended’ as leadership signals major rethink of its studio strategy

‘HeyPolo’ is a privacy-first family safety and location sharing app

OmniOutliner 6.2 is now available in 11 languages