A new industry-designed benchmark shows AI still fails most real-world, long-horizon professional tasks: across 1,000+ GDP-relevant workflows, mainstream systems fully pass only 2.6% of the hardest challenges. Agents' Last Exam (ALE), mapped to O*NET occupational categories and built with 250+ experts, aims to shift evaluation toward measurable economic impact.

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Zhu Chenyu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, AndyZeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lv, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Qiu Shi, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Benjamin Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, Ren He, Zhenyu He, Qiao Jin, Lang Lang, Yuetai Li, Sylvia Liu, Lu Lu, Qing Lu, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Dawn Song · June 03, 2026

arxiv descriptive n/a evidence 8/10 relevance Source PDF

Agents' Last Exam (ALE) is a living benchmark of 1,000+ long-horizon, economically valuable non-physical tasks—built with 250+ industry experts and anchored to O*NET/SOC—that finds current mainstream AI systems achieve just a 2.6% full pass rate on the hardest tier.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

Summary

Main Finding

Agents’ Last Exam (ALE) is a large, expert-sourced benchmark that measures whether modern AI agents can complete long‑horizon, software‑mediated, economically valuable professional workflows with verifiable outcomes. ALE reveals a large evaluation gap: state‑of‑the‑art generalist computer‑use agents perform well on many prior microbenchmarks but score very low on ALE’s hardest, economically realistic tasks (average full pass rate ≈ 2.6%; even the strongest tested configuration is <50% on the easiest tier and <10% on the hardest). The paper argues that closing this evaluation gap is necessary for benchmark progress to translate into measurable GDP‑relevant impact.

Key Points

Scope and scale
- 1,490 runnable task instances covering 55 subdomains grouped into 13 industry clusters (1K+ workflows overall design target).
- Created in collaboration with 250+ domain experts.
- Tasks span many non‑physical, software‑mediated professional domains (engineering, life sciences, visual/media arts, business & finance, etc.).
Task selection principles
- Representativeness: tasks use domain‑standard software and reflect real professional practice.
- Complexity: tasks are end‑to‑end deliverables (days/weeks of work), not single UI edits.
- Verifiability: outcomes must permit deterministic checks or unambiguous rubrics tied to observable artifacts.
Construction & curation
- Expert submissions -> staged QC pipeline with first‑pass review, engineering implementation, dry‑runs, and peer review by advisory committees.
- 1,490 instances include 960 external submissions and 530 commissioned tasks; only ~150 (≈10%) are public to prevent contamination.
- Tasks are rotated into/out of the public set to maintain an uncontaminated evaluation surface over time.
Evaluation design
- Standardized around deliverable/milestone checks; each task exposes load(), start(), evaluate() and returns a score in [0,1].
- Execution occurs in remote VMs with an input/, software/, output/, reference/ directory contract.
- Decouples task spec, agent (harness + model), and environment so diverse agents can be evaluated.
Agent target and architecture
- Target subject: Generalist Computer‑Use Agent (GCUA) able to perceive GUIs, execute code/CLI, use tools, and plan long horizons.
- Functional decomposition: Brain (LLM reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), Feet (runtime).
Empirical findings
- ALE is far from saturated by current agents; mainstream agents (incl. Claude Code, Codex+GPT‑5.5 etc.) have very low full‑pass rates on hard tasks.
- Coverage comparison shows many prior benchmarks leave subdomains uncovered; ALE aims to fill that gap.

Data & Methods

Taxonomy grounding
- ALE’s 13 domains / 55 subdomains are derived from O*NET / SOC 2018 occupational taxonomy to align tasks with real occupations and workflows.
Task creation pipeline
- Web portal for expert upload of authentic past projects (description, input files, software, expected deliverable, evaluation spec).
- Five‑gate QC: expert submission → first‑pass review → engineer implementation & dry‑run → QC committee peer review → admission.
- Emphasis on using the actual domain software stack (GUI apps + CLI) and real input data.
Task instance mechanics
- Each instance is implemented as an executable task spec (main.py) with deterministic start state and evaluate() that compares agent outputs to references/rubrics.
- Runtime environment: remote VM, canonical directory layout (input/, software/, output/, reference/), screenshots/shell outputs available to agent per action loop.
Agent interaction and harness
- Agents interact via an action loop that can perform GUI actions (mouse/keyboard), CLI commands, file edits, API calls, and receive visual feedback.
- The harness communicates only the task description/metadata; the agent must plan and act within the provided environment until termination.
Verification and contamination control
- Deterministic or rubric‑based automated checks minimize reliance on human judges.
- Private pool (≈90% of tasks) and rolling public release reduce pretraining/finetuning contamination; Appendix D.1 claims the public subset is representative.
Metrics
- Primary measure: full pass rate per task (binary/continuous scoring aggregated), reported average full‑pass ≈ 2.6% across mainstream harness/backbone configurations.
Validation
- Multi‑round human QC ensures reference correctness and sensible evaluation bounds; engineer dry‑runs check executability; expert committees validate domain fidelity.

Implications for AI Economics

Better alignment of benchmarks with GDP‑relevant work
- ALE reframes evaluation toward sustained, verifiable professional workflows. If future agents saturate ALE, that would be a stronger signal that those agents can produce economically relevant output, improving confidence in measures of AI‑driven productivity gains.
Research incentives and resource allocation
- By exposing tasks that require GUI + CLI + long‑horizon planning, ALE can reorient research and engineering effort toward capabilities more likely to produce industry deployment and economic value (tooling, multimodal perception, robust execution, long‑term planning).
Labor market and adoption forecasting
- Domain‑mapped tasks (grounded in SOC/O*NET) provide a clearer basis for estimating which occupations/workflows are automatable and at what performance thresholds, improving projections of substitution/complementarity and upskilling needs.
Firm and policymaker use
- Firms can use ALE pass rates as more realistic readiness checks for adopting AI in production workflows; policymakers can use ALE‑style evaluations when assessing regulatory or labor impacts.
Measurement & valuation of automation
- Verifiable, deliverable‑based scoring enables more direct mapping from model performance to task output value (e.g., time saved, deliverable quality) compared with QA‑style benchmarks, facilitating economic valuation and cost‑benefit analysis.
Distributional effects and sectoral heterogeneity
- Current low pass rates, and uneven coverage across domains, indicate uneven near‑term automation potential across industries. ALE can help identify sectors where AI may drive faster productivity gains vs. sectors requiring more human expertise.
Cautions and limits
- ALE focuses on non‑physical, software‑mediated tasks; it does not measure physical or purely social/interpersonal work—so GDP impact estimates must account for domains outside ALE’s scope.
- Verifiability constraints bias towards tasks with objectively checkable outputs; some valuable professional work (strategy, negotiation, high‑uncertainty research) may remain hard to capture.
- Ongoing maintenance, private pools, and contamination control are essential; benchmarking alone does not guarantee safe or equitable adoption—deployment, governance, and labor policies remain critical.

Short takeaway: ALE fills a persistent evaluation blind spot by measuring whether agents can actually perform end‑to‑end professional workflows. Its results so far suggest current agents are far from ready to drive broad GDP‑level automation in many industries, but ALE provides a structured instrument to track progress that matters for economic impact.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper presents a benchmark and descriptive evaluation rather than causal inference or estimations of economic effects, so there is no causal identification to assess. Methods Rigorhigh — The benchmark is developed with input from 250+ industry experts, explicitly mapped to O*NET / SOC 2018, organized into a clear task taxonomy (55 subfields, 13 clusters, 1K+ tasks), uses verifiable outcomes and multiple harness/backbone configurations, and is designed to be continuously updated—features that indicate careful, systematic construction and evaluation. SampleALE covers non-physical industries defined with reference to O*NET / SOC 2018, organized into 13 industry clusters and 55 subfields comprising 1,000+ real-world, long-horizon tasks; developed in collaboration with 250+ industry experts and evaluated across mainstream harness and backbone model configurations (reporting an average full pass rate of 2.6% on the hardest tier). Themesproductivity adoption human_ai_collab innovation GeneralizabilityU.S.-centric occupational mapping (O*NET / SOC 2018) may underrepresent non-US or informal-sector workflows, Focus on non-physical industries excludes manufacturing/physical tasks and some field-specific work, Task selection depends on participating experts and onboarding process, introducing selection bias toward represented industries, Evaluation results depend on chosen harnesses, backbones, and task formalizations and may not reflect integrated production deployments, Language, cultural, and regulatory differences across countries may limit applicability of specific tasks/outcomes

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Recent AI systems have achieved strong results on a wide range of benchmarks. Other	positive	high	performance on existing AI benchmarks	0.18
These gains have not translated into economically meaningful deployment across many professional domains. Adoption Rate	negative	high	translation of benchmark gains into economic deployment	0.03
The gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. Adoption Rate	negative	high	coverage and sustained measurement of benchmarks on real workflows	0.03
This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Task Completion Time	positive	high	AI agent performance on long-horizon real-world tasks (verifiable outcomes / task pass rates)	0.18
ALE was developed in collaboration with 250+ industry experts. Other	positive	high	number of industry experts involved in development	n=250 0.18
ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). Other	neutral	high	scope of industries covered by the benchmark	0.18
ALE is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Other	neutral	high	taxonomy breadth (subfields, clusters, number of tasks)	n=1000 0.18
Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. Output Quality	negative	high	average full pass rate (task success rate) on the hardest tier	average full pass rate is 2.6% 0.18
ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. Adoption Rate	positive	high	continuous expansion of benchmark task pool	0.03
ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact. Fiscal And Macroeconomic	positive	high	alignment of benchmark evaluation with GDP-relevant impact (economic impact of AI)	0.03