A new industry-designed benchmark shows AI still fails most real-world, long-horizon professional tasks: across 1,000+ GDP-relevant workflows, mainstream systems fully pass only 2.6% of the hardest challenges. Agents' Last Exam (ALE), mapped to O*NET occupational categories and built with 250+ experts, aims to shift evaluation toward measurable economic impact.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
Summary
Main Finding
Agents’ Last Exam (ALE) is a large, expert-sourced benchmark that measures whether modern AI agents can complete long‑horizon, software‑mediated, economically valuable professional workflows with verifiable outcomes. ALE reveals a large evaluation gap: state‑of‑the‑art generalist computer‑use agents perform well on many prior microbenchmarks but score very low on ALE’s hardest, economically realistic tasks (average full pass rate ≈ 2.6%; even the strongest tested configuration is <50% on the easiest tier and <10% on the hardest). The paper argues that closing this evaluation gap is necessary for benchmark progress to translate into measurable GDP‑relevant impact.
Key Points
- Scope and scale
- 1,490 runnable task instances covering 55 subdomains grouped into 13 industry clusters (1K+ workflows overall design target).
- Created in collaboration with 250+ domain experts.
- Tasks span many non‑physical, software‑mediated professional domains (engineering, life sciences, visual/media arts, business & finance, etc.).
- Task selection principles
- Representativeness: tasks use domain‑standard software and reflect real professional practice.
- Complexity: tasks are end‑to‑end deliverables (days/weeks of work), not single UI edits.
- Verifiability: outcomes must permit deterministic checks or unambiguous rubrics tied to observable artifacts.
- Construction & curation
- Expert submissions -> staged QC pipeline with first‑pass review, engineering implementation, dry‑runs, and peer review by advisory committees.
- 1,490 instances include 960 external submissions and 530 commissioned tasks; only ~150 (≈10%) are public to prevent contamination.
- Tasks are rotated into/out of the public set to maintain an uncontaminated evaluation surface over time.
- Evaluation design
- Standardized around deliverable/milestone checks; each task exposes load(), start(), evaluate() and returns a score in [0,1].
- Execution occurs in remote VMs with an input/, software/, output/, reference/ directory contract.
- Decouples task spec, agent (harness + model), and environment so diverse agents can be evaluated.
- Agent target and architecture
- Target subject: Generalist Computer‑Use Agent (GCUA) able to perceive GUIs, execute code/CLI, use tools, and plan long horizons.
- Functional decomposition: Brain (LLM reasoning), Eyes (GUI perception), Body (orchestration), Hands (tool invocation), Feet (runtime).
- Empirical findings
- ALE is far from saturated by current agents; mainstream agents (incl. Claude Code, Codex+GPT‑5.5 etc.) have very low full‑pass rates on hard tasks.
- Coverage comparison shows many prior benchmarks leave subdomains uncovered; ALE aims to fill that gap.
Data & Methods
- Taxonomy grounding
- ALE’s 13 domains / 55 subdomains are derived from O*NET / SOC 2018 occupational taxonomy to align tasks with real occupations and workflows.
- Task creation pipeline
- Web portal for expert upload of authentic past projects (description, input files, software, expected deliverable, evaluation spec).
- Five‑gate QC: expert submission → first‑pass review → engineer implementation & dry‑run → QC committee peer review → admission.
- Emphasis on using the actual domain software stack (GUI apps + CLI) and real input data.
- Task instance mechanics
- Each instance is implemented as an executable task spec (main.py) with deterministic start state and evaluate() that compares agent outputs to references/rubrics.
- Runtime environment: remote VM, canonical directory layout (input/, software/, output/, reference/), screenshots/shell outputs available to agent per action loop.
- Agent interaction and harness
- Agents interact via an action loop that can perform GUI actions (mouse/keyboard), CLI commands, file edits, API calls, and receive visual feedback.
- The harness communicates only the task description/metadata; the agent must plan and act within the provided environment until termination.
- Verification and contamination control
- Deterministic or rubric‑based automated checks minimize reliance on human judges.
- Private pool (≈90% of tasks) and rolling public release reduce pretraining/finetuning contamination; Appendix D.1 claims the public subset is representative.
- Metrics
- Primary measure: full pass rate per task (binary/continuous scoring aggregated), reported average full‑pass ≈ 2.6% across mainstream harness/backbone configurations.
- Validation
- Multi‑round human QC ensures reference correctness and sensible evaluation bounds; engineer dry‑runs check executability; expert committees validate domain fidelity.
Implications for AI Economics
- Better alignment of benchmarks with GDP‑relevant work
- ALE reframes evaluation toward sustained, verifiable professional workflows. If future agents saturate ALE, that would be a stronger signal that those agents can produce economically relevant output, improving confidence in measures of AI‑driven productivity gains.
- Research incentives and resource allocation
- By exposing tasks that require GUI + CLI + long‑horizon planning, ALE can reorient research and engineering effort toward capabilities more likely to produce industry deployment and economic value (tooling, multimodal perception, robust execution, long‑term planning).
- Labor market and adoption forecasting
- Domain‑mapped tasks (grounded in SOC/O*NET) provide a clearer basis for estimating which occupations/workflows are automatable and at what performance thresholds, improving projections of substitution/complementarity and upskilling needs.
- Firm and policymaker use
- Firms can use ALE pass rates as more realistic readiness checks for adopting AI in production workflows; policymakers can use ALE‑style evaluations when assessing regulatory or labor impacts.
- Measurement & valuation of automation
- Verifiable, deliverable‑based scoring enables more direct mapping from model performance to task output value (e.g., time saved, deliverable quality) compared with QA‑style benchmarks, facilitating economic valuation and cost‑benefit analysis.
- Distributional effects and sectoral heterogeneity
- Current low pass rates, and uneven coverage across domains, indicate uneven near‑term automation potential across industries. ALE can help identify sectors where AI may drive faster productivity gains vs. sectors requiring more human expertise.
- Cautions and limits
- ALE focuses on non‑physical, software‑mediated tasks; it does not measure physical or purely social/interpersonal work—so GDP impact estimates must account for domains outside ALE’s scope.
- Verifiability constraints bias towards tasks with objectively checkable outputs; some valuable professional work (strategy, negotiation, high‑uncertainty research) may remain hard to capture.
- Ongoing maintenance, private pools, and contamination control are essential; benchmarking alone does not guarantee safe or equitable adoption—deployment, governance, and labor policies remain critical.
Short takeaway: ALE fills a persistent evaluation blind spot by measuring whether agents can actually perform end‑to‑end professional workflows. Its results so far suggest current agents are far from ready to drive broad GDP‑level automation in many industries, but ALE provides a structured instrument to track progress that matters for economic impact.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Recent AI systems have achieved strong results on a wide range of benchmarks. Other | positive | high | performance on existing AI benchmarks |
0.18
|
| These gains have not translated into economically meaningful deployment across many professional domains. Adoption Rate | negative | high | translation of benchmark gains into economic deployment |
0.03
|
| The gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. Adoption Rate | negative | high | coverage and sustained measurement of benchmarks on real workflows |
0.03
|
| This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Task Completion Time | positive | high | AI agent performance on long-horizon real-world tasks (verifiable outcomes / task pass rates) |
0.18
|
| ALE was developed in collaboration with 250+ industry experts. Other | positive | high | number of industry experts involved in development |
n=250
0.18
|
| ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). Other | neutral | high | scope of industries covered by the benchmark |
0.18
|
| ALE is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Other | neutral | high | taxonomy breadth (subfields, clusters, number of tasks) |
n=1000
0.18
|
| Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. Output Quality | negative | high | average full pass rate (task success rate) on the hardest tier |
average full pass rate is 2.6%
0.18
|
| ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. Adoption Rate | positive | high | continuous expansion of benchmark task pool |
0.03
|
| ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact. Fiscal And Macroeconomic | positive | high | alignment of benchmark evaluation with GDP-relevant impact (economic impact of AI) |
0.03
|