JobBench tests AI agents on 130 real-work tasks across 35 occupations using extensive fact-anchored rubrics and finds leading models succeed on less than half the required criteria, implying current agents are far from reliably performing the professional tasks humans actually want delegated.

JobBench: Aligning Agent Work With Human Will

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran · May 25, 2026

arxiv descriptive n/a evidence 8/10 relevance Source PDF

JobBench is a 130-task, 35-occupation benchmark that evaluates agentic AI on realistic workspaces with detailed fact-anchored rubrics and finds even the best models succeed on under half the required criteria (top score 45.9%).

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

Summary

Main Finding

JobBench is a new, human-centered benchmark that measures agent capability on the specific professional duties experts most want delegated to AI. Unlike GDP-oriented benchmarks that emphasize economically valuable end-to-end deliverables (a “replacement” story), JobBench evaluates source-grounded professional reasoning across messy, multi-file workspaces and scores agents with chained, fact-anchored binary rubrics that require every reasoning step to pass. Evaluating 36 model–scaffold configurations, the strongest setup (Claude Opus 4.7 under Claude Code) reaches only 45.9% on the main set, demonstrating that current agents remain far from reliably performing the duties workers most want automated.

Key Points

Human-centered scope: Tasks are selected from Workbank, a survey of >1,500 workers who rated their willingness to delegate each O*NET work duty. JobBench focuses on duties with high reported delegation desire and nontrivial economic exposure.
Coverage: 130 tasks across 35 occupations (65 main + 65 easy), spanning 10 SOC groups.
Realistic workspaces: Tasks are packaged as agentic workspaces with heterogeneous reference files (502 files across 17 formats; mean 3.9 files/task). Sources include federal/state public records and synthesized files (main set includes synthesized files; easy set uses only real-world files).
Fact-anchored chained rubrics: 4,631 binary criteria total (mean 35.6 criteria/task). Rubrics are chains of criteria; a rubric’s weight is awarded only if every criterion in its chain passes (no partial credit).
Quality control: Tasks pass a three-stage audit (automated audit, annotator review, solve trials). 71% of candidate tasks pass the pipeline; union pass rate across accepted criteria is 95.4% (i.e., >95% of criteria were passed by at least one agent in sampling).
Evaluation setup: Non-interactive, headless agent runs in isolated workspaces, 60-minute timeout. Four agent scaffolds used (Claude Code, Codex CLI, OpenCode, OpenClaw). LLM-as-judge approach used (default judge: grok-4.1-fast; validated against Opus-4.5 with <0.7% score variance).
Leaderboard results: 36 model–scaffold configurations tested. Top performance: Claude Opus 4.7 (Claude Code) = 45.9% on JobBench-Main. GPT-family Codex variants follow (e.g., GPT-5.5 under Codex = 42.7%). Outside Claude/GPT families, no configuration exceeds ~19%.
Harder than GDP-focused benchmarks: GDPVal shows near-saturation (many models >70%), while JobBench-Main scores remain well below 50%. JobBench tasks cause greater runtime and tool-use complexity (e.g., GPT-5.4 under Codex took ~2.4× runtime vs GDPVal).
Easy vs Main: The easy set (no web search, fewer conflicts) produces much higher scores (models improve ~26–31 points), indicating challenge design matters.

Data & Methods

Task selection and grounding:
- Base signal: Workbank worker delegation-desire ratings for O*NET work duties.
- Occupation filter: Occupations with average desire > 3 and significant economic exposure (OEWS wages).
- Feasibility filter: Duty must be digitalizable, evaluable, and supportable.
Expert involvement:
- Annotators and domain experts recruited from Prolific and Upwork.
- Structured onboarding and annotation platform; logs of AI-assisted annotation usage.
Task construction:
- Each task includes: scenario query, workspace of reference files, binary criteria, and chained rubrics to reflect defensible professional reasoning.
- Easy set: all references real-world and fewer reasoning challenges; Main set: mix of real and synthesized files and more complex reconciliation required.
Rubric design constraints:
- Self-contained, binary, objective, unambiguous.
Validation and quality gates:
- Automated audit agent verifies instruction-file consistency and rubric correctness.
- Annotator review refines tasks.
- Solve trial: multiple agents sample runs; tasks retained if union of passed rubrics > 90%.
Benchmark statistics:
- 130 tasks (65 main / 65 easy), 35 occupations, 502 reference files, 17 file formats.
- 4,631 binary criteria; mean 35.6 criteria per task.
- Judge: grok-4.1-fast by default, validated against Opus-4.5 judge (agreement within 0.7%).
Models & scaffolds:
- 36 configurations across families: Anthropic Claude (Opus/Sonnet/Haiku variants), OpenAI GPT-5 series & Codex variants, Google Gemini 3, Qwen-3.5-Plus, MiniMax, Kimi, xAI Grok, etc.
- Four scaffolds: Claude Code, Codex CLI, OpenCode, OpenClaw; each provides tool policies (shell, file editing, web fetch, subagents).
Metrics:
- Per-task normalized score = weighted sum of rubrics that fully pass (zr = 1 only when all criteria in a rubric pass). Final leaderboard = average per-task score across tasks.

Implications for AI Economics

Reframe economic impact: JobBench demonstrates a practical alternative to GDP-centric automation metrics. Measuring what workers actually want delegated (and can trust an agent to do) matters for predicting adoption, productivity gains, and labor-market effects. Economic impact should incorporate delegation desire × capability × adoption, not just technical replaceability.
Augmentation-first policy and product strategy: Companies and policymakers should prioritize building agents that handle duties professionals want offloaded (e.g., multi-source fact-checking, proposal assembly), as those agents are likelier to be adopted and to raise job satisfaction and productivity rather than purely displace labor.
Measurement of value: Traditional metrics that price tasks by wages or GDP exposure miss the uptake channel: even if a task is automatable technically, workers may not want it delegated (or may require high reliability/traceability). Benchmarks like JobBench supply complementary signals (delegation-preference-aligned capability) that better predict real-world augmentation value.
Research priorities:
- Improve source-grounded multi-document reasoning, conflict resolution, and verifiable chains of inference (JobBench’s chained rubrics explicitly reward that).
- Focus engineering on tool-use, longer-horizon reasoning, provenance, and reproducible computations; scaffold design matters (same base model can differ several points by scaffold).
- Invest in robust human-in-the-loop workflows (explainability, editability, and verifiable outputs) to bridge from bench performance to workplace adoption.
Caution for displacement forecasts: High automation exposure numbers (hours or GDP percentage) should not be equated with imminent job loss. JobBench suggests many worker-valued duties are still beyond reliable automation and that automation-driven productivity gains may be more plausible near-term than outright displacement.
For policy and evaluation: Incorporate worker preference surveys (like Workbank) and human-centered benchmarks into regulatory impact assessments, grants, and procurement decisions that aim to deploy workplace AI safely and usefully.

Limitations and next steps (brief) - Geographic and occupational scope: based on US O*NET/Workbank; expanding to other labor markets and larger worker samples would improve generality. - Some main-set files are synthesized; broader real-data coverage would further test generalization. - LLM-as-judge approach has limitations despite validation; human grading studies and mixed-method evaluation could strengthen reliability. - Future work: longitudinal field trials to measure real adoption, productivity, and well-being effects when agents are deployed for duties workers request delegated.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper presents a benchmark and model evaluation rather than causal empirical analysis; it measures model performance on curated tasks but does not identify causal impacts of AI on economic outcomes (productivity, wages, employment). Methods Rigormedium — The benchmark is carefully constructed (130 tasks across 35 occupations, heterogeneous workspaces, detailed fact-anchored rubrics averaging 35.6 binary criteria) and evaluates 36 models, which indicates substantial methodological care; however, potential concerns remain about task and expert selection procedures, rubric validation and inter-rater reliability, representativeness of the tasks to real-world work, and how grading maps to real-world effectiveness. SampleA curated benchmark (JobBench) comprising 130 'agentic' tasks spanning 35 occupations; each task is instantiated as a workspace containing heterogeneous reference files mimicking professional clutter; outputs are scored using a chain of fact-anchored rubrics averaging 35.6 binary criteria per task; 36 models were evaluated (the top model, Claude Opus 4.7 via Claude Code, scored 45.9%). Themeshuman_ai_collab labor_markets GeneralizabilityTasks and occupations selected by unspecified experts may not represent the full diversity of real-world jobs or geographic/cultural contexts, Controlled task workspaces may not capture dynamics of live workplace interactions (iterative clarification, time pressure, multi-agent workflows), Rubric design and binary criteria may reflect designers' judgments and could introduce subjective or cultural bias, Evaluation measures model output correctness but not real-world effects on productivity, decision-making, or worker satisfaction, Models and versions evaluated are a snapshot; results may not generalize as models rapidly change

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. Governance And Regulation	negative	medium	framing/scope of existing occupational AI benchmarks (economic-value / replacement narrative)	0.02
We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. Task Allocation	positive	high	evaluation of AI agents on expert-identified high-priority workflows for delegation	0.18
JobBench covers 130 agentic tasks across 35 occupations. Task Allocation	null_result	high	scope/coverage of the benchmark (number of tasks and occupations)	n=130 130 tasks across 35 occupations 0.18
Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Task Allocation	positive	high	realism of task inputs (heterogeneous reference files; information clutter)	0.18
Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. Output Quality	null_result	high	granularity of evaluation (number of binary rubric criteria per task)	n=130 35.6 binary criteria per task (average) 0.18
We evaluate 36 models; the strongest, Claude Opus 4.7 under Claude Code, reaches only 45.9%. Output Quality	negative	high	model performance on JobBench (aggregate score/accuracy as percent)	n=36 45.9 % 0.18
We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable. Governance And Regulation	positive	high	intended shift in community priorities / framing of labour-market effects (replacement -> enhancement)	0.03