JobBench tests AI agents on 130 real-work tasks across 35 occupations using extensive fact-anchored rubrics and finds leading models succeed on less than half the required criteria, implying current agents are far from reliably performing the professional tasks humans actually want delegated.
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
Summary
Main Finding
JobBench is a new, human-centered benchmark that measures agent capability on the specific professional duties experts most want delegated to AI. Unlike GDP-oriented benchmarks that emphasize economically valuable end-to-end deliverables (a “replacement” story), JobBench evaluates source-grounded professional reasoning across messy, multi-file workspaces and scores agents with chained, fact-anchored binary rubrics that require every reasoning step to pass. Evaluating 36 model–scaffold configurations, the strongest setup (Claude Opus 4.7 under Claude Code) reaches only 45.9% on the main set, demonstrating that current agents remain far from reliably performing the duties workers most want automated.
Key Points
- Human-centered scope: Tasks are selected from Workbank, a survey of >1,500 workers who rated their willingness to delegate each O*NET work duty. JobBench focuses on duties with high reported delegation desire and nontrivial economic exposure.
- Coverage: 130 tasks across 35 occupations (65 main + 65 easy), spanning 10 SOC groups.
- Realistic workspaces: Tasks are packaged as agentic workspaces with heterogeneous reference files (502 files across 17 formats; mean 3.9 files/task). Sources include federal/state public records and synthesized files (main set includes synthesized files; easy set uses only real-world files).
- Fact-anchored chained rubrics: 4,631 binary criteria total (mean 35.6 criteria/task). Rubrics are chains of criteria; a rubric’s weight is awarded only if every criterion in its chain passes (no partial credit).
- Quality control: Tasks pass a three-stage audit (automated audit, annotator review, solve trials). 71% of candidate tasks pass the pipeline; union pass rate across accepted criteria is 95.4% (i.e., >95% of criteria were passed by at least one agent in sampling).
- Evaluation setup: Non-interactive, headless agent runs in isolated workspaces, 60-minute timeout. Four agent scaffolds used (Claude Code, Codex CLI, OpenCode, OpenClaw). LLM-as-judge approach used (default judge: grok-4.1-fast; validated against Opus-4.5 with <0.7% score variance).
- Leaderboard results: 36 model–scaffold configurations tested. Top performance: Claude Opus 4.7 (Claude Code) = 45.9% on JobBench-Main. GPT-family Codex variants follow (e.g., GPT-5.5 under Codex = 42.7%). Outside Claude/GPT families, no configuration exceeds ~19%.
- Harder than GDP-focused benchmarks: GDPVal shows near-saturation (many models >70%), while JobBench-Main scores remain well below 50%. JobBench tasks cause greater runtime and tool-use complexity (e.g., GPT-5.4 under Codex took ~2.4× runtime vs GDPVal).
- Easy vs Main: The easy set (no web search, fewer conflicts) produces much higher scores (models improve ~26–31 points), indicating challenge design matters.
Data & Methods
- Task selection and grounding:
- Base signal: Workbank worker delegation-desire ratings for O*NET work duties.
- Occupation filter: Occupations with average desire > 3 and significant economic exposure (OEWS wages).
- Feasibility filter: Duty must be digitalizable, evaluable, and supportable.
- Expert involvement:
- Annotators and domain experts recruited from Prolific and Upwork.
- Structured onboarding and annotation platform; logs of AI-assisted annotation usage.
- Task construction:
- Each task includes: scenario query, workspace of reference files, binary criteria, and chained rubrics to reflect defensible professional reasoning.
- Easy set: all references real-world and fewer reasoning challenges; Main set: mix of real and synthesized files and more complex reconciliation required.
- Rubric design constraints:
- Self-contained, binary, objective, unambiguous.
- Validation and quality gates:
- Automated audit agent verifies instruction-file consistency and rubric correctness.
- Annotator review refines tasks.
- Solve trial: multiple agents sample runs; tasks retained if union of passed rubrics > 90%.
- Benchmark statistics:
- 130 tasks (65 main / 65 easy), 35 occupations, 502 reference files, 17 file formats.
- 4,631 binary criteria; mean 35.6 criteria per task.
- Judge: grok-4.1-fast by default, validated against Opus-4.5 judge (agreement within 0.7%).
- Models & scaffolds:
- 36 configurations across families: Anthropic Claude (Opus/Sonnet/Haiku variants), OpenAI GPT-5 series & Codex variants, Google Gemini 3, Qwen-3.5-Plus, MiniMax, Kimi, xAI Grok, etc.
- Four scaffolds: Claude Code, Codex CLI, OpenCode, OpenClaw; each provides tool policies (shell, file editing, web fetch, subagents).
- Metrics:
- Per-task normalized score = weighted sum of rubrics that fully pass (zr = 1 only when all criteria in a rubric pass). Final leaderboard = average per-task score across tasks.
Implications for AI Economics
- Reframe economic impact: JobBench demonstrates a practical alternative to GDP-centric automation metrics. Measuring what workers actually want delegated (and can trust an agent to do) matters for predicting adoption, productivity gains, and labor-market effects. Economic impact should incorporate delegation desire × capability × adoption, not just technical replaceability.
- Augmentation-first policy and product strategy: Companies and policymakers should prioritize building agents that handle duties professionals want offloaded (e.g., multi-source fact-checking, proposal assembly), as those agents are likelier to be adopted and to raise job satisfaction and productivity rather than purely displace labor.
- Measurement of value: Traditional metrics that price tasks by wages or GDP exposure miss the uptake channel: even if a task is automatable technically, workers may not want it delegated (or may require high reliability/traceability). Benchmarks like JobBench supply complementary signals (delegation-preference-aligned capability) that better predict real-world augmentation value.
- Research priorities:
- Improve source-grounded multi-document reasoning, conflict resolution, and verifiable chains of inference (JobBench’s chained rubrics explicitly reward that).
- Focus engineering on tool-use, longer-horizon reasoning, provenance, and reproducible computations; scaffold design matters (same base model can differ several points by scaffold).
- Invest in robust human-in-the-loop workflows (explainability, editability, and verifiable outputs) to bridge from bench performance to workplace adoption.
- Caution for displacement forecasts: High automation exposure numbers (hours or GDP percentage) should not be equated with imminent job loss. JobBench suggests many worker-valued duties are still beyond reliable automation and that automation-driven productivity gains may be more plausible near-term than outright displacement.
- For policy and evaluation: Incorporate worker preference surveys (like Workbank) and human-centered benchmarks into regulatory impact assessments, grants, and procurement decisions that aim to deploy workplace AI safely and usefully.
Limitations and next steps (brief) - Geographic and occupational scope: based on US O*NET/Workbank; expanding to other labor markets and larger worker samples would improve generality. - Some main-set files are synthesized; broader real-data coverage would further test generalization. - LLM-as-judge approach has limitations despite validation; human grading studies and mixed-method evaluation could strengthen reliability. - Future work: longitudinal field trials to measure real adoption, productivity, and well-being effects when agents are deployed for duties workers request delegated.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. Governance And Regulation | negative | medium | framing/scope of existing occupational AI benchmarks (economic-value / replacement narrative) |
0.02
|
| We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. Task Allocation | positive | high | evaluation of AI agents on expert-identified high-priority workflows for delegation |
0.18
|
| JobBench covers 130 agentic tasks across 35 occupations. Task Allocation | null_result | high | scope/coverage of the benchmark (number of tasks and occupations) |
n=130
130 tasks across 35 occupations
0.18
|
| Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Task Allocation | positive | high | realism of task inputs (heterogeneous reference files; information clutter) |
0.18
|
| Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. Output Quality | null_result | high | granularity of evaluation (number of binary rubric criteria per task) |
n=130
35.6 binary criteria per task (average)
0.18
|
| We evaluate 36 models; the strongest, Claude Opus 4.7 under Claude Code, reaches only 45.9%. Output Quality | negative | high | model performance on JobBench (aggregate score/accuracy as percent) |
n=36
45.9 %
0.18
|
| We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable. Governance And Regulation | positive | high | intended shift in community priorities / framing of labour-market effects (replacement -> enhancement) |
0.03
|