Benchmarks that mimic NLP tasks overstate LLMs' readiness for real-world knowledge work; a three-step framework — name the work activity, specify the tested setting, and score the work product — plus an O*NET-derived activity inventory clarifies the strongest work claims a benchmark score can support and exposes gaps in popular datasets.

Design and Report Benchmarks for Knowledge Work

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian · May 22, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

The paper proposes a three-step framework — specify the work activity, the tested setting, and the scored work product — and an O*NET-derived inventory of 18 work activities to make explicit what benchmark scores for LLM agents can and cannot claim about real-world knowledge work.

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

Summary

Main Finding

Benchmarks for LLMs intended to measure “knowledge-work” capability must report three things explicitly—(1) the work activity being represented, (2) the tested setting (materials, tools, role, workflow state), and (3) the scored work product (not just visible output). Without this reporting, high benchmark scores can misleadingly support overly broad claims about real-world work capability. The paper operationalizes this via a practical three-step design/reporting approach and an 18-item cross-occupation work-activity inventory derived from O*NET task statements.

Key Points

Problem: Many current LLM/agent benchmarks follow traditional NLP input→output logic and report final-output metrics, which do not reliably indicate whether a system can perform situated, downstream-usable knowledge work in deployment.
Three-step reporting approach:
Define the work activity the benchmark intends to represent (middle-grain between component tasks and whole occupations).
Specify the tested setting: materials (what is supplied), tools (what the agent can/do), role and scope (jurisdiction, decisions allowed), and workflow state (drafting, review, execution, handoff).
Score the appropriate work product: include not only visible content but also state changes, revision traces, assumptions, links, and handoff information needed for downstream continuation.
Work product vs work output: A benchmark that scores only visible output (e.g., an answer) may miss whether that output is usable in subsequent work (e.g., executable code, reviewable patch, record-compliant document).
Activity-level reporting is necessary because domain labels (e.g., “software engineering”) are too broad and component-task labels (e.g., “retrieval”) are too narrow or under-cover the relevant responsibilities.
The paper provides an 18-item cross-occupation work-activity inventory (e.g., analysis, investigation, coordination, record-keeping, troubleshooting, advising, design, inspection) to standardize what is being claimed by benchmarks.
Case analyses (GDPVAL, OFFICEQA PRO, APEX-SWE and SWE-BENCH example) show how benchmark design choices determine the strongest supported claim and reveal common gaps between benchmarked tasks and deployment-relevant work claims.

Data & Methods

Data source: O*NET 30.2 task statements (job zones 3–5) as the raw reporting corpus for knowledge-work tasks.
Filtering/Screening:
- Initial corpus: 18,796 O*NET task statements.
- Knowledge-work filter and task-level screens produced 12,464 screened statements (removed direct manual, routine clerical, performative tasks).
- A stricter atlas-inclusion screen retained 8,372 statements for atlas construction.
Method pipeline:
- Profession-neutral rewrites of task statements.
- Embedding the rewrites, dimensionality reduction with UMAP, clustering with HDBSCAN to form dense task groups (resulting in 108 dense task groups).
- LLM-based summaries of clusters followed by expert-panel review to consolidate and label clusters.
- Final output: 18 cross-occupation work-activity labels; these labels were then applied back to the larger 12,464-statement reporting corpus.
Tools/algorithms noted: text embeddings, UMAP (for visualization/dimensionality reduction), HDBSCAN (density clustering), LLM summarization, and expert review.
Illustrative benchmark case work:
- SWE-BENCH (GitHub-issue based): shows scores tied to repository state + test oracle; score supports only the narrow claim “generate a patch that passes the given tests under the provided environment.”
- GDPVAL, OFFICEQA PRO, APEX-SWE (analyzed as case studies): show how task-to-activity mapping, tested-setting choices, and scoring rules constrain the supported claims and where gaps arise.

Implications for AI Economics

Measurement validity for economic estimation: Researchers estimating productivity, task automation potential, or labor displacement from reported benchmark scores must consider supported-claim boundaries. Using conventional NLP-style scores without the three-step reporting risks mis-measuring automation potential and overstating substitutability.
Better inputs for task-level decomposition: The 18 work-activity inventory provides a standardized activity taxonomy that economists can map to O*NET task/occupation descriptions to more precisely estimate which tasks are automatable or augmentable by LLMs.
External validity and deployment risk: Benchmarks that do not report tested settings (materials, tools, role/scope, workflow state) create ambiguity in counterfactuals. Economists and policy analysts should prefer benchmarks that document these dimensions when projecting adoption, productivity gains, or reallocation effects across occupations.
Estimating complementarities vs. substitution: Detailed reporting on work products (including hidden elements like state changes and handoff information) helps identify complementarities—tasks where AI improves worker productivity by generating usable artifacts for downstream actors—versus tasks where AI outputs are brittle and require substantial human remediation.
Policy and procurement design: Regulators and procurement agents should require benchmark-style reporting as part of vendor claims to assess whether systems truly deliver usable work products and to reduce mismatches between advertised capabilities and workplace needs (important for safety-critical and regulated sectors).
Incentives for benchmark creation: Creating benchmarks that score work products (not just outputs) and document tested settings will shift incentives toward measures that more closely track economic impact—encouraging development of systems that actually reduce task completion costs or meaningfully augment labor.
Guidance for empirical work:
- When using benchmark results in empirical models, condition estimates on the supported claim (activity+tested-setting+scored-product) rather than on domain-level labels.
- Use the paper’s case analyses to calibrate measurement error in common benchmark-to-real-world mappings (e.g., how often a “passing” code test actually corresponds to deployable fixes).
- For estimating labor market impacts, construct adoption scenarios that account for gaps identified by the three-step approach (e.g., requirements/coordination steps typically omitted by benchmarks).
Caution for macro forecasts: Aggregate forecasts of AI-driven productivity gains or job displacement based on aggregate benchmark scores should be down-weighted unless the underlying benchmarks demonstrate coverage of the activities, settings, and work products that match the deployment contexts assumed in the forecast.

Limitations noted by the paper (relevant to economists): the three-step approach is a reporting framework, not a full psychometric validity model; other benchmark quality concerns (rubrics, grader reliability, fairness) remain important; the 18-activity inventory is preliminary and derived from O*NET-based methods that may need further validation across industries and international contexts.

If you want, I can: - Map the 18 work activities to O*NET occupation groups relevant to your economic analysis, - Produce a checklist economists can use to assess whether a given benchmark supports a particular task-automation claim, - Or extract the specific definitions of the 18 activities into a machine-readable table for use in empirical models. Which would be most useful?

Assessment

Paper Typetheoretical Evidence Strengthn/a — The paper is methodological/conceptual and does not present causal or empirical identification of AI effects; it offers a framework and illustrative case analyses rather than causal inference from data. Methods Rigormedium — The authors systematically derive an 18-item work-activity inventory from the O*NET task database and ground recommendations in a literature review and three concrete benchmark case analyses, but they do not provide empirical validation, user studies, or deployment tests to demonstrate the framework's predictive or explanatory power. SampleA conceptual literature review of work studies on roles, tools, and artifacts; derivation of 18 work activities from the O*NET occupational task database; and illustrative analyses of three existing benchmarks (GDPval — a non-code occupational deliverable benchmark; OfficeQA Pro — a grounded document-analysis benchmark scored by final answers; and APEX-SWE — a software-engineering benchmark with executable scored products). Themeshuman_ai_collab productivity GeneralizabilityFramework focused on knowledge-work tasks; may not apply to manual, physical, or highly procedural work., Derived from the O*NET (U.S.-centric) occupational taxonomy — may not map cleanly to non-U.S. labor markets or informal work arrangements., Case analyses are limited to three benchmarks and do not include deployment or field validation, limiting external validation., Does not empirically test how well revised benchmarks predict real-world performance or economic outcomes like productivity or wages., May not capture domain-specific constraints, languages, or regulatory contexts outside the analyzed examples.

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. Output Quality	negative	high	ability of a system to carry out knowledge work in real-world deployment settings	0.12
This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. Organizational Efficiency	positive	high	quality of benchmark-to-work claim mapping (explicitness of representation)	0.12
We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. Task Allocation	positive	high	organizational characteristics of knowledge work (roles, materials, tools, artifact usability)	0.12
We translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. Organizational Efficiency	positive	high	quality of benchmark design and reporting (alignment with real-world work concerns)	0.12
To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O*NET occupational task database. Task Allocation	positive	high	inventory size and coverage (18 work activities derived)	18 work activities 0.12
We demonstrate the approach through three benchmark case analyses: GDPval, OfficeQA Pro, and APEX-SWE. Research Productivity	positive	high	demonstration of approach via case analyses (number of cases = 3)	n=3 3 case analyses 0.12
GDPval [is] a non-code occupational deliverable benchmark. Task Allocation	positive	high	nature of GDPval benchmark (non-code occupational deliverable)	0.06
OfficeQA Pro [is] a grounded document-analysis benchmark scored by final answers. Output Quality	positive	high	scoring methodology and nature of OfficeQA Pro (grounded document-analysis, final-answer scoring)	0.06
APEX-SWE [is] a software-engineering benchmark with executable scored products. Developer Productivity	positive	high	nature of APEX-SWE benchmark (software-engineering, executable product scoring)	0.06
These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim. Output Quality	positive	high	degree to which benchmark scores can support work claims; identification of gaps between task, setting, product, and claim	n=3 0.12
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. Innovation Output	positive	high	growth of literature/work on knowledge-work AI enabled by LLM agents in specified domains	0.12