Benchmarks that mimic NLP tasks overstate LLMs' readiness for real-world knowledge work; a three-step framework — name the work activity, specify the tested setting, and score the work product — plus an O*NET-derived activity inventory clarifies the strongest work claims a benchmark score can support and exposes gaps in popular datasets.
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Summary
Main Finding
Benchmarks for LLMs intended to measure “knowledge-work” capability must report three things explicitly—(1) the work activity being represented, (2) the tested setting (materials, tools, role, workflow state), and (3) the scored work product (not just visible output). Without this reporting, high benchmark scores can misleadingly support overly broad claims about real-world work capability. The paper operationalizes this via a practical three-step design/reporting approach and an 18-item cross-occupation work-activity inventory derived from O*NET task statements.
Key Points
- Problem: Many current LLM/agent benchmarks follow traditional NLP input→output logic and report final-output metrics, which do not reliably indicate whether a system can perform situated, downstream-usable knowledge work in deployment.
- Three-step reporting approach:
- Define the work activity the benchmark intends to represent (middle-grain between component tasks and whole occupations).
- Specify the tested setting: materials (what is supplied), tools (what the agent can/do), role and scope (jurisdiction, decisions allowed), and workflow state (drafting, review, execution, handoff).
- Score the appropriate work product: include not only visible content but also state changes, revision traces, assumptions, links, and handoff information needed for downstream continuation.
- Work product vs work output: A benchmark that scores only visible output (e.g., an answer) may miss whether that output is usable in subsequent work (e.g., executable code, reviewable patch, record-compliant document).
- Activity-level reporting is necessary because domain labels (e.g., “software engineering”) are too broad and component-task labels (e.g., “retrieval”) are too narrow or under-cover the relevant responsibilities.
- The paper provides an 18-item cross-occupation work-activity inventory (e.g., analysis, investigation, coordination, record-keeping, troubleshooting, advising, design, inspection) to standardize what is being claimed by benchmarks.
- Case analyses (GDPVAL, OFFICEQA PRO, APEX-SWE and SWE-BENCH example) show how benchmark design choices determine the strongest supported claim and reveal common gaps between benchmarked tasks and deployment-relevant work claims.
Data & Methods
- Data source: O*NET 30.2 task statements (job zones 3–5) as the raw reporting corpus for knowledge-work tasks.
- Filtering/Screening:
- Initial corpus: 18,796 O*NET task statements.
- Knowledge-work filter and task-level screens produced 12,464 screened statements (removed direct manual, routine clerical, performative tasks).
- A stricter atlas-inclusion screen retained 8,372 statements for atlas construction.
- Method pipeline:
- Profession-neutral rewrites of task statements.
- Embedding the rewrites, dimensionality reduction with UMAP, clustering with HDBSCAN to form dense task groups (resulting in 108 dense task groups).
- LLM-based summaries of clusters followed by expert-panel review to consolidate and label clusters.
- Final output: 18 cross-occupation work-activity labels; these labels were then applied back to the larger 12,464-statement reporting corpus.
- Tools/algorithms noted: text embeddings, UMAP (for visualization/dimensionality reduction), HDBSCAN (density clustering), LLM summarization, and expert review.
- Illustrative benchmark case work:
- SWE-BENCH (GitHub-issue based): shows scores tied to repository state + test oracle; score supports only the narrow claim “generate a patch that passes the given tests under the provided environment.”
- GDPVAL, OFFICEQA PRO, APEX-SWE (analyzed as case studies): show how task-to-activity mapping, tested-setting choices, and scoring rules constrain the supported claims and where gaps arise.
Implications for AI Economics
- Measurement validity for economic estimation: Researchers estimating productivity, task automation potential, or labor displacement from reported benchmark scores must consider supported-claim boundaries. Using conventional NLP-style scores without the three-step reporting risks mis-measuring automation potential and overstating substitutability.
- Better inputs for task-level decomposition: The 18 work-activity inventory provides a standardized activity taxonomy that economists can map to O*NET task/occupation descriptions to more precisely estimate which tasks are automatable or augmentable by LLMs.
- External validity and deployment risk: Benchmarks that do not report tested settings (materials, tools, role/scope, workflow state) create ambiguity in counterfactuals. Economists and policy analysts should prefer benchmarks that document these dimensions when projecting adoption, productivity gains, or reallocation effects across occupations.
- Estimating complementarities vs. substitution: Detailed reporting on work products (including hidden elements like state changes and handoff information) helps identify complementarities—tasks where AI improves worker productivity by generating usable artifacts for downstream actors—versus tasks where AI outputs are brittle and require substantial human remediation.
- Policy and procurement design: Regulators and procurement agents should require benchmark-style reporting as part of vendor claims to assess whether systems truly deliver usable work products and to reduce mismatches between advertised capabilities and workplace needs (important for safety-critical and regulated sectors).
- Incentives for benchmark creation: Creating benchmarks that score work products (not just outputs) and document tested settings will shift incentives toward measures that more closely track economic impact—encouraging development of systems that actually reduce task completion costs or meaningfully augment labor.
- Guidance for empirical work:
- When using benchmark results in empirical models, condition estimates on the supported claim (activity+tested-setting+scored-product) rather than on domain-level labels.
- Use the paper’s case analyses to calibrate measurement error in common benchmark-to-real-world mappings (e.g., how often a “passing” code test actually corresponds to deployable fixes).
- For estimating labor market impacts, construct adoption scenarios that account for gaps identified by the three-step approach (e.g., requirements/coordination steps typically omitted by benchmarks).
- Caution for macro forecasts: Aggregate forecasts of AI-driven productivity gains or job displacement based on aggregate benchmark scores should be down-weighted unless the underlying benchmarks demonstrate coverage of the activities, settings, and work products that match the deployment contexts assumed in the forecast.
Limitations noted by the paper (relevant to economists): the three-step approach is a reporting framework, not a full psychometric validity model; other benchmark quality concerns (rubrics, grader reliability, fairness) remain important; the 18-activity inventory is preliminary and derived from O*NET-based methods that may need further validation across industries and international contexts.
If you want, I can: - Map the 18 work activities to O*NET occupation groups relevant to your economic analysis, - Produce a checklist economists can use to assess whether a given benchmark supports a particular task-automation claim, - Or extract the specific definitions of the 18 activities into a machine-readable table for use in empirical models. Which would be most useful?
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. Output Quality | negative | high | ability of a system to carry out knowledge work in real-world deployment settings |
0.12
|
| This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. Organizational Efficiency | positive | high | quality of benchmark-to-work claim mapping (explicitness of representation) |
0.12
|
| We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. Task Allocation | positive | high | organizational characteristics of knowledge work (roles, materials, tools, artifact usability) |
0.12
|
| We translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. Organizational Efficiency | positive | high | quality of benchmark design and reporting (alignment with real-world work concerns) |
0.12
|
| To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O*NET occupational task database. Task Allocation | positive | high | inventory size and coverage (18 work activities derived) |
18 work activities
0.12
|
| We demonstrate the approach through three benchmark case analyses: GDPval, OfficeQA Pro, and APEX-SWE. Research Productivity | positive | high | demonstration of approach via case analyses (number of cases = 3) |
n=3
3 case analyses
0.12
|
| GDPval [is] a non-code occupational deliverable benchmark. Task Allocation | positive | high | nature of GDPval benchmark (non-code occupational deliverable) |
0.06
|
| OfficeQA Pro [is] a grounded document-analysis benchmark scored by final answers. Output Quality | positive | high | scoring methodology and nature of OfficeQA Pro (grounded document-analysis, final-answer scoring) |
0.06
|
| APEX-SWE [is] a software-engineering benchmark with executable scored products. Developer Productivity | positive | high | nature of APEX-SWE benchmark (software-engineering, executable product scoring) |
0.06
|
| These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim. Output Quality | positive | high | degree to which benchmark scores can support work claims; identification of gaps between task, setting, product, and claim |
n=3
0.12
|
| The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. Innovation Output | positive | high | growth of literature/work on knowledge-work AI enabled by LLM agents in specified domains |
0.12
|