The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A large new benchmark of real-world file workspaces finds AI agents far from human-level performance: the best system scores 68.7% against a human 80.7%, with an average across agents of just 47.4%, exposing persistent weaknesses in cross-file reasoning and workspace-level decision-making.

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Weizheng Wang, Hongzhang Huang, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu · May 05, 2026
arxiv descriptive n/a evidence 8/10 relevance Source PDF
Workspace-Bench introduces a large, realistic benchmark of cross-file workspace tasks showing current AI agents average 47.4% (best 68.7%) versus human 80.7%, revealing substantial gaps in cross-file retrieval, contextual reasoning, and adaptive decision-making.

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.

Summary

Main Finding

Workspace-Bench 1.0 introduces a realistic, dependency-driven benchmark for “workspace learning” (agents reasoning over large, heterogeneous file ecosystems). Evaluation of 28 agent configurations (4 harnesses × 7 LLMs) shows current autonomous agents are far from reliable: average Rubrics Pass Rate = 47.4%, best ≈ 68.7% (≈70%), while human+tools reach 80.7%. Major failure modes are heterogeneous file understanding and file-lineage tracing; harder tasks reduce pass rates (Easy 57.6% → Hard 40.5%). The benchmark also exposes large inference/token/interaction-costs for some configurations.

Key Points

  • Benchmark scope

    • 5 realistic user personas (operations manager, logistics manager, product manager, backend developer, researcher).
    • 20,476 files across 74 file types (up to 20GB total).
    • 388 dependency-driven tasks with explicit file-dependency graphs.
    • 7,399 rubrics (average ~19.1 rubrics per task) assessing final outputs and intermediate decisions.
    • Workspace-Bench-Lite: 100-task subset preserving distribution while cutting evaluation cost by ≈70%.
  • Evaluation framework and tooling

    • Workspace-grounded evaluation with dual parallel acceleration and an Agent-as-a-Judge paradigm for fine-grained scoring (intermediate decisions + final correctness).
    • Tasks annotated and validated by experts; LLMs used only for auxiliary verification/rubric optimization.
    • Measured operational metrics: interaction turns, token consumption, task-specific costs.
  • Empirical results

    • Average Rubrics Pass Rate across 28 configurations: 47.4%.
    • Best-performing configuration: OpenClaw + Claude-Opus4.7 ≈ 68.7% (human + tools: 80.7%).
    • Performance declines with task difficulty (Easy 57.6%, Hard 40.5%).
    • Significant variation across harnesses and LLMs. Harnesses boost weaker LLMs more than strong ones.
    • Cost explosions observed: e.g., DeepAgent + MiniMax-M2.7 consumed up to 58.1 interaction turns and ~0.61M tokens per task while underperforming (~45% pass rate).
  • Conceptual contributions

    • Explicit modeling and annotation of semantic relations, result-providing file aggregation, and file-lineage relations—dimensions often missing from prior benchmarks.
    • A five-stage characterization of agentic workspace learning (from data-insensitive guidance to data-driven self-evolution), with identified bottlenecks such as “orchestration singularity” and the “Data Association Gap.”

Data & Methods

  • Data assembly

    • Persona-driven workspace simulation combining agent-based structure generation and hybrid file population (real files + controlled generation) to mimic messy, role-specific workspaces.
    • Files include heterogeneous modalities (docs, sheets, code, chat logs, etc.) and lineage/derivative relationships.
  • Task curation and annotation

    • 388 tasks collected from real office scenarios and curated with domain experts.
    • Each task paired with a file-dependency graph and manually validated reference outputs.
    • Rubrics include checks for: file discovery, correct aggregation of result files, intra- and inter-file reasoning, lineage tracing, and correctness of operations.
  • Evaluation

    • Tested 4 harnesses: ClaudeCode, DeepAgent, Hermes, OpenClaw.
    • Tested 7 backbone LLMs (examples from paper: Opus-4.7, GLM-5.1, MiniMax-M2.7, GPT-5.4, Kimi-2.5, Seed-2.0-Code, Gemini-3.1-Pro).
    • 28 harness×LLM configurations evaluated on the Lite subset (and full benchmark in some experiments).
    • Primary metric: Rubrics Pass Rate (granular rubric-level pass/fail aggregated).
    • Measured auxiliary costs: interaction turns, token consumption, and evaluation labeling effort (>~2500 human-hours reported for full curation).

Implications for AI Economics

  • Inference cost is a practical adoption bottleneck

    • The paper reports enterprises cite inference cost as a top barrier and demonstrates configurations with massive token/turn footprints. High per-task token costs (e.g., hundreds of thousands of tokens in pathological runs) raise directly the marginal cost of automation in workplace settings.
    • Economic consequence: even promising models may be uneconomical unless token/interaction efficiency or pricing improves.
  • ROI and deployment readiness

    • Average agent reliability (47.4%) is substantially below human+tools (80.7%), implying limited replacement value for complex office workflows today.
    • Businesses must weigh savings from partial automation against costs: model inference, engineering (harnesses, parsers), oversight, error-handling and rework caused by agent mistakes.
  • Invest in targeted capabilities with high economic leverage

    • Heterogeneous file parsing and lineage-tracing are primary bottlenecks; investment in robust multimodal parsers, provenance-aware retrieval, and file-indexing infrastructure likely yields outsized returns (improving accuracy and reducing token churn).
    • Improvements in retrieval/RAG, compact context representations, and summarization of file clusters reduce token costs per task — directly improving per-task economics.
  • Harnesses and systems engineering matter

    • Agent harnesses can substantially improve weaker models (cost-effective way to raise performance when replacing LLMs is expensive). This suggests economic value in engineering platforms that wrap existing models effectively (tool orchestration, caching, chunking).
    • For procurement decisions, buyers should evaluate harness+model combinations, not LLMs in isolation.
  • Need for hybrid human-in-the-loop models

    • Because human+tools still outperform autonomous agents, the economically optimal design for many workflows will be hybrid: agents do pre-filtering, extraction, and candidate synthesis; humans retain oversight for high-risk decisions.
    • This reduces liability and rework costs, while leveraging agent efficiency where reliability is achievable.
  • Benchmarking, procurement, and standards

    • Workspace-Bench provides a practical, task- and cost-aware evaluation standard that vendors and buyers can use to compare agent configurations on realistic, dependency-driven tasks.
    • Enterprises should demand metrics beyond accuracy: cost-per-task (tokens + compute), interaction turns, intermediate rubric reliability (to forecast rework), and lineage correctness (for compliance-sensitive domains).
  • Market and workforce effects

    • Slower-than-expected productivity gains for complex knowledge work: automation value will concentrate in tasks with limited cross-file dependencies or in well-structured workflows until workspace learning improves.
    • Demand growth for tools that organize and standardize workspace data (document indexing, provenance capture) — vendors can monetize improved data hygiene that lowers agent costs and raises accuracy.
  • Policy and risk management

    • For compliance-heavy domains, inability to trace file lineage and provenance reduces legal/operational trust in fully autonomous agents. Economic adoption requires better provenance guarantees and auditability.

Practical recommendations for economists, procurement teams, and product leaders - When estimating expected savings from agent deployment, include token/interaction costs, harness engineering costs, expected error/rework rates, and human supervision costs; use rubrics-like micro-metrics to forecast rework. - Prioritize investments in (a) workspace normalization (indexing/provenance), (b) retrieval and summarization layers that reduce context size, and (c) harness engineering that caps unnecessary LLM calls. - Use Workspace-Bench (or its Lite subset) as an economic stress-test when comparing vendor claims: measure both accuracy and per-task operational cost. - Consider hybrid workflows (agent pre-processing + human verification) until workspace-learning capabilities reach human parity for targeted task classes.

If useful, I can (1) extract numeric per-configuration results from the paper (detailed table), (2) sketch a simple cost model (tokens → $) for representative configurations, or (3) propose procurement-oriented benchmark questions (SLAs, cost thresholds) to evaluate agents for enterprise deployment. Which would you prefer?

Assessment

Paper Typedescriptive Evidence Strengthn/a — The paper presents a benchmark and empirical evaluation of agent capabilities rather than testing causal hypotheses about economic outcomes, so causal evidence strength is not applicable; results are descriptive performance metrics on the benchmark. Methods Rigormedium — The benchmark is large-scale and detailed (5 worker profiles, 74 file types, 20,476 files, 388 tasks, 7,399 rubrics) with human baselines and a cost-saving Lite subset, and multiple agent harnesses and foundation models are evaluated; however, the paper likely leaves open questions about sampling/selection of workspaces and tasks, annotation and rubric reliability (e.g., inter-annotator agreement), sensitivity to rubric design, representativeness of the chosen profiles and file types, and reproducibility of harness configurations. SampleConstructed realistic workspaces covering 5 worker profiles, 74 file types and 20,476 files (up to 20GB total), with 388 tasks each associated with an explicit file-dependency graph and 7,399 evaluation rubrics; also provides Workspace-Bench-Lite (100 tasks). Evaluations run on 4 agent harnesses and 7 foundation models, with human performance reported for comparison. Themesproductivity human_ai_collab GeneralizabilityWorkspaces limited to 5 curated worker profiles and selected file types which may not represent all industries or workflows, Tasks and dependency graphs are curated and may not capture the full diversity or temporality of real-world workspaces, Cultural, language, and domain biases in files/rubrics could limit applicability across regions or specialties, Performance depends on agent harness and evaluation protocol; different interfaces, tool access, or model settings may yield different results, Benchmark reflects capabilities at time of evaluation; rapidly evolving models may change relative performance

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Other positive high ability of AI agents to use file dependencies to complete tasks
0.03
Existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. Other negative medium coverage/realism of file dependencies in existing benchmarks
0.11
We construct Workspace-Bench with 5 worker profiles, 74 file types, 20,476 files (up to 20GB), 388 tasks, and 7,399 total rubrics, each task associated with its own file dependency graph. Other positive high benchmark size and heterogeneity (worker profiles, file types, file count, task count, rubric count, max size)
n=20476
0.3
Workspace-Bench includes files up to 20GB in size. Other positive high maximum file size in the benchmark
up to 20GB
0.3
We provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. Other positive high evaluation cost (and distributional fidelity of the subset)
n=100
about 70% reduction in evaluation costs
0.18
We evaluate 4 popular agent harnesses and 7 foundation models on Workspace-Bench. Other neutral high number of agent harnesses and foundation models evaluated
n=11
0.18
The best-performing agent reaches only 68.7% on the benchmark. Other negative high benchmark score (agent performance)
n=7399
68.7%
0.3
Human performance on the benchmark is 80.7%. Other positive high benchmark score (human performance)
n=7399
80.7%
0.3
The average performance across evaluated agents is only 47.4%. Other negative high average benchmark score across agents
n=7399
47.4%
0.3
Experimental results show that current agents remain far from reliable workspace learning. Other negative high reliability of agents on workspace learning tasks
n=7399
0.18

Notes