A large new benchmark of real-world file workspaces finds AI agents far from human-level performance: the best system scores 68.7% against a human 80.7%, with an average across agents of just 47.4%, exposing persistent weaknesses in cross-file reasoning and workspace-level decision-making.
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.
Summary
Main Finding
Workspace-Bench 1.0 introduces a realistic, dependency-driven benchmark for “workspace learning” (agents reasoning over large, heterogeneous file ecosystems). Evaluation of 28 agent configurations (4 harnesses × 7 LLMs) shows current autonomous agents are far from reliable: average Rubrics Pass Rate = 47.4%, best ≈ 68.7% (≈70%), while human+tools reach 80.7%. Major failure modes are heterogeneous file understanding and file-lineage tracing; harder tasks reduce pass rates (Easy 57.6% → Hard 40.5%). The benchmark also exposes large inference/token/interaction-costs for some configurations.
Key Points
-
Benchmark scope
- 5 realistic user personas (operations manager, logistics manager, product manager, backend developer, researcher).
- 20,476 files across 74 file types (up to 20GB total).
- 388 dependency-driven tasks with explicit file-dependency graphs.
- 7,399 rubrics (average ~19.1 rubrics per task) assessing final outputs and intermediate decisions.
- Workspace-Bench-Lite: 100-task subset preserving distribution while cutting evaluation cost by ≈70%.
-
Evaluation framework and tooling
- Workspace-grounded evaluation with dual parallel acceleration and an Agent-as-a-Judge paradigm for fine-grained scoring (intermediate decisions + final correctness).
- Tasks annotated and validated by experts; LLMs used only for auxiliary verification/rubric optimization.
- Measured operational metrics: interaction turns, token consumption, task-specific costs.
-
Empirical results
- Average Rubrics Pass Rate across 28 configurations: 47.4%.
- Best-performing configuration: OpenClaw + Claude-Opus4.7 ≈ 68.7% (human + tools: 80.7%).
- Performance declines with task difficulty (Easy 57.6%, Hard 40.5%).
- Significant variation across harnesses and LLMs. Harnesses boost weaker LLMs more than strong ones.
- Cost explosions observed: e.g., DeepAgent + MiniMax-M2.7 consumed up to 58.1 interaction turns and ~0.61M tokens per task while underperforming (~45% pass rate).
-
Conceptual contributions
- Explicit modeling and annotation of semantic relations, result-providing file aggregation, and file-lineage relations—dimensions often missing from prior benchmarks.
- A five-stage characterization of agentic workspace learning (from data-insensitive guidance to data-driven self-evolution), with identified bottlenecks such as “orchestration singularity” and the “Data Association Gap.”
Data & Methods
-
Data assembly
- Persona-driven workspace simulation combining agent-based structure generation and hybrid file population (real files + controlled generation) to mimic messy, role-specific workspaces.
- Files include heterogeneous modalities (docs, sheets, code, chat logs, etc.) and lineage/derivative relationships.
-
Task curation and annotation
- 388 tasks collected from real office scenarios and curated with domain experts.
- Each task paired with a file-dependency graph and manually validated reference outputs.
- Rubrics include checks for: file discovery, correct aggregation of result files, intra- and inter-file reasoning, lineage tracing, and correctness of operations.
-
Evaluation
- Tested 4 harnesses: ClaudeCode, DeepAgent, Hermes, OpenClaw.
- Tested 7 backbone LLMs (examples from paper: Opus-4.7, GLM-5.1, MiniMax-M2.7, GPT-5.4, Kimi-2.5, Seed-2.0-Code, Gemini-3.1-Pro).
- 28 harness×LLM configurations evaluated on the Lite subset (and full benchmark in some experiments).
- Primary metric: Rubrics Pass Rate (granular rubric-level pass/fail aggregated).
- Measured auxiliary costs: interaction turns, token consumption, and evaluation labeling effort (>~2500 human-hours reported for full curation).
Implications for AI Economics
-
Inference cost is a practical adoption bottleneck
- The paper reports enterprises cite inference cost as a top barrier and demonstrates configurations with massive token/turn footprints. High per-task token costs (e.g., hundreds of thousands of tokens in pathological runs) raise directly the marginal cost of automation in workplace settings.
- Economic consequence: even promising models may be uneconomical unless token/interaction efficiency or pricing improves.
-
ROI and deployment readiness
- Average agent reliability (47.4%) is substantially below human+tools (80.7%), implying limited replacement value for complex office workflows today.
- Businesses must weigh savings from partial automation against costs: model inference, engineering (harnesses, parsers), oversight, error-handling and rework caused by agent mistakes.
-
Invest in targeted capabilities with high economic leverage
- Heterogeneous file parsing and lineage-tracing are primary bottlenecks; investment in robust multimodal parsers, provenance-aware retrieval, and file-indexing infrastructure likely yields outsized returns (improving accuracy and reducing token churn).
- Improvements in retrieval/RAG, compact context representations, and summarization of file clusters reduce token costs per task — directly improving per-task economics.
-
Harnesses and systems engineering matter
- Agent harnesses can substantially improve weaker models (cost-effective way to raise performance when replacing LLMs is expensive). This suggests economic value in engineering platforms that wrap existing models effectively (tool orchestration, caching, chunking).
- For procurement decisions, buyers should evaluate harness+model combinations, not LLMs in isolation.
-
Need for hybrid human-in-the-loop models
- Because human+tools still outperform autonomous agents, the economically optimal design for many workflows will be hybrid: agents do pre-filtering, extraction, and candidate synthesis; humans retain oversight for high-risk decisions.
- This reduces liability and rework costs, while leveraging agent efficiency where reliability is achievable.
-
Benchmarking, procurement, and standards
- Workspace-Bench provides a practical, task- and cost-aware evaluation standard that vendors and buyers can use to compare agent configurations on realistic, dependency-driven tasks.
- Enterprises should demand metrics beyond accuracy: cost-per-task (tokens + compute), interaction turns, intermediate rubric reliability (to forecast rework), and lineage correctness (for compliance-sensitive domains).
-
Market and workforce effects
- Slower-than-expected productivity gains for complex knowledge work: automation value will concentrate in tasks with limited cross-file dependencies or in well-structured workflows until workspace learning improves.
- Demand growth for tools that organize and standardize workspace data (document indexing, provenance capture) — vendors can monetize improved data hygiene that lowers agent costs and raises accuracy.
-
Policy and risk management
- For compliance-heavy domains, inability to trace file lineage and provenance reduces legal/operational trust in fully autonomous agents. Economic adoption requires better provenance guarantees and auditability.
Practical recommendations for economists, procurement teams, and product leaders - When estimating expected savings from agent deployment, include token/interaction costs, harness engineering costs, expected error/rework rates, and human supervision costs; use rubrics-like micro-metrics to forecast rework. - Prioritize investments in (a) workspace normalization (indexing/provenance), (b) retrieval and summarization layers that reduce context size, and (c) harness engineering that caps unnecessary LLM calls. - Use Workspace-Bench (or its Lite subset) as an economic stress-test when comparing vendor claims: measure both accuracy and per-task operational cost. - Consider hybrid workflows (agent pre-processing + human verification) until workspace-learning capabilities reach human parity for targeted task classes.
If useful, I can (1) extract numeric per-configuration results from the paper (detailed table), (2) sketch a simple cost model (tokens → $) for representative configurations, or (3) propose procurement-oriented benchmark questions (SLAs, cost thresholds) to evaluate agents for enterprise deployment. Which would you prefer?
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Other | positive | high | ability of AI agents to use file dependencies to complete tasks |
0.03
|
| Existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. Other | negative | medium | coverage/realism of file dependencies in existing benchmarks |
0.11
|
| We construct Workspace-Bench with 5 worker profiles, 74 file types, 20,476 files (up to 20GB), 388 tasks, and 7,399 total rubrics, each task associated with its own file dependency graph. Other | positive | high | benchmark size and heterogeneity (worker profiles, file types, file count, task count, rubric count, max size) |
n=20476
0.3
|
| Workspace-Bench includes files up to 20GB in size. Other | positive | high | maximum file size in the benchmark |
up to 20GB
0.3
|
| We provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. Other | positive | high | evaluation cost (and distributional fidelity of the subset) |
n=100
about 70% reduction in evaluation costs
0.18
|
| We evaluate 4 popular agent harnesses and 7 foundation models on Workspace-Bench. Other | neutral | high | number of agent harnesses and foundation models evaluated |
n=11
0.18
|
| The best-performing agent reaches only 68.7% on the benchmark. Other | negative | high | benchmark score (agent performance) |
n=7399
68.7%
0.3
|
| Human performance on the benchmark is 80.7%. Other | positive | high | benchmark score (human performance) |
n=7399
80.7%
0.3
|
| The average performance across evaluated agents is only 47.4%. Other | negative | high | average benchmark score across agents |
n=7399
47.4%
0.3
|
| Experimental results show that current agents remain far from reliable workspace learning. Other | negative | high | reliability of agents on workspace learning tasks |
n=7399
0.18
|