A new benchmark reveals AI assistants still struggle with complex desktop workflows: on DeskCraft’s 538 long-horizon creative and engineering tasks, GPT-5.4 scores only ~32% on standard tasks and ~28% when realistic interactions are allowed. The suite exposes systematic failures in delivering long workflows and in proactively clarifying ambiguous instructions despite a formal mid-turn/post-turn collaboration protocol.

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Wenkai Wang, Tao Xiong, Jingchen Ni, Yunpeng Bao, Xiyun Li, Tianqi Liu, Hongcan Guo, Zilong Huang, Shengyu Zhang · June 02, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

DeskCraft is a long-horizon desktop GUI benchmark for professional creative and engineering workflows that evaluates 18 agents on 538 tasks with a formal mid-turn and post-turn interaction protocol and finds persistent failures in long-horizon execution and proactive clarification (GPT-5.4 scores ~31.6% standard, ~27.6% interactive).

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

Summary

Main Finding

DeskCraft introduces a 538-task desktop GUI benchmark focused on long-horizon professional workflows and human-in-the-loop collaboration. State-of-the-art agents (including GPT-5.4) perform poorly on sustained multi-step professional tasks: best models score ≈30–34% on standard tasks and ≈26–28% on interactive tasks. Major failure modes are (1) delivering L3 long-horizon workflow artifacts, (2) proactive clarification and coordinated human-agent planning, and (3) obtaining only small gains from longer action budgets beyond ~100 steps. These results imply that current desktop agents are far from reliably automating professional creative and engineering desktop work without substantial human oversight.

Key Points

Benchmark scope
- 538 tasks total: 386 standard tasks (L1: 141 atomic, L2: 137 compositional, L3: 108 long-horizon) and 152 interactive tasks.
- Covers 11 desktop applications across 5 domains: office/browser/dev, image & vector design, video editing, audio production, 3D rendering.
- L3 long-horizon tasks typically require >50 execution steps and preserve multi-app handoffs and named artifacts.
Interaction protocol
- Deterministic phase-based user-agent interaction using three triggers: agent_ask (agent-initiated clarification), step_count (user-initiated interruption mid-execution), agent_done (post-turn feedback/corrections).
- Phases allow mid-turn and post-turn messages to evolve the goal in a reproducible way.
Evaluation and verification
- Execution-based verification: domain-aware verifiers extract structured state (project files/runtimes) and apply rule-based checks to score success (binary).
- 300+ evaluator functions; 279 asset files across tasks.
Models evaluated
- 18 agents across three families: proprietary frontier models (GPT-5.4, Kimi-K2.6), open-source generalist VLMs (Qwen variants), and open-source GUI-specialized foundation models.
- For interactive tasks, a fixed MLLM (Kimi-K2.5) acts as the deterministic user simulator.
Key empirical outcomes
- Top performance: GPT-5.4 ≈31.6% on standard split and ≈27.6% on interactive split (paper reports strongest standard model ≈33.8% in places).
- Kimi-K2.6 ≈25.7% on interactive under a 100-step limit.
- Sharp drop in success from L1→L2→L3; agents rarely ask clarifying questions proactively.
- Increasing step budgets to 300 steps yields limited additional successes beyond what 100 steps recover.

Data & Methods

Task formulation
- Each task τ = (s0, u0, Φ, E, R): initial desktop state s0, initial instruction u0, optional phase sequence Φ (phased user messages with triggers), environment E (Ubuntu VM), and evaluation function R.
- Agent action space: GUI operations + DONE / ASK / FAIL. Agents observe screenshots and the active instruction.
Difficulty taxonomy
- L1: atomic single GUI ops (low step count).
- L2: composition of related L1 ops (2–4 dependent ops).
- L3: real delivery-style long-horizon workflows (>50 ops; cross-app; named inputs/outputs).
Interaction protocol (deterministic)
- Three composable triggers:
  - agent_ask: fires when agent emits ASK (agent requests clarification).
  - step_count: fires after a pre-specified number of agent steps (models user interruption).
  - agent_done: fires after agent emits DONE (post-completion feedback/corrections).
- The MLLM user simulator replies deterministically to triggers and to unexpected ASK.
Verifiers and assets
- Verifiers extract semantic state from files/runtimes; checks are rule-atom based. Tasks backed by 300+ evaluators.
- Assets: 279 unique files in 19 formats, sourced publicly or authored by annotators.
Experimental protocol
- Agents evaluated on Standard and Interactive splits; step budgets varied (e.g., 100 vs 300 steps) to probe long-horizon capacity and reliability (pass@k analysis used).
- Models include proprietary frontier agents, scaled open-source VLMs, and GUI-specialized models to probe scale and domain adaptation effects.

Implications for AI Economics

Limited near-term labor substitution for complex creative/engineering desktop work
- L1/L2 tasks (atomic + short compositions) are the most automatable; L3 workflow-level delivery remains unreliable.
- For employers considering automation, expect partial automation: routine/atomic steps may be offloaded, but full task handover for end-to-end delivery still requires human oversight.
Supervision and coordination costs matter
- Agents rarely initiate proactive clarification, and failures in long-horizon planning mean human-in-the-loop time will remain necessary. Any productivity model should include supervision/coordination time and error-correction costs as a persistent overhead.
- Error costs can be high in professional pipelines (e.g., incorrect renders, lost assets). Firms must weigh reduced per-step labor costs against increased review/rollback costs.
Productivity gains are non-linear and thresholded
- Because success rates are far below parity for complex tasks, realized productivity gains will be concentrated in workflows where tasks decompose into many reliable L1/L2 subtasks.
- Benchmarks like DeskCraft suggest adoption thresholds: only if agent reliability for a task family crosses a certain reliability × speed frontier will firms reallocate labor away from humans.
Value of domain specialization and evaluation for procurement
- GUI-specialized training and benchmarked verification matter. Purchasers should require task- or domain-specific benchmarks (not just general capability claims) and prefer models validated on execution-based, realistic workflows.
- Procurement contracts should specify step budgets, interaction behavior, and verifiable deliverable definitions (analogous to DeskCraft’s verifiers).
Impact on labor markets and skill composition
- Partial automation may reallocate labor away from routine GUI operations to higher-level design, supervision, QA, and client-facing roles—improving task complexity for human workers but potentially reducing low-skill desktop operation demand.
- Gig/contract labor markets may shift: fewer micro-tasks available, but higher demand for quality assurance and pipeline orchestration roles.
Costs of deployment and ROI modeling
- Deployment cost drivers: model API/compute, retraining/fine-tuning for app-specific actions, integration with local project files and verifiers, and human supervision hours.
- Simple ROI model to consider: Expected net gain per task = (task_value × automation_success_rate) − (supervision_cost × supervision_time) − (error_cost × failure_rate) − model_usage_cost. DeskCraft provides empirical estimates for automation_success_rate and failure_rate by task class.
Policy, liability, and certification
- For professional use (media, engineering), deterministic verifiers and routine benchmark validation should be part of certification/regulatory frameworks to allocate liability and set minimum reliability standards.
- Benchmarks like DeskCraft can support standards for safety/accuracy in deployed desktop agents.
Research and investment priorities from an economic perspective
- Investing in (a) domain-specific fine-tuning (GUI behavior + project-file reasoning), (b) improved proactive clarification policies, and (c) long-horizon planning architectures likely yields the largest marginal productivity returns.
- Firms and researchers should prioritize benchmarks containing realistic, multi-app workflows and deterministic verification to measure economically relevant capabilities.

Practical recommendations for economists and decision-makers - When modeling automation impacts, incorporate realistic reliability rates by task difficulty (use DeskCraft L1/L2/L3 splits), and include supervision/verification time and error costs explicitly. - Use execution-based benchmarks (like DeskCraft) as part of procurement and risk assessment for desktop automation tools. - Expect short-term gains concentrated in task simplification and supervision-support tooling; full substitution of specialist creative/engineering desktop roles is unlikely until long-horizon planning and proactive human-agent coordination improve substantially.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper introduces and evaluates a benchmark rather than estimating causal effects; it reports agent performance on tasks but does not make causal claims requiring identification. Methods Rigormedium — The paper develops a systematic, multilevel task taxonomy and a formal interaction protocol, evaluates a substantial set of agents (18) on a large task set (538), and reports quantitative results; however, interactions appear simulated rather than real-user studies, evaluation metrics and task scripting may bias results, and external validation (e.g., field deployment or human-in-the-loop experiments) is limited or absent. Sample538 benchmark tasks spanning professional desktop creative and engineering software across design, video, audio, and 3D creation, organized into a multilevel difficulty taxonomy with long-horizon tasks (>50 execution steps); evaluations run 18 proprietary and open-source agents (including GPT-5.4, reported at 31.6% on standard tasks and 27.6% on interactive tasks); interaction protocol formalizes mid-turn (agent clarification and user interruption) and post-turn (user feedback) exchanges — interactions appear evaluated via the benchmark's protocol rather than via extensive real-world user sessions. Themeshuman_ai_collab productivity GeneralizabilityBenchmark tasks are curated/simulated and may not capture the full variability of real-world professional workflows, Interactions are formalized and likely simulated rather than measured in live user studies, limiting ecological validity, Coverage limited to selected creative and engineering desktop applications; results may not generalize to other domains (e.g., enterprise software, coding IDEs), Performance of proprietary agents may reflect specific API access, model versions, or prompt engineering that change over time, Evaluation metrics and task scripts may bias toward certain solution styles and not reflect diverse user goals or multi-user collaboration scenarios

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. Research Productivity	negative	high	representation of real-world workflow complexity in prior benchmarks	0.18
We introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. Research Productivity	positive	high	availability of a benchmark for long-horizon workflows and human-agent collaboration	0.3
DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps. Task Completion Time	positive	high	task difficulty / horizon length (number of execution steps)	over 50 execution steps 0.3
DeskCraft covers professional creative software across design, video, audio, and 3D creation. Research Productivity	positive	high	breadth of software domains covered by the benchmark	0.3
DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Governance And Regulation	positive	high	formalization of human-agent interaction protocol	0.3
Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion. Task Allocation	positive	high	coverage of interaction types (mid-turn and post-turn) in the protocol	0.18
We evaluate 18 proprietary and open source agents on 538 tasks. Research Productivity	positive	high	evaluation sample size (agents and tasks)	n=538 18 agents 0.3
GPT-5.4 reaches 31.6% on standard tasks. Output Quality	positive	high	task success rate / benchmark score on standard (non-interactive) tasks	n=538 31.6% 0.3
GPT-5.4 reaches 27.6% on interactive tasks. Output Quality	positive	high	task success rate / benchmark score on interactive tasks	n=538 27.6% 0.3
Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. Output Quality	negative	high	failure modes: long-horizon workflow delivery and proactive clarification	n=538 0.18
We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft. Research Productivity	positive	high	availability of benchmark artifacts (code, tasks, data) as open source	0.03