A new benchmark reveals AI assistants still struggle with complex desktop workflows: on DeskCraft’s 538 long-horizon creative and engineering tasks, GPT-5.4 scores only ~32% on standard tasks and ~28% when realistic interactions are allowed. The suite exposes systematic failures in delivering long workflows and in proactively clarifying ambiguous instructions despite a formal mid-turn/post-turn collaboration protocol.
Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.
Summary
Main Finding
DeskCraft introduces a 538-task desktop GUI benchmark focused on long-horizon professional workflows and human-in-the-loop collaboration. State-of-the-art agents (including GPT-5.4) perform poorly on sustained multi-step professional tasks: best models score ≈30–34% on standard tasks and ≈26–28% on interactive tasks. Major failure modes are (1) delivering L3 long-horizon workflow artifacts, (2) proactive clarification and coordinated human-agent planning, and (3) obtaining only small gains from longer action budgets beyond ~100 steps. These results imply that current desktop agents are far from reliably automating professional creative and engineering desktop work without substantial human oversight.
Key Points
- Benchmark scope
- 538 tasks total: 386 standard tasks (L1: 141 atomic, L2: 137 compositional, L3: 108 long-horizon) and 152 interactive tasks.
- Covers 11 desktop applications across 5 domains: office/browser/dev, image & vector design, video editing, audio production, 3D rendering.
- L3 long-horizon tasks typically require >50 execution steps and preserve multi-app handoffs and named artifacts.
- Interaction protocol
- Deterministic phase-based user-agent interaction using three triggers: agent_ask (agent-initiated clarification), step_count (user-initiated interruption mid-execution), agent_done (post-turn feedback/corrections).
- Phases allow mid-turn and post-turn messages to evolve the goal in a reproducible way.
- Evaluation and verification
- Execution-based verification: domain-aware verifiers extract structured state (project files/runtimes) and apply rule-based checks to score success (binary).
- 300+ evaluator functions; 279 asset files across tasks.
- Models evaluated
- 18 agents across three families: proprietary frontier models (GPT-5.4, Kimi-K2.6), open-source generalist VLMs (Qwen variants), and open-source GUI-specialized foundation models.
- For interactive tasks, a fixed MLLM (Kimi-K2.5) acts as the deterministic user simulator.
- Key empirical outcomes
- Top performance: GPT-5.4 ≈31.6% on standard split and ≈27.6% on interactive split (paper reports strongest standard model ≈33.8% in places).
- Kimi-K2.6 ≈25.7% on interactive under a 100-step limit.
- Sharp drop in success from L1→L2→L3; agents rarely ask clarifying questions proactively.
- Increasing step budgets to 300 steps yields limited additional successes beyond what 100 steps recover.
Data & Methods
- Task formulation
- Each task τ = (s0, u0, Φ, E, R): initial desktop state s0, initial instruction u0, optional phase sequence Φ (phased user messages with triggers), environment E (Ubuntu VM), and evaluation function R.
- Agent action space: GUI operations + DONE / ASK / FAIL. Agents observe screenshots and the active instruction.
- Difficulty taxonomy
- L1: atomic single GUI ops (low step count).
- L2: composition of related L1 ops (2–4 dependent ops).
- L3: real delivery-style long-horizon workflows (>50 ops; cross-app; named inputs/outputs).
- Interaction protocol (deterministic)
- Three composable triggers:
- agent_ask: fires when agent emits ASK (agent requests clarification).
- step_count: fires after a pre-specified number of agent steps (models user interruption).
- agent_done: fires after agent emits DONE (post-completion feedback/corrections).
- The MLLM user simulator replies deterministically to triggers and to unexpected ASK.
- Three composable triggers:
- Verifiers and assets
- Verifiers extract semantic state from files/runtimes; checks are rule-atom based. Tasks backed by 300+ evaluators.
- Assets: 279 unique files in 19 formats, sourced publicly or authored by annotators.
- Experimental protocol
- Agents evaluated on Standard and Interactive splits; step budgets varied (e.g., 100 vs 300 steps) to probe long-horizon capacity and reliability (pass@k analysis used).
- Models include proprietary frontier agents, scaled open-source VLMs, and GUI-specialized models to probe scale and domain adaptation effects.
Implications for AI Economics
- Limited near-term labor substitution for complex creative/engineering desktop work
- L1/L2 tasks (atomic + short compositions) are the most automatable; L3 workflow-level delivery remains unreliable.
- For employers considering automation, expect partial automation: routine/atomic steps may be offloaded, but full task handover for end-to-end delivery still requires human oversight.
- Supervision and coordination costs matter
- Agents rarely initiate proactive clarification, and failures in long-horizon planning mean human-in-the-loop time will remain necessary. Any productivity model should include supervision/coordination time and error-correction costs as a persistent overhead.
- Error costs can be high in professional pipelines (e.g., incorrect renders, lost assets). Firms must weigh reduced per-step labor costs against increased review/rollback costs.
- Productivity gains are non-linear and thresholded
- Because success rates are far below parity for complex tasks, realized productivity gains will be concentrated in workflows where tasks decompose into many reliable L1/L2 subtasks.
- Benchmarks like DeskCraft suggest adoption thresholds: only if agent reliability for a task family crosses a certain reliability × speed frontier will firms reallocate labor away from humans.
- Value of domain specialization and evaluation for procurement
- GUI-specialized training and benchmarked verification matter. Purchasers should require task- or domain-specific benchmarks (not just general capability claims) and prefer models validated on execution-based, realistic workflows.
- Procurement contracts should specify step budgets, interaction behavior, and verifiable deliverable definitions (analogous to DeskCraft’s verifiers).
- Impact on labor markets and skill composition
- Partial automation may reallocate labor away from routine GUI operations to higher-level design, supervision, QA, and client-facing roles—improving task complexity for human workers but potentially reducing low-skill desktop operation demand.
- Gig/contract labor markets may shift: fewer micro-tasks available, but higher demand for quality assurance and pipeline orchestration roles.
- Costs of deployment and ROI modeling
- Deployment cost drivers: model API/compute, retraining/fine-tuning for app-specific actions, integration with local project files and verifiers, and human supervision hours.
- Simple ROI model to consider: Expected net gain per task = (task_value × automation_success_rate) − (supervision_cost × supervision_time) − (error_cost × failure_rate) − model_usage_cost. DeskCraft provides empirical estimates for automation_success_rate and failure_rate by task class.
- Policy, liability, and certification
- For professional use (media, engineering), deterministic verifiers and routine benchmark validation should be part of certification/regulatory frameworks to allocate liability and set minimum reliability standards.
- Benchmarks like DeskCraft can support standards for safety/accuracy in deployed desktop agents.
- Research and investment priorities from an economic perspective
- Investing in (a) domain-specific fine-tuning (GUI behavior + project-file reasoning), (b) improved proactive clarification policies, and (c) long-horizon planning architectures likely yields the largest marginal productivity returns.
- Firms and researchers should prioritize benchmarks containing realistic, multi-app workflows and deterministic verification to measure economically relevant capabilities.
Practical recommendations for economists and decision-makers - When modeling automation impacts, incorporate realistic reliability rates by task difficulty (use DeskCraft L1/L2/L3 splits), and include supervision/verification time and error costs explicitly. - Use execution-based benchmarks (like DeskCraft) as part of procurement and risk assessment for desktop automation tools. - Expect short-term gains concentrated in task simplification and supervision-support tooling; full substitution of specialist creative/engineering desktop roles is unlikely until long-horizon planning and proactive human-agent coordination improve substantially.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. Research Productivity | negative | high | representation of real-world workflow complexity in prior benchmarks |
0.18
|
| We introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. Research Productivity | positive | high | availability of a benchmark for long-horizon workflows and human-agent collaboration |
0.3
|
| DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps. Task Completion Time | positive | high | task difficulty / horizon length (number of execution steps) |
over 50 execution steps
0.3
|
| DeskCraft covers professional creative software across design, video, audio, and 3D creation. Research Productivity | positive | high | breadth of software domains covered by the benchmark |
0.3
|
| DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Governance And Regulation | positive | high | formalization of human-agent interaction protocol |
0.3
|
| Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion. Task Allocation | positive | high | coverage of interaction types (mid-turn and post-turn) in the protocol |
0.18
|
| We evaluate 18 proprietary and open source agents on 538 tasks. Research Productivity | positive | high | evaluation sample size (agents and tasks) |
n=538
18 agents
0.3
|
| GPT-5.4 reaches 31.6% on standard tasks. Output Quality | positive | high | task success rate / benchmark score on standard (non-interactive) tasks |
n=538
31.6%
0.3
|
| GPT-5.4 reaches 27.6% on interactive tasks. Output Quality | positive | high | task success rate / benchmark score on interactive tasks |
n=538
27.6%
0.3
|
| Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. Output Quality | negative | high | failure modes: long-horizon workflow delivery and proactive clarification |
n=538
0.18
|
| We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft. Research Productivity | positive | high | availability of benchmark artifacts (code, tasks, data) as open source |
0.03
|