A new benchmark, Cutverse, tests autonomous GUI agents on 186 realistic media-editing tasks across seven professional applications and finds just 36% task success, highlighting persistent failures in long-horizon planning and domain-specific workflows.

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao · May 19, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Cutverse is a benchmark of 186 realistic media post-production tasks across seven professional apps that finds current autonomous GUI agents succeed on only 36% of tasks, revealing limits in long-horizon reliability and domain-specific planning despite promising grounding and multimodal alignment.

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

Summary

Main Finding

CutVerse introduces a high-fidelity benchmark for evaluating GUI agents in professional media post-production and shows that current vision-language and GUI agents perform poorly on realistic editing workflows: state-of-the-art models achieve only ~36% task success. The paper demonstrates that long-horizon, dense, multimodal editing tasks (timeline manipulation, VFX, masking/tracking, audio alignment) pose fundamentally different and harder challenges than prior GUI benchmarks.

Key Points

Scope and novelty
- 186 human-verified, long-horizon tasks across 7 professional applications (e.g., Adobe Premiere Pro, After Effects, Photoshop) plus generative tools (ComfyUI, Keling).
- Tasks cover end-to-end post-production workflows (timeline editing, effects tuning, masking/tracking, audio rhythm editing, export) and integrate AIGC-to-editing pipelines (“Vibe Cutting”).
Data scale & complexity
- 2.43 hours of high-fidelity recordings, 3,484 atomic GUI interactions, average ~18.7 steps per trajectory (peak 239), ~23.8 interactions/minute.
- Timelines and multi-track controls dominate operations (timelines ≈ 46% of ops; layer/track controls ≈ 25%).
Interaction model
- Agents are constrained to human-like motor actions (pixel-based clicks, drags, keyboard shortcuts) with vision-only perception — no privileged APIs or DOM access.
Infrastructure & evaluation
- Windows VM-based environment with resettable checkpoints for reproducible live execution.
- Parser that synchronizes high-framerate screen recordings with low-level I/O logs to produce grounded, milestone-driven trajectories.
- Milestone QA evaluator and fine-grained metrics that go beyond binary Success Rate to expose intermediate failures and error accumulation.
Baseline behavior and failure modes
- Stronger performance on procedural setup and file management; weak performance on sustained, compositional editing.
- Key bottlenecks: long-horizon planning/reliability, fine-grained spatial grounding in dense UIs, multimodal temporal alignment, missing compositional action abstractions.

Data & Methods

Dataset construction
- Human experts recorded authentic editing sessions across professional tools; recordings parsed into structured action trajectories and hierarchical milestones.
- Task types catalogued into 9 functional domains (effects tuning, export, asset import, audio editing, timeline arrangement, preview/validation, masking/tracking, launch/setup, generative workflows).
- Table highlights: effects/visual tuning (51 tasks, avg duration 52.8s, avg 20.3 steps, labeled Extreme complexity); masking/tracking (10 tasks, avg duration 73.0s, avg 25.4 steps).
Evaluation environment
- Custom Windows virtualization enforces pixel-level interaction and human-like constraints; VM checkpoints ensure reproducibility.
- No privileged API calls; agents must use continuous mouse/keyboard events based on visual frames.
Parser & representation
- Aligns frames and I/O logs to produce spatiotemporally grounded atomic actions and milestone hierarchies, enabling scalable, automated assessment.
Baselines & metrics
- Evaluated state-of-the-art VLM-based GUI agents and planner-executor architectures.
- Reported overall task success ≈ 36.0%; detailed per-milestone diagnostics reveal frequent intermediate failures even when some steps succeed.

Implications for AI Economics

Market opportunity and value capture
- Professional post-production is high-value: automating even a subset of fine-grained editing tasks could yield significant productivity gains for studios, agencies, and freelance creators. There is a clear commercial opportunity for specialized GUI agents and integrated AIGC→post-production pipelines (SaaS agents, enterprise integrations, plugin marketplaces).
R&D cost and timeline
- The benchmark highlights large technical gaps (long-horizon planning, pixel-precise control, multimodal alignment). Bridging these gaps requires concentrated R&D (data collection, model engineering, UI-specific training regimes, safety/QA), implying substantial upfront investment and a multi-year horizon before reliable, general-purpose agents reach production-grade performance.
Labor markets and task composition
- Near-term: agents are likely to complement rather than substitute skilled editors — automating repetitive setup and file management tasks while leaving high-level creative decisions and error-catching to humans. This suggests potential shifts in required skills (higher emphasis on overseeing agents, verifying outputs, and combining generative assets).
- Medium-term: partial automation could change billing models (task-based automation discounts, bundled editing-as-a-service) and reallocate human labor toward more creative, supervisory, or higher-value tasks.
Platform and ecosystem economics
- Lock-in and partnerships: effective deployment favors deep integration with professional tools (Adobe suite, DaVinci) or provision via VM-like sandboxes; platform owners could monetize agent extensions, but vendor lock-in and licensing of proprietary software will shape adoption costs and business models.
- Benchmarking value: CutVerse provides a standardized way to measure progress and risk, reducing buyer uncertainty—this can accelerate enterprise procurement of GUI-agent services by clarifying capabilities and ROI.
Investment & policy considerations
- Companies and investors should prioritize investments that address the benchmark’s bottlenecks (spatiotemporal grounding, long-horizon reasoning, compositional action primitives) rather than only improving generative output quality.
- Standards and evaluation protocols (like CutVerse) are important for auditing performance, liability allocation, and setting realistic SLAs for automation in creative industries.
Research & economic questions opened
- Cost-benefit analyses: which editing sub-tasks yield the highest ROI if automated? How do error rates and required human oversight affect net productivity gains?
- Labor displacement vs. augmentation: what is the net effect on employment across junior editors, senior editors, and related roles?
- Pricing models: subscription vs. per-task vs. outcome-based pricing for agent-assisted post-production services.

Short takeaway for economists and decision-makers: CutVerse quantifies a substantial capability gap that constrains immediate automation value in professional post-production. This implies promising long-term commercial opportunities but also significant near-term R&D costs and a likely transition where agents augment skilled labor before any broad substitution occurs. Benchmarks like CutVerse are therefore critical inputs for investment, procurement, and labor-market forecasting in the AIGC-driven creative economy.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic empirical evaluation using a curated set of 186 expert-grounded, long-horizon tasks and a parser converting raw recordings into structured action trajectories, which produces concrete performance measures (36% task success). However, the evidence is limited by potential selection biases in task curation, possible parser errors or annotation noise, the modest task sample relative to the diversity of media workflows, and by not linking agent performance to downstream economic outcomes (productivity, time saved, firm-level impacts). Methods Rigormedium — Design shows substantial rigor: expert demonstrations across multiple professional applications, attention to multimodal grounding, and a scalable parser for precise trajectory extraction. Missing (in the provided summary) are detailed validation metrics for the parser, inter-annotator agreement for demonstrations, robustness checks across software versions and platforms, and sensitivity analyses over task selection and evaluation thresholds. SampleExpert demonstrations collected across 7 professional media applications (examples: Premiere Pro, Photoshop), covering 186 complex, long-horizon post-production tasks; data consist of raw screen recordings and low-level interaction logs transformed by a lightweight parser into structured, compositional GUI action trajectories for scalable evaluation of multiple autonomous GUI agents. Themesproductivity human_ai_collab GeneralizabilityFocused on media post-production — may not generalize to other professional domains (finance, engineering, medical software)., Limited to 7 applications (likely Adobe-centric); results may vary across different software ecosystems, OS versions, or localized UIs., 186 tasks is substantial for a benchmark but may not capture full diversity of real-world workflows and edge cases., Parser and trajectory extraction pipelines may introduce systematic errors that affect evaluation; performance depends on recording fidelity and logging protocols., Evaluated agents may be a non-representative subset of future or alternative architectures; results are time-sensitive as models and plugins evolve.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
GUI agents have made significant progress in web navigation and basic operating system tasks. Other	positive	high	capability progress on web navigation and OS tasks	0.18
We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. Other	positive	high	existence and design of a benchmark for GUI agents in media post-production	0.3
We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows. Other	positive	high	size and scope of demonstration dataset (number of applications and tasks)	n=186 0.3
The tasks involve dense multimodal interfaces and tightly coupled interaction sequences. Other	positive	high	interface complexity and interaction coupling in tasks	0.18
We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Other	positive	high	ability to produce structured, grounded GUI action trajectories from recordings/logs	0.3
Extensive evaluations reveal that existing agents achieve only 36.0% task success on realistic media editing tasks. Task Completion Time	negative	high	task success rate	n=186 36.0% task success 0.18
Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution. Other	positive	high	spatial grounding, multimodal alignment, coordinated action execution	0.18
However, models remain limited in long-horizon reliability and domain-specific planning. Task Completion Time	negative	high	long-horizon reliability and domain-specific planning ability	0.18