A new benchmark, Cutverse, tests autonomous GUI agents on 186 realistic media-editing tasks across seven professional applications and finds just 36% task success, highlighting persistent failures in long-horizon planning and domain-specific workflows.
While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.
Summary
Main Finding
CutVerse introduces a high-fidelity benchmark for evaluating GUI agents in professional media post-production and shows that current vision-language and GUI agents perform poorly on realistic editing workflows: state-of-the-art models achieve only ~36% task success. The paper demonstrates that long-horizon, dense, multimodal editing tasks (timeline manipulation, VFX, masking/tracking, audio alignment) pose fundamentally different and harder challenges than prior GUI benchmarks.
Key Points
- Scope and novelty
- 186 human-verified, long-horizon tasks across 7 professional applications (e.g., Adobe Premiere Pro, After Effects, Photoshop) plus generative tools (ComfyUI, Keling).
- Tasks cover end-to-end post-production workflows (timeline editing, effects tuning, masking/tracking, audio rhythm editing, export) and integrate AIGC-to-editing pipelines (“Vibe Cutting”).
- Data scale & complexity
- 2.43 hours of high-fidelity recordings, 3,484 atomic GUI interactions, average ~18.7 steps per trajectory (peak 239), ~23.8 interactions/minute.
- Timelines and multi-track controls dominate operations (timelines ≈ 46% of ops; layer/track controls ≈ 25%).
- Interaction model
- Agents are constrained to human-like motor actions (pixel-based clicks, drags, keyboard shortcuts) with vision-only perception — no privileged APIs or DOM access.
- Infrastructure & evaluation
- Windows VM-based environment with resettable checkpoints for reproducible live execution.
- Parser that synchronizes high-framerate screen recordings with low-level I/O logs to produce grounded, milestone-driven trajectories.
- Milestone QA evaluator and fine-grained metrics that go beyond binary Success Rate to expose intermediate failures and error accumulation.
- Baseline behavior and failure modes
- Stronger performance on procedural setup and file management; weak performance on sustained, compositional editing.
- Key bottlenecks: long-horizon planning/reliability, fine-grained spatial grounding in dense UIs, multimodal temporal alignment, missing compositional action abstractions.
Data & Methods
- Dataset construction
- Human experts recorded authentic editing sessions across professional tools; recordings parsed into structured action trajectories and hierarchical milestones.
- Task types catalogued into 9 functional domains (effects tuning, export, asset import, audio editing, timeline arrangement, preview/validation, masking/tracking, launch/setup, generative workflows).
- Table highlights: effects/visual tuning (51 tasks, avg duration 52.8s, avg 20.3 steps, labeled Extreme complexity); masking/tracking (10 tasks, avg duration 73.0s, avg 25.4 steps).
- Evaluation environment
- Custom Windows virtualization enforces pixel-level interaction and human-like constraints; VM checkpoints ensure reproducibility.
- No privileged API calls; agents must use continuous mouse/keyboard events based on visual frames.
- Parser & representation
- Aligns frames and I/O logs to produce spatiotemporally grounded atomic actions and milestone hierarchies, enabling scalable, automated assessment.
- Baselines & metrics
- Evaluated state-of-the-art VLM-based GUI agents and planner-executor architectures.
- Reported overall task success ≈ 36.0%; detailed per-milestone diagnostics reveal frequent intermediate failures even when some steps succeed.
Implications for AI Economics
- Market opportunity and value capture
- Professional post-production is high-value: automating even a subset of fine-grained editing tasks could yield significant productivity gains for studios, agencies, and freelance creators. There is a clear commercial opportunity for specialized GUI agents and integrated AIGC→post-production pipelines (SaaS agents, enterprise integrations, plugin marketplaces).
- R&D cost and timeline
- The benchmark highlights large technical gaps (long-horizon planning, pixel-precise control, multimodal alignment). Bridging these gaps requires concentrated R&D (data collection, model engineering, UI-specific training regimes, safety/QA), implying substantial upfront investment and a multi-year horizon before reliable, general-purpose agents reach production-grade performance.
- Labor markets and task composition
- Near-term: agents are likely to complement rather than substitute skilled editors — automating repetitive setup and file management tasks while leaving high-level creative decisions and error-catching to humans. This suggests potential shifts in required skills (higher emphasis on overseeing agents, verifying outputs, and combining generative assets).
- Medium-term: partial automation could change billing models (task-based automation discounts, bundled editing-as-a-service) and reallocate human labor toward more creative, supervisory, or higher-value tasks.
- Platform and ecosystem economics
- Lock-in and partnerships: effective deployment favors deep integration with professional tools (Adobe suite, DaVinci) or provision via VM-like sandboxes; platform owners could monetize agent extensions, but vendor lock-in and licensing of proprietary software will shape adoption costs and business models.
- Benchmarking value: CutVerse provides a standardized way to measure progress and risk, reducing buyer uncertainty—this can accelerate enterprise procurement of GUI-agent services by clarifying capabilities and ROI.
- Investment & policy considerations
- Companies and investors should prioritize investments that address the benchmark’s bottlenecks (spatiotemporal grounding, long-horizon reasoning, compositional action primitives) rather than only improving generative output quality.
- Standards and evaluation protocols (like CutVerse) are important for auditing performance, liability allocation, and setting realistic SLAs for automation in creative industries.
- Research & economic questions opened
- Cost-benefit analyses: which editing sub-tasks yield the highest ROI if automated? How do error rates and required human oversight affect net productivity gains?
- Labor displacement vs. augmentation: what is the net effect on employment across junior editors, senior editors, and related roles?
- Pricing models: subscription vs. per-task vs. outcome-based pricing for agent-assisted post-production services.
Short takeaway for economists and decision-makers: CutVerse quantifies a substantial capability gap that constrains immediate automation value in professional post-production. This implies promising long-term commercial opportunities but also significant near-term R&D costs and a likely transition where agents augment skilled labor before any broad substitution occurs. Benchmarks like CutVerse are therefore critical inputs for investment, procurement, and labor-market forecasting in the AIGC-driven creative economy.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| GUI agents have made significant progress in web navigation and basic operating system tasks. Other | positive | high | capability progress on web navigation and OS tasks |
0.18
|
| We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. Other | positive | high | existence and design of a benchmark for GUI agents in media post-production |
0.3
|
| We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows. Other | positive | high | size and scope of demonstration dataset (number of applications and tasks) |
n=186
0.3
|
| The tasks involve dense multimodal interfaces and tightly coupled interaction sequences. Other | positive | high | interface complexity and interaction coupling in tasks |
0.18
|
| We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Other | positive | high | ability to produce structured, grounded GUI action trajectories from recordings/logs |
0.3
|
| Extensive evaluations reveal that existing agents achieve only 36.0% task success on realistic media editing tasks. Task Completion Time | negative | high | task success rate |
n=186
36.0% task success
0.18
|
| Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution. Other | positive | high | spatial grounding, multimodal alignment, coordinated action execution |
0.18
|
| However, models remain limited in long-horizon reliability and domain-specific planning. Task Completion Time | negative | high | long-horizon reliability and domain-specific planning ability |
0.18
|