CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

Summary

Main Finding

CUA-SUITE introduces a large, open, expert-curated ecosystem of continuous human demonstration videos and dense UI annotations tailored to training and evaluating computer-use agents (CUAs). Its centerpiece, VIDEOCUA, offers ~55 hours (≈6M frames) of 30 fps video across ~10,000 tasks and 87 desktop applications, paired with kinematic cursor traces and rich multi-layered reasoning annotations. Paired datasets/benchmarks (GROUNDCUA and UI-VISION) provide 3.6M UI element annotations and a 450-task evaluation benchmark. Preliminary evaluation shows current foundation action models still fail on roughly 60% of professional desktop tasks, indicating substantial room for model improvement.

Key Points

Scope and scale
- VIDEOCUA: ~55 hours, ~6 million frames, ~10,000 task demonstrations across 87 open-source desktop applications at 30 fps.
- GROUNDCUA: ~56K annotated screenshots with >3.6 million UI element annotations (pixel-precise bounding boxes, textual labels; ~50% of elements classified into functional categories).
- UI-VISION: 450 high-quality task demonstrations as a diagnostic benchmark (Element Grounding, Layout Grounding, Action Prediction).
- Multi-layer annotations average ~497 words per step (rich reasoning), and all interactions logged with millisecond precision (clicks, drags, keystrokes, cursor kinematics).
Methodology highlights
- Expert human task design and execution (realistic workflows, not synthetic templates).
- Continuous 30 fps screen recording preserves temporal dynamics (intermediate cursor motion/visual feedback) lost in screenshot-only datasets.
- Keyframes extracted immediately before state-changing actions; manual element bounding boxes and textual labels; OCR applied for long text.
- Expert verification and QA applied to annotations.
Comparative advantage
- More continuous video information than prior open datasets (e.g., ScaleCUA’s 2M screenshots ≈ <20 hours at 30 fps).
- Unifies continuous video + fine-grained grounding + evaluation benchmark in one open ecosystem.
Empirical diagnostics
- State-of-the-art grounding models show modest performance (top model avg ≈47.7% on UI-VISION grounding metrics).
- Spatial reasoning remains a core weakness across models.
Limitations (noted or implied)
- Focus on open-source desktop apps (good for releaseability but may not exactly match proprietary app UIs).
- 55 hours, while large relative to prior open datasets, remains limited relative to large-scale web/speech corpora; generalization to the full diversity of user behavior is not guaranteed.

Data & Methods

Application selection
- 87 open-source desktop applications across 12 categories (e.g., IDEs, 3D modeling, office, graphics) chosen to cover professional workflows; permissive licenses permit full public release.
Task design and recording
- Human experts authored and executed tasks that reflect realistic work; >10K demonstrations recorded.
- Continuous screen capture at 30 fps with synchronized millisecond-precision logs of all input events (clicks, drags, scrolls, typing).
Annotation pipeline
- Keyframes selected immediately before state-changing actions for grounding.
- Manual bounding box annotation of every visible UI element on keyframes; textual labels assigned (displayed text or concise summaries for long content).
- OCR (PaddleOCR) used to extract raw text for long segments; ~50% of elements labelled with one of eight high-level functional categories.
- Multi-layered reasoning and step-level textual annotations (rich natural-language explanations/history for actions).
- Expert human review and QA for annotations and trajectories.
Datasets and benchmarks included
- VIDEOCUA: continuous video + kinematic traces + reasoning annotations.
- GROUNDCUA: dense pixel-precise element grounding (3.6M annotations).
- UI-VISION: benchmark with 450 tasks focusing on grounding, layout understanding, and action prediction.
Evaluation findings (representative)
- Element grounding remains challenging; best models approach ~60% in Basic/Functional splits but fall much lower on Spatial reasoning.
- Foundation action models exhibit ~60% task failure rate on professional desktop tasks (high brittleness).

Implications for AI Economics

Lowering data barriers and R&D costs
- Public release of a high-quality, expert-curated dataset reduces a key fixed-cost input (task-specific, human-annotated training data) for building CUAs. This can accelerate entrants (startups, research labs) and reduce time-to-market for new assistant products.
- By providing an open foundation, CUA-SUITE can shift investment from expensive annotation collection toward model development and compute, potentially lowering marginal costs of improving CUAs.
Productivity and labor effects
- Better CUAs trained on this data could raise worker productivity for knowledge-intensive desktop tasks (data entry, content creation, coding support, CAD/3D workflows). The productivity gains are likely task- and skill-dependent: routine, structured desktop tasks are most automatable.
- Potential displacement risk exists for roles centered on repetitive GUI workflows, while complementary demand may rise for supervision, prompt engineering, tool integration, and higher-order decision tasks.
- Distributional effects: benefits may accrue unevenly across occupations and firms (larger firms integrating CUAs at scale could capture more gains), potentially increasing returns to capital (software/platforms) relative to labor.
Market structure and competitive dynamics
- Democratizing access to high-quality training data lowers entry barriers for smaller firms and open-source initiatives, but scale advantages (compute, deployment/integration with proprietary software ecosystems) will still matter. Winners may emerge among firms that combine models with broad platform access and enterprise integration.
- Because the dataset focuses on open-source apps, incumbent platform holders (proprietary OS and app vendors) may retain advantages tied to privileged integration, telemetry, and user data — where closed data remains a competitive moat.
Data as public infrastructure and policy
- CUA-SUITE functions as a public research good (infrastructure) that can accelerate general-purpose agent research; policymakers and funders could view similar datasets as high-leverage investments to steer innovation toward socially beneficial uses.
- Regulators and procurement actors should consider how open datasets change the landscape for accountability, auditing, and competition (e.g., benchmarking safety, measuring automation potential).
Research and measurement opportunities for economists
- Use the dataset to estimate automatable task shares: researchers can couple task-level performance of candidate models (trained/fine-tuned on VIDEOCUA) with occupational task data to estimate potential automation exposure across occupations.
- Field experiments / RCTs: deploy CUAs trained on this data in workplace pilots to measure productivity effects, reallocation of worker time, learning/upskilling dynamics, and wage effects.
- Study complementarities: measure how gains depend on worker skill, firm IT integration, and complementary investments (training, process redesign).
- Evaluate cost trade-offs: analyze compute/storage costs required to exploit continuous video vs. screenshot datasets and implications for development budget and carbon footprint.
Strategic considerations for stakeholders
- Firms: invest in integrating CUAs where task structure is high and UI interactions are standardized; expect initial performance gaps in spatial reasoning—design fallback/verification workflows to manage errors.
- Workers/policymakers: support retraining and role redefinition focusing on tasks requiring judgment, domain expertise, or non-routine problem solving.
- Researchers/funders: prioritize benchmarks and real-world evaluations (not just lab metrics) to track economic impacts and safety/performance in deployed settings.

Short actionable suggestions for economists and policy researchers - Benchmark automation potential: fine-tune an agent on VIDEOCUA and map model success rates to occupational task databases (e.g., O*NET equivalents) to infer short-run automatable task shares. - Design field pilots with enterprises using CUA-SUITE–trained agents to measure productivity, task reallocation, and worker welfare impacts. - Track concentration dynamics: monitor which firms integrate CUAs with proprietary data flows and whether open datasets shift market power.

Overall, CUA-SUITE materially improves the data inputs available for building CUAs, which can accelerate automation of desktop work. The economic impact will depend on adoption, integration with proprietary ecosystems, and complementary investments in human capital and governance.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is primarily a dataset and benchmark paper, not a causal or inference study; empirical claims are descriptive/benchmarking (e.g., ~60% task failure) rather than causal effects, so conventional evidence-strength ratings for causal inference do not apply. Methods Rigormedium — The dataset appears carefully constructed: large scale (≈10,000 tasks, ~55 hours, 6M frames), continuous 30 fps recordings, kinematic cursor traces, dense multi-layer annotations, and a benchmark and grounding corpus; public release improves reproducibility. However, the paper (as summarized) lacks detail on sampling strategy, annotator recruitment/training, inter-annotator agreement/quality-control metrics, representativeness across OS/locales/app versions, and the benchmark evaluation offers only a preliminary set of baselines, so some methodological transparency and validation are missing. SampleVideoCUA: ~10,000 human-demonstrated tasks spanning 87 desktop applications, continuous 30 fps screen recordings totaling ~55 hours and ~6 million frames, plus kinematic cursor traces and multi-layer reasoning annotations; GroundCUA: ~56,000 annotated screenshots with >3.6 million UI element annotations; UI-Vision: a benchmark for evaluating grounding and planning; all data and models publicly released. Themesproductivity human_ai_collab GeneralizabilityFocused on professional desktop applications; may not generalize to mobile, tablet, or command-line workflows, Likely biased toward the specific OS, locales, and application versions recorded (e.g., English UIs), limiting global applicability, Expert demonstrations may not reflect novice or heterogeneous end-user behavior and error modes, Total duration (~55 hours) is large for dense expert video but still limited relative to the long tail of possible tasks and UI states, Performance benchmarks depend on recorded environments and tooling; results may differ in deployed/heterogeneous real-world systems

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Computer-use agents (CUAs) hold great promise for automating complex desktop workflows. Other	positive	high	promise/ability to automate complex desktop workflows	0.03
Progress toward general-purpose CUAs is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Research Productivity	negative	high	availability of continuous, high-quality human demonstration videos (data scarcity)	0.18
Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. Research Productivity	positive	high	importance of continuous video vs. sparse screenshots for scaling CUAs	0.18
The largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. Other	negative	high	size/coverage of existing open dataset (ScaleCUA)	n=2000000 less than 20 hours 0.3
VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Other	positive	high	size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annotations)	n=10000 approximately 55 hours; 6 million frames 0.3
Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates). Other	positive	high	information content and transformability of continuous video vs. sparse data	0.18
CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Other	positive	high	size and scope of GroundCUA (annotated screenshots and UI element annotations) and availability of UI-Vision benchmark	n=56000 over 3.6 million UI element annotations 0.3
Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Error Rate	negative	high	task failure rate of foundation action models on professional desktop applications	approximately 60% task failure rate 0.18
CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. Research Productivity	positive	high	support for various research directions (capability to enable research)	0.03
All data and models are publicly released. Other	positive	high	public availability of data and models	0.3

A new public dataset of 55 hours of continuous expert desktop recordings and 3.6M UI annotations aims to jump-start general-purpose computer-use agents; preliminary benchmarks show current action models fail about 60% of professional tasks, indicating substantial room for improvement.