A new pipeline turns real software into 10,000+ long-horizon agent environments tied to U.S. occupations, creating CUA-World for evaluating computer-use assistants; distilled 2B vision-language models and reviewer-audit loops yield measurable but modest performance gains on these economically relevant tasks.
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
Summary
Main Finding
Gym-Anything is a scalable multi-agent framework that can turn arbitrary GUI software into interactive environments, enabling the automated creation of CUA-World — a GDP-grounded collection of 10K+ realistic, long-horizon computer-use tasks across 200 software applications. The pipeline (creation + audit + propose-and-amplify + checklist verification) produces train/test splits and a very challenging long-horizon benchmark (CUA-World-Long). Training signals distilled from trajectories improve a 2B vision‑language model to outperform larger models, and test‑time auditing improves agent reliability (e.g., Gemini-3-Flash from 11.5% → 14.0%), but frontier agents still perform poorly on many long-horizon tasks (best ≈27.5% pass).
Key Points
- Gym-Anything: a library/specification that reduces environment creation to three staged setup scripts (install, configure, task_setup) + a config file and a unified gym-style API (screenshots as observations, mouse/keyboard actions).
- Multi-agent creation-audit loop:
- Creation agent (AgentC) writes setup scripts, finds/ingests realistic data, launches and interacts with the software, and produces evidence (screenshots, logs, outputs).
- Audit agent (Agentaudit) verifies evidence against quality checklists and returns failures for iterative fixes.
- Shared memory of learnings accelerates later environment builds; a summarization agent condenses memory periodically.
- Scalable task generation (propose-and-amplify):
- Expensive agentic model proposes small number of high-quality seed tasks per app.
- Cheaper non-agentic LM amplifies seeds into many tasks using in-context examples.
- Tasks evaluated with a checklist-based vision-language-model (VLM) verifier that uses privileged information embedded during setup for fine-grained scoring.
- Dataset and benchmark:
- CUA-World: 10K+ tasks across 200 software apps with realistic data, train/test splits, multi-OS coverage.
- CUA-World-Long: one long-horizon task per software; tasks often require hundreds (sometimes 500+) steps and target realistic failure modes.
- Empirical findings:
- Distilling successful trajectories into a 2B VLM yields better performance than models twice its size.
- Performance scales with the number of distinct software environments in training.
- Test-time auditing (independent VLM reviewer) raises pass rates modestly (example: Gemini-3-Flash from 11.5% → 14.0%).
- Even top models struggle: best reported pass ≈27.5% on the long-horizon split.
- Assets released: code, infrastructure, and benchmark data to enable further research.
Data & Methods
- Software selection (GDP-grounded):
- Start: O*NET occupations (~900), employment & wage data from BLS; scale to U.S. GDP using national accounts (BEA).
- Discover software per occupation via LLM + web search → initial catalog (~16,600 apps, ~1,400 categories).
- Attribute GDP to each software using: GDP_software = sum_over_occ (GDP_occ × p_computer × s_category × s_software) where p_computer = fraction of occupation using computers, s_category = share of computer work for the software category, s_software = software share within category (both estimated with LLM+web).
- Filter to sandboxable apps (self-hostable, free-tier, GUI, no special hardware) → ~3,400 candidates; apply tiered selection (economic importance + strategic/STEM coverage + SOC-group coverage) to pick 200 apps.
- Gym-Anything library:
- Declarative specification: three sequential setup scripts + config; library manages OS backends (Linux, Windows, Android), container orchestration (docker, apptainer), display forwarding, checkpointing, caching, and a consistent observation/action interface.
- Enables large parallelism (authors ran 400+ concurrent environments over ~1,600 CPUs).
- Multi-agent pipeline:
- Agents implemented as LLM instances with tool access (Claude variants in the paper), distinct system prompts and toolsets (bash/python, visual grounding tools).
- Creation-audit loop iterated until checklist criteria satisfied; memory M collected environment-specific heuristics; a summarization agent reduced memory size periodically to mitigate context fatigue.
- Task scaling & verification:
- Seed generation by expensive agent; LM amplification (e.g., 75×).
- VLM checklist verifier decomposes tasks into weighted subtasks using privileged metadata produced during setup (e.g., ground-truth tumor location) — agents solving tasks do not see this privileged info.
- Evaluation & training experiments:
- Distillation of teacher trajectories into a 2B VLM; comparisons to larger models; evaluation on CUA-World and CUA-World-Long.
- Test-time auditing: independent VLM inspects agent trajectories and produces guidance/required additional steps.
Implications for AI Economics
- Broader, GDP-relevant evaluation: Grounding environment selection in occupation-level GDP aligns benchmarks with economic value, enabling more relevant assessments of whether CUAs can automate or assist high-value digital work.
- Training signal for economically consequential tasks: Large, diverse, realistic trajectories across many domain-specific applications provide richer supervision for agents that might substitute for or augment human labor in digitally intensive occupations.
- Reveals current capability gaps: Weak performance on long-horizon, cross-software workflows implies substantial remaining risk of overestimating short-horizon benchmark success when predicting real-world economic impact (limited substitution now, but potential with further progress).
- Auditing and verification are critical: Creation- and test-time audit agents materially improve environment quality and agent reliability. For economic deployment this suggests production CUAs will need independent verification layers (e.g., auditing agents, privileged checks) to mitigate false completion claims and unsafe errors.
- Scalability vs. representativeness trade-offs:
- The pipeline scales environment creation, but selection is US GDP‑grounded and limited to sandboxable, free/self-hostable GUI apps — this biases coverage toward certain institutions, countries, and software types; broader economic conclusions should account for that.
- Replacing proprietary or enterprise-only software with sandboxable analogs preserves some signal but may understate real complexity and integrations in production systems.
- Policy and labor-market consequences:
- If CUAs trained on CUA-World generalize to real software, there could be productivity gains in many occupations (accounting, healthcare workflows, scientific analysis), but also displacement risks concentrated in digitally intensive tasks.
- Measurement: Benchmarks grounded in economic data help quantify potential exposure but require careful updating as software mixes and work practices differ across firms and countries.
- Research and safety directions:
- Future AI-economic analyses should combine scalable, GDP-grounded benchmarks (like CUA-World) with field trials and expert-in-the-loop auditing to estimate real-world productivity, risk, and complementarities.
- The reliance on audit layers suggests that robustness, interpretability, and verifiability are essential for high-stakes deployment where economic value and liability are significant.
Limitations to note (relevant for economic interpretation): selection limited to sandboxable/free GUI software; GDP attribution uses LLM estimates for software shares (potential error); environments do not capture organizational constraints like permissions, enterprise integrations, or confidential datasets; substantial compute required to scale environments and training.
Contact/resources: authors release code, infra, and benchmark data (link: https://cmu-l3.github.io/gym-anything) to facilitate replication and further economic analyses.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Current research has largely focused on short-horizon tasks over a limited set of software with limited economic value (e.g., basic e-commerce and OS-configuration tasks). Research Productivity | negative | high | scope and horizon of existing research tasks |
0.18
|
| We introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. Research Productivity | positive | high | availability of a general framework for environment creation |
0.3
|
| Environment creation is framed as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software while producing evidence of correct setup; an independent audit agent verifies evidence against a quality checklist. Research Productivity | positive | high | reliability/validity of environment setup via multi-agent workflow |
0.3
|
| Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. Research Productivity | positive | high | number of software applications covered and occupational coverage |
n=200
200 software applications
0.3
|
| The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. Research Productivity | positive | high | number of long-horizon tasks and availability of realistic data and splits |
n=10000
over 10K long-horizon tasks
0.3
|
| CUA-World-Long is a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Research Productivity | positive | high | task horizon measured in number of steps |
often requiring over 500 steps
0.18
|
| Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. Output Quality | positive | high | model performance on benchmark tasks (success metric unspecified in excerpt) |
outperforms models 2× its size
0.18
|
| Applying the same auditing principle at test time — a separate VLM reviews completed trajectories and provides feedback — improves Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. Output Quality | positive | high | benchmark score (success rate) on CUA-World-Long |
11.5% to 14.0%
0.18
|
| All code, infrastructure, and benchmark data are released to facilitate future research in realistic computer-use agents. Research Productivity | positive | high | availability of code, infrastructure, and benchmark data |
0.3
|