A new pipeline turns real software into 10,000+ long-horizon agent environments tied to U.S. occupations, creating CUA-World for evaluating computer-use assistants; distilled 2B vision-language models and reviewer-audit loops yield measurable but modest performance gains on these economically relevant tasks.

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, Sean Welleck · April 07, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

The paper introduces Gym-Anything and CUA-World, an automated pipeline and 10K+ task benchmark converting 200 real software applications into long-horizon computer-use environments grounded in an occupation/GDP taxonomy, and demonstrates that distilled VLMs and auditor/reviewer agents can modestly improve performance on these realistic tasks.

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

Summary

Main Finding

Gym-Anything is a scalable multi-agent framework that can turn arbitrary GUI software into interactive environments, enabling the automated creation of CUA-World — a GDP-grounded collection of 10K+ realistic, long-horizon computer-use tasks across 200 software applications. The pipeline (creation + audit + propose-and-amplify + checklist verification) produces train/test splits and a very challenging long-horizon benchmark (CUA-World-Long). Training signals distilled from trajectories improve a 2B vision‑language model to outperform larger models, and test‑time auditing improves agent reliability (e.g., Gemini-3-Flash from 11.5% → 14.0%), but frontier agents still perform poorly on many long-horizon tasks (best ≈27.5% pass).

Key Points

Gym-Anything: a library/specification that reduces environment creation to three staged setup scripts (install, configure, task_setup) + a config file and a unified gym-style API (screenshots as observations, mouse/keyboard actions).
Multi-agent creation-audit loop:
- Creation agent (AgentC) writes setup scripts, finds/ingests realistic data, launches and interacts with the software, and produces evidence (screenshots, logs, outputs).
- Audit agent (Agentaudit) verifies evidence against quality checklists and returns failures for iterative fixes.
- Shared memory of learnings accelerates later environment builds; a summarization agent condenses memory periodically.
Scalable task generation (propose-and-amplify):
- Expensive agentic model proposes small number of high-quality seed tasks per app.
- Cheaper non-agentic LM amplifies seeds into many tasks using in-context examples.
- Tasks evaluated with a checklist-based vision-language-model (VLM) verifier that uses privileged information embedded during setup for fine-grained scoring.
Dataset and benchmark:
- CUA-World: 10K+ tasks across 200 software apps with realistic data, train/test splits, multi-OS coverage.
- CUA-World-Long: one long-horizon task per software; tasks often require hundreds (sometimes 500+) steps and target realistic failure modes.
Empirical findings:
- Distilling successful trajectories into a 2B VLM yields better performance than models twice its size.
- Performance scales with the number of distinct software environments in training.
- Test-time auditing (independent VLM reviewer) raises pass rates modestly (example: Gemini-3-Flash from 11.5% → 14.0%).
- Even top models struggle: best reported pass ≈27.5% on the long-horizon split.
Assets released: code, infrastructure, and benchmark data to enable further research.

Data & Methods

Software selection (GDP-grounded):
- Start: O*NET occupations (~900), employment & wage data from BLS; scale to U.S. GDP using national accounts (BEA).
- Discover software per occupation via LLM + web search → initial catalog (~16,600 apps, ~1,400 categories).
- Attribute GDP to each software using: GDP_software = sum_over_occ (GDP_occ × p_computer × s_category × s_software) where p_computer = fraction of occupation using computers, s_category = share of computer work for the software category, s_software = software share within category (both estimated with LLM+web).
- Filter to sandboxable apps (self-hostable, free-tier, GUI, no special hardware) → ~3,400 candidates; apply tiered selection (economic importance + strategic/STEM coverage + SOC-group coverage) to pick 200 apps.
Gym-Anything library:
- Declarative specification: three sequential setup scripts + config; library manages OS backends (Linux, Windows, Android), container orchestration (docker, apptainer), display forwarding, checkpointing, caching, and a consistent observation/action interface.
- Enables large parallelism (authors ran 400+ concurrent environments over ~1,600 CPUs).
Multi-agent pipeline:
- Agents implemented as LLM instances with tool access (Claude variants in the paper), distinct system prompts and toolsets (bash/python, visual grounding tools).
- Creation-audit loop iterated until checklist criteria satisfied; memory M collected environment-specific heuristics; a summarization agent reduced memory size periodically to mitigate context fatigue.
Task scaling & verification:
- Seed generation by expensive agent; LM amplification (e.g., 75×).
- VLM checklist verifier decomposes tasks into weighted subtasks using privileged metadata produced during setup (e.g., ground-truth tumor location) — agents solving tasks do not see this privileged info.
Evaluation & training experiments:
- Distillation of teacher trajectories into a 2B VLM; comparisons to larger models; evaluation on CUA-World and CUA-World-Long.
- Test-time auditing: independent VLM inspects agent trajectories and produces guidance/required additional steps.

Implications for AI Economics

Broader, GDP-relevant evaluation: Grounding environment selection in occupation-level GDP aligns benchmarks with economic value, enabling more relevant assessments of whether CUAs can automate or assist high-value digital work.
Training signal for economically consequential tasks: Large, diverse, realistic trajectories across many domain-specific applications provide richer supervision for agents that might substitute for or augment human labor in digitally intensive occupations.
Reveals current capability gaps: Weak performance on long-horizon, cross-software workflows implies substantial remaining risk of overestimating short-horizon benchmark success when predicting real-world economic impact (limited substitution now, but potential with further progress).
Auditing and verification are critical: Creation- and test-time audit agents materially improve environment quality and agent reliability. For economic deployment this suggests production CUAs will need independent verification layers (e.g., auditing agents, privileged checks) to mitigate false completion claims and unsafe errors.
Scalability vs. representativeness trade-offs:
- The pipeline scales environment creation, but selection is US GDP‑grounded and limited to sandboxable, free/self-hostable GUI apps — this biases coverage toward certain institutions, countries, and software types; broader economic conclusions should account for that.
- Replacing proprietary or enterprise-only software with sandboxable analogs preserves some signal but may understate real complexity and integrations in production systems.
Policy and labor-market consequences:
- If CUAs trained on CUA-World generalize to real software, there could be productivity gains in many occupations (accounting, healthcare workflows, scientific analysis), but also displacement risks concentrated in digitally intensive tasks.
- Measurement: Benchmarks grounded in economic data help quantify potential exposure but require careful updating as software mixes and work practices differ across firms and countries.
Research and safety directions:
- Future AI-economic analyses should combine scalable, GDP-grounded benchmarks (like CUA-World) with field trials and expert-in-the-loop auditing to estimate real-world productivity, risk, and complementarities.
- The reliance on audit layers suggests that robustness, interpretability, and verifiability are essential for high-stakes deployment where economic value and liability are significant.

Limitations to note (relevant for economic interpretation): selection limited to sandboxable/free GUI software; GDP attribution uses LLM estimates for software shares (potential error); environments do not capture organizational constraints like permissions, enterprise integrations, or confidential datasets; substantial compute required to scale environments and training.

Contact/resources: authors release code, infra, and benchmark data (link: https://cmu-l3.github.io/gym-anything) to facilitate replication and further economic analyses.

Assessment

Paper Typedescriptive Evidence Strengthn/a — The paper introduces a dataset and environment-creation pipeline and reports model performance on benchmarks; it does not make or test causal claims about economic outcomes that would require identification strategies. Methods Rigormedium — The work is methodically engineered at scale: an automated multi-agent pipeline for environment creation, independent audit agents, realistic data and train/test splits, and multiple model evaluations (including distillation and reviewer-audit experiments). However, risks remain from automated setup errors and audit coverage, limited external validation of environment realism (vs. real worker behavior), and evaluations constrained to a handful of VLMs and distilled models rather than broad, real-world deployments. SampleCUA-World: over 10,000 long-horizon computer-use tasks created from 200 real software applications selected to cover occupations grounded in a U.S. GDP-based taxonomy; each task is configured with realistic/downloaded data and includes train/test splits; CUA-World-Long is a challenging subset of tasks often requiring 500+ steps. Experiments include distillation into a 2B vision-language model and evaluations using off-the-shelf VLMs (e.g., Gemini-3-Flash) with an auditor/reviewer agent. Themeshuman_ai_collab productivity adoption GeneralizabilityEnvironments are simulated conversions of software and may not capture real-world worker variability or multi-user interactions, Occupational grounding is U.S.-centric (GDP-based), limiting geographic and sectoral representativeness, Selection of 200 software applications may be biased toward particular domains or software types, Model evaluations are limited to specific VLMs and a distilled 2B model, so performance may not generalize across architectures or future models, Automated setup and auditing could miss subtle errors, reducing fidelity to real-world workflows, Focus is on digital/computer-use tasks and excludes physical, offline, or interpersonal tasks

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Current research has largely focused on short-horizon tasks over a limited set of software with limited economic value (e.g., basic e-commerce and OS-configuration tasks). Research Productivity	negative	high	scope and horizon of existing research tasks	0.18
We introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. Research Productivity	positive	high	availability of a general framework for environment creation	0.3
Environment creation is framed as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software while producing evidence of correct setup; an independent audit agent verifies evidence against a quality checklist. Research Productivity	positive	high	reliability/validity of environment setup via multi-agent workflow	0.3
Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. Research Productivity	positive	high	number of software applications covered and occupational coverage	n=200 200 software applications 0.3
The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. Research Productivity	positive	high	number of long-horizon tasks and availability of realistic data and splits	n=10000 over 10K long-horizon tasks 0.3
CUA-World-Long is a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Research Productivity	positive	high	task horizon measured in number of steps	often requiring over 500 steps 0.18
Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. Output Quality	positive	high	model performance on benchmark tasks (success metric unspecified in excerpt)	outperforms models 2× its size 0.18
Applying the same auditing principle at test time — a separate VLM reviews completed trajectories and provides feedback — improves Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. Output Quality	positive	high	benchmark score (success rate) on CUA-World-Long	11.5% to 14.0% 0.18
All code, infrastructure, and benchmark data are released to facilitate future research in realistic computer-use agents. Research Productivity	positive	high	availability of code, infrastructure, and benchmark data	0.3