Synthetic 'computers' simulate month-long, user-specific work environments to train agents: 1,000 long-horizon runs (8+ hours, ~2,000 turns) produce richer experiential signals and improve agent performance on in- and out-of-domain productivity evaluations, though real-world validation is still pending.

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Tao Ge, Baolin Peng, Hao Cheng, Jianfeng Gao · April 30, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

The paper presents a scalable method to generate synthetic, user-specific computer environments and runs long-horizon agent simulations on 1,000 such worlds, yielding improved agent performance on productivity tasks in preliminary evaluations.

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Summary

Main Finding

The report introduces "Synthetic Computers at Scale," a methodology that generates artifact-rich, user-specific synthetic computer environments from personas and uses them to run long-horizon productivity simulations. These simulations produce realistic, multi-week agent trajectories (process + outcome signals) that materially improve agent performance on both in-domain and out-of-domain productivity evaluations. The approach is scalable in principle (persona pools at billion scale) and intended as a substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Key Points

Goal: produce realistic, user-grounded long-horizon trajectories for productivity agents without relying on private real-world computers.
Pipeline summary:
- Start from high-level personas → expand to detailed user profiles (identity, role, habits, tools, document behavior).
- Plan a filesystem policy and file inventory (directory tree, timestamps, file metadata, and an explicit dependency graph linking artifacts).
- Instantiate directories and content-rich artifacts using web retrieval for public files or LLM-driven synthesis when needed, honoring cross-file dependencies via topological ordering.
Simulation design:
- Two agents: a setup agent (creates month-scale productivity objectives tailored to the persona/computer) and a work agent (acts as the user, navigates the filesystem, coordinates with simulated collaborators, iteratively produces deliverables).
- Objectives correspond to about a month of human work and produce deliverables (documents, spreadsheets, decks, PDFs) plus detailed process traces.
Experimental scale and results (preliminary):
- Created 1,000 synthetic computers and ran one long-horizon simulation per computer.
- Typical simulation: >8 hours agent runtime and >2,000 interaction turns on average.
- Produced rich process and outcome signals; using these signals yields significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations (report provides qualitative claim of significance; specific numeric metrics not included in the excerpt).
Released resources: 100 synthetic computers (50 Windows-style, 50 macOS-style) and retrospective analysis reports for 500 simulations.
Claim of scalability: with large persona pools, the methodology could scale to millions–billions of synthetic user worlds given sufficient compute.

Data & Methods

Persona-driven creation:
- Personas are expanded into detailed user profiles capturing professional context (projects, collaborators, deliverables) and computer-use behavior (naming conventions, folder habits, tooling, tidiness).
Filesystem planning:
- Generate a filesystem policy (start time, drive layout, preferred folders, naming style, storage patterns).
- Plan file inventory and a directed dependency graph (files reference/derive from others) to ensure correlated, realistic artifact sets rather than i.i.d. file samples.
Artifact instantiation:
- Map logical paths to a portable on-disk layout that preserves OS semantics.
- Instantiate files in dependency-aware order (Kahn topo sort); public/downloadable artifacts are fetched when available, otherwise synthesized by LLMs equipped with artifact-creation tools.
- Artifacts are content-rich (spreadsheets with tabs, Word docs, PPT decks, PDFs) and reflect timestamped virtual histories.
Long-horizon simulation:
- Setup agent produces several deliverable work packages tailored to the persona and existing artifacts (typically multiple interdependent deliverables).
- Work agent executes: filesystem navigation for grounding, using/modifying artifacts, coordinating with simulated collaborators, iterative revision, and failure recovery.
- Simulation outputs: detailed trajectories (search/planning/coordination/revision steps) and final deliverables.
Evaluation:
- Use process trajectories and final artifacts as experiential signals for agent training/finetuning and evaluate on in-domain and out-of-domain productivity tasks. The report claims significant performance gains; raw evaluation metrics are not provided in the excerpt.

Implications for AI Economics

Training-data economics
- Synthetic alternatives to private user data reduce privacy barriers and legal/frictional costs of collecting realistic long-horizon trajectories, enabling wider training signal availability for productivity agents.
- Synthetic data can be duplicated, diversified, and scaled cheaply once the generation pipeline is built, lowering marginal cost per training example (but see compute cost caveats below).
Compute, storage, and marginal cost trade-offs
- Each long-horizon simulation is compute- and time-intensive (>8 hours runtime, thousands of turns). Scaling to millions of synthetic computers implies substantial upfront and operational compute/storage costs; economic viability depends on model re-use, amortization of generation costs, and downstream value (improved agent utility).
- There is a nontrivial engineering and infrastructure cost to synthesize realistic artifacts (rich spreadsheets, cross-file dependencies) compared with lightweight synthetic tasks.
Labor market and productivity effects
- High-quality long-horizon training data could materially accelerate productivity agents' ability to perform knowledge-work tasks (planning, evidence-gathering, drafting deliverables). That suggests potential for both augmentation (higher output per worker) and substitution effects for mid- to high-skill white-collar roles.
- The distributional impact will depend on which professions and workflows are synthesized and improved first — those with abundant personas and standardized artifacts may see faster automation.
Value of process signals
- Process trajectories (planning, revision, coordination) are economically valuable beyond final artifacts: they enable training agents that can integrate into multi-step workflows, improve reliability, and reduce costly human oversight, increasing expected downstream returns from deployment.
Generalization and externalities
- While synthetic environments can be diversified, generation biases (persona design choices, tooling assumptions, language/cultural biases) will determine where agents generalize well or fail, potentially concentrating gains for contexts represented well in the persona pool.
- Synthetic data may lower privacy risk but creates risks from miscalibrated or unrealistic artifacts (leading to brittle agent behavior); poor generalization could produce negative externalities if deployed broadly (bad advice, miscoordination).
Policy and governance considerations
- Regulators and organizations should weigh benefits (reduced need for private data, faster capability improvements) against risks (automation impacts, biased deployments, accountability for synthetic-origin training data).
- Economic cost–benefit analyses should include compute costs, annotation/curation overhead, potential labor displacement, and the societal value of improved productivity tools.
Research and investment signals
- This methodology highlights investment opportunities in scalable synthetic-world generation, tooling for realistic artifact synthesis, and infrastructure to store and serve long-horizon trajectories for training agentic systems.
- It also motivates economic research into returns to scale: at what scale of synthetic persona coverage do marginal performance gains decline, and how do compute costs compare to gains in agent utility?

Suggestions for follow-up economic analysis - Estimate per-simulation compute and storage cost and project costs for scaling to N = 10k/1M/1B synthetic computers. - Model welfare impacts: augmentation vs substitution across occupations represented in persona pools. - Empirically test generalization gaps between synthetic-trained agents and agents trained on privacy-preserving samples of real user trajectories to quantify trade-offs. - Analyze market concentration risks if large providers control massive synthetic simulation substrates and downstream agent deployment.

Assessment

Paper Typedescriptive Evidence Strengthlow — Evidence comes from preliminary, simulation-based experiments (1,000 synthetic computers) showing improved agent performance; there is no causal identification versus real human productivity, limited external validation, and potential distributional mismatch between synthetic and real environments. Methods Rigormedium — The paper proposes a detailed, scalable methodology and runs long-horizon (8+ hour, ~2,000-turn) simulations at nontrivial scale, but evaluation appears preliminary: limited information on baselines, robustness checks, statistical significance, ablations, and validation against human-generated environments. Sample1,000 synthetically generated 'computers' each containing realistic folder hierarchies and content-rich artifacts (documents, spreadsheets, presentations); two-agent long-horizon simulations per computer (a task-generating agent and a user-acting agent) producing month-scale, multi-deliverable objectives, with runs averaging 8+ hours and >2,000 turns; resulting synthetic experiential trajectories used for in-domain and out-of-domain productivity evaluations. Themesproductivity human_ai_collab innovation GeneralizabilitySynthetic environments may not capture full complexity of real user work patterns and noisiness, Simulated collaborators and user behavior may diverge from human interactions, Scale reported is 1,000 computers; claims about millions/billions assume compute and persona coverage not demonstrated, Domain and cultural diversity of synthetic personas/environments unclear, Evaluation lacks real-world productivity or human-subject validation

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Adoption Rate	positive	high	creation of synthetic computer environments with realistic folder hierarchies and content-rich artifacts	0.18
Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed. Task Completion Time	positive	high	ability to simulate long-horizon, user-conditioned productivity workflows	objectives require multiple professional deliverables and about a month of human work 0.18
In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them. Adoption Rate	positive	high	number of synthetic computers created and simulated	n=1000 1,000 synthetic computers 0.18
Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. Task Completion Time	positive	high	agent runtime per simulation run; number of turns per run	n=1000 over 8 hours of agent runtime and spans more than 2,000 turns on average 0.18
These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Developer Productivity	positive	medium	agent performance on productivity evaluations (in-domain and out-of-domain)	significant improvements (not further quantified in the provided text) 0.11
Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. Adoption Rate	positive	high	scalability potential (number of synthetic user worlds producible)	can in principle scale to millions or even billions of synthetic user worlds 0.03
Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios. Research Productivity	positive	high	suitability as a substrate for agent self-improvement and agentic RL	0.03