OpenComputer creates 1,000 verifiable desktop tasks across 33 applications and finds verifier-based scoring matches human judgments better than LLM-as-judge; despite partial successes, even frontier agents struggle to complete tasks end-to-end, revealing a gap in robust desktop automation.

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Jinbiao Wei, Qianran Ma, Yilun Zhao, Xiao Zhou, Kangqi Ni, Guo Gan, Arman Cohan · May 19, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

OpenComputer builds a verifier-grounded benchmark of 1,000 machine-checkable desktop tasks across 33 apps, shows verifier-based scoring aligns better with humans than LLM-as-judge, and finds modern agents often make partial progress but rarely achieve end-to-end task completion.

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

Summary

Main Finding

OpenComputer is a verifier-grounded framework that automates construction, verification, and evaluation of realistic desktop software tasks for computer-use agents. By making programmatic verification the organizing principle, it produces reproducible task instances (x, e, c) where success is checked via app-specific inspection endpoints. In a 33-application / 1,000-task release, verifiers produced by OpenComputer align more closely with human adjudicators than LLM-as-judge methods, and experiments show that even frontier models (best: GPT-5.4) struggle to achieve end-to-end reliable automation despite making partial progress. Open-source agents show large performance drops relative to previous benchmarks, revealing a persistent robustness gap in practical desktop automation.

Key Points

Framework design
- Four tightly coupled components: (1) app-specific state verifiers, (2) a self-evolving verification layer using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic, machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards.
Verifier-first philosophy
- Verification is treated as a software artifact: each app has a Python verifier exposing CLI/JSON endpoints that query stable inspection channels (e.g., browser CDP/Marionette, SQLite profile DBs, LibreOffice UNO, D-Bus, file parsing, accessibility APIs).
- Verifiers are unit- and integration-tested with realistic synthetic artifacts before use.
Self-evolving verification
- Calibration executions (~15 tasks/app) are run; an LLM evaluator and the programmatic verifier independently judge outcomes and disagreements that stem from verifier errors trigger bounded verifier fixes. Cached trajectories remain unchanged during fixes.
Task generation
- Candidate human-like goals are filtered for complexity and data-generatability; tasks are retained only if a verifier can check the intended outcome (or a new endpoint is added).
- Finalized tasks are packaged as τ = (x, e, c) with executable environment initialization and machine-checkable checks.
Evaluation and scoring
- Runs are performed in fresh sandboxes; screenshots and actions are logged. Final checkers execute inside the sandbox and score reward as R = Npass / Ntotal, enabling partial credit.
Release and scale
- OpenComputer release: 33 desktop applications, 1,000 finalized tasks, average ~17.7 verifier endpoints per app, ~6.9 checks per task, ~1.3 seed files per task. Repo: https://github.com/echo0715/OpenComputer
Empirical findings
- Hard-coded verifiers align better with human adjudication than LLM-as-judge evaluation, particularly for fine-grained application state that screenshots do not reveal.
- Performance (from paper Table 2): GPT-5.4 — 68.3% success rate, 88.4% average reward; Claude-Sonnet-4.6 — 64.4% success, 76.6% reward; Kimi-K2.6 — 58.8% success, 70.7% reward. Open-source models perform substantially worse (e.g., Qwen-3.5-27B: 32.3% success, 59.4% reward; GUI-OWL-1.5-8B: 5.7% success, 27.8% reward). Frontier models sometimes exceed their reported OSWorld-Verified scores, but open-source drops are large relative to OSWorld.

Data & Methods

Task instance specification
- Each task τ = (x, e, c): x = textual instruction for agent; e = executable environment initialization (creates files, bookmarks, profiles, etc.); c = set of machine-checkable success criteria implemented as verifier commands.
Verifier construction
- For each app a, implement Va as a Python module exposing structured inspection and check-* endpoints returning JSON.
- Inspection channels chosen per-app: CDP for Chrome/Brave/Electron apps, Marionette for Firefox, SQLite parsing for profile DBs, LibreOffice UNO for office files, AT-SPI/accessibility for some GUI state, direct file parsing (PIL, ffprobe, document parsers) for document/media apps, D-Bus for certain services, etc.
- Verifier development follows a test plan with positive/negative cases and integration tests on the real sandboxed application.
Self-evolution (U)
- Generate calibration tasks (≈15 per app); run a strong agent to produce cached trajectories s0→sT.
- An LLM evaluator produces a criterion-level reference verdict; programmatic verifier produces a machine verdict; a comparator aligns verdicts and attributes disagreements. Verifier-side disagreements lead to code/endpoint/doc fixes and re-execution until agreement or budget exhaustion.
Task generation pipeline
- Proposal → filter by difficulty/data-generatability → ground against verifier endpoints (add endpoints if needed) → synthesize environment artifacts → emit task.json.
- Periodic task-extension workflow to avoid coverage collapse.
Evaluation harness
- Sandbox initialization, screenshot-action loop, logging of agents’ reasoning/actions/screenshots, optional final save action, run checkers inside sandbox, compute partial-credit reward R = Npass/Ntotal.
Scale & statistics
- 33 apps, 1,000 tasks; avg verifier endpoints per app = 17.7; avg checks per task = 6.9; avg seed files per task = 1.3.
Experimental comparisons
- Benchmarked multiple closed-source frontier models and several open-source agents; compared OpenComputer-verified scores to OSWorld-Verified where available; human adjudication used to validate verifier alignment.

Implications for AI Economics

Lower marginal cost and higher reproducibility of benchmarking
- Automating environment synthesis and machine-checkable verification reduces the human labor needed to produce large, diverse desktop-task benchmarks. This lowers fixed costs for evaluating and iterating agent designs and enables cheaper, repeatable evaluation at scale.
Better training signals and potential sample-efficiency gains
- Verifier-grounded rewards (precise, partial-credit, machine-checkable) enable more reliable supervised fine-tuning and RL with grounded reward signals. This can reduce wasted compute for poor or mis-evaluated episodes and improve the efficiency of producing deployable agents.
Productization and market opportunities
- The verifier module pattern suggests an industry service model: verified task suites, per-app verifier development, and sandbox orchestration could be productized for firms building workplace automation. Markets may emerge for high-quality verifier libraries, task generation services, and audited evaluation stacks.
Concentration of capability and competitive dynamics
- The observed robustness gap—frontier proprietary models outperform open-source ones substantially on verifiable real-world desktop tasks—implies short-term concentration of practical automation capabilities among well-resourced model providers. This affects bar for automation adoption across firms and could delay commoditization until open-source models close the gap.
Labor-net effects and task reallocation
- Partial-success patterns (agents make progress but fail end-to-end) suggest near-term augmentation rather than full substitution: human workers may retain oversight and finishing tasks, shifting job content toward verification, exception handling, and higher-level decision making. This implies that productivity gains may be realized via human-AI complementarity first, with incremental labor reallocation rather than wholesale displacement.
Measurement and policy considerations
- Verifier-grounded benchmarks produce audit-ready, reproducible measures of agent capabilities. For regulators or procurement, such benchmarks improve verifiability of supplier claims and can be used in procurement/standards. They also surface where LLM-judge evaluations can be misleading—important for any policy relying on reported capability metrics.
Cost structure for deployment
- While benchmark synthesis is automated, verifier maintenance and app-specific engineering remain nontrivial (endpoints, evolution loop). Firms must weigh ongoing engineering costs to extend verifiers across new applications versus expected automation value. This raises incentives for standardized inspection APIs and vendor cooperation to reduce integration costs.
Research and investment signals
- The framework highlights value in investments that improve robustness on fine-grained application state (not just visual understanding): system integration, reliable persistence actions, error recovery, and stateful API usage. These areas are likely high-return targets for both academic and industrial R&D funding.

Limitations to factor into economic interpretation - Coverage: 33 apps is broad but not exhaustive; vertical-specific or proprietary applications remain outside scope. - Maintenance costs: verifiers require upkeep as apps update; these ongoing costs affect long-term economics of automation. - Sandbox realism vs. production heterogeneity: synthetic sandboxes reduce evaluation variance but may underrepresent deployment friction in heterogeneous enterprise environments.

Overall, OpenComputer materially reduces the human cost of constructing verifiable software-world benchmarks and supplies higher-fidelity evaluation and training signals, which should accelerate development of practical desktop automation—yet the observed robustness gap implies firms will face nontrivial engineering and model-capability investments before wide-scale, reliable task automation displaces significant labor.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides extensive empirical evaluation (1,000 machine-checkable tasks across 33 desktop applications) and compares verifier outputs to human adjudication, giving reasonably strong internal evidence about verifier alignment and agent performance; however, tasks are synthesized and constrained to verifiable desktop scenarios, there is limited real-world deployment evidence, and the work does not measure downstream economic outcomes, which limits broader causal or external claims. Methods Rigormedium — The framework integrates systematic verifiers, an execution-grounded verifier-improvement loop, a task-generation pipeline, and an evaluation harness that records full trajectories and computes auditable rewards, which is methodologically thorough; weaknesses include potential selection bias in task synthesis, unclear details about human adjudication sample size and annotation protocol, and limited stress-testing across heterogeneous real-world environments and user behaviors. SampleEvaluation covers 33 desktop applications and 1,000 finalized machine-checkable desktop tasks spanning browsers, office and creative tools, development environments, file managers, and communication apps; experiments compare hard-coded verifiers, LLM-as-judge evaluation, frontier agents, and open-source models within the OpenComputer simulated/verified environment. Themesproductivity human_ai_collab GeneralizabilityTasks are synthesized and optimized for machine-checkability, which may not reflect the full complexity and ambiguity of real user workflows., Coverage limited to 33 specific desktop applications and the particular OS/software stacks used; results may not generalize to other applications, platforms, or versions., Controlled/verifier-grounded environment may not capture network effects, multi-user interactions, or real-world variability (latency, permissions, config differences)., Performance of agents may differ substantially in deployed settings with noisy inputs, diverse user goals, or tasks that lack precise state verification., Open-source vs frontier model comparisons may depend on model access, prompting details, and compute budget that are not universal.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Adoption Rate	positive	high	coverage of applications and tasks (count)	n=1000 33 applications; 1,000 finalized tasks 0.3
OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. Other	positive	high	system architecture/components (qualitative)	0.3
OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Decision Quality	positive	high	alignment with human adjudication / evaluation accuracy	n=1000 0.18
Frontier agents struggle with end-to-end completion despite partial progress. Task Completion Time	negative	high	end-to-end task completion / success rate	0.18
Open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation. Output Quality	negative	medium	verified performance gap (score drop) between evaluations	sharp drops (magnitude not specified) 0.11
The self-evolving verification layer improves verifier reliability using execution-grounded feedback. Training Effectiveness	positive	medium	verifier reliability / accuracy	0.11
The evaluation harness records full trajectories and computes auditable partial-credit rewards. Other	positive	high	availability of full trajectories and partial-credit reward computation (qualitative/system capability)	0.3