OpenComputer creates 1,000 verifiable desktop tasks across 33 applications and finds verifier-based scoring matches human judgments better than LLM-as-judge; despite partial successes, even frontier agents struggle to complete tasks end-to-end, revealing a gap in robust desktop automation.
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
Summary
Main Finding
OpenComputer is a verifier-grounded framework that automates construction, verification, and evaluation of realistic desktop software tasks for computer-use agents. By making programmatic verification the organizing principle, it produces reproducible task instances (x, e, c) where success is checked via app-specific inspection endpoints. In a 33-application / 1,000-task release, verifiers produced by OpenComputer align more closely with human adjudicators than LLM-as-judge methods, and experiments show that even frontier models (best: GPT-5.4) struggle to achieve end-to-end reliable automation despite making partial progress. Open-source agents show large performance drops relative to previous benchmarks, revealing a persistent robustness gap in practical desktop automation.
Key Points
- Framework design
- Four tightly coupled components: (1) app-specific state verifiers, (2) a self-evolving verification layer using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic, machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards.
- Verifier-first philosophy
- Verification is treated as a software artifact: each app has a Python verifier exposing CLI/JSON endpoints that query stable inspection channels (e.g., browser CDP/Marionette, SQLite profile DBs, LibreOffice UNO, D-Bus, file parsing, accessibility APIs).
- Verifiers are unit- and integration-tested with realistic synthetic artifacts before use.
- Self-evolving verification
- Calibration executions (~15 tasks/app) are run; an LLM evaluator and the programmatic verifier independently judge outcomes and disagreements that stem from verifier errors trigger bounded verifier fixes. Cached trajectories remain unchanged during fixes.
- Task generation
- Candidate human-like goals are filtered for complexity and data-generatability; tasks are retained only if a verifier can check the intended outcome (or a new endpoint is added).
- Finalized tasks are packaged as τ = (x, e, c) with executable environment initialization and machine-checkable checks.
- Evaluation and scoring
- Runs are performed in fresh sandboxes; screenshots and actions are logged. Final checkers execute inside the sandbox and score reward as R = Npass / Ntotal, enabling partial credit.
- Release and scale
- OpenComputer release: 33 desktop applications, 1,000 finalized tasks, average ~17.7 verifier endpoints per app, ~6.9 checks per task, ~1.3 seed files per task. Repo: https://github.com/echo0715/OpenComputer
- Empirical findings
- Hard-coded verifiers align better with human adjudication than LLM-as-judge evaluation, particularly for fine-grained application state that screenshots do not reveal.
- Performance (from paper Table 2): GPT-5.4 — 68.3% success rate, 88.4% average reward; Claude-Sonnet-4.6 — 64.4% success, 76.6% reward; Kimi-K2.6 — 58.8% success, 70.7% reward. Open-source models perform substantially worse (e.g., Qwen-3.5-27B: 32.3% success, 59.4% reward; GUI-OWL-1.5-8B: 5.7% success, 27.8% reward). Frontier models sometimes exceed their reported OSWorld-Verified scores, but open-source drops are large relative to OSWorld.
Data & Methods
- Task instance specification
- Each task τ = (x, e, c): x = textual instruction for agent; e = executable environment initialization (creates files, bookmarks, profiles, etc.); c = set of machine-checkable success criteria implemented as verifier commands.
- Verifier construction
- For each app a, implement Va as a Python module exposing structured inspection and check-* endpoints returning JSON.
- Inspection channels chosen per-app: CDP for Chrome/Brave/Electron apps, Marionette for Firefox, SQLite parsing for profile DBs, LibreOffice UNO for office files, AT-SPI/accessibility for some GUI state, direct file parsing (PIL, ffprobe, document parsers) for document/media apps, D-Bus for certain services, etc.
- Verifier development follows a test plan with positive/negative cases and integration tests on the real sandboxed application.
- Self-evolution (U)
- Generate calibration tasks (≈15 per app); run a strong agent to produce cached trajectories s0→sT.
- An LLM evaluator produces a criterion-level reference verdict; programmatic verifier produces a machine verdict; a comparator aligns verdicts and attributes disagreements. Verifier-side disagreements lead to code/endpoint/doc fixes and re-execution until agreement or budget exhaustion.
- Task generation pipeline
- Proposal → filter by difficulty/data-generatability → ground against verifier endpoints (add endpoints if needed) → synthesize environment artifacts → emit task.json.
- Periodic task-extension workflow to avoid coverage collapse.
- Evaluation harness
- Sandbox initialization, screenshot-action loop, logging of agents’ reasoning/actions/screenshots, optional final save action, run checkers inside sandbox, compute partial-credit reward R = Npass/Ntotal.
- Scale & statistics
- 33 apps, 1,000 tasks; avg verifier endpoints per app = 17.7; avg checks per task = 6.9; avg seed files per task = 1.3.
- Experimental comparisons
- Benchmarked multiple closed-source frontier models and several open-source agents; compared OpenComputer-verified scores to OSWorld-Verified where available; human adjudication used to validate verifier alignment.
Implications for AI Economics
- Lower marginal cost and higher reproducibility of benchmarking
- Automating environment synthesis and machine-checkable verification reduces the human labor needed to produce large, diverse desktop-task benchmarks. This lowers fixed costs for evaluating and iterating agent designs and enables cheaper, repeatable evaluation at scale.
- Better training signals and potential sample-efficiency gains
- Verifier-grounded rewards (precise, partial-credit, machine-checkable) enable more reliable supervised fine-tuning and RL with grounded reward signals. This can reduce wasted compute for poor or mis-evaluated episodes and improve the efficiency of producing deployable agents.
- Productization and market opportunities
- The verifier module pattern suggests an industry service model: verified task suites, per-app verifier development, and sandbox orchestration could be productized for firms building workplace automation. Markets may emerge for high-quality verifier libraries, task generation services, and audited evaluation stacks.
- Concentration of capability and competitive dynamics
- The observed robustness gap—frontier proprietary models outperform open-source ones substantially on verifiable real-world desktop tasks—implies short-term concentration of practical automation capabilities among well-resourced model providers. This affects bar for automation adoption across firms and could delay commoditization until open-source models close the gap.
- Labor-net effects and task reallocation
- Partial-success patterns (agents make progress but fail end-to-end) suggest near-term augmentation rather than full substitution: human workers may retain oversight and finishing tasks, shifting job content toward verification, exception handling, and higher-level decision making. This implies that productivity gains may be realized via human-AI complementarity first, with incremental labor reallocation rather than wholesale displacement.
- Measurement and policy considerations
- Verifier-grounded benchmarks produce audit-ready, reproducible measures of agent capabilities. For regulators or procurement, such benchmarks improve verifiability of supplier claims and can be used in procurement/standards. They also surface where LLM-judge evaluations can be misleading—important for any policy relying on reported capability metrics.
- Cost structure for deployment
- While benchmark synthesis is automated, verifier maintenance and app-specific engineering remain nontrivial (endpoints, evolution loop). Firms must weigh ongoing engineering costs to extend verifiers across new applications versus expected automation value. This raises incentives for standardized inspection APIs and vendor cooperation to reduce integration costs.
- Research and investment signals
- The framework highlights value in investments that improve robustness on fine-grained application state (not just visual understanding): system integration, reliable persistence actions, error recovery, and stateful API usage. These areas are likely high-return targets for both academic and industrial R&D funding.
Limitations to factor into economic interpretation - Coverage: 33 apps is broad but not exhaustive; vertical-specific or proprietary applications remain outside scope. - Maintenance costs: verifiers require upkeep as apps update; these ongoing costs affect long-term economics of automation. - Sandbox realism vs. production heterogeneity: synthetic sandboxes reduce evaluation variance but may underrepresent deployment friction in heterogeneous enterprise environments.
Overall, OpenComputer materially reduces the human cost of constructing verifiable software-world benchmarks and supplies higher-fidelity evaluation and training signals, which should accelerate development of practical desktop automation—yet the observed robustness gap implies firms will face nontrivial engineering and model-capability investments before wide-scale, reliable task automation displaces significant labor.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Adoption Rate | positive | high | coverage of applications and tasks (count) |
n=1000
33 applications; 1,000 finalized tasks
0.3
|
| OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. Other | positive | high | system architecture/components (qualitative) |
0.3
|
| OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Decision Quality | positive | high | alignment with human adjudication / evaluation accuracy |
n=1000
0.18
|
| Frontier agents struggle with end-to-end completion despite partial progress. Task Completion Time | negative | high | end-to-end task completion / success rate |
0.18
|
| Open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation. Output Quality | negative | medium | verified performance gap (score drop) between evaluations |
sharp drops (magnitude not specified)
0.11
|
| The self-evolving verification layer improves verifier reliability using execution-grounded feedback. Training Effectiveness | positive | medium | verifier reliability / accuracy |
0.11
|
| The evaluation harness records full trajectories and computes auditable partial-credit rewards. Other | positive | high | availability of full trajectories and partial-credit reward computation (qualitative/system capability) |
0.3
|