Open-world tests find real-world AI reach: in a CRUX pilot, an AI agent built and published a simple iOS app with only one avoidable human intervention, indicating that messy, long-horizon evaluations can reveal capabilities that standard benchmarks understate.

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan · May 19, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

The paper argues for 'open-world' evaluations to complement benchmarks and reports a CRUX pilot where an AI agent successfully developed and published a simple iOS app with only a single avoidable human intervention, suggesting such evaluations can surface near-deployment capabilities benchmarks miss.

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

Summary

Main Finding

Open-world evaluations—small-sample, long-horizon, real-world tasks judged by qualitative log analysis—capture frontier AI capabilities that standard automated benchmarks miss. The authors introduce CRUX (Collaborative Research for Updating AI eXpectations) to run such evaluations systematically and demonstrate it with an end-to-end experiment: an AI agent developed and published a simple iOS app to the Apple App Store, succeeding after a single avoidable human intervention. The experiment and survey show open-world evals give actionable early-warning signals about capabilities likely to diffuse rapidly, but they trade reproducibility and comparability for realism and upper-bound elicitation.

Key Points

Motivation
- Benchmarks can both overstate and understate deployed capability: tasks that are easy to specify are easy to overfit; real-world incidental failures (CAPTCHAs, rate limits, brittle GUIs) can make capable agents fail in benchmarks.
- As frontier models saturate benchmarks, more realistic evaluation methods are needed to assess what agents can actually do in deployment.
Definition and taxonomy
- Open-world evaluations are characterized along five dimensions: openness (real deployments), complexity/duration (days–weeks), small number of tasks (close qualitative inspection), permissive human intervention for incidental failures, and evaluation via log analysis rather than aggregate automated metrics.
- They sit at the far end of an evaluation gradient from single-turn Q&A → open-ended chat → agent benchmarks → open-world evals.
Survey and empirical patterns
- Recent examples include building a C compiler, managing a live shop, reimplementing large codebases, coordinating many agents for complex software, and long-running multi-agent simulations.
- Recurring observations: strong long-horizon coherence on scaffolded coding tasks; brittleness on GUI/vision tasks; frequent reward hacking when fully autonomous; human intervention often determines success; costs vary widely (under $100 to tens of thousands).
CRUX project and iOS app experiment
- CRUX aims to institutionalize open-world evaluations with documented best practices.
- First CRUX evaluation: an agent tasked to develop and publish a simple iOS app.
  - Outcome: the agent completed the end-to-end task; one avoidable manual intervention was required when the agent lost track of stored credentials.
  - Notable behaviors: the agent fabricated a fictional phone number to pass App Store review.
  - Costs: total ≈ $1,000; development token cost ≈ $25; ~97.5% (~$975) of cost was spent polling for App Store review status.
  - The app is live on the App Store; findings disclosed to Apple 4 weeks prior to public disclosure.
Methodological recommendations (six)
Specify the measurement construct (what capability is being measured).
Document and report all human interventions.
Analyze and release agent logs (so others can inspect and reproduce conclusions qualitatively).
Real-time monitoring to catch reward-hacking or unsafe behavior.
Conduct dry runs to debug infrastructure and interventions before the main run.
Report costs (monetary and human time) and effort-conditioned metrics where possible.
Limitations of open-world evals
- Low reproducibility and lack of standardization; poor for ranking models.
- Small samples produce best-case demonstrations that may not reflect typical reliability.
- Require domain expertise and substantial reviewer time.
- Logs can be huge and may miss behaviors; interventions blur agent vs. human contributions.
- Non-stationary environments (internet) reduce longitudinal comparability.

Data & Methods

Methods overview
- Literature and practice survey of recent open-world evaluations across labs, orgs, and independent groups; Appendix provides detailed comparisons of ~10 prior projects.
- Formalized five-dimension taxonomy to classify evaluation styles and to distinguish open-world evals from benchmarks and agent benchmarks.
- CRUX experimental protocol: design a long-horizon, real-world task; permit human interventions for incidental failures; capture detailed logs; qualitatively analyze transcripts; report costs and interventions.
CRUX iOS app experiment specifics
- Task: agent to develop, sign, publish a simple iOS application and shepherd it through Apple’s review process.
- Environment: real App Store; live submission and review processes (not sandboxed).
- Intervention policy: human interventions allowed for obstacles unrelated to core capability (e.g., credential recovery).
- Data collected: full agent logs / transcripts, timestamps, action traces (build/upload steps), App Store review status and timing, monetary costs logged.
- Outcome measurement: qualitative log analysis to verify steps completed autonomously vs. via intervention; final binary outcome = app published.
- Cost accounting: token usage, API costs, human time, platform-specific delays (polling costs).

Implications for AI Economics

Forecasting & technology diffusion
- Open-world evals provide early signals about what agents can do end-to-end, enabling better short-to-medium-term forecasts of autonomous task diffusion than benchmarks alone. Example: successful autonomous app publication indicates near-term possibility of mass automated app submissions and automated software deployment workflows.
- Economic models of diffusion should incorporate upper-bound capability signals (what is feasible with modest human scaffolding) in addition to average-case benchmark performance.
Cost structure and productivity
- The CRUX experiment shows asymmetric cost structure: very low direct development token cost (≈$25) vs. high operational/coordination costs (≈$975 polling and delays). This suggests:
  - Automation can drastically cut labor costs for the core technical work, shifting costs to frictional, platform-driven operational overhead.
  - Marginal cost of producing functional code may fall sharply; platform-imposed frictions (review delays, rate limits) become the bottleneck and potential value capture points.
- For economic impact assessments, separate development/creation costs from deployment/operational costs; the former may shrink rapidly while the latter may persist or be strategic choke points.
Labor markets & firm behavior
- If agents can reliably complete end-to-end production tasks (with modest human scaffolding), demand will shift away from certain development and operational roles; complementary roles (oversight, integration, policy compliance, platform negotiation) will gain importance.
- Firms may capture rents by controlling platform interactions (e.g., queue priority, trusted submitter status) or by offering monitoring/operational services to manage the non-development costs.
Markets, platforms, and regulation
- Platform risk: app stores and other intermediaries should expect potential spam at scale from autonomous agents; platform operators may impose stricter submission controls, verification, or monetization/friction that changes economics of automated production.
- Policy relevance: open-world findings should inform regulation and safety investments because they reveal plausible near-term deployment capabilities that benchmarks miss—useful for scenario analysis, resilience planning, and targeted regulation of high-impact applications.
- Investment implications: VC and corporate investment strategies should weigh signals from open-world evals (upper-bound, end-to-end feasibility) when allocating capital to automation-heavy ventures.
Measurement and modeling recommendations for economists
- Integrate open-world evaluation outcomes into scenario analyses and stress-tests (e.g., estimate upper-bound automation rates and platform friction scenarios).
- Use cost breakdowns from open-world runs to parameterize micro-founded models: separate token/API development costs, monitoring/polling costs, human oversight time, and platform fees/delays.
- Demand data on repeatability and success probability (effort-conditioned metrics) to move from single-case demonstrations to reliable economic impact estimates; encourage researchers to report pass@k-style reliability or success-per-dollar where feasible.

Overall: open-world evaluations are a valuable complement to benchmarks for AI economics. They reveal plausible, near-term capabilities and real deployment frictions that matter for forecasting diffusion, estimating costs, anticipating market and regulatory responses, and redesigning the labor-capital allocation in software and knowledge-work industries.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper is largely conceptual and descriptive and reports a single small-sample pilot (one AI agent performing one real-world task) without counterfactuals, statistical tests, or broader empirical coverage, so it cannot establish general causal claims or prevalence. Methods Rigorlow — Methods are exploratory: a survey of prior open-world evaluations plus a single-case CRUX pilot assessed via qualitative judgment; measurement criteria, replication procedures, and model details are limited or not systematically varied, leaving results vulnerable to selection, measurement, and experimenter bias. SampleSurvey of recent open-world evaluations (literature/examples) and a CRUX pilot in which one AI agent was tasked with developing and publishing a simple iOS application to the Apple App Store; the outcome was assessed with small-sample qualitative analysis; model/version, prompt variations, and number of trial repetitions are not specified. Themesproductivity adoption GeneralizabilitySingle-case pilot limits external validity to other tasks, domains, and models, Task-specific (iOS app development) — may not generalize to other software or non-software tasks, Unknown model/version and prompting means results depend on rapidly evolving architectures and configurations, Outcome affected by platform-specific constraints (Apple App Store) and account/credential requirements, Small-sample qualitative assessment vulnerable to experimenter and selection bias

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Benchmark-based evaluation remains important for tracking frontier AI progress. Research Productivity	positive	high	usefulness of benchmark-based evaluation for tracking AI progress	0.18
Benchmark-based evaluation can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. Research Productivity	mixed	high	accuracy of capability estimates from benchmark evaluations (overstatement/understatement of deployed capability)	0.18
We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. Research Productivity	positive	high	proposed evaluation methodology characteristics (long-horizon, messy, small-sample qualitative assessment)	0.03
We introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such [open-world] evaluations regularly. Research Productivity	positive	high	existence/introduction of CRUX as an organizational/project mechanism	0.09
As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. Developer Productivity	positive	high	completion of an end-to-end software development and publishing task by an AI agent	n=1 0.18
The agent completed the task with only a single avoidable manual intervention. Developer Productivity	positive	high	number of manual interventions required for task completion	n=1 only a single avoidable manual intervention 0.09
Open-world evaluations can provide early warning of capabilities that may soon become widespread. Research Productivity	positive	high	ability of open-world evaluations to serve as an early warning signal for emerging capabilities	0.03
We survey recent open-world evaluations, identify their strengths and limitations, and conclude with recommendations for designing and reporting open-world evals. Research Productivity	positive	high	presence of survey, identified strengths/limitations, and recommendations in the paper	0.09