← Papers

Large language models are broadly getting better across text tasks rather than suddenly mastering narrow task clusters: measured success rose from roughly 50% in mid‑2024 to 65% by mid‑2025 and is projected to reach 80–95% for most text tasks by 2029. That suggests gradual, economy‑wide automation (‘rising tide’) rather than imminent, concentrated disruptions, although actual labor impacts will depend on organizational adoption and quality requirements.

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks

Matthias Mertens, Adam Kuzee, Brittany S. Harris, Harry Lyu, Wensu Li, Jonathan Rosenfeld, Meiri Anto, Martin Fleming, Neil Thompson · April 01, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Using >3,000 O*NET-derived text tasks and >17,000 worker evaluations, the study finds broad, steadily improving LLM performance—supporting a 'rising tide' pattern rather than abrupt 'crashing waves'—with task success rates roughly 50% in 2024-Q2, 65% by 2025-Q3, and projected 80–95% by 2029 at a minimally sufficient quality level.

We propose that AI automation is a continuum between: (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is more continuous and broad-based. We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable. Based on more than 17,000 evaluations by workers from these jobs, we find little evidence of crashing waves (in contrast to recent work by METR), but substantial evidence that rising tides are the primary form of AI automation. AI performance is high and improving rapidly across a wide range of tasks. We estimate that, in 2024-Q2, AI models successfully complete tasks that take humans approximately 3-4 hours with about a 50% success rate, increasing to about 65% by 2025-Q3. If recent trends in AI capability growth persist, this pace of AI improvement implies that LLMs will be able to complete most text-related tasks with success rates of, on average, 80%-95% by 2029 at a minimally sufficient quality level. Achieving near-perfect success rates at this quality level or comparable success rates at superior quality would require several additional years. These AI capability improvements would impact the economy and labor market as organizations adopt AI, which could have a substantially longer timeline.

Summary

Main Finding

Across >11,000 text‑addressable O*NET task instances (evaluated by domain‑experienced workers and run on >40 LLMs), AI capability gains look like a “rising tide” rather than episodic “crashing waves.” Performance improves broadly across many tasks at once, with substantial and rapidly rising success rates for minimally sufficient (manager‑acceptable without edits) outputs. Large vs. newer model differences matter: larger models boost short‑task performance more, while newer model vintages shift performance up roughly in parallel across task lengths.

Key Points

Rising‑tide pattern: The relationship between AI success and log task duration is relatively flat on average. Pooled logistic slope for “minimally sufficient” (rating ≥7) is ≈ −0.31 in log10(task time), implying modest declines in success as task duration increases.
High current capability: Across surveyed models, many tasks already see substantial success rates (roughly half to three‑quarters depending on domain and threshold).
- In the pooled sample, models in 2024‑Q2 roughly achieved a 50% success rate on tasks that take humans ~3–4 hours.
- Frontier models (by 2024‑Q3) were already at ~50% success on some tasks taking about a day.
Rapid improvement:
- Implied “doubling time” (calendar time between releases required to handle tasks twice as long at the same success rate) ≈ 3.8 months.
- Average failure rates for tasks taking 5 minutes–24 hours halve every ~2.4–3.2 years; this corresponds to annual success increases of ~8–11 percentage points over 2024‑Q2 to 2025‑Q3.
Heterogeneity by job family: Success rates and slopes vary across domains. Some families show steeper negative slopes (greater sensitivity to task duration), others show near‑flat slopes.
Size vs. vintage:
- Larger models outperform smaller models disproportionately on short tasks (an “outward rotation” of the success–duration curve).
- Newer vintages outperform older ones with an approximately parallel upward shift (similar gains across task durations).
Extrapolation (with caveats): If recent trends persist, most text‑related tasks could reach ~80–95% success at a minimally sufficient level by ~2029; reaching near‑perfect or superior quality will take several additional years.
Important caveats: Preliminary and selective sample (only text‑addressable or partially text‑based O*NET tasks filtered for ≥10% time‑saving potential), primary outcome is a “minimally sufficient” manager acceptability threshold (rating ≥7), survey still ongoing, and results do not directly map to job automation shares (ignores integration/last‑mile costs, adoption lags, complementarities).

Data & Methods

Task sample: Tasks drawn from U.S. O*NET; GPT‑4 used to screen tasks for ≥10% time‑saving plausibility by LLMs. The survey includes two instances per selected task; tasks span ≈10 minutes to multiple days (most 20 minutes–10 hours).
Models: >40 LLMs of different sizes and vintages (small and large, pre‑ and post‑2025 models) were evaluated.
Evaluators: Human raters with relevant on‑the‑job experience judged outputs. Ratings used a 1–9 scale; main binary outcome = manager would accept output without edits (rating ≥7). Stricter thresholds (≥8 average, =9 superior) were also analyzed.
Estimation: Logistic regression of success on log10(task duration): Pr(success) = Λ(α + β log10 T). Coefficients interpreted and compared across job families, model size, and model vintage. Standard errors clustered by participant; robustness checks and alternative specifications reported.
Microfoundation: Paper offers an interpretation mapping slope β to the number of sequentially dependent substeps in tasks (one plausible microfoundation), but the empirical findings do not rely on that specific model.

Implications for AI Economics

Broad, distributed productivity gains: A rising‑tide pattern implies many tasks across many occupations will see incremental automation simultaneously rather than concentrated, abrupt shocks to a narrow set of tasks. Expect wide spillovers rather than isolated collapses.
Labor market adjustment: Because performance improves across many tasks, displacement and reallocation pressures could be pervasive but more gradual at the task level. This may allow more time for retraining and adaptation than a crashing‑wave scenario, but cumulative impacts can still be large.
Wage and task composition effects: Widespread partial automation can compress wages for routine text tasks and shift worker time toward supervisory, integrative, and non‑automatable activities. Heterogeneity across job families suggests uneven impacts across sectors and occupations.
Firm adoption and macro timing: Capability improvements are rapid, but adoption depends on integration costs, “last‑mile” work, regulatory and organizational frictions. Thus, capability availability (short horizon) does not mean immediate full job automation (longer, uncertain horizon).
Policy priorities:
- Invest in task‑level monitoring and measurement (to map which tasks are automatable and at what quality).
- Support worker transitions: targeted retraining, portable benefits, and policies that ease reallocation.
- Encourage research and standards on evaluation quality (difference between “minimally sufficient” vs. “average/superior” outputs matters for safety‑critical domains).
Research gaps highlighted: estimation of adoption elasticities, quantifying last‑mile integration costs, general equilibrium effects, extension beyond text‑based tasks, and better mapping from task‑level success to occupation‑level employment outcomes.
Uncertainty: Projections assume continuation of rapid gains; hardware, compute scaling, or algorithmic slowdowns could slow progress, reducing the projected near‑term impacts.

Short takeaway: For text‑based labor‑market tasks, the evidence so far points to a broad, steadily rising tide of AI capability — substantial and accelerating gains across many tasks — which is likely to reshape work over the coming decade but with important domain heterogeneity and considerable uncertainty about timing and full economic consequences.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper uses a large task set (3,000+ O*NET-derived, text-addressable tasks) and a substantial number of human evaluations (>17,000), which provides meaningful descriptive evidence on LLM capabilities and trends; however, evaluations appear observational and potentially subjective, task selection is limited to text-based items, model sampling and evaluator protocols are not fully described here, and the forward projections rely on extrapolation of capability trends rather than causal evidence linking capabilities to economic outcomes. Methods Rigormedium — The study systematically measures performance across many tasks and uses worker judgments to assess success, suggesting careful measurement design, but potential weaknesses include non-random task sampling (O*NET text-only subset), possible evaluator bias, unclear inter-rater reliability or rubric standardization, limited information on which model families and releases are tested, and uncertainty in extrapolation methods used for multi-year projections. SampleOver 3,000 text-based tasks derived from the U.S. Department of Labor O*NET taxonomy and more than 17,000 human evaluations by workers from those jobs assessing LLM responses across multiple model versions/time points (preliminary/ongoing data covering 2024-Q2 through 2025-Q3 for measured trends and extrapolated forward to 2029). Themesproductivity labor_markets GeneralizabilityLimited to text-addressable tasks; excludes non-text, multimodal, or domain-specific technical tasks (e.g., advanced coding, image/video work), O*NET-based tasks reflect U.S. occupational taxonomy and may not represent international job content, Human evaluator sample and rubric may introduce subjective bias; inter-rater reliability not reported, Performance on isolated tasks may not capture full job complexity, context-switching, or integrative tasks, Projections to 2029 depend on continued trends and do not account for future model architecture shifts, regulation, or adoption barriers, Does not measure firms' adoption rates or direct labor-market outcomes, so economic implications are speculative

Claims (11)

Claim	Direction	Confidence	Outcome	Details
AI automation is a continuum between (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is more continuous and broad-based. Other	mixed	high	pattern of AI capability change across tasks (crashing waves vs rising tides)	0.03
We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable. Other	null_result	high	coverage of LLM-addressable tasks (task sample)	n=3000 0.3
The evaluation is based on more than 17,000 evaluations by workers from these jobs. Other	null_result	high	number of human evaluations	n=17000 0.3
We find little evidence of crashing waves (in contrast to recent work by METR). Automation Exposure	null_result	high	presence of abrupt concentrated capability surges ('crashing waves')	n=17000 0.18
Substantial evidence that rising tides are the primary form of AI automation. Automation Exposure	positive	high	breadth and continuity of AI capability improvements across tasks ('rising tides')	n=17000 0.18
AI performance is high and improving rapidly across a wide range of tasks. Output Quality	positive	high	AI success/performance on tasks (performance level and trend)	n=17000 0.18
In 2024-Q2, AI models successfully complete tasks that take humans approximately 3-4 hours with about a 50% success rate. Output Quality	positive	high	task success rate for tasks taking humans ~3–4 hours	n=17000 about a 50% success rate for tasks taking humans ~3-4 hours 0.18
AI success rates for those tasks increase to about 65% by 2025-Q3. Output Quality	positive	high	projected task success rate by 2025-Q3	about 65% success rate 0.03
If recent trends in AI capability growth persist, LLMs will be able to complete most text-related tasks with success rates of, on average, 80%-95% by 2029 at a minimally sufficient quality level. Output Quality	positive	high	projected average task success rate for most text-related tasks by 2029 (minimally sufficient quality)	80%-95% success rate 0.03
Achieving near-perfect success rates at this minimally sufficient quality level or comparable success rates at superior quality would require several additional years. Output Quality	positive	high	time-to-reach near-perfect or superior-quality success rates	several additional years (unspecified) 0.03
These AI capability improvements would impact the economy and labor market as organizations adopt AI, which could have a substantially longer timeline. Employment	mixed	high	impact on economy and labor market (timing and magnitude of effects)	0.03