Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.
Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.
Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).
Standard PayGo degrades substantially under classroom-scale concurrency.
Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.
Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.
Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).
In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.
Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).
Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.
Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).
The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.
Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).
Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.
Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.
Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.
Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.
Input tokens rather than output tokens drive the overall cost of agentic tasks.
Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.
Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).
Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes.
Background statement in the paper's introduction; general literature/field observation (no new primary data reported for this claim in the abstract).
Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.
Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.
Vibe coding (unstructured GenAI-driven coding) promises rapid prototyping but often suffers from architectural drift, limited traceability, and reduced maintainability.
Paper asserts this as a motivating observation and characterizes vibe coding's weaknesses; the abstract frames these as commonly observed problems motivating the Shift-Up approach (no sample size given in abstract).
Every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) degrades performance.
Findings from the nine-variant ablation study reported in the paper; comparison of variants that add each listed mechanism versus the memory+reflection combination.
There is a stark geopolitical divide between 'AI Core' nations and the Global South; the Global South faces acute risks of 'Digital Dependency' and eroded digital sovereignty.
Cross-study synthesis in the systematic review (2018-2026) identifying geopolitical patterns and risks; abstract does not quantify the number of studies or present empirical effect sizes.
The 'black box' nature of automated systems undermines the democratic social contract and principles of procedural justice, epitomised by the Australian 'Robo-debt' scandal.
Case study material and literature synthesized in the systematic review referencing the Australian Robo-debt case as an exemplar; abstract does not provide primary data or sample sizes.
Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness in volatile-demand, uncertain-supply industries.
Positioning/background statement in the paper motivating the integrated framework (literature-based claim).
The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user.
Conceptual / domain-specific argument made by the authors (no empirical sample attached to the claim).
AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement.
Author assertion / conceptual description in the paper (no empirical quantification provided for this general statement).
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.
Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).
Selective forgetting remains underexplored compared to retention in LLM agent memory research.
Authors' literature survey / position statement in paper (assertion made in abstract).
Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.
Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.
Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.
Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.
Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.
Agent-written code introduces more security vulnerabilities than code authored by humans.
Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).
Just 44% of all agent-produced code survives into user commits.
Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.
Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.
Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.
Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.
Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.
The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.
Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.
Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.
Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.
Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.
Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.
Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.
Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).
The observed negative OPM effect is consistent with short-term 'J-curve' transition costs (process redesign and capability buildup) during early AI adoption.
Interpretation of empirical patterns (short-term decline in OPM concurrent with no ROA change) offered by the authors as an explanatory mechanism; not presented as separately estimated or experimentally tested.
AI adoption had a significantly negative impact on the operating profit margin (OPM).
Causal analysis of KOSDAQ-listed companies (2018–2025) with AI-adoption timing identified via multi-step, contextually validated text analysis of DART business reports; endogeneity addressed using two-way fixed effects (TWFE) and Propensity Score Matching (PSM).
An alternative specification that makes different choices about the timing of the pervasiveness of AI yields less robust results, though it also suggests that AI is labor saving.
Reported sensitivity analysis / alternative empirical specification in the paper; authors state the alternative yields less robust results but still indicates labor-saving effects.
Our baseline model finds evidence that AI is input saving.
Outcome reported from the baseline empirical specification indicating reductions in inputs associated with AI (authors' baseline model results).
The infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it.
Authoritative claim in paper framing the research gap; presented as observational/argumentative (no empirical audit reported).
Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user.
Statement in paper's introduction/positioning; conceptual survey-style claim (no empirical study or systematic benchmark reported).
Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations.
Paper asserts that existing/standard benchmarks do not adequately isolate parsing and computation-orchestration abilities, motivating the new benchmark.
As multimodal AI achieves human-parity understanding of speech and gesture, [the keyboard's] necessity dissolves.
Theoretical claim supported by multidisciplinary review (history, neuroscience, technology, organizational studies); no quantified empirical test reported.
General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs.
Conceptual/argumentative claim stated in the paper's motivation; no empirical test reported in the abstract.
There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot).
Comparison of retest (one-week) performance across conditions reported in results; authors report a nonsignificant reduction and larger decrement for the AI/Copilot condition (n=22).
Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment.
Authors' characterization of industry practice and limitations (assertion in paper; no empirical sample size reported in abstract).
Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations.
Statement in paper framing the problem; based on authors' characterization of current operational practice (no empirical sample size reported in abstract).
The paper identifies governance challenges such as accountability gaps, digital sovereignty risks, ethical pluralism, and strategic weaponization arising from embedding AI in diplomatic practice.
Conceptual and normative analysis section of the paper outlining risks and governance challenges; illustrated by examples and argumentation.
Thin training coverage fosters anxiety about substitution and slows diffusion of AI tools.
Reported associations from surveys of mid-level managers and technical staff, interviews, and document analysis across cases; thematic coding identified links between limited training, worker anxiety, and slower diffusion. (Sample size not reported.)
Upstream textile SMEs frequently exhibit constrained supply chain resilience owing to persistent information latency and structural dependence on downstream orders.
Background/contextual claim stated in paper (motivation for study); no specific quantitative test reported in abstract.
The pharmaceutical R&D process is persistently challenged by high financial costs, protracted timelines, and remarkably low success rates.
Background statement in the review synthesizing prior literature and field knowledge; no original empirical data or sample sizes reported in the provided text.