Evidence (4560 claims)
Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 378 | 106 | 59 | 455 | 1007 |
| Governance & Regulation | 379 | 176 | 116 | 58 | 739 |
| Research Productivity | 240 | 96 | 34 | 294 | 668 |
| Organizational Efficiency | 370 | 82 | 63 | 35 | 553 |
| Technology Adoption Rate | 296 | 118 | 66 | 29 | 513 |
| Firm Productivity | 277 | 34 | 68 | 10 | 394 |
| AI Safety & Ethics | 117 | 177 | 44 | 24 | 364 |
| Output Quality | 244 | 61 | 23 | 26 | 354 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 168 | 74 | 37 | 19 | 301 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 89 | 32 | 39 | 9 | 169 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 106 | 12 | 21 | 11 | 151 |
| Consumer Welfare | 70 | 30 | 37 | 7 | 144 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 75 | 11 | 29 | 6 | 121 |
| Training Effectiveness | 55 | 12 | 12 | 16 | 96 |
| Error Rate | 42 | 48 | 6 | — | 96 |
| Worker Satisfaction | 45 | 32 | 11 | 6 | 94 |
| Task Completion Time | 78 | 5 | 4 | 2 | 89 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 17 | 9 | 5 | 50 |
| Job Displacement | 5 | 31 | 12 | — | 48 |
| Social Protection | 21 | 10 | 6 | 2 | 39 |
| Developer Productivity | 29 | 3 | 3 | 1 | 36 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Skill Obsolescence | 3 | 19 | 2 | — | 24 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Labor Share of Income | 10 | 4 | 9 | — | 23 |
Productivity
Remove filter
When employers have monopsony power, they choose technologies that expand this power beyond what a social planner would consider optimal.
Model results on monopsonistic employer incentives and their technological choices; discussion supported by citations.
Profit-maximizing firms pursue innovations that erode workers' market power by making them more easily replaceable, even at the expense of production efficiency; a social planner who values worker welfare would employ technologies that preserve workers' market power.
Theoretical analysis of interactions between technological choice and market power; supported by cited empirical evidence (e.g., Azar et al. 2023) in the paper.
A welfare-maximizing planner would choose to automate fewer tasks than production efficiency would dictate when workers' welfare is heavily weighted.
Model analysis of welfare-maximizing automation level compared to production-efficient automation; analytical result in the automation application.
Observed declines in browsing time due to ChatGPT adoption are concentrated in website categories such as search and news, which are highly exposed to substitution by generative AI.
Category-level browsing time changes across website classification; concentration of declines in categories identified as highly overlap-exposed to chatbot capabilities using web-scraping and LLM site-level overlap classification.
High-income and younger households adopt generative AI substantially faster than low-income and older counterparts, and this gap is widening over time ('generative AI divide').
Descriptive heterogeneity analysis using Comscore household demographics (income and age bins) and observed adoption trajectories across 2021–2024; authors report widening gap rather than convergence.
Diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.
Analytical claim in the paper about the implications of diminishing returns for cost pressure and innovation requirements (qualitative; no sample size in excerpt).
Prominent studies predict substantial job displacement due to automation.
Paper asserts this as background, referencing the existence of prominent studies in the literature (no specific citations or sample sizes provided in the abstract).
For organizations of n humans with AI agents, the optimal team size decreases with agent capability.
Derived implication from the stylized model's analysis of multi-human organizations interacting with AI agents.
There is no smooth sublinear regime for human effort; it transitions sharply from O(E) to O(1) with no intermediate scaling class.
Mathematical derivation from a stylized model of human-AI collaboration that assumes tasks decompose into atomic decisions, a fraction ν are novel, and specification/verification/error correction scale with task size.
So far the maintenance and migration work was done largely manually by human experts.
Background assertion in the paper's introduction/abstract; no empirical backing provided in abstract.
The regime divide deepens under AI capital concentration, admits a permanent displacement attractor in shallow markets, and generates equity market participation hysteresis in which the ERP remains elevated after employment has normalised.
Model-based assertions: analysis shows capital concentration magnifies regime separation, yields a permanent displacement attractor in shallow-market parameterizations, and produces hysteresis in participation leading to persistently elevated ERP after employment recovery.
The alignment risk channel is specific to agentic AI: correlated misalignment in AI objectives generates aggregate output shocks with fat left tails; formalised via Hansen-Sargent multiplier preferences, the resulting alignment risk premium (ARP) enters the equilibrium ERP decomposition as a priced factor additively separable from the participation wedge.
Theoretical formalisation in the paper: uses Hansen-Sargent multiplier preferences to capture model uncertainty/robustness and defines an ARP that is additively separable in the ERP decomposition.
The participation compression channel operates through household wealth: displacement pushes marginal households below the equity market entry cost κ, concentrating aggregate consumption risk on a shrinking investor pool and—by the Basak-Cuoco mechanism—raising the required risk premium even as fundamentals improve.
Model mechanism described in the paper: heterogeneous-agent model with an explicit market entry cost κ and reference to the Basak-Cuoco mechanism leading to a higher required risk premium when investor base shrinks.
AI can worsen financial and market performance if it crowds out normal R&D.
Paper's empirical analysis and interpretation linking AI dependence to poorer financial/market performance through displacement of standard R&D activities; presented as a study finding.
High AI dependency disclosed in financial reports does not improve firms' financial health and may even endanger it.
Empirical results drawn from the study's analysis of listed new energy vehicle and automobile manufacturers (2013–2023); statement appears in the paper's findings/conclusions.
AI dependency reduces financial safety for listed new energy vehicle and automobile manufacturers.
Empirical analysis of a sample of listed new energy vehicle and automobile manufacturers covering 2013–2023; the paper reports data analysis showing AI dependency reduces financial safety.
Performance degradation persists even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution.
Experiments comparing unstructured versus structured context provision; structured semantic layers (AST context, import graph resolution) were evaluated and models still degraded with more context.
Models' performance degrades monotonically from diff-only (config_A) to diff+file content (config_B) to full context (config_C) across all 8 models.
Systematic ablation across three frozen context configurations (config_A, config_B, config_C) reported; all 8 evaluated models show monotonic performance decline as more context is provided.
Eight frontier models detect only 15–31% of human-flagged issues on the diff-only configuration (config_A).
Empirical evaluation across 8 models on SWE-PRBench (350 PRs) under the diff-only configuration; reported detection rates of 15–31% relative to human-flagged issues.
There is a growing gap between rapid experimentation with AI tools and limited organizational capability to institutionalize them in everyday workflows.
Argument supported by targeted literature synthesis and review of recent scholarly and institutional sources; no primary empirical sample reported in this paper.
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 55.0% accuracy on help prediction.
Experimental evaluation reported in the paper comparing eight multimodal models on the Help Prediction task with reported accuracy metric.
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 44.6% accuracy on behavior state detection.
Experimental evaluation reported in the paper comparing eight multimodal models on the Behavior State Detection task with reported accuracy metric.
Ikema is a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old.
Demographic/descriptive claim reported in the paper's background (likely citing prior surveys or census estimates); the abstract states the ~1,300 speakers figure and age distribution.
The financial planning and investment management profession is undergoing a radical transformation driven by Generative AI (GenAI) and Agentic AI, creating urgent workforce displacement challenges that require coordinated government policy intervention alongside educational reform.
Author assertion in the paper's introduction/abstract; framing argument based on the paper's synthesized analysis (no empirical sample, no reported statistical test).
LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions (a pathology analogous to human designers).
Literature/background claim and authors' characterization of observed agent behavior; motivated the proposed metacognitive interventions. No numerical sample size reported.
Real estate pro forma development remains one of the most time-intensive functions in property investment, typically requiring twenty to forty hours per multifamily project through manual research, Excel-based modeling, and iterative scenario analysis.
Statement in paper asserting typical industry practice; not tied to the paper's controlled test. No empirical sample size or survey data reported alongside this assertion.
Traditional car-following models, such as the Intelligent Driver Model (IDM), often struggle to generalize across diverse traffic scenarios and typically do not account for fuel efficiency.
Literature-based statement within the paper motivating the study (review of limitations of traditional car-following models). No sample size reported.
Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity).
Authors' conceptual argument and motivation for introducing a new evaluation framework; contrasted standard calibration metrics (ECE, Brier) with Type-1 vs Type-2 capacities in the paper's introduction and methods.
Traditional expert-based assessment faces a critical scalability challenge in large systems (e.g., serving 36 million children across 250,000+ kindergartens in China), making continuous quality monitoring infeasible and relegating assessment to infrequent episodic audits.
Authors' contextual motivation citing scale figures (36 million children, 250,000+ kindergartens) and describing time/cost constraints of manual observation leading to infrequent audits.
Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate).
Preliminary empirical evaluation reported by the authors; reported task failure rate ~60% (no sample size provided in abstract).
The largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video.
Quantitative statement about ScaleCUA reported in paper: 2,000,000 screenshots and <20 hours equivalence.
Progress toward general-purpose CUAs is bottlenecked by the scarcity of continuous, high-quality human demonstration videos.
Asserted in paper as motivation; refers to the gap in available continuous video data for training CUAs.
Reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment.
Introductory problem statement in the paper arguing that large context prompts increase per-token API costs and latency for API-based LLMs; no quantitative study or sample size provided for this claim within the excerpt.
AI-enabled, democratised production is more likely to intensify competition and produce winner-take-most outcomes than to generate broadly distributed entrepreneurial success.
Synthesised theoretical prediction based on the unified framework (attention scarcity + free-entry dilution + superstar/preferential attachment dynamics) developed in the paper; no empirical validation provided.
When the framework is extended to include quality heterogeneity and reinforcement dynamics, equilibrium outcomes exhibit declining average payoffs.
Analytical extension of the baseline formal model to incorporate heterogeneous quality and reinforcement (preferential attachment) dynamics; theoretical derivation in the paper; no empirical sample.
In markets with near-zero marginal costs and free entry, increases in the number of producers dilute average attention and returns per producer.
Formal theoretical model introduced in the paper (Builder Saturation Effect) that assumes near-zero marginal costs, free entry, and finite human attention; no empirical sample or experimental data reported.
Agent memories currently remain private and non-transferable because there is no way to validate their value.
Descriptive assertion in the paper about current state of agent memories; no empirical survey or measurement cited.
Insufficient organizational resources significantly inhibit AI adoption in procurement (β = -0.19, p < 0.05).
Same questionnaire survey (n=326) and multiple linear regression analysis; reported coefficient β=-0.19 with p<0.05.
Measuring only technical model performance (such as predictive accuracy) is insufficient for assessing the strategic impact of AI in drug discovery.
Argued in the paper as a critique of current evaluation practices; presented as a conceptual point rather than supported by new empirical data in the excerpt.
Pressure remains high to increase the probability of success to improve the effectiveness of pharmaceutical R&D.
Asserted in the paper as motivational context for the work; framed as an industry pressure point rather than backed by a specific empirical sample or quantified survey in the excerpt.
Increasing cost and failure rates in the pharmaceutical R&D process have not fundamentally improved over the last decade.
Stated as a contextual observation in the paper's opening paragraph; presented as a summary of industry trends (no specific dataset, sample size, or citation included in the excerpt).
Without support, performance stays stable up to three issues but declines as additional issues increase cognitive load.
Empirical study / human-AI negotiation case study in a property rental scenario that varied the number of negotiated issues; the paper reports observed performance across different numbers of issues (no sample size for this specific comparison stated in the abstract).
Reliance on automated content generation introduces risks of cognitive overreliance, algorithmic bias, and strategic misalignment.
The paper articulates these risks as conceptual/qualitative concerns in its discussion; no quantitative estimates or empirical tests of these specific risks are reported in the provided excerpt.
Wide disagreement among AIs created confusion and undermined appropriate reliance on advice.
Reported experimental finding from the paper: manipulating within-panel disagreement across tasks produced wide disagreement conditions that, according to the abstract, led to confusion and reduced appropriate reliance. No quantitative metrics reported in abstract.
High within-panel consensus fostered overreliance on AI advice.
Experimental manipulation of within-panel consensus across the three tasks; the abstract reports that high consensus increased participants' reliance on AI (interpreted as overreliance). Specific measures and sample size not provided in abstract.
Improvements in AI ('better' AI) amplify the excess automation as well.
Model comparative statics: increased AI capabilities raise private incentives to automate, leading to more displacement than is socially optimal; theoretical analysis only.
More competition amplifies the excess automation (the automation arms race).
Comparative-statics result in the competitive task-based theoretical model showing increased competition raises firms' incentives to automate; no empirical sample.
The resulting loss from excess automation harms both workers and firm owners.
Welfare comparisons from the model showing negative payoff changes for workers (lower wages/less employment) and reduced owner returns when automation is excessive; theoretical analysis, no empirical data.
In a competitive task-based model, demand externalities trap rational firms in an automation arms race, displacing workers well beyond what is collectively optimal.
Formal equilibrium analysis in the paper's theoretical competitive task-based model; comparative statics and welfare analysis (no empirical sample).
Knowing that AI-driven displacement can erode demand is not enough for firms to stop automating.
Analytical result from the paper's competitive task-based model showing firms' incentives do not internalize demand externalities; no empirical sample.