Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
The study's findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.
Authors' reported limitations section citing specific threats to internal validity and measurement (session timing confound, differential attrition across conditions, and grading biases of the LLM used to evaluate documents).
The behavioral scaffolding intervention was associated with substantially lower document production.
Same field experiment (N=388); the behavioral scaffolding required joint AI use within pairs and was compared to unstructured use, with reported reductions in document production in the behavioral condition.
A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use.
Field experiment with 388 employees at a Fortune 500 retailer; random/experimental assignment to scaffolding conditions while all participants had access to the same AI tool; comparison reported between behavioral scaffolding condition and unstructured use.
LLMs lag behind humans in sustaining heterogeneity when divergence is rewarded.
Empirical comparison from the experiment showing humans are better able than LLMs to maintain diverse actions when the payoff structure rewards divergence; stated qualitatively in the abstract without numeric effect sizes or sample sizes.
Latent-outcome estimation faces a within-study noncomparability challenge: different indicators within a study may have different and possibly nonlinear relationships with the same latent outcome, making them not directly comparable.
Theoretical exposition in the paper describing heterogenous indicator-to-latent mappings and potential nonlinearity; illustrated with examples (no empirical sample size).
Latent-outcome estimation faces a cross-study noncomparability challenge: different measurement systems across studies may cause estimators to target different empirical quantities even when the underlying latent treatment effect is the same.
Conceptual and theoretical argumentation in the paper describing identification issues across studies due to differing measurement systems; supported by examples and discussion (no empirical sample size).
Lower survival rates among BDA adopters are driven by greater uncertainty in sales.
Paper states greater uncertainty in sales is an interrelated factor explaining lower survival for BDA adopters, based on empirical analysis of German start-ups.
Lower survival rates among BDA adopters are driven by higher operating costs.
Paper reports that higher operating costs are an interrelated factor explaining lower survival among BDA adopters, based on the same empirical sample of German start-ups.
Start-ups using BDA face lower survival rates.
Empirical comparison of BDA adopters versus non-adopters in a large sample of German start-ups (survival analysis implied by reported outcome).
Enterprise sales organizations are systematically hampered by what this paper terms 'Revenue Friction'—the accumulative productivity loss caused by fragmented, human-mediated data entry across disconnected CRM, ERP, and quoting systems.
Statement/definition presented in the paper excerpt. No empirical method, sample size, or quantitative evidence reported in the provided text.
Some of this reduced price is related to reduced input cost contributions, in particular labor and materials costs.
Decomposition/mediation analysis reported in the paper attributing part of the observed price reductions to declines in input cost contributions (labor and materials); exact methods, sample size, and statistical estimates not provided in the excerpt.
AI intensity is associated with lower prices charged to purchasers.
Empirical analysis reported in the paper linking measures of AI intensity to observed output prices (details of data sources, sample size, and specific methods not provided in the excerpt).
Foundation-model usage can increase compute-related emissions.
Conceptual/environmental concern highlighted in the paper about the carbon footprint of heavy model use and persistent storage; no quantified emissions analysis or lifecycle assessment presented.
These systems can cause skill atrophy.
Theoretical risk articulated in the paper that reliance on AI assistance may degrade human skills over time; no longitudinal skill-measurement or experimental evidence provided.
The same foundation-model systems can also intensify surveillance.
Cautionary claim in the paper noting the surveillance risk of durable, queryable traces and integrated tooling; presented as a conceptual risk rather than empirically measured increase in surveillance.
Baseline (non-structured) interactions had 16 of 50 accepted on first pass.
Reported counts in the paper for the baseline group (16 accepted of 50 baseline interactions).
In an observational study of documented interactions across four AI tools (Claude, ChatGPT, Cowork, Codex), incomplete context was associated with 72% of iteration cycles.
Observational study reported in the paper covering interactions across four AI tools; the paper reports the 72% figure.
Job insecurity emerges as a critical mediating factor influencing employee attitudes and behavioural responses to generative AI, including upskilling intentions and resistance to technological change.
Review-level synthesis identifying job insecurity reported in included studies as mediating relationships between AI adoption and employee attitudes/behaviours (e.g., upskilling, resistance).
Employees express concerns about role displacement (job loss or role changes) associated with generative AI adoption.
Reported across multiple studies included in the review; the review summarises these concerns as part of mixed employee perceptions.
These positive perceptions coexist with employee concerns about skill obsolescence related to generative AI.
Synthesis of studies included in the review documenting worker concerns about skills becoming obsolete due to AI-driven changes.
Income inequality, measured by the Gini index, rises moderately in every scenario we examine due to the polarising effect of job losses and wage and capital income increases on the income distribution.
Calculation of Gini index across multiple simulated scenarios using the SWITCH-linked distributional analysis; reported in the report.
The largest average losses are experienced by middle and higher income households, for whom job displacement outweighs any wage or capital income gains. Lower income households also lose, but by much less.
Distributional results from microsimulation (SWITCH) applying scenarioled job displacement, wage and capital effects across income groups; reported in the report.
When these effects are combined, we find an average decline in household disposable income as a result of AI adoption.
Combined scenario simulations incorporating job displacement, wage effects and capital income effects linked to the Irish tax-benefit system using SWITCH; result reported in the report's main findings.
These wage gains are not large enough to counterbalance the average fall in income due to job displacement.
Combined simulation results (displacement + wage effects) using scenario assumptions and microsimulation (SWITCH), reported in the report's distributional analysis.
Those most likely to experience this disruption are found in higher income households, where the share of workers transitioning into unemployment is substantially larger than in lower income families.
Microsimulation (SWITCH) linking simulated job displacement scenarios to household income groups; results reported in the report.
In our central scenario — drawn from credible international estimates — around 7 per cent of current jobs could be displaced in the short–medium run.
Scenario simulation based on international estimates of AI exposure/adoption; central scenario reported in the report (linked to SWITCH microsimulation for distributional analysis).
AI tends to place higher earning and highly educated workers at greater risk of disruption, because the occupations most exposed to AI are predominantly in these groups.
Synthesis of international research on occupational exposure to AI and the report's analysis linking exposure to worker characteristics (education and earnings); presented as descriptive finding in the report.
Result 2: When managers are short-termist or worker skill has external value, the decision-maker's optimal policy can produce the augmentation trap, leaving the worker worse off than if AI had never been adopted.
Analytical result from the dynamic model comparing planner/objective variations (short-termist manager or externalities) and showing an outcome labeled the 'augmentation trap'.
Result 1: Even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption.
Analytical result from the dynamic model showing optimal adoption choice can lead to a steady-state where worker productivity is lower than pre-adoption (model-based comparative statics).
Experimental evidence shows that sustained use of AI tools can erode the expertise on which productivity gains depend (deskilling).
Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).
Claude Sonnet 4.6 achieves only 33.3% (completion rate) on ClawBench.
Paper gives a concrete example performance result for Claude Sonnet 4.6 (reported completion percentage on the benchmark).
The authors evaluated 7 frontier models on ClawBench and found that both proprietary and open-source models can complete only a small portion of these tasks.
Paper reports evaluations of 7 models on the ClawBench tasks (empirical evaluation across the benchmark).
Aggressive compression increased total session cost by 67% despite reducing input tokens by 17%, because it shifted interpretive burden to the model's reasoning phase.
Result reported from the controlled experiment comparing log-format conditions; four conditions described but specific number of sessions/replications not provided in the abstract.
Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall.
Paper reports an evaluation across 17 models and states the maximum overall score observed was below 66%.
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval.
Statement in paper introduction contrasting prior benchmarks' focus on explicit recall with a claimed gap in evaluating implicit (non-declarative) memory; no systematic literature review or quantitative survey reported in the excerpt.
OpenAI o3 achieves only 17% of optimal collective performance.
Experimental measurement of collective performance for OpenAI o3 in the paper's multi-agent setup (value reported in abstract; no sample size provided there).
The study observed errors and limitations in both phases (test generation and refactoring), and manual intervention was necessary at times.
Case study observations reported in the paper describing observed model errors/limitations and instances requiring manual developer intervention.
Current AI coding assistants, such as GitHub Copilot and Amazon CodeWhisperer, emphasize developer speed and convenience, with energy impact not yet a primary focus.
Stated as an observation in the paper; no specific empirical comparison or quantification provided in this excerpt.
Frontend code, replicated across millions of page views, consumes significant energy and contributes directly to digital emissions.
Asserted in paper's introduction; no specific empirical data or sample reported in this excerpt.
We posit that persistence is reduced because AI conditions people to expect immediate answers, denying them the experience of working through challenges on their own.
Authors' proposed psychological mechanism / explanation inferred from observed behavior; presented as a hypothesis rather than directly proven causal mediator.
These negative effects (reduced persistence and impaired unassisted performance) emerge after only brief interactions with AI (approximately 10 minutes).
Experimental manipulation / exposure in RCTs where participants interacted with AI for about 10 minutes and subsequent outcomes were measured.
People are more likely to give up after interacting with AI (increased likelihood of quitting tasks unassisted).
Randomized controlled trials (N = 1,222) measuring rates of task abandonment/giving-up after AI interaction vs. control.
AI assistance impairs unassisted performance: although AI improves short-term performance, people perform significantly worse without AI after interacting with it.
Randomized controlled trials (N = 1,222) comparing performance with and without AI assistance across tasks; causal inference from randomized assignment.
Through a series of randomized controlled trials on human-AI interactions (N = 1,222), we provide causal evidence that AI assistance reduces persistence.
Randomized controlled trials (RCTs) on human-AI interactions with total sample size N = 1,222; persistence measured after AI interaction across tasks.
AI-assisted evaluation reduces variance in research quality.
SEM and regression analyses on OECD panel data report a decrease in variance of research quality measures associated with higher AIRC.
Current research has largely focused on short-horizon tasks over a limited set of software with limited economic value (e.g., basic e-commerce and OS-configuration tasks).
Narrative literature/field observation reported in paper introduction (no numeric study reported in excerpt).
There is a fundamental gap in current agent capabilities: functional correctness alone is insufficient for design-aware issue resolution, motivating design-aware evaluation beyond functional correctness.
Synthesis of experimental findings: low design-satisfaction despite functional correctness, prevalence of design violations, and only partial improvement from guidance support the conclusion.
Design violations are widespread in agent-produced patches.
Empirical results from experiments on the benchmark showing many patches violate validated design constraints; backed by counts/percentages in evaluation (as summarized in abstract).
Test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying.
Experimental evaluation with state-of-the-art LLM-based agents on the benchmark (reported in paper). Sample implicit: benchmark issues (495) used to evaluate agents; comparison between test pass rates and design-satisfaction measured by verifier.
Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks.
Stated as an observation in the paper (abstract); no quantitative evidence, metrics, or comparative analysis provided in the excerpt.