Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
Organizational resistance to technological change hinders AI adoption in logistics operations.
Qualitative synthesis of 31 reviewed publications identifying organizational and cultural barriers to AI uptake.
Data security concerns are a key barrier to adopting AI in global supply chains.
Synthesis of themes from 31 scholarly sources in the structured literature review highlighting data/security-related implementation issues.
High initial investment costs are a significant barrier to AI implementation in logistics.
Synthesis of literature (31 sources) reporting implementation challenges and barriers identified across studies.
Existing coordination approaches often occupy two extremes: highly structured methods that rely on fixed roles/pipelines assigned a priori, and fully unstructured teams that enable adaptability but suffer inefficiencies like error propagation, inter-agent conflicts, and wasted resources.
Framing/background claim made in the paper (conceptual argument motivating LATTE).
The price-setter for cognitive labor is no longer the labor market.
Central normative/conceptual claim of the paper supported by the analytical model and the CAW bound: authors argue the compute capital market (through rental price of compute) sets the effective price for cognitive labor. Stated as the paper's concise position; based on theoretical derivation and argumentation.
Compute-Anchored Wage (CAW) bound: on tasks where human and agent cognitive labor are substitutes, the competitive human wage is bounded above by λ · k · r_c (where r_c is the rental rate of compute capital, k is the compute intensity of one effective agent-labor unit, and λ is the relative human-to-agent productivity).
Formal analytical result presented in the paper (mathematical derivation within the factor-pricing model). This is a theoretical bound derived from the model rather than an empirical estimate.
Once agents are recognized as a production technology, the elastic-supply margin that anchors the equilibrium wage migrates from the labor market to the compute capital market.
Analytical derivation using a textbook factor-pricing framework (citing Mankiw 2020) within the paper's theoretical model; derivation and verbal argument linking supply-elasticity margins to compute capital market. No empirical data reported in the excerpt.
The reform reduces industrial wastewater discharge, which improves agricultural production conditions (mechanism linking the reform to higher grain yield).
Mechanism analysis in the paper reporting reductions in industrial wastewater discharge following the reform (mediation channel analysis).
From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression.
Theoretical / information-theoretic analysis in the paper linking observed dynamics to entropy reduction and information bottleneck concepts.
Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium.
Computational simulation reported in the paper (described as a 'simple simulation'); no sample size or experimental dataset reported in the provided text.
Tabular data does not have a foundation model that understands it natively; every approach to tabular AI today (from gradient-boosted trees to the latest tabular foundation models) requires a preprocessing pipeline before any model can consume the data.
Paper's survey/positioning statement asserting the current state of tabular AI approaches and their reliance on preprocessing pipelines (no specific empirical dataset given).
With strong exposure of low-wealth, high-MPC households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income.
Theoretical welfare/comparison analyses in the model with heterogeneous households (differing in wealth and marginal propensities to consume) and ownership concentration; shows private incentives lead to automation choices that are suboptimal from a social perspective under these parameter constellations.
Automation reduces paid human labor.
Model comparative statics in the same equilibrium framework showing substitution away from paid human labor as firms choose automation; result reported in the paper's static benchmark and general-equilibrium analysis.
Experimental results show that current agents remain far from reliable workspace learning.
Authors' interpretation based on the reported agent performance (< best agent 68.7% vs human 80.7%, average 47.4%).
The average performance across evaluated agents is only 47.4%.
Reported mean performance across agents in the experiments (authors' aggregated result).
The best-performing agent reaches only 68.7% on the benchmark.
Experimental results reported by the authors (evaluation across tasks/rubrics).
AI development may reduce firms' labor income share.
Further analysis reported in the paper linking firm-level AI development to reductions in the labor income share within firms.
AI increases the firm-level skill premium by substituting for low-skilled labor.
Mechanism analysis reported in the paper (firm-level regressions investigating labor composition / substitution effects following AI development).
WIOA is not well-equipped to support large-scale, cross-industry labor transitions.
Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.
A substantial portion of WIOA participants simply return to their prior field after program participation.
Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.
WIOA rarely shifts workers into less automation-exposed work.
Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.
Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).
Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).
AI integration creates challenges such as workforce displacement that must be addressed.
Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.
AI integration creates challenges such as algorithmic bias that must be addressed.
Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.
Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).
Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.
Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.
Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).
A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.
Experimental intervention with full transparency of information between agents; authors report that even with full information exchange, dyads fail to reach optimal coordination, pointing to interactive grounding processes as the bottleneck.
The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations.
Experimental baseline (oracle) in which individual reasoning is isolated and shown to be sufficient for identifying optimal allocations; details/sizes not given in the abstract.
Failures in referential binding occur, where agents lose track of commitments across turns.
Reported failure mode from multi-turn experiments: referential binding breakdowns leading to loss of commitments.
Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination.
Empirical observation from negotiation experiments where agents prefer equal splits rather than allocations that maximize joint reward, as reported in the paper.
Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable.
Observed failure mode in multi-turn negotiation experiments: agents anchor on initial proposals and fail to revise, as reported by the authors.
Coordination degrades when shared interaction history is absent.
Experimental comparison of settings with and without shared interaction history (ablation showing worse coordination when history is removed).
While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models.
Experimental results comparing single-agent (isolated) performance and paired-agent (dyad) negotiation performance across multiple LLMs (open- and closed-source); specific sample sizes not reported in the abstract.
Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns.
Literature/benchmark survey claim by the authors (asserted in the paper; no numeric summary provided here).
We establish a Volume-Quality Inverse Law: code volume is a near perfect predictor of structural degradation.
Empirical finding from the paper's analysis correlating code volume with measures of structural degradation; described as 'near perfect predictor'.
There exists a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code.
Multi-scale comparative analysis across models of differing capability showing higher-capability models produce larger (volume) and more highly-coupled code artifacts.
AI does not eliminate software flaws but rather introduces a distinct 'machine signature' of defects in generated code.
Systematic audit (multi-scale analysis) of AI-generated software across single-file algorithmic tasks and complex, agent-generated systems, reporting characteristic defect patterns attributed to machine generation.
The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability.
Framing statement in the paper; argument based on literature/practice that current evaluations emphasize functional correctness rather than maintainability.
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables.
Position asserted in the paper based on literature/benchmark trends and authors' field observations; no original empirical dataset or quantified analysis provided in the paper text excerpt.
Standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
Quantitative analysis reported in the paper comparing detection of the seven failure modes by standard metrics over evaluation cycles.
Standard metrics (ROUGE, BERTScore, accuracy/AUC, and agentic benchmarks such as HELM/MT-Bench/AgentBench/BIG-bench) fail to detect each of the seven production failure modes.
Empirical demonstration reported in the paper comparing standard metrics and agentic benchmarks against the seven failure modes.
The seven failure modes include compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks.
Author-provided list of example failure modes within the taxonomy; grounded in observations described in the paper.
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings and do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production.
Author statement based on literature/framework review (references to HELM, MT-Bench, AgentBench, BIG-bench) and contrast with production agentic evaluation needs.
Prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.
Cited prior work / literature claim reported in paper (no specific study details or sample sizes provided in excerpt).
LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control.
Paper's background/related-work motivation (literature summary and framing). No specific empirical data reported in excerpt.
Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.
Synthesis conclusion by the authors based on the multivocal literature review, telemetry findings, conceptual modeling (PRP/SGM), and the four-month pilot evaluation.
These conflicting findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline.
Conceptual synthesis and interpretation by the paper's authors, based on the multivocal literature review, telemetry, and experimental evidence summarized above.
Telemetry across 10,000+ developers shows 91% longer code review times.
Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in review time.
The most rigorous randomized controlled trial (RCT) documents a 19% slowdown for experienced developers.
A single RCT cited in the paper described as the most rigorous trial; result reported as a 19% slowdown for experienced developers. Sample size for the RCT is not provided in the summary statement.
Compound-system-specific operational challenges arise when serving agentic workloads, including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics.
The paper presents a novel analysis and discussion of these challenges and supports the points via case studies and operational lessons from the production deployment; no quantitative prevalence metrics or sample sizes are provided in the provided text.