Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
All artifacts associated with this study are publicly available at https://zenodo.org/records/18489222.
Statement in the paper providing a Zenodo link to artifacts.
This review identifies key research gaps and provides recommendations for future research and practice.
Authors' discussion and conclusion sections synthesizing gaps and offering recommendations based on the mapping results.
Satisfaction, Performance, and Efficiency are the most frequently investigated SPACE dimensions, whereas Communication and Activity remain underexplored.
Frequency counts and synthesis across the 39 included studies mapped to SPACE dimensions as reported by the authors.
Only 15% of the reviewed studies extend beyond three SPACE dimensions.
Authors' coding of included studies against the SPACE framework with reported proportion.
90% of the reviewed studies adopt a multi-dimensional perspective by examining at least two SPACE dimensions.
Authors' coding of included studies against the SPACE framework, yielding the reported proportion.
This paper is a systematic review and mapping of 39 peer-reviewed studies published between January 2014 and December 2024 that examine the impact of LLM-assistants on software developer productivity.
Authors conducted a systematic review and mapping exercise covering peer-reviewed studies within the stated date range; the paper reports the count of included studies as 39.
Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles.
Agent activity traces showing sequential decision counts per agent (trace-level telemetry).
The system consumed roughly 70B inference tokens across the deployment.
API/inference telemetry reporting total token usage.
More than 5,000 ETH was deployed by agents during the experiment.
Accounting of ETH held/deployed by agent-controlled vaults during deployment.
Agents executed about $20M in trading volume over the deployment.
Aggregate trading-volume accounting from the bounded onchain market during deployment.
The deployment produced roughly 300K onchain actions.
Onchain transaction logs aggregated over the deployment.
The system produced 7.5M agent invocations during the deployment.
System invocation logs reporting total agent calls across the deployment.
DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market.
Deployment logs and system telemetry from a 21-day field deployment reporting the number of user-funded agents.
Fears of AI automation do not primarily increase support for traditional interventions such as unemployment benefits and training programs.
Comparative analysis of policy preference responses in the 2024 OECD 'Risks that Matter' survey as reported in the paper.
Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02.
Reported Pearson (or Spearman) correlation coefficients between stage-level metrics in the benchmark; exact correlation method not specified in excerpt.
We evaluate SecMate in a controlled study with 144 participants and 711 conversations.
Reported experimental study sample and conversation counts in the paper.
Given the limited sample size, the results should be interpreted as exploratory.
Authors explicitly note limited sample size (20 decks) and label findings exploratory.
Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930.
Paper reports interclass correlation coefficient (ICC) analysis of model output reliability across runs, giving a range of ICC values from 0.240 to 0.930.
To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions.
Paper states that each model pair was run five times under identical conditions to distinguish one-off variation from persistent tendencies.
Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages.
Paper reports a controlled simulation design in which each model assessed 20 real pitch decks (sample of 20 decks).
The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2.
Explicit statement in the paper describing the experimental subjects: three named LLMs were evaluated.
The paper develops a typology of enterprise applications by their sensitivity to AI-induced shifts in make-or-buy economics.
Paper's stated contribution (conceptual typology based on analysis of application categories and AI sensitivity).
This paper adopts a conceptual research approach, combining transaction cost economics and the resource-based view with an assessment of current AI capabilities, to systematically re-evaluate the factors underlying the make-or-buy decision.
Paper's stated methodology and theoretical framing (methodological claim about the paper itself).
Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally.
Empirical application of the proposed decomposition and bubble test to asset price data covering the 2020–2025 period associated with the AI rally (data analysis reported in the paper).
At this stage, AI adoption in Israel does not result in widespread layoffs; its primary impact lies in restructuring the labor market through a slowdown in recruitment, changes in job composition, and the emergence of new AI-related roles.
Empirical claim reported in the paper; the excerpt does not specify datasets, time periods, or sample sizes supporting this observation.
Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary.
Model architecture description in the paper specifying a 2-layer GCN encoder, twin critics, and a value network used for adversary control.
The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update.
Algorithmic description in the paper detailing the robust backup solver components (Kantorovich--Rubinstein dual, projected subgradient, primal-dual update).
To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations.
Methodological description of a robust training objective: SAC optimized under a Wasserstein-1 ambiguity set using a graph-aligned Mahalanobis metric to encode spatial correlations.
These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints.
Method/algorithm description in the paper: a rolling MILP projection component implemented to enforce physical constraints (state-of-charge, charger port limits, feeder limits) at each decision step.
The policy learns over high-level intentions produced by a masked, temperature-annealed actor.
Method/algorithm description in the paper describing the actor design (masked, temperature-annealed) and the high-level intentions used for policy learning.
We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations.
Methodological description in the paper presenting the model formulation (hex-grid semi-MDP) and action space design; no external dataset required.
The analysis employs rigorous econometric methods including difference-in-differences estimation and propensity score matching to control for confounding variables across industry (NAICS 2-digit), firm size, geographic location, occupation-level characteristics, and macroeconomic conditions.
Methodological description in the paper specifying DiD and propensity score matching and listed covariates/controls.
The study uses U.S. Census Bureau Business Trends and Outlook Survey data tracking over 1.2 million businesses.
Paper statement that it incorporates the Census Bureau Business Trends and Outlook Survey covering >1,200,000 businesses.
The analysis integrates the Anthropic Economic Index capturing approximately one million AI usage interactions.
Paper statement that the Anthropic Economic Index was used and captures ~1,000,000 AI usage interactions.
We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions.
Dataset and experimental log statistics reported in the paper.
We run AI-only games and conduct a user study pitting human players against AI opponents.
Method statement in the paper describing experiments with both AI-only and human-vs-AI games.
Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge.
Specification of game mechanics and rules in the paper (design features of C2C).
We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective.
Description of a newly developed environment (paper introduces the game and its rules/design).
Overall, robot exposure is only weakly related to job-quality outcomes once controls and fixed effects are included.
Individual-level data from the European Working Conditions Telephone Survey (EWCTS) 2021 merged with country–industry robot exposure measures from International Federation of Robotics (IFR) statistics; weighted logistic regression models including individual and job controls and country and industry fixed effects.
There is no decrease in coding skills among new hires associated with GHC adoption.
Comparison of coding-skill indicators on LinkedIn profiles for new hires at GHC-adopting firms versus non-adopting firms; finding of no measurable decline in coding-skill measures.
Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time.
Clinical utility evaluation reports that inter-rater agreement was comparable between semantic-search-assisted abstraction and clinician-performed chart review.
The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset.
Methods: experiment described as optimization of embedding model and chunking using a physician-authored benchmark dataset.
The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework.
Methods description of system architecture and governance provided in the paper.
We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients.
Paper reports a production deployment at a large children's hospital and gives exact index counts: 166 million clinical notes, 484 million vectors, 1.68 million patients.
We develop an analytical model in which a firm jointly chooses AI deployment and cybersecurity investment under this governance-capability gap.
Methodological claim: the paper presents an analytical (theoretical) model describing joint choice of deployment and cybersecurity investment.
Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap.
Methodological description in the paper indicating the authors performed a systematic sensitivity analysis across environmental parameters (resource scarcity and temporal dominance) to measure performance differences between training modalities.
Foundational research on AI identity is the central conclusion of this report.
Authors' stated conclusion of the paper.
We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment.
Conceptual definition presented by the authors (conceptual/terminological contribution rather than empirical evidence).
The sign reversal is a structural consequence of the reviewer effort collapse under log-concave quality distributions; this is proved analytically.
Formal analytical proofs in the paper that use the assumption of log-concave quality distributions to show the mechanism producing the sign reversal.
We formalize the distinction between compensatory and non-compensatory decision regimes and define a pre-execution legitimacy boundary.
Theoretical formalization presented in the paper (definitions and conceptual framework). No empirical evidence or sample size provided.