Evidence (8625 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
The paper distinguishes physical electricity transmission from digital relocation of electricity-consuming computation.
Conceptual/analytic distinction explicitly stated as a contribution in the paper.
We develop an energy-geography framework for geo-distributed AI inference that models a three-layer architecture of clients, service nodes, and compute nodes, and formulates inference placement as a constrained optimization problem over electricity prices, marginal carbon intensity, power usage effectiveness, compute capacity, network latency, and migration frictions.
Methodological contribution described in the paper: formulation of a modeling/optimization framework and specification of variables considered.
Inference workloads can sometimes be executed away from the user-facing service location, provided that latency, state locality, capacity, and regulatory constraints remain acceptable.
Conceptual claim and modeling premise stated in the paper; used as an assumption motivating the relocation/placement model rather than an empirical finding.
The paper traces near-term evolutionary trajectories for digital proto-life through three narratives: Lamarck (self-modifying coding agents), Remora (resource-seeking companion chatbots), and Mycelium (DAO-LLC trading bots).
Methodological statement in the abstract: exploratory scenario method with three specified narrative scenarios; descriptive rather than empirical.
Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles.
Agent activity traces showing sequential decision counts per agent (trace-level telemetry).
The system consumed roughly 70B inference tokens across the deployment.
API/inference telemetry reporting total token usage.
More than 5,000 ETH was deployed by agents during the experiment.
Accounting of ETH held/deployed by agent-controlled vaults during deployment.
Agents executed about $20M in trading volume over the deployment.
Aggregate trading-volume accounting from the bounded onchain market during deployment.
The deployment produced roughly 300K onchain actions.
Onchain transaction logs aggregated over the deployment.
The system produced 7.5M agent invocations during the deployment.
System invocation logs reporting total agent calls across the deployment.
DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market.
Deployment logs and system telemetry from a 21-day field deployment reporting the number of user-funded agents.
Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02.
Reported Pearson (or Spearman) correlation coefficients between stage-level metrics in the benchmark; exact correlation method not specified in excerpt.
We evaluate SecMate in a controlled study with 144 participants and 711 conversations.
Reported experimental study sample and conversation counts in the paper.
Given the limited sample size, the results should be interpreted as exploratory.
Authors explicitly note limited sample size (20 decks) and label findings exploratory.
Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930.
Paper reports interclass correlation coefficient (ICC) analysis of model output reliability across runs, giving a range of ICC values from 0.240 to 0.930.
To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions.
Paper states that each model pair was run five times under identical conditions to distinguish one-off variation from persistent tendencies.
Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages.
Paper reports a controlled simulation design in which each model assessed 20 real pitch decks (sample of 20 decks).
The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2.
Explicit statement in the paper describing the experimental subjects: three named LLMs were evaluated.
The paper develops a typology of enterprise applications by their sensitivity to AI-induced shifts in make-or-buy economics.
Paper's stated contribution (conceptual typology based on analysis of application categories and AI sensitivity).
This paper adopts a conceptual research approach, combining transaction cost economics and the resource-based view with an assessment of current AI capabilities, to systematically re-evaluate the factors underlying the make-or-buy decision.
Paper's stated methodology and theoretical framing (methodological claim about the paper itself).
Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally.
Empirical application of the proposed decomposition and bubble test to asset price data covering the 2020–2025 period associated with the AI rally (data analysis reported in the paper).
At this stage, AI adoption in Israel does not result in widespread layoffs; its primary impact lies in restructuring the labor market through a slowdown in recruitment, changes in job composition, and the emergence of new AI-related roles.
Empirical claim reported in the paper; the excerpt does not specify datasets, time periods, or sample sizes supporting this observation.
Overall, robot exposure is only weakly related to job-quality outcomes once controls and fixed effects are included.
Individual-level data from the European Working Conditions Telephone Survey (EWCTS) 2021 merged with country–industry robot exposure measures from International Federation of Robotics (IFR) statistics; weighted logistic regression models including individual and job controls and country and industry fixed effects.
Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time.
Clinical utility evaluation reports that inter-rater agreement was comparable between semantic-search-assisted abstraction and clinician-performed chart review.
The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset.
Methods: experiment described as optimization of embedding model and chunking using a physician-authored benchmark dataset.
The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework.
Methods description of system architecture and governance provided in the paper.
We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients.
Paper reports a production deployment at a large children's hospital and gives exact index counts: 166 million clinical notes, 484 million vectors, 1.68 million patients.
Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap.
Methodological description in the paper indicating the authors performed a systematic sensitivity analysis across environmental parameters (resource scarcity and temporal dominance) to measure performance differences between training modalities.
Foundational research on AI identity is the central conclusion of this report.
Authors' stated conclusion of the paper.
We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment.
Conceptual definition presented by the authors (conceptual/terminological contribution rather than empirical evidence).
We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use.
Methodological statement: the paper presents a formal/theoretical model specifying institutional choice variables (model description rather than empirical result).
On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (ρ_s=0.25).
Spearman rank correlation between composite rankings and benchmark-only rankings on an 11-agent subset that has published SWE-bench scores; reported correlation.
We document the performance of a market-based scaffolding with these LLMs.
Empirical documentation reported in the paper describing how a market-based scaffolding performs when using the six LLMs on the 93 tasks.
We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration.
Empirical setup described in the paper: evaluation uses a 93-task subset of SWE-bench Lite and six recent LLMs.
We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities.
Paper contribution claim: introduction of a benchmark named MarketBench described in the paper.
In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so.
Conceptual claim / design requirement motivating the benchmark; stated as part of the paper's framing rather than an empirical result.
We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment.
Methods statement in paper describing experimental setup: four-agent ITAS built on Gemini 2.5 Flash and Google Vertex AI; three throughput tiers; eleven concurrency levels up to 50; over 3,000 requests from a live graduate STEM deployment.
We compare LLM-guided bidding against truthful and heuristic strategies using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness.
Methodological claim describing the comparative experimental design: simulations use VCG as benchmark and include comparisons to truthful and heuristic bidding strategies. No sample size or detailed experimental parameters are provided in the excerpt.
When the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions.
Simulation experiments comparing LLM-guided bidding to the VCG benchmark and to truthful/heuristic strategies under conditions where VCG assumptions are satisfied. The paper reports that LLM outcomes were close to the VCG-predicted equilibrium. No numeric sample size or quantitative effect sizes reported in the provided text.
We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks.
Descriptive statement of the study design: the paper reports simulation/experimental evaluation where each user equipment (UE) is modeled as a rational player in repeated spectrum auctions; comparison against truthful and heuristic strategies under Vickrey-Clarke-Groves (VCG) benchmark. No numeric sample size reported in the provided text.
The welfare consequences of genAI can be organized by a two-dimensional taxonomy: the strength of the incentive to perform the task without AI, and the severity of model collapse.
Analytical organization derived from the theoretical model presented in the paper (conceptual taxonomy based on model parameters; no empirical sample reported in abstract).
We develop a parsimonious model of behavior in collaborative interactions in which individuals can either exert human effort, rely on genAI, or refrain from work altogether.
Methodological claim: authors present a formal theoretical model with the specified choice set (model description in paper; no empirical sample reported in abstract).
Predictive performance exhibits saturation beyond a certain context length.
Experiments varying the context (input) length in foundation models and observing changes in forecasting performance; reported saturation effect in analyses.
Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend.
Analysis comparing human expert difficulty ratings to measured token costs for tasks in SWE-bench Verified; weak alignment reported in the paper between ratings and token consumption.
Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs.
Comparison of accuracy (task success) versus total token usage across runs/trajectories in the agentic coding experiments on SWE-bench Verified; reported observed relationship (peak at intermediate costs and saturation thereafter).
Die Studie basiert auf einer wiederholten Querschnittsbefragung lizenzierter Beschäftigter einer außeruniversitären Forschungseinrichtung.
Autorenangabe im Abstract: wiederholte Querschnittsbefragung (survey) unter lizenzieren Beschäftigten der untersuchten Forschungseinrichtung; methodische Beschreibung im Abstract.
The main findings are robust to multiple robustness checks.
Paper reports multiple unspecified robustness checks applied to the fixed-effects regression analyses on the panel of publicly listed Chinese firms (2012–2023).
We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows.
Methodological contribution described in the paper: a unified amortized computational framework applied to eight Shapley variants, evaluated under latency constraints typical of operational workflows.
No formulation improved objective analyst performance.
Controlled/empirical experiment reported in the paper evaluating eight Shapley variants with professional analysts in the fraud-detection environment; performance measured over 3,735 case reviews.
Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility.
Empirical comparison in the paper between quantitative metrics (sparsity, faithfulness) and human-judged clarity/decision-utility across the datasets and analyst reviews; based on the authors' large-scale evaluation.