Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

All artifacts associated with this study are publicly available at https://zenodo.org/records/18489222.

Statement in the paper providing a Zenodo link to artifacts.

high null result The Impact of LLM-Assistants on Software Developer Productiv... availability of study artifacts

This review identifies key research gaps and provides recommendations for future research and practice.

Authors' discussion and conclusion sections synthesizing gaps and offering recommendations based on the mapping results.

high null result The Impact of LLM-Assistants on Software Developer Productiv... research gaps and recommendations (qualitative synthesis)

Satisfaction, Performance, and Efficiency are the most frequently investigated SPACE dimensions, whereas Communication and Activity remain underexplored.

Frequency counts and synthesis across the 39 included studies mapped to SPACE dimensions as reported by the authors.

high null result The Impact of LLM-Assistants on Software Developer Productiv... frequency of SPACE dimensions studied

Only 15% of the reviewed studies extend beyond three SPACE dimensions.

Authors' coding of included studies against the SPACE framework with reported proportion.

high null result The Impact of LLM-Assistants on Software Developer Productiv... proportion of studies examining >3 SPACE dimensions

90% of the reviewed studies adopt a multi-dimensional perspective by examining at least two SPACE dimensions.

Authors' coding of included studies against the SPACE framework, yielding the reported proportion.

high null result The Impact of LLM-Assistants on Software Developer Productiv... proportion of studies examining >=2 SPACE dimensions

This paper is a systematic review and mapping of 39 peer-reviewed studies published between January 2014 and December 2024 that examine the impact of LLM-assistants on software developer productivity.

Authors conducted a systematic review and mapping exercise covering peer-reviewed studies within the stated date range; the paper reports the count of included studies as 39.

high null result The Impact of LLM-Assistants on Software Developer Productiv... scope of literature reviewed (count of studies)

Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles.

Agent activity traces showing sequential decision counts per agent (trace-level telemetry).

high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of prompt-state-action cycles per agent

The system consumed roughly 70B inference tokens across the deployment.

API/inference telemetry reporting total token usage.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... inference token consumption

More than 5,000 ETH was deployed by agents during the experiment.

Accounting of ETH held/deployed by agent-controlled vaults during deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... ETH deployed

Agents executed about $20M in trading volume over the deployment.

Aggregate trading-volume accounting from the bounded onchain market during deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... trading volume (USD)

The deployment produced roughly 300K onchain actions.

Onchain transaction logs aggregated over the deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... onchain actions (transactions executed)

The system produced 7.5M agent invocations during the deployment.

System invocation logs reporting total agent calls across the deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... agent invocations (usage)

DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market.

Deployment logs and system telemetry from a 21-day field deployment reporting the number of user-funded agents.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of active agents

Fears of AI automation do not primarily increase support for traditional interventions such as unemployment benefits and training programs.

Comparative analysis of policy preference responses in the 2024 OECD 'Risks that Matter' survey as reported in the paper.

high null result AI, the Future of Work, and the Politics of the Welfare Stat... public support for unemployment benefits and training programs

Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02.

Reported Pearson (or Spearman) correlation coefficients between stage-level metrics in the benchmark; exact correlation method not specified in excerpt.

high null result Benchmarking Complex Multimodal Document Processing Pipeline... correlation between stage-level quality metrics

We evaluate SecMate in a controlled study with 144 participants and 711 conversations.

Reported experimental study sample and conversation counts in the paper.

high null result SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... study sample size and conversation count

Given the limited sample size, the results should be interpreted as exploratory.

Authors explicitly note limited sample size (20 decks) and label findings exploratory.

high null result Algorithmic personalities and the myth of neutrality: financ... interpretation caveat regarding sample size

Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930.

Paper reports interclass correlation coefficient (ICC) analysis of model output reliability across runs, giving a range of ICC values from 0.240 to 0.930.

high null result Algorithmic personalities and the myth of neutrality: financ... output reliability (ICC)

To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions.

Paper states that each model pair was run five times under identical conditions to distinguish one-off variation from persistent tendencies.

high null result Algorithmic personalities and the myth of neutrality: financ... number of repeated runs (methodological)

Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages.

Paper reports a controlled simulation design in which each model assessed 20 real pitch decks (sample of 20 decks).

high null result Algorithmic personalities and the myth of neutrality: financ... number of pitch decks evaluated (methodological)

The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2.

Explicit statement in the paper describing the experimental subjects: three named LLMs were evaluated.

high null result Algorithmic personalities and the myth of neutrality: financ... models evaluated (methodological)

The paper develops a typology of enterprise applications by their sensitivity to AI-induced shifts in make-or-buy economics.

Paper's stated contribution (conceptual typology based on analysis of application categories and AI sensitivity).

high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... classification (typology) of enterprise applications by sensitivity to AI

This paper adopts a conceptual research approach, combining transaction cost economics and the resource-based view with an assessment of current AI capabilities, to systematically re-evaluate the factors underlying the make-or-buy decision.

Paper's stated methodology and theoretical framing (methodological claim about the paper itself).

high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... methodological approach to studying make-or-buy decisions

Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally.

Empirical application of the proposed decomposition and bubble test to asset price data covering the 2020–2025 period associated with the AI rally (data analysis reported in the paper).

high null result General-Purpose Technology and Speculative Bubble Detection presence (or absence) of speculative bubble evidence in the 2020–2025 AI rally

At this stage, AI adoption in Israel does not result in widespread layoffs; its primary impact lies in restructuring the labor market through a slowdown in recruitment, changes in job composition, and the emergence of new AI-related roles.

Empirical claim reported in the paper; the excerpt does not specify datasets, time periods, or sample sizes supporting this observation.

high null result Artificial Intelligence in Israel, Trends, Developments, and... employment changes attributable to AI adoption (layoffs, recruitment rates, job ...

Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary.

Model architecture description in the paper specifying a 2-layer GCN encoder, twin critics, and a value network used for adversary control.

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... model architecture components (2-layer GCN encoder, twin critics, adversary-driv...

The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update.

Algorithmic description in the paper detailing the robust backup solver components (Kantorovich--Rubinstein dual, projected subgradient, primal-dual update).

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... robust backup algorithm design and optimization procedure

To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations.

Methodological description of a robust training objective: SAC optimized under a Wasserstein-1 ambiguity set using a graph-aligned Mahalanobis metric to encode spatial correlations.

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... robustness to distributional shift via Wasserstein-1 ambiguity set with graph-al...

These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints.

Method/algorithm description in the paper: a rolling MILP projection component implemented to enforce physical constraints (state-of-charge, charger port limits, feeder limits) at each decision step.

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... constraint compliance via MILP projection (state-of-charge, port, feeder constra...

The policy learns over high-level intentions produced by a masked, temperature-annealed actor.

Method/algorithm description in the paper describing the actor design (masked, temperature-annealed) and the high-level intentions used for policy learning.

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... policy representation (high-level intentions from masked, temperature-annealed a...

We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations.

Methodological description in the paper presenting the model formulation (hex-grid semi-MDP) and action space design; no external dataset required.

high null result Semi-Markov Reinforcement Learning for City-Scale EV Ride-Ha... problem formulation (hex-grid semi-MDP with mixed and continuous actions and var...

The analysis employs rigorous econometric methods including difference-in-differences estimation and propensity score matching to control for confounding variables across industry (NAICS 2-digit), firm size, geographic location, occupation-level characteristics, and macroeconomic conditions.

Methodological description in the paper specifying DiD and propensity score matching and listed covariates/controls.

high null result The Generative AI Revolution: Early Evidence of Structural T... methodological controls / identification strategy

The study uses U.S. Census Bureau Business Trends and Outlook Survey data tracking over 1.2 million businesses.

Paper statement that it incorporates the Census Bureau Business Trends and Outlook Survey covering >1,200,000 businesses.

high null result The Generative AI Revolution: Early Evidence of Structural T... business-level observations (adoption/behavior)

The analysis integrates the Anthropic Economic Index capturing approximately one million AI usage interactions.

Paper statement that the Anthropic Economic Index was used and captures ~1,000,000 AI usage interactions.

high null result The Generative AI Revolution: Early Evidence of Structural T... AI usage interactions (adoption/usage)

We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions.

Dataset and experimental log statistics reported in the paper.

high null result Cooperate to Compete: Strategic Coordination in Multi-Agent ... dataset size metrics (games, conversations, tokens, actions)

We run AI-only games and conduct a user study pitting human players against AI opponents.

Method statement in the paper describing experiments with both AI-only and human-vs-AI games.

high null result Cooperate to Compete: Strategic Coordination in Multi-Agent ... experimental setup (AI-only games and user study)

Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge.

Specification of game mechanics and rules in the paper (design features of C2C).

high null result Cooperate to Compete: Strategic Coordination in Multi-Agent ... game mechanic: objective asymmetry and non-binding negotiation

We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective.

Description of a newly developed environment (paper introduces the game and its rules/design).

high null result Cooperate to Compete: Strategic Coordination in Multi-Agent ... environmental features (private negotiations, secret objectives)

Overall, robot exposure is only weakly related to job-quality outcomes once controls and fixed effects are included.

Individual-level data from the European Working Conditions Telephone Survey (EWCTS) 2021 merged with country–industry robot exposure measures from International Federation of Robotics (IFR) statistics; weighted logistic regression models including individual and job controls and country and industry fixed effects.

high null result Gendered Effects of Robotisation on Job Quality job-quality outcomes (aggregate across dimensions)

There is no decrease in coding skills among new hires associated with GHC adoption.

Comparison of coding-skill indicators on LinkedIn profiles for new hires at GHC-adopting firms versus non-adopting firms; finding of no measurable decline in coding-skill measures.

high null result Firms' GitHub Copilot adoption and labor market outcomes for... coding skills among new hires

Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time.

Clinical utility evaluation reports that inter-rater agreement was comparable between semantic-search-assisted abstraction and clinician-performed chart review.

high null result Health System Scale Semantic Search Across Unstructured Clin... inter-rater_agreement

The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset.

Methods: experiment described as optimization of embedding model and chunking using a physician-authored benchmark dataset.

high null result Health System Scale Semantic Search Across Unstructured Clin... model_and_chunking_configuration

The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework.

Methods description of system architecture and governance provided in the paper.

high null result Health System Scale Semantic Search Across Unstructured Clin... system_architecture / governance_compliance

We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients.

Paper reports a production deployment at a large children's hospital and gives exact index counts: 166 million clinical notes, 484 million vectors, 1.68 million patients.

high null result Health System Scale Semantic Search Across Unstructured Clin... number_of_notes_indexed / index_size

We develop an analytical model in which a firm jointly chooses AI deployment and cybersecurity investment under this governance-capability gap.

Methodological claim: the paper presents an analytical (theoretical) model describing joint choice of deployment and cybersecurity investment.

high null result The Security Cost of Intelligence: AI Capability, Cyber Risk... model of joint choice (AI deployment and cybersecurity investment)

Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap.

Methodological description in the paper indicating the authors performed a systematic sensitivity analysis across environmental parameters (resource scarcity and temporal dominance) to measure performance differences between training modalities.

high null result An Analysis of the Coordination Gap between Joint and Modula... coordination gap

Foundational research on AI identity is the central conclusion of this report.

Authors' stated conclusion of the paper.

high null result AI Identity: Standards, Gaps, and Research Directions for AI... priority recommendation for future research

We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment.

Conceptual definition presented by the authors (conceptual/terminological contribution rather than empirical evidence).

high null result AI Identity: Standards, Gaps, and Research Directions for AI... conceptualization of AI agent identity

The sign reversal is a structural consequence of the reviewer effort collapse under log-concave quality distributions; this is proved analytically.

Formal analytical proofs in the paper that use the assumption of log-concave quality distributions to show the mechanism producing the sign reversal.

high null result Buying the Right to Monitor:Editorial Design in AI-Assisted ... existence of sign reversal as a robust structural model implication under log-co...

We formalize the distinction between compensatory and non-compensatory decision regimes and define a pre-execution legitimacy boundary.

Theoretical formalization presented in the paper (definitions and conceptual framework). No empirical evidence or sample size provided.

high null result Right-to-Act: A Pre-Execution Non-Compensatory Decision Prot... formal definitions distinguishing decision regimes and the notion of a pre-execu...

« Prev 1 2 3 … 68 69 70 … 277 278 Next »