Evidence (8625 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	761	200	101	904	2020
Governance & Regulation	829	400	191	122	1566
Organizational Efficiency	784	193	125	84	1197
Technology Adoption Rate	637	236	124	97	1103
Research Productivity	431	131	58	340	972
Output Quality	481	183	59	47	770
Decision Quality	332	177	82	49	647
Firm Productivity	439	57	88	20	610
AI Safety & Ethics	218	279	66	33	602
Market Structure	181	170	123	24	503
Task Allocation	214	64	72	33	388
Skill Acquisition	174	62	62	17	315
Innovation Output	204	27	45	18	295
Employment Level	105	54	108	13	282
Fiscal & Macroeconomic	132	69	43	26	277
Consumer Welfare	117	63	42	11	233
Firm Revenue	154	48	26	3	231
Task Completion Time	173	31	8	12	225
Inequality Measures	44	123	50	6	223
Worker Satisfaction	89	65	22	12	188
Error Rate	71	92	10	2	175
Regulatory Compliance	77	69	14	5	165
Automation Exposure	58	56	26	13	156
Training Effectiveness	96	21	14	19	152
Wages & Compensation	77	37	25	6	145
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	81	21	1	115
Hiring & Recruitment	52	7	8	3	70
Creative Output	32	20	8	3	64
Skill Obsolescence	5	47	6	1	59
Social Protection	28	16	8	2	54
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Adoption Remove filter

The paper distinguishes physical electricity transmission from digital relocation of electricity-consuming computation.

Conceptual/analytic distinction explicitly stated as a contribution in the paper.

high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... conceptual differentiation between transmission of electrons and relocation of c...

We develop an energy-geography framework for geo-distributed AI inference that models a three-layer architecture of clients, service nodes, and compute nodes, and formulates inference placement as a constrained optimization problem over electricity prices, marginal carbon intensity, power usage effectiveness, compute capacity, network latency, and migration frictions.

Methodological contribution described in the paper: formulation of a modeling/optimization framework and specification of variables considered.

high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... inference placement feasibility and optimization across energy and latency dimen...

Inference workloads can sometimes be executed away from the user-facing service location, provided that latency, state locality, capacity, and regulatory constraints remain acceptable.

Conceptual claim and modeling premise stated in the paper; used as an assumption motivating the relocation/placement model rather than an empirical finding.

high null result AI Inference as Relocatable Electricity Demand: A Latency-Co... feasibility of relocating inference workload execution given constraints (latenc...

The paper traces near-term evolutionary trajectories for digital proto-life through three narratives: Lamarck (self-modifying coding agents), Remora (resource-seeking companion chatbots), and Mycelium (DAO-LLC trading bots).

Methodological statement in the abstract: exploratory scenario method with three specified narrative scenarios; descriptive rather than empirical.

high null result Digital Darwinism: steering the evolution of artificial life... narrative scenarios produced (Lamarck, Remora, Mycelium)

Long-running agents accumulated thousands of sequential decisions; continuously active agents reached 6,000+ prompt-state-action cycles.

Agent activity traces showing sequential decision counts per agent (trace-level telemetry).

high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of prompt-state-action cycles per agent

The system consumed roughly 70B inference tokens across the deployment.

API/inference telemetry reporting total token usage.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... inference token consumption

More than 5,000 ETH was deployed by agents during the experiment.

Accounting of ETH held/deployed by agent-controlled vaults during deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... ETH deployed

Agents executed about $20M in trading volume over the deployment.

Aggregate trading-volume accounting from the bounded onchain market during deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... trading volume (USD)

The deployment produced roughly 300K onchain actions.

Onchain transaction logs aggregated over the deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... onchain actions (transactions executed)

The system produced 7.5M agent invocations during the deployment.

System invocation logs reporting total agent calls across the deployment.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... agent invocations (usage)

DX Terminal Pro was deployed for 21 days with 3,505 user-funded agents trading real ETH in a bounded onchain market.

Deployment logs and system telemetry from a 21-day field deployment reporting the number of user-funded agents.

high null result Operating-Layer Controls for Onchain Language-Model Agents U... number of active agents

Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02.

Reported Pearson (or Spearman) correlation coefficients between stage-level metrics in the benchmark; exact correlation method not specified in excerpt.

high null result Benchmarking Complex Multimodal Document Processing Pipeline... correlation between stage-level quality metrics

We evaluate SecMate in a controlled study with 144 participants and 711 conversations.

Reported experimental study sample and conversation counts in the paper.

high null result SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... study sample size and conversation count

Given the limited sample size, the results should be interpreted as exploratory.

Authors explicitly note limited sample size (20 decks) and label findings exploratory.

high null result Algorithmic personalities and the myth of neutrality: financ... interpretation caveat regarding sample size

Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930.

Paper reports interclass correlation coefficient (ICC) analysis of model output reliability across runs, giving a range of ICC values from 0.240 to 0.930.

high null result Algorithmic personalities and the myth of neutrality: financ... output reliability (ICC)

To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions.

Paper states that each model pair was run five times under identical conditions to distinguish one-off variation from persistent tendencies.

high null result Algorithmic personalities and the myth of neutrality: financ... number of repeated runs (methodological)

Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages.

Paper reports a controlled simulation design in which each model assessed 20 real pitch decks (sample of 20 decks).

high null result Algorithmic personalities and the myth of neutrality: financ... number of pitch decks evaluated (methodological)

The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2.

Explicit statement in the paper describing the experimental subjects: three named LLMs were evaluated.

high null result Algorithmic personalities and the myth of neutrality: financ... models evaluated (methodological)

The paper develops a typology of enterprise applications by their sensitivity to AI-induced shifts in make-or-buy economics.

Paper's stated contribution (conceptual typology based on analysis of application categories and AI sensitivity).

high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... classification (typology) of enterprise applications by sensitivity to AI

This paper adopts a conceptual research approach, combining transaction cost economics and the resource-based view with an assessment of current AI capabilities, to systematically re-evaluate the factors underlying the make-or-buy decision.

Paper's stated methodology and theoretical framing (methodological claim about the paper itself).

high null result The Buy-or-Build Decision, Revisited: How Agentic AI Changes... methodological approach to studying make-or-buy decisions

Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally.

Empirical application of the proposed decomposition and bubble test to asset price data covering the 2020–2025 period associated with the AI rally (data analysis reported in the paper).

high null result General-Purpose Technology and Speculative Bubble Detection presence (or absence) of speculative bubble evidence in the 2020–2025 AI rally

At this stage, AI adoption in Israel does not result in widespread layoffs; its primary impact lies in restructuring the labor market through a slowdown in recruitment, changes in job composition, and the emergence of new AI-related roles.

Empirical claim reported in the paper; the excerpt does not specify datasets, time periods, or sample sizes supporting this observation.

high null result Artificial Intelligence in Israel, Trends, Developments, and... employment changes attributable to AI adoption (layoffs, recruitment rates, job ...

Overall, robot exposure is only weakly related to job-quality outcomes once controls and fixed effects are included.

Individual-level data from the European Working Conditions Telephone Survey (EWCTS) 2021 merged with country–industry robot exposure measures from International Federation of Robotics (IFR) statistics; weighted logistic regression models including individual and job controls and country and industry fixed effects.

high null result Gendered Effects of Robotisation on Job Quality job-quality outcomes (aggregate across dimensions)

Semantic search maintained comparable inter-rater agreement while reducing chart abstraction time.

Clinical utility evaluation reports that inter-rater agreement was comparable between semantic-search-assisted abstraction and clinician-performed chart review.

high null result Health System Scale Semantic Search Across Unstructured Clin... inter-rater_agreement

The authors optimized embedding model and chunking strategy using a physician-authored benchmark dataset.

Methods: experiment described as optimization of embedding model and chunking using a physician-authored benchmark dataset.

high null result Health System Scale Semantic Search Across Unstructured Clin... model_and_chunking_configuration

The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework.

Methods description of system architecture and governance provided in the paper.

high null result Health System Scale Semantic Search Across Unstructured Clin... system_architecture / governance_compliance

We deployed a semantic search system indexing 166 million clinical notes (484 million vectors) from 1.68 million patients.

Paper reports a production deployment at a large children's hospital and gives exact index counts: 166 million clinical notes, 484 million vectors, 1.68 million patients.

high null result Health System Scale Semantic Search Across Unstructured Clin... number_of_notes_indexed / index_size

Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap.

Methodological description in the paper indicating the authors performed a systematic sensitivity analysis across environmental parameters (resource scarcity and temporal dominance) to measure performance differences between training modalities.

high null result An Analysis of the Coordination Gap between Joint and Modula... coordination gap

Foundational research on AI identity is the central conclusion of this report.

Authors' stated conclusion of the paper.

high null result AI Identity: Standards, Gaps, and Research Directions for AI... priority recommendation for future research

We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment.

Conceptual definition presented by the authors (conceptual/terminological contribution rather than empirical evidence).

high null result AI Identity: Standards, Gaps, and Research Directions for AI... conceptualization of AI agent identity

We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use.

Methodological statement: the paper presents a formal/theoretical model specifying institutional choice variables (model description rather than empirical result).

high null result AI Governance under Political Turnover: The Alignment Surfac... institutional choices regarding automation scale, codification, and safeguards (...

On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (ρ_s=0.25).

Spearman rank correlation between composite rankings and benchmark-only rankings on an 11-agent subset that has published SWE-bench scores; reported correlation.

high null result AgentPulse: A Continuous Multi-Signal Framework for Evaluati... rank correlation between composite ranking and benchmark-only ranking

We document the performance of a market-based scaffolding with these LLMs.

Empirical documentation reported in the paper describing how a market-based scaffolding performs when using the six LLMs on the 93 tasks.

high null result MarketBench: Evaluating AI Agents as Market Participants performance metrics of a market-based scaffolding using LLM self-reports

We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration.

Empirical setup described in the paper: evaluation uses a 93-task subset of SWE-bench Lite and six recent LLMs.

high null result MarketBench: Evaluating AI Agents as Market Participants experimental dataset size and model set used for demonstration

We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities.

Paper contribution claim: introduction of a benchmark named MarketBench described in the paper.

high null result MarketBench: Evaluating AI Agents as Market Participants existence of the MarketBench benchmark

In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so.

Conceptual claim / design requirement motivating the benchmark; stated as part of the paper's framing rather than an empirical result.

high null result MarketBench: Evaluating AI Agents as Market Participants informativeness/calibration of self-reported ability and cost signals

We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment.

Methods statement in paper describing experimental setup: four-agent ITAS built on Gemini 2.5 Flash and Google Vertex AI; three throughput tiers; eleven concurrency levels up to 50; over 3,000 requests from a live graduate STEM deployment.

high null result Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... instrumented request sample (number of requests and concurrency levels)

We compare LLM-guided bidding against truthful and heuristic strategies using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness.

Methodological claim describing the comparative experimental design: simulations use VCG as benchmark and include comparisons to truthful and heuristic bidding strategies. No sample size or detailed experimental parameters are provided in the excerpt.

high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... comparative performance of bidding strategies

When the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions.

Simulation experiments comparing LLM-guided bidding to the VCG benchmark and to truthful/heuristic strategies under conditions where VCG assumptions are satisfied. The paper reports that LLM outcomes were close to the VCG-predicted equilibrium. No numeric sample size or quantitative effect sizes reported in the provided text.

high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... equilibrium outcomes / allocation and utility relative to VCG benchmark

We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks.

Descriptive statement of the study design: the paper reports simulation/experimental evaluation where each user equipment (UE) is modeled as a rational player in repeated spectrum auctions; comparison against truthful and heuristic strategies under Vickrey-Clarke-Groves (VCG) benchmark. No numeric sample size reported in the provided text.

high null result Strategic Bidding in 6G Spectrum Auctions with Large Languag... use of LLMs as bidding agents (methodological evaluation)

The welfare consequences of genAI can be organized by a two-dimensional taxonomy: the strength of the incentive to perform the task without AI, and the severity of model collapse.

Analytical organization derived from the theoretical model presented in the paper (conceptual taxonomy based on model parameters; no empirical sample reported in abstract).

high null result Generative artificial intelligence reduces social welfare th... social welfare outcomes as a function of incentive strength and model collapse s...

We develop a parsimonious model of behavior in collaborative interactions in which individuals can either exert human effort, rely on genAI, or refrain from work altogether.

Methodological claim: authors present a formal theoretical model with the specified choice set (model description in paper; no empirical sample reported in abstract).

high null result Generative artificial intelligence reduces social welfare th... choice among effort modalities (human effort, genAI reliance, abstention)

Predictive performance exhibits saturation beyond a certain context length.

Experiments varying the context (input) length in foundation models and observing changes in forecasting performance; reported saturation effect in analyses.

high null result FETS Benchmark: Foundation Models Outperform Dataset-specifi... change in forecast accuracy as context length increases

Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend.

Analysis comparing human expert difficulty ratings to measured token costs for tasks in SWE-bench Verified; weak alignment reported in the paper between ratings and token consumption.

high null result How Do AI Agents Spend Your Money? Analyzing and Predicting ... correspondence/alignment between human-rated task difficulty and measured token ...

Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs.

Comparison of accuracy (task success) versus total token usage across runs/trajectories in the agentic coding experiments on SWE-bench Verified; reported observed relationship (peak at intermediate costs and saturation thereafter).

high null result How Do AI Agents Spend Your Money? Analyzing and Predicting ... task accuracy as a function of token usage

Die Studie basiert auf einer wiederholten Querschnittsbefragung lizenzierter Beschäftigter einer außeruniversitären Forschungseinrichtung.

Autorenangabe im Abstract: wiederholte Querschnittsbefragung (survey) unter lizenzieren Beschäftigten der untersuchten Forschungseinrichtung; methodische Beschreibung im Abstract.

high null result Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Studiendesign / Datengrundlage (repeated cross-sectional survey)

The main findings are robust to multiple robustness checks.

Paper reports multiple unspecified robustness checks applied to the fixed-effects regression analyses on the panel of publicly listed Chinese firms (2012–2023).

high null result Following the Herd or the Bellwether: Peer Effects in Firms’... robustness of reported peer effect findings

We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows.

Methodological contribution described in the paper: a unified amortized computational framework applied to eight Shapley variants, evaluated under latency constraints typical of operational workflows.

high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... ability to isolate semantic differences among Shapley variants under low-latency...

No formulation improved objective analyst performance.

Controlled/empirical experiment reported in the paper evaluating eight Shapley variants with professional analysts in the fraud-detection environment; performance measured over 3,735 case reviews.

high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... objective analyst performance (e.g., accuracy on case reviews)

Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility.

Empirical comparison in the paper between quantitative metrics (sparsity, faithfulness) and human-judged clarity/decision-utility across the datasets and analyst reviews; based on the authors' large-scale evaluation.

high null result Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... correlation/alignment between quantitative explanation metrics (sparsity, faithf...

« Prev 1 2 3 … 37 38 39 … 172 173 Next »