Evidence (11677 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	738	1617
Governance & Regulation	671	334	160	99	1285
Organizational Efficiency	626	147	105	70	955
Technology Adoption Rate	502	176	98	78	861
Research Productivity	349	109	48	322	838
Output Quality	391	121	45	40	597
Firm Productivity	385	46	85	17	539
Decision Quality	277	145	63	34	526
AI Safety & Ethics	189	244	59	30	526
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	106	40	6	188
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	79	8	1	152
Regulatory Compliance	69	66	14	3	152
Training Effectiveness	82	16	13	18	131
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

There is a persistent female disadvantage in work intensity.

Analysis of EWCTS 2021 with IFR robot exposure measures using weighted logit models controlling for individual and job covariates and fixed effects; gender-specific patterns examined via interaction terms.

high negative Gendered Effects of Robotisation on Job Quality work intensity (job-quality dimension)

Monthly operational cost of running the system is approximately USD 4,000.

Full-scale performance characterization reports monthly cost estimate of approximately USD 4,000.

high negative Health System Scale Semantic Search Across Unstructured Clin... monthly_operational_cost

Breach externalities expand the range of environments in which deployment is socially constrained.

Analytical model extension/discussion: inclusion of breach externalities increases the set of parameter values where socially optimal deployment is limited.

high negative The Security Cost of Intelligence: AI Capability, Cyber Risk... range of environments where social constraints bind on deployment

Optimal deployment falls below the no-risk benchmark, and this shortfall widens with breach-loss magnitude and with the authority exposure attached to more capable systems.

Analytical comparative-statics results from the model showing optimal deployment relative to a no-risk benchmark and sensitivity to breach-loss magnitude and authority exposure.

high negative The Security Cost of Intelligence: AI Capability, Cyber Risk... gap between optimal deployment and no-risk benchmark (deployment shortfall)

Central result (the 'deployment paradox'): in high-loss environments, better AI can lead a firm to deploy less when capability is deployed through broader authority exposure under weak governance.

Analytical result derived from the paper's theoretical model (no empirical sample; comparative statics in the model demonstrate this effect).

high negative The Security Cost of Intelligence: AI Capability, Cyber Risk... level of AI deployment

The supply of AI-literate workers attenuates wage inequality effects.

Presented in the article as a distributional mechanism informed by synthesized theoretical and empirical findings; no concrete empirical methods or sample sizes are provided in the excerpt.

high negative AI as Augmentation: How Human Capital Shapes Technology's Im... wage inequality

Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary.

Literature-review style claim made in the paper asserting a gap in prior research emphasis (novel cooperative architectures) versus investigation of training modality necessity.

high negative An Analysis of the Coordination Gap between Joint and Modula... research focus (coverage of training-modality necessity in prior literature)

The coordination gap advantage (between joint and modular training) diminishes in bottleneck environments, particularly under severe transport and processing constraints.

Results from a sensitivity analysis varying resource scarcity and temporal dominance showing the relative performance gap shrinks under bottleneck conditions with tight transport and processing constraints. Details on experimental scenarios not provided in the abstract.

high negative An Analysis of the Coordination Gap between Joint and Modula... coordination gap (performance difference between training modalities)

These gaps are structural; more engineering effort alone will not close them.

Authors' argument/conclusion based on their analytical comparison and gap analysis (normative/assertive claim).

high negative AI Identity: Standards, Gaps, and Research Directions for AI... likelihood that additional engineering alone can resolve identity gaps

We identify five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves.

Gap analysis synthesized from the structured survey of industry trends, standards, and literature; presented as findings in the paper.

high negative AI Identity: Standards, Gaps, and Research Directions for AI... coverage of critical identity-related gaps by existing technology and regulation

An evaluation of current technical and regulatory documents against the identity requirements of autonomous agents finds that none adequately address the challenge of governing nondeterministic, boundary-crossing entities.

Document review / evaluation reported in the abstract (structured survey of technical and regulatory documents); specific documents and number reviewed are not specified in the abstract.

high negative AI Identity: Standards, Gaps, and Research Directions for AI... adequacy of technical and regulatory documents for governing autonomous agents

A structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) shows that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures.

Authors' structural comparison (analytical/theoretical method) across four dimensions, reported as a core contribution of the paper.

high negative AI Identity: Standards, Gaps, and Research Directions for AI... suitability of human identity frameworks when applied to AI agents

This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing?

Authors' gap analysis informed by a structured survey of industry trends, emerging standards, and technical literature; presented as a synthesized conclusion from that survey.

high negative AI Identity: Standards, Gaps, and Research Directions for AI... adequacy of existing infrastructure for identity, verification, and accountabili...

Before the AI transition, editors should tighten acceptance standards to curb rent-dissipating author polishing.

Optimal policy characterization in the model for the regime where AI capability is below the critical threshold; derived analytically under model assumptions.

high negative Buying the Right to Monitor:Editorial Design in AI-Assisted ... editorial acceptance standards (policy intensity) as a response to author polish...

When AI capability crosses a critical threshold, reviewer effort collapses discontinuously.

Analytical result proved within the paper's three-sided equilibrium model; threshold and collapse derived theoretically (no empirical sample).

high negative Buying the Right to Monitor:Editorial Design in AI-Assisted ... reviewer effort (level of evaluative effort exerted by reviewers)

Generative AI acts as a disruptive technological shock to evaluative organizations.

Stated as the motivating premise and developed throughout via a theoretical three-sided equilibrium model in the paper; no empirical sample reported (the claim is supported by model construction and analysis).

high negative Buying the Right to Monitor:Editorial Design in AI-Assisted ... disruption to evaluative organizations (change in organizational evaluative proc...

The framework addresses emerging tensions captured in the Creativity Paradox, whereby GenAI may weaken intrinsic motivation, conceptual risk-taking, and evaluative depth.

Theoretical extension of paradox theory and conceptual discussion of potential negative effects; presented as conceptual risks rather than empirically demonstrated outcomes.

high negative Beyond the Creativity Paradox: A Theory-informed Framework f... intrinsic motivation, conceptual risk-taking, evaluative depth

Making AI usable can thus make procedures easier for future governments to learn and exploit.

Synthesis concluding claim based on the paper's formal model and argumentation (theoretical; no empirical testing reported).

high negative AI Governance under Political Turnover: The Alignment Surfac... ease with which future governments can learn and exploit administrative procedur...

The model shows why expansions in AI use may be difficult to unwind.

Analytical conclusion from the paper's formal model (theoretical argument without empirical sample).

high negative AI Governance under Political Turnover: The Alignment Surfac... persistence/irreversibility of AI adoption (difficulty of unwinding expansions)

The model explains why reforms that initially improve oversight can later increase that vulnerability.

Analytical/theoretical result from the paper's formal model (presented as an explanation; no empirical data).

high negative AI Governance under Political Turnover: The Alignment Surfac... long-run effect of oversight-improving reforms on system vulnerability

The model shows when these systems become vulnerable to strategic use from within government.

Analytical result derived from the paper's formal theoretical model (no empirical validation reported).

high negative AI Governance under Political Turnover: The Alignment Surfac... vulnerability of automated systems to strategic internal use

The compliance layer can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration.

Stated conclusion/insight from the paper's formal argument and conceptual framing (theoretical, no empirical sample).

high negative AI Governance under Political Turnover: The Alignment Surfac... creation of a stable approval boundary exploitable by successive governments

Manual tools like mind maps support structure creation but lack intelligent (AI) assistance.

Paper's comparison of manual tools versus AI-augmented tools (background/related-work discussion; no empirical evaluation reported for this claim).

high negative MindTrellis: Co-Creating Knowledge Structures with AI throug... presence of intelligent assistance in manual structure-creation tools

Current LLM-based systems let users query information but do not let users shape how knowledge is organized.

Paper's analysis of existing tools and limitations (literature/feature comparison described in introduction; no new empirical test reported).

high negative MindTrellis: Co-Creating Knowledge Structures with AI throug... capability to shape knowledge organization in LLM-based systems

Knowledge workers face increasing challenges in synthesizing information from multiple documents into structured conceptual understanding.

Statement in paper's introduction/motivation; conceptual observation (no empirical data reported here).

high negative MindTrellis: Co-Creating Knowledge Structures with AI throug... ability to synthesize information from multiple documents into structured concep...

The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.

Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).

high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Adoption-Capability correlation among closed-source high-capability agents

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.

Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).

high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... scope of measurement of static benchmarks (capability vs. deployment/adoption)

Self-assessment is a key bottleneck for market-style coordination of AI agents.

Conclusion drawn from empirical results (miscalibration findings, auction divergence, modest improvement from prior-information intervention) reported in the paper.

high negative MarketBench: Evaluating AI Agents as Market Participants importance of self-assessment calibration for successful market coordination

Auctions built from these self-reports diverge from a full-information allocation.

Simulation or empirical auction experiments using self-reported signals from the six LLMs on the 93 tasks, compared to a full-information allocation benchmark (method described in paper).

high negative MarketBench: Evaluating AI Agents as Market Participants difference between allocations produced by auctions using self-reports and full-...

These LLMs are miscalibrated on both success probability and token usage.

Empirical evaluation of six LLMs on 93 SWE-bench Lite tasks assessing calibration of predicted success probabilities and token usage (as reported in the paper).

high negative MarketBench: Evaluating AI Agents as Market Participants calibration of self-reported success probability and token usage

Standard PayGo degrades substantially under classroom-scale concurrency.

Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.

high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) degradation under concurrency

Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.

Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).

high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response latency (task completion time)

In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.

Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... collective (social) welfare

Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.

Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... spillover adoption and amplified welfare losses

The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.

Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).

high negative Generative artificial intelligence reduces social welfare th... social welfare for high-value tasks

Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.

Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.

high negative Generative artificial intelligence reduces social welfare th... output diversity and accuracy

Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.

Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... correlation and bias between model self-predicted token usage and actual token u...

Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.

Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... average total token consumption per model (tokens consumed by model A minus mode...

Input tokens rather than output tokens drive the overall cost of agentic tasks.

Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... share/contribution of input tokens vs output tokens to total token consumption

Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.

Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).

high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... total token consumption (agentic vs. code reasoning/code chat)

Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes.

Background statement in the paper's introduction; general literature/field observation (no new primary data reported for this claim in the abstract).

high negative Learning-augmented robotic automation for real-world manufac... robustness of fixed waypoint script manipulation

Under our definition, contestants with types below certain threshold (low types) always engage in benchmark hacking, whereas those above the threshold do not.

Theoretical result (characterization/theorem) derived from the contest model showing threshold behavior in equilibrium across contestant types.

high negative On Benchmark Hacking in ML Contests: Modeling, Insights and ... incidence of benchmark hacking by contestant type (below vs above threshold)

Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.

Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.

high negative The Last Harness You'll Ever Build need for human (expert) harness engineering

A vulnerability class is characterised for expected-utility maximisers that makes them susceptible to adversarial gambles.

Formal characterization/definition and analytical derivation in the paper describing which expected-utility maximisers are vulnerable to adversarial (Pascal-type) offers; theoretical examples provided rather than empirical tests.

high negative Bounding the Long Tail: Ai Norms for Decision-Making Under N... vulnerability of expected-utility maximisers to adversarial gambles

Ungoverned coupling between humans and AI can produce fragility, lock-in, polarization, and domination basins.

Theoretical/modeling analysis showing destabilizing dynamics and multiple basins of attraction when governance regularization is absent or weak; no empirical sample.

high negative A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism,... fragility, lock-in, polarization, and domination outcomes in the dynamical model

Classical robot ethics framed around obedience (e.g. Asimov's laws) is too narrow for contemporary AI systems.

Literature synthesis and conceptual argument drawing on developments in adaptive, generative, embodied, and embedded AI; no empirical sample reported.

high negative A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism,... adequacy of obedience-based ethical framing for contemporary AI

Industry digital maturity weakens the effect of the peer leader on a focal firm’s AI adoption.

Interaction/heterogeneity analysis in fixed-effects regression models on panel data of publicly listed Chinese firms (2012–2023), using an industry digital maturity moderator.

high negative Following the Herd or the Bellwether: Peer Effects in Firms’... focal firm AI adoption level (moderated by industry digital maturity for peer le...

Current evaluation proxies are insufficient for predicting downstream human impact.

Empirical results in the paper showing decoupling between standard quantitative proxies (e.g., sparsity, faithfulness) and human outcomes (clarity, decision utility, confidence) across datasets and analyst reviews.

high negative Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... predictive validity of quantitative evaluation proxies for human impact

A highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents.

Constructive worst-case examples and theoretical bounds in the paper demonstrating arbitrarily large performance degradation when applying sophisticated-optimal policies to naive agents.

high negative Algorithmic Feature Highlighting for Human-AI Decision-Makin... performance (loss in decision quality) of highlighting policies when agent type ...

Optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings.

Theoretical complexity results and proofs in the paper showing hardness of the optimization problem under the sophisticated-agent model; no sample/calibration required (formal/algorithmic analysis).

high negative Algorithmic Feature Highlighting for Human-AI Decision-Makin... computational tractability of the highlighting optimization problem

« Prev 1 2 3 … 21 22 23 … 233 234 Next »