Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

Modeled joules per correct answer varies by a factor of 6.2 across endpoints.

Modeled energy estimate combined with task accuracy to compute joules per correct answer across 78 endpoints.

high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... joules per correct answer (modeled energy efficiency)

Across 78 endpoints, the same model on different endpoints differs in tail latency by an order of magnitude.

Empirical tail-latency measurements across 78 endpoints serving 12 model families.

high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... tail latency

The same model on different endpoints differs in fingerprint similarity to first party by up to 12 points.

Empirical measurement of fingerprint (output-distribution) similarity to a first-party reference across the same set of endpoints (78 endpoints, 12 model families).

high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... fingerprint similarity to first-party reference (endpoint fidelity)

Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code.

Empirical measurement across 78 endpoints and 12 model families comparing mean accuracy on math and code tasks.

high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... mean accuracy on math and code benchmarks

The rise of digital agents will transform the foundations of production, labour markets, institutional arrangements and the international distribution of economic power.

Synthesis and theoretical projection across sections of the paper; presented as a broad conclusion without reported empirical quantification in the provided text.

high mixed DIGITAL AGENTS AS FUNCTIONAL EQUIVALENTS OF ECONOMIC ACTORS:... transformation of production systems, labour markets, institutions, and internat...

There is a fundamental asymmetry between economic and social reproduction: digital agents can compensate for productive functions of the population but are unable to substitute the population's functions of social reproduction.

Theoretical argument and conceptual distinction in the paper; no empirical study measuring substitution in social reproduction provided.

high mixed DIGITAL AGENTS AS FUNCTIONAL EQUIVALENTS OF ECONOMIC ACTORS:... capacity of digital agents to substitute productive vs social reproduction funct...

These patterns suggest that AI adoption is associated with expected efficiency gains that shape both firms' pricing behaviour and their macroeconomic expectations.

Interpretation based on observed increases in productivity/profitability and different pricing/inflation expectations among adopters vs non-adopters in survey and DID analyses.

high mixed The economic impact of artificial intelligence: evidence fro... interpretive link between productivity/profitability gains and firms' pricing an...

AI adoption leads both to job displacement and job creation, including the emergence of new occupational categories.

Abstract states the review examines empirical evidence on both job displacement and creation and the emergence of new occupations; no numeric counts or sample sizes provided in abstract.

high mixed AI and the Transformation of Human Employment: Challenges, O... job destruction and creation; emergence of new occupations

The study identifies short-term transitional risks and long-term productivity gains associated with AI integration in the workforce.

Abstract states the paper evaluates both short-term risks and long-term productivity gains from AI integration based on the reviewed literature; no empirical quantification given in abstract.

high mixed AI and the Transformation of Human Employment: Challenges, O... transitional risks and productivity gains

AI-driven automation and augmentation are reshaping employment landscapes, with emphasis on sector-level disruption, skill transformation, and socioeconomic consequences.

Abstract states this as a conclusion of the review drawing on interdisciplinary empirical literature; no specific studies or sample sizes cited in abstract.

high mixed AI and the Transformation of Human Employment: Challenges, O... employment landscape changes (sector disruption, skill transformation, socioecon...

The accelerating deployment of artificial intelligence across industries has fundamentally altered the structure of global labour markets.

Statement in abstract summarizing a systematic review of interdisciplinary literature (economics, computer science, organizational behaviour, public policy); no specific sample size reported in abstract.

high mixed AI and the Transformation of Human Employment: Challenges, O... structure of global labour markets

The magnitude of AI’s effect on potential GDP varied across industries and depended on the level of digital maturity, human resources, and institutional conditions.

Decompositional analysis across aggregated industry data and scenario-based modeling drawing on sectoral sources and reviews.

high mixed THE IMPACT OF AI ON POTENTIAL GDP AND LONG-TERM ECONOMIC GRO... industry-specific magnitude of AI contribution to GDP

Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated.

Error-mode analysis across the 105 tasks and evaluated models reported in experiments; authors identify task-family-level patterns (HR, management, multi-system workflows) and relative ease of local workspace repair.

high mixed Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... failure distribution by task family / execution surface

Whether LLM-based assistants improve or degrade code quality remains unresolved: existing studies report contradictory outcomes contingent on context and evaluation criteria.

Review finds mixed/contradictory findings across included studies regarding code quality effects.

high mixed The Impact of LLM-Assistants on Software Developer Productiv... code quality (e.g., correctness, maintainability, defects)

The system tends to be factually correct when it answers but often omits information (i.e., 'the system is right when it answers — it just leaves things out').

Interpretation combining reported factual accuracy (85.5%) with low completeness (0.40) from benchmark results.

high mixed Benchmarking Complex Multimodal Document Processing Pipeline... factual accuracy vs. answer completeness

The study establishes statistically significant relationships between organizational AI adoption and compensation dynamics.

Econometric estimates (difference-in-differences and propensity score matched comparisons) using the combined datasets listed in the paper and controlling for industry, firm size, geography, occupation characteristics, and macroeconomic variables.

high mixed The Generative AI Revolution: Early Evidence of Structural T... compensation dynamics (wages/pay)

The study establishes statistically significant relationships between organizational AI adoption and changes in occupational structures.

Same econometric approach (difference-in-differences and propensity score matching) applied to combined datasets (Anthropic Economic Index, Census Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics), with controls for industry, firm size, location, occupation-level characteristics, and macroeconomic environment.

high mixed The Generative AI Revolution: Early Evidence of Structural T... occupational structures

The study establishes statistically significant relationships between organizational AI adoption and changes in employment patterns in the United States during 2022–2025.

Econometric analysis using multiple large-scale data sources (Anthropic Economic Index, U.S. Census Bureau Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics) and methods described as difference-in-differences estimation and propensity score matching controlling for industry (NAICS 2-digit), firm size, geography, occupation characteristics, and macro conditions.

high mixed The Generative AI Revolution: Early Evidence of Structural T... employment patterns

The paper extends paradox theory to conceptualise the Creativity Paradox in the context of GenAI.

Theoretical extension and conceptual development within the paper (no empirical tests reported).

high mixed Beyond the Creativity Paradox: A Theory-informed Framework f... extension of paradox theory (Creativity Paradox)

Within that n=11 subset, 9 of 11 agents shift by at least 2 ranks between composite and benchmark-only rankings.

Comparison of rank positions between composite and benchmark-only rankings on the 11-agent subset; reported count of agents that moved at least 2 ranks.

high mixed AgentPulse: A Continuous Multi-Signal Framework for Evaluati... count/proportion of agents with ≥2-rank shifts

The four factors capture largely complementary information (n=50; ρ_max = 0.61 for Adoption-Ecosystem, all others |ρ| ≤ 0.37).

Correlation analysis among the four factor scores computed on the 50-agent sample; reported maximum inter-factor Pearson/Spearman correlation coefficients.

high mixed AgentPulse: A Continuous Multi-Signal Framework for Evaluati... inter-factor correlations (Adoption vs Ecosystem and other factor pairs)

Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users.

Empirical measurements from the instrumented system across concurrency up to 50 users and tier comparisons; the paper reports the observed saturation point near ~20 concurrent users.

high mixed Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) and saturation threshold (concurrency where reserved cap...

Delegating tasks to genAI can be individually beneficial in the short term even as widespread adoption degrades future model performance (creating a social dilemma).

Result of the paper's behavioral model showing an individual-level incentive to use genAI versus a collective cost from adoption (theoretical/model-based; no empirical sample reported in abstract).

high mixed Generative artificial intelligence reduces social welfare th... individual short-term benefit vs future model performance (collective welfare)

Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens.

Observed run-to-run variability in total token counts for identical tasks across the collected agentic trajectories from eight frontier LLMs on SWE-bench Verified.

high mixed How Do AI Agents Spend Your Money? Analyzing and Predicting ... run-to-run variability in total token consumption for the same task

ASC (adaptive stopping criterion) halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost.

Reported experiment with ASC showing that it prevents harmful iterative refinement yet causes a measured cost described as 3.8 percentage points due to confidence elicitation.

high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... trade-off between stopping harmful refinement and a confidence-elicitation cost ...

Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading under self-correction; GPT-5 degrades by -1.8 pp.

Reported measured changes in accuracy (percentage-point changes) and measured EIR values for the named models after applying iterative self-correction across the experiment suite.

high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... accuracy change from self-correction

Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction.

Empirical experiments reported across 7 LLMs and 3 benchmark datasets (GSM8K, MATH, StrategyQA) comparing outcomes of iterative self-correction as a function of measured EIR.

high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... accuracy change from self-correction as a function of EIR

These efficiency gains are offset by a growing 'Efficiency-Legitimacy Paradox' (i.e., improvements in efficiency come with worsening legitimacy concerns).

Conceptual synthesis from the systematic review (2018-2026) identifying a recurring trade-off across reviewed studies; specific empirical quantification not provided in abstract.

high mixed Artificial Intelligence, Public Policy and Governance - impl... trade-off between administrative efficiency and democratic legitimacy/procedural...

There is a structural shift from 'street level' bureaucracies to 'system-level' architectures that can be defined as the institutional division of 'Artificial Discretion' to algorithmic infrastructures.

Synthesis from the PRISMA-guided systematic review of literature (2018-2026) reporting observed changes in administrative architectures; specific studies not enumerated in abstract.

high mixed Artificial Intelligence, Public Policy and Governance - impl... institutional/administrative architecture (shift from street-level to system-lev...

As a General-Purpose Technology (GPT), Artificial Intelligence (AI) is fundamentally reconfiguring state capacity, as well as the mechanics of global economic management.

Systematic review of current research studies (2018-2026) conducted following PRISMA guidelines; synthesis of literature claiming broad institutional and macroeconomic effects. Number of studies not specified in abstract.

high mixed Artificial Intelligence, Public Policy and Governance - impl... state capacity and the mechanics of global economic management

For LLM agents, memory management critically impacts efficiency, quality, and security.

Statement in paper framing and motivation; supported conceptually by literature linking memory design to system properties (no specific experimental details provided in abstract).

high mixed FSFM: A Biologically-Inspired Framework for Selective Forget... efficiency, content quality, and security of LLM agents

Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves.

Empirical analysis of authorship attribution across the 6,000 sessions in the SWE-chat dataset; percentages derived from session-level classification.

high mixed SWE-chat: Coding Agent Interactions From Real Users in the W... distribution of code authorship across sessions (agent-dominant vs human-only se...

A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls.

Determinism experiment with 10 replays per case at temperature zero; qualitative/quantitative observation about number of nondeterministic LLM calls exposed by each architecture.

high mixed Stateless Decision Memory for Enterprise AI Agents system nondeterminism / number of nondeterministic LLM calls exposed per decisio...

Multi-agent workflows and benchmark evaluation reveal current capabilities, limitations, and research frontiers in agentic AI for physical design.

The paper states it analyzes recent experience with multi-agent workflows and benchmark evaluation; the abstract does not provide specific benchmark names, metrics, or sample sizes.

high mixed Invited: Agentic AI for Physical Design R&D: Status and Pros... capabilities and limitations as identified via multi-agent workflows and benchma...

AI is associated with a shift toward younger, relatively less educated workers.

Reported association in the paper's baseline empirical results linking AI presence/pervasiveness to changes in workforce composition (age and education).

high mixed Early Estimates of the Impact of AI Within BEA’s Industry Ec... worker composition by age and education

Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI.

Authors' recommendation in the paper's conclusion based on experimental findings (performance, workload, emotion, retention outcomes).

high mixed Fast and Forgettable: A Controlled Study of Novices' Perform... educational practice recommendation (pair programming vs AI-assisted instruction...

Formal network verification has made substantial progress in proving correctness properties but is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior.

Authors' summary of the state of the art in network verification (assertion in paper; no empirical data in abstract).

high mixed Aether: Network Validation Using Agentic AI and Digital Twin applicability of formal verification to live/continuous change

Overall, the proposed HRL framework improves learning efficiency and scalability, outperforming heuristic baselines while remaining below the perfect-information oracle bound.

Results reported in the paper from simulation experiments comparing the HRL framework to heuristic baselines and the oracle; pairwise differences analyzed (Wilcoxon tests referenced). The paper asserts better performance than heuristics but still worse than the oracle.

high mixed Omnichannel Supply Chains Amid Demand Shocks: A Centralized ... policy performance (learning efficiency, scalability, and supply-chain control p...

How software developers interact with AI-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them.

Based on qualitative analysis of twenty-two interviews with software developers about using LLMs for software development; asserted as a central finding in the paper's analysis.

high mixed Towards an Appropriate Level of Reliance on AI: A Preliminar... impact of AI tools on developers (broadly: productivity, skills, quality)

Benefits of technology and data analytics are context-dependent, with emerging markets facing unique regulatory and infrastructural barriers.

Narrative synthesis of included studies noting heterogeneity by context and reports of regulatory/infrastructural constraints in emerging markets.

high mixed The Use of Technology and Data Analytics in Modern Auditing:... realized benefits / adoption in varying contexts

Cybersecurity has a moderating effect on audit data analytics.

Synthesis statement in the review summarizing included studies that report cybersecurity influences the effectiveness/usability of audit data analytics.

high mixed The Use of Technology and Data Analytics in Modern Auditing:... effectiveness of audit data analytics

CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.

Empirical evaluation comparing CLARITI and GPT-5 on a task set of underspecified software engineering issues; the result reported in the abstract indicates parity in resolution rate and a quantified reduction in questions (41%) but the abstract does not report sample size, test set composition, or statistical significance.

high mixed Asking What Matters: Reward-Driven Clarification for Softwar... resolution rate (task success) and number of clarifying questions generated

They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction.

Descriptive claim made in the text contrasting surface-level fluency with missing properties; no empirical data or experiments provided.

high mixed Governing Reflective Human-AI Collaboration: A Framework for... fluency vs. temporal_continuity, causal_feedback, real-world_anchoring

This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.

Paper's stated contribution presenting theory and conceptual groundwork; no empirical validation provided in the abstract.

high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... interaction between augmented cognitive performance and changes in self-percepti...

The LLM fallacy has implications for education, hiring, and AI literacy.

Implications and argumentation presented in the paper; these are prospective and conceptual rather than supported by empirical data in the abstract.

high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... impacts on education practices, hiring decisions, and AI literacy needs

Further research is needed to explore the longitudinal impact of these AI deployments on local labor markets and the creation of indigenous datasets that reflect Cameroon’s unique linguistic diversity.

Authors' identified research gaps and recommendations; statement of future research needs rather than empirical result.

high mixed A Framework for Sovereign AI Governance and Economic Growth ... longitudinal impacts on local labor markets and creation/use of indigenous lingu...

Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success.

Qualitative and quantitative comparisons from the deployed evaluation across the three conditions (observations about turn counts, validation-feedback loops, and model hallucinations in unconstrained condition over the 25 scenario trials).

high mixed Bounded Autonomy for Enterprise AI: Typed Action Contracts a... number of interaction turns to correct outcome; presence of hallucinated success

Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model.

Reported experimental results aggregated across two practical settings (AI consultancy and AI software team) and 12 tasks; direct comparison between AI Organizations of aligned models and a single aligned model.

high mixed AI Organizations are More Effective but Less Aligned than In... solution utility (higher) and model misalignment (greater)

Multi-agent "AI organizations" are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents.

Experimental comparison reported in the paper: experiments comparing multi-agent AI organizations to single aligned agents across tasks and settings (described below).

high mixed AI Organizations are More Effective but Less Aligned than In... solution utility (effectiveness at achieving business goals) and model alignment...

Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance.

Paper reports instances where top-performing (frontier) models outperform aggregate human expert accuracy on SciPredict, but concludes overall accuracies are insufficient for reliable experimental guidance.

high mixed SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... prediction_accuracy / usability_for_guidance

« Prev 1 2 3 … … 130 131 Next »