Evidence (4137 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Governance Remove filter

Series 2 consisted of local and API open-source systems (n = 6) administered blind and declared, with four systems re-administered under declared conditions.

Methods description detailing Series 2 composition, modes (blind and declared), and that four systems were re-tested under declared conditions.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 2 (n=6) and number re-administered under declared con...

Series 1 consisted of frontier commercial systems administered blind (n = 7).

Methods description specifying Series 1 composition and blind administration.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 1 (n=7) and administration mode (blind)

The study employed 24 experimental conditions spanning 13 distinct LLM systems across two series.

Study design reported in Methods: Series 1 (frontier commercial, blind, n=7), Series 2 (local/API open-source, blind and declared, n=6), plus re-administered declared runs and ceiling-probe runs summing to 24 conditions.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... number of experimental conditions and distinct systems tested (study scope)

The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.

Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.

high null result Nonstandard Errors in AI Agents dataset and experimental design variables (data coverage, number of hypotheses t...

Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.

Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.

high null result Nonstandard Errors in AI Agents recorded methodological choices (categorical), effect estimates (continuous), di...

AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.

Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).

high null result Nonstandard Errors in AI Agents change in dispersion (IQR) of estimates between independent-analysis stage and p...

The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.

Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.

high null result When Openclaw Agents Learn from Each Other: Insights from Em... nature of evidence (qualitative/exploratory vs. causal inference)

Future empirical work should measure calibration (user trust vs. model accuracy), hallucination rate, user comprehension of capability limits, and behavioral dependence on system recommendations.

Explicit methodological recommendations and suggested metrics in the paper; these are proposed future measurements rather than reported findings.

high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... calibration metrics, hallucination rates, user comprehension, behavioral depende...

Conversational AI differs from interpersonal conversation: it has no true beliefs/intentions or accountability and produces probabilistic, sometimes inconsistent outputs with opaque training/data provenance.

Analytical/distinctive claim based on properties of LLMs and machine learning models discussed in the paper; conceptual analysis, no empirical testing.

high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... ontological status of AI outputs (beliefs/intentions/accountability) and propert...

Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.

Paper's stated research and policy agenda; prescriptive rather than empirical.

high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and prioritization of empirical research on WTP, labor impacts, mechan...

Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.

Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.

high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... availability and maturity of evaluation metrics and benchmarks

Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.

Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).

high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... methodological characterization (use of conceptual synthesis vs. empirical data ...

The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.

Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).

high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... existence and specification of four oversight modes and their mapping criteria (...

Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.

Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed empirical research topics and corresponding outcomes to measure

New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.

Methodological recommendation in the paper listing specific metric categories for future empirical work.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities

Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.

Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis

The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.

Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented

The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.

Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper

Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.

Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.

high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)

No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.

Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.

high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)

Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.

Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).

high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)

Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.

Limitations section stating recruitment sources, small N, and bias toward severe cases.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample

The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).

Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)

Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.

List of proposed metrics in the paper's evaluation agenda.

high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...

The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).

Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.

high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)

The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.

Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.

high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)

The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.

Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.

high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)

The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.

Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.

high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)

Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.

Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics

Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.

Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison

Eight state-of-the-art LLMs were evaluated in the study.

Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)

The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).

Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...

RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).

Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...

The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).

Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...

The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).

Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.

high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models

FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.

Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.

high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)

Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.

Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.

high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)

The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.

Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.

high null result Fanar 2.0: Arabic Generative AI Stack training token count and dataset composition (three recipes)

Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.

Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).

high null result Fanar 2.0: Arabic Generative AI Stack compute infrastructure (GPU count & location)

Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.

Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.

high null result Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... experimental coverage (benchmarks and model families)

Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).

Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.

high null result Are Large Language Models Truly Smarter Than Humans? uncertainty in contamination detection accuracy due to incomplete provenance

Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.

Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.

high null result Are Large Language Models Truly Smarter Than Humans? generalizability of contamination findings to other benchmarks/models

BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.

Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.

high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics

Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.

Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.

high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)

Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.

Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).

high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...

The study's main limitations include reliance on a simulated dataset rather than exhaustive administrative microdata, literature limited to selected publishers/years, and correlational (not causal) identification of some effects.

Authors' explicitly stated limitations in the paper's methods and discussion sections describing data choices (simulated dataset, selected publishers 2020–2024) and the observational/correlational nature of several analyses.

high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Study validity/generalizability limitations

Limitation: Implementation heterogeneity — the costs and feasibility of the recommended HR changes vary by context and may affect generalisability.

Explicit limitation acknowledged in the paper; drawn from theoretical reasoning about contextual heterogeneity and practitioner variability.

high null result Symbiarchic leadership: leading integrated human and AI cybe... implementation costs; feasibility; effect on generalisability

Limitation: The framework is conceptual and requires empirical validation across sectors, firm sizes and AI‑intensity levels.

Explicit limitation acknowledged by the authors; based on the paper's method (theoretical synthesis, no original data).

high null result Symbiarchic leadership: leading integrated human and AI cybe... generalizability and empirical validity across contexts

The paper generates empirically testable propositions (e.g., how leader practices affect AI adoption speed, task reallocation, productivity, error rates, employee well‑being and turnover) and suggests natural‑experiment settings for evaluation.

Stated methodological output of the conceptual synthesis; the paper lists candidate empirical tests and research opportunities but contains no original empirical tests.

high null result Symbiarchic leadership: leading integrated human and AI cybe... AI adoption speed; task reallocation; productivity; error rates; employee well‑b...

The available evidence consists mainly of promising empirical studies and case studies, but there are few long-run, generalized ROI or productivity estimates; results are heterogeneous across therapeutic areas.

Self-described limitation of the narrative review: heterogeneity of study designs and outcomes precluded pooled quantitative estimates and long-run ROI assessment.

high null result From Algorithm to Medicine: AI in the Discovery and Developm... evidence quality (availability of long-run ROI/productivity estimates) and heter...

« Prev 1 2 3 … 13 14 15 … 82 83 Next »