Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.

Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.

high null result Fanar 2.0: Arabic Generative AI Stack training token count and dataset composition (three recipes)

Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.

Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).

high null result Fanar 2.0: Arabic Generative AI Stack compute infrastructure (GPU count & location)

Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.

Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.

high null result Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... experimental coverage (benchmarks and model families)

Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).

Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.

high null result Are Large Language Models Truly Smarter Than Humans? uncertainty in contamination detection accuracy due to incomplete provenance

Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.

Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.

high null result Are Large Language Models Truly Smarter Than Humans? generalizability of contamination findings to other benchmarks/models

A three-layer evaluation framework was applied systematically: Layer 1 = syntactic validity; Layer 2 = semantic correctness; Layer 3 = hardware executability (with sublayer 3b = end-to-end evaluation on quantum hardware).

Methods section describes application of a three-layer evaluation framework to each reviewed system, including the explicit sublayer 3b definition.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... evaluation framework definition and application

The review grouped training regimes across the systems as supervised fine-tuning, verifier-in-the-loop reinforcement learning (RL), diffusion/graph generation, and agentic optimization.

Surveyed systems' training descriptions were classified into these training-regime categories during the review's analytical synthesis.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... training regimes present among reviewed systems

The review organized artifacts along artifact-type axes: Qiskit code, OpenQASM programs, and circuit graphs.

Analytical organization described in the methods: artifact-type axis enumerated as Qiskit, OpenQASM, and circuit graphs across the surveyed systems.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... artifact types covered in the field synthesis

"Quantum code" in this review is defined as program artifacts (Qiskit code, OpenQASM); quantum error-correcting code (QEC) generation was excluded.

Inclusion/exclusion criteria specified in the review explicitly limited scope to program artifacts such as Qiskit and OpenQASM and excluded QEC-focused works.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... scope definition (inclusion/exclusion of QEC)

A structured scoping review (Hugging Face, arXiv, provenance tracing; Jan–Feb 2026) identified 13 generative systems and 5 supporting datasets relevant to quantum circuit / quantum code generation.

Structured search of Hugging Face model/dataset listings, arXiv literature, and provenance tracing conducted between January and February 2026; results yielded 13 systems and 5 datasets (sample counts reported in the review).

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... number of generative systems and datasets identified (13 systems, 5 datasets)

The reinforcement learning objective optimizes a combined utility that trades off task success and resource costs; the reward penalizes delays and failures.

Learning method section describes training the high-level orchestrator with an RL reward that penalizes delays (latency/resource consumption) and failures, and that algorithmic/hyperparameter details are provided.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... training objective: combined utility of task success and resource cost

The experiments use empirical LLM latency profiles measured from ALFRED tasks to model realistic inference delays in simulation.

Environment/evaluation description states use of an embodied task suite based on ALFRED and empirical latency profiles to model realistic LLM inference delays.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... latency modeling (empirical latency profiles)

Baselines for comparison include fixed reasoning strategies (always reason, never reason), heuristic triggers for invoking LLMs, and ablations of RARRL components.

Paper lists these baselines explicitly in the Baselines and comparisons section and reports experiments comparing RARRL to them.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... baseline policy types used for comparison

The high-level orchestration policy uses observations that include current sensory observation, execution history, and remaining resources (e.g., remaining time or compute budget).

Key Points and Methods specify the observation space used by the orchestrator, listing sensory inputs, execution history, and resource remaining as inputs.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... policy input features (sensory observation, execution history, remaining resourc...

RARRL trains only a high-level orchestration policy via reinforcement learning and does not retrain the existing low-level control/policy modules end-to-end.

Methods/Model architecture describe a hierarchical approach where low-level controllers are existing modules and are not retrained; RL is applied to the high-level orchestrator.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... level of learning: high-level orchestration policy trained vs. low-level control...

RARRL (Resource-Aware Reasoning via Reinforcement Learning) is a hierarchical orchestration framework that learns a high-level policy to decide when an embodied agent should invoke LLM-based reasoning, which reasoning role to use, and how much compute budget to allocate.

Paper describes a hierarchical design with a learned high-level RL orchestrator that issues discrete decisions about reasoning invocation, reasoning role/mode, and compute budget allocation; architecture and decision space specified in Methods.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... decision variables: whether to call an LLM, reasoning role/mode selected, comput...

BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.

Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.

high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics

Pilot randomized or quasi-experimental implementations of reduced workweeks (across firms, industries, or regions) are needed to measure effects on employment, productivity, wages, and consumption.

Research-design recommendation motivated by lack of contemporary causal evidence; not an empirical finding but a stated priority for rigorous testing.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... measured causal effects of reduced workweeks on employment, productivity, wages,...

There is limited direct causal identification separating technology-driven layoffs from incentive-driven layoffs in current firm-level data, creating a need for new firm-panel datasets linking AI adoption, executive pay/ownership, layoff decisions, and local demand outcomes.

Stated limitation of the paper and research-priority recommendation; assessment based on literature gaps noted in the synthesis rather than empirical gap quantification.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... availability/coverage of firm-level panel data capable of separating AI effects ...

Observed layoffs should be treated in empirical research as outcomes of firm governance and incentive structures; econometric studies estimating displacement from AI must control for managerial incentives and financial pressures.

Methodological recommendation based on the conceptual argument and literature linking governance/incentives to firm behavior; no new empirical demonstration provided.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... bias in estimated causal effect of AI on layoffs when not controlling for manage...

Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.

Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.

high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)

Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.

Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).

high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...

The study's main limitations include reliance on a simulated dataset rather than exhaustive administrative microdata, literature limited to selected publishers/years, and correlational (not causal) identification of some effects.

Authors' explicitly stated limitations in the paper's methods and discussion sections describing data choices (simulated dataset, selected publishers 2020–2024) and the observational/correlational nature of several analyses.

high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Study validity/generalizability limitations

Further research is needed—randomized controlled trials, long-term impact measurement (earnings, employment stability, skill accumulation), distributional analysis, and model audits for bias.

Authors' stated research agenda and recommendations; not an empirical finding but a methodological recommendation following the pilot.

high null result AI-Driven Skill Mapping and Gig Economy Matching Algorithm f... long-term earnings, employment stability, skill accumulation, distributional out...

The authors explicitly note limitations: the study focuses on prediction (not causation), results are sensitive to data quality, workforce records may contain biases, and practical constraints like privacy and deployment complexity limit direct operational adoption.

Limitations section described by the authors listing prediction-versus-causation distinction, sensitivity to data quality, potential biases, privacy concerns, and deployment complexity.

high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Scope and limitations of study conclusions (qualitative)

The study used a reproducible modeling pipeline (data cleaning, feature engineering, model training and tuning, systematic evaluation) applied to several freely available workforce datasets to enable replication.

Methods section describes a reproducible workflow including preprocessing steps, engineered features, hyperparameter tuning for each model class, cross-validation, and use of publicly available datasets.

high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Reproducibility of predictive modeling workflow (procedural, not an empirical pe...

This work is conceptual/theoretical and reports no original empirical dataset; it explicitly calls for mixed-methods empirical validation (case studies, field experiments, longitudinal studies), measurement development, and multi-level data collection.

Explicit methodological statement in the paper describing its nature as a theoretical synthesis and listing empirical needs; no empirical sample provided.

high null result Revolutionizing Human Resource Development: A Theoretical Fr... presence/absence of original empirical data in the paper (none)

Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams.

Benchmarking experiment described in the study: four autonomous AI agents evaluated on the identical fresh challenge set used in the live onsite CTF.

high null result Understanding Human-AI Collaboration in Cybersecurity Compet... agent performance metrics on the fresh CTF challenge set (success rates, traject...

Data and methods: the study used an online experiment with 861 online-retail employees performing short-duration, virtual, task-focused collaborations; analyses focused on direct effects, moderation (emotion and partner type), mediation (service empathy), and moderated-mediation.

Methods description in the paper specifying design, sample size (n = 861), task context (temporary virtual teamwork), and analytic approach (hypothesis tests including moderation and mediation analyses).

high null result Adoption of AI partners in temporary tasks: exploring the ef... NA (methodological claim about study design and analyses)

Teamwork partner type (human vs AI) has no direct, significant effect on collaboration proficiency for temporary virtual tasks.

Online experiment with employees in the online-retail industry (n = 861). Hypothesis testing showed no significant main effect of partner type on the outcome variable 'collaboration proficiency' in the reported analyses.

high null result Adoption of AI partners in temporary tasks: exploring the ef... collaboration proficiency

Empirical strategy: the main identification strategy uses panel regressions with quadratic AI specification and interaction terms, controlling for firm covariates, employing fixed effects and robustness checks (alternative measures, sub-samples).

Methods section description: panel regressions including AI and AI^2, interactions for moderators, controls, fixed effects, and robustness analyses reported in the paper.

high null result Attention to Whom? AI Adoption and Corporate Social Responsi... N/A (methodological claim)

Data/sample claim: the empirical analysis uses a panel of 2,575 Chinese listed firms observed from 2013 to 2023.

Paper-stated sample description (panel dataset covering 2013–2023, N = 2,575 firms).

high null result Attention to Whom? AI Adoption and Corporate Social Responsi... N/A (sample description)

The paper recommends an empirical research agenda including field experiments comparing teams with and without AI mediation, structural models of labor supply and wages under reduced language frictions, microdata analysis of adopters, and measurement studies for coordination costs and mediated-action reliability.

Explicit recommendations and research agenda stated in the paper; this is a descriptive claim about the paper's content rather than an empirical finding.

high null result AI as a universal collaboration layer: Eliminating language ... existence of the recommended research agenda items in the paper

The paper's primary approach is conceptual/theoretical development and agenda-setting; it does not report large-scale empirical or experimental data.

Explicit methods statement in the paper: synthesis, illustrative examples, framework development; absence of reported empirical sample or experiments.

high null result AI as a universal collaboration layer: Eliminating language ... presence/absence of empirical/experimental data in the paper

The study's empirical base consists of 40 semi-structured interviews with cross-industry project practitioners in the UK, analyzed using thematic qualitative methods.

Stated data and methods in the paper: sample size (40), interview method, cross-industry sampling, and thematic analysis.

high null result AI in project teams: how trust calibration reconfigures team... study sample and methodology (empirical basis)

Limitation: Implementation heterogeneity — the costs and feasibility of the recommended HR changes vary by context and may affect generalisability.

Explicit limitation acknowledged in the paper; drawn from theoretical reasoning about contextual heterogeneity and practitioner variability.

high null result Symbiarchic leadership: leading integrated human and AI cybe... implementation costs; feasibility; effect on generalisability

Limitation: The framework is conceptual and requires empirical validation across sectors, firm sizes and AI‑intensity levels.

Explicit limitation acknowledged by the authors; based on the paper's method (theoretical synthesis, no original data).

high null result Symbiarchic leadership: leading integrated human and AI cybe... generalizability and empirical validity across contexts

The paper generates empirically testable propositions (e.g., how leader practices affect AI adoption speed, task reallocation, productivity, error rates, employee well‑being and turnover) and suggests natural‑experiment settings for evaluation.

Stated methodological output of the conceptual synthesis; the paper lists candidate empirical tests and research opportunities but contains no original empirical tests.

high null result Symbiarchic leadership: leading integrated human and AI cybe... AI adoption speed; task reallocation; productivity; error rates; employee well‑b...

Typical methods used are deep learning for property prediction and representation learning, protein-structure modelling tools, generative models for de novo design, NLP for knowledge extraction, and ADME/Tox in silico models integrated with traditional computational chemistry.

Methodological survey in the paper listing these approaches and examples of their application.

high null result Has AI Reshaped Drug Discovery, or Is There Still a Long Way... methods deployed in AI-driven drug discovery workflows

Commonly used data types in AI-driven drug discovery include biochemical/binding assay data, protein structural data, HTS results, ADME/Tox and PK datasets, omics/phenotypic readouts, and scientific literature/patents.

Cataloguing of data sources used across studies and company pipelines described in the paper.

high null result Has AI Reshaped Drug Discovery, or Is There Still a Long Way... types of datasets employed in model training and discovery workflows

AI became widely adopted in pharmaceutical discovery during the 2010s, driven by greater compute, larger datasets, and advances in deep learning.

Historical overview and trend analysis in the paper referencing increased compute availability, growth in public and proprietary datasets, and the rise of deep-learning publications and tools over the 2010s.

high null result Has AI Reshaped Drug Discovery, or Is There Still a Long Way... timeline and adoption rate of AI methods in pharmaceutical discovery

The available evidence consists mainly of promising empirical studies and case studies, but there are few long-run, generalized ROI or productivity estimates; results are heterogeneous across therapeutic areas.

Self-described limitation of the narrative review: heterogeneity of study designs and outcomes precluded pooled quantitative estimates and long-run ROI assessment.

high null result From Algorithm to Medicine: AI in the Discovery and Developm... evidence quality (availability of long-run ROI/productivity estimates) and heter...

AI applications span the full drug development pipeline, including target discovery, in silico screening and de novo design, preclinical safety models, clinical trial design and patient selection/monitoring, and post-marketing surveillance.

Comprehensive literature synthesis across preclinical, clinical, and post-marketing sources in the narrative review summarizing documented uses across these stages.

high null result From Algorithm to Medicine: AI in the Discovery and Developm... coverage of pipeline stages by AI applications (scope)

Current evidence is illustrative rather than systematic; there is a lack of long-run, quantitative measures of AI’s effect on late-stage clinical outcomes in the literature reviewed.

Explicit methodological statement in the paper: study is an expert/opinion synthesis and narrative review with no new causal econometric estimates or primary experimental data.

high null result Learning from the successes and failures of early artificial... existence/availability of long-run quantitative measures linking AI adoption to ...

Suggested metrics for researchers and investors to monitor include R&D cycle time, cost per IND/NDA, proportion of projects using AI, success rates at development stages, market concentration measures, and investment flows into AI-enabled biotech vs incumbents.

Recommendations made in the Implications section as metrics to watch; no empirical tracking or baseline measures provided.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research recommended monitoring metrics for AI impact in pharma/biotech

Limitations of the analysis include limited empirical validation of archetypes or impacts and potential selection bias toward prominent firms and technologies.

Explicit limitations stated in the Data & Methods section of the paper.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research generalizability and representativeness of the paper's claims

The paper is an editorial/conceptual synthesis rather than a primary empirical study: it uses qualitative analysis and illustrative examples, and reports no new quantitative estimates.

Explicit statement in the Data & Methods section of the paper describing document type, approach, evidence base, and limitations.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research empirical evidence provision (absence of new quantitative data)

Ethical oversight and governance (addressing bias, consent, downstream risks) are critical constraints that must be addressed for AI to generate sustained benefits.

Normative synthesis referencing common ethical concerns; no empirical evaluation of oversight mechanisms in the paper.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research ethical acceptability and downstream risk mitigation

Transparency and auditability for model behavior, provenance, and decisions are essential for trustworthy deployment and regulatory acceptance.

Policy and governance synthesis drawing on regulatory dynamics; no empirical study of regulatory outcomes included.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research trustworthiness/regulatory acceptability of models

Rigorous model validation and reproducibility across datasets and settings are necessary constraints for successful AI deployment.

Normative claim in the editorial based on reproducibility concerns in ML and biomedical research; no reported validation trials within the paper.

high null result AI as the Catalyst for a New Paradigm in Biomedical Research reliability and generalizability of AI models across settings

« Prev 1 2 3 … 82 83 84 … 277 278 Next »