Evidence (7953 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.

Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.

high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency

Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.

Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.

high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...

The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.

Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.

high null result When AI Navigates the Fog of War number of temporal nodes

The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.

Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.

high null result When AI Navigates the Fog of War number and type of questions/prompts used

Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.

Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.

high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)

No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.

Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.

high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)

Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.

Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).

high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)

Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.

Limitations section stating recruitment sources, small N, and bias toward severe cases.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample

The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).

Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)

Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.

Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.

high null result Deep Learning-Driven Black-Box Doherty Power Amplifier with ... number of fabricated prototypes and test frequency (2.75 GHz)

Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.

List of proposed metrics in the paper's evaluation agenda.

high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...

The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).

Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.

high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)

The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.

Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.

high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)

The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.

Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.

high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)

The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.

Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.

high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)

Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.

Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics

Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.

Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison

Eight state-of-the-art LLMs were evaluated in the study.

Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)

The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).

Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...

RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).

Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...

The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).

Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...

Roughly 25% of the training corpus is Italian-language data.

Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence percentage share of Italian-language tokens in the training corpus

The model was trained on approximately 2.5 trillion tokens of data.

Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence total number of training tokens

Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).

Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence active parameters used per inference

EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.

Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence model architecture and total parameter count

The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).

Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.

high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models

FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.

Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.

high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)

Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.

Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.

high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)

The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.

Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.

high null result Fanar 2.0: Arabic Generative AI Stack training token count and dataset composition (three recipes)

Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.

Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).

high null result Fanar 2.0: Arabic Generative AI Stack compute infrastructure (GPU count & location)

Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.

Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.

high null result Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... experimental coverage (benchmarks and model families)

Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).

Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.

high null result Are Large Language Models Truly Smarter Than Humans? uncertainty in contamination detection accuracy due to incomplete provenance

Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.

Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.

high null result Are Large Language Models Truly Smarter Than Humans? generalizability of contamination findings to other benchmarks/models

A three-layer evaluation framework was applied systematically: Layer 1 = syntactic validity; Layer 2 = semantic correctness; Layer 3 = hardware executability (with sublayer 3b = end-to-end evaluation on quantum hardware).

Methods section describes application of a three-layer evaluation framework to each reviewed system, including the explicit sublayer 3b definition.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... evaluation framework definition and application

The review grouped training regimes across the systems as supervised fine-tuning, verifier-in-the-loop reinforcement learning (RL), diffusion/graph generation, and agentic optimization.

Surveyed systems' training descriptions were classified into these training-regime categories during the review's analytical synthesis.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... training regimes present among reviewed systems

The review organized artifacts along artifact-type axes: Qiskit code, OpenQASM programs, and circuit graphs.

Analytical organization described in the methods: artifact-type axis enumerated as Qiskit, OpenQASM, and circuit graphs across the surveyed systems.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... artifact types covered in the field synthesis

"Quantum code" in this review is defined as program artifacts (Qiskit code, OpenQASM); quantum error-correcting code (QEC) generation was excluded.

Inclusion/exclusion criteria specified in the review explicitly limited scope to program artifacts such as Qiskit and OpenQASM and excluded QEC-focused works.

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... scope definition (inclusion/exclusion of QEC)

A structured scoping review (Hugging Face, arXiv, provenance tracing; Jan–Feb 2026) identified 13 generative systems and 5 supporting datasets relevant to quantum circuit / quantum code generation.

Structured search of Hugging Face model/dataset listings, arXiv literature, and provenance tracing conducted between January and February 2026; results yielded 13 systems and 5 datasets (sample counts reported in the review).

high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... number of generative systems and datasets identified (13 systems, 5 datasets)

The reinforcement learning objective optimizes a combined utility that trades off task success and resource costs; the reward penalizes delays and failures.

Learning method section describes training the high-level orchestrator with an RL reward that penalizes delays (latency/resource consumption) and failures, and that algorithmic/hyperparameter details are provided.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... training objective: combined utility of task success and resource cost

The experiments use empirical LLM latency profiles measured from ALFRED tasks to model realistic inference delays in simulation.

Environment/evaluation description states use of an embodied task suite based on ALFRED and empirical latency profiles to model realistic LLM inference delays.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... latency modeling (empirical latency profiles)

Baselines for comparison include fixed reasoning strategies (always reason, never reason), heuristic triggers for invoking LLMs, and ablations of RARRL components.

Paper lists these baselines explicitly in the Baselines and comparisons section and reports experiments comparing RARRL to them.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... baseline policy types used for comparison

The high-level orchestration policy uses observations that include current sensory observation, execution history, and remaining resources (e.g., remaining time or compute budget).

Key Points and Methods specify the observation space used by the orchestrator, listing sensory inputs, execution history, and resource remaining as inputs.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... policy input features (sensory observation, execution history, remaining resourc...

RARRL trains only a high-level orchestration policy via reinforcement learning and does not retrain the existing low-level control/policy modules end-to-end.

Methods/Model architecture describe a hierarchical approach where low-level controllers are existing modules and are not retrained; RL is applied to the high-level orchestrator.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... level of learning: high-level orchestration policy trained vs. low-level control...

RARRL (Resource-Aware Reasoning via Reinforcement Learning) is a hierarchical orchestration framework that learns a high-level policy to decide when an embodied agent should invoke LLM-based reasoning, which reasoning role to use, and how much compute budget to allocate.

Paper describes a hierarchical design with a learned high-level RL orchestrator that issues discrete decisions about reasoning invocation, reasoning role/mode, and compute budget allocation; architecture and decision space specified in Methods.

high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... decision variables: whether to call an LLM, reasoning role/mode selected, comput...

BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.

Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.

high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics

Pilot randomized or quasi-experimental implementations of reduced workweeks (across firms, industries, or regions) are needed to measure effects on employment, productivity, wages, and consumption.

Research-design recommendation motivated by lack of contemporary causal evidence; not an empirical finding but a stated priority for rigorous testing.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... measured causal effects of reduced workweeks on employment, productivity, wages,...

There is limited direct causal identification separating technology-driven layoffs from incentive-driven layoffs in current firm-level data, creating a need for new firm-panel datasets linking AI adoption, executive pay/ownership, layoff decisions, and local demand outcomes.

Stated limitation of the paper and research-priority recommendation; assessment based on literature gaps noted in the synthesis rather than empirical gap quantification.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... availability/coverage of firm-level panel data capable of separating AI effects ...

Observed layoffs should be treated in empirical research as outcomes of firm governance and incentive structures; econometric studies estimating displacement from AI must control for managerial incentives and financial pressures.

Methodological recommendation based on the conceptual argument and literature linking governance/incentives to firm behavior; no new empirical demonstration provided.

high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... bias in estimated causal effect of AI on layoffs when not controlling for manage...

Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.

Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.

high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)

Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.

Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).

high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...

« Prev 1 2 3 … 28 29 30 … 159 160 Next »