The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (7953 claims)

Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 402 112 67 480 1076
Governance & Regulation 402 192 122 62 790
Research Productivity 249 98 34 311 697
Organizational Efficiency 395 95 70 40 603
Technology Adoption Rate 321 126 73 39 564
Firm Productivity 306 39 70 12 432
Output Quality 256 66 25 28 375
AI Safety & Ethics 116 177 44 24 363
Market Structure 107 128 85 14 339
Decision Quality 177 76 38 20 315
Fiscal & Macroeconomic 89 58 33 22 209
Employment Level 77 34 80 9 202
Skill Acquisition 92 33 40 9 174
Innovation Output 120 12 23 12 168
Firm Revenue 98 34 22 154
Consumer Welfare 73 31 37 7 148
Task Allocation 84 16 33 7 140
Inequality Measures 25 77 32 5 139
Regulatory Compliance 54 63 13 3 133
Error Rate 44 51 6 101
Task Completion Time 88 5 4 3 100
Training Effectiveness 58 12 12 16 99
Worker Satisfaction 47 32 11 7 97
Wages & Compensation 53 15 20 5 93
Team Performance 47 12 15 7 82
Automation Exposure 24 22 9 6 62
Job Displacement 6 38 13 57
Hiring & Recruitment 41 4 6 3 54
Developer Productivity 34 4 3 1 42
Social Protection 22 10 6 2 40
Creative Output 16 7 5 1 29
Labor Share of Income 12 5 9 26
Skill Obsolescence 3 20 2 25
Worker Turnover 10 12 3 25
TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.
Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.
high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.
Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.
high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.
Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.
high null result When AI Navigates the Fog of War number of temporal nodes
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.
Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.
high null result When AI Navigates the Fog of War number and type of questions/prompts used
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)
Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.
Limitations section stating recruitment sources, small N, and bias toward severe cases.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample
The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).
Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)
Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.
Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.
high null result Deep Learning-Driven Black-Box Doherty Power Amplifier with ... number of fabricated prototypes and test frequency (2.75 GHz)
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...
Roughly 25% of the training corpus is Italian-language data.
Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence percentage share of Italian-language tokens in the training corpus
The model was trained on approximately 2.5 trillion tokens of data.
Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence total number of training tokens
Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).
Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence active parameters used per inference
EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.
Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence model architecture and total parameter count
The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).
Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.
high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models
FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.
Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.
high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)
Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.
Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.
high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)
The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.
Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.
high null result Fanar 2.0: Arabic Generative AI Stack training token count and dataset composition (three recipes)
Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.
Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).
high null result Fanar 2.0: Arabic Generative AI Stack compute infrastructure (GPU count & location)
Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.
Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.
high null result Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... experimental coverage (benchmarks and model families)
Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).
Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.
high null result Are Large Language Models Truly Smarter Than Humans? uncertainty in contamination detection accuracy due to incomplete provenance
Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.
Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.
high null result Are Large Language Models Truly Smarter Than Humans? generalizability of contamination findings to other benchmarks/models
A three-layer evaluation framework was applied systematically: Layer 1 = syntactic validity; Layer 2 = semantic correctness; Layer 3 = hardware executability (with sublayer 3b = end-to-end evaluation on quantum hardware).
Methods section describes application of a three-layer evaluation framework to each reviewed system, including the explicit sublayer 3b definition.
high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... evaluation framework definition and application
The review grouped training regimes across the systems as supervised fine-tuning, verifier-in-the-loop reinforcement learning (RL), diffusion/graph generation, and agentic optimization.
Surveyed systems' training descriptions were classified into these training-regime categories during the review's analytical synthesis.
high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... training regimes present among reviewed systems
The review organized artifacts along artifact-type axes: Qiskit code, OpenQASM programs, and circuit graphs.
Analytical organization described in the methods: artifact-type axis enumerated as Qiskit, OpenQASM, and circuit graphs across the surveyed systems.
high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... artifact types covered in the field synthesis
"Quantum code" in this review is defined as program artifacts (Qiskit code, OpenQASM); quantum error-correcting code (QEC) generation was excluded.
Inclusion/exclusion criteria specified in the review explicitly limited scope to program artifacts such as Qiskit and OpenQASM and excluded QEC-focused works.
high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... scope definition (inclusion/exclusion of QEC)
A structured scoping review (Hugging Face, arXiv, provenance tracing; Jan–Feb 2026) identified 13 generative systems and 5 supporting datasets relevant to quantum circuit / quantum code generation.
Structured search of Hugging Face model/dataset listings, arXiv literature, and provenance tracing conducted between January and February 2026; results yielded 13 systems and 5 datasets (sample counts reported in the review).
high null result Generative AI for Quantum Circuits and Quantum Code: A Techn... number of generative systems and datasets identified (13 systems, 5 datasets)
The reinforcement learning objective optimizes a combined utility that trades off task success and resource costs; the reward penalizes delays and failures.
Learning method section describes training the high-level orchestrator with an RL reward that penalizes delays (latency/resource consumption) and failures, and that algorithmic/hyperparameter details are provided.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... training objective: combined utility of task success and resource cost
The experiments use empirical LLM latency profiles measured from ALFRED tasks to model realistic inference delays in simulation.
Environment/evaluation description states use of an embodied task suite based on ALFRED and empirical latency profiles to model realistic LLM inference delays.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... latency modeling (empirical latency profiles)
Baselines for comparison include fixed reasoning strategies (always reason, never reason), heuristic triggers for invoking LLMs, and ablations of RARRL components.
Paper lists these baselines explicitly in the Baselines and comparisons section and reports experiments comparing RARRL to them.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... baseline policy types used for comparison
The high-level orchestration policy uses observations that include current sensory observation, execution history, and remaining resources (e.g., remaining time or compute budget).
Key Points and Methods specify the observation space used by the orchestrator, listing sensory inputs, execution history, and resource remaining as inputs.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... policy input features (sensory observation, execution history, remaining resourc...
RARRL trains only a high-level orchestration policy via reinforcement learning and does not retrain the existing low-level control/policy modules end-to-end.
Methods/Model architecture describe a hierarchical approach where low-level controllers are existing modules and are not retrained; RL is applied to the high-level orchestrator.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... level of learning: high-level orchestration policy trained vs. low-level control...
RARRL (Resource-Aware Reasoning via Reinforcement Learning) is a hierarchical orchestration framework that learns a high-level policy to decide when an embodied agent should invoke LLM-based reasoning, which reasoning role to use, and how much compute budget to allocate.
Paper describes a hierarchical design with a learned high-level RL orchestrator that issues discrete decisions about reasoning invocation, reasoning role/mode, and compute budget allocation; architecture and decision space specified in Methods.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... decision variables: whether to call an LLM, reasoning role/mode selected, comput...
BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively.
Methodological contribution described in the paper: explicit definitions of MR as fraction of inappropriate applications and AAR as fraction of appropriate applications, used to score model behavior.
high null result BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Definition and use of MR and AAR metrics
Pilot randomized or quasi-experimental implementations of reduced workweeks (across firms, industries, or regions) are needed to measure effects on employment, productivity, wages, and consumption.
Research-design recommendation motivated by lack of contemporary causal evidence; not an empirical finding but a stated priority for rigorous testing.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... measured causal effects of reduced workweeks on employment, productivity, wages,...
There is limited direct causal identification separating technology-driven layoffs from incentive-driven layoffs in current firm-level data, creating a need for new firm-panel datasets linking AI adoption, executive pay/ownership, layoff decisions, and local demand outcomes.
Stated limitation of the paper and research-priority recommendation; assessment based on literature gaps noted in the synthesis rather than empirical gap quantification.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... availability/coverage of firm-level panel data capable of separating AI effects ...
Observed layoffs should be treated in empirical research as outcomes of firm governance and incentive structures; econometric studies estimating displacement from AI must control for managerial incentives and financial pressures.
Methodological recommendation based on the conceptual argument and literature linking governance/incentives to firm behavior; no new empirical demonstration provided.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... bias in estimated causal effect of AI on layoffs when not controlling for manage...
Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.
Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.
high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)
Key empirical metrics introduced and used are: AI adoption rates (sector-level intensity), Skill shift index, Hybrid job share, and employment levels/net changes by sector.
Methods description listing the constructed metrics used in the simulated dataset and subsequent analyses (definitions and calculation procedures provided in the paper).
high null result AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Defined metrics (AI adoption rate, Skill shift index, Hybrid job share, Employme...