Evidence (2215 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Innovation
Remove filter
We develop the Institutional Fitness Manifold, a mathematical framework that evaluates AI systems along four dimensions: capability, institutional trust, affordability, and sovereign compliance.
Theoretical/model development presented in the paper (formal definition of the manifold and its four dimensions).
There have been five eras of AI development since 1943, and within the current Generative AI Era there are four distinct epochs, each initiated by a discontinuous event.
Descriptive/historical classification within the paper (counts of eras and epochs; named initiating events such as the transformer and the 'DeepSeek Moment').
Open research challenges that define the research agenda include scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, and designing human-AI specification interactions.
Authors' explicit enumeration of open problems and a proposed multi-disciplinary research agenda; presented as expert opinion rather than empirical finding.
A GNN graph is constructed from reasoning embeddings and trading decisions are made using a PPO-DSR policy.
Method description: the paper reports embedding agents' reasoning, building a graph neural network (GNN) from those embeddings, and using a PPO-DSR reinforcement learning policy to trade. Specific GNN/PPO-DSR hyperparameters and architecture are not provided in the excerpt.
Four LLM agents output scores along with reasoning.
Method description: the paper states that four LLM agents produce numeric scores and associated textual reasoning. The number of agents is explicitly given as four; no further architecture or model-family details included in the excerpt.
BlindTrade anonymizes tickers and company names (blindfolding agents by anonymizing all identifiers).
Methodological description in the paper: the system design explicitly replaces tickers and company names with anonymized identifiers. Implementation details and examples not provided in the excerpt.
The dependent variable is the Market Opportunity Index, which is a combination of indicators of innovation activity, the share of firms with new products, and the share of opportunity-oriented entrepreneurs.
Paper provides the construction/definition of the dependent variable (components listed in the excerpt).
The model used lags of the dependent variable to take into account inertia in the development of entrepreneurial opportunities, and the stability of the impact of cognitive tools was tested.
Paper states the model specification included lagged dependent variables and that stability tests for the impact of cognitive tools were performed (no further details on lag length or test statistics in the excerpt).
The methodological foundation of the study was panel econometric modelling, which enabled taking into account international differences observed over time and the dynamics of indicators in the domestic sphere.
Description of methods in the paper: use of panel econometric modelling on an international panel over the 2020–2024 period (sample size not specified in the excerpt).
Practical recommendations for firms and policymakers include investing in training for AI curation/evaluation/coordination, experimenting with decentralised decision rights and governance safeguards, and monitoring competitive dynamics related to model/platform providers.
Policy and practitioner takeaways explicitly presented in the discussion/implications sections, deriving from the conceptual framework and mapped literature.
The paper recommends a research agenda for AI economists: causal microeconometric studies (DiD, IVs, RCTs), structural models with hybrid human–AI agents, measurement work on GenAI use, distributional analysis and policy evaluation.
Explicit recommendations listed in the implications and research agenda sections; logical follow‑on from bibliometric findings about gaps in causal and measurement evidence.
Bibliometric mapping profiles the intellectual structure and evolution of the field but does not establish causal effects of GenAI on organisational outcomes.
Methodological limitation explicitly stated in the paper; bibliometric approach (co‑word, citation, thematic mapping) is descriptive and historical in scope.
Co‑word and thematic analyses reveal six coherent conceptual clusters that bridge technical AI topics (e.g., LLMs, GANs) with managerial themes (e.g., autonomy, coordination, decision‑making).
Thematic mapping and co‑word network analysis performed on the 212‑paper corpus; identification of six clusters reported in results.
Bibliometric and conceptual tools (VOSviewer, Bibliometrix) were used to identify performance trends, co‑word structures, thematic maps, and conceptual evolution in the GenAI–organisation literature.
Methods section: use of VOSviewer for network visualization and Bibliometrix for bibliometric statistics, co‑word analysis, thematic mapping and Sankey thematic evolution.
The study analysed a corpus of 212 Scopus‑indexed publications covering 2018–2025 to map emergent literature on Generative AI and organisational change.
Bibliometric dataset constructed from Scopus; sample size = 212 peer‑reviewed articles; time window 2018–2025; analyses performed with Bibliometrix and VOSviewer.
The paper identifies future research directions, including empirical causal studies on how DPP+AI interventions change recycling rates, second‑hand market prices, and firm investment in circular processes; and modeling firm strategy around proprietary vs shared DPP data.
Stated research agenda and gaps in the paper informed by the study's findings and limitations; these are recommendations rather than empirical claims.
The study used a mixed-methods design focused on the Italian fashion and cosmetics industries, employing two online surveys, k‑means clustering (consumer segmentation), principal component analysis (to identify underlying dimensions of DPP functionalities and sustainability practices), and logistic regression (to identify adoption drivers).
Methods section summary provided in the paper; explicit statement of methods and industry context. Note: sample sizes and survey instrument details are not provided in the summary.
Two consumer segments were identified: 'aware' consumers (environmentally attuned and receptive to digital innovation and sustainability information) and 'unaware' consumers (prioritize immediate, tangible benefits like price and convenience over sustainability information).
K‑means cluster analysis applied to consumer responses from one of the online surveys in the Italian fashion and cosmetics context; summary identifies two clusters; sample sizes not reported.
This work is a conceptual/policy analysis rather than an original empirical study.
Explicit statement in the paper's Data & Methods section.
Study limitations include single-country (China) listed‑firm sample and reliance on secondary/administrative proxies for digitalization and innovation, which may miss internal qualitative aspects and introduce measurement error.
Authors’ stated limitations: sample restricted to Chinese A-share listed firms (2012–2022) and measures of digitalization/innovation derived from administrative/secondary data rather than direct observation/survey of internal practices.
Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.
Explicit recommendations for evaluation metrics in the paper.
This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.
Explicit statement in the paper about nature of contribution.
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).
Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).
Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).
Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)
Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.
Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.
Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.
Roughly 25% of the training corpus is Italian-language data.
Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.
The model was trained on approximately 2.5 trillion tokens of data.
Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.
Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).
Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.
EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.
Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.
The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).
Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.
FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.
Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.
Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.
Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.
The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.
Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.
Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.
Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).
A three-layer evaluation framework was applied systematically: Layer 1 = syntactic validity; Layer 2 = semantic correctness; Layer 3 = hardware executability (with sublayer 3b = end-to-end evaluation on quantum hardware).
Methods section describes application of a three-layer evaluation framework to each reviewed system, including the explicit sublayer 3b definition.
The review grouped training regimes across the systems as supervised fine-tuning, verifier-in-the-loop reinforcement learning (RL), diffusion/graph generation, and agentic optimization.
Surveyed systems' training descriptions were classified into these training-regime categories during the review's analytical synthesis.
The review organized artifacts along artifact-type axes: Qiskit code, OpenQASM programs, and circuit graphs.
Analytical organization described in the methods: artifact-type axis enumerated as Qiskit, OpenQASM, and circuit graphs across the surveyed systems.
"Quantum code" in this review is defined as program artifacts (Qiskit code, OpenQASM); quantum error-correcting code (QEC) generation was excluded.
Inclusion/exclusion criteria specified in the review explicitly limited scope to program artifacts such as Qiskit and OpenQASM and excluded QEC-focused works.
A structured scoping review (Hugging Face, arXiv, provenance tracing; Jan–Feb 2026) identified 13 generative systems and 5 supporting datasets relevant to quantum circuit / quantum code generation.
Structured search of Hugging Face model/dataset listings, arXiv literature, and provenance tracing conducted between January and February 2026; results yielded 13 systems and 5 datasets (sample counts reported in the review).
This work is conceptual/theoretical and reports no original empirical dataset; it explicitly calls for mixed-methods empirical validation (case studies, field experiments, longitudinal studies), measurement development, and multi-level data collection.
Explicit methodological statement in the paper describing its nature as a theoretical synthesis and listing empirical needs; no empirical sample provided.