Evidence (2954 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Human Ai Collab
Remove filter
Improving explainability can trade off with predictive performance, privacy, and robustness; these trade-offs must be managed rather than ignored.
Review aggregates technical literature and conceptual analyses documenting trade-offs reported by researchers (e.g., simpler interpretable models sometimes having lower predictive accuracy; disclosure risks to privacy; robustness concerns). No single causal estimate provided.
The evidence base presented is limited to a single SME pilot, so generalizability across sectors, firm sizes, and data regimes is untested and requires further research.
Explicit limitation noted in the paper and the fact that the pilot illustrated is a single case study (sample size = 1 SME pilot).
Tasks that are routine, repetitive, or pattern‑based (e.g., boilerplate coding, refactoring, unit test generation, some accessibility fixes) will be increasingly automated by AI.
Task‑level decomposition and examples of current automation capabilities (code generation, test suggestion tools); conceptual projection rather than empirical measurement.
Upfront costs for AI adoption are substantial: development, clinical validation, regulatory compliance, EHR integration, and ongoing monitoring.
Implementation and regulatory literature synthesized in the review documenting typical cost categories and reported expenditures for clinical AI projects.
Large language models (LLMs) suffer from hallucinations (fabricated facts), overconfidence, and unpredictable failure modes in open-ended tasks.
Technical papers and benchmarks on LLM factuality, calibration, and failure modes summarized in the review; empirical evaluations showing instances of fabricated outputs and calibration issues.
Contemporary AI systems have no capacity for physical examination, sensorimotor procedures, or direct patient-contact diagnostics.
Technical limitations of CNNs and LLMs described in literature (lack of embodiment, no sensorimotor capabilities) and absence of credible empirical demonstrations of safe autonomous physical clinical procedures in reviewed studies.
Current models exhibit poor out-of-distribution (OOD) generalization: performance degrades when inputs differ from training distributions.
Technical literature and robustness/domain-shift research reviewed in the paper documenting declines in model accuracy under domain shift and dataset changes.
Heterogeneity in study designs and contexts within the literature limits direct comparability and generalizability of findings.
Limitation noted in the paper based on the authors' assessment of diversity across the 103 reviewed studies (varying methods, contexts, metrics).
Institutional inertia, fragmented governance structures, limited technical capacity, and weak data stewardship impede scale‑up of AI systems in the public sector.
Thematic synthesis of barriers reported across empirical studies and institutional reports within the systematic review (103 items).
Low‑ and middle‑income contexts face persistent gaps—infrastructure, data ecosystems, and talent retention—that slow AI adoption in public governance.
Consistent findings across multiple studies in the 103‑item corpus reporting infrastructure deficits, weak data ecosystems, and brain drain/retention issues in LMIC settings.
AI-generated code can introduce security vulnerabilities and raise licensing/intellectual-property concerns.
Case studies of security incidents, analyses of generated code provenance, and vulnerability-detection studies synthesized in the review.
LLMs sometimes generate incorrect, nonsensical, or insecure code (hallucinations).
Multiple benchmarks, code-generation accuracy tests, and incident case studies documented in the empirical literature showing incorrect or fabricated outputs.
Results reflect small-scale e-commerce use cases; external validity to larger firms, other sectors, or more complex tasks is not established.
Scope of deployments limited to small-scale e-commerce settings as stated in methods; no cross-sector or large-firm samples reported in summary.
The study's evidence is observational rather than randomized controlled trials, so causal estimates about productivity impacts are suggestive rather than definitive.
Declared study design: applied experimentation and observational analysis of deployments (no randomized assignment); methods section explicitly notes observational limitation.
High upfront costs, weak digital/physical infrastructure, limited access to credit, low digital literacy, insecure land tenure, and sociocultural factors (including gendered access) limit uptake of digital and precision technologies among smallholders.
Consistent findings across program evaluations, qualitative stakeholder interviews, participatory assessments, and case studies cited in the synthesis.
Integrating AI raises questions of accountability, transparency, fairness, privacy, and bias; managerial responsibility includes governance design, validation, and audit of AI decisions.
Normative and governance-focused synthesis citing ethical frameworks and illustrative cases; identifies governance tasks and validation/audit needs rather than empirical prevalence rates.
Deficits in governance, auditing, and interpretability constrain the safe deployment of generative AI in firms.
Synthesis of industry reports and conceptual literature noting gaps in governance and interpretability; no quantitative governance dataset reported.
Algorithmic biases in generative AI can amplify and codify discriminatory patterns in organizational decisions.
Extensive literature on algorithmic bias synthesized in the review and applied to generative models; case examples referenced.
Generative AI use introduces significant organizational risks including data privacy breaches and leakage when models or third‑party services are used.
Conceptual analysis and references to documented incidents and industry reports within the review; no single aggregated incident dataset provided.
Generated code can introduce security vulnerabilities.
Security analyses and code audits documenting examples where LLM-generated code contains known vulnerability patterns; incident-oriented case studies and controlled experiments assessing vulnerability incidence.
LLMs can produce plausible-looking but incorrect or insecure code (so-called 'hallucinations').
Benchmarks and controlled tests demonstrating incorrect outputs; security analyses and replicated examples showing erroneous or insecure snippets produced by LLMs across multiple models and prompts.
Risks: dependence on LLM behavior means hallucinations, bias, or misaligned reasoning can propagate into simulated outcomes; Chain-of-Thought reasoning may be hard to fully verify, posing interpretability/auditability challenges.
Paper's cautions section listing potential failure modes and ethical/interpretability risks; these are identified risks rather than quantified failures observed in experiments.
The study is limited by being a single-domain (CMM) case study with a likely modest sample size and dependence on specific AR hardware and MLLM capabilities; further validation across other machines and larger samples is needed.
Authors note these limitations in their discussion; the summary explicitly lists single-case domain, likely modest sample size, and dependency on particular hardware/MLLM as limitations.
Governing-logic stability uncertainty (whether decision logic or objectives remain stationary) is a distinct risk posed by agentic AI.
Conceptual argument and proposed taxonomy; no empirical tests reported.
Epistemic grounding uncertainty (uncertainty about how/why an AI produced a particular output) increases with agentic AI.
Literature synthesis on model-level opacity and causal explanation limits; conceptual reasoning in the paper.
Behavioral trajectory uncertainty (difficulty predicting long-run actions) is a primary form of uncertainty introduced by agentic AI.
Conceptual classification and argument; proposed as one of three principal uncertainties; no empirical estimation.
Integration cost: AI-generated outputs often require human revision, testing, and manual integration into existing systems.
Reported practitioner experience and observed practices from the field study at Netlight; authors note time and effort spent on revision and integration; no quantitative time-cost estimates provided.
AI systems lack full project context, design rationale, and long-term constraints, creating context gaps for development tasks.
Interviews and workflow observations at Netlight where practitioners reported contextual limitations of AI tools; qualitative examples provided; single-firm qualitative evidence.
AI outputs commonly contain errors and hallucinations: generated code can be incorrect, incomplete, or misleading.
Practitioner reports and observed interactions with AI tools documented in the Netlight qualitative study; specific instances and practitioner concerns described in the paper; no quantitative error rates provided.
Integration and engineering complexity (legacy systems, privacy/compliance pipelines, multi-channel platforms) is a persistent barrier to deployment.
Industry case studies and practitioner reports synthesized in the review documenting integration challenges; no systematic cost accounting or sample sizes presented.
Hallucinations and factual errors from generative AI can damage service quality and customer trust.
Documented failure cases and empirical reports from the literature aggregated by the review; no novel incident count or experimental data in this paper.
Generative AI is susceptible to social and representational biases and to factual errors or hallucinations; it lacks tacit, contextual domain expertise.
Documented examples in the literature of biased outputs and hallucinations; controlled evaluations and audits of model outputs; qualitative reports highlighting lack of tacit knowledge in domain-specific tasks.
The quality of AI-generated outputs is highly variable; models frequently produce mediocre but plausible-sounding content that requires human filtering.
Multiple user studies and qualitative reports documenting variability in output quality and the need for human curation; outcome measures include error rates, user-rated quality, and time spent vetting.
Factual errors and 'hallucinations' create misinformation risks and can produce costly service failures.
Model evaluation studies, incident case reports from deployments, and academic/industry analyses documenting hallucination rates and concrete failure examples.
Resource, compute, privacy, and deployment costs associated with CRAEA were not fully quantified in the paper.
Authors note that resource, compute, privacy, and deployment costs were not fully quantified; no cost analyses or benchmarks provided in the summary.
Evaluation was performed in an artificial/simulated home environment; therefore real-world transfer, robustness to noisy perception, and hardware constraints remain open questions.
Authors explicitly state evaluations occurred in a simulated home environment and acknowledge limits on real-world transfer and robustness. This is a stated limitation rather than an experimental finding.
High linguistic diversity in Africa makes building and evaluating multilingual language technologies more difficult and is a barrier to inclusive AI.
Synthesis of technical literature on NLP and multilingual model development and policy/NGO reports highlighting missing language resources; no original model evaluation reported.
Structural constraints—limited digital infrastructure, scarce and skewed data, and high linguistic diversity—complicate AI development, deployment and evaluation in African contexts.
Desk review of infrastructure and data availability reports and scholarly literature demonstrating gaps and their effects; no new measurement in this paper.
Privacy concerns, regulatory/compliance issues, biased or opaque models, and the need for change management and HR analytics capability building are significant risks constraining adoption.
Recurring risks and constraints reported by multiple included studies; summarized in the review's 'risks and constraints' theme.
Implementation of data-driven HRM faces recurring challenges: data quality, privacy and ethics, algorithmic bias, and deficiencies in skills and organizational readiness.
Commonly reported implementation issues across the 47 reviewed studies; extracted as a central theme in the review's thematic analysis.
Constraints and risks include model risk (overfitting, drift), algorithmic bias, privacy and data-sharing limits, legacy ERP complexity, interoperability challenges, and limited organizational readiness and skills.
Reviewed literature (empirical studies, technical evaluations, and standards) documenting technical and organizational failures, risk incidents, and common barriers to implementation.
Algorithmic bias, unequal digital financial literacy, caregiving time constraints, and limited access to personalized solutions can sustain or reproduce gender investment gaps if not addressed.
Synthesis of literature on barriers to financial inclusion and AI fairness concerns, plus platform report observations (review of empirical and conceptual studies; not a single empirical test).
Women statistically exhibit greater risk aversion in some settings compared with men.
Summary of empirical survey and experimental studies on gender differences in risk attitudes discussed in the review (multiple cross‑sectional and lab/field experiments referenced).
Evaluation is carried out under three frozen context configurations (diff only: config_A; diff with file content: config_B; full context: config_C) enabling systematic ablation of context provision strategies.
Methodological description: three fixed context configurations defined and used for ablation experiments.
We performed an extensive evaluation of 37 state-of-the-art Vision-Language Models on MultihopSpatial.
Empirical evaluation described in the paper listing the number of models evaluated (37).
Economic evaluations of GLAI should account for end-to-end risk externalities (error propagation, institutional trust, rights impacts), not only short-term productivity gains.
Methodological recommendation grounded in conceptual synthesis of technical, behavioral, and legal risks; normative argument rather than empirical result.
Generative Legal AI (GLAI) systems are built on token-prediction (LLM) architectures rather than formal legal-reasoning architectures.
Conceptual and technical analysis in the paper distinguishing GLAI from other legal-tech; literature synthesis on common LLM architectures. No original empirical dataset or sample size—qualitative/technical review.
Through a thematic review of existing research, the authors identified recurring themes about incentive schemes: their components, how researchers manipulate them, and their impact on research outcomes.
Authors' stated method and findings: thematic review (the scope/number of reviewed papers not specified in excerpt).
A critical aspect of conducting human–AI decision-making studies is the role of participants, often recruited through crowdsourcing platforms.
Claim based on the authors' thematic literature review noting participant sourcing practices (specific studies and counts not given in excerpt).
Researchers conduct empirical studies investigating how humans use AI assistance for decision-making and how this collaboration impacts results.
Statement summarizing the research landscape; supported implicitly by the authors' thematic review of existing empirical studies (number of studies not specified in excerpt).