Evidence (5126 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Adoption
Remove filter
The current literature is skewed toward descriptive and engineering work; there is a lack of causal, field‑experimental evidence on NLP interventions' effects on customer behavior and firm profits.
Review coding of study types in the sample (engineering/descriptive vs. experimental/causal) showing few field experiments or causal designs.
Important gaps include customer acquisition, personalization at scale, use of external text sources (social media, news, reviews), operational process improvement, and cross‑channel integration.
Gap detection via low‑density regions in the UMAP thematic map of sentence‑transformer embeddings and manual review showing low article counts for these topics within the 109‑article sample.
Existing literature on NLP in marketing is concentrated around customer retention tasks (e.g., churn prediction, complaint handling, relationship management).
Thematic clustering from sentence‑transformer embeddings of article text combined with UMAP visualization, and manual review of article topics and keywords identifying frequent retention‑related themes.
NLP applications in bank marketing are severely under‑studied.
Descriptive result from the PRISMA review showing only 8/109 articles focused on NLP in bank marketing (≈7%), plus thematic mapping showing sparse coverage in bank‑marketing/NLP intersection.
AI‑enabled platforms can magnify winner‑takes‑most dynamics in digital services trade, concentrating market power.
Theoretical and empirical literature on network effects and platform markets reviewed in the paper; illustrative examples (no novel empirical aggregation).
Current data governance regimes in China can impede cross‑border data flows.
Comparative policy analysis and literature documenting data localization and privacy/regulatory regimes that restrict flows (descriptive evidence in the review).
Institutional barriers—fragmented international rules on data flows and privacy, regulatory divergence including data localization, weak participation in multilateral rule setting, and uneven domestic regulation of platforms—impede digital services trade.
Comparative policy analysis and literature review, supported by policy documents and case examples (qualitative evidence; no original econometric tests).
Vietnam's civil-law features—statutory specificity, formal procedures, and constitutional principles like legal certainty and fairness—make straightforward AI deployment legally fraught.
Close textual analysis of Vietnam's statutes, constitutional provisions, and administrative procedures (doctrinal legal analysis); no quantitative sample.
Automated decisions complicate assigning responsibility and hinder judicial and administrative reviewability.
Doctrinal examination of accountability and review mechanisms in administrative law plus comparative institutional analysis of automated decision-making governance.
Opaque AI models risk violating notice, reason-giving, and appeal rights protected under administrative due process.
Analysis of procedural due-process requirements (notice, reason-giving, appeal) in Vietnam's legal framework and assessment of opacity issues in algorithmic systems; qualitative reasoning, no empirical testing.
Provider incentives may be misaligned (e.g., optimizing for engagement or test performance instead of durable learning), requiring contracts, regulation, or purchaser design to align incentives.
Consensus from interdisciplinary workshop (50 scholars) highlighting incentive risks and market-design considerations; descriptive, not empirical.
Extensive learner data needed to personalize AI feedback raises privacy and data-governance concerns (consent, storage, usage).
Qualitative consensus from workshop participants (50 scholars) noting data-collection requirements and governance risks; no empirical governance studies included.
Automated feedback may not capture pedagogical nuances expert teachers use (motivation, socio-emotional cues, complex reasoning), limiting pedagogical fit.
Expert syntheses from the workshop of 50 scholars highlighting limits of automation relative to expert teacher judgment; no empirical comparisons presented.
AI-generated feedback can be incorrect, misleading, or misaligned with learning objectives; assessing feedback quality is nontrivial.
Repeated concern raised across workshop participants (50 scholars) in qualitative synthesis; noted as a substantive risk and open challenge rather than empirically quantified here.
Integration costs—domain modeling, human-in-the-loop protocols, and regulatory/liability frameworks—are significant barriers to deployment.
Conceptual assessment of operational and regulatory requirements; no quantified cost studies provided.
AFs and LLMs may be gamed or misled; adversaries may exploit systems leading to strategic argumentation or manipulation.
Conceptual security/adversarial concern based on known vulnerabilities in ML and strategic behavior; no adversarial tests reported.
Faithful extraction—aligning LLM-extracted arguments with formal AF primitives and ensuring fidelity to source evidence—is a key technical challenge.
Paper's explicit identification of failure modes and alignment issues; grounded in documented limitations of IE/LLMs (no empirical quantification here).
Computational argumentation approaches have required heavy feature engineering and domain-specific knowledge to be effective.
Conceptual claim grounded in prior work and practical experience reported in the literature; no quantitative cost estimates provided in the paper.
Automation bias (human tendency to defer to automated outputs) compounds the risk that GLAI errors become embedded in legal processes.
Behavioral literature review on automation bias and trust in AI systems; applied to legal-context vignettes. No primary empirical test within the paper.
A key architectural risk is interoperability failure and fragmentation across vendors and protocols in agent ecosystems.
Comparative analysis with IoT and other platform histories showing vendor/protocol fragmentation; argument is conceptual and illustrative rather than empirically measured for future agent ecosystems.
Domains such as disaster response, healthcare, industrial automation, and mobility will be affected and are safety‑critical, where failures have high social and economic cost.
Domain examples and policy reasoning; draws on general knowledge about those sectors and potential harms; no new empirical damage quantification provided in the paper.
IoT digitized perception at scale but exposed limitations such as fragmentation, weak security, limited autonomy, and poor sustainability.
Historical and comparative analysis of IoT deployments and literature cited illustratively in the paper; qualitative evidence from prior IoT incidents and ecosystem studies rather than new empirical data.
Adoption requires hardware (VR headsets, capable GPUs) and integration effort, implying upfront capital expenditure for labs/observatories.
Paper explicitly notes hardware requirements (VR headsets, capable GPUs) and integration effort as part of adoption considerations; common-sense assessment of required capital.
Current models heavily rely on large static datasets and batch training and exhibit poor lifelong/continual learning.
Synthesis of common practices in contemporary ML (supervised pretraining and offline training paradigms); no new experiments provided.
When identical replies are labeled as coming from AI rather than from a human, recipients report feeling less heard and less validated (an attribution effect).
Controlled attribution labeling experiment within the study: identical replies presented with different source labels (AI vs. human) and recipient-rated perceptions of being heard/validated measured.
HindSight scores are negatively correlated with LLM-judged novelty (Spearman ρ = −0.29, p < 0.01), indicating LLM judges tend to overvalue novel-sounding ideas that do not materialize in the literature.
Reported Spearman correlation between HindSight scores and LLM-judged novelty across the generated ideas; ρ = −0.29 with p < 0.01. Interpretation that LLMs overvalue novel-sounding ideas is drawn from the negative correlation.
Barriers to adoption include toolchain cost, trace data storage/transfer demands, IP-security concerns when sharing traces, and organizational inertia.
Listed as practical caveats and limitations in the summary; based on authors' experience and reasoning rather than quantified study.
Adoption requires up-front investment in tooling and infrastructure for deterministic capture/replay, plus management of large trace data and integration with existing validation/IP/security workflows.
Authors explicitly list these practical caveats in the summary: needs tooling/infrastructure, trace data management, and integration with validation flows and IP/security constraints. (Descriptive claim based on implementation experience; no cost figures provided.)
Static ACLs evaluate deterministic rules that ignore partial execution paths and therefore can only capture a subset of organizational constraints.
Formal argument and examples showing static ACLs map to Policy functions that do not depend on partial_path; illustrative limitations presented.
Runtime evaluation imposes additional compute, latency, logging, and engineering costs that increase the marginal cost of deploying agents.
Operational discussion in the paper outlining additional runtime compute and logging requirements; cost implications argued qualitatively; no empirical cost measurements provided.
Prompt-level instructions and static access control lists (ACLs) are limited special cases of a more general runtime policy-evaluation framework and cannot, in general, enforce path-dependent rules.
Formalization showing prompt/system messages and static ACLs map to restricted forms of the Policy(agent_id, partial_path, proposed_action, org_state) function; logical proof/argument in the paper and illustrative counterexamples.
LLM-based agent behavior is non-deterministic and path-dependent: an agent's safety/compliance risk depends on the entire execution path, not just the current prompt or single action.
Formal/abstract execution model defined in the paper (states, actions, execution paths) and conceptual arguments/illustrative examples showing how earlier states/actions affect later behavior; no large-scale empirical dataset reported.
Qualitative case studies show modality-specific failures, such as correct entity recognition but wrong factual attribute.
Paper includes qualitative examples/case studies from the benchmark where models identify entities in images correctly but produce incorrect time-sensitive attributes (e.g., current officeholder or company status).
Real-world deployment will require representative data coverage and online adaptation despite the method’s robustness mechanisms.
Authors' discussion/limitations section: theoretical requirements for persistently exciting/representative trajectories for DeePC and recommendation for online adaptation and continual data collection for deployment.
Agent performance degrades markedly as environment complexity, stochasticity, and non-stationarity increase, revealing core limitations of current LLM-based agents for long-horizon, multi-factor decision problems.
Experimental results across progressively harder RetailBench environments showing performance falloff for multiple LLMs under increased task complexity and non-stationarity.
Behavioral memorization probe (TS‑Guessing) signaled memorization above chance for 72.5% of prompts across all models and items.
Experiment 3 — TS‑Guessing behavioral probe applied exhaustively to all 513 MMLU questions × six models (total prompts = 513×6); statistical thresholds used to classify above-chance memorization signals, yielding 72.5% of prompts flagged.
Paraphrase / indirect-reference diagnostic: on a 100-question subset, average accuracy dropped by 7.0 percentage points under indirect referencing.
Experiment 2 — paraphrase/indirect-reference diagnostic applied to a 100-question subset of MMLU; measured delta between original and paraphrased question accuracy averaged to 7.0 percentage points.
STEM items show higher lexical contamination (18.1%) relative to the overall rate.
Category-level results from Experiment 1 (lexical matching) on the MMLU dataset (513 questions), aggregated by subject domain to compute an 18.1% contamination rate for STEM categories.
Overall lexical contamination: 13.8% of MMLU items show evidence of exposure in training data.
Experiment 1 — lexical contamination detection pipeline that searched model training–era public corpora and the open web for literal or near-literal occurrences of the 513 MMLU questions/answers; per-item contamination flags aggregated to produce the 13.8% figure.
Public leaderboards overstate modern LLM capabilities because substantial portions of benchmark QA items appear in (or are memorized from) training data, inflating measured accuracy.
Multi-method contamination audit across six frontier LLMs (GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, Qwen3-235B) evaluated on the MMLU benchmark (513 questions, 57 subjects), using lexical matching, paraphrase sensitivity, and behavioral memorization probes that together show systematic leakage.
None of the 13 systems report end-to-end evaluation on real quantum hardware (Layer 3b).
Systematic check of reported experiments for each of the 13 systems found no documented real-device, end-to-end hardware execution results (explicit Layer 3b reporting was absent).
Aggregating informal and recommendation data raises privacy and consent issues in low-regulation contexts, requiring governance safeguards.
Policy and ethical consideration based on the nature of the data used; no specific privacy-impact assessment reported in the summary.
NLP/ML systems can inherit biases from inputs (underrepresentation, noisy self-reports, biased recommendations) and may therefore disadvantage some youth unless transparency and fairness constraints are implemented.
Reasoned risk assessment grounded in known properties of ML/NLP; the pilot summary does not report an audit or measured bias outcomes.
There are limited randomized controlled trials or longitudinal evaluations; few studies measure patient-relevant outcomes or economic impacts.
Literature synthesis noting scarcity of RCTs and long-term observational studies, and absence of widespread patient-outcome and cost-effectiveness evaluations in existing publications.
Many published studies focus on standalone algorithm accuracy rather than clinician–AI joint performance in routine workflows.
Review of the literature categorizing study designs (preponderance of algorithm development/validation studies, fewer reader-in-the-loop, simulation, or deployment studies).
Regulators and payers remain central bottlenecks—AI can accelerate discovery but cannot bypass clinical evidence requirements.
Policy discussion and regulatory analysis in the paper noting that approvals require clinical evidence independent of discovery modality.
Downstream clinical development costs and translational failure rates remain the major drivers of total R&D expenditure; early-stage AI savings may not translate into proportionate increases in approved drugs.
Economic analysis and discussion in the paper referencing known cost distributions in drug development and historical attrition rates in clinical phases.
Inherent biological complexity and translational gaps between in silico predictions, preclinical models, and human biology constrain downstream success rates.
Review of translational failures and literature cited in the paper demonstrating mismatch between preclinical signals and clinical outcomes; conceptual analysis of biological complexity.
Gaps exist between computational designs and chemical/experimental feasibility (e.g., synthetic accessibility and assay readiness), limiting the usefulness of some generative outputs.
Case studies and critiques in the paper showing generated molecules that are synthetically infeasible or incompatible with experimental constraints; discussion of missing integration of practical constraints in many generative models.
Many models have limited interpretability and insufficient uncertainty quantification, hampering trust and decision-making.
Methodological analysis in the paper noting common deep-learning approaches lacking clear interpretability and uncertainty estimates; references to literature on model explainability and calibration gaps.