Evidence (5539 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Adoption
Remove filter
Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary.
Observed substantial performance degradation of LLM agents (including ESE) as complexity and non-stationarity increased across RetailBench experiments; discussion of practical deployment risks and failure amplification over long horizons.
Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions.
Behavioral analyses and failure-mode characterization from experiments on RetailBench across long horizons and non-stationary conditions reported in the paper.
Trade-off curves in the experiments show that increasing a target factuality guarantee reduces retained task utility/informativeness.
Reported trade-off curves between factuality guarantees and the proposed informativeness metrics across experiments.
High factuality thresholds frequently force redaction or omission of content, producing outputs that are less informative for downstream tasks.
Empirical observations using the paper's informativeness-aware metrics and examples showing increased redaction/vacuity as thresholds rise.
The conformal factuality guarantee is not robust to distribution shift or to distractor evidence unless calibration examples closely match deployment conditions.
Experiments showing factuality and downstream performance degradation when calibration and deployment distributions differ, and when distractor evidence is present; discussion linking robustness failure to violation of exchangeability assumptions.
Achieving high guaranteed factuality levels often causes models to produce vacuous or overly conservative outputs, reducing task usefulness (informativeness).
Empirical evaluation across the paper's benchmarks showing trade-off curves between target factuality thresholds and proposed informativeness-aware metrics; filtering/redaction at high thresholds correlated with lower informativeness/utility.
The poor TSFM performance is attributed to pretraining corpora lacking high-frequency, domain-diverse examples (temporal-scale and domain mismatch).
Paper interprets benchmark failures as resulting from pretraining data mismatch (TSFMs usually pretrained on low-frequency domains like energy/finance) and argues lack of high-frequency examples reduces effectiveness. This is a causal interpretation based on observed transfer failures rather than a controlled causal experiment.
Most TSFM configurations evaluated failed to achieve adequate predictive performance on this high-frequency distribution.
Benchmarking compares multiple TSFM configurations (and includes traditional ML baselines) on the 5G millisecond dataset and reports that most TSFMs did not reach acceptable performance levels. The summary does not provide exact performance numbers or how adequacy was defined.
Current time-series foundation models (TSFMs), typically pretrained on low-frequency data, generalize poorly to high-frequency wireless and traffic data in zero-shot transfer.
Benchmarks reported in the paper include zero-shot evaluations of multiple TSFM configurations on the high-frequency 5G dataset and find poor zero-shot predictive performance. Exact models, metrics, and sample sizes are not specified in the summary.
Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value.
Argument based on observed contamination and memorization patterns (13.8% lexical contamination, 72.5% memorization signals) and observed accuracy drops under paraphrase; economic inference about data marginal value is conceptual rather than directly measured.
Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions.
Inference drawn from measured contamination rates, estimated accuracy uplifts, and model-specific memorization signatures that could create misleading cross-model performance differences; economic implications discussed qualitatively rather than measured quantitatively.
Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities.
Cross-method comparison (lexical matching, paraphrase sensitivity, and behavioral probes) showing similar relative contamination/orderings when aggregating category-level signals across the 513-item MMLU benchmark.
Law and Ethics questions showed the largest paraphrase-induced accuracy drops (19.8 percentage points).
Category-specific results from the 100-question paraphrase subset in Experiment 2, with Law and Ethics items showing the largest average drop of 19.8 percentage points.
Philosophy category exhibited the maximum observed lexical contamination (up to 66.7%).
Per-category contamination rates output by the lexical detection pipeline on MMLU items; the highest observed category rate reported was 66.7% for Philosophy.
Estimates of productivity gains from automating quantum-program generation should be discounted given the current lack of hardware-execution validation; adoption timelines and returns remain contingent on resolving the Layer 3b gap.
Forward-looking inference in the review: because Layer 3b is unreported across systems, projected productivity/adoption gains derived from Layers 1–2 results are uncertain and should be treated conservatively.
The absence of Layer 3b reporting raises investment risk and valuation uncertainty for startups and investors building on generative quantum-code technologies.
Economic reasoning derived from the documented empirical gap (no real-device evaluation) in the review; the claim links missing validation to higher uncertainty in productization and revenue potential.
Because end-to-end hardware evaluation is missing, claims of model performance based only on syntactic and semantic tests may be over-optimistic when translated into hardware-deployed value.
Analytical inference in the review: observed evaluations stop at Layers 1–2 for most systems, so mapping to hardware outcomes is unvalidated; this underpins the caution about over-optimistic extrapolation.
Datasets and provenance vary in coverage and quality, and benchmarking practices are heterogeneous across systems, complicating cross-system comparisons.
Review of the 5 identified datasets and reported benchmarking across the 13 systems found variation in dataset provenance, size, task coverage, and bespoke evaluation metrics.
The absence of Layer 3b evaluations creates uncertainty about latency, fidelity, noise resilience, calibration dependence, and practical deployability of generated artifacts.
Logical inference based on the documented lack of real-hardware execution (Layer 3b) across 13 systems; review highlights these specific practical metrics as untested in real devices.
Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context.
Behavioral patterns observed across BenchPreS scenarios (preference application persisting in inappropriate contexts) and ablation results; interpretive claim based on empirical behavior rather than direct model internals inspection.
BenchPreS detects a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than conditional, context‑dependent signals.
Pattern of results across the benchmark showing high MR alongside cases where preference application should have been suppressed; qualitative interpretation of model behavior across varied interaction partners and normative contexts in the dataset.
Modern frontier LLMs frequently misapply stored user preferences in contexts where social or institutional norms require suppression (third‑party communication).
Empirical evaluation using the BenchPreS benchmark: models were provided stored preferences and asked to generate responses across contexts requiring either application or suppression; Misapplication Rate (MR) computed as fraction of instances where preferences were applied despite required suppression. Multiple state‑of‑the‑art models were tested (described generically as “frontier models”) across the scenario set.
If left unchecked, managerial short-termism combined with AI adoption can create a feedback loop where firms cut labor to boost short-term profits, undermining aggregate demand and eroding the market that sustains those profits.
Conceptual macroeconomic and organizational synthesis drawing on theory and historical patterns; no new empirical time-series demonstrating this loop in current AI-driven layoffs.
Work-time reduction policies carry distributional and implementation risks (heterogeneous effects by occupation, firm size, capital intensity; risk of hidden wage cuts) that require careful compensation rules and monitoring.
Theoretical reasoning and references to heterogeneous outcomes in prior work-hour studies; no new empirical quantification of heterogeneity in AI-era implementations.
Lower household demand resulting from payroll cuts can precipitate further cost-cutting and automation, creating a self-reinforcing feedback loop that risks persistent demand shortfalls and higher structural unemployment.
Theoretical models of demand-driven adjustment and cited historical patterns; conceptual argument rather than empirical causal identification in contemporary AI contexts.
AI-justified layoffs are driven more by managerial short-termism and misaligned executive incentives than by immediate technological necessity.
Interdisciplinary conceptual synthesis drawing on labor-economics theory, organizational behavior literature linking executive compensation/short-termism to layoffs, and selected prior empirical studies; no new firm-level causal identification or large-scale dataset provided.
Manufacturing and Retail experienced net employment contractions attributable mainly to task automation and substitution.
Simulated employment-level series and net change calculations by sector (Manufacturing, Retail) across 2020–2024 in the paper's dataset, together with literature-derived mechanisms emphasizing automation/substitution in these sectors (systematic review of selected publishers 2020–2024).
Explainability, trust, and demonstrated real-world effectiveness are key demand-side frictions; small-scale laboratory gains rarely translate into broad clinical uptake without workflow fit.
Adoption studies, qualitative interviews with clinicians and purchasers, and observations that many high-performing lab models see limited clinical use due to workflow and trust issues.
Hidden costs can arise from increased liability exposure, workflow redesign burden, and potential productivity loss during transition periods.
Qualitative deployment studies and procurement narratives reporting unanticipated legal, operational, and productivity impacts during early rollouts.
Human-AI collaboration can also generate harms, including automation bias, deskilling, and workflow disruption.
Behavioral laboratory experiments, simulation/reader studies demonstrating automation bias, qualitative reports and observational deployment accounts documenting workflow frictions and concerns about reduced trainee exposure.
Trust, verification costs, and legal/governance requirements remain consequential even with AI mediation and may limit or shape adoption.
Theoretical discussion of governance and verification costs; no empirical measurement of these costs in adopter firms provided.
AI-mediated interpretation and action carry risks related to quality, bias, and misalignment, which can produce miscommunication or incorrect automated actions.
Paper's discussion section raising caveats; conceptual risk analysis without empirical incident data; references to general concerns in AI safety literature (no new empirical evidence provided).
If AI models encode prevailing consensus or measurement conventions, they risk locking in suboptimal conventions and creating path-dependent coordination failures in R&D.
Argument based on path-dependence and model-mediated coordination theory; conceptual exploration with illustrative scenarios; no empirical demonstrations.
Platformization of sensory models and proprietary digital twins could create winner-take-most market dynamics, raise barriers to entry, and concentrate rents in firms controlling large sensory-performance datasets.
Economic reasoning drawing on platform economics and data-monopoly literature; applied conceptually to sensory-model platforms; no empirical market-concentration measurement in the food domain provided.
Failures of translation—both literal (across languages/markets) and metaphorical (between disciplines, scales, and practices)—impede global adoption and ideation of food products and innovations.
Argumentative synthesis citing cross-cultural examples and theoretical literature on translation costs; qualitative examples rather than empirical measurement of translation failures.
Industrial food R&D tends toward conservatism, privileging established measurement and classification schemes that can obscure sensory nuance and cultural variation.
Critical review and synthesis of literature on industrial R&D practices and measurement norms; illustrative industry examples cited; no systematic surveys or quantitative industry-wide data presented.
Language and conceptual frameworks (drawing on Wittgenstein) constrain what can be noticed, measured, and communicated about texture and taste, creating epistemic limits in scientific practice.
Philosophical analysis using Wittgensteinian language theory and examples from food science and sensory studies; literature synthesis and illustrative examples; no systematic empirical validation.
Empirical evidence shows that every 1 percentage Industrial Robot Density elevation leads to a 0.8 percentage point decrease in the Manufacturing Global Value Chain Participation Rate.
Empirical claim reported in the paper; method described as empirical analysis but the provided excerpt does not specify dataset, country sample, time period, model specification, controls, or sample size.
Developing countries face Technology Embargo, Rule Bundling and Capital Concentration Triple Barriers.
Theoretical and literature-based claim described by the authors; no empirical quantification of these barriers (e.g., number of embargoes, measures of rule bundling, capital concentration metrics) included in the excerpt.
Organisations struggle to optimise human–AI collaboration in knowledge‑intensive decision‑making.
Statement based on a systematic synthesis of human–AI interaction and knowledge management literature presented in the paper; no primary empirical sample or dataset reported in the abstract.
Despite increased deployment, the field lacks a principled framework for answering when a team is helpful, how many agents to use, how team structure impacts performance, and whether a team is better than a single agent.
Authors' assessment of the literature and gaps; presented as a motivation for their work (no empirical count of missing frameworks given in excerpt).
There is a need for cross-jurisdictional regulatory standards to support deployment of ML-blockchain accounting systems.
Policy analysis and stakeholder feedback indicating regulatory fragmentation and the requirement for harmonized standards; asserted as a study finding. (Summary does not list consulted jurisdictions or regulatory bodies.)
Data privacy trade-offs are a significant challenge when combining ML and decentralized ledger technologies for accounting oversight.
Analytic discussion and evaluation of privacy implications arising from the hybrid architecture and use of decentralized ledgers with empirical datasets. (No specific privacy-attack tests or privacy metric values reported in the summary.)
The integration reveals scalability limitations as a critical challenge.
Findings from system evaluation and analysis that identified performance and scalability constraints when applying the hybrid solution to high-risk economic sectors. (No quantitative scalability metrics or testing conditions provided in the summary.)
Despite positive outcomes, challenges such as workforce displacement, ethical concerns, and limited access to AI technologies were identified as barriers to full adoption.
Study respondents reported barriers in the survey; descriptive statistics summarized the prevalence of workforce displacement concerns, ethical issues, and limited access to AI technologies as impediments to broader adoption.
Significant mediating barriers—low participation in AI training, uneven educational backgrounds, and demographic disparities related to gender and age—constrain widespread and effective AI adoption.
Mediation/conditional analyses reported in the study (based on survey items about training participation, education, gender, age) indicating these factors act as barriers to adoption and effectiveness.
O SCF é expandido para uma camada de segunda ordem (SCF-E) que incorpora déficit de imaginação tecnocultural e governança simbólica, explicando por que a IA permanece em pilotos e não se converte em capacidade organizacional.
Extensão conceitual (segunda ordem) relatada no artigo; respaldada metodologicamente pela combinação QUAN→QUAL, incluindo etnografia orientada ao SCF (detalhes empíricos no corpo do artigo, não no resumo).
A literatura de adoção tecnológica (TAM, UTAUT, Difusão de Inovações) tende a tratar a resistência como variável comportamental genérica ou deficiência de 'treinamento', negligenciando dimensões simbólicas (ritos, identidades e poder), mecanismos cognitivos de ameaça (aversão à perda, sobrecarga e heurísticas) e seus efeitos econômicos.
Revisão bibliográfica e posicionamento teórico declarado no artigo comparando modelos consagrados com a perspectiva proposta; sem indicação de meta-análise ou contagem empírica no resumo.
A Fricção Psicoantropológica (SCF) é proposta e detalhada como um coeficiente mensurável do custo cultural e da resistência cognitiva que reduz a capacidade de pequenas e médias empresas (PMEs) de transformar iniciativas de Inteligência Artificial (IA) em geração de valor em escala.
Proposição teórica e operacionalização apresentada no artigo; desenho metodológico descrito como QUAN→QUAL incluindo construção de escala psicométrica e etnografia organizacional. O resumo não especifica tamanho de amostra para validação.
The paper highlights that urgent policy intervention is required to reestablish a balance between the benefits of AI and the ethical ramifications that arise from these technologies, with a particular emphasis on job displacement.
Author conclusion drawn from the stated literature-based analysis; the excerpt does not list the specific studies, empirical findings, or criteria used to reach this policy recommendation.