Evidence (5539 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Adoption Remove filter

Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary.

Observed substantial performance degradation of LLM agents (including ESE) as complexity and non-stationarity increased across RetailBench experiments; discussion of practical deployment risks and failure amplification over long horizons.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... robustness to long-horizon non-stationary environments (qualitative and performa...

Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions.

Behavioral analyses and failure-mode characterization from experiments on RetailBench across long horizons and non-stationary conditions reported in the paper.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... frequency and impact of specific failure modes (error accumulation, failed strat...

Trade-off curves in the experiments show that increasing a target factuality guarantee reduces retained task utility/informativeness.

Reported trade-off curves between factuality guarantees and the proposed informativeness metrics across experiments.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness metric as a function of factuality guarantee

High factuality thresholds frequently force redaction or omission of content, producing outputs that are less informative for downstream tasks.

Empirical observations using the paper's informativeness-aware metrics and examples showing increased redaction/vacuity as thresholds rise.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... rate of redaction/vacuous outputs and downstream task utility (informativeness m...

The conformal factuality guarantee is not robust to distribution shift or to distractor evidence unless calibration examples closely match deployment conditions.

Experiments showing factuality and downstream performance degradation when calibration and deployment distributions differ, and when distractor evidence is present; discussion linking robustness failure to violation of exchangeability assumptions.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... post-filtering factuality rates and task performance under distribution shift an...

Achieving high guaranteed factuality levels often causes models to produce vacuous or overly conservative outputs, reducing task usefulness (informativeness).

Empirical evaluation across the paper's benchmarks showing trade-off curves between target factuality thresholds and proposed informativeness-aware metrics; filtering/redaction at high thresholds correlated with lower informativeness/utility.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness/usefulness (informativeness-aware metrics proposed by the paper)

The poor TSFM performance is attributed to pretraining corpora lacking high-frequency, domain-diverse examples (temporal-scale and domain mismatch).

Paper interprets benchmark failures as resulting from pretraining data mismatch (TSFMs usually pretrained on low-frequency domains like energy/finance) and argues lack of high-frequency examples reduces effectiveness. This is a causal interpretation based on observed transfer failures rather than a controlled causal experiment.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... generalization effectiveness of TSFMs when pretrained on low-frequency corpora a...

Most TSFM configurations evaluated failed to achieve adequate predictive performance on this high-frequency distribution.

Benchmarking compares multiple TSFM configurations (and includes traditional ML baselines) on the 5G millisecond dataset and reports that most TSFMs did not reach acceptable performance levels. The summary does not provide exact performance numbers or how adequacy was defined.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... adequacy of predictive performance (forecasting error/accuracy relative to task ...

Current time-series foundation models (TSFMs), typically pretrained on low-frequency data, generalize poorly to high-frequency wireless and traffic data in zero-shot transfer.

Benchmarks reported in the paper include zero-shot evaluations of multiple TSFM configurations on the high-frequency 5G dataset and find poor zero-shot predictive performance. Exact models, metrics, and sample sizes are not specified in the summary.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... predictive performance in zero-shot transfer (forecasting accuracy/error on high...

Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value.

Argument based on observed contamination and memorization patterns (13.8% lexical contamination, 72.5% memorization signals) and observed accuracy drops under paraphrase; economic inference about data marginal value is conceptual rather than directly measured.

medium negative Are Large Language Models Truly Smarter Than Humans? relative marginal value of contaminated/benchmark-mirrored training data versus ...

Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions.

Inference drawn from measured contamination rates, estimated accuracy uplifts, and model-specific memorization signatures that could create misleading cross-model performance differences; economic implications discussed qualitatively rather than measured quantitatively.

medium negative Are Large Language Models Truly Smarter Than Humans? reliability of leaderboard-based signals for valuation and procurement decisions

Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities.

Cross-method comparison (lexical matching, paraphrase sensitivity, and behavioral probes) showing similar relative contamination/orderings when aggregating category-level signals across the 513-item MMLU benchmark.

medium negative Are Large Language Models Truly Smarter Than Humans? relative contamination ordering across subject domains

Law and Ethics questions showed the largest paraphrase-induced accuracy drops (19.8 percentage points).

Category-specific results from the 100-question paraphrase subset in Experiment 2, with Law and Ethics items showing the largest average drop of 19.8 percentage points.

medium negative Are Large Language Models Truly Smarter Than Humans? category-specific accuracy drop (percentage points) under paraphrase

Philosophy category exhibited the maximum observed lexical contamination (up to 66.7%).

Per-category contamination rates output by the lexical detection pipeline on MMLU items; the highest observed category rate reported was 66.7% for Philosophy.

medium negative Are Large Language Models Truly Smarter Than Humans? category-level contamination prevalence (Philosophy)

Estimates of productivity gains from automating quantum-program generation should be discounted given the current lack of hardware-execution validation; adoption timelines and returns remain contingent on resolving the Layer 3b gap.

Forward-looking inference in the review: because Layer 3b is unreported across systems, projected productivity/adoption gains derived from Layers 1–2 results are uncertain and should be treated conservatively.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... recommended adjustment to productivity/adoption estimates

The absence of Layer 3b reporting raises investment risk and valuation uncertainty for startups and investors building on generative quantum-code technologies.

Economic reasoning derived from the documented empirical gap (no real-device evaluation) in the review; the claim links missing validation to higher uncertainty in productization and revenue potential.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... investment risk / valuation uncertainty

Because end-to-end hardware evaluation is missing, claims of model performance based only on syntactic and semantic tests may be over-optimistic when translated into hardware-deployed value.

Analytical inference in the review: observed evaluations stop at Layers 1–2 for most systems, so mapping to hardware outcomes is unvalidated; this underpins the caution about over-optimistic extrapolation.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... risk of overestimation of deployable performance from Layer 1–2 results

Datasets and provenance vary in coverage and quality, and benchmarking practices are heterogeneous across systems, complicating cross-system comparisons.

Review of the 5 identified datasets and reported benchmarking across the 13 systems found variation in dataset provenance, size, task coverage, and bespoke evaluation metrics.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... dataset coverage/provenance quality and benchmarking heterogeneity

The absence of Layer 3b evaluations creates uncertainty about latency, fidelity, noise resilience, calibration dependence, and practical deployability of generated artifacts.

Logical inference based on the documented lack of real-hardware execution (Layer 3b) across 13 systems; review highlights these specific practical metrics as untested in real devices.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... uncertainty in hardware-related performance metrics (latency, fidelity, noise re...

Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context.

Behavioral patterns observed across BenchPreS scenarios (preference application persisting in inappropriate contexts) and ablation results; interpretive claim based on empirical behavior rather than direct model internals inspection.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Tendency to apply stored preferences across contexts (inferred internalization)

BenchPreS detects a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than conditional, context‑dependent signals.

Pattern of results across the benchmark showing high MR alongside cases where preference application should have been suppressed; qualitative interpretation of model behavior across varied interaction partners and normative contexts in the dataset.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Context sensitivity of preference application (operationalized via MR and AAR di...

Modern frontier LLMs frequently misapply stored user preferences in contexts where social or institutional norms require suppression (third‑party communication).

Empirical evaluation using the BenchPreS benchmark: models were provided stored preferences and asked to generate responses across contexts requiring either application or suppression; Misapplication Rate (MR) computed as fraction of instances where preferences were applied despite required suppression. Multiple state‑of‑the‑art models were tested (described generically as “frontier models”) across the scenario set.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Misapplication Rate (MR) — frequency of inappropriate application of stored pref...

If left unchecked, managerial short-termism combined with AI adoption can create a feedback loop where firms cut labor to boost short-term profits, undermining aggregate demand and eroding the market that sustains those profits.

Conceptual macroeconomic and organizational synthesis drawing on theory and historical patterns; no new empirical time-series demonstrating this loop in current AI-driven layoffs.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... sequence of firm-level layoffs, short-term profits, aggregate demand decline, su...

Work-time reduction policies carry distributional and implementation risks (heterogeneous effects by occupation, firm size, capital intensity; risk of hidden wage cuts) that require careful compensation rules and monitoring.

Theoretical reasoning and references to heterogeneous outcomes in prior work-hour studies; no new empirical quantification of heterogeneity in AI-era implementations.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... heterogeneous employment/wage effects across occupations/firms; incidence of wag...

Lower household demand resulting from payroll cuts can precipitate further cost-cutting and automation, creating a self-reinforcing feedback loop that risks persistent demand shortfalls and higher structural unemployment.

Theoretical models of demand-driven adjustment and cited historical patterns; conceptual argument rather than empirical causal identification in contemporary AI contexts.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... aggregate demand, subsequent rounds of layoffs/automation adoption, structural u...

AI-justified layoffs are driven more by managerial short-termism and misaligned executive incentives than by immediate technological necessity.

Interdisciplinary conceptual synthesis drawing on labor-economics theory, organizational behavior literature linking executive compensation/short-termism to layoffs, and selected prior empirical studies; no new firm-level causal identification or large-scale dataset provided.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... frequency/extent of layoffs attributed to AI (vs. attributable to managerial inc...

Manufacturing and Retail experienced net employment contractions attributable mainly to task automation and substitution.

Simulated employment-level series and net change calculations by sector (Manufacturing, Retail) across 2020–2024 in the paper's dataset, together with literature-derived mechanisms emphasizing automation/substitution in these sectors (systematic review of selected publishers 2020–2024).

medium negative AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Employment levels and net change by sector (Manufacturing, Retail)

Explainability, trust, and demonstrated real-world effectiveness are key demand-side frictions; small-scale laboratory gains rarely translate into broad clinical uptake without workflow fit.

Adoption studies, qualitative interviews with clinicians and purchasers, and observations that many high-performing lab models see limited clinical use due to workflow and trust issues.

medium negative Human-AI interaction and collaboration in radiology: from co... adoption rates, clinician trust/acceptance measures, implementation success rate...

Hidden costs can arise from increased liability exposure, workflow redesign burden, and potential productivity loss during transition periods.

Qualitative deployment studies and procurement narratives reporting unanticipated legal, operational, and productivity impacts during early rollouts.

medium negative Human-AI interaction and collaboration in radiology: from co... measures of productivity during rollout, documented workflow redesign time/costs...

Human-AI collaboration can also generate harms, including automation bias, deskilling, and workflow disruption.

Behavioral laboratory experiments, simulation/reader studies demonstrating automation bias, qualitative reports and observational deployment accounts documenting workflow frictions and concerns about reduced trainee exposure.

medium negative Human-AI interaction and collaboration in radiology: from co... rates of over-reliance on AI, diagnostic error rates attributable to automation ...

Trust, verification costs, and legal/governance requirements remain consequential even with AI mediation and may limit or shape adoption.

Theoretical discussion of governance and verification costs; no empirical measurement of these costs in adopter firms provided.

medium negative AI as a universal collaboration layer: Eliminating language ... verification/trust costs; legal/governance compliance costs; adoption barriers

AI-mediated interpretation and action carry risks related to quality, bias, and misalignment, which can produce miscommunication or incorrect automated actions.

Paper's discussion section raising caveats; conceptual risk analysis without empirical incident data; references to general concerns in AI safety literature (no new empirical evidence provided).

medium negative AI as a universal collaboration layer: Eliminating language ... incidence of miscommunication/errors attributable to AI mediation; bias metrics;...

If AI models encode prevailing consensus or measurement conventions, they risk locking in suboptimal conventions and creating path-dependent coordination failures in R&D.

Argument based on path-dependence and model-mediated coordination theory; conceptual exploration with illustrative scenarios; no empirical demonstrations.

medium negative At the table with Wittgenstein: How language shapes taste an... incidence of path-dependent coordination failures and persistence of suboptimal ...

Platformization of sensory models and proprietary digital twins could create winner-take-most market dynamics, raise barriers to entry, and concentrate rents in firms controlling large sensory-performance datasets.

Economic reasoning drawing on platform economics and data-monopoly literature; applied conceptually to sensory-model platforms; no empirical market-concentration measurement in the food domain provided.

medium negative At the table with Wittgenstein: How language shapes taste an... market concentration, barriers to entry, and rent distribution in firms using pr...

Failures of translation—both literal (across languages/markets) and metaphorical (between disciplines, scales, and practices)—impede global adoption and ideation of food products and innovations.

Argumentative synthesis citing cross-cultural examples and theoretical literature on translation costs; qualitative examples rather than empirical measurement of translation failures.

medium negative At the table with Wittgenstein: How language shapes taste an... success/adoption rates of food products across cultural/linguistic markets and c...

Industrial food R&D tends toward conservatism, privileging established measurement and classification schemes that can obscure sensory nuance and cultural variation.

Critical review and synthesis of literature on industrial R&D practices and measurement norms; illustrative industry examples cited; no systematic surveys or quantitative industry-wide data presented.

medium negative At the table with Wittgenstein: How language shapes taste an... degree of methodological conservatism in R&D and resultant loss of sensory/cultu...

Language and conceptual frameworks (drawing on Wittgenstein) constrain what can be noticed, measured, and communicated about texture and taste, creating epistemic limits in scientific practice.

Philosophical analysis using Wittgensteinian language theory and examples from food science and sensory studies; literature synthesis and illustrative examples; no systematic empirical validation.

medium negative At the table with Wittgenstein: How language shapes taste an... scope and granularity of observable and communicable sensory descriptors (textur...

Empirical evidence shows that every 1 percentage Industrial Robot Density elevation leads to a 0.8 percentage point decrease in the Manufacturing Global Value Chain Participation Rate.

Empirical claim reported in the paper; method described as empirical analysis but the provided excerpt does not specify dataset, country sample, time period, model specification, controls, or sample size.

medium negative Artificial Intelligence and Globalized Division of Labor: Re... Manufacturing Global Value Chain (GVC) Participation Rate (percentage points)

Developing countries face Technology Embargo, Rule Bundling and Capital Concentration Triple Barriers.

Theoretical and literature-based claim described by the authors; no empirical quantification of these barriers (e.g., number of embargoes, measures of rule bundling, capital concentration metrics) included in the excerpt.

medium negative Artificial Intelligence and Globalized Division of Labor: Re... barriers to participation in global division of labor for developing countries (...

Organisations struggle to optimise human–AI collaboration in knowledge‑intensive decision‑making.

Statement based on a systematic synthesis of human–AI interaction and knowledge management literature presented in the paper; no primary empirical sample or dataset reported in the abstract.

medium negative Optimising Human– AI Decision Performance: A Trust and Cap... ability to optimise human–AI collaboration / effectiveness of knowledge‑intensiv...

Despite increased deployment, the field lacks a principled framework for answering when a team is helpful, how many agents to use, how team structure impacts performance, and whether a team is better than a single agent.

Authors' assessment of the literature and gaps; presented as a motivation for their work (no empirical count of missing frameworks given in excerpt).

medium negative Language Model Teams as Distributed Systems availability of principled frameworks addressing team design questions

There is a need for cross-jurisdictional regulatory standards to support deployment of ML-blockchain accounting systems.

Policy analysis and stakeholder feedback indicating regulatory fragmentation and the requirement for harmonized standards; asserted as a study finding. (Summary does not list consulted jurisdictions or regulatory bodies.)

medium negative AI-Driven Accounting Oversight Systems: Integrating Machine ... regulatory harmonization need / policy readiness

Data privacy trade-offs are a significant challenge when combining ML and decentralized ledger technologies for accounting oversight.

Analytic discussion and evaluation of privacy implications arising from the hybrid architecture and use of decentralized ledgers with empirical datasets. (No specific privacy-attack tests or privacy metric values reported in the summary.)

medium negative AI-Driven Accounting Oversight Systems: Integrating Machine ... data privacy (trade-offs / risk)

The integration reveals scalability limitations as a critical challenge.

Findings from system evaluation and analysis that identified performance and scalability constraints when applying the hybrid solution to high-risk economic sectors. (No quantitative scalability metrics or testing conditions provided in the summary.)

medium negative AI-Driven Accounting Oversight Systems: Integrating Machine ... scalability / system performance at scale

Despite positive outcomes, challenges such as workforce displacement, ethical concerns, and limited access to AI technologies were identified as barriers to full adoption.

Study respondents reported barriers in the survey; descriptive statistics summarized the prevalence of workforce displacement concerns, ethical issues, and limited access to AI technologies as impediments to broader adoption.

medium negative Entrepreneurship in the Era of Artificial Intelligence: Rede... barriers to AI adoption (perceived workforce displacement, ethical concerns, lim...

Significant mediating barriers—low participation in AI training, uneven educational backgrounds, and demographic disparities related to gender and age—constrain widespread and effective AI adoption.

Mediation/conditional analyses reported in the study (based on survey items about training participation, education, gender, age) indicating these factors act as barriers to adoption and effectiveness.

medium negative The role of artificial intelligence in enhancing financial l... AI adoption effectiveness / uptake (mediated by training participation, educatio...

O SCF é expandido para uma camada de segunda ordem (SCF-E) que incorpora déficit de imaginação tecnocultural e governança simbólica, explicando por que a IA permanece em pilotos e não se converte em capacidade organizacional.

Extensão conceitual (segunda ordem) relatada no artigo; respaldada metodologicamente pela combinação QUAN→QUAL, incluindo etnografia orientada ao SCF (detalhes empíricos no corpo do artigo, não no resumo).

medium negative A FRICÇÃO PSICOANTROPOLÓGICA (SCF - Symbolic-Cognitive Frict... progressão de iniciativas de IA de pilotos para capacidade organizacional

A literatura de adoção tecnológica (TAM, UTAUT, Difusão de Inovações) tende a tratar a resistência como variável comportamental genérica ou deficiência de 'treinamento', negligenciando dimensões simbólicas (ritos, identidades e poder), mecanismos cognitivos de ameaça (aversão à perda, sobrecarga e heurísticas) e seus efeitos econômicos.

Revisão bibliográfica e posicionamento teórico declarado no artigo comparando modelos consagrados com a perspectiva proposta; sem indicação de meta-análise ou contagem empírica no resumo.

medium negative A FRICÇÃO PSICOANTROPOLÓGICA (SCF - Symbolic-Cognitive Frict... cobertura das dimensões simbólicas e cognitivas na literatura de adoção tecnológ...

A Fricção Psicoantropológica (SCF) é proposta e detalhada como um coeficiente mensurável do custo cultural e da resistência cognitiva que reduz a capacidade de pequenas e médias empresas (PMEs) de transformar iniciativas de Inteligência Artificial (IA) em geração de valor em escala.

Proposição teórica e operacionalização apresentada no artigo; desenho metodológico descrito como QUAN→QUAL incluindo construção de escala psicométrica e etnografia organizacional. O resumo não especifica tamanho de amostra para validação.

medium negative A FRICÇÃO PSICOANTROPOLÓGICA (SCF - Symbolic-Cognitive Frict... capacidade das PMEs de transformar iniciativas de IA em geração de valor em esca...

The paper highlights that urgent policy intervention is required to reestablish a balance between the benefits of AI and the ethical ramifications that arise from these technologies, with a particular emphasis on job displacement.

Author conclusion drawn from the stated literature-based analysis; the excerpt does not list the specific studies, empirical findings, or criteria used to reach this policy recommendation.

medium negative A Study on Work-Life Balance of Women Employees in the IT Se... need for policy intervention to address ethical implications and job displacemen...

« Prev 1 2 3 … 58 59 60 … 110 111 Next »