Evidence (8066 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	417	113	67	480	1091
Governance & Regulation	419	202	124	64	823
Research Productivity	261	100	34	303	703
Organizational Efficiency	406	96	71	40	616
Technology Adoption Rate	323	128	74	38	568
Firm Productivity	307	38	70	12	432
Output Quality	260	71	27	29	387
AI Safety & Ethics	118	179	45	24	368
Market Structure	107	128	85	14	339
Decision Quality	177	75	37	19	312
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	74	34	78	9	197
Skill Acquisition	98	36	40	9	183
Innovation Output	121	12	24	13	171
Firm Revenue	98	35	24	—	157
Consumer Welfare	73	31	37	7	148
Task Allocation	87	16	34	7	144
Inequality Measures	25	76	32	5	138
Regulatory Compliance	54	61	13	3	131
Task Completion Time	89	7	4	3	103
Error Rate	44	51	6	—	101
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	33	11	7	98
Wages & Compensation	54	15	20	5	94
Team Performance	47	12	15	7	82
Automation Exposure	27	26	10	6	72
Job Displacement	6	39	13	—	58
Hiring & Recruitment	40	4	6	3	53
Developer Productivity	34	4	3	1	42
Social Protection	22	11	6	2	41
Creative Output	16	7	5	1	29
Labor Share of Income	12	6	9	—	27
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

If contest channels are unevenly usable (due to digital literacy, language, physical access), the pattern could exacerbate inequities unless contest pathways are designed inclusively.

Equity analysis in the paper; proposed evaluation to measure time-to-help across groups and usability/access disparities; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... equity measures (time-to-help by demographic group, contest access/use rates, us...

Readily contestable decisions create incentives for strategic contesting (false claims, gaming) and may increase congestion of the assistance system.

Risk analysis and conceptual discussion in the paper; proposed metrics include contest frequency and evidence of gaming; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... contest frequency, incidence of strategic/gaming behavior, system congestion (de...

Implementing governance-approved menus, legibility interfaces, and contest systems imposes administrative and operational costs (design, monitoring, adjudication).

Analytic discussion in the paper about transaction and enforcement costs; no cost quantification or empirical costing data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... administrative/enforcement costs (design time, ongoing monitoring/adjudication w...

Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary.

Observed substantial performance degradation of LLM agents (including ESE) as complexity and non-stationarity increased across RetailBench experiments; discussion of practical deployment risks and failure amplification over long horizons.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... robustness to long-horizon non-stationary environments (qualitative and performa...

Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions.

Behavioral analyses and failure-mode characterization from experiments on RetailBench across long horizons and non-stationary conditions reported in the paper.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... frequency and impact of specific failure modes (error accumulation, failed strat...

Trade-off curves in the experiments show that increasing a target factuality guarantee reduces retained task utility/informativeness.

Reported trade-off curves between factuality guarantees and the proposed informativeness metrics across experiments.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness metric as a function of factuality guarantee

High factuality thresholds frequently force redaction or omission of content, producing outputs that are less informative for downstream tasks.

Empirical observations using the paper's informativeness-aware metrics and examples showing increased redaction/vacuity as thresholds rise.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... rate of redaction/vacuous outputs and downstream task utility (informativeness m...

The conformal factuality guarantee is not robust to distribution shift or to distractor evidence unless calibration examples closely match deployment conditions.

Experiments showing factuality and downstream performance degradation when calibration and deployment distributions differ, and when distractor evidence is present; discussion linking robustness failure to violation of exchangeability assumptions.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... post-filtering factuality rates and task performance under distribution shift an...

Achieving high guaranteed factuality levels often causes models to produce vacuous or overly conservative outputs, reducing task usefulness (informativeness).

Empirical evaluation across the paper's benchmarks showing trade-off curves between target factuality thresholds and proposed informativeness-aware metrics; filtering/redaction at high thresholds correlated with lower informativeness/utility.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness/usefulness (informativeness-aware metrics proposed by the paper)

The poor TSFM performance is attributed to pretraining corpora lacking high-frequency, domain-diverse examples (temporal-scale and domain mismatch).

Paper interprets benchmark failures as resulting from pretraining data mismatch (TSFMs usually pretrained on low-frequency domains like energy/finance) and argues lack of high-frequency examples reduces effectiveness. This is a causal interpretation based on observed transfer failures rather than a controlled causal experiment.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... generalization effectiveness of TSFMs when pretrained on low-frequency corpora a...

Most TSFM configurations evaluated failed to achieve adequate predictive performance on this high-frequency distribution.

Benchmarking compares multiple TSFM configurations (and includes traditional ML baselines) on the 5G millisecond dataset and reports that most TSFMs did not reach acceptable performance levels. The summary does not provide exact performance numbers or how adequacy was defined.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... adequacy of predictive performance (forecasting error/accuracy relative to task ...

Current time-series foundation models (TSFMs), typically pretrained on low-frequency data, generalize poorly to high-frequency wireless and traffic data in zero-shot transfer.

Benchmarks reported in the paper include zero-shot evaluations of multiple TSFM configurations on the high-frequency 5G dataset and find poor zero-shot predictive performance. Exact models, metrics, and sample sizes are not specified in the summary.

medium negative Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... predictive performance in zero-shot transfer (forecasting accuracy/error on high...

Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value.

Argument based on observed contamination and memorization patterns (13.8% lexical contamination, 72.5% memorization signals) and observed accuracy drops under paraphrase; economic inference about data marginal value is conceptual rather than directly measured.

medium negative Are Large Language Models Truly Smarter Than Humans? relative marginal value of contaminated/benchmark-mirrored training data versus ...

Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions.

Inference drawn from measured contamination rates, estimated accuracy uplifts, and model-specific memorization signatures that could create misleading cross-model performance differences; economic implications discussed qualitatively rather than measured quantitatively.

medium negative Are Large Language Models Truly Smarter Than Humans? reliability of leaderboard-based signals for valuation and procurement decisions

Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities.

Cross-method comparison (lexical matching, paraphrase sensitivity, and behavioral probes) showing similar relative contamination/orderings when aggregating category-level signals across the 513-item MMLU benchmark.

medium negative Are Large Language Models Truly Smarter Than Humans? relative contamination ordering across subject domains

Law and Ethics questions showed the largest paraphrase-induced accuracy drops (19.8 percentage points).

Category-specific results from the 100-question paraphrase subset in Experiment 2, with Law and Ethics items showing the largest average drop of 19.8 percentage points.

medium negative Are Large Language Models Truly Smarter Than Humans? category-specific accuracy drop (percentage points) under paraphrase

Philosophy category exhibited the maximum observed lexical contamination (up to 66.7%).

Per-category contamination rates output by the lexical detection pipeline on MMLU items; the highest observed category rate reported was 66.7% for Philosophy.

medium negative Are Large Language Models Truly Smarter Than Humans? category-level contamination prevalence (Philosophy)

Estimates of productivity gains from automating quantum-program generation should be discounted given the current lack of hardware-execution validation; adoption timelines and returns remain contingent on resolving the Layer 3b gap.

Forward-looking inference in the review: because Layer 3b is unreported across systems, projected productivity/adoption gains derived from Layers 1–2 results are uncertain and should be treated conservatively.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... recommended adjustment to productivity/adoption estimates

The absence of Layer 3b reporting raises investment risk and valuation uncertainty for startups and investors building on generative quantum-code technologies.

Economic reasoning derived from the documented empirical gap (no real-device evaluation) in the review; the claim links missing validation to higher uncertainty in productization and revenue potential.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... investment risk / valuation uncertainty

Because end-to-end hardware evaluation is missing, claims of model performance based only on syntactic and semantic tests may be over-optimistic when translated into hardware-deployed value.

Analytical inference in the review: observed evaluations stop at Layers 1–2 for most systems, so mapping to hardware outcomes is unvalidated; this underpins the caution about over-optimistic extrapolation.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... risk of overestimation of deployable performance from Layer 1–2 results

Datasets and provenance vary in coverage and quality, and benchmarking practices are heterogeneous across systems, complicating cross-system comparisons.

Review of the 5 identified datasets and reported benchmarking across the 13 systems found variation in dataset provenance, size, task coverage, and bespoke evaluation metrics.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... dataset coverage/provenance quality and benchmarking heterogeneity

The absence of Layer 3b evaluations creates uncertainty about latency, fidelity, noise resilience, calibration dependence, and practical deployability of generated artifacts.

Logical inference based on the documented lack of real-hardware execution (Layer 3b) across 13 systems; review highlights these specific practical metrics as untested in real devices.

medium negative Generative AI for Quantum Circuits and Quantum Code: A Techn... uncertainty in hardware-related performance metrics (latency, fidelity, noise re...

Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context.

Behavioral patterns observed across BenchPreS scenarios (preference application persisting in inappropriate contexts) and ablation results; interpretive claim based on empirical behavior rather than direct model internals inspection.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Tendency to apply stored preferences across contexts (inferred internalization)

BenchPreS detects a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than conditional, context‑dependent signals.

Pattern of results across the benchmark showing high MR alongside cases where preference application should have been suppressed; qualitative interpretation of model behavior across varied interaction partners and normative contexts in the dataset.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Context sensitivity of preference application (operationalized via MR and AAR di...

Modern frontier LLMs frequently misapply stored user preferences in contexts where social or institutional norms require suppression (third‑party communication).

Empirical evaluation using the BenchPreS benchmark: models were provided stored preferences and asked to generate responses across contexts requiring either application or suppression; Misapplication Rate (MR) computed as fraction of instances where preferences were applied despite required suppression. Multiple state‑of‑the‑art models were tested (described generically as “frontier models”) across the scenario set.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Misapplication Rate (MR) — frequency of inappropriate application of stored pref...

If left unchecked, managerial short-termism combined with AI adoption can create a feedback loop where firms cut labor to boost short-term profits, undermining aggregate demand and eroding the market that sustains those profits.

Conceptual macroeconomic and organizational synthesis drawing on theory and historical patterns; no new empirical time-series demonstrating this loop in current AI-driven layoffs.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... sequence of firm-level layoffs, short-term profits, aggregate demand decline, su...

Work-time reduction policies carry distributional and implementation risks (heterogeneous effects by occupation, firm size, capital intensity; risk of hidden wage cuts) that require careful compensation rules and monitoring.

Theoretical reasoning and references to heterogeneous outcomes in prior work-hour studies; no new empirical quantification of heterogeneity in AI-era implementations.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... heterogeneous employment/wage effects across occupations/firms; incidence of wag...

Lower household demand resulting from payroll cuts can precipitate further cost-cutting and automation, creating a self-reinforcing feedback loop that risks persistent demand shortfalls and higher structural unemployment.

Theoretical models of demand-driven adjustment and cited historical patterns; conceptual argument rather than empirical causal identification in contemporary AI contexts.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... aggregate demand, subsequent rounds of layoffs/automation adoption, structural u...

AI-justified layoffs are driven more by managerial short-termism and misaligned executive incentives than by immediate technological necessity.

Interdisciplinary conceptual synthesis drawing on labor-economics theory, organizational behavior literature linking executive compensation/short-termism to layoffs, and selected prior empirical studies; no new firm-level causal identification or large-scale dataset provided.

medium negative A Shorter Workweek as a Policy Response to AI-Driven Labor D... frequency/extent of layoffs attributed to AI (vs. attributable to managerial inc...

Distributional impacts of AI are uneven: younger workers and individuals with lower formal education face greater disruption.

Descriptive breakdowns of occupational vulnerability and employment changes by demographic groups (age and education) derived from labor statistics and vulnerability mapping; supported by qualitative case observations. Exact subgroup sample sizes not given.

medium negative The AI Transition: Assessing Vulnerability and Structural Re... employment change / displacement risk by age cohort and education level

Routine service and administrative occupations show the highest vulnerability to automation and displacement from AI.

Occupational vulnerability mapping using task/routine exposure methods and descriptive employment trend analysis across occupations; supported by employer survey responses and case-study observations. Sample sizes for surveys/mapping not provided in summary.

medium negative The AI Transition: Assessing Vulnerability and Structural Re... occupational vulnerability / risk of displacement (automation exposure index or ...

Passive monitoring and predictive models are insufficient for governing the complex dynamics of a tech-driven economy.

Conceptual critique based on economic cybernetics literature and the author's expert assessment; no empirical test comparing governance regimes is provided.

medium negative DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... governance adequacy/effectiveness (ability to steer socio-economic outcomes)

Digitalization is deepening digital inequality (unequal access to digital tools, skills, and benefits) across social groups and regions.

Qualitative analysis and expert assessment; the paper calls for new metrics but does not present systematic empirical measures of inequality.

medium negative DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... digital inequality (access to internet/digital services, digital literacy rates)

Digital transformation can generate technological unemployment if not managed with appropriate retraining and social protection measures.

Expert assessment and literature-informed argumentation in the paper; no empirical longitudinal analysis isolating technology-driven job losses presented.

medium negative DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... technological unemployment (job losses attributable to automation/AI adoption)

Forced or poorly regulated digitalization risks exacerbating social stratification.

Conceptual argument supported by qualitative analysis of policy documents and expert assessment; no empirical causal estimates provided.

medium negative DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... social stratification (income/wealth inequality measures, social mobility proxie...

Manufacturing and Retail experienced net employment contractions attributable mainly to task automation and substitution.

Simulated employment-level series and net change calculations by sector (Manufacturing, Retail) across 2020–2024 in the paper's dataset, together with literature-derived mechanisms emphasizing automation/substitution in these sectors (systematic review of selected publishers 2020–2024).

medium negative AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Employment levels and net change by sector (Manufacturing, Retail)

Explainability, trust, and demonstrated real-world effectiveness are key demand-side frictions; small-scale laboratory gains rarely translate into broad clinical uptake without workflow fit.

Adoption studies, qualitative interviews with clinicians and purchasers, and observations that many high-performing lab models see limited clinical use due to workflow and trust issues.

medium negative Human-AI interaction and collaboration in radiology: from co... adoption rates, clinician trust/acceptance measures, implementation success rate...

Hidden costs can arise from increased liability exposure, workflow redesign burden, and potential productivity loss during transition periods.

Qualitative deployment studies and procurement narratives reporting unanticipated legal, operational, and productivity impacts during early rollouts.

medium negative Human-AI interaction and collaboration in radiology: from co... measures of productivity during rollout, documented workflow redesign time/costs...

Human-AI collaboration can also generate harms, including automation bias, deskilling, and workflow disruption.

Behavioral laboratory experiments, simulation/reader studies demonstrating automation bias, qualitative reports and observational deployment accounts documenting workflow frictions and concerns about reduced trainee exposure.

medium negative Human-AI interaction and collaboration in radiology: from co... rates of over-reliance on AI, diagnostic error rates attributable to automation ...

Primary failure mode for human–AI teams was poor human prompting/insufficient context specification rather than deficiencies in the model's reasoning.

Failure-mode analysis from the instrumented AI interactions and qualitative review of unsuccessful challenge attempts among 41 participants showing recurring prompt/context issues as the main cause.

medium negative Understanding Human-AI Collaboration in Cybersecurity Compet... proportion of failed attempts attributable to human prompting/context issues vs....

Human limits—specifically ineffective prompting and poor context specification—became the primary bottleneck to solving challenges, rather than model reasoning capability.

Qualitative analysis and instrumentation of AI interactions from the 41-participant live CTF; failure-mode analysis attributing unsuccessful attempts to poor human prompts/insufficient context rather than observed model reasoning failure.

medium negative Understanding Human-AI Collaboration in Cybersecurity Compet... attribution of challenge-solve failures to prompting/context issues versus model...

Industry-level AI substitution risk moderates the AI–ECSR relationship: higher substitution risk sharpens the inverted U and shifts its peak left (firms in high-substitution-risk industries reach the turning point earlier and suffer stronger negative effects at high AI adoption).

Interaction terms between AI (and AI^2) and an industry AI substitution-risk measure in panel regressions show heterogeneity consistent with a leftward shift and steeper decline in high-risk industries; results reported across the 2,575-firm panel with controls and robustness checks.

medium negative Attention to Whom? AI Adoption and Corporate Social Responsi... ECSR

Beyond a certain threshold of AI embedding, deeper AI adoption shifts managerial attention toward AI systems and away from employees, reducing ECSR (AI attention shift mechanism).

Negative AI^2 coefficient in quadratic panel regressions indicates declining ECSR at high AI adoption; supported by theoretical dual-agent model arguing attention shift; robustness checks reported. (Sample: same 2,575 firms, 2013–2023.)

medium negative Attention to Whom? AI Adoption and Corporate Social Responsi... ECSR (managerial attention shift inferred)

Trust, verification costs, and legal/governance requirements remain consequential even with AI mediation and may limit or shape adoption.

Theoretical discussion of governance and verification costs; no empirical measurement of these costs in adopter firms provided.

medium negative AI as a universal collaboration layer: Eliminating language ... verification/trust costs; legal/governance compliance costs; adoption barriers

AI-mediated interpretation and action carry risks related to quality, bias, and misalignment, which can produce miscommunication or incorrect automated actions.

Paper's discussion section raising caveats; conceptual risk analysis without empirical incident data; references to general concerns in AI safety literature (no new empirical evidence provided).

medium negative AI as a universal collaboration layer: Eliminating language ... incidence of miscommunication/errors attributable to AI mediation; bias metrics;...

Operational sustainability is a challenge: coordinating long R&D timelines and ensuring expert governance for drug development within DAOs is difficult.

Case-study observations and discussion of organizational challenges; acknowledged lack of longitudinal performance data in the studied projects.

medium negative Decentralized Autonomous Organizations in the Pharmaceutical... project continuity over long R&D timelines, availability/quality of expert gover...

Token economics can create speculative behavior misaligned with long-horizon drug development incentives.

Theoretical analysis of token market dynamics and incentive misalignment; supported by general observations of crypto market speculative behavior, but no DAO-specific empirical causation demonstrated.

medium negative Decentralized Autonomous Organizations in the Pharmaceutical... token price volatility, short-term trading activity vs. long-term investment in ...

Traditional hierarchical firms struggle to coordinate dispersed expertise and finance public‑good stages of drug development.

Theoretical/organizational analysis and literature synthesis on coordination problems and financing gaps for public-good preclinical stages; qualitative argumentation rather than empirical causal inference.

medium negative Decentralized Autonomous Organizations in the Pharmaceutical... coordination efficiency across geographically/disciplinarily dispersed teams; fi...

If AI models encode prevailing consensus or measurement conventions, they risk locking in suboptimal conventions and creating path-dependent coordination failures in R&D.

Argument based on path-dependence and model-mediated coordination theory; conceptual exploration with illustrative scenarios; no empirical demonstrations.

medium negative At the table with Wittgenstein: How language shapes taste an... incidence of path-dependent coordination failures and persistence of suboptimal ...

Platformization of sensory models and proprietary digital twins could create winner-take-most market dynamics, raise barriers to entry, and concentrate rents in firms controlling large sensory-performance datasets.

Economic reasoning drawing on platform economics and data-monopoly literature; applied conceptually to sensory-model platforms; no empirical market-concentration measurement in the food domain provided.

medium negative At the table with Wittgenstein: How language shapes taste an... market concentration, barriers to entry, and rent distribution in firms using pr...

« Prev 1 2 3 … 89 90 91 … 161 162 Next »