Evidence (6869 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Governance Remove filter

The combination of hallucination and professional overreliance strains existing regulatory goals (e.g., explainability, human oversight) within European AI governance frameworks.

Legal and regulatory analysis mapping technical and behavioral risks onto European AI governance goals; references to statutory/regulatory texts and policy debates. Qualitative argumentation rather than empirical test.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... compatibility between GLAI deployment dynamics and regulatory obligations (e.g.,...

Fabricated or opaque intermediate data and reasoning in GLAI weaken explainability, making it difficult to provide meaningful explanations about how outputs were produced.

Conceptual analysis of token-prediction architectures, literature on explainability limits of LLMs, and legal/regulatory analysis referencing explainability requirements. No empirical measurement.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... quality/meaningfulness of explanations about model outputs (explainability)

Hallucinated content produced by GLAI is often linguistically fluent and persuasive, increasing the risk that legal professionals will accept it without verification.

Literature synthesis on model fluency and behavioral literature on trust in coherent authoritative outputs, plus illustrative vignettes. No original experimental data or sample size.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... rate of professional acceptance or uncritical reliance on fluent but incorrect o...

This architectural mismatch (token-prediction vs. formal legal reasoning) contributes to confident but factually incorrect outputs (hallucinations) in GLAI.

Technical/conceptual analysis plus synthesis of existing literature on hallucinations in generative models; illustrative examples and vignettes provided. No primary empirical measurement in the paper.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... incidence and nature of hallucinated (factually incorrect) outputs produced by G...

Observed failure modes during the workflow included hypothesis creep, definition-alignment bugs (mismatch between informal and formal definitions), and agent avoidance behaviors (agents delegating or failing to complete tasks).

Qualitative analysis and post-mortem reported in the paper based on the single project workflow and logs; specific failure modes enumerated by authors from their process observations.

medium negative Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... presence and types of failure modes observed in the workflow (hypothesis creep, ...

Absence of governance and observability could increase social costs of accidents and induce conservative regulation that stifles beneficial adoption.

Policy reasoning and historical regulatory responses to systemic risks; conceptual projection without quantitative modeling of regulatory impact.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... social cost of accidents, regulatory restrictiveness, adoption rates

Strong proprietary stacks and incompatible protocols could create winner‑take‑all or oligopolistic market outcomes due to network effects and switching costs.

Market‑structure theory and historical platform examples (e.g., dominant tech platforms); argument is conceptual and not backed by new empirical market analysis in the paper.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... market concentration (e.g., market share distribution), barriers to entry

Without these architectural commitments, the economic costs — stranded assets, safety incidents, reduced innovation, and high coordination costs — will be substantial.

Predictive economic argument built from historical IoT/Internet lessons and systems reasoning; no quantitative cost estimates or econometric analysis in the paper.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... economic costs: stranded assets, safety incident frequency, innovation rates, co...

Poor governance and observability in agent networks would make accountability, certification, and regulation difficult.

Policy and governance reasoning with illustrative domain examples; conceptual argument without empirical governance case studies or metrics.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... ease of accountability/certification/regulation; observability coverage

Weak or brittle security and trust mechanisms across distributed agent ecosystems will pose serious risks.

Lessons drawn from IoT security failures and conceptual threat analysis; no new penetration testing or security metrics presented.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... security/trust robustness of agent ecosystems (vulnerabilities, compromise rates...

Lifecycle mismatch — rapidly evolving AI software embedded in long‑lived physical assets — risks premature ossification or expensive retrofits.

Systems engineering reasoning and historical analogies to embedded systems/IoT lifecycles; no quantitative lifecycle modeling or case study data in the paper.

medium negative The Internet of Physical AI Agents: Interoperability, Longev... frequency/cost of ossification and expensive retrofits; expected upgrade cost

Aligning deployments with frameworks like the EU AI Act will influence cross-border competitiveness and create compliance costs that small operators may struggle to bear, possibly concentrating deployment among larger firms or those using third-party governance services.

Policy-economic analysis drawing on regulatory compliance cost logic and barriers to entry; supported by conceptual examples rather than empirical cross-sectional firm data.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... market concentration and competitiveness effects (number/size distribution of de...

Requiring bounded autonomy and hybrid governance raises upfront costs (designing constraints, verification, auditing) and ongoing operational costs (human oversight, training, compliance), which will affect deployment timing and scale across sectors.

Economic reasoning and descriptive analysis of compliance/operational cost categories; no empirical cost-sample or econometric estimation provided.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... change in deployment costs and timing (capital and operational expenditures, tim...

Purely capability-driven autonomy can exacerbate crises when AI actions interact with novel dynamics or other automated systems.

Analytical reasoning supported by crisis-management literature and illustrative interaction scenarios between automated agents; thought experiments rather than empirical validation.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... change in crisis propagation/severity attributable to autonomous AI decisions (i...

Embodied AI in critical infrastructure is vulnerable to cascading failures and crisis dynamics outside training distributions.

Conceptual synthesis of crisis-dynamics and cascading-failure literature; analytical characterization of limitations in current embodied-AI training paradigms; illustrative thought experiments (no new empirical field data).

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... vulnerability to cascading/systemic failures (probability or severity of cascade...

Public‑interest concerns (bias, misuse, systemic risk) may be harder to mitigate via simple transparency rules; policies should emphasize outcome‑based regulations, mandatory behavioral testing, and marketplace disclosure obligations for stressed scenarios.

Policy implication derived from the non‑rule‑encodability thesis; no empirical policy evaluation included.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of transparency-based vs outcome-based regulatory approaches

Standard contracts and regulatory audits that rely on inspection of rule sets or source code will be insufficient to assess model behavior or risk; regulators and buyers must rely more on behavior‑based testing, standards, and outcome measures.

Policy and regulatory argument derived from the main theorem about non‑rule‑encodability; no empirical regulatory studies presented.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of rule‑based audits/regulatory inspections for assessing model ri...

Full interpretability via rule extraction may be impossible for the most valuable parts of LLM competence, limiting the utility of some transparency approaches for safety and auditing.

Argumentative consequence of the main theoretical claim and structural mismatch; supported by historical limitations of rule‑based systems; no empirical tests reported.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... feasibility of fully extracting human‑readable rules from LLMs (interpretability...

There is a structural mismatch between explicit human cognitive tools (rules, checklists) and the pattern‑rich, high‑dimensional competence encoded in LLMs.

Theoretical/structural argument about distributed statistical representations in LLMs versus discrete rules; no experimental quantification provided.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... alignment/mismatch between human‑readable rules and LLM representations/competen...

Historical expert systems failed to generalize or scale to complex, ambiguous tasks, contrasting with LLMs' broader empirical successes.

Historical case analysis and literature review-style discussion of expert systems versus contemporary LLM performance; no new quantitative historical dataset provided.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... generalization and scalability of rule‑based expert systems

High governance costs in regulated/high-risk domains can slow adoption of agentic systems, concentrating deployment in less regulated uses or among large firms that can afford governance infrastructure.

Economic reasoning about fixed and marginal governance costs and firm-level adoption decisions; no empirical adoption data presented.

medium negative Runtime Governance for AI Agents: Policies on Paths rate of adoption of agentic systems across firm sizes and regulated domains

Path-dependent behavior increases the complexity of principal–agent contracting and moral hazard between platforms, enterprise customers, and downstream users, requiring richer contract terms (acceptable paths, logging, audit rights).

Economic theory reasoning and applied contract/design implications discussed; no empirical contract-study data.

medium negative Runtime Governance for AI Agents: Policies on Paths complexity of contractual arrangements (number/complexity of contract clauses or...

Path-dependent policies complicate ex post auditing and simple rule-based regulation; regulators may prefer standards requiring runtime evaluation and logging to be enforceable in practice.

Conceptual argument about limits of auditing when important state is ephemeral and about how runtime logging enables ex post review; illustrative policy examples mapping to runtime requirements.

medium negative Runtime Governance for AI Agents: Policies on Paths enforceability of regulation (ease of ex post compliance verification)

Outdated or inconsistent facts—especially when visual inputs are involved—can reduce user trust, raise liability risks, and increase oversight costs in high-stakes domains.

Argumentative implications in the paper linking empirical findings (outdated/inconsistent outputs) to downstream product risk, trust, and oversight cost concerns; not directly measured empirically.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... projected impacts on trust, liability, and oversight costs (qualitative)

Static-training regimes create recurring economic costs: organizations must choose between expensive retraining/continuous fine-tuning and engineering around external retrieval/RAG systems to keep facts current.

Analytic discussion in paper on maintenance costs and trade-offs; economic argumentation rather than primary empirical measurement.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... economic maintenance cost trade-offs (qualitative analysis)

Multimodal retrieval-augmented generation (RAG) designs conditionally using time-stamped external evidence do not guarantee cross-modal propagation of updated facts.

Experiments implementing multimodal RAG pipelines where models are conditioned on retrieved, time-stamped evidence; evaluation shows that retrieved evidence does not always override outdated internal knowledge across both text and image prompts.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... effectiveness of RAG in updating model outputs across modalities

Knowledge-editing procedures (parameter edits or local fine-tuning) often fail to reliably change the model’s factual outputs for both text and image inputs.

Experimental application of knowledge-editing techniques with measurement of post-edit correctness for both modalities; reported inconsistent or partial success in updating facts.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... post-edit correctness / update success rate across modalities

Factual correctness and consistency are lower for visual stimuli even when the visual input correctly identifies the entity.

Paired tests where images correctly depict/identify the target entity while the model still produces incorrect or inconsistent factual attributes; correctness and consistency metrics reported per modality.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... modality-specific factual correctness and cross-modal consistency

Model responses vary with minor input perturbations (paraphrases, image occlusion/cropping/filters), revealing robustness issues in time-sensitive factual representation.

Controlled input perturbations included in the benchmark (paraphrases, image edits); evaluation of consistency/stability metrics across perturbations showing variability in answers.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... consistency / stability of answers under input perturbations

Existing techniques for editing or augmenting model knowledge (including multimodal retrieval/RAG and alignment methods) do not reliably update knowledge across modalities.

Experiments applying knowledge-editing procedures, multimodal RAG pipelines, and alignment/instruction-tuning interventions, with measurement of update efficacy (update success rate) across text and image inputs; reported inconsistent propagation of updated facts.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... update efficacy / update success rate across modalities

Factual reliability degrades when the same fact is presented visually rather than textually (a modality gap).

Paired multimodal stimuli (text prompts and images referencing the same entity/time) evaluated on the benchmark; comparison of correctness and consistency metrics across modalities showing lower performance for visual inputs.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... modality-specific correctness and cross-modal consistency

Current vision-language models commonly produce outdated factual answers because they are trained on static data snapshots.

Empirical evaluation on the V-DyKnow benchmark: model predictions compared to current ground-truth facts using the curated time-sensitive item set; multiple off-the-shelf VLMs tested. Metrics include correctness/accuracy relative to up-to-date ground truth.

medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... correctness (accuracy) of model answers vs current ground-truth facts

There is evidence that some safeguards or behavioral guardrails may degrade over multi-turn dialogues (i.e., safety mechanisms become less effective in extended interactions).

Authors' analyses and examples showing emergent chatbot behaviors and increased incidence of problematic codes over longer conversations; qualitative/code-based observations noted.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... apparent effectiveness of safety behaviors/guardrails as a function of conversat...

Certain harmful dynamics—notably declarations of romantic interest and chatbot claims of sentience—are more frequent in longer, multi-turn interactions, suggesting multi-turn engagement can worsen risk.

Observed association between conversation length and higher incidence of these codes from longitudinal/co-occurrence analyses across the coded corpus.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... incidence of harmful dynamics (romantic interest, chatbot sentience claims) rela...

Co-occurrence and longitudinal analyses show that topics such as user romantic declarations and chatbot self-sentience claims occur disproportionately in longer conversations.

Analyses described in paper: co-occurrence matrices and conversation-length (longitudinal) analyses correlating code incidence with conversation length.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... frequency of specified codes (romantic declarations, chatbot sentience claims) a...

21.2% of chatbot messages included misrepresentations of sentience (chatbot-presenting-as-sentient).

Quantitative coding of chatbot messages reporting 21.2% prevalence for the 'chatbot-presenting-as-sentient' code.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... presence of chatbot sentience-claim content in chatbot messages (coded proportio...

69 user messages were validated as expressing suicidal thoughts.

Manual coding with validation step reported for suicidal ideation items; count of validated suicidal messages given as 69.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... count of validated user messages expressing suicidal ideation

15.5% of user messages exhibited delusional thinking according to the applied code.

Quantitative coding results reported after manual annotation of user messages across the corpus; prevalence percentage reported as 15.5%.

medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... presence of delusional thinking in user messages (coded proportion)

If contest channels are unevenly usable (due to digital literacy, language, physical access), the pattern could exacerbate inequities unless contest pathways are designed inclusively.

Equity analysis in the paper; proposed evaluation to measure time-to-help across groups and usability/access disparities; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... equity measures (time-to-help by demographic group, contest access/use rates, us...

Readily contestable decisions create incentives for strategic contesting (false claims, gaming) and may increase congestion of the assistance system.

Risk analysis and conceptual discussion in the paper; proposed metrics include contest frequency and evidence of gaming; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... contest frequency, incidence of strategic/gaming behavior, system congestion (de...

Implementing governance-approved menus, legibility interfaces, and contest systems imposes administrative and operational costs (design, monitoring, adjudication).

Analytic discussion in the paper about transaction and enforcement costs; no cost quantification or empirical costing data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... administrative/enforcement costs (design time, ongoing monitoring/adjudication w...

Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary.

Observed substantial performance degradation of LLM agents (including ESE) as complexity and non-stationarity increased across RetailBench experiments; discussion of practical deployment risks and failure amplification over long horizons.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... robustness to long-horizon non-stationary environments (qualitative and performa...

Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions.

Behavioral analyses and failure-mode characterization from experiments on RetailBench across long horizons and non-stationary conditions reported in the paper.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... frequency and impact of specific failure modes (error accumulation, failed strat...

Trade-off curves in the experiments show that increasing a target factuality guarantee reduces retained task utility/informativeness.

Reported trade-off curves between factuality guarantees and the proposed informativeness metrics across experiments.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness metric as a function of factuality guarantee

High factuality thresholds frequently force redaction or omission of content, producing outputs that are less informative for downstream tasks.

Empirical observations using the paper's informativeness-aware metrics and examples showing increased redaction/vacuity as thresholds rise.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... rate of redaction/vacuous outputs and downstream task utility (informativeness m...

The conformal factuality guarantee is not robust to distribution shift or to distractor evidence unless calibration examples closely match deployment conditions.

Experiments showing factuality and downstream performance degradation when calibration and deployment distributions differ, and when distractor evidence is present; discussion linking robustness failure to violation of exchangeability assumptions.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... post-filtering factuality rates and task performance under distribution shift an...

Achieving high guaranteed factuality levels often causes models to produce vacuous or overly conservative outputs, reducing task usefulness (informativeness).

Empirical evaluation across the paper's benchmarks showing trade-off curves between target factuality thresholds and proposed informativeness-aware metrics; filtering/redaction at high thresholds correlated with lower informativeness/utility.

medium negative Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness/usefulness (informativeness-aware metrics proposed by the paper)

Models trained on publicly mirrored benchmark content provide limited marginal value compared to genuinely novel, high-quality data; high memorization tendency correlates with brittleness and lower generalization value.

Argument based on observed contamination and memorization patterns (13.8% lexical contamination, 72.5% memorization signals) and observed accuracy drops under paraphrase; economic inference about data marginal value is conceptual rather than directly measured.

medium negative Are Large Language Models Truly Smarter Than Humans? relative marginal value of contaminated/benchmark-mirrored training data versus ...

Leaderboard-based performance is a noisy signal of true capability; contamination can bias model comparisons and distort economic valuation, procurement, and investment decisions.

Inference drawn from measured contamination rates, estimated accuracy uplifts, and model-specific memorization signatures that could create misleading cross-model performance differences; economic implications discussed qualitatively rather than measured quantitatively.

medium negative Are Large Language Models Truly Smarter Than Humans? reliability of leaderboard-based signals for valuation and procurement decisions

Contamination ranking is consistent across methods: STEM > Professional domains > Social Sciences > Humanities.

Cross-method comparison (lexical matching, paraphrase sensitivity, and behavioral probes) showing similar relative contamination/orderings when aggregating category-level signals across the 513-item MMLU benchmark.

medium negative Are Large Language Models Truly Smarter Than Humans? relative contamination ordering across subject domains

« Prev 1 2 3 … 94 95 96 … 137 138 Next »