Evidence (3062 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	373	105	59	439	984
Governance & Regulation	366	172	115	55	718
Research Productivity	237	95	34	294	664
Organizational Efficiency	364	82	62	34	545
Technology Adoption Rate	293	118	66	30	511
Firm Productivity	274	33	68	10	390
AI Safety & Ethics	117	178	44	24	365
Output Quality	231	61	23	25	340
Market Structure	107	123	85	14	334
Decision Quality	158	68	33	17	279
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	88	31	38	9	166
Firm Revenue	96	34	22	—	152
Innovation Output	105	12	21	11	150
Consumer Welfare	68	29	35	7	139
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	71	10	29	6	116
Worker Satisfaction	46	38	12	9	105
Error Rate	42	47	6	—	95
Training Effectiveness	55	12	11	16	94
Task Completion Time	76	5	4	2	87
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	16	9	5	48
Job Displacement	5	29	12	—	46
Social Protection	19	8	6	1	34
Developer Productivity	27	2	3	1	33
Worker Turnover	10	12	—	3	25
Creative Output	15	5	3	1	24
Skill Obsolescence	3	18	2	—	23
Labor Share of Income	8	4	9	—	21

Human Ai Collab Remove filter

Agents that attempt to infer others' reasoning depth may be vulnerable to strategic misrepresentation (partners could behave to induce incorrect ToM estimates).

Conceptual analysis in the paper and discussion of strategic incentives; paper also identifies the risk and suggests potential mitigations (e.g., conservatism, verification, meta-reasoning).

medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... vulnerability to strategic manipulation (qualitative risk and proposed mitigatio...

Both too little and too much recursive reasoning (i.e., too shallow or too deep ToM) can produce poor joint behavior — miscalibrated anticipation harms coordination.

Observed non-monotonic effects in the reported experiments where fixed-order agents at either low or high ToM orders performed worse in mismatched pairings; evidence comes from the same multi-environment evaluation using joint-payoff / success-rate metrics.

medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, success rate)

Misalignment in Theory-of-Mind (ToM) order between agents (i.e., agents using different recursive reasoning depths) degrades coordination performance.

Empirical experiments using LLM-driven agents with configurable ToM depth across four coordination environments (a repeated matrix game, two grid navigation tasks, and an Overcooked task); comparisons of matched (same-order) vs mismatched (different-order) pairings using task-specific joint payoffs and success rates as metrics.

medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, task success rate, task completion/time)

There is a risk of manipulation and misinformation if argument mining/synthesis is unregulated or misaligned with social incentives, creating externalities that may justify public intervention.

Conceptual risk assessment combining known misinformation dynamics and AI capabilities; no empirical incident data provided.

medium negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... incidence of manipulation/misinformation attributable to argument-mining/synthes...

Increased error risk and weaker explainability from GLAI will raise malpractice and liability exposure for firms and lawyers, driving up insurance and compliance costs.

Legal-risk analysis and economic reasoning connecting explainability/liability to insurance costs; no empirical cost studies presented.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... malpractice/liability exposure levels and associated insurance/compliance costs

The combination of hallucination and professional overreliance strains existing regulatory goals (e.g., explainability, human oversight) within European AI governance frameworks.

Legal and regulatory analysis mapping technical and behavioral risks onto European AI governance goals; references to statutory/regulatory texts and policy debates. Qualitative argumentation rather than empirical test.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... compatibility between GLAI deployment dynamics and regulatory obligations (e.g.,...

Fabricated or opaque intermediate data and reasoning in GLAI weaken explainability, making it difficult to provide meaningful explanations about how outputs were produced.

Conceptual analysis of token-prediction architectures, literature on explainability limits of LLMs, and legal/regulatory analysis referencing explainability requirements. No empirical measurement.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... quality/meaningfulness of explanations about model outputs (explainability)

Hallucinated content produced by GLAI is often linguistically fluent and persuasive, increasing the risk that legal professionals will accept it without verification.

Literature synthesis on model fluency and behavioral literature on trust in coherent authoritative outputs, plus illustrative vignettes. No original experimental data or sample size.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... rate of professional acceptance or uncritical reliance on fluent but incorrect o...

This architectural mismatch (token-prediction vs. formal legal reasoning) contributes to confident but factually incorrect outputs (hallucinations) in GLAI.

Technical/conceptual analysis plus synthesis of existing literature on hallucinations in generative models; illustrative examples and vignettes provided. No primary empirical measurement in the paper.

medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... incidence and nature of hallucinated (factually incorrect) outputs produced by G...

Observed failure modes during the workflow included hypothesis creep, definition-alignment bugs (mismatch between informal and formal definitions), and agent avoidance behaviors (agents delegating or failing to complete tasks).

Qualitative analysis and post-mortem reported in the paper based on the single project workflow and logs; specific failure modes enumerated by authors from their process observations.

medium negative Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... presence and types of failure modes observed in the workflow (hypothesis creep, ...

Aligning deployments with frameworks like the EU AI Act will influence cross-border competitiveness and create compliance costs that small operators may struggle to bear, possibly concentrating deployment among larger firms or those using third-party governance services.

Policy-economic analysis drawing on regulatory compliance cost logic and barriers to entry; supported by conceptual examples rather than empirical cross-sectional firm data.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... market concentration and competitiveness effects (number/size distribution of de...

Requiring bounded autonomy and hybrid governance raises upfront costs (designing constraints, verification, auditing) and ongoing operational costs (human oversight, training, compliance), which will affect deployment timing and scale across sectors.

Economic reasoning and descriptive analysis of compliance/operational cost categories; no empirical cost-sample or econometric estimation provided.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... change in deployment costs and timing (capital and operational expenditures, tim...

Purely capability-driven autonomy can exacerbate crises when AI actions interact with novel dynamics or other automated systems.

Analytical reasoning supported by crisis-management literature and illustrative interaction scenarios between automated agents; thought experiments rather than empirical validation.

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... change in crisis propagation/severity attributable to autonomous AI decisions (i...

Embodied AI in critical infrastructure is vulnerable to cascading failures and crisis dynamics outside training distributions.

Conceptual synthesis of crisis-dynamics and cascading-failure literature; analytical characterization of limitations in current embodied-AI training paradigms; illustrative thought experiments (no new empirical field data).

medium negative Resilience Meets Autonomy: Governing Embodied AI in Critical... vulnerability to cascading/systemic failures (probability or severity of cascade...

Human–AI chats contain fewer emotional and social messages compared with human–human chats.

Content coding of chat transcripts comparing frequencies of emotional/social message categories across human–AI (n = 126) and human–human (n = 108) conditions; reported lower counts/proportions of social/emotional content in human–AI dialogs.

medium negative Playing Against the Machine: Cooperation, Communication, and... frequency/count of emotional/social message types in chat logs

Misalignment or poor meta-control could produce persistent unsafe behaviors in autonomous learners; governance and oversight mechanisms will be crucial.

Risk analysis based on conceptual failure modes for meta-control; no empirical incidents reported in the paper.

medium negative Why AI systems don't learn and what to do about it: Lessons ... frequency and severity of unsafe behaviors; successful governance interventions

Current models transfer poorly across domains, are brittle in nonstationary environments, and are inefficient in physical/embodied tasks.

Synthesis of known challenges from prior literature and practical experience; paper cites these as motivating observations rather than reporting new data.

medium negative Why AI systems don't learn and what to do about it: Lessons ... cross-domain generalization; robustness under nonstationarity; sample efficiency...

Current models have limited meta-control and do not autonomously decide when to explore, imitate, consult prior knowledge, or consolidate.

Conceptual critique based on typical ML training pipelines and limited on-line decision-making modules; no empirical tests in paper.

medium negative Why AI systems don't learn and what to do about it: Lessons ... autonomy in meta-decisions (e.g., fraction of exploration/imitative acts chosen ...

There is weak integration between passive observation (supervised/representation learning) and active experimentation (reinforcement/exploratory learning) in current systems.

Observation of methodological separation in current literature and systems; conceptual discussion in the paper.

medium negative Why AI systems don't learn and what to do about it: Lessons ... performance on mixed observation-action tasks; ability to combine passive and ac...

Current AI models lack the architectures and control mechanisms required for sustained, autonomous learning in dynamic real-world settings.

Conceptual/theoretical analysis presented in the paper; synthesis of limitations observed in existing literature and practices (no new empirical data provided).

medium negative Why AI systems don't learn and what to do about it: Lessons ... ability to sustain autonomous learning in dynamic real-world environments

Attribution (labeling responses as AI) can alter perceived empathy and therefore matters for product design, branding, and disclosure policy decisions.

Findings from the attribution effect experiment showing reduced feelings of being heard/validated when replies are labeled AI despite identical content; authors discuss implications for product design and disclosure.

medium negative Practicing with Language Models Cultivates Human Empathic Co... recipient-rated perceptions (being heard/validated) and inferred implications fo...

Public‑interest concerns (bias, misuse, systemic risk) may be harder to mitigate via simple transparency rules; policies should emphasize outcome‑based regulations, mandatory behavioral testing, and marketplace disclosure obligations for stressed scenarios.

Policy implication derived from the non‑rule‑encodability thesis; no empirical policy evaluation included.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of transparency-based vs outcome-based regulatory approaches

Standard contracts and regulatory audits that rely on inspection of rule sets or source code will be insufficient to assess model behavior or risk; regulators and buyers must rely more on behavior‑based testing, standards, and outcome measures.

Policy and regulatory argument derived from the main theorem about non‑rule‑encodability; no empirical regulatory studies presented.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of rule‑based audits/regulatory inspections for assessing model ri...

Full interpretability via rule extraction may be impossible for the most valuable parts of LLM competence, limiting the utility of some transparency approaches for safety and auditing.

Argumentative consequence of the main theoretical claim and structural mismatch; supported by historical limitations of rule‑based systems; no empirical tests reported.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... feasibility of fully extracting human‑readable rules from LLMs (interpretability...

There is a structural mismatch between explicit human cognitive tools (rules, checklists) and the pattern‑rich, high‑dimensional competence encoded in LLMs.

Theoretical/structural argument about distributed statistical representations in LLMs versus discrete rules; no experimental quantification provided.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... alignment/mismatch between human‑readable rules and LLM representations/competen...

Historical expert systems failed to generalize or scale to complex, ambiguous tasks, contrasting with LLMs' broader empirical successes.

Historical case analysis and literature review-style discussion of expert systems versus contemporary LLM performance; no new quantitative historical dataset provided.

medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... generalization and scalability of rule‑based expert systems

If contest channels are unevenly usable (due to digital literacy, language, physical access), the pattern could exacerbate inequities unless contest pathways are designed inclusively.

Equity analysis in the paper; proposed evaluation to measure time-to-help across groups and usability/access disparities; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... equity measures (time-to-help by demographic group, contest access/use rates, us...

Readily contestable decisions create incentives for strategic contesting (false claims, gaming) and may increase congestion of the assistance system.

Risk analysis and conceptual discussion in the paper; proposed metrics include contest frequency and evidence of gaming; no empirical data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... contest frequency, incidence of strategic/gaming behavior, system congestion (de...

Implementing governance-approved menus, legibility interfaces, and contest systems imposes administrative and operational costs (design, monitoring, adjudication).

Analytic discussion in the paper about transaction and enforcement costs; no cost quantification or empirical costing data.

medium negative Designing for Disagreement: Front-End Guardrails for Assista... administrative/enforcement costs (design time, ongoing monitoring/adjudication w...

Despite improvements from ESE, current LLM-based agents are not robust enough for fully autonomous long-horizon management in complex, non-stationary commercial environments; human oversight and hybrid systems remain necessary.

Observed substantial performance degradation of LLM agents (including ESE) as complexity and non-stationarity increased across RetailBench experiments; discussion of practical deployment risks and failure amplification over long horizons.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... robustness to long-horizon non-stationary environments (qualitative and performa...

Key observed failure modes include error accumulation over long horizons, inability to revise strategy adequately under evolving external conditions, and sensitivity to multi-factor interactions.

Behavioral analyses and failure-mode characterization from experiments on RetailBench across long horizons and non-stationary conditions reported in the paper.

medium negative RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... frequency and impact of specific failure modes (error accumulation, failed strat...

Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context.

Behavioral patterns observed across BenchPreS scenarios (preference application persisting in inappropriate contexts) and ablation results; interpretive claim based on empirical behavior rather than direct model internals inspection.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Tendency to apply stored preferences across contexts (inferred internalization)

BenchPreS detects a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than conditional, context‑dependent signals.

Pattern of results across the benchmark showing high MR alongside cases where preference application should have been suppressed; qualitative interpretation of model behavior across varied interaction partners and normative contexts in the dataset.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Context sensitivity of preference application (operationalized via MR and AAR di...

Modern frontier LLMs frequently misapply stored user preferences in contexts where social or institutional norms require suppression (third‑party communication).

Empirical evaluation using the BenchPreS benchmark: models were provided stored preferences and asked to generate responses across contexts requiring either application or suppression; Misapplication Rate (MR) computed as fraction of instances where preferences were applied despite required suppression. Multiple state‑of‑the‑art models were tested (described generically as “frontier models”) across the scenario set.

medium negative BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Misapplication Rate (MR) — frequency of inappropriate application of stored pref...

Manufacturing and Retail experienced net employment contractions attributable mainly to task automation and substitution.

Simulated employment-level series and net change calculations by sector (Manufacturing, Retail) across 2020–2024 in the paper's dataset, together with literature-derived mechanisms emphasizing automation/substitution in these sectors (systematic review of selected publishers 2020–2024).

medium negative AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Employment levels and net change by sector (Manufacturing, Retail)

Explainability, trust, and demonstrated real-world effectiveness are key demand-side frictions; small-scale laboratory gains rarely translate into broad clinical uptake without workflow fit.

Adoption studies, qualitative interviews with clinicians and purchasers, and observations that many high-performing lab models see limited clinical use due to workflow and trust issues.

medium negative Human-AI interaction and collaboration in radiology: from co... adoption rates, clinician trust/acceptance measures, implementation success rate...

Hidden costs can arise from increased liability exposure, workflow redesign burden, and potential productivity loss during transition periods.

Qualitative deployment studies and procurement narratives reporting unanticipated legal, operational, and productivity impacts during early rollouts.

medium negative Human-AI interaction and collaboration in radiology: from co... measures of productivity during rollout, documented workflow redesign time/costs...

Human-AI collaboration can also generate harms, including automation bias, deskilling, and workflow disruption.

Behavioral laboratory experiments, simulation/reader studies demonstrating automation bias, qualitative reports and observational deployment accounts documenting workflow frictions and concerns about reduced trainee exposure.

medium negative Human-AI interaction and collaboration in radiology: from co... rates of over-reliance on AI, diagnostic error rates attributable to automation ...

Primary failure mode for human–AI teams was poor human prompting/insufficient context specification rather than deficiencies in the model's reasoning.

Failure-mode analysis from the instrumented AI interactions and qualitative review of unsuccessful challenge attempts among 41 participants showing recurring prompt/context issues as the main cause.

medium negative Understanding Human-AI Collaboration in Cybersecurity Compet... proportion of failed attempts attributable to human prompting/context issues vs....

Human limits—specifically ineffective prompting and poor context specification—became the primary bottleneck to solving challenges, rather than model reasoning capability.

Qualitative analysis and instrumentation of AI interactions from the 41-participant live CTF; failure-mode analysis attributing unsuccessful attempts to poor human prompts/insufficient context rather than observed model reasoning failure.

medium negative Understanding Human-AI Collaboration in Cybersecurity Compet... attribution of challenge-solve failures to prompting/context issues versus model...

Industry-level AI substitution risk moderates the AI–ECSR relationship: higher substitution risk sharpens the inverted U and shifts its peak left (firms in high-substitution-risk industries reach the turning point earlier and suffer stronger negative effects at high AI adoption).

Interaction terms between AI (and AI^2) and an industry AI substitution-risk measure in panel regressions show heterogeneity consistent with a leftward shift and steeper decline in high-risk industries; results reported across the 2,575-firm panel with controls and robustness checks.

medium negative Attention to Whom? AI Adoption and Corporate Social Responsi... ECSR

Beyond a certain threshold of AI embedding, deeper AI adoption shifts managerial attention toward AI systems and away from employees, reducing ECSR (AI attention shift mechanism).

Negative AI^2 coefficient in quadratic panel regressions indicates declining ECSR at high AI adoption; supported by theoretical dual-agent model arguing attention shift; robustness checks reported. (Sample: same 2,575 firms, 2013–2023.)

medium negative Attention to Whom? AI Adoption and Corporate Social Responsi... ECSR (managerial attention shift inferred)

Trust, verification costs, and legal/governance requirements remain consequential even with AI mediation and may limit or shape adoption.

Theoretical discussion of governance and verification costs; no empirical measurement of these costs in adopter firms provided.

medium negative AI as a universal collaboration layer: Eliminating language ... verification/trust costs; legal/governance compliance costs; adoption barriers

AI-mediated interpretation and action carry risks related to quality, bias, and misalignment, which can produce miscommunication or incorrect automated actions.

Paper's discussion section raising caveats; conceptual risk analysis without empirical incident data; references to general concerns in AI safety literature (no new empirical evidence provided).

medium negative AI as a universal collaboration layer: Eliminating language ... incidence of miscommunication/errors attributable to AI mediation; bias metrics;...

If AI models encode prevailing consensus or measurement conventions, they risk locking in suboptimal conventions and creating path-dependent coordination failures in R&D.

Argument based on path-dependence and model-mediated coordination theory; conceptual exploration with illustrative scenarios; no empirical demonstrations.

medium negative At the table with Wittgenstein: How language shapes taste an... incidence of path-dependent coordination failures and persistence of suboptimal ...

Platformization of sensory models and proprietary digital twins could create winner-take-most market dynamics, raise barriers to entry, and concentrate rents in firms controlling large sensory-performance datasets.

Economic reasoning drawing on platform economics and data-monopoly literature; applied conceptually to sensory-model platforms; no empirical market-concentration measurement in the food domain provided.

medium negative At the table with Wittgenstein: How language shapes taste an... market concentration, barriers to entry, and rent distribution in firms using pr...

Failures of translation—both literal (across languages/markets) and metaphorical (between disciplines, scales, and practices)—impede global adoption and ideation of food products and innovations.

Argumentative synthesis citing cross-cultural examples and theoretical literature on translation costs; qualitative examples rather than empirical measurement of translation failures.

medium negative At the table with Wittgenstein: How language shapes taste an... success/adoption rates of food products across cultural/linguistic markets and c...

Industrial food R&D tends toward conservatism, privileging established measurement and classification schemes that can obscure sensory nuance and cultural variation.

Critical review and synthesis of literature on industrial R&D practices and measurement norms; illustrative industry examples cited; no systematic surveys or quantitative industry-wide data presented.

medium negative At the table with Wittgenstein: How language shapes taste an... degree of methodological conservatism in R&D and resultant loss of sensory/cultu...

Language and conceptual frameworks (drawing on Wittgenstein) constrain what can be noticed, measured, and communicated about texture and taste, creating epistemic limits in scientific practice.

Philosophical analysis using Wittgensteinian language theory and examples from food science and sensory studies; literature synthesis and illustrative examples; no systematic empirical validation.

medium negative At the table with Wittgenstein: How language shapes taste an... scope and granularity of observable and communicable sensory descriptors (textur...

Systematic skill differences cannot be captured by conventional measuring systems.

Comparative evaluation performed by the authors between conventional performance/skill measurement frameworks and patterns observed in their empirical dataset (5,000 job adverts and 2,000 salary records), leading to the conclusion that conventional systems miss systematic differences introduced by AI-enabled skills.

medium negative Reconstruction of knowledge worker performance evaluation sy... ability of conventional measurement systems to detect systematic skill differenc...

« Prev 1 2 3 … 31 32 33 … 61 62 Next »