Evidence (11633 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

An audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both (Periodic-with-floor and history-conditioned suspicion-escalation) auditor extensions.

Construction of the OffAuditDrift auditee strategy in the paper and simulation/theoretical demonstration that it can evade both proposed auditor policies by exploiting auditor commitment.

high negative A Benchmark for Strategic Auditee Gaming Under Continuous Co... effectiveness of an audit-aware auditee strategy at defeating auditor policies

We identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously (formalized as Observation 1).

Theoretical observation/proposition in the paper (Observation 1) derived from the formal model of continuous auditing under noise-aware static auditing rules.

high negative A Benchmark for Strategic Auditee Gaming Under Continuous Co... trade-off between coverage gaps and granularity gaps in static auditing designs

Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions.

Specification and enumeration of auditee strategies in the paper (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift); conceptual examples and inclusion in simulator.

high negative A Benchmark for Strategic Auditee Gaming Under Continuous Co... types of auditee strategic behaviors available under continuous audits

Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work.

Conceptual and theoretical argument in the paper, motivated by regulatory context; formalization of continuous auditing as a multi-round interaction (T-round Stackelberg game).

high negative A Benchmark for Strategic Auditee Gaming Under Continuous Co... existence of a distinct class of strategic gaming (audit-evasion behaviors) unde...

The reform reduces industrial wastewater discharge, which improves agricultural production conditions (mechanism linking the reform to higher grain yield).

Mechanism analysis in the paper reporting reductions in industrial wastewater discharge following the reform (mediation channel analysis).

high negative Can water resource tax reform increase grain yield?—Evidence... industrial wastewater discharge

A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.

Empirical comparison in simulator experiments indicating that optimizing for exact action accuracy (matching individual actions) can harm higher-level trace distribution alignment; observed in the studies contrasting deterministic copying/value-based approaches with Trace-Prior RL.

high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... exact action accuracy vs. aggregate trace alignment (distributional match)

Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior.

Empirical observation in simulator experiments comparing deterministic value-based RL and deterministic copying agents to other approaches; observed collapsed/shortcut pricing behaviors when uncertainty is unresolved.

high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... policy action distribution / pricing choices (shortcut behavior)

This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices.

Theoretical diagnosis supported by simulator setup and observed ambiguity in agent-visible states mapping to multiple competitor prices; derived from the two-hotel simulator design where key competitor variables are hidden from Hotel A.

high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... policy robustness / correctness under partial observability (mapping from observ...

GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1.

Model-level observation from the ASR analysis within the experiment (paper reports GPT-4.1 had perfect TSR and HF1 but failed trajectory-level fidelity).

high negative Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... trajectory fidelity vs. standard metrics (TSR, HF1)

Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly.

Empirical evaluation reported in the paper: HMASP tested across 18 LLMs and 90,000 task instances; analysis via ASR showing checkpoint-skipping behavior for 10 models and correct enforcement for 8 models.

high negative Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... adherence to expected workflow transitions (confirmation checkpoint adherence)

From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression.

Theoretical / information-theoretic analysis in the paper linking observed dynamics to entropy reduction and information bottleneck concepts.

high negative Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Sy... entropy (diversity/support) of the human-AI data loop and its interpretation as ...

Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium.

Computational simulation reported in the paper (described as a 'simple simulation'); no sample size or experimental dataset reported in the provided text.

high negative Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Sy... system transitioning to a low-diversity, suboptimal equilibrium as reliance on A...

Tabular data does not have a foundation model that understands it natively; every approach to tabular AI today (from gradient-boosted trees to the latest tabular foundation models) requires a preprocessing pipeline before any model can consume the data.

Paper's survey/positioning statement asserting the current state of tabular AI approaches and their reliance on preprocessing pipelines (no specific empirical dataset given).

high negative Data Language Models: A New Foundation Model Class for Tabul... presence/absence of a native tabular foundation model and the need for preproces...

With strong exposure of low-wealth, high-MPC households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income.

Theoretical welfare/comparison analyses in the model with heterogeneous households (differing in wealth and marginal propensities to consume) and ownership concentration; shows private incentives lead to automation choices that are suboptimal from a social perspective under these parameter constellations.

high negative The Demand Externality of Automation extent of automation chosen relative to social optimum (welfare-relevant automat...

Automation reduces paid human labor.

Model comparative statics in the same equilibrium framework showing substitution away from paid human labor as firms choose automation; result reported in the paper's static benchmark and general-equilibrium analysis.

high negative The Demand Externality of Automation paid human labor (labor share / labor employed in production)

DePAI entails risks including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, requiring value-sensitive design and continuously adaptive governance.

Risk analysis and conceptual argument in the paper identifying possible failure modes and recommended design/governance responses; no empirical incidence data provided.

high negative DAO-enabled decentralized physical AI: A new paradigm for hu... security, centralization, incentive failure, legal exposure, and intrinsic motiv...

Experimental results show that current agents remain far from reliable workspace learning.

Authors' interpretation based on the reported agent performance (< best agent 68.7% vs human 80.7%, average 47.4%).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... reliability of agents on workspace learning tasks

The average performance across evaluated agents is only 47.4%.

Reported mean performance across agents in the experiments (authors' aggregated result).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... average benchmark score across agents

The best-performing agent reaches only 68.7% on the benchmark.

Experimental results reported by the authors (evaluation across tasks/rubrics).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark score (agent performance)

These industry visions have implications for human experts, whose professional lives may be transformed and revalued by the expert-annotation industry.

Synthesis and interpretation of themes from public statements by five data-annotation firms and CEOs; authors draw implications for professionals based on observed framings and industry positioning.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... professional transformation and revaluation of human experts (risk of role chang...

Human expertise is viewed by the industry as an extractable resource whose value can be judged relative to AI expertise.

The paper's thematic analysis of public-facing statements from five annotation firms/CEOs showing language that frames human expertise as a resource to be extracted and monetized for AI training.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... valuation and treatment of human expertise (commodification/extraction)

The industry envisions AI expertise as cheap, meaning that it can offer a better return on investment than human expertise.

Interpretive coding of statements from five data-annotation firms and their CEOs on social media and podcasts indicating that AI-based expertise is framed as lower-cost and higher-ROI relative to human experts.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... relative valuation/price of AI expertise versus human expertise (implications fo...

These dynamics may produce an asymmetric barbell-shaped structure of value capture in advanced economies: high-volume synthetic production controlled by owners of AI infrastructure at one pole, and scarce, high-status human labor valued for verified human presence at the other.

Conceptual projection and economic argument in the paper (no empirical decomposition, distributional statistics, or sample reported in the excerpt).

high negative Human-Provenance Verification should be Treated as Labor Inf... concentration of value capture across economic actors (inequality / distribution...

AI compresses the value of standardized middle-tier labor by making good-enough synthetic substitutes scalable at low marginal cost, hollowing out the middle of the skill distribution currently categorized by knowledge work.

Conceptual/theoretical argument presented in the paper (no reported empirical sample, statistical analysis, or quantified experiment in the excerpt).

high negative Human-Provenance Verification should be Treated as Labor Inf... value of standardized middle-tier knowledge work (wages / scarcity premiums)

AI development may reduce firms' labor income share.

Further analysis reported in the paper linking firm-level AI development to reductions in the labor income share within firms.

high negative The Impact of Artificial Intelligence on the Labor Skill Pre... firms' labor income share

AI increases the firm-level skill premium by substituting for low-skilled labor.

Mechanism analysis reported in the paper (firm-level regressions investigating labor composition / substitution effects following AI development).

high negative The Impact of Artificial Intelligence on the Labor Skill Pre... low-skilled labor employment / displacement (substitution away from low-skilled ...

The cultural and technical misalignment of the data center and electric power sectors makes coordination difficult.

Analytic claim in the paper describing differing design principles, operational philosophies, and economic incentives as sources of misalignment; presented as conceptual analysis without empirical measurement in the excerpt.

high negative From Barrier to Bridge: The Case for AI Data Center/Power Gr... ease/difficulty of coordination between sectors

A single hyperscale training campus can draw power comparable to a mid-sized city, driven by one tightly synchronized job whose demand swings by hundreds of megawatts in seconds.

Concrete illustrative assertion in the paper about facility-level power draw and rapid demand swings; no numeric source, dataset, or case-study details provided in the excerpt.

high negative From Barrier to Bridge: The Case for AI Data Center/Power Gr... power draw (MW) and rapid demand swing magnitude/timescale

AI training data centers break that assumption (load diversity).

Argumentative claim in the paper asserting that characteristics of AI training workloads violate the load-diversity assumption; no quantitative study included in the excerpt.

high negative From Barrier to Bridge: The Case for AI Data Center/Power Gr... degree to which aggregate grid demand is smoothed by uncorrelated loads (i.e., l...

WIOA is not well-equipped to support large-scale, cross-industry labor transitions.

Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.

high negative Did US Worker Retraining Reduce Participant Automation Expos... cross-industry occupational transitions / shifts in RTI after program participat...

A substantial portion of WIOA participants simply return to their prior field after program participation.

Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.

high negative Did US Worker Retraining Reduce Participant Automation Expos... occupational/industry re-entry (return to prior field) following program partici...

WIOA rarely shifts workers into less automation-exposed work.

Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.

high negative Did US Worker Retraining Reduce Participant Automation Expos... change in Routine Task Intensity (RTI) of occupations post-participation

Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).

Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).

high negative The Heterogeneous Effects of Artificial Intelligence on Ente... Innovation activity / productivity implications

AI integration creates challenges such as workforce displacement that must be addressed.

Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.

high negative Research on the Transformation Acceleration of Financial Ins... Workforce displacement

AI integration creates challenges such as algorithmic bias that must be addressed.

Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.

high negative Research on the Transformation Acceleration of Financial Ins... Algorithmic bias

Responsible AI research typically focuses on examining the use and impacts of deployed AI systems, and there is currently limited visibility into the pre-deployment decisions to pursue building such systems.

Argument and literature framing presented in the paper based on a scoping review of academic literature, civil society resources, and grey literature.

high negative To Build or Not to Build? Factors that Lead to Non-Developme... visibility into pre-deployment decision-making for AI development

This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low.

Theoretical result/argument from the model linking concentrated decision-energy to increased systemic risk despite low local error rates.

high negative AI Safety as Control of Irreversibility: A Systems Framework... probability of irreversible system-level loss

Efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node.

Derived from the paper's formal model and argumentation about system dynamics (efficiency and feedback mechanisms); theoretical rather than empirical evidence.

high negative AI Safety as Control of Irreversibility: A Systems Framework... concentration of decision-energy (centralization of decision authority)

Declining deployment friction changes the safety problem at its root: safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density.

Main theoretical argument of the paper; supported by conceptual framing and a formal model that introduces decision-density considerations.

high negative AI Safety as Control of Irreversibility: A Systems Framework... safety framing (control of irreversibility)

Recent AI systems compress the distance between capability growth and capability deployment.

Conceptual and descriptive claim in the paper's introduction; supported by theoretical argumentation and illustrative examples rather than empirical measurement.

high negative AI Safety as Control of Irreversibility: A Systems Framework... deployment speed / adoption

Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).

Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.

high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... relative RL feasibility vs. general AI exposure for named creative/interpersonal...

Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.

Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).

high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... accuracy/misclassification of occupations by AI-exposure indices vs. learnabilit...

A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.

Experimental intervention with full transparency of information between agents; authors report that even with full information exchange, dyads fail to reach optimal coordination, pointing to interactive grounding processes as the bottleneck.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance under full information transparency

The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations.

Experimental baseline (oracle) in which individual reasoning is isolated and shown to be sufficient for identifying optimal allocations; details/sizes not given in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... attribution of coordination gap to individual reasoning limitations

Failures in referential binding occur, where agents lose track of commitments across turns.

Reported failure mode from multi-turn experiments: referential binding breakdowns leading to loss of commitments.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... referential binding / tracking of commitments across turns

Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination.

Empirical observation from negotiation experiments where agents prefer equal splits rather than allocations that maximize joint reward, as reported in the paper.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... allocation strategy preference (equal split vs reward-maximizing)

Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable.

Observed failure mode in multi-turn negotiation experiments: agents anchor on initial proposals and fail to revise, as reported by the authors.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... propensity to revise initial proposals / anchoring behavior

Coordination degrades when shared interaction history is absent.

Experimental comparison of settings with and without shared interaction history (ablation showing worse coordination when history is removed).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance as a function of shared interaction history

While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models.

Experimental results comparing single-agent (isolated) performance and paired-agent (dyad) negotiation performance across multiple LLMs (open- and closed-source); specific sample sizes not reported in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... achievement of Pareto-optimal allocations in dyadic negotiation

Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns.

Literature/benchmark survey claim by the authors (asserted in the paper; no numeric summary provided here).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coverage of dynamic grounding in benchmarks

« Prev 1 2 3 … 17 18 19 … 232 233 Next »