Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

Organizational resistance to technological change hinders AI adoption in logistics operations.

Qualitative synthesis of 31 reviewed publications identifying organizational and cultural barriers to AI uptake.

high negative Evaluating the Role of Artificial Intelligence in Optimizing... organizational resistance as an adoption barrier

Data security concerns are a key barrier to adopting AI in global supply chains.

Synthesis of themes from 31 scholarly sources in the structured literature review highlighting data/security-related implementation issues.

high negative Evaluating the Role of Artificial Intelligence in Optimizing... data security concerns as an adoption barrier

High initial investment costs are a significant barrier to AI implementation in logistics.

Synthesis of literature (31 sources) reporting implementation challenges and barriers identified across studies.

high negative Evaluating the Role of Artificial Intelligence in Optimizing... adoption barriers (initial investment costs)

Existing coordination approaches often occupy two extremes: highly structured methods that rely on fixed roles/pipelines assigned a priori, and fully unstructured teams that enable adaptability but suffer inefficiencies like error propagation, inter-agent conflicts, and wasted resources.

Framing/background claim made in the paper (conceptual argument motivating LATTE).

high negative Improving the Efficiency of Language Agent Teams with Adapti... coordination efficiency / error propagation / resource waste

The price-setter for cognitive labor is no longer the labor market.

Central normative/conceptual claim of the paper supported by the analytical model and the CAW bound: authors argue the compute capital market (through rental price of compute) sets the effective price for cognitive labor. Stated as the paper's concise position; based on theoretical derivation and argumentation.

high negative Who Prices Cognitive Labor in the Age of Agents? A Position ... which market determines cognitive labor price

Compute-Anchored Wage (CAW) bound: on tasks where human and agent cognitive labor are substitutes, the competitive human wage is bounded above by λ · k · r_c (where r_c is the rental rate of compute capital, k is the compute intensity of one effective agent-labor unit, and λ is the relative human-to-agent productivity).

Formal analytical result presented in the paper (mathematical derivation within the factor-pricing model). This is a theoretical bound derived from the model rather than an empirical estimate.

high negative Who Prices Cognitive Labor in the Age of Agents? A Position ... competitive human wage (upper bound)

Once agents are recognized as a production technology, the elastic-supply margin that anchors the equilibrium wage migrates from the labor market to the compute capital market.

Analytical derivation using a textbook factor-pricing framework (citing Mankiw 2020) within the paper's theoretical model; derivation and verbal argument linking supply-elasticity margins to compute capital market. No empirical data reported in the excerpt.

high negative Who Prices Cognitive Labor in the Age of Agents? A Position ... source of wage determination / wage-anchoring margin

The reform reduces industrial wastewater discharge, which improves agricultural production conditions (mechanism linking the reform to higher grain yield).

Mechanism analysis in the paper reporting reductions in industrial wastewater discharge following the reform (mediation channel analysis).

high negative Can water resource tax reform increase grain yield?—Evidence... industrial wastewater discharge

From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression.

Theoretical / information-theoretic analysis in the paper linking observed dynamics to entropy reduction and information bottleneck concepts.

high negative Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Sy... entropy (diversity/support) of the human-AI data loop and its interpretation as ...

Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium.

Computational simulation reported in the paper (described as a 'simple simulation'); no sample size or experimental dataset reported in the provided text.

high negative Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Sy... system transitioning to a low-diversity, suboptimal equilibrium as reliance on A...

Tabular data does not have a foundation model that understands it natively; every approach to tabular AI today (from gradient-boosted trees to the latest tabular foundation models) requires a preprocessing pipeline before any model can consume the data.

Paper's survey/positioning statement asserting the current state of tabular AI approaches and their reliance on preprocessing pipelines (no specific empirical dataset given).

high negative Data Language Models: A New Foundation Model Class for Tabul... presence/absence of a native tabular foundation model and the need for preproces...

With strong exposure of low-wealth, high-MPC households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income.

Theoretical welfare/comparison analyses in the model with heterogeneous households (differing in wealth and marginal propensities to consume) and ownership concentration; shows private incentives lead to automation choices that are suboptimal from a social perspective under these parameter constellations.

high negative The Demand Externality of Automation extent of automation chosen relative to social optimum (welfare-relevant automat...

Automation reduces paid human labor.

Model comparative statics in the same equilibrium framework showing substitution away from paid human labor as firms choose automation; result reported in the paper's static benchmark and general-equilibrium analysis.

high negative The Demand Externality of Automation paid human labor (labor share / labor employed in production)

Experimental results show that current agents remain far from reliable workspace learning.

Authors' interpretation based on the reported agent performance (< best agent 68.7% vs human 80.7%, average 47.4%).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... reliability of agents on workspace learning tasks

The average performance across evaluated agents is only 47.4%.

Reported mean performance across agents in the experiments (authors' aggregated result).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... average benchmark score across agents

The best-performing agent reaches only 68.7% on the benchmark.

Experimental results reported by the authors (evaluation across tasks/rubrics).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark score (agent performance)

AI development may reduce firms' labor income share.

Further analysis reported in the paper linking firm-level AI development to reductions in the labor income share within firms.

high negative The Impact of Artificial Intelligence on the Labor Skill Pre... firms' labor income share

AI increases the firm-level skill premium by substituting for low-skilled labor.

Mechanism analysis reported in the paper (firm-level regressions investigating labor composition / substitution effects following AI development).

high negative The Impact of Artificial Intelligence on the Labor Skill Pre... low-skilled labor employment / displacement (substitution away from low-skilled ...

WIOA is not well-equipped to support large-scale, cross-industry labor transitions.

Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.

high negative Did US Worker Retraining Reduce Participant Automation Expos... cross-industry occupational transitions / shifts in RTI after program participat...

A substantial portion of WIOA participants simply return to their prior field after program participation.

Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.

high negative Did US Worker Retraining Reduce Participant Automation Expos... occupational/industry re-entry (return to prior field) following program partici...

WIOA rarely shifts workers into less automation-exposed work.

Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.

high negative Did US Worker Retraining Reduce Participant Automation Expos... change in Routine Task Intensity (RTI) of occupations post-participation

Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).

Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).

high negative The Heterogeneous Effects of Artificial Intelligence on Ente... Innovation activity / productivity implications

AI integration creates challenges such as workforce displacement that must be addressed.

Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.

high negative Research on the Transformation Acceleration of Financial Ins... Workforce displacement

AI integration creates challenges such as algorithmic bias that must be addressed.

Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.

high negative Research on the Transformation Acceleration of Financial Ins... Algorithmic bias

Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).

Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.

high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... relative RL feasibility vs. general AI exposure for named creative/interpersonal...

Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.

Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).

high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... accuracy/misclassification of occupations by AI-exposure indices vs. learnabilit...

A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.

Experimental intervention with full transparency of information between agents; authors report that even with full information exchange, dyads fail to reach optimal coordination, pointing to interactive grounding processes as the bottleneck.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance under full information transparency

The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations.

Experimental baseline (oracle) in which individual reasoning is isolated and shown to be sufficient for identifying optimal allocations; details/sizes not given in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... attribution of coordination gap to individual reasoning limitations

Failures in referential binding occur, where agents lose track of commitments across turns.

Reported failure mode from multi-turn experiments: referential binding breakdowns leading to loss of commitments.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... referential binding / tracking of commitments across turns

Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination.

Empirical observation from negotiation experiments where agents prefer equal splits rather than allocations that maximize joint reward, as reported in the paper.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... allocation strategy preference (equal split vs reward-maximizing)

Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable.

Observed failure mode in multi-turn negotiation experiments: agents anchor on initial proposals and fail to revise, as reported by the authors.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... propensity to revise initial proposals / anchoring behavior

Coordination degrades when shared interaction history is absent.

Experimental comparison of settings with and without shared interaction history (ablation showing worse coordination when history is removed).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance as a function of shared interaction history

While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models.

Experimental results comparing single-agent (isolated) performance and paired-agent (dyad) negotiation performance across multiple LLMs (open- and closed-source); specific sample sizes not reported in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... achievement of Pareto-optimal allocations in dyadic negotiation

Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns.

Literature/benchmark survey claim by the authors (asserted in the paper; no numeric summary provided here).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coverage of dynamic grounding in benchmarks

We establish a Volume-Quality Inverse Law: code volume is a near perfect predictor of structural degradation.

Empirical finding from the paper's analysis correlating code volume with measures of structural degradation; described as 'near perfect predictor'.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... structural degradation (predicted by code volume)

There exists a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code.

Multi-scale comparative analysis across models of differing capability showing higher-capability models produce larger (volume) and more highly-coupled code artifacts.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... code volume and coupling (architectural complexity)

AI does not eliminate software flaws but rather introduces a distinct 'machine signature' of defects in generated code.

Systematic audit (multi-scale analysis) of AI-generated software across single-file algorithmic tasks and complex, agent-generated systems, reporting characteristic defect patterns attributed to machine generation.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... presence and patterning of defects in AI-generated code (machine signature of de...

The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability.

Framing statement in the paper; argument based on literature/practice that current evaluations emphasize functional correctness rather than maintainability.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... emphasis of evaluation metrics (functional correctness vs maintainability)

Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables.

Position asserted in the paper based on literature/benchmark trends and authors' field observations; no original empirical dataset or quantified analysis provided in the paper text excerpt.

high negative The Conversations Beneath the Code: Triadic Data for Long-Ho... performance on short-horizon benchmarks versus performance on long-horizon, mult...

Standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

Quantitative analysis reported in the paper comparing detection of the seven failure modes by standard metrics over evaluation cycles.

high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... proportion and timing of detection of failure modes by standard metrics

Standard metrics (ROUGE, BERTScore, accuracy/AUC, and agentic benchmarks such as HELM/MT-Bench/AgentBench/BIG-bench) fail to detect each of the seven production failure modes.

Empirical demonstration reported in the paper comparing standard metrics and agentic benchmarks against the seven failure modes.

high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... detection capability of standard metrics/benchmarks for production failure modes

The seven failure modes include compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks.

Author-provided list of example failure modes within the taxonomy; grounded in observations described in the paper.

high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... types of failure modes affecting production agentic systems

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings and do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production.

Author statement based on literature/framework review (references to HELM, MT-Bench, AgentBench, BIG-bench) and contrast with production agentic evaluation needs.

high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... ability of existing LLM evaluation frameworks to address continuous production a...

Prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.

Cited prior work / literature claim reported in paper (no specific study details or sample sizes provided in excerpt).

high negative U-Define: Designing User Workflows for Hard and Soft Constra... usability of constraint specification (rigidity and understandability of numeric...

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control.

Paper's background/related-work motivation (literature summary and framing). No specific empirical data reported in excerpt.

high negative U-Define: Designing User Workflows for Hard and Soft Constra... reliability and control over LLM outputs

Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.

Synthesis conclusion by the authors based on the multivocal literature review, telemetry findings, conceptual modeling (PRP/SGM), and the four-month pilot evaluation.

high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability (reliability) in AI-assisted development

These conflicting findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline.

Conceptual synthesis and interpretation by the paper's authors, based on the multivocal literature review, telemetry, and experimental evidence summarized above.

high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability / trade-off between productivity and reliability

Telemetry across 10,000+ developers shows 91% longer code review times.

Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in review time.

high negative The Productivity-Reliability Paradox: Specification-Driven G... code review time

The most rigorous randomized controlled trial (RCT) documents a 19% slowdown for experienced developers.

A single RCT cited in the paper described as the most rigorous trial; result reported as a 19% slowdown for experienced developers. Sample size for the RCT is not provided in the summary statement.

high negative The Productivity-Reliability Paradox: Specification-Driven G... developer productivity (task completion speed)

Compound-system-specific operational challenges arise when serving agentic workloads, including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics.

The paper presents a novel analysis and discussion of these challenges and supports the points via case studies and operational lessons from the production deployment; no quantitative prevalence metrics or sample sizes are provided in the provided text.

high negative Scalable Inference Architectures for Compound AI Systems: A ... operational challenges: fan-out overhead, cold-start propagation, heterogeneous ...

« Prev 1 2 3 … 7 8 9 … 130 131 Next »