Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

The study's findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.

Authors' reported limitations section citing specific threats to internal validity and measurement (session timing confound, differential attrition across conditions, and grading biases of the LLM used to evaluate documents).

high negative Scaffolding Human-AI Collaboration: A Field Experiment on Be... threats to validity (confounds and measurement sensitivity)

The behavioral scaffolding intervention was associated with substantially lower document production.

Same field experiment (N=388); the behavioral scaffolding required joint AI use within pairs and was compared to unstructured use, with reported reductions in document production in the behavioral condition.

high negative Scaffolding Human-AI Collaboration: A Field Experiment on Be... document production (quantity of documents produced)

A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use.

Field experiment with 388 employees at a Fortune 500 retailer; random/experimental assignment to scaffolding conditions while all participants had access to the same AI tool; comparison reported between behavioral scaffolding condition and unstructured use.

high negative Scaffolding Human-AI Collaboration: A Field Experiment on Be... document quality

LLMs lag behind humans in sustaining heterogeneity when divergence is rewarded.

Empirical comparison from the experiment showing humans are better able than LLMs to maintain diverse actions when the payoff structure rewards divergence; stated qualitatively in the abstract without numeric effect sizes or sample sizes.

high negative Strategic Algorithmic Monoculture:Experimental Evidence from... ability to sustain heterogeneity/divergence under incentives

Latent-outcome estimation faces a within-study noncomparability challenge: different indicators within a study may have different and possibly nonlinear relationships with the same latent outcome, making them not directly comparable.

Theoretical exposition in the paper describing heterogenous indicator-to-latent mappings and potential nonlinearity; illustrated with examples (no empirical sample size).

high negative Nonparametric Identification and Estimation of Causal Effect... comparability of different indicators for the same latent outcome within a study

Latent-outcome estimation faces a cross-study noncomparability challenge: different measurement systems across studies may cause estimators to target different empirical quantities even when the underlying latent treatment effect is the same.

Conceptual and theoretical argumentation in the paper describing identification issues across studies due to differing measurement systems; supported by examples and discussion (no empirical sample size).

high negative Nonparametric Identification and Estimation of Causal Effect... comparability of estimated latent treatment effects across studies

Lower survival rates among BDA adopters are driven by greater uncertainty in sales.

Paper states greater uncertainty in sales is an interrelated factor explaining lower survival for BDA adopters, based on empirical analysis of German start-ups.

high negative Big data-based management decisions and start-up performance uncertainty in sales (sales volatility/variance)

Lower survival rates among BDA adopters are driven by higher operating costs.

Paper reports that higher operating costs are an interrelated factor explaining lower survival among BDA adopters, based on the same empirical sample of German start-ups.

high negative Big data-based management decisions and start-up performance operating costs

Start-ups using BDA face lower survival rates.

Empirical comparison of BDA adopters versus non-adopters in a large sample of German start-ups (survival analysis implied by reported outcome).

high negative Big data-based management decisions and start-up performance survival (firm exit / failure)

Enterprise sales organizations are systematically hampered by what this paper terms 'Revenue Friction'—the accumulative productivity loss caused by fragmented, human-mediated data entry across disconnected CRM, ERP, and quoting systems.

Statement/definition presented in the paper excerpt. No empirical method, sample size, or quantitative evidence reported in the provided text.

high negative From CRM to Cognition: Autonomous Revenue Operations Systems... accumulative productivity loss (termed 'Revenue Friction') resulting from fragme...

Some of this reduced price is related to reduced input cost contributions, in particular labor and materials costs.

Decomposition/mediation analysis reported in the paper attributing part of the observed price reductions to declines in input cost contributions (labor and materials); exact methods, sample size, and statistical estimates not provided in the excerpt.

high negative Early Evidence on the Relationship Between AI, Costs, and Pr... input cost contributions (labor costs and materials costs)

AI intensity is associated with lower prices charged to purchasers.

Empirical analysis reported in the paper linking measures of AI intensity to observed output prices (details of data sources, sample size, and specific methods not provided in the excerpt).

high negative Early Evidence on the Relationship Between AI, Costs, and Pr... prices charged to purchasers (output prices)

Foundation-model usage can increase compute-related emissions.

Conceptual/environmental concern highlighted in the paper about the carbon footprint of heavy model use and persistent storage; no quantified emissions analysis or lifecycle assessment presented.

high negative Remote-Capable Knowledge Work Should Default to AI-Enabled F... compute-related (carbon) emissions associated with foundation-model usage

These systems can cause skill atrophy.

Theoretical risk articulated in the paper that reliance on AI assistance may degrade human skills over time; no longitudinal skill-measurement or experimental evidence provided.

high negative Remote-Capable Knowledge Work Should Default to AI-Enabled F... degradation or atrophy of worker skills

The same foundation-model systems can also intensify surveillance.

Cautionary claim in the paper noting the surveillance risk of durable, queryable traces and integrated tooling; presented as a conceptual risk rather than empirically measured increase in surveillance.

high negative Remote-Capable Knowledge Work Should Default to AI-Enabled F... increase in workplace surveillance capability/use

Baseline (non-structured) interactions had 16 of 50 accepted on first pass.

Reported counts in the paper for the baseline group (16 accepted of 50 baseline interactions).

high negative Context Engineering: A Practitioner Methodology for Structur... first-pass acceptances (count and rate)

In an observational study of documented interactions across four AI tools (Claude, ChatGPT, Cowork, Codex), incomplete context was associated with 72% of iteration cycles.

Observational study reported in the paper covering interactions across four AI tools; the paper reports the 72% figure.

high negative Context Engineering: A Practitioner Methodology for Structur... iteration cycles associated with incomplete context

Job insecurity emerges as a critical mediating factor influencing employee attitudes and behavioural responses to generative AI, including upskilling intentions and resistance to technological change.

Review-level synthesis identifying job insecurity reported in included studies as mediating relationships between AI adoption and employee attitudes/behaviours (e.g., upskilling, resistance).

high negative Generative AI in the Workplace: A Systematic Review of Produ... upskilling intentions and resistance to technological change (mediated by job in...

Employees express concerns about role displacement (job loss or role changes) associated with generative AI adoption.

Reported across multiple studies included in the review; the review summarises these concerns as part of mixed employee perceptions.

high negative Generative AI in the Workplace: A Systematic Review of Produ... perceived risk of role displacement / job loss

These positive perceptions coexist with employee concerns about skill obsolescence related to generative AI.

Synthesis of studies included in the review documenting worker concerns about skills becoming obsolete due to AI-driven changes.

high negative Generative AI in the Workplace: A Systematic Review of Produ... concerns about skill obsolescence

Income inequality, measured by the Gini index, rises moderately in every scenario we examine due to the polarising effect of job losses and wage and capital income increases on the income distribution.

Calculation of Gini index across multiple simulated scenarios using the SWITCH-linked distributional analysis; reported in the report.

high negative Artificial Intelligence and income inequality in Ireland Gini index (income inequality)

The largest average losses are experienced by middle and higher income households, for whom job displacement outweighs any wage or capital income gains. Lower income households also lose, but by much less.

Distributional results from microsimulation (SWITCH) applying scenarioled job displacement, wage and capital effects across income groups; reported in the report.

high negative Artificial Intelligence and income inequality in Ireland change in household disposable income by income group

When these effects are combined, we find an average decline in household disposable income as a result of AI adoption.

Combined scenario simulations incorporating job displacement, wage effects and capital income effects linked to the Irish tax-benefit system using SWITCH; result reported in the report's main findings.

high negative Artificial Intelligence and income inequality in Ireland household disposable income (average change)

These wage gains are not large enough to counterbalance the average fall in income due to job displacement.

Combined simulation results (displacement + wage effects) using scenario assumptions and microsimulation (SWITCH), reported in the report's distributional analysis.

high negative Artificial Intelligence and income inequality in Ireland net effect on household income (wages versus displacement losses)

Those most likely to experience this disruption are found in higher income households, where the share of workers transitioning into unemployment is substantially larger than in lower income families.

Microsimulation (SWITCH) linking simulated job displacement scenarios to household income groups; results reported in the report.

high negative Artificial Intelligence and income inequality in Ireland share of workers transitioning into unemployment by household income

In our central scenario — drawn from credible international estimates — around 7 per cent of current jobs could be displaced in the short–medium run.

Scenario simulation based on international estimates of AI exposure/adoption; central scenario reported in the report (linked to SWITCH microsimulation for distributional analysis).

high negative Artificial Intelligence and income inequality in Ireland share of jobs displaced

AI tends to place higher earning and highly educated workers at greater risk of disruption, because the occupations most exposed to AI are predominantly in these groups.

Synthesis of international research on occupational exposure to AI and the report's analysis linking exposure to worker characteristics (education and earnings); presented as descriptive finding in the report.

high negative Artificial Intelligence and income inequality in Ireland risk of job disruption / occupational exposure to AI

Result 2: When managers are short-termist or worker skill has external value, the decision-maker's optimal policy can produce the augmentation trap, leaving the worker worse off than if AI had never been adopted.

Analytical result from the dynamic model comparing planner/objective variations (short-termist manager or externalities) and showing an outcome labeled the 'augmentation trap'.

high negative The Augmentation Trap: AI Productivity and the Cost of Cogni... worker welfare/productivity relative to non-adoption

Result 1: Even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption.

Analytical result from the dynamic model showing optimal adoption choice can lead to a steady-state where worker productivity is lower than pre-adoption (model-based comparative statics).

high negative The Augmentation Trap: AI Productivity and the Cost of Cogni... steady-state worker productivity (relative to pre-adoption)

Experimental evidence shows that sustained use of AI tools can erode the expertise on which productivity gains depend (deskilling).

Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).

high negative The Augmentation Trap: AI Productivity and the Cost of Cogni... worker expertise / skill level

Claude Sonnet 4.6 achieves only 33.3% (completion rate) on ClawBench.

Paper gives a concrete example performance result for Claude Sonnet 4.6 (reported completion percentage on the benchmark).

high negative ClawBench: Can AI Agents Complete Everyday Online Tasks? task_completion_rate (percentage of tasks completed)

The authors evaluated 7 frontier models on ClawBench and found that both proprietary and open-source models can complete only a small portion of these tasks.

Paper reports evaluations of 7 models on the ClawBench tasks (empirical evaluation across the benchmark).

high negative ClawBench: Can AI Agents Complete Everyday Online Tasks? task_completion_rate / automation_exposure (how many tasks models can complete)

Aggressive compression increased total session cost by 67% despite reducing input tokens by 17%, because it shifted interpretive burden to the model's reasoning phase.

Result reported from the controlled experiment comparing log-format conditions; four conditions described but specific number of sessions/replications not provided in the abstract.

high negative Beyond Human-Readable: Rethinking Software Engineering Conve... total session cost (primary) and input token count (secondary)

Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall.

Paper reports an evaluation across 17 models and states the maximum overall score observed was below 66%.

high negative ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... overall accuracy on the implicit memory benchmark

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval.

Statement in paper introduction contrasting prior benchmarks' focus on explicit recall with a claimed gap in evaluating implicit (non-declarative) memory; no systematic literature review or quantitative survey reported in the excerpt.

high negative ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... other

OpenAI o3 achieves only 17% of optimal collective performance.

Experimental measurement of collective performance for OpenAI o3 in the paper's multi-agent setup (value reported in abstract; no sample size provided there).

high negative More Capable, Less Cooperative? When LLMs Fail At Zero-Cost ... collective performance (percent of optimal group revenue)

The study observed errors and limitations in both phases (test generation and refactoring), and manual intervention was necessary at times.

Case study observations reported in the paper describing observed model errors/limitations and instances requiring manual developer intervention.

high negative AI-Assisted Unit Test Writing and Test-Driven Code Refactori... occurrence of errors and need for manual intervention

Current AI coding assistants, such as GitHub Copilot and Amazon CodeWhisperer, emphasize developer speed and convenience, with energy impact not yet a primary focus.

Stated as an observation in the paper; no specific empirical comparison or quantification provided in this excerpt.

high negative EcoAssist: Embedding Sustainability into AI-Assisted Fronten... design priorities of AI coding assistants (speed/convenience vs. energy impact)

Frontend code, replicated across millions of page views, consumes significant energy and contributes directly to digital emissions.

Asserted in paper's introduction; no specific empirical data or sample reported in this excerpt.

high negative EcoAssist: Embedding Sustainability into AI-Assisted Fronten... energy consumption / digital emissions from frontend code

We posit that persistence is reduced because AI conditions people to expect immediate answers, denying them the experience of working through challenges on their own.

Authors' proposed psychological mechanism / explanation inferred from observed behavior; presented as a hypothesis rather than directly proven causal mediator.

high negative AI Assistance Reduces Persistence and Hurts Independent Perf... mechanistic explanation for reduced persistence (expectation of immediate answer...

These negative effects (reduced persistence and impaired unassisted performance) emerge after only brief interactions with AI (approximately 10 minutes).

Experimental manipulation / exposure in RCTs where participants interacted with AI for about 10 minutes and subsequent outcomes were measured.

high negative AI Assistance Reduces Persistence and Hurts Independent Perf... onset/time to observable effect (persistence and unassisted performance after ~1...

People are more likely to give up after interacting with AI (increased likelihood of quitting tasks unassisted).

Randomized controlled trials (N = 1,222) measuring rates of task abandonment/giving-up after AI interaction vs. control.

high negative AI Assistance Reduces Persistence and Hurts Independent Perf... likelihood of giving up / task abandonment

AI assistance impairs unassisted performance: although AI improves short-term performance, people perform significantly worse without AI after interacting with it.

Randomized controlled trials (N = 1,222) comparing performance with and without AI assistance across tasks; causal inference from randomized assignment.

high negative AI Assistance Reduces Persistence and Hurts Independent Perf... unassisted task performance (accuracy/quality when working without AI after prio...

Through a series of randomized controlled trials on human-AI interactions (N = 1,222), we provide causal evidence that AI assistance reduces persistence.

Randomized controlled trials (RCTs) on human-AI interactions with total sample size N = 1,222; persistence measured after AI interaction across tasks.

high negative AI Assistance Reduces Persistence and Hurts Independent Perf... persistence (willingness to continue working on tasks without AI)

AI-assisted evaluation reduces variance in research quality.

SEM and regression analyses on OECD panel data report a decrease in variance of research quality measures associated with higher AIRC.

high negative AI-Augmented Peer Review and Scientific Productivity: A Cros... variance in research quality

Current research has largely focused on short-horizon tasks over a limited set of software with limited economic value (e.g., basic e-commerce and OS-configuration tasks).

Narrative literature/field observation reported in paper introduction (no numeric study reported in excerpt).

high negative Gym-Anything: Turn any Software into an Agent Environment scope and horizon of existing research tasks

There is a fundamental gap in current agent capabilities: functional correctness alone is insufficient for design-aware issue resolution, motivating design-aware evaluation beyond functional correctness.

Synthesis of experimental findings: low design-satisfaction despite functional correctness, prevalence of design violations, and only partial improvement from guidance support the conclusion.

high negative Does Pass Rate Tell the Whole Story? Evaluating Design Const... agent capability for design-aware issue resolution

Design violations are widespread in agent-produced patches.

Empirical results from experiments on the benchmark showing many patches violate validated design constraints; backed by counts/percentages in evaluation (as summarized in abstract).

high negative Does Pass Rate Tell the Whole Story? Evaluating Design Const... number/occurrence of design violations

Test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying.

Experimental evaluation with state-of-the-art LLM-based agents on the benchmark (reported in paper). Sample implicit: benchmark issues (495) used to evaluate agents; comparison between test pass rates and design-satisfaction measured by verifier.

high negative Does Pass Rate Tell the Whole Story? Evaluating Design Const... design-satisfaction of patches (design compliance)

Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks.

Stated as an observation in the paper (abstract); no quantitative evidence, metrics, or comparative analysis provided in the excerpt.

high negative Flowr -- Scaling Up Retail Supply Chain Operations Through A... degree of manual decision-making and coordination (fragmentation/reactivity)

« Prev 1 2 3 … 11 12 13 … 130 131 Next »