Evidence (5157 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Increasing the complexity of the information structure has a significant and negative impact on information aggregation, suggesting AI agents may suffer from the same limitations as humans when reasoning about others.

Experimental manipulation of information-structure complexity in the controlled trading experiment; measured change in aggregation performance (log error of last price) as complexity increases.

high negative Information Aggregation with AI Agents information aggregation (log error of the last price)

Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.

Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... rate of user pushback per interaction turn

Agent-written code introduces more security vulnerabilities than code authored by humans.

Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... security vulnerabilities introduced by agent-written code versus human-written c...

Just 44% of all agent-produced code survives into user commits.

Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... survival/usefulness of agent-produced code (proportion incorporated into commits...

Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.

Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... overall agent efficiency in natural developer workflows (qualitative synthesis)

Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.

Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.

high negative Stateless Decision Memory for Enterprise AI Agents compatibility of stateful architectures with regulatory/system properties

Evaluation of four leading AI platforms shows that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient.

Empirical evaluation described in paper: four AI platforms tested on benchmark; reported average accuracy of 15% for RAG-based approaches on cases with insufficient information.

high negative Learning When Not to Decide: A Framework for Overcoming Fact... accuracy on cases where information is insufficient (inconclusive cases)

Unemployment insurance adjudication has seen rapid integration of AI systems and the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually.

Contextual/introductory claim in paper; references to domain-scale impact and bottleneck; no specific numeric study sample provided in excerpt.

high negative Learning When Not to Decide: A Framework for Overcoming Fact... scale of impact (number of applicants affected) and fact-finding bottleneck in a...

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking.

Statement in paper framing the problem; general literature/contextual claim (no specific experiment cited in the excerpt).

high negative Learning When Not to Decide: A Framework for Overcoming Fact... tendency to provide confident answers when information is lacking (presumptuousn...

Brevity, semantic isolation and rhetorical register independently predict representational outcome (i.e., which submissions are included/excluded in summaries).

Statistical/semantic analysis (presumably regression or causal inference) reported in the paper linking textual features—brevity, semantic isolation, rhetorical register—to representational outcomes.

high negative Participatory provenance as representational auditing for AI... predictive relationship between textual features and representational outcome (c...

Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI, with exclusion rates of 33%–88% in such clusters.

Cluster/semantic analysis reported in the paper showing higher exclusion rates for clusters labeled as dissent/scepticism/critique.

high negative Participatory provenance as representational auditing for AI... cluster-level exclusion rate for dissenting/sceptical/critical clusters

In topic B, 15.3% of participants are effectively excluded by the official summary.

Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.

high negative Participatory provenance as representational auditing for AI... participant exclusion rate

In topic A, 16.9% of participants are effectively excluded by the official summary.

Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.

high negative Participatory provenance as representational auditing for AI... participant exclusion rate

Both official government summaries underperform a random-participant baseline for topic B (coverage degradation of -8.0%).

Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.

high negative Participatory provenance as representational auditing for AI... coverage (coverage degradation relative to random baseline)

Both official government summaries underperform a random-participant baseline for topic A (coverage degradation of -9.1%).

Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.

high negative Participatory provenance as representational auditing for AI... coverage (coverage degradation relative to random baseline)

LLMs endorsed fraudulent investments at 0% across all models tested.

Preregistered experiment across seven leading LLMs producing 3,360 AI advisory conversations; reported 0% endorsement of objectively fraudulent opportunities.

high negative Large Language Models Outperform Humans in Fraud Detection a... endorsement rate of fraudulent investments by LLMs

Endorsement reversal occurred in fewer than 3 in 1,000 observations.

Observed incidence reported from the preregistered experiment (3,360 AI advisory conversations); statement in paper reporting incidence <3/1,000.

high negative Large Language Models Outperform Humans in Fraud Detection a... rate of endorsement reversal (AI shifting from warning to endorsing fraudulent o...

The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.

Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... institutional continuity and human capital reproduction (quality of workforce fo...

Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.

Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... human capital formation (tacit knowledge acquisition and contextual reasoning ca...

Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.

Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... structural change in occupational layers (hollowing out of junior and supervisor...

Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.

Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... displacement of routine tasks (across entry-level and middle-management roles)

A gender gap persists, concentrated in the most exposed occupations.

Stratified/descriptive and regression analyses of the 2024 EWCS showing gender differences in self-reported generative AI adoption, with the gap largest among occupations with highest exposure; sample >36,600 workers across 35 countries.

high negative Generative AI at Work: From Exposure to Adoption across 35 E... self-reported adoption of generative AI by gender

The infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it.

Authoritative claim in paper framing the research gap; presented as observational/argumentative (no empirical audit reported).

high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... availability of cross-user collaboration infrastructure and governance mechanism...

Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user.

Statement in paper's introduction/positioning; conceptual survey-style claim (no empirical study or systematic benchmark reported).

high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... automation scope (single-user vs multi-user)

Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations.

Paper asserts that existing/standard benchmarks do not adequately isolate parsing and computation-orchestration abilities, motivating the new benchmark.

high negative Time Series Augmented Generation for Financial Applications benchmark adequacy for isolating parsing/computation orchestration

As multimodal AI achieves human-parity understanding of speech and gesture, [the keyboard's] necessity dissolves.

Theoretical claim supported by multidisciplinary review (history, neuroscience, technology, organizational studies); no quantified empirical test reported.

high negative The Instrumental Dissolution of Typing: Why AI Challenges th... necessity/usage of keyboard as default input

General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs.

Conceptual/argumentative claim stated in the paper's motivation; no empirical test reported in the abstract.

high negative Learning from AVA: Early Lessons from a Curated and Trustwor... misinformation risk / epistemic humility

There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot).

Comparison of retest (one-week) performance across conditions reported in results; authors report a nonsignificant reduction and larger decrement for the AI/Copilot condition (n=22).

high negative Fast and Forgettable: A Controlled Study of Novices' Perform... retest performance (learning retention) after one week

Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment.

Authors' characterization of industry practice and limitations (assertion in paper; no empirical sample size reported in abstract).

high negative Aether: Network Validation Using Agentic AI and Digital Twin test coverage and post-deployment error incidence

Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations.

Statement in paper framing the problem; based on authors' characterization of current operational practice (no empirical sample size reported in abstract).

high negative Aether: Network Validation Using Agentic AI and Digital Twin manual effort / error-proneness of network change validation

Thick subjectivist theories of meaning in life and meaningful work—those theories that emphasize that meaning-conferring activities are historically formed—enable us to appreciate how some losses cannot be made up, even if there are in principle ample alternative sources of meaning to be found elsewhere.

Theoretical claim about the explanatory power of 'thick subjectivist' normative theories; argued via conceptual philosophical analysis in the paper (no empirical testing reported).

high negative Is artificial intelligence a threat to meaningful work and l... capacity of theoretical framework (thick subjectivism) to account for non-substi...

Even if there are rich non-work sources of meaning, this does not entail that there is not a significant and multi-faceted loss of meaning, one that cannot be compensated for or offset elsewhere.

Normative/philosophical argument presented in the paper (conceptual reasoning rather than empirical measurement; no sample size).

high negative Is artificial intelligence a threat to meaningful work and l... loss of meaning due to automation and the (in)ability of non-work sources to com...

The argument that non-work goods can replace work-derived meaning fails to consider the embeddedness and thickness of meaning in human lives.

Philosophical/theoretical critique based on conceptual analysis (author's argument invoking the notions of embeddedness and thickness of meaning; no empirical study reported).

high negative Is artificial intelligence a threat to meaningful work and l... adequacy of non-work sources to substitute for work-derived meaning

The paper identifies governance challenges such as accountability gaps, digital sovereignty risks, ethical pluralism, and strategic weaponization arising from embedding AI in diplomatic practice.

Conceptual and normative analysis section of the paper outlining risks and governance challenges; illustrated by examples and argumentation.

high negative Strategic Cognition and Artificial Diplomacy: Designing Huma... presence of governance risks (accountability gaps, digital sovereignty, ethical ...

Thin training coverage fosters anxiety about substitution and slows diffusion of AI tools.

Reported associations from surveys of mid-level managers and technical staff, interviews, and document analysis across cases; thematic coding identified links between limited training, worker anxiety, and slower diffusion. (Sample size not reported.)

high negative Overcoming Resistance to Change: Artificial Intelligence in ... worker anxiety and speed of diffusion/adoption

Upstream textile SMEs frequently exhibit constrained supply chain resilience owing to persistent information latency and structural dependence on downstream orders.

Background/contextual claim stated in paper (motivation for study); no specific quantitative test reported in abstract.

high negative Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... supply chain resilience (constrained due to information latency and downstream o...

Platforms can exploit workers' uncertainty about the cost of labor to effectively suppress wages.

Interpretation / implication drawn from the theoretical model and the result that a platform can achieve coverage while paying only O(log(M)/M) fraction of total labor cost under assumptions about workers' cost estimates.

high negative Stochastic wage suppression on gig platforms and how to orga... worker wages / wage suppression

There exists a simple pricing strategy for the platform that covers all M tasks with wait time O(M) while paying only an O(log(M)/M) fraction of the total cost of labor.

Theoretical result from the paper's posted-price procurement model under stated assumptions on workers' estimated costs; formal analysis/proof showing existence of such a pricing strategy for general M (no empirical sample).

high negative Stochastic wage suppression on gig platforms and how to orga... fraction of total labor cost paid by the platform (platform payments / total wor...

Because the technical threshold for this transition is already crossed at modest engineering effort, the window for protective frameworks covering disclosure, consent, compensation and deployment restriction is the present, while deployment remains optional rather than infrastructural.

Authors' normative claim based on their implementation (distillation and deployment) and interpretation that modest engineering sufficed; used to argue policy urgency for disclosure/consent/compensation frameworks.

high negative The Relic Condition: When Published Scholarship Becomes Mate... need for protective policy frameworks and timing

We term this the Relic condition: when publication systems make stable reasoning architectures legible, extractable and cheaply deployable, the public record of intellectual labor becomes raw material for its own functional replacement.

Conceptual framing introduced by the authors as an interpretation of the observed results and their implications; not an empirical measurement but a named condition/argument.

high negative The Relic Condition: When Published Scholarship Becomes Mate... conceptual risk of intellectual-labor replacement derived from extractable publi...

Agency in software engineering is primarily constrained by organizational policies rather than individual preferences.

Authors' synthesis of qualitative results across the ACTA/Delphi and task/review phases indicating organizational policy factors were cited as primary constraints.

high negative From Junior to Senior: Allocating Agency and Navigating Prof... Primary source of constraint on developer agency (organizational policy vs indiv...

Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities.

Literature/benchmark critique asserted in the paper (argument that current benchmarks focus on simple judgmental formats and miss continuous numerical forecasting capabilities).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... scope/coverage of existing evaluation formats

Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Empirical observations from QuantSightBench evaluation showing model calibration performance as a function of magnitude (paper statement noting sharp degradation and overconfidence at extremes).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... calibration / overconfidence of prediction intervals across magnitudes

The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target.

Reported empirical coverage percentages from evaluation on QuantSightBench for the listed models (paper provides these percentage values).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage) for specific models

None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target.

Empirical evaluation on the newly introduced QuantSightBench benchmark across 11 frontier and open-weight models; models were assessed on empirical coverage of prediction intervals versus a 90% target (paper statement).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage)

The study identified significant implementation challenges including algorithmic bias, digital divide concerns, data privacy risks, and low technology readiness among HR teams in Tier 2 cities.

Synthesis of qualitative case study findings from 4 organizations plus survey responses (N=150) reporting barriers and risks encountered during adoption.

high negative A Study on the Effectiveness of Technology-Driven Recruitmen... implementation challenges / risks

Current attack policies do not saturate LinuxArena (human-crafted attacks evade monitors at substantially higher rates than model-generated attacks, indicating headroom for attackers).

Empirical observation comparing human-crafted attacks (LaStraj) and elicited model-generated attacks; authors interpret higher human evasion rates as evidence that current automated attack policies have not saturated the challenge posed by LinuxArena.

high negative LinuxArena: A Control Setting for AI Agents in Live Producti... relative performance gap between human-crafted and model-generated attacks (impl...

LaStraj is a dataset of human-crafted attack trajectories that evade monitors at substantially higher rates than any model-generated attacks we elicited.

Authors release LaStraj and report empirical comparisons showing human-crafted trajectories evade monitors at higher rates than the model-generated attacks they tested (exact evasion rates and sample sizes not provided in the excerpt).

high negative LinuxArena: A Control Setting for AI Agents in Live Producti... monitor evasion rate of human-crafted attack trajectories versus model-generated...

Against a GPT-5-nano trusted monitor at a 1% step-wise false positive rate, Claude Opus 4.6 achieves roughly a 23% undetected sabotage success rate.

Empirical sabotage evaluation reported by the authors: monitoring a trusted monitor (GPT-5-nano) at a specified step-wise false positive rate and reporting attacking model (Claude Opus 4.6) undetected success rate. (Sample size / number of evaluated runs not provided in the excerpt.)

high negative LinuxArena: A Control Setting for AI Agents in Live Producti... undetected sabotage success rate (attacker success despite monitoring)

Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Qualitative and quantitative analysis of errors observed across the DELEGATE-52 experiments (19 LLMs) showing sparse, high-severity, and silently introduced errors that accumulate over long workflows.

high negative LLMs Corrupt Your Documents When You Delegate error severity and silent corruption over time

« Prev 1 2 3 … 11 12 13 … 103 104 Next »