Evidence (16496 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Organizational structures, bias susceptibility, retraining constraints, and interface design co-determine system stability, error propagation, and optimization ceilings.

Conceptual claim based on synthesis of literature across organizational adoption and ML lifecycle management (no empirical tests or sample sizes reported).

high negative Optimizing Human Capital in AI-Enabled Architectures: A Syst... system stability and error propagation (incidence and spread of errors) and limi...

Human interfaces define throughput limits in areas such as prompt engineering, data-stream curation, adjudication of model outputs, and the orchestration of hybrid automation workflows including robotics, scraping, and digitization.

Theoretical assertion supported by the paper's systems-oriented analysis and literature synthesis (no empirical measurement or sample size provided).

high negative Optimizing Human Capital in AI-Enabled Architectures: A Syst... throughput / task completion capacity for workflows involving human-AI interacti...

Despite accelerating advances in AI capabilities, human capital remains the enduring and dominant system constraint.

Argument and synthesis of emerging research across human-AI interaction, ML lifecycle management, organizational adoption, and adult learning theory (conceptual synthesis; no empirical sample size reported).

high negative Optimizing Human Capital in AI-Enabled Architectures: A Syst... constraint on overall AI system performance (human capital as limiting factor)

GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

Argument combining conceptual critique of GDP with empirical demonstration on G7 data using the GAGI index (authors' normative policy recommendation).

high negative GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-... ability of GDP-only monitoring to detect distributional harms from automation (v...

The divergence between welfare-adjusted prosperity (GAGI) and headline GDP widens sharply after 2022, temporally coincident with the after-effects of COVID and the acceleration of generative-AI deployment, though this evidence alone does not demonstrate causation.

Temporal pattern observed in the authors' G7 2010–2026 empirical series (associational observation; authors explicitly note lack of causal identification).

high negative GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-... increase in the gap (widening divergence) between GAGI and GDP per capita after ...

Applying GAGI to the G7 economies over 2010-2026 shows that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth.

Empirical analysis performed on G7 countries over 2010–2026 (sample: 7 economies; time series comparison of GAGI vs. GDP per capita).

high negative GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-... divergence between welfare-adjusted prosperity (GAGI) and headline GDP per capit...

What is missing from the macroeconomic monitoring toolkit is an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change is legible to a regulator.

Normative/methodological claim by the authors arguing for a practical monitoring statistic (no empirical test; statement of need).

high negative GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-... presence/absence of an operational, auditable macroeconomic monitoring statistic

GDP per capita is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact.

Conceptual/definitional argument presented by the authors in the paper (no empirical test reported).

high negative GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-... ability of GDP per capita to reflect consumer welfare (specifically distribution...

Mediation analysis: AI adoption contracts employment in production and managerial positions.

Mediation models using occupational/role-level employment categories showing reductions in production and managerial headcounts associated with AI adoption.

high negative Creative disruption or destructive inequality? Firm-level ev... employment in production and managerial roles

AI adoption widens intra-firm pay disparities (increases pay inequality within firms).

Regression analyses showing divergent effects on employee vs. executive pay and explicit measures of intra-firm pay disparity in the panel data.

high negative Creative disruption or destructive inequality? Firm-level ev... intra-firm pay disparities (inequality between employees and executives)

Oligopolistic capture of productivity gains is intelligible as an outcome of AI-driven assetisation (i.e., productivity gains are appropriated by a small number of firms).

Theoretical claim based on political economy argument about assetisation and market power; no empirical sample or quantitative evidence reported in the excerpt.

high negative From human capital to asset ownership: AI as rentier asset distribution of productivity gains (capture by oligopolies)

Labour markets for university-educated workers are where the explanatory limits of human capital theory are most consequentially exposed.

Theoretical critique supported by political economy / sociological reasoning (no empirical sample reported).

high negative From human capital to asset ownership: AI as rentier asset adequacy of human capital theory to explain outcomes in university-educated labo...

AI should be understood as a productive rentier asset whose returns derive from constructed scarcity and access control rather than from commodity exchange.

Conceptual/theoretical framing based on political economy and sociological analysis (argumentative, no empirical sample reported).

high negative From human capital to asset ownership: AI as rentier asset basis of economic returns to AI (constructed scarcity and access control vs comm...

Capital distortion negatively moderates the effect of industrial robots on firm-level TFP (i.e., capital distortion reduces the positive impact of robots on TFP).

Moderation/interaction analysis using the same panel data of Chinese listed firms and industrial robots (2006–2019); reported tests of interaction between robot application and measures of capital distortion.

high negative The application of industrial robots, capital distortion, an... total factor productivity (TFP) at the firm level (moderated by capital distorti...

Longer system responses and more information-providing turns negatively affect user satisfaction.

Statistical modeling of user satisfaction using features of multi-turn interactions (response length, number of information-providing turns) derived from the 49 participant sessions; models show negative associations reported in the paper.

high negative Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... user satisfaction

Accuracy of developer+LLM assessments against expert ground truth is low.

Comparison of participant/LLM assessment outcomes to expert-annotated ground truth for the 148 NFRs; reported low accuracy in the paper.

high negative Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NF... accuracy of assessments relative to expert ground truth

Existing benchmarks typically report accuracy for a single model on a single run, which systematically understates real-world LLM capabilities—particularly under heterogeneous data distributions—because (i) different models get different questions correct according to their specializations, and (ii) given a budget, multiple generations can be sampled and selectively retained.

Argument supported by the paper's empirical Capability Frontier analysis (21 models, 16 benchmarks) and supporting simulations demonstrating specialization and gains from multiple samples/selection.

high negative The Capability Frontier: Benchmarks Miss 82% of Model Perfor... reported benchmark accuracy vs achievable collective capability

LLM-based coding harnesses grant agents broad file and shell access, while the configuration layer that steers them (rules files, agent definitions, IDE-specific markdown) is largely unmanaged.

Author statement / qualitative assessment in the paper describing typical LLM harness privileges and the management state of configuration artifacts; not tied to a quantified experiment in the excerpt.

high negative A Deterministic Control Plane for LLM Coding Agents management state of configuration layer (qualitative)

Age-normalised commit rates for agent configs are lower than CI/CD workflows: 0.4 vs 0.6 commits/month.

Comparative, age-normalised commits-per-month statistic reported for agent configs versus CI/CD workflows (method: commit-rate normalization by file age).

high negative A Deterministic Control Plane for LLM Coding Agents commits per month (age-normalised)

Configurations are rarely revised: 58% are single-commit files.

Commit-history analysis of agent configuration files; proportion of files with only a single commit reported as 58%.

high negative A Deterministic Control Plane for LLM Coding Agents proportion of configs with a single commit

75.5% of clone pairs cross organisational boundaries (i.e., duplicated configs often span different organizations).

Analysis of clone pairs derived from SHA-256 matches; percentage of clone pairs classified as crossing organisational boundaries reported in the study.

high negative A Deterministic Control Plane for LLM Coding Agents share of clone pairs crossing organisational boundaries

Agent configurations propagate as undeclared shared components: 10.1% of tracked paths are SHA-256 exact duplicates across independent repositories (fork-adjusted, threshold-independent).

File-content analysis across collected agent configuration files using SHA-256 exact matching; sample drawn from the stated 6,145 agent config files.

high negative A Deterministic Control Plane for LLM Coding Agents rate of exact-duplicate configuration paths

We surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents (e.g., shortcuts).

Empirical analysis of failures and shortcuts observed when evaluating more capable agents on CORE-Bench Hard (case-study observations).

high negative Life After Benchmark Saturation: A Case Study of CORE-Bench construct validity issues (shortcuts)

When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version; this accuracy-centric approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance (construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration).

Argument and framing in the paper supported by conceptual analysis and the CORE-Bench Hard case study (qualitative reasoning and empirical examples).

high negative Life After Benchmark Saturation: A Case Study of CORE-Bench coverage of evaluation dimensions (accuracy-centric vs. multidimensional evaluat...

Diagnostic heuristic: if letting AI in makes the task feel effortless, it is in the wrong place.

Authors' heuristic for educators (conceptual guidance; no empirical test reported in the excerpt).

high negative The Effortless Trap: Productive Struggle, AI, and the Illusi... perceived effort during task when AI is allowed

An unguarded AI helper left high-school students about 17% worse on an unaided exam than peers with no tool at all.

Described as the 'strongest causal evidence' in the paper; empirical study of high-school students measuring unaided exam performance. (Study design details and sample size not provided in the excerpt.)

high negative The Effortless Trap: Productive Struggle, AI, and the Illusi... score on an unaided exam

Used poorly, AI replaces the cognitive work that learning requires and leaves an illusion of learning: a confident sense of mastery that collapses on the unaided task.

Authors' conceptual claim supported in the paper by reference to causal evidence (see following empirical claims); no sample size given in the excerpt.

high negative The Effortless Trap: Productive Struggle, AI, and the Illusi... performance on unaided tasks (collapse of apparent mastery)

After removing Sybil-flagged feedback, 15.5%, 72.3%, and 89.4% of rated agents on Ethereum, BSC, and Base respectively are left with no valid feedback.

Recomputation of rated agents' feedback counts after excluding reviews flagged by the paper's Sybil detection methodology (reported percentages come from that post-filtering analysis).

high negative Can Trustless Agents Be Trusted? An Empirical Study of the E... fraction of rated agents left with zero valid feedback after Sybil-flag removal

A substantial fraction of reviewers exhibit coordinated Sybil behavior: 73.6%, 59.2%, and 90.6% across Ethereum, BSC, and Base respectively.

Sybil-detection analysis applied to reviewer activity patterns in the Reputation registry across the three chains (reported percentages derived from that analysis).

high negative Can Trustless Agents Be Trusted? An Empirical Study of the E... fraction of reviewers flagged as exhibiting coordinated Sybil behavior

Feedback records in the Registry are rarely grounded in verifiable interactions.

Cross-checks between reputation feedback records and on-chain/off-chain evidence of corresponding verifiable interactions (including x402 payment transactions) as described in the paper.

high negative Can Trustless Agents Be Trusted? An Empirical Study of the E... proportion of feedback records that can be linked to verifiable interactions

The Registry, as currently deployed, cannot function as a trust signal because registration/reputation values are not commensurable.

Analysis of on-chain Reputation registry entries and their formats/values showing heterogeneity and lack of common scale or semantics (qualitative and quantitative examination reported in the paper).

high negative Can Trustless Agents Be Trusted? An Empirical Study of the E... commensurability/consistency of reputation values recorded in the registry

Most registrations are placeholders rather than active agents: only a small fraction (3%, 4%, and 15% across Ethereum, BSC, and Base) expose a valid ERC-8004 registration file with at least one live service endpoint.

On-chain crawl of Identity registration events plus retrieval/validation of off-chain registration files and live service endpoint checks across the three chains (as reported in the paper).

high negative Can Trustless Agents Be Trusted? An Empirical Study of the E... fraction of registered identities exposing a valid registration file with at lea...

Experts in the study assign a 14% probability to 'rapid-progress' scenarios characterized by substantial GDP growth, declining labor force participation, and accelerating wealth inequality.

Result from the 2025 forecasting study of experts (69 economists + 52 AI experts), reporting a probability estimate (14%) for a named scenario with specified macroeconomic and labor-market features.

high negative Preparing Organizations for AI's Economic Disruption: Eviden... probability assigned to a rapid-progress scenario with substantial GDP growth, d...

Developed economies leverage educational capital to mitigate the adverse inequality effects of AI adoption.

Reported interaction/moderation findings from OLS and Random Forest analyses on the World Bank/OECD dataset showing weaker or offset association between AI adoption and Gini in higher-education / higher-development country groups.

high negative Analyzing the Impact of Artificial Intelligence Adoption on ... Gini index (income inequality)

There is a substantial lag in the adoption of state-of-the-art AI techniques in RERS research.

Synthesis of methodological findings from the 59-study review, including low deep-learning usage (15%) and absence of state-of-the-art XAI/TL implementations.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... adoption of state-of-the-art AI techniques in RERS literature

Fairness auditing in RERS research is limited despite documented discrimination risks in housing markets.

Assessment of evaluation and ethical practices across the 59 reviewed studies; authors note few studies perform fairness auditing and cite broader literature on discrimination risks in housing.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... presence/absence of fairness auditing in RERS studies

The literature is dominated by residential property studies (91% of reviewed works).

Domain/topic classification of the 59 studies in the review; authors report 91% focus on residential properties.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... proportion of studies focused on residential properties

Research is geographically concentrated in Asia (56% of reviewed studies).

Geographic coding of study origins in the systematic review; authors report 56% of studies originate from Asia (n=59 total studies).

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... geographic distribution of RERS research (share from Asia)

A reliance on proprietary datasets is pervasive: 80% of reviewed studies use proprietary data.

Reported percentage (80%) derived from categorization of dataset types across the 59 reviewed studies.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... use of proprietary datasets in RERS research

No reviewed work implements state-of-the-art post hoc explainable AI (XAI) or transfer learning (TL) frameworks.

Systematic review of methods reported across 59 studies; authors state absence of implementations of state-of-the-art post hoc XAI or TL.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... implementation of state-of-the-art post hoc XAI and transfer learning frameworks...

Deep learning is employed in 15% of reviewed RERS studies.

Count/percentage reported from the systematic review of 59 studies; percentage (15%) directly reported in paper.

high negative Real Estate Recommender Systems: A PRISMA-Compliant Systemat... use of deep learning techniques in RERS studies

Barriers limiting full AI adoption in auditing include resistance to change, algorithm aversion, heuristics and biases, lack of transparency, expertise and training gaps, and technological complexity.

Systematic Literature Review (SLR) of 43 studies synthesizing reported inhibitors and challenges to AI uptake in auditing from empirical and conceptual papers.

high negative AI in auditing: Drivers and barriers to its adoption and the... Barriers/inhibitors to AI adoption in auditing

When external shocks occur, model convergence triggers nonlinear 'algorithmic resonance' and 'digital herding effects,' amplifying localized market disturbances into systemic crises.

Theoretical mechanism and conceptual dynamics described in the paper (narrative/analytic reasoning; no empirical testing reported).

high negative A Theoretical Framework for AI and Financial Stability: The ... amplification of localized shocks into systemic crises

The core transmission chain is: profit-driven financial institutions adopt similar data sources and model architectures, leading to increased model correlation and convergent decision logic at the macro level.

Analytical transmission-chain argument constructed in the paper's framework (conceptual mechanism; no empirical sample).

high negative A Theoretical Framework for AI and Financial Stability: The ... model correlation / convergent decision logic

Homogeneity, opacity, overfitting tendencies, and technological dependency emerge as novel sources of systemic risk under widespread AI adoption in finance.

Conceptual identification of novel risk channels within the paper's theoretical framework (analytical reasoning rather than empirical testing).

high negative A Theoretical Framework for AI and Financial Stability: The ... emergence of novel systemic risk factors

AI does not eliminate financial risks but rather shifts them from traditional balance-sheet and leverage domains to realms characterized by algorithmic dependence and technological vulnerability.

The paper's primary theoretical argument based on a developed dual analytical framework and conceptual analysis (no empirical sample reported).

high negative A Theoretical Framework for AI and Financial Stability: The ... financial risk / financial stability (shift in risk domains)

The effectiveness of prompt injection rapidly diminishes as more candidates inject, collapsing when manipulation becomes widespread.

Controlled experiments that vary the share of candidates performing prompt injection and observe changes in manipulation effectiveness; exact sample size not provided in the abstract.

high negative Prompt Injection in Automated Résumé Screening with Large La... change in manipulation effectiveness as measured by shifts in applicant rankings

GenAI adoption carries risks including overreliance on models, misalignment between model outputs and human needs, and uneven performance across tasks and contexts.

Reported adverse effects and risks identified in the reviewed literature (task-level experiments and applied studies summarized by the paper).

high negative Generative AI, Digital Infrastructure, and Firm Productivity... error rates, misalignment incidents, quality failures due to overreliance

AI nudification has expanded from targeting public figures to increasingly harming individuals within users' own social circles.

Comparison of the target-demographic distribution in this study (55.8% non-celebrities) to prior studies (4.7% non-celebrities), interpreted as a shift in who is targeted.

high negative From Celebrities to Anyone: Characterizing AI Nudification C... shift in target demographics from public figures to private/non-celebrity indivi...

The ecosystem runs on a small cohort of active producers, with the most prolific producing 780 items.

Contributor activity analysis of the collected dataset showing distribution of items per producer and identifying the most prolific producer with 780 items.

high negative From Celebrities to Anyone: Characterizing AI Nudification C... number of SNEACI items produced by the most prolific contributor

« Prev 1 2 3 … 25 26 27 … 329 330 Next »