Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Collaboration between humans and AI enhances decision-making, efficiency, and innovation.
Reported result from thematic evaluation of literature and secondary data (qualitative synthesis). No sample size or quantified effect provided.
AI improves overall organisational productivity.
Authors' synthesis of peer-reviewed studies and secondary data indicating productivity impacts (qualitative literature review). No quantitative sample size reported.
AI increases human capacities.
Conclusion from comprehensive analysis of peer-reviewed literature and thematic evaluation of secondary data (literature review). No primary sample size reported.
Policy responses should prioritise governance frameworks that emphasise equity, accountability, and inclusive distribution of value to address concentrated digital power.
Normative policy recommendations derived from the paper's conceptual analysis and synthesis of recent literature (policy prescription, no empirical evaluation reported).
Time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times.
Empirical result reported in the abstract: subjective effort ratings were lower for AI-assisted conditions even though measured completion times were equivalent (preregistered study, N = 1237).
Participants predicted AI to be significantly faster.
Empirical result reported in the abstract: participants' predicted completion times indicated AI-assisted completion would be faster than independent completion (statistical significance claimed). Sample from preregistered study (N = 1237).
Large language models (LLMs) have the potential to boost human productivity by speeding up task completion -- provided users know when to offload cognitive work to them.
Framing/introductory claim in the paper (theoretical/argumentative), no direct empirical evidence reported in the abstract.
Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.
Aggregate of dataset creation, benchmark results, algorithm (curriculum-LoRA) efficiency gains, and system integration reported in the paper; claim is a stated implication/claim about practical feasibility for local administrations.
The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline.
System-level description and implementation in the paper that embeds curriculum-LoRA within a closed-loop pipeline for policy evaluation and iteration.
Curriculum-LoRA Pareto-dominates every configuration tested.
Empirical comparisons across the tested configurations in the experiments reported in the paper; curriculum-LoRA outperforms or matches all other configurations on the fidelity-versus-cost Pareto frontier.
Curriculum-LoRA is a parameter-efficient personalization framework that, by closing the fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost.
Experimental evaluation comparing curriculum-LoRA to baselines on fidelity and per-call cost metrics; reported result that curriculum-LoRA attains comparable fidelity while reducing per-call cost by about a factor of ten.
Adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline.
Benchmark comparisons between prompting strategies that include rich life-history profiles versus a no-profile baseline across the evaluated LLMs, using the interview-derived dataset to assess fidelity.
The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains.
Reported dataset construction: two-hour semi-structured interviews with each of 92 residents (92 interviews), organized around nine governance domains; reported total text volume ~1.2 million characters.
Wage gains coincide with an increase in within-firm wage dispersion in small firms, with wage variance rising by around 7.5%.
Within-firm wage variance analysis (likely computed from worker-level wages aggregated to firm-level dispersion) showing a ~7.5% increase in wage variance in small firms after automation adoption.
Using a difference-in-differences framework exploiting import lumpiness in product categories linked to automation technologies, we find a positive average adoption effect on adopters’ average wages, which stabilizes at around 4% five years after an automation spike.
Difference-in-differences (DiD) estimation exploiting time variation in import 'spikes' in automation-related product categories (including robots) on the integrated panel of Italian importing firms (2011–2019).
Mincer-type wage regressions reveal that automation adopters pay approximately 3% higher wages after controlling for worker sorting.
Mincer-style wage regressions with controls for worker sorting (individual-level regression analysis on the integrated dataset).
The automation wage premium for adopting firms stands at approximately 10%.
Descriptive comparison of wages between automation-adopting firms and others using integrated firm-worker-trade data for Italian importing firms (2011–2019).
The design isolates the contribution of the platform's algorithm to the outcome which is separable from creative content.
Methodological claim supported by the proposed three-arm design and its empirical demonstration in the live campaign.
Roughly three-quarters of the absolute reallocation is algorithmic.
Empirical decomposition from the live Meta campaign reported in the paper (proportion of total reallocation attributed to algorithmic channel).
In a live Meta campaign with a women-targeted text fragment, the algorithmic channel raises female impression share by +2.07 ppt.
Empirical result from a live Meta campaign reported in the paper; conveys a measured effect size (+2.07 percentage points).
We propose a three-arm design that adds an arm exposing the algorithm to the treatment metadata while holding the user-facing creative identical to control, point-identifying the natural indirect (algorithmic) and direct (creative) effects without sequential ignorability.
Methodological proposal in the paper (design description and identification claim); presumably supported by theoretical derivation/proof in the paper.
The platform's delivery algorithm routes each creative to the audience it predicts will engage.
Descriptive claim in paper about algorithmic delivery behavior; likely supported by platform operational details and the motivating discussion.
Online advertising platforms host hundreds of thousands of A/B tests.
Statement in paper (assertion about industry scale); no sample size or citation provided in excerpt.
Recommendations for adapting employment policy to AI transformation conditions have been proposed.
Policy recommendations derived from the paper's analysis of statistical data, industry reviews, and regulatory/legal documents; recommendations are proposed by the authors (not empirically validated within the paper).
In 2024-2025, the labor market of Uzbekistan is characterized by duality: there is an increasing demand for IT specialists and workers with digital skills.
Analysis of 2024–2025 labor market statistics and industry reviews cited in the paper (no numerical sample size or survey sampling reported).
The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.
Stated design/ethical objective in the paper; normative claim about intended social and governance outcomes rather than an empirically validated result.
FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead.
Design rationale/claim in the paper about interoperability and incremental adoption strategy; no empirical deployment, integration case studies, or measured overhead reductions presented.
FP treats policy, provenance, and audit as first-class concerns.
Design/architectural claim in the paper stating that policy, provenance, and audit are prioritized within FP; no empirical compliance or audit trials presented.
FP provides economic primitives for metering, receipts, and settlement.
Design claim in the paper listing economic primitives as part of FP; no deployment or economic experiments reported.
FP supports native multi-party organization and event-based collaboration.
Feature/architecture claim in the paper describing native support for multi-party organization and event-driven collaboration; no empirical evaluation or user studies provided.
FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations.
Design specification/feature claim in the paper describing FP's data and entity model; no empirical interoperability study reported.
This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society.
Claim of authorship/introduction in the paper; architectural/design proposal rather than an evaluated system.
Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight.
Normative/requirements statement in the paper describing necessary capabilities for scaled multi-agent systems; no empirical validation or experimental data provided.
Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another.
Statement in the paper's introductory/abstract text presenting an observed trend; conceptual/qualitative claim without empirical data or measured sample.
Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions.
Citation to prior literature reported in the paper (background literature review claiming general findings about perceptions of AI narrative explanations).
Narrative explanations increased reliance on the AI, both when the AI prediction was correct and when it was incorrect.
Findings from the paper's human behavioral experiment reporting increased reliance on AI with accompanying narratives under both correct and incorrect AI prediction conditions.
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare.
Statement grounded in observation of recent literature trends and the cited body of work on LLM agents applied to coding, research, and healthcare domains.
These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Qualitative findings from the three case analyses demonstrating how different design choices limit or enable particular work claims and exposing gaps between task, setting, and scored product.
APEX-SWE [is] a software-engineering benchmark with executable scored products.
Description of the APEX-SWE benchmark in the paper's case analysis.
OfficeQA Pro [is] a grounded document-analysis benchmark scored by final answers.
Description of the OfficeQA Pro benchmark in the paper's case analysis.
GDPval [is] a non-code occupational deliverable benchmark.
Description of the GDPval benchmark in the paper's case analysis.
We demonstrate the approach through three benchmark case analyses: GDPval, OfficeQA Pro, and APEX-SWE.
Empirical/methodological demonstration reported in paper via three case analyses of existing benchmarks; the paper applies its three-step approach to each case.
To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O*NET occupational task database.
Method described in paper: mapping/derivation from the O*NET occupational task database to produce an inventory of 18 work activities.
We translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system.
Paper provides prescriptive guidance derived from conceptual analysis and the reviewed literature; guidance illustrated via application to case benchmarks.
We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows.
Literature review of work studies cited in the paper; synthesis of organizational features of knowledge work.
This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product.
Methodological contribution described in paper; approach presented and motivated, and later applied in case analyses (three benchmark case studies).
European AI companies increasingly face differing regulatory expectations across global markets, and European institutions should provide structured support (advisory mechanisms, regulatory guidance, dialogue with partner jurisdictions) to help companies navigate emerging compliance requirements abroad.
Combined descriptive claim and policy recommendation; the text asserts increasing regulatory asymmetry faced by firms but provides no empirical data or firm-level survey evidence.
Systematic monitoring of global regulatory developments (for example through foresight functions within the European Commission or the AI Office) would help anticipate regulatory divergence and support future adjustments to European governance frameworks.
Policy recommendation advocating institutional monitoring mechanisms; argumentative justification rather than empirical demonstration in the text.
European regulators should monitor whether conversational systems begin to assume intermediary or gatekeeping roles within digital ecosystems and consider how existing platform governance frameworks might apply.
Policy recommendation advocating monitoring and potential regulatory application; no empirical study in text demonstrating current gatekeeping behavior.
Risk assessments and auditing standards should explicitly examine interaction design, including engagement optimisation mechanisms, recommendation loops, and other features that may encourage behavioural influence or dependency.
Normative recommendation arguing current frameworks focus mainly on outputs; no empirical evaluation or sample reported.