Evidence (6574 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
The platform's delivery algorithm routes each creative to the audience it predicts will engage.
Descriptive claim in paper about algorithmic delivery behavior; likely supported by platform operational details and the motivating discussion.
Online advertising platforms host hundreds of thousands of A/B tests.
Statement in paper (assertion about industry scale); no sample size or citation provided in excerpt.
The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.
Stated design/ethical objective in the paper; normative claim about intended social and governance outcomes rather than an empirically validated result.
FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead.
Design rationale/claim in the paper about interoperability and incremental adoption strategy; no empirical deployment, integration case studies, or measured overhead reductions presented.
FP treats policy, provenance, and audit as first-class concerns.
Design/architectural claim in the paper stating that policy, provenance, and audit are prioritized within FP; no empirical compliance or audit trials presented.
FP provides economic primitives for metering, receipts, and settlement.
Design claim in the paper listing economic primitives as part of FP; no deployment or economic experiments reported.
FP supports native multi-party organization and event-based collaboration.
Feature/architecture claim in the paper describing native support for multi-party organization and event-driven collaboration; no empirical evaluation or user studies provided.
FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations.
Design specification/feature claim in the paper describing FP's data and entity model; no empirical interoperability study reported.
This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society.
Claim of authorship/introduction in the paper; architectural/design proposal rather than an evaluated system.
Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight.
Normative/requirements statement in the paper describing necessary capabilities for scaled multi-agent systems; no empirical validation or experimental data provided.
Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another.
Statement in the paper's introductory/abstract text presenting an observed trend; conceptual/qualitative claim without empirical data or measured sample.
Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions.
Citation to prior literature reported in the paper (background literature review claiming general findings about perceptions of AI narrative explanations).
Narrative explanations increased reliance on the AI, both when the AI prediction was correct and when it was incorrect.
Findings from the paper's human behavioral experiment reporting increased reliance on AI with accompanying narratives under both correct and incorrect AI prediction conditions.
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare.
Statement grounded in observation of recent literature trends and the cited body of work on LLM agents applied to coding, research, and healthcare domains.
These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.
Qualitative findings from the three case analyses demonstrating how different design choices limit or enable particular work claims and exposing gaps between task, setting, and scored product.
APEX-SWE [is] a software-engineering benchmark with executable scored products.
Description of the APEX-SWE benchmark in the paper's case analysis.
OfficeQA Pro [is] a grounded document-analysis benchmark scored by final answers.
Description of the OfficeQA Pro benchmark in the paper's case analysis.
GDPval [is] a non-code occupational deliverable benchmark.
Description of the GDPval benchmark in the paper's case analysis.
We demonstrate the approach through three benchmark case analyses: GDPval, OfficeQA Pro, and APEX-SWE.
Empirical/methodological demonstration reported in paper via three case analyses of existing benchmarks; the paper applies its three-step approach to each case.
To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O*NET occupational task database.
Method described in paper: mapping/derivation from the O*NET occupational task database to produce an inventory of 18 work activities.
We translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system.
Paper provides prescriptive guidance derived from conceptual analysis and the reviewed literature; guidance illustrated via application to case benchmarks.
We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows.
Literature review of work studies cited in the paper; synthesis of organizational features of knowledge work.
This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product.
Methodological contribution described in paper; approach presented and motivated, and later applied in case analyses (three benchmark case studies).
European AI companies increasingly face differing regulatory expectations across global markets, and European institutions should provide structured support (advisory mechanisms, regulatory guidance, dialogue with partner jurisdictions) to help companies navigate emerging compliance requirements abroad.
Combined descriptive claim and policy recommendation; the text asserts increasing regulatory asymmetry faced by firms but provides no empirical data or firm-level survey evidence.
Systematic monitoring of global regulatory developments (for example through foresight functions within the European Commission or the AI Office) would help anticipate regulatory divergence and support future adjustments to European governance frameworks.
Policy recommendation advocating institutional monitoring mechanisms; argumentative justification rather than empirical demonstration in the text.
European regulators should monitor whether conversational systems begin to assume intermediary or gatekeeping roles within digital ecosystems and consider how existing platform governance frameworks might apply.
Policy recommendation advocating monitoring and potential regulatory application; no empirical study in text demonstrating current gatekeeping behavior.
Risk assessments and auditing standards should explicitly examine interaction design, including engagement optimisation mechanisms, recommendation loops, and other features that may encourage behavioural influence or dependency.
Normative recommendation arguing current frameworks focus mainly on outputs; no empirical evaluation or sample reported.
European institutions (in particular the European AI Office) should issue guidance on how systems designed for sustained social or emotional interaction should be assessed in the implementation of the AI Act.
Policy recommendation contained in the text; prescriptive argument rather than an empirical finding; no supporting data or empirical evaluation provided.
Existing regulatory frameworks will need to consider risks that arise not only from system outputs but also from longer-term patterns of human–AI interaction.
Normative recommendation based on the document's argument that conversational AI generates risks through sustained interaction; no empirical method or data reported.
The paper proposes five evaluation dimensions for AutoResearch systems: novelty, validity, impact, reliability, and provenance.
Paper explicitly proposes these five dimensions as an evaluation rubric; conceptual proposal.
The field can be organized around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication.
Authors propose this five-condition organizational framework as part of their survey and synthesis; conceptual contribution.
Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution within AutoResearch.
Paper-introduced terminology and conceptual delineation of a sub-region of the AutoResearch spectrum; definitional statement.
AutoResearch is defined as the developmental spectrum of AI-powered scientific workflow automation.
Paper provides an explicit definitional framing (terminology introduced by authors); conceptual contribution rather than empirical finding.
This shift marks a transition from task-level AI for science to workflow-level research automation.
Conceptual argument backed by literature survey and examples of systems that coordinate multiple research tasks; no single quantitative study reported.
Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision.
Survey / conceptual synthesis of recent AI research systems and literature; paper presents this as an observed trend rather than reporting original empirical measurements.
The study advances multilevel propositions and outlines a research agenda for examining legitimacy in hybrid human–AI decision systems.
Paper presents multilevel theoretical propositions and a suggested agenda for future empirical research (conceptual contribution; no empirical validation reported).
Human judgment remains essential for contextual interpretation and accountability in hybrid human–AI decision systems.
Conceptual claim advanced through theoretical argumentation and literature references in the paper (no empirical sample reported).
Legitimacy of AI-enabled decisions depends on transparency, explainability, and perceived fairness.
Conceptual argument and literature synthesis in the paper emphasizing transparency, explainability, and fairness as determinants (no empirical sample reported).
AI enhances efficiency and consistency in organizational decision-making.
Theoretical claim supported by referenced literature and conceptual argumentation within the paper (no empirical test or sample reported).
Procedural, distributive, and cognitive legitimacy are key dimensions of decision legitimacy in AI-enabled organizations.
Conceptual development in the paper drawing on institutional theory, socio-technical systems, and behavioral decision-making; literature synthesis and theoretical argumentation (no empirical sample reported).
Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.
Argument supported by observed cases in the experiments where models with similar overall ranks differed on capability axes and jaggedness, implying additional diagnostic value.
Newer frontier-tier models score higher on average.
Aggregate results from the head-to-head tournament comparing nine models across sampled games (>36k matches).
We introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games.
Methodological contribution described in paper (jaggedness metric).
We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness).
Methodological description in paper introducing the capability-profile decomposition.
The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination.
Method claim about generator capability described in the paper.
We introduce GENSTRAT, which uses procedurally generated strategic environments to address the limitations of fixed benchmarks.
Methodological contribution described in paper: design and implementation of GENSTRAT.
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings.
Introductory statement in the paper situating motivation; no empirical data reported in the abstract to quantify the increase.
We propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.
Authors' prescriptive recommendations derived from interview insights; recommendations are not empirically validated in the study.
We propose steps that AI companies can take to make the invisible work more visible.
Authors' normative recommendations based on synthesis of the qualitative interview findings; not empirically tested within the paper.
Some of these changes are positive, such as smoother collaboration between peers.
Interviewee accounts from the 24-participant qualitative study reporting perceived improvements in peer collaboration due to AI tools.