Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
In the short run, with fixed human capital, wages, and job boundaries, AI raises productivity by reducing the time required to perform steps.
Model distinction between short-run (fixed job design and skills) and long-run horizons; short-run optimization shows AI reduces expected execution times for steps, thereby raising productivity.
Aggregating heterogeneous firms that deploy a commonly available AI technology yields an aggregate production function that admits a constant elasticity of substitution (CES) representation with three inputs: aggregate manual labor, aggregate AI-assisted labor, and aggregate capital.
Theoretical aggregation argument drawing on Houthakker (1955) and Levhari (1968), deriving a macro-level CES representation from a microfounded algorithmic cost function defined by firms' joint optimization over AI deployment and job design.
Improvements in AI quality generate non-linear effects on labor demand and wages because firms' cost-minimizing AI deployment and job designs change discretely at particular AI quality thresholds (microfoundation for the productivity J-curve).
Theoretical analysis of discrete switches in the cost-minimizing arrangement as AI success probability and execution times change; characterization of threshold effects and discussion linking to the J-curve phenomenon (model results and comparative statics).
Adjacency to AI-executed steps increases the likelihood that a given step is executed by AI (local complementarities): a step is more likely to be AI-executed in occupations where its neighboring steps are also AI-executed.
Empirical comparisons of conceptually similar steps across occupations paired with workflow adjacency information and realized AI execution outcomes from Anthropic’s Economic Index; statistical tests reported in the paper.
AI-executed steps co-occur in contiguous chains rather than being randomly scattered across a production workflow.
Empirical analysis linking O*NET tasks to human assessments of AI exposure (Eloundou et al., 2024), realized AI execution outcomes from Anthropic’s Economic Index (Handa et al., 2025), and GPT-generated workflow orderings for occupations; statistical tests comparing observed contiguity to random/scaled baselines reported in the paper.
Platforms should implement AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of online content platforms.
Policy/recommendation derived from the paper's empirical findings on consumption preferences, producer behaviors, and the moderating role of distribution algorithms.
AIGC creators achieve aggregate engagement comparable to HGC creators by producing content at high volume (a 'scale-over-preference' dynamic).
Analysis of creation and engagement patterns in the dataset showing that AIGC creators compensate for lower per-item engagement by higher production volume, yielding comparable aggregate engagement levels to HGC creators.
Consumers show a marked preference for Human-Generated Content (HGC) over Artificial Intelligence-Generated Content (AIGC).
Comparative analysis of consumption behavior in the longitudinal dataset; the paper reports consumption metrics that indicate higher consumer preference for HGC versus AIGC (e.g., relative engagement per item).
AI facilitates access to distant knowledge domains.
Theoretical model (Schumpeterian quality-ladder recombinant-innovation framework). The paper models R&D as recombining ideas across a knowledge space and shows analytically that AI increases firms' ability to combine ideas across longer distances.
A statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage.
Application of conformal prediction to the LLM interval outputs in the experiment, resulting in expanded intervals that attain the target coverage.
Larger, more capable models produce more accurate estimates.
Empirical experiment asking eleven LLMs to estimate population statistics (health prevalence rates, personality trait distributions, labor market figures) and comparing accuracy across models of different capability.
The paper proposes five architectural requirements for genuine human oversight systems.
Stated methodological/prescriptive contribution of the paper (a proposal rather than an empirical finding); no sample size or empirical validation reported in the provided excerpt.
The proposed framework outlines a pathway toward large-scale cooperative intelligence and offers a constructive perspective on the coevolution of human and artificial agents in the informational ecosystems of the future.
Claim about the paper's contribution; based on conceptual synthesis and theoretical framing rather than empirical validation.
A voluntary ecosystem of free rational agents, human and artificial, who cooperate through transparent and fair exchange of information maximizes their adaptive capacity and long-term well-being.
Normative proposition in the paper derived from theoretical principles (information theory, collective intelligence); presented as a proposed ideal rather than an empirically tested policy.
Emerging opportunities exist for stabilizing these ecosystems through new forms of informational verification and monitoring made possible by advanced artificial agents.
Forward-looking claim grounded in conceptual analysis of capabilities of advanced agents; proposed as an opportunity in the paper rather than demonstrated empirically.
Systems that preserve diversity of exploration while minimizing barriers to information exchange exhibit superior capacity for discovery and adaptation in complex environments.
Theoretical claim supported by the paper's appeal to principles from information theory, adaptive systems, and collective intelligence; presented as an argument rather than as empirically validated result.
Increasing the strictness of algorithmic control paradoxically increases the evolutionary fitness of coordinated resistance (e.g., coordinated log-offs).
Results from the EGT model and simulations showing fitness/payoff changes for coordinated resistance strategies as platform surveillance strictness parameter increases; model-only (no empirical N reported).
The future of transformative transformer-based AI is fundamentally many, not one.
Concluding synthesis and normative prediction based on the paper's theoretical arguments and literature synthesis; no empirical data or quantified projection provided in the excerpt.
Developing diverse AI teams addresses critics' concerns that current models are constrained by past data and lack the creative insight required for innovation.
Argumentative claim drawing on conceptual critique of current models and the proposed remedy of diverse AI teams; supported by referenced disciplinary literatures but no empirical validation provided in the excerpt.
Having a diverse team broadens the search for solutions, delays premature consensus, and allows for the pursuit of unconventional approaches.
Theoretical/argumentative claim referencing literature in complex systems and organizational behavior as support; no quantitative evidence or sample reported in the excerpt.
Deep intellectual breakthroughs should be expected to come from epistemically diverse groups of AI agents working together rather than singular superintelligent agents.
Predictive/theoretical claim motivated by referenced research and formal results in complex systems, organizational behavior, and philosophy of science; no empirical experiment or sample size given in the excerpt.
We should abandon the individual approach if we're hoping for AI to support groundbreaking innovation and scientific discovery.
Normative prescription based on theoretical argument and synthesis of literature from complex systems, organizational behavior, and philosophy of science; no empirical trial or quantified evaluation reported in the excerpt.
With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.
Forward-looking claim by the authors extrapolating from current prototype results and potential improvements; no empirical evidence provided that it already exceeds traditional methods.
ARQuest shows great potential to improve user satisfaction and streamline insurance processes.
Interpretation based on experimental findings (fewer questions, user preference) and the proposed framework; forward-looking claim rather than a fully established empirical result.
Adaptive versions were preferred by users for their more fluid and engaging experience.
User preference reported from the experiments (qualitative/user feedback or preference metric); specific measures and sample size not provided in excerpt.
Adaptive versions powered by GPT models required fewer questions.
Experimental result reported in paper comparing question counts between adaptive GPT-powered questionnaires and traditional questionnaires; no numeric counts or sample sizes provided in the excerpt.
Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions.
Described methods/techniques used within the ARQuest system implementation in the paper.
The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires.
Methodological contribution described in the paper (framework design); description of components and intended function rather than a quantified outcome.
Only interventions that reshape risk allocation can plausibly shift stable system-level behaviour.
Argument based on the paper's game-theoretic reasoning and stylised example (theoretical claim; no empirical testing reported in the abstract).
Artificial intelligence (AI) is widely promoted as a promising technological response to healthcare capacity and productivity pressures.
Author assertion in the paper's introduction/abstract, based on literature/policy discourse (no empirical sample or quantitative analysis reported in the abstract).
We open-source the complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts.
Paper statement committing to open-sourcing the benchmark components and artifacts.
We evaluated leading agent frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B).
Paper reports extensive evaluations using the listed agent frameworks and LLM models paired together to run the benchmark scenarios.
Execution-based evaluators were implemented with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments.
Paper statement describing the evaluation methodology and the specific metrics used for regression, classification and health-assessment tasks.
We construct 65 specialized tools across two MCP servers to enable interactions for the benchmark.
Paper statement reporting the number of specialized tools (65) and that they are deployed across two MCP servers as part of the benchmark implementation.
The benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation.
Explicit statement in paper listing the number of scenarios (75), number of asset classes (7) and enumerating the 5 task categories; benchmark construction described by authors.
PHMForge is the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers.
Paper statement introducing PHMForge as a benchmark and describing its construction to evaluate LLM agents via MCP servers; benchmark implementation is presented in the manuscript.
Design implication: adaptive AI coaching systems should align support intensity with individual readiness, rather than assuming universal effectiveness.
Authors' design recommendation derived from experimental results showing heterogeneous effects by personality profile.
The system is in production, serving 21 industry verticals with 650+ agents.
Deployment claim reported in paper (production system metrics: number of verticals and agents).
We propose a framework for output-side ontological validation (response validation, reasoning verification, compliance checking).
Proposed framework described in paper (conceptual/procedural proposal; not described as empirically validated in abstract).
We introduce ontology-constrained tool discovery via SQL-pushdown scoring.
Methodological/implementation contribution described in the paper (technical mechanism introduced).
Improvements from ontology coupling are greatest where LLM parametric knowledge is weakest—particularly in Vietnam-localized domains.
Observed pattern reported from the controlled experiment across the five industries, with stronger improvements in Vietnam-localized domains (no per-industry sample sizes reported in abstract).
Ontology-coupled agents significantly outperform ungrounded agents on Role Consistency (p < .001, W = .614).
Controlled experiment with 600 runs; statistical test reported (p-value and W statistic provided in abstract).
Ontology-coupled agents significantly outperform ungrounded agents on Regulatory Compliance (p = .003, W = .318).
Controlled experiment with 600 runs; statistical test reported (p-value and W statistic provided in abstract).
Ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460).
Controlled experiment with 600 runs; statistical test reported (p-value and W statistic provided in abstract).
We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking).
Theoretical/formalization contribution described in the paper (conceptual and methodological development).
Our approach introduces a three-layer ontological framework--Role, Domain, and Interaction ontologies--that provides formal semantic grounding for LLM-based enterprise agents.
Design contribution described in the paper (formal model specification).
We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning.
System design and implementation claim: description of architecture and its implementation in the FAOS platform (technical/design evidence reported in paper).
The analysis identifies seventeen emerging occupational categories benefiting from reinstatement effects, concentrated in human-AI collaboration, AI governance, and domain-specific AI operations roles.
Modeling/taxonomy result reported in the paper listing 17 emerging occupational categories characterized as benefiting from reinstatement effects (human-AI collaboration, governance, operations).
Our findings indicate an increasing agent activity in open-source projects.
Trend analysis reported in the paper showing growth in agent-originated activity within the assembled dataset of PRs and associated metadata.
Effective collaboration with AI for software engineering (SE) tasks may benefit from functional design rather than replicating human SEI traits, thereby redefining collaboration as functional alignment.
Authors' conclusion and recommendation derived from qualitative interview evidence (10 practitioners) and the proposed concept of functional equivalents.