Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
This reframes the question from whether the model can think to whether the human-AI system can reason.
Conceptual reframing stated in the paper; no empirical evidence required as it is a change of perspective.
We introduce 'The Architect's Pen' as a practical method where the human uses the model as an external medium for structured reflection by embedding phases of articulation, critique, and revision into human-AI interaction.
Method description / practical proposal included in the paper; no experimental evaluation, user study, or quantitative validation reported.
This perspective emphasizes collaborative intelligence, combining human judgment and contextual understanding with machine speed, memory, and associative capacity.
Theoretical claim about complementary strengths of humans and models within the proposed framework; presented without empirical tests.
Building on recent work on 'System-2' learning, reflective reasoning can be relocated to the interaction layer and framed as a cognitive protocol that can be structured, measured, and governed using existing systems.
Conceptual extension of prior literature ('System-2' learning) into an interaction-layer protocol; no empirical protocol testing or measurement evidence provided.
Reasoning should be treated as a relational process distributed between human and model rather than an internal capability of either.
Methodological proposal / theoretical framing presented by the authors; no empirical validation reported.
Large language models have advanced rapidly, from pattern recognition to emerging forms of reasoning.
Stated as an observational claim in the paper's introduction; no empirical evaluation or dataset provided.
This approach aligns with emerging compliance expectations, including the EU AI Act and ISO/IEC 42001, by making reasoning processes traceable under real conditions of use.
Claim of regulatory alignment made by the authors; presented as interpretive/legal/standards-relevant argument rather than supported by empirical analysis or legal review data in this excerpt.
Stabilising interaction makes uncertainty and drift visible before enforcement is applied, enabling more precise capability governance.
Normative/operational claim in the paper about the anticipated effect of the proposed interventions; no empirical test or measurement reported in this excerpt.
Together, these layers form a missing operational substrate for governance by increasing signal-to-noise at the point of use.
Argumentative claim from the paper proposing that the combined interventions improve the information available at the decision point; no empirical validation or sample size provided here.
This paper is the first in a five-paper research series on stabilising human-AI reasoning that proposes a two-layer approach: Parts II–IV introduce human-side mechanisms (uncertainty cues, conflict surfacing, auditable reasoning traces) and Part V develops a model-side Epistemic Control Loop (ECL) that detects instability and modulates generation.
Descriptive claim about the structure and scope of the paper series as stated by the authors; internal to the publication (no external dataset).
Large language models are increasingly integrated into decision-making in areas such as healthcare, law, finance, engineering, and government.
Statement in paper describing observed/adoptive trend; no empirical dataset, sample size, or quantitative analysis reported in the text.
For settings with multiple interventions, a tractable approximation that prioritizes interventions based on the magnitude of the policy-value discrepancy is effective.
Proposed algorithm/approximation in the paper (methodological contribution); evaluated empirically in simulations and experiments described in the paper.
In the single-intervention regime, the optimal strategy is to recommend the action that maximizes the human value function.
Theoretical result derived in the paper within a Markov decision process model for single-intervention settings.
Policy-value inconsistencies naturally identify opportunities for intervention.
Analytical/formal argument within a Markov decision process framework showing that when human policy-value consistency fails, discrepancies indicate intervention opportunities.
These cooperation mechanisms become more effective under evolutionary pressures to maximize individual payoffs.
Authors report results from experiments or simulations applying evolutionary-pressure dynamics (selection for payoff-maximizing agents) and observing increased effectiveness of mechanisms; no numeric results or sample sizes in excerpt.
Contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models.
Empirical results from the authors' experiments across four social dilemmas comparing mechanism performance; specifics (which models, quantitative cooperation rates) are not included in the excerpt.
Continuous learning and diversity of ideas are essential if AI is to play a meaningful role in original scientific discovery.
Normative/conditional claim supported by conceptual reasoning in the article; no empirical evidence or measured sample provided.
AI is likely to fundamentally reshape scientific publication.
Author's argument and discussion of implications for publishing and evaluation; no reported empirical study.
There is a gradual path from AI as a research tool to AI as a scientific collaborator.
Narrative/theoretical progression outlined in the article; conceptual roadmap rather than empirical demonstration.
AI for Science is especially important because it may transform not only the efficiency of research, but also the structure of scientific collaboration, discovery, publishing, and evaluation.
Argumentative/theoretical analysis in the article; forward-looking claim without reported empirical data or experimental sample.
The most important significance of the AI revolution, especially the rise of large language models, lies not simply in automation, but in a fundamental change in how complex information and human know-how are carried, replicated, and shared.
Conceptual argument presented in the article (theoretical/essayistic reasoning); no empirical sample or quantitative study reported.
The paper proposes a conceptual framework of the underlying mechanisms of the LLM fallacy and a typology of its manifestations across computational, linguistic, analytical, and creative domains.
Author(s) contribution described in the paper (framework and typology); no empirical testing reported in the abstract.
The rapid integration of large language models (LLMs) into everyday workflows has transformed how individuals perform cognitive tasks such as writing, programming, analysis, and multilingual communication.
Author(s) assertion based on literature review and conceptual overview; no empirical sample or experiment reported in the abstract.
A hybrid AI-human sprint planning framework should assign algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution.
Theoretical framework proposed by the authors, motivated by the experimental findings (trade-offs observed between efficiency and risk capture/rework) and qualitative analysis.
Human-only planning excels at adaptability.
Controlled experiment comparing human-only, AI-only, and hybrid models with qualitative indicators of planning robustness and adaptability showing superior adaptability for human-only planning.
AI-only planning minimizes time and cost.
Controlled, three-condition experiment (AI-only, human-only, hybrid) conducted on a live client deliverable at a mid-sized digital agency; quantitative metrics included time and cost measures (reported alongside estimation accuracy, rework rates, and scope change recovery time).
The bounded-autonomy architecture is a practical, deployed approach for making imperfect language models operationally useful in enterprise systems.
Deployment and reported performance in the described multi-tenant enterprise application evaluation (completion rates, safety interceptions, speedups); the paper synthesizes these empirical results to support the practical claim.
The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest.
Architectural proposal and implementation details described in the paper; asserted as part of the bounded-autonomy design deployed in the enterprise application.
Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output.
Design and deployment of bounded-autonomy architecture with typed action contracts, permission-aware capability exposure, scoped context, validation before side effects, and consumer-side execution boundaries; empirical claim that these code-enforced properties intercepted targeted violations during evaluation.
Both AI conditions delivered 13–18x speedup over manual operation.
Timing/performance comparison across the three experimental conditions (manual operation, unconstrained AI, full bounded autonomy) within the deployed evaluation; reported speedup range 13–18x relative to manual operation.
The bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions.
Evaluation in a deployed multi-tenant enterprise application across 25 scenario trials spanning seven failure families; comparison across three conditions (manual, unconstrained AI with safety layers disabled, full bounded autonomy).
Overall, GAI provides a principled and scalable approach to integrating AI-generated information.
Summary claim in the abstract based on the combination of the theoretical properties and empirical results reported in the paper.
Across applications, GAI improves confidence interval coverage without inflating width.
Empirical claim reported across the multiple application studies in the paper (abstract states CI coverage improvement while maintaining or not inflating width); details in main text/appendix presumably contain the quantitative analysis.
In health insurance choice, GAI cuts labeling requirements by over 90% while maintaining decision accuracy.
Reported empirical result from the paper's health insurance choice experiment; abstract gives the >90% reduction claim but does not include sample size or exact metrics in the abstract.
In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information.
Empirical experiment in a retail pricing application comparing multiple estimators given identical auxiliary inputs; stated as consistent outperformance in the abstract (no numerical effect sizes or sample sizes provided there).
In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%.
Reported empirical result from the paper's conjoint analysis experiment(s); exact sample size and experimental details are not stated in the abstract.
Empirically, GAI outperforms benchmarks across diverse settings.
Empirical experiments reported across multiple application settings (conjoint analysis, retail pricing, health insurance choice) comparing GAI to alternative estimators/benchmarks.
The authors establish asymptotic normality for the GAI estimator and show a 'safe default' property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive.
The paper claims formal theoretical results (asymptotic normality and efficiency comparisons) — supported by analytic derivations/proofs in the manuscript as referenced in the abstract.
GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels.
The paper presents a methodological proposal (Generative Augmented Inference) and states theoretical properties (orthogonal moment construction, consistency, valid inference) — supported by formal asymptotic analysis/proofs in the paper (the abstract references establishing asymptotic normality).
This work takes a foundational step toward dignified human-AI interaction futures by balancing productivity with the preservation of human expertise.
Author-stated contribution and goal of the paper (conceptual + empirical work). Abstract claims contribution but does not present quantified validation of 'foundational' status.
AI delivers initial operational/productivity gains in high-stakes work settings.
Claimed empirical observation from the year-long study (abstract: 'Initial operational gains'). No quantitative productivity metrics reported in abstract.
The framework operationalizes 'sociotechnical immunity' via dual-purpose mechanisms that both serve institutional quality goals and build worker power to detect, contain, and recover from skill erosion while preserving human identity.
Descriptive claim about the nộive of the proposed framework as stated in the abstract; no empirical performance metrics provided in abstract.
We offer a framework for dignified Human-AI interaction co-constructed with professional knowledge workers facing AI-induced skill erosion without traditional labor protections.
Paper contribution: proposed framework described as co-constructed with knowledge workers; abstract states aim and intended beneficiaries but does not report empirical validation details in the abstract.
Clear specifications, explicit governance, and ongoing human-AI collaboration are critical for successful scaling of regression automation.
Conclusions and recommendations derived from the case study's lessons and mixed-method evaluation.
The Copilot achieves 30-50% code reuse when generating candidate test scripts.
Quantitative result reported in the paper's evaluation (stated 30-50% code reuse in the abstract/summary).
Mixed-method evaluation shows the AI accelerates script authoring and increases throughput.
Empirical claim based on the paper's mixed-method evaluation (qualitative and quantitative data reported in the case study); specific sample sizes not provided in the summary.
Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations.
Introductory/position statement in the paper; general premise motivating the case study (no specific empirical test reported).
AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers.
Empirical evaluation on MLE-Bench reported in the paper (benchmark ranking, metric = medal rate).
AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization; each sub-agent is itself an LLM-based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches.
System architecture description in the paper (methods/architecture section).
We introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data.
Methodological contribution: system design and implementation described in the paper (introduction/methods).