Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

This reframes the question from whether the model can think to whether the human-AI system can reason.

Conceptual reframing stated in the paper; no empirical evidence required as it is a change of perspective.

high positive Governing Reflective Human-AI Collaboration: A Framework for... system_level_reasoning_evaluation (human-AI system reasoning instead of model-on...

We introduce 'The Architect's Pen' as a practical method where the human uses the model as an external medium for structured reflection by embedding phases of articulation, critique, and revision into human-AI interaction.

Method description / practical proposal included in the paper; no experimental evaluation, user study, or quantitative validation reported.

high positive Governing Reflective Human-AI Collaboration: A Framework for... structured_reflection_via_interaction_protocol (articulation/critique/revision l...

This perspective emphasizes collaborative intelligence, combining human judgment and contextual understanding with machine speed, memory, and associative capacity.

Theoretical claim about complementary strengths of humans and models within the proposed framework; presented without empirical tests.

high positive Governing Reflective Human-AI Collaboration: A Framework for... collaborative_intelligence (integration of human judgment and machine capabiliti...

Building on recent work on 'System-2' learning, reflective reasoning can be relocated to the interaction layer and framed as a cognitive protocol that can be structured, measured, and governed using existing systems.

Conceptual extension of prior literature ('System-2' learning) into an interaction-layer protocol; no empirical protocol testing or measurement evidence provided.

high positive Governing Reflective Human-AI Collaboration: A Framework for... measurability_and_governability_of_reasoning (via interaction protocols)

Reasoning should be treated as a relational process distributed between human and model rather than an internal capability of either.

Methodological proposal / theoretical framing presented by the authors; no empirical validation reported.

high positive Governing Reflective Human-AI Collaboration: A Framework for... system_level_reasoning_capability (human-AI distributed reasoning)

Large language models have advanced rapidly, from pattern recognition to emerging forms of reasoning.

Stated as an observational claim in the paper's introduction; no empirical evaluation or dataset provided.

high positive Governing Reflective Human-AI Collaboration: A Framework for... model_capability (advancement from pattern recognition to emerging reasoning)

This approach aligns with emerging compliance expectations, including the EU AI Act and ISO/IEC 42001, by making reasoning processes traceable under real conditions of use.

Claim of regulatory alignment made by the authors; presented as interpretive/legal/standards-relevant argument rather than supported by empirical analysis or legal review data in this excerpt.

high positive The Missing Knowledge Layer in AI: A Framework for Stable Hu... alignment with regulatory/compliance requirements (traceability of reasoning)

Stabilising interaction makes uncertainty and drift visible before enforcement is applied, enabling more precise capability governance.

Normative/operational claim in the paper about the anticipated effect of the proposed interventions; no empirical test or measurement reported in this excerpt.

high positive The Missing Knowledge Layer in AI: A Framework for Stable Hu... visibility of uncertainty/drift and precision of capability governance

Together, these layers form a missing operational substrate for governance by increasing signal-to-noise at the point of use.

Argumentative claim from the paper proposing that the combined interventions improve the information available at the decision point; no empirical validation or sample size provided here.

high positive The Missing Knowledge Layer in AI: A Framework for Stable Hu... signal-to-noise ratio of reasoning outputs at point of use (informational qualit...

This paper is the first in a five-paper research series on stabilising human-AI reasoning that proposes a two-layer approach: Parts II–IV introduce human-side mechanisms (uncertainty cues, conflict surfacing, auditable reasoning traces) and Part V develops a model-side Epistemic Control Loop (ECL) that detects instability and modulates generation.

Descriptive claim about the structure and scope of the paper series as stated by the authors; internal to the publication (no external dataset).

high positive The Missing Knowledge Layer in AI: A Framework for Stable Hu... proposal of methodological architecture for stabilising human-AI reasoning

Large language models are increasingly integrated into decision-making in areas such as healthcare, law, finance, engineering, and government.

Statement in paper describing observed/adoptive trend; no empirical dataset, sample size, or quantitative analysis reported in the text.

high positive The Missing Knowledge Layer in AI: A Framework for Stable Hu... integration/adoption of LLMs into decision-making

For settings with multiple interventions, a tractable approximation that prioritizes interventions based on the magnitude of the policy-value discrepancy is effective.

Proposed algorithm/approximation in the paper (methodological contribution); evaluated empirically in simulations and experiments described in the paper.

high positive Improving Human Performance with Value-Aware Interventions: ... effectiveness of intervention prioritization under intervention budget constrain...

In the single-intervention regime, the optimal strategy is to recommend the action that maximizes the human value function.

Theoretical result derived in the paper within a Markov decision process model for single-intervention settings.

high positive Improving Human Performance with Value-Aware Interventions: ... optimality of single-intervention recommendation (maximizing human value functio...

Policy-value inconsistencies naturally identify opportunities for intervention.

Analytical/formal argument within a Markov decision process framework showing that when human policy-value consistency fails, discrepancies indicate intervention opportunities.

high positive Improving Human Performance with Value-Aware Interventions: ... identification of states/actions where intervention is beneficial (policy-value ...

These cooperation mechanisms become more effective under evolutionary pressures to maximize individual payoffs.

Authors report results from experiments or simulations applying evolutionary-pressure dynamics (selection for payoff-maximizing agents) and observing increased effectiveness of mechanisms; no numeric results or sample sizes in excerpt.

high positive CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and... mechanism effectiveness (cooperation outcomes) under evolutionary pressure

Contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models.

Empirical results from the authors' experiments across four social dilemmas comparing mechanism performance; specifics (which models, quantitative cooperation rates) are not included in the excerpt.

high positive CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and... effectiveness of mechanisms at producing cooperative outcomes

Continuous learning and diversity of ideas are essential if AI is to play a meaningful role in original scientific discovery.

Normative/conditional claim supported by conceptual reasoning in the article; no empirical evidence or measured sample provided.

high positive The Agentification of Scientific Research: A Physicist's Per... AI's effectiveness in contributing to original scientific discovery

AI is likely to fundamentally reshape scientific publication.

Author's argument and discussion of implications for publishing and evaluation; no reported empirical study.

high positive The Agentification of Scientific Research: A Physicist's Per... structure and practice of scientific publication

There is a gradual path from AI as a research tool to AI as a scientific collaborator.

Narrative/theoretical progression outlined in the article; conceptual roadmap rather than empirical demonstration.

high positive The Agentification of Scientific Research: A Physicist's Per... role of AI in research from tool to collaborator

AI for Science is especially important because it may transform not only the efficiency of research, but also the structure of scientific collaboration, discovery, publishing, and evaluation.

Argumentative/theoretical analysis in the article; forward-looking claim without reported empirical data or experimental sample.

high positive The Agentification of Scientific Research: A Physicist's Per... efficiency of research and the structure of scientific collaboration, discovery,...

The most important significance of the AI revolution, especially the rise of large language models, lies not simply in automation, but in a fundamental change in how complex information and human know-how are carried, replicated, and shared.

Conceptual argument presented in the article (theoretical/essayistic reasoning); no empirical sample or quantitative study reported.

high positive The Agentification of Scientific Research: A Physicist's Per... how complex information and human know-how are carried, replicated, and shared

The paper proposes a conceptual framework of the underlying mechanisms of the LLM fallacy and a typology of its manifestations across computational, linguistic, analytical, and creative domains.

Author(s) contribution described in the paper (framework and typology); no empirical testing reported in the abstract.

high positive The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... formal framework and typology coverage across domains

The rapid integration of large language models (LLMs) into everyday workflows has transformed how individuals perform cognitive tasks such as writing, programming, analysis, and multilingual communication.

Author(s) assertion based on literature review and conceptual overview; no empirical sample or experiment reported in the abstract.

high positive The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... how individuals perform cognitive tasks (writing, programming, analysis, multili...

A hybrid AI-human sprint planning framework should assign algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution.

Theoretical framework proposed by the authors, motivated by the experimental findings (trade-offs observed between efficiency and risk capture/rework) and qualitative analysis.

high positive Cognitive Offloading in Agile Teams: How Artificial Intellig... task allocation between AI and humans / recommended planning process

Human-only planning excels at adaptability.

Controlled experiment comparing human-only, AI-only, and hybrid models with qualitative indicators of planning robustness and adaptability showing superior adaptability for human-only planning.

high positive Cognitive Offloading in Agile Teams: How Artificial Intellig... adaptability / planning robustness

AI-only planning minimizes time and cost.

Controlled, three-condition experiment (AI-only, human-only, hybrid) conducted on a live client deliverable at a mid-sized digital agency; quantitative metrics included time and cost measures (reported alongside estimation accuracy, rework rates, and scope change recovery time).

high positive Cognitive Offloading in Agile Teams: How Artificial Intellig... time and cost

The bounded-autonomy architecture is a practical, deployed approach for making imperfect language models operationally useful in enterprise systems.

Deployment and reported performance in the described multi-tenant enterprise application evaluation (completion rates, safety interceptions, speedups); the paper synthesizes these empirical results to support the practical claim.

high positive Bounded Autonomy for Enterprise AI: Typed Action Contracts a... operational usefulness of LLMs in enterprise context

The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest.

Architectural proposal and implementation details described in the paper; asserted as part of the bounded-autonomy design deployed in the enterprise application.

high positive Bounded Autonomy for Enterprise AI: Typed Action Contracts a... system design property (source-of-truth and orchestration behavior)

Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output.

Design and deployment of bounded-autonomy architecture with typed action contracts, permission-aware capability exposure, scoped context, validation before side effects, and consumer-side execution boundaries; empirical claim that these code-enforced properties intercepted targeted violations during evaluation.

high positive Bounded Autonomy for Enterprise AI: Typed Action Contracts a... interception of targeted violations / enforcement of safety properties

Both AI conditions delivered 13–18x speedup over manual operation.

Timing/performance comparison across the three experimental conditions (manual operation, unconstrained AI, full bounded autonomy) within the deployed evaluation; reported speedup range 13–18x relative to manual operation.

high positive Bounded Autonomy for Enterprise AI: Typed Action Contracts a... task completion time (speedup vs. manual)

The bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions.

Evaluation in a deployed multi-tenant enterprise application across 25 scenario trials spanning seven failure families; comparison across three conditions (manual, unconstrained AI with safety layers disabled, full bounded autonomy).

high positive Bounded Autonomy for Enterprise AI: Typed Action Contracts a... tasks completed / unsafe executions

Overall, GAI provides a principled and scalable approach to integrating AI-generated information.

Summary claim in the abstract based on the combination of the theoretical properties and empirical results reported in the paper.

high positive Generative Augmented Inference scalability and principled integration of AI-generated information

Across applications, GAI improves confidence interval coverage without inflating width.

Empirical claim reported across the multiple application studies in the paper (abstract states CI coverage improvement while maintaining or not inflating width); details in main text/appendix presumably contain the quantitative analysis.

high positive Generative Augmented Inference confidence interval coverage and width (statistical inference quality)

In health insurance choice, GAI cuts labeling requirements by over 90% while maintaining decision accuracy.

Reported empirical result from the paper's health insurance choice experiment; abstract gives the >90% reduction claim but does not include sample size or exact metrics in the abstract.

high positive Generative Augmented Inference human labeling requirements; decision accuracy

In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information.

Empirical experiment in a retail pricing application comparing multiple estimators given identical auxiliary inputs; stated as consistent outperformance in the abstract (no numerical effect sizes or sample sizes provided there).

high positive Generative Augmented Inference estimator performance in retail pricing (e.g., predictive or decision accuracy /...

In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%.

Reported empirical result from the paper's conjoint analysis experiment(s); exact sample size and experimental details are not stated in the abstract.

high positive Generative Augmented Inference estimation error; human labeling requirements

Empirically, GAI outperforms benchmarks across diverse settings.

Empirical experiments reported across multiple application settings (conjoint analysis, retail pricing, health insurance choice) comparing GAI to alternative estimators/benchmarks.

high positive Generative Augmented Inference overall performance relative to benchmarks (estimation error / predictive perfor...

The authors establish asymptotic normality for the GAI estimator and show a 'safe default' property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive.

The paper claims formal theoretical results (asymptotic normality and efficiency comparisons) — supported by analytic derivations/proofs in the manuscript as referenced in the abstract.

high positive Generative Augmented Inference estimation efficiency (asymptotic variance / efficiency relative to baseline)

GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels.

The paper presents a methodological proposal (Generative Augmented Inference) and states theoretical properties (orthogonal moment construction, consistency, valid inference) — supported by formal asymptotic analysis/proofs in the paper (the abstract references establishing asymptotic normality).

high positive Generative Augmented Inference consistent estimation and valid inference (statistical estimation properties)

This work takes a foundational step toward dignified human-AI interaction futures by balancing productivity with the preservation of human expertise.

Author-stated contribution and goal of the paper (conceptual + empirical work). Abstract claims contribution but does not present quantified validation of 'foundational' status.

high positive From Future of Work to Future of Workers: Addressing Asympto... balance between productivity and preservation of expertise

AI delivers initial operational/productivity gains in high-stakes work settings.

Claimed empirical observation from the year-long study (abstract: 'Initial operational gains'). No quantitative productivity metrics reported in abstract.

high positive From Future of Work to Future of Workers: Addressing Asympto... operational gains / productivity

The framework operationalizes 'sociotechnical immunity' via dual-purpose mechanisms that both serve institutional quality goals and build worker power to detect, contain, and recover from skill erosion while preserving human identity.

Descriptive claim about the nộive of the proposed framework as stated in the abstract; no empirical performance metrics provided in abstract.

high positive From Future of Work to Future of Workers: Addressing Asympto... mechanisms for detection/containment/recovery from skill erosion and preservatio...

We offer a framework for dignified Human-AI interaction co-constructed with professional knowledge workers facing AI-induced skill erosion without traditional labor protections.

Paper contribution: proposed framework described as co-constructed with knowledge workers; abstract states aim and intended beneficiaries but does not report empirical validation details in the abstract.

high positive From Future of Work to Future of Workers: Addressing Asympto... design of human-AI interaction frameworks to mitigate skill erosion and protect ...

Clear specifications, explicit governance, and ongoing human-AI collaboration are critical for successful scaling of regression automation.

Conclusions and recommendations derived from the case study's lessons and mixed-method evaluation.

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... success of scaling regression automation / effectiveness of human-AI teaming

The Copilot achieves 30-50% code reuse when generating candidate test scripts.

Quantitative result reported in the paper's evaluation (stated 30-50% code reuse in the abstract/summary).

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... code reuse in generated test scripts

Mixed-method evaluation shows the AI accelerates script authoring and increases throughput.

Empirical claim based on the paper's mixed-method evaluation (qualitative and quantitative data reported in the case study); specific sample sizes not provided in the summary.

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... script authoring speed and throughput

Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations.

Introductory/position statement in the paper; general premise motivating the case study (no specific empirical test reported).

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... ability to maintain rapid, high-quality delivery

AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers.

Empirical evaluation on MLE-Bench reported in the paper (benchmark ranking, metric = medal rate).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models medal rate (task success rate) on MLE-Bench

AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization; each sub-agent is itself an LLM-based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches.

System architecture description in the paper (methods/architecture section).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models system architecture and claimed capabilities (multistep reasoning, tool use, end...

We introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data.

Methodological contribution: system design and implementation described in the paper (introduction/methods).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models ability to produce AI models from task descriptions and training data

« Prev 1 2 3 … 73 74 75 … 129 130 Next »