Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Reactive adaptive feedback (Study 2) based on real-time deviations in JME and/or JVA improves collaboration outcomes, with combined feedback targeting both dimensions yielding the strongest improvements in performance, regulatory coherence, and cognitive-to-attentional causality, outperforming single-channel feedback.

Study 2 experimental intervention delivering reactive adaptive feedback tied to real-time JME/JVA deviations; comparisons between combined feedback, single-channel feedback, and presumably control conditions reported in paper (no per-condition sample sizes in abstract).

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... collaboration performance, regulatory coherence, cognitive-to-attentional causal...

In Study 1, there is a stable causal relationship in which JME predicts JVA (cognitive alignment drives attentional coordination).

Causal modeling / causal inference applied to time-series measures (JME, JVA) from dual eye-tracking and pupillometry in Study 1; reported directionality JME -> JVA.

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... causal influence of joint mental effort (JME) on joint visual attention (JVA)

High-performing dyads show a greater prevalence of productive high-JME–high-JVA episodes.

Episode-based analysis in Study 1 using eye-tracking/pupillometry to identify and count high-JME–high-JVA episodes; comparison between performance groups reported in paper.

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... frequency/prevalence of high-JME–high-JVA episodes

In natural collaboration (Study 1), high-performing dyads exhibit significantly higher joint mental effort (JME) and joint visual attention (JVA) than lower-performing dyads.

Study 1 empirical comparison of dyads during collaborative debugging using dual eye-tracking and pupillometry; performance-based grouping (high-performing vs lower-performing). Exact per-study sample not specified in abstract.

high positive Cognitive Alignment Drives Attention: Modeling and Supportin... joint mental effort (JME) and joint visual attention (JVA)

Human labor retains premium value when human judgment, attention, accountability, authorship, or relational participation is not incidental to the output but constitutive of what is being purchased (the paper proposes 'constitutive human presence' as the relevant standard for evaluating hybrid human-AI work).

Conceptual definition and prescriptive standard introduced in the paper; no empirical validation or measurement reported in the excerpt.

high positive Human-Provenance Verification should be Treated as Labor Inf... retention of premium value for human labor under the 'constitutive human presenc...

Because these premiums depend on credible verification, AI governance should treat human-provenance verification systems as labor infrastructure rather than as luxury authenticity labels.

Normative/policy recommendation based on the paper's conceptual analysis; the excerpt contains argumentation but no empirical evaluation of governance interventions.

high positive Human-Provenance Verification should be Treated as Labor Inf... policy classification / regulatory treatment of human-provenance verification sy...

AI-saturated markets are likely to create Veblen-good premiums, termed human-provenance premiums, for verified human presence (i.e., consumers will pay price premiums for verified human-produced outputs).

Theoretical claim drawing on economic reasoning about Veblen goods and market preferences; paper presents argumentation rather than reported empirical estimation in the excerpt.

high positive Human-Provenance Verification should be Treated as Labor Inf... price premium for verified human-produced outputs (willingness-to-pay / premium ...

This compression reallocates demand for human labor toward work valued for its visible human character (performative humanity), including relational presence, aesthetic provenance, and accountability.

Theoretical/conceptual reasoning and typology proposed in the paper (no empirical sample or measurement reported in the excerpt).

high positive Human-Provenance Verification should be Treated as Labor Inf... demand for human-valued labor (employment or demand shifts toward specific human...

Results provide operations managers with tech-backed playbooks for responsible resource use without compromising profit motives, enabling operational excellence while meeting environmental and social responsibilities.

Paper conclusion/implication statement asserting managerial applicability of findings; grounded in the study's reported results but presented as a recommendation/implication rather than a quantified finding.

high positive Green Supply Chain Optimization: AI and IoT for Ethical Reso... availability/applicability of managerial playbooks for responsible resource use ...

Firms maintain competitive costs while implementing AI-IoT eco-networks.

Paper claims that waste and emissions reductions are achieved without compromising costs; specific cost metrics or statistical tests not provided in the abstract.

high positive Green Supply Chain Optimization: AI and IoT for Ethical Reso... cost competitiveness / operational costs

Firms embracing AI-IoT eco-networks trim carbon output by 20-35%.

Paper results reported as empirical findings; presumably measured via carbon footprint assessments and IoT/operational metrics across the case study firms and facilities.

high positive Green Supply Chain Optimization: AI and IoT for Ethical Reso... carbon output / emissions

Firms embracing AI-IoT eco-networks cut waste by 30-50%.

Paper results reported as empirical findings; based on mixed-methods case studies of 12 multinational companies and IoT data from 45 facilities (as stated in methods).

high positive Green Supply Chain Optimization: AI and IoT for Ethical Reso... waste (resource/material waste)

By placing networked IoT sensors in factories, trucks, storage sites, and upstream suppliers, real-time data were paired with machine-learning routines to schedule preventive maintenance, forecast orders, and guide blockchain tracking, routing adjustments, and automated decisions balancing green goals with everyday performance.

Paper description of system design and interventions: placement of sensors across supply chain nodes and pairing with ML routines for maintenance, forecasting, blockchain tracking, routing, and automated decisions.

high positive Green Supply Chain Optimization: AI and IoT for Ethical Reso... implementation of AI-IoT system functions (preventive maintenance scheduling, or...

Proactive feedback produces post-intervention gains in Joint Visual Attention (JVA) and Joint Mental Effort (JME).

Within-subject empirical study with 26 dyads reporting post-intervention increases in JVA and JME measures following proactive feedback.

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... Joint Visual Attention (JVA) and Joint Mental Effort (JME)

Proactive feedback significantly improves feedback uptake.

Reported results from the within-subject study (26 dyads) indicating higher uptake/adoption of feedback when proactive feedback was provided.

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... feedback uptake (adoption of scaffolded suggestions)

Proactive feedback significantly improves task efficiency.

Within-subject empirical study with 26 dyads reported in the paper; authors report significant improvement in task efficiency for proactive feedback condition.

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... task efficiency (e.g., time to complete debugging tasks)

In a within-subject study with 26 pair-programming dyads, proactive feedback significantly improves debugging success.

Within-subject empirical study reported in the paper with 26 pair-programming dyads; statistical claim of significant improvement in debugging success under proactive feedback condition.

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... debugging success

ProPACT uses a hierarchical adaptive policy that delivers minimally intrusive scaffolds while fading support during productive collaboration.

Algorithm/policy design described in the paper (hierarchical adaptive policy and scaffold delivery/fading behavior).

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... scaffolding intrusiveness and adaptive fading of support

ProPACT employs an XGBoost-based forecasting model to predict emerging suboptimal collaboration states up to 30 seconds in advance.

Modeling and evaluation described in the paper; forecasting model implementation stated as XGBoost and claim of 30-second-ahead prediction (trained/evaluated on study data from the paper).

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... prediction of emerging suboptimal collaboration states (prediction horizon up to...

ProPACT constructs a multimodal dyadic learner model based on Joint Visual Attention (JVA), Joint Mental Effort (JME), and individual mental effort.

System design / modeling description in the paper (multimodal dyadic learner model specification).

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... Joint Visual Attention (JVA) and Joint Mental Effort (JME) measurements

ProPACT is a proactive AI-driven adaptive collaborative tutor that treats collaboration itself as the object of instruction.

System description presented in the paper (design/implementation claim); authors introduce ProPACT as an AI-driven adaptive collaborative tutor.

high positive ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor ... collaboration quality

These evolved models improve downstream end-to-end agentic data-science (ADS) performance, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%.

Empirical evaluation on the BLADE benchmark comparing downstream ADS performance using Copilot CLI, Claude Code, and Codex with and without the evolved models; reported maximum improvement 'up to 73%'.

high positive Agentic-imodels: Evolving agentic interpretability tools via... downstream ADS performance on the BLADE benchmark (measured as benchmark perform...

The evolved models generalize to new datasets.

Reported experiments showing performance of the evolved models on datasets not used during evolution / training (as described in the paper's experimental results).

high positive Agentic-imodels: Evolving agentic interpretability tools via... generalization of model performance to new datasets

The evolved models jointly improve agent-facing interpretability (as measured by the LLM-based metric) and generalize to new interpretability tests.

Experimental evaluation using the proposed LLM-based interpretability metric, including tests on held-out interpretability evaluations described in the paper.

high positive Agentic-imodels: Evolving agentic interpretability tools via... agent-facing interpretability (LLM-graded simulatable test performance)

The evolved models jointly improve predictive performance.

Experimental results reported in the paper comparing evolved models to baselines on predictive metrics across datasets (details of datasets and metrics referenced in the experiments section).

high positive Agentic-imodels: Evolving agentic interpretability tools via... predictive performance (e.g., prediction accuracy or other predictive metrics)

We introduce a novel LLM-based interpretability metric that measures a suite of LLM-graded tests probing whether a fitted model's string representation is 'simulatable' by an LLM (i.e., whether the LLM can answer questions about the model's behavior by reading its string output alone).

Design and specification of an interpretability metric based on LLM-graded tests, described in the paper; metric operationalized by asking LLMs questions about models' string representations.

high positive Agentic-imodels: Evolving agentic interpretability tools via... agent-facing interpretability as measured by LLM-graded simulatable tests

Agentic-imodels develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric.

Implemented library of scikit-learn-compatible regressors described in the paper and used in experiments; optimization objective includes predictive performance and an LLM-based interpretability metric.

high positive Agentic-imodels: Evolving agentic interpretability tools via... availability and optimization of a library of regressors (predictive performance...

We introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents.

Description and implementation of a new system (Agentic-imodels) presented in the paper; methodological contribution described as an autoresearch loop that evolves tools.

high positive Agentic-imodels: Evolving agentic interpretability tools via... existence and implementation of Agentic-imodels (a system-level contribution)

Decomposing AI recommendations into individually verifiable claims linked to source guidelines produces substantially higher clinician trust than traditional explainability approaches in high-stakes clinical decisions.

Summary/meaning statement in paper supported by randomized trial results (N=356; reported effect sizes and proportions showing higher trust for atomic fact-checking vs traditional approaches).

high positive Atomic Fact-Checking Increases Clinician Trust in Large Lang... clinician trust in AI treatment recommendations

Traditional transparency/explainability mechanisms showed a dose-response gradient of improvement over baseline (Cohen's d ranged from 0.25 to 0.50).

Randomized trial comparisons reported in paper; effect sizes for traditional mechanisms given as a range (d = 0.25 to 0.50).

high positive Atomic Fact-Checking Increases Clinician Trust in Large Lang... clinician trust (trust ratings) under traditional transparency mechanisms

Atomic fact-checking increased the proportion of clinicians expressing trust from 26.9% to 66.5%.

Randomized trial reported these proportions for trust in the paper (356 clinicians; reported percentages).

high positive Atomic Fact-Checking Increases Clinician Trust in Large Lang... proportion of clinicians expressing trust

Atomic fact-checking produced a large effect on clinician trust (Cohen's d = 0.94).

Randomized trial reported in paper with 356 clinicians; effect size reported directly (Cohen's d = 0.94).

high positive Atomic Fact-Checking Increases Clinician Trust in Large Lang... clinician trust (trust ratings)

The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design.

Synthesis and prescriptive claim based on the paper's theoretical analysis and proposed framework; supported by conceptual integration rather than empirical testing.

high positive AI Safety as Control of Irreversibility: A Systems Framework... safety governance approach (layered controls and limits)

The main result is a boundary stabilization theorem showing that safety need not require proving that advanced systems are always correct; instead it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node.

Formal/theoretical claim presented as the paper's primary theorem (a 'boundary stabilization theorem') demonstrated within the paper's formal model.

high positive AI Safety as Control of Irreversibility: A Systems Framework... safety (effectiveness of layered controls vs. proof-of-correctness)

The no-talk baseline establishes that communication is necessary.

Experimental no-talk baseline showing worse coordination without communication between agents.

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance with vs without communication

These results highlight dynamic grounding as a critical and understudied axis of multi-agent coordination.

Synthesis/interpretation of the experimental findings reported in the paper.

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... importance of dynamic grounding for multi-agent coordination

We introduce an iterated, multi-turn negotiation game in which two agents allocate shared resources toward private projects with verifiable jointly optimal outcomes.

Methodological contribution described in the paper (design of a new multi-turn negotiation game).

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... existence of a multi-turn negotiation benchmark with verifiable optimal outcomes

Grounding is the collaborative process of establishing mutual belief sufficient for the current communicative purpose.

Conceptual/definitional statement presented by the authors (no empirical data reported).

high positive Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... definition of grounding

The frontier for AI-augmented science is not acceleration; it is the redesign of the certifying infrastructure around these new scarcities.

Prescriptive conclusion in the paper arguing priority of institutional redesign over mere speed gains; presented without empirical testing in the excerpt.

high positive AI-Augmented Science and the New Institutional Scarcities prioritization of redesigning certifying infrastructure versus accelerating scie...

Competent-looking judgment, including selecting, ranking, attributing, and certifying, is now produced at scale at marginal cost approaching zero, inverting the dominant economics-of-AI reading that treats judgment as the scarce complement to cheap prediction.

Argumentative/theoretical claim in the paper; no empirical sample, experiment, or quantitative data reported in the excerpt (implicit basis: observation of scalable AI outputs).

high positive AI-Augmented Science and the New Institutional Scarcities production of competent-looking judgment (selecting, ranking, attributing, certi...

HAAS can serve as a pre-deployment workbench for comparing and inspecting human–AI allocation policies before organisational commitment.

Claim about intended use and demonstration of HAAS as an implemented tool; based on the framework implementation and benchmark experiments reported. No deployment-scale evaluation or sample sizes provided in the excerpt.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... ability to compare and inspect allocation policies prior to deployment

In manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously — a workload-buffering effect.

Domain-specific empirical result reported for the manufacturing benchmark in the paper, comparing operational performance and fatigue under different governance strengths. No numeric sample size or effect sizes provided in the excerpt.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... operational performance and worker fatigue

Task–agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum (from human-only to fully autonomous) embedded in a reproducible benchmark spanning software engineering and manufacturing.

Design and benchmark description within the paper; specification of five cognitive dimensions and a five-mode autonomy spectrum and a reproducible benchmark across two domains. No numeric sample size provided.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... representation of task–agent fit and benchmarking across domains

HAAS combines a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback.

Descriptive claim about the implemented HAAS framework as presented in the paper; method description of system architecture (rule-based expert system + contextual-bandit learner). No sample size reported.

high positive HAAS: A Policy-Aware Framework for Adaptive Task Allocation ... mechanism for adaptive task allocation (selected collaboration mode)

The field's near-term research agenda should explicitly include collecting and using triadic data.

Normative recommendation in the paper; presented as the authors' advised research priority rather than empirically justified within the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... inclusion of triadic data collection/use in near-term research agendas in the SW...

This data is the empirical key to four open questions in agent training.

Argumentative claim in the paper asserting centrality of triadic data to addressing unspecified four open research questions; no empirical demonstration included in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... resolvability of four open questions in agent training using triadic data

This triadic data is capturable in 12-18 months with methods already mature in adjacent fields.

Claim in the paper based on authors' assessment of methodological maturity in adjacent fields; no empirical project timeline or pilot data is provided in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... time required to collect a triadic dataset using existing methods

Any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher through a four-tier evidence framework: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation.

Methodological proposal in the paper outlining a four-tier evidence framework; presented as normative guidance rather than validated by application to a corpus in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... quality and trustworthiness of fine-tuning corpora as judged by the four-tier fr...

The canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure.

Prescriptive specification in the paper proposing two concrete dataset types as canonical instantiations; presented as design/recommendation rather than empirically tested.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... availability and suitability of dataset modalities (stimulated-recall expert tra...

The substrate for the next generation of software-engineering (SWE) agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs; it is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both.

Argument and conceptual proposal in the paper; no empirical validation or comparative experiments are provided in the excerpt.

high positive The Conversations Beneath the Code: Triadic Data for Long-Ho... effectiveness of training data substrates for improving agent performance on lon...

« Prev 1 2 3 … 63 64 65 … 129 130 Next »