Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

A Neural Boosted Tree model with entity embeddings for textile attributes was constructed and achieved a mean R2 of 0.921 in cross-validation, surpassing benchmark methods.

Model training and cross-validation reported in paper using the e-commerce dataset; comparison to benchmark methods reported (specific benchmarks not listed in abstract).

high positive Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... forecasting accuracy (mean R2)

The framework incorporates ethically compliant acquisition of consumer demand signals, semantic translation of unstructured market data into textile engineering attributes, machine-learning-based demand forecasting, and human-centric decision support.

Description of framework components and design choices presented in paper (methodological/architectural claim).

high positive Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... presence of specified framework components (ethical data acquisition, semantic t...

This study develops and validates a customer-to-manufacturer (C2M) intelligence framework that enables data-driven production planning using publicly available e-commerce data.

Methodological development described in paper; validation based on ML modeling using e-commerce data and a 12-month field deployment at one Taiwanese dyeing SME.

high positive Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... feasibility and validation of a C2M intelligence framework for production planni...

The paper introduces a novel posted-price procurement model with coverage objectives for studying platform procurement of human input.

Methodological contribution declared in the paper: presentation of a new formal model (posted-price procurement with coverage objectives).

high positive Stochastic wage suppression on gig platforms and how to orga... model formulation / methodological innovation

A small coalition of targeted low-cost workers who commit to a price floor forces the platform's total spending to change from logarithmic to linear in M.

Theoretical analysis within the model showing that when a targeted subset of low-cost workers commit to a minimum price, the asymptotic scaling of platform spending increases from logarithmic (in M) to linear (in M); proof-based, no empirical sample.

high positive Stochastic wage suppression on gig platforms and how to orga... platform's total spending / total payments to workers (scaling in M)

A research-degree-student survey showed high performance ratings across information reliability, theoretical depth and logical rigor, with pronounced ceiling effects on a 7-point scale, despite all participants already being frontier-model users.

Authors report results from a survey of research-degree students evaluating the scholar-bots on specified dimensions (information reliability, theoretical depth, logical rigor) using a 7-point scale and note ceiling effects; participants reportedly were experienced model users.

high positive The Relic Condition: When Published Scholarship Becomes Mate... student-rated performance on reliability, theoretical depth, logical rigor (7-po...

Recovered panel scores placed Scholar A between 7.9 and 8.9/10 and Scholar B between 8.5 and 8.9/10 under multi-turn debate conditions.

Paper reports numeric panel scores (ranges) for the two scholar-bots in multi-turn debate scenarios; scores are presented as recovered panel evaluations.

high positive The Relic Condition: When Published Scholarship Becomes Mate... panel evaluation scores (0-10 scale) under multi-turn debate

Appointment-level recommendations placed both bots at or above Senior Lecturer level in the Australian university system.

Authors state that appointment-level syntheses from assessors recommended both scholar-bots at or above the Senior Lecturer rank (Australian system); based on the experts' syntheses.

high positive The Relic Condition: When Published Scholarship Becomes Mate... appointment/rank recommendation

Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining.

Authors report that the preserved set of expert review and supervision reports (from the three assessors) rated scholar-bot outputs as attaining the benchmark standards used for assessment.

high positive The Relic Condition: When Published Scholarship Becomes Mate... benchmark attainment in review and supervision reports

The scholar-bots were deployed across doctoral supervision, peer review, lecturing and panel-style academic exchange.

Authors report deployment of the generated scholar-bots in multiple academic task contexts (doctoral supervision, peer review, lecturing, panel debates); reported as part of evaluation protocol.

high positive The Relic Condition: When Published Scholarship Becomes Mate... ability to perform academic tasks (supervision, peer review, lecturing, panel ex...

We converted those systems into structured inference-time constraints for a large language model.

Authors describe a pipeline that transforms the extracted scholar reasoning artefacts into inference-time constraints applied to a LLM; presented as part of methods for the two scholar cases.

high positive The Relic Condition: When Published Scholarship Becomes Mate... conversion of extracted reasoning systems into inference-time constraints

We extracted the scholarly reasoning systems of two internationally prominent humanities and social science scholars from their published corpora alone.

Authors report an extraction procedure applied to the published corpora of two named scholars; claim is descriptive of dataset and method (n=2).

high positive The Relic Condition: When Published Scholarship Becomes Mate... successful extraction of reasoning systems from published corpora

From synthesis of results, we suggest three practices that focus on preserving agency in software engineering for coding, learning, and mentorship, especially as AI grows increasingly autonomous.

Authors' prescriptive recommendations derived from the paper's qualitative synthesis; presented as proposed practices rather than empirically tested interventions.

high positive From Junior to Senior: Allocating Agency and Navigating Prof... Recommended practices intended to preserve developer agency

Seniors leverage pre-AI foundational instincts to steer modern tools and possess valuable perspectives for mentoring juniors in their early AI-encouraged career development.

Qualitative accounts from senior participants in the Delphi/ACTA process and blind reviews showing seniors reference pre-AI practices and see mentoring value.

high positive From Junior to Senior: Allocating Agency and Navigating Prof... Seniors' ability to direct AI tools based on prior foundations and their perceiv...

Juniors enter as AI‑natives, seniors adapted mid‑career.

Authors' synthesis from a three-phase mixed-methods study: ACTA combined with a Delphi process (5 seniors), an AI-assisted debugging task (10 juniors), and blind reviews of junior prompt histories by 5 additional seniors.

high positive From Junior to Senior: Allocating Agency and Navigating Prof... Whether developers began their careers with AI tools (AI-native status) versus a...

Prediction intervals are a more suitable evaluation format than point estimates for numerical forecasting because they require scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes.

Conceptual/analytical argument presented in the paper explaining why prediction intervals better capture uncertainty and testability for continuous numerical forecasting (no empirical proof provided in the excerpt).

high positive QuantSightBench: Evaluating LLM Quantitative Forecasting wit... suitability of evaluation format (prediction intervals vs point estimates)

Technology-driven recruitment has emerged as a strategic imperative for organizations seeking competitive advantage in talent acquisition.

Argumentative/interpretive claim in the paper's introduction and discussion, supported by survey findings (N=150) indicating perceived strategic importance.

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... perceived strategic importance / adoption intent

The paper proposes the Technology-Enabled Recruitment Optimization Framework (TEROF), a structured implementation model designed to guide organizations through the phased adoption of recruitment technology.

Paper synthesizes its empirical findings into a named framework (TEROF) described in the discussion/conclusions; based on combined survey (N=150) and case-study analysis (4 organizations).

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... adoption guidance / implementation framework

Video interview platforms improved recruiter productivity by 41%.

Reported quantitative finding from the study's survey (N=150) and corroborating case study observations.

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... recruiter productivity

AI-powered resume screening reduced initial shortlisting time by 64%.

Reported quantitative result in the paper derived from the survey of HR professionals (N=150) and illustrated in case studies.

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... initial shortlisting time

Integrated technology-driven recruitment produced a 52% reduction in cost-per-hire relative to traditional methods.

Reported quantitative finding from the study's survey (N=150) and supporting case studies (4 organizations).

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... cost-per-hire

Adoption of integrated recruitment technology yielded a 45% improvement in candidate quality as measured by first-year performance ratings.

Reported quantitative result from the survey (N=150) and case study evidence using first-year performance ratings as the quality metric.

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... first-year employee performance (candidate quality)

Organizations adopting integrated technology-driven recruitment platforms experienced an average reduction in time-to-hire of 38%.

Reported quantitative finding based on the paper's mixed-methods analysis (survey of 150 HR professionals and corroborating qualitative case studies of 4 organizations).

high positive A Study on the Effectiveness of Technology-Driven Recruitmen... time-to-hire

These results suggest that LinuxArena has meaningful headroom for both attackers and defenders, making it a strong testbed for developing and evaluating future control protocols.

Authors synthesize results from sabotage evaluations, monitor evaluations, and the LaStraj human-attack dataset to conclude there is room for improvement on both attacker and defender sides; this is presented as an implication/recommendation rather than a strictly measured outcome.

high positive LinuxArena: A Control Setting for AI Agents in Live Producti... suitability/quality of LinuxArena as a testbed (headroom for attacker and defend...

LinuxArena contains 184 side tasks representing safety failures such as data exfiltration and backdooring.

Authors report the number of side tasks and describe their nature (safety failures) in the dataset/control setting documentation.

high positive LinuxArena: A Control Setting for AI Agents in Live Producti... number of side (safety-failure) tasks in LinuxArena

LinuxArena contains 1,671 main tasks representing legitimate software engineering work.

Authors report the number of main tasks when describing the contents of LinuxArena.

high positive LinuxArena: A Control Setting for AI Agents in Live Producti... number of main (legitimate) tasks in LinuxArena

LinuxArena contains 20 environments.

Authors report constructing LinuxArena and state the number of environments explicitly in the paper's description of the dataset/control setting.

high positive LinuxArena: A Control Setting for AI Agents in Live Producti... number of environments in the LinuxArena control setting

We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows; DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains (e.g., coding, crystallography, and music notation).

Paper describes creation of a benchmark/dataset called DELEGATE-52 covering 52 professional domains and designed to simulate long delegated document-editing workflows.

high positive LLMs Corrupt Your Documents When You Delegate benchmark scope / domain coverage

Drawing on Moral Foundations Theory and a multi-stakeholder perspective, moral (mis)alignment matters for the meaningful integration of AI in sensitive contexts.

Paper's theoretical framing and normative claim (method: conceptual synthesis using Moral Foundations Theory and multi-stakeholder argumentation; no empirical sample or quantitative results reported in the supplied text).

high positive Smart But Not Moral? Moral Alignment In Human-AI Decision-Ma... meaningful integration/adoption of AI in sensitive/high-stakes contexts

Moral alignment is defined as the perceived congruence between the values embedded in an AI system's decision logic and the moral intuitions of stakeholders.

Explicit definitional statement in the paper (conceptual definition; no empirical measurement reported in the supplied text).

high positive Smart But Not Moral? Moral Alignment In Human-AI Decision-Ma... perceived congruence between AI values and stakeholder moral intuitions (definit...

Moral alignment may be a more fundamental dimension of human-AI decision-making than functional or behavioral alignment.

Paper's central argumentative claim (theoretical proposition building on conceptual reasoning and prior theory; no empirical evidence or sample size reported in the supplied text).

high positive Smart But Not Moral? Moral Alignment In Human-AI Decision-Ma... relative fundamental status of moral alignment in human-AI decision-making

In high-stakes AI-supported decisions, considerations are not purely technical but involve moral judgments about fairness, responsibility, and harm.

Stated as a conceptual assertion in the paper's framing/abstract; presented as an observation building on prior literature (no empirical method or sample size reported in the supplied text).

high positive Smart But Not Moral? Moral Alignment In Human-AI Decision-Ma... presence of moral judgments in decision-making

Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies.

Authors' claimed contribution based on synthesis of themes from twenty-two interviews and presentation of the reliance-control framework.

high positive Towards an Appropriate Level of Reliance on AI: A Preliminar... developers' ability to effectively use AI tools (appropriate degree of reliance)

The reliance-control framework can be used to recommend future research to explore different control levels supported by current and emergent LLM-driven tools.

Paper explicitly uses the framework to motivate and recommend directions for future research; based on qualitative interview findings (n=22) and authors' synthesis.

high positive Towards an Appropriate Level of Reliance on AI: A Preliminar... research directions and scope (exploration of control levels)

We propose a preliminary reliance-control framework where the level of control can be used to identify AI overreliance and underreliance.

Authors present a conceptual/framework contribution derived from analysis of the twenty-two interviews; this is a proposed (theoretical) framework rather than an experimentally validated one.

high positive Towards an Appropriate Level of Reliance on AI: A Preliminar... ability to identify overreliance and underreliance (framework applicability)

Fairness should be evaluated at the system level (the interacting agents) rather than solely at the level of individual models, because fairness can be an emergent, procedural property of decentralized agent interaction.

Conceptual framing supported by the triage experiments showing emergent fairness properties from agent interaction that were not present at the single-agent level.

high positive Beyond Arrow's Impossibility: Fairness as an Emergent Proper... appropriateness of system-level versus model-level evaluation for fairness

Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart.

Behavioral observations from the triage negotiation trials where aligned agents contested allocations proposed by biased/un-aligned agents and adjusted final allocations in ways that increased access for marginalized groups while not fully changing the adversarial agent's preferences.

high positive Beyond Arrow's Impossibility: Fairness as an Emergent Proper... change in allocations for marginalized groups due to contestation in multi-agent...

Neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone.

Comparative analysis of individual-agent allocations versus joint allocations after three rounds of negotiation in the hospital triage simulation; claim based on observed differences between solitary and joint outcomes.

high positive Beyond Arrow's Impossibility: Fairness as an Emergent Proper... ethical adequacy / fairness of allocations (individual vs joint)

Fairness in language models emerges through interaction and exchange among agents, rather than being solely a property of a single, centrally optimized model.

Controlled simulation using a hospital triage framework in which two agents negotiate over three structured debate rounds; one agent is aligned via retrieval-augmented generation (RAG) and the other is unaligned or adversarially prompted. Observed final allocations and negotiation dynamics used to support the claim.

high positive Beyond Arrow's Impossibility: Fairness as an Emergent Proper... emergent fairness of joint allocations produced by multi-agent interaction

By framing disclosure as epistemic infrastructure, this work outlines a conceptual roadmap for future empirical and design research on Human–AI collaboration.

High-level, forward-looking claim about the paper's contribution to research agenda (conceptual argument). No empirical validation in the abstract.

high positive Who Gets Credit? Operationalizing AI Disclosure as Epistemic... influence on future empirical and design research agendas

We contribute a research instrument that operationalizes these configurations in a collaborative chat setting and articulate testable design conjectures.

Paper contribution: a research instrument and set of conjectures described by the authors (design/methodological artifact). The abstract does not report empirical deployment or sample size.

high positive Who Gets Credit? Operationalizing AI Disclosure as Epistemic... operationalization of disclosure configurations in a collaborative chat research...

We introduce an AI Disclosure Design Space that conceptualizes disclosure as an epistemic coordination mechanism.

Paper contribution: conceptual artifact (design space) introduced by the authors; this is a descriptive/foundational claim about the paper's contents.

high positive Who Gets Credit? Operationalizing AI Disclosure as Epistemic... conceptualization of disclosure as an epistemic coordination mechanism

What matters in practice is the design of disclosure: how systems reveal, signal, or conceal AI assistance within collaboration.

Central theoretical argument of the paper (conceptual/design claim); no empirical validation reported in the abstract.

high positive Who Gets Credit? Operationalizing AI Disclosure as Epistemic... effects of AI disclosure design on collaboration

Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.

Conclusion drawn from the paper's empirical work: identification of task relevance and user answerability properties, operationalization via RL rewards, and the CLARITI evaluation showing fewer questions for matched resolution rate; abstract does not report experimental details or metrics beyond the 41% reduction.

high positive Asking What Matters: Reward-Driven Clarification for Softwar... clarification efficiency (fewer questions for similar resolution performance)

CLARITI is an 8B-parameter clarification module.

Model specification reported in the abstract; factual description of the trained model's scale (no further empirical detail provided in the abstract).

high positive Asking What Matters: Reward-Driven Clarification for Softwar... model parameter count

We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module.

Methodological claim: the paper reports implementation of multi-stage RL rewards and training of a clarification model named CLARITI with 8 billion parameters (claim reported in abstract; no training dataset size reported).

high positive Asking What Matters: Reward-Driven Clarification for Softwar... ability to train a clarification module using the proposed reward design

Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide).

Analytical methods reported in the paper: Shapley attribution and distributional comparisons applied to datasets of software engineering tasks and simulated user responses (abstract mentions these methods but gives no numeric sample size).

high positive Asking What Matters: Reward-Driven Clarification for Softwar... importance of information features for predicting task success and simulated-use...

Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions.

Background claim stated in the paper's introduction/abstract; likely supported by literature on underspecified task specifications and/or the authors' motivating examples (no specific sample size or experiment reported in the abstract).

high positive Asking What Matters: Reward-Driven Clarification for Softwar... frequency/occurrence of incomplete task specifications (need for clarification)

The approach provides a practical path toward more transparent, controllable, and accountable AI use without requiring new model architectures.

Authors' asserted benefit of the proposed interaction-layer framework; no empirical demonstration that transparency, control, or accountability are achieved or that no architectural changes are required in practice.

high positive Governing Reflective Human-AI Collaboration: A Framework for... transparency_controllability_accountability_of_AI_use

The framework enables auditable reasoning traces and supports alignment with emerging governance standards, including the EU AI Act and ISO/IEC 42001.

Stated compliance/alignment claim linking the proposed interaction-layer approach to existing regulatory standards; no compliance testing or audit examples reported.

high positive Governing Reflective Human-AI Collaboration: A Framework for... auditable_reasoning_traces_and_regulatory_alignment (EU AI Act, ISO/IEC 42001)

« Prev 1 2 3 … 72 73 74 … 129 130 Next »