Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
A Neural Boosted Tree model with entity embeddings for textile attributes was constructed and achieved a mean R2 of 0.921 in cross-validation, surpassing benchmark methods.
Model training and cross-validation reported in paper using the e-commerce dataset; comparison to benchmark methods reported (specific benchmarks not listed in abstract).
The framework incorporates ethically compliant acquisition of consumer demand signals, semantic translation of unstructured market data into textile engineering attributes, machine-learning-based demand forecasting, and human-centric decision support.
Description of framework components and design choices presented in paper (methodological/architectural claim).
This study develops and validates a customer-to-manufacturer (C2M) intelligence framework that enables data-driven production planning using publicly available e-commerce data.
Methodological development described in paper; validation based on ML modeling using e-commerce data and a 12-month field deployment at one Taiwanese dyeing SME.
The paper introduces a novel posted-price procurement model with coverage objectives for studying platform procurement of human input.
Methodological contribution declared in the paper: presentation of a new formal model (posted-price procurement with coverage objectives).
A small coalition of targeted low-cost workers who commit to a price floor forces the platform's total spending to change from logarithmic to linear in M.
Theoretical analysis within the model showing that when a targeted subset of low-cost workers commit to a minimum price, the asymptotic scaling of platform spending increases from logarithmic (in M) to linear (in M); proof-based, no empirical sample.
A research-degree-student survey showed high performance ratings across information reliability, theoretical depth and logical rigor, with pronounced ceiling effects on a 7-point scale, despite all participants already being frontier-model users.
Authors report results from a survey of research-degree students evaluating the scholar-bots on specified dimensions (information reliability, theoretical depth, logical rigor) using a 7-point scale and note ceiling effects; participants reportedly were experienced model users.
Recovered panel scores placed Scholar A between 7.9 and 8.9/10 and Scholar B between 8.5 and 8.9/10 under multi-turn debate conditions.
Paper reports numeric panel scores (ranges) for the two scholar-bots in multi-turn debate scenarios; scores are presented as recovered panel evaluations.
Appointment-level recommendations placed both bots at or above Senior Lecturer level in the Australian university system.
Authors state that appointment-level syntheses from assessors recommended both scholar-bots at or above the Senior Lecturer rank (Australian system); based on the experts' syntheses.
Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining.
Authors report that the preserved set of expert review and supervision reports (from the three assessors) rated scholar-bot outputs as attaining the benchmark standards used for assessment.
The scholar-bots were deployed across doctoral supervision, peer review, lecturing and panel-style academic exchange.
Authors report deployment of the generated scholar-bots in multiple academic task contexts (doctoral supervision, peer review, lecturing, panel debates); reported as part of evaluation protocol.
We converted those systems into structured inference-time constraints for a large language model.
Authors describe a pipeline that transforms the extracted scholar reasoning artefacts into inference-time constraints applied to a LLM; presented as part of methods for the two scholar cases.
We extracted the scholarly reasoning systems of two internationally prominent humanities and social science scholars from their published corpora alone.
Authors report an extraction procedure applied to the published corpora of two named scholars; claim is descriptive of dataset and method (n=2).
From synthesis of results, we suggest three practices that focus on preserving agency in software engineering for coding, learning, and mentorship, especially as AI grows increasingly autonomous.
Authors' prescriptive recommendations derived from the paper's qualitative synthesis; presented as proposed practices rather than empirically tested interventions.
Seniors leverage pre-AI foundational instincts to steer modern tools and possess valuable perspectives for mentoring juniors in their early AI-encouraged career development.
Qualitative accounts from senior participants in the Delphi/ACTA process and blind reviews showing seniors reference pre-AI practices and see mentoring value.
Juniors enter as AI‑natives, seniors adapted mid‑career.
Authors' synthesis from a three-phase mixed-methods study: ACTA combined with a Delphi process (5 seniors), an AI-assisted debugging task (10 juniors), and blind reviews of junior prompt histories by 5 additional seniors.
Prediction intervals are a more suitable evaluation format than point estimates for numerical forecasting because they require scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes.
Conceptual/analytical argument presented in the paper explaining why prediction intervals better capture uncertainty and testability for continuous numerical forecasting (no empirical proof provided in the excerpt).
Technology-driven recruitment has emerged as a strategic imperative for organizations seeking competitive advantage in talent acquisition.
Argumentative/interpretive claim in the paper's introduction and discussion, supported by survey findings (N=150) indicating perceived strategic importance.
The paper proposes the Technology-Enabled Recruitment Optimization Framework (TEROF), a structured implementation model designed to guide organizations through the phased adoption of recruitment technology.
Paper synthesizes its empirical findings into a named framework (TEROF) described in the discussion/conclusions; based on combined survey (N=150) and case-study analysis (4 organizations).
Video interview platforms improved recruiter productivity by 41%.
Reported quantitative finding from the study's survey (N=150) and corroborating case study observations.
AI-powered resume screening reduced initial shortlisting time by 64%.
Reported quantitative result in the paper derived from the survey of HR professionals (N=150) and illustrated in case studies.
Integrated technology-driven recruitment produced a 52% reduction in cost-per-hire relative to traditional methods.
Reported quantitative finding from the study's survey (N=150) and supporting case studies (4 organizations).
Adoption of integrated recruitment technology yielded a 45% improvement in candidate quality as measured by first-year performance ratings.
Reported quantitative result from the survey (N=150) and case study evidence using first-year performance ratings as the quality metric.
Organizations adopting integrated technology-driven recruitment platforms experienced an average reduction in time-to-hire of 38%.
Reported quantitative finding based on the paper's mixed-methods analysis (survey of 150 HR professionals and corroborating qualitative case studies of 4 organizations).
These results suggest that LinuxArena has meaningful headroom for both attackers and defenders, making it a strong testbed for developing and evaluating future control protocols.
Authors synthesize results from sabotage evaluations, monitor evaluations, and the LaStraj human-attack dataset to conclude there is room for improvement on both attacker and defender sides; this is presented as an implication/recommendation rather than a strictly measured outcome.
LinuxArena contains 184 side tasks representing safety failures such as data exfiltration and backdooring.
Authors report the number of side tasks and describe their nature (safety failures) in the dataset/control setting documentation.
LinuxArena contains 1,671 main tasks representing legitimate software engineering work.
Authors report the number of main tasks when describing the contents of LinuxArena.
LinuxArena contains 20 environments.
Authors report constructing LinuxArena and state the number of environments explicitly in the paper's description of the dataset/control setting.
We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows; DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains (e.g., coding, crystallography, and music notation).
Paper describes creation of a benchmark/dataset called DELEGATE-52 covering 52 professional domains and designed to simulate long delegated document-editing workflows.
Drawing on Moral Foundations Theory and a multi-stakeholder perspective, moral (mis)alignment matters for the meaningful integration of AI in sensitive contexts.
Paper's theoretical framing and normative claim (method: conceptual synthesis using Moral Foundations Theory and multi-stakeholder argumentation; no empirical sample or quantitative results reported in the supplied text).
Moral alignment is defined as the perceived congruence between the values embedded in an AI system's decision logic and the moral intuitions of stakeholders.
Explicit definitional statement in the paper (conceptual definition; no empirical measurement reported in the supplied text).
Moral alignment may be a more fundamental dimension of human-AI decision-making than functional or behavioral alignment.
Paper's central argumentative claim (theoretical proposition building on conceptual reasoning and prior theory; no empirical evidence or sample size reported in the supplied text).
In high-stakes AI-supported decisions, considerations are not purely technical but involve moral judgments about fairness, responsibility, and harm.
Stated as a conceptual assertion in the paper's framing/abstract; presented as an observation building on prior literature (no empirical method or sample size reported in the supplied text).
Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies.
Authors' claimed contribution based on synthesis of themes from twenty-two interviews and presentation of the reliance-control framework.
The reliance-control framework can be used to recommend future research to explore different control levels supported by current and emergent LLM-driven tools.
Paper explicitly uses the framework to motivate and recommend directions for future research; based on qualitative interview findings (n=22) and authors' synthesis.
We propose a preliminary reliance-control framework where the level of control can be used to identify AI overreliance and underreliance.
Authors present a conceptual/framework contribution derived from analysis of the twenty-two interviews; this is a proposed (theoretical) framework rather than an experimentally validated one.
Fairness should be evaluated at the system level (the interacting agents) rather than solely at the level of individual models, because fairness can be an emergent, procedural property of decentralized agent interaction.
Conceptual framing supported by the triage experiments showing emergent fairness properties from agent interaction that were not present at the single-agent level.
Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart.
Behavioral observations from the triage negotiation trials where aligned agents contested allocations proposed by biased/un-aligned agents and adjusted final allocations in ways that increased access for marginalized groups while not fully changing the adversarial agent's preferences.
Neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone.
Comparative analysis of individual-agent allocations versus joint allocations after three rounds of negotiation in the hospital triage simulation; claim based on observed differences between solitary and joint outcomes.
Fairness in language models emerges through interaction and exchange among agents, rather than being solely a property of a single, centrally optimized model.
Controlled simulation using a hospital triage framework in which two agents negotiate over three structured debate rounds; one agent is aligned via retrieval-augmented generation (RAG) and the other is unaligned or adversarially prompted. Observed final allocations and negotiation dynamics used to support the claim.
By framing disclosure as epistemic infrastructure, this work outlines a conceptual roadmap for future empirical and design research on Human–AI collaboration.
High-level, forward-looking claim about the paper's contribution to research agenda (conceptual argument). No empirical validation in the abstract.
We contribute a research instrument that operationalizes these configurations in a collaborative chat setting and articulate testable design conjectures.
Paper contribution: a research instrument and set of conjectures described by the authors (design/methodological artifact). The abstract does not report empirical deployment or sample size.
We introduce an AI Disclosure Design Space that conceptualizes disclosure as an epistemic coordination mechanism.
Paper contribution: conceptual artifact (design space) introduced by the authors; this is a descriptive/foundational claim about the paper's contents.
What matters in practice is the design of disclosure: how systems reveal, signal, or conceal AI assistance within collaboration.
Central theoretical argument of the paper (conceptual/design claim); no empirical validation reported in the abstract.
Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.
Conclusion drawn from the paper's empirical work: identification of task relevance and user answerability properties, operationalization via RL rewards, and the CLARITI evaluation showing fewer questions for matched resolution rate; abstract does not report experimental details or metrics beyond the 41% reduction.
CLARITI is an 8B-parameter clarification module.
Model specification reported in the abstract; factual description of the trained model's scale (no further empirical detail provided in the abstract).
We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module.
Methodological claim: the paper reports implementation of multi-stage RL rewards and training of a clarification model named CLARITI with 8 billion parameters (claim reported in abstract; no training dataset size reported).
Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide).
Analytical methods reported in the paper: Shapley attribution and distributional comparisons applied to datasets of software engineering tasks and simulated user responses (abstract mentions these methods but gives no numeric sample size).
Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions.
Background claim stated in the paper's introduction/abstract; likely supported by literature on underspecified task specifications and/or the authors' motivating examples (no specific sample size or experiment reported in the abstract).
The approach provides a practical path toward more transparent, controllable, and accountable AI use without requiring new model architectures.
Authors' asserted benefit of the proposed interaction-layer framework; no empirical demonstration that transparency, control, or accountability are achieved or that no architectural changes are required in practice.
The framework enables auditable reasoning traces and supports alignment with emerging governance standards, including the EU AI Act and ISO/IEC 42001.
Stated compliance/alignment claim linking the proposed interaction-layer approach to existing regulatory standards; no compliance testing or audit examples reported.