Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Positive Alignment is a distinct and necessary agenda within AI alignment research.
Normative argumentation in the paper advocating for a separate research agenda (no empirical validation presented).
Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative.
Paper's definitional proposal / conceptual framing (normative definition rather than empirical evidence).
Policy frameworks are necessary to govern verifiable machine intelligence in modern socio-technical infrastructures.
Normative recommendation and policy discussion in the paper; no empirical policy evaluation or legislative case studies are presented in the supplied text.
Process-based supervision has broader implications for algorithmic fairness and can reduce black-box opacity.
High-level discussion in the paper linking process-verifiability to fairness and reduced opacity; no empirical fairness audits or quantitative fairness metrics reported in the provided text.
Integrating reinforcement learning with process-oriented feedback can foster a more transparent AI ecosystem where the path to a conclusion is as scrutinized as the conclusion itself.
Conceptual claim and proposed benefit in the paper; presented as an argument rather than supported by empirical transparency or interpretability studies in the supplied text.
Process-based supervision significantly improves the reliability of models in high-stakes domains such as law, medicine, and engineering.
Asserted by the authors as an advantage of PRMs for high-stakes applications; presented as argumentation rather than backed by reported empirical trials or case-study sample sizes in the provided text.
Optimizing PRMs through reinforcement learning enhances the verifiability and robustness of multi-step reasoning in large-scale model architectures.
Central argumentative claim of the paper (theoretical proposal and conceptual analysis); no experimental results or quantitative evaluation provided in the text supplied.
Process-Based Reward Models (PRMs) assign value to each distinct stage of a reasoning chain, providing a more granular signal for training than outcome-only approaches.
Methodological description and conceptual argument in the paper; described as a design/approach rather than empirically validated with data.
Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools.
Study design: creation of Delphi-derived rubric from 20 experts and its use as an evaluation benchmark for six LLMs; reported as a methodological contribution.
The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability.
Method: Delphi process with 20 infrastructure professionals that generated and refined reasoning criteria; resulting rubric content reported in paper.
Experiments show consistent advantages in viewer engagement.
Reported experimental comparison vs named baselines; claim of consistent advantage in viewer engagement without numeric effect size provided in the excerpt.
Experiments show consistent advantages in tactfulness.
Reported experimental comparison vs named baselines; claim of consistent advantage in tactfulness without numeric effect size provided in the excerpt.
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 18% on factual correctness.
Reported experimental comparison vs named baselines; specific numeric improvement stated (18% gain on factual correctness). Evaluation dataset or sample size not provided in the excerpt.
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness.
Reported experimental comparison vs named baselines; specific numeric improvement stated (23% gain on informativeness). Evaluation dataset or sample size not provided in the excerpt.
We fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection.
Methodological contribution: fine-tuning an LLM on the collected annotated data, described in the paper.
We collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents.
Dataset creation reported in the methods: explicitly states 1,475 annotated live-commerce interactions.
We construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise.
Methodological contribution described in the paper: construction of a domain knowledge base and curated sales terminology lexicon.
A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure.
Conceptual description in the paper's introduction; no empirical data or experimental method cited in the excerpt.
A causal ablation confirms that each of the four mechanical enforcement primitives is individually necessary.
Causal ablation experiments reported by authors in the synthetic banking domain: removing each primitive degrades performance/governance, implying individual necessity. Abstract does not report exact experimental counts or effect sizes.
Mechanical enforcement raises task accuracy from MCC ~0.43 to 0.88.
Reported Matthews correlation coefficient (MCC) for task accuracy under text-only governance (≈0.43) versus mechanical enforcement (≈0.88) in the paper's synthetic experiments; sample size not provided in abstract.
Mechanical enforcement more than doubles deferral information content.
Comparison of information-content measures for deferrals between text-only governance and mechanical enforcement in the synthetic banking domain experiments; exact numeric basis not given in abstract.
Mechanical enforcement reduces the rate of deferrals that carry no decision-relevant information by 73%.
Head-to-head comparison between text-only governance and a mechanically enforced architecture (four primitives) in the paper's synthetic banking experiments; specific sample size not stated in abstract.
There is a positive spillover effect on AI-ineligible chats: treated workers adapted their multitasking workflow to devote greater attention to these chats.
Experiment-level observations comparing worker behavior on AI-ineligible chats between treatment and control; treated workers reallocated attention/effort (multitasking workflow changes) leading to improved attention on AI-ineligible chats.
Early intervention is essential for sustaining high post-escalation intervention effort.
Temporal analysis of intervention timing within the randomized experiment showing an association between earlier human intervention after escalation and higher subsequent intervention effort.
Human intervention preserves service quality in algorithm-triggered technical escalations (unresolved customer issues beyond the AI's capability).
Experimental subgroup analysis of escalations categorized as algorithm-triggered technical escalations; post-escalation human interventions were observed to maintain service quality in these cases.
By reframing reskilling as a shared, supported, and bounded process, AI-driven change can foster long-term career resilience, professional identity renewal, and sustainable human–AI integration.
Conceptual conclusion/implication drawn by the authors from the proposed model and recommendations; no empirical validation included in the paper.
The paper advances a set of sustainable, collective strategies—such as role-linked learning, protected learning time, skill prioritization, and phased AI adoption—to interrupt the reskilling loop and redistribute adaptive demands across organizations.
Prescriptive/theoretical recommendations proposed by the authors; no empirical evaluation or trial evidence presented.
The paper proposes a reconstructed labour law framework based on economic dependency rather than traditional employment classification, including recognition of dependent contractor status, platform liability for worker welfare, algorithmic transparency, social security obligations, and specialised grievance mechanisms.
Normative legal/policy proposal articulated by the author(s) based on theoretical argument and the comparative analysis of existing regulatory gaps; prescriptive recommendation rather than empirically tested intervention.
The appropriate design response to Metis tasks is centaur architectures in which humans lead and AI supports, rather than pursuing further automation.
Prescriptive recommendation based on the conceptual analysis and normative reasoning in the paper; not supported by empirical evaluation or quantified comparisons of architectures.
These verified assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
User study reported in the paper: a study involving more than 400 participants measured performance on code-comprehension tasks with and without the verified assertions (sample size reported as >400 participants).
Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions.
Empirical evaluation reported in the paper: a test set of 18 programming tasks was used to evaluate Viverra's ability to generate code with verified assertions (sample size = 18 tasks).
Viverra verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers.
Method description: the paper states that verification is done compositionally and in a best-effort way using a portfolio of bounded model checkers (implementation/algorithmic claim).
Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties.
Method section description: the workflow described in the paper explicitly states LLM prompting to produce C programs and candidate assertions (methodological claim, illustrated with examples).
Viverra automatically produces formally verified annotations alongside generated code to aid users' understanding of the generated program.
System description in the paper: Viverra is presented as a system that generates code together with formally verified annotations; implementation details and demonstration are described (no precise external benchmark cited here).
Participants cited inclusivity as their primary reason for preferring LLM facilitators.
Post-task survey responses where participants reported reasons for preferring LLM-facilitated discussion; inclusivity reported as the primary reason.
Participants consistently preferred facilitated discussion.
Survey responses collected after deliberation across both studies indicating participant preference for facilitated discussions over no facilitation.
The study offers actionable insights for leaders seeking to balance innovation, capability development and ethical governance in AI-enabled workplaces while sustaining human interpretive authority, accountability and responsibility over time.
Implications and recommendations derived from the study's qualitative findings (28 interviews) and interpretive synthesis.
AI reshapes contemporary work by augmenting, rather than substituting, human roles.
Qualitative semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive services in Europe and Asia; thematic and interpretive analysis supported by organizational document review.
The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons).
Policy and technical recommendations presented in the paper (proposal, not empirical test).
We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim.
Paper's conceptual contribution defining 'fragile assurance' and illustrating the notion with argumentation/examples.
We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access).
Paper introduces a formal definition and conceptual framing called the 'audit gap'.
AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability.
Paper's review of AI governance frameworks enacted between 2019 and early 2026 (policy/literature review as reported in the paper).
Task complexity positively moderates the relationships between GenAI usage patterns and knowledge integration capability.
Moderation analysis using three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; interaction terms between task complexity and GenAI usage patterns reported to have positive effects on knowledge integration capability.
Employees' knowledge integration capability plays a critical complementary mediating role in the relationships between GenAI usage patterns (exploitative and exploratory) and creativity.
Mediation analysis conducted on three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; knowledge integration capability measured and tested as mediator between GenAI usage patterns and creativity outcomes.
Exploratory GenAI use is more strongly positively associated with radical creativity than incremental creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploratory GenAI use with radical vs. incremental creativity (mediation and moderation models reported in paper).
Exploitative GenAI use is more strongly positively associated with incremental creativity than radical creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploitative GenAI use with incremental vs. radical creativity (mediation and moderation models reported in paper).
Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code.
Systematic examination of multiple prompt dimensions in the paper, reporting that function signatures, constraints, and style descriptions had the largest measured influence on readability scores.
We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode.
Empirical evaluation reported in paper using 5,869 scenarios drawn from WoC and LeetCode; LLM-generated code samples were produced and scored with the readability model.
We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code.
Description in paper of a newly constructed readability model combining textual, structural, program, and visual features; model development is presented as a methodological contribution (no numeric effect size).
Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency).
Synthesis/interpretation in the paper based on the controlled study results (120 participants) and box-level analysis showing improved label quality and reduced time when uncertainty cues were shown.