Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
AI adoption significantly influenced skill transformation (β = 0.67, p < 0.001).
Structural equation modeling (SEM) on the same survey sample (n=320); reported standardized path coefficient β = 0.67 with p < 0.001.
AI adoption significantly influenced employment patterns (β = 0.63, p < 0.001).
Structural equation modeling (SEM) on primary survey data from n=320 employees across IT, banking, manufacturing, education, and service sectors; reported standardized path coefficient β = 0.63 with p < 0.001.
We distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.
Abstract states the paper provides practical design principles derived from the theoretical work; basis is methodological/theoretical synthesis (no empirical sample size provided in abstract).
We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective.
Abstract claims demonstration and description of theoretically optimal approaches; evidence likely consists of analytical results and/or illustrative demonstrations in the paper (no sample size reported in abstract).
We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time.
Abstract states the development of a framework and derivations of optimal policies across specified informational regimes; evidence is theoretical derivations (no empirical details in abstract).
In practice OPE accuracy depends heavily on the logging policy used to collect data for computing the estimate.
Assertion in abstract motivated by the authors' study of logging policy design; implies analytical results in the paper relating logging policy to OPE error (no sample size given in abstract).
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy, enabling high-stakes experimentation without live deployment.
Statement in abstract describing OPE and its role; conceptual/theoretical description (no sample size or empirical study reported in the abstract).
These verified assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
User study reported in the paper: a study involving more than 400 participants measured performance on code-comprehension tasks with and without the verified assertions (sample size reported as >400 participants).
Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions.
Empirical evaluation reported in the paper: a test set of 18 programming tasks was used to evaluate Viverra's ability to generate code with verified assertions (sample size = 18 tasks).
Viverra verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers.
Method description: the paper states that verification is done compositionally and in a best-effort way using a portfolio of bounded model checkers (implementation/algorithmic claim).
Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties.
Method section description: the workflow described in the paper explicitly states LLM prompting to produce C programs and candidate assertions (methodological claim, illustrated with examples).
Viverra automatically produces formally verified annotations alongside generated code to aid users' understanding of the generated program.
System description in the paper: Viverra is presented as a system that generates code together with formally verified annotations; implementation details and demonstration are described (no precise external benchmark cited here).
Participants cited inclusivity as their primary reason for preferring LLM facilitators.
Post-task survey responses where participants reported reasons for preferring LLM-facilitated discussion; inclusivity reported as the primary reason.
Participants consistently preferred facilitated discussion.
Survey responses collected after deliberation across both studies indicating participant preference for facilitated discussions over no facilitation.
The study offers actionable insights for leaders seeking to balance innovation, capability development and ethical governance in AI-enabled workplaces while sustaining human interpretive authority, accountability and responsibility over time.
Implications and recommendations derived from the study's qualitative findings (28 interviews) and interpretive synthesis.
AI reshapes contemporary work by augmenting, rather than substituting, human roles.
Qualitative semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive services in Europe and Asia; thematic and interpretive analysis supported by organizational document review.
The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons).
Policy and technical recommendations presented in the paper (proposal, not empirical test).
We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim.
Paper's conceptual contribution defining 'fragile assurance' and illustrating the notion with argumentation/examples.
We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access).
Paper introduces a formal definition and conceptual framing called the 'audit gap'.
AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability.
Paper's review of AI governance frameworks enacted between 2019 and early 2026 (policy/literature review as reported in the paper).
Task complexity positively moderates the relationships between GenAI usage patterns and knowledge integration capability.
Moderation analysis using three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; interaction terms between task complexity and GenAI usage patterns reported to have positive effects on knowledge integration capability.
Employees' knowledge integration capability plays a critical complementary mediating role in the relationships between GenAI usage patterns (exploitative and exploratory) and creativity.
Mediation analysis conducted on three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; knowledge integration capability measured and tested as mediator between GenAI usage patterns and creativity outcomes.
Exploratory GenAI use is more strongly positively associated with radical creativity than incremental creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploratory GenAI use with radical vs. incremental creativity (mediation and moderation models reported in paper).
Exploitative GenAI use is more strongly positively associated with incremental creativity than radical creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploitative GenAI use with incremental vs. radical creativity (mediation and moderation models reported in paper).
Adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs (strong transferability).
Reported transfer experiments in the abstract showing that evolved adversarial inputs from a small proxy model remain effective against larger commercial models; no numeric transfer success rates provided in abstract.
Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines.
Experimental results reported in the abstract: evaluations on the MATH benchmark and comparisons against benign and manually crafted missing-premise baselines across four SOTA models.
We propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems using a hierarchical genetic algorithm (HGA) operating on structured problem decompositions and optimizing a composite fitness function to maximize response length and reflective overthinking markers.
Methodological description of the proposed approach (HGA and composite fitness) as presented by the authors in the abstract.
Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code.
Systematic examination of multiple prompt dimensions in the paper, reporting that function signatures, constraints, and style descriptions had the largest measured influence on readability scores.
We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode.
Empirical evaluation reported in paper using 5,869 scenarios drawn from WoC and LeetCode; LLM-generated code samples were produced and scored with the readability model.
We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code.
Description in paper of a newly constructed readability model combining textual, structural, program, and visual features; model development is presented as a methodological contribution (no numeric effect size).
The study demonstrates that recent archival case evidence can be used rigorously to analyze an emerging strategic phenomenon without reducing the study to a purely descriptive literature review.
Methodological claim supported by the paper's demonstration of within-case coding and cross-case pattern matching applied to recent archival documents for the four firms.
The paper develops a process view of AIECI built on sensing, interpretation, and orchestration as the sequence through which AI inputs are transformed into competitive intelligence capability, intelligence-informed decisions, and economic outcomes.
Theoretical contribution synthesized from cross-case analysis and conceptual development within the paper.
Competitive intelligence (the process of sensing, interpreting, and orchestrating responses) rather than AI as a standalone automation tool is the strategic mechanism through which value is created.
Theoretical argument supported by within-case coding and cross-case synthesis of archival materials from four firms demonstrating how AI functions as part of an intelligence infrastructure rather than as isolated automation.
Across the four cases, AIECI delivered strategic speed under uncertainty (faster, better-timed decisions in uncertain environments).
Archival case evidence (public disclosures and corporate materials) showing firms using AI-enabled intelligence to accelerate decision cycles and respond more quickly to market signals.
Across the four cases, AIECI improved allocation quality (better targeting and resource allocation decisions).
Within- and cross-case coding of corporate materials from the four sampled firms reporting improvements in campaign targeting, budget allocation, and resource deployment linked to AI-driven intelligence.
Across the four cases, AIECI produced efficiency gains and cost relief for firms.
Cross-case evidence from archival corporate disclosures and reports for Walmart, Unilever, Sprinklr, and DoubleVerify showing operational/marketing efficiencies and cost savings linked to AI-enabled competitive intelligence.
Across the four cases, AIECI generated value through revenue acceleration.
Cross-case findings from a qualitative comparative multiple-case design using public archival evidence (annual reports, 10-Ks, earnings releases, corporate materials) for four firms (Walmart, Unilever, Sprinklr, DoubleVerify).
Policy options should centre on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.
Paper's recommended policy agenda derived from its assessment of risks and gaps (as stated in abstract); the abstract does not report empirical testing of these options or quantified expected effects.
These findings point to a need for a coordinated European preparedness agenda.
Paper's synthesis and policy recommendation based on the identified capability and governance gaps (as stated in abstract); recommendation not supported by quantified impact estimates in the abstract.
A plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains.
Paper's synthesis of empirical trends in AI capabilities, expert forecasting surveys, and policy analysis (as stated in abstract). No specific sample size or survey details provided in the abstract.
Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency).
Synthesis/interpretation in the paper based on the controlled study results (120 participants) and box-level analysis showing improved label quality and reduced time when uncertainty cues were shown.
A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes.
Box-level analysis reported in paper comparing annotator behavior across predicted boxes with differing localization uncertainty; analysis shows effort reallocation toward boxes labeled as high-uncertainty.
In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time).
Same controlled user study with 120 participants comparing interfaces with and without spatial-uncertainty visualizations; paper reports that participants with cues were faster overall.
In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality.
Controlled user study reported in the paper; 120 participants; comparison between annotators who received visualized spatial-uncertainty cues via a purpose-built interface and those who did not; paper reports label quality outcomes.
The model identifies simple measures/conditions that characterize when productivity paradoxes and skill polarization arise.
Theoretical derivations and analytical characterizations within the model yielding threshold conditions and measures parameterizing when paradoxical outcomes occur (model-based; no empirical validation).
Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies (i.e., the same ranking of models at the population level).
A replication of the human within-subject experiment using simulated users; authors report that aggregate model ranking/hierarchy is preserved between simulators and humans.
People reward sycophancy and relationship-seeking behaviours in short-term evaluations.
Participant judgments in the blinded multi-turn conversations (same 530-participant experiment) indicated higher short-term preference ratings for outputs exhibiting sycophancy/relationship-seeking.
Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting in blinded multi-turn conversations with human participants.
Within-subject blinded multi-turn conversation experiment with 530 human participants comparing P-DPO, a generic model, and personalised prompting; statistical comparison reported in paper (claimed 'significantly outperforms').
Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings.
Experimental evaluation section comparing performance of MLLMs, specialized detectors, and human participants on the benchmark.
We synthesized fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models.
Dataset augmentation methodology using six SOTA image editing/generation models to produce fake-damaged images.