The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13827 claims)

Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 195 97 889 1979
Governance & Regulation 815 391 188 121 1539
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 624 233 123 96 1084
Research Productivity 410 121 56 331 929
Output Quality 466 177 59 47 749
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 166 122 24 495
Task Allocation 206 64 70 31 376
Skill Acquisition 165 57 60 17 299
Innovation Output 201 27 41 18 288
Employment Level 105 51 107 13 278
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 149 46 26 3 224
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 61 20 12 182
Error Rate 69 91 10 2 172
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 92 19 13 19 145
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Skill Obsolescence 5 45 6 1 57
Creative Output 31 16 7 2 57
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
AI adoption significantly influenced skill transformation (β = 0.67, p < 0.001).
Structural equation modeling (SEM) on the same survey sample (n=320); reported standardized path coefficient β = 0.67 with p < 0.001.
AI adoption significantly influenced employment patterns (β = 0.63, p < 0.001).
Structural equation modeling (SEM) on primary survey data from n=320 employees across IT, banking, manufacturing, education, and service sectors; reported standardized path coefficient β = 0.63 with p < 0.001.
We distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.
Abstract states the paper provides practical design principles derived from the theoretical work; basis is methodological/theoretical synthesis (no empirical sample size provided in abstract).
high positive Logging Policy Design for Off-Policy Evaluation practical guidance / design principles for logging policy selection under constr...
We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective.
Abstract claims demonstration and description of theoretically optimal approaches; evidence likely consists of analytical results and/or illustrative demonstrations in the paper (no sample size reported in abstract).
high positive Logging Policy Design for Off-Policy Evaluation impact of treatment (logging) selection on OPE performance and derivation of opt...
We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time.
Abstract states the development of a framework and derivations of optimal policies across specified informational regimes; evidence is theoretical derivations (no empirical details in abstract).
high positive Logging Policy Design for Off-Policy Evaluation optimal logging policies for minimizing OPE error under different informational ...
In practice OPE accuracy depends heavily on the logging policy used to collect data for computing the estimate.
Assertion in abstract motivated by the authors' study of logging policy design; implies analytical results in the paper relating logging policy to OPE error (no sample size given in abstract).
high positive Logging Policy Design for Off-Policy Evaluation OPE accuracy / OPE error
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy, enabling high-stakes experimentation without live deployment.
Statement in abstract describing OPE and its role; conceptual/theoretical description (no sample size or empirical study reported in the abstract).
high positive Logging Policy Design for Off-Policy Evaluation ability to estimate target policy value without live deployment (OPE capability)
These verified assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
User study reported in the paper: a study involving more than 400 participants measured performance on code-comprehension tasks with and without the verified assertions (sample size reported as >400 participants).
high positive Viverra: Text-to-Code with Guarantees users' performance on code-comprehension tasks
Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions.
Empirical evaluation reported in the paper: a test set of 18 programming tasks was used to evaluate Viverra's ability to generate code with verified assertions (sample size = 18 tasks).
high positive Viverra: Text-to-Code with Guarantees rate/ability to generate code with verified assertions
Viverra verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers.
Method description: the paper states that verification is done compositionally and in a best-effort way using a portfolio of bounded model checkers (implementation/algorithmic claim).
high positive Viverra: Text-to-Code with Guarantees verification of assertions using bounded model checkers
Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties.
Method section description: the workflow described in the paper explicitly states LLM prompting to produce C programs and candidate assertions (methodological claim, illustrated with examples).
high positive Viverra: Text-to-Code with Guarantees generation of C program plus candidate assertions
Viverra automatically produces formally verified annotations alongside generated code to aid users' understanding of the generated program.
System description in the paper: Viverra is presented as a system that generates code together with formally verified annotations; implementation details and demonstration are described (no precise external benchmark cited here).
high positive Viverra: Text-to-Code with Guarantees availability of formally verified annotations alongside generated code
Participants cited inclusivity as their primary reason for preferring LLM facilitators.
Post-task survey responses where participants reported reasons for preferring LLM-facilitated discussion; inclusivity reported as the primary reason.
high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... self-reported reasons for facilitator preference (inclusivity)
Participants consistently preferred facilitated discussion.
Survey responses collected after deliberation across both studies indicating participant preference for facilitated discussions over no facilitation.
high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... participant preference for facilitated discussion (self-report)
The study offers actionable insights for leaders seeking to balance innovation, capability development and ethical governance in AI-enabled workplaces while sustaining human interpretive authority, accountability and responsibility over time.
Implications and recommendations derived from the study's qualitative findings (28 interviews) and interpretive synthesis.
high positive Reimagining work in the age of intelligent automation: a qua... guidance for leadership on balancing innovation and governance
AI reshapes contemporary work by augmenting, rather than substituting, human roles.
Qualitative semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive services in Europe and Asia; thematic and interpretive analysis supported by organizational document review.
high positive Reimagining work in the age of intelligent automation: a qua... nature of human roles (augmentation vs substitution)
The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons).
Policy and technical recommendations presented in the paper (proposal, not empirical test).
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim.
Paper's conceptual contribution defining 'fragile assurance' and illustrating the notion with argumentation/examples.
We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access).
Paper introduces a formal definition and conceptual framing called the 'audit gap'.
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability.
Paper's review of AI governance frameworks enacted between 2019 and early 2026 (policy/literature review as reported in the paper).
high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation
Task complexity positively moderates the relationships between GenAI usage patterns and knowledge integration capability.
Moderation analysis using three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; interaction terms between task complexity and GenAI usage patterns reported to have positive effects on knowledge integration capability.
high positive The impact of generative artificial intelligence (GenAI) usa... knowledge integration capability
Employees' knowledge integration capability plays a critical complementary mediating role in the relationships between GenAI usage patterns (exploitative and exploratory) and creativity.
Mediation analysis conducted on three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; knowledge integration capability measured and tested as mediator between GenAI usage patterns and creativity outcomes.
high positive The impact of generative artificial intelligence (GenAI) usa... creativity (incremental and radical) via mediator knowledge integration capabili...
Exploratory GenAI use is more strongly positively associated with radical creativity than incremental creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploratory GenAI use with radical vs. incremental creativity (mediation and moderation models reported in paper).
high positive The impact of generative artificial intelligence (GenAI) usa... radical creativity (and compared to incremental creativity)
Exploitative GenAI use is more strongly positively associated with incremental creativity than radical creativity.
Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploitative GenAI use with incremental vs. radical creativity (mediation and moderation models reported in paper).
high positive The impact of generative artificial intelligence (GenAI) usa... incremental creativity (and compared to radical creativity)
Adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs (strong transferability).
Reported transfer experiments in the abstract showing that evolved adversarial inputs from a small proxy model remain effective against larger commercial models; no numeric transfer success rates provided in abstract.
high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... transfer effectiveness of adversarial inputs (ability to induce overthinking / i...
Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines.
Experimental results reported in the abstract: evaluations on the MATH benchmark and comparisons against benign and manually crafted missing-premise baselines across four SOTA models.
high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... output length (response length) on MATH benchmark
We propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems using a hierarchical genetic algorithm (HGA) operating on structured problem decompositions and optimizing a composite fitness function to maximize response length and reflective overthinking markers.
Methodological description of the proposed approach (HGA and composite fitness) as presented by the authors in the abstract.
high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... ability to induce overthinking / increase in response length (method capability)
Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code.
Systematic examination of multiple prompt dimensions in the paper, reporting that function signatures, constraints, and style descriptions had the largest measured influence on readability scores.
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... impact_of_prompt_dimensions_on_readability
We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode.
Empirical evaluation reported in paper using 5,869 scenarios drawn from WoC and LeetCode; LLM-generated code samples were produced and scored with the readability model.
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... coverage of evaluation / dataset size for readability assessment
We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code.
Description in paper of a newly constructed readability model combining textual, structural, program, and visual features; model development is presented as a methodological contribution (no numeric effect size).
high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... code_readability (measured via the proposed readability model)
The study demonstrates that recent archival case evidence can be used rigorously to analyze an emerging strategic phenomenon without reducing the study to a purely descriptive literature review.
Methodological claim supported by the paper's demonstration of within-case coding and cross-case pattern matching applied to recent archival documents for the four firms.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... validity and rigor of archival case methods for studying emerging strategic phen...
The paper develops a process view of AIECI built on sensing, interpretation, and orchestration as the sequence through which AI inputs are transformed into competitive intelligence capability, intelligence-informed decisions, and economic outcomes.
Theoretical contribution synthesized from cross-case analysis and conceptual development within the paper.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... conceptual/process model of how AI inputs are transformed into economic outcomes...
Competitive intelligence (the process of sensing, interpreting, and orchestrating responses) rather than AI as a standalone automation tool is the strategic mechanism through which value is created.
Theoretical argument supported by within-case coding and cross-case synthesis of archival materials from four firms demonstrating how AI functions as part of an intelligence infrastructure rather than as isolated automation.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... role of competitive intelligence as the mechanism linking AI inputs to economic ...
Across the four cases, AIECI delivered strategic speed under uncertainty (faster, better-timed decisions in uncertain environments).
Archival case evidence (public disclosures and corporate materials) showing firms using AI-enabled intelligence to accelerate decision cycles and respond more quickly to market signals.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... strategic speed under uncertainty (reduced time-to-decision and faster strategic...
Across the four cases, AIECI improved allocation quality (better targeting and resource allocation decisions).
Within- and cross-case coding of corporate materials from the four sampled firms reporting improvements in campaign targeting, budget allocation, and resource deployment linked to AI-driven intelligence.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... improved allocation quality (better targeting/allocating marketing and operation...
Across the four cases, AIECI produced efficiency gains and cost relief for firms.
Cross-case evidence from archival corporate disclosures and reports for Walmart, Unilever, Sprinklr, and DoubleVerify showing operational/marketing efficiencies and cost savings linked to AI-enabled competitive intelligence.
high positive Artificial Intelligence Enabled Competitive Intelligence as ... efficiency improvements and cost relief (reduced costs or improved resource use ...
Across the four cases, AIECI generated value through revenue acceleration.
Cross-case findings from a qualitative comparative multiple-case design using public archival evidence (annual reports, 10-Ks, earnings releases, corporate materials) for four firms (Walmart, Unilever, Sprinklr, DoubleVerify).
high positive Artificial Intelligence Enabled Competitive Intelligence as ... revenue acceleration (increased sales or faster revenue growth attributed to AIE...
Policy options should centre on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.
Paper's recommended policy agenda derived from its assessment of risks and gaps (as stated in abstract); the abstract does not report empirical testing of these options or quantified expected effects.
high positive Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation
These findings point to a need for a coordinated European preparedness agenda.
Paper's synthesis and policy recommendation based on the identified capability and governance gaps (as stated in abstract); recommendation not supported by quantified impact estimates in the abstract.
high positive Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation
A plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains.
Paper's synthesis of empirical trends in AI capabilities, expert forecasting surveys, and policy analysis (as stated in abstract). No specific sample size or survey details provided in the abstract.
Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency).
Synthesis/interpretation in the paper based on the controlled study results (120 participants) and box-level analysis showing improved label quality and reduced time when uncertainty cues were shown.
high positive From Model Uncertainty to Human Attention: Localization-Awar... human-in-the-loop annotation quality and efficiency
A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes.
Box-level analysis reported in paper comparing annotator behavior across predicted boxes with differing localization uncertainty; analysis shows effort reallocation toward boxes labeled as high-uncertainty.
high positive From Model Uncertainty to Human Attention: Localization-Awar... annotator effort allocation across predicted boxes
In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time).
Same controlled user study with 120 participants comparing interfaces with and without spatial-uncertainty visualizations; paper reports that participants with cues were faster overall.
In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality.
Controlled user study reported in the paper; 120 participants; comparison between annotators who received visualized spatial-uncertainty cues via a purpose-built interface and those who did not; paper reports label quality outcomes.
The model identifies simple measures/conditions that characterize when productivity paradoxes and skill polarization arise.
Theoretical derivations and analytical characterizations within the model yielding threshold conditions and measures parameterizing when paradoxical outcomes occur (model-based; no empirical validation).
high positive Human-AI Productivity Paradoxes: Modeling the Interplay of S... predictive conditions/thresholds for productivity paradoxes and skill polarizati...
Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies (i.e., the same ranking of models at the population level).
A replication of the human within-subject experiment using simulated users; authors report that aggregate model ranking/hierarchy is preserved between simulators and humans.
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... agreement in aggregate model rankings between simulated-user evaluations and hum...
People reward sycophancy and relationship-seeking behaviours in short-term evaluations.
Participant judgments in the blinded multi-turn conversations (same 530-participant experiment) indicated higher short-term preference ratings for outputs exhibiting sycophancy/relationship-seeking.
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... participant short-term preference ratings for model outputs showing sycophancy/r...
Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting in blinded multi-turn conversations with human participants.
Within-subject blinded multi-turn conversation experiment with 530 human participants comparing P-DPO, a generic model, and personalised prompting; statistical comparison reported in paper (claimed 'significantly outperforms').
high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... human preference / model ranking as judged by participants in blinded multi-turn...
Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings.
Experimental evaluation section comparing performance of MLLMs, specialized detectors, and human participants on the benchmark.
high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... comparative detection performance across model classes and humans
We synthesized fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models.
Dataset augmentation methodology using six SOTA image editing/generation models to produce fake-damaged images.
high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... generation of fake-damaged images via six models