Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

AI adoption significantly influenced skill transformation (β = 0.67, p < 0.001).

Structural equation modeling (SEM) on the same survey sample (n=320); reported standardized path coefficient β = 0.67 with p < 0.001.

high positive ARTIFICIAL INTELLIGENCE, AUTOMATION, AND LABOR MARKET TRANSF... skill transformation

AI adoption significantly influenced employment patterns (β = 0.63, p < 0.001).

Structural equation modeling (SEM) on primary survey data from n=320 employees across IT, banking, manufacturing, education, and service sectors; reported standardized path coefficient β = 0.63 with p < 0.001.

high positive ARTIFICIAL INTELLIGENCE, AUTOMATION, AND LABOR MARKET TRANSF... employment patterns

We distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

Abstract states the paper provides practical design principles derived from the theoretical work; basis is methodological/theoretical synthesis (no empirical sample size provided in abstract).

high positive Logging Policy Design for Off-Policy Evaluation practical guidance / design principles for logging policy selection under constr...

We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective.

Abstract claims demonstration and description of theoretically optimal approaches; evidence likely consists of analytical results and/or illustrative demonstrations in the paper (no sample size reported in abstract).

high positive Logging Policy Design for Off-Policy Evaluation impact of treatment (logging) selection on OPE performance and derivation of opt...

We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time.

Abstract states the development of a framework and derivations of optimal policies across specified informational regimes; evidence is theoretical derivations (no empirical details in abstract).

high positive Logging Policy Design for Off-Policy Evaluation optimal logging policies for minimizing OPE error under different informational ...

In practice OPE accuracy depends heavily on the logging policy used to collect data for computing the estimate.

Assertion in abstract motivated by the authors' study of logging policy design; implies analytical results in the paper relating logging policy to OPE error (no sample size given in abstract).

high positive Logging Policy Design for Off-Policy Evaluation OPE accuracy / OPE error

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy, enabling high-stakes experimentation without live deployment.

Statement in abstract describing OPE and its role; conceptual/theoretical description (no sample size or empirical study reported in the abstract).

high positive Logging Policy Design for Off-Policy Evaluation ability to estimate target policy value without live deployment (OPE capability)

These verified assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.

User study reported in the paper: a study involving more than 400 participants measured performance on code-comprehension tasks with and without the verified assertions (sample size reported as >400 participants).

high positive Viverra: Text-to-Code with Guarantees users' performance on code-comprehension tasks

Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions.

Empirical evaluation reported in the paper: a test set of 18 programming tasks was used to evaluate Viverra's ability to generate code with verified assertions (sample size = 18 tasks).

high positive Viverra: Text-to-Code with Guarantees rate/ability to generate code with verified assertions

Viverra verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers.

Method description: the paper states that verification is done compositionally and in a best-effort way using a portfolio of bounded model checkers (implementation/algorithmic claim).

high positive Viverra: Text-to-Code with Guarantees verification of assertions using bounded model checkers

Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties.

Method section description: the workflow described in the paper explicitly states LLM prompting to produce C programs and candidate assertions (methodological claim, illustrated with examples).

high positive Viverra: Text-to-Code with Guarantees generation of C program plus candidate assertions

Viverra automatically produces formally verified annotations alongside generated code to aid users' understanding of the generated program.

System description in the paper: Viverra is presented as a system that generates code together with formally verified annotations; implementation details and demonstration are described (no precise external benchmark cited here).

high positive Viverra: Text-to-Code with Guarantees availability of formally verified annotations alongside generated code

Participants cited inclusivity as their primary reason for preferring LLM facilitators.

Post-task survey responses where participants reported reasons for preferring LLM-facilitated discussion; inclusivity reported as the primary reason.

high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... self-reported reasons for facilitator preference (inclusivity)

Participants consistently preferred facilitated discussion.

Survey responses collected after deliberation across both studies indicating participant preference for facilitated discussions over no facilitation.

high positive Real-Time Group Dynamics with LLM Facilitation: Evidence fro... participant preference for facilitated discussion (self-report)

The study offers actionable insights for leaders seeking to balance innovation, capability development and ethical governance in AI-enabled workplaces while sustaining human interpretive authority, accountability and responsibility over time.

Implications and recommendations derived from the study's qualitative findings (28 interviews) and interpretive synthesis.

high positive Reimagining work in the age of intelligent automation: a qua... guidance for leadership on balancing innovation and governance

AI reshapes contemporary work by augmenting, rather than substituting, human roles.

Qualitative semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive services in Europe and Asia; thematic and interpretive analysis supported by organizational document review.

high positive Reimagining work in the age of intelligent automation: a qua... nature of human roles (augmentation vs substitution)

The paper proposes a technical and regulatory pivot: bounding the evidentiary weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes (specifically linear probes, activation patching, and before/after-training comparisons).

Policy and technical recommendations presented in the paper (proposal, not empirical test).

high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation

We introduce the concept of 'fragile assurance' to describe cases where the evidential structure does not support the asserted safety claim.

Paper's conceptual contribution defining 'fragile assurance' and illustrating the notion with argumentation/examples.

high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... ai_safety_and_ethics

We formalize the structural mismatch between required and achievable verification access as the 'audit gap' (the divergence between required and achievable verification access).

Paper introduces a formal definition and conceptual framing called the 'audit gap'.

high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation

AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability.

Paper's review of AI governance frameworks enacted between 2019 and early 2026 (policy/literature review as reported in the paper).

high positive Position: Behavioural Assurance Cannot Verify the Safety Cla... governance_and_regulation

Task complexity positively moderates the relationships between GenAI usage patterns and knowledge integration capability.

Moderation analysis using three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; interaction terms between task complexity and GenAI usage patterns reported to have positive effects on knowledge integration capability.

high positive The impact of generative artificial intelligence (GenAI) usa... knowledge integration capability

Employees' knowledge integration capability plays a critical complementary mediating role in the relationships between GenAI usage patterns (exploitative and exploratory) and creativity.

Mediation analysis conducted on three-wave lagged survey data from 381 matched employees in knowledge-intensive firms in China; knowledge integration capability measured and tested as mediator between GenAI usage patterns and creativity outcomes.

high positive The impact of generative artificial intelligence (GenAI) usa... creativity (incremental and radical) via mediator knowledge integration capabili...

Exploratory GenAI use is more strongly positively associated with radical creativity than incremental creativity.

Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploratory GenAI use with radical vs. incremental creativity (mediation and moderation models reported in paper).

high positive The impact of generative artificial intelligence (GenAI) usa... radical creativity (and compared to incremental creativity)

Exploitative GenAI use is more strongly positively associated with incremental creativity than radical creativity.

Three-wave lagged survey design; 381 valid matched employees from knowledge-intensive firms in China; statistical analysis comparing associations of exploitative GenAI use with incremental vs. radical creativity (mediation and moderation models reported in paper).

high positive The impact of generative artificial intelligence (GenAI) usa... incremental creativity (and compared to radical creativity)

Adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs (strong transferability).

Reported transfer experiments in the abstract showing that evolved adversarial inputs from a small proxy model remain effective against larger commercial models; no numeric transfer success rates provided in abstract.

high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... transfer effectiveness of adversarial inputs (ability to induce overthinking / i...

Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines.

Experimental results reported in the abstract: evaluations on the MATH benchmark and comparisons against benign and manually crafted missing-premise baselines across four SOTA models.

high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... output length (response length) on MATH benchmark

We propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems using a hierarchical genetic algorithm (HGA) operating on structured problem decompositions and optimizing a composite fitness function to maximize response length and reflective overthinking markers.

Methodological description of the proposed approach (HGA and composite fitness) as presented by the authors in the abstract.

high positive Inducing Overthink: Hierarchical Genetic Algorithm-based DoS... ability to induce overthinking / increase in response length (method capability)

Function signatures, constraints and style descriptions emerge as the most influential prompt dimensions affecting the readability of LLM-generated code.

Systematic examination of multiple prompt dimensions in the paper, reporting that function signatures, constraints, and style descriptions had the largest measured influence on readability scores.

high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... impact_of_prompt_dimensions_on_readability

We evaluate the readability of code generated by mainstream LLMs under 5,869 scenarios extracted from large code bases including World of Code (WoC) and LeetCode.

Empirical evaluation reported in paper using 5,869 scenarios drawn from WoC and LeetCode; LLM-generated code samples were produced and scored with the readability model.

high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... coverage of evaluation / dataset size for readability assessment

We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code.

Description in paper of a newly constructed readability model combining textual, structural, program, and visual features; model development is presented as a methodological contribution (no numeric effect size).

high positive The Readability Spectrum: Patterns, Issues, and Prompt Effec... code_readability (measured via the proposed readability model)

The study demonstrates that recent archival case evidence can be used rigorously to analyze an emerging strategic phenomenon without reducing the study to a purely descriptive literature review.

Methodological claim supported by the paper's demonstration of within-case coding and cross-case pattern matching applied to recent archival documents for the four firms.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... validity and rigor of archival case methods for studying emerging strategic phen...

The paper develops a process view of AIECI built on sensing, interpretation, and orchestration as the sequence through which AI inputs are transformed into competitive intelligence capability, intelligence-informed decisions, and economic outcomes.

Theoretical contribution synthesized from cross-case analysis and conceptual development within the paper.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... conceptual/process model of how AI inputs are transformed into economic outcomes...

Competitive intelligence (the process of sensing, interpreting, and orchestrating responses) rather than AI as a standalone automation tool is the strategic mechanism through which value is created.

Theoretical argument supported by within-case coding and cross-case synthesis of archival materials from four firms demonstrating how AI functions as part of an intelligence infrastructure rather than as isolated automation.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... role of competitive intelligence as the mechanism linking AI inputs to economic ...

Across the four cases, AIECI delivered strategic speed under uncertainty (faster, better-timed decisions in uncertain environments).

Archival case evidence (public disclosures and corporate materials) showing firms using AI-enabled intelligence to accelerate decision cycles and respond more quickly to market signals.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... strategic speed under uncertainty (reduced time-to-decision and faster strategic...

Across the four cases, AIECI improved allocation quality (better targeting and resource allocation decisions).

Within- and cross-case coding of corporate materials from the four sampled firms reporting improvements in campaign targeting, budget allocation, and resource deployment linked to AI-driven intelligence.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... improved allocation quality (better targeting/allocating marketing and operation...

Across the four cases, AIECI produced efficiency gains and cost relief for firms.

Cross-case evidence from archival corporate disclosures and reports for Walmart, Unilever, Sprinklr, and DoubleVerify showing operational/marketing efficiencies and cost savings linked to AI-enabled competitive intelligence.

high positive Artificial Intelligence Enabled Competitive Intelligence as ... efficiency improvements and cost relief (reduced costs or improved resource use ...

Across the four cases, AIECI generated value through revenue acceleration.

Cross-case findings from a qualitative comparative multiple-case design using public archival evidence (annual reports, 10-Ks, earnings releases, corporate materials) for four firms (Walmart, Unilever, Sprinklr, DoubleVerify).

high positive Artificial Intelligence Enabled Competitive Intelligence as ... revenue acceleration (increased sales or faster revenue growth attributed to AIE...

Policy options should centre on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.

Paper's recommended policy agenda derived from its assessment of risks and gaps (as stated in abstract); the abstract does not report empirical testing of these options or quantified expected effects.

high positive Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

These findings point to a need for a coordinated European preparedness agenda.

Paper's synthesis and policy recommendation based on the identified capability and governance gaps (as stated in abstract); recommendation not supported by quantified impact estimates in the abstract.

high positive Europe and the Geopolitics of AGI: The Need for a Preparedne... governance_and_regulation

A plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains.

Paper's synthesis of empirical trends in AI capabilities, expert forecasting surveys, and policy analysis (as stated in abstract). No specific sample size or survey details provided in the abstract.

high positive Europe and the Geopolitics of AGI: The Need for a Preparedne... other

Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency).

Synthesis/interpretation in the paper based on the controlled study results (120 participants) and box-level analysis showing improved label quality and reduced time when uncertainty cues were shown.

high positive From Model Uncertainty to Human Attention: Localization-Awar... human-in-the-loop annotation quality and efficiency

A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes.

Box-level analysis reported in paper comparing annotator behavior across predicted boxes with differing localization uncertainty; analysis shows effort reallocation toward boxes labeled as high-uncertainty.

high positive From Model Uncertainty to Human Attention: Localization-Awar... annotator effort allocation across predicted boxes

In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time).

Same controlled user study with 120 participants comparing interfaces with and without spatial-uncertainty visualizations; paper reports that participants with cues were faster overall.

high positive From Model Uncertainty to Human Attention: Localization-Awar... task completion time

In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality.

Controlled user study reported in the paper; 120 participants; comparison between annotators who received visualized spatial-uncertainty cues via a purpose-built interface and those who did not; paper reports label quality outcomes.

high positive From Model Uncertainty to Human Attention: Localization-Awar... label quality

The model identifies simple measures/conditions that characterize when productivity paradoxes and skill polarization arise.

Theoretical derivations and analytical characterizations within the model yielding threshold conditions and measures parameterizing when paradoxical outcomes occur (model-based; no empirical validation).

high positive Human-AI Productivity Paradoxes: Modeling the Interplay of S... predictive conditions/thresholds for productivity paradoxes and skill polarizati...

Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies (i.e., the same ranking of models at the population level).

A replication of the human within-subject experiment using simulated users; authors report that aggregate model ranking/hierarchy is preserved between simulators and humans.

high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... agreement in aggregate model rankings between simulated-user evaluations and hum...

People reward sycophancy and relationship-seeking behaviours in short-term evaluations.

Participant judgments in the blinded multi-turn conversations (same 530-participant experiment) indicated higher short-term preference ratings for outputs exhibiting sycophancy/relationship-seeking.

high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... participant short-term preference ratings for model outputs showing sycophancy/r...

Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting in blinded multi-turn conversations with human participants.

Within-subject blinded multi-turn conversation experiment with 530 human participants comparing P-DPO, a generic model, and personalised prompting; statistical comparison reported in paper (claimed 'significantly outperforms').

high positive PRISM-X: Experiments on Personalised Fine-Tuning with Human ... human preference / model ranking as judged by participants in blinded multi-turn...

Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings.

Experimental evaluation section comparing performance of MLLMs, specialized detectors, and human participants on the benchmark.

high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... comparative detection performance across model classes and humans

We synthesized fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models.

Dataset augmentation methodology using six SOTA image editing/generation models to produce fake-damaged images.

high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... generation of fake-damaged images via six models

« Prev 1 2 3 … 117 118 119 … 276 277 Next »