The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).
Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.
high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... operational metrics for human-AI cognitive interaction (CAI*, D, HRI, HCDR)
The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).
Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).
high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... mode of human-AI interaction (amplification vs delegation)
Artificial intelligence generates positive spatial spillovers for UCEE (positive effects on neighboring regions).
Spatial Durbin model reported in the abstract indicating positive spillover coefficients for artificial intelligence.
high positive How artificial intelligence and environmental regulation inf... UCEE index (spatial spillover effect of AI)
The Global Malmquist–Luenberger (GML) index and its efficiency change (EC) and technological change (TC) components stay above 1, indicating sustained efficiency gains dominated by technological progress.
GML index and decomposition results reported in the abstract based on the panel data and GML computation.
high positive How artificial intelligence and environmental regulation inf... GML index and its EC and TC components (measures of productivity/efficiency chan...
Nationally, the average UCEE index rises from about 0.3 to above 0.7 over the sample period.
Computed UCEE index results from the Super-SBM model applied to the panel of 30 provinces (2013–2022) as reported in the abstract.
high positive How artificial intelligence and environmental regulation inf... UCEE index (average, national)
Recent advances in large language models, tool-using agents, and financial machine learning are shifting financial automation from isolated prediction tasks to integrated decision systems that can perceive information, reason over objectives, and generate or execute actions.
Literature synthesis and conceptual statement in the paper's introduction describing recent technological advances and their effects on financial automation; no empirical sample size reported.
high positive AI Agents in Financial Markets: Architecture, Applications, ... shift in type of financial automation (from isolated prediction to integrated de...
SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.
Conceptual/positioning claim made by the authors about the intended shift in benchmarking perspective enabled by SOL-ExecBench.
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmarking_objective_shift_toward_hardware_efficiency
To support robust evaluation of agentic optimizers, we provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis-based checks against common reward-hacking strategies.
Method/tool claim in paper describing the provided evaluation harness and its engineered controls (list of features included).
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... evaluation_robustness_and_integrity_of_benchmarking
We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes.
Paper defines the SOL Score metric and states its interpretive meaning (fraction of gap closed between baseline and hardware SOL bound).
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... fraction_of_gap_closed_to_hardware_bound
SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.
Methodological claim: introduction of SOLAR pipeline to compute analytic hardware-grounded SOL bounds and use of those bounds as benchmark targets, as described in the paper.
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... proximity_to_hardware_speed_of_light_bounds
The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities.
Paper description of benchmark coverage (workload direction and data types; inclusion of kernels tied to Blackwell hardware features).
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... coverage_of_workloads_and_datatypes
We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs.
Paper reports construction of the benchmark with counts: 235 CUDA kernel problems and 124 source models; descriptive dataset claim in the manuscript.
high positive SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP... benchmark_problem_count_and_coverage
Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.
Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.
high positive AI, Productivity, and Labor Markets: A Review of the Empiric... regulatory approach to AI governance (strategy of forbearance vs. new regulation...
Generative AI lowers entry costs for startups, facilitating new firm entry and product development.
Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.
high positive AI, Productivity, and Labor Markets: A Review of the Empiric... barriers to entry / startup costs and rate of new product development
Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.
Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.
high positive AI, Productivity, and Labor Markets: A Review of the Empiric... task-level productivity in coding, writing, and customer service
The study contributes to theory by empirically integrating technological, human, and institutional dimensions within a single architectural framework, moving beyond isolated analyses of digital credit.
Author-stated contribution based on combining measures of algorithmic credit systems, human capability, and institutional design and testing interactions in the same regression models.
high positive Architecting financial well-being in algorithmic credit syst... theoretical contribution / integrative framework
Moderation analysis reveals that higher levels of human capability and stronger institutional design amplify the positive effects of algorithmic credit systems and mitigate their adverse effects (i.e., they strengthen repayment and resilience effects and reduce financial stress).
Reported moderation analyses using interaction terms in the regression models on the 400-user cross-sectional sample; results described as significant moderation by human capability and institutional design.
high positive Architecting financial well-being in algorithmic credit syst... conditional effects on repayment behavior, financial resilience, and financial s...
Algorithmic credit systems are positively associated with financial resilience.
Regression analyses reported show a positive relationship between algorithmic credit system use and measures of financial resilience in the sample of 400 users.
Algorithmic credit systems are positively associated with repayment behavior.
Multiple regression results reported in the study indicate a positive association between use of algorithmic credit systems and repayment behavior based on cross-sectional survey of 400 users.
Measurement reliability and validity were established through Cronbach's alpha and principal component analysis.
Paper states that Cronbach’s alpha and principal component analysis (PCA) were used to establish measurement reliability and validity.
high positive Architecting financial well-being in algorithmic credit syst... measurement reliability/validity
The study used a quantitative, explanatory, cross-sectional design and employed multiple regression and moderation analyses to assess relationships among algorithmic credit systems, human capability, institutional design, and financial-wellbeing outcomes.
Methods described explicitly: quantitative explanatory cross-sectional design; analytical methods named as multiple regression and moderation analyses.
high positive Architecting financial well-being in algorithmic credit syst... research design / analytic methods
Data were collected from 400 users of algorithmic and digitally mediated credit platforms.
Study reports a quantitative, explanatory, cross-sectional survey of users; sample size explicitly stated as 400.
high positive Architecting financial well-being in algorithmic credit syst... sample_size / data source
Institutional design (enforceable rules, auditable logs, human oversight on high-impact actions) is a precondition for safe delegation of real authority to LLM agents; systems should be stress-tested under governance-like constraints before assignment of real authority.
Policy recommendation derived from simulation findings that governance structure strongly influences corruption-related outcomes and that safeguards alone are not consistently sufficient; grounded in experiments and rubric-assessed outcomes across 28,112 transcript segments.
high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... safety of delegation to LLM agents (compliance with rules, avoidance of abuse)
Among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity.
Comparative analysis within the multi-agent governance simulations across different authority structures and model identities; outcomes aggregated and compared across regimes (based on the 28,112 transcript segments scored).
high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... corruption-related outcomes / rule-breaking
Integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption.
Argument and recommendation based on results from multi-agent governance simulations evaluating rule-breaking and abuse; conclusions drawn from aggregate outcomes across simulated regimes and interventions (see study of 28,112 transcript segments).
high positive I Can't Believe It's Corrupt: Evaluating Corruption in Multi... institutional integrity / safety of delegation to LLM agents
The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.
Paper includes link to the open-source datasets and the AgentDS website.
The strongest solutions arise from human-AI collaboration.
Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).
high positive AgentDS Technical Report: Benchmarking the Future of Human-A... performance of human-AI collaborative solutions
We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.
Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.
high positive AgentDS Technical Report: Benchmarking the Future of Human-A... benchmark for evaluating AI agents and human-AI collaboration
Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.
Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.
high positive AgentDS Technical Report: Benchmarking the Future of Human-A... automation of data science workflow
Data science plays a critical role in transforming complex data into actionable insights across numerous domains.
Background statement in the paper (no empirical test or dataset provided to support this claim).
high positive AgentDS Technical Report: Benchmarking the Future of Human-A... transforming complex data into actionable insights
LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews.
Analysis of scores in the conference peer review dataset comparing LLM-generated vs human reviews; the excerpt states an average increase of one full point but does not include sample size or scale range.
high positive How LLMs Distort Our Written Language assigned review scores
About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild.
Analysis of peer reviews from a recent top AI conference reported in the paper; the excerpt reports the 21% figure but does not give total number of reviews in the excerpt.
high positive How LLMs Distort Our Written Language share/proportion of peer reviews that were AI-generated
Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning.
Experiment in which LLMs were given expert feedback and explicit instructions to perform only grammar edits; comparisons show significant semantic alteration despite constrained instructions; sample size not provided.
high positive How LLMs Distort Our Written Language semantic alteration of text despite constrained grammar-only prompt
Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning.
Controlled experiments applying LLM revision to a pre-LLM essay dataset and comparing pre- and post-revision content/semantics; dataset described as collected in 2021 but sample size not stated in the excerpt.
high positive How LLMs Distort Our Written Language magnitude of content and semantic changes after LLM revision
In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question.
Human user study reported in the paper; the excerpt gives the quantified result (nearly 70% increase) but does not report sample size here.
high positive How LLMs Distort Our Written Language proportion of essays judged as neutral in answering the topic question
LLMs consistently alter the intended meaning of human writing.
Experiments in which human-written essays were revised by LLMs (including prompts asking only for grammar edits) and comparison of pre- and post-LLM text semantics; exact sample sizes not stated in the excerpt.
high positive How LLMs Distort Our Written Language degree of semantic change / alteration of intended meaning
LLMs alter the voice and tone of human writing.
Reported results from a human user study and subsequent experiments comparing original human-written text to LLM-assisted/LLM-revised text; sample sizes not provided in the excerpt.
high positive How LLMs Distort Our Written Language change in voice and tone of writing
Large language models (LLMs) are used by over a billion people globally, most often to assist with writing.
Statement in paper (likely based on external usage statistics or surveys cited by authors); no sample size reported in the provided text.
high positive How LLMs Distort Our Written Language LLM adoption and primary use case (writing assistance)
The code and data used in the study are publicly available at the referenced repository.
Paper statement that code and data are publicly available at a repository (link provided in paper).
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... availability of replication materials (code and data)
A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals outcomes are most sensitive to officer deployment levels.
Reported sensitivity analysis across patrol radius, officer count, and reporting probability showing officer count as the most influential parameter in the simulation outcomes.
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... sensitivity of bias/detection outcomes to simulation parameters (patrol radius, ...
Persistent Gini coefficients of 0.43 to 0.62 across all conditions indicate concentrated detection inequality.
Reported range of Gini coefficients from simulation experiments across conditions.
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... Gini Coefficient (detection distribution inequality)
Experiments reveal extreme and year-variant bias in Baltimore's detected mode, with mean annual DIR up to 15,714 in 2019.
Reported experimental result from simulations on Baltimore data giving mean annual DIR up to 15,714 for 2019.
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... Disparate Impact Ratio (DIR)
We compute four monthly bias metrics across 264 city-year-mode observations: the Disparate Impact Ratio (DIR), Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score.
Statement of metrics computed and the number of observations (264 city-year-mode observations) reported in the paper.
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... monthly bias metrics (DIR, Demographic Parity Gap, Gini, Bias Amplification Scor...
The study uses 145,000+ Part 1 crime records from Baltimore (2017–2019) and 233,000+ records from Chicago (2022), augmented with US Census ACS demographic data.
Reported dataset sizes and data sources in the paper (crime records from Baltimore and Chicago; ACS demographic augmentation).
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... data sample size / dataset composition
We present a reproducible simulation framework that couples a Generative Adversarial Network (GAN) with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact.
Description of methods in paper: coupling a GAN (CTGAN) for synthetic crime generation with a Noisy OR detection/patrol model; method-level claim rather than a numerical result.
high positive Unmasking Algorithmic Bias in Predictive Policing: A GAN-Bas... bias propagation through enforcement pipeline (simulation framework)
Empirical simulations of five game scenarios (ranging from repeated prisoner's dilemma to stylized repeated marketing promotion games) validate the theoretical predictions: AI agents naturally exhibit the proposed reasoning patterns and attain stable equilibrium behaviors intrinsically.
Simulation experiments reported in the paper across five distinct game scenarios; these simulations are presented as empirical validation of the theoretical results.
high positive Reasonably reasoning AI agents can avoid game-theoretic fail... frequency/occurrence of stable equilibrium behaviors (Nash-like play) in simulat...
Relaxing the common-knowledge payoff assumption—allowing stage payoffs to be unknown and each agent to observe only its own privately realized stochastic payoffs—still yields the same on-path Nash convergence guarantee.
Theoretical extension/proof in the paper showing convergence results hold under private, stochastic stage payoffs (no common-knowledge of payoffs).
high positive Reasonably reasoning AI agents can avoid game-theoretic fail... on-path Nash convergence under private, stochastic payoffs
We prove that 'reasonably reasoning' agents—agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs—eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game.
Formal theoretical proof provided in the paper (mathematical analysis of agent belief-formation and best-response learning leading to on-path closeness to Nash equilibria).
high positive Reasonably reasoning AI agents can avoid game-theoretic fail... on-path proximity (weak closeness) to Nash equilibrium of the continuation game
Off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training.
Stated claim in the paper supported by a combination of theoretical results (formal proofs about convergence properties of 'reasonably reasoning' agents) and empirical simulations across five game scenarios (including repeated prisoner's dilemma and stylized repeated marketing promotion games).
high positive Reasonably reasoning AI agents can avoid game-theoretic fail... attainment of Nash-like play / strategic equilibrium (zero-shot)
End-to-end verified pipelines can produce provably correct code from informal specifications.
The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.
high positive Intent Formalization: A Grand Challenge for Reliable Coding ... provable correctness of generated code