The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6507 claims)

Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 609 159 77 736 1615
Governance & Regulation 664 329 160 99 1273
Organizational Efficiency 624 143 105 70 949
Technology Adoption Rate 502 176 98 78 861
Research Productivity 348 109 48 322 836
Output Quality 391 120 44 40 595
Firm Productivity 385 46 85 17 539
Decision Quality 275 143 62 34 521
AI Safety & Ethics 183 241 59 30 517
Market Structure 152 154 109 20 440
Task Allocation 158 50 56 26 295
Innovation Output 178 23 38 17 257
Skill Acquisition 137 52 50 13 252
Fiscal & Macroeconomic 120 64 38 23 252
Employment Level 93 46 96 12 249
Firm Revenue 130 43 26 3 202
Consumer Welfare 99 51 40 11 201
Inequality Measures 36 105 40 6 187
Task Completion Time 134 18 6 5 163
Worker Satisfaction 79 54 16 11 160
Error Rate 64 78 8 1 151
Regulatory Compliance 69 64 14 3 150
Training Effectiveness 81 15 13 18 129
Wages & Compensation 70 25 22 6 123
Team Performance 74 16 21 9 121
Automation Exposure 41 48 19 9 120
Job Displacement 11 71 16 1 99
Developer Productivity 71 14 9 3 98
Hiring & Recruitment 49 7 8 3 67
Social Protection 26 14 8 2 50
Creative Output 26 14 6 2 49
Skill Obsolescence 5 37 5 1 48
Labor Share of Income 12 13 12 37
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Productivity Remove filter
Modeled joules per correct answer varies by a factor of 6.2 across endpoints.
Modeled energy estimate combined with task accuracy to compute joules per correct answer across 78 endpoints.
high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... joules per correct answer (modeled energy efficiency)
Across 78 endpoints, the same model on different endpoints differs in tail latency by an order of magnitude.
Empirical tail-latency measurements across 78 endpoints serving 12 model families.
The same model on different endpoints differs in fingerprint similarity to first party by up to 12 points.
Empirical measurement of fingerprint (output-distribution) similarity to a first-party reference across the same set of endpoints (78 endpoints, 12 model families).
high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... fingerprint similarity to first-party reference (endpoint fidelity)
Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code.
Empirical measurement across 78 endpoints and 12 model families comparing mean accuracy on math and code tasks.
high mixed Token Arena: A Continuous Benchmark Unifying Energy and Cogn... mean accuracy on math and code benchmarks
The rise of digital agents will transform the foundations of production, labour markets, institutional arrangements and the international distribution of economic power.
Synthesis and theoretical projection across sections of the paper; presented as a broad conclusion without reported empirical quantification in the provided text.
high mixed DIGITAL AGENTS AS FUNCTIONAL EQUIVALENTS OF ECONOMIC ACTORS:... transformation of production systems, labour markets, institutions, and internat...
There is a fundamental asymmetry between economic and social reproduction: digital agents can compensate for productive functions of the population but are unable to substitute the population's functions of social reproduction.
Theoretical argument and conceptual distinction in the paper; no empirical study measuring substitution in social reproduction provided.
high mixed DIGITAL AGENTS AS FUNCTIONAL EQUIVALENTS OF ECONOMIC ACTORS:... capacity of digital agents to substitute productive vs social reproduction funct...
These patterns suggest that AI adoption is associated with expected efficiency gains that shape both firms' pricing behaviour and their macroeconomic expectations.
Interpretation based on observed increases in productivity/profitability and different pricing/inflation expectations among adopters vs non-adopters in survey and DID analyses.
high mixed The economic impact of artificial intelligence: evidence fro... interpretive link between productivity/profitability gains and firms' pricing an...
AI adoption leads both to job displacement and job creation, including the emergence of new occupational categories.
Abstract states the review examines empirical evidence on both job displacement and creation and the emergence of new occupations; no numeric counts or sample sizes provided in abstract.
high mixed AI and the Transformation of Human Employment: Challenges, O... job destruction and creation; emergence of new occupations
The study identifies short-term transitional risks and long-term productivity gains associated with AI integration in the workforce.
Abstract states the paper evaluates both short-term risks and long-term productivity gains from AI integration based on the reviewed literature; no empirical quantification given in abstract.
high mixed AI and the Transformation of Human Employment: Challenges, O... transitional risks and productivity gains
AI-driven automation and augmentation are reshaping employment landscapes, with emphasis on sector-level disruption, skill transformation, and socioeconomic consequences.
Abstract states this as a conclusion of the review drawing on interdisciplinary empirical literature; no specific studies or sample sizes cited in abstract.
high mixed AI and the Transformation of Human Employment: Challenges, O... employment landscape changes (sector disruption, skill transformation, socioecon...
The accelerating deployment of artificial intelligence across industries has fundamentally altered the structure of global labour markets.
Statement in abstract summarizing a systematic review of interdisciplinary literature (economics, computer science, organizational behaviour, public policy); no specific sample size reported in abstract.
high mixed AI and the Transformation of Human Employment: Challenges, O... structure of global labour markets
The magnitude of AI’s effect on potential GDP varied across industries and depended on the level of digital maturity, human resources, and institutional conditions.
Decompositional analysis across aggregated industry data and scenario-based modeling drawing on sectoral sources and reviews.
high mixed THE IMPACT OF AI ON POTENTIAL GDP AND LONG-TERM ECONOMIC GRO... industry-specific magnitude of AI contribution to GDP
Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated.
Error-mode analysis across the 105 tasks and evaluated models reported in experiments; authors identify task-family-level patterns (HR, management, multi-system workflows) and relative ease of local workspace repair.
high mixed Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... failure distribution by task family / execution surface
Whether LLM-based assistants improve or degrade code quality remains unresolved: existing studies report contradictory outcomes contingent on context and evaluation criteria.
Review finds mixed/contradictory findings across included studies regarding code quality effects.
high mixed The Impact of LLM-Assistants on Software Developer Productiv... code quality (e.g., correctness, maintainability, defects)
The system tends to be factually correct when it answers but often omits information (i.e., 'the system is right when it answers — it just leaves things out').
Interpretation combining reported factual accuracy (85.5%) with low completeness (0.40) from benchmark results.
high mixed Benchmarking Complex Multimodal Document Processing Pipeline... factual accuracy vs. answer completeness
The study establishes statistically significant relationships between organizational AI adoption and compensation dynamics.
Econometric estimates (difference-in-differences and propensity score matched comparisons) using the combined datasets listed in the paper and controlling for industry, firm size, geography, occupation characteristics, and macroeconomic variables.
high mixed The Generative AI Revolution: Early Evidence of Structural T... compensation dynamics (wages/pay)
The study establishes statistically significant relationships between organizational AI adoption and changes in occupational structures.
Same econometric approach (difference-in-differences and propensity score matching) applied to combined datasets (Anthropic Economic Index, Census Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics), with controls for industry, firm size, location, occupation-level characteristics, and macroeconomic environment.
The study establishes statistically significant relationships between organizational AI adoption and changes in employment patterns in the United States during 2022–2025.
Econometric analysis using multiple large-scale data sources (Anthropic Economic Index, U.S. Census Bureau Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics) and methods described as difference-in-differences estimation and propensity score matching controlling for industry (NAICS 2-digit), firm size, geography, occupation characteristics, and macro conditions.
The paper extends paradox theory to conceptualise the Creativity Paradox in the context of GenAI.
Theoretical extension and conceptual development within the paper (no empirical tests reported).
high mixed Beyond the Creativity Paradox: A Theory-informed Framework f... extension of paradox theory (Creativity Paradox)
Within that n=11 subset, 9 of 11 agents shift by at least 2 ranks between composite and benchmark-only rankings.
Comparison of rank positions between composite and benchmark-only rankings on the 11-agent subset; reported count of agents that moved at least 2 ranks.
high mixed AgentPulse: A Continuous Multi-Signal Framework for Evaluati... count/proportion of agents with ≥2-rank shifts
The four factors capture largely complementary information (n=50; ρ_max = 0.61 for Adoption-Ecosystem, all others |ρ| ≤ 0.37).
Correlation analysis among the four factor scores computed on the 50-agent sample; reported maximum inter-factor Pearson/Spearman correlation coefficients.
high mixed AgentPulse: A Continuous Multi-Signal Framework for Evaluati... inter-factor correlations (Adoption vs Ecosystem and other factor pairs)
Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users.
Empirical measurements from the instrumented system across concurrency up to 50 users and tier comparisons; the paper reports the observed saturation point near ~20 concurrent users.
high mixed Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) and saturation threshold (concurrency where reserved cap...
Delegating tasks to genAI can be individually beneficial in the short term even as widespread adoption degrades future model performance (creating a social dilemma).
Result of the paper's behavioral model showing an individual-level incentive to use genAI versus a collective cost from adoption (theoretical/model-based; no empirical sample reported in abstract).
high mixed Generative artificial intelligence reduces social welfare th... individual short-term benefit vs future model performance (collective welfare)
Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens.
Observed run-to-run variability in total token counts for identical tasks across the collected agentic trajectories from eight frontier LLMs on SWE-bench Verified.
high mixed How Do AI Agents Spend Your Money? Analyzing and Predicting ... run-to-run variability in total token consumption for the same task
ASC (adaptive stopping criterion) halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost.
Reported experiment with ASC showing that it prevents harmful iterative refinement yet causes a measured cost described as 3.8 percentage points due to confidence elicitation.
high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... trade-off between stopping harmful refinement and a confidence-elicitation cost ...
Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading under self-correction; GPT-5 degrades by -1.8 pp.
Reported measured changes in accuracy (percentage-point changes) and measured EIR values for the named models after applying iterative self-correction across the experiment suite.
high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... accuracy change from self-correction
Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction.
Empirical experiments reported across 7 LLMs and 3 benchmark datasets (GSM8K, MATH, StrategyQA) comparing outcomes of iterative self-correction as a function of measured EIR.
high mixed When Does LLM Self-Correction Help? A Control-Theoretic Mark... accuracy change from self-correction as a function of EIR
These efficiency gains are offset by a growing 'Efficiency-Legitimacy Paradox' (i.e., improvements in efficiency come with worsening legitimacy concerns).
Conceptual synthesis from the systematic review (2018-2026) identifying a recurring trade-off across reviewed studies; specific empirical quantification not provided in abstract.
high mixed Artificial Intelligence, Public Policy and Governance - impl... trade-off between administrative efficiency and democratic legitimacy/procedural...
There is a structural shift from 'street level' bureaucracies to 'system-level' architectures that can be defined as the institutional division of 'Artificial Discretion' to algorithmic infrastructures.
Synthesis from the PRISMA-guided systematic review of literature (2018-2026) reporting observed changes in administrative architectures; specific studies not enumerated in abstract.
high mixed Artificial Intelligence, Public Policy and Governance - impl... institutional/administrative architecture (shift from street-level to system-lev...
As a General-Purpose Technology (GPT), Artificial Intelligence (AI) is fundamentally reconfiguring state capacity, as well as the mechanics of global economic management.
Systematic review of current research studies (2018-2026) conducted following PRISMA guidelines; synthesis of literature claiming broad institutional and macroeconomic effects. Number of studies not specified in abstract.
high mixed Artificial Intelligence, Public Policy and Governance - impl... state capacity and the mechanics of global economic management
For LLM agents, memory management critically impacts efficiency, quality, and security.
Statement in paper framing and motivation; supported conceptually by literature linking memory design to system properties (no specific experimental details provided in abstract).
high mixed FSFM: A Biologically-Inspired Framework for Selective Forget... efficiency, content quality, and security of LLM agents
Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves.
Empirical analysis of authorship attribution across the 6,000 sessions in the SWE-chat dataset; percentages derived from session-level classification.
high mixed SWE-chat: Coding Agent Interactions From Real Users in the W... distribution of code authorship across sessions (agent-dominant vs human-only se...
A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls.
Determinism experiment with 10 replays per case at temperature zero; qualitative/quantitative observation about number of nondeterministic LLM calls exposed by each architecture.
high mixed Stateless Decision Memory for Enterprise AI Agents system nondeterminism / number of nondeterministic LLM calls exposed per decisio...
Multi-agent workflows and benchmark evaluation reveal current capabilities, limitations, and research frontiers in agentic AI for physical design.
The paper states it analyzes recent experience with multi-agent workflows and benchmark evaluation; the abstract does not provide specific benchmark names, metrics, or sample sizes.
high mixed Invited: Agentic AI for Physical Design R&D: Status and Pros... capabilities and limitations as identified via multi-agent workflows and benchma...
AI is associated with a shift toward younger, relatively less educated workers.
Reported association in the paper's baseline empirical results linking AI presence/pervasiveness to changes in workforce composition (age and education).
high mixed Early Estimates of the Impact of AI Within BEA’s Industry Ec... worker composition by age and education
Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI.
Authors' recommendation in the paper's conclusion based on experimental findings (performance, workload, emotion, retention outcomes).
high mixed Fast and Forgettable: A Controlled Study of Novices' Perform... educational practice recommendation (pair programming vs AI-assisted instruction...
Formal network verification has made substantial progress in proving correctness properties but is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior.
Authors' summary of the state of the art in network verification (assertion in paper; no empirical data in abstract).
high mixed Aether: Network Validation Using Agentic AI and Digital Twin applicability of formal verification to live/continuous change
Overall, the proposed HRL framework improves learning efficiency and scalability, outperforming heuristic baselines while remaining below the perfect-information oracle bound.
Results reported in the paper from simulation experiments comparing the HRL framework to heuristic baselines and the oracle; pairwise differences analyzed (Wilcoxon tests referenced). The paper asserts better performance than heuristics but still worse than the oracle.
high mixed Omnichannel Supply Chains Amid Demand Shocks: A Centralized ... policy performance (learning efficiency, scalability, and supply-chain control p...
How software developers interact with AI-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them.
Based on qualitative analysis of twenty-two interviews with software developers about using LLMs for software development; asserted as a central finding in the paper's analysis.
high mixed Towards an Appropriate Level of Reliance on AI: A Preliminar... impact of AI tools on developers (broadly: productivity, skills, quality)
Benefits of technology and data analytics are context-dependent, with emerging markets facing unique regulatory and infrastructural barriers.
Narrative synthesis of included studies noting heterogeneity by context and reports of regulatory/infrastructural constraints in emerging markets.
high mixed The Use of Technology and Data Analytics in Modern Auditing:... realized benefits / adoption in varying contexts
Cybersecurity has a moderating effect on audit data analytics.
Synthesis statement in the review summarizing included studies that report cybersecurity influences the effectiveness/usability of audit data analytics.
high mixed The Use of Technology and Data Analytics in Modern Auditing:... effectiveness of audit data analytics
CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.
Empirical evaluation comparing CLARITI and GPT-5 on a task set of underspecified software engineering issues; the result reported in the abstract indicates parity in resolution rate and a quantified reduction in questions (41%) but the abstract does not report sample size, test set composition, or statistical significance.
high mixed Asking What Matters: Reward-Driven Clarification for Softwar... resolution rate (task success) and number of clarifying questions generated
They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction.
Descriptive claim made in the text contrasting surface-level fluency with missing properties; no empirical data or experiments provided.
high mixed Governing Reflective Human-AI Collaboration: A Framework for... fluency vs. temporal_continuity, causal_feedback, real-world_anchoring
This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.
Paper's stated contribution presenting theory and conceptual groundwork; no empirical validation provided in the abstract.
high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... interaction between augmented cognitive performance and changes in self-percepti...
The LLM fallacy has implications for education, hiring, and AI literacy.
Implications and argumentation presented in the paper; these are prospective and conceptual rather than supported by empirical data in the abstract.
high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... impacts on education practices, hiring decisions, and AI literacy needs
Further research is needed to explore the longitudinal impact of these AI deployments on local labor markets and the creation of indigenous datasets that reflect Cameroon’s unique linguistic diversity.
Authors' identified research gaps and recommendations; statement of future research needs rather than empirical result.
high mixed A Framework for Sovereign AI Governance and Economic Growth ... longitudinal impacts on local labor markets and creation/use of indigenous lingu...
Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success.
Qualitative and quantitative comparisons from the deployed evaluation across the three conditions (observations about turn counts, validation-feedback loops, and model hallucinations in unconstrained condition over the 25 scenario trials).
high mixed Bounded Autonomy for Enterprise AI: Typed Action Contracts a... number of interaction turns to correct outcome; presence of hallucinated success
Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model.
Reported experimental results aggregated across two practical settings (AI consultancy and AI software team) and 12 tasks; direct comparison between AI Organizations of aligned models and a single aligned model.
high mixed AI Organizations are More Effective but Less Aligned than In... solution utility (higher) and model misalignment (greater)
Multi-agent "AI organizations" are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents.
Experimental comparison reported in the paper: experiments comparing multi-agent AI organizations to single aligned agents across tasks and settings (described below).
high mixed AI Organizations are More Effective but Less Aligned than In... solution utility (effectiveness at achieving business goals) and model alignment...
Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance.
Paper reports instances where top-performing (frontier) models outperform aggregate human expert accuracy on SciPredict, but concludes overall accuracies are insufficient for reliable experimental guidance.
high mixed SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... prediction_accuracy / usability_for_guidance