The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (11633 claims)

Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 609 159 77 736 1615
Governance & Regulation 664 329 160 99 1273
Organizational Efficiency 624 143 105 70 949
Technology Adoption Rate 502 176 98 78 861
Research Productivity 348 109 48 322 836
Output Quality 391 120 44 40 595
Firm Productivity 385 46 85 17 539
Decision Quality 275 143 62 34 521
AI Safety & Ethics 183 241 59 30 517
Market Structure 152 154 109 20 440
Task Allocation 158 50 56 26 295
Innovation Output 178 23 38 17 257
Skill Acquisition 137 52 50 13 252
Fiscal & Macroeconomic 120 64 38 23 252
Employment Level 93 46 96 12 249
Firm Revenue 130 43 26 3 202
Consumer Welfare 99 51 40 11 201
Inequality Measures 36 105 40 6 187
Task Completion Time 134 18 6 5 163
Worker Satisfaction 79 54 16 11 160
Error Rate 64 78 8 1 151
Regulatory Compliance 69 64 14 3 150
Training Effectiveness 81 15 13 18 129
Wages & Compensation 70 25 22 6 123
Team Performance 74 16 21 9 121
Automation Exposure 41 48 19 9 120
Job Displacement 11 71 16 1 99
Developer Productivity 71 14 9 3 98
Hiring & Recruitment 49 7 8 3 67
Social Protection 26 14 8 2 50
Creative Output 26 14 6 2 49
Skill Obsolescence 5 37 5 1 48
Labor Share of Income 12 13 12 37
Worker Turnover 11 12 3 26
Industry 1 1
Of these four, integration capacity is the least developed for scientific institutions and the most binding: no improvement in AI tooling can buy it.
Normative/diagnostic claim in the paper about relative scarcity and irreducibility of integration capacity; no empirical measures or sample provided in the excerpt.
high negative AI-Augmented Science and the New Institutional Scarcities relative development of integration capacity in scientific institutions and its ...
Four complements then become scarce and load-bearing for AI-augmented science: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition).
Theoretical framework proposed by the paper; list of four complements presented as an argument without empirical quantification in the excerpt.
high negative AI-Augmented Science and the New Institutional Scarcities scarcity of verified signal, legitimacy, authentic provenance, and integration c...
We establish a Volume-Quality Inverse Law: code volume is a near perfect predictor of structural degradation.
Empirical finding from the paper's analysis correlating code volume with measures of structural degradation; described as 'near perfect predictor'.
high negative AI-Generated Smells: An Analysis of Code and Architecture in... structural degradation (predicted by code volume)
There exists a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code.
Multi-scale comparative analysis across models of differing capability showing higher-capability models produce larger (volume) and more highly-coupled code artifacts.
high negative AI-Generated Smells: An Analysis of Code and Architecture in... code volume and coupling (architectural complexity)
AI does not eliminate software flaws but rather introduces a distinct 'machine signature' of defects in generated code.
Systematic audit (multi-scale analysis) of AI-generated software across single-file algorithmic tasks and complex, agent-generated systems, reporting characteristic defect patterns attributed to machine generation.
high negative AI-Generated Smells: An Analysis of Code and Architecture in... presence and patterning of defects in AI-generated code (machine signature of de...
The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability.
Framing statement in the paper; argument based on literature/practice that current evaluations emphasize functional correctness rather than maintainability.
high negative AI-Generated Smells: An Analysis of Code and Architecture in... emphasis of evaluation metrics (functional correctness vs maintainability)
Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables.
Position asserted in the paper based on literature/benchmark trends and authors' field observations; no original empirical dataset or quantified analysis provided in the paper text excerpt.
high negative The Conversations Beneath the Code: Triadic Data for Long-Ho... performance on short-horizon benchmarks versus performance on long-horizon, mult...
Standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
Quantitative analysis reported in the paper comparing detection of the seven failure modes by standard metrics over evaluation cycles.
high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... proportion and timing of detection of failure modes by standard metrics
Standard metrics (ROUGE, BERTScore, accuracy/AUC, and agentic benchmarks such as HELM/MT-Bench/AgentBench/BIG-bench) fail to detect each of the seven production failure modes.
Empirical demonstration reported in the paper comparing standard metrics and agentic benchmarks against the seven failure modes.
high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... detection capability of standard metrics/benchmarks for production failure modes
The seven failure modes include compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks.
Author-provided list of example failure modes within the taxonomy; grounded in observations described in the paper.
high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... types of failure modes affecting production agentic systems
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings and do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production.
Author statement based on literature/framework review (references to HELM, MT-Bench, AgentBench, BIG-bench) and contrast with production agentic evaluation needs.
high negative Evaluating Agentic AI in the Wild: Failure Modes, Drift Patt... ability of existing LLM evaluation frameworks to address continuous production a...
Prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.
Cited prior work / literature claim reported in paper (no specific study details or sample sizes provided in excerpt).
high negative U-Define: Designing User Workflows for Hard and Soft Constra... usability of constraint specification (rigidity and understandability of numeric...
LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control.
Paper's background/related-work motivation (literature summary and framing). No specific empirical data reported in excerpt.
high negative U-Define: Designing User Workflows for Hard and Soft Constra... reliability and control over LLM outputs
The most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify with current methods.
Argumentative claim in the position paper linking capability value to unverifiability; no empirical validation or measurement of 'value' or verifiability included.
high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... verifiability of high-level AI capabilities (reasoning, judgment, intuition)
Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap in verifying AI's implicit knowledge.
Conceptual critique in the paper of existing verification/validation approaches; no systematic review or empirical comparison provided.
high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... verifiability of AI knowledge (explicit vs implicit)
Implicit knowledge remains unexternalized because documentation cost exceeds perceived value.
Presented as an economic/theoretical explanation in the paper; no empirical study, sample, or cost estimates provided.
high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... degree of externalization of implicit knowledge (documentation vs tacit retentio...
Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.
Synthesis conclusion by the authors based on the multivocal literature review, telemetry findings, conceptual modeling (PRP/SGM), and the four-month pilot evaluation.
high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability (reliability) in AI-assisted development
These conflicting findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline.
Conceptual synthesis and interpretation by the paper's authors, based on the multivocal literature review, telemetry, and experimental evidence summarized above.
high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability / trade-off between productivity and reliability
Telemetry across 10,000+ developers shows 91% longer code review times.
Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in review time.
The most rigorous randomized controlled trial (RCT) documents a 19% slowdown for experienced developers.
A single RCT cited in the paper described as the most rigorous trial; result reported as a 19% slowdown for experienced developers. Sample size for the RCT is not provided in the summary statement.
high negative The Productivity-Reliability Paradox: Specification-Driven G... developer productivity (task completion speed)
Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target
Stated as a limitation in the paper (conceptual and computational argument); no benchmarks or computational cost measurements reported.
high negative Position: agentic AI orchestration should be Bayes-consisten... computational feasibility and conceptual tractability of making LLMs fully Bayes...
Compound-system-specific operational challenges arise when serving agentic workloads, including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics.
The paper presents a novel analysis and discussion of these challenges and supports the points via case studies and operational lessons from the production deployment; no quantitative prevalence metrics or sample sizes are provided in the provided text.
high negative Scalable Inference Architectures for Compound AI Systems: A ... operational challenges: fan-out overhead, cold-start propagation, heterogeneous ...
Whether it is the periodic compulsory recoinage in medieval Europe or Gesell's stamp scrip, both are essentially mechanisms for taxing money holdings.
Interpretive/historical claim presented by the authors; no empirical testing or sample reported in the excerpt.
high negative RSDM: The Consensus Honest Money in the AI Era degree_to_which_historical_monetary_policies_function_as_a_tax_on_money_holdings
The devaluation of money runs through almost the whole process of history, from the weight reduction and purity decrease of metallic coin to the unanchored over-issuance of paper currency.
Historical summary/claim by the authors referencing long-run monetary history; no specific empirical study or sample size given in the excerpt.
high negative RSDM: The Consensus Honest Money in the AI Era occurrence_of_currency_devaluation_over_history
Disparities may lead to AI bias and governance challenges that potentially leave the poorest communities excluded from the Fourth Industrial Revolution.
Paper lists AI bias and governance challenges as potential consequences of uneven AI development; presented as conceptual/ethical/political risks without empirical quantification in the excerpt.
high negative GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... AI bias and governance failures leading to exclusion
These disparities risk causing economic isolation and social inequality.
Qualitative claim in the paper listing potential socio-economic risks of uneven AI adoption; no supporting empirical estimates in the excerpt.
high negative GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... economic isolation and social inequality
These disparities carry the risk of a deepening digital divide.
Stated as a consequence/risk in the paper; presented qualitatively without empirical quantification in the excerpt.
high negative GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... digital divide (differential access/use of digital technologies)
Projections indicate that without additional measures, these disparities are likely to increase.
Paper reports forward-looking projections or scenario analysis (methods, assumptions, and quantitative projection details not given in the excerpt).
high negative GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... future global disparities / inequality in AI and digital access
Low-income regions (in particular parts of Africa and South Asia) lag significantly behind in both education and access to digital technologies.
Statement in the paper based on comparative assessment of education levels and digital access across regions; the excerpt provides no numeric data or described sample.
high negative GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... education levels and access to digital technologies
Keeping humans in the loop can sometimes make the decision worse.
Argumentative/diagnostic statement in the paper (theoretical assertion; no experimental or observational effect sizes reported in the excerpt).
high negative Leading Across the Spectrum of Human-AI Relationships: A Con... decision quality when humans are kept in the loop
Leaders may believe oversight remains meaningful when it has become ceremonial.
Conceptual warning in the paper about erosion of meaningful oversight (no empirical validation provided in the excerpt).
high negative Leading Across the Spectrum of Human-AI Relationships: A Con... meaningfulness/effectiveness of oversight
The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere (e.g., to AI).
Analytic/diagnostic claim in the paper (conceptual warning; no empirical sample or measured incidence provided).
high negative Leading Across the Spectrum of Human-AI Relationships: A Con... degree of accurate recognition of who holds decision-shaping authority
Current AI agents implement only the first half of CLS (fast exemplar/hippocampal-style storage) and lack the slow weight-consolidation half.
Analytic claim in paper comparing current AI agent designs to CLS; no empirical evaluation reported in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory presence/absence of slow weight-consolidation mechanisms in AI agents
Agents that rely only on lookup are structurally vulnerable to persistent memory poisoning as injected content propagates across all future sessions.
Theoretical/security argument presented in paper; claims about propagation of injected content across sessions; no empirical attack experiments detailed in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory vulnerability to persistent memory poisoning
Conflating the two produces agents that face a provable generalization ceiling on compositionally novel tasks that no increase in context size or retrieval quality can overcome.
Formal claim asserted in paper (formalization of limitations and proofs claimed); no empirical sample detailed in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory generalization performance on compositionally novel tasks
Conflating retrieval and weight-based memory produces agents that accumulate notes indefinitely without developing expertise.
Theoretical argument/formalization presented in paper; claim based on analysis of how lookup-only systems fail to consolidate abstract knowledge; no empirical sample reported in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory expertise development / continued accumulation of notes
Treating lookup as memory is a category error with provable consequences for security.
Theoretical/formal argument and formalization in paper; security consequences (e.g., persistent poisoning) claimed; no empirical sample reported in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory security (vulnerability to persistent memory poisoning)
Treating lookup as memory is a category error with provable consequences for long-term learning.
Theoretical/formal argument asserted in the paper, drawing on formalization and Complementary Learning Systems theory; no empirical sample reported in abstract.
high negative Contextual Agentic Memory is a Memo, Not True Memory long-term learning
Treating lookup as memory is a category error with provable consequences for agent capability.
Theoretical/formal argument asserted in the paper (formalization and proofs claimed); no empirical sample reported in abstract.
Current agentic memory systems (vector stores, retrieval-augmented generation, scratchpads, and context-window management) do not implement memory: they implement lookup.
Conceptual/analytic claim stated in paper; supported by comparison of existing agent memory mechanisms (vector stores, RAG, scratchpads, context-window management) to the paper's definition of 'memory'. No empirical sample reported.
high negative Contextual Agentic Memory is a Memo, Not True Memory whether systems implement memory vs. lookup
Novices more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark.
Annotation-based comparison in the 27K WildChat transcript sample indicating higher rates of 'invisible' failures (apparent successes that are actually incorrect or insufficient) among novice users.
high negative A paradox of AI fluency invisible failure rate (apparent success but incorrect outcome)
Fluent users experience more failures than novices.
Quantitative comparison of failure occurrences across user-fluency strata in the 27K annotated transcript sample from WildChat-4.8M.
high negative A paradox of AI fluency failure rate (errors / failed turns)
Reactive approaches paired with automation or creation produced breakdowns (reduced effectiveness).
Thematic evidence from interviewees describing instances where reactive leadership combined with high automation-or-creation use led to coordination or accountability breakdowns across the 34 cases.
high negative E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness (breakdowns)
Workers acquire skills through generative AI tools but lack credible ways to signal or validate these skills in competitive freelance markets (a structural challenge the paper terms 'invisible competencies').
Reported finding and conceptual contribution based on the paper's mixed-methods study (survey + semi-structured interviews).
high negative Upskilling with Generative AI: Practices and Challenges for ... ability to signal/validate skills acquired via generative AI in freelance market...
There is a shift from learning as growth to learning as survival, where upskilling is oriented toward immediate market viability rather than long-term development.
Reported thematic finding from the paper's interviews and survey of freelance knowledge workers.
high negative Upskilling with Generative AI: Practices and Challenges for ... orientation of upskilling (immediate market viability vs long-term development)
Freelancers do not treat generative AI as their primary learning resource due to inconsistency, lack of contextual relevance, and verification overhead.
Reported finding from the paper's mixed-methods study (survey + semi-structured interviews with freelance knowledge workers).
high negative Upskilling with Generative AI: Practices and Challenges for ... role of generative AI in freelancers' learning stacks / barriers to using it as ...
Freelance workers must continually acquire new skills to remain competitive in online labor markets, yet they lack the organizational training, mentorship, and infrastructure available to traditional employees.
Framing statement in the paper's introduction / literature review (not reported as an empirical result from this study).
high negative Upskilling with Generative AI: Practices and Challenges for ... need for continual upskilling and availability of organizational training/mentor...
Existing approaches address data quality but not data valuation.
Literature review / background discussion in paper contrasting prior work on data quality with lack of approaches for data valuation.
high negative Calibrating Attribution Proxies for Reward Allocation in Par... coverage of data valuation in existing approaches
Suppression bias is the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold.
Definition and characterization of a proposed failure mode provided in the paper (conceptual/theoretical).
high negative Learning from Disagreement: Clinician Overrides as Implicit ... bias in recorded overrides leading to omission of correct-but-difficult recommen...
Existing approaches, runtime guardrails, training-time alignment, and post-hoc auditing treat governance as an external constraint rather than an internalized behavioral principle, leaving agents vulnerable to unsafe and irreversible actions.
Author's conceptual/literature critique presented in the paper (argumentative claim, no empirical sample or experiment reported for this statement).
high negative Think Before You Act -- A Neurocognitive Governance Model fo... vulnerability to unsafe and irreversible actions