The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (8570 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Adoption Remove filter
The paper identifies three properties of LLM agents that distinguish the present challenge from prior bot-detection problems: identity discontinuity by design, task-based instantiation, and agent-to-agent loops.
Analytic claim based on synthesis of agent architecture literature; presented as conceptual identification rather than empirically tested properties.
high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... distinctive properties of LLM agents relevant to detection and measurement
A click may reflect an optimization routine, a proxy objective, or a recursive agent-to-agent exchange rather than meaningful human intent, and traditional inference frameworks cannot reliably distinguish among these possibilities.
Theoretical claim derived from literature on agent behaviors, agent-to-agent interactions, and limitations of existing inference frameworks; no empirical discrimination test reported in this paper excerpt.
high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... reliability of attribution of click events to meaningful human intent
The presence of autonomous AI agents weakens the interpretive value of core web analytics metrics, including sessions, engagement, conversion, and retention.
Argument based on conceptual synthesis of how non-human, non-persistent actors generate signals that undermine standard metric interpretations (position paper; no original empirical test included).
high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... interpretive validity of core web analytics metrics (sessions, engagement, conve...
Unlike crawlers and traditional bots, these agents do not possess persistent identities or psychologically grounded motivations; they are task-specific, dynamically instantiated processes whose behaviors are contingent and often orchestrated by external systems.
Conceptual analysis informed by literature on agent architecture and LLM-based agents; no primary empirical measurement presented in this paper excerpt.
high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... identity persistence and motivational structure of autonomous AI agents (vs. tra...
Conventional web analytics treats the human user as its fundamental unit of analysis, assuming stable preferences, identifiable intentions, and behavioral patterns that unfold over time.
Conceptual statement supported by literature synthesis and critique of standard web-analytics assumptions (position paper; no primary empirical sample reported).
high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... validity of web analytics' human-centered unit-of-analysis assumption (stability...
Consequently, generated artifacts may exhibit brittle behavior and limited deployability.
Paper asserts that lack of production awareness leads to brittle artifacts and limited deployability; no quantitative measures or sample sizes provided in the abstract.
high negative Architectural Constraints Alignment in AI-assisted, Platform... brittleness of artifacts and deployability
AI-assisted development tools often lack awareness of architectural constraints, infrastructure dependencies, and organizational standards required in production environments.
Asserted observation in the paper arguing limitations of general-purpose AI code generation when targeting production-ready systems; no empirical sample size or methodological details provided in the excerpt.
high negative Architectural Constraints Alignment in AI-assisted, Platform... awareness of architectural constraints / suitability for production
Nominally cheaper models can incur higher total cost due to token-intensive reasoning.
Cost and token usage analysis reported in the paper showing cheaper-per-token models may generate more tokens and thus higher total cost in practice.
high negative Switchcraft: AI Model Router for Agentic Tool Calling total inference cost as a function of token usage and per-token price
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets.
Stated as background/motivation in the paper (conceptual claim; no empirical sample size reported).
high negative Switchcraft: AI Model Router for Agentic Tool Calling inference cost / developer tendency to use large models
Patent text similarity analysis confirms a 'homogenization trap' (AI-associated increases in patent-text similarity).
Text-similarity analysis of patent documents reported in the paper showing increased patent similarity associated with AI use.
high negative The Inverted-U Relationship Between AI and Corporate Innovat... patent text similarity (homogenization of patent content)
Industry concentration negatively moderates the AI–innovation relationship.
Moderation analysis/interacted fixed-effects models indicating that higher industry concentration weakens the AI→innovation effect.
high negative The Inverted-U Relationship Between AI and Corporate Innovat... moderating effect of industry concentration on AI → innovation
Cascade performance is limited primarily by structural cost (they pay the cheap model before any escalation decision), rather than by a shortage of intermediate stages.
Synthesis of theoretical insights and empirical results reported in the paper (theoretical analysis of structural costs + empirical comparisons showing limited benefit from additional stages).
high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... primary constraint on cascade performance (structural cost vs availability of in...
Optimized subsequence cascades do not deliver practically meaningful held-out gains over the pairwise envelope.
Empirical evaluation on the five benchmarks comparing optimized subsequence cascades to the pairwise envelope; reported lack of practically meaningful held-out improvement.
high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... held-out performance gains of optimized subsequence cascades relative to the pai...
Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope.
Empirical comparison across the reported benchmarks and models showing that full fixed chains achieve worse cost-quality tradeoffs than the pairwise envelope (experimental results described in the paper).
high negative Is Escalation Worth It? A Decision-Theoretic Characterizatio... relative cost-quality performance of full fixed-chain cascades versus the pairwi...
Municipal 311 call centers and complaint intake systems face a structural mismatch between incoming volume and classification capacity that produces a bottleneck and differential service quality that follows income and racial lines.
Stated in the paper's introduction; cites prior work (Liu 2024 SLA) as support for the differential service-quality / demographic claim. No sample size or quantitative result reported in the excerpt.
high negative Scaling the Queue: Reinforcement Learning for Equitable Call... differential service quality by income and race
Organizational resistance to technological change hinders AI adoption in logistics operations.
Qualitative synthesis of 31 reviewed publications identifying organizational and cultural barriers to AI uptake.
high negative Evaluating the Role of Artificial Intelligence in Optimizing... organizational resistance as an adoption barrier
Data security concerns are a key barrier to adopting AI in global supply chains.
Synthesis of themes from 31 scholarly sources in the structured literature review highlighting data/security-related implementation issues.
high negative Evaluating the Role of Artificial Intelligence in Optimizing... data security concerns as an adoption barrier
High initial investment costs are a significant barrier to AI implementation in logistics.
Synthesis of literature (31 sources) reporting implementation challenges and barriers identified across studies.
high negative Evaluating the Role of Artificial Intelligence in Optimizing... adoption barriers (initial investment costs)
AGI (Artificial General Intelligence) is problematic both conceptually and definitionally.
Authorial assertion in the paper stating AGI is problematic as a concept and definition; framed as a conditioning assumption that shapes the subsequent analysis.
high negative Pathways to AGI conceptual_and_definitional_soundness_of_AGI
The paper argues we should avoid assuming the inevitability of the current situation relating to AI (i.e., the current commercial AI development trajectory is not inevitable).
Authorial methodological claim in the paper's framing/introductory text; presented as a normative methodological stance rather than empirical evidence.
high negative Pathways to AGI policy_assumption_of_inevitability
There is an absence of agreed-upon benchmarks for evaluating AI systems.
Introductory chapter notes lack of standardized evaluation benchmarks as a cross-cutting concern; presented as an analytical observation by the task force.
high negative Introduction: Artificial Intelligence, Politics, and Politic... existence of standardized evaluation benchmarks for AI
AI systems exhibit bias.
Introductory chapter points to bias in AI systems as a recurring theme; supported by the broader literature cited in the report (no numerical sample reported in the introduction).
high negative Introduction: Artificial Intelligence, Politics, and Politic... bias and fairness issues in AI system outputs and decisions
AI model outputs are often opaque and non-replicable.
Introductory chapter identifies opacity and non-replicability of AI outputs as a cross-cutting theme; claim is based on literature synthesis and conceptual critique in the report.
high negative Introduction: Artificial Intelligence, Politics, and Politic... transparency and replicability of AI model outputs
A small number of AI corporations have unprecedented power.
Introductory chapter highlights the theme of concentrated corporate power in AI; asserted as an observational claim in the report's framing rather than derived from a presented empirical sample in the introduction.
high negative Introduction: Artificial Intelligence, Politics, and Politic... concentration of corporate power in the AI industry (market control, platform in...
Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels.
Empirical experiments reported in the paper evaluating three frontier large language models on three task domains (short stories, marketing slogans, alternative-uses) and finding ρ < 1 (below parity) across crowding kernels. The abstract specifies three models but does not report the number of generated samples per model or other sample-size details.
high negative Ex Ante Evaluation of AI-Induced Idea Diversity Collapse human-relative diversity ratio (ρ) indicating excess crowding
This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding.
Theoretical/ conceptual claim in the paper arguing that improvements at the individual-output level can still increase similarity (crowding) at the population level; no empirical numbers given in the abstract.
high negative Ex Ante Evaluation of AI-Induced Idea Diversity Collapse population-level crowding (diversity collapse)
Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones.
Conceptual argument presented in the paper's introduction motivating a population-level perspective on creative outputs (no empirical sample size reported).
high negative Ex Ante Evaluation of AI-Induced Idea Diversity Collapse loss of value due to similarity (population-level creative value)
Any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score.
Formal theoretical result/proof presented in the paper based on the transformation-graph semantic-class model.
Once announced, such a metric becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm.
Modeling argument in the paper (transformation graph / semantic classes) and supported by formal analysis and experimental checks described in the paper.
A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
Empirical comparison in simulator experiments indicating that optimizing for exact action accuracy (matching individual actions) can harm higher-level trace distribution alignment; observed in the studies contrasting deterministic copying/value-based approaches with Trace-Prior RL.
high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... exact action accuracy vs. aggregate trace alignment (distributional match)
Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior.
Empirical observation in simulator experiments comparing deterministic value-based RL and deterministic copying agents to other approaches; observed collapsed/shortcut pricing behaviors when uncertainty is unresolved.
high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... policy action distribution / pricing choices (shortcut behavior)
This failure is a Goodhart-style failure under partial observability: Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices.
Theoretical diagnosis supported by simulator setup and observed ambiguity in agent-visible states mapping to multiple competitor prices; derived from the two-hotel simulator design where key competitor variables are hidden from Hotel A.
high negative Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... policy robustness / correctness under partial observability (mapping from observ...
GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1.
Model-level observation from the ASR analysis within the experiment (paper reports GPT-4.1 had perfect TSR and HF1 but failed trajectory-level fidelity).
high negative Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... trajectory fidelity vs. standard metrics (TSR, HF1)
Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly.
Empirical evaluation reported in the paper: HMASP tested across 18 LLMs and 90,000 task instances; analysis via ASR showing checkpoint-skipping behavior for 10 models and correct enforcement for 8 models.
high negative Beyond Task Success: Measuring Workflow Fidelity in LLM-Base... adherence to expected workflow transitions (confirmation checkpoint adherence)
DePAI entails risks including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, requiring value-sensitive design and continuously adaptive governance.
Risk analysis and conceptual argument in the paper identifying possible failure modes and recommended design/governance responses; no empirical incidence data provided.
high negative DAO-enabled decentralized physical AI: A new paradigm for hu... security, centralization, incentive failure, legal exposure, and intrinsic motiv...
WIOA is not well-equipped to support large-scale, cross-industry labor transitions.
Low observed incidence of cross-industry occupational transitions and limited shifts into less automation-exposed occupations in the WIOA data (2017-2023) lead authors to conclude the program is poorly suited for large-scale cross-industry reallocation.
high negative Did US Worker Retraining Reduce Participant Automation Expos... cross-industry occupational transitions / shifts in RTI after program participat...
A substantial portion of WIOA participants simply return to their prior field after program participation.
Descriptive and outcome analyses on the WIOA participation records (2017-2023) showing many participants re-enter the same occupation/industry rather than transitioning to different occupations.
high negative Did US Worker Retraining Reduce Participant Automation Expos... occupational/industry re-entry (return to prior field) following program partici...
WIOA rarely shifts workers into less automation-exposed work.
Analysis of WIOA administrative records (2017-2023) using a newly introduced 'Retrainability Index' that decomposes outcomes into post-intervention wage recovery and shifts in routine task intensity (RTI). The paper reports low incidence of downward RTI (movement into less automation-exposed occupations) among participants.
high negative Did US Worker Retraining Reduce Participant Automation Expos... change in Routine Task Intensity (RTI) of occupations post-participation
Mechanism tests indicate innovation stagnation in mature firms with redundant AI is a pathway that limits productivity gains (i.e., AI can be associated with stagnant innovation in mature firms).
Mechanism analysis reported in the paper showing signs of reduced innovation-related gains or stagnation in mature, advanced firms using AI (interpreted as redundant AI leading to limited incremental innovation).
high negative The Heterogeneous Effects of Artificial Intelligence on Ente... Innovation activity / productivity implications
AI integration creates challenges such as workforce displacement that must be addressed.
Authors raise workforce displacement as a challenge/consideration in the paper's discussion; this appears as a qualitative claim rather than an empirically quantified result in the supplied text.
AI integration creates challenges such as algorithmic bias that must be addressed.
Authors identify algorithmic bias as a notable challenge in the discussion/conclusion; presented qualitatively rather than as an estimated empirical outcome in the supplied text.
Responsible AI research typically focuses on examining the use and impacts of deployed AI systems, and there is currently limited visibility into the pre-deployment decisions to pursue building such systems.
Argument and literature framing presented in the paper based on a scoping review of academic literature, civil society resources, and grey literature.
high negative To Build or Not to Build? Factors that Lead to Non-Developme... visibility into pre-deployment decision-making for AI development
This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low.
Theoretical result/argument from the model linking concentrated decision-energy to increased systemic risk despite low local error rates.
high negative AI Safety as Control of Irreversibility: A Systems Framework... probability of irreversible system-level loss
Efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node.
Derived from the paper's formal model and argumentation about system dynamics (efficiency and feedback mechanisms); theoretical rather than empirical evidence.
high negative AI Safety as Control of Irreversibility: A Systems Framework... concentration of decision-energy (centralization of decision authority)
Declining deployment friction changes the safety problem at its root: safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density.
Main theoretical argument of the paper; supported by conceptual framing and a formal model that introduces decision-density considerations.
high negative AI Safety as Control of Irreversibility: A Systems Framework... safety framing (control of irreversibility)
Recent AI systems compress the distance between capability growth and capability deployment.
Conceptual and descriptive claim in the paper's introduction; supported by theoretical argumentation and illustrative examples rather than empirical measurement.
high negative AI Safety as Control of Irreversibility: A Systems Framework... deployment speed / adoption
Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure).
Empirical comparison between the RL Feasibility Index and existing AI-exposure measures, with named creative/interpersonal occupations showing opposite rankings.
high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... relative RL feasibility vs. general AI exposure for named creative/interpersonal...
Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large.
Conceptual critique and comparison of existing AI-exposure indices vs. the authors' proposed learnability-focused approach (paper text argument and empirical comparisons implied later).
high negative What Jobs Can AI Learn? Measuring Exposure by Reinforcement ... accuracy/misclassification of occupations by AI-exposure indices vs. learnabilit...
Of these four, integration capacity is the least developed for scientific institutions and the most binding: no improvement in AI tooling can buy it.
Normative/diagnostic claim in the paper about relative scarcity and irreducibility of integration capacity; no empirical measures or sample provided in the excerpt.
high negative AI-Augmented Science and the New Institutional Scarcities relative development of integration capacity in scientific institutions and its ...
Four complements then become scarce and load-bearing for AI-augmented science: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition).
Theoretical framework proposed by the paper; list of four complements presented as an argument without empirical quantification in the excerpt.
high negative AI-Augmented Science and the New Institutional Scarcities scarcity of verified signal, legitimacy, authentic provenance, and integration c...