The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (8570 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Adoption Remove filter
The model shows when these systems become vulnerable to strategic use from within government.
Analytical result derived from the paper's formal theoretical model (no empirical validation reported).
high negative AI Governance under Political Turnover: The Alignment Surfac... vulnerability of automated systems to strategic internal use
The compliance layer can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration.
Stated conclusion/insight from the paper's formal argument and conceptual framing (theoretical, no empirical sample).
high negative AI Governance under Political Turnover: The Alignment Surfac... creation of a stable approval boundary exploitable by successive governments
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.
Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).
high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Adoption-Capability correlation among closed-source high-capability agents
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.
Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).
high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... scope of measurement of static benchmarks (capability vs. deployment/adoption)
Self-assessment is a key bottleneck for market-style coordination of AI agents.
Conclusion drawn from empirical results (miscalibration findings, auction divergence, modest improvement from prior-information intervention) reported in the paper.
high negative MarketBench: Evaluating AI Agents as Market Participants importance of self-assessment calibration for successful market coordination
Auctions built from these self-reports diverge from a full-information allocation.
Simulation or empirical auction experiments using self-reported signals from the six LLMs on the 93 tasks, compared to a full-information allocation benchmark (method described in paper).
high negative MarketBench: Evaluating AI Agents as Market Participants difference between allocations produced by auctions using self-reports and full-...
These LLMs are miscalibrated on both success probability and token usage.
Empirical evaluation of six LLMs on 93 SWE-bench Lite tasks assessing calibration of predicted success probabilities and token usage (as reported in the paper).
high negative MarketBench: Evaluating AI Agents as Market Participants calibration of self-reported success probability and token usage
Standard PayGo degrades substantially under classroom-scale concurrency.
Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.
high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) degradation under concurrency
Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.
Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).
high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response latency (task completion time)
In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.
Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... collective (social) welfare
Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.
Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... spillover adoption and amplified welfare losses
The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.
Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... social welfare for high-value tasks
Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.
Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.
high negative Generative artificial intelligence reduces social welfare th... output diversity and accuracy
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.
Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... correlation and bias between model self-predicted token usage and actual token u...
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.
Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... average total token consumption per model (tokens consumed by model A minus mode...
Input tokens rather than output tokens drive the overall cost of agentic tasks.
Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... share/contribution of input tokens vs output tokens to total token consumption
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.
Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... total token consumption (agentic vs. code reasoning/code chat)
Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.
Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.
high negative The Last Harness You'll Ever Build need for human (expert) harness engineering
A vulnerability class is characterised for expected-utility maximisers that makes them susceptible to adversarial gambles.
Formal characterization/definition and analytical derivation in the paper describing which expected-utility maximisers are vulnerable to adversarial (Pascal-type) offers; theoretical examples provided rather than empirical tests.
high negative Bounding the Long Tail: Ai Norms for Decision-Making Under N... vulnerability of expected-utility maximisers to adversarial gambles
Industry digital maturity weakens the effect of the peer leader on a focal firm’s AI adoption.
Interaction/heterogeneity analysis in fixed-effects regression models on panel data of publicly listed Chinese firms (2012–2023), using an industry digital maturity moderator.
high negative Following the Herd or the Bellwether: Peer Effects in Firms’... focal firm AI adoption level (moderated by industry digital maturity for peer le...
Current evaluation proxies are insufficient for predicting downstream human impact.
Empirical results in the paper showing decoupling between standard quantitative proxies (e.g., sparsity, faithfulness) and human outcomes (clarity, decision utility, confidence) across datasets and analyst reviews.
high negative Rethinking XAI Evaluation: A Human-Centered Audit of Shapley... predictive validity of quantitative evaluation proxies for human impact
A highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents.
Constructive worst-case examples and theoretical bounds in the paper demonstrating arbitrarily large performance degradation when applying sophisticated-optimal policies to naive agents.
high negative Algorithmic Feature Highlighting for Human-AI Decision-Makin... performance (loss in decision quality) of highlighting policies when agent type ...
Optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings.
Theoretical complexity results and proofs in the paper showing hardness of the optimization problem under the sophisticated-agent model; no sample/calibration required (formal/algorithmic analysis).
high negative Algorithmic Feature Highlighting for Human-AI Decision-Makin... computational tractability of the highlighting optimization problem
The regulatory architecture is in place; the verification instrument is not.
Paper's high-level diagnosis asserting that regulations establish obligations but lack a technical instrument for quantitative verification of acceptable risk.
high negative Bounding the Black Box: A Statistical Certification Framewor... presence of regulatory architecture versus presence of technical verification in...
The systems most in need of oversight are opaque statistical inference engines that resist white-box scrutiny.
Paper's characterization/analysis of contemporary high-risk AI systems as opaque statistical models that are difficult to inspect via white-box methods.
high negative Bounding the Black Box: A Statistical Certification Framewor... degree of model opacity / resistance to white-box scrutiny among high-risk AI sy...
This gap is not theoretical: as the EU AI Act moves into full enforcement, developers face mandatory conformity assessments without established methodologies for producing quantitative safety evidence.
Argument in paper linking imminent enforcement of EU AI Act to practical conformity-assessment requirements for developers and asserting lack of established methodologies for quantitative safety evidence.
high negative Bounding the Black Box: A Statistical Certification Framewor... availability of established methodologies for producing quantitative safety evid...
None [of these regulatory frameworks] specifies what 'acceptable risk' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold.
Paper's critical analysis of existing regulatory instruments, arguing absence of quantitative definitions and verification methods.
high negative Bounding the Black Box: A Statistical Certification Framewor... presence or absence of quantitative acceptable-risk definitions and technical ve...
There is a stark geopolitical divide between 'AI Core' nations and the Global South; the Global South faces acute risks of 'Digital Dependency' and eroded digital sovereignty.
Cross-study synthesis in the systematic review (2018-2026) identifying geopolitical patterns and risks; abstract does not quantify the number of studies or present empirical effect sizes.
high negative Artificial Intelligence, Public Policy and Governance - impl... digital dependency and digital sovereignty
The 'black box' nature of automated systems undermines the democratic social contract and principles of procedural justice, epitomised by the Australian 'Robo-debt' scandal.
Case study material and literature synthesized in the systematic review referencing the Australian Robo-debt case as an exemplar; abstract does not provide primary data or sample sizes.
high negative Artificial Intelligence, Public Policy and Governance - impl... democratic legitimacy and procedural justice
Agentic AI introduces novel challenges related to market stability, regulatory compliance, interpretability, and systemic risk.
Survey discussion synthesizing literature on systemic and governance risks of autonomous systems in markets; draws on conceptual and empirical prior work but does not present new quantitative results.
high negative Agentic Artificial Intelligence in Finance: A Comprehensive ... market stability, regulatory compliance burden, interpretability deficits, syste...
Scalable AI tutoring for procedural skill learning requires structured knowledge representations, yet constructing these representations remains a labor-intensive bottleneck.
Background/claim made in the paper's introduction framing the problem; no specific quantitative evidence reported in the abstract.
high negative Developing Models of Procedural Skills using an AI-assisted ... effort required to construct structured knowledge representations
Under-represented groups tend to be systematically under-observed because of historical exclusion and selective feedback, which exacerbates uncertainty for those groups.
Conceptual claim supported by illustrative examples (e.g., lending context) and simulations demonstrating selective feedback effects; literature citation likely included in paper.
high negative Fairness under uncertainty in sequential decisions observation frequency/data availability for under-represented groups; resulting ...
Policies that ignore the unobserved (counterfactual) space can harm decision makers (via unrealized gains or losses) and subjects (via compounding exclusion and reduced access).
Theoretical argumentation and illustrative examples (e.g., loan denial counterfactuals) and modelled simulations showing downstream harms when ignoring unobserved outcomes.
high negative Fairness under uncertainty in sequential decisions unrealized gains/losses for decision makers; compounding exclusion and reduced a...
Experiments on simulated data with varying bias show that unequal uncertainty and selective feedback produce disparities across groups.
Simulation experiments described in the paper manipulate bias and feedback patterns and report resulting group disparities (synthetic datasets; experiment details in methods/results sections).
high negative Fairness under uncertainty in sequential decisions group disparities (fairness metrics)
Industrial firms face a dual challenge: (1) the development and deployment of digital technologies and (2) the proliferation and integration of the corresponding skills portfolios.
Conceptual framing and literature synthesis presented in the paper (identification by authors); not tied to a specific quantitative sample in the provided text.
high negative Industry 4.0 Inc.—Mergers and acquisitions and the digital t... ability to develop and deploy digital technologies and integrate skills portfoli...
Renewable energy adoption further reinforces the beneficial effect of digital trade on emissions under stronger regulatory stringency (mediation via renewable energy and regulation).
Structural equation modelling (SEM) on the monthly panel (38 OECD economies, 2000–2024) assessing mediation paths through renewable energy adoption and regulatory stringency; reported as reinforcing the digital trade effect.
There is a carbon-pricing threshold at USD 40 per tonne, above which emissions decline significantly (Δ = −15%, p < 0.01).
Carbon-pricing threshold analysis applied to the monthly panel of 38 OECD economies (2000–2024); threshold identified and associated pre/post comparison reports a 15% decline with p < 0.01.
The environmental effect of digital trade becomes stronger (more negative on emissions) when combined with AI-enhanced logistics (interaction effect).
Econometric models including interaction terms for AI-enhanced logistics and digital trade on the monthly panel (38 OECD economies, 2000–2024); interaction effects identified via regression and machine-learning threshold techniques.
GVC participation is significantly associated with lower CO2 emissions (β = −0.064, p < 0.01).
Econometric analysis on a monthly panel of 38 OECD economies from 2000–2024 using fixed-effects models; coefficient and p-value reported in paper.
Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness in volatile-demand, uncertain-supply industries.
Positioning/background statement in the paper motivating the integrated framework (literature-based claim).
high negative Hybrid Deep Learning Approach for Coupled Demand Forecasting... effectiveness of isolated forecasting/optimization approaches
The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user.
Conceptual / domain-specific argument made by the authors (no empirical sample attached to the claim).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets risk associated with automated changes to user-owned artifacts
AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement.
Author assertion / conceptual description in the paper (no empirical quantification provided for this general statement).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets process transparency / accessibility during execution
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.
Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets user oversight ability
A threat model taxonomy mapping misuse vectors to hardware, software, institutional, and liability layers illustrates why no single governance mechanism suffices.
Threat model taxonomy developed in the paper (conceptual taxonomy; illustrative mapping rather than empirical testing).
high negative The Open-Weight Paradox: Why Restricting Access to AI Models... completeness/adequacy of single governance mechanisms
Restricting access to open-weight models deepens asymmetries while driving proliferation into unsupervised settings.
Argumentation and threat-model reasoning in the paper describing likely consequences of restrictions (theoretical analysis; no empirical sample cited).
high negative The Open-Weight Paradox: Why Restricting Access to AI Models... geopolitical asymmetries and proliferation into unsupervised settings
Access restrictions, without governed alternatives, may displace risks rather than reduce them.
Theoretical argument and threat-model analysis in the paper showing possible risk displacement (conceptual reasoning; no empirical sample reported).
high negative The Open-Weight Paradox: Why Restricting Access to AI Models... risk displacement vs risk reduction from access restrictions
Selective forgetting remains underexplored compared to retention in LLM agent memory research.
Authors' literature survey / position statement in paper (assertion made in abstract).
high negative FSFM: A Biologically-Inspired Framework for Selective Forget... extent of research coverage on forgetting vs retention
Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.
Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.
high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... organizational readiness factors (AI literacy, culture, governance alignment)
Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.
Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.
high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... barriers to AI adoption in engineering/manufacturing
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.
Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.
high negative SWE-chat: Coding Agent Interactions From Real Users in the W... rate of user pushback per interaction turn