The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6507 claims)

Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 609 159 77 736 1615
Governance & Regulation 664 329 160 99 1273
Organizational Efficiency 624 143 105 70 949
Technology Adoption Rate 502 176 98 78 861
Research Productivity 348 109 48 322 836
Output Quality 391 120 44 40 595
Firm Productivity 385 46 85 17 539
Decision Quality 275 143 62 34 521
AI Safety & Ethics 183 241 59 30 517
Market Structure 152 154 109 20 440
Task Allocation 158 50 56 26 295
Innovation Output 178 23 38 17 257
Skill Acquisition 137 52 50 13 252
Fiscal & Macroeconomic 120 64 38 23 252
Employment Level 93 46 96 12 249
Firm Revenue 130 43 26 3 202
Consumer Welfare 99 51 40 11 201
Inequality Measures 36 105 40 6 187
Task Completion Time 134 18 6 5 163
Worker Satisfaction 79 54 16 11 160
Error Rate 64 78 8 1 151
Regulatory Compliance 69 64 14 3 150
Training Effectiveness 81 15 13 18 129
Wages & Compensation 70 25 22 6 123
Team Performance 74 16 21 9 121
Automation Exposure 41 48 19 9 120
Job Displacement 11 71 16 1 99
Developer Productivity 71 14 9 3 98
Hiring & Recruitment 49 7 8 3 67
Social Protection 26 14 8 2 50
Creative Output 26 14 6 2 49
Skill Obsolescence 5 37 5 1 48
Labor Share of Income 12 13 12 37
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Productivity Remove filter
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.
Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).
high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Adoption-Capability correlation among closed-source high-capability agents
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.
Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).
high negative AgentPulse: A Continuous Multi-Signal Framework for Evaluati... scope of measurement of static benchmarks (capability vs. deployment/adoption)
Standard PayGo degrades substantially under classroom-scale concurrency.
Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.
high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency) degradation under concurrency
Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.
Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).
high negative Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response latency (task completion time)
In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.
Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... collective (social) welfare
Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.
Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... spillover adoption and amplified welfare losses
The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.
Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).
high negative Generative artificial intelligence reduces social welfare th... social welfare for high-value tasks
Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.
Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.
high negative Generative artificial intelligence reduces social welfare th... output diversity and accuracy
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.
Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... correlation and bias between model self-predicted token usage and actual token u...
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.
Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... average total token consumption per model (tokens consumed by model A minus mode...
Input tokens rather than output tokens drive the overall cost of agentic tasks.
Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... share/contribution of input tokens vs output tokens to total token consumption
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.
Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).
high negative How Do AI Agents Spend Your Money? Analyzing and Predicting ... total token consumption (agentic vs. code reasoning/code chat)
Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes.
Background statement in the paper's introduction; general literature/field observation (no new primary data reported for this claim in the abstract).
high negative Learning-augmented robotic automation for real-world manufac... robustness of fixed waypoint script manipulation
Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.
Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.
high negative The Last Harness You'll Ever Build need for human (expert) harness engineering
Vibe coding (unstructured GenAI-driven coding) promises rapid prototyping but often suffers from architectural drift, limited traceability, and reduced maintainability.
Paper asserts this as a motivating observation and characterizes vibe coding's weaknesses; the abstract frames these as commonly observed problems motivating the Shift-Up approach (no sample size given in abstract).
high negative Shift-Up: A Framework for Software Engineering Guardrails in... architectural drift, traceability, maintainability
Every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) degrades performance.
Findings from the nine-variant ablation study reported in the paper; comparison of variants that add each listed mechanism versus the memory+reflection combination.
high negative AEL: Agent Evolving Learning for Open-Ended Environments performance (e.g., Sharpe ratio or other benchmark metrics) relative to memory+r...
There is a stark geopolitical divide between 'AI Core' nations and the Global South; the Global South faces acute risks of 'Digital Dependency' and eroded digital sovereignty.
Cross-study synthesis in the systematic review (2018-2026) identifying geopolitical patterns and risks; abstract does not quantify the number of studies or present empirical effect sizes.
high negative Artificial Intelligence, Public Policy and Governance - impl... digital dependency and digital sovereignty
The 'black box' nature of automated systems undermines the democratic social contract and principles of procedural justice, epitomised by the Australian 'Robo-debt' scandal.
Case study material and literature synthesized in the systematic review referencing the Australian Robo-debt case as an exemplar; abstract does not provide primary data or sample sizes.
high negative Artificial Intelligence, Public Policy and Governance - impl... democratic legitimacy and procedural justice
Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness in volatile-demand, uncertain-supply industries.
Positioning/background statement in the paper motivating the integrated framework (literature-based claim).
high negative Hybrid Deep Learning Approach for Coupled Demand Forecasting... effectiveness of isolated forecasting/optimization approaches
The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user.
Conceptual / domain-specific argument made by the authors (no empirical sample attached to the claim).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets risk associated with automated changes to user-owned artifacts
AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement.
Author assertion / conceptual description in the paper (no empirical quantification provided for this general statement).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets process transparency / accessibility during execution
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.
Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).
high negative Auditing and Controlling AI Agent Actions in Spreadsheets user oversight ability
Selective forgetting remains underexplored compared to retention in LLM agent memory research.
Authors' literature survey / position statement in paper (assertion made in abstract).
high negative FSFM: A Biologically-Inspired Framework for Selective Forget... extent of research coverage on forgetting vs retention
Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.
Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.
high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... organizational readiness factors (AI literacy, culture, governance alignment)
Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.
Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.
high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... barriers to AI adoption in engineering/manufacturing
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.
Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.
high negative SWE-chat: Coding Agent Interactions From Real Users in the W... rate of user pushback per interaction turn
Agent-written code introduces more security vulnerabilities than code authored by humans.
Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).
high negative SWE-chat: Coding Agent Interactions From Real Users in the W... security vulnerabilities introduced by agent-written code versus human-written c...
Just 44% of all agent-produced code survives into user commits.
Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.
high negative SWE-chat: Coding Agent Interactions From Real Users in the W... survival/usefulness of agent-produced code (proportion incorporated into commits...
Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.
Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.
high negative SWE-chat: Coding Agent Interactions From Real Users in the W... overall agent efficiency in natural developer workflows (qualitative synthesis)
Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.
Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.
high negative Stateless Decision Memory for Enterprise AI Agents compatibility of stateful architectures with regulatory/system properties
The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.
Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.
high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... institutional continuity and human capital reproduction (quality of workforce fo...
Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.
Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.
high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... human capital formation (tacit knowledge acquisition and contextual reasoning ca...
Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.
Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.
high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... structural change in occupational layers (hollowing out of junior and supervisor...
Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.
Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).
high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... displacement of routine tasks (across entry-level and middle-management roles)
The observed negative OPM effect is consistent with short-term 'J-curve' transition costs (process redesign and capability buildup) during early AI adoption.
Interpretation of empirical patterns (short-term decline in OPM concurrent with no ROA change) offered by the authors as an explanatory mechanism; not presented as separately estimated or experimentally tested.
high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin dynamics / transition costs interpretation
AI adoption had a significantly negative impact on the operating profit margin (OPM).
Causal analysis of KOSDAQ-listed companies (2018–2025) with AI-adoption timing identified via multi-step, contextually validated text analysis of DART business reports; endogeneity addressed using two-way fixed effects (TWFE) and Propensity Score Matching (PSM).
high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin (OPM)
An alternative specification that makes different choices about the timing of the pervasiveness of AI yields less robust results, though it also suggests that AI is labor saving.
Reported sensitivity analysis / alternative empirical specification in the paper; authors state the alternative yields less robust results but still indicates labor-saving effects.
high negative Early Estimates of the Impact of AI Within BEA’s Industry Ec... labor use (labor-saving effect)
Our baseline model finds evidence that AI is input saving.
Outcome reported from the baseline empirical specification indicating reductions in inputs associated with AI (authors' baseline model results).
high negative Early Estimates of the Impact of AI Within BEA’s Industry Ec... use of inputs (e.g., labor/capital inputs)
The infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it.
Authoritative claim in paper framing the research gap; presented as observational/argumentative (no empirical audit reported).
high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... availability of cross-user collaboration infrastructure and governance mechanism...
Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user.
Statement in paper's introduction/positioning; conceptual survey-style claim (no empirical study or systematic benchmark reported).
high negative ClawNet: Human-Symbiotic Agent Network for Cross-User Autono... automation scope (single-user vs multi-user)
Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations.
Paper asserts that existing/standard benchmarks do not adequately isolate parsing and computation-orchestration abilities, motivating the new benchmark.
high negative Time Series Augmented Generation for Financial Applications benchmark adequacy for isolating parsing/computation orchestration
As multimodal AI achieves human-parity understanding of speech and gesture, [the keyboard's] necessity dissolves.
Theoretical claim supported by multidisciplinary review (history, neuroscience, technology, organizational studies); no quantified empirical test reported.
high negative The Instrumental Dissolution of Typing: Why AI Challenges th... necessity/usage of keyboard as default input
General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs.
Conceptual/argumentative claim stated in the paper's motivation; no empirical test reported in the abstract.
high negative Learning from AVA: Early Lessons from a Curated and Trustwor... misinformation risk / epistemic humility
There was a nonsignificant absolute retest performance reduction in the AI condition and a larger retest performance decrement in the AI condition (i.e., retention decreased more after using Copilot).
Comparison of retest (one-week) performance across conditions reported in results; authors report a nonsignificant reduction and larger decrement for the AI/Copilot condition (n=22).
high negative Fast and Forgettable: A Controlled Study of Novices' Perform... retest performance (learning retention) after one week
Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment.
Authors' characterization of industry practice and limitations (assertion in paper; no empirical sample size reported in abstract).
high negative Aether: Network Validation Using Agentic AI and Digital Twin test coverage and post-deployment error incidence
Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations.
Statement in paper framing the problem; based on authors' characterization of current operational practice (no empirical sample size reported in abstract).
high negative Aether: Network Validation Using Agentic AI and Digital Twin manual effort / error-proneness of network change validation
The paper identifies governance challenges such as accountability gaps, digital sovereignty risks, ethical pluralism, and strategic weaponization arising from embedding AI in diplomatic practice.
Conceptual and normative analysis section of the paper outlining risks and governance challenges; illustrated by examples and argumentation.
high negative Strategic Cognition and Artificial Diplomacy: Designing Huma... presence of governance risks (accountability gaps, digital sovereignty, ethical ...
Thin training coverage fosters anxiety about substitution and slows diffusion of AI tools.
Reported associations from surveys of mid-level managers and technical staff, interviews, and document analysis across cases; thematic coding identified links between limited training, worker anxiety, and slower diffusion. (Sample size not reported.)
high negative Overcoming Resistance to Change: Artificial Intelligence in ... worker anxiety and speed of diffusion/adoption
Upstream textile SMEs frequently exhibit constrained supply chain resilience owing to persistent information latency and structural dependence on downstream orders.
Background/contextual claim stated in paper (motivation for study); no specific quantitative test reported in abstract.
high negative Enhancing Supply Chain Resilience in Textile SMEs: A Human-C... supply chain resilience (constrained due to information latency and downstream o...
The pharmaceutical R&D process is persistently challenged by high financial costs, protracted timelines, and remarkably low success rates.
Background statement in the review synthesizing prior literature and field knowledge; no original empirical data or sample sizes reported in the provided text.
high negative Artificial intelligence in drug discovery from advanced mole... financial costs, timelines, and success rates of pharmaceutical R&D