The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13827 claims)

Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 195 97 889 1979
Governance & Regulation 815 391 188 121 1539
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 624 233 123 96 1084
Research Productivity 410 121 56 331 929
Output Quality 466 177 59 47 749
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 166 122 24 495
Task Allocation 206 64 70 31 376
Skill Acquisition 165 57 60 17 299
Innovation Output 201 27 41 18 288
Employment Level 105 51 107 13 278
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 149 46 26 3 224
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 61 20 12 182
Error Rate 69 91 10 2 172
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 92 19 13 19 145
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Skill Obsolescence 5 45 6 1 57
Creative Output 31 16 7 2 57
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
Artificial intelligence enhances analytics, automates routine tasks, personalizes interactions, and supports decision-making.
Aggregate finding reported in the abstract based on thematic synthesis of the reviewed literature (160 articles).
high positive The implementation of artificial intelligence in organizatio... organizational_capabilities (analytics, automation, personalization, decision_su...
There are convergent patterns of AI adoption in human resources, marketing and customer services, logistics, and finance.
Synthesis claim from the systematic review of the 160 included peer‑reviewed articles as reported in the abstract.
high positive The implementation of artificial intelligence in organizatio... patterns_of_adoption_across_functions
Clear specifications, explicit governance, and ongoing human-AI collaboration are critical for successful scaling of regression automation.
Conclusions and recommendations derived from the case study's lessons and mixed-method evaluation.
high positive Human-AI Collaboration for Scaling Agile Regression Testing:... success of scaling regression automation / effectiveness of human-AI teaming
The Copilot achieves 30-50% code reuse when generating candidate test scripts.
Quantitative result reported in the paper's evaluation (stated 30-50% code reuse in the abstract/summary).
high positive Human-AI Collaboration for Scaling Agile Regression Testing:... code reuse in generated test scripts
Mixed-method evaluation shows the AI accelerates script authoring and increases throughput.
Empirical claim based on the paper's mixed-method evaluation (qualitative and quantitative data reported in the case study); specific sample sizes not provided in the summary.
high positive Human-AI Collaboration for Scaling Agile Regression Testing:... script authoring speed and throughput
Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations.
Introductory/position statement in the paper; general premise motivating the case study (no specific empirical test reported).
high positive Human-AI Collaboration for Scaling Agile Regression Testing:... ability to maintain rapid, high-quality delivery
AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers.
Empirical evaluation on MLE-Bench reported in the paper (benchmark ranking, metric = medal rate).
high positive AIBuildAI: An AI Agent for Automatically Building AI Models medal rate (task success rate) on MLE-Bench
AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization; each sub-agent is itself an LLM-based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches.
System architecture description in the paper (methods/architecture section).
high positive AIBuildAI: An AI Agent for Automatically Building AI Models system architecture and claimed capabilities (multistep reasoning, tool use, end...
We introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data.
Methodological contribution: system design and implementation described in the paper (introduction/methods).
high positive AIBuildAI: An AI Agent for Automatically Building AI Models ability to produce AI models from task descriptions and training data
This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself.
Interpretive result from the paper's qualitative thematic analysis of survey responses (n=860), labeled by the authors as the 'bounded delegation' pattern.
high positive To Copilot and Beyond: 22 AI Systems Developers Want Built preferred boundary of automation / delegation
Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout.
Reported constraints and desiderata from the thematic analysis of survey responses (n=860).
high positive To Copilot and Beyond: 22 AI Systems Developers Want Built desired governance/security features for AI tools (authority scoping, provenance...
Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation.
Thematic findings from the paper's human-in-the-loop, multi-model council-based analysis of survey responses (n=860).
high positive To Copilot and Beyond: 22 AI Systems Developers Want Built requested placement/timing of quality signals in developer workflow
Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories.
Qualitative analysis method described in the paper applied to the survey responses (n=860); result reported as identification of 22 desired AI systems organized into five categories.
high positive To Copilot and Beyond: 22 AI Systems Developers Want Built catalog of desired AI systems and task categories
For listed firms, AI patents command a robust market-value premium in both countries.
Firm-level analysis linking AI patenting to market valuation for listed firms in both countries (regression or valuation analysis implied by statement).
high positive AI Patents in the United States and China: Measurement, Orga... market-value premium for listed firms associated with AI patents
China surpasses the United States in recent annual AI patent counts.
Time-series patent count comparison using classifier-applied corpora (paper reports that recent annual counts are higher for China than the U.S.).
high positive AI Patents in the United States and China: Measurement, Orga... annual number of AI patents (patent counts)
There is broad convergence in AI patenting intensity and subfield composition between the United States and China.
Comparative analysis of AI patenting intensity and subfield composition across the two patent corpora (US 1976-2023, China 2010-2023) reported in paper.
high positive AI Patents in the United States and China: Measurement, Orga... AI patenting intensity and distribution across AI subfields
Applying the classifier to granted U.S. patents (1976-2023) and Chinese patents (2010-2023), we document rapid growth in AI patenting in both countries.
Application of classifier to full corpora of granted U.S. patents (1976-2023) and Chinese patents (2010-2023); time-series counts of AI patents reported.
high positive AI Patents in the United States and China: Measurement, Orga... number of granted AI patents over time (patent counts)
The classifier generalizes well to Chinese patents based on citation and lexical validation.
Validation analyses described as citation-based and lexical validation applied to Chinese patents (paper states generalization to Chinese patents via these validation methods).
high positive AI Patents in the United States and China: Measurement, Orga... generalization / validity of classifier on Chinese patents
Our classifier substantially improves the existing USPTO approach, achieving 97.0% precision, 91.3% recall, and a 94.0% F1 score.
Reported classifier evaluation metrics (precision, recall, F1) presumably on held-out test data; comparison stated against the existing USPTO approach.
high positive AI Patents in the United States and China: Measurement, Orga... classification performance (precision, recall, F1)
We develop a high-precision classifier to measure artificial intelligence (AI) patents by fine-tuning PatentSBERTa on manually labeled data from the USPTO's AI Patent Dataset.
Methodological description in paper: fine-tuning PatentSBERTa on manually labeled USPTO AI Patent Dataset (manually labeled training data and model fine-tuning stated).
high positive AI Patents in the United States and China: Measurement, Orga... ability to classify patents as AI-related (classifier development)
The results demonstrate the importance of considering interacting systems of AI agents when doing both capabilities and safety research.
Authors' interpretation/generalization based on experimental findings comparing multi-agent organizations and single agents across tasks and settings.
high positive AI Organizations are More Effective but Less Aligned than In... research priorities/considerations for capabilities and safety research (implica...
BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility.
Design claim in abstract describing the benchmark's automated scoring system and rubric size (100+ criteria) defined by expert bankers.
high positive BankerToolBench: Evaluating AI Agents in End-to-End Investme... number of rubric criteria for automated evaluation
Substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments.
Conclusion drawn from the authors' empirical study and the reported final-system performance; presented as a general methodological claim (supporting data referenced in paper but not detailed in excerpt).
high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... effectiveness of economic penalties as an alignment method
The final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase.
Quantitative performance result reported for the mature phase of the system in the paper's abstract; Sharpe ratio provided as a single-number metric (no sample size, number of trading periods, or statistical significance reported in the excerpt).
The MAS abandoned overfitted hallucinations in favor of the Strict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified ≥95% code coverage constraint matrix.
Design and outcome claim in the paper: introduction of STDAW/RO-Lock and reported enforcement of a ≥95% code coverage constraint as part of the aligned architecture (qualitative + a coverage threshold stated).
high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... code coverage (>=95%) and reduction in hallucinations / overfitting
The system evolved from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture over the course of the study.
Reported longitudinal observations from the 20-month empirical study described in the paper (qualitative system evolution claim; no numeric counts provided in excerpt).
high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... system architecture and behaviour (turnover rate, sycophancy, liquidity awarenes...
We introduce Out-of-Money Reinforcement Learning (OOM-RL): deploying agents into the non-stationary, high-friction reality of live financial markets to utilize capital depletion as an un-hackable negative gradient.
Methodological claim / novel paradigm introduced by the paper; described as implemented in the study (no numerical sample size given in excerpt).
high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... use of financial loss (capital depletion) as negative training signal for agent ...
Established regional telcos and banks are leveraging proprietary data to develop digital loan products.
Observations and interviews from the nine-month ethnography describing practices of regional telcos and banks in Nairobi developing digital loan products using proprietary data.
high positive Risk, Data, Alignment: Making Credit Scoring Work in Kenya use of proprietary data by telcos and banks to create digital loan products (ado...
For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Explicit reproducibility statement and URL provided in the paper.
high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... data_and_code_availability
SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process?
Statement of research goals and scope in the paper introducing the SciPredict benchmark and accompanying evaluations.
high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... research_questions_addressed
Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment.
Reported stratified accuracy of human experts on SciPredict tasks by self-reported predictability judgments; accuracy rises from ≈5% (when judged not predictable) to ≈80% (when judged predictable).
high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... calibration_of_human_confidence_vs_accuracy
We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry.
Construction of the SciPredict benchmark described in the paper; explicitly reports 405 tasks and 33 sub-fields.
The future of Nagpur's industrial belt depends not on resisting automation, but on an aggressive reskilling strategy to bridge the gap between current workforce capabilities and future technological requirements.
Normative policy conclusion in the paper recommending reskilling as the primary response; based on the paper's analysis of task changes and projected role shifts; no program evaluation or empirical evidence of reskilling effectiveness reported in the excerpt.
high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... need for reskilling / workforce skill acquisition
There is a projected surge in demand for 'AI-collaborative' roles such as machine maintenance, data supervision, and process optimization.
Projection in the paper based on analysis of task complementarities between humans and AI, listing specific roles expected to grow; no quantitative demand estimates or sample sizes provided in the excerpt.
high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... projected demand for AI-collaborative roles (machine maintenance, data supervisi...
The paper documents 14 deliberate conservative assumptions — including frozen base GDP, no AI-on-AI compounding, a permanent friction floor, and conservative capture rates — all of which directionally understate the benefit.
Paper lists 14 conservative modeling assumptions and claims they bias results downward (i.e., understate potential benefits).
high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... directional bias of model assumptions relative to potential benefits
Even excluding demand expansion and robotics layers entirely, the direct productivity contribution alone reaches approximately $940 billion per year by 2036.
Model output reported in the paper when removing demand expansion and robotics layers.
high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... direct productivity contribution to annual GDP by 2036 excluding demand expansio...
In all four scenarios, cumulative net GDP exceeds cumulative AI infrastructure investment before 2036, with the base case achieving payback in 2033.
Model financial calculation comparing cumulative net GDP uplift to cumulative AI infrastructure investment across scenarios; explicit payback year reported for base case.
high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... year when cumulative net GDP exceeds cumulative AI infrastructure investment (pa...
The base-case scenario yields approximately $1,057 billion in net annual GDP uplift by 2036, equivalent to 3.6 percent of 2024 GDP; the bear case produces $796 billion, the bull case $1,368 billion, and an agentic scenario produces $2,521 billion.
Model scenario outputs presented in the paper (four scenarios differentiated by capture rate and friction assumptions).
high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... net annual GDP uplift by 2036 (US, scenario-specific)
Sector-specific productivity gain percentages are anchored to published evidence, including a randomized controlled trial of GitHub Copilot (Kalliamvakou et al., 2023), JPMorgan CEO disclosures, and Cognizant's New Work New World 2026 research.
Paper states productivity percentages are anchored to published evidence and specifically cites Kalliamvakou et al. (2023) RCT, JPMorgan CEO disclosures, and Cognizant (2026).
high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... sector-specific productivity gain percentages used in the model
A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.
Design/implementation claim in paper describing deployment approach using YAML configuration rather than engineering work.
high positive Governed Reasoning for Institutional AI deployment effort required to support a new institutional decision domain
We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy.
Conceptual contribution/metric proposed by authors in paper; no empirical validation reported in the excerpt.
high positive Governed Reasoning for Institutional AI governability (system's ability to know when not to act autonomously)
Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set.
Empirical benchmark reported in paper on the 11-case evaluation set; counts of silent errors given for Cognitive Core and baselines.
high positive Governed Reasoning for Institutional AI count of silent errors (incorrect determinations that executed without human-rev...
Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve.
Empirical benchmark reported in paper on the 11-case evaluation set; accuracies explicitly stated for three systems.
high positive Governed Reasoning for Institutional AI accuracy on prior authorization appeal cases
We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.
Design/proposal described in paper (architectural specification); no empirical evaluation reported for the architecture itself in the excerpt.
high positive Governed Reasoning for Institutional AI system governability and auditability as properties of the decision substrate
Organisations should invest in customisation capabilities for AI recruitment tools, implement comprehensive change management strategies, and maintain robust post-hire evaluation procedures.
Authors' recommendations derived from thematic findings and participant perspectives across two firms (qualitative synthesis of n = 22 interviews).
high positive The augmented recruiter: examining AI integration and decisi... recommended_organisational_practices_for_AI_recruitment
AI functioned optimally as an augmentative technology rather than as a replacement for human decision-makers in recruitment.
Findings: participants across the two case firms described AI being most effective when augmenting human judgment rather than replacing it (interviews n = 22).
high positive The augmented recruiter: examining AI integration and decisi... role_of_AI (augmentation vs replacement)
AI significantly enhanced efficiency through process standardisation and automation.
Findings based on participant accounts in thematic analysis (interviews n = 22) describing process optimisation and automation benefits.
high positive The augmented recruiter: examining AI integration and decisi... efficiency (process standardisation and automation)
The Principle of Maximum Heterogeneity reveals a convergence of complex phenomena across fields onto simple underlying design principles with important predictive value for future distributed production systems.
Synthesis claim in the paper arguing cross-field convergence and predictive value based on the theoretical model and conceptual examples; no empirical validation or forecasting trials reported.
high positive The Principle of Maximum Heterogeneity Optimises Productivit... predictive value of the model/principles for future distributed production syste...
The principles derived (including the Principle of Maximum Heterogeneity) can be used as a blueprint for constructing ideal distributed production systems; demonstrated by suggesting specific redesigns for compute systems executing large-scale AI.
Paper includes suggested redesigns for compute systems as demonstrations of the blueprint; these are proposed designs/illustrative applications rather than empirically validated interventions or trials.
high positive The Principle of Maximum Heterogeneity Optimises Productivit... design-guided performance improvements in compute systems for large-scale AI (pr...
The Principle of Maximum Heterogeneity applies recursively across all layers of nested production systems.
Theoretical claim within the paper arguing recursive applicability across nested system layers (e.g., neurons, firms, ecosystems); supported by conceptual reasoning and model exposition rather than empirical multi-layer tests.
high positive The Principle of Maximum Heterogeneity Optimises Productivit... emergence/spread of heterogeneity across nested layers