The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
The main results are robust to inclusion of firm, industry, and year fixed effects, DID identification using the 2018 SCD pilot, and multiple robustness checks addressing potential confounders and endogeneity.
Authors report baseline regressions with firm/industry/year fixed effects, DID specifications exploiting the 2018 Supply Chain Innovation and Application Pilot Program as a quasi-natural experiment, and a battery of robustness tests (alternative specifications, controls, and checks).
high positive Supply Chain Digitalization and its Impact on Green Innovati... robustness of estimated SCD effects on corporate green innovation
The positive effect of SCD on green innovation is stronger for substantive green innovation (actual environmentally beneficial R&D and technologies) than for strategic green innovation (symbolic/labeling or reputation‑oriented activities).
Heterogeneous outcome analysis splitting green innovation into 'substantive' (e.g., green patents, technological R&D outputs) versus 'strategic' (signaling/compliance indicators); regression and DID estimates show larger and statistically significant coefficients for substantive measures compared to smaller or weaker effects on strategic measures.
high positive Supply Chain Digitalization and its Impact on Green Innovati... substantive green innovation (green patents, concrete environmental R&D outputs)...
Supply chain digitalization (SCD) significantly increases corporate green innovation among Chinese A-share listed firms (2012–2022).
Panel analysis of Chinese A-share listed firms over 2012–2022 using regression models with firm, industry, and year fixed effects; difference-in-differences (DID) identification exploiting the 2018 Supply Chain Innovation and Application Pilot Program as an exogenous shock to SCD; firm-level controls included; multiple robustness checks reported.
high positive Supply Chain Digitalization and its Impact on Green Innovati... corporate green innovation (aggregate measures of green innovation such as green...
Algorithmic transparency and interpretability are important so investors and regulators can understand how ESG inputs affect automated decision systems.
Normative recommendation grounded in literature on model risk, accountability, and regulatory needs; not an empirical finding but a consensus implication of reviewed work.
high positive SUSTAINABILITY ISSUES IN FINANCIAL ACCOUNTING RESEARCH model interpretability / stakeholder understanding / accountability
Research priorities include empirically quantifying AI's effects on productivity, wages, inequality, and environmental costs; developing standardized sustainability and governance metrics; and evaluating regulatory impacts on innovation and welfare.
Stated research agenda based on gaps identified in the narrative review; identifies directions for future empirical work rather than presenting new empirical findings.
high positive The Evolution and Societal Impact of Artificial Intelligence... empirical evidence and standardized metrics for AI impacts (productivity, labor-...
AI has progressed from symbolic systems to data-driven, generative architectures and large-scale computational infrastructures, becoming a foundational technology across sectors.
Narrative synthesis of historical and technical literature across AI research and innovation studies; qualitative tracing of architectural shifts (symbolic → statistical → deep learning/generative models) and increased deployment across industries. No original empirical measurement or sample size reported in this paper.
high positive The Evolution and Societal Impact of Artificial Intelligence... technological evolution and cross-sector adoption (foundational-technology statu...
MYRIAD-EU synthesizes progress and remaining challenges and proposes concrete directions for continued research and practice in multi-hazard, multi-risk DRR.
Overall project scope: synthesis and reflection on interdisciplinary research and practice conducted across MYRIAD-EU (2021–2025), as reported in the paper.
high positive Reducing risk together: moving towards a more holistic appro... existence of a consolidated synthesis and recommended research/practice directio...
MYRIAD-EU conducted in-depth, place-based case studies co-produced with local stakeholders to test methods and tools for multi-risk assessment.
Reported methods include in-depth place-based case studies co-produced with local stakeholders as part of MYRIAD-EU activities (2021–2025).
high positive Reducing risk together: moving towards a more holistic appro... testing and validation of methods and tools via co-produced case studies
The main results are robust to inclusion of controls and a range of heterogeneity and moderation checks, supporting that findings are not driven by simple time trends or obvious confounders.
Reported robustness checks in the staggered-DID framework (control variables, alternative specifications, subgroup tests) and discussion of parallel-trends assumption.
high positive How Does Urban Green Data Center Policy Empower Corporate En... corporate energy utilization efficiency (stability of estimated policy effect ac...
Implementation of urban green data center pilot policies leads to measurable improvements in firms' energy utilization efficiency.
Staggered-adoption difference-in-differences (DID) using an unbalanced firm–year panel of Chinese A-share listed firms linked to prefecture-level cities (2012–2024); treatment is timing/location of urban green data center pilot designation; results reported as statistically significant and robust to controls and alternative specifications.
high positive How Does Urban Green Data Center Policy Empower Corporate En... corporate energy utilization efficiency
Mechanisms linking digital services to export performance include reduced transaction and search costs, platform network and scale effects, data as an input improving service quality and customization, and task‑level specialization changing comparative advantage.
Conceptual/theoretical synthesis drawing on multiple strands of literature and illustrative case studies presented in the review (no new causal identification).
high positive Analysis of Digital Services Trade and Export Competitivenes... export performance of digital services (via transaction costs, service quality, ...
Digital services trade is shifting from traditional cross‑border delivery toward online, platform‑based models, with cross‑border data flows a core input and determinant of competitiveness.
Integrative literature and policy review synthesizing domestic and international studies; theoretical/conceptual synthesis and cited case examples (no new econometric analysis or primary microdata).
high positive Analysis of Digital Services Trade and Export Competitivenes... mode of digital services delivery and export competitiveness (role of platforms ...
Policy recommendations include standards on explainability, audit trails, certification for finance/tax AI systems, stronger data governance, and public–private coordination to update regulatory guidance.
Paper's policy and governance recommendations drawn from case findings and literature synthesis; prescriptive content rather than evaluated interventions.
high positive Explore the Impact of Generative AI on Finance and Taxation existence/adoption of standards, improvements in regulatory clarity and complian...
Deployments should build governance, explainability, and auditability into systems and start with pilots on high-volume, well-structured tasks before scaling.
Paper recommendations based on case experience and analytic framing; advocated strategy rather than empirically validated at scale within the paper.
high positive Explore the Impact of Generative AI on Finance and Taxation deployment success rate, governance completeness, pilot-to-scale learning outcom...
To mitigate risks and realize benefits, AI systems in finance/tax should combine AI with human-in-the-loop controls and clear escalation paths.
Prescriptive recommendation grounded in case lessons and literature on safe AI deployment; presented as a best-practice guideline rather than tested intervention.
high positive Explore the Impact of Generative AI on Finance and Taxation safety/accuracy of outputs, reduction in erroneous autonomous actions
Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.
Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.
high positive Explore the Impact of Generative AI on Finance and Taxation capability enabling: natural language understanding, document extraction accurac...
Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.
Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.
high positive Explore the Impact of Generative AI on Finance and Taxation timeliness of monitoring, fraud detection rate, KYC/AML screening coverage, exce...
For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.
Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.
high positive Explore the Impact of Generative AI on Finance and Taxation accuracy and speed of tax fact extraction, draft return quality, compliance-chec...
Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.
Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.
high positive Explore the Impact of Generative AI on Finance and Taxation cashflow forecast accuracy, anomaly detection precision/recall, automation rate ...
Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.
Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.
high positive Explore the Impact of Generative AI on Finance and Taxation functionality/performance in accounting tasks: bookkeeping accuracy, reconciliat...
Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.
Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.
high positive ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... feasibility of AI deployment (legal/institutional compatibility enabling efficie...
A supplemental theological differentiator probe achieved perfect rank-order agreement between the two ceiling judges (Spearman rs = 1.00), supporting judge reliability for the ceiling probe.
Reported Spearman rank correlation rs = 1.00 between Gemini Pro and Copilot Pro on the theological differentiator probe used as a reliability check.
high positive Literary Narrative as Moral Probe : A Cross-System Framework... Spearman rank-order agreement (rs) between the two ceiling judges on the theolog...
Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.
Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.
high positive The Future of Feedback: How Can AI Help Transform Feedback t... existence and quality of RCTs and long-run studies; availability of validated me...
CABP (Context-Aware Broker Protocol) extends JSON-RPC with identity-scoped request routing via a six-stage broker pipeline to ensure correct identity and policy propagation.
Design and protocol specification included in the paper; formal description and broker-pipeline semantics documented as a deliverable.
high positive Bridging Protocol and Production: Design Patterns for Deploy... correctness of identity and policy propagation across broker pipeline (as define...
Different model families (Sonnet 4.6 vs. Opus 4.6) exhibit stable, systematic differences in methodological preferences and choice patterns—distinct empirical 'styles'.
Comparison of choice patterns and methodological decisions across agents instantiated with Sonnet 4.6 versus Opus 4.6 within the 150-agent experiment, showing consistent between-family differences in measure selection and estimation procedures.
high positive Nonstandard Errors in AI Agents frequency/distribution of methodological choices by model family (categorical ch...
Agents split on measure choice (e.g., autocorrelation vs. variance-ratio tests; dollar-volume vs. share-volume measures), producing different substantive estimates from the same raw data and hypotheses.
Observed categorical divergences in measure selection across the 150 agents during independent analyses of SPY TAQ (2015–2024); documented alternative test/measure families and corresponding divergent effect estimates for the six hypotheses.
high positive Nonstandard Errors in AI Agents measure selection (categorical) and resulting substantive effect estimates (cont...
AI-to-AI variation (nonstandard errors, NSEs) across autonomous coding agents produces substantial uncertainty in empirical results analogous to human researcher heterogeneity.
Experimental results from 150 autonomous Claude Code agents (two model families: Sonnet 4.6 and Opus 4.6) independently analyzing the same SPY TAQ data (NYSE TAQ, 2015–2024) on six pre-specified hypotheses; recorded agent-to-agent variation in methodological choices and resulting effect estimates (dispersion measured via IQR and related diagnostics).
high positive Nonstandard Errors in AI Agents agent-to-agent variation in methodological choices and effect estimates (dispers...
Observations span multiple agent platforms (Moltbook, The Colony, 4claw) with more than 167,000 agents interacting as peers.
Author-reported coverage from naturalistic observations across the named platforms during the one-month observation window; count reported as ≈167k agents.
high positive When Openclaw Agents Learn from Each Other: Insights from Em... number of agents observed interacting as peers
The mechanism generalizes to another field: models trained on economics publication records reach ~70% accuracy on a similar benchmark.
Analogue of the management experiment performed in economics: models fine-tuned on economics journal publication records were evaluated on an economics benchmark and achieved approximately 70% accuracy. (Exact dataset sizes, benchmarks, and train/test splits not specified in the provided text.)
high positive Machines acquire scientific taste from institutional traces Accuracy on an economics research-pitch benchmark
Fine-tuned models trained on publication records each outperform every frontier model and the expert panel; the best single model achieves 59% accuracy on the benchmark.
Language models fine-tuned on historical journal accept/reject records were evaluated on the held-out four-tier benchmark; reported performance shows each fine-tuned model exceeds the frontier-model average and the human-panel baseline, with the best model at 59% accuracy. (Exact training set size and benchmark sample count not specified here.)
high positive Machines acquire scientific taste from institutional traces Accuracy on the four-tier management research-pitch benchmark
Panels of journal editors and editorial board members reach 42% accuracy by majority vote on the same four-tier benchmark.
Human baseline obtained by soliciting judgments from journal editors and editorial board members on the held-out benchmark and computing majority-vote accuracy (reported as 42%). (Number of human raters and benchmark size not given in supplied text.)
high positive Machines acquire scientific taste from institutional traces Majority-vote accuracy on the four-tier management research-pitch benchmark
Fine-tuning language models on historical journal publication decisions recovers an evaluative "scientific taste" that frontier (zero-shot) models and expert editor panels cannot reliably reproduce.
Fine-tuned models were trained on years of journal publication decisions (institutional accept/reject records) and evaluated on a held-out four-tier benchmark of management research pitches; performance compared to zero-shot evaluations of frontier models and to panels of journal editors (majority-vote). (Sample sizes for training records and held-out benchmark not specified in the provided text.)
high positive Machines acquire scientific taste from institutional traces Ability to predict publication-worthiness as measured by tier prediction accurac...
An asynchronous sliding-window engine treats the GPU as a sliding compute window and overlaps GPU computation with CPU-side parameter updates and multi-tier I/O to hide data movement and synchronization overheads.
System design and implementation described in the paper: an asynchronous runtime that coordinates GPU kernels, CPU updates, and multi-tier I/O. This is a design/implementation claim rather than a measured outcome; the summary links the design to performance improvements.
high positive An Efficient Heterogeneous Co-Design for Fine-Tuning on a Si... system behavior (overlap of compute and I/O / synchronization)
The A-ToM mechanism operates by estimating a partner's likely ToM order from interaction history and using that estimate to predict the partner's next action which then informs the agent's policy choices.
Method description and implementation details provided in the paper: estimator over ToM orders based on past interactions + conditional action prediction feeding into decision-making; validated in the reported experiments.
high positive Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... accuracy/usefulness of inferred ToM order for partner-action prediction and subs...
Empirical evaluation was performed across four coordination environments: a repeated matrix game, two grid navigation tasks, and an Overcooked task.
Methods section describes these four benchmark environments used for all reported comparisons between fixed-order agents and A-ToM agents; evaluation metrics were joint payoffs and task-specific success measures.
high positive Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, success rate) as used in experiments
Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability.
Architectural design and output artifacts described in the paper (logs and structured outputs per agent); these artifacts provide material for explanation and audit.
high positive CoMAI: A Collaborative Multi-Agent Framework for Robust and ... interpretability and auditability (availability of logs and structured outputs)
Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance.
Implementation of adaptive testing logic within the workflow described in the paper, with experiments involving dynamic difficulty adjustment; detailed metrics of adaptation effectiveness are not provided in the summary.
high positive CoMAI: A Collaborative Multi-Agent Framework for Robust and ... ability to adapt question difficulty and evaluate multiple skill dimensions
Operating as a pre-processor (rather than modifying the generator) enables modular integration with existing LLMs and provides an explicit decision point for clarification.
Novelty/architecture claim in the paper explaining that C.A.P. runs before generation and therefore can be plugged into existing LLM pipelines; described design rationale (no empirical integration study presented).
high positive A Context Alignment Pre-processor for Enhancing the Coherenc... ease of integration / ability to attach to existing generation pipelines
C.A.P. verifies semantic alignment between the current expanded prompt and the weighted history and triggers a structured clarification protocol when similarity is below a threshold.
Component-level description: alignment verification via semantic embeddings (cosine similarity) or learned classifiers and threshold-based decision branching to initiate clarification; described protocol templates (no empirical validation provided).
high positive A Context Alignment Pre-processor for Enhancing the Coherenc... alignment detection (similarity score) and number/rate of triggered clarificatio...
C.A.P. retrieves dialogue history using a time-weighted decay so recent context is prioritized (approximating human conversational focus).
Design description of a 'time-weighted context retrieval' component; authors propose temporal decay functions (e.g., exponential decay, half-life parameter) applied to dialogue-turn embeddings or metadata (no empirical results reported).
high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recency-weighted relevance of retrieved context / retrieval precision for recent...
C.A.P. is a pre-generation module that expands user utterances to recover omitted premises and implications.
Architecture and methods description in the paper specifying a 'semantic expansion' component; suggested implementations via knowledge-bases or small LLM prompts to generate premises, paraphrases, and implications (no empirical evaluation reported).
high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recovered implicit premises / coverage of implied goals in expanded prompt
Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.
Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.
high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... inspectability/traceability of inference chains (auditability)
Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).
Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.
high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and machine-checkability of formal inferential chains (inspectability/...
The development artifacts are fully transparent and reproducible: the repository includes an archive of 229 human prompts and a git history with 213 commits.
Paper reports counts of prompts (229) and git commits (213) and states these archives are public; these are concrete repository metrics (n=1 development repository).
high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... number of human prompts archived (229); number of git commits (213); public avai...
The Lean kernel provided full machine verification of all formalized statements in the development.
Paper reports 'Full verification by the Lean kernel' for the Lean 4 development; supported by availability of the Lean 4 repository and verified theorem artifacts (n=1 project).
high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... machine-checked verification status of formalized statements (verified/unverifie...
A specialized prover (Aristotle) automatically closed 111 lemmas during the development.
Quantitative verification metric reported in the paper: 111 lemmas automatically closed by Aristotle; claim tied to the Lean development and prover logs (single project count).
high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... number of lemmas automatically discharged by Aristotle (111)
The AI-assisted pipeline combined an AI reasoning model (Gemini DeepThink) to generate the proof, an agentic coding tool (Claude Code) to translate the proof to Lean, a specialized automated prover (Aristotle) that closed 111 lemmas, and the Lean kernel to fully verify the result.
Project workflow description and verification metrics in the paper; reported counts and named components (Gemini DeepThink, Claude Code, Aristotle, Lean kernel); repository and logs purportedly document toolchain usage (n=1 project; 111 lemmas closed by Aristotle reported).
high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... composition of toolchain and number of lemmas automatically discharged (111)
A complete formalization in Lean 4 of the equilibrium characterization for the Vlasov–Maxwell–Landau (VML) system was produced through an AI-assisted pipeline.
Single-project artifact: a Lean 4 development containing formal statements, proof scripts and verified theorems reported by the paper (n=1 project); authors report full machine verification by the Lean kernel and provide the repository as public evidence.
high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... completeness of formalization / machine-checked verification of the VML equilibr...
In the human–human benchmark, repeated pre-play communication substantially increases cooperation.
Reference benchmark data from Dvorak & Fehrler (2024), human–human sample n = 108, showing higher cooperation under repeated communication relative to less frequent communication; comparison reported in the paper.
high positive Playing Against the Machine: Cooperation, Communication, and... change in cooperation rate associated with repeated communication in human–human...
Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures.
Paper's evaluation section lists metrics used: win-rate, completion time, strategic robustness; describes how they are computed and used to compare agents.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... evaluation metrics used (win-rate, completion time, strategic robustness)