The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (14055 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
The paper examines operational logic, defining features and emerging use cases of agentic payments across retail, e-commerce and decentralised finance.
Stated scope in the abstract; analysis and case-study-driven review across specified sectors (retail, e-commerce, DeFi). No sample sizes reported.
high null result AI Agents in Payments: Applications, Risks and Regulations emerging use cases / sector-level application
Agentic payments refer to transactions initiated and completed by AI agents without direct human intervention.
Explicit definitional statement in the abstract (conceptual definition provided by the authors).
high null result AI Agents in Payments: Applications, Risks and Regulations definition/characterisation of a payment modality
Current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes.
Synthesis of mixed results from controlled studies, meta-analyses, and benchmarks reported in the paper (no single sample size given in abstract).
high null result Agentic Agile-V: From Vibe Coding to Verified Engineering in... engineering outcomes (overall improvement from autonomous code generation)
However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance.
Reported comparison between the coordinated workflow and a strong combined-summary baseline for exoplanet vetting indicating no meaningful improvement.
high null result Cross-domain benchmarks reveal when coordinated AI agents im... relative performance vs. combined-summary baseline for exoplanet vetting
All [the listed orchestration frameworks] follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn.
Author assertion based on architectural analysis of the listed frameworks (observation of orchestration pattern in the named projects).
high null result Compiling Agentic Workflows into LLM Weights: Near-Frontier ... architectural pattern (external orchestrator behavior)
The study used established measurement scales to assess AI-driven learning culture, knowledge orchestration, organisational intelligence and innovation performance.
Methods: authors report use of established scales for AIDLC, KO, OI and IP in the questionnaire.
high null result Enhancing innovation in Pakistan’s IT sector measurement validity / constructs used
Structured questionnaires were distributed between March and October 2025 to employees involved in innovation, learning and project management roles in Karachi, Lahore and Islamabad.
Methods section description of data collection period, target respondent roles, and cities covered.
high null result Enhancing innovation in Pakistan’s IT sector data collection protocol (timing and respondent roles)
Most respondents held undergraduate or postgraduate degrees in computer science, engineering or business-related disciplines.
Sample demographic summary from the survey (N=348).
high null result Enhancing innovation in Pakistan’s IT sector respondent educational background
After screening the data, 348 valid responses were analyzed.
Structured questionnaires distributed March–October 2025 to employees in medium and large IT firms in Karachi, Lahore and Islamabad; screening produced 348 valid responses (sample description in methods).
high null result Enhancing innovation in Pakistan’s IT sector sample_size
The paper draws on empirical studies from 2024–2026.
Methodological statement in the paper specifying the time window of empirical studies used in the analysis.
high null result The Algorithmic Mirror: Can Artificial Intelligence Truly Mi... temporal scope of literature reviewed
This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks.
Comparative evaluation reported in the paper showing that single-threshold (binary) scoring metrics do not exhibit the inverse-scaling pattern observed with tail-inclusive distributional metrics (specific metrics and calculations not given in excerpt).
high null result Is Capability a Liability? More Capable Language Models Make... relationship between model capability and accuracy under single-threshold metric...
Domain knowledge does not reliably rescue calibration.
Experiments reported in the paper where domain-knowledge interventions (procedures or prompts incorporating domain knowledge) were applied and did not consistently improve forecast calibration (details not provided in excerpt).
high null result Is Capability a Liability? More Capable Language Models Make... forecast calibration after incorporating domain knowledge
Using large language models, we measure the AIO level of Chinese listed companies from 2010 to 2023.
Authors report constructing firm-level measures of artificial intelligence orientation (AIO) by applying large language models to corporate texts/disclosures for Chinese listed companies over the 2010–2023 period.
high null result Artificial intelligence orientation and decarbonization spil... artificial intelligence orientation (AIO) measurement
We compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred.
Study design: comparison between trait set responsible for incidents (from incident reports) and stated developer preferences collected from a sample of 197 developers working on those tasks.
high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... developers' preferred AI system traits (self-reported)
We compared the extracted traits with the traits that 202 workers highly familiar with those tasks would have preferred.
Study design: a comparison between LLM-extracted traits from incident reports and stated preferences from a sample of 202 workers familiar with the tasks.
high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... workers' preferred AI system traits (self-reported preferences)
We used an LLM-as-an-expert approach to extract the main traits of the AI systems involved in those incidents using an established framework of twelve traits.
Methods statement: applied a Large Language Model to code/extract AI system traits from the incident reports using an established 12-trait framework.
high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... trait classification of AI systems involved in incidents
We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors.
Descriptive statement in paper: dataset comprised 1,524 incident reports, covering 171 occupational tasks and 12 industry sectors (dataset construction / corpus used for analysis).
high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... scope and coverage of analyzed incident reports (number of incidents, tasks, and...
This study provides the first cross-class synthesis covering raw materials, work-in-process, and finished goods within a unified evaluative framework, positioning machine learning and deep reinforcement learning methods alongside classical policy families and quantifying the boundary conditions for each approach.
Author-stated theoretical contribution and scope of the review (coverage of raw materials, WIP, finished goods and methods).
high null result Equitable railway corridor investment under demand uncertain... breadth and novelty of synthesis across inventory classes and methods
A random-effects model estimated by restricted maximum likelihood was applied to pool percentage cost-reduction effect sizes across 18 studies admissible to quantitative synthesis.
Methods reported in the paper: random-effects meta-analysis using REML across 18 studies eligible for quantitative pooling.
high null result Equitable railway corridor investment under demand uncertain... pooled percentage cost-reduction effect sizes
A systematic review and meta-analytic synthesis of 31 peer-reviewed studies published between 2004 and 2025 was conducted following the PRISMA 2020 protocol.
Study methods reported in the paper: systematic review following PRISMA 2020; sample of 31 peer-reviewed studies dated 2004–2025.
high null result Equitable railway corridor investment under demand uncertain... number and coverage of studies included in the review
Çalışmada yapay zekâ göstergesi olarak yapay zekâ patent sayıları (AI patent counts) kullanılmıştır.
Metodolojik açıklama: bağımlı değişken olarak AI patent sayıları kullanımı; veri: G8 ülkeleri + Türkiye, 2010-2020.
high null result AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (tanımlayıcı/bağımlı değişken bildirimi)
The study uses PyQu to quantify changes across five quality attributes for Python code.
Methodological description: application of PyQu (an ML-based quality assessment tool for Python) to measure five quality attributes before and after refactoring edits.
high null result Quality and Security Signals in AI-Generated Python Refactor... five PyQu quality attributes (measured by the tool)
From the observed diffs, we derive a taxonomy of 24 recurring change operations.
Manual/automated analysis of diffs from the studied agentic refactoring PRs to identify and categorize recurring change operations into a 24-item taxonomy.
high null result Quality and Security Signals in AI-Generated Python Refactor... count and categorization of recurring change operations present in diffs
We will release the reanalysis pipeline to support replication.
Authors' statement of intent in the paper to release code/pipeline for replication.
high null result When Skills Don't Help: A Negative Result on Procedural Know... availability of reanalysis pipeline (planned release)
In offensive cybersecurity, the marginal benefit of Skills collapses: the spread between the no-Skills and full-Skills conditions is only 8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small-effect threshold).
Statistical re-analysis of the 180-run CTF study comparing no-Skills vs full-Skills conditions: reported spread = 8.9 percentage points; reported p-values from χ² and Cochran–Armitage trend tests; reported Cohen's h comparisons.
high null result When Skills Don't Help: A Negative Result on Procedural Know... task pass rate (success rate) in Capture-the-Flag offensive cybersecurity tasks
Those four documentation conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation.
Authors map the four documentation-line-count conditions from the re-analyzed study to skill-ablation categories (No/Experiential/Curated/Comprehensive) as part of their interpretive re-analysis.
high null result When Skills Don't Help: A Negative Result on Procedural Know... mapping of documentation richness to Skill-ablation categories
We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines).
Authors' re-analysis of an existing controlled study consisting of 180 runs and four documentation conditions with the stated line counts; this is a descriptive claim about the re-analysis dataset and experimental conditions.
high null result When Skills Don't Help: A Negative Result on Procedural Know... reanalysis dataset size and documentation-condition line counts
Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate.
Empirical evaluation: 660 trials run using Claude Code on the minimal-pair repos with hidden tests; reported comparison of pass rates between clean and messy repo variants showing no change.
high null result Does Code Cleanliness Affect Coding Agents? A Controlled Min... pass rate (task success on hidden tests)
Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity.
Measurement approach stated in the abstract (unified rubric with listed dimensions).
The study uses three LLM systems: ChatGPT, Claude, and Grok.
Method description in the paper's abstract naming the three LLMs evaluated.
The evaluation covers four task types: summarization, planning, explanation, and coding.
Method description in the paper's abstract listing the four task types used for evaluation.
The study compares three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt.
Experimental design described in the paper (three prompt conditions stated in the abstract).
high null result Less Back-and-Forth: A Comparative Study of Structured Promp... experimental_condition
Large language models (LLMs) are widely used for open-ended tasks.
Stated as background/context in the paper's introduction; no quantitative data reported in the abstract.
high null result Less Back-and-Forth: A Comparative Study of Structured Promp... use_of_llms_for_open_ended_tasks
We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao.
Statement of experimental methodology describing the types of evaluations performed (public datasets, simulated auctions, and online deployment).
high null result Generative Auto-Bidding with Unified Modeling and Exploratio... scope and environments of experiments (public datasets, simulations, live deploy...
This study used a controlled mixed-design experiment with 60 participants who completed analytical survival ranking tasks in multi-turn human–AI collaborations, with pre/post measurements and two types of prompting training (general or sycophancy-focused).
Methodological description in the paper's abstract/summary.
high null result The Hidden Cost of Contextual Sycophancy: an AI Literacy Int... study design / methodological description
Reported empirical values are transformed through transparent indicators such as relative growth, CAGR, growth multipliers, stock-flow ratios, concentration ratios, and HHI.
Methodological description and application in the paper listing these specific indicators used to summarize public data on AI investment, adoption, robots, compute, and labour-market reallocation.
high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... data transformation / indicator usage
The study uses a conceptual-empirical quantitative diagnostic design rather than a causal econometric model.
Explicit methodological statement in the paper describing the design choice and rejecting causal econometric modeling in favor of diagnostics using public institutional data and transparent indicators.
high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... study methodology (diagnostic vs causal modeling)
The agentic economy is not yet a completed global order, but its transition pressure is measurable enough to require a distinct economic vocabulary, reproducible diagnostics, and future sector-level measurement.
Synthesis of diagnostic indicators (AI investment/adoption trends, robot stock, compute-energy coupling, labour reallocation measures) showing measurable transition pressures; conclusion drawn from the conceptual-empirical diagnostic.
high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... degree of completion of 'agentic economy' transition / measurability of transiti...
Following PRISMA 2020 guidelines, searches across Google Scholar, Web of Science, Scopus, ScienceDirect, and CNKI yielded 1,562 initial records, of which 21 studies published between 2019 and 2026 met inclusion criteria.
Methodological description of the systematic literature review reported in the paper: initial records = 1,562; included studies = 21; publication years 2019–2026.
high null result Application of Artificial Intelligence in Human Resource Man... number of records screened and studies included
Small and medium-sized enterprises (SMEs) constitute over 98.5% of businesses in many economies including China.
Descriptive statistic reported in the paper's background/intro; source of the statistic not specified within the summary provided.
high null result Application of Artificial Intelligence in Human Resource Man... share of businesses that are SMEs
This study analyzes developments through April 2026.
Explicit timeframe statement in the paper's summary/introduction.
high null result AI for Auto-Research: Roadmap & User Guide temporal coverage of the review/analysis
The authors provide source code for their framework on GitHub to encourage further research.
Statement in the paper that the source code is available on GitHub; verifiable by visiting the repository (link not provided in the excerpt).
high null result Modelling Customer Trajectories with Reinforcement Learning ... availability of implementation/source code
Heuristics such as TSP and PNN are commonly used as inexpensive approximations for customer trajectories.
Descriptive claim about common practice cited in the paper; used as motivation for proposing the RL approach (no quantitative survey evidence provided in the excerpt).
high null result Modelling Customer Trajectories with Reinforcement Learning ... use of heuristic methods (TSP, PNN) for trajectory approximation
We conducted a randomized controlled experiment in which participants—analogs of early-career knowledge workers—were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance.
Statement of experimental design in the paper (randomized controlled experiment assigning participants to either traditional resources or LLM assistance; participants described as analogs of early-career knowledge workers).
high null result Generative AI and the Productivity Divide: Human-AI Compleme... experimental assignment / study design (treatment vs control)
Results remain robust across checks.
Robustness checks reported by the authors (unspecified in abstract) that do not overturn the main findings.
high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... robustness of core findings (debt financing cost increase for AI washing firms)
China's 14th Five Year Plan (FYP) is used as a quasi-natural experiment / strategic policy shock to study effects of AI washing.
Research design leverages the FYP announcement as an exogenous policy shock in a difference-in-differences framework (design claim; no sample size in abstract).
high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... policy shock (use of FYP as quasi-experiment)
AI washing is identified as the residual between AI narrative intensity and patent output.
Constructed a firm-level AI washing proxy by regressing AI narrative intensity on patent output and using the residual; described as the study's measurement approach (no sample size reported in the abstract).
high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... AI washing measure (residual between narrative intensity and patent output)
Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.
Prescriptive conclusion derived from the observed cross-configuration heterogeneity in the paper's empirical results.
high null result Same Signal, Different Semantics: A Cross-Framework Behavior... validity/generalizability of behavioral findings across agent configurations
Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%.
Variance decomposition / explained-variance analysis reported for 'mean turns' across configurations (reported percentages: 64% vs 10%).
high null result Same Signal, Different Semantics: A Cross-Framework Behavior... mean turns (average number of turns per task)
The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations.
Methods description in the paper: experimental design holding LLM or framework fixed to disentangle effects.
high null result Same Signal, Different Semantics: A Cross-Framework Behavior... behavior–outcome effects per configuration (methodological approach)