The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6507 claims)

Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 609 159 77 736 1615
Governance & Regulation 664 329 160 99 1273
Organizational Efficiency 624 143 105 70 949
Technology Adoption Rate 502 176 98 78 861
Research Productivity 348 109 48 322 836
Output Quality 391 120 44 40 595
Firm Productivity 385 46 85 17 539
Decision Quality 275 143 62 34 521
AI Safety & Ethics 183 241 59 30 517
Market Structure 152 154 109 20 440
Task Allocation 158 50 56 26 295
Innovation Output 178 23 38 17 257
Skill Acquisition 137 52 50 13 252
Fiscal & Macroeconomic 120 64 38 23 252
Employment Level 93 46 96 12 249
Firm Revenue 130 43 26 3 202
Consumer Welfare 99 51 40 11 201
Inequality Measures 36 105 40 6 187
Task Completion Time 134 18 6 5 163
Worker Satisfaction 79 54 16 11 160
Error Rate 64 78 8 1 151
Regulatory Compliance 69 64 14 3 150
Training Effectiveness 81 15 13 18 129
Wages & Compensation 70 25 22 6 123
Team Performance 74 16 21 9 121
Automation Exposure 41 48 19 9 120
Job Displacement 11 71 16 1 99
Developer Productivity 71 14 9 3 98
Hiring & Recruitment 49 7 8 3 67
Social Protection 26 14 8 2 50
Creative Output 26 14 6 2 49
Skill Obsolescence 5 37 5 1 48
Labor Share of Income 12 13 12 37
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Productivity Remove filter
Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities.
Literature/benchmark critique asserted in the paper (argument that current benchmarks focus on simple judgmental formats and miss continuous numerical forecasting capabilities).
high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... scope/coverage of existing evaluation formats
Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
Empirical observations from QuantSightBench evaluation showing model calibration performance as a function of magnitude (paper statement noting sharp degradation and overconfidence at extremes).
high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... calibration / overconfidence of prediction intervals across magnitudes
The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target.
Reported empirical coverage percentages from evaluation on QuantSightBench for the listed models (paper provides these percentage values).
high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage) for specific models
None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target.
Empirical evaluation on the newly introduced QuantSightBench benchmark across 11 frontier and open-weight models; models were assessed on empirical coverage of prediction intervals versus a 90% target (paper statement).
high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage)
The study identified significant implementation challenges including algorithmic bias, digital divide concerns, data privacy risks, and low technology readiness among HR teams in Tier 2 cities.
Synthesis of qualitative case study findings from 4 organizations plus survey responses (N=150) reporting barriers and risks encountered during adoption.
high negative A Study on the Effectiveness of Technology-Driven Recruitmen... implementation challenges / risks
Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Qualitative and quantitative analysis of errors observed across the DELEGATE-52 experiments (19 LLMs) showing sparse, high-severity, and silently introduced errors that accumulate over long workflows.
high negative LLMs Corrupt Your Documents When You Delegate error severity and silent corruption over time
Degradation severity is exacerbated by document size, length of interaction, or presence of distractor files.
Additional experiments and analyses varying document size, interaction length, and presence of distractor files reported in the paper showing increased degradation under these conditions.
high negative LLMs Corrupt Your Documents When You Delegate severity of document degradation / error rate
Agentic tool use does not improve performance on DELEGATE-52.
Additional experiments reported in the paper that compare plain LLM delegation vs. agentic tool-using configurations on DELEGATE-52 and find no performance improvement from agentic tool use.
high negative LLMs Corrupt Your Documents When You Delegate task performance on DELEGATE-52 (document quality/corruption)
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.
Reported results from the experiment evaluating 19 LLMs on DELEGATE-52; these named models are highlighted and an average corruption fraction (25%) is reported at the end of long workflows.
high negative LLMs Corrupt Your Documents When You Delegate proportion of document content corrupted
Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation.
Large-scale experiment reported in the paper evaluating 19 LLMs on DELEGATE-52 long delegated workflows; observed document degradation across models.
high negative LLMs Corrupt Your Documents When You Delegate document degradation / output quality
Underreliance on AI might deprive software developers of potential gains in productivity and quality.
Stated in the paper and motivated by themes from twenty-two developer interviews indicating missed benefits when developers underuse LLM tools.
high negative Towards an Appropriate Level of Reliance on AI: A Preliminar... productivity and output quality
Overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills).
Paper explicitly states this risk and grounds the discussion in findings from twenty-two developer interviews (qualitative evidence and participant-reported concerns).
high negative Towards an Appropriate Level of Reliance on AI: A Preliminar... atrophy of critical thinking skills / skill degradation
Small and medium-sized practices face challenges of skill gaps and resource constraints that hinder adoption of technology and data analytics.
Consistent findings across included studies highlighting barriers in small and medium-sized practices (SMPs).
high negative The Use of Technology and Data Analytics in Modern Auditing:... ability to adopt and implement technology/data analytics
AI adoption is reinforcing existing structural disparities within the BRICS bloc, creating a two‑tier productivity hierarchy (China & India vs. Brazil, Russia & South Africa).
Observed divergence in TFP trajectories and differing links between AI indicators and TC/EC across the five BRICS economies; comparative analysis shows stronger frontier-shifting effects in China and India and weaker or negative effects in the other three economies.
high negative AI-driven productivity dynamics in BRICS economies: Evidence... Cross-country divergence in Total Factor Productivity (TFP) growth and its compo...
Brazil, Russia, and South Africa experience stagnation or decline in both efficiency and technological advancement over 2005–2023.
Malmquist TFP decomposition (EC and TC) for each BRICS economy showing flat or negative trends in EC and TC for Brazil, Russia, and South Africa during 2005–2023.
high negative AI-driven productivity dynamics in BRICS economies: Evidence... Efficiency Change (EC) and Technological Change (TC) components of the Malmquist...
Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors.
Negative capability claim based on the authors' survey of prior work (asserted limitation); no systematic benchmark or exhaustive evaluation numbers provided in the excerpt.
high negative Agent-Aided Design for Dynamic CAD Models capability to generate complex 3D assemblies with moving parts
Effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide.
Analytic claim supported by the paper's empirical study of clarification in real software engineering tasks (methods mentioned: quantifying types of information affecting task success and simulated-user question-answering; no sample size given in the abstract).
high negative Asking What Matters: Reward-Driven Clarification for Softwar... impact of missing information and answerability on task success
Large language models remain confined to linguistic simulation rather than grounded understanding.
Conceptual assertion in the paper arguing limits of current models; no empirical tests or measurements reported.
high negative Governing Reflective Human-AI Collaboration: A Framework for... grounded_understanding (absence thereof)
Fluency is not reliability: without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most.
Central thesis/claim of the paper; normative argument synthesising the paper's observations and proposals rather than an empirically tested finding provided here.
high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... trustworthiness/governability of AI in high-stakes contexts
Humans often mistake fluency for reliability: when a model responds smoothly, users tend to trust it, even when both model and user are drifting together.
Behavioral/psychological assertion in the paper referencing human interaction patterns with fluent outputs; no experimental data or sample size reported in this paper excerpt.
high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... user trust in model outputs
LLMs produce fluent outputs even when their internal reasoning has drifted; a confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions.
Conceptual/observational claim presented in the paper; no original empirical test or sample size reported here.
high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... reliability/consistency of model outputs (decision quality)
The opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them.
Theoretical argument grounded in prior literature on automation bias and cognitive offloading; presented as explanatory mechanism in the paper rather than an empirically tested causal estimate.
high negative The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... user inference of competence (output-based vs process-based attribution)
The paper introduces the 'LLM fallacy,' a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability.
Conceptual/theoretical claim and formal definition offered in the paper; no empirical validation reported in the abstract.
high negative The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... divergence between perceived competence and actual competence when using LLM out...
Most Sub-Saharan African states still lack the institutional frameworks needed to turn these innovations into sustainable development.
Comparative policy analysis stated in the paper; no quantitative sample size or formal survey data reported in the excerpt.
high negative A Framework for Sovereign AI Governance and Economic Growth ... presence/absence of institutional frameworks enabling AI-driven sustainable deve...
Efficiency (e.g., minimizing time and cost with AI-only planning) does not equal effectiveness: optimizing for efficiency can erode team cognition and reduce decision quality.
Synthesis of experimental quantitative results (time/cost vs. risk capture and rework) and qualitative assessment indicating that AI-driven efficiency can come at the expense of risk awareness and planning robustness.
high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... trade-off between efficiency and decision quality / team cognition
Human-only planning incurs substantial overhead.
Same controlled experiment reporting that human-only planning produced higher time and cost overheads relative to AI-assisted approaches.
high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... planning overhead (time/cost)
AI-only planning increases rework due to unstated assumptions.
Experiment measured rework rates and accompanying qualitative analysis attributing increased rework in the AI-only condition to unstated assumptions made by algorithmic planning.
AI-only planning significantly degrades risk capture rates.
Same controlled three-condition experiment on a live client deliverable; paper reports measures/qualitative indicators of risk capture rates and states degradation for AI-only condition.
Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class.
Empirical observation during the 25 scenario trials spanning seven failure families in the deployed multi-tenant evaluation; the paper reports two instances of wrong-entity mutations that were not blocked by consumer-contributed protections.
high negative Bounded Autonomy for Enterprise AI: Typed Action Contracts a... wrong-entity mutation errors (escaped protections)
The unconstrained AI configuration completed only 17 of 25 tasks.
Same evaluation described above: deployed multi-tenant enterprise application, 25 scenario trials comparing unconstrained AI (safety layers disabled) against bounded autonomy and manual operation.
Infrastructure constraints, particularly in developing countries, limit AI adoption in auditing.
Thematic analysis of reviewed articles noting infrastructure limitations (e.g., ICT infrastructure) in developing-country contexts.
high negative Implementing Artificial Intelligence in Auditing: A Systemat... infrastructure constraints affecting AI adoption
Limitations in auditor competencies (skills and training) hinder effective AI adoption in auditing.
Thematic findings across the sample of articles report auditor competency gaps as a challenge to AI implementation.
high negative Implementing Artificial Intelligence in Auditing: A Systemat... auditor competencies / skill gaps
Ethical and data privacy concerns are persistent challenges to AI implementation in auditing.
Recurring theme in the reviewed literature identified via thematic analysis; papers cite ethics and privacy as obstacles.
high negative Implementing Artificial Intelligence in Auditing: A Systemat... ethical and data privacy concerns as barriers
Several challenges persist for AI adoption in auditing, including high technology investment costs.
Thematic analysis of barriers reported across the 15 articles highlighting cost as a recurrent challenge.
high negative Implementing Artificial Intelligence in Auditing: A Systemat... barrier: technology investment costs to AI adoption
Conventional methods that use AI predictions as direct proxies for true labels can be inefficient or unreliable when the relationship between AI outputs and human labels is weak or misspecified.
The paper's motivation and critique of standard proxy-using approaches; asserted in the abstract as background rationale for the proposed method.
high negative Generative Augmented Inference efficiency/reliability of estimators using AI outputs as direct proxies
Human review remains necessary for maintainability and correct domain interpretation of generated scripts.
Qualitative finding from the mixed-method case study indicating limitations and the need for human oversight.
high negative Human-AI Collaboration for Scaling Agile Regression Testing:... maintainability and domain-correctness of test scripts
Validated test specifications accumulate faster than they are automated in many teams, limiting regression coverage and increasing manual work.
Observational claim stated in the paper as a motivating problem; likely based on industry experience and the Hacon case study context.
high negative Human-AI Collaboration for Scaling Agile Regression Testing:... regression coverage and manual testing workload
Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise.
Statement summarizing limitations of prior work (literature review/background in the paper).
high negative AIBuildAI: An AI Agent for Automatically Building AI Models scope and limitations of existing AutoML approaches
Developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation.
Background statement in paper's introduction; general literature context rather than a specific empirical test within this paper.
high negative AIBuildAI: An AI Agent for Automatically Building AI Models human labor intensity / need for expert practitioners in AI model development
Most AI tooling targets that fraction [the ~10% of the workday spent writing code].
Assertion made in the paper (abstract) as an observed mismatch between where AI tooling focuses and overall developer work activities.
high negative To Copilot and Beyond: 22 AI Systems Developers Want Built focus of AI tooling relative to developer time allocation
Failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
Qualitative/quantitative failure analysis reported in abstract identifying obstacle categories (example given: cross-artifact consistency breakdowns).
high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... types of failure modes encountered (e.g., cross-artifact consistency issues)
Bankers rate 0% of GPT-5.4's outputs as client-ready.
Human ratings by bankers reported in abstract indicating none of the evaluated outputs from GPT-5.4 were judged client-ready.
high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... proportion of model outputs rated as client-ready by bankers
Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria.
Evaluation results reported in abstract: model-level rubric pass/fail aggregated to show best model failure rate approaching ~50% of criteria.
high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... rubric criteria pass/fail rate for GPT-5.4
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows.
Author assertion in paper abstract arguing current benchmarks are insufficient; presented as motivation for developing BTB rather than empirically tested within the abstract.
high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... fidelity of AI benchmarks to professional workflows
Models fail to distinguish reliable predictions from unreliable ones, achieving only ≈20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation.
Analysis in the paper comparing model self-reported confidence / predictability judgments to actual accuracy across the 405 tasks; reports ≈20% accuracy irrespective of confidence/predictability judgments.
high negative SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... calibration_of_confidence_vs_accuracy
Human expert performance on the benchmark is approximately 20%.
Reported comparison between human experts and models on SciPredict tasks; the paper states human performance is ≈20% (evaluated on the benchmark tasks).
Model accuracies on SciPredict are 14-26%.
Empirical evaluation of multiple LLMs on the SciPredict benchmark (405 tasks); the paper reports aggregate model accuracy range 14–26%.
Regulatory and labor friction is scored per sector using actual compliance frameworks (Basel III, FDA AI guidance, HIPAA) and BLS union density data, and is applied as a haircut to base adoption rates via an S-curve ramp.
Paper description of friction scoring method referencing specific regulatory frameworks and BLS union density; applied in the model as a haircut and S-curve adoption ramp.
high negative AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... adjustment (haircut) to sectoral adoption rates due to regulatory and labor fric...
Restricting AI productivity gains to the labor-generated portion of each sector's gross value added reduces the naive addressable base by approximately 72 percent.
Bottom-up sectoral model described in the paper that applies labor share to gross value added across 21 NAICS industries; the paper explicitly states the labor-generated restriction reduces the naive addressable base by ~72%.
high negative AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... reduction in naive AI-addressable economic base when restricting gains to labor-...
Environmental demands place an upper bound on the degree of heterogeneity required in a distributed production system.
Theoretical claim derived from the Distributed Production System framework and discussed in the paper; supported by conceptual argument and model constraints rather than empirical data; no sample size reported.
high negative The Principle of Maximum Heterogeneity Optimises Productivit... required degree of heterogeneity (upper bound) given environmental demands