Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities.
Literature/benchmark critique asserted in the paper (argument that current benchmarks focus on simple judgmental formats and miss continuous numerical forecasting capabilities).
Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
Empirical observations from QuantSightBench evaluation showing model calibration performance as a function of magnitude (paper statement noting sharp degradation and overconfidence at extremes).
The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target.
Reported empirical coverage percentages from evaluation on QuantSightBench for the listed models (paper provides these percentage values).
None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target.
Empirical evaluation on the newly introduced QuantSightBench benchmark across 11 frontier and open-weight models; models were assessed on empirical coverage of prediction intervals versus a 90% target (paper statement).
The study identified significant implementation challenges including algorithmic bias, digital divide concerns, data privacy risks, and low technology readiness among HR teams in Tier 2 cities.
Synthesis of qualitative case study findings from 4 organizations plus survey responses (N=150) reporting barriers and risks encountered during adoption.
Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Qualitative and quantitative analysis of errors observed across the DELEGATE-52 experiments (19 LLMs) showing sparse, high-severity, and silently introduced errors that accumulate over long workflows.
Degradation severity is exacerbated by document size, length of interaction, or presence of distractor files.
Additional experiments and analyses varying document size, interaction length, and presence of distractor files reported in the paper showing increased degradation under these conditions.
Agentic tool use does not improve performance on DELEGATE-52.
Additional experiments reported in the paper that compare plain LLM delegation vs. agentic tool-using configurations on DELEGATE-52 and find no performance improvement from agentic tool use.
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.
Reported results from the experiment evaluating 19 LLMs on DELEGATE-52; these named models are highlighted and an average corruption fraction (25%) is reported at the end of long workflows.
Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation.
Large-scale experiment reported in the paper evaluating 19 LLMs on DELEGATE-52 long delegated workflows; observed document degradation across models.
Underreliance on AI might deprive software developers of potential gains in productivity and quality.
Stated in the paper and motivated by themes from twenty-two developer interviews indicating missed benefits when developers underuse LLM tools.
Overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills).
Paper explicitly states this risk and grounds the discussion in findings from twenty-two developer interviews (qualitative evidence and participant-reported concerns).
Small and medium-sized practices face challenges of skill gaps and resource constraints that hinder adoption of technology and data analytics.
Consistent findings across included studies highlighting barriers in small and medium-sized practices (SMPs).
AI adoption is reinforcing existing structural disparities within the BRICS bloc, creating a two‑tier productivity hierarchy (China & India vs. Brazil, Russia & South Africa).
Observed divergence in TFP trajectories and differing links between AI indicators and TC/EC across the five BRICS economies; comparative analysis shows stronger frontier-shifting effects in China and India and weaker or negative effects in the other three economies.
Brazil, Russia, and South Africa experience stagnation or decline in both efficiency and technological advancement over 2005–2023.
Malmquist TFP decomposition (EC and TC) for each BRICS economy showing flat or negative trends in EC and TC for Brazil, Russia, and South Africa during 2005–2023.
Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors.
Negative capability claim based on the authors' survey of prior work (asserted limitation); no systematic benchmark or exhaustive evaluation numbers provided in the excerpt.
Effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide.
Analytic claim supported by the paper's empirical study of clarification in real software engineering tasks (methods mentioned: quantifying types of information affecting task success and simulated-user question-answering; no sample size given in the abstract).
Large language models remain confined to linguistic simulation rather than grounded understanding.
Conceptual assertion in the paper arguing limits of current models; no empirical tests or measurements reported.
Fluency is not reliability: without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most.
Central thesis/claim of the paper; normative argument synthesising the paper's observations and proposals rather than an empirically tested finding provided here.
Humans often mistake fluency for reliability: when a model responds smoothly, users tend to trust it, even when both model and user are drifting together.
Behavioral/psychological assertion in the paper referencing human interaction patterns with fluent outputs; no experimental data or sample size reported in this paper excerpt.
LLMs produce fluent outputs even when their internal reasoning has drifted; a confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions.
Conceptual/observational claim presented in the paper; no original empirical test or sample size reported here.
The opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them.
Theoretical argument grounded in prior literature on automation bias and cognitive offloading; presented as explanatory mechanism in the paper rather than an empirically tested causal estimate.
The paper introduces the 'LLM fallacy,' a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability.
Conceptual/theoretical claim and formal definition offered in the paper; no empirical validation reported in the abstract.
Most Sub-Saharan African states still lack the institutional frameworks needed to turn these innovations into sustainable development.
Comparative policy analysis stated in the paper; no quantitative sample size or formal survey data reported in the excerpt.
Efficiency (e.g., minimizing time and cost with AI-only planning) does not equal effectiveness: optimizing for efficiency can erode team cognition and reduce decision quality.
Synthesis of experimental quantitative results (time/cost vs. risk capture and rework) and qualitative assessment indicating that AI-driven efficiency can come at the expense of risk awareness and planning robustness.
Human-only planning incurs substantial overhead.
Same controlled experiment reporting that human-only planning produced higher time and cost overheads relative to AI-assisted approaches.
AI-only planning increases rework due to unstated assumptions.
Experiment measured rework rates and accompanying qualitative analysis attributing increased rework in the AI-only condition to unstated assumptions made by algorithmic planning.
AI-only planning significantly degrades risk capture rates.
Same controlled three-condition experiment on a live client deliverable; paper reports measures/qualitative indicators of risk capture rates and states degradation for AI-only condition.
Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class.
Empirical observation during the 25 scenario trials spanning seven failure families in the deployed multi-tenant evaluation; the paper reports two instances of wrong-entity mutations that were not blocked by consumer-contributed protections.
The unconstrained AI configuration completed only 17 of 25 tasks.
Same evaluation described above: deployed multi-tenant enterprise application, 25 scenario trials comparing unconstrained AI (safety layers disabled) against bounded autonomy and manual operation.
Infrastructure constraints, particularly in developing countries, limit AI adoption in auditing.
Thematic analysis of reviewed articles noting infrastructure limitations (e.g., ICT infrastructure) in developing-country contexts.
Limitations in auditor competencies (skills and training) hinder effective AI adoption in auditing.
Thematic findings across the sample of articles report auditor competency gaps as a challenge to AI implementation.
Ethical and data privacy concerns are persistent challenges to AI implementation in auditing.
Recurring theme in the reviewed literature identified via thematic analysis; papers cite ethics and privacy as obstacles.
Several challenges persist for AI adoption in auditing, including high technology investment costs.
Thematic analysis of barriers reported across the 15 articles highlighting cost as a recurrent challenge.
Conventional methods that use AI predictions as direct proxies for true labels can be inefficient or unreliable when the relationship between AI outputs and human labels is weak or misspecified.
The paper's motivation and critique of standard proxy-using approaches; asserted in the abstract as background rationale for the proposed method.
Human review remains necessary for maintainability and correct domain interpretation of generated scripts.
Qualitative finding from the mixed-method case study indicating limitations and the need for human oversight.
Validated test specifications accumulate faster than they are automated in many teams, limiting regression coverage and increasing manual work.
Observational claim stated in the paper as a motivating problem; likely based on industry experience and the Hacon case study context.
Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise.
Statement summarizing limitations of prior work (literature review/background in the paper).
Developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation.
Background statement in paper's introduction; general literature context rather than a specific empirical test within this paper.
Most AI tooling targets that fraction [the ~10% of the workday spent writing code].
Assertion made in the paper (abstract) as an observed mismatch between where AI tooling focuses and overall developer work activities.
Failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
Qualitative/quantitative failure analysis reported in abstract identifying obstacle categories (example given: cross-artifact consistency breakdowns).
Bankers rate 0% of GPT-5.4's outputs as client-ready.
Human ratings by bankers reported in abstract indicating none of the evaluated outputs from GPT-5.4 were judged client-ready.
Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria.
Evaluation results reported in abstract: model-level rubric pass/fail aggregated to show best model failure rate approaching ~50% of criteria.
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows.
Author assertion in paper abstract arguing current benchmarks are insufficient; presented as motivation for developing BTB rather than empirically tested within the abstract.
Models fail to distinguish reliable predictions from unreliable ones, achieving only ≈20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation.
Analysis in the paper comparing model self-reported confidence / predictability judgments to actual accuracy across the 405 tasks; reports ≈20% accuracy irrespective of confidence/predictability judgments.
Human expert performance on the benchmark is approximately 20%.
Reported comparison between human experts and models on SciPredict tasks; the paper states human performance is ≈20% (evaluated on the benchmark tasks).
Model accuracies on SciPredict are 14-26%.
Empirical evaluation of multiple LLMs on the SciPredict benchmark (405 tasks); the paper reports aggregate model accuracy range 14–26%.
Regulatory and labor friction is scored per sector using actual compliance frameworks (Basel III, FDA AI guidance, HIPAA) and BLS union density data, and is applied as a haircut to base adoption rates via an S-curve ramp.
Paper description of friction scoring method referencing specific regulatory frameworks and BLS union density; applied in the model as a haircut and S-curve adoption ramp.
Restricting AI productivity gains to the labor-generated portion of each sector's gross value added reduces the naive addressable base by approximately 72 percent.
Bottom-up sectoral model described in the paper that applies labor share to gross value added across 21 NAICS industries; the paper explicitly states the labor-generated restriction reduces the naive addressable base by ~72%.
Environmental demands place an upper bound on the degree of heterogeneity required in a distributed production system.
Theoretical claim derived from the Distributed Production System framework and discussed in the paper; supported by conceptual argument and model constraints rather than empirical data; no sample size reported.