Evidence (5192 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5921 claims
Human-AI Collaboration
5192 claims
Org Design
3497 claims
Innovation
3492 claims
Labor Markets
3231 claims
Skills & Training
2608 claims
Inequality
1842 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 738 | 1617 |
| Governance & Regulation | 671 | 334 | 160 | 99 | 1285 |
| Organizational Efficiency | 626 | 147 | 105 | 70 | 955 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 349 | 109 | 48 | 322 | 838 |
| Output Quality | 391 | 121 | 45 | 40 | 597 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 277 | 145 | 63 | 34 | 526 |
| AI Safety & Ethics | 189 | 244 | 59 | 30 | 526 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 106 | 40 | 6 | 188 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 79 | 8 | 1 | 152 |
| Regulatory Compliance | 69 | 66 | 14 | 3 | 152 |
| Training Effectiveness | 82 | 16 | 13 | 18 | 131 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Robust resilience stems from 'bounded autonomy': constraining what an AI may decide and when humans must intervene.
Normative proposal grounded in synthesis of safety standards, crisis-management practices, and conceptual arguments; specification of autonomy dimensions (authority scope, temporal limits, performance envelopes, fail-safes).
Human–AI chat logs contain more explicit strategy commitments (stated rules) than human–human chats.
Content analysis / coding of natural-language chat logs from the human–AI experiment (human–AI n = 126) and the human–human benchmark (n = 108); coding counts show higher frequency of explicit commitments/statements of rules in human–AI messages.
Human–human subjects converge to Tit‑for‑Tat under one condition and to unconditional cooperation under the repeated-communication condition.
Strategy-estimation and behavioral trajectory analysis from the human–human benchmark (Dvorak & Fehrler 2024; n = 108) reported in the paper, showing condition-dependent convergence to Tit‑for‑Tat and to unconditional cooperation under repeated communication.
Strategy estimation indicates human–AI subjects tend to favor Grim Trigger when allowed pre-play communication.
Strategy-estimation/classification applied to subjects' choices in the human–AI condition with pre-play chat (subset of the human–AI n = 126); inferred strategy prevalence shows elevated assignment to Grim Trigger-type rules.
Version 1.0 marks integration into operational workflows and establishes a base for future capabilities.
Authors report that v1.0 has been used in verification and mask-refinement loops for real datasets (MeerKAT, ASKAP, APERTIF); no detailed deployment metrics provided.
Immersive inspection tools like iDaVIE are complements to automated ML pipelines by helping generate higher-quality labels and curated training examples.
Paper argues conceptual complementarity and cites iDaVIE's use for mask refinement and curated subcube export; no experimental comparison of label quality or downstream ML performance provided.
iDaVIE accelerates inspection-driven parts of astronomy workflows (e.g., mask refinement, verification).
Reported use cases where iDaVIE was used to refine masks and verify sources in real datasets; no measured time-per-task or throughput statistics provided.
iDaVIE has already been integrated into real pipelines (MeerKAT, ASKAP, APERTIF) and used to improve quality control, refine detection masks, and identify new sources.
Author statement of integration and use cases citing verification of HI data cubes from MeerKAT, ASKAP and APERTIF; no quantitative deployment counts or independent validation provided in the text.
There is a need for policies supporting workforce transitions (retraining, portability of skills) and safety/regulation for embodied agents operating in public spaces.
Policy recommendation grounded in anticipated labor and safety risks; proposed but not empirically evaluated.
Benchmarks and tasks that mix observation and intervention (imitation with sparse feedback, active imitation, transfer under domain shift, continual learning streams) are required to evaluate the architecture.
Proposal for evaluation tasks and benchmarks; not empirically validated in the paper.
Embodied robotics experiments are necessary to evaluate real-world constraints such as sample efficiency, physical affordances, and motor learning.
Methodological recommendation recognizing simulation-to-real gaps; no experiments reported.
Simulated environments (procedural, nonstationary), multi-agent social domains, and open-world 3D simulators are appropriate for scalable iteration to test the proposed architecture.
Methodological recommendation and suggested experimental approaches; not tested in the paper.
Neuromodulatory systems and meta-decision circuits in animals provide analogies for implementing meta-control (M) in artificial systems.
Neuroscience analogy cited to motivate architectural choices; not empirically instantiated in the paper.
Developmental trajectories can scaffold gradual competence (from observation to exploratory action) and should be reflected in training curricula.
Argument from developmental biology and learning theory; proposed as a design principle rather than empirically tested here.
Evolution supplies inductive biases and slow structural priors that can be leveraged in artificial learners.
Biological analogy and theoretical suggestion; no empirical experiments presented to quantify effect in AI systems.
The taxonomy and measurement approach provide operational metrics to quantify empathic communication for economic analyses (productivity, customer satisfaction, retention).
Authors propose that their data-driven taxonomy and automated/coding measures can be used as metrics; the paper demonstrates derivation and use in trial outcomes but does not present direct economic outcome measurements.
LLM-generated responses frequently score as more empathic than human-written responses in blinded evaluations.
Blinded evaluations comparing LLM-generated replies to human-written replies using recipient/judge ratings of perceived empathy (reported in blinded tests described in paper). Exact blinded-test sample sizes not specified in the summary but derived from the study's evaluation procedures.
LLMs are more likely to complement human tacit skills than to replace explicit rule‑following jobs; value accrues to workers and firms that integrate model outputs with human judgment and tacit expertise.
Labor‑economics style argument and theoretical reasoning; no empirical labor market analysis provided.
Commoditization via rule extraction is limited; firms that can harness and deploy tacit LLM capabilities will retain economic rents.
Theoretical economic argument based on non‑rule‑encodability; no empirical firm‑level data included.
The highest‑value attributes of LLMs may be inherently non‑decomposable into simple, auditable rules, which increases the value of proprietary, black‑box models and strengthens economies of scale and scope for large model providers.
Economic reasoning and theoretical implications drawn from the central thesis; no empirical market analyses provided.
Some LLM capabilities are tacit, practice‑derived, or 'insight'‑like, akin to the Chinese concept of Wu (sudden insight through practiced skill).
Philosophical framing and analogy to the concept of tacit knowledge (Wu); argumentative rather than empirical support.
The economically valuable capabilities of large language models are precisely those that cannot be fully encoded as a complete, human‑readable set of discrete rules.
Formal, conceptual argument (proof by contradiction) plus qualitative historical case analysis comparing expert systems and LLMs; no new empirical datasets or experiments reported.
Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies.
Release of SlideRL dataset (288 rollouts) and code repository; general statement about reproducibility benefits.
Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling.
Observed empirical gain of +33.1% for the tuned 7B over its untuned base and the 91.2% relative performance vs Claude Opus 4.6; implication drawn about cost-effectiveness of tuning few parameters rather than scaling model size.
The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal.
Reward design: inverse-specification component implemented and used as part of composite reward; claimed to measure fidelity via recovery accuracy.
Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.
Cross-model comparison across six models on the 48-task benchmark, with analyses showing instruction adherence and tool-use compliance better predict agent performance than parameter count.
The proposed algorithm's performance is robust to heterogeneous populations in the synthetic experiments (i.e., it continues to find core alternatives under varying degrees of population heterogeneity).
Empirical robustness checks reported in the experiments where population heterogeneity is varied and performance (core-attainment frequency) is evaluated.
The authors compare their sampling algorithm against classical social-choice rules and LLM-based heuristics and report superior core-attainment frequency for their method.
Experimental comparisons described in the paper between the proposed algorithm and baseline methods (classical social-choice rules, LLM-based heuristics) on the synthetic dataset; results summarized in the experiments section.
On a synthetic text-preference dataset, the proposed algorithm reliably finds alternatives that lie in the proportional veto core.
Empirical experiments reported in the paper using a synthetic dataset of text preferences; evaluation metric reported as frequency (proportion) of runs where the returned alternative is in the proportional veto core.
Temporal grounding (restricting models to contemporaneous information) should be adopted as a methodological best practice in economic research using LLMs to avoid leakage and produce more realistic assessments of model forecasting ability.
Study methodology and rationale emphasize temporal grounding; authors recommend it as best practice based on the observed benefits in reducing retrospective contamination.
Because the conflict unfolded after the training cutoffs of contemporary frontier LLMs, the dataset and analyses provide an archival, hindsight-free benchmark for studying model reasoning.
Case selection rationale: the 2026 Middle East conflict was deliberately chosen because it occurred after the training cutoffs of the evaluated frontier models; dataset preserves contemporaneous queries and model outputs.
Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric).
Evaluation across 11 temporally defined nodes during the early 2026 Middle East conflict using 42 node-specific verifiable questions and 5 exploratory prompts; results assessed via verifiability checks and qualitative coding for strategic reasoning of outputs from contemporary frontier LLMs constrained to contemporaneous information.
Legible decision modes and recorded contest pathways improve verifiability and lower information asymmetries, aiding regulators and platforms in monitoring and reducing litigation/reputational risk.
Analytic claim in the implications section; argued conceptually and tied to proposed logging/audit tools; no empirical validation.
The pattern can reduce costly misallocations caused by LLM unpredictability by constraining policy options, improving overall allocation efficiency in expectation.
Theoretical argument in the paper tying constrained policy space to reduced variability and misallocation risk; no empirical testing or quantitative model provided.
The pattern improves legibility, procedural legitimacy, and actionability compared to systems without these elements (proposed as evaluation goals).
Evaluation agenda and proposed user-study metrics in the paper (legibility tests, perceived fairness surveys, contest effectiveness measures); no empirical results yet.
Bounded calibration with contestability avoids opaque silent defaults that mask value choices and avoids wide-open user-configurable value sliders that offload moral choice under stress.
Normative rationale and argumentation in the paper; compared qualitatively against two alternative design approaches; no empirical comparison.
Bounded calibration with contestability is a viable design pattern for LLM-enabled robots that must allocate scarce, real-time assistance among multiple people.
Conceptual/design proposal in the paper; illustrated with a concrete public-concourse robot vignette; no empirical deployment or sample data reported.
Modular strategy/execution architectures (like ESE) can materially improve the stability and efficiency of LLM-driven operational decision systems, increasing their attractiveness for deployment in retail, logistics, and supply-chain contexts.
Empirical improvements observed with ESE on RetailBench relative to monolithic baselines, coupled with analysis of deployment considerations and domain relevance discussed in the paper.
ESE improves operational stability and efficiency relative to baselines that do not separate strategy from execution.
Empirical comparisons reported in the experiments: eight contemporary LLMs evaluated on multiple RetailBench environments, with ESE compared against monolithic LLM agents and other baselines using metrics of operational stability (e.g., variance or frequency of catastrophic failures) and efficiency (e.g., cost/profit/fulfillment).
ESE enables interpretable and adaptive strategy updates intended to counteract error accumulation and environmental drift.
Design features of the strategy module (slower updates, interpretable strategy representation) and qualitative analysis in the paper linking these features to reduced error accumulation and strategy drift in experiments.
Policy implication: prioritize large-scale, targeted reskilling and lifelong learning programs to enable workforce adaptability and capture AI complementarity gains.
Policy recommendations derived from the paper's findings (association between AI adoption and skill shifts, heterogeneous sectoral impacts) and the literature synthesis that links reskilling interventions to better labor outcomes; recommendation is prescriptive rather than empirically tested within the study.
The paper provides empirical support for the complementarity hypothesis: AI tends to reconfigure jobs and create hybrid roles rather than eliminate employment wholesale.
Convergence of simulated sectoral employment patterns (some sectors showing net gains and hybrid-role growth), the strong correlation between AI adoption and skill shifts (r = 0.71), and corroborating studies from the literature synthesis emphasizing augmentation and hybridization mechanisms.
Institutional reskilling programs and governance frameworks markedly moderate labor-market outcomes: better frameworks correlate with more complementarities and lower net job loss.
Integration of literature-derived mechanisms with simulated empirical patterns; paper reports correlations/moderation-style comparisons across simulated sector-year cases incorporating policy/institutional variables (described in methods), supported by studies in the systematic review linking policy interventions to labor outcomes.
Healthcare and IT Services experienced net employment gains consistent with AI complementarity (augmented tasks and creation of new hybrid roles).
Simulated sectoral employment trends and net-change metrics for Healthcare and IT Services (2020–2024) presented in the paper, supported by literature synthesis examples showing human–AI complementarities in these sectors.
The largest rises in hybrid jobs occurred in IT Services and Healthcare.
Sectoral decomposition of hybrid job share trends in the simulated dataset across the seven industries (2020–2024) and supporting qualitative/quantitative findings from the literature synthesis focused on IT Services and Healthcare.
Hybrid human–AI jobs increased substantially across all seven analyzed sectors between 2020 and 2024.
Descriptive trend analysis of the simulated dataset's hybrid job share metric (fraction of roles reclassified as human–AI hybrid) for the seven industries over 2020–2024, combined with corroborating examples from the literature synthesis (selected ACM/IEEE/Springer studies 2020–2024).
Firms should pair strong-performing ensemble/deep models with explainability tools (e.g., feature-importance, SHAP) and fairness audits, and prefer pilot human-in-the-loop implementations to validate economic impacts and reduce operational risks.
Authors' practical recommendations based on empirical model performance, interpretability analyses, and noted limitations; presented as guidance rather than empirically validated interventions.
Variable-contribution analyses (feature importance / model explanation techniques) clarified which inputs drive predictions, making results actionable for HR decision-making.
The paper reports use of feature-importance and model-explanation methods to quantify variable contributions and interpretable outputs intended for HR practitioners.
Employee engagement/participation levels, learning agility (pace of acquiring new skills), tenure in current role, and perceived workload/manageability are consistently among the most important predictors of job performance in the datasets examined.
Feature-importance and model-explanation analyses (e.g., feature importance, SHAP-style approaches) applied across multiple publicly available workforce datasets produced consistently high importance scores for these variables.
The models' superior performance hinges on their ability to capture complex, non-linear patterns in features (e.g., engagement, learning agility, tenure, workload perception).
Inference from comparative model performance: non-linear models (ensembles, DNNs) outperform linear baselines; feature engineering captured engagement dynamics and learning trends; variable-contribution analyses highlighted these feature types as influential.