Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
AI-driven productivity and data externalities can reconfigure which countries/regions specialize in which activities, with implications for labor demand, offshoring, and services trade patterns.
Mechanism and theory-based analysis drawing on literature about comparative advantage, automation, and data externalities; empirical testing recommended but not performed in the paper.
Standard international trade models should be updated to incorporate data as an input, platform-mediated matching, algorithmic complementarities, and costs of regulatory fragmentation.
Theoretical critique and modeling recommendations based on mechanism analysis; no new formal model calibration or empirical testing presented in the paper.
AI-enabled markets tend toward winner-take-most platforms amplified by network effects.
Theoretical reasoning supported by platform literature and case illustrations of platform concentration dynamics; empirical magnitudes not estimated in the paper.
Competitive advantage is shifting away from asset- and labor-intensive models toward data-, model-, and platform-driven advantages, altering comparative advantage and market structure.
Mechanism/theoretical analysis drawing on platform and AI economics literature and qualitative examples; no empirical estimation provided in the paper.
Regulatory design acts as an economic instrument that can balance social value from AI with protection of rights, affecting social welfare, public trust, and long-term adoption rates.
Normative synthesis combining legal and economic reasoning; suggested as a theoretical mechanism rather than empirically validated within the paper.
Automation of routine administrative tasks may reduce demand for certain clerical roles while increasing demand for oversight, auditing, and legal-technical expertise, altering public-sector labor composition and retraining needs.
Qualitative labor-market reasoning based on task-based automation literature and the administrative context; no field labor-data or sample provided.
Current LLMs produce deep, reliable reasoning mainly in domains with rigorous, pre-existing abstractions (mathematics, programming) and underperform in domains that lack such formal abstractions.
Performance comparisons and observed patterns referenced qualitatively (e.g., better behavior on math and code tasks) drawn from existing literature and practitioner reports; the paper does not present new controlled benchmark experiments.
AI feedback may either augment teacher productivity (complementarity) or substitute for routine teacher feedback tasks (substitution), with unclear net labor impacts.
Workshop deliberations among 50 scholars highlighting competing theoretical scenarios; no causal labor-market evidence provided.
Easier conversational access to models can substitute for routine cognitive labor while complementing high-skill work; miscalibrated trust affects labor outcomes and supervision costs.
Labor and task-allocation implications argued conceptually; no labor-market empirical evidence or quantified substitution/complementarity rates presented.
Firms can compete on front-end design (transparency, trustworthiness) as a socially beneficial quality signal, but absent regulation competition may favor more persuasive (less honest) interfaces.
Economic argument about product differentiation and competitive incentives, drawn from market theory and literature; no empirical market study provided.
Misleading cues can create short-term surplus (user satisfaction) but long-term welfare losses if overtrust causes harms or misinformation.
Theoretical economic argument based on information asymmetry and externalities; no empirical quantification in the paper.
LLM-based chatbots’ conversational naturalness increases usability and adoption but also triggers misleading mental models (e.g., anthropomorphism, overtrust).
Paper-level main finding based on conceptual analysis and literature synthesis from HCI, ethics, and conversational analysis; no new large-scale empirical study or sample reported.
The approach shifts some resource demand from GPU clusters to CPU, memory, and storage I/O, meaning local SSD and CPU provisioning can become the new bottleneck.
Authors note the system relies on multi-tier I/O and CPU-side updates to enable single-GPU fine-tuning; the summary highlights this resource-shift as a risk/consideration. No quantitative cost or workload-specific tradeoff analysis is provided in the summary.
Human experts will likely shift roles from sole decision-makers to adjudicators, challengers, and validators of AI-generated arguments, changing required skills toward critical evaluation and dialectical oversight.
Conceptual labor-market projection; no empirical labor studies or surveys presented.
Productivity gains from partial automation may be offset by negative externalities (incorrect legal outcomes, appeals, reputational damage) that impose social and private costs not captured by narrow productivity measures.
Theoretical economic analysis and illustrative case vignettes describing error propagation; no empirical quantification of externalities.
Market demand will likely split between providers offering generative convenience with liability exposure and providers offering certified/verified, explainable tools at a premium, creating a two-tier market.
Market-structure analysis and illustrative projections; no empirical market data or sample size.
Reported monetary supervision cost was low (~$200) for this project, but the paper cautions that general equilibrium effects and scaling may change costs as demand for supervisors rises.
Paper provides reported supervision cost (≈$200) for the single project and includes a caveat about external validity and scaling; cost is self-reported and contextualized by authors.
Because these agents will be embedded in safety‑critical infrastructure, economic and technical outcomes will depend heavily on system architecture choices.
Systems‑engineering and policy reasoning drawing on analogies to Internet/IoT evolution and domain examples (disaster response, healthcare, industrial automation, mobility); conceptual argumentation rather than empirical measurement.
Policymakers must weigh productivity gains from higher autonomy against increased systemic risk and governance costs; optimal allocation will vary by sector (high-consequence systems justify stricter human oversight; lower-consequence tasks may tolerate more autonomy).
Normative policy analysis and cost–benefit reasoning; sector-differentiated triage framework proposed (no quantitative welfare or sectoral optimization performed).
Bounded-autonomy governance internalizes some externalities from automated interactions, reducing the probability of cascading failures and associated economic damages, but misaligned or heterogeneous governance across firms/sectors can still generate systemic vulnerabilities.
Theoretical argument combining externalities literature and governance design principles; illustrative scenarios and policy reasoning (no empirical validation).
Modern critical infrastructure increasingly uses embodied AI for monitoring, predictive maintenance, and decision support, but these systems are typically trained for statistically representable uncertainty rather than systemic, cascading crises.
Review and synthesis of policy texts, industry descriptions, and safety/AI standards cited in the paper (EU AI Act, ISO standards) and literature on embodied-AI applications; conceptual argument (no original empirical sample).
Cooperation with the AI is sustained mainly through conditional rule-based strategies rather than through trust-building, emotional, and social channels.
Synthesis of behavioral trajectories (cooperation plateauing below human–human levels), strategy-estimation results (prevalence of rule-based strategies such as Grim Trigger), and chat-content analysis (more explicit commitments, fewer social/emotional messages) from the laboratory experiment (human–AI n = 126) and comparison to human–human benchmark (n = 108).
When allowed repeated communication with the AI, human subjects remain behaviorally dispersed and do not converge to a single dominant strategy.
Strategy-estimation results for the human–AI repeated-chat treatment (from the experiment, n = 126) showing heterogeneous assignment across strategy classes and lack of convergence over time.
Increasing benign-agent count and agent stubbornness are practical levers for improving robustness, but both carry costs: added compute/operational cost for scaling agents, and degraded consensus/coordination when stubbornness is high.
Argumentation supported by simulation results showing improved robustness with more agents or higher stubbornness, combined with discussion of computational cost (scaling) and observed consensus degradation; computational cost is presented as conceptual/operational reasoning rather than quantified in the summary.
Naïvely lowering trust weights assigned to suspected adversaries can limit adversarial influence but may also hinder cooperation and reduce task performance.
Simulations manipulating fixed trust weights and observing tradeoffs between reduced adversarial sway and decreased cooperative task performance/convergence; conceptual analysis of the tradeoff is provided.
Raising agents' innate stubbornness (peer resistance) reduces susceptibility to adversarial manipulation but impairs the network's ability to reach consensus or coordinate effectively.
Combined theoretical reasoning from FJ model (stubbornness is weight on innate opinion) and simulation experiments varying stubbornness parameters; measured outcomes include adversarial influence and measures of convergence/coordination or task performance.
BenchPress evaluation shows Pokemon battling evaluates capabilities largely orthogonal to common LLM benchmarks (i.e., it stresses different skill sets).
Paper applies a BenchPress matrix/method to quantify coverage relative to standard benchmarks and reports near-orthogonality for battling tasks in the matrix results.
The study documents a 'silent empathy' effect: people often feel empathic concern but fail to express it in ways that align with normative empathic communication; targeted feedback helps close that expression gap.
Analysis showing mismatch between internal empathic concern (implied by context/self-report/ratings) and the presence of idiomatic empathic moves in participants' messages; targeted personalized feedback increased use of normative empathic expressions.
Investments in interpretability that aim to fully 'rule‑ify' LLM competence may have diminishing returns; economic value may be better captured by research into robust behavioral evaluation, stress testing, and hybrid human‑AI workflows, while partial interpretability remains valuable.
R&D allocation and interpretability economics argument built on the central thesis; suggestion rather than empirical finding.
The paper challenges a purely rule‑based view of scientific explanation: some explanatory power will remain in implicit model structure rather than explicit rules.
Philosophical/epistemological argument based on the main thesis about tacit competence; no empirical validation.
LLMs can provide useful inputs for near-term economic and logistical forecasting in crises (e.g., supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously.
Observed stronger and more verifiable performance on economic/logistical question types in the 42-node evaluation; weaker reliability on politically ambiguous multi-actor issues reported in qualitative coding and verifiability checks.
Model narratives evolve over time: earlier node outputs emphasize rapid containment, while later node outputs increasingly describe regional entrenchment and attritional de-escalation scenarios.
Longitudinal analysis across 11 temporal nodes comparing thematic/narrative content of model responses; qualitative coding tracked shifts in dominant scenario framings from early to later nodes.
Model reliability is uneven across domains: performance is stronger on structured economic and logistical questions than on politically ambiguous, multi-actor strategic issues.
Domain-specific comparison of model outputs on node-specific verifiable questions and exploratory prompts, with higher verifiability/accuracy and more consistent inferences reported for economic/logistical items versus greater ambiguity and lower consistency on political/multi-actor items.
Liability regimes and penalties should account for limits of enforced compliance and false positives/negatives from probabilistic policy evaluations.
Normative/economic discussion in the paper highlighting probabilistic outputs of the Policy function and calibration challenges; no empirical validation.
Firms will trade off compliance strictness against service quality (task completion rates), creating an economic tradeoff that shapes market offerings (e.g., safer-but-slower vs. faster-but-riskier agents).
Economic reasoning and conceptual models in the paper; suggested objective balancing task completion and legal/reputational costs; no empirical market data.
Alignment and instruction tuning approaches intended to encourage up-to-date answers improve some behaviors but do not reliably solve time-sensitivity and cross-modal consistency issues.
Experiments applying alignment/instruction-tuning methods with measurement of correctness and consistency; reported partial or inconsistent improvements rather than full resolution.
Diagnostic analysis links outdated predictions to (i) the static, time-stamped nature of training/evaluation datasets and (ii) mechanistic limits in how multimodal representations encode and retrieve temporal facts.
Error attribution analyses connecting incorrect answers to training snapshot timestamps and dataset provenance; representation-level analyses and qualitative case studies demonstrating multimodal encoding/retrieval limits.
For models/dynamics with negative LLE (contracting behavior), investment in parallel Newton tooling is likely to pay off; for expanding/chaotic dynamics (positive LLE), alternative architectural or modeling changes may be more cost-effective.
Application of the LLE convergence criterion derived in the thesis combined with empirical demonstrations on representative tasks indicating correlation between LLE sign and parallel solver performance; economic recommendation is interpretive.
The economic value of deploying DeePC-based controllers depends critically on representativeness of training data and the costs of online adaptation and safety verification.
Authors' deployment-risk analysis and discussion of trade-offs (qualitative), grounded in methodological requirements of DeePC (need for representative, persistently exciting data and safeguards).
System-level improvements from the controller do not imply uniform spatial/temporal benefits—distributional effects may favor certain routes or neighborhoods.
Authors' discussion and caution about distributional effects and equity; possibly supported by spatial analyses in simulation (qualitative discussion in paper).
Sparse MoE designs reduce active compute per query but can introduce serving complexity (routing, memory bandwidth, batching) that may require specialized infrastructure.
Architectural property of sparse MoE (sparse activation) and the paper's discussion of deployment trade-offs; the summary notes the need for specialized serving infra and potential transitional costs. This is an argument supported by known MoE deployment literature rather than novel empirical measurements in the summary.
Deploying conformal factuality systems increases development cost (collecting representative calibration data) and inference cost (verifier compute), though efficient verifiers mitigate inference cost.
Discussion and empirical cost measurements: need for representative calibration datasets to maintain guarantees; measured verifier FLOPs; qualitative economic analysis in the paper.
Conformal filtering improves formal reliability (statistical factuality guarantees) but does not, by itself, deliver robustness and task utility without careful system design.
Aggregate empirical results: improved factuality guarantees after calibration/filtering, but concurrent reductions in informativeness and sensitivity to distribution shift/distractors unless calibration/data-processing are adapted.
Fine-tuning TSFMs on the high-frequency 5G data provides limited recovery; many configurations still perform poorly after fine-tuning.
Paper reports experiments including fine-tuning regimes where TSFMs were fine-tuned on the new dataset; results indicate limited improvement in many configurations. Specific fine-tuning procedures, datasets sizes, and quantitative results are not provided in the summary.
DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe.
Model-specific results from Experiment 3 (TS‑Guessing) reporting per-model rates of partial reconstruction and verbatim recall across the 513 MMLU items for DeepSeek-R1.
Quantitative comparisons across tested models show systematic Misapplication Rate even in settings where Appropriate Application Rate is high.
Aggregated MR and AAR statistics reported for multiple frontier models across the benchmark showing co‑occurrence of high AAR and nontrivial MR.
Prompt‑based defensive instructions (explicitly instructing models to suppress preferences where inappropriate) reduce misapplication but fail to fully eliminate it.
Ablation experiments adding prompt‑based safety/defenses to model inputs and measuring MR and AAR; defenses produced reductions in MR but residual misapplication remained.
Attempts to mitigate misapplication with stronger reasoning prompts (e.g., chain‑of‑thought) reduce Misapplication Rate but do not eliminate it.
Ablation applying reasoning prompts and chain‑of‑thought style instructions to models, comparing MR before and after; reported reductions in MR but persistence of non‑zero MR across scenarios.
Models that more faithfully enforce stored preferences achieve higher Appropriate Application Rate (AAR) but also systematically have higher Misapplication Rate (MR), indicating a trade‑off between correct personalization and harmful over‑application.
Ablation experiments varying strength of preference encoding and measuring resulting AAR and MR per model; quantitative comparisons across models showing positive correlation between stronger preference adherence and both higher AAR and higher MR.
Reducing payrolls raises short-term firm profitability but reduces aggregate household income and consumption.
Macroeconomic accounting and labor-demand theory combined with historical examples of payroll reductions; argument is theoretical/conceptual rather than estimated with new aggregate time-series regression evidence.