Evidence (5539 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Adoption
Remove filter
The paper challenges a purely rule‑based view of scientific explanation: some explanatory power will remain in implicit model structure rather than explicit rules.
Philosophical/epistemological argument based on the main thesis about tacit competence; no empirical validation.
LLMs can provide useful inputs for near-term economic and logistical forecasting in crises (e.g., supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously.
Observed stronger and more verifiable performance on economic/logistical question types in the 42-node evaluation; weaker reliability on politically ambiguous multi-actor issues reported in qualitative coding and verifiability checks.
Model narratives evolve over time: earlier node outputs emphasize rapid containment, while later node outputs increasingly describe regional entrenchment and attritional de-escalation scenarios.
Longitudinal analysis across 11 temporal nodes comparing thematic/narrative content of model responses; qualitative coding tracked shifts in dominant scenario framings from early to later nodes.
Model reliability is uneven across domains: performance is stronger on structured economic and logistical questions than on politically ambiguous, multi-actor strategic issues.
Domain-specific comparison of model outputs on node-specific verifiable questions and exploratory prompts, with higher verifiability/accuracy and more consistent inferences reported for economic/logistical items versus greater ambiguity and lower consistency on political/multi-actor items.
Liability regimes and penalties should account for limits of enforced compliance and false positives/negatives from probabilistic policy evaluations.
Normative/economic discussion in the paper highlighting probabilistic outputs of the Policy function and calibration challenges; no empirical validation.
Firms will trade off compliance strictness against service quality (task completion rates), creating an economic tradeoff that shapes market offerings (e.g., safer-but-slower vs. faster-but-riskier agents).
Economic reasoning and conceptual models in the paper; suggested objective balancing task completion and legal/reputational costs; no empirical market data.
Alignment and instruction tuning approaches intended to encourage up-to-date answers improve some behaviors but do not reliably solve time-sensitivity and cross-modal consistency issues.
Experiments applying alignment/instruction-tuning methods with measurement of correctness and consistency; reported partial or inconsistent improvements rather than full resolution.
Diagnostic analysis links outdated predictions to (i) the static, time-stamped nature of training/evaluation datasets and (ii) mechanistic limits in how multimodal representations encode and retrieve temporal facts.
Error attribution analyses connecting incorrect answers to training snapshot timestamps and dataset provenance; representation-level analyses and qualitative case studies demonstrating multimodal encoding/retrieval limits.
For models/dynamics with negative LLE (contracting behavior), investment in parallel Newton tooling is likely to pay off; for expanding/chaotic dynamics (positive LLE), alternative architectural or modeling changes may be more cost-effective.
Application of the LLE convergence criterion derived in the thesis combined with empirical demonstrations on representative tasks indicating correlation between LLE sign and parallel solver performance; economic recommendation is interpretive.
The economic value of deploying DeePC-based controllers depends critically on representativeness of training data and the costs of online adaptation and safety verification.
Authors' deployment-risk analysis and discussion of trade-offs (qualitative), grounded in methodological requirements of DeePC (need for representative, persistently exciting data and safeguards).
System-level improvements from the controller do not imply uniform spatial/temporal benefits—distributional effects may favor certain routes or neighborhoods.
Authors' discussion and caution about distributional effects and equity; possibly supported by spatial analyses in simulation (qualitative discussion in paper).
Sparse MoE designs reduce active compute per query but can introduce serving complexity (routing, memory bandwidth, batching) that may require specialized infrastructure.
Architectural property of sparse MoE (sparse activation) and the paper's discussion of deployment trade-offs; the summary notes the need for specialized serving infra and potential transitional costs. This is an argument supported by known MoE deployment literature rather than novel empirical measurements in the summary.
Deploying conformal factuality systems increases development cost (collecting representative calibration data) and inference cost (verifier compute), though efficient verifiers mitigate inference cost.
Discussion and empirical cost measurements: need for representative calibration datasets to maintain guarantees; measured verifier FLOPs; qualitative economic analysis in the paper.
Conformal filtering improves formal reliability (statistical factuality guarantees) but does not, by itself, deliver robustness and task utility without careful system design.
Aggregate empirical results: improved factuality guarantees after calibration/filtering, but concurrent reductions in informativeness and sensitivity to distribution shift/distractors unless calibration/data-processing are adapted.
Fine-tuning TSFMs on the high-frequency 5G data provides limited recovery; many configurations still perform poorly after fine-tuning.
Paper reports experiments including fine-tuning regimes where TSFMs were fine-tuned on the new dataset; results indicate limited improvement in many configurations. Specific fine-tuning procedures, datasets sizes, and quantitative results are not provided in the summary.
DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe.
Model-specific results from Experiment 3 (TS‑Guessing) reporting per-model rates of partial reconstruction and verbatim recall across the 513 MMLU items for DeepSeek-R1.
Quantitative comparisons across tested models show systematic Misapplication Rate even in settings where Appropriate Application Rate is high.
Aggregated MR and AAR statistics reported for multiple frontier models across the benchmark showing co‑occurrence of high AAR and nontrivial MR.
Prompt‑based defensive instructions (explicitly instructing models to suppress preferences where inappropriate) reduce misapplication but fail to fully eliminate it.
Ablation experiments adding prompt‑based safety/defenses to model inputs and measuring MR and AAR; defenses produced reductions in MR but residual misapplication remained.
Attempts to mitigate misapplication with stronger reasoning prompts (e.g., chain‑of‑thought) reduce Misapplication Rate but do not eliminate it.
Ablation applying reasoning prompts and chain‑of‑thought style instructions to models, comparing MR before and after; reported reductions in MR but persistence of non‑zero MR across scenarios.
Models that more faithfully enforce stored preferences achieve higher Appropriate Application Rate (AAR) but also systematically have higher Misapplication Rate (MR), indicating a trade‑off between correct personalization and harmful over‑application.
Ablation experiments varying strength of preference encoding and measuring resulting AAR and MR per model; quantitative comparisons across models showing positive correlation between stronger preference adherence and both higher AAR and higher MR.
Reducing payrolls raises short-term firm profitability but reduces aggregate household income and consumption.
Macroeconomic accounting and labor-demand theory combined with historical examples of payroll reductions; argument is theoretical/conceptual rather than estimated with new aggregate time-series regression evidence.
Finance, Education, and Transportation show mixed dynamics: both displacement of routine tasks and creation of new hybrid roles.
Descriptive sectoral analyses from the simulated dataset (hybrid share, task-displacement indicators, employment changes) covering Finance, Education, Transportation (2020–2024), plus mixed-evidence studies from the literature synthesis (ACM/IEEE/Springer 2020–2024).
Improved matches and clearer skill signals can raise short-term wages for matched youth, while longer-term wage dynamics will depend on supply responses and bargaining power shifts.
Pilot reports higher reported short-term wages; longer-term effects are discussed as conditional and not measured in the pilot.
Overall, economic benefits from AI in radiology are plausible but conditional on human-AI interaction design, governance, workforce effects, and payment structures; net value is not determined by algorithmic accuracy alone.
Synthesis of the heterogeneous literature (laboratory, reader, observational, qualitative) and conceptual economic analysis highlighting dependencies beyond algorithmic performance.
The net effect of AI on clinician burnout is ambiguous: tools can remove tedious tasks but may introduce new cognitive, administrative, and liability stresses.
Mixed qualitative and small-scale observational studies with variable findings on burnout-related measures after AI introduction.
Changes in workload composition can reduce routine burdens but may shift cognitive load to follow-up decisions and managing AI outputs.
Observational and qualitative studies of deployed systems reporting redistribution of tasks and clinician-reported changes in cognitive demands.
Economic outcomes depend on complementarity versus substitution: AI that augments radiologists can raise output per worker; AI that substitutes tasks may reduce demand for certain diagnostic activities.
Theoretical economic frameworks and case studies of task reallocation in early deployments; empirical workforce-impact studies limited.
Automation bias can increase undue reliance on AI, while algorithmic aversion can drive underuse of helpful tools.
Cognitive and behavioral studies and reader simulations demonstrating both increased acceptance/overtrust in automated outputs in some settings and rejection/discounting of AI advice in others.
Real clinical value depends critically on how AI tools interact with radiologists in practice (integration design and human-AI interaction).
Conceptual models and synthesis of reader studies, simulation/interaction studies, usability and qualitative deployment evaluations that compare standalone algorithm performance versus clinician+AI workflows.
Trust calibration influences project performance outcomes: organizations tend toward metric-driven evaluation of AI outputs and use AI to strategically augment human expertise, but miscalibration risks overreliance or inappropriate metric focus that can harm performance.
Based on participants' reported experiences in the 40 interviews and interpretive thematic analysis linking trust practices to observed/perceived performance consequences (shift to metric-based evaluation, strategic use, and noted risks).
Trust calibration shapes collaboration patterns, including delegation of oversight to systems or specialists, changes in communication networks (who talks to whom), and erosion of informal ad hoc communications used previously for tacit coordination.
Observed in interview narratives (40 interviews) and thematic coding showing repeated reports of shifted oversight roles, altered communication pathways, and reduced informal coordination after AI integration.
Trust calibration is produced and maintained through ongoing boundary work between humans and machines (i.e., teams continuously negotiate which inputs/responsibilities are treated as human versus machine).
Derived from participants' accounts in the 40 interviews and thematic analysis documenting repeated examples of role negotiation and boundary-setting between people and AI systems during project routines.
Trust in AI within project-based work is situational and socially distributed across team members, rather than a stable individual attitude.
The claim is based on thematic qualitative analysis of 40 semi-structured interviews with project professionals across multiple industries in the UK. Interview data showed variation in how different team members described their trust in systems depending on role, task, and context.
Explicit governance reduces negative externalities (bias, privacy breaches, loss of trust) but entails compliance costs that should be factored into adoption and diffusion models.
Conceptual claim synthesizing trade‑off arguments from governance and risk literatures and practitioner examples; not measured empirically in the paper.
Embedding AI into workflows may change firm boundaries (e.g., outsourcing models vs. in‑house systems) and make investments in internal auditability and explainability strategic assets.
Theoretical implication drawn from synthesis of organizational boundary theory and practitioner trends; suggested rather than empirically demonstrated within the paper.
AI is likely to continue shifting the frontier of early discovery and increase the throughput and quality of hypotheses, but persistent biological uncertainty and the cost of clinical validation mean AI will complement—not fully replace—traditional R&D for the foreseeable future.
Synthesis of technological trends, application successes and limitations, translational risk, and economic reasoning presented throughout the paper.
Proprietary data, precompetitive consortia, and platform consolidation can create barriers to entry; public-data initiatives could alter competitive dynamics.
Market-structure analysis and discussion of data-access models in the paper, with examples of consortia and proprietary platform effects.
Expect strong returns-to-scale and winner-take-most dynamics: large incumbents and well-funded startups with proprietary data/compute may dominate the field.
Economic reasoning and observations in the paper about data/compute concentration, platform effects, and market outcomes.
Realizing economic gains at scale from AI in drug R&D is constrained by data quality and access, high implementation and integration costs, regulatory uncertainty, and ethical/legal concerns; these constraints will shape how gains are distributed across firms, countries, and patients.
Aggregate conclusion of the narrative review synthesizing documented benefits and recurring constraints from published studies, case reports, industry/regulatory analyses; qualitative synthesis without quantitative projection of distributional outcomes.
Adoption of AI in pharma will increase demand for computational biologists, ML engineers, and data scientists and may displace or redefine some traditional bench roles.
Labor-market trend reports and organizational case studies included in the review noting hiring patterns and role changes; qualitative synthesis rather than comprehensive labor-market study.
AI could lower discovery costs and permit more entrants in niche/specialty therapy discovery, but clinical development costs remain a major barrier to entry.
Synthesis of reported reductions in early-stage discovery costs and persistent high clinical trial costs from studies and industry reports; heterogeneous evidence across therapeutic areas.
Upfront capital and proprietary data requirements may advantage large incumbents or well-funded startups and could increase market concentration unless data-sharing or open platforms emerge.
Market-structure analysis and industry examples in the narrative review; inference based on observed data-asset advantages and investment needs across firms.
AI shifts the cost structure of drug R&D toward higher fixed costs (data infrastructure, compute, ML talent) and potentially lower marginal costs for candidate generation and some preclinical activities.
Economic synthesis and industry reports in the review describing capital-intensive investments and reduced per-unit costs in algorithmic candidate generation; largely conceptual and based on case examples.
Early-stage unit costs and time-per-hit can fall with AI, but late-stage clinical trial costs driven by biology remain the primary bottleneck to overall R&D productivity gains.
Qualitative assessment of stage-specific effects based on industry observations and conceptual decomposition of R&D stages; no new cost accounting or econometric estimates provided.
AI can improve specific stages of drug discovery but cannot eliminate fundamental biological uncertainty.
Conceptual and thematic analysis across technological capability and R&D integration levels; supported by illustrative examples showing limits of prediction in complex biology.
Two opposing market forces will act: (a) democratization lowering entry barriers for startups, and (b) concentration where firms with premium proprietary data and integrated AI capture outsized returns.
Conceptual economic analysis and illustrative industry observations; no empirical market-structure measurement presented.
AI (including machine learning, generative AI, and NLP) is reshaping biomedical research and pharmaceutical R&D by creating distinct adoption archetypes within large pharmaceutical companies.
Editorial / conceptual synthesis using qualitative analysis and archetype classification based on cross-industry observations and illustrative examples; no systematic measurement or sample size reported.
Emerging technologies (AI, digital twins, computational rheology) can compress high-dimensional sensory/rheological spaces into actionable models, enabling faster iteration in R&D and altering how firms value R&D inputs.
Theoretical projection and literature-based argument about technological capabilities; illustrative scenarios offered; no empirical trials or measured productivity changes reported.
There is potential for timely, personalized interventions (nudges/warnings) that could reduce harm, but causal evidence of long‑term effectiveness is limited.
Many studies propose or evaluate intervention prototypes and report feasibility/short‑term outcomes, while the review notes scarce randomized or longitudinal evaluations measuring welfare outcomes.
Techniques to mitigate data scarcity—transfer learning, data augmentation, physics-informed priors, active learning, and leveraging multimodal data—provide partial improvements but do not fully resolve generalization limits.
Review of methodological papers and empirical studies applying these techniques; synthesis indicates improvements in certain contexts but ongoing limitations documented across sources.