Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

AI-driven productivity and data externalities can reconfigure which countries/regions specialize in which activities, with implications for labor demand, offshoring, and services trade patterns.

Mechanism and theory-based analysis drawing on literature about comparative advantage, automation, and data externalities; empirical testing recommended but not performed in the paper.

medium mixed Path Analysis of Digital Economy and Reconstruction of Inter... specialization patterns, labor demand, offshoring levels, services trade composi...

Standard international trade models should be updated to incorporate data as an input, platform-mediated matching, algorithmic complementarities, and costs of regulatory fragmentation.

Theoretical critique and modeling recommendations based on mechanism analysis; no new formal model calibration or empirical testing presented in the paper.

medium mixed Path Analysis of Digital Economy and Reconstruction of Inter... adequacy and predictive accuracy of trade models for AI-era trade patterns

AI-enabled markets tend toward winner-take-most platforms amplified by network effects.

Theoretical reasoning supported by platform literature and case illustrations of platform concentration dynamics; empirical magnitudes not estimated in the paper.

medium mixed Path Analysis of Digital Economy and Reconstruction of Inter... market concentration / platform dominance

Competitive advantage is shifting away from asset- and labor-intensive models toward data-, model-, and platform-driven advantages, altering comparative advantage and market structure.

Mechanism/theoretical analysis drawing on platform and AI economics literature and qualitative examples; no empirical estimation provided in the paper.

medium mixed Path Analysis of Digital Economy and Reconstruction of Inter... comparative advantage (sectoral specialization), market structure (incumbency, c...

Regulatory design acts as an economic instrument that can balance social value from AI with protection of rights, affecting social welfare, public trust, and long-term adoption rates.

Normative synthesis combining legal and economic reasoning; suggested as a theoretical mechanism rather than empirically validated within the paper.

medium mixed ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... social welfare, public trust, long-term AI adoption rates

Automation of routine administrative tasks may reduce demand for certain clerical roles while increasing demand for oversight, auditing, and legal-technical expertise, altering public-sector labor composition and retraining needs.

Qualitative labor-market reasoning based on task-based automation literature and the administrative context; no field labor-data or sample provided.

medium mixed ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... demand for different job categories (clerical roles vs oversight/legal-technical...

Current LLMs produce deep, reliable reasoning mainly in domains with rigorous, pre-existing abstractions (mathematics, programming) and underperform in domains that lack such formal abstractions.

Performance comparisons and observed patterns referenced qualitatively (e.g., better behavior on math and code tasks) drawn from existing literature and practitioner reports; the paper does not present new controlled benchmark experiments.

medium mixed An Alternative Trajectory for Generative AI reasoning accuracy and reliability across domains (e.g., test performance on mat...

AI feedback may either augment teacher productivity (complementarity) or substitute for routine teacher feedback tasks (substitution), with unclear net labor impacts.

Workshop deliberations among 50 scholars highlighting competing theoretical scenarios; no causal labor-market evidence provided.

medium mixed The Future of Feedback: How Can AI Help Transform Feedback t... teacher time allocation; demand for teacher skills; employment levels in educati...

Easier conversational access to models can substitute for routine cognitive labor while complementing high-skill work; miscalibrated trust affects labor outcomes and supervision costs.

Labor and task-allocation implications argued conceptually; no labor-market empirical evidence or quantified substitution/complementarity rates presented.

medium mixed Why We Need to Destroy the Illusion of Speaking to A Human: ... labor substitution for routine tasks, complementarity with high-skill tasks, sup...

Firms can compete on front-end design (transparency, trustworthiness) as a socially beneficial quality signal, but absent regulation competition may favor more persuasive (less honest) interfaces.

Economic argument about product differentiation and competitive incentives, drawn from market theory and literature; no empirical market study provided.

medium mixed Why We Need to Destroy the Illusion of Speaking to A Human: ... firm competition strategies, prevalence of transparent vs. persuasive interfaces...

Misleading cues can create short-term surplus (user satisfaction) but long-term welfare losses if overtrust causes harms or misinformation.

Theoretical economic argument based on information asymmetry and externalities; no empirical quantification in the paper.

medium mixed Why We Need to Destroy the Illusion of Speaking to A Human: ... short-term user satisfaction vs. long-term welfare (harms from misinformation/ov...

LLM-based chatbots’ conversational naturalness increases usability and adoption but also triggers misleading mental models (e.g., anthropomorphism, overtrust).

Paper-level main finding based on conceptual analysis and literature synthesis from HCI, ethics, and conversational analysis; no new large-scale empirical study or sample reported.

medium mixed Why We Need to Destroy the Illusion of Speaking to A Human: ... usability, adoption (engagement/use rates), and prevalence of misleading mental ...

The approach shifts some resource demand from GPU clusters to CPU, memory, and storage I/O, meaning local SSD and CPU provisioning can become the new bottleneck.

Authors note the system relies on multi-tier I/O and CPU-side updates to enable single-GPU fine-tuning; the summary highlights this resource-shift as a risk/consideration. No quantitative cost or workload-specific tradeoff analysis is provided in the summary.

medium mixed An Efficient Heterogeneous Co-Design for Fine-Tuning on a Si... relative resource utilization (GPU vs CPU/host memory/SSD I/O) and potential bot...

Human experts will likely shift roles from sole decision-makers to adjudicators, challengers, and validators of AI-generated arguments, changing required skills toward critical evaluation and dialectical oversight.

Conceptual labor-market projection; no empirical labor studies or surveys presented.

medium mixed Argumentative Human-AI Decision-Making: Toward AI Agents Tha... changes in job tasks, skill demand, and employment shares for expert validators/...

Productivity gains from partial automation may be offset by negative externalities (incorrect legal outcomes, appeals, reputational damage) that impose social and private costs not captured by narrow productivity measures.

Theoretical economic analysis and illustrative case vignettes describing error propagation; no empirical quantification of externalities.

medium mixed Why Avoid Generative Legal AI Systems? Hallucination, Overre... net social welfare/productivity after accounting for error-related externalities

Market demand will likely split between providers offering generative convenience with liability exposure and providers offering certified/verified, explainable tools at a premium, creating a two-tier market.

Market-structure analysis and illustrative projections; no empirical market data or sample size.

medium mixed Why Avoid Generative Legal AI Systems? Hallucination, Overre... market segmentation between riskier low-cost generative providers and premium ve...

Reported monetary supervision cost was low (~$200) for this project, but the paper cautions that general equilibrium effects and scaling may change costs as demand for supervisors rises.

Paper provides reported supervision cost (≈$200) for the single project and includes a caveat about external validity and scaling; cost is self-reported and contextualized by authors.

medium mixed Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... monetary supervision cost for this project (≈$200) and authors' caution about sc...

Because these agents will be embedded in safety‑critical infrastructure, economic and technical outcomes will depend heavily on system architecture choices.

Systems‑engineering and policy reasoning drawing on analogies to Internet/IoT evolution and domain examples (disaster response, healthcare, industrial automation, mobility); conceptual argumentation rather than empirical measurement.

medium mixed The Internet of Physical AI Agents: Interoperability, Longev... economic costs and technical system performance/resilience

Policymakers must weigh productivity gains from higher autonomy against increased systemic risk and governance costs; optimal allocation will vary by sector (high-consequence systems justify stricter human oversight; lower-consequence tasks may tolerate more autonomy).

Normative policy analysis and cost–benefit reasoning; sector-differentiated triage framework proposed (no quantitative welfare or sectoral optimization performed).

medium mixed Resilience Meets Autonomy: Governing Embodied AI in Critical... policy-optimal oversight allocation by sector (trade-off between productivity ga...

Bounded-autonomy governance internalizes some externalities from automated interactions, reducing the probability of cascading failures and associated economic damages, but misaligned or heterogeneous governance across firms/sectors can still generate systemic vulnerabilities.

Theoretical argument combining externalities literature and governance design principles; illustrative scenarios and policy reasoning (no empirical validation).

medium mixed Resilience Meets Autonomy: Governing Embodied AI in Critical... net effect on systemic risk (probability and expected loss from cascades) under ...

Modern critical infrastructure increasingly uses embodied AI for monitoring, predictive maintenance, and decision support, but these systems are typically trained for statistically representable uncertainty rather than systemic, cascading crises.

Review and synthesis of policy texts, industry descriptions, and safety/AI standards cited in the paper (EU AI Act, ISO standards) and literature on embodied-AI applications; conceptual argument (no original empirical sample).

medium mixed Resilience Meets Autonomy: Governing Embodied AI in Critical... mismatch between training uncertainty assumptions and real-world systemic crisis...

Cooperation with the AI is sustained mainly through conditional rule-based strategies rather than through trust-building, emotional, and social channels.

Synthesis of behavioral trajectories (cooperation plateauing below human–human levels), strategy-estimation results (prevalence of rule-based strategies such as Grim Trigger), and chat-content analysis (more explicit commitments, fewer social/emotional messages) from the laboratory experiment (human–AI n = 126) and comparison to human–human benchmark (n = 108).

medium mixed Playing Against the Machine: Cooperation, Communication, and... mechanism of cooperation (relative contribution of conditional rule-following vs...

When allowed repeated communication with the AI, human subjects remain behaviorally dispersed and do not converge to a single dominant strategy.

Strategy-estimation results for the human–AI repeated-chat treatment (from the experiment, n = 126) showing heterogeneous assignment across strategy classes and lack of convergence over time.

medium mixed Playing Against the Machine: Cooperation, Communication, and... strategy convergence / dispersion (distribution of inferred strategies over time...

Increasing benign-agent count and agent stubbornness are practical levers for improving robustness, but both carry costs: added compute/operational cost for scaling agents, and degraded consensus/coordination when stubbornness is high.

Argumentation supported by simulation results showing improved robustness with more agents or higher stubbornness, combined with discussion of computational cost (scaling) and observed consensus degradation; computational cost is presented as conceptual/operational reasoning rather than quantified in the summary.

medium mixed Don't Trust Stubborn Neighbors: A Security Framework for Age... robustness to manipulation (improvement), computational/operational cost (increa...

Naïvely lowering trust weights assigned to suspected adversaries can limit adversarial influence but may also hinder cooperation and reduce task performance.

Simulations manipulating fixed trust weights and observing tradeoffs between reduced adversarial sway and decreased cooperative task performance/convergence; conceptual analysis of the tradeoff is provided.

medium mixed Don't Trust Stubborn Neighbors: A Security Framework for Age... adversarial influence (reduction) and cooperative task performance / convergence...

Raising agents' innate stubbornness (peer resistance) reduces susceptibility to adversarial manipulation but impairs the network's ability to reach consensus or coordinate effectively.

Combined theoretical reasoning from FJ model (stubbornness is weight on innate opinion) and simulation experiments varying stubbornness parameters; measured outcomes include adversarial influence and measures of convergence/coordination or task performance.

medium mixed Don't Trust Stubborn Neighbors: A Security Framework for Age... adversarial influence (reduction) and network coordination/consensus metrics or ...

BenchPress evaluation shows Pokemon battling evaluates capabilities largely orthogonal to common LLM benchmarks (i.e., it stresses different skill sets).

Paper applies a BenchPress matrix/method to quantify coverage relative to standard benchmarks and reports near-orthogonality for battling tasks in the matrix results.

medium mixed The PokeAgent Challenge: Competitive and Long-Context Learni... coverage/overlap metric from BenchPress matrix comparing PokeAgent Battling to s...

The study documents a 'silent empathy' effect: people often feel empathic concern but fail to express it in ways that align with normative empathic communication; targeted feedback helps close that expression gap.

Analysis showing mismatch between internal empathic concern (implied by context/self-report/ratings) and the presence of idiomatic empathic moves in participants' messages; targeted personalized feedback increased use of normative empathic expressions.

medium mixed Practicing with Language Models Cultivates Human Empathic Co... gap between experienced empathy and expressed empathic moves (alignment with nor...

Investments in interpretability that aim to fully 'rule‑ify' LLM competence may have diminishing returns; economic value may be better captured by research into robust behavioral evaluation, stress testing, and hybrid human‑AI workflows, while partial interpretability remains valuable.

R&D allocation and interpretability economics argument built on the central thesis; suggestion rather than empirical finding.

medium mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... returns to different types of interpretability/AI safety R&D

The paper challenges a purely rule‑based view of scientific explanation: some explanatory power will remain in implicit model structure rather than explicit rules.

Philosophical/epistemological argument based on the main thesis about tacit competence; no empirical validation.

medium mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... completeness of rule‑based scientific explanations when applied to LLM behavior

LLMs can provide useful inputs for near-term economic and logistical forecasting in crises (e.g., supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously.

Observed stronger and more verifiable performance on economic/logistical question types in the 42-node evaluation; weaker reliability on politically ambiguous multi-actor issues reported in qualitative coding and verifiability checks.

medium mixed When AI Navigates the Fog of War usefulness for forecasting (economic/logistical forecasting accuracy/utility vs....

Model narratives evolve over time: earlier node outputs emphasize rapid containment, while later node outputs increasingly describe regional entrenchment and attritional de-escalation scenarios.

Longitudinal analysis across 11 temporal nodes comparing thematic/narrative content of model responses; qualitative coding tracked shifts in dominant scenario framings from early to later nodes.

medium mixed When AI Navigates the Fog of War narrative framing over time (frequency of containment vs. entrenchment/attrition...

Model reliability is uneven across domains: performance is stronger on structured economic and logistical questions than on politically ambiguous, multi-actor strategic issues.

Domain-specific comparison of model outputs on node-specific verifiable questions and exploratory prompts, with higher verifiability/accuracy and more consistent inferences reported for economic/logistical items versus greater ambiguity and lower consistency on political/multi-actor items.

medium mixed When AI Navigates the Fog of War domain-specific accuracy/reliability (economic/logistical vs. political/strategi...

Liability regimes and penalties should account for limits of enforced compliance and false positives/negatives from probabilistic policy evaluations.

Normative/economic discussion in the paper highlighting probabilistic outputs of the Policy function and calibration challenges; no empirical validation.

medium mixed Runtime Governance for AI Agents: Policies on Paths appropriateness of liability frameworks given probabilistic enforcement (policy ...

Firms will trade off compliance strictness against service quality (task completion rates), creating an economic tradeoff that shapes market offerings (e.g., safer-but-slower vs. faster-but-riskier agents).

Economic reasoning and conceptual models in the paper; suggested objective balancing task completion and legal/reputational costs; no empirical market data.

medium mixed Runtime Governance for AI Agents: Policies on Paths tradeoff curve between task completion rate and compliance risk (expected violat...

Alignment and instruction tuning approaches intended to encourage up-to-date answers improve some behaviors but do not reliably solve time-sensitivity and cross-modal consistency issues.

Experiments applying alignment/instruction-tuning methods with measurement of correctness and consistency; reported partial or inconsistent improvements rather than full resolution.

medium mixed V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... changes in correctness and consistency after alignment/instruction tuning

Diagnostic analysis links outdated predictions to (i) the static, time-stamped nature of training/evaluation datasets and (ii) mechanistic limits in how multimodal representations encode and retrieve temporal facts.

Error attribution analyses connecting incorrect answers to training snapshot timestamps and dataset provenance; representation-level analyses and qualitative case studies demonstrating multimodal encoding/retrieval limits.

medium mixed V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... attribution of errors to dataset temporal mismatch and representation/mechanisti...

For models/dynamics with negative LLE (contracting behavior), investment in parallel Newton tooling is likely to pay off; for expanding/chaotic dynamics (positive LLE), alternative architectural or modeling changes may be more cost-effective.

Application of the LLE convergence criterion derived in the thesis combined with empirical demonstrations on representative tasks indicating correlation between LLE sign and parallel solver performance; economic recommendation is interpretive.

medium mixed Unifying Optimization and Dynamics to Parallelize Sequential... return-on-investment / suitability of parallelization conditioned on LLE sign

The economic value of deploying DeePC-based controllers depends critically on representativeness of training data and the costs of online adaptation and safety verification.

Authors' deployment-risk analysis and discussion of trade-offs (qualitative), grounded in methodological requirements of DeePC (need for representative, persistently exciting data and safeguards).

medium mixed Data-driven generalized perimeter control: Zürich case study net economic value after accounting for data collection, adaptation, and verific...

System-level improvements from the controller do not imply uniform spatial/temporal benefits—distributional effects may favor certain routes or neighborhoods.

Authors' discussion and caution about distributional effects and equity; possibly supported by spatial analyses in simulation (qualitative discussion in paper).

medium mixed Data-driven generalized perimeter control: Zürich case study spatial/temporal distribution of travel-time changes across network links or nei...

Sparse MoE designs reduce active compute per query but can introduce serving complexity (routing, memory bandwidth, batching) that may require specialized infrastructure.

Architectural property of sparse MoE (sparse activation) and the paper's discussion of deployment trade-offs; the summary notes the need for specialized serving infra and potential transitional costs. This is an argument supported by known MoE deployment literature rather than novel empirical measurements in the summary.

medium mixed EngGPT2: Sovereign, Efficient and Open Intelligence trade-off between per-query active compute reduction and increased serving/opera...

Deploying conformal factuality systems increases development cost (collecting representative calibration data) and inference cost (verifier compute), though efficient verifiers mitigate inference cost.

Discussion and empirical cost measurements: need for representative calibration datasets to maintain guarantees; measured verifier FLOPs; qualitative economic analysis in the paper.

medium mixed Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... development effort for calibration data, inference compute cost (FLOPs), margina...

Conformal filtering improves formal reliability (statistical factuality guarantees) but does not, by itself, deliver robustness and task utility without careful system design.

Aggregate empirical results: improved factuality guarantees after calibration/filtering, but concurrent reductions in informativeness and sensitivity to distribution shift/distractors unless calibration/data-processing are adapted.

medium mixed Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... post-filtering factuality guarantees, informativeness metrics, robustness under ...

Fine-tuning TSFMs on the high-frequency 5G data provides limited recovery; many configurations still perform poorly after fine-tuning.

Paper reports experiments including fine-tuning regimes where TSFMs were fine-tuned on the new dataset; results indicate limited improvement in many configurations. Specific fine-tuning procedures, datasets sizes, and quantitative results are not provided in the summary.

medium mixed Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... predictive performance after fine-tuning (forecasting accuracy/error)

DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe.

Model-specific results from Experiment 3 (TS‑Guessing) reporting per-model rates of partial reconstruction and verbatim recall across the 513 MMLU items for DeepSeek-R1.

medium mixed Are Large Language Models Truly Smarter Than Humans? partial reconstruction rate and verbatim recall rate (per-model)

Quantitative comparisons across tested models show systematic Misapplication Rate even in settings where Appropriate Application Rate is high.

Aggregated MR and AAR statistics reported for multiple frontier models across the benchmark showing co‑occurrence of high AAR and nontrivial MR.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Co‑occurrence of high Appropriate Application Rate (AAR) and nonzero Misapplicat...

Prompt‑based defensive instructions (explicitly instructing models to suppress preferences where inappropriate) reduce misapplication but fail to fully eliminate it.

Ablation experiments adding prompt‑based safety/defenses to model inputs and measuring MR and AAR; defenses produced reductions in MR but residual misapplication remained.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Misapplication Rate (MR) and Appropriate Application Rate (AAR) under prompt‑bas...

Attempts to mitigate misapplication with stronger reasoning prompts (e.g., chain‑of‑thought) reduce Misapplication Rate but do not eliminate it.

Ablation applying reasoning prompts and chain‑of‑thought style instructions to models, comparing MR before and after; reported reductions in MR but persistence of non‑zero MR across scenarios.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Change in Misapplication Rate (MR) after applying chain‑of‑thought / reasoning p...

Models that more faithfully enforce stored preferences achieve higher Appropriate Application Rate (AAR) but also systematically have higher Misapplication Rate (MR), indicating a trade‑off between correct personalization and harmful over‑application.

Ablation experiments varying strength of preference encoding and measuring resulting AAR and MR per model; quantitative comparisons across models showing positive correlation between stronger preference adherence and both higher AAR and higher MR.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Appropriate Application Rate (AAR) and Misapplication Rate (MR) — trade‑off rela...

Reducing payrolls raises short-term firm profitability but reduces aggregate household income and consumption.

Macroeconomic accounting and labor-demand theory combined with historical examples of payroll reductions; argument is theoretical/conceptual rather than estimated with new aggregate time-series regression evidence.

medium mixed A Shorter Workweek as a Policy Response to AI-Driven Labor D... firm profitability (short-term) and aggregate household income/consumption

« Prev 1 2 3 … 186 187 188 … 277 278 Next »