Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Raising agents' innate stubbornness (peer resistance) reduces susceptibility to adversarial manipulation but impairs the network's ability to reach consensus or coordinate effectively.

Combined theoretical reasoning from FJ model (stubbornness is weight on innate opinion) and simulation experiments varying stubbornness parameters; measured outcomes include adversarial influence and measures of convergence/coordination or task performance.

medium mixed Don't Trust Stubborn Neighbors: A Security Framework for Age... adversarial influence (reduction) and network coordination/consensus metrics or ...

BenchPress evaluation shows Pokemon battling evaluates capabilities largely orthogonal to common LLM benchmarks (i.e., it stresses different skill sets).

Paper applies a BenchPress matrix/method to quantify coverage relative to standard benchmarks and reports near-orthogonality for battling tasks in the matrix results.

medium mixed The PokeAgent Challenge: Competitive and Long-Context Learni... coverage/overlap metric from BenchPress matrix comparing PokeAgent Battling to s...

The study documents a 'silent empathy' effect: people often feel empathic concern but fail to express it in ways that align with normative empathic communication; targeted feedback helps close that expression gap.

Analysis showing mismatch between internal empathic concern (implied by context/self-report/ratings) and the presence of idiomatic empathic moves in participants' messages; targeted personalized feedback increased use of normative empathic expressions.

medium mixed Practicing with Language Models Cultivates Human Empathic Co... gap between experienced empathy and expressed empathic moves (alignment with nor...

Investments in interpretability that aim to fully 'rule‑ify' LLM competence may have diminishing returns; economic value may be better captured by research into robust behavioral evaluation, stress testing, and hybrid human‑AI workflows, while partial interpretability remains valuable.

R&D allocation and interpretability economics argument built on the central thesis; suggestion rather than empirical finding.

medium mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... returns to different types of interpretability/AI safety R&D

The paper challenges a purely rule‑based view of scientific explanation: some explanatory power will remain in implicit model structure rather than explicit rules.

Philosophical/epistemological argument based on the main thesis about tacit competence; no empirical validation.

medium mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... completeness of rule‑based scientific explanations when applied to LLM behavior

LLMs can provide useful inputs for near-term economic and logistical forecasting in crises (e.g., supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously.

Observed stronger and more verifiable performance on economic/logistical question types in the 42-node evaluation; weaker reliability on politically ambiguous multi-actor issues reported in qualitative coding and verifiability checks.

medium mixed When AI Navigates the Fog of War usefulness for forecasting (economic/logistical forecasting accuracy/utility vs....

Model narratives evolve over time: earlier node outputs emphasize rapid containment, while later node outputs increasingly describe regional entrenchment and attritional de-escalation scenarios.

Longitudinal analysis across 11 temporal nodes comparing thematic/narrative content of model responses; qualitative coding tracked shifts in dominant scenario framings from early to later nodes.

medium mixed When AI Navigates the Fog of War narrative framing over time (frequency of containment vs. entrenchment/attrition...

Model reliability is uneven across domains: performance is stronger on structured economic and logistical questions than on politically ambiguous, multi-actor strategic issues.

Domain-specific comparison of model outputs on node-specific verifiable questions and exploratory prompts, with higher verifiability/accuracy and more consistent inferences reported for economic/logistical items versus greater ambiguity and lower consistency on political/multi-actor items.

medium mixed When AI Navigates the Fog of War domain-specific accuracy/reliability (economic/logistical vs. political/strategi...

Liability regimes and penalties should account for limits of enforced compliance and false positives/negatives from probabilistic policy evaluations.

Normative/economic discussion in the paper highlighting probabilistic outputs of the Policy function and calibration challenges; no empirical validation.

medium mixed Runtime Governance for AI Agents: Policies on Paths appropriateness of liability frameworks given probabilistic enforcement (policy ...

Firms will trade off compliance strictness against service quality (task completion rates), creating an economic tradeoff that shapes market offerings (e.g., safer-but-slower vs. faster-but-riskier agents).

Economic reasoning and conceptual models in the paper; suggested objective balancing task completion and legal/reputational costs; no empirical market data.

medium mixed Runtime Governance for AI Agents: Policies on Paths tradeoff curve between task completion rate and compliance risk (expected violat...

Alignment and instruction tuning approaches intended to encourage up-to-date answers improve some behaviors but do not reliably solve time-sensitivity and cross-modal consistency issues.

Experiments applying alignment/instruction-tuning methods with measurement of correctness and consistency; reported partial or inconsistent improvements rather than full resolution.

medium mixed V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... changes in correctness and consistency after alignment/instruction tuning

Diagnostic analysis links outdated predictions to (i) the static, time-stamped nature of training/evaluation datasets and (ii) mechanistic limits in how multimodal representations encode and retrieve temporal facts.

Error attribution analyses connecting incorrect answers to training snapshot timestamps and dataset provenance; representation-level analyses and qualitative case studies demonstrating multimodal encoding/retrieval limits.

medium mixed V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... attribution of errors to dataset temporal mismatch and representation/mechanisti...

For models/dynamics with negative LLE (contracting behavior), investment in parallel Newton tooling is likely to pay off; for expanding/chaotic dynamics (positive LLE), alternative architectural or modeling changes may be more cost-effective.

Application of the LLE convergence criterion derived in the thesis combined with empirical demonstrations on representative tasks indicating correlation between LLE sign and parallel solver performance; economic recommendation is interpretive.

medium mixed Unifying Optimization and Dynamics to Parallelize Sequential... return-on-investment / suitability of parallelization conditioned on LLE sign

The economic value of deploying DeePC-based controllers depends critically on representativeness of training data and the costs of online adaptation and safety verification.

Authors' deployment-risk analysis and discussion of trade-offs (qualitative), grounded in methodological requirements of DeePC (need for representative, persistently exciting data and safeguards).

medium mixed Data-driven generalized perimeter control: Zürich case study net economic value after accounting for data collection, adaptation, and verific...

System-level improvements from the controller do not imply uniform spatial/temporal benefits—distributional effects may favor certain routes or neighborhoods.

Authors' discussion and caution about distributional effects and equity; possibly supported by spatial analyses in simulation (qualitative discussion in paper).

medium mixed Data-driven generalized perimeter control: Zürich case study spatial/temporal distribution of travel-time changes across network links or nei...

Sparse MoE designs reduce active compute per query but can introduce serving complexity (routing, memory bandwidth, batching) that may require specialized infrastructure.

Architectural property of sparse MoE (sparse activation) and the paper's discussion of deployment trade-offs; the summary notes the need for specialized serving infra and potential transitional costs. This is an argument supported by known MoE deployment literature rather than novel empirical measurements in the summary.

medium mixed EngGPT2: Sovereign, Efficient and Open Intelligence trade-off between per-query active compute reduction and increased serving/opera...

Deploying conformal factuality systems increases development cost (collecting representative calibration data) and inference cost (verifier compute), though efficient verifiers mitigate inference cost.

Discussion and empirical cost measurements: need for representative calibration datasets to maintain guarantees; measured verifier FLOPs; qualitative economic analysis in the paper.

medium mixed Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... development effort for calibration data, inference compute cost (FLOPs), margina...

Conformal filtering improves formal reliability (statistical factuality guarantees) but does not, by itself, deliver robustness and task utility without careful system design.

Aggregate empirical results: improved factuality guarantees after calibration/filtering, but concurrent reductions in informativeness and sensitivity to distribution shift/distractors unless calibration/data-processing are adapted.

medium mixed Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... post-filtering factuality guarantees, informativeness metrics, robustness under ...

Fine-tuning TSFMs on the high-frequency 5G data provides limited recovery; many configurations still perform poorly after fine-tuning.

Paper reports experiments including fine-tuning regimes where TSFMs were fine-tuned on the new dataset; results indicate limited improvement in many configurations. Specific fine-tuning procedures, datasets sizes, and quantitative results are not provided in the summary.

medium mixed Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... predictive performance after fine-tuning (forecasting accuracy/error)

DeepSeek-R1 exhibits a distributed memorization signature: 76.6% partial reconstruction rate but 0% verbatim recall on the TS‑Guessing probe.

Model-specific results from Experiment 3 (TS‑Guessing) reporting per-model rates of partial reconstruction and verbatim recall across the 513 MMLU items for DeepSeek-R1.

medium mixed Are Large Language Models Truly Smarter Than Humans? partial reconstruction rate and verbatim recall rate (per-model)

Quantitative comparisons across tested models show systematic Misapplication Rate even in settings where Appropriate Application Rate is high.

Aggregated MR and AAR statistics reported for multiple frontier models across the benchmark showing co‑occurrence of high AAR and nontrivial MR.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Co‑occurrence of high Appropriate Application Rate (AAR) and nonzero Misapplicat...

Prompt‑based defensive instructions (explicitly instructing models to suppress preferences where inappropriate) reduce misapplication but fail to fully eliminate it.

Ablation experiments adding prompt‑based safety/defenses to model inputs and measuring MR and AAR; defenses produced reductions in MR but residual misapplication remained.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Misapplication Rate (MR) and Appropriate Application Rate (AAR) under prompt‑bas...

Attempts to mitigate misapplication with stronger reasoning prompts (e.g., chain‑of‑thought) reduce Misapplication Rate but do not eliminate it.

Ablation applying reasoning prompts and chain‑of‑thought style instructions to models, comparing MR before and after; reported reductions in MR but persistence of non‑zero MR across scenarios.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Change in Misapplication Rate (MR) after applying chain‑of‑thought / reasoning p...

Models that more faithfully enforce stored preferences achieve higher Appropriate Application Rate (AAR) but also systematically have higher Misapplication Rate (MR), indicating a trade‑off between correct personalization and harmful over‑application.

Ablation experiments varying strength of preference encoding and measuring resulting AAR and MR per model; quantitative comparisons across models showing positive correlation between stronger preference adherence and both higher AAR and higher MR.

medium mixed BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Appropriate Application Rate (AAR) and Misapplication Rate (MR) — trade‑off rela...

Reducing payrolls raises short-term firm profitability but reduces aggregate household income and consumption.

Macroeconomic accounting and labor-demand theory combined with historical examples of payroll reductions; argument is theoretical/conceptual rather than estimated with new aggregate time-series regression evidence.

medium mixed A Shorter Workweek as a Policy Response to AI-Driven Labor D... firm profitability (short-term) and aggregate household income/consumption

Reviving model-based central planning tools (ISB+NDMS) risks political-economy problems and requires evaluation of efficiency and flexibility compared to market coordination.

Analytic discussion and normative argument in the paper; no empirical comparative study provided.

medium mixed DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... efficiency and flexibility of coordination mechanisms; political-economy risks (...

Russia's digitalization and adoption of AI/Big Data are reshaping the country's socio-economic infrastructure in multifaceted and systemic ways.

Qualitative analysis of national strategies and policy documents plus the author's expert assessments; no sample size or statistical testing reported.

medium mixed DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... systemic change in socio-economic infrastructure (broad, descriptive)

Finance, Education, and Transportation show mixed dynamics: both displacement of routine tasks and creation of new hybrid roles.

Descriptive sectoral analyses from the simulated dataset (hybrid share, task-displacement indicators, employment changes) covering Finance, Education, Transportation (2020–2024), plus mixed-evidence studies from the literature synthesis (ACM/IEEE/Springer 2020–2024).

medium mixed AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Hybrid job share, task-displacement indicators, employment levels by sector

Improved matches and clearer skill signals can raise short-term wages for matched youth, while longer-term wage dynamics will depend on supply responses and bargaining power shifts.

Pilot reports higher reported short-term wages; longer-term effects are discussed as conditional and not measured in the pilot.

medium mixed AI-Driven Skill Mapping and Gig Economy Matching Algorithm f... short-term wages; long-term wage dynamics (not measured)

Overall, economic benefits from AI in radiology are plausible but conditional on human-AI interaction design, governance, workforce effects, and payment structures; net value is not determined by algorithmic accuracy alone.

Synthesis of the heterogeneous literature (laboratory, reader, observational, qualitative) and conceptual economic analysis highlighting dependencies beyond algorithmic performance.

medium mixed Human-AI interaction and collaboration in radiology: from co... net economic value/ROI, clinical outcomes, adoption and sustainability metrics

The net effect of AI on clinician burnout is ambiguous: tools can remove tedious tasks but may introduce new cognitive, administrative, and liability stresses.

Mixed qualitative and small-scale observational studies with variable findings on burnout-related measures after AI introduction.

medium mixed Human-AI interaction and collaboration in radiology: from co... burnout survey scores, task satisfaction, administrative burden metrics

Changes in workload composition can reduce routine burdens but may shift cognitive load to follow-up decisions and managing AI outputs.

Observational and qualitative studies of deployed systems reporting redistribution of tasks and clinician-reported changes in cognitive demands.

medium mixed Human-AI interaction and collaboration in radiology: from co... time allocation across task types, subjective cognitive workload scores, frequen...

Economic outcomes depend on complementarity versus substitution: AI that augments radiologists can raise output per worker; AI that substitutes tasks may reduce demand for certain diagnostic activities.

Theoretical economic frameworks and case studies of task reallocation in early deployments; empirical workforce-impact studies limited.

medium mixed Human-AI interaction and collaboration in radiology: from co... radiologist productivity metrics, employment levels/demand for diagnostic activi...

Automation bias can increase undue reliance on AI, while algorithmic aversion can drive underuse of helpful tools.

Cognitive and behavioral studies and reader simulations demonstrating both increased acceptance/overtrust in automated outputs in some settings and rejection/discounting of AI advice in others.

medium mixed Human-AI interaction and collaboration in radiology: from co... rates of clinician acceptance/use of AI recommendations, error rates when follow...

Real clinical value depends critically on how AI tools interact with radiologists in practice (integration design and human-AI interaction).

Conceptual models and synthesis of reader studies, simulation/interaction studies, usability and qualitative deployment evaluations that compare standalone algorithm performance versus clinician+AI workflows.

medium mixed Human-AI interaction and collaboration in radiology: from co... clinician-AI joint diagnostic performance, patient-relevant outcomes, workflow m...

Practical takeaway: effectiveness of human–AI teaming in security tasks depends heavily on human ability to formulate context-rich prompts; autonomous workflows that self-manage prompting and tool selection can be more effective.

Synthesis of empirical observations from the live CTF (41 participants) and the autonomous agent benchmark (4 agents), showing human prompting failures limiting team performance and autonomous agents with self-directed prompting achieving higher performance.

medium mixed Understanding Human-AI Collaboration in Cybersecurity Compet... relative effectiveness (challenge solve rates/rankings) conditional on human pro...

Participants’ perceptions, trust, and expectations about the AI shifted after hands-on use (qualitative observation).

Pre- vs. post-AI qualitative measures and observational analysis collected during the live CTF (self-reports/observations of trust and expectations after using the instrumented AI).

medium mixed Understanding Human-AI Collaboration in Cybersecurity Compet... qualitative changes in participant perceptions, trust, and expectations after ha...

Implication for substitution: Because there was no main effect of partner type on collaboration proficiency, AI teammates may substitute for humans on short, temporary tasks without clear productivity loss—conditional on emotional and empathetic factors.

Inference by authors based on the null main effect of partner type combined with the observed role of emotion and service empathy in moderating/mediating collaboration proficiency (experimental evidence, n = 861).

medium mixed Adoption of AI partners in temporary tasks: exploring the ef... productivity / collaboration proficiency

Theoretical framing: an attention-based view (ABV) and a dual-agent model capture two opposing mechanisms—(1) human attention gain from initial AI–human collaboration and (2) AI attention shift under deep embedding—that jointly generate the inverted U-shaped AI–ECSR relationship.

The paper develops and presents ABV and a dual-agent theoretical model to explain observed empirical patterns; model predictions align qualitatively with regression results and heterogeneity tests.

medium mixed Attention to Whom? AI Adoption and Corporate Social Responsi... Managerial attention (theoretical/mediating construct)

Trust calibration influences project performance outcomes: organizations tend toward metric-driven evaluation of AI outputs and use AI to strategically augment human expertise, but miscalibration risks overreliance or inappropriate metric focus that can harm performance.

Based on participants' reported experiences in the 40 interviews and interpretive thematic analysis linking trust practices to observed/perceived performance consequences (shift to metric-based evaluation, strategic use, and noted risks).

medium mixed AI in project teams: how trust calibration reconfigures team... project performance (measured outputs, augmentation of expertise, error rates/qu...

Trust calibration shapes collaboration patterns, including delegation of oversight to systems or specialists, changes in communication networks (who talks to whom), and erosion of informal ad hoc communications used previously for tacit coordination.

Observed in interview narratives (40 interviews) and thematic coding showing repeated reports of shifted oversight roles, altered communication pathways, and reduced informal coordination after AI integration.

medium mixed AI in project teams: how trust calibration reconfigures team... collaboration dynamics (oversight delegation, communication patterns, informal c...

Trust calibration is produced and maintained through ongoing boundary work between humans and machines (i.e., teams continuously negotiate which inputs/responsibilities are treated as human versus machine).

Derived from participants' accounts in the 40 interviews and thematic analysis documenting repeated examples of role negotiation and boundary-setting between people and AI systems during project routines.

medium mixed AI in project teams: how trust calibration reconfigures team... trust calibration practices / boundary work (who is responsible for tasks/inputs...

Trust in AI within project-based work is situational and socially distributed across team members, rather than a stable individual attitude.

The claim is based on thematic qualitative analysis of 40 semi-structured interviews with project professionals across multiple industries in the UK. Interview data showed variation in how different team members described their trust in systems depending on role, task, and context.

medium mixed AI in project teams: how trust calibration reconfigures team... trust in AI (nature/distribution of trust across individuals and situations)

Explicit governance reduces negative externalities (bias, privacy breaches, loss of trust) but entails compliance costs that should be factored into adoption and diffusion models.

Conceptual claim synthesizing trade‑off arguments from governance and risk literatures and practitioner examples; not measured empirically in the paper.

medium mixed Symbiarchic leadership: leading integrated human and AI cybe... incidence of bias/privacy breaches/loss of trust; governance/compliance costs

Embedding AI into workflows may change firm boundaries (e.g., outsourcing models vs. in‑house systems) and make investments in internal auditability and explainability strategic assets.

Theoretical implication drawn from synthesis of organizational boundary theory and practitioner trends; suggested rather than empirically demonstrated within the paper.

medium mixed Symbiarchic leadership: leading integrated human and AI cybe... firm boundaries (insourcing vs outsourcing); value of internal governance capabi...

AI is likely to continue shifting the frontier of early discovery and increase the throughput and quality of hypotheses, but persistent biological uncertainty and the cost of clinical validation mean AI will complement—not fully replace—traditional R&D for the foreseeable future.

Synthesis of technological trends, application successes and limitations, translational risk, and economic reasoning presented throughout the paper.

medium mixed Has AI Reshaped Drug Discovery, or Is There Still a Long Way... long-run role of AI in drug discovery (degree of complementarity versus replacem...

Proprietary data, precompetitive consortia, and platform consolidation can create barriers to entry; public-data initiatives could alter competitive dynamics.

Market-structure analysis and discussion of data-access models in the paper, with examples of consortia and proprietary platform effects.

medium mixed Has AI Reshaped Drug Discovery, or Is There Still a Long Way... barriers to entry and competitive dynamics influenced by data-sharing models and...

Expect strong returns-to-scale and winner-take-most dynamics: large incumbents and well-funded startups with proprietary data/compute may dominate the field.

Economic reasoning and observations in the paper about data/compute concentration, platform effects, and market outcomes.

medium mixed Has AI Reshaped Drug Discovery, or Is There Still a Long Way... market concentration and returns-to-scale in AI-driven drug discovery firms

Realizing economic gains at scale from AI in drug R&D is constrained by data quality and access, high implementation and integration costs, regulatory uncertainty, and ethical/legal concerns; these constraints will shape how gains are distributed across firms, countries, and patients.

Aggregate conclusion of the narrative review synthesizing documented benefits and recurring constraints from published studies, case reports, industry/regulatory analyses; qualitative synthesis without quantitative projection of distributional outcomes.

medium mixed From Algorithm to Medicine: AI in the Discovery and Developm... scale of economic gains (industry-wide productivity); distributional outcomes ac...

Adoption of AI in pharma will increase demand for computational biologists, ML engineers, and data scientists and may displace or redefine some traditional bench roles.

Labor-market trend reports and organizational case studies included in the review noting hiring patterns and role changes; qualitative synthesis rather than comprehensive labor-market study.

medium mixed From Algorithm to Medicine: AI in the Discovery and Developm... employment composition by role; hiring demand for computational vs. bench roles

« Prev 1 2 3 … 190 191 192 … 281 282 Next »