Evidence (5539 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Adoption Remove filter

The paper provides concrete, regulation-inspired policy examples (e.g., content prohibition, sensitive data exfiltration) showing how they map into the Policy function.

Worked, illustrative examples included in the paper mapping regulatory constraints to the Policy(agent_id, partial_path, proposed_action, org_state) formalism.

high positive Runtime Governance for AI Agents: Policies on Paths representability of regulation-inspired policies in the formalism (yes/no; examp...

Runtime policy evaluation can intercept, score, log, allow/modify/block actions, and update organizational state as part of an agent's execution loop (reference implementation architecture).

Reference implementation design described in the paper (runtime policy evaluator hooks, logging, enforcement actions); architectural reasoning and pseudo-workflows provided; no production deployment data.

high positive Runtime Governance for AI Agents: Policies on Paths feasibility of integrating runtime policy evaluator into agent loops (architectu...

Policies can be formalized as deterministic functions p_violation = Policy(agent_id, partial_path, proposed_action, org_state) that return a probability or score of violation for a proposed next action.

Formal definition and mapping in the paper; worked examples showing how regulatory-style constraints map into this function; no large-scale empirical validation.

high positive Runtime Governance for AI Agents: Policies on Paths expressiveness of policy formalism (ability to represent targeted constraints)

Effective governance for agentic LLM systems requires treating the execution path as the central object and performing runtime evaluation of proposed next actions given the partial path.

Theoretical argument and formal proposal of runtime policy evaluator that takes (agent_id, partial_path, proposed_action, org_state) and returns a violation probability; reference architecture described; illustrative examples.

high positive Runtime Governance for AI Agents: Policies on Paths governance effectiveness for path-dependent policies (qualitative/coverage)

Multiple off-the-shelf vision-language models (closed-source and open-source) representative of current state-of-the-art architectures were benchmarked.

Paper reports experiments across a mix of closed-source and open-source VLMs; exact model names provided in the released materials.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... models evaluated (variety and representativeness)

Evaluation targets include correctness, consistency, and update efficacy, operationalized via quantitative metrics (accuracy, consistency rates, update success rate).

Methods section describing evaluation metrics and how correctness, consistency, and update efficacy are measured across experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... metrics used: accuracy, consistency rate, update success rate

A curated set of time-sensitive factual items (e.g., officeholders, company statuses, recent awards/results) was used to construct the benchmark.

Benchmark composition description listing categories of time-sensitive facts and methodology for curation of items used in experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... composition of benchmark item set

The authors release the V-DyKnow benchmark, code, and evaluation data for community use.

Statement in paper and accompanying release materials indicating benchmark, code, and evaluation data are publicly available.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... availability of benchmark, code, and data

V-DyKnow is a benchmark specifically designed to evaluate time-sensitive factual knowledge in vision-language models across both text and image modalities.

Release and description of the benchmark in the paper: curated set of time-sensitive factual items, paired multimodal stimuli (text + images), input perturbations, and evaluation scripts. Methodological description of benchmark composition and tasks.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... benchmark existence / capability to evaluate time-sensitive multimodal factual k...

Ethical handling: the study involved sensitive material (self-harm, trauma) and authors applied validation and careful handling consistent with research ethics.

Ethics section and methods describing sensitivity of material and precautions taken in data handling and validation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... ethical procedures applied to sensitive data

Selected coded items (for example, suicidal messages) were validated by the authors to increase reliability of certain critical annotations.

Methods section describing validation procedures applied to selected items such as suicidal ideation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... validation status of coded items (e.g., number of validated suicidal messages)

The authors developed and applied a manual codebook of 28 behavioral/phenomenological codes (e.g., delusional thinking, suicidal ideation, chatbot sentience claims, romantic interest) across the full corpus.

Method section describing construction of a 28-code inventory and manual coding applied to entire dataset.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... existence and application of a 28-code annotation scheme

The parallel associative scan enables the reductions required by Newton-style updates across time steps, thereby enabling parallelism across sequence length.

Algorithmic construction and implementation details in the thesis showing how associative scan operations aggregate intermediate Jacobian/ update information across time; examples provided in implementation section.

high positive Unifying Optimization and Dynamics to Parallelize Sequential... practical parallelizability / ability to compute required reductions in parallel

The thesis proves linear convergence rates for a family of fixed-point/Newton-like solvers, with rates depending on approximation accuracy and stability properties of the chosen method.

Mathematical proofs and convergence theorems provided in the theoretical analysis section establishing linear rates under stated assumptions (bounds on approximation error, stability metrics).

high positive Unifying Optimization and Dynamics to Parallelize Sequential... convergence rate (linear) as a function of approximation error and stability mea...

Evaluation of dynamical systems can be cast as solving a system of nonlinear equations, enabling parallel solution methods.

Methodological framing and derivation in the thesis showing recurrent updates and Markov transitions can be represented as a global nonlinear root-finding problem; algorithmic constructions follow from this representation.

high positive Unifying Optimization and Dynamics to Parallelize Sequential... feasibility of parallel solution (existence of equivalent nonlinear-system formu...

Explicit enforcement of signal constraints in DeePC provides a safety/operational advantage over many pure learning approaches that do not explicitly enforce hard constraints.

Algorithmic formulation includes constraints in the optimization; paper contrasts this with unconstrained learning-based controllers and demonstrates constrained, feasible actuation in simulation.

high positive Data-driven generalized perimeter control: Zürich case study explicit constraint satisfaction and operational safety of signal timings

DeePC can compute traffic-light actuation sequences that respect hard operational and safety constraints (e.g., phasing, minimum/maximum green times).

Formulation of DeePC as a constrained optimization problem in the paper with explicit constraint terms for signal phasing and safety; implemented in simulation experiments where constraints are enforced in the controller optimization.

high positive Data-driven generalized perimeter control: Zürich case study constraint satisfaction / feasibility of computed actuation sequences

Reframing urban traffic dynamics with behavioral systems theory allows system evolution to be learned and predicted directly from measured input–output data (no explicit model identification).

Theoretical exposition in the paper showing that traffic trajectories can be represented as linear combinations of past measured trajectories via Hankel/data matrices; used as the basis for predictive control (DeePC).

high positive Data-driven generalized perimeter control: Zürich case study predictive capability from measured I/O trajectories (ability to forecast future...

Applying DeePC yields measurable improvements in system-level outcomes (reduced total travel time and CO2 emissions) in a very large, high-fidelity microscopic simulation of Zürich.

Simulation experiments in a city-scale, high-fidelity microscopic closed-loop simulator of Zürich comparing DeePC-controlled signals against baseline controllers (e.g., fixed-time or standard adaptive schemes); reported reductions in aggregated metrics (total travel time and CO2 emissions).

high positive Data-driven generalized perimeter control: Zürich case study total travel time; CO2 emissions

A model-free traffic control approach (DeePC) can steer urban traffic via dynamic traffic-light control without building explicit traffic models.

Algorithmic/theoretical development (behavioral systems theory + DeePC) and controller-in-loop experiments in a high-fidelity microscopic closed-loop simulator of Zürich demonstrating closed-loop control using only input–output trajectory data (Hankel matrices) rather than parametric model identification.

high positive Data-driven generalized perimeter control: Zürich case study ability to generate feasible control (traffic-light) actuation sequences and clo...

The model weights will be open (open-weight release) to support European sovereignty and adoption.

Authors state intent to publish open weights and position the model as an open-weight European alternative; the summary reports this as a declared objective. The paper likely includes a licensing/availability statement.

high positive EngGPT2: Sovereign, Efficient and Open Intelligence planned availability / licensing status of model weights

Calibration data must be representative of deployment data to preserve conformal statistical guarantees in practice.

Theoretical requirement of exchangeability for conformal guarantees combined with empirical results where mismatched calibration caused guarantee violations or degraded factuality.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... preservation of factuality guarantees and post-deployment factuality

The paper introduces informativeness-aware metrics to measure task utility under conformal filtering, going beyond pure factuality rates.

Methodological contribution described: new metrics that penalize vacuous outputs and quantify retained task utility after filtering.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness/usefulness metrics (as defined in the paper)

Decomposing generated outputs into atomic claims and calibrating a verifier score threshold on held-out data yields a statistically valid guarantee (under exchangeability) that claims passing the threshold meet a target factuality level.

Method description and theoretical use of conformal calibration applied to per-claim scores, with held-out calibration set used to set the threshold; conforms to standard conformal prediction methodology presented in the paper.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... coverage/factuality level of claims passing threshold

Conformal factuality provides distribution-free statistical guarantees for claim-level correctness in retrieval-augmented LLM outputs.

The paper applies conformal calibration to atomic claims: decompose outputs into atomic claims, score each claim with a verifier, and calibrate a score threshold on held-out (exchangeable) data to guarantee a target claim-level factuality rate. This is a theoretical property of conformal methods described and implemented in the paper.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... claim-level factuality guarantee (probability bound on correctness of claims pas...

Traditional machine-learning baselines were included for comparison in the benchmarks.

Paper explicitly states that traditional ML baselines were used alongside TSFMs in benchmarking experiments. The summary does not list which baselines or their quantitative results.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... inclusion of traditional ML baseline models in comparative evaluation

The dataset sampling resolution is at the millisecond level, enabling forecasting horizons from 1 step (100 ms) up to 96 steps (9.6 s).

Paper states sampling resolution is millisecond-level and defines forecasting tasks spanning 1 to 96 steps (100 ms to 9.6 s). This is a methodological description rather than an experimental metric.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... supported forecast horizons (temporal prediction horizon: 100 ms–9.6 s)

Introduces a new millisecond-resolution dataset of wireless channel and traffic-condition measurements from an operational 5G deployment.

Paper describes collection of operational 5G telemetry at millisecond sampling resolution; dataset is presented as a novel domain addition to TSFM pretraining corpora. Exact number of records/sessions not specified in the provided summary.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... availability and characteristics of a millisecond-resolution 5G measurement data...

Under pathological label heterogeneity (mutually exclusive local labels) FederatedFactory restores CIFAR-10 classification accuracy from a collapsed baseline of 11.36% to 90.57%.

Empirical experiment reported on CIFAR-10 configured as a pathological heterogeneity stress test; paper reports baseline collapsed accuracy (11.36%) and FederatedFactory result (90.57%). (Specific sample sizes / client counts not provided in the summary.)

high positive FederatedFactory: Generative One-Shot Learning for Extremely... CIFAR-10 classification accuracy (%)

A single communication round of generative-module exchange suffices for clients to synthesize class-balanced datasets locally and align their training data.

Paper reports a single exchange of generative modules across clients (one communication round) and uses that to synthesize a globally class-balanced training set at each client; experiments (CIFAR-10, MedMNIST, ISIC2019) are run under this one-round regime.

high positive FederatedFactory: Generative One-Shot Learning for Extremely... number of communication rounds required; class balance of synthesized datasets

Convergence of the three complementary methods (lexical, paraphrase, behavioral) strengthens confidence that contamination is real and systematically inflates scores.

Triangulation across Experiment 1 (lexical detection on public corpora), Experiment 2 (paraphrase robustness on 100-question subset), and Experiment 3 (TS‑Guessing on all items); consistent patterns observed across methods.

high positive Are Large Language Models Truly Smarter Than Humans? robustness/confidence in contamination detection (methodological convergence)

All 13 surveyed generative systems report addressing syntactic validity (Layer 1).

For each of the 13 systems the review reports syntactic/parse/compile checks or token-level validity tests under Layer 1 in the systematic application of the evaluation framework.

high positive Generative AI for Quantum Circuits and Quantum Code: A Techn... reporting of syntactic validity checks

BenchPreS can be used as an evaluative tool for mechanism designers and regulators to measure and compare models' context‑sensitivity to guide incentives, penalties, or certification regimes.

Methodological claim about the benchmark's applicability: BenchPreS produces MR and AAR metrics that can be used for comparisons; paper suggests use in policy/design contexts.

high positive BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Usability of BenchPreS metrics (MR, AAR) for model comparison and regulatory eva...

BenchPreS provides a benchmark and evaluation protocol that systematically varies stored user preference, interaction partner (self vs third party), and normative requirement to assess appropriate suppression or application of preferences.

Dataset construction and evaluation procedure described: scenario generation varying preference, partner, and normative appropriateness; MR and AAR computed across the scenario set.

high positive BenchPreS: A Benchmark for Context-Aware Personalized Prefer... Benchmark coverage and experimental protocol (design dimensions: preference, par...

Historical transitions in standard work hours (e.g., six-day to five-day week) show that phased implementation, collective bargaining, and complementary policies can make work-time reductions feasible and economically beneficial.

Historical analyses and case studies of past industrialized-country workweek transitions cited in the synthesis; evidence drawn from historical institutional records and prior economic histories rather than a unified econometric analysis.

high positive A Shorter Workweek as a Policy Response to AI-Driven Labor D... feasibility and economic outcomes of phased work-time reductions (employment, pr...

The paper advances a replicable interdisciplinary synthesis method and provides a simulated dataset and transparent protocols enabling other researchers to adapt the approach.

Methods section detailing systematic literature search protocols (ACM/IEEE/Springer, 2020–2024), inclusion criteria, simulation parameterization for the cross-sectoral dataset (seven industries, 2020–2024), and stated reproducibility materials.

high positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Availability and description of reproducible methods and a simulated dataset (re...

AI adoption is strongly associated with workforce skill transformation (reported correlation r = 0.71).

Correlational analysis reported in the paper using the simulated cross-sectoral dataset that mirrors employment trends across seven industries (Manufacturing, Healthcare, Finance, Education, Transportation, Retail, IT Services) over 2020–2024. This corresponds to sector-year observations (7 sectors × 5 years = 35 observations) and is triangulated with findings from a systematic literature synthesis (ACM, IEEE, Springer publications 2020–2024).

high positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Skill shift index (measure of changes in required skills and task composition)

The evaluation compared models on multiple metrics (accuracy, precision, recall, F1, AUC) across repeated trials and cross-company tests, and reported gains for AI methods across these metrics.

Evaluation protocol described: repeated trials, cross-validation, holdout sets, cross-company tests; reported performance improvements for AI models on the listed metrics.

high positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Classification evaluation metrics (accuracy, precision, recall, F1, AUC)

Ensemble methods and deep learning models show the largest and most consistent improvements in predictive performance relative to classic statistical models.

Aggregate results across repeated trials and evaluation metrics indicate Random Forests and Gradient Boosting (ensembles) and deep neural networks outperform linear/logistic regression and other baselines on the publicly available datasets used.

high positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Predictive performance (accuracy, F1, AUC, etc.)

Modern AI-driven prediction methods (especially ensemble models and deep neural networks) systematically outperform traditional statistical approaches at predicting job performance in publicly available workforce datasets.

Direct model comparison reported in the paper: baseline statistical models (linear/logistic regression) versus machine learning models (Random Forest, Gradient Boosting, SVM, deep neural networks) evaluated on multiple publicly available workforce datasets using cross-validation and holdout sets; performance reported on accuracy, precision, recall, F1, and AUC across repeated trials.

high positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Job performance prediction (classification performance metrics: accuracy, precis...

Research priorities include rigorous real-world trials assessing patient outcomes, cost-effectiveness, and labor impacts; comparative studies of integration strategies; measurement of long-run workforce effects; and development of standard metrics and monitoring frameworks.

Explicit recommendations from the narrative review based on identified gaps: scarcity of RCTs, economic analyses, and long-term workforce studies.

high positive Human-AI interaction and collaboration in radiology: from co... number and quality of real-world trials, existence of standardized monitoring fr...

Economists and researchers should measure organizational mediators (governance, mentoring practices, learning processes) alongside AI adoption and use empirical designs such as difference-in-differences with phased rollouts, randomized mentoring/training interventions, matched employer–employee panels, and IV exploiting exogenous shocks to innovation backing to identify causal effects.

Methodological recommendations and proposed empirical designs contained in the paper; no implementation or empirical results reported.

high positive Revolutionizing Human Resource Development: A Theoretical Fr... feasibility and validity of empirical identification strategies for causal effec...

The integrated framework links multi-level outcomes: micro (individual skills, task performance), meso (team coordination, workflows), and macro (organizational strategy, innovation, productivity) effects to adaptive structuration processes and affordance actualization.

Framework specification and theoretical mapping across levels in the conceptual paper; no empirical validation or sample.

high positive Revolutionizing Human Resource Development: A Theoretical Fr... individual skills and performance; team coordination and workflow quality; organ...

The paper develops a conceptual framework that integrates Adaptive Structuration Theory (AST) and Affordance Actualization Theory (AAT) to explain how effective human–AI collaboration can be structured within organizations.

Conceptual/theoretical synthesis and literature integration combining AST and AAT streams; no original empirical data or sample reported (theoretical development).

high positive Revolutionizing Human Resource Development: A Theoretical Fr... explanatory power / conceptual framework for human–AI collaboration

Reward shaping at the assignment layer enables an explicit trade-off between diagnostic accuracy and human labor by incorporating penalties for human involvement.

Methodology section describing reward shaping and experimental comparisons showing different accuracy/human-effort trade-offs (results reported in paper; exact experimental details not provided in the summary).

high positive Hierarchical Reinforcement Learning Based Human-AI Online Di... diagnostic accuracy vs human effort (as controlled by reward shaping)

Masked reinforcement learning techniques constrain or mask action spaces, reducing exploration over huge symptom/action spaces.

Paper describes use of masked RL to limit action options during training and execution; used in both assignment and execution layers (methodological claim supported by algorithmic description and experiments).

high positive Hierarchical Reinforcement Learning Based Human-AI Online Di... action-space reduction / sample efficiency / learning stability (as applied to s...

The upper layer ('master') learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost.

Methodological description in the paper and empirical results from experiments using masked RL and reward-shaped objectives at the assignment layer (implementation and experimental setup reported; dataset/sample size not specified in summary).

high positive Hierarchical Reinforcement Learning Based Human-AI Online Di... assignment policy performance; human effort allocation; diagnostic accuracy unde...

The paper advances augmentation debates by articulating the leader’s practical role when decision lead‑agency shifts between humans and AI and by detailing systemic HR changes needed to sustain performance, legitimacy and well‑being.

Stated contribution of the conceptual synthesis comparing existing augmentation and leadership literatures and providing an HR‑focused framework; descriptive of the paper's intellectual contribution.

high positive Symbiarchic leadership: leading integrated human and AI cybe... clarity of leader role; specification of HR system changes

Core practice 4 — Embed governance: make accountability, bias testing, privacy safeguards, audit trails, escalation thresholds and human oversight explicit and routine.

Prescriptive governance practice grounded in literature on algorithmic accountability and risk management and in practitioner examples; presented without original empirical validation.

high positive Symbiarchic leadership: leading integrated human and AI cybe... bias incidence; privacy breaches; auditability and compliance metrics

Core practice 3 — Manage the human–AI relationship: build adoption, psychological safety and calibrated trust; address automation anxiety and misuse.

Framework recommendation synthesizing organizational‑psychology and technology adoption literature plus practitioner observations; not tested empirically in the paper.

high positive Symbiarchic leadership: leading integrated human and AI cybe... adoption rates; psychological safety; calibrated trust; misuse incidents

« Prev 1 2 3 … 41 42 43 … 110 111 Next »