Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.

Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.

high neutral Generative AI Advertising as a Problem of Trustworthy Commer... categorization of types of commercial influence in generative systems

This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals.

Method: sociotechnical audit comparing six commercial LLMs to a rubric created via a Delphi process with 20 infrastructure professionals (Delphi-derived rubric).

high neutral Governance risks of AI reasoning in urban infrastructure thr... comparison of LLM reasoning to expert-derived rubric

Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.

Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.

high neutral RegTech-enabled governance of sanctions-safe enterprise ecos... conceptual role of RegTech in organizing inter-actor relations

We design a budget split intervention that directly incorporates unknown users and targets users with Google-inferred gender labels (male, female).

Authors' stated experimental/intervention design implemented in collaboration with a state-level government agency; methodological claim about the intervention (no sample size or deployment details in the excerpt).

high neutral Into the Unknown: Accounting for Missing Demographic Data wh... design and implementation of a budget split intervention incorporating unknown u...

The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change.

Conceptual reframing and argument presented in the abstract as a conclusion of the proposed framework and evaluation approach.

high neutral AI Harness Engineering: A Runtime Substrate for Foundation-M... ability of the overall system (model+harness+environment) to produce verifiably ...

We formalize this substrate as 'AI Harness Engineering' and identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording.

Methodological/conceptual contribution described in the paper (abstract) that lists eleven component responsibilities as part of the formalization.

high neutral AI Harness Engineering: A Runtime Substrate for Foundation-M... completeness and scope of responsibilities required for a runtime harness

Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.

Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.

high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... experimental coverage and evaluation methodology (models invoked graph query too...

Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.

Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).

high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... mechanism of attack (data-layer vs instruction-layer manipulation)

A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.

Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... ability to distinguish engineering failures from commercial steering

Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).

Method description of the paired counterfactual experimental design used by TourMart.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... steering delta (difference in acceptance between commission-aware and minimum-di...

We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).

Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... audit instrument capability for measuring message-induced perception shifts unde...

Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.

Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... interface format (ranked-list → single-sentence conversational recommendation)

We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget.

Methods / experimental setup reported in the paper: five function-calling benchmarks and a DistilBERT classifier trained and deployed under latency constraints.

high neutral Switchcraft: AI Model Router for Agentic Tool Calling evaluation framework and classifier training/deployment

We show that ρ ≥ 1 is the no-excess-crowding parity condition and connect Δ to an adoption game with exposure-dependent redundancy costs.

Theoretical result derived in the paper linking the human-relative diversity ratio ρ to a parity condition and relating the excess-crowding coefficient Δ to an adoption-game model with exposure-dependent redundancy costs.

high neutral Ex Ante Evaluation of AI-Induced Idea Diversity Collapse parity condition for no-excess-crowding (ρ ≥ 1) and economic/game-theoretic rela...

We characterize optimal and fair policies in the short term.

Theoretical results/characterizations presented in the paper identifying optimal policies and fair-policy structures for the short-term setting.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... policy optimality under short-term fairness constraints

We theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF).

Theoretical analysis in the paper using the Price of Fairness formalism to study trade-offs.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... trade-off between utility (decision-maker objective) and fairness constraints (P...

We introduce notions of group fairness for both the short and long term.

Methodological contribution in the paper: formal definitions of short-term and long-term group fairness introduced by the authors.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... definitions of group fairness (short-term, long-term)

The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).

Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.

high neutral Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... methodological reproducibility and conceptual framing

We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.

Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.

high neutral DAO-enabled decentralized physical AI: A new paradigm for hu... conceptual integration of DePIN components into a vertical infrastructure stack

We evaluate 4 popular agent harnesses and 7 foundation models on Workspace-Bench.

Experimental setup reported in the paper listing 4 agent harnesses and 7 foundation models used in evaluations.

high neutral Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... number of agent harnesses and foundation models evaluated

The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool.

System description in paper noting that the curated index is available via a Gosset MCP server for external models to call.

high neutral Curated AI beats frontier LLMs at pharma asset discovery availability of curated index as callable MCP server

All five systems receive the same natural-language query and the same JSON output schema.

Methodological detail reported in paper describing controlled inputs across systems.

high neutral Curated AI beats frontier LLMs at pharma asset discovery consistency of input/query and output schema across systems

We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets.

Experimental benchmark described in paper: direct comparison of Gosset versus four named models on 10 targets; methodological statement.

high neutral Curated AI beats frontier LLMs at pharma asset discovery comparative retrieval performance on 10 niche oncology/immunology targets

Weight-based memory generalizes by applying abstract rules to inputs never seen before.

Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.

high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by weight-based memory

Retrieval generalizes by similarity to stored cases.

Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.

high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by retrieval systems

The study uses LinkedIn and GitHub data to examine firms' adoption of GitHub Copilot and related SWE skills and labor outcomes.

Statement of data sources and study design reported in the paper (LinkedIn profiles/skill listings linked to GitHub repository/adoption signals).

high neutral Firms' GitHub Copilot adoption and labor market outcomes for... data sources / methodological description

The process of synthesizing information is inherently iterative: users explore content, identify relationships between concepts, and continuously reorganize their mental models.

Conceptual description of the cognitive/process characteristics in the paper's background/motivation (no empirical measurement reported).

high neutral MindTrellis: Co-Creating Knowledge Structures with AI throug... iterative nature of knowledge synthesis (exploration, relation identification, r...

Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.

Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.

high neutral Fairness under uncertainty in sequential decisions characterization of ML application setting (online/sequential)

The paper establishes a taxonomy of forgetting mechanisms: passive decay-based, active deletion-based, safety-triggered, and adaptive reinforcement-based.

Explicit taxonomy presented in paper (listed in abstract).

high neutral FSFM: A Biologically-Inspired Framework for Selective Forget... classification of forgetting mechanisms

We evaluate Aether over synthetic network change scenarios covering main classes of network changes and on past incidents from a major ISP operational network.

Evaluation methodology stated in paper abstract: tested on synthetic scenarios and historical incidents from one major ISP (no numeric sample size provided in abstract).

high neutral Aether: Network Validation Using Agentic AI and Digital Twin evaluation dataset composition (synthetic scenarios + past ISP incidents)

Expert assessment involved three senior academics producing reports and appointment-level syntheses.

Paper states that three senior academics produced assessment reports and synthesised appointment-level recommendations; n=3 assessors.

high neutral The Relic Condition: When Published Scholarship Becomes Mate... expert assessment procedure (number and type of assessors)

The distillation pipeline used an eight-layer extraction method and a nine-module skill architecture grounded in local, closed-corpus analysis.

Methods description in paper specifying an eight-layer extraction approach and nine-module skill architecture; presented as the technical design of the distillation pipeline.

high neutral The Relic Condition: When Published Scholarship Becomes Mate... pipeline architecture (layers/modules)

Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback.

Descriptive claim about the general architecture of Agent-Aided Design systems as asserted by the authors (methodological description), not an empirical test; no quantitative evaluation provided here.

high neutral Agent-Aided Design for Dynamic CAD Models system architecture / iterative design loop (agent writes code, compiles, visual...

We evaluate four mechanisms to enable cooperation: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players.

Description of experimental design / mechanisms evaluated in the study across four social dilemmas; details on implementation and sample sizes not provided in the excerpt.

high neutral CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and... comparative effectiveness of four cooperation mechanisms

CoCoGen+ formulates each training round as a weighted potential game in which organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses.

Theoretical formulation and game-theoretic modeling provided in the paper (analytical derivation); no empirical sample size reported.

high neutral Cooperate to Compete: Strategic Data Generation and Incentiv... synthetic_data_generation_quantity (strategy)

The paper provides lessons for scaling regression automation and enabling effective human-AI teaming in Agile settings.

Stated contribution of the paper (synthesis of lessons from the industrial case study).

high neutral Human-AI Collaboration for Scaling Agile Regression Testing:... availability of lessons and guidance

The Copilot was integrated with Hacon's CI pipelines and operates asynchronously as a 'silent AI teammate', producing candidate scripts for human review.

System integration and deployment description within the case study (implementation detail reported in the paper).

high neutral Human-AI Collaboration for Scaling Agile Regression Testing:... operational mode and integration with CI (asynchronous candidate generation for ...

We conducted an exploratory industrial case study of the Hacon Test Automation Copilot, an agentic AI system that generates system-level regression test scripts from validated specifications using retrieval-augmented generation and a multi-agent workflow.

Methodological claim: description of the study design and the system; the paper reports a single industrial case study at Hacon (a Siemens company).

high neutral Human-AI Collaboration for Scaling Agile Regression Testing:... capability to generate system-level regression test scripts

Predictive outputs are translated into allocation rules, with emphasis on mean–variance optimization, shrinkage-based risk estimation, risk parity, hierarchical allocation, and reinforcement-learning-based dynamic rebalancing.

Surveyed literature on portfolio construction and allocation techniques described in the review (methodological overview; no single empirical dataset or sample size).

high neutral Artificial Intelligence in Financial Decision-Making methods for converting predictions into portfolio allocation rules

SAFI measures LLM performance on text-based representations of skills, not full occupational execution.

Methodological caveat stated by the authors clarifying the scope and limits of SAFI.

high neutral The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... scope of SAFI measure (text-based representations vs full job execution)

We propose an AI Impact Matrix that positions skills into four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk.

Conceptual/interpretive framework introduced by the authors; described in text as proposed by the paper.

high neutral The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... interpretive classification of skills into four impact quadrants

Legitimate accountability is axiomatized through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated).

Paper presents an explicit axiomatization listing these four properties as definitions/axioms forming the normative criteria for legitimate accountability.

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... formal criteria for legitimate accountability

Collective behaviour is characterised through interaction graphs and joint action spaces.

Paper specifies interaction graphs and joint action spaces as part of the formal model (definitions and formal structure).

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... formal representation of collective behaviour

Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social).

Paper defines autonomy as a 4-dimensional information-theoretic profile (conceptual/mathematical definition within the formal model).

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... measure/characterisation of agent autonomy

Using a strictly algorithmic baseline (mathematical bottleneck aggregation), we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market based on the DWA-level scores.

Method and calculation claim: algorithmic baseline aggregation applied across the 923 occupations / 2,087 DWAs to produce OAIs mapped to the U.S. labor market. Specific aggregation formula referenced but not numerically detailed in the excerpt.

high neutral Bounded by Risk, Not Capability: Quantifying AI Occupational... Relative Occupational Automation Index (OAI)

We deconstructed 923 occupations into 2,087 Detailed Work Activities (DWAs).

Explicit data processing claim in the paper: mapping of 923 occupations to 2,087 DWAs for analysis.

high neutral Bounded by Risk, Not Capability: Quantifying AI Occupational... coverage of occupations and DWAs used for analysis

The economic model for IASCA follows the FDA's PDUFA precedent, with progressive certification fees representing 0.1-1% of model training costs.

Proposal specifies that IASCA's funding would mirror the FDA PDUFA model and states a fee range of 0.1–1% of model training costs; this is an asserted financing mechanism, not empirically validated in the excerpt.

high neutral IASCA: The International AI Safety Certification Authority —... progressive certification fees equal to 0.1-1% of model training costs

IASCA is modelled after existing international and national regulatory bodies such as the IAEA, FAA, and FDA.

Proposal explicitly states IASCA is modelled after the IAEA, FAA, and FDA; this is an analogy/organizational design claim rather than an empirical finding.

high neutral IASCA: The International AI Safety Certification Authority —... institutional design modeled on IAEA/FAA/FDA

A variance decomposition indicates that most expert disagreement about long-run macroeconomic outcomes is driven by differing beliefs about the economic effects of highly capable AI, rather than disagreement about the pace of AI capability progress.

Authors' variance-decomposition analysis of survey responses separating components due to beliefs about AI capabilities vs. beliefs about economic effects given capabilities (methodological details referenced but not provided in excerpt).

high neutral Forecasting the Economic Effects of AI sources of expert disagreement (capabilities vs. economic effects)

A life insurance system integrated into an industry partner mobile app was tested in two experiments.

Paper reports two experiments running the ARQuest-enabled life insurance system inside a partner mobile app; experimental setup is stated though sample sizes are not provided in the excerpt.

high neutral AI in Insurance: Adaptive Questionnaires for Improved Risk P... experimental evaluation of system in partner app

« Prev 1 2 3 … 59 60 61 … 281 282 Next »