Evidence (6917 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	761	200	101	904	2020
Governance & Regulation	829	400	191	122	1566
Organizational Efficiency	784	193	125	84	1197
Technology Adoption Rate	637	236	124	97	1103
Research Productivity	431	131	58	340	972
Output Quality	481	183	59	47	770
Decision Quality	332	177	82	49	647
Firm Productivity	439	57	88	20	610
AI Safety & Ethics	218	279	66	33	602
Market Structure	181	170	123	24	503
Task Allocation	214	64	72	33	388
Skill Acquisition	174	62	62	17	315
Innovation Output	204	27	45	18	295
Employment Level	105	54	108	13	282
Fiscal & Macroeconomic	132	69	43	26	277
Consumer Welfare	117	63	42	11	233
Firm Revenue	154	48	26	3	231
Task Completion Time	173	31	8	12	225
Inequality Measures	44	123	50	6	223
Worker Satisfaction	89	65	22	12	188
Error Rate	71	92	10	2	175
Regulatory Compliance	77	69	14	5	165
Automation Exposure	58	56	26	13	156
Training Effectiveness	96	21	14	19	152
Wages & Compensation	77	37	25	6	145
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	81	21	1	115
Hiring & Recruitment	52	7	8	3	70
Creative Output	32	20	8	3	64
Skill Obsolescence	5	47	6	1	59
Social Protection	28	16	8	2	54
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Governance Remove filter

Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic developed in the theory.

Methodological statement that the paper uses structured cross-domain illustrations to ground and discipline the theoretical claims; no empirical sample reported.

high neutral Redrawing the AI Map: A Theory of Accountability Boundaries ... theoretical grounding via domain illustrations

There are three accountability-boundary strategies in agentic ecosystems: component, integrated, and dual-track.

Theoretical categorization introduced by the authors as part of the capability-level theory; illustrated with cross-domain examples rather than empirical testing.

high neutral Redrawing the AI Map: A Theory of Accountability Boundaries ... classification of boundary strategy

The study used standard scientific methods, employing a comparative approach and inductive and deductive methods to identify patterns of interaction between legal regulation and technological development.

Methodology section of the paper explicitly states the use of comparative, inductive and deductive methods and theoretical synthesis.

high neutral ECONOMIC SYSTEMS IN THE CONTEXT OF DIGITALISATION AND AI: TH... methodological approach used in the study

The paper develops a theoretical and legal model that treats law as an integral part of the economic system influencing income distribution, labour relations, market structure and productivity dynamics.

Model construction through synthesis of theoretical perspectives using inductive and deductive methods and comparative legal analysis (methodology described in the paper).

high neutral ECONOMIC SYSTEMS IN THE CONTEXT OF DIGITALISATION AND AI: TH... role of legal frameworks in shaping economic institutional conditions (income di...

Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025).

Empirical observation from the dataset showing that only a small number of benchmarks are highlighted across multiple builders/releases; specific named benchmarks are cited as relatively widely used.

high neutral Unsteady Metrics and Benchmarking Cultures of AI Model Build... frequency of benchmark highlighting across builders/releases

We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.

Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.

high neutral Generative AI Advertising as a Problem of Trustworthy Commer... categorization of types of commercial influence in generative systems

This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals.

Method: sociotechnical audit comparing six commercial LLMs to a rubric created via a Delphi process with 20 infrastructure professionals (Delphi-derived rubric).

high neutral Governance risks of AI reasoning in urban infrastructure thr... comparison of LLM reasoning to expert-derived rubric

Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.

Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.

high neutral RegTech-enabled governance of sanctions-safe enterprise ecos... conceptual role of RegTech in organizing inter-actor relations

We design a budget split intervention that directly incorporates unknown users and targets users with Google-inferred gender labels (male, female).

Authors' stated experimental/intervention design implemented in collaboration with a state-level government agency; methodological claim about the intervention (no sample size or deployment details in the excerpt).

high neutral Into the Unknown: Accounting for Missing Demographic Data wh... design and implementation of a budget split intervention incorporating unknown u...

Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.

Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.

high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... experimental coverage and evaluation methodology (models invoked graph query too...

Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.

Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).

high neutral Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... mechanism of attack (data-layer vs instruction-layer manipulation)

A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.

Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... ability to distinguish engineering failures from commercial steering

Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).

Method description of the paired counterfactual experimental design used by TourMart.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... steering delta (difference in acceptance between commission-aware and minimum-di...

We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).

Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... audit instrument capability for measuring message-induced perception shifts unde...

Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.

Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.

high neutral TourMart: A Parametric Audit Instrument for Commission Steer... interface format (ranked-list → single-sentence conversational recommendation)

We characterize optimal and fair policies in the short term.

Theoretical results/characterizations presented in the paper identifying optimal policies and fair-policy structures for the short-term setting.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... policy optimality under short-term fairness constraints

We theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF).

Theoretical analysis in the paper using the Price of Fairness formalism to study trade-offs.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... trade-off between utility (decision-maker objective) and fairness constraints (P...

We introduce notions of group fairness for both the short and long term.

Methodological contribution in the paper: formal definitions of short-term and long-term group fairness introduced by the authors.

high neutral Price of Fairness in Short-Term and Long-Term Algorithmic Se... definitions of group fairness (short-term, long-term)

The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).

Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.

high neutral Market-Alignment Risk in Pricing Agents: Trace Diagnostics a... methodological reproducibility and conceptual framing

We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.

Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.

high neutral DAO-enabled decentralized physical AI: A new paradigm for hu... conceptual integration of DePIN components into a vertical infrastructure stack

Weight-based memory generalizes by applying abstract rules to inputs never seen before.

Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.

high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by weight-based memory

Retrieval generalizes by similarity to stored cases.

Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.

high neutral Contextual Agentic Memory is a Memo, Not True Memory type of generalization performed by retrieval systems

Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.

Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.

high neutral Fairness under uncertainty in sequential decisions characterization of ML application setting (online/sequential)

We evaluate four mechanisms to enable cooperation: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players.

Description of experimental design / mechanisms evaluated in the study across four social dilemmas; details on implementation and sample sizes not provided in the excerpt.

high neutral CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and... comparative effectiveness of four cooperation mechanisms

CoCoGen+ formulates each training round as a weighted potential game in which organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses.

Theoretical formulation and game-theoretic modeling provided in the paper (analytical derivation); no empirical sample size reported.

high neutral Cooperate to Compete: Strategic Data Generation and Incentiv... synthetic_data_generation_quantity (strategy)

Predictive outputs are translated into allocation rules, with emphasis on mean–variance optimization, shrinkage-based risk estimation, risk parity, hierarchical allocation, and reinforcement-learning-based dynamic rebalancing.

Surveyed literature on portfolio construction and allocation techniques described in the review (methodological overview; no single empirical dataset or sample size).

high neutral Artificial Intelligence in Financial Decision-Making methods for converting predictions into portfolio allocation rules

Legitimate accountability is axiomatized through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated).

Paper presents an explicit axiomatization listing these four properties as definitions/axioms forming the normative criteria for legitimate accountability.

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... formal criteria for legitimate accountability

Collective behaviour is characterised through interaction graphs and joint action spaces.

Paper specifies interaction graphs and joint action spaces as part of the formal model (definitions and formal structure).

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... formal representation of collective behaviour

Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social).

Paper defines autonomy as a 4-dimensional information-theoretic profile (conceptual/mathematical definition within the formal model).

high neutral The Accountability Horizon: An Impossibility Theorem for Gov... measure/characterisation of agent autonomy

Using a strictly algorithmic baseline (mathematical bottleneck aggregation), we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market based on the DWA-level scores.

Method and calculation claim: algorithmic baseline aggregation applied across the 923 occupations / 2,087 DWAs to produce OAIs mapped to the U.S. labor market. Specific aggregation formula referenced but not numerically detailed in the excerpt.

high neutral Bounded by Risk, Not Capability: Quantifying AI Occupational... Relative Occupational Automation Index (OAI)

We deconstructed 923 occupations into 2,087 Detailed Work Activities (DWAs).

Explicit data processing claim in the paper: mapping of 923 occupations to 2,087 DWAs for analysis.

high neutral Bounded by Risk, Not Capability: Quantifying AI Occupational... coverage of occupations and DWAs used for analysis

The economic model for IASCA follows the FDA's PDUFA precedent, with progressive certification fees representing 0.1-1% of model training costs.

Proposal specifies that IASCA's funding would mirror the FDA PDUFA model and states a fee range of 0.1–1% of model training costs; this is an asserted financing mechanism, not empirically validated in the excerpt.

high neutral IASCA: The International AI Safety Certification Authority —... progressive certification fees equal to 0.1-1% of model training costs

IASCA is modelled after existing international and national regulatory bodies such as the IAEA, FAA, and FDA.

Proposal explicitly states IASCA is modelled after the IAEA, FAA, and FDA; this is an analogy/organizational design claim rather than an empirical finding.

high neutral IASCA: The International AI Safety Certification Authority —... institutional design modeled on IAEA/FAA/FDA

We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance).

Controlled experiment reported in the paper: 600 runs across five named industries (experimental setup reported in abstract).

high neutral Ontology-Constrained Neural Reasoning in Enterprise Agentic ... experimental performance of ontology-coupled vs ungrounded agents across industr...

The paper addresses three institutional audiences: enterprise finance and operations teams; government and regulatory bodies developing AI labor displacement frameworks; and financial markets requiring a machine labor index as a long-duration economic signal.

Stated intended audiences in the paper (descriptive statement).

high neutral HEWU: A Standardized Framework for Measuring Machine-Generat... intended institutional audiences

Costinot and Werning (2023) develop a sufficient-statistic approach and find optimal technology taxes of 1–3.7% on robots.

Citation reported in the paper summarizing Costinot and Werning (2023)'s quantitative sufficient-statistic estimate.

high neutral NBER WORKING PAPER SERIES optimal robot tax rate

Guerreiro et al. (2022) characterize optimal Mirrleesian tax system with automation and find that robot taxes should be transitional—high when incumbent workers cannot retrain, converging to zero as new cohorts adjust skill investments.

Citation reported in the paper summarizing Guerreiro et al. (2022)'s theoretical result on transitional robot taxes.

high neutral NBER WORKING PAPER SERIES optimal robot tax path over time

If labor becomes economically redundant, the policy focus shifts from steering innovation to redesigning public finance and redistribution (e.g., new tax instruments, redistribution mechanisms).

Theoretical scenario analysis in the paper with references to related works (Korinek and Juelfs 2024; Korinek and Lockwood 2026).

high neutral NBER WORKING PAPER SERIES policy priority shift (steering -> public finance/redistribution)

We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL).

Dataset statement: the paper compares model outputs to a corpus of 10,000 CJOL labor dispute judgments.

high neutral LLM Safety in Judicial AI: A Stress Test of Social Media Inf... agreement / deviation between LLM-generated rulings and CJOL judgments

We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure.

Methodological description in the paper: a designed stress test where social media sentiment is used to perturb LLM outputs for labor dispute cases.

high neutral LLM Safety in Judicial AI: A Stress Test of Social Media Inf... sensitivity of LLM-generated labor dispute outcomes to injected social media sen...

The paper treats data as a new type of production factor and endogenizes it within the production function.

Theoretical/methodological: the paper constructs a macro-level theoretical model that explicitly includes data as an endogenous input in the production function (no empirical/sample data).

high neutral Study on the impact of big data sharing on individuals’ welf... inclusion of data as a production factor (model specification)

In the near term, the most plausible equilibrium is bounded autonomy, in which AI agents operate as supervised co-pilots, monitoring systems, and constrained execution modules embedded within human decision processes.

Theoretical argument and forward-looking assessment by the authors based on the proposed framework and plausibility considerations; not presented as the result of a causal empirical study in the excerpt.

high neutral AI Agents in Financial Markets: Architecture, Applications, ... expected equilibrium mode of AI agent autonomy in finance (bounded autonomy / su...

Economic evaluations of GLAI should account for end-to-end risk externalities (error propagation, institutional trust, rights impacts), not only short-term productivity gains.

Methodological recommendation grounded in conceptual synthesis of technical, behavioral, and legal risks; normative argument rather than empirical result.

high neutral Why Avoid Generative Legal AI Systems? Hallucination, Overre... comprehensiveness of economic evaluations (inclusion of externalities vs. narrow...

Generative Legal AI (GLAI) systems are built on token-prediction (LLM) architectures rather than formal legal-reasoning architectures.

Conceptual and technical analysis in the paper distinguishing GLAI from other legal-tech; literature synthesis on common LLM architectures. No original empirical dataset or sample size—qualitative/technical review.

high neutral Why Avoid Generative Legal AI Systems? Hallucination, Overre... underlying model architecture type (token-prediction vs. formal-reasoning)

The paper's formalism shows that prompt/system messages shape distributions over possible execution paths (indirect control) but do not evaluate actual partial paths at runtime.

Formal mapping in the paper that treats prompts as shaping prior over paths; conceptual argument and illustrative examples.

high neutral Runtime Governance for AI Agents: Policies on Paths degree of control over execution path (distributional shaping vs. path-specific ...

Returns to AI are heterogeneous across firms; estimating treatment effects requires attention to selection, complementarities, and dynamic adoption pipelines.

Methodological argument referencing treatment-effect literature and observed firm heterogeneity; supported by conceptual examples rather than a single empirical treatment-effect estimate.

high neutral Modern Management in the Age of Artificial Intelligence: Str... heterogeneity in returns to AI adoption (firm-level productivity or performance ...

In our setting, the locus of AI bias is not estimation but interpretation.

Overall experiment results: agent coefficient/estimate distributions remained aligned with human consensus and largely unchanged under biased prompts, while final-verdict outcomes were flip-prone under confirmatory prompts (e.g., Claude Code 10%→90%).

high null result AI Coding Agents in Social Science: Methodologically Diverse... whether bias manifests in estimation (coefficients) versus interpretation (verdi...

Unlike for biased human analysts in the same data, the anti-immigration prior prompt does not shift agents' aggregate estimates or final verdicts.

Comparison of the effect of an anti-immigration prior on human analysts (reported bias) versus agents (20 runs), showing that agent aggregate estimates and final verdict rates remained stable despite changes in methodological decisions.

high null result AI Coding Agents in Social Science: Methodologically Diverse... aggregate effect estimates and final verdict support rates under the anti-immigr...

No agent model exactly matches any human model.

Specification-by-specification comparison showing that none of the agent-generated models (from 20 executions) are identical to any human analyst's model in the many-analysts baseline.

high null result AI Coding Agents in Social Science: Methodologically Diverse... exact match count between agent models and human analyst models

Both agents' effect estimates remain broadly aligned with the human consensus.

Comparison of effect estimate distributions from Claude Code and Codex (20 runs each) to the human many-analysts consensus; reported alignment/broad agreement between agent estimates and human consensus.

high null result AI Coding Agents in Social Science: Methodologically Diverse... distribution of estimated effects (coefficients) relative to human consensus

« Prev 1 2 3 … 31 32 33 … 138 139 Next »