Evidence (6917 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic developed in the theory.
Methodological statement that the paper uses structured cross-domain illustrations to ground and discipline the theoretical claims; no empirical sample reported.
There are three accountability-boundary strategies in agentic ecosystems: component, integrated, and dual-track.
Theoretical categorization introduced by the authors as part of the capability-level theory; illustrated with cross-domain examples rather than empirical testing.
The study used standard scientific methods, employing a comparative approach and inductive and deductive methods to identify patterns of interaction between legal regulation and technological development.
Methodology section of the paper explicitly states the use of comparative, inductive and deductive methods and theoretical synthesis.
The paper develops a theoretical and legal model that treats law as an integral part of the economic system influencing income distribution, labour relations, market structure and productivity dynamics.
Model construction through synthesis of theoretical perspectives using inductive and deductive methods and comparative legal analysis (methodology described in the paper).
Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025).
Empirical observation from the dataset showing that only a small number of benchmarks are highlighted across multiple builders/releases; specific named benchmarks are cited as relatively widely used.
We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.
Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.
This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals.
Method: sociotechnical audit comparing six commercial LLMs to a rubric created via a Delphi process with 20 infrastructure professionals (Delphi-derived rubric).
Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.
Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.
We design a budget split intervention that directly incorporates unknown users and targets users with Google-inferred gender labels (male, female).
Authors' stated experimental/intervention design implemented in collaboration with a state-level government agency; methodological claim about the intervention (no sample size or deployment details in the excerpt).
Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.
Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.
Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.
Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).
A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.
Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.
Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).
Method description of the paired counterfactual experimental design used by TourMart.
We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).
Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.
Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.
We characterize optimal and fair policies in the short term.
Theoretical results/characterizations presented in the paper identifying optimal policies and fair-policy structures for the short-term setting.
We theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF).
Theoretical analysis in the paper using the Price of Fairness formalism to study trade-offs.
We introduce notions of group fairness for both the short and long term.
Methodological contribution in the paper: formal definitions of short-term and long-term group fairness introduced by the authors.
The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).
Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.
We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.
Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.
Weight-based memory generalizes by applying abstract rules to inputs never seen before.
Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.
Retrieval generalizes by similarity to stored cases.
Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.
Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.
Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.
We evaluate four mechanisms to enable cooperation: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players.
Description of experimental design / mechanisms evaluated in the study across four social dilemmas; details on implementation and sample sizes not provided in the excerpt.
CoCoGen+ formulates each training round as a weighted potential game in which organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses.
Theoretical formulation and game-theoretic modeling provided in the paper (analytical derivation); no empirical sample size reported.
Predictive outputs are translated into allocation rules, with emphasis on mean–variance optimization, shrinkage-based risk estimation, risk parity, hierarchical allocation, and reinforcement-learning-based dynamic rebalancing.
Surveyed literature on portfolio construction and allocation techniques described in the review (methodological overview; no single empirical dataset or sample size).
Legitimate accountability is axiomatized through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated).
Paper presents an explicit axiomatization listing these four properties as definitions/axioms forming the normative criteria for legitimate accountability.
Collective behaviour is characterised through interaction graphs and joint action spaces.
Paper specifies interaction graphs and joint action spaces as part of the formal model (definitions and formal structure).
Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social).
Paper defines autonomy as a 4-dimensional information-theoretic profile (conceptual/mathematical definition within the formal model).
Using a strictly algorithmic baseline (mathematical bottleneck aggregation), we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market based on the DWA-level scores.
Method and calculation claim: algorithmic baseline aggregation applied across the 923 occupations / 2,087 DWAs to produce OAIs mapped to the U.S. labor market. Specific aggregation formula referenced but not numerically detailed in the excerpt.
We deconstructed 923 occupations into 2,087 Detailed Work Activities (DWAs).
Explicit data processing claim in the paper: mapping of 923 occupations to 2,087 DWAs for analysis.
The economic model for IASCA follows the FDA's PDUFA precedent, with progressive certification fees representing 0.1-1% of model training costs.
Proposal specifies that IASCA's funding would mirror the FDA PDUFA model and states a fee range of 0.1–1% of model training costs; this is an asserted financing mechanism, not empirically validated in the excerpt.
IASCA is modelled after existing international and national regulatory bodies such as the IAEA, FAA, and FDA.
Proposal explicitly states IASCA is modelled after the IAEA, FAA, and FDA; this is an analogy/organizational design claim rather than an empirical finding.
We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance).
Controlled experiment reported in the paper: 600 runs across five named industries (experimental setup reported in abstract).
The paper addresses three institutional audiences: enterprise finance and operations teams; government and regulatory bodies developing AI labor displacement frameworks; and financial markets requiring a machine labor index as a long-duration economic signal.
Stated intended audiences in the paper (descriptive statement).
Costinot and Werning (2023) develop a sufficient-statistic approach and find optimal technology taxes of 1–3.7% on robots.
Citation reported in the paper summarizing Costinot and Werning (2023)'s quantitative sufficient-statistic estimate.
Guerreiro et al. (2022) characterize optimal Mirrleesian tax system with automation and find that robot taxes should be transitional—high when incumbent workers cannot retrain, converging to zero as new cohorts adjust skill investments.
Citation reported in the paper summarizing Guerreiro et al. (2022)'s theoretical result on transitional robot taxes.
If labor becomes economically redundant, the policy focus shifts from steering innovation to redesigning public finance and redistribution (e.g., new tax instruments, redistribution mechanisms).
Theoretical scenario analysis in the paper with references to related works (Korinek and Juelfs 2024; Korinek and Lockwood 2026).
We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL).
Dataset statement: the paper compares model outputs to a corpus of 10,000 CJOL labor dispute judgments.
We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure.
Methodological description in the paper: a designed stress test where social media sentiment is used to perturb LLM outputs for labor dispute cases.
The paper treats data as a new type of production factor and endogenizes it within the production function.
Theoretical/methodological: the paper constructs a macro-level theoretical model that explicitly includes data as an endogenous input in the production function (no empirical/sample data).
In the near term, the most plausible equilibrium is bounded autonomy, in which AI agents operate as supervised co-pilots, monitoring systems, and constrained execution modules embedded within human decision processes.
Theoretical argument and forward-looking assessment by the authors based on the proposed framework and plausibility considerations; not presented as the result of a causal empirical study in the excerpt.
Economic evaluations of GLAI should account for end-to-end risk externalities (error propagation, institutional trust, rights impacts), not only short-term productivity gains.
Methodological recommendation grounded in conceptual synthesis of technical, behavioral, and legal risks; normative argument rather than empirical result.
Generative Legal AI (GLAI) systems are built on token-prediction (LLM) architectures rather than formal legal-reasoning architectures.
Conceptual and technical analysis in the paper distinguishing GLAI from other legal-tech; literature synthesis on common LLM architectures. No original empirical dataset or sample size—qualitative/technical review.
The paper's formalism shows that prompt/system messages shape distributions over possible execution paths (indirect control) but do not evaluate actual partial paths at runtime.
Formal mapping in the paper that treats prompts as shaping prior over paths; conceptual argument and illustrative examples.
Returns to AI are heterogeneous across firms; estimating treatment effects requires attention to selection, complementarities, and dynamic adoption pipelines.
Methodological argument referencing treatment-effect literature and observed firm heterogeneity; supported by conceptual examples rather than a single empirical treatment-effect estimate.
In our setting, the locus of AI bias is not estimation but interpretation.
Overall experiment results: agent coefficient/estimate distributions remained aligned with human consensus and largely unchanged under biased prompts, while final-verdict outcomes were flip-prone under confirmatory prompts (e.g., Claude Code 10%→90%).
Unlike for biased human analysts in the same data, the anti-immigration prior prompt does not shift agents' aggregate estimates or final verdicts.
Comparison of the effect of an anti-immigration prior on human analysts (reported bias) versus agents (20 runs), showing that agent aggregate estimates and final verdict rates remained stable despite changes in methodological decisions.
No agent model exactly matches any human model.
Specification-by-specification comparison showing that none of the agent-generated models (from 20 executions) are identical to any human analyst's model in the many-analysts baseline.
Both agents' effect estimates remain broadly aligned with the human consensus.
Comparison of effect estimate distributions from Claude Code and Codex (20 runs each) to the human many-analysts consensus; reported alignment/broad agreement between agent estimates and human consensus.