Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.
Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.
This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals.
Method: sociotechnical audit comparing six commercial LLMs to a rubric created via a Delphi process with 20 infrastructure professionals (Delphi-derived rubric).
Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.
Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.
We design a budget split intervention that directly incorporates unknown users and targets users with Google-inferred gender labels (male, female).
Authors' stated experimental/intervention design implemented in collaboration with a state-level government agency; methodological claim about the intervention (no sample size or deployment details in the excerpt).
The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change.
Conceptual reframing and argument presented in the abstract as a conclusion of the proposed framework and evaluation approach.
We formalize this substrate as 'AI Harness Engineering' and identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording.
Methodological/conceptual contribution described in the paper (abstract) that lists eleven component responsibilities as part of the formalization.
Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.
Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.
Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.
Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).
A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.
Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.
Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).
Method description of the paired counterfactual experimental design used by TourMart.
We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).
Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.
Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.
We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget.
Methods / experimental setup reported in the paper: five function-calling benchmarks and a DistilBERT classifier trained and deployed under latency constraints.
We show that ρ ≥ 1 is the no-excess-crowding parity condition and connect Δ to an adoption game with exposure-dependent redundancy costs.
Theoretical result derived in the paper linking the human-relative diversity ratio ρ to a parity condition and relating the excess-crowding coefficient Δ to an adoption-game model with exposure-dependent redundancy costs.
We characterize optimal and fair policies in the short term.
Theoretical results/characterizations presented in the paper identifying optimal policies and fair-policy structures for the short-term setting.
We theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF).
Theoretical analysis in the paper using the Price of Fairness formalism to study trade-offs.
We introduce notions of group fairness for both the short and long term.
Methodological contribution in the paper: formal definitions of short-term and long-term group fairness introduced by the authors.
The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).
Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.
We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.
Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.
We evaluate 4 popular agent harnesses and 7 foundation models on Workspace-Bench.
Experimental setup reported in the paper listing 4 agent harnesses and 7 foundation models used in evaluations.
The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool.
System description in paper noting that the curated index is available via a Gosset MCP server for external models to call.
All five systems receive the same natural-language query and the same JSON output schema.
Methodological detail reported in paper describing controlled inputs across systems.
We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets.
Experimental benchmark described in paper: direct comparison of Gosset versus four named models on 10 targets; methodological statement.
Weight-based memory generalizes by applying abstract rules to inputs never seen before.
Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.
Retrieval generalizes by similarity to stored cases.
Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.
The study uses LinkedIn and GitHub data to examine firms' adoption of GitHub Copilot and related SWE skills and labor outcomes.
Statement of data sources and study design reported in the paper (LinkedIn profiles/skill listings linked to GitHub repository/adoption signals).
The process of synthesizing information is inherently iterative: users explore content, identify relationships between concepts, and continuously reorganize their mental models.
Conceptual description of the cognitive/process characteristics in the paper's background/motivation (no empirical measurement reported).
Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.
Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.
The paper establishes a taxonomy of forgetting mechanisms: passive decay-based, active deletion-based, safety-triggered, and adaptive reinforcement-based.
Explicit taxonomy presented in paper (listed in abstract).
We evaluate Aether over synthetic network change scenarios covering main classes of network changes and on past incidents from a major ISP operational network.
Evaluation methodology stated in paper abstract: tested on synthetic scenarios and historical incidents from one major ISP (no numeric sample size provided in abstract).
Expert assessment involved three senior academics producing reports and appointment-level syntheses.
Paper states that three senior academics produced assessment reports and synthesised appointment-level recommendations; n=3 assessors.
The distillation pipeline used an eight-layer extraction method and a nine-module skill architecture grounded in local, closed-corpus analysis.
Methods description in paper specifying an eight-layer extraction approach and nine-module skill architecture; presented as the technical design of the distillation pipeline.
Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback.
Descriptive claim about the general architecture of Agent-Aided Design systems as asserted by the authors (methodological description), not an empirical test; no quantitative evaluation provided here.
We evaluate four mechanisms to enable cooperation: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players.
Description of experimental design / mechanisms evaluated in the study across four social dilemmas; details on implementation and sample sizes not provided in the excerpt.
CoCoGen+ formulates each training round as a weighted potential game in which organizations strategically decide how much synthetic data to generate by balancing learning performance gains against computational costs and competition-caused utility losses.
Theoretical formulation and game-theoretic modeling provided in the paper (analytical derivation); no empirical sample size reported.
The paper provides lessons for scaling regression automation and enabling effective human-AI teaming in Agile settings.
Stated contribution of the paper (synthesis of lessons from the industrial case study).
The Copilot was integrated with Hacon's CI pipelines and operates asynchronously as a 'silent AI teammate', producing candidate scripts for human review.
System integration and deployment description within the case study (implementation detail reported in the paper).
We conducted an exploratory industrial case study of the Hacon Test Automation Copilot, an agentic AI system that generates system-level regression test scripts from validated specifications using retrieval-augmented generation and a multi-agent workflow.
Methodological claim: description of the study design and the system; the paper reports a single industrial case study at Hacon (a Siemens company).
Predictive outputs are translated into allocation rules, with emphasis on mean–variance optimization, shrinkage-based risk estimation, risk parity, hierarchical allocation, and reinforcement-learning-based dynamic rebalancing.
Surveyed literature on portfolio construction and allocation techniques described in the review (methodological overview; no single empirical dataset or sample size).
SAFI measures LLM performance on text-based representations of skills, not full occupational execution.
Methodological caveat stated by the authors clarifying the scope and limits of SAFI.
We propose an AI Impact Matrix that positions skills into four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk.
Conceptual/interpretive framework introduced by the authors; described in text as proposed by the paper.
Legitimate accountability is axiomatized through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated).
Paper presents an explicit axiomatization listing these four properties as definitions/axioms forming the normative criteria for legitimate accountability.
Collective behaviour is characterised through interaction graphs and joint action spaces.
Paper specifies interaction graphs and joint action spaces as part of the formal model (definitions and formal structure).
Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social).
Paper defines autonomy as a 4-dimensional information-theoretic profile (conceptual/mathematical definition within the formal model).
Using a strictly algorithmic baseline (mathematical bottleneck aggregation), we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market based on the DWA-level scores.
Method and calculation claim: algorithmic baseline aggregation applied across the 923 occupations / 2,087 DWAs to produce OAIs mapped to the U.S. labor market. Specific aggregation formula referenced but not numerically detailed in the excerpt.
We deconstructed 923 occupations into 2,087 Detailed Work Activities (DWAs).
Explicit data processing claim in the paper: mapping of 923 occupations to 2,087 DWAs for analysis.
The economic model for IASCA follows the FDA's PDUFA precedent, with progressive certification fees representing 0.1-1% of model training costs.
Proposal specifies that IASCA's funding would mirror the FDA PDUFA model and states a fee range of 0.1–1% of model training costs; this is an asserted financing mechanism, not empirically validated in the excerpt.
IASCA is modelled after existing international and national regulatory bodies such as the IAEA, FAA, and FDA.
Proposal explicitly states IASCA is modelled after the IAEA, FAA, and FDA; this is an analogy/organizational design claim rather than an empirical finding.
A variance decomposition indicates that most expert disagreement about long-run macroeconomic outcomes is driven by differing beliefs about the economic effects of highly capable AI, rather than disagreement about the pace of AI capability progress.
Authors' variance-decomposition analysis of survey responses separating components due to beliefs about AI capabilities vs. beliefs about economic effects given capabilities (methodological details referenced but not provided in excerpt).
A life insurance system integrated into an industry partner mobile app was tested in two experiments.
Paper reports two experiments running the ARQuest-enabled life insurance system inside a partner mobile app; experimental setup is stated though sample sizes are not provided in the excerpt.