Evidence (6917 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
At the design layer, Codex matches human methodological diversity.
Comparison of methodological specifications produced by Codex (20 independent executions) to the many-analysts human baseline; reported similarity in diversity metrics between Codex outputs and human analysts.
We run 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy problem and compare them against a many-analysts human baseline.
Experimental method described in the paper: 20 independent runs/executions of each agent model (Claude Code and Codex), compared to an existing many-analysts human baseline.
Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap
Claim of artifact availability hosted on GitHub (URL provided) as part of the paper's resources.
Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability.
Factual claim referencing existing standards (MCP and A2A) and their scopes; no citations or supporting documentation included in the provided excerpt.
Production deployments are no longer one human supervising one model; they are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries.
Stated as a general characterization of modern production deployments; no quantitative data or case counts provided in the excerpt.
The value of an in-band cooperative deny signal (Recuse Signal) is an empirical question: it was previously unmeasured and the paper measures whether compliant LLM agents honor such a signal.
Motivation and framing in the paper; they position their controlled experiment as the measurement addressing this previously unmeasured question.
We searched seven databases (plus backward and forward citation searching) and synthesised 13 empirical studies published between 2018 and 2025.
Methods reported in abstract: PRISMA-ScR scoping review with a preregistered protocol; explicit count of included studies and publication date range.
From Codeforces histories we build an AI-prompt signature characterised by more first-attempt acceptances and fewer attempts and retries, consistent with AI-assisted practice.
Empirical construction from CF submission histories (pattern: increased first-try accepts, fewer retries). Method: analysis of historical submission logs; sample size not stated in abstract.
The International Collegiate Programming Contest (ICPC) and the International Olympiad in Informatics (IOI) prohibit AI under proctoring and admit entrants through qualification rounds, whereas online Codeforces (CF) contests are unproctored and open to all.
Descriptive factual claim about contest rules and formats (institutional description in paper); based on contest rules and organizational formats referenced by authors.
Future research should adopt a more intersectional approach exploring how race, class, and geography interact with gender to shape platform work experiences.
Research limitations and implications section of the paper recommends more intersectional research directions.
This paper conducted a systematic literature review and thematic synthesis of 48 peer‑reviewed studies (2010–2024) to analyze the gendered dynamics of AI‑mediated digital labor.
Methods statement in the paper: systematic literature review and thematic synthesis; explicitly reports reviewing 48 peer‑reviewed studies covering 2010–2024.
This study benchmarks Algeria’s readiness to adopt AI against Morocco, Egypt, and Turkey using data from the World Bank (2022), the Oxford Insights Government AI Readiness Index, and sector-specific studies.
Methodological statement in the paper specifying data sources used for the comparative assessment (World Bank 2022, Oxford Insights index, sector studies).
Over 100 participants collaborated with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours.
Study description: experimental participants (reported as "Over 100 participants") each paired with one of four named models on a ~5-hour coding task designed to mimic real-world workflows.
We conduct the first large-scale study of human oversight in AI coding sabotage.
Authors state they ran a large-scale user study; described as the first such study focused on human oversight in AI coding sabotage (methodological claim).
Verified word-count analysis of the Executive Order shows the word 'security' appears 17× and the word 'cyber' appears 14×, while there are zero mentions of 'labor', 'education', 'culture', 'fairness', 'transparency', 'attribution', 'provenance', 'meaning', or 'commons'.
Automated/count-based analysis of the EO text (single-document word-count reported in the paper).
The aggregate Stanford HAI AI Vibrancy Score shows no significant within-country effect on tourism’s direct GDP share after controlling for macroeconomic factors.
Fixed-effects estimation with clustered standard errors on panel data from 33 countries (2017–2023); reported coefficient β = 0.061, p = 0.622, with macroeconomic controls.
These are mechanism-oriented synthetic results, not estimates of real firm behavior in a jurisdiction or industry.
Explicit qualification in the abstract stating the scope and limits of inference (paper text).
The study uses a synthetic agent-based reinforcement-learning simulation that separates actual conduct near a legal threshold from proximity in the computable enforcement signal.
Methodological description in abstract: ABM/RL simulation with explicit separation of conduct vs. computable signal; run counts reported (150 seed-level scenario runs, 378 computability-sweep runs, 288 Latin-hypercube runs) and a 2,880,000-row firm-period panel.
Ordinary adaptive updates do not reliably reduce boundary search.
ABM/RL simulation experiments reported in the paper (multiple runs and the firm-period panel); qualitative comparative statement from simulation outputs.
There is no evidence of improved win rates for AI-flagged complaints; AI-flagged complaints are more likely to be dismissed and to terminate at earlier procedural phases.
Outcome analysis linking AI-flag status to litigation outcomes (win rates, dismissal rates, termination phase) using case metadata.
This study uses panel data from 281 Chinese cities between 2005 and 2022, treats establishment of national GIPs as a quasi‑natural experiment, and applies a double machine learning approach.
Methods description in the paper explicitly states data coverage (281 Chinese cities, 2005–2022), research design (quasi‑natural experiment), and estimation strategy (double machine learning).
Experts rated 24 AI risks on harm probability and severity, sector and actor vulnerability, actor responsibility, and overall concern.
Study design described in paper: set of 24 defined AI risks rated across several dimensions by Delphi panel participants (n=272).
We conducted a three-round Delphi study conducted late 2025 with 272 international AI experts.
Methodological description in the paper: three-round Delphi study, timing reported as late 2025, sample size reported as 272 international AI experts.
This study constructs a comprehensive evaluation system of urban ecological resilience from three dimensions: potential, elasticity, and stability.
Methodological description in the paper: authors state they constructed a composite resilience evaluation system composed of three specified dimensions for prefecture-level cities.
Explicit commercial content (product placement) shows no engagement premium (−3.8%, not significant).
Analysis comparing videos labeled for explicit commercial content (product placement) to others; reported percent difference and non-significance.
We conducted a multimodal AI audit of 5,051 videos across 79 kidfluencer channels using weak supervision (LLM-based classification of titles and GPT-4 Vision analysis of thumbnails and descriptions across six literature-grounded dimensions) to assign a probabilistic exploitation score to each video.
Described dataset and methods in paper: multimodal automated pipeline combining weak supervision labeling functions (LLM classifiers on titles, GPT-4 Vision on thumbnails/descriptions) applied to 5,051 videos from 79 channels.
The study uses listed companies in China's manufacturing industry from 2010 to 2023 as the research sample.
Authors explicitly state the empirical sample: listed manufacturing firms in China covering 2010–2023.
The positive relationship between BDTA and CEE remains robust after a series of robustness tests and endogeneity tests.
Authors state they conducted robustness checks and endogeneity tests (unspecified in the summary) and report that the main regression results remain robust.
Brain privacy has both personal and social attributes; its protection therefore implicates individual interests and technological development.
Normative/legal argumentation and conceptual analysis presented in the paper (no empirical data reported).
New York City’s Local Law 144 mandates annual bias audits to increase transparency.
Statement of law/policy in paper (factual claim about NYC Local Law 144); legal requirement as described in the text.
The fairness of AI-enabled hiring systems remains uncertain.
Statement in paper (background/interpretive claim); no direct empirical measure provided in the excerpt.
The study employs a comparative mixed-methods approach (comparative institutional analysis) of leading financial systems in China, the United States, and the United Kingdom (2022–2025), integrating secondary quantitative indicators with qualitative documentary evidence.
Direct methodological statement in the abstract describing the study design and data sources.
The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows.
Conceptual argument in the paper articulating difference between two defined concepts (Agentic Technical Debt vs Stochastic Tax); no empirical demonstration.
Stochastic Tax is the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds.
Paper provides a formal definition / conceptual framing of 'Stochastic Tax'; stated as an operational concept (no empirical quantification provided).
Agentic Technical Debt is the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed.
Paper provides a formal definition / conceptual framing of 'Agentic Technical Debt'; presented as a definitional contribution rather than an empirically measured quantity.
Agentic AI systems reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback.
Descriptive/definitional statement in the paper; presented as characteristics of agentic systems rather than supported by empirical measurement.
Agentic AI systems are increasingly being explored as production infrastructure.
Stated as an observation in the paper's introduction/abstract; no empirical data, sample, or formal measurement provided (conceptual/observational claim).
We identify four archetypes (data orchestrators, aggregators, niche specialists, and cloud orchestrators).
Paper states it develops a taxonomy and explicitly lists four archetypes; based on the taxonomy development and conceptual classification reported in the paper (no sample size or quantitative empirical test reported in abstract).
Regression models and moderation analyses were performed in R to examine associations between governance exposure, AI maturity, and adaptation intensity.
Methods statement: 'Regression models and moderation analyses were performed in R (R Computing, Austria) to examine associations between governance exposure, AI maturity, and adaptation intensity.'
Path-specific composite indices for bifurcation, modularity, ethical signaling, and compartmentalization were quantified using validated scales.
Methods description in the paper: 'Path-specific composite indices ... were quantified using validated scales.'
The study coded 500 adaptation events.
Explicit statement: 'and 500 coded adaptation events.'
The qualitative dataset included 48 executive and technical informants.
Explicit statement: 'including 48 executive and technical informants'.
The study uses a comparative multi-case dataset of 12 multinational firms (4 tri-jurisdictional, 4 Atlantic, 4 China-primary).
Explicit dataset description in the paper: 'A comparative multi-case dataset of 12 multinational firms (4 tri-jurisdictional, 4 Atlantic, 4 China-primary) was analyzed.'
This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions.
Comparison of action-recipient patterns vs action-type distributions across the experimental conditions in the simulation; reported observation that action-type distributions did not show increased negative actions and that audits of action logs (action types) failed to reveal the bias.
(i, continued) The counterfactual toll has explicit non-uniqueness (i.e., non-uniqueness of the toll is demonstrated).
Mathematical argument in the paper identifying conditions or constructions that lead to multiple valid tolls (formal counterexample or theorem on non-uniqueness).
The paper proposes a policy framework consisting of six groups of solutions for Vietnam to both promote AI development and control risks in the digital age.
Declared in abstract: the paper presents a six-group policy framework for Vietnam; the framework itself is the paper's output (proposal), not empirically tested in the paper.
This study employs document synthesis and comparative analysis of international policies.
Methodological statement in the paper abstract describing the research approach; no sample size specified beyond document sources.
The rise of artificial intelligence (AI) is shaping a new Agent Economy (AE), in which autonomous AI agents represent humans in performing a wide range of complex tasks.
Statement in paper abstract/intro (conceptual definition); no empirical data or sample reported.
The study contributes a taxonomy of AI workforce impact, a Workforce Resilience Readiness Score (WRRS), an AI Workforce Trust Index (AWTI), an Ethical Automation Boundary concept, and a pilot empirical validation design.
Declared methodological and conceptual contributions in the paper (these are presented as deliverables of the study; no validated results reported in the excerpt).
The International Labour Organization's 2025 update highlights the need to assess the exposure of generative AI at the task level using task data, expert input, and AI model predictions.
Reference to ILO 2025 update recommendation described in the paper (policy/technical guidance rather than primary empirical data in the excerpt).