Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance).
Controlled experiment reported in the paper: 600 runs across five named industries (experimental setup reported in abstract).
The paper addresses three institutional audiences: enterprise finance and operations teams; government and regulatory bodies developing AI labor displacement frameworks; and financial markets requiring a machine labor index as a long-duration economic signal.
Stated intended audiences in the paper (descriptive statement).
BCR is a minimalist, single-stage training paradigm that trains the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy.
Methodological description presented in the paper describing the training procedure and objective (single-stage, per-instance accuracy reward, N-problem batching in shared context).
The framework is calibrated with O*NET task data, a survey of 3,778 domain experts, and GPT-4o-derived task decompositions, and implemented in computer vision.
Calibration and empirical implementation using O*NET, a domain expert survey (n=3,778), and GPT-4o task decompositions; applied to computer vision tasks.
We introduce an entropy-based measure of task complexity that maps model accuracy into a labor substitution ratio, quantifying human labor displacement at each accuracy level.
New metric proposed in the paper (entropy-based task complexity) and mapping procedure from accuracy to substitution ratio; implemented in the framework.
Costinot and Werning (2023) develop a sufficient-statistic approach and find optimal technology taxes of 1–3.7% on robots.
Citation reported in the paper summarizing Costinot and Werning (2023)'s quantitative sufficient-statistic estimate.
Guerreiro et al. (2022) characterize optimal Mirrleesian tax system with automation and find that robot taxes should be transitional—high when incumbent workers cannot retrain, converging to zero as new cohorts adjust skill investments.
Citation reported in the paper summarizing Guerreiro et al. (2022)'s theoretical result on transitional robot taxes.
If labor becomes economically redundant, the policy focus shifts from steering innovation to redesigning public finance and redistribution (e.g., new tax instruments, redistribution mechanisms).
Theoretical scenario analysis in the paper with references to related works (Korinek and Juelfs 2024; Korinek and Lockwood 2026).
Evaluation is carried out under three frozen context configurations (diff only: config_A; diff with file content: config_B; full context: config_C) enabling systematic ablation of context provision strategies.
Methodological description: three fixed context configurations defined and used for ablation experiments.
Traffic performance is evaluated using the Fundamental Diagram (FD) under varying driver heterogeneity, heterogeneous time-gap penetration levels, and different shares of RL-controlled vehicles.
Description of experimental/evaluation setup in the paper: macroscopic evaluation via Fundamental Diagram across varied scenario parameters. No numeric sample size provided in the claim text.
CriQ is a sister app to Dream11, India's largest fantasy sports platform with over 250 million users.
Descriptive statement in the paper providing context about the application domain and user base.
We performed an extensive evaluation of 37 state-of-the-art Vision-Language Models on MultihopSpatial.
Empirical evaluation described in the paper listing the number of models evaluated (37).
We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL).
Dataset statement: the paper compares model outputs to a corpus of 10,000 CJOL labor dispute judgments.
We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure.
Methodological description in the paper: a designed stress test where social media sentiment is used to perturb LLM outputs for labor dispute cases.
The paper treats data as a new type of production factor and endogenizes it within the production function.
Theoretical/methodological: the paper constructs a macro-level theoretical model that explicitly includes data as an endogenous input in the production function (no empirical/sample data).
In the near term, the most plausible equilibrium is bounded autonomy, in which AI agents operate as supervised co-pilots, monitoring systems, and constrained execution modules embedded within human decision processes.
Theoretical argument and forward-looking assessment by the authors based on the proposed framework and plausibility considerations; not presented as the result of a causal empirical study in the excerpt.
Economic evaluations of GLAI should account for end-to-end risk externalities (error propagation, institutional trust, rights impacts), not only short-term productivity gains.
Methodological recommendation grounded in conceptual synthesis of technical, behavioral, and legal risks; normative argument rather than empirical result.
Generative Legal AI (GLAI) systems are built on token-prediction (LLM) architectures rather than formal legal-reasoning architectures.
Conceptual and technical analysis in the paper distinguishing GLAI from other legal-tech; literature synthesis on common LLM architectures. No original empirical dataset or sample size—qualitative/technical review.
The paper's formalism shows that prompt/system messages shape distributions over possible execution paths (indirect control) but do not evaluate actual partial paths at runtime.
Formal mapping in the paper that treats prompts as shaping prior over paths; conceptual argument and illustrative examples.
Through a thematic review of existing research, the authors identified recurring themes about incentive schemes: their components, how researchers manipulate them, and their impact on research outcomes.
Authors' stated method and findings: thematic review (the scope/number of reviewed papers not specified in excerpt).
A critical aspect of conducting human–AI decision-making studies is the role of participants, often recruited through crowdsourcing platforms.
Claim based on the authors' thematic literature review noting participant sourcing practices (specific studies and counts not given in excerpt).
Researchers conduct empirical studies investigating how humans use AI assistance for decision-making and how this collaboration impacts results.
Statement summarizing the research landscape; supported implicitly by the authors' thematic review of existing empirical studies (number of studies not specified in excerpt).
The study provides empirical evidence specific to a small open EU economy (Slovakia) on the relationship between AI adoption and labour productivity.
Use of harmonised Eurostat enterprise and productivity data for Slovakia and EU27 over 2021–2024, analysed with descriptive statistics, gap analysis, dynamics of change, correlation, and an illustrative regression model.
Returns to AI are heterogeneous across firms; estimating treatment effects requires attention to selection, complementarities, and dynamic adoption pipelines.
Methodological argument referencing treatment-effect literature and observed firm heterogeneity; supported by conceptual examples rather than a single empirical treatment-effect estimate.
Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap
Claim of artifact availability hosted on GitHub (URL provided) as part of the paper's resources.
Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability.
Factual claim referencing existing standards (MCP and A2A) and their scopes; no citations or supporting documentation included in the provided excerpt.
Production deployments are no longer one human supervising one model; they are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries.
Stated as a general characterization of modern production deployments; no quantitative data or case counts provided in the excerpt.
The six middle macros form a low-contrast band between the poles; equivalence testing (TOST at d = 0.2) admits only 1 out of 15 macro-pair comparisons as equivalent.
Authors' analysis of pairwise macro comparisons using Two One-Sided Tests (TOST) for equivalence at Cohen's d = 0.2.
We decomposed 1,961 O*NET Detailed Work Activities (DWAs) into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert human-in-the-loop (HITL) calibration.
Empirical method reported by the authors: automated multi-agent LLM pipeline plus 31-expert HITL calibration producing the stated counts (1,961 DWAs -> 15,817 micro-actions).
Empirical research since Frey and Osborne (2017) has converged on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions.
Literature synthesis / statement in the paper referencing Frey and Osborne (2017) and subsequent empirical work using continuous exposure scores.
Retrieval augmentation and scientist persona prompting yield only marginal gains.
Ablation/augmentation experiments comparing baseline LLM outputs to versions augmented with retrieval or scientist-persona prompting, showing only small improvements in judged quality.
6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption.
Reported study participation and rating counts: 6,749 respondents providing 25,139 rating sets on specified dimensions.
We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas that large language models (LLMs) generated from the context and puzzles of their own papers.
Study recruitment described in paper: invitations sent to authors of 121,640 recent preprints across multiple fields (biology, medicine, chemistry, social sciences).
The findings provide empirical insights for managing employee wellbeing and refining human resource strategies during organizational digital transformation.
Authors' stated implications in the discussion, based on the reported empirical associations and moderation results from the survey of 411 employees.
The study draws on the Conservation of Resources Theory and the Cognitive Appraisal Theory of Stress to explain how AI application influences employees' job insecurity via resource gain and resource threat mechanisms.
Theoretical framing stated in the introduction and discussion explaining the mechanisms (resource gain vs. resource threat) underlying the observed U-shaped association.
Data were collected via mixed online and offline questionnaires: 453 questionnaires were distributed (242 online, 211 offline); 449 were returned (242 online, 207 offline); following validity screening, 411 valid questionnaires were retained (219 online, 192 offline), yielding an effective response rate of 90.73%.
Reported survey administration and response counts provided in the methods section of the paper.
Devil's Advocate (DA) is an AI assistant that critiques the human's initial ideas, whereas Dialectical Inquiry (DI) provides alternatives and synthesizes a resolution.
Conceptual/definitional claim in the paper describing the operationalization of DA and DI for the experiments.
This research empirically compares DA and DI in AI contexts.
Paper reports experimental comparison between AI behaviors implementing Devil's Advocate (DA) and Dialectical Inquiry (DI) across the studies.
Both studies examine benefit (information elaboration) and cost (cognitive load) pathways when AI supports SDM.
Paper explicitly frames both studies to measure information elaboration as a benefit pathway and cognitive load as a cost pathway; stated measurement plan in methods.
Study 2 tests mind-shaping interventions through user strategy training.
Study design described in the paper: a second experiment (Study 2) manipulating user strategy training (mind-shaping) to evaluate effects on SDM processes and outcomes.
Study 1 tests tool-shaping interventions by comparing three AI bot prototype conditions (Information-only, DA, DI) against a control treatment.
Study design described in the paper: randomized/controlled experiment (Study 1) with four conditions (three AI prototype conditions plus control).
The 'do no harm' property is confirmed empirically.
Abstract states empirical confirmation in simulations and applications; specifics (e.g., datasets, sample sizes) not included in abstract.
Including AI predictions as covariates has a 'do no harm' property: the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative.
Stated theoretical property in the paper and described as empirically confirmed in simulations and applications (per abstract).
The model frames near-complete AGI substitution not merely as an efficiency transition but as a boundary case for value production under a strict political-economy theory of value.
Interpretive conclusion drawn from the theoretical model and its limiting-case implications (conceptual/theoretical claim; no empirical sample).
Under the paper's core value-theoretic assumption, AGI transfers value but does not itself create new value.
Explicit model assumption / value-theoretic premise stated in the paper (theoretical assumption, no empirical backing).
The paper distinguishes technical substitutability (the feasible replacement ceiling implied by AGI capability) from actual adoption (the realized replacement share chosen under cost, profitability, and adoption frictions).
Conceptual/theoretical definition introduced in the political-economy model (no empirical sample; definitional argument within the paper).
Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96).
Blind-panel scoring of generated reports from agents A and B; panel size and panel methodology not specified in abstract.
The value of an in-band cooperative deny signal (Recuse Signal) is an empirical question: it was previously unmeasured and the paper measures whether compliant LLM agents honor such a signal.
Motivation and framing in the paper; they position their controlled experiment as the measurement addressing this previously unmeasured question.
We searched seven databases (plus backward and forward citation searching) and synthesised 13 empirical studies published between 2018 and 2025.
Methods reported in abstract: PRISMA-ScR scoping review with a preregistered protocol; explicit count of included studies and publication date range.
Self-evaluated creative performance remained unchanged when using GenAI.
Same experiment with 82 participants; authors report no significant difference in self-evaluated creative performance between GenAI users and controls.