Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Most existing approaches implicitly assume that once a decision is produced, it is eligible for execution.
Author assertion / conceptual critique of existing approaches presented in the paper (no empirical test reported).
Most existing approaches to AI safety, risk management, and governance focus on post-hoc validation, probabilistic risk estimation, or certification of model behavior.
Author statement summarizing the literature / prior work in AI safety and governance (conceptual claim in the paper's introduction). No empirical survey or sample size reported.
We develop a formal model in which institutions choose the scale of automation, the degree of codification, and safeguards on iterative use.
Methodological statement: the paper presents a formal/theoretical model specifying institutional choice variables (model description rather than empirical result).
On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (ρ_s=0.25).
Spearman rank correlation between composite rankings and benchmark-only rankings on an 11-agent subset that has published SWE-bench scores; reported correlation.
We document the performance of a market-based scaffolding with these LLMs.
Empirical documentation reported in the paper describing how a market-based scaffolding performs when using the six LLMs on the 93 tasks.
We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration.
Empirical setup described in the paper: evaluation uses a 93-task subset of SWE-bench Lite and six recent LLMs.
We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities.
Paper contribution claim: introduction of a benchmark named MarketBench described in the paper.
In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so.
Conceptual claim / design requirement motivating the benchmark; stated as part of the paper's framing rather than an empirical result.
We instrument ITAS, a four-agent tutoring system built on Gemini 2.5 Flash and Google Vertex AI, across three throughput tiers (Standard PayGo, Priority PayGo, and Provisioned Throughput) and eleven concurrency levels up to 50 simultaneous users, producing over 3,000 requests drawn from a live graduate STEM deployment.
Methods statement in paper describing experimental setup: four-agent ITAS built on Gemini 2.5 Flash and Google Vertex AI; three throughput tiers; eleven concurrency levels up to 50; over 3,000 requests from a live graduate STEM deployment.
We compare LLM-guided bidding against truthful and heuristic strategies using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness.
Methodological claim describing the comparative experimental design: simulations use VCG as benchmark and include comparisons to truthful and heuristic bidding strategies. No sample size or detailed experimental parameters are provided in the excerpt.
When the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions.
Simulation experiments comparing LLM-guided bidding to the VCG benchmark and to truthful/heuristic strategies under conditions where VCG assumptions are satisfied. The paper reports that LLM outcomes were close to the VCG-predicted equilibrium. No numeric sample size or quantitative effect sizes reported in the provided text.
We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks.
Descriptive statement of the study design: the paper reports simulation/experimental evaluation where each user equipment (UE) is modeled as a rational player in repeated spectrum auctions; comparison against truthful and heuristic strategies under Vickrey-Clarke-Groves (VCG) benchmark. No numeric sample size reported in the provided text.
The welfare consequences of genAI can be organized by a two-dimensional taxonomy: the strength of the incentive to perform the task without AI, and the severity of model collapse.
Analytical organization derived from the theoretical model presented in the paper (conceptual taxonomy based on model parameters; no empirical sample reported in abstract).
We develop a parsimonious model of behavior in collaborative interactions in which individuals can either exert human effort, rely on genAI, or refrain from work altogether.
Methodological claim: authors present a formal theoretical model with the specified choice set (model description in paper; no empirical sample reported in abstract).
Predictive performance exhibits saturation beyond a certain context length.
Experiments varying the context (input) length in foundation models and observing changes in forecasting performance; reported saturation effect in analyses.
Task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend.
Analysis comparing human expert difficulty ratings to measured token costs for tasks in SWE-bench Verified; weak alignment reported in the paper between ratings and token consumption.
Higher token usage does not translate into higher accuracy; accuracy often peaks at intermediate cost and saturates at higher costs.
Comparison of accuracy (task success) versus total token usage across runs/trajectories in the agentic coding experiments on SWE-bench Verified; reported observed relationship (peak at intermediate costs and saturation thereafter).
Learning-based control offers a more adaptive alternative, but it remains unclear whether such methods... can sustain hours of reliable operation, deliver consistent quality, and behave safely around people on a live production line.
Framing of a research gap in the paper's introduction; no primary experimental data presented here (statement of uncertainty motivating the study).
Die Studie basiert auf einer wiederholten Querschnittsbefragung lizenzierter Beschäftigter einer außeruniversitären Forschungseinrichtung.
Autorenangabe im Abstract: wiederholte Querschnittsbefragung (survey) unter lizenzieren Beschäftigten der untersuchten Forschungseinrichtung; methodische Beschreibung im Abstract.
The paper provides a natural definition of benchmark hacking in this strategic context by comparing a player's equilibrium effort allocation to that of a single-agent baseline scenario.
Conceptual/theoretical definition introduced in the model comparing equilibrium effort allocations to a single-agent (non-competitive) baseline.
The main findings are robust to multiple robustness checks.
Paper reports multiple unspecified robustness checks applied to the fixed-effects regression analyses on the panel of publicly listed Chinese firms (2012–2023).
We use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows.
Methodological contribution described in the paper: a unified amortized computational framework applied to eight Shapley variants, evaluated under latency constraints typical of operational workflows.
No formulation improved objective analyst performance.
Controlled/empirical experiment reported in the paper evaluating eight Shapley variants with professional analysts in the fraud-detection environment; performance measured over 3,735 case reviews.
Standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility.
Empirical comparison in the paper between quantitative metrics (sparsity, faithfulness) and human-judged clarity/decision-utility across the datasets and analyst reviews; based on the authors' large-scale evaluation.
We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews.
Experimental methods reported in the paper: evaluation across four risk datasets and a fraud-detection environment with professional analysts; stated sample of 3,735 case reviews.
A central issue is how humans interpret the algorithm's choice of features, which affects the design and evaluation of highlighting policies.
Framing and motivation in the paper: conceptual claim motivating the formal models and analysis (theoretical/argumentative).
We illustrate our framework in a calibrated empirical exercise based on the American Housing Survey.
An empirical/calibrated exercise using data from the American Housing Survey reported in the paper; the claim is that the framework is illustrated empirically (data-based demonstration).
Humans may interpret the algorithm's choice of features in different ways: a sophisticated agent correctly conditions on the selection rule, while a naive agent updates only on revealed feature values and treats the selection event as exogenous.
Conceptual/behavioral modeling in the paper that defines two agent-types (sophisticated vs naive) and analyzes their distinct inference processes (theoretical/modeling).
Highlighting can be modeled as a constrained information policy that selects a small number of features to reveal.
Modeling framework developed in the paper: formal definition of highlighting as an information policy with a feature-selection constraint (theoretical/modeling).
We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account.
Descriptive statement of the study dataset reported in the paper: dataset of 10,659 matched human-agent pairs from Moltbook with public linkage to owner's Twitter/X account.
The paper proposes a conceptual framework linking AI adoption to employability and role transformation, mediated by skill adaptation, continuous learning, and organizational readiness.
Author-proposed conceptual framework presented in the review paper (theoretical linkage based on literature synthesis).
This study takes food delivery riders as the research object and analyzes the dilemma of labor relations determination under AIGC.
Methodological statement in the paper specifying the chosen subject of analysis (food delivery riders); this is an explicit description of the paper's scope rather than an empirical finding.
The paper develops an interdisciplinary conceptual framework that integrates insights from economics, management theory, and digital governance to characterize algorithmic enterprises.
Methodological claim about the paper's approach; stated in abstract as the paper's contribution (conceptual framework built from interdisciplinary literature).
Future research should strengthen cross-national comparisons, longitudinal tracking, and interdisciplinary collaboration to support development of a technology governance framework that balances efficiency with equity.
Author recommendation based on identified research gaps in the literature review (prescriptive/recommendation).
Existing research has clear gaps: limited evidence from developing-country contexts, insufficient attention to within-occupation heterogeneity, incomplete accounts of psychological mechanisms underlying AI anxiety, and a shortage of rigorous evaluations of reskilling policy effectiveness.
Author's assessment based on the reviewed literature identifying thematic gaps and methodological limitations (critical literature review).
The study uses a mixed-methods design combining a quantitative survey of 312 senior managers/strategy professionals and 28 semi-structured interviews across four sectors in Zimbabwe.
Methods reported in the paper: quantitative survey n = 312; qualitative 28 interviews across manufacturing, financial services, telecommunications, and retail.
This study leverages the establishment of National New-Generation Artificial Intelligence Innovation and Development Pilot Zones as a quasi-natural experiment and employs a multi-period DID model on A-share listed manufacturing firms from 2010 to 2023.
Methodological description provided in the paper: policy rollout as quasi-natural experiment; multi-period difference-in-differences estimation; sample frame specified as A-share listed manufacturing firms on the Shanghai and Shenzhen Stock Exchanges, 2010–2023.
The paper synthesizes sector-specific insights across manufacturing, information technology, healthcare, and finance to examine AI's influence on task automation, job augmentation, and skill requirements.
Descriptive claim about the scope of the review (sectors named in the abstract); no breakdown of sectoral evidence or counts provided in the abstract.
There is a lack of comparative sectoral assessments and standardized risk evaluation frameworks in the literature.
Identified research gap reported by the authors from their systematic review (no counts or formal gap-analysis metrics provided in the abstract).
A structured methodology (systematic review) was adopted to identify literature on AI-driven job transformation and associated employment risks using major academic databases.
Methodological statement in the paper claiming a systematic review approach (specific databases, search terms, inclusion/exclusion criteria and number of studies are not reported in the abstract).
An exploratory evaluation compared unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application.
Paper reports an exploratory evaluation / comparative study described in the abstract; the task context is a web application development exercise comparing three approaches (no sample size reported in abstract).
The First Fundamental Theorem of Welfare Economics assumes that welfare-bearing agents are autonomous and implicitly relies on a binary distinction between autonomy and instrumentality.
Explicit statement in the paper's introduction/abstract describing the theorem's assumptions; conceptual/theoretical textual analysis (no empirical sample).
This paper was generated by AI, using https://github.com/chenandrewy/ralph-wiggum-asset-pricing/.
Author statement in the abstract declaring the paper was generated by AI and providing a GitHub link.
The paper integrates information processing theory, the resource-based view, and the dynamic capabilities perspective to develop an integrated framework linking digital technology adoption, visibility, and resilience.
Theoretical framing described in the paper (explicit mention of the three theories and their integration).
The study employs hierarchical regression, structural equation modeling (SEM), and rigorous endogeneity controls including instrumental variables and propensity score matching.
Methods section summary reported in the paper; explicit listing of regression, SEM, IV, and propensity score matching.
The study draws on survey data from 742 manufacturing and logistics firms across 23 countries.
Reported sample description in the paper: survey of 742 firms across 23 countries (manufacturing and logistics).
This review was conducted following the guidelines of the Preferred Reporting of Items in a Systematic Review and Meta-Analysis (PRISMA).
Methodological statement in the paper's abstract indicating PRISMA adherence; no further protocol details or study counts provided in the abstract.
The staggered expansion of Turkey's national natural gas pipeline network provides plausibly exogenous variation in connectivity because pipeline routing is determined by energy distribution priorities rather than digital demand.
Identification strategy described by the authors: using pipeline expansion as an instrument/conduit for fiber-optic deployment; argument rests on institutional routing rules and timing.
We evaluate structural validity, semantic alignment, reproducibility, and refinement effort to characterize authoring scalability.
Reported evaluation dimensions in the paper; implies empirical assessments were performed along these axes (details not provided in the abstract).
The paper foregrounds industrial firms' own digital agency as a less understood aspect in the literature on digitalization and governance.
Authors' positioning of their contribution and literature review claim in the paper (qualitative/theoretical claim).