Evidence (2432 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Labor Markets
Remove filter
Research priorities include empirically quantifying AI's effects on productivity, wages, inequality, and environmental costs; developing standardized sustainability and governance metrics; and evaluating regulatory impacts on innovation and welfare.
Stated research agenda based on gaps identified in the narrative review; identifies directions for future empirical work rather than presenting new empirical findings.
AI has progressed from symbolic systems to data-driven, generative architectures and large-scale computational infrastructures, becoming a foundational technology across sectors.
Narrative synthesis of historical and technical literature across AI research and innovation studies; qualitative tracing of architectural shifts (symbolic → statistical → deep learning/generative models) and increased deployment across industries. No original empirical measurement or sample size reported in this paper.
Policy recommendations include standards on explainability, audit trails, certification for finance/tax AI systems, stronger data governance, and public–private coordination to update regulatory guidance.
Paper's policy and governance recommendations drawn from case findings and literature synthesis; prescriptive content rather than evaluated interventions.
Deployments should build governance, explainability, and auditability into systems and start with pilots on high-volume, well-structured tasks before scaling.
Paper recommendations based on case experience and analytic framing; advocated strategy rather than empirically validated at scale within the paper.
To mitigate risks and realize benefits, AI systems in finance/tax should combine AI with human-in-the-loop controls and clear escalation paths.
Prescriptive recommendation grounded in case lessons and literature on safe AI deployment; presented as a best-practice guideline rather than tested intervention.
Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.
Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.
Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.
Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.
For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.
Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.
Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.
Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.
Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.
Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.
Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.
Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.
Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.
Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.
Observations span multiple agent platforms (Moltbook, The Colony, 4claw) with more than 167,000 agents interacting as peers.
Author-reported coverage from naturalistic observations across the named platforms during the one-month observation window; count reported as ≈167k agents.
Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability.
Architectural design and output artifacts described in the paper (logs and structured outputs per agent); these artifacts provide material for explanation and audit.
Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance.
Implementation of adaptive testing logic within the workflow described in the paper, with experiments involving dynamic difficulty adjustment; detailed metrics of adaptation effectiveness are not provided in the summary.
Operating as a pre-processor (rather than modifying the generator) enables modular integration with existing LLMs and provides an explicit decision point for clarification.
Novelty/architecture claim in the paper explaining that C.A.P. runs before generation and therefore can be plugged into existing LLM pipelines; described design rationale (no empirical integration study presented).
C.A.P. verifies semantic alignment between the current expanded prompt and the weighted history and triggers a structured clarification protocol when similarity is below a threshold.
Component-level description: alignment verification via semantic embeddings (cosine similarity) or learned classifiers and threshold-based decision branching to initiate clarification; described protocol templates (no empirical validation provided).
C.A.P. retrieves dialogue history using a time-weighted decay so recent context is prioritized (approximating human conversational focus).
Design description of a 'time-weighted context retrieval' component; authors propose temporal decay functions (e.g., exponential decay, half-life parameter) applied to dialogue-turn embeddings or metadata (no empirical results reported).
C.A.P. is a pre-generation module that expands user utterances to recover omitted premises and implications.
Architecture and methods description in the paper specifying a 'semantic expansion' component; suggested implementations via knowledge-bases or small LLM prompts to generate premises, paraphrases, and implications (no empirical evaluation reported).
Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.
Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.
Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).
Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.
Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures.
Paper's evaluation section lists metrics used: win-rate, completion time, strategic robustness; describes how they are computed and used to compare agents.
Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.
Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.
Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.
Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.
The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.
Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.
Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.
Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.
The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).
Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.
The Battling Track dataset contains more than 20 million recorded battle trajectories.
Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.
PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.
Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.
LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.
Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.
Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.
Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.
LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.
Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.
Historical transitions in standard work hours (e.g., six-day to five-day week) show that phased implementation, collective bargaining, and complementary policies can make work-time reductions feasible and economically beneficial.
Historical analyses and case studies of past industrialized-country workweek transitions cited in the synthesis; evidence drawn from historical institutional records and prior economic histories rather than a unified econometric analysis.
The paper advances a replicable interdisciplinary synthesis method and provides a simulated dataset and transparent protocols enabling other researchers to adapt the approach.
Methods section detailing systematic literature search protocols (ACM/IEEE/Springer, 2020–2024), inclusion criteria, simulation parameterization for the cross-sectoral dataset (seven industries, 2020–2024), and stated reproducibility materials.
AI adoption is strongly associated with workforce skill transformation (reported correlation r = 0.71).
Correlational analysis reported in the paper using the simulated cross-sectoral dataset that mirrors employment trends across seven industries (Manufacturing, Healthcare, Finance, Education, Transportation, Retail, IT Services) over 2020–2024. This corresponds to sector-year observations (7 sectors × 5 years = 35 observations) and is triangulated with findings from a systematic literature synthesis (ACM, IEEE, Springer publications 2020–2024).
Research priorities include rigorous real-world trials assessing patient outcomes, cost-effectiveness, and labor impacts; comparative studies of integration strategies; measurement of long-run workforce effects; and development of standard metrics and monitoring frameworks.
Explicit recommendations from the narrative review based on identified gaps: scarcity of RCTs, economic analyses, and long-term workforce studies.
Reward shaping at the assignment layer enables an explicit trade-off between diagnostic accuracy and human labor by incorporating penalties for human involvement.
Methodology section describing reward shaping and experimental comparisons showing different accuracy/human-effort trade-offs (results reported in paper; exact experimental details not provided in the summary).
Masked reinforcement learning techniques constrain or mask action spaces, reducing exploration over huge symptom/action spaces.
Paper describes use of masked RL to limit action options during training and execution; used in both assignment and execution layers (methodological claim supported by algorithmic description and experiments).
The upper layer ('master') learns turn-by-turn human–machine assignment using masked reinforcement learning with reward shaping to balance accuracy and human cost.
Methodological description in the paper and empirical results from experiments using masked RL and reward-shaped objectives at the assignment layer (implementation and experimental setup reported; dataset/sample size not specified in summary).
Returns to advanced digital skills vary by firm size/type: the wage return in large Chaebol conglomerates is approximately 18.7%, significantly higher than the ~9.5% return in Small and Medium-sized Enterprises (SMEs), indicating a 'skills–scale' complementarity effect.
Heterogeneity analysis within the extended Mincerian wage regression framework using KLIPS micro-data, comparing estimated returns across firm types (Chaebol vs SMEs). (Sample size and exact model specification not provided in the excerpt.)
Workers with only general digital literacy receive a wage premium of approximately 5.8% (after controlling for education, experience, and demographics).
Same empirical framework: extended Mincerian wage equation on KLIPS micro-data with controls for education, experience, and demographic characteristics. (Sample size not specified in the provided excerpt.)
Workers possessing specialized digital skills (e.g., data analysis, programming, automation control) enjoy a significant wage premium of approximately 14.2% after controlling for years of education, work experience, and demographic characteristics.
Empirical estimation using an extended Mincerian wage equation on micro-data from the Korean Labor and Income Panel Study (KLIPS); models control for years of education, work experience, and demographic covariates. (Sample size not specified in the provided excerpt.)
The model is disciplined using data from the Michigan Survey of Consumers and the Survey of Professional Forecasters, targeting key empirical moments.
Calibration/estimation strategy described in the paper: parameters are chosen to match moments from the Michigan Survey of Consumers and SPF (targeted empirical moments). Specific moments and calibration targets are reported in the paper.
I develop a search-and-matching model with sticky wages and endogenous separations.
Theoretical/model contribution: construction and analysis of a calibrated search-and-matching framework that incorporates wage stickiness and endogenous separation decisions.
Workers and firms face information frictions about the aggregate state of the economy (modeled explicitly).
Assumption and mechanism built into the paper's theoretical framework: a search-and-matching model with information frictions for both sides of the market (model specification).
Households form dispersed, backward-looking expectations about macroeconomic conditions.
Survey evidence from the Michigan Survey of Consumers showing dispersion in individual expectations and patterns consistent with backward-looking (slow/updating) belief formation about macro variables; exact sample sizes and empirical specifications are provided in the paper (not in the summary).
DARE posits that responsible AI deployment requires the simultaneous and integrated development of Digital readiness, Administrative governance, Resilience & ethics, and Economic equity.
Descriptive claim about the framework's components as reported in the abstract (conceptual proposition).
This paper introduces the DARE Framework, a holistic, four-dimensional model for national AI strategy and international cooperation.
Factual description of paper content in abstract — the framework is introduced by the authors (conceptual/model contribution).
AI tools—ranging from machine learning algorithms in inventory management to natural language processing in customer engagement—are applied in micro‑enterprise contexts.
Descriptive synthesis from included articles reporting specific AI applications (ML for inventory management; NLP for customer engagement) across the reviewed literature.
Global efforts toward establishing shared norms and multilateral cooperation are underway through initiatives led by the United Nations, OECD, UNESCO, and G7.
Qualitative document review identifying initiatives and normative efforts by multilateral organizations (organizations named; specific initiatives referenced qualitatively but not enumerated as a dataset).