Evidence (3231 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5921 claims
Human-AI Collaboration
5192 claims
Org Design
3497 claims
Innovation
3492 claims
Labor Markets
3231 claims
Skills & Training
2608 claims
Inequality
1842 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 738 | 1617 |
| Governance & Regulation | 671 | 334 | 160 | 99 | 1285 |
| Organizational Efficiency | 626 | 147 | 105 | 70 | 955 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 349 | 109 | 48 | 322 | 838 |
| Output Quality | 391 | 121 | 45 | 40 | 597 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 277 | 145 | 63 | 34 | 526 |
| AI Safety & Ethics | 189 | 244 | 59 | 30 | 526 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 106 | 40 | 6 | 188 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 79 | 8 | 1 | 152 |
| Regulatory Compliance | 69 | 66 | 14 | 3 | 152 |
| Training Effectiveness | 82 | 16 | 13 | 18 | 131 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Labor Markets
Remove filter
Higher robot density is associated with productivity gains, particularly in low-robotized sectors such as Ukraine’s mining and metallurgical industry.
Empirical evidence cited from international and industry-specific studies reviewed in the paper (literature review/meta-analytic style evidence); no Ukraine-specific causal estimate with sample size reported in the summary.
Human-replacing technologies also have an indirect impact on productivity by increasing total factor productivity (TFP).
Analytical argumentation in the paper supported by references to empirical studies showing TFP effects of automation/digitalization; literature synthesis rather than a new econometric estimate presented for Ukraine.
Human-replacing technologies (mechanization, automation, robotization, digitalization and AI-augmentation) make a direct contribution to labour productivity growth in Ukraine's mining and metallurgical sector.
Sectoral analysis and synthesis in the paper drawing on empirical international and industry-specific studies; literature review of productivity impacts of mechanization/automation/robotization/digitalization/AI in industrial contexts.
There exist reserves for optimizing the interaction of artificial intelligence with the labor market, and it is necessary to adapt AI to the specifics of national economic models.
Conclusions drawn from the envelope-model results showing heterogeneity across countries and implied gaps/opportunities for policy and adaptation; the paper emphasizes policy implications and the need for AI adaptation to national economic specifics.
Certain countries can optimally transform AI diffusion into positive domestic labor-market outcomes (economic development and realization of human capital potential): the Netherlands, France, Portugal, Italy, and Malta.
Comparative envelope-model analysis across the sample of European Union countries produced a ranking or identification of countries judged able to optimally transform AI diffusion into labor-market and human-capital results; these five countries are named in the paper.
Introducing an 'AI Engineer' occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.
Speculative policy suggestion based on the co-attractor framework and empirical observation that vocabulary exists but population cohesion is absent.
Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations.
Empirical application of the method to a dataset of 8.2 million US resumes spanning 2022–2026; claim that results match known/established occupations (implies validation against existing taxonomy or known labels).
The co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population.
Methodological claim describing the approach applied to resume data: independent tests of vocabulary cohesion and population cohesion, plus ablation experiments. Supported by the method's implementation on the resume dataset.
A genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary.
Theoretical/conceptual proposal introduced by the authors as the defining mechanism for occupational emergence; motivates the detection method.
Occupations form and evolve faster than classification systems can track.
Argument supported by the paper's analysis approach and motivating observation; asserted as motivation for developing a detection method. No specific numerical test reported in the excerpt beyond the large resume dataset.
Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.
Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.
Generative AI lowers entry costs for startups, facilitating new firm entry and product development.
Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.
Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.
Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.
The authors provide a demo video, a hosted website, and an installable package demonstrating JobMatchAI.
Paper explicitly states availability of a demo video, a hosted website, and an installable package. No links, access dates, or artifact verification details are provided in the excerpt.
The authors provide a hybrid retrieval stack combining BM25, a skill knowledge graph, and semantic components to evaluate skill generalization.
Paper describes a hybrid retrieval stack composed of BM25, a knowledge graph, and semantic retrieval components intended for evaluation of skill generalization. No evaluation metrics or comparisons are included in the excerpt.
The authors release JobSearch-XS benchmark.
Paper explicitly states release of the JobSearch-XS benchmark. No dataset size, annotation protocol, or access URL provided in the excerpt.
JobMatchAI integrates Transformer embeddings, skill knowledge graphs, and interpretable reranking.
Statement in paper describing system architecture and components (implementation claim). No quantitative implementation details or component-level ablation results provided in the supplied excerpt.
Research priorities include empirically quantifying AI's effects on productivity, wages, inequality, and environmental costs; developing standardized sustainability and governance metrics; and evaluating regulatory impacts on innovation and welfare.
Stated research agenda based on gaps identified in the narrative review; identifies directions for future empirical work rather than presenting new empirical findings.
AI has progressed from symbolic systems to data-driven, generative architectures and large-scale computational infrastructures, becoming a foundational technology across sectors.
Narrative synthesis of historical and technical literature across AI research and innovation studies; qualitative tracing of architectural shifts (symbolic → statistical → deep learning/generative models) and increased deployment across industries. No original empirical measurement or sample size reported in this paper.
Policy recommendations include standards on explainability, audit trails, certification for finance/tax AI systems, stronger data governance, and public–private coordination to update regulatory guidance.
Paper's policy and governance recommendations drawn from case findings and literature synthesis; prescriptive content rather than evaluated interventions.
Deployments should build governance, explainability, and auditability into systems and start with pilots on high-volume, well-structured tasks before scaling.
Paper recommendations based on case experience and analytic framing; advocated strategy rather than empirically validated at scale within the paper.
To mitigate risks and realize benefits, AI systems in finance/tax should combine AI with human-in-the-loop controls and clear escalation paths.
Prescriptive recommendation grounded in case lessons and literature on safe AI deployment; presented as a best-practice guideline rather than tested intervention.
Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.
Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.
Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.
Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.
For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.
Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.
Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.
Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.
Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.
Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.
Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.
Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.
Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.
Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.
Observations span multiple agent platforms (Moltbook, The Colony, 4claw) with more than 167,000 agents interacting as peers.
Author-reported coverage from naturalistic observations across the named platforms during the one-month observation window; count reported as ≈167k agents.
Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability.
Architectural design and output artifacts described in the paper (logs and structured outputs per agent); these artifacts provide material for explanation and audit.
Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance.
Implementation of adaptive testing logic within the workflow described in the paper, with experiments involving dynamic difficulty adjustment; detailed metrics of adaptation effectiveness are not provided in the summary.
Operating as a pre-processor (rather than modifying the generator) enables modular integration with existing LLMs and provides an explicit decision point for clarification.
Novelty/architecture claim in the paper explaining that C.A.P. runs before generation and therefore can be plugged into existing LLM pipelines; described design rationale (no empirical integration study presented).
C.A.P. verifies semantic alignment between the current expanded prompt and the weighted history and triggers a structured clarification protocol when similarity is below a threshold.
Component-level description: alignment verification via semantic embeddings (cosine similarity) or learned classifiers and threshold-based decision branching to initiate clarification; described protocol templates (no empirical validation provided).
C.A.P. retrieves dialogue history using a time-weighted decay so recent context is prioritized (approximating human conversational focus).
Design description of a 'time-weighted context retrieval' component; authors propose temporal decay functions (e.g., exponential decay, half-life parameter) applied to dialogue-turn embeddings or metadata (no empirical results reported).
C.A.P. is a pre-generation module that expands user utterances to recover omitted premises and implications.
Architecture and methods description in the paper specifying a 'semantic expansion' component; suggested implementations via knowledge-bases or small LLM prompts to generate premises, paraphrases, and implications (no empirical evaluation reported).
Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.
Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.
Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).
Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.
Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures.
Paper's evaluation section lists metrics used: win-rate, completion time, strategic robustness; describes how they are computed and used to compare agents.
Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.
Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.
Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.
Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.
The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.
Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.
Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.
Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.
The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).
Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.
The Battling Track dataset contains more than 20 million recorded battle trajectories.
Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.
PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.
Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.
LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.
Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.
Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.
Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.
LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.
Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.
Historical transitions in standard work hours (e.g., six-day to five-day week) show that phased implementation, collective bargaining, and complementary policies can make work-time reductions feasible and economically beneficial.
Historical analyses and case studies of past industrialized-country workweek transitions cited in the synthesis; evidence drawn from historical institutional records and prior economic histories rather than a unified econometric analysis.