Evidence (3231 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	738	1617
Governance & Regulation	671	334	160	99	1285
Organizational Efficiency	626	147	105	70	955
Technology Adoption Rate	502	176	98	78	861
Research Productivity	349	109	48	322	838
Output Quality	391	121	45	40	597
Firm Productivity	385	46	85	17	539
Decision Quality	277	145	63	34	526
AI Safety & Ethics	189	244	59	30	526
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	106	40	6	188
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	79	8	1	152
Regulatory Compliance	69	66	14	3	152
Training Effectiveness	82	16	13	18	131
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Labor Markets Remove filter

Higher robot density is associated with productivity gains, particularly in low-robotized sectors such as Ukraine’s mining and metallurgical industry.

Empirical evidence cited from international and industry-specific studies reviewed in the paper (literature review/meta-analytic style evidence); no Ukraine-specific causal estimate with sample size reported in the summary.

high positive Human-replacing technologies as a driver of labour productiv... productivity (associated gains)

Human-replacing technologies also have an indirect impact on productivity by increasing total factor productivity (TFP).

Analytical argumentation in the paper supported by references to empirical studies showing TFP effects of automation/digitalization; literature synthesis rather than a new econometric estimate presented for Ukraine.

high positive Human-replacing technologies as a driver of labour productiv... total factor productivity

Human-replacing technologies (mechanization, automation, robotization, digitalization and AI-augmentation) make a direct contribution to labour productivity growth in Ukraine's mining and metallurgical sector.

Sectoral analysis and synthesis in the paper drawing on empirical international and industry-specific studies; literature review of productivity impacts of mechanization/automation/robotization/digitalization/AI in industrial contexts.

high positive Human-replacing technologies as a driver of labour productiv... labour productivity

There exist reserves for optimizing the interaction of artificial intelligence with the labor market, and it is necessary to adapt AI to the specifics of national economic models.

Conclusions drawn from the envelope-model results showing heterogeneity across countries and implied gaps/opportunities for policy and adaptation; the paper emphasizes policy implications and the need for AI adaptation to national economic specifics.

high positive Artificial intelligence as a driver of economic growth: Chal... potential to optimize AI–labor-market interaction / need for policy adaptation

Certain countries can optimally transform AI diffusion into positive domestic labor-market outcomes (economic development and realization of human capital potential): the Netherlands, France, Portugal, Italy, and Malta.

Comparative envelope-model analysis across the sample of European Union countries produced a ranking or identification of countries judged able to optimally transform AI diffusion into labor-market and human-capital results; these five countries are named in the paper.

high positive Artificial intelligence as a driver of economic growth: Chal... capacity to translate AI diffusion into economic development and human capital r...

Introducing an 'AI Engineer' occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

Speculative policy suggestion based on the co-attractor framework and empirical observation that vocabulary exists but population cohesion is absent.

high positive NLP Occupational Emergence Analysis: How Occupations Form an... potential for creating population cohesion (policy intervention effect)

Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations.

Empirical application of the method to a dataset of 8.2 million US resumes spanning 2022–2026; claim that results match known/established occupations (implies validation against existing taxonomy or known labels).

high positive NLP Occupational Emergence Analysis: How Occupations Form an... accuracy / correctness of detected occupations (established occupations identifi...

The co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population.

Methodological claim describing the approach applied to resume data: independent tests of vocabulary cohesion and population cohesion, plus ablation experiments. Supported by the method's implementation on the resume dataset.

high positive NLP Occupational Emergence Analysis: How Occupations Form an... ability to detect occupational emergence (via vocabulary cohesion and population...

A genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary.

Theoretical/conceptual proposal introduced by the authors as the defining mechanism for occupational emergence; motivates the detection method.

high positive NLP Occupational Emergence Analysis: How Occupations Form an... conceptual definition of occupation formation (vocabulary ↔ population cohesion)

Occupations form and evolve faster than classification systems can track.

Argument supported by the paper's analysis approach and motivating observation; asserted as motivation for developing a detection method. No specific numerical test reported in the excerpt beyond the large resume dataset.

high positive NLP Occupational Emergence Analysis: How Occupations Form an... speed of occupation formation / evolution relative to classification updates

Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.

Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... regulatory approach to AI governance (strategy of forbearance vs. new regulation...

Generative AI lowers entry costs for startups, facilitating new firm entry and product development.

Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... barriers to entry / startup costs and rate of new product development

Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.

Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... task-level productivity in coding, writing, and customer service

The authors provide a demo video, a hosted website, and an installable package demonstrating JobMatchAI.

Paper explicitly states availability of a demo video, a hosted website, and an installable package. No links, access dates, or artifact verification details are provided in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... availability of demonstration artifacts (video, hosted website, installable pack...

The authors provide a hybrid retrieval stack combining BM25, a skill knowledge graph, and semantic components to evaluate skill generalization.

Paper describes a hybrid retrieval stack composed of BM25, a knowledge graph, and semantic retrieval components intended for evaluation of skill generalization. No evaluation metrics or comparisons are included in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... retrieval stack composition (BM25 + knowledge graph + semantic components) inten...

The authors release JobSearch-XS benchmark.

Paper explicitly states release of the JobSearch-XS benchmark. No dataset size, annotation protocol, or access URL provided in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... availability of JobSearch-XS benchmark (artifact release)

JobMatchAI integrates Transformer embeddings, skill knowledge graphs, and interpretable reranking.

Statement in paper describing system architecture and components (implementation claim). No quantitative implementation details or component-level ablation results provided in the supplied excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... system design / component integration (presence of Transformer embeddings, knowl...

Research priorities include empirically quantifying AI's effects on productivity, wages, inequality, and environmental costs; developing standardized sustainability and governance metrics; and evaluating regulatory impacts on innovation and welfare.

Stated research agenda based on gaps identified in the narrative review; identifies directions for future empirical work rather than presenting new empirical findings.

high positive The Evolution and Societal Impact of Artificial Intelligence... empirical evidence and standardized metrics for AI impacts (productivity, labor-...

AI has progressed from symbolic systems to data-driven, generative architectures and large-scale computational infrastructures, becoming a foundational technology across sectors.

Narrative synthesis of historical and technical literature across AI research and innovation studies; qualitative tracing of architectural shifts (symbolic → statistical → deep learning/generative models) and increased deployment across industries. No original empirical measurement or sample size reported in this paper.

high positive The Evolution and Societal Impact of Artificial Intelligence... technological evolution and cross-sector adoption (foundational-technology statu...

Policy recommendations include standards on explainability, audit trails, certification for finance/tax AI systems, stronger data governance, and public–private coordination to update regulatory guidance.

Paper's policy and governance recommendations drawn from case findings and literature synthesis; prescriptive content rather than evaluated interventions.

high positive Explore the Impact of Generative AI on Finance and Taxation existence/adoption of standards, improvements in regulatory clarity and complian...

Deployments should build governance, explainability, and auditability into systems and start with pilots on high-volume, well-structured tasks before scaling.

Paper recommendations based on case experience and analytic framing; advocated strategy rather than empirically validated at scale within the paper.

high positive Explore the Impact of Generative AI on Finance and Taxation deployment success rate, governance completeness, pilot-to-scale learning outcom...

To mitigate risks and realize benefits, AI systems in finance/tax should combine AI with human-in-the-loop controls and clear escalation paths.

Prescriptive recommendation grounded in case lessons and literature on safe AI deployment; presented as a best-practice guideline rather than tested intervention.

high positive Explore the Impact of Generative AI on Finance and Taxation safety/accuracy of outputs, reduction in erroneous autonomous actions

Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.

Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.

high positive Explore the Impact of Generative AI on Finance and Taxation capability enabling: natural language understanding, document extraction accurac...

Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.

Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.

high positive Explore the Impact of Generative AI on Finance and Taxation timeliness of monitoring, fraud detection rate, KYC/AML screening coverage, exce...

For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.

Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.

high positive Explore the Impact of Generative AI on Finance and Taxation accuracy and speed of tax fact extraction, draft return quality, compliance-chec...

Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.

Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.

high positive Explore the Impact of Generative AI on Finance and Taxation cashflow forecast accuracy, anomaly detection precision/recall, automation rate ...

Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.

Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.

high positive Explore the Impact of Generative AI on Finance and Taxation functionality/performance in accounting tasks: bookkeeping accuracy, reconciliat...

Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.

Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.

high positive ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... feasibility of AI deployment (legal/institutional compatibility enabling efficie...

Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.

Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.

high positive The Future of Feedback: How Can AI Help Transform Feedback t... existence and quality of RCTs and long-run studies; availability of validated me...

Observations span multiple agent platforms (Moltbook, The Colony, 4claw) with more than 167,000 agents interacting as peers.

Author-reported coverage from naturalistic observations across the named platforms during the one-month observation window; count reported as ≈167k agents.

high positive When Openclaw Agents Learn from Each Other: Insights from Em... number of agents observed interacting as peers

Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability.

Architectural design and output artifacts described in the paper (logs and structured outputs per agent); these artifacts provide material for explanation and audit.

high positive CoMAI: A Collaborative Multi-Agent Framework for Robust and ... interpretability and auditability (availability of logs and structured outputs)

Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance.

Implementation of adaptive testing logic within the workflow described in the paper, with experiments involving dynamic difficulty adjustment; detailed metrics of adaptation effectiveness are not provided in the summary.

high positive CoMAI: A Collaborative Multi-Agent Framework for Robust and ... ability to adapt question difficulty and evaluate multiple skill dimensions

Operating as a pre-processor (rather than modifying the generator) enables modular integration with existing LLMs and provides an explicit decision point for clarification.

Novelty/architecture claim in the paper explaining that C.A.P. runs before generation and therefore can be plugged into existing LLM pipelines; described design rationale (no empirical integration study presented).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... ease of integration / ability to attach to existing generation pipelines

C.A.P. verifies semantic alignment between the current expanded prompt and the weighted history and triggers a structured clarification protocol when similarity is below a threshold.

Component-level description: alignment verification via semantic embeddings (cosine similarity) or learned classifiers and threshold-based decision branching to initiate clarification; described protocol templates (no empirical validation provided).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... alignment detection (similarity score) and number/rate of triggered clarificatio...

C.A.P. retrieves dialogue history using a time-weighted decay so recent context is prioritized (approximating human conversational focus).

Design description of a 'time-weighted context retrieval' component; authors propose temporal decay functions (e.g., exponential decay, half-life parameter) applied to dialogue-turn embeddings or metadata (no empirical results reported).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recency-weighted relevance of retrieved context / retrieval precision for recent...

C.A.P. is a pre-generation module that expands user utterances to recover omitted premises and implications.

Architecture and methods description in the paper specifying a 'semantic expansion' component; suggested implementations via knowledge-bases or small LLM prompts to generate premises, paraphrases, and implications (no empirical evaluation reported).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recovered implicit premises / coverage of implied goals in expanded prompt

Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.

Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.

high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... inspectability/traceability of inference chains (auditability)

Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).

Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.

high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and machine-checkability of formal inferential chains (inspectability/...

Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures.

Paper's evaluation section lists metrics used: win-rate, completion time, strategic robustness; describes how they are computed and used to compare agents.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... evaluation metrics used (win-rate, completion time, strategic robustness)

Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.

Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... availability of open-source orchestration code and standardized evaluation scena...

Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.

Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of competing teams (100+), availability of competition analyses/winning s...

The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.

Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence of live leaderboard and self-contained evaluation pipelines

Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.

Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence and types of baseline agents (heuristic, RL, LLM)

The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).

Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark partitioning (presence of Battling and Speedrunning tracks)

The Battling Track dataset contains more than 20 million recorded battle trajectories.

Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of recorded battle trajectories (>20,000,000)

PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.

Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark task characteristics (partial observability, game-theoretic complexity...

LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.

Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.

high positive Internalizing Agency from Reflective Experience Pass@128 (absolute percentage point improvement)

Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.

Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.

high positive Internalizing Agency from Reflective Experience Pass@1 and Pass@k (fraction of problems solved among k candidate runs)

LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.

Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.

high positive Internalizing Agency from Reflective Experience Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128)

Historical transitions in standard work hours (e.g., six-day to five-day week) show that phased implementation, collective bargaining, and complementary policies can make work-time reductions feasible and economically beneficial.

Historical analyses and case studies of past industrialized-country workweek transitions cited in the synthesis; evidence drawn from historical institutional records and prior economic histories rather than a unified econometric analysis.

high positive A Shorter Workweek as a Policy Response to AI-Driven Labor D... feasibility and economic outcomes of phased work-time reductions (employment, pr...

« Prev 1 2 3 … 26 27 28 … 64 65 Next »