Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Modern methodological assessment emphasizes the importance of recording individual contribution in various areas, assessing not only the fulfillment and quality of assignments, but also aspects such as collaboration, creativity, innovative behavior and professional growth.

Descriptive conclusion from the scoping review synthesizing themes across 29 empirical studies (2020–2025).

high positive The influence of AI-Driven Employee Performance Management (... dimensions included in performance assessments (collaboration, creativity, innov...

Employee Performance Management (EPM) systems are undergoing a pivotal shift from annual manual data collection ... into more agile human research operations.

Claim summarized from the scoping review of 29 empirical studies (PRISMA-ScR adherence stated).

high positive The influence of AI-Driven Employee Performance Management (... character/tempo of EPM processes (manual annual -> agile/continuous)

Findings provide practical insights for AI implementation that prioritize management capability and adaptability to external environments.

Authors' interpretation and managerial implication drawn from empirical PLS-SEM (mediation/moderation) and fsQCA results on 251 firms.

high positive AI for decision-making: exploring the linkage from AI capabi... organizational effectiveness of AI implementation (management capability and ada...

Decision-making agility is a critical conduit linking AI capabilities to improving organizational outcomes.

Inference from PLS-SEM mediation results reported in paper indicating AI capability effects on performance operate via decision-making agility; analysis based on survey of 251 firms.

high positive AI for decision-making: exploring the linkage from AI capabi... organizational/firm performance mediated by decision-making agility

Two sub-dimensions of AI capability, technical infrastructure and management, affect performance outcomes through decision-making agility.

PLS-SEM results reported in paper showing relationships among measured constructs (AI capability sub-dimensions, decision-making agility, and performance outcomes); based on survey of 251 firms.

high positive AI for decision-making: exploring the linkage from AI capabi... firm performance (through decision-making agility)

Developing an integrated national AI strategic framework is critically necessary to position Georgia as a regional technological leader.

Method: policy recommendation derived from the paper's sectoral analysis and comparative study of successful national strategies; argumentative/ normative claim rather than experimental evidence.

high positive Economic Impact of Artificial Intelligence and Policy Framew... Georgia's positioning as a regional technological leader

An effective AI ecosystem requires an adaptive regulatory framework, infrastructural investments, the integration of ethical standards, and cross-sectoral coordination.

Method: synthesis of findings from the comparative policy analysis and literature; policy prescription based on observed patterns across successful national strategies.

high positive Economic Impact of Artificial Intelligence and Policy Framew... effectiveness of AI ecosystem / governance quality

Countries such as Singapore, the United Kingdom, Canada, and France have achieved AI policy success through institutional flexibility and targeted policies independent of the dominant USA and China models.

Method: comparative analysis of national AI strategies and institutional arrangements across the four named countries; qualitative assessment (no numeric sample size).

high positive Economic Impact of Artificial Intelligence and Policy Framew... success of national AI strategies / national innovation outcomes

The paper analyzes the sectoral economic effects of AI using projections from Goldman Sachs, McKinsey, Penn Wharton, and the IMF, and assesses the potential for technology integration in Georgia's finance, healthcare, and education sectors.

Method: synthesis/analysis of published projections from Goldman Sachs, McKinsey, Penn Wharton, and IMF applied to Georgia's sectoral context; comparative assessment of applicability to finance, healthcare, education in Georgia. (No sample size reported.)

high positive Economic Impact of Artificial Intelligence and Policy Framew... potential for technology integration in finance, healthcare, and education

The impact of household-side digital economy applications on labor-structure change is significantly greater than that of government- and enterprise-side applications.

Heterogeneity analysis using provincial panel data (2013–2024) comparing household-, government-, and enterprise-side measures of digital-economy application and their associations with servicization/industrialization.

high positive The impact of China's digital economy development on changes... relative impact magnitudes of household- vs government- vs enterprise-side digit...

The driving effect of industrial digitalization on changes in the labor structure is stronger than that of digital industrialization.

Comparative effect estimates from the same provincial panel (2013–2024) separating two dimensions of the digital economy: 'digital industrialization' and 'industrial digitalization'.

high positive The impact of China's digital economy development on changes... relative magnitude of impact of industrial digitalization versus digital industr...

Establishing this prospective forecasting infrastructure is a critical technical requirement for managing the current global workforce realignment around AI.

Argumentative claim made by the authors in the paper's conclusion/positioning; presented as a normative recommendation rather than an empirically demonstrated necessity.

high positive Toward an AI-Powered Computational Testbed for Workforce Pol... necessity of prospective forecasting infrastructure for managing workforce reali...

The article details the computational architecture required to construct this simulation platform and defines the privacy, accuracy, and representativeness safeguards necessary for responsible deployment.

Statement of the paper's content and contributions (architectural description and discussion of safeguards); this is a claim about what the paper contains rather than an empirical finding.

high positive Toward an AI-Powered Computational Testbed for Workforce Pol... specification of computational architecture and specification of privacy, accura...

Among consenting populations, these agents can be seeded with HR records, validated psychometric measures, and digital activity data to simulate employees' cognitive, emotional, and behavioral trajectories across successive workdays during planned organizational changes.

Proposal/specification in the paper describing how the simulation would be constructed and what inputs it could use; no empirical evaluation or results reported in the excerpt.

high positive Toward an AI-Powered Computational Testbed for Workforce Pol... ability to simulate employees' cognitive, emotional, and behavioral daily trajec...

We combine recent advances in LLM-powered generative agents with foundational management science and organizational behavior research to propose dynamic employee agents.

Descriptive/methodological claim about the paper's proposed approach; represents a design/proposal rather than empirical validation.

high positive Toward an AI-Powered Computational Testbed for Workforce Pol... availability of a proposed simulation approach (dynamic employee agents) combini...

The integration of artificial intelligence into knowledge work currently affects a substantial share of the global workforce.

Claim presented in the paper as background/context; no supporting empirical sample, statistics, or citations provided in the excerpt.

high positive Toward an AI-Powered Computational Testbed for Workforce Pol... share of the global workforce affected by AI integration in knowledge work

The activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require.

Description of the classroom activity in the paper (students construct tasks, review peers' tasks for ambiguity, and evaluate systems), supported by qualitative reflections.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... student exposure to AI tools combined with critical evaluation practices

Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs.

Qualitative reflections reported from five student contributors (n=5) included in the paper, used as evidence for educational impact.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... students' conceptualization of professional knowledge and ability to judge AI ou...

Across thirteen evaluated systems, the best-performing system, GPT-5.5, reaches a 57.58% pass rate.

Empirical evaluation results reported in the paper naming GPT-5.5 as best performer with a 57.58% pass rate on QuestBench.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... pass rate of top-performing model

The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

URL provided in the paper pointing to the hosted dataset on Hugging Face.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... public availability of dataset

The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains.

Statement in the paper specifying dataset composition: 256 questions and 14 domains; dataset artifact referenced and released.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... creation of benchmark dataset (question count and domain coverage)

We introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work.

Description of course design and pedagogical practice in the paper (course activity where students construct benchmarks and evaluate systems). No numerical sample size for the course cohort reported in the excerpt.

high positive Teaching AI Through Benchmark Construction: QuestBench as a ... students' ability to test and judge AI (educational practice introduced)

We survey recent open-world evaluations, identify their strengths and limitations, and conclude with recommendations for designing and reporting open-world evals.

Paper content promise: literature/methods survey and synthesis; detailed recommendations included in conclusions (qualitative content).

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... presence of survey, identified strengths/limitations, and recommendations in the...

Open-world evaluations can provide early warning of capabilities that may soon become widespread.

Inference drawn by authors based on the reported open-world experiment (the iOS app trial) and a survey of recent open-world evaluations; claim is presented as a suggested benefit rather than proven at scale.

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... ability of open-world evaluations to serve as an early warning signal for emergi...

The agent completed the task with only a single avoidable manual intervention.

Direct observation from the paper's described experiment (single-agent trial producing the iOS app and publishing it; authors report occurrence of one avoidable manual intervention). Sample size = 1.

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... number of manual interventions required for task completion

As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store.

Empirical demonstration described in the paper: a single experimental task in which an AI agent was assigned to develop and publish a simple iOS app. Sample size implied by description: 1 trial/instance.

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... completion of an end-to-end software development and publishing task by an AI ag...

We introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such [open-world] evaluations regularly.

Paper describes the CRUX project as a proposed/introduced initiative (project description); no empirical trial-size or rollout numbers reported in the abstract.

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... existence/introduction of CRUX as an organizational/project mechanism

We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation.

Methodological proposal/definition presented in the paper (conceptual argument and rationale); described as design recommendation rather than empirically validated at scale.

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... proposed evaluation methodology characteristics (long-horizon, messy, small-samp...

Benchmark-based evaluation remains important for tracking frontier AI progress.

Conceptual assertion in paper's introduction/abstract and literature context; no empirical sample reported for this claim (position statement).

high positive Open-World Evaluations for Measuring Frontier AI Capabilitie... usefulness of benchmark-based evaluation for tracking AI progress

Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

Availability statement in the paper (link to repository).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... data and code availability

AgroVG provides task-specific protocols for box-set matching and query-level mask coverage.

Methodological contribution described in the paper (evaluation protocols designed for the benchmark).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... availability of evaluation protocols (box-set matching, mask coverage)

AgroVG supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes.

Dataset/task specification described in the paper (task types T1 and T2 and query regime coverage).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... support for bounding-box and instance-mask grounding across target families and ...

AgroVG contains 10,071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy.

Dataset construction / reported dataset statistics in the paper (explicit count and composition).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... number of annotation-grounded image-query pairs and coverage across target famil...

We introduce AgroVG, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present.

Paper contribution: description of a new benchmark and its task formulation (benchmark construction and formalization).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... benchmark formulation (generalized set prediction capability)

Evaluating agricultural visual grounding therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention.

Methodological assertion in the paper motivating the benchmark design (conceptual requirement for evaluation metrics and protocols).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... completeness of evaluation (localization accuracy, completeness, abstention)

Visual grounding is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting.

Framing / motivation statement in the paper abstract/introduction (conceptual argument linking visual grounding capability to downstream agri-applications).

high positive AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... ability to enable agricultural applications (selective weeding, disease monitori...

Enterprise capability adaptation serves as the key support for implementing intelligent international marketing models.

Conclusion from the paper's review and content analysis of literature (2010–2025); presented as a synthesized enabling factor rather than empirically quantified effect.

high positive Research on International Marketing in the Context of Intell... role of enterprise capability adaptation in model implementation

Mainstream innovation models include data-driven precision marketing, AI-powered cross-border CRM, intelligent omnichannel integration, and cross-cultural intelligent localization marketing.

Summary from the paper's systematic review and content analysis of core literature (2010–2025); descriptive synthesis, no primary experimental sample size reported.

high positive Research on International Marketing in the Context of Intell... prevalent innovation models in international marketing

New theoretical frameworks have emerged: data-driven precision marketing theory, nonlinear customer journey reconstruction theory, cross-border intelligent value co-creation theory, and global intelligent marketing ecosystem theory.

Identified via the paper's systematic review and content analysis of literature from 2010–2025; presented as conceptual/theoretical developments rather than quantified empirical effects.

high positive Research on International Marketing in the Context of Intell... emergence of new international marketing theoretical frameworks

Intelligent technologies have increased international marketing ROI by 12%–25%.

Mixed-method systematic review and content analysis of core literature sources from 2010 to 2025 (as reported in the paper). No primary dataset or sample size reported for this quantified range.

high positive Research on International Marketing in the Context of Intell... international marketing ROI

Structured production-process management and size are significant predictors of AI adoption.

Regression/associational analysis from the Census Bureau survey showing that measures of structured production-process management and establishment size predict reported AI use; sample ~28,500 establishments.

high positive The Adoption of Industrial AI in America AI adoption (predicted by management structure and size)

The paper's methodology enables classification of automation exposure that disentangles labour-substituting from labour-augmenting automation, identifies the relevant technology channel, and records the material role of AI — allowing exposure levels, labour margins, technological channels and AI involvement to be treated as separate dimensions across development stages.

Description of the task-based, country-specific classification approach and the multidimensional labels produced (labour margin, technology channel, AI involvement) across 124 countries.

high positive Global Automation Atlas granularity/capacity of the measurement methodology (ability to separate multipl...

Females seem to be disproportionately more exposed to labour-substituting automation than males.

Gender-disaggregated exposure analysis derived from task-country labels combined with workforce composition by gender across countries; reported descriptive comparison indicating higher substitution exposure for females.

high positive Global Automation Atlas gender gap in exposure to labour-substituting automation (female vs male exposur...

Less technologically advanced forms of automation account for more than half of exposed tasks in low-income countries but about one quarter in high-income countries; more complex technological channels generally rise with income levels.

Breakdown of exposed tasks by technological channel across the 124-country task-country dataset; descriptive comparison across income groups (low- vs high-income).

high positive Global Automation Atlas share of exposed tasks attributed to 'less technologically advanced' channels vs...

Exposure to automation is highly uneven across countries, ranging from 3.3% of tasks in South Sudan to 61.6% in China, and exposure rises strongly with income (with substantial within-group variation).

Descriptive statistics from the task-country atlas covering 124 countries (2.33M task-country labels); reported minimum and maximum exposure percentages and summary comparison across income groups.

high positive Global Automation Atlas share/percentage of tasks exposed to automation

Our measure spans 124 countries, generating an atlas of 2.33 million task-country labels for economies covering 99% of world population and GDP.

Statement in paper describing the constructed task-based, country-specific measure and the generated dataset (124 countries, 2.33 million task-country labels), covering ~99% of world population and GDP.

high positive Global Automation Atlas coverage of task-country labels / dataset scope

Deployed FLUID increases Active Hours by +0.05%.

Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.

high positive FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... Active Hours

Deployed FLUID increases Cold-Start Room Views by +2.05%.

Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.

high positive FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... Cold-Start Room Views

Deployed FLUID delivers an online gain of +0.55% Quality Watch Duration.

Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.

high positive FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... Quality Watch Duration

FLUID was deployed on industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally.

Authors' deployment statement in the paper indicating production rollout across industrial recommenders and noting a combined user base (statement of scope/scale). No A/B sample sizes reported in the excerpt.

high positive FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... deployment scale (reach/user base)

« Prev 1 2 3 … 108 109 110 … 276 277 Next »