Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Modern methodological assessment emphasizes the importance of recording individual contribution in various areas, assessing not only the fulfillment and quality of assignments, but also aspects such as collaboration, creativity, innovative behavior and professional growth.
Descriptive conclusion from the scoping review synthesizing themes across 29 empirical studies (2020–2025).
Employee Performance Management (EPM) systems are undergoing a pivotal shift from annual manual data collection ... into more agile human research operations.
Claim summarized from the scoping review of 29 empirical studies (PRISMA-ScR adherence stated).
Findings provide practical insights for AI implementation that prioritize management capability and adaptability to external environments.
Authors' interpretation and managerial implication drawn from empirical PLS-SEM (mediation/moderation) and fsQCA results on 251 firms.
Decision-making agility is a critical conduit linking AI capabilities to improving organizational outcomes.
Inference from PLS-SEM mediation results reported in paper indicating AI capability effects on performance operate via decision-making agility; analysis based on survey of 251 firms.
Two sub-dimensions of AI capability, technical infrastructure and management, affect performance outcomes through decision-making agility.
PLS-SEM results reported in paper showing relationships among measured constructs (AI capability sub-dimensions, decision-making agility, and performance outcomes); based on survey of 251 firms.
Developing an integrated national AI strategic framework is critically necessary to position Georgia as a regional technological leader.
Method: policy recommendation derived from the paper's sectoral analysis and comparative study of successful national strategies; argumentative/ normative claim rather than experimental evidence.
An effective AI ecosystem requires an adaptive regulatory framework, infrastructural investments, the integration of ethical standards, and cross-sectoral coordination.
Method: synthesis of findings from the comparative policy analysis and literature; policy prescription based on observed patterns across successful national strategies.
Countries such as Singapore, the United Kingdom, Canada, and France have achieved AI policy success through institutional flexibility and targeted policies independent of the dominant USA and China models.
Method: comparative analysis of national AI strategies and institutional arrangements across the four named countries; qualitative assessment (no numeric sample size).
The paper analyzes the sectoral economic effects of AI using projections from Goldman Sachs, McKinsey, Penn Wharton, and the IMF, and assesses the potential for technology integration in Georgia's finance, healthcare, and education sectors.
Method: synthesis/analysis of published projections from Goldman Sachs, McKinsey, Penn Wharton, and IMF applied to Georgia's sectoral context; comparative assessment of applicability to finance, healthcare, education in Georgia. (No sample size reported.)
The impact of household-side digital economy applications on labor-structure change is significantly greater than that of government- and enterprise-side applications.
Heterogeneity analysis using provincial panel data (2013–2024) comparing household-, government-, and enterprise-side measures of digital-economy application and their associations with servicization/industrialization.
The driving effect of industrial digitalization on changes in the labor structure is stronger than that of digital industrialization.
Comparative effect estimates from the same provincial panel (2013–2024) separating two dimensions of the digital economy: 'digital industrialization' and 'industrial digitalization'.
Establishing this prospective forecasting infrastructure is a critical technical requirement for managing the current global workforce realignment around AI.
Argumentative claim made by the authors in the paper's conclusion/positioning; presented as a normative recommendation rather than an empirically demonstrated necessity.
The article details the computational architecture required to construct this simulation platform and defines the privacy, accuracy, and representativeness safeguards necessary for responsible deployment.
Statement of the paper's content and contributions (architectural description and discussion of safeguards); this is a claim about what the paper contains rather than an empirical finding.
Among consenting populations, these agents can be seeded with HR records, validated psychometric measures, and digital activity data to simulate employees' cognitive, emotional, and behavioral trajectories across successive workdays during planned organizational changes.
Proposal/specification in the paper describing how the simulation would be constructed and what inputs it could use; no empirical evaluation or results reported in the excerpt.
We combine recent advances in LLM-powered generative agents with foundational management science and organizational behavior research to propose dynamic employee agents.
Descriptive/methodological claim about the paper's proposed approach; represents a design/proposal rather than empirical validation.
The integration of artificial intelligence into knowledge work currently affects a substantial share of the global workforce.
Claim presented in the paper as background/context; no supporting empirical sample, statistics, or citations provided in the excerpt.
The activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require.
Description of the classroom activity in the paper (students construct tasks, review peers' tasks for ambiguity, and evaluate systems), supported by qualitative reflections.
Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs.
Qualitative reflections reported from five student contributors (n=5) included in the paper, used as evidence for educational impact.
Across thirteen evaluated systems, the best-performing system, GPT-5.5, reaches a 57.58% pass rate.
Empirical evaluation results reported in the paper naming GPT-5.5 as best performer with a 57.58% pass rate on QuestBench.
The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.
URL provided in the paper pointing to the hosted dataset on Hugging Face.
The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains.
Statement in the paper specifying dataset composition: 256 questions and 14 domains; dataset artifact referenced and released.
We introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work.
Description of course design and pedagogical practice in the paper (course activity where students construct benchmarks and evaluate systems). No numerical sample size for the course cohort reported in the excerpt.
We survey recent open-world evaluations, identify their strengths and limitations, and conclude with recommendations for designing and reporting open-world evals.
Paper content promise: literature/methods survey and synthesis; detailed recommendations included in conclusions (qualitative content).
Open-world evaluations can provide early warning of capabilities that may soon become widespread.
Inference drawn by authors based on the reported open-world experiment (the iOS app trial) and a survey of recent open-world evaluations; claim is presented as a suggested benefit rather than proven at scale.
The agent completed the task with only a single avoidable manual intervention.
Direct observation from the paper's described experiment (single-agent trial producing the iOS app and publishing it; authors report occurrence of one avoidable manual intervention). Sample size = 1.
As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store.
Empirical demonstration described in the paper: a single experimental task in which an AI agent was assigned to develop and publish a simple iOS app. Sample size implied by description: 1 trial/instance.
We introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such [open-world] evaluations regularly.
Paper describes the CRUX project as a proposed/introduced initiative (project description); no empirical trial-size or rollout numbers reported in the abstract.
We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation.
Methodological proposal/definition presented in the paper (conceptual argument and rationale); described as design recommendation rather than empirically validated at scale.
Benchmark-based evaluation remains important for tracking frontier AI progress.
Conceptual assertion in paper's introduction/abstract and literature context; no empirical sample reported for this claim (position statement).
Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .
Availability statement in the paper (link to repository).
AgroVG provides task-specific protocols for box-set matching and query-level mask coverage.
Methodological contribution described in the paper (evaluation protocols designed for the benchmark).
AgroVG supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes.
Dataset/task specification described in the paper (task types T1 and T2 and query regime coverage).
AgroVG contains 10,071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy.
Dataset construction / reported dataset statistics in the paper (explicit count and composition).
We introduce AgroVG, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present.
Paper contribution: description of a new benchmark and its task formulation (benchmark construction and formalization).
Evaluating agricultural visual grounding therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention.
Methodological assertion in the paper motivating the benchmark design (conceptual requirement for evaluation metrics and protocols).
Visual grounding is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting.
Framing / motivation statement in the paper abstract/introduction (conceptual argument linking visual grounding capability to downstream agri-applications).
Enterprise capability adaptation serves as the key support for implementing intelligent international marketing models.
Conclusion from the paper's review and content analysis of literature (2010–2025); presented as a synthesized enabling factor rather than empirically quantified effect.
Mainstream innovation models include data-driven precision marketing, AI-powered cross-border CRM, intelligent omnichannel integration, and cross-cultural intelligent localization marketing.
Summary from the paper's systematic review and content analysis of core literature (2010–2025); descriptive synthesis, no primary experimental sample size reported.
New theoretical frameworks have emerged: data-driven precision marketing theory, nonlinear customer journey reconstruction theory, cross-border intelligent value co-creation theory, and global intelligent marketing ecosystem theory.
Identified via the paper's systematic review and content analysis of literature from 2010–2025; presented as conceptual/theoretical developments rather than quantified empirical effects.
Intelligent technologies have increased international marketing ROI by 12%–25%.
Mixed-method systematic review and content analysis of core literature sources from 2010 to 2025 (as reported in the paper). No primary dataset or sample size reported for this quantified range.
Structured production-process management and size are significant predictors of AI adoption.
Regression/associational analysis from the Census Bureau survey showing that measures of structured production-process management and establishment size predict reported AI use; sample ~28,500 establishments.
The paper's methodology enables classification of automation exposure that disentangles labour-substituting from labour-augmenting automation, identifies the relevant technology channel, and records the material role of AI — allowing exposure levels, labour margins, technological channels and AI involvement to be treated as separate dimensions across development stages.
Description of the task-based, country-specific classification approach and the multidimensional labels produced (labour margin, technology channel, AI involvement) across 124 countries.
Females seem to be disproportionately more exposed to labour-substituting automation than males.
Gender-disaggregated exposure analysis derived from task-country labels combined with workforce composition by gender across countries; reported descriptive comparison indicating higher substitution exposure for females.
Less technologically advanced forms of automation account for more than half of exposed tasks in low-income countries but about one quarter in high-income countries; more complex technological channels generally rise with income levels.
Breakdown of exposed tasks by technological channel across the 124-country task-country dataset; descriptive comparison across income groups (low- vs high-income).
Exposure to automation is highly uneven across countries, ranging from 3.3% of tasks in South Sudan to 61.6% in China, and exposure rises strongly with income (with substantial within-group variation).
Descriptive statistics from the task-country atlas covering 124 countries (2.33M task-country labels); reported minimum and maximum exposure percentages and summary comparison across income groups.
Our measure spans 124 countries, generating an atlas of 2.33 million task-country labels for economies covering 99% of world population and GDP.
Statement in paper describing the constructed task-based, country-specific measure and the generated dataset (124 countries, 2.33 million task-country labels), covering ~99% of world population and GDP.
Deployed FLUID increases Active Hours by +0.05%.
Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.
Deployed FLUID increases Cold-Start Room Views by +2.05%.
Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.
Deployed FLUID delivers an online gain of +0.55% Quality Watch Duration.
Reported online metric improvement from production experiments/deployment as stated in the paper. No statistical significance, confidence intervals, or sample sizes provided in the excerpt.
FLUID was deployed on industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally.
Authors' deployment statement in the paper indicating production rollout across industrial recommenders and noting a combined user base (statement of scope/scale). No A/B sample sizes reported in the excerpt.