Evidence (2954 claims)
Adoption
5126 claims
Productivity
4409 claims
Governance
4049 claims
Human-AI Collaboration
2954 claims
Labor Markets
2432 claims
Org Design
2273 claims
Innovation
2215 claims
Skills & Training
1902 claims
Inequality
1286 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 369 | 105 | 58 | 432 | 972 |
| Governance & Regulation | 365 | 171 | 113 | 54 | 713 |
| Research Productivity | 229 | 95 | 33 | 294 | 655 |
| Organizational Efficiency | 354 | 82 | 58 | 34 | 531 |
| Technology Adoption Rate | 277 | 115 | 63 | 27 | 486 |
| Firm Productivity | 273 | 33 | 68 | 10 | 389 |
| AI Safety & Ethics | 112 | 177 | 43 | 24 | 358 |
| Output Quality | 228 | 61 | 23 | 25 | 337 |
| Market Structure | 105 | 118 | 81 | 14 | 323 |
| Decision Quality | 154 | 68 | 33 | 17 | 275 |
| Employment Level | 68 | 32 | 74 | 8 | 184 |
| Fiscal & Macroeconomic | 74 | 52 | 32 | 21 | 183 |
| Skill Acquisition | 85 | 31 | 38 | 9 | 163 |
| Firm Revenue | 96 | 30 | 22 | — | 148 |
| Innovation Output | 100 | 11 | 20 | 11 | 143 |
| Consumer Welfare | 66 | 29 | 35 | 7 | 137 |
| Regulatory Compliance | 51 | 61 | 13 | 3 | 128 |
| Inequality Measures | 24 | 66 | 31 | 4 | 125 |
| Task Allocation | 64 | 6 | 28 | 6 | 104 |
| Error Rate | 42 | 47 | 6 | — | 95 |
| Training Effectiveness | 55 | 12 | 10 | 16 | 93 |
| Worker Satisfaction | 42 | 32 | 11 | 6 | 91 |
| Task Completion Time | 71 | 5 | 3 | 1 | 80 |
| Wages & Compensation | 38 | 13 | 19 | 4 | 74 |
| Team Performance | 41 | 8 | 15 | 7 | 72 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 17 | 15 | 9 | 5 | 46 |
| Job Displacement | 5 | 28 | 12 | — | 45 |
| Social Protection | 18 | 8 | 6 | 1 | 33 |
| Developer Productivity | 25 | 1 | 2 | 1 | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Skill Obsolescence | 3 | 18 | 2 | — | 23 |
| Labor Share of Income | 7 | 4 | 9 | — | 20 |
Human Ai Collab
Remove filter
Helicoid dynamics is a specific failure regime: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless.
Definition introduced in the paper and illustrated by the reported case series; the claim is conceptual/phenomenological rather than a statistical result.
A minimal linear specification (linearized model) demonstrates how coupling strength, persistence, and dissipation determine local stability and oscillatory regimes through spectral conditions on the Jacobian.
Analytic linear model and local stability analysis in the paper: computation of Jacobian, derivation of spectral conditions (eigenvalue locations) that separate stable/oscillatory regimes; illustrative examples within the paper (no empirical data).
Distinct AI features (recommendation engines, chatbots, and comparison tools) influence consumer outcomes when modeled as latent constructs.
Methodological claim: the study modeled three AI features as latent constructs and analyzed their relationships with dependent variables using SEM (quantitative questionnaire data).
Both time constraints and LLM use significantly alter the characteristics of decision-makers' mental representations.
Results from the 2 × 2 experiment (N = 348) comparing representation-related measures across manipulated conditions; reported statistically significant differences associated with time constraints and with LLM use.
We develop a theoretical framework - the productivity funnel - that traces how technological potential narrows through successive stages, from access and digital infrastructure, through organizational absorption and human capital adaptation, to ultimate value capture.
Conceptual/theoretical development presented in the paper; no empirical sample needed (framework-building).
Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare).
Per-domain pass-rate deltas reported in the paper (SkillsBench per-domain analysis). The example domain deltas (+4.5 pp and +51.9 pp) are taken from the reported per-domain results.
The study's qualitative and exploratory design limits generalizability; the proposed framework requires quantitative testing and broader samples (practicing architects, firms, cross-cultural contexts).
Explicit limitations stated by authors; study is based on semi-structured interviews with architecture students (N unspecified) and inductive thematic analysis.
XChronos reframes transhumanist technology evaluation in experiential terms, creating both market opportunities and measurement/regulatory challenges for AI economics.
Synthesis and concluding argument in the paper summarizing proposed implications; conceptual reasoning without empirical tests.
Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants.
Systematic review and synthesis of 182 empirical and methodological studies comparing LLM-generated participants to human samples; studies were coded and analyzed for fidelity outcomes.
Participant targeting: 44% of programs targeted doctors and 44% targeted medical students (with possible overlap), and 56% targeted entry‑to‑practice career stages.
Participant audience and career-stage data extracted from the 27 included programs; proportions reported in the review.
Most programs were delivered in academic settings: 56% of evaluated programs reported an academic setting.
Setting information extracted from the 27 included programs, with 56% reported as delivered in academic settings.
A plurality of programs were short in duration: 44% of programs were categorized as short courses.
Extraction of program length from the 27 included studies; 44% were classified as short courses per the review's categorization.
Most programs were introductory in content: 67% of included programs taught introductory AI concepts rather than advanced/technical AI skills.
Program content extraction across the 27 included studies yielded that 67% were classified as teaching introductory AI.
The methodological landscape of the evidence base is heterogeneous, consisting of cross-sectional surveys, case studies, quasi-experimental designs, and a limited number of longitudinal analyses.
Study design information was extracted from the 145 included studies revealing a mix of designs and relatively few longitudinal or experimental studies.
Human factors (training, trust calibration, workflows) determine whether clinicians accept, override, or ignore GenAI suggestions.
Qualitative and quantitative human-AI interaction studies and pilot deployments discussed in the paper; specific sample sizes and effect sizes are not reported in the paper.
Safety and net benefit of GenAI CDS hinge on deployment details: user interface, real-time feedback, uncertainty quantification, calibration, and how recommendations are presented (strong vs. suggestive).
Human factors and implementation studies referenced; early A/B tests and human-AI interaction research suggest interface and presentation affect acceptance and error rates; no large-scale standardized implementation trial data cited.
Reimbursement models (fee-for-service vs. capitation) will influence whether cost savings from GenAI are realized or offset by increased service volume.
Economic incentive framework and prior health-economics literature cited; the paper does not provide direct empirical tests but references plausible incentive channels.
RL and adaptive methods are good for real-time adaptation but can be myopic, require large amounts of interaction data, and struggle to incorporate long-term preference structure and ethical constraints.
Surveyed properties of reinforcement learning and adaptive methods in HRI/RS literature; no new empirical evaluation in this paper.
The community knowledge functions both as practical how-to guidance and as collective experimentation with platform rules and revenue mechanisms.
Observed dual nature in the 377-video corpus: instructional workflows alongside demonstrations/testing of platform-tailored monetization tactics and workarounds.
Typical practices emphasized by creators include rapid mass production of content, productizing prompt engineering, repurposing existing material via synthesis/localization, and packaging AI outputs as sellable creative services or assets.
Recurring practices surfaced through qualitative coding of workflows, tools, and pipelines described in the 377 videos.
Across the 377 videos, creators converge on a set of repeatable use cases and platform‑tailored monetization tactics.
Thematic coding of 377 videos produced a catalog of recurring use cases and tactics; the paper reports convergence across that sample.
YouTube creators have collectively constructed and circulated a practical knowledge repository about how to monetize GenAI-driven creative work.
Systematic qualitative content analysis (thematic coding) of 377 publicly available YouTube videos in which creators promote GenAI workflows and monetization strategies.
Choice of scaffold materially affects outcomes: an open-source scaffold outperformed vendor-provided scaffolds by up to approximately 5 percentage points.
Comparative experiments across three scaffolding approaches (vendor scaffolds and at least one open-source scaffold) showing up to ~5 percentage point differences in measured outcomes.
Adoption of NFD approaches in regulated domains will depend on standards for validation, auditability, and update procedures.
Implications and governance discussion emphasizing regulatory constraints (finance, healthcare) and the need for validation/audit standards; logical/ normative claim rather than empirical finding.
Limitations include generalizability beyond Chatbot Arena data, calibration of priors on novel tasks, audit costs/latency, user comprehension/cognitive load, and strategic manipulation.
Authors' stated limitations and open questions; these are candid acknowledgements rather than empirical findings.
RAD requires estimating cost distributions and choosing a reference policy and quantile-weighting function; these choices determine the method's conservatism and sample efficiency.
Methodological and practical considerations discussed in the paper; noted dependency on estimation and design choices (no quantitative sample-efficiency results provided in the summary).
Explanations change workflows, shift responsibilities between humans and machines, and can reshape power dynamics—creating both opportunities (better oversight) and risks (over-reliance, gaming).
Qualitative and conceptual studies synthesized in the review, including socio-technical analyses and case studies reporting observed or theorized workflow and responsibility shifts; no meta-analytic causal estimate.
Explanations increase user trust principally when they are understandable, actionable, and aligned with users’ domain knowledge; opaque or overly technical explanations can fail to build trust or even decrease it.
Thematic synthesis of empirical and conceptual studies in the reviewed literature reporting conditional effects of explanation form and comprehensibility on trust; review notes heterogeneity in study designs and contexts.
Explainability improves perceived legitimacy, user trust, and organizational accountability only when technical transparency is paired with human-centered explanation design and governance mechanisms.
Synthesis of studies from the reviewed literature showing conditional effects of algorithmic interpretability combined with explanation design and governance; derived via thematic coding across technical and social-science sources (no new primary experimental data reported).
Explainability is a necessary but not sufficient condition for trustworthy AI in high-stakes domains.
Systematic literature review (thematic coding and synthesis) of interdisciplinary scholarship (peer-reviewed research, technical reports, policy documents); the paper synthesizes conceptual and empirical studies rather than presenting new primary data. Emphasis on high-stakes domains (healthcare, finance, public sector).
Some patients value human contact for sensitive cases; automated interactions can feel impersonal.
Semi-structured interviews with patients/staff and open-ended survey responses documenting preferences for human interaction in sensitive/complex complaints.
Data‑driven policies can either amplify or mitigate inequalities depending on data representativeness, model design, and deployment governance.
Multiple empirical examples and theoretical analyses in the review highlighting cases of both harm (bias amplification) and mitigation, identified across the 103 items.
Citizen acceptance, transparency, and perceived fairness strongly shape adoption trajectories and the political feasibility of AI tools in government.
Repeated empirical findings in the reviewed literature linking public trust, transparency measures, and fairness perceptions to successful or failed deployments (drawn from multiple case studies in the 103 items).
Adoption of AI and data-driven governance is highly uneven across jurisdictions and sectors, driven by institutional capacity, governance frameworks, and public trust.
Cross‑regional and cross‑sector comparisons in the review corpus (103 items) showing varying maturity levels and repeated identification of institutional capacity, governance arrangements, and trust factors as determinants.
Productivity gains from generative AI depend on task mix, integration design, and the availability of complementary human skills.
Theoretical evaluation and synthesis of heterogeneous empirical findings; authors highlight variation across firms, sectors, and tasks.
Existing evidence is time-sensitive and heterogeneous: rapidly evolving models, heterogeneous study designs, and many short-term lab/microtask studies limit direct comparability and long-run inference.
Meta-observation from the review: documented methodological limitations across the literature (variation in models, tasks, metrics; prevalence of short-term studies).
Methodological caveats across the literature (heterogeneity of tasks/measures, publication bias, short-term studies) limit the generalizability of current findings.
Meta-level critique within the synthesis noting study heterogeneity, likely publication/short-term biases, and variable domain-specific performance dependent on user expertise and workflows.
Standard productivity metrics are likely to undercount the value generated by AI-augmented ideation; quality-adjusted measures of creative output are required.
Measurement critique based on the mismatch between existing productivity statistics and the kinds of upstream idea-generation gains observed in empirical studies; supported by the review's methodological discussion.
Realized value from AI methods (ML, predictive analytics, anomaly detection, XAI) is conditional: these technical methods deliver capabilities only when combined with strong data governance, standardized processes, and change management.
Thematic synthesis across the systematic review (2020–2025) showing repeated case-study and practitioner-report evidence that technical gains failed to scale without governance, process standardization, and organizational change efforts.
The hybrid estimator (GA+SQP) is computationally more intensive than single-stage MLE/local optimization, implying a trade-off between estimation reliability and runtime cost.
Reported runtime and computational cost comparisons in estimation experiments: the paper notes longer runtimes for GA+SQP versus standard optimizers while documenting improvements in objective values and convergence behavior.
Results and implications are limited by the sample and context: evidence comes from law students on a single issue-spotting exam using one brief training intervention, so generalizability to experienced professionals, other tasks, or other models is untested.
Authors’ reported sample (164 law students) and explicit caution about generalizability in the study summary; the intervention and outcome are specific to one exam and one ~10-minute training.
Some mechanism-specific estimates are imprecise due to the sample size; confidence intervals for those estimates are wide.
Authors report wide confidence intervals for mechanism decomposition (principal stratification) results based on the randomized sample of 164 students.
Performance degradation persists even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution.
Experiments comparing unstructured versus structured context provision; structured semantic layers (AST context, import graph resolution) were evaluated and models still degraded with more context.
Models' performance degrades monotonically from diff-only (config_A) to diff+file content (config_B) to full context (config_C) across all 8 models.
Systematic ablation across three frozen context configurations (config_A, config_B, config_C) reported; all 8 evaluated models show monotonic performance decline as more context is provided.
Eight frontier models detect only 15–31% of human-flagged issues on the diff-only configuration (config_A).
Empirical evaluation across 8 models on SWE-PRBench (350 PRs) under the diff-only configuration; reported detection rates of 15–31% relative to human-flagged issues.
There is a growing gap between rapid experimentation with AI tools and limited organizational capability to institutionalize them in everyday workflows.
Argument supported by targeted literature synthesis and review of recent scholarly and institutional sources; no primary empirical sample reported in this paper.
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 55.0% accuracy on help prediction.
Experimental evaluation reported in the paper comparing eight multimodal models on the Help Prediction task with reported accuracy metric.
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 44.6% accuracy on behavior state detection.
Experimental evaluation reported in the paper comparing eight multimodal models on the Behavior State Detection task with reported accuracy metric.
Ikema is a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old.
Demographic/descriptive claim reported in the paper's background (likely citing prior surveys or census estimates); the abstract states the ~1,300 speakers figure and age distribution.
The financial planning and investment management profession is undergoing a radical transformation driven by Generative AI (GenAI) and Agentic AI, creating urgent workforce displacement challenges that require coordinated government policy intervention alongside educational reform.
Author assertion in the paper's introduction/abstract; framing argument based on the paper's synthesized analysis (no empirical sample, no reported statistical test).