Evidence (5157 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Science has repeatedly delegated its bottlenecks to machines—first inference, then search, then measurement, then the full workflow—and each delegation solves one problem while exposing a harder one underneath.
Interpretive historical argument drawing on examples across AI-for-science milestones (e.g., DENDRAL, search and inference systems, measurement automation, and contemporary end-to-end workflows). No quantitative sample or experimental method reported.
Testing revealed AI excels at computational tasks but consistently misses nuanced factors like new construction rent premiums and infrastructure proximity impacts, validating the framework's hybrid structure as essential for professional-grade underwriting.
Findings from the controlled ChatGPT-4 test on the single 150-unit scenario: qualitative and comparative observations showing AI handled computations well but failed to capture specific local-market nuances, leading authors to endorse a hybrid human-AI framework.
Phase Two requires human-led professional validation to correct AI limitations, apply local market knowledge, and integrate risk factors.
Framework description supported by observations from the controlled test where human review was used to correct AI outputs and apply local knowledge (e.g., adjusting for nuanced market factors).
AI assistance in safety engineering is fundamentally a collaboration design problem rather than merely a software procurement decision: the same tool can either degrade or improve analysis quality depending entirely on how it is used.
Synthesis of the formal framework and analytic results in the paper (theoretical argument; no empirical sample reported).
The paper concludes by discussing open challenges in evaluating harmful manipulation by AI models.
Paper includes a discussion/conclusion section enumerating open challenges; stated in abstract.
We identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others.
Empirical comparison across three locales (US, UK, India) showing statistically significant differences in manipulation outcomes by geography.
Context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used.
Comparative analysis across three domains (public policy, finance, health) showing differences in manipulative behaviour and/or impact by domain in the empirical study.
AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions.
Metric comparison across models showing that AUROC_2-based ranking and M-ratio-based ranking are fully inverted in the reported results on the evaluated dataset.
Temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity.
Experimental manipulation (temperature changes) applied to models; reported result that Type-2 criterion shifted with temperature while meta-d' was stable for two models (out of four) in the 224,000-trial dataset.
Metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics.
Domain-level analyses reported in the paper showing per-domain M-ratio results and identification of different weakest domains per model, contrasted with aggregate metric behavior.
Metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar — Mistral achieves the highest d' but the lowest M-ratio.
Empirical comparison of Type-1 sensitivity (d') and metacognitive efficiency (M-ratio) across the four evaluated LLMs on the 224,000 QA trials; explicit statement that Mistral had highest d' but lowest M-ratio.
Organizational culture and technological readiness moderate the effectiveness of generative AI integration in decision-making processes.
The paper reports moderation effects tested in the SEM framework using survey data from senior managers, decision-makers, and AI adoption specialists (SmartPLS). No numeric moderator effect sizes or sample size provided in the excerpt.
Implementation of human-replacing technologies leads to significant transformations in skill demand: it reduces reliance on low-skilled labour while increasing demand for qualified engineers, system operators and specialists in digital technologies.
Sector-specific analysis and review of international labour-market studies cited in the article documenting skill-biased effects of automation and digitalization; qualitative assessment for Ukraine's mining and metallurgical sector under workforce shortage conditions.
The framework implies threshold effects in training and capability acquisition: when the teaching horizon lies below the prerequisite depth of the target, additional instruction cannot produce successful completion of teaching; once that depth is reached, completion becomes feasible.
Model-derived threshold result described in the abstract (mathematical analysis of prerequisite depth vs. teaching horizon).
The value of information depends on whether downstream users can absorb and act on it: a signal conveys meaning only to a learner with the structural capacity to decode it (an explanation that clarifies a concept for one user may be indistinguishable from noise to another who lacks the relevant prerequisites).
Conceptual argument motivating the model; theoretical reasoning described in the paper's intro/abstract.
Generative AI serves as an effective 'wingman' for employment lawyers, capable of replacing substantial junior associate work while requiring continued human expertise for client counseling, supervision, and final legal advice preparation.
Authors' synthesis of experimental results showing AI-produced substantive analysis plus discussion about remaining limitations (e.g., citation errors) and required human oversight; qualitative assertion about substitutability for junior associate tasks.
PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning tasks.
Task-level analysis across the three domains (business, technical, travel) within the controlled study (60 tasks total); authors report differential performance patterns by domain/ambiguity.
AI usage has dual effects on employees: it can both enhance innovative behavior and predict disengagement, as revealed by a dual-path (SOR-based) model.
Interpretation/synthesis from the four-stage longitudinal study of 285 finance professionals using a dual-path model based on SOR theory (combining the mediation and moderation results).
We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap.
Experimental evaluation reported in the paper: authors state they ran experiments on 14 different large language models, under zero-shot and retrieval-augmented configurations, and observed differing performance across models.
Artificial intelligence embedded in human decision-making can either enhance human reasoning or induce excessive cognitive dependence.
Stated as a conceptual claim in the paper's introduction/abstract; supported by the paper's conceptual framing (theoretical argument), no empirical sample or experimental data reported here.
These productivity gains are most pronounced for lower-skilled workers, producing a pattern the authors call “skill compression.”
Cross-study pattern reported in the literature review: comparative evidence across worker-skill strata in multiple empirical papers showing larger relative gains for lower-skilled/junior workers; specific underlying studies and sample sizes are not enumerated in the brief.
Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt.
Controlled experiment described in the paper: 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art LLMs under five prompt framing conditions.
These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science.
Interpretation based on competition results where AI-only baselines underperformed relative to many participant teams and top solutions used human-AI collaboration.
These findings indicate a misalignment between the perceived benefit of AI writing and an implicit, consistent effect on the semantics of human writing, with potential implications for cultural and scientific institutions.
Synthesis and interpretation of the paper's empirical results (user study, essay revision experiments, and peer-review analysis); presented as the paper's broader conclusion.
The paper formalizes the distinction using a signal-aggregation model in which an organization maintains an anchor belief and achieves agreement through two exclusion channels: (1) report shrinkage toward the anchor and (2) a tolerance rule that discards reports deviating beyond a threshold.
Analytical formal model presented in the paper specifying an anchor belief and two exclusion mechanisms; model assumptions and mechanisms are explicit in the theoretical development. No empirical sample.
Organizational cohesion is observationally ambiguous: it can arise either from genuine information integration (debate and synthesis of heterogeneous inputs) or from exclusionary processes (conformity pressure, gatekeeping, intolerance of dissent).
Conceptual argument and formal definition in the paper framing; supported by the analytic distinction introduced in the paper between integration and exclusion as alternative generative mechanisms for observed agreement. No empirical sample—argument is theoretical and illustrated by model construction.
The authors identify ten evaluation practices that teams use, ranging from lightweight interpretive checks to formal organizational processes (examples: qualitative user reviews, red-team testing, A/B experiments, telemetry/log analysis, structured annotation, governance/meta-evaluation).
Thematic coding of 19 interview transcripts produced a taxonomy enumerating ten practices (paper reports the taxonomy as an outcome).
The net educational value of AI-generated feedback depends on alignment with pedagogical goals, quality evaluation, integration with human teaching, and governance to manage equity, privacy, and incentives.
Synthesis statement from the meeting report produced by 50 interdisciplinary scholars; conceptual judgment rather than empirical proof.
Convergence after exemplar exposure occurred by both tightening of estimates within a measure family and by agents switching measure families.
Agent-level tracking across stages showed two patterns following exemplar exposure: (1) reduced within-family dispersion (tighter estimates) and (2) categorical switches in measure selection by some agents, as recorded across the 150-agent sample.
LLMs excel at extracting and generating arguments from unstructured text but are opaque and hard to evaluate or trust.
Synthesis of recent LLM literature and observed properties (generation capability vs. opacity); no empirical evaluation within this paper.
The paper is primarily theoretical and historical; empirical validation is needed to quantify the irreducible component of LLM value, and practical degrees of rule‑extractability may exist even if some capabilities remain tacit.
Stated limitations section acknowledging the theoretical nature of the work and the need for empirical follow‑up.
If an LLM's full capability were reducible to an explicit rule set, that rule set would be an expert system; because expert systems are empirically and historically weaker than LLMs, this leads to a contradiction (supporting non‑rule‑encodability).
Logical proof‑by‑contradiction presented in the paper, supported by conceptual mapping between rule sets and expert systems and qualitative historical comparisons.
Teamwork partner type moderates the effect of service empathy on collaboration proficiency (i.e., the impact of service empathy on proficiency differs by human vs AI partner).
Reported interaction/moderated-mediation analyses from the online experiment (n = 861) indicating a significant partner-type × service-empathy interaction predicting collaboration proficiency.
Employees' emotional state significantly moderates the relationship between partner type (human vs AI) and collaboration proficiency.
Moderation analyses reported from the same online experimental dataset (n = 861), testing interaction terms between partner type and measured employee emotion on collaboration proficiency; authors report a significant moderating effect.
AI adoption has an inverted U-shaped effect on employee-related corporate social responsibility (ECSR).
Panel regression with quadratic specification (AI and AI^2) showing statistically significant positive coefficient on AI and statistically significant negative coefficient on AI^2; sample of 2,575 Chinese listed firms observed 2013–2023; controls, firm and/or year fixed effects and robustness checks reported.
Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged.
Measured token usage for agent runs with and without skills, reporting a range from modest token savings up to a 451% token increase with no corresponding change in pass rates.
The research methodology combines systemic analysis, comparative assessment of international practices, and analytical generalization of organizational learning models, enabling capture of both structural trends and concrete institutional responses to technological changes.
Methodological statement from the paper describing its approach; this is a factual claim about methods used rather than an empirical finding.
Model output can be treated as evidence for studying human behavior, but there are important epistemic limits to interpreting model-generated text as direct evidence of human beliefs or social facts.
Epistemic analysis and methodological critique in the paper (discussion of limits of treating model outputs as evidence); no single empirical test cited in the provided text.
The validity of human–AI decision-making studies hinges on participants' behaviours; effective incentives can potentially affect these behaviours.
Conclusion from the authors' thematic review and theoretical rationale linking incentive design to participant behaviour and study validity (no quantitative effect sizes provided in excerpt).
The study's counterfactual analytical model links HR indicators (training intensity, absenteeism, labor productivity, turnover rates, workforce allocation) to organizational performance outcomes using regression-based simulations and predictive estimation.
Methodological claim explicitly stated: model construction from an industrial firm dataset using regression-based simulations and predictive techniques. (Specific sample size, variable operationalizations, and time frame not reported in the description.)
Helicoid dynamics is a specific failure regime: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless.
Definition introduced in the paper and illustrated by the reported case series; the claim is conceptual/phenomenological rather than a statistical result.
A minimal linear specification (linearized model) demonstrates how coupling strength, persistence, and dissipation determine local stability and oscillatory regimes through spectral conditions on the Jacobian.
Analytic linear model and local stability analysis in the paper: computation of Jacobian, derivation of spectral conditions (eigenvalue locations) that separate stable/oscillatory regimes; illustrative examples within the paper (no empirical data).
Distinct AI features (recommendation engines, chatbots, and comparison tools) influence consumer outcomes when modeled as latent constructs.
Methodological claim: the study modeled three AI features as latent constructs and analyzed their relationships with dependent variables using SEM (quantitative questionnaire data).
Both time constraints and LLM use significantly alter the characteristics of decision-makers' mental representations.
Results from the 2 × 2 experiment (N = 348) comparing representation-related measures across manipulated conditions; reported statistically significant differences associated with time constraints and with LLM use.
We develop a theoretical framework - the productivity funnel - that traces how technological potential narrows through successive stages, from access and digital infrastructure, through organizational absorption and human capital adaptation, to ultimate value capture.
Conceptual/theoretical development presented in the paper; no empirical sample needed (framework-building).
Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare).
Per-domain pass-rate deltas reported in the paper (SkillsBench per-domain analysis). The example domain deltas (+4.5 pp and +51.9 pp) are taken from the reported per-domain results.
The study's qualitative and exploratory design limits generalizability; the proposed framework requires quantitative testing and broader samples (practicing architects, firms, cross-cultural contexts).
Explicit limitations stated by authors; study is based on semi-structured interviews with architecture students (N unspecified) and inductive thematic analysis.
XChronos reframes transhumanist technology evaluation in experiential terms, creating both market opportunities and measurement/regulatory challenges for AI economics.
Synthesis and concluding argument in the paper summarizing proposed implications; conceptual reasoning without empirical tests.
Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants.
Systematic review and synthesis of 182 empirical and methodological studies comparing LLM-generated participants to human samples; studies were coded and analyzed for fidelity outcomes.
Participant targeting: 44% of programs targeted doctors and 44% targeted medical students (with possible overlap), and 56% targeted entry‑to‑practice career stages.
Participant audience and career-stage data extracted from the 27 included programs; proportions reported in the review.