Evidence (5157 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Science has repeatedly delegated its bottlenecks to machines—first inference, then search, then measurement, then the full workflow—and each delegation solves one problem while exposing a harder one underneath.

Interpretive historical argument drawing on examples across AI-for-science milestones (e.g., DENDRAL, search and inference systems, measurement automation, and contemporary end-to-end workflows). No quantitative sample or experimental method reported.

high mixed A Brief History of AI for Scientific Discovery: Open Researc... pattern of delegation and emergent bottlenecks in research workflows

Testing revealed AI excels at computational tasks but consistently misses nuanced factors like new construction rent premiums and infrastructure proximity impacts, validating the framework's hybrid structure as essential for professional-grade underwriting.

Findings from the controlled ChatGPT-4 test on the single 150-unit scenario: qualitative and comparative observations showing AI handled computations well but failed to capture specific local-market nuances, leading authors to endorse a hybrid human-AI framework.

high mixed AI-Augmented Real Estate Underwriting: A Practical Framework... output_quality

Phase Two requires human-led professional validation to correct AI limitations, apply local market knowledge, and integrate risk factors.

Framework description supported by observations from the controlled test where human review was used to correct AI outputs and apply local knowledge (e.g., adjusting for nuanced market factors).

high mixed AI-Augmented Real Estate Underwriting: A Practical Framework... task_allocation

AI assistance in safety engineering is fundamentally a collaboration design problem rather than merely a software procurement decision: the same tool can either degrade or improve analysis quality depending entirely on how it is used.

Synthesis of the formal framework and analytic results in the paper (theoretical argument; no empirical sample reported).

high mixed The Competence Shadow: Theory and Bounds of AI Assistance in... output_quality

The paper concludes by discussing open challenges in evaluating harmful manipulation by AI models.

Paper includes a discussion/conclusion section enumerating open challenges; stated in abstract.

high mixed Evaluating Language Models for Harmful Manipulation identification of open research and evaluation challenges

We identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others.

Empirical comparison across three locales (US, UK, India) showing statistically significant differences in manipulation outcomes by geography.

high mixed Evaluating Language Models for Harmful Manipulation geographic variation in manipulative behaviour/effects

Context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used.

Comparative analysis across three domains (public policy, finance, health) showing differences in manipulative behaviour and/or impact by domain in the empirical study.

high mixed Evaluating Language Models for Harmful Manipulation variation in manipulative behaviour/effects across use domains

AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions.

Metric comparison across models showing that AUROC_2-based ranking and M-ratio-based ranking are fully inverted in the reported results on the evaluated dataset.

high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... model ranking by AUROC_2 versus model ranking by M-ratio

Temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity.

Experimental manipulation (temperature changes) applied to models; reported result that Type-2 criterion shifted with temperature while meta-d' was stable for two models (out of four) in the 224,000-trial dataset.

high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... Type-2 criterion (confidence policy) and meta-d' (metacognitive capacity)

Metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics.

Domain-level analyses reported in the paper showing per-domain M-ratio results and identification of different weakest domains per model, contrasted with aggregate metric behavior.

high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... domain-specific metacognitive efficiency (M-ratio) across task domains

Metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar — Mistral achieves the highest d' but the lowest M-ratio.

Empirical comparison of Type-1 sensitivity (d') and metacognitive efficiency (M-ratio) across the four evaluated LLMs on the 224,000 QA trials; explicit statement that Mistral had highest d' but lowest M-ratio.

high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... Type-1 sensitivity (d') and metacognitive efficiency (M-ratio)

Organizational culture and technological readiness moderate the effectiveness of generative AI integration in decision-making processes.

The paper reports moderation effects tested in the SEM framework using survey data from senior managers, decision-makers, and AI adoption specialists (SmartPLS). No numeric moderator effect sizes or sample size provided in the excerpt.

high mixed The Strategic Impact of Generative Artificial Intelligence o... effectiveness of generative AI integration in decision-making (moderation effect...

Implementation of human-replacing technologies leads to significant transformations in skill demand: it reduces reliance on low-skilled labour while increasing demand for qualified engineers, system operators and specialists in digital technologies.

Sector-specific analysis and review of international labour-market studies cited in the article documenting skill-biased effects of automation and digitalization; qualitative assessment for Ukraine's mining and metallurgical sector under workforce shortage conditions.

high mixed Human-replacing technologies as a driver of labour productiv... skill demand composition (shift from low-skilled to high-skilled roles)

The framework implies threshold effects in training and capability acquisition: when the teaching horizon lies below the prerequisite depth of the target, additional instruction cannot produce successful completion of teaching; once that depth is reached, completion becomes feasible.

Model-derived threshold result described in the abstract (mathematical analysis of prerequisite depth vs. teaching horizon).

high mixed A Mathematical Theory of Understanding feasibility of successful teaching / completion of instruction

The value of information depends on whether downstream users can absorb and act on it: a signal conveys meaning only to a learner with the structural capacity to decode it (an explanation that clarifies a concept for one user may be indistinguishable from noise to another who lacks the relevant prerequisites).

Conceptual argument motivating the model; theoretical reasoning described in the paper's intro/abstract.

high mixed A Mathematical Theory of Understanding ability to interpret instructional signals / effective information transfer

Generative AI serves as an effective 'wingman' for employment lawyers, capable of replacing substantial junior associate work while requiring continued human expertise for client counseling, supervision, and final legal advice preparation.

Authors' synthesis of experimental results showing AI-produced substantive analysis plus discussion about remaining limitations (e.g., citation errors) and required human oversight; qualitative assertion about substitutability for junior associate tasks.

high mixed Robot Wingman: Using AI to Assess an Employment Termination potential replacement of junior associate tasks and required human oversight

PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning tasks.

Task-level analysis across the three domains (business, technical, travel) within the controlled study (60 tasks total); authors report differential performance patterns by domain/ambiguity.

high mixed Evaluating 5W3H Structured Prompting for Intent Alignment in... relative_performance_by_task_domain (PPS vs baselines)

AI usage has dual effects on employees: it can both enhance innovative behavior and predict disengagement, as revealed by a dual-path (SOR-based) model.

Interpretation/synthesis from the four-stage longitudinal study of 285 finance professionals using a dual-path model based on SOR theory (combining the mediation and moderation results).

high mixed Autonomous enhancement or emotional depletion? The dual-path... innovative work behavior and work disengagement behavior (dual outcomes)

We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap.

Experimental evaluation reported in the paper: authors state they ran experiments on 14 different large language models, under zero-shot and retrieval-augmented configurations, and observed differing performance across models.

high mixed FinTradeBench: A Financial Reasoning Benchmark for LLMs model performance on financial reasoning benchmark (accuracy/score across models...

Artificial intelligence embedded in human decision-making can either enhance human reasoning or induce excessive cognitive dependence.

Stated as a conceptual claim in the paper's introduction/abstract; supported by the paper's conceptual framing (theoretical argument), no empirical sample or experimental data reported here.

high mixed Cognitive Amplification vs Cognitive Delegation in Human-AI ... human reasoning quality / cognitive dependence

These productivity gains are most pronounced for lower-skilled workers, producing a pattern the authors call “skill compression.”

Cross-study pattern reported in the literature review: comparative evidence across worker-skill strata in multiple empirical papers showing larger relative gains for lower-skilled/junior workers; specific underlying studies and sample sizes are not enumerated in the brief.

high mixed AI, Productivity, and Labor Markets: A Review of the Empiric... relative productivity/gains by worker skill level (leading to 'skill compression...

Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt.

Controlled experiment described in the paper: 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art LLMs under five prompt framing conditions.

high mixed Measuring and Exploiting Confirmation Bias in LLM-Assisted S... confirmation bias as measured by vulnerability detection performance

These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science.

Interpretation based on competition results where AI-only baselines underperformed relative to many participant teams and top solutions used human-AI collaboration.

high mixed AgentDS Technical Report: Benchmarking the Future of Human-A... implications for automation vs. human expertise

These findings indicate a misalignment between the perceived benefit of AI writing and an implicit, consistent effect on the semantics of human writing, with potential implications for cultural and scientific institutions.

Synthesis and interpretation of the paper's empirical results (user study, essay revision experiments, and peer-review analysis); presented as the paper's broader conclusion.

high mixed How LLMs Distort Our Written Language alignment between perceived benefits and actual semantic effects of AI writing; ...

The paper formalizes the distinction using a signal-aggregation model in which an organization maintains an anchor belief and achieves agreement through two exclusion channels: (1) report shrinkage toward the anchor and (2) a tolerance rule that discards reports deviating beyond a threshold.

Analytical formal model presented in the paper specifying an anchor belief and two exclusion mechanisms; model assumptions and mechanisms are explicit in the theoretical development. No empirical sample.

high mixed Cohesion as Concentration: Exclusion-Driven Fragility in Fin... mechanisms producing agreement (report shrinkage, tolerance-based discarding)

Organizational cohesion is observationally ambiguous: it can arise either from genuine information integration (debate and synthesis of heterogeneous inputs) or from exclusionary processes (conformity pressure, gatekeeping, intolerance of dissent).

Conceptual argument and formal definition in the paper framing; supported by the analytic distinction introduced in the paper between integration and exclusion as alternative generative mechanisms for observed agreement. No empirical sample—argument is theoretical and illustrated by model construction.

high mixed Cohesion as Concentration: Exclusion-Driven Fragility in Fin... source of observed cohesion (integration versus exclusion)

The authors identify ten evaluation practices that teams use, ranging from lightweight interpretive checks to formal organizational processes (examples: qualitative user reviews, red-team testing, A/B experiments, telemetry/log analysis, structured annotation, governance/meta-evaluation).

Thematic coding of 19 interview transcripts produced a taxonomy enumerating ten practices (paper reports the taxonomy as an outcome).

high mixed Results-Actionability Gap: Understanding How Practitioners E... taxonomy/count and description of evaluation practices

The net educational value of AI-generated feedback depends on alignment with pedagogical goals, quality evaluation, integration with human teaching, and governance to manage equity, privacy, and incentives.

Synthesis statement from the meeting report produced by 50 interdisciplinary scholars; conceptual judgment rather than empirical proof.

high mixed The Future of Feedback: How Can AI Help Transform Feedback t... net educational value (composite of learning outcomes, equity metrics, privacy c...

Convergence after exemplar exposure occurred by both tightening of estimates within a measure family and by agents switching measure families.

Agent-level tracking across stages showed two patterns following exemplar exposure: (1) reduced within-family dispersion (tighter estimates) and (2) categorical switches in measure selection by some agents, as recorded across the 150-agent sample.

high mixed Nonstandard Errors in AI Agents within-family dispersion (IQR) and measure-family switching frequency (binary/ca...

LLMs excel at extracting and generating arguments from unstructured text but are opaque and hard to evaluate or trust.

Synthesis of recent LLM literature and observed properties (generation capability vs. opacity); no empirical evaluation within this paper.

high mixed Argumentative Human-AI Decision-Making: Toward AI Agents Tha... argument extraction/generation performance and model interpretability/trustworth...

The paper is primarily theoretical and historical; empirical validation is needed to quantify the irreducible component of LLM value, and practical degrees of rule‑extractability may exist even if some capabilities remain tacit.

Stated limitations section acknowledging the theoretical nature of the work and the need for empirical follow‑up.

high mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... need for empirical validation and degree of rule‑extractability of LLM capabilit...

If an LLM's full capability were reducible to an explicit rule set, that rule set would be an expert system; because expert systems are empirically and historically weaker than LLMs, this leads to a contradiction (supporting non‑rule‑encodability).

Logical proof‑by‑contradiction presented in the paper, supported by conceptual mapping between rule sets and expert systems and qualitative historical comparisons.

high mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... logical consistency of the reducibility-to-rules claim (validity of the contradi...

Teamwork partner type moderates the effect of service empathy on collaboration proficiency (i.e., the impact of service empathy on proficiency differs by human vs AI partner).

Reported interaction/moderated-mediation analyses from the online experiment (n = 861) indicating a significant partner-type × service-empathy interaction predicting collaboration proficiency.

high mixed Adoption of AI partners in temporary tasks: exploring the ef... collaboration proficiency

Employees' emotional state significantly moderates the relationship between partner type (human vs AI) and collaboration proficiency.

Moderation analyses reported from the same online experimental dataset (n = 861), testing interaction terms between partner type and measured employee emotion on collaboration proficiency; authors report a significant moderating effect.

high mixed Adoption of AI partners in temporary tasks: exploring the ef... collaboration proficiency

AI adoption has an inverted U-shaped effect on employee-related corporate social responsibility (ECSR).

Panel regression with quadratic specification (AI and AI^2) showing statistically significant positive coefficient on AI and statistically significant negative coefficient on AI^2; sample of 2,575 Chinese listed firms observed 2013–2023; controls, firm and/or year fixed effects and robustness checks reported.

high mixed Attention to Whom? AI Adoption and Corporate Social Responsi... Employee-related corporate social responsibility (ECSR)

Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged.

Measured token usage for agent runs with and without skills, reporting a range from modest token savings up to a 451% token increase with no corresponding change in pass rates.

high mixed SWE-Skills-Bench: Do Agent Skills Actually Help in Real-Worl... token usage/overhead (percent change) and its relation to pass rates

The research methodology combines systemic analysis, comparative assessment of international practices, and analytical generalization of organizational learning models, enabling capture of both structural trends and concrete institutional responses to technological changes.

Methodological statement from the paper describing its approach; this is a factual claim about methods used rather than an empirical finding.

high mixed EDUCATIONAL AND PROFESSIONAL STRATEGIES FOR PREPARING HUMAN ... ability to capture structural trends and institutional responses (through the ch...

Model output can be treated as evidence for studying human behavior, but there are important epistemic limits to interpreting model-generated text as direct evidence of human beliefs or social facts.

Epistemic analysis and methodological critique in the paper (discussion of limits of treating model outputs as evidence); no single empirical test cited in the provided text.

high mixed The Third Ambition: Artificial Intelligence and the Science ... validity and limits of using LLM outputs as evidence about human behavior and so...

The validity of human–AI decision-making studies hinges on participants' behaviours; effective incentives can potentially affect these behaviours.

Conclusion from the authors' thematic review and theoretical rationale linking incentive design to participant behaviour and study validity (no quantitative effect sizes provided in excerpt).

high mixed Incentive-Tuning: Understanding and Designing Incentives for... participant behaviour (engagement, effort, strategy) and resulting study validit...

The study's counterfactual analytical model links HR indicators (training intensity, absenteeism, labor productivity, turnover rates, workforce allocation) to organizational performance outcomes using regression-based simulations and predictive estimation.

Methodological claim explicitly stated: model construction from an industrial firm dataset using regression-based simulations and predictive techniques. (Specific sample size, variable operationalizations, and time frame not reported in the description.)

high mixed Artificial Intelligence and Human Resource Management: A Cou... methodological estimate of counterfactual organizational performance outcomes

Helicoid dynamics is a specific failure regime: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless.

Definition introduced in the paper and illustrated by the reported case series; the claim is conceptual/phenomenological rather than a statistical result.

high mixed AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics i... incidence and qualitative characterization of the helicoid pattern in LLM intera...

A minimal linear specification (linearized model) demonstrates how coupling strength, persistence, and dissipation determine local stability and oscillatory regimes through spectral conditions on the Jacobian.

Analytic linear model and local stability analysis in the paper: computation of Jacobian, derivation of spectral conditions (eigenvalue locations) that separate stable/oscillatory regimes; illustrative examples within the paper (no empirical data).

high mixed How Intelligence Emerges: A Minimal Theory of Dynamic Adapti... local stability/oscillatory behavior characterized by Jacobian eigenvalues (spec...

Distinct AI features (recommendation engines, chatbots, and comparison tools) influence consumer outcomes when modeled as latent constructs.

Methodological claim: the study modeled three AI features as latent constructs and analyzed their relationships with dependent variables using SEM (quantitative questionnaire data).

high mixed Role of artificial intelligence on consumer buying behavior:... influence on consumer trust, perceived decision-making support, and purchase int...

Both time constraints and LLM use significantly alter the characteristics of decision-makers' mental representations.

Results from the 2 × 2 experiment (N = 348) comparing representation-related measures across manipulated conditions; reported statistically significant differences associated with time constraints and with LLM use.

high mixed AI-Augmented Strategic Decision-Making Under Time Constraint... characteristics of mental representations (representation-related measures colle...

We develop a theoretical framework - the productivity funnel - that traces how technological potential narrows through successive stages, from access and digital infrastructure, through organizational absorption and human capital adaptation, to ultimate value capture.

Conceptual/theoretical development presented in the paper; no empirical sample needed (framework-building).

high mixed The complementarity trap: AI adoption and value capture n/a (theoretical framework describing stages leading to value capture)

Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare).

Per-domain pass-rate deltas reported in the paper (SkillsBench per-domain analysis). The example domain deltas (+4.5 pp and +51.9 pp) are taken from the reported per-domain results.

high mixed SkillsBench: Benchmarking How Well Agent Skills Work Across ... task pass rate (per-domain average delta)

The study's qualitative and exploratory design limits generalizability; the proposed framework requires quantitative testing and broader samples (practicing architects, firms, cross-cultural contexts).

Explicit limitations stated by authors; study is based on semi-structured interviews with architecture students (N unspecified) and inductive thematic analysis.

high mixed Human–AI Collaboration in Architectural Design Education: To... generalizability / external validity of findings and framework

XChronos reframes transhumanist technology evaluation in experiential terms, creating both market opportunities and measurement/regulatory challenges for AI economics.

Synthesis and concluding argument in the paper summarizing proposed implications; conceptual reasoning without empirical tests.

high mixed XChronos and Conscious Transhumanism: A Philosophical Framew... shift in evaluation criteria toward experiential measures and resultant market/r...

Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants.

Systematic review and synthesis of 182 empirical and methodological studies comparing LLM-generated participants to human samples; studies were coded and analyzed for fidelity outcomes.

high mixed Synthetic Participants Generated by Large Language Models: A... fidelity of synthetic participants to human participants (behavioral/response si...

Participant targeting: 44% of programs targeted doctors and 44% targeted medical students (with possible overlap), and 56% targeted entry‑to‑practice career stages.

Participant audience and career-stage data extracted from the 27 included programs; proportions reported in the review.

high mixed Assessing the effectiveness of artificial intelligence educa... target audience (doctors, medical students) and career stage distribution (entry...

« Prev 1 2 3 4 5 6 … 103 104 Next »