The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (5157 claims)

Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 609 159 77 736 1615
Governance & Regulation 664 329 160 99 1273
Organizational Efficiency 624 143 105 70 949
Technology Adoption Rate 502 176 98 78 861
Research Productivity 348 109 48 322 836
Output Quality 391 120 44 40 595
Firm Productivity 385 46 85 17 539
Decision Quality 275 143 62 34 521
AI Safety & Ethics 183 241 59 30 517
Market Structure 152 154 109 20 440
Task Allocation 158 50 56 26 295
Innovation Output 178 23 38 17 257
Skill Acquisition 137 52 50 13 252
Fiscal & Macroeconomic 120 64 38 23 252
Employment Level 93 46 96 12 249
Firm Revenue 130 43 26 3 202
Consumer Welfare 99 51 40 11 201
Inequality Measures 36 105 40 6 187
Task Completion Time 134 18 6 5 163
Worker Satisfaction 79 54 16 11 160
Error Rate 64 78 8 1 151
Regulatory Compliance 69 64 14 3 150
Training Effectiveness 81 15 13 18 129
Wages & Compensation 70 25 22 6 123
Team Performance 74 16 21 9 121
Automation Exposure 41 48 19 9 120
Job Displacement 11 71 16 1 99
Developer Productivity 71 14 9 3 98
Hiring & Recruitment 49 7 8 3 67
Social Protection 26 14 8 2 50
Creative Output 26 14 6 2 49
Skill Obsolescence 5 37 5 1 48
Labor Share of Income 12 13 12 37
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
Science has repeatedly delegated its bottlenecks to machines—first inference, then search, then measurement, then the full workflow—and each delegation solves one problem while exposing a harder one underneath.
Interpretive historical argument drawing on examples across AI-for-science milestones (e.g., DENDRAL, search and inference systems, measurement automation, and contemporary end-to-end workflows). No quantitative sample or experimental method reported.
high mixed A Brief History of AI for Scientific Discovery: Open Researc... pattern of delegation and emergent bottlenecks in research workflows
Testing revealed AI excels at computational tasks but consistently misses nuanced factors like new construction rent premiums and infrastructure proximity impacts, validating the framework's hybrid structure as essential for professional-grade underwriting.
Findings from the controlled ChatGPT-4 test on the single 150-unit scenario: qualitative and comparative observations showing AI handled computations well but failed to capture specific local-market nuances, leading authors to endorse a hybrid human-AI framework.
Phase Two requires human-led professional validation to correct AI limitations, apply local market knowledge, and integrate risk factors.
Framework description supported by observations from the controlled test where human review was used to correct AI outputs and apply local knowledge (e.g., adjusting for nuanced market factors).
AI assistance in safety engineering is fundamentally a collaboration design problem rather than merely a software procurement decision: the same tool can either degrade or improve analysis quality depending entirely on how it is used.
Synthesis of the formal framework and analytic results in the paper (theoretical argument; no empirical sample reported).
The paper concludes by discussing open challenges in evaluating harmful manipulation by AI models.
Paper includes a discussion/conclusion section enumerating open challenges; stated in abstract.
high mixed Evaluating Language Models for Harmful Manipulation identification of open research and evaluation challenges
We identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others.
Empirical comparison across three locales (US, UK, India) showing statistically significant differences in manipulation outcomes by geography.
high mixed Evaluating Language Models for Harmful Manipulation geographic variation in manipulative behaviour/effects
Context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used.
Comparative analysis across three domains (public policy, finance, health) showing differences in manipulative behaviour and/or impact by domain in the empirical study.
high mixed Evaluating Language Models for Harmful Manipulation variation in manipulative behaviour/effects across use domains
AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions.
Metric comparison across models showing that AUROC_2-based ranking and M-ratio-based ranking are fully inverted in the reported results on the evaluated dataset.
high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... model ranking by AUROC_2 versus model ranking by M-ratio
Temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity.
Experimental manipulation (temperature changes) applied to models; reported result that Type-2 criterion shifted with temperature while meta-d' was stable for two models (out of four) in the 224,000-trial dataset.
high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... Type-2 criterion (confidence policy) and meta-d' (metacognitive capacity)
Metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics.
Domain-level analyses reported in the paper showing per-domain M-ratio results and identification of different weakest domains per model, contrasted with aggregate metric behavior.
high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... domain-specific metacognitive efficiency (M-ratio) across task domains
Metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar — Mistral achieves the highest d' but the lowest M-ratio.
Empirical comparison of Type-1 sensitivity (d') and metacognitive efficiency (M-ratio) across the four evaluated LLMs on the 224,000 QA trials; explicit statement that Mistral had highest d' but lowest M-ratio.
high mixed Do LLMs Know What They Know? Measuring Metacognitive Efficie... Type-1 sensitivity (d') and metacognitive efficiency (M-ratio)
Organizational culture and technological readiness moderate the effectiveness of generative AI integration in decision-making processes.
The paper reports moderation effects tested in the SEM framework using survey data from senior managers, decision-makers, and AI adoption specialists (SmartPLS). No numeric moderator effect sizes or sample size provided in the excerpt.
high mixed The Strategic Impact of Generative Artificial Intelligence o... effectiveness of generative AI integration in decision-making (moderation effect...
Implementation of human-replacing technologies leads to significant transformations in skill demand: it reduces reliance on low-skilled labour while increasing demand for qualified engineers, system operators and specialists in digital technologies.
Sector-specific analysis and review of international labour-market studies cited in the article documenting skill-biased effects of automation and digitalization; qualitative assessment for Ukraine's mining and metallurgical sector under workforce shortage conditions.
high mixed Human-replacing technologies as a driver of labour productiv... skill demand composition (shift from low-skilled to high-skilled roles)
The framework implies threshold effects in training and capability acquisition: when the teaching horizon lies below the prerequisite depth of the target, additional instruction cannot produce successful completion of teaching; once that depth is reached, completion becomes feasible.
Model-derived threshold result described in the abstract (mathematical analysis of prerequisite depth vs. teaching horizon).
high mixed A Mathematical Theory of Understanding feasibility of successful teaching / completion of instruction
The value of information depends on whether downstream users can absorb and act on it: a signal conveys meaning only to a learner with the structural capacity to decode it (an explanation that clarifies a concept for one user may be indistinguishable from noise to another who lacks the relevant prerequisites).
Conceptual argument motivating the model; theoretical reasoning described in the paper's intro/abstract.
high mixed A Mathematical Theory of Understanding ability to interpret instructional signals / effective information transfer
Generative AI serves as an effective 'wingman' for employment lawyers, capable of replacing substantial junior associate work while requiring continued human expertise for client counseling, supervision, and final legal advice preparation.
Authors' synthesis of experimental results showing AI-produced substantive analysis plus discussion about remaining limitations (e.g., citation errors) and required human oversight; qualitative assertion about substitutability for junior associate tasks.
high mixed Robot Wingman: Using AI to Assess an Employment Termination potential replacement of junior associate tasks and required human oversight
PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning tasks.
Task-level analysis across the three domains (business, technical, travel) within the controlled study (60 tasks total); authors report differential performance patterns by domain/ambiguity.
high mixed Evaluating 5W3H Structured Prompting for Intent Alignment in... relative_performance_by_task_domain (PPS vs baselines)
AI usage has dual effects on employees: it can both enhance innovative behavior and predict disengagement, as revealed by a dual-path (SOR-based) model.
Interpretation/synthesis from the four-stage longitudinal study of 285 finance professionals using a dual-path model based on SOR theory (combining the mediation and moderation results).
high mixed Autonomous enhancement or emotional depletion? The dual-path... innovative work behavior and work disengagement behavior (dual outcomes)
We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap.
Experimental evaluation reported in the paper: authors state they ran experiments on 14 different large language models, under zero-shot and retrieval-augmented configurations, and observed differing performance across models.
high mixed FinTradeBench: A Financial Reasoning Benchmark for LLMs model performance on financial reasoning benchmark (accuracy/score across models...
Artificial intelligence embedded in human decision-making can either enhance human reasoning or induce excessive cognitive dependence.
Stated as a conceptual claim in the paper's introduction/abstract; supported by the paper's conceptual framing (theoretical argument), no empirical sample or experimental data reported here.
high mixed Cognitive Amplification vs Cognitive Delegation in Human-AI ... human reasoning quality / cognitive dependence
These productivity gains are most pronounced for lower-skilled workers, producing a pattern the authors call “skill compression.”
Cross-study pattern reported in the literature review: comparative evidence across worker-skill strata in multiple empirical papers showing larger relative gains for lower-skilled/junior workers; specific underlying studies and sample sizes are not enumerated in the brief.
high mixed AI, Productivity, and Labor Markets: A Review of the Empiric... relative productivity/gains by worker skill level (leading to 'skill compression...
Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt.
Controlled experiment described in the paper: 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art LLMs under five prompt framing conditions.
high mixed Measuring and Exploiting Confirmation Bias in LLM-Assisted S... confirmation bias as measured by vulnerability detection performance
These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science.
Interpretation based on competition results where AI-only baselines underperformed relative to many participant teams and top solutions used human-AI collaboration.
high mixed AgentDS Technical Report: Benchmarking the Future of Human-A... implications for automation vs. human expertise
These findings indicate a misalignment between the perceived benefit of AI writing and an implicit, consistent effect on the semantics of human writing, with potential implications for cultural and scientific institutions.
Synthesis and interpretation of the paper's empirical results (user study, essay revision experiments, and peer-review analysis); presented as the paper's broader conclusion.
high mixed How LLMs Distort Our Written Language alignment between perceived benefits and actual semantic effects of AI writing; ...
The paper formalizes the distinction using a signal-aggregation model in which an organization maintains an anchor belief and achieves agreement through two exclusion channels: (1) report shrinkage toward the anchor and (2) a tolerance rule that discards reports deviating beyond a threshold.
Analytical formal model presented in the paper specifying an anchor belief and two exclusion mechanisms; model assumptions and mechanisms are explicit in the theoretical development. No empirical sample.
high mixed Cohesion as Concentration: Exclusion-Driven Fragility in Fin... mechanisms producing agreement (report shrinkage, tolerance-based discarding)
Organizational cohesion is observationally ambiguous: it can arise either from genuine information integration (debate and synthesis of heterogeneous inputs) or from exclusionary processes (conformity pressure, gatekeeping, intolerance of dissent).
Conceptual argument and formal definition in the paper framing; supported by the analytic distinction introduced in the paper between integration and exclusion as alternative generative mechanisms for observed agreement. No empirical sample—argument is theoretical and illustrated by model construction.
high mixed Cohesion as Concentration: Exclusion-Driven Fragility in Fin... source of observed cohesion (integration versus exclusion)
The authors identify ten evaluation practices that teams use, ranging from lightweight interpretive checks to formal organizational processes (examples: qualitative user reviews, red-team testing, A/B experiments, telemetry/log analysis, structured annotation, governance/meta-evaluation).
Thematic coding of 19 interview transcripts produced a taxonomy enumerating ten practices (paper reports the taxonomy as an outcome).
high mixed Results-Actionability Gap: Understanding How Practitioners E... taxonomy/count and description of evaluation practices
The net educational value of AI-generated feedback depends on alignment with pedagogical goals, quality evaluation, integration with human teaching, and governance to manage equity, privacy, and incentives.
Synthesis statement from the meeting report produced by 50 interdisciplinary scholars; conceptual judgment rather than empirical proof.
high mixed The Future of Feedback: How Can AI Help Transform Feedback t... net educational value (composite of learning outcomes, equity metrics, privacy c...
Convergence after exemplar exposure occurred by both tightening of estimates within a measure family and by agents switching measure families.
Agent-level tracking across stages showed two patterns following exemplar exposure: (1) reduced within-family dispersion (tighter estimates) and (2) categorical switches in measure selection by some agents, as recorded across the 150-agent sample.
high mixed Nonstandard Errors in AI Agents within-family dispersion (IQR) and measure-family switching frequency (binary/ca...
LLMs excel at extracting and generating arguments from unstructured text but are opaque and hard to evaluate or trust.
Synthesis of recent LLM literature and observed properties (generation capability vs. opacity); no empirical evaluation within this paper.
high mixed Argumentative Human-AI Decision-Making: Toward AI Agents Tha... argument extraction/generation performance and model interpretability/trustworth...
The paper is primarily theoretical and historical; empirical validation is needed to quantify the irreducible component of LLM value, and practical degrees of rule‑extractability may exist even if some capabilities remain tacit.
Stated limitations section acknowledging the theoretical nature of the work and the need for empirical follow‑up.
high mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... need for empirical validation and degree of rule‑extractability of LLM capabilit...
If an LLM's full capability were reducible to an explicit rule set, that rule set would be an expert system; because expert systems are empirically and historically weaker than LLMs, this leads to a contradiction (supporting non‑rule‑encodability).
Logical proof‑by‑contradiction presented in the paper, supported by conceptual mapping between rule sets and expert systems and qualitative historical comparisons.
high mixed Why the Valuable Capabilities of LLMs Are Precisely the Unex... logical consistency of the reducibility-to-rules claim (validity of the contradi...
Teamwork partner type moderates the effect of service empathy on collaboration proficiency (i.e., the impact of service empathy on proficiency differs by human vs AI partner).
Reported interaction/moderated-mediation analyses from the online experiment (n = 861) indicating a significant partner-type × service-empathy interaction predicting collaboration proficiency.
Employees' emotional state significantly moderates the relationship between partner type (human vs AI) and collaboration proficiency.
Moderation analyses reported from the same online experimental dataset (n = 861), testing interaction terms between partner type and measured employee emotion on collaboration proficiency; authors report a significant moderating effect.
AI adoption has an inverted U-shaped effect on employee-related corporate social responsibility (ECSR).
Panel regression with quadratic specification (AI and AI^2) showing statistically significant positive coefficient on AI and statistically significant negative coefficient on AI^2; sample of 2,575 Chinese listed firms observed 2013–2023; controls, firm and/or year fixed effects and robustness checks reported.
high mixed Attention to Whom? AI Adoption and Corporate Social Responsi... Employee-related corporate social responsibility (ECSR)
Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged.
Measured token usage for agent runs with and without skills, reporting a range from modest token savings up to a 451% token increase with no corresponding change in pass rates.
high mixed SWE-Skills-Bench: Do Agent Skills Actually Help in Real-Worl... token usage/overhead (percent change) and its relation to pass rates
The research methodology combines systemic analysis, comparative assessment of international practices, and analytical generalization of organizational learning models, enabling capture of both structural trends and concrete institutional responses to technological changes.
Methodological statement from the paper describing its approach; this is a factual claim about methods used rather than an empirical finding.
high mixed EDUCATIONAL AND PROFESSIONAL STRATEGIES FOR PREPARING HUMAN ... ability to capture structural trends and institutional responses (through the ch...
Model output can be treated as evidence for studying human behavior, but there are important epistemic limits to interpreting model-generated text as direct evidence of human beliefs or social facts.
Epistemic analysis and methodological critique in the paper (discussion of limits of treating model outputs as evidence); no single empirical test cited in the provided text.
high mixed The Third Ambition: Artificial Intelligence and the Science ... validity and limits of using LLM outputs as evidence about human behavior and so...
The validity of human–AI decision-making studies hinges on participants' behaviours; effective incentives can potentially affect these behaviours.
Conclusion from the authors' thematic review and theoretical rationale linking incentive design to participant behaviour and study validity (no quantitative effect sizes provided in excerpt).
high mixed Incentive-Tuning: Understanding and Designing Incentives for... participant behaviour (engagement, effort, strategy) and resulting study validit...
The study's counterfactual analytical model links HR indicators (training intensity, absenteeism, labor productivity, turnover rates, workforce allocation) to organizational performance outcomes using regression-based simulations and predictive estimation.
Methodological claim explicitly stated: model construction from an industrial firm dataset using regression-based simulations and predictive techniques. (Specific sample size, variable operationalizations, and time frame not reported in the description.)
high mixed Artificial Intelligence and Human Resource Management: A Cou... methodological estimate of counterfactual organizational performance outcomes
Helicoid dynamics is a specific failure regime: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless.
Definition introduced in the paper and illustrated by the reported case series; the claim is conceptual/phenomenological rather than a statistical result.
high mixed AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics i... incidence and qualitative characterization of the helicoid pattern in LLM intera...
A minimal linear specification (linearized model) demonstrates how coupling strength, persistence, and dissipation determine local stability and oscillatory regimes through spectral conditions on the Jacobian.
Analytic linear model and local stability analysis in the paper: computation of Jacobian, derivation of spectral conditions (eigenvalue locations) that separate stable/oscillatory regimes; illustrative examples within the paper (no empirical data).
high mixed How Intelligence Emerges: A Minimal Theory of Dynamic Adapti... local stability/oscillatory behavior characterized by Jacobian eigenvalues (spec...
Distinct AI features (recommendation engines, chatbots, and comparison tools) influence consumer outcomes when modeled as latent constructs.
Methodological claim: the study modeled three AI features as latent constructs and analyzed their relationships with dependent variables using SEM (quantitative questionnaire data).
high mixed Role of artificial intelligence on consumer buying behavior:... influence on consumer trust, perceived decision-making support, and purchase int...
Both time constraints and LLM use significantly alter the characteristics of decision-makers' mental representations.
Results from the 2 × 2 experiment (N = 348) comparing representation-related measures across manipulated conditions; reported statistically significant differences associated with time constraints and with LLM use.
high mixed AI-Augmented Strategic Decision-Making Under Time Constraint... characteristics of mental representations (representation-related measures colle...
We develop a theoretical framework - the productivity funnel - that traces how technological potential narrows through successive stages, from access and digital infrastructure, through organizational absorption and human capital adaptation, to ultimate value capture.
Conceptual/theoretical development presented in the paper; no empirical sample needed (framework-building).
high mixed The complementarity trap: AI adoption and value capture n/a (theoretical framework describing stages leading to value capture)
Effects of curated Skills are highly heterogeneous across domains (e.g., +4.5 pp in Software Engineering vs. +51.9 pp in Healthcare).
Per-domain pass-rate deltas reported in the paper (SkillsBench per-domain analysis). The example domain deltas (+4.5 pp and +51.9 pp) are taken from the reported per-domain results.
high mixed SkillsBench: Benchmarking How Well Agent Skills Work Across ... task pass rate (per-domain average delta)
The study's qualitative and exploratory design limits generalizability; the proposed framework requires quantitative testing and broader samples (practicing architects, firms, cross-cultural contexts).
Explicit limitations stated by authors; study is based on semi-structured interviews with architecture students (N unspecified) and inductive thematic analysis.
high mixed Human–AI Collaboration in Architectural Design Education: To... generalizability / external validity of findings and framework
XChronos reframes transhumanist technology evaluation in experiential terms, creating both market opportunities and measurement/regulatory challenges for AI economics.
Synthesis and concluding argument in the paper summarizing proposed implications; conceptual reasoning without empirical tests.
high mixed XChronos and Conscious Transhumanism: A Philosophical Framew... shift in evaluation criteria toward experiential measures and resultant market/r...
Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants.
Systematic review and synthesis of 182 empirical and methodological studies comparing LLM-generated participants to human samples; studies were coded and analyzed for fidelity outcomes.
high mixed Synthetic Participants Generated by Large Language Models: A... fidelity of synthetic participants to human participants (behavioral/response si...
Participant targeting: 44% of programs targeted doctors and 44% targeted medical students (with possible overlap), and 56% targeted entry‑to‑practice career stages.
Participant audience and career-stage data extracted from the 27 included programs; proportions reported in the review.
high mixed Assessing the effectiveness of artificial intelligence educa... target audience (doctors, medical students) and career stage distribution (entry...