The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

A cross-model audit of 70+ LLMs on ~26,000 real queries finds outputs have become strikingly similar, creating an 'Artificial Hivemind' that erodes stylistic and cognitive diversity; firms must invest in governance, process redesign and multi-model strategies to avoid correlated errors and loss of creative value.

The Artificial Hivemind: Rethinking Work Design and Leadership in the Age of Homogenized AI
Jonathan H. Westover · Fetched March 12, 2026 · Human Capital Leadership Review
semantic_scholar descriptive medium evidence 7/10 relevance DOI Source
Across ~26,000 real-world queries and 70+ language models, the paper documents strong intra-model repetition and striking inter-model convergence — an 'Artificial Hivemind' — and shows that reward models and automated evaluations favor consensus outputs, reducing stylistic and cognitive diversity with organizational risks for creativity and correlated errors.

This article examines the organizational implications of behavioral homogeneity in large language models (LLMs), a phenomenon we term the "Artificial Hivemind." Drawing on a comprehensive analysis of 26,000 real-world user queries and 70+ language models, we reveal that contemporary AI systems exhibit pronounced intra-model repetition and inter-model convergence, generating strikingly similar outputs despite variations in architecture, training, and scale. From an organizational leadership and work design perspective, this convergence poses critical challenges: the erosion of creative diversity in AI-assisted workflows, the potential amplification of groupthink in decision-making processes, and misalignment between organizational needs for pluralistic solutions and AI capabilities. We introduce evidence-based organizational responses spanning leadership communication strategies, work redesign initiatives, and governance frameworks. Our findings demonstrate that current reward models and AI evaluation systems are miscalibrated to human preferences when responses exhibit comparable quality but divergent styles—a critical gap for organizations deploying AI at scale. This research provides practitioners with actionable frameworks for diagnosing AI homogenization in their workflows, redesigning roles to preserve human creativity, and building governance structures that promote cognitive diversity rather than algorithmic conformity.

Summary

Main Finding

Contemporary LLMs display strong intra-model repetition and inter-model convergence — an "Artificial Hivemind" — producing highly similar outputs across different architectures, training regimes, and scales. This homogenization erodes creative diversity in AI-assisted work, risks amplifying groupthink in organizational decision-making, and reveals miscalibration in reward models and evaluation systems that prefer consensus-style outputs even when stylistically diverse alternatives are equally high-quality.

Key Points

  • Evidence of two related phenomena:
    • Intra-model repetition: single models often produce repetitive, low-diversity responses across similar prompts.
    • Inter-model convergence: different models (70+ examined) frequently generate strikingly similar outputs for the same real-world queries.
  • Dataset and scale: analysis based on ~26,000 real-world user queries and outputs from 70+ language models.
  • Evaluation gap: current reward models and automated evaluation metrics are biased toward consensus/high-probability responses, misaligning machine incentives with organizational needs for stylistic and cognitive diversity.
  • Organizational risks:
    • Reduced creativity and solution variety in AI-augmented workflows.
    • Increased susceptibility to groupthink and correlated errors across teams using different models.
    • Misfit between organizational decision problems that require pluralistic perspectives and AI systems optimized for narrow consensus.
  • Organizational responses proposed:
    • Leadership communication strategies to value and solicit diverse human/AI perspectives.
    • Work redesign: roles and processes to preserve human creativity (e.g., contrarian roles, ensemble-based workflows, mandated diversity checks).
    • Governance frameworks: auditing for homogenization, calibration of evaluation/reward systems to reward diversity, procurement policies avoiding monoculture.
  • Practical tools: diagnostic frameworks and metrics (e.g., inter-model similarity, response entropy) for detecting and tracking AI homogenization in workflows.

Data & Methods

  • Data: ≈26,000 real-world user queries paired with outputs from over 70 distinct language models spanning different providers, architectures, and scales.
  • Analyses likely included (as reported):
    • Quantitative similarity measures (semantic/textual similarity, clustering) to assess intra- and inter-model output overlap.
    • Diversity metrics (e.g., entropy, distinct-n style measures) to quantify repetition and variability.
    • Human preference assessments demonstrating miscalibration: when outputs are similar in judged quality but differ in style, reward models and automated evaluations favor the consensus-style outputs.
    • Comparative evaluation of models across prompt types and task contexts to identify where homogenization is most pronounced.
  • Note: the paper synthesizes empirical measurement with organizational-design recommendations; exact statistical methods, thresholds, and model lists are detailed in the article.

Implications for AI Economics

  • Productivity and complementarities:
    • Homogenized AI outputs reduce the value of AI as a source of varied cognitive complements to human labor, potentially lowering productivity gains from human–AI collaboration in tasks needing creativity and exploration.
    • Firms that can design processes to preserve human diversity (and to elicit diverse AI outputs) may capture greater productivity gains, increasing returns to organizational capability rather than to raw model access.
  • Market structure and vendor dynamics:
    • Inter-model convergence undermines product differentiation across AI providers, increasing price competition on marginal features (latency, cost) and shifting competitive advantage to governance, integration, and workflow design services.
    • Reduced differentiation could accelerate commoditization of base LLM outputs but open a market for value-adds (diversity-promoting tools, ensemble services, customization for non-conformity).
  • Labor demand and skills:
    • Greater premium on human skills that preserve or generate diversity: contrarian reasoning, editorial curation, prompt engineering focused on diversity, and governance roles (AI auditors, diversity officers).
    • Routine augmentation tasks that rely on consensus outputs may be more easily automated, while tasks requiring pluralistic solutions remain human-intensive.
  • Investment and governance costs:
    • Organizations will incur additional costs to audit, procure, and govern LLM use (diversity audits, recalibrating reward models, multi-model infrastructures), shifting some economic benefits of AI toward governance and integration spending.
  • Policy and standards:
    • Findings motivate regulatory attention to systemic risks from algorithmic homogenization (e.g., correlated errors in critical systems) and potential standards for measuring and disclosing model diversity characteristics.
  • Evaluation economics:
    • Miscalibrated reward/evaluation systems can lead firms to prefer models that maximize apparent evaluation scores but minimize useful diversity, producing allocative inefficiencies in model choice and adoption decisions.

Actionable takeaway for economists and practitioners: measure inter-model similarity and response diversity as part of ROI and procurement analyses; factor in governance and role redesign costs when estimating net returns to LLM deployment; and treat organizational capability to sustain cognitive diversity as a critical, economically valuable complement to model access.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents broad, systematic empirical evidence of intra- and inter-model similarity using a large cross-section of models (70+) and ~26,000 real-world queries, plus human preference checks showing evaluator/reward-model miscalibration; however, it is descriptive (observational) and does not establish causal effects of homogenization on firm-level productivity or labor outcomes, and some key details (sampling frame, representativeness, statistical thresholds) are not fully specified. Methods Rigormedium — Uses standard quantitative similarity and diversity metrics and complementary human evaluations across many models and prompts, which is appropriate for the research question; but the paper appears to lack a causal identification strategy, the selection of queries and models may introduce bias, and the write-up (as summarized) does not report robustness checks, pre-registration, or detailed inference procedures that would raise the rigor to high. SampleApproximately 26,000 real-world user queries paired with outputs from over 70 distinct language models spanning multiple providers, architectures, and scales; analyses include automated similarity/diversity metrics (semantic/textual similarity, entropy, distinct-n) and human preference/evaluation comparisons between high-probability consensus-style outputs and stylistically diverse alternatives. Themeshuman_ai_collab productivity org_design governance adoption GeneralizabilityQuery sample may be non-representative (source and selection biases of the ~26k queries not fully described)., Model set, while large (70+), may over- or under-represent key provider families, finetuned variants, or on-prem/custom models., Results depend on prompt phrasing, system messages, decoding/temperature settings, and API defaults which vary across deployments., Temporal generalizability is limited: models and reward/evaluation systems evolve rapidly, so homogenization levels may change over time., Findings on output similarity do not directly translate into measured economic outcomes (productivity, wages) without additional causal evidence.

Claims (17)

ClaimDirectionConfidenceOutcomeDetails
Contemporary LLMs display strong intra-model repetition (single models often produce repetitive, low-diversity responses across similar prompts). Creativity positive medium within-model response diversity (entropy, distinct-n, repetition rates)
n=26000
0.11
Contemporary LLMs show inter-model convergence — different models frequently generate highly similar outputs for the same real-world queries. Creativity positive medium inter-model output similarity (semantic/textual similarity scores, clustering overlap)
n=26000
0.11
The analysis dataset comprises approximately 26,000 real-world user queries paired with outputs from over 70 distinct language models spanning different providers, architectures, and scales. Other positive high dataset size and model count
n=26000
0.18
Current reward models and automated evaluation metrics are biased toward consensus/high-probability responses, preferring consensus-style outputs even when stylistically diverse alternatives are judged equally high-quality by humans. Decision Quality negative medium alignment between reward-model/automated evaluation scores and human quality judgments (bias toward consensus)
0.11
Homogenization of LLM outputs erodes creative diversity in AI-assisted work and reduces the variety of solutions produced. Creativity negative medium creative diversity / number of distinct solution variants produced
0.11
Homogenized outputs increase organizational susceptibility to groupthink and correlated errors across teams using different models. Error Rate negative medium risk of correlated errors / susceptibility to groupthink (conceptual risk inferred from output correlation)
0.11
Reward-model and evaluation miscalibration can cause organizations to prefer models that maximize apparent evaluation scores at the expense of useful stylistic or cognitive diversity. Adoption Rate negative medium model selection bias driven by automated evaluation scores; reduction in diversity as a side-effect of evaluation-driven selection
0.11
Organizational responses to homogenization include leadership communication strategies, work redesign (contrarian roles, ensemble workflows, mandated diversity checks), and governance frameworks (auditing, procurement policies avoiding monoculture). Organizational Efficiency positive high proposed organizational interventions to preserve cognitive and stylistic diversity
0.18
The paper provides practical diagnostic tools and metrics (e.g., inter-model similarity, response entropy) for detecting and tracking AI homogenization in workflows. Organizational Efficiency positive high operational diagnostic metrics (inter-model similarity, entropy, distinct-n)
0.18
Homogenized AI outputs reduce the value of AI as a source of varied cognitive complements to human labor, potentially lowering productivity gains from human–AI collaboration in tasks requiring creativity and exploration. Firm Productivity negative medium productivity gains from human–AI collaboration (theoretical implication inferred from diversity loss)
0.11
Firms that design processes to preserve human diversity and elicit diverse AI outputs may capture greater productivity gains, increasing returns to organizational capability rather than to raw model access. Firm Productivity positive low firm-level productivity or returns to organizational capability versus model access
0.05
Inter-model convergence undermines product differentiation across AI providers and could accelerate commoditization of base LLM outputs. Market Structure negative medium vendor product differentiation / commoditization of base outputs
0.11
Reduced differentiation opens market opportunities for value-add services (diversity-promoting tools, ensemble services, customization for non-conformity) and shifts competitive advantage toward governance and workflow integration. Firm Revenue mixed low market demand for value-added services and governance/integration capabilities
0.05
Labor demand will shift toward skills that preserve or generate diversity (contrarian reasoning, editorial curation, diversity-focused prompt engineering, AI auditors), while routine augmentation tasks that rely on consensus outputs may be more easily automated. Skill Acquisition mixed low demand for specific human skills and automation of routine consensus-based tasks
0.05
Organizations will incur additional governance and procurement costs (diversity audits, recalibration of reward models, multi-model infrastructures) to mitigate homogenization, shifting some economic benefits of AI toward governance spending. Organizational Efficiency negative medium governance and procurement costs associated with LLM deployment
0.11
The findings motivate regulatory attention to systemic risks from algorithmic homogenization (e.g., correlated errors in critical systems) and potential standards for measuring and disclosing model diversity characteristics. Governance And Regulation positive medium regulatory action / disclosure standards regarding model diversity
0.11
Actionable takeaway: organizations should measure inter-model similarity and response diversity as part of ROI and procurement analyses and factor in governance and role-redesign costs when estimating net returns to LLM deployment. Adoption Rate positive high inclusion of diversity metrics and governance cost estimates in ROI/procurement decisions
0.18

Entities

Artificial Hivemind (outcome) Inter-model convergence (outcome) Intra-model repetition (outcome) Dataset of ~26,000 real-world user queries with paired model outputs (dataset) 70+ distinct language models (diverse providers, architectures, scales) (ai_tool) Large Language Models (LLMs) (ai_tool) Reward models (ai_tool) Automated evaluation metrics (method) Inter-model similarity metric (method) Response entropy (diversity metric) (method) Semantic/textual similarity measures (method) Clustering analysis (method) Diversity metrics (entropy; distinct-n) (method) Human preference assessments (human evaluations) (method) Ensemble-based workflows (method) Leadership communication strategies to solicit diverse perspectives (method) Work redesign (contrarian roles; mandated diversity checks) (method) Governance frameworks for AI homogenization (auditing; procurement policies) (method) AI homogenization (outcome) Reduced creativity in AI-augmented work (outcome) Amplified groupthink / correlated errors (outcome) Real-world users submitting queries (population) Organizational decision-makers and teams (population) Firms (organizations adopting LLMs) (institution) AI providers (model vendors) (institution)

Notes