The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

AI judges stray from real rulings and public opinion pushes them further off course: introducing social-media sentiment into LLM prompts inflates compensation forecasts and destabilizes outcomes, with the largest distortions in low-skilled and emotionally charged labor disputes.

LLM Safety in Judicial AI: A Stress Test of Social Media Influence on Real-World Judgments
Yixuan Xie, Yao He, Xiaoyu Yang, Xu Gai, Pan Hui · March 14, 2026 · Proceedings of the AAAI Conference on Artificial Intelligence
openalex quasi_experimental medium evidence 7/10 relevance DOI Source PDF
LLMs produce labor-dispute outcomes that systematically deviate from real Chinese court judgments, and injecting social-media sentiment into prompts amplifies these deviations—often inflating compensation predictions, especially for low-skilled and emotionally charged cases.

Integrating Large Language Models (LLMs) into judicial decision-making demands rigorous safety examination against non-legal influences. This paper presents a novel stress test where we evaluate LLM-generated labor dispute outcomes by introducing social media sentiment as an external pressure, critically comparing them against 10,000 real-world court judgments from China Judgments Online (CJOL). Our findings reveal significant LLM safety vulnerabilities: models exhibit inherent deviations from real rulings, and public opinion substantially amplifies these discrepancies, leading to unstable and often inflated compensation predictions. Furthermore, these safety risks are compounded across low-skilled occupational categories and emotionally charged topics. This study uncovers critical threats to judicial integrity and public trust, underscoring the urgent need for robust safeguards against non-legal influences in AI legal systems.

Summary

Main Finding

LLMs used to generate judicial outcomes for labor disputes diverge substantially from real court judgments, and exposure to structured social media sentiment (Douyin) amplifies those deviations. Models tend to inflate compensation predictions under negative public sentiment, with amplified effects concentrated in low-skilled occupations and emotionally charged topics — raising risks for judicial integrity, distributional economic outcomes, and public trust in AI-assisted adjudication.

Key Points

  • Stress-test setup: authors compare LLM-generated compensation outcomes to 10,000 real labor dispute judgments from China Judgments Online (CJOL) and inject quantified social-media sentiment from Douyin as an external pressure signal.
  • Datasets: 309,642 cleaned Douyin comments (from ~319k collected) and 10,000 CJOL labor cases (2019–2021). Code and data processing details are on the project GitHub.
  • Sentiment and topic processing:
    • Fine-tuned Erlangshen-MacBERT binary sentiment classifier: accuracy 92.24%, F1 91.81%.
    • BERTopic + manual clustering produced five top-level topic dimensions: worker identity, worker income, employer evaluation, labor legal relationships, and job-seeking difficulties.
    • Negative sentiment predominant: e.g., labor legal relationships 86.11% negative; worker income 80.21% negative.
  • Metrics introduced to quantify safety risk:
    • CCRbefore / CCRafter: relative change of LLM-predicted compensation vs. real compensation (before/after sentiment input).
    • OICDR (Opinion-Influenced Compensation Deviation Ratio): how much public opinion amplifies deviation.
    • CIM (Compensation Inflation Multiple): LLM_after / |Real amount| (magnitude of inflation).
  • Model behavior:
    • Tested models: farui-plus, ChatGLM-4-9b, DeepSeek V2.5, gemma-2-9b, Qwen2.5-7B.
    • farui-plus showed the largest sensitivity and inflation (CCR 0.851 → 2.874; OICDR 560.139; CIM 3.873).
    • DeepSeek V2.5 was comparatively stable (CCR 0.396 → 0.476; OICDR 62.271; CIM 1.476).
    • Other models (ChatGLM-4, gemma, Qwen) showed moderate baseline deviations and substantial amplification under sentiment.
  • Contextual vulnerabilities:
    • Low-skilled occupational categories and emotionally charged topics exhibit larger CCR increases and higher CIM and OICDR values.
    • Even under neutral prompts many models exhibited baseline deviations from real judgments, indicating inherent model mismatch with legal benchmarks.

Data & Methods

  • Judicial data:
    • 10,000 labor dispute cases sampled (stratified random sampling across 2019–2021) focusing on Article 38, Para.1 of China’s Labor Contract Law and relevant causes of action to capture discretionary compensation determinations.
    • Preprocessing removed judges’ reasoning, PII, and final rulings to create prompts that require the model to generate outcomes.
    • Compensation ground-truth extracted by ChatGLM-4 plus script processing.
  • Occupation labeling:
    • Extracted ~3,623 job titles using ChatGLM-4; three annotators mapped jobs to ISCO-08 groups (Fleiss kappa = 0.829).
  • Social-media data:
    • Crawled 386 short videos and ~319k comments; cleaned to 309,642 usable comments.
    • Sentiment classifier fine-tuned on 6,000 curated comments (from 10k annotated examples).
    • Topic modeling per sentiment category (BERTopic + TopicTuner), manual topic filtering and clustering.
  • LLM evaluation protocol:
    • Baseline phase: LLM prompted as a judge to decide compensation strictly according to law (no social input).
    • Influence phase: LLM prompts augmented with structured public-opinion indicators (topic signals + engagement metrics: total comment count, negative proportion, user engagement, average replies)—this isolates sentiment influence from raw text artifacts.
    • Compensation outputs collected; final arithmetic (summing payment items, excluding interest) performed by researchers to compute CCR/CIM/OICDR.
  • Quantitative comparisons across models and occupation/topic strata using CCRbefore, CCRafter, OICDR, and CIM.

Implications for AI Economics

  • Redistribution and financial impacts
    • Inflated compensation predictions (higher CIMs) change monetary transfers implied by rulings. If LLM-assisted outputs influence actual adjudication or settlement behavior, this can shift wealth between employers and workers, altering firm costs and household incomes.
  • Labor-market signaling and inequality
    • Disproportionate amplification in low-skilled occupations could worsen existing inequalities: systematic over- or under-compensation across occupations affects employment costs, hiring decisions, and bargaining power of different worker groups.
  • Strategic behavior and market responses
    • Parties (employers, plaintiffs, lawyers) may adapt to LLM-influenced norms (e.g., basing settlement demands on model outputs), potentially creating self-reinforcing distortions in compensation expectations and litigation strategies.
  • Legal risk, insurance, and transaction costs
    • Increased variability and potential bias in AI-assisted outcomes raise litigation risk and uncertainty. Insurers, employers, and courts may face higher compliance and monitoring costs; risk premia could increase in labor contracts.
  • Policy, regulation, and governance economics
    • Findings argue for regulatory oversight, mandated stress-testing, auditing, and liability frameworks for judicial AI systems. Economically efficient regulation should weigh the gains from AI assistance (speed, access) against distributional harms and reputational costs to institutions.
  • Deployment design and procurement choices matter
    • Model selection and fine-tuning materially affect economic outcomes. Procurement of judicial AI tools should incorporate robustness-to-sentiment metrics (CCR/OICDR/CIM) and domain-aligned calibration as contract requirements.
  • Cost–benefit and social-welfare evaluation
    • Any evaluation of AI adoption in courts must internalize externalities: changes in settlement rates, enforcement behavior, employer labor-costs, and trust in legal institutions—requiring macro- and microeconomic modeling beyond accuracy metrics.
  • Recommended mitigations with economic rationale
    • Restrict LLMs to decision-support (not final judgments) to avoid automatic monetary transfers driven by biased outputs.
    • Apply domain-specific, legally grounded fine-tuning and constraints (hard legal rules encoded) to reduce variability; while costly to develop, this lowers economic risk and uncertainty.
    • Mandate adversarial stress tests (including social-media pressure scenarios) and enforce transparency/audit trails to limit hidden distributional effects.
    • Continuous monitoring and model choice based on empirically measured robustness metrics; investment in such governance reduces downstream economic losses from misruled compensation.
  • Research priorities for AI economics
    • Quantify welfare effects of LLM-induced changes in judicial outcomes (e.g., simulation of settlement behavior, firm hiring decisions, and redistribution).
    • Model dynamic feedback loops where public sentiment and LLM outputs co-evolve (endogenous public reactions to perceived judicial bias).
    • Cross-jurisdictional and longitudinal studies to understand how LLM adoption changes legal markets, insurance pricing, and labor contracts over time.

Bottom line: the paper provides empirical evidence that LLMs can be both inherently misaligned with legal benchmarks and highly sensitive to social-media sentiment; in economic terms, that implies tangible redistributional, market-behavioral, and institutional risks unless adoption is accompanied by robust safeguards, procurement standards, and economic impact assessment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study provides direct, experimental evidence that LLM outputs shift when exposed to manipulated social-media sentiment and that these shifts increase deviations from real court judgments; sample size (10,000 cases) is large. However, strength is limited because the analysis concerns model behavior in a lab setting (not deployed judges), depends on how sentiment was measured/encoded and which LLMs were tested, and therefore does not establish impacts on real-world judicial decision-making or broad economic outcomes. Methods Rigormedium — Large, well-defined case sample and a clear within-case manipulation strengthen causal claims about model responsiveness to sentiment, but the rigor is weakened by likely omissions or uncertainties: (a) unspecified or unvalidated sentiment measurement and encoding into prompts; (b) potential lack of randomization checks, robustness across different LLM architectures/temperatures, or sensitivity analyses; and (c) limited discussion (in the summary) of controls for confounders, model calibration, and whether outputs were post-processed or aggregated. Sample10,000 real-world Chinese labor-dispute judgments drawn from China Judgments Online (CJOL); corresponding LLM-generated rulings produced for the same cases under baseline and multiple social-media sentiment conditions (sentiment sourced from unspecified social platforms and encoded into prompts); analysis stratified by occupational skill level and topic emotionality. Themesgovernance labor_markets IdentificationWithin-case stress test: the authors pair 10,000 real labor-dispute cases from China Judgments Online with LLM-generated rulings and then systematically vary an external input—social media sentiment—while holding case facts constant; causal effects of sentiment on model outputs are identified by comparing LLM predictions for the same case across different sentiment conditions (i.e., within-case comparisons / manipulated prompt experiments). GeneralizabilityFindings apply only to the specific LLMs, prompt designs, and sentiment encoding methods used and may not generalize to other models or deployments., Results are based on Chinese labor-dispute cases and judgment styles; applicability to other legal systems, case types (criminal, civil beyond labor), or jurisdictions is limited., Lab-based prompt manipulations do not fully capture how LLMs would be used in real judicial workflows or how human oversight would mitigate biases., Quality and representativeness of social-media sentiment data (platform, sampling, language processing) may limit external validity., Implications for actual economic outcomes (e.g., labor market wages, long-term productivity) are indirect and not measured.

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure. Decision Quality neutral high sensitivity of LLM-generated labor dispute outcomes to injected social media sentiment
n=10000
0.48
We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL). Decision Quality neutral high agreement / deviation between LLM-generated rulings and CJOL judgments
n=10000
0.8
Models exhibit inherent deviations from real rulings. Decision Quality negative high magnitude and frequency of deviations between LLM outputs and actual court judgments
n=10000
0.48
Public opinion (social media sentiment) substantially amplifies deviations between LLM outputs and real rulings. Decision Quality negative high change in deviation between LLM outputs and CJOL rulings when social media sentiment is introduced
n=10000
0.48
The sentiment-induced divergences lead to unstable and often inflated compensation predictions by the models. Wages negative high predicted compensation amounts (inflation and instability) from LLMs versus CJOL judgments
n=10000
0.48
These safety risks are compounded (stronger) for low-skilled occupational categories. Decision Quality negative high interaction effect: deviation/amplification magnitude by occupational skill level (low-skilled vs others)
n=10000
0.48
These safety risks are compounded for emotionally charged topics. Decision Quality negative high change in deviation/amplification of model outputs for emotionally charged topics
n=10000
0.48
These findings uncover critical threats to judicial integrity and public trust and underscore the urgent need for robust safeguards against non-legal influences in AI legal systems. Governance And Regulation negative high potential impact on judicial integrity and public trust (qualitative/inferential)
n=10000
0.08

Notes