AI judges stray from real rulings and public opinion pushes them further off course: introducing social-media sentiment into LLM prompts inflates compensation forecasts and destabilizes outcomes, with the largest distortions in low-skilled and emotionally charged labor disputes.
Integrating Large Language Models (LLMs) into judicial decision-making demands rigorous safety examination against non-legal influences. This paper presents a novel stress test where we evaluate LLM-generated labor dispute outcomes by introducing social media sentiment as an external pressure, critically comparing them against 10,000 real-world court judgments from China Judgments Online (CJOL). Our findings reveal significant LLM safety vulnerabilities: models exhibit inherent deviations from real rulings, and public opinion substantially amplifies these discrepancies, leading to unstable and often inflated compensation predictions. Furthermore, these safety risks are compounded across low-skilled occupational categories and emotionally charged topics. This study uncovers critical threats to judicial integrity and public trust, underscoring the urgent need for robust safeguards against non-legal influences in AI legal systems.
Summary
Main Finding
LLMs used to generate judicial outcomes for labor disputes diverge substantially from real court judgments, and exposure to structured social media sentiment (Douyin) amplifies those deviations. Models tend to inflate compensation predictions under negative public sentiment, with amplified effects concentrated in low-skilled occupations and emotionally charged topics — raising risks for judicial integrity, distributional economic outcomes, and public trust in AI-assisted adjudication.
Key Points
- Stress-test setup: authors compare LLM-generated compensation outcomes to 10,000 real labor dispute judgments from China Judgments Online (CJOL) and inject quantified social-media sentiment from Douyin as an external pressure signal.
- Datasets: 309,642 cleaned Douyin comments (from ~319k collected) and 10,000 CJOL labor cases (2019–2021). Code and data processing details are on the project GitHub.
- Sentiment and topic processing:
- Fine-tuned Erlangshen-MacBERT binary sentiment classifier: accuracy 92.24%, F1 91.81%.
- BERTopic + manual clustering produced five top-level topic dimensions: worker identity, worker income, employer evaluation, labor legal relationships, and job-seeking difficulties.
- Negative sentiment predominant: e.g., labor legal relationships 86.11% negative; worker income 80.21% negative.
- Metrics introduced to quantify safety risk:
- CCRbefore / CCRafter: relative change of LLM-predicted compensation vs. real compensation (before/after sentiment input).
- OICDR (Opinion-Influenced Compensation Deviation Ratio): how much public opinion amplifies deviation.
- CIM (Compensation Inflation Multiple): LLM_after / |Real amount| (magnitude of inflation).
- Model behavior:
- Tested models: farui-plus, ChatGLM-4-9b, DeepSeek V2.5, gemma-2-9b, Qwen2.5-7B.
- farui-plus showed the largest sensitivity and inflation (CCR 0.851 → 2.874; OICDR 560.139; CIM 3.873).
- DeepSeek V2.5 was comparatively stable (CCR 0.396 → 0.476; OICDR 62.271; CIM 1.476).
- Other models (ChatGLM-4, gemma, Qwen) showed moderate baseline deviations and substantial amplification under sentiment.
- Contextual vulnerabilities:
- Low-skilled occupational categories and emotionally charged topics exhibit larger CCR increases and higher CIM and OICDR values.
- Even under neutral prompts many models exhibited baseline deviations from real judgments, indicating inherent model mismatch with legal benchmarks.
Data & Methods
- Judicial data:
- 10,000 labor dispute cases sampled (stratified random sampling across 2019–2021) focusing on Article 38, Para.1 of China’s Labor Contract Law and relevant causes of action to capture discretionary compensation determinations.
- Preprocessing removed judges’ reasoning, PII, and final rulings to create prompts that require the model to generate outcomes.
- Compensation ground-truth extracted by ChatGLM-4 plus script processing.
- Occupation labeling:
- Extracted ~3,623 job titles using ChatGLM-4; three annotators mapped jobs to ISCO-08 groups (Fleiss kappa = 0.829).
- Social-media data:
- Crawled 386 short videos and ~319k comments; cleaned to 309,642 usable comments.
- Sentiment classifier fine-tuned on 6,000 curated comments (from 10k annotated examples).
- Topic modeling per sentiment category (BERTopic + TopicTuner), manual topic filtering and clustering.
- LLM evaluation protocol:
- Baseline phase: LLM prompted as a judge to decide compensation strictly according to law (no social input).
- Influence phase: LLM prompts augmented with structured public-opinion indicators (topic signals + engagement metrics: total comment count, negative proportion, user engagement, average replies)—this isolates sentiment influence from raw text artifacts.
- Compensation outputs collected; final arithmetic (summing payment items, excluding interest) performed by researchers to compute CCR/CIM/OICDR.
- Quantitative comparisons across models and occupation/topic strata using CCRbefore, CCRafter, OICDR, and CIM.
Implications for AI Economics
- Redistribution and financial impacts
- Inflated compensation predictions (higher CIMs) change monetary transfers implied by rulings. If LLM-assisted outputs influence actual adjudication or settlement behavior, this can shift wealth between employers and workers, altering firm costs and household incomes.
- Labor-market signaling and inequality
- Disproportionate amplification in low-skilled occupations could worsen existing inequalities: systematic over- or under-compensation across occupations affects employment costs, hiring decisions, and bargaining power of different worker groups.
- Strategic behavior and market responses
- Parties (employers, plaintiffs, lawyers) may adapt to LLM-influenced norms (e.g., basing settlement demands on model outputs), potentially creating self-reinforcing distortions in compensation expectations and litigation strategies.
- Legal risk, insurance, and transaction costs
- Increased variability and potential bias in AI-assisted outcomes raise litigation risk and uncertainty. Insurers, employers, and courts may face higher compliance and monitoring costs; risk premia could increase in labor contracts.
- Policy, regulation, and governance economics
- Findings argue for regulatory oversight, mandated stress-testing, auditing, and liability frameworks for judicial AI systems. Economically efficient regulation should weigh the gains from AI assistance (speed, access) against distributional harms and reputational costs to institutions.
- Deployment design and procurement choices matter
- Model selection and fine-tuning materially affect economic outcomes. Procurement of judicial AI tools should incorporate robustness-to-sentiment metrics (CCR/OICDR/CIM) and domain-aligned calibration as contract requirements.
- Cost–benefit and social-welfare evaluation
- Any evaluation of AI adoption in courts must internalize externalities: changes in settlement rates, enforcement behavior, employer labor-costs, and trust in legal institutions—requiring macro- and microeconomic modeling beyond accuracy metrics.
- Recommended mitigations with economic rationale
- Restrict LLMs to decision-support (not final judgments) to avoid automatic monetary transfers driven by biased outputs.
- Apply domain-specific, legally grounded fine-tuning and constraints (hard legal rules encoded) to reduce variability; while costly to develop, this lowers economic risk and uncertainty.
- Mandate adversarial stress tests (including social-media pressure scenarios) and enforce transparency/audit trails to limit hidden distributional effects.
- Continuous monitoring and model choice based on empirically measured robustness metrics; investment in such governance reduces downstream economic losses from misruled compensation.
- Research priorities for AI economics
- Quantify welfare effects of LLM-induced changes in judicial outcomes (e.g., simulation of settlement behavior, firm hiring decisions, and redistribution).
- Model dynamic feedback loops where public sentiment and LLM outputs co-evolve (endogenous public reactions to perceived judicial bias).
- Cross-jurisdictional and longitudinal studies to understand how LLM adoption changes legal markets, insurance pricing, and labor contracts over time.
Bottom line: the paper provides empirical evidence that LLMs can be both inherently misaligned with legal benchmarks and highly sensitive to social-media sentiment; in economic terms, that implies tangible redistributional, market-behavioral, and institutional risks unless adoption is accompanied by robust safeguards, procurement standards, and economic impact assessment.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce a novel stress test that evaluates LLM-generated labor dispute outcomes by injecting social media sentiment as an external pressure. Decision Quality | neutral | high | sensitivity of LLM-generated labor dispute outcomes to injected social media sentiment |
n=10000
0.48
|
| We critically compare LLM-generated rulings against 10,000 real-world court judgments from China Judgments Online (CJOL). Decision Quality | neutral | high | agreement / deviation between LLM-generated rulings and CJOL judgments |
n=10000
0.8
|
| Models exhibit inherent deviations from real rulings. Decision Quality | negative | high | magnitude and frequency of deviations between LLM outputs and actual court judgments |
n=10000
0.48
|
| Public opinion (social media sentiment) substantially amplifies deviations between LLM outputs and real rulings. Decision Quality | negative | high | change in deviation between LLM outputs and CJOL rulings when social media sentiment is introduced |
n=10000
0.48
|
| The sentiment-induced divergences lead to unstable and often inflated compensation predictions by the models. Wages | negative | high | predicted compensation amounts (inflation and instability) from LLMs versus CJOL judgments |
n=10000
0.48
|
| These safety risks are compounded (stronger) for low-skilled occupational categories. Decision Quality | negative | high | interaction effect: deviation/amplification magnitude by occupational skill level (low-skilled vs others) |
n=10000
0.48
|
| These safety risks are compounded for emotionally charged topics. Decision Quality | negative | high | change in deviation/amplification of model outputs for emotionally charged topics |
n=10000
0.48
|
| These findings uncover critical threats to judicial integrity and public trust and underscore the urgent need for robust safeguards against non-legal influences in AI legal systems. Governance And Regulation | negative | high | potential impact on judicial integrity and public trust (qualitative/inferential) |
n=10000
0.08
|