AMEL: Accumulated Message Effects on LLM Judgments

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

Summary

Main Finding

Conversation history with a prevailing polarity systematically shifts LLM binary judgments toward that polarity — an effect the author terms AMEL (Accumulated Message Effects on LLM Judgments). The shift is small-to-moderate overall (Cohen’s d ≈ −0.17), concentrated on items where the model is uncertain at baseline, saturates after very few prior turns, and is asymmetric: negative histories push models more strongly than positive ones (paired per-item ratio ≈1.62×).

Key Points

Scope and magnitude
- Data: 75,898 API calls to 11 models (OpenAI, Anthropic, Google, and four open-source models run locally); after parsing exclusions, 7,631 bias-score observations.
- Overall mean bias score BS̄ = −0.052 (95% CI [−0.059, −0.045]); t(7630) = −14.69, p < 10−46; Cohen’s d = −0.17.
- 9 of 11 models show significant conformity bias after Bonferroni correction; effect sizes vary by model (e.g., GPT-4.1 Nano d = −0.34; GPT-5.2 d = −0.17).
Dependence on uncertainty and item type
- Items where the model is empirically uncertain at baseline (nonzero binary entropy) absorb roughly twice the bias: d = −0.34 for high-entropy items vs. d = −0.15 for deterministic-baseline items.
- Author-coded “ambiguous” items are most affected; clear negatives barely move.
Negativity asymmetry
- Negative-skewed histories induce stronger bias than positive-skewed ones (paired absolute-bias ratio 1.62×; paired t p < 10−39).
- Aggregate effect is biased toward “no” even when histories are neutral.
Accumulation and saturation
- Bias does not grow with longer contexts: 5 biased turns ≈ 50 biased turns (Spearman |r| < 0.01, p > 0.94).
- Models “recognize the pattern” quickly; additional examples produce little extra shift.
Scale and model heterogeneity
- Larger models within families tend to be less susceptible but not immune (scaling reduces bias but does not eliminate it).
- Some exceptions: Gemini family showed inconsistent scaling; Qwen3 4B showed contrarian behavior (shift opposite to polarity).
Mechanistic characterization (follow-up experiments)
- The shift is continuous in token-probability space (not a threshold flip).
- Negativity asymmetry arises from both token-level and semantic components (per-model attribution exploratory given sample sizes).
- Positional explanations (primacy/recency) don’t explain AMEL: biased turns placed anywhere in 50-turn histories produce the same shift.
Domains tested
- Effect detected in all three domains tested (code review d = −0.24 > nutrition d = −0.13 > content moderation d = −0.12).
Practical mitigate recommendations from the paper
- Best fix: evaluate each item in a fresh context.
- If batching/batching-in-conversation is unavoidable for cost reasons, balance the polarity of the history or apply de-biasing detectors/corrections (e.g., B-SCORE, log-prob corrections).

Data & Methods

Design
- Between-subjects: identical test items evaluated either in a clean baseline context or following histories with different polarity mixes.
- Four history conditions: Baseline (no history), No-saturated (90% “no”, 10% “yes”), Yes-saturated (90% “yes”, 10% “no”), Neutral (50/50).
- Context lengths N ∈ {5, 10, 20, 50}.
Items and domains
- Three binary classification domains: Code review (“Is this code production-ready?”), Content moderation (“Is this comment appropriate?”), Nutritional assessment (“Is this a healthy choice?”).
- For each domain: 55 positive / 55 negative items for history construction; 21 test items (7 clear positive, 7 ambiguous, 7 clear negative).
Models and sampling
- 11 models from 4 providers, mix of API models and local OSS inference; temperature = 1.0; 10 repetitions per condition; ~76K API calls in main experiment.
Outcome metric
- Bias Score (BS) per item-model-polarity-length: BS = P(r | treatment) − P(r | baseline), where r* is the target response for the saturated polarity (“no” for no-sat/neutral; “yes” for yes-sat).
- Parsing: multi-step parser for binary yes/no; unparseable outputs excluded (~8.3% overall; differential rates across conditions reported).
Statistical analysis
- One-sample t-tests on BS to test deviation from zero, Cohen’s d reported; Bonferroni correction applied (factor 21) for primary group tests.
- Mixed-effects analysis to estimate ICC and account for nested observations; Spearman correlation and OLS for accumulation tests.
Robustness & limitations noted
- Some small-sample cells (Gemini) and parsing-exclusion concentrated in specific models (Claude Opus).
- Author-coded categories validated via intra-rater checks but items are limited (21 per domain). Follow-up mechanistic tests are informative but exploratory due to smaller n on some attribution analyses.

Implications for AI Economics

Measurement error in market outcomes and metrics
- When LLMs are used as automated evaluators (hiring screens, code review accept/rejects, content moderation, credit underwriting proxies, quality scoring), AMEL introduces context-dependent, asymmetric measurement error.
- Because bias concentrates on uncertain cases, selection effects will be non-random: borderline cases (where economic margins are often highest) are most likely to be mismeasured or systematically skewed by prior evaluation history.
- Negativity asymmetry implies systematic downward drift (more “no” outcomes) in batched evaluative processes, which can bias estimates of product quality, agent behavior, or policy impact if LLM judgments are treated as ground truth.
Design of evaluation pipelines and cost trade-offs
- There is a concrete trade-off between cost (batching many items in a single conversation to reduce API calls) and measurement bias. Fresh-context evaluation reduces AMEL but increases compute/API cost; balancing histories or applying correction scores (B-SCORE, log-prob corrections) is a partial mitigation that leaves residual bias.
- Economic evaluations and experiments that rely on LLM judges (A/B tests, platform audits, scorer-mediated auctions) should account for AMEL when designing data collection (randomize context, use fresh contexts for critical items, or explicitly model the bias).
Incentives and feedback loops
- If LLM evaluators feed back into training pipelines, RLHF, or automated decision systems, AMEL can create self-reinforcing equilibria (drift equilibria): a stream of rejections may entrench conservative (more negative) policies; approvals may fail to overcome baseline negativity. This changes dynamic incentives for agents whose outcomes depend on LLM judgments.
- Platform-level metrics that influence incentives (e.g., reviewer approval rates, moderation thresholds) may be systematically biased downward, changing firm behavior and societal outcomes (e.g., over-removal of content, under-promotion of borderline novel work).
Heterogeneity and general equilibrium concerns
- Heterogeneous model susceptibility implies that economic analyses comparing platforms or time periods must control for model family and scale; naive comparisons could conflate model susceptibility with real changes in underlying quality or behavior.
- Contrarian or idiosyncratic models (e.g., Qwen3 4B) can generate heteroskedastic, model-dependent measurement error that complicates aggregation across evaluators.
Recommendations for empirical economists and policymakers
- Treat LLM-based judgments as noisy, context-dependent measures. When possible, obtain fresh-context evaluations for a representative subsample to estimate and correct for AMEL-driven bias.
- Stratify analyses by baseline uncertainty: estimate bias conditional on model-empirical entropy to avoid differential attenuation and misattribution.
- When batching is necessary, use balanced histories or run de-biasing detectors (B-SCORE, log-prob corrections) and report residual bias estimates; incorporate those into standard errors or bias-correction factors.
- For audits, regulation, or audits that rely on LLM judgments, require a protocol that prevents cross-item contamination (fresh contexts) or mandates disclosure of context-history treatment and correction methods.
Research agenda points
- Quantify economic welfare consequences of AMEL in concrete deployments (moderation-induced content suppression, hiring/admissions decisions, automated loan screening).
- Model the equilibrium dynamics when LLM evaluations feed into agent strategies and platform policies (endogenous responses to biased evaluation signals).
- Evaluate cost-effectiveness of mitigations across scales: when is paying for fresh-context evaluation justified by reductions in misclassification externalities?
- Further study of mechanistic origins (RLHF, token-level priors) to target editing or training-time fixes that lower susceptibility without prohibitive costs.

Summary takeaway: AMEL is a reproducible, cross-model phenomenon that meaningfully distorts LLM-as-judge outputs — especially on borderline items — and creates asymmetric (negative-leaning) measurement error. For economists using or regulating LLM-based evaluations, the safest operational rule is to evaluate items in fresh contexts or explicitly measure and correct for history-driven drift when batching is unavoidable.

Assessment

Paper Typequasi_experimental Evidence Strengthhigh — Large-scale, pre-registered-style experimental design with 75,898 API calls across 11 models and 4 providers, clearly randomized history treatments, strong statistical significance, robustness checks (entropy stratification, context-length tests, token-distribution analysis), and replication across multiple model families. Methods Rigorhigh — Careful treatment manipulation, large sample sizes, multiple model providers and sizes, stratified analyses for uncertainty, tests for alternative mechanisms (token-level vs semantic, position effects), and practical mitigation experiments (fresh context, balancing), all of which demonstrate thorough internal validity and robustness. Sample75,898 API calls to 11 LLMs from 4 providers (OpenAI, Anthropic, Google, plus four open-source models); identical test items presented either in isolation or after histories saturated with positive or negative evaluations; analyses stratify items by baseline uncertainty (entropy) and include follow-up experiments probing token distributions, position, and batching. Themeshuman_ai_collab governance IdentificationRandomized manipulation of conversation history polarity (predominantly positive, predominantly negative, or isolated/neutral) applied to identical evaluation items across many API calls and models, with within-item comparisons and model-level replication to isolate the causal effect of prior messages on subsequent LLM judgments. GeneralizabilityExperimental histories may be more saturated/extreme than typical real-world conversational contexts, Tasks tested (automated evaluation scenarios) may not cover all evaluation domains (e.g., hiring, high-stakes grading, non-English content), Results depend on specific model versions, prompts, and API settings (temperature, system prompts) which evolve over time, Black-box nature of proprietary models limits mechanistic interpretation and transfer to future architectures, Downstream organizational or human-in-the-loop interactions and economic consequences are not directly measured

Claims (11)

Claim	Direction	Confidence	Outcome	Details
We conducted 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models). Other	null_result	high	experimental sample size / scope (number of API calls and models)	n=75898 0.8
Models shift toward the conversation's prevailing polarity (accumulated message effect on LLM judgments, AMEL). Decision Quality	mixed	high	directional bias in LLM judgments toward preceding conversation polarity	n=75898 d = -0.17, p < 10^-46 0.8
The accumulated-message effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Decision Quality	mixed	high	magnitude of AMEL as a function of item baseline uncertainty (entropy)	d = -0.34 for high-entropy items; d = -0.15 when the baseline is deterministic 0.8
Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman \|r\| < 0.01; OLS slope p = 0.80). Decision Quality	null_result	high	relationship between context length and magnitude of AMEL	Spearman \|r\| < 0.01; OLS slope p = 0.80 0.48
There is a negativity asymmetry: negative histories induce 1.62x more bias than positive (paired per item; t = 13.46, p < 10^-39, n = 2,481). Decision Quality	negative	high	relative bias magnitude induced by negative versus positive conversation histories	n=2481 negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39) 0.8
Scaling helps but does not solve the accumulated-message effect (Anthropic models: Haiku -0.22 to Opus -0.17; OpenAI models: Nano -0.34 to GPT-5.2 -0.17). Decision Quality	mixed	high	AMEL magnitude as a function of model scale/variant	Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17 0.48
At the token-probability level, the distribution shifts continuously rather than via a threshold when histories bias later judgments. Ai Safety And Ethics	mixed	medium	change in token probability distribution induced by biased histories	0.29
The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Ai Safety And Ethics	negative	medium	sources (token-level vs semantic) of the observed negativity asymmetry	0.05
Position of biased turns does not matter: five biased turns placed anywhere in a 50-turn history produce the same shift. Decision Quality	null_result	high	dependence of AMEL on the position of biased messages in conversation history	0.48
The simplest practical fix for evaluation pipelines is to use a fresh context per item; when batching is unavoidable, balancing the history helps reduce bias. Organizational Efficiency	positive	medium	effectiveness of mitigation strategies (fresh context per item; balancing histories) in reducing AMEL	0.29
Large language models are routinely used as automated evaluators (to review code, moderate content, or score outputs), often with many items passing through one conversation. Other	null_result	high	prevalence of LLM use as automated evaluators	0.24