Changing the composition of feedback and asking annotators for probabilities reduces systematic under-detection of rare events, and applying a simple recalibration step to aggregated probabilities produces better-calibrated labels that materially improve CNN performance out of sample.

Managing Cognitive Bias in Human Labeling Operations for Rare-Event AI: Evidence from a Field Experiment

Gunnar P. Epping, Andrew Caplin, Erik Duhaime, William R. Holmes, Daniel Martin, Jennifer S. Trueblood · March 12, 2026

arxiv rct high evidence 9/10 relevance Source PDF

Randomizing the prevalence of positives in the feedback stream and eliciting probabilistic judgments reduces rare-event misses, and a lightweight pipeline-level recalibration of elicited probabilities substantially improves label calibration and downstream CNN rare-event detection.

Many operational AI systems depend on large-scale human annotation to detect rare but consequential events (e.g., fraud, defects, and medical abnormalities). When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. We analyze prior experimental evidence and run a field experiment on DiagnosUs, a medical crowdsourcing platform, in which we hold the true prevalence in the unlabeled stream fixed (20% blasts) while varying (i) the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and (ii) the response interface (binary labels vs. elicited probabilities). We then post-process probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels, and train convolutional neural networks on the resulting labels. Balanced feedback and probabilistic elicitation reduce rare-event misses, and pipeline-level recalibration substantially improves both classification performance and probabilistic calibration; these gains carry through to downstream CNN reliability out of sample.

Summary

Main Finding

When positive events are rare, providing balanced gold-standard feedback and eliciting probabilistic (vs. binary) judgments from annotators reduces misses on rare events; applying linear-in-log-odds recalibration at worker and crowd levels further improves label quality (both classification and calibration), and these improvements persist when training downstream convolutional neural networks—yielding more reliable out-of-sample AI performance.

Key Points

Prevalence effect: Low true prevalence of positives leads annotators to systematically under-report positives (higher miss rates), which biases training labels and harms downstream models.
Experimental manipulation: On a medical crowdsourcing platform (DiagnosUs), true prevalence in the unlabeled stream was held at 20% blasts while two factors were varied:
- Gold-standard feedback prevalence shown to workers: 20% vs. 50%.
- Response interface: binary labels vs. elicited probabilities.
Eliciting probabilities reduced miss rates relative to binary labels.
Providing balanced feedback (50% positives in the gold-standard feedback stream) reduced prevalence-driven bias compared with reflecting the true (rarer) prevalence in feedback.
Post-collection recalibration using a linear-in-log-odds model applied at both individual-worker and crowd levels substantially improved probabilistic calibration and classification metrics.
Benefits of these human-level interventions and recalibration transferred to downstream CNNs trained on the processed labels: better classification performance and more reliable probability estimates out of sample.

Data & Methods

Setting: Field experiment on DiagnosUs, a medical crowdsourcing platform labeling images for "blasts" (rare medical abnormality).
Design:
- True prevalence in the unlabeled labeling stream fixed at 20% positives.
- Randomized orthogonal manipulations: feedback prevalence (20% vs 50%) and response interface (binary vs probabilistic estimates).
Label processing:
- Collected probabilistic labels where applicable.
- Applied linear-in-log-odds recalibration to individual worker outputs and to aggregated crowd outputs. (In practice: fit affine transform in log-odds space to map worker-reported probabilities to calibrated probabilities.)
Downstream modeling:
- Trained convolutional neural networks on labels produced under the different conditions (raw vs recalibrated).
- Evaluated both classification performance (e.g., miss/false-positive behavior) and probabilistic calibration out of sample.
Outcome measures: miss rates on rare positives, aggregate classification metrics, and calibration (implicit from description; exact metrics and sample sizes not stated in the summary).

Implications for AI Economics

Labeling pipeline design matters for rare-event detection: Small changes to feedback composition and the elicitation interface can materially affect label quality and thus model reliability, implying procurement contracts and annotation platform design should internalize these effects.
Cost–benefit tradeoffs: Balancing feedback toward more positives (e.g., 50% in gold-standard feedback) likely requires sourcing or synthesizing more positive examples and may raise short-term costs, but can reduce costly misses downstream—important when misses have high economic or safety costs.
Eliciting probabilities adds information value: Probabilistic judgments are more informative than binary labels for rare events, enabling better recalibration and improved downstream models without necessarily increasing labeling volume.
Pipeline-level calibration is high-leverage: Simple post-hoc recalibration (linear-in-log-odds) at worker and crowd levels can recover substantial downstream performance and calibration, offering a low-cost intervention compared with collecting more labels or building more complex models.
Procurement and incentives: Contracts and UI choices (feedback prevalence shown, whether to ask for probabilities) are levers that platforms and firms can use to reduce systematic annotation bias; these should be considered alongside traditional price-per-label decisions.
Generalization & risk: Improvements in annotation and calibration reduce model risk for rare but consequential events, affecting expected loss calculations, insurance, regulatory compliance, and investment decisions in AI systems deployed in high-stakes domains.
Research directions for economic analysis: quantify the cost-effectiveness of balanced feedback and probabilistic elicitation across prevalence regimes; incorporate these interventions into active learning and optimal labeling budgets; study long-run worker learning and strategic responses to feedback prevalence.

Assessment

Paper Typerct Evidence Strengthhigh — The paper reports a randomized controlled field experiment that manipulates operational levers (feedback prevalence and interface) and measures both immediate labeling outcomes and downstream effects on trained CNNs with ground-truth labels, providing direct causal evidence; strength is moderated by setting-specific limits (one perceptual task and a single crowdsourcing platform). Methods Rigorhigh — Carefully designed 2×2 randomized field experiment, pre-existing laboratory experiments analyzed for mechanistic insight, clear use of ground-truth labels from hematopathologists, robust aggregation and recalibration procedures (individual and crowd LLO), and evaluation of downstream model calibration and error trade-offs; limitations include potential self-selection into contests, unspecified sample size details for the field sample in the excerpt, and domain specificity to a medical image task. SampleStudy 1: laboratory experiments (Trueblood et al. 2021) with Vanderbilt undergraduates (Study 1a N=39; Study 1b N=57) classifying 300 Wright-stained white blood cell images (50% blasts) with ground truth from hematopathology faculty (data on OSF). Study 2: randomized field experiment on DiagnosUs (crowd annotators) labeling white blood cell images; unlabeled QA stream prevalence fixed at 20% blasts while gold-standard feedback stream prevalence was randomized to 20% or 50% and interface randomized to binary vs. elicited probabilities; resulting labeled datasets were used to train and evaluate convolutional neural networks against ground truth. Themeshuman_ai_collab org_design IdentificationRandomized field experiment on a crowdsourcing platform (DiagnosUs) that orthogonally assigns annotators/contests to a 2x2 treatment (gold-standard feedback prevalence: 20% vs. 50% × response interface: binary vs. elicited probabilities) while holding the true unlabeled (production) stream prevalence fixed at 20%; uses random assignment to causally identify the effect of feedback prevalence and interface on labeling behavior and downstream CNN performance, and complements this with post-hoc linear-in-log-odds recalibration at worker and crowd levels. GeneralizabilitySingle perceptual task (white blood cell / blast detection) may not generalize to non-visual or more abstract labeling domains (e.g., text moderation, NLP), Crowdsourced annotators and contest-style incentives on DiagnosUs may differ from professional/expert labelling operations (radiologists, experienced fraud analysts), Examined prevalence levels (20% vs 50% GS feedback; unlabeled 20%) may not map directly to settings with much rarer true base rates (<<1%), Platform-specific workflow (interleaved GS items, leaderboard incentives) limits transferability to different annotation pipelines, CNN architectures, training regimes, and dataset sizes used may affect downstream model transferability to other model families and scales

Claims (7)

Claim	Direction	Confidence	Outcome	Details
When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. Error Rate	negative	medium	miss rate (false negative rate) for rare positives; downstream bias in training labels	0.6
In a field experiment on the DiagnosUs medical crowdsourcing platform, the authors held the true prevalence in the unlabeled stream fixed at 20% (blasts) while varying the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and the response interface (binary labels vs. elicited probabilities). Other	null_result	high	experimental manipulations (true prevalence, feedback prevalence, response interface) — these are independent variables in the study design	1.0
Balanced feedback (higher positive prevalence in the feedback stream) and probabilistic elicitation reduce rare-event misses. Error Rate	positive	medium	rare-event miss rate (false negative rate for positive examples)	0.6
Post-processing probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels substantially improves classification performance. Output Quality	positive	medium	classification performance of models trained on labels (e.g., accuracy, AUC or other classification metrics)	0.6
Pipeline-level recalibration substantially improves probabilistic calibration of labels. Output Quality	positive	medium	probabilistic calibration (e.g., calibration error, Brier score, reliability diagrams)	0.6
The improvements from balanced feedback, probabilistic elicitation, and pipeline-level recalibration carry through to downstream convolutional neural network (CNN) reliability out of sample. Output Quality	positive	medium	downstream CNN out-of-sample reliability (e.g., generalization performance, accuracy, calibration on held-out test data)	0.6
Eliciting probabilities (instead of forcing binary labels) enables post-hoc recalibration that improves both individual-worker and crowd-level label quality. Output Quality	positive	medium	label quality at worker and crowd levels (measured via calibration and classification-relevant metrics)	0.6