Changing the composition of feedback and asking annotators for probabilities reduces systematic under-detection of rare events, and applying a simple recalibration step to aggregated probabilities produces better-calibrated labels that materially improve CNN performance out of sample.
Many operational AI systems depend on large-scale human annotation to detect rare but consequential events (e.g., fraud, defects, and medical abnormalities). When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. We analyze prior experimental evidence and run a field experiment on DiagnosUs, a medical crowdsourcing platform, in which we hold the true prevalence in the unlabeled stream fixed (20% blasts) while varying (i) the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and (ii) the response interface (binary labels vs. elicited probabilities). We then post-process probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels, and train convolutional neural networks on the resulting labels. Balanced feedback and probabilistic elicitation reduce rare-event misses, and pipeline-level recalibration substantially improves both classification performance and probabilistic calibration; these gains carry through to downstream CNN reliability out of sample.
Summary
Main Finding
When positive events are rare, providing balanced gold-standard feedback and eliciting probabilistic (vs. binary) judgments from annotators reduces misses on rare events; applying linear-in-log-odds recalibration at worker and crowd levels further improves label quality (both classification and calibration), and these improvements persist when training downstream convolutional neural networks—yielding more reliable out-of-sample AI performance.
Key Points
- Prevalence effect: Low true prevalence of positives leads annotators to systematically under-report positives (higher miss rates), which biases training labels and harms downstream models.
- Experimental manipulation: On a medical crowdsourcing platform (DiagnosUs), true prevalence in the unlabeled stream was held at 20% blasts while two factors were varied:
- Gold-standard feedback prevalence shown to workers: 20% vs. 50%.
- Response interface: binary labels vs. elicited probabilities.
- Eliciting probabilities reduced miss rates relative to binary labels.
- Providing balanced feedback (50% positives in the gold-standard feedback stream) reduced prevalence-driven bias compared with reflecting the true (rarer) prevalence in feedback.
- Post-collection recalibration using a linear-in-log-odds model applied at both individual-worker and crowd levels substantially improved probabilistic calibration and classification metrics.
- Benefits of these human-level interventions and recalibration transferred to downstream CNNs trained on the processed labels: better classification performance and more reliable probability estimates out of sample.
Data & Methods
- Setting: Field experiment on DiagnosUs, a medical crowdsourcing platform labeling images for "blasts" (rare medical abnormality).
- Design:
- True prevalence in the unlabeled labeling stream fixed at 20% positives.
- Randomized orthogonal manipulations: feedback prevalence (20% vs 50%) and response interface (binary vs probabilistic estimates).
- Label processing:
- Collected probabilistic labels where applicable.
- Applied linear-in-log-odds recalibration to individual worker outputs and to aggregated crowd outputs. (In practice: fit affine transform in log-odds space to map worker-reported probabilities to calibrated probabilities.)
- Downstream modeling:
- Trained convolutional neural networks on labels produced under the different conditions (raw vs recalibrated).
- Evaluated both classification performance (e.g., miss/false-positive behavior) and probabilistic calibration out of sample.
- Outcome measures: miss rates on rare positives, aggregate classification metrics, and calibration (implicit from description; exact metrics and sample sizes not stated in the summary).
Implications for AI Economics
- Labeling pipeline design matters for rare-event detection: Small changes to feedback composition and the elicitation interface can materially affect label quality and thus model reliability, implying procurement contracts and annotation platform design should internalize these effects.
- Cost–benefit tradeoffs: Balancing feedback toward more positives (e.g., 50% in gold-standard feedback) likely requires sourcing or synthesizing more positive examples and may raise short-term costs, but can reduce costly misses downstream—important when misses have high economic or safety costs.
- Eliciting probabilities adds information value: Probabilistic judgments are more informative than binary labels for rare events, enabling better recalibration and improved downstream models without necessarily increasing labeling volume.
- Pipeline-level calibration is high-leverage: Simple post-hoc recalibration (linear-in-log-odds) at worker and crowd levels can recover substantial downstream performance and calibration, offering a low-cost intervention compared with collecting more labels or building more complex models.
- Procurement and incentives: Contracts and UI choices (feedback prevalence shown, whether to ask for probabilities) are levers that platforms and firms can use to reduce systematic annotation bias; these should be considered alongside traditional price-per-label decisions.
- Generalization & risk: Improvements in annotation and calibration reduce model risk for rare but consequential events, affecting expected loss calculations, insurance, regulatory compliance, and investment decisions in AI systems deployed in high-stakes domains.
- Research directions for economic analysis: quantify the cost-effectiveness of balanced feedback and probabilistic elicitation across prevalence regimes; incorporate these interventions into active learning and optimal labeling budgets; study long-run worker learning and strategic responses to feedback prevalence.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| When positives are rare, the prevalence effect induces systematic cognitive biases that inflate misses and can propagate through the AI lifecycle via biased training labels. Error Rate | negative | medium | miss rate (false negative rate) for rare positives; downstream bias in training labels |
0.6
|
| In a field experiment on the DiagnosUs medical crowdsourcing platform, the authors held the true prevalence in the unlabeled stream fixed at 20% (blasts) while varying the prevalence of positives in the gold-standard feedback stream (20% vs. 50%) and the response interface (binary labels vs. elicited probabilities). Other | null_result | high | experimental manipulations (true prevalence, feedback prevalence, response interface) — these are independent variables in the study design |
1.0
|
| Balanced feedback (higher positive prevalence in the feedback stream) and probabilistic elicitation reduce rare-event misses. Error Rate | positive | medium | rare-event miss rate (false negative rate for positive examples) |
0.6
|
| Post-processing probabilistic labels using a linear-in-log-odds recalibration approach at the worker and crowd levels substantially improves classification performance. Output Quality | positive | medium | classification performance of models trained on labels (e.g., accuracy, AUC or other classification metrics) |
0.6
|
| Pipeline-level recalibration substantially improves probabilistic calibration of labels. Output Quality | positive | medium | probabilistic calibration (e.g., calibration error, Brier score, reliability diagrams) |
0.6
|
| The improvements from balanced feedback, probabilistic elicitation, and pipeline-level recalibration carry through to downstream convolutional neural network (CNN) reliability out of sample. Output Quality | positive | medium | downstream CNN out-of-sample reliability (e.g., generalization performance, accuracy, calibration on held-out test data) |
0.6
|
| Eliciting probabilities (instead of forcing binary labels) enables post-hoc recalibration that improves both individual-worker and crowd-level label quality. Output Quality | positive | medium | label quality at worker and crowd levels (measured via calibration and classification-relevant metrics) |
0.6
|