Showing models' spatial uncertainty to annotators increases labeling quality while cutting annotation time in a controlled 120-person study. Uncertainty cues focus human effort on poorly localized boxes and away from well-localized ones, improving annotation efficiency.

From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review

Moussa Kassem Sbeyti, Joshua Holstein, Philipp Spitzer, Nadja Klein, Gerhard Satzger · May 12, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

Visualizing localization uncertainty in an annotation interface causally improves bounding-box label quality and reduces annotation time in a 120-participant controlled study by redirecting annotator effort to high-uncertainty predictions.

High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.

Summary

Main Finding

Communicating model-estimated localization (aleatoric) uncertainty through color-coded visual cues in an annotation interface both improves bounding-box label quality and reduces annotation time. In a randomized controlled experiment (120 participants, 1,800 trials) reviewing model-proposed boxes on autonomous-driving images (KITTI), the uncertainty-aware interface produced a small but statistically significant mean IoU (mIoU) gain (~0.70 percentage points; p = 0.015, d = 0.40) and a 7.2% reduction in annotation time per image (26.69s → 24.76s; p = 0.045). Benefits grow with image difficulty: gains are larger on medium and hard images. Code: https://mos-ks.github.io/MUHA/.

Key Points

Problem addressed: model outputs for object detection have two independent components (class label and spatial boundaries). Class confidence does not reveal mislocalization risk, so human reviewers often underinspect spatial errors.
Intervention: an interface that visualizes per-coordinate aleatoric localization uncertainty (σ_top, σ_bottom, σ_left, σ_right) as color-coded box borders (blue = certain → red = uncertain), guiding reviewer attention to spatially unreliable predictions.
Experimental design:
- Randomized controlled experiment: Baseline (standard boxes + class score) vs. Uncertainty Visualization.
- 120 participants, 1,800 annotation trials on KITTI images relabeled by authors for ground truth.
- Image difficulty binned into Easy / Medium / Difficult based on detector uncertainty.
Outcomes:
- Annotation quality: Baseline participants: 88.28% ± 1.81 mIoU; with uncertainty cues: 88.98% ± 1.69 mIoU. One-tailed t-test p = 0.015. LME confirms treatment effect (β = 0.70, SE = 0.32, p = 0.030).
- Efficiency: time per image reduced by 7.2% (26.69s → 24.76s). One-tailed t-test p = 0.045. LME directionally consistent; significance stronger on medium/difficult images.
- Mechanism (attention reallocation): box-level analysis (Hungarian matching; N = 10,574 box observations) and GEE logistic regression show that uncertainty cues increase edit rates on high-uncertainty boxes and decrease edits on low-uncertainty boxes (interaction β = 0.736, p < .001). Without cues, annotators tended to adjust low-uncertainty boxes instead.
- Cognitive load and satisfaction: no significant increase in self-reported cognitive load; satisfaction similar across conditions.
Robustness: effects persist in medium+difficult subset; attenuate when controlling for initial prediction quality in the full sample (expected because easy images offer little room for improvement).
Scope/limitations noted by authors: study on autonomous-driving images (KITTI); intervention communicates aleatoric (data) uncertainty, not epistemic model uncertainty; gains depend on meaningful uncertainty calibration.

Data & Methods

Data:
- Dataset: KITTI autonomous-driving images; authors relabeled a ground-truth set to correct benchmarking imprecision.
- Model outputs: probabilistic object detector that produces per-coordinate aleatoric localization uncertainty (σ for each box edge).
Interface:
- Baseline: standard predicted bounding boxes and class scores.
- Treatment: same predictions plus color-coded box borders reflecting localization uncertainty.
Experimental protocol:
- Participants randomized to Baseline vs. Uncertainty Visualization.
- 1,800 total annotation trials across 120 participants.
- Image difficulty binned by predicted localization uncertainty.
Analyses:
- Primary comparisons: independent-samples t-tests for mIoU and time per image (one-tailed where justified).
- Mixed models: Linear Mixed Effects (LME) models with treatment as fixed effect and participant random intercepts for repeated measures; robustness checks (winsorizing, covariate adjustment for self-efficacy and task familiarity).
- Box-level behavior: Hungarian matching of predicted vs. submitted boxes; Generalized Estimating Equations (GEE) logistic regression with exchangeable working correlation to model probability a box was edited, predictors included condition, log-transformed localization uncertainty, and interaction.
- Cognitive load: self-reported measures included as covariates; no significant interaction with treatment.
Statistical highlights:
- mIoU treatment effect: t(117.4) = 2.19, p = 0.015; LME β = 0.70, SE = 0.32, p = 0.030.
- Time per image: t(117.4) = −1.714, p = 0.045 (7.2% reduction).
- Box-level GEE: treatment × uncertainty interaction β = 0.736, p < .001 (treatment redirects edits to high-uncertainty boxes).

Implications for AI Economics

Reduced labeling cost per label (time): observed 7.2% time savings implies that, for time-priced annotation labor, the same budget can procure roughly 7–8% more reviewed images (1 / (1 − 0.072) ≈ 1.078). This is a simple illustrative ROI; exact savings depend on task mix and worker pay models.
Improved data value and downstream performance: a measurable mIoU increase (~0.7 percentage points overall, larger on harder examples) implies higher-quality ground truth that can reduce label-noise propagation into model retraining cycles. Because labeling errors compound across iterations, even modest per-label quality gains can yield disproportionate downstream model-performance and operational-safety benefits—especially in high-stakes domains (autonomous driving, healthcare).
Targeted review & prioritization (cost-effective auditing): localization uncertainty can be used as a prioritization signal in review pipelines (review high-localization-uncertainty boxes first). This aligns annotation effort with value — reviewers focus on labels where human correction yields the largest marginal improvement—improving cost-effectiveness versus uniform review.
Productization and procurement:
- Labeling platforms and buyers (enterprises, ML teams) can integrate localization-uncertainty visualization and/or use localization-uncertainty heuristics to price or route tasks (e.g., premium for uncertain-box review, or automated triage).
- Buyers should require probabilistic detection outputs (or require vendors to compute localization uncertainty) to realize these gains.
Complementarity with active learning and budgeting:
- Combine localization uncertainty with active-selection strategies to allocate labeling budget between new-image annotation vs. review of model-proposed labels.
- Use uncertainty-aware routing to decide which instances to invest more reviewer time in or to request additional annotator redundancy.
Implementation caveats and risks:
- Calibration matters: benefits rely on meaningful (well-calibrated) localization uncertainty estimates. Poor calibration could misdirect attention and create false assurance.
- Aleatoric vs. epistemic: this approach addresses irreducible data noise (aleatoric). Epistemic uncertainty (model uncertainty) requires separate estimation and may also be valuable for prioritization.
- Human factors: the study used 120 participants (not specified as professional annotators); transfer to production labeling teams may require retraining, UX tuning, and continuous monitoring.
- Perverse incentives and gaming: if reviewers or vendors are paid per edit/time, routing that highlights easy corrections could change behavior; contract and QC design should account for this.
Strategic takeaway for AI economics:
- Small per-instance efficiency and quality gains from interface-level improvements can scale into large aggregate cost savings and substantial improvements in model training set quality when deployed across large labeling pipelines.
- Investments in probabilistic models and uncertainty-aware tooling are likely cost-effective, especially where (a) labeling volume is large, (b) spatial accuracy matters, and (c) labeling budgets are constrained.

Assessment

Paper Typerct Evidence Strengthmedium — The study uses a controlled experiment with a reasonably sized sample (120 participants), which provides credible internal validity for the causal effect of the UI treatment; however, external validity is limited by the lab-style setting, unspecified participant population and dataset diversity, potential short-term exposure only, and missing detail on pre-registration/power and blinding. Methods Rigormedium — Design includes random assignment and fine-grained box-level analysis, and code availability increases transparency; but the abstract omits details on recruitment (crowdworkers vs experts), dataset/task variety, balance checks, multiple hypothesis correction, and longer-run or production-setting robustness, which limits assessment of overall rigor. SampleControlled study with N=120 human participants who performed bounding-box annotation tasks on images where model predictions provided class labels and spatial boxes; treatment participants saw visualizations of localization uncertainty; the abstract does not report participant recruitment source, expertise, or the specific image dataset(s) used. Themesproductivity human_ai_collab IdentificationRandomized controlled experiment: participants were assigned to annotation interfaces with vs without visualized spatial (localization) uncertainty and causal effects were estimated by comparing label quality and annotation time across groups, with box-level analyses linking effort shifts to uncertainty cues. GeneralizabilityLab/controlled-task setting may not reflect real-world, large-scale annotation pipelines, Unknown participant pool (crowdworkers vs expert annotators) limits external validity, Results are specific to bounding-box localization tasks and a particular UI design—may not generalize to segmentation, keypoints, or other annotation modalities, Dataset composition and object classes are unspecified, so effects may vary with task difficulty and image complexity, Short-term study—unclear whether improvements persist over longer annotation sessions or scale to production workflows

Claims (6)

Claim	Direction	Confidence	Outcome	Details
In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality. Output Quality	positive	high	label quality	n=120 0.6
In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time). Task Completion Time	positive	high	task completion time	n=120 0.6
A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. Task Allocation	positive	high	annotator effort allocation across predicted boxes	0.6
Visualizing spatial (localization) uncertainty in the annotation interface improves human-in-the-loop annotation (i.e., localization uncertainty is a lever to improve annotation quality/efficiency). Task Allocation	positive	high	human-in-the-loop annotation quality and efficiency	n=120 0.6
Existing AI-assisted annotation workflows typically offer annotators no signal about where spatial (localization) errors are most likely, causing humans to potentially underinspect subtly misplaced boxes. Error Rate	negative	medium	rate of underinspection / missed localization errors	0.18
AI-assisted annotation has become standard in large-scale labeling workflows. Adoption Rate	positive	medium	adoption of AI-assisted annotation	0.18