AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

Summary

Main Finding

AgroVG is a new, multi-source benchmark framing agricultural visual grounding as a generalized set-prediction problem (return all matching instances or abstain). It contains 10,071 annotation-grounded image–query pairs drawn from 10 datasets across six agricultural target families, and provides two complementary tasks—T1 bounding-box grounding (all six families) and T2 instance-mask grounding (families with reliable masks). Zero-shot evaluation of 26 model configurations (closed-source MLLMs, open-source VLMs, specialized grounding systems) shows large gaps: best multi-target Set-F1 ≈ 0.35 and best positive-query mask success at IoU@0.75 < 0.17, with frequent failures on multi-target completeness, target-absent abstention (hallucination), and fine-grained mask grounding.

Key Points

Problem framing:
- Agricultural grounding → generalized set-prediction: queries may expect one, many, or zero targets.
- Two output granularities: T1 (boxes) and T2 (instance masks).
- Evaluation regimes: single-target, multi-target (K>1), target-absent (K=0).
Dataset scope:
- 10,071 image–query pairs, from 10 source datasets.
- Six target families: crop/weed, fruit, wheat head, pest, plant disease, tree canopy.
- T1 uses nine sources (all families); T2 uses five mask-capable sources (crop/weed, fruit, plant disease).
- Per-family query counts (approx.): crop/weed 2,335; fruit 2,457; wheat head 1,091; pest 1,255; plant disease 2,358; tree canopy 575.
Construction principles:
- Normalize heterogeneous annotations into a unified instance-level schema.
- Audit-driven sampling to balance sources, families, and instance densities.
- Expert visual review and split-aware query generation so every query is annotation-grounded and verified.
- Do not synthesize masks from boxes; use native masks or mask derivation from pixel-level annotations only.
Evaluation:
- T1: box-set matching protocol (metrics like Set-F1).
- T2: query-level mask coverage protocol (IoU thresholds, per-query success).
- Multi-granularity reporting across regimes, families, sources, and model types.
Performance findings:
- Substantial headroom for improvement—current models struggle especially on multi-instance completeness and high-IoU mask success.
- Models either hallucinate on target-absent queries or over-abstain on positives, indicating calibration/uncertainty issues.

Data & Methods

Sources and normalization:
- Aggregates datasets such as PhenoBench, CropAndWeed, MinneApple, ACFR, MegaFruits, GWHD, IP102, PlantSeg, OAM-TCD, and others.
- Unified instance schema: image metadata, target family, instance ID, bbox, mask reference when available, annotation provenance.
- Masks used only when pixel-level evidence exists; derivation rules applied where appropriate (not from boxes alone).
Sampling and quality control:
- Audit table per image (source, family, queryable instance counts, density buckets, eligibility).
- Source-aware sampling and density stratification to avoid dominance by any single dataset.
- Expert review panels to remove corrupted/ambiguous/low-quality cases.
- Final dev/test splits assigned at image level (no overlap).
Query generation:
- Split-aware, template-based queries covering category-level and fine-grained references (geometric ranks, spatial selection, relative relations, quantification).
- Every positive query is traceable to verified target IDs; target-absent queries are verified empty relative to benchmark annotations.
Evaluation setup:
- Zero-shot evaluation of 26 model configurations: closed-source multi-modal LLMs (MLLMs), open-source vision-language models (VLMs), and specialized grounding systems.
- T1 metrics: set-oriented box matching (Set-F1, precision/recall by set).
- T2 metrics: per-query mask IoU thresholds (e.g., IoU@0.75) and mask coverage statistics.
- Diagnostic breakdowns by target family, regime (single/multi/absent), and instance density.

Implications for AI Economics

Market demand and product differentiation
- Large performance gaps on realistic agricultural grounding tasks imply significant demand for specialized models and services (precision weeding, disease localization, targeted harvesting).
- Generalist MLLMs/VLMs are insufficient zero-shot; firms can capture value by offering domain-tuned models, data labeling, and integrated robot/edge solutions.
Data and annotation economics
- Construction required extensive annotation normalization, expert review, and split-aware query generation — indicating high per-sample human-labor costs.
- Economic opportunities: specialized annotation services, labeling marketplaces for agricultural masks/instance identities, and tools to reduce expert time (semi-automatic mask extraction).
- Sustainable dataset creation will likely need cost-sharing across stakeholders (agribusiness, equipment manufacturers, public research grants).
R&D and investment signals
- Low mask IoU performance and poor multi-target completeness point to promising R&D targets: better domain adaptation, fine-grained segmentation models, uncertainty-aware abstention mechanisms, and active learning to prioritize costly annotations.
- Investors should weigh opportunities in datasets, model fine-tuning pipelines, lightweight on-device inference for robotics, and hybrid human-AI systems.
Deployment risk and liability
- Hallucination on target-absent queries and incomplete multi-target retrieval translate directly into operational risk (missed pests/diseased plants or wasted interventions), affecting expected ROI of automation.
- Buyers of agricultural AI systems will value calibrated uncertainty, human-in-the-loop overrides, and conservative decision thresholds—these features can command premium pricing or be mandated by procurement standards.
Competitive dynamics and value capture
- Multi-source, open benchmarks like AgroVG lower barriers for entrants to demonstrate competence but also highlight need for proprietary labeled data and fine-tuned models to achieve production performance—encouraging hybrid strategies (open benchmarking + proprietary augmentation).
- Platform providers that combine hardware, perception models, and annotation pipelines may capture more value than pure-model vendors, because real-world efficacy depends on integrated sensing and verification.
Policy and public-good considerations
- Agricultural efficiency and food-supply resilience are public goods; public funding for high-quality, domain-specific datasets (and open benchmarks) can accelerate safe adoption while distributing development costs.
- Standardized evaluation (AgroVG) can inform regulatory guidance and procurement standards for agricultural automation.
Short recommendations for practitioners and policymakers
- For model buyers: require benchmark-based evaluations (including target-absent behavior), insist on uncertainty calibration, and budget for domain-specific fine-tuning/annotation.
- For researchers/companies: prioritize multi-instance recall, high-IoU mask accuracy, and abstention calibration; focus annotation budgets where marginal model gains are largest (small/occluded instances, dense clusters).
- For policymakers/institutions: subsidize shared datasets and labeling infrastructure to reduce duplication and lower adoption costs for smallholders.

Overall, AgroVG quantifies a clear performance gap between current generalist models and the requirements of practical agricultural grounding tasks. That gap creates concrete economic opportunities — and risks — across data services, model specialization, platform integration, and regulated deployment in precision agriculture.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a dataset and benchmark paper that evaluates model capability; it does not attempt causal identification or estimate causal effects. Methods Rigorhigh — The authors assemble a multi-source dataset (10,071 image–query pairs) spanning ten source datasets and six agricultural target families, define clear task formulations (box and mask set prediction, existence-aware abstention), provide task-specific protocols and metrics (e.g., Set‑F1, mask coverage at IoU thresholds), and run a broad zero-shot evaluation across 26 model configurations including closed-source MLLMs, open-source VLMs, and specialized grounding systems; potential limitations (subset mask availability, zero‑shot setting, and source sampling biases) are acknowledged but do not undermine the benchmark design. Sample10,071 annotation‑grounded image–query pairs pooled from ten publicly available source datasets covering six target families (crop/weed, fruit, wheat head, pest, plant disease, and tree canopy); supports bounding‑box grounding across all families and instance‑mask grounding where reliable pixel annotations exist; queries include single‑target, multi‑target, and target‑absent cases. Themesproductivity adoption GeneralizabilityGeographic and crop‑type coverage limited to the ten source datasets used — may underrepresent global crop diversity and regional imaging conditions, Instance‑mask annotations available only for a subset of sources, so mask evaluation does not cover all target families equally, Evaluation primarily in zero‑shot model settings; fine‑tuned or domain‑adapted model performance may differ substantially, Possible annotation biases or inconsistencies across source datasets (labeling protocols, query phrasing) could affect comparability, Benchmarks use static images and language queries, so results may not fully capture performance in robotic/real‑time field deployments

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Visual grounding is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Innovation Output	positive	high	ability to enable agricultural applications (selective weeding, disease monitoring, targeted harvesting)	0.03
Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Other	negative	high	difficulty of reliable evaluation for agricultural visual grounding	0.18
Evaluating agricultural visual grounding therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. Other	positive	high	completeness of evaluation (localization accuracy, completeness, abstention)	0.03
We introduce AgroVG, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. Adoption Rate	positive	high	benchmark formulation (generalized set prediction capability)	0.18
AgroVG contains 10,071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. Other	positive	high	number of annotation-grounded image-query pairs and coverage across target families	n=10071 10,071 annotation-grounded image-query pairs 0.3
AgroVG supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. Other	positive	high	support for bounding-box and instance-mask grounding across target families and query regimes	0.3
AgroVG provides task-specific protocols for box-set matching and query-level mask coverage. Other	positive	high	availability of evaluation protocols (box-set matching, mask coverage)	0.18
Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-F1 reaches only 0.35. Output Quality	negative	high	multi-target Set-F1	n=26 Set-F1 = 0.35 0.3
Zero-shot evaluation shows the best positive-query mask success rate at IoU@0.75 remains below 0.17. Output Quality	negative	high	positive-query mask success rate at IoU@0.75	n=26 mask success rate < 0.17 at IoU@0.75 0.3
Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ . Other	positive	high	data and code availability	0.3

A new 10,071‑pair agricultural visual‑grounding benchmark exposes large capability gaps in current models: the best zero‑shot multi‑target F1 is only 0.35 and high‑IoU mask success rates remain under 17%, underscoring limits to reliable field automation.