Multimodal LLMs can 'know' when text conflicts with what they see or hear but usually don't say so; probing shows mismatch signals in hidden states and a decoding bottleneck that prevents grounded rejections, especially for audio.
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
Summary
Main Finding
The paper documents a consistent Representation–Action Gap in omnimodal LLMs: models reliably encode that a textual premise conflicts with their visual or audio input (representation), but that internal signal usually fails to influence the model’s behavior (action). In practice, models often accept false premises about what they see/hear rather than reject them, and this failure is modality-asymmetric (audio grounding is weaker). A lightweight diagnostic intervention (probe-guided logit adjustment, PGLA) that reinjects the encoded mismatch into decoding materially improves rejection rates, indicating the bottleneck is translation to output rather than perception.
Key Points
- IMAVB benchmark:
- 500 long-form movie clips (1–5 min, 20.7 hours) from three sources.
- 2×2 design: modality targeted (vision, audio) × premise condition (standard, misleading).
- Misleading variants swap exactly one premise detail; answers E/F correspond to “visual detail incorrect” / “audio detail incorrect.”
- Behavioral outcomes:
- Most open-source models under-reject misleading premises: e.g., good standard accuracy (40–75%) but catch misleading vision in only ≤16.2% and audio in ≤6.6% of cases (some models 0% on audio under fixed option order).
- Two models (Qwen3-Omni, Gemini 3.1 Pro) over-reject: they detect many misleads but sacrifice standard QA accuracy.
- The failure modes are robust to prompt variants and option shuffling; results generalize beyond open-source models (Gemini tested).
- Representational analysis:
- Linear probes recover the standard vs. misleading distinction from hidden states at high accuracy (up to ~86% on vision probes).
- After projecting out text-predictive features (residualization), vision probe accuracy remains above text-only baselines — i.e., the multimodal mismatch signal is genuinely present.
- The audio-side multimodal signal is weaker than the vision-side.
- Logit-lens analyses show the correct-token signal is often aligned in mid-stack representations but does not propagate to the final output distribution.
- Cross-modal interference:
- Effects vary by architecture. Removing audio sometimes improves visual misleading detection (audio interference), but not uniformly; one model (MiniCPM-o 2.6) showed AV synergy.
- Diagnostic intervention:
- Probe-Guided Logit Adjustment (PGLA): using the probe’s encoded mismatch to adjust logits at decoding yields an average +15.0 percentage-point improvement in rejection across the eight open-source models—evidence the hidden signal is actionable if reintroduced.
- Experimental rigor:
- Multiple diagnostics (prompt ablations, temporal stratifications, option shuffles) and cross-validation grouped by video to avoid leakage.
Data & Methods
- Dataset construction:
- Movie cut-scenes preserved intact (no edits to video/audio).
- Three-pass annotation pipeline producing temporally grounded captions; QA generated by Qwen3.5-27B and manually quality-checked across 500 items.
- Each item yields four question variants: standard vs. misleading for vision and audio targets; same A–D options for both, with E/F appended for misleading cases.
- Models evaluated:
- Eight open-source omnimodal LLMs: OLA, OmniVinci, Qwen2.5-Omni, MiniCPM-o 2.6, Uni-MoE-2.0-Omni, Baichuan-Omni-1.5, Video-SALMONN-2, Qwen3-Omni.
- One proprietary model (Gemini 3.1 Pro) included for baseline behavior.
- Evaluation protocol:
- Six-choice multiple choice: A–D content answers; E = “visual detail incorrect”; F = “audio detail incorrect.”
- Fixed option order and K=3 random shuffles to measure position bias.
- Temperature=0 decoding with top-p/k=1; metrics: per-split accuracy (std_v, std_a, mis_v, mis_a) and balanced accuracy.
- Interpretability tools:
- Linear probes (logistic regression) trained per layer on last-token hidden states with 4-fold group CV (groups by video).
- Residualized probes: project out text-predictive variance using ridge regression to isolate multimodal signal from lexical cues.
- Logit-lens projection: project intermediate hidden states through the model’s RMSNorm + unembedding to compute per-layer probability mass on the ground-truth token.
- Diagnostic intervention:
- Probe-Guided Logit Adjustment (PGLA): use probe outputs to adjust decoding logits at inference and measure behavioral change.
Implications for AI Economics
- Deployment risk & externalities:
- Omnimodal agents that “accept” false premises about the environment (silent compliance) pose outsized economic and safety risks in real-world deployments (autonomous inspection, surveillance, trading based on sensor inputs, decision support). Economic losses can arise when downstream actors trust model outputs unaware of unexpressed internal uncertainty.
- Evaluation & procurement:
- Standard cooperative benchmarks and task accuracy metrics can mask grounding failures. Procuring or pricing models should include representation→action gap diagnostics (e.g., IMAVB-style tests, probes, logit-lens checks) as part of model evaluation and SLAs.
- Value of interpretability & lightweight fixes:
- The paper shows interpretability tools can reveal actionable signals not present in outputs and that inexpensive inference-time fixes (PGLA) can materially improve behavior. For firms and regulators, investing in probe-based auditing and low-cost inference corrections may be cost-effective compared with full re-training or larger architectures.
- Product design and incentive structures:
- If agents are rewarded only for task performance, they may learn behaviors that ignore internal mismatch signals. Economic mechanisms (contracts, reward shaping, verification clauses) should align incentives to ensure internal evidence of mismatch is surfaced. This matters for marketplaces of AI services, insurance models, and liability allocation.
- Market differentiation and certification:
- Models that demonstrably translate perceptual signals into refusals/corrections can command premium value in safety-critical domains. Certification standards could require tests for representation→action alignment across modalities.
- Research & policy directions for AI economics:
- Quantify the economic benefit of closing the gap: estimate avoided losses from increased rejection accuracy in targeted domains (e.g., inspection, medical triage, autonomous monitoring).
- Model incentives for providers to deploy probe-based mitigations versus retraining; analyze costs, latency trade-offs, and consumer willingness to pay.
- Design contract terms and regulatory tests that require models to (a) surface internal uncertainty/mismatch signals, (b) have traceable audit logs of decision provenance, and (c) meet minimum multimodal grounding thresholds.
- Operational recommendations:
- Incorporate internal-state probes and logit-lens audits into model evaluation pipelines used by procurement teams.
- For latency-tolerant contexts, deploy probe-guided inference adjustments as a stopgap while pursuing architectural/training fixes.
- Prioritize improving audio-grounding capabilities if the use case relies on audio evidence (calling out modality asymmetry).
- Limitations relevant to economic interpretation:
- Results are based on long-form movie clips; domain transfer (real-world sensors, live audio) needs testing. Economic impact estimates must account for domain differences.
- Intervention (PGLA) is diagnostic; long-term robustness and adversarial resilience require further evaluation.
Overall, the paper implies that economic actors (buyers, regulators, insurers) should not rely solely on observed task accuracy for multimodal agents: internal representations can be accurate without behavioral manifestation, and cost-effective audit/intervention tools exist that materially change outcomes.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Other | positive | high | other |
n=500
0.3
|
| Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise–perception mismatches even when the same models almost never reject the false claim in their outputs. Decision Quality | negative | high | decision_quality |
n=9
0.18
|
| Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. Error Rate | negative | high | error_rate |
n=500
0.18
|
| The gap is modality-asymmetric (audio grounding underperforms vision). Decision Quality | negative | high | decision_quality |
n=500
0.18
|
| The gap is prompt-resistant across seven variants. Decision Quality | negative | high | decision_quality |
n=500
0.18
|
| As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Decision Quality | positive | high | decision_quality |
n=500
0.18
|
| Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception. Decision Quality | negative | medium | decision_quality |
n=500
0.11
|