Multimodal LLMs can 'know' when text conflicts with what they see or hear but usually don't say so; probing shows mismatch signals in hidden states and a decoding bottleneck that prevents grounded rejections, especially for audio.

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun, Ziwei Liu · May 13, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Omnimodal LLMs reliably encode conflicts between sensory input and a textual premise in their hidden states but often fail to express those conflicts in outputs, indicating a translation (action) bottleneck rather than a perception failure.

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Summary

Main Finding

The paper documents a consistent Representation–Action Gap in omnimodal LLMs: models reliably encode that a textual premise conflicts with their visual or audio input (representation), but that internal signal usually fails to influence the model’s behavior (action). In practice, models often accept false premises about what they see/hear rather than reject them, and this failure is modality-asymmetric (audio grounding is weaker). A lightweight diagnostic intervention (probe-guided logit adjustment, PGLA) that reinjects the encoded mismatch into decoding materially improves rejection rates, indicating the bottleneck is translation to output rather than perception.

Key Points

IMAVB benchmark:
- 500 long-form movie clips (1–5 min, 20.7 hours) from three sources.
- 2×2 design: modality targeted (vision, audio) × premise condition (standard, misleading).
- Misleading variants swap exactly one premise detail; answers E/F correspond to “visual detail incorrect” / “audio detail incorrect.”
Behavioral outcomes:
- Most open-source models under-reject misleading premises: e.g., good standard accuracy (40–75%) but catch misleading vision in only ≤16.2% and audio in ≤6.6% of cases (some models 0% on audio under fixed option order).
- Two models (Qwen3-Omni, Gemini 3.1 Pro) over-reject: they detect many misleads but sacrifice standard QA accuracy.
- The failure modes are robust to prompt variants and option shuffling; results generalize beyond open-source models (Gemini tested).
Representational analysis:
- Linear probes recover the standard vs. misleading distinction from hidden states at high accuracy (up to ~86% on vision probes).
- After projecting out text-predictive features (residualization), vision probe accuracy remains above text-only baselines — i.e., the multimodal mismatch signal is genuinely present.
- The audio-side multimodal signal is weaker than the vision-side.
- Logit-lens analyses show the correct-token signal is often aligned in mid-stack representations but does not propagate to the final output distribution.
Cross-modal interference:
- Effects vary by architecture. Removing audio sometimes improves visual misleading detection (audio interference), but not uniformly; one model (MiniCPM-o 2.6) showed AV synergy.
Diagnostic intervention:
- Probe-Guided Logit Adjustment (PGLA): using the probe’s encoded mismatch to adjust logits at decoding yields an average +15.0 percentage-point improvement in rejection across the eight open-source models—evidence the hidden signal is actionable if reintroduced.
Experimental rigor:
- Multiple diagnostics (prompt ablations, temporal stratifications, option shuffles) and cross-validation grouped by video to avoid leakage.

Data & Methods

Dataset construction:
- Movie cut-scenes preserved intact (no edits to video/audio).
- Three-pass annotation pipeline producing temporally grounded captions; QA generated by Qwen3.5-27B and manually quality-checked across 500 items.
- Each item yields four question variants: standard vs. misleading for vision and audio targets; same A–D options for both, with E/F appended for misleading cases.
Models evaluated:
- Eight open-source omnimodal LLMs: OLA, OmniVinci, Qwen2.5-Omni, MiniCPM-o 2.6, Uni-MoE-2.0-Omni, Baichuan-Omni-1.5, Video-SALMONN-2, Qwen3-Omni.
- One proprietary model (Gemini 3.1 Pro) included for baseline behavior.
Evaluation protocol:
- Six-choice multiple choice: A–D content answers; E = “visual detail incorrect”; F = “audio detail incorrect.”
- Fixed option order and K=3 random shuffles to measure position bias.
- Temperature=0 decoding with top-p/k=1; metrics: per-split accuracy (std_v, std_a, mis_v, mis_a) and balanced accuracy.
Interpretability tools:
- Linear probes (logistic regression) trained per layer on last-token hidden states with 4-fold group CV (groups by video).
- Residualized probes: project out text-predictive variance using ridge regression to isolate multimodal signal from lexical cues.
- Logit-lens projection: project intermediate hidden states through the model’s RMSNorm + unembedding to compute per-layer probability mass on the ground-truth token.
Diagnostic intervention:
- Probe-Guided Logit Adjustment (PGLA): use probe outputs to adjust decoding logits at inference and measure behavioral change.

Implications for AI Economics

Deployment risk & externalities:
- Omnimodal agents that “accept” false premises about the environment (silent compliance) pose outsized economic and safety risks in real-world deployments (autonomous inspection, surveillance, trading based on sensor inputs, decision support). Economic losses can arise when downstream actors trust model outputs unaware of unexpressed internal uncertainty.
Evaluation & procurement:
- Standard cooperative benchmarks and task accuracy metrics can mask grounding failures. Procuring or pricing models should include representation→action gap diagnostics (e.g., IMAVB-style tests, probes, logit-lens checks) as part of model evaluation and SLAs.
Value of interpretability & lightweight fixes:
- The paper shows interpretability tools can reveal actionable signals not present in outputs and that inexpensive inference-time fixes (PGLA) can materially improve behavior. For firms and regulators, investing in probe-based auditing and low-cost inference corrections may be cost-effective compared with full re-training or larger architectures.
Product design and incentive structures:
- If agents are rewarded only for task performance, they may learn behaviors that ignore internal mismatch signals. Economic mechanisms (contracts, reward shaping, verification clauses) should align incentives to ensure internal evidence of mismatch is surfaced. This matters for marketplaces of AI services, insurance models, and liability allocation.
Market differentiation and certification:
- Models that demonstrably translate perceptual signals into refusals/corrections can command premium value in safety-critical domains. Certification standards could require tests for representation→action alignment across modalities.
Research & policy directions for AI economics:
- Quantify the economic benefit of closing the gap: estimate avoided losses from increased rejection accuracy in targeted domains (e.g., inspection, medical triage, autonomous monitoring).
- Model incentives for providers to deploy probe-based mitigations versus retraining; analyze costs, latency trade-offs, and consumer willingness to pay.
- Design contract terms and regulatory tests that require models to (a) surface internal uncertainty/mismatch signals, (b) have traceable audit logs of decision provenance, and (c) meet minimum multimodal grounding thresholds.
Operational recommendations:
- Incorporate internal-state probes and logit-lens audits into model evaluation pipelines used by procurement teams.
- For latency-tolerant contexts, deploy probe-guided inference adjustments as a stopgap while pursuing architectural/training fixes.
- Prioritize improving audio-grounding capabilities if the use case relies on audio evidence (calling out modality asymmetry).
Limitations relevant to economic interpretation:
- Results are based on long-form movie clips; domain transfer (real-world sensors, live audio) needs testing. Economic impact estimates must account for domain differences.
- Intervention (PGLA) is diagnostic; long-term robustness and adversarial resilience require further evaluation.

Overall, the paper implies that economic actors (buyers, regulators, insurers) should not rely solely on observed task accuracy for multimodal agents: internal representations can be accurate without behavioral manifestation, and cost-effective audit/intervention tools exist that materially change outcomes.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study uses a well-structured benchmark, multiple models, hidden-state probes, and an intervention (PGLA) that consistently changes behavior—giving strong internal validity for the specific task and dataset—but evidence is limited to a curated set of movie clips, a finite set of models (mostly open-source plus one closed), and synthetic 'misleading premise' prompts, so external validity to other modalities, domains, languages, and real-world deployments is uncertain. Methods Rigorhigh — Careful 2x2 experimental design, sizable benchmark (500 clips), evaluation across multiple model families including a leading closed-source model, systematic probe analyses linking representation to behavior, prompt-robustness checks, and an explicit intervention (PGLA) to test mechanism provide rigorous, multi-angle evidence; potential weaknesses are dataset curation choices and model coverage, which the authors acknowledge. SampleIMAVB: a curated benchmark of 500 long-form movie clips with questions constructed in a 2x2 crossing of target modality (vision vs audio) and premise condition (standard vs misleading); evaluated on eight open-source omnimodal LLMs plus Gemini 3.1 Pro, using seven prompt variants, probing hidden states for mismatch signals and applying a probe-guided logit adjustment intervention to decoding. Themeshuman_ai_collab adoption IdentificationConstructed 2x2 benchmark (target modality: vision vs audio; premise condition: standard vs misleading) on 500 curated long-form movie clips to separate conflict detection from ordinary comprehension; compare internal representations (probes on hidden states) to behavioral outputs (model answers) to detect a Representation-Action Gap; validate robustness across eight open-source omnimodal LLMs plus Gemini 3.1 Pro, seven prompt variants, and test a causal intervention (probe-guided logit adjustment, PGLA) that re-injects the encoded mismatch signal into decoding to demonstrate the translation step drives failures. GeneralizabilityCurated long-form movie clips may not reflect real-world conversational, instructional, or surveillance settings, Language and cultural scope likely limited (presumably English movie clips), limiting cross-linguistic generalizability, Models tested are a subset of current and future omnimodal architectures; results may differ for other architectures or larger proprietary models, Misleading premise formulation is synthetic; different types of contradictions or adversarial prompts may produce different behavior, Focuses on audio and vision modalities only; other sensors or multimodal contexts (e.g., live interaction, noisy audio) may show different patterns

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Other	positive	high	other	n=500 0.3
Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise–perception mismatches even when the same models almost never reject the false claim in their outputs. Decision Quality	negative	high	decision_quality	n=9 0.18
Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. Error Rate	negative	high	error_rate	n=500 0.18
The gap is modality-asymmetric (audio grounding underperforms vision). Decision Quality	negative	high	decision_quality	n=500 0.18
The gap is prompt-resistant across seven variants. Decision Quality	negative	high	decision_quality	n=500 0.18
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Decision Quality	positive	high	decision_quality	n=500 0.18
Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception. Decision Quality	negative	medium	decision_quality	n=500 0.11