The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Prompt framing creates a confirmation-bias blind spot in LLM code reviewers: when changes are framed as bug-free or as security fixes, detection rates fall sharply and attackers can reintroduce known vulnerabilities in up to 88% of trials against an autonomous agent; removing metadata or adding explicit instructions largely undoes the effect.

Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review
Dimitris Mitropoulos, Nikolaos Alexopoulos, Georgios Alexopoulos, Diomidis Spinellis · March 19, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
Framing code changes as bug-free or as security/functional improvements systematically induces confirmation bias in LLM-based code review—substantially reducing vulnerability detection and enabling adversarial reintroduction of known vulnerabilities, though simple debiasing (metadata redaction/explicit instructions) largely restores detection.

Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.

Summary

Main Finding

LLM-based code review exhibits strong confirmation bias: when commit/PR metadata frames a change as bug-free or security-improving, vulnerability detection rates fall dramatically (16–93 percentage points across models). This effect is asymmetric (false negatives rise much more than false positives) and varies by vulnerability type (injection-related flaws are more susceptible than memory-corruption bugs). Adversaries can exploit this bias in realistic supply-chain attacks—successful approvals occur in 35% of one-shot attacks against an interactive assistant (GitHub Copilot) and up to 88% against an autonomous review agent (Claude Code) when attackers iteratively refine their framing. Simple debiasing (explicit instructions to ignore metadata and metadata redaction) largely restores detection.

Key Points

  • Scope and scale
    • Controlled experiments: ~10,000 LLM queries (4 models × 5 framing conditions × 497 files) using 250 CVE–patched pairs (selected from CrossVuln).
    • Models evaluated: GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash, DeepSeek V3.
    • Framing conditions: Neutral, Weak Bug, Strong Bug, Weak Bug-free, Strong Bug-free.
  • Main quantitative results
    • Bug-free framing reduced detection rates by 16–93 percentage points depending on model and condition.
    • Framing effects are strongly asymmetric: large increases in false negatives; false positive rates change little.
    • Study 2 exploitability: adversarial PR metadata caused acceptance/approval in 35% of one-shot attacks against Copilot and 88% against Claude Code (autonomous) when attackers could iterate.
    • Debiasing: explicit instructions to ignore metadata recovered 100% of detections for interactive assistants; combining metadata redaction and explicit instructions recovered ~93.75–94% effectiveness for autonomous agents.
  • Heterogeneity and failure modes
    • Injection-type vulnerabilities (e.g., XSS, SQLi) were more vulnerable to framing effects than memory-safety issues (e.g., buffer overflows).
    • Many high-failure cases involved missing protections plausibly framed as unnecessary overhead, enabling plausible adversarial narratives.
  • Practical artifacts and disclosure
    • Authors released prompt templates, ~10k LLM responses, 51 synthetic PRs (34 interactive, 17 autonomous), manual annotations, and replication packages.
    • Experiments performed in isolated environments; maintainers were notified.

Data & Methods

  • Dataset
    • Source: CrossVuln dataset (real-world CVEs and patches).
    • Final sample: 250 CVE–patched pairs (497 files after exclusions), stratified across languages and CWEs (C, PHP, JavaScript; CWE Top 25 types including XSS, SQL Injection, OOB Read/Write).
    • Patch characteristics: median patch size = 5 lines; many changes are small/localized.
  • Experimental design (Study 1: controlled bias measurement)
    • Each file evaluated under five framing conditions with a structured required output (VULNERABLE: YES/NO; LINE_NUMBER; CODE_FRAGMENT; EXPLANATION) to allow automated parsing and manual validation.
    • Metrics: true positive rate (TPR) on vulnerable files, false positive rate (FPR) on patched files. Confirmation bias quantified as change in TPR between neutral and bug-free framings; significance via two-proportion z-tests and Cohen’s h.
    • Manual validation: every positive detection on vulnerable files compared against ground-truth CVE patches to classify detections as correct / incorrect / generic. Additional manual analysis of consensus failures (cases where all models failed under bug-free framing) and sampled false positives.
  • Exploitability experiments (Study 2)
    • Threat model: adversary reintroduces known vulnerable code (revert of the patch) in a PR, with metadata crafted to frame the change as benign or security-improving.
    • Deployment contexts: interactive assistant (GitHub Copilot using GPT-4 option) and autonomous review agent (Claude Code action in CI).
    • Synthetic/private repo setup used to replay real vulnerabilities and measure approval/acceptance rates; adversaries allowed iterative refinement for autonomous-agent attacks.
    • Evaluated debiasing strategies: metadata redaction and explicit prompt instructions to ignore metadata.
  • Output and reproducibility
    • Approximately 9,940 queries / ~10,000 LLM responses saved; full replication artifacts publicly released.

Implications for AI Economics

  • Increased systemic risk and negative externalities
    • Wide deployment of LLM-based automated review increases the scale at which confirmation-bias-based attacks can be attempted cheaply: framing manipulations are low-cost to produce and can be automated, enabling mass-targeted supply-chain attacks that create outsized negative externalities across downstream ecosystems.
    • Small, plausibly benign changes (median patch ~5 lines) are especially at risk; because many packages are transitively reused, a single successful bypass can produce large downstream costs.
  • Incentives, moral hazard, and reliance
    • Organizations that substitute human review with LLM assistants or fully autonomous agents reduce marginal human oversight costs but increase exposure to targeted manipulations. There is a moral-hazard incentive to underinvest in safeguards (metadata policies, human checks) if immediate productivity gains are prioritized over low-probability high-impact supply-chain failure.
    • Attackers exploit predictable behavior; markets will value vendors and integrators that demonstrably mitigate confirmation-bias vulnerabilities.
  • Market and regulatory consequences
    • Demand for debiasing tools and services (metadata redaction, hardened review prompts, automated provenance checks) will grow. Vendors offering certified or audited LLM-review pipelines may command premium pricing.
    • Insurance and liability markets will need new risk models: insurers may require documented debiasing and human-in-the-loop controls for coverage of supply-chain security incidents.
    • Regulators and large platform operators might mandate provenance controls or minimum auditability for automated code-review agents used in critical infrastructure and widely reused OSS components.
  • Cost–benefit tradeoffs and operational recommendations
    • The paper shows relatively cheap, effective mitigations (explicit ignore-metadata instructions, metadata redaction) that largely recover detection. Implementing these mitigations is likely to be cost-effective relative to the high expected loss from supply-chain compromise—suggesting a near-term, high ROI for adopting them.
    • However, for autonomous agents (CI-integrated), mitigations are slightly less effective and may require combining redaction with stricter guardrails and human approvals for high-risk repositories.
    • Firms must balance throughput gains from automation against added governance costs (tooling, audits, policy enforcement). Economic decisions should internalize the tail risk: higher-value or high-dependency projects warrant stronger controls and manual checkpoints.
  • Longer-term market dynamics
    • The discovery of a behavioral failure mode that depends on natural-language metadata creates incentives for model vendors to (a) harden models against contextual anchoring, (b) offer prompt-guardrail features, and (c) provide certified “security-mode” review endpoints.
    • Open-source maintainers and large consumers of OSS may form standards (e.g., metadata hygiene, automated provenance verification) that new tools must support; hybrid human+LLM workflows that combine automated triage with mandatory human sign-off on security-relevant PRs will become a differentiator.
  • Research and audit economics
    • The public release of artifacts lowers monitoring and auditing costs for third parties and may catalyze markets for independent audits of LLM-assisted security tools. Continued benchmarking and red-team evaluations should be economically incentivized (bug bounty programs, third-party attestations).

Takeaway: Confirmation bias in LLM-assisted code review is a material economic risk for software supply chains because it enables low-cost, scalable attacks and changes the value proposition of automated review tools. The paper identifies cheap, effective mitigations that firms should adopt immediately while encouraging market and regulatory adjustments (auditing, insurance conditions, vendor certification) to internalize and price the residual risk.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Strong internal validity from controlled manipulations, a sizable sample of 250 CVE/patch pairs, multiple state-of-the-art models, and both measurement and exploitability experiments; however, external validity is limited by the specific set of models, CVEs, repository contexts, simulated PR workflows, and potential model/version drift, so results may not generalize across all real-world deployments. Methods Rigorhigh — Systematic, controlled experimental design with multiple framing conditions, cross-model comparisons, explicit adversarial tests, and evaluation of mitigations (redaction/instructions); sample size is substantial for this type of study and outcomes are measured quantitatively, though the paper does not report (in the provided summary) field deployment trials with human reviewers or long-term ecological validation. SampleDataset of 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art LLM code-review models under five prompt/PR framing conditions (Study 1); Study 2 used adversarially constructed pull requests that reintroduce known vulnerabilities tested against GitHub Copilot (interactive assistant) in one-shot attacks and Claude Code (autonomous agent) in iterative attack setups within real project configurations; debiasing interventions (metadata redaction and explicit instructions) were also evaluated. Themeshuman_ai_collab governance IdentificationControlled framing experiments that vary only the review prompt/PR metadata while holding the underlying CVE/patch code constant: detection rates compared across five randomized framing conditions for 250 CVE/patch pairs and across four LLMs; complementary adversarial PR experiments manipulate metadata (security-improvement vs. bug-free phrasing) in one-shot and iterative attack setups and test debiasing (metadata redaction, explicit instructions) to identify causal effect of framing on vulnerability detection. GeneralizabilityLimited to the specific LLM models and versions tested (results may differ on other or updated models), CVEs and codebases sampled may not reflect all vulnerability types or real-world repository diversity, Simulated pull requests and automated agents may not capture interactions with human reviewers or organizational review processes, Adversary capabilities in the experiments (e.g., ability to iteratively refine framing) may be stronger or weaker than in practice, Temporal validity: model behaviour can change with updates, fine-tuning, or different system prompts

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Error Rate mixed high confirmation bias as measured by vulnerability detection performance
n=250
0.8
Framing a change as bug-free reduces vulnerability detection rates by 16-93%. Error Rate negative high vulnerability detection rate
n=250
16-93% reduction
0.8
The framing effect is strongly asymmetric: false negatives increase sharply while false positive rates change little. Error Rate negative high false negative rate and false positive rate
n=250
0.48
Bias effects vary by vulnerability type, with injection flaws being more susceptible to framing bias than memory corruption bugs. Error Rate negative medium change in vulnerability detection rate by vulnerability type
0.29
Adversarial pull request framing (e.g., labeled as security improvements or urgent functionality fixes) succeeds in reintroducing known vulnerabilities in 35% of cases against GitHub Copilot under one-shot attacks. Error Rate negative high attack success rate (vulnerability reintroduction accepted/not detected)
35% success rate
0.48
Adversarial framing succeeds in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Error Rate negative high attack success rate (vulnerability reintroduction accepted/not detected)
88% success rate
0.48
Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Error Rate positive medium restoration of vulnerability detection (post-intervention detection rate)
100% restored (interactive), 94% (autonomous)
0.29
Confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed. Organizational Efficiency negative high reliability/security of LLM-based code review
0.48

Notes