The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Code-writing AI agents often pass tests but fail to respect project design standards: a benchmark of 495 real issues finds under half of auto-resolved patches meet repository-specific design constraints, and test success is a weak predictor of design compliance.

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
Kai Yu, Zhenhao Zhou, Junhao Zeng, Ying Wang, Xueying Du, Zhiqiang Yuan, Junwei Liu, Ziyu Zhou, Yujia Wang, Chong Wang, Xin Peng · April 07, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
A new benchmark that makes implicit repository design constraints explicit shows that test-based correctness overestimates LLM-agent patch quality: fewer than half of resolved issues fully satisfy design constraints, and passing tests is poorly correlated with design compliance.

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

Summary

Main Finding

Pass rates on test suites substantially overstate the real-world quality of LLM‑generated patches. When design constraints (project conventions, error‑handling policies, maintainability rules, etc.) extracted from historical code reviews are taken into account, many test‑passing patches still violate important design requirements. The paper introduces SWE-Shield, a benchmark and toolchain to extract, validate, and verify such constraints, and shows that state‑of‑the‑art agents achieve far lower “design satisfaction” than test pass rates imply.

Key Points

  • Problem: Existing issue‑resolution benchmarks evaluate success primarily by test pass rate, but real PR acceptance depends heavily on implicit design constraints (architectural conventions, API consistency, error‑handling styles, cross‑cutting tradeoffs).
  • SWE-Shield: A new benchmark that explicitly links issues to validated, scenario‑grounded design constraints, enabling design‑aware evaluation of LLM‑based issue resolution.
  • Data scale: SWE-Shield contains 495 issues from 6 repositories, with 1,787 manually validated design constraints. Using DesignHunter the authors initially identified 10,885 candidate constraints from pull requests.
  • DesignHunter: An LLM‑based two‑stage extraction pipeline:
    • Stage I: sliding‑window extraction of atomic design suggestions from review threads and validation against commit history (to see if suggestions were adopted).
    • Stage II: hierarchical clustering/aggregation to form generalized design constraints (problem description, options, applicable conditions, example code).
  • Patch verifier: An LLM‑based “judge” that assesses whether a generated patch satisfies the linked design constraints (outputs satisfied/violated/neutral).
  • Metrics introduced: Pass Rate (test correctness), Design Satisfaction Rate (DSR), Design Violation Rate (DVR).
  • Empirical results (representative figures reported):
    • Pass Rate: 70.25%–75.95% on SWE-Shield_verified; up to 42.69% on SWE-Shield_pro (harder set).
    • Design Satisfaction Rate: 32.64%–50.20% (much lower than pass rates).
    • Design Violation Rate: up to 45.85%.
    • Association between functional correctness and design compliance is weak/negligible (χ2 tests mostly non‑significant; Cramér’s V ≤ 0.11).
    • Model choice yields only modest DSR gains (≤ ~12 percentage points across foundation models); many violations are shared across models.
    • Providing explicit, issue‑specific design guidance reduces violations modestly (DVR down by up to 6.35 percentage points), but residual violation rates remain >30%.
  • Conclusion: Test‑based metrics give an inflated view of usefulness; there is a clear need for design‑aware evaluation and improved model capabilities for design reasoning.

Data & Methods

  • Data sources:
    • Start points: existing issue‑resolution benchmarks (SWE‑bench‑Verified and SWE‑bench‑Pro).
    • Mining pull requests, code review threads, commits and PR discussions from six real open‑source repositories (examples include Django).
  • Design constraint extraction (DesignHunter):
    • Sliding‑window LLM prompts to decompose long review threads into atomic suggestions while avoiding lost‑in‑the‑middle problems.
    • Validate atomic suggestions against commit history (adopted vs rejected).
    • Cluster and aggregate suggestions addressing the same problem to form structured design constraints (problem, options, applicability, code snippets).
    • Manual validation step to ensure fidelity for the final benchmark.
  • Patch verification:
    • LLM‑based verifier (LLMs‑as‑Judge) compares generated patches and reasoning traces against the linked constraints to produce satisfaction/violation labels.
    • Human judgers used to validate/verifiy benchmark items and final labels.
  • Experiments:
    • Evaluate several state‑of‑the‑art LLMs/agents (within the same agent framework) on SWE‑Shield.
    • Report Pass Rate, DSR, DVR; statistical tests for association between functional correctness and design compliance; analysis of constraint types missed and effects of explicit guidance.

Implications for AI Economics

  • Hidden quality and deployment risk:
    • Relying on pass‑rate metrics can lead firms to overestimate the productivity and safety of LLM‑based developer tools. Deploying agents that generate design‑violating patches risks higher review/rework costs, regressions, and technical debt.
  • Misaligned incentives and benchmarking:
    • Benchmarks that reward only test‑passing encourage models optimized for tests rather than maintainability or project conventions. This skews R&D investment toward superficial fixes rather than long‑term usable automation.
  • Labour substitution vs complementarity:
    • Residual high violation rates (>30%) and modest gains from model choice suggest continued need for skilled human reviewers. LLMs may substitute for low‑complexity tasks but are more likely to complement humans for design‑sensitive work, affecting staffing, training, and the nature of developer workflows.
  • Product valuation and ROI:
    • Estimated ROI for tools automating issue resolution should factor in the cost of additional human review, rework from design violations, and potential downstream maintenance costs. Pass‑rate–only estimates overstate productivity gains.
  • Market opportunities:
    • Demand exists for “design‑aware” tooling: models and pipelines that can extract and respect project conventions, plus verification tools that flag design non‑compliance. Benchmarks like SWE‑Shield can catalyze product differentiation.
  • Procurement and compliance:
    • Organizations with strict architectural or regulatory constraints should require design‑aware evaluation in procurement. Relying on functional tests alone is insufficient for safety‑critical or heavily regulated software.
  • Research and investment priorities:
    • Investors and R&D managers should favor models and methods that improve contextual, longitudinal reasoning (learning from project history/reviews), and invest in labelled datasets and verifiers that surface implicit constraints.
  • Standardization and governance:
    • There is value in community standards for design‑aware evaluation (analogous to test suites), which would reshape competitive dynamics and create incentives for models that are robust in project contexts.

Recommended short actions for stakeholders: - Benchmark designers: incorporate design constraints and human‑in‑the‑loop validation into leaderboards. - Product teams/vendors: report design‑aware metrics (DSR/DVR) in addition to pass rates; build constraint extraction and verification into pipelines. - Investors/procurement: request evidence of design compliance on representative repositories before wide deployment; budget for residual human review. - Researchers: prioritize learning from historical review data and improving models’ ability to reason about applicability conditions and cross‑cutting project conventions.

Limitations & directions: - Extraction and verification rely on LLMs and manual validation; the taxonomy of constraints is not yet standardized. - Future work: expand repositories, formalize constraint taxonomies, improve automatic verifiers, and measure economic impacts (cost/benefit) in field deployments.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper builds a sizeable, validated benchmark (495 issues, 1,787 constraints) mined from real pull requests and runs systematic experiments with multiple state-of-the-art agents, giving credible empirical support for its claims; however, key measurements rely on an LLM-based verifier (which can introduce measurement error and subjectivity), the dataset covers only six repositories, and tests of interventions are not causal, limiting the strength of generalizable conclusions. Methods Rigormedium — Methodologically the paper is careful: it mines constraints from real PRs, validates them, aligns with existing SWE-bench variants, and evaluates multiple agents and interventions; nevertheless, the use of an LLM verifier as the primary compliance checker, possible selection bias in repository/PR sampling, and limited external validation of the verifier reduce overall rigor. SampleBenchmark constructed by mining real-world pull requests across six software repositories, yielding 495 issues linked to 1,787 validated, repository-specific design constraints; constraints are hand-validated and automatically checked with an LLM-based verifier; experiments evaluate several state-of-the-art LLM-based code-repair/agent systems and measure test pass rates, design-constraint satisfaction, and the effect of providing design guidance. Themesproductivity human_ai_collab GeneralizabilityOnly six repositories — may not represent broader open-source ecosystems or proprietary codebases, Likely limited to particular programming languages, architectures, and repo sizes present in the sample, Selection bias toward PRs where design discussion exists or is extractable, Design constraint labeling and compliance assessment depend on an LLM verifier, which may misclassify or be sensitive to prompt/temperature, Findings are about agent performance at repo-level issue resolution and may not generalize to other AI-assisted software tasks or future agent versions

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We construct DESIGN-AWARE benchmark (Bench) by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier. Other positive high other
0.3
Bench contains 495 issues and 1,787 validated design constraints across six repositories. Other positive high other
n=495
1,787 validated constraints
0.3
Test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying. Output Quality negative high design-satisfaction of patches (design compliance)
n=495
fewer than half fully design-satisfying (<50%)
0.18
Design violations are widespread in agent-produced patches. Output Quality negative high number/occurrence of design violations
n=495
0.18
Functional correctness (test-based correctness) exhibits negligible statistical association with design satisfaction. Output Quality null_result high association between functional correctness and design satisfaction
n=495
negligible association (not statistically meaningful)
0.18
Providing issue-specific design guidance reduces design violations, but substantial non-compliance remains. Output Quality mixed high design violations / design satisfaction
n=495
reduces violations (magnitude not specified in abstract)
0.18
SWE-bench alignment: Bench is aligned with SWE-bench-Verified and SWE-bench-Pro. Other positive high benchmark alignment
0.09
There is a fundamental gap in current agent capabilities: functional correctness alone is insufficient for design-aware issue resolution, motivating design-aware evaluation beyond functional correctness. Other negative high agent capability for design-aware issue resolution
n=495
0.03

Notes