Code-writing AI agents often pass tests but fail to respect project design standards: a benchmark of 495 real issues finds under half of auto-resolved patches meet repository-specific design constraints, and test success is a weak predictor of design compliance.
Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.
Summary
Main Finding
Pass rates on test suites substantially overstate the real-world quality of LLM‑generated patches. When design constraints (project conventions, error‑handling policies, maintainability rules, etc.) extracted from historical code reviews are taken into account, many test‑passing patches still violate important design requirements. The paper introduces SWE-Shield, a benchmark and toolchain to extract, validate, and verify such constraints, and shows that state‑of‑the‑art agents achieve far lower “design satisfaction” than test pass rates imply.
Key Points
- Problem: Existing issue‑resolution benchmarks evaluate success primarily by test pass rate, but real PR acceptance depends heavily on implicit design constraints (architectural conventions, API consistency, error‑handling styles, cross‑cutting tradeoffs).
- SWE-Shield: A new benchmark that explicitly links issues to validated, scenario‑grounded design constraints, enabling design‑aware evaluation of LLM‑based issue resolution.
- Data scale: SWE-Shield contains 495 issues from 6 repositories, with 1,787 manually validated design constraints. Using DesignHunter the authors initially identified 10,885 candidate constraints from pull requests.
- DesignHunter: An LLM‑based two‑stage extraction pipeline:
- Stage I: sliding‑window extraction of atomic design suggestions from review threads and validation against commit history (to see if suggestions were adopted).
- Stage II: hierarchical clustering/aggregation to form generalized design constraints (problem description, options, applicable conditions, example code).
- Patch verifier: An LLM‑based “judge” that assesses whether a generated patch satisfies the linked design constraints (outputs satisfied/violated/neutral).
- Metrics introduced: Pass Rate (test correctness), Design Satisfaction Rate (DSR), Design Violation Rate (DVR).
- Empirical results (representative figures reported):
- Pass Rate: 70.25%–75.95% on SWE-Shield_verified; up to 42.69% on SWE-Shield_pro (harder set).
- Design Satisfaction Rate: 32.64%–50.20% (much lower than pass rates).
- Design Violation Rate: up to 45.85%.
- Association between functional correctness and design compliance is weak/negligible (χ2 tests mostly non‑significant; Cramér’s V ≤ 0.11).
- Model choice yields only modest DSR gains (≤ ~12 percentage points across foundation models); many violations are shared across models.
- Providing explicit, issue‑specific design guidance reduces violations modestly (DVR down by up to 6.35 percentage points), but residual violation rates remain >30%.
- Conclusion: Test‑based metrics give an inflated view of usefulness; there is a clear need for design‑aware evaluation and improved model capabilities for design reasoning.
Data & Methods
- Data sources:
- Start points: existing issue‑resolution benchmarks (SWE‑bench‑Verified and SWE‑bench‑Pro).
- Mining pull requests, code review threads, commits and PR discussions from six real open‑source repositories (examples include Django).
- Design constraint extraction (DesignHunter):
- Sliding‑window LLM prompts to decompose long review threads into atomic suggestions while avoiding lost‑in‑the‑middle problems.
- Validate atomic suggestions against commit history (adopted vs rejected).
- Cluster and aggregate suggestions addressing the same problem to form structured design constraints (problem, options, applicability, code snippets).
- Manual validation step to ensure fidelity for the final benchmark.
- Patch verification:
- LLM‑based verifier (LLMs‑as‑Judge) compares generated patches and reasoning traces against the linked constraints to produce satisfaction/violation labels.
- Human judgers used to validate/verifiy benchmark items and final labels.
- Experiments:
- Evaluate several state‑of‑the‑art LLMs/agents (within the same agent framework) on SWE‑Shield.
- Report Pass Rate, DSR, DVR; statistical tests for association between functional correctness and design compliance; analysis of constraint types missed and effects of explicit guidance.
Implications for AI Economics
- Hidden quality and deployment risk:
- Relying on pass‑rate metrics can lead firms to overestimate the productivity and safety of LLM‑based developer tools. Deploying agents that generate design‑violating patches risks higher review/rework costs, regressions, and technical debt.
- Misaligned incentives and benchmarking:
- Benchmarks that reward only test‑passing encourage models optimized for tests rather than maintainability or project conventions. This skews R&D investment toward superficial fixes rather than long‑term usable automation.
- Labour substitution vs complementarity:
- Residual high violation rates (>30%) and modest gains from model choice suggest continued need for skilled human reviewers. LLMs may substitute for low‑complexity tasks but are more likely to complement humans for design‑sensitive work, affecting staffing, training, and the nature of developer workflows.
- Product valuation and ROI:
- Estimated ROI for tools automating issue resolution should factor in the cost of additional human review, rework from design violations, and potential downstream maintenance costs. Pass‑rate–only estimates overstate productivity gains.
- Market opportunities:
- Demand exists for “design‑aware” tooling: models and pipelines that can extract and respect project conventions, plus verification tools that flag design non‑compliance. Benchmarks like SWE‑Shield can catalyze product differentiation.
- Procurement and compliance:
- Organizations with strict architectural or regulatory constraints should require design‑aware evaluation in procurement. Relying on functional tests alone is insufficient for safety‑critical or heavily regulated software.
- Research and investment priorities:
- Investors and R&D managers should favor models and methods that improve contextual, longitudinal reasoning (learning from project history/reviews), and invest in labelled datasets and verifiers that surface implicit constraints.
- Standardization and governance:
- There is value in community standards for design‑aware evaluation (analogous to test suites), which would reshape competitive dynamics and create incentives for models that are robust in project contexts.
Recommended short actions for stakeholders: - Benchmark designers: incorporate design constraints and human‑in‑the‑loop validation into leaderboards. - Product teams/vendors: report design‑aware metrics (DSR/DVR) in addition to pass rates; build constraint extraction and verification into pipelines. - Investors/procurement: request evidence of design compliance on representative repositories before wide deployment; budget for residual human review. - Researchers: prioritize learning from historical review data and improving models’ ability to reason about applicability conditions and cross‑cutting project conventions.
Limitations & directions: - Extraction and verification rely on LLMs and manual validation; the taxonomy of constraints is not yet standardized. - Future work: expand repositories, formalize constraint taxonomies, improve automatic verifiers, and measure economic impacts (cost/benefit) in field deployments.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We construct DESIGN-AWARE benchmark (Bench) by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier. Other | positive | high | other |
0.3
|
| Bench contains 495 issues and 1,787 validated design constraints across six repositories. Other | positive | high | other |
n=495
1,787 validated constraints
0.3
|
| Test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying. Output Quality | negative | high | design-satisfaction of patches (design compliance) |
n=495
fewer than half fully design-satisfying (<50%)
0.18
|
| Design violations are widespread in agent-produced patches. Output Quality | negative | high | number/occurrence of design violations |
n=495
0.18
|
| Functional correctness (test-based correctness) exhibits negligible statistical association with design satisfaction. Output Quality | null_result | high | association between functional correctness and design satisfaction |
n=495
negligible association (not statistically meaningful)
0.18
|
| Providing issue-specific design guidance reduces design violations, but substantial non-compliance remains. Output Quality | mixed | high | design violations / design satisfaction |
n=495
reduces violations (magnitude not specified in abstract)
0.18
|
| SWE-bench alignment: Bench is aligned with SWE-bench-Verified and SWE-bench-Pro. Other | positive | high | benchmark alignment |
0.09
|
| There is a fundamental gap in current agent capabilities: functional correctness alone is insufficient for design-aware issue resolution, motivating design-aware evaluation beyond functional correctness. Other | negative | high | agent capability for design-aware issue resolution |
n=495
0.03
|