The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A clarification module trained on Shapley-derived relevance and simulated answerability matches GPT-5’s issue-resolution success while cutting clarification questions by 41%, suggesting more efficient human–AI collaboration in software engineering; however, results rely on simulated users and a narrow task set, tempering claims about real-world gains.

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks
Sanidhya Vijayvargiya, Vijay Viswanathan, Graham Neubig · April 16, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
Grounding clarification reward design in empirically measured task relevance (Shapley) and simulated user answerability yields a trained 8B-module (CLARITI) that matches GPT-5’s resolution rate on underspecified software tasks while asking 41% fewer questions.

Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions. However, effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide. We study clarification in real software engineering tasks by quantifying which types of information most affect task success and which questions elicit useful responses from simulated users. Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide). We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module, that matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions. Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.

Summary

Main Finding

Training a clarification module (CLARITI, 8B params) with rewards grounded in (1) the empirical impact of different missing information types on task success and (2) characteristics of user-answerable questions yields more efficient, effective clarification for software-engineering tasks. CLARITI matches GPT-5’s task-resolution rate while asking 41% fewer questions (3.0 vs 5.1 on average) and attains ~88% of the fully specified upper-bound performance, demonstrating that principled reward design can reduce user burden and preserve agent effectiveness.

Key Points

  • Problem framed: many real-world software engineering requests are underspecified; agents should ask clarifying questions that (a) matter for success and (b) are realistically answerable by users.
  • RQ1 — What information matters:
    • Derived a 6-category taxonomy of missing information from SWE-Bench Verified issues: Error information, Implementation details, Expected behavior, External references, Reproduction steps, Version/Environment.
    • Shapley (SHAP) analysis across 700 underspecified instances shows error information has the largest marginal association with agent success, followed by implementation details; other categories (expected behavior, external refs, etc.) are less predictive. Importantly, frequency of omission does not equal importance.
    • Underspecifying issues reduces agent success substantially (fully specified → 43.8% success; underspecified → 23.7%).
  • RQ2 — What makes questions answerable:
    • Constructed answerability labels by checking whether the missing info requested by a question is present in the full issue (simulated user proxy).
    • Distributional analysis (Vargha–Delaney effect sizes) reveals four high-level strategies that characterize answerable questions:
    • Ground in evidence — ask for concrete artifacts (e.g., stack trace).
    • Demand specificity — request precise values (e.g., exact Python version).
    • Minimize scope — request the smallest isolating unit (e.g., a 10-line reproducer).
    • Ensure actionability — ask for things users can run/observe (e.g., run tests and share output).
    • There is a practical trade-off: more questions can plateau or even reduce net performance while increasing user burden; redundant or unanswerable questions waste user effort.
  • RQ3 — Training with empirically grounded rewards:
    • Designed multi-stage reward signals combining: impact-weighted task relevance (from RQ1), answerability heuristics (from RQ2), and auxiliary desiderata (non-redundancy, diversity).
    • Trained CLARITI (8B) to generate clarification before the coding agent runs; used GPT-5 as a simulated user to answer questions during evaluation.
    • Results: CLARITI attains GPT-5-level resolution rates while generating substantially fewer questions (41% reduction), and reaches ~88% of the fully specified upper bound.

Data & Methods

  • Data:
    • Source: SWE-Bench Verified repository issues; expert codebook created from 112 highly underspecified issues, yielding the 6-category taxonomy.
    • Constructed dataset: 500 base issues → 3 underspecified rewrites each → 1,500 variants; experiments sample 700 instances for impact analysis and 500 for clarification experiments.
  • Agent and evaluation:
    • Agent backbone fixed: Seed OSS 36B Instruct used within OpenHands sandbox (agents can edit files, run bash/Python, iteratively refine).
    • Success metric: binary — whether produced patch passes repository test suite.
    • Clarification interaction: single-turn clarification phase where the clarification module can ask multiple questions; simulated user implemented with GPT-5 which has the fully specified issue.
  • Analytical methods:
    • RQ1: represent presence/absence of categories as binary features; train predictive models and compute SHAP values (bootstrap CIs via 10k resamples) to estimate marginal contribution of each category to task success.
    • RQ2: generate candidate questions with GPT-5 and GPT-5 nano; classify questions as answerable/unanswerable/redundant by comparing to full issue (via GPT-5 judge); use Vargha–Delaney effect sizes to find distributional differences in linguistic/structural features.
    • RQ3: design multi-stage RL-style rewards combining impact (SHAP-derived weights), answerability signals, non-redundancy, and diversity; train CLARITI (8B) and compare to GPT-5/GPT-5 nano baselines on resolution rate and number of questions.
  • Key quantitative outcomes:
    • Fully specified success: 43.8%; underspecified baseline: 23.7% (drop ~20.1 pp).
    • CLARITI: matches GPT-5’s resolution (~36.8% reported in figure), asks 3.0 questions on average vs GPT-5’s 5.1 (41% fewer), achieves ~88% of fully specified upper bound.

Implications for AI Economics

  • Efficiency and cost reduction:
    • Reducing the number of clarification questions directly lowers user time cost and interaction friction, improving the value-per-interaction of AI copilots. Fewer, higher-quality questions reduce human effort and can reduce overall time-to-resolution.
    • For service providers, improved clarification efficiency can decrease compute waste from failed runs and repeated agent attempts, improving per-task compute ROI.
  • Model-size vs. design trade-offs:
    • CLARITI (8B) trained with principled rewards can match a larger model’s (GPT-5) resolution rates on clarification while reducing user burden. This suggests targeted investment in specialized clarification modules can be a cost-effective alternative to always using larger generalist models for the clarification step.
  • Reward design as a value lever:
    • Grounding reward signals in empirical impact on downstream metrics (here, test-passing patches) aligns model incentives with economic value (task success, not just linguistic plausibility). This reduces wasted effort and can better capture marginal benefit of information — a core economic optimization.
  • Product and pricing implications:
    • Systems that minimize user burden (fewer questions) while maintaining success rates increase perceived product quality and may enable tiered offerings (e.g., low-latency clarification modules bundled with heavier code-generation compute).
    • Companies should weigh upfront costs of building empirically grounded clarification modules (data collection, training) against recurring savings from reduced human time and compute per task.
  • Deployment & market risks:
    • Simulated-user training/evaluation (using GPT-5 as proxy) may overestimate answerability in real heterogeneous user populations; misalignment could reduce real-world effectiveness and imply re-training costs or customer support overhead.
    • Incentive mis-specification risk: reward designs that over-emphasize minimizing questions could under-ask and reduce correctness, so economic objectives must balance user time, success probability, and potential downstream support costs.
  • Generality and scalability:
    • The two-dimensional framing (task relevance × user answerability) is transferable to other domains where user-provided specifications matter (legal, medical intake, data annotation). Applying the same empirical-impact + answerability pipeline can help optimize human-AI interaction economics across products.

Limitations (economic perspective) - The paper evaluates with simulated users and a single coding agent backbone; real user heterogeneity and different agent capabilities will affect realized economic benefits. - Training the clarification module has upfront compute and data costs (8B model training); a cost–benefit analysis is needed for deployment decisions comparing saved user time/compute vs training/deployment expenses.

Overall, the study provides actionable evidence that reward designs grounded in empirical impact and realistic answerability lead to more economically efficient human–AI interactions for task-oriented agents.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides internal evidence that reward design grounded in Shapley-based relevance and simulated user answerability improves clarification efficiency (matching GPT-5’s resolution with far fewer questions). However, the evaluation relies heavily on simulated users and a specific software-engineering task distribution, limiting external validity and causal claims about real-world human-AI interactions. Methods Rigormedium — Uses principled tools (Shapley attribution, distributional comparisons, RL training, and direct baseline comparisons) and reports quantitative gains, indicating careful experimental design. Rigor is reduced by reliance on simulated user behavior rather than extensive human-subject trials, potential sensitivity to dataset and simulator design, and possible lack of randomized human evaluations or robustness checks across diverse developer populations and task types. SampleHeld-out dataset of underspecified software-engineering tasks/issues drawn from engineering workflows (authors’ curated/evaluated corpus); simulated users that generate answerability responses; CLARITI (8B-parameter clarification module) trained with multi-stage RL rewards; evaluation compares resolution rate and question count against GPT-5 and other baselines on the same task distribution. Themeshuman_ai_collab productivity IdentificationEstimate the importance of missing information for task success using Shapley value attribution and distributional comparisons; operationalize 'useful clarification' as a combination of task relevance (Shapley) and user answerability (simulated-response models) and then train a clarification policy via multi-stage reinforcement learning; evaluate causally by comparing resolution rates and question counts of the trained module (CLARITI) versus baselines (including GPT-5) on held-out underspecified software-engineering tasks. GeneralizabilityRelies on simulated users rather than large-scale human-subject experiments, so real user behavior may differ, Limited to software engineering/bug/task-clarification domain and to the specific dataset collected by authors, Findings may not transfer to other domains (e.g., creative writing, medical tasks) where answerability and relevance differ, Performance comparison depends on the particular GPT-5 baseline and model scale; results may change with different LLMs or sizes, Cultural, linguistic, and team-workflow variations among real developers are not accounted for

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions. Task Allocation positive high frequency/occurrence of incomplete task specifications (need for clarification)
0.48
Effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide. Developer Productivity negative high impact of missing information and answerability on task success
0.48
Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide). Task Allocation positive high importance of information features for predicting task success and simulated-user answerability
0.48
We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module. Developer Productivity positive high ability to train a clarification module using the proposed reward design
0.48
CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions. Developer Productivity mixed high resolution rate (task success) and number of clarifying questions generated
41% fewer questions
0.48
CLARITI is an 8B-parameter clarification module. Other positive high model parameter count
0.48
Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency. Organizational Efficiency positive high clarification efficiency (fewer questions for similar resolution performance)
41% fewer questions (as part of supporting result)
0.48

Notes