The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Developers want AI to take on the assembly and housekeeping around coding, not the craft itself; Microsoft engineers demand provenance, uncertainty signaling and strict authority limits, revealing a preference for 'bounded delegation' rather than full automation.

To Copilot and Beyond: 22 AI Systems Developers Want Built
Rudrajit Choudhuri, Christian Bird, Carmen Badea, Anita Sarma · April 09, 2026 · arXiv (Cornell University)
openalex descriptive medium evidence 7/10 relevance Source PDF
A survey of 860 Microsoft developers shows they want AI to automate peripheral 'assembly' work around coding while preserving core craft identity, demanding provenance, uncertainty signaling, authority scoping, and least-privilege controls — a pattern the authors call 'bounded delegation.'

Developers spend roughly one-tenth of their workday writing code, yet most AI tooling targets that fraction. This paper asks what should be built for the rest. We surveyed 860 Microsoft developers to understand where they want AI support, and where they want it to stay out. Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories. For each, we describe the problem it solves, what makes it hard to build, and the constraints developers place on its behavior. Our findings point to a growing right-shift burden in AI-assisted development: developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation, while enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout. This tension reveals a pattern we call "bounded delegation": developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself. That boundary tracks where they locate professional identity, suggesting that the value of AI tooling may lie as much in where and how precisely it stops as in what it does.

Summary

Main Finding

Developers want AI systems that augment the non-coding, verification-heavy parts of software engineering—not just code generation. From a large survey (n = 860 Microsoft developers) and a human-in-the-loop thematic analysis, the authors identify 22 concrete AI systems across five task categories that developers want built. Across those systems developers consistently require four guardrails (authority scoping, provenance, uncertainty signaling, least-privilege access). The central pattern is “bounded delegation”: developers want AI to take over assembly and support work around their craft, but not the craft itself, even where AI capability might plausibly exist.

Key Points

  • Mismatch in current tooling: coding is ~10% of developers’ time, yet most AI tooling focuses on code generation. Developers need help with the remaining 90% (debugging, review, documentation, ops, onboarding, compliance).
  • 22 requested AI systems cluster into five categories: Development; Design & Planning; Quality & Risk Management; Infrastructure & Operations; Meta-work (documentation, onboarding, stakeholder communication).
  • Emphasis on verification: developers want quality signals embedded earlier (at authorship/point-of-change) to keep pace with accelerated code generation and to reduce downstream review/triage costs.
  • Four required guardrails for acceptability:
  • Explicit authority scoping (clear human vs AI roles)
  • Provenance (traceable origins of suggestions)
  • Uncertainty signaling (express model confidence/limits)
  • Least-privilege access (minimal privileges for AI actions)
  • Bounded delegation is not purely capability-driven: developers maintain boundaries even for tasks they believe models could perform, linking the boundary to professional identity, accountability, and ownership.
  • Methodology highlight: a multi-model, human-in-the-loop pipeline (three models independently propose themes; reconciliation; human validation; three-model coding with chain-of-thought rationales; Krippendorff’s α used for IRR). Models used: GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6.
  • Risks called out: “workslop” (AI-produced-but-shallow output), accumulating AI-induced technical debt, developer fatigue and burnout, and expanded verification burden as generative tools accelerate production but not downstream assurance.

Data & Methods

  • Data source: IRB‑approved survey of 860 Microsoft software developers (July 2025); respondents spanned product groups, roles, geographies. Survey included Likert items and two open-ended questions per selected task category: (1) where they want AI help, (2) what they do not want AI to handle.
  • Task taxonomy: Development; Design & Planning; Quality & Risk Management; Infrastructure & Operations; Meta-work.
  • Response counts (open-ended): Development n=816, Design & Planning n=548, Meta-work n=532, Quality & Risk Management n=401, Infrastructure & Operations n=283 (2,580 total response sets).
  • Analysis pipeline:
    • Stage 1: Three independent large models (GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6) performed theme discovery on responses.
    • Stage 2: Consolidation/reconciliation of themes using GPT-5.2 with rules (retain single-model themes only if supported by ≥3 participant responses).
    • Stage 3: Two researchers validated and refined the reconciled codebook (fidelity, grounding, distinctness), adding positive/negative examples.
    • Stage 4: All three models independently coded every response against the approved codebook; each decision required a rationale (chain-of-thought). ISSUE_* flags captured problematic responses (~11% flagged).
    • Stage 5: Inter-rater reliability assessed via Krippendorff’s α (themes ranged 0.81–0.97, mean 0.94). Final theme assignments by 2-of-3 majority vote; researchers spot-checked assignments.
  • Limitations noted by authors: stated preferences (not revealed behavior); cross-sectional and Microsoft-specific sample; use of LLMs for coding (mitigated by multi-model corroboration and human review).

Implications for AI Economics

  • Shifts in task composition and labor demand
    • Complementarity over substitution: The “bounded delegation” pattern implies AI will substitute routine assembly work but increase demand for verification, review, and system-level coordination. Labor demand is likely to shift toward higher-skilled verification, auditing, and interpretive roles (raising skill premiums for those activities).
    • Right-shift burden: As generative AI accelerates production of code/artifacts, downstream tasks (review, testing, incident triage) expand. Firms may face higher ongoing labor costs for assurance and maintenance unless they invest in complementary AI tooling that reduces verification costs.
  • Productivity accounting and the “productivity paradox”
    • Short-run apparent gains from code generation can be offset by downstream rework, “workslop,” and technical debt. Standard productivity metrics that count lines or tasks produced will overstate net gains unless they incorporate verification / correction time and technical-debt externalities.
    • Empirical ROI assessments for AI tools must include downstream verification costs and quality-adjusted outputs; otherwise firms risk misallocating investment toward front-end code generation instead of whole-lifecycle tools.
  • Technology investment and market opportunities
    • Demand for provenance, uncertainty, authority-scoping, and least-privilege features creates markets for specialized tools: provenance/audit layers, model-certification services, access-control middleware, uncertainty-calibration systems, and tooling for integrating AI outputs into CI/CD pipelines with traceable approvals.
    • Firms that build end-to-end augmentation (not just generation) may capture more value by lowering verification costs and liability exposure—creating a competitive advantage for platforms that embed the four guardrails.
  • Governance, compliance, and regulatory costs
    • The emphasis on provenance and explicit authority suggests higher governance/compliance needs. Industries with regulatory constraints will particularly value traceability and limited delegation; this raises compliance-service demand and potentially increases compliance costs for adopters.
    • Policy implications: regulators and standards bodies may focus on provenance, accountability, and least-privilege defaults—creating minimum-cost-to-comply features that become market norms.
  • Distribution of gains and value capture
    • Platform economics: vendors supplying well-governed augmentation stacks (provenance + verification + controlled automation) are positioned to extract rents because these features are complementary to firms’ risk-management needs.
    • Internal vs external capture: Organizations that internalize tooling for verification and provenance may retain more of the productivity gains, while those that rely on generic generators without governance will bear higher externalities and possibly lose value to downstream costs.
  • Wage, training, and organizational implications
    • Reskilling requirements: demand for skills in verification, AI-system scoping, audit, and integration will increase; organizations must invest in training and may face short-term wage pressure for scarce verification expertise.
    • Task redesign: managers should redesign roles to explicitly allocate authority and review responsibilities, and to measure outcomes in quality-adjusted terms.
  • Measurement and evaluation recommendations
    • Broaden ROI metrics: include downstream correction time, incidence of technical debt, mean-time-to-detect/fix post-deployment, and developer well-being indicators (burnout risk).
    • Pilot evaluations: firms should pilot augmentation tools that implement the four guardrails and measure whole-lifecycle effects before scaling generative-only tools.

Overall, the paper implies that the economic value of AI in software engineering depends less on raw generation capability and more on complementary investments that reduce verification burden, provide trust/provenance, and respect human authority. For economists and managers, this suggests focusing on complementarities, governance costs, and whole-lifecycle accounting when assessing AI investments and their labor-market impacts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Based on a sizable (n=860) survey and systematic thematic analysis, the paper provides rich descriptive evidence about developer preferences; however, findings are self-reported, non-experimental, and confined to a single firm so they do not establish causal effects or broad external validity. Methods Rigormedium — The study uses a human-in-the-loop, multi-model council-based thematic analysis which suggests careful qualitative coding and triangulation, and the sample size is substantial for survey work; but details on sampling frame, response rates, coder agreement, and validation of themes are not reported here, and potential selection and social-desirability biases remain. SampleSurvey of 860 Microsoft software developers; qualitative responses analyzed via a human-in-the-loop, multi-model council-based thematic analysis to identify 22 desired AI systems across five task categories; demographic and role breakdowns and response-rate details are not provided in the summary. Themeshuman_ai_collab productivity org_design adoption GeneralizabilityAll respondents are Microsoft employees — may not reflect developers at other firms, startups, or open-source contexts, Self-selected survey respondents may over-represent developers who are engaged with or opinionated about AI tooling, Findings are based on self-reported preferences and stated constraints, not observed behavior or productivity measures, Cultural and organizational practices at Microsoft (e.g., security, code review norms) may shape responses in ways not generalizable to other environments, Lack of demographic/role breakdown limits assessment of representativeness across seniority, language stack, or domain

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
Developers spend roughly one-tenth of their workday writing code. Task Allocation null_result high fraction of workday spent writing code
0.18
Most AI tooling targets that fraction [the ~10% of the workday spent writing code]. Adoption Rate negative high focus of AI tooling relative to developer time allocation
0.18
We surveyed 860 Microsoft developers to understand where they want AI support, and where they want it to stay out. Adoption Rate null_result high developer preferences for AI support / rejection
n=860
0.3
Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories. Adoption Rate positive high catalog of desired AI systems and task categories
n=860
0.18
Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation. Output Quality positive high requested placement/timing of quality signals in developer workflow
n=860
0.18
Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout. Ai Safety And Ethics positive high desired governance/security features for AI tools (authority scoping, provenance, uncertainty signaling, least-privilege)
n=860
0.18
This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself. Automation Exposure positive high preferred boundary of automation / delegation
n=860
0.03
That boundary tracks where they locate professional identity, suggesting that the value of AI tooling may lie as much in where and how precisely it stops as in what it does. Worker Satisfaction mixed medium relationship between automation boundary and professional identity / perceived value of AI tools
n=860
0.02

Notes