Structuring GenAI-driven development with executable requirements and architectural artifacts curbs agent drift and moves engineers from line-by-line coding to higher-level design work; early lab tests on a web app show improved traceability and maintainability compared with unstructured vibe coding.

Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings

Petrus Lipsanen, Liisa Rannikko, François Christophe, Konsta Kalliokoski, Vlad Stirbu, Tommi Mikkonen · April 22, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

Embedding machine-readable requirements and architectural artifacts (BDD, C4, ADRs) as guardrails for GenAI-driven development stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation in an exploratory web-app evaluation.

Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. While vibe coding promises rapid prototyping, it often suffers from architectural drift, limited traceability, and reduced maintainability. Applying the design science research (DSR) methodology, this paper proposes Shift-Up, a framework that reinterprets established software engineering practices, like executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs), as structural guardrails for GenAI-native development. Preliminary findings from our exploratory evaluation compare unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. These findings indicate that embedding machine-readable requirements and architectural artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation activities. The results suggest that traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development.

Summary

Main Finding

Embedding traditional software-engineering artifacts (machine-readable requirements, executable acceptance tests, architectural models and ADRs) as persistent guardrails in GenAI-driven development (the "Shift‑Up" framework) stabilizes agent behavior, reduces reactive human intervention, and shifts developer effort from low-level coding to higher-level design, orchestration and validation—at the cost of higher upfront investment and slower early velocity.

Key Points

Problem: "Vibe coding" (unstructured, prompt-driven agent development) accelerates prototyping but often produces architectural drift, poor traceability and maintainability.
Proposal: Shift‑Up reinterprets established practices (BDD/executable requirements, C4 architectural modeling, ADRs) as machine‑readable guardrails that are continuously fed to GenAI agents.
Workflow: Produce SRS → decompose into user stories → transform into executable BDD acceptance tests (Robot Framework) → generate C4/ADR context → create implementation roadmap and GitHub issues → iterate: agent implements issue → run acceptance tests → repeat until tests pass.
Empirical comparison (exploratory):
- Three paradigms evaluated on a snack-bar full‑stack web app: unstructured vibe coding (Lovable), structured prompt engineering (VS Code + GPT‑5.0‑Codex), and initial Shift‑Up (Claude Sonnet 4.5 + GPT‑5.0‑Codex).
- Shift‑Up produced 68 user stories and 175 Robot Framework test cases; 176 prompts were recorded across structured vibe and Shift‑Up runs for analysis.
- Prompt-role distributions:
  - Shift‑Up: proceed-with-next-step 62%, execute-acceptance-tests 16%, developer-identified-fixes 9%, acceptance-of-agent-solution 7%, initiate-next-step 5% — indicating strategic orchestration and automated validation.
  - Structured prompt engineering: 52% prompts reacting to agent output (GUI/IDE fixes), 27% proceeding-to-next-step — indicating more reactive human intervention.
Tradeoffs: Shift‑Up increases human control, structured constraints and guardrails, but requires higher upfront work and reduces early raw development speed (qualitative assessment).
Partial answers to research questions:
- RQ1: Structured, machine-readable artifacts increase agent autonomy for implementation when paired with continuous executable validation.
- RQ2: Initial evidence suggests executable requirements reduce agent drift vs. prompt-only workflows, but results are preliminary and project-specific.

Data & Methods

Research paradigm: Design Science Research (DSR) to design, implement and do an exploratory evaluation of the Shift‑Up artifact.
Implementation context: Development of a full‑stack web application (GUI, PostgreSQL, admin, backend logic).
Tools and agents:
- Lovable platform for unstructured vibe coding.
- VS Code + GPT‑5.0‑Codex for structured prompt engineering and parts of Shift‑Up.
- Claude Sonnet 4.5 for early Shift‑Up artifact generation and refinement.
- Robot Framework used for executable BDD acceptance tests.
Evaluation setup:
- Human authors acted as developers; in all runs humans wrote no production code—work was delegated to GenAI agents via prompts.
- Recorded all prompts and developer journal notes; measured implementation time and categorized prompts.
- Inductive qualitative analysis of prompts to identify interaction patterns and developer roles.
- Produced artifacts: SRS, 68 user stories, 175 test cases, C4/ADR documents, implementation roadmap, GitHub issues, branch-and-PR workflow with automated test gating.
Limitations:
- Exploratory and project-specific (single common application domain).
- Shift‑Up rollout incomplete at time of report; technical/code-quality metrics deferred to subsequent analysis.
- Small‑scale, qualitative-focused evidence; no large-sample or long‑term maintenance evaluation yet.

Implications for AI Economics

Division of labor and skill structure:
- Shift from low‑level coding to higher‑value tasks (requirements, architecture, validation) implies increased demand for skills in specification, orchestration, verification and system design rather than routine implementation. This can raise the skill premium for designers/architects and reduce demand for traditional coder roles engaged in boilerplate implementation.
Productivity and returns:
- Agents plus guardrails can raise effective productivity by enabling agents to operate autonomously across implementation steps. However, the necessary upfront investment in machine‑readable specifications imposes fixed costs that are amortized over larger projects or repeated product lines, favoring firms with scale or repeated-product business models.
Tradeoffs in speed vs. quality (cost structure):
- Pure vibe coding yields rapid prototyping with lower upfront costs but higher downstream quality and maintenance costs (technical debt). Shift‑Up increases upfront costs (specification time) and slows initial delivery but likely reduces rework, maintenance costs, and architectural drift—changing the timing and composition of costs across the project lifecycle.
Market structure and firm strategy:
- Firms that adopt guardrail approaches may gain competitive advantage through more predictable, maintainable AI‑generated products, attracting clients in quality‑sensitive markets (finance, healthcare, enterprise). Startups seeking rapid MVPs may still prefer vibe coding initially, creating bifurcation in market practices.
Platform dependence and transaction costs:
- The unstructured approach (e.g., Lovable) can create platform lock‑in; Shift‑Up aims to reduce lock‑in by producing standard artifacts (SRS, BDD tests, C4, ADR) and GitHub‑centered workflows that are more portable. However, guardrail creation has transaction costs (time, expertise) that change contracting and procurement decisions.
Governance, standards and public goods:
- Widespread adoption of machine‑readable guardrails suggests opportunities for standardization (test/spec formats, ADR templates). Standard artifacts could lower verification costs across vendors, facilitate auditing, and support markets for third‑party validation and certification—important for regulated industries.
Labor market and organizational design:
- Organizations should reconfigure roles: invest in requirements engineering, test engineering, and architecture; provide training for staff to orchestrate and validate agents. Compensation and hiring practices may shift to favor domain experts who can author precise, machine‑readable requirements.
Investment and financing implications:
- Investors should assess whether a startup’s development model leans on fast prototyping or on guardrailed, maintainable delivery. The latter model may show higher initial burn but lower long‑term risk, potentially affecting valuation, time-to-market expectations and capital needs.
Policy and externalities:
- If unguarded GenAI development creates widespread low‑quality software, negative externalities (security risks, maintenance burden) could increase social costs. Policies or industry standards incentivizing guardrails (e.g., through procurement rules, certification) could internalize such externalities.

Takeaway for economists and policymakers: The economic impact of GenAI on software production depends critically on organizational choices about guardrails. There are clear tradeoffs—speed vs. durability, lower up-front costs vs. lower downstream rework—that will shape firm behavior, market structure, labor demand and returns to scale in AI-assisted software production.

If you want, I can: - Extract the quantitative prompt-category results into a small table for easy presentation. - Flesh out a short model of firm-level cost trajectories (vibe coding vs Shift‑Up) across the project lifecycle.

Assessment

Paper Typedescriptive Evidence Strengthlow — Findings come from a preliminary, exploratory evaluation comparing three development styles on a single web-application task with no strong causal design, no reported sample size or statistical inference, and likely limited replication — so claims about GenAI's effects on productivity and maintainability are suggestive but not robust. Methods Rigorlow — The study uses design science and a small-scale comparative evaluation without randomized assignment, pre-registered measures, blinded assessment, or long-term deployment data; methodology is appropriate for early-stage artifact validation but lacks rigor for causal or generalizable claims. SampleExploratory lab-style evaluation where participants/agents implemented a web application under three conditions: unstructured vibe coding, structured prompt engineering, and the Shift-Up framework embedding machine-readable requirements (BDD), C4 architectural models, and ADRs; exact number of participants, their expertise, and specifics of the GenAI models used are not reported. Themeshuman_ai_collab productivity GeneralizabilitySmall, preliminary, single-application evaluation limits external validity, Unknown participant expertise and small sample likely unrepresentative of real-world teams, Results depend on particular GenAI models, prompt setups, and toolchains used, Short-term prototyping task does not capture long-term maintainability or production deployment issues, Lab conditions may differ from complex organizational processes and scale

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. Automation Exposure	positive	high	shift toward agent-driven implementation (automation exposure)	0.18
Vibe coding (unstructured GenAI-driven coding) promises rapid prototyping but often suffers from architectural drift, limited traceability, and reduced maintainability. Output Quality	negative	high	architectural drift, traceability, maintainability	0.18
This paper proposes Shift-Up, a framework that reinterprets established software engineering practices (executable requirements / BDD, C4 architectural modeling, and architecture decision records / ADRs) as structural guardrails for GenAI-native development. Organizational Efficiency	positive	high	use of traditional SE artifacts as structural guardrails	0.03
An exploratory evaluation compared unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. Other	null_result	high	comparative evaluation of development approaches	0.18
Embedding machine-readable requirements and architectural artifacts stabilizes agent behavior. Developer Productivity	positive	medium	agent behavior stability	0.05
Embedding machine-readable requirements and architectural artifacts reduces implementation drift. Output Quality	positive	high	implementation drift	0.09
Using these artifacts shifts human effort toward higher-level design and validation activities. Task Allocation	positive	medium	allocation of human effort to design and validation	0.05
Traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development. Organizational Efficiency	positive	high	effectiveness of traditional SE artifacts as control mechanisms	0.09