On-the-fly AI agents often deliver brittle, improvised outputs unsuited to high‑stakes use; embedding disciplined software engineering and a shared 'AI Workflow Store' of hardened workflows can yield far more reliable, reusable, and secure agent behavior, albeit at higher upfront compute and development cost.
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.
Summary
Main Finding
The paper argues that the prevailing “on-the-fly” agent paradigm—where LLM-based personal agents synthesize and execute multi-step tool chains in seconds—systematically short‑circuits essential software‑engineering (SE) practices and therefore produces brittle, insecure, and sometimes dangerous behavior. To address this, the authors propose an AI Workflow Store: a backend-driven ecosystem in which SE‑hardened, reusable, and deterministically constrained workflows are engineered, vetted, and stored for reuse by local agents. Properly designed and reused across users, these workflows can (1) materially improve robustness and security versus purely on‑the‑fly synthesis, and (2) amortize the engineering cost across many requests and users.
Key Points
- Problem diagnosis
- On‑the‑fly agent loops prioritize speed and low cost at the expense of the iterative design, testing, adversarial evaluation, staged rollout, and monitoring that produce robust software.
- Real failures include wrong-account actions, deleting data, unauthorized transfers, and prompt‑injection attacks via untrusted inputs (e.g., email).
- Vision: AI Workflow Store
- Architecture layers:
- Local agent: matches user requests to stored workflows and invokes them (cheap, low latency).
- Workflow repository: stores hardened, parameterizable workflows; supports filtering, refactoring, and versioning.
- Backend SE agent team: performs an AI‑driven SE lifecycle (requirements, design exploration, implementation, adversarial testing, staged deployment) to create new workflows when needed.
- Workflows are durable artifacts (code + policies + invariants) rather than ephemeral LLM prompts or one‑off scripts.
- Architecture layers:
- Tradeoffs and design tensions
- Robustness vs. generality: precise, coded workflows give stronger guarantees but cover fewer cases; natural‑language/skill descriptions are more general but probabilistic.
- Latency vs. rigor: hardened workflows require more upfront compute/time to produce; this cost must be amortized by reuse across users.
- Automation limits: design exploration and adversarial testing are difficult to automate fully and may require curated datasets, human oversight, and repositories of design patterns.
- Example (motivating): “Book the Airbnb Bob recommended”
- On‑the‑fly approach: LLM search + extraction + booking — vulnerable to choosing wrong emails and prompt injection.
- In‑loop defenses: policy generation can help but often lacks context and retroactive privilege adjustments.
- Engineered workflow: classify recommendation, present a trusted “Book this recommendation” overlay that constrains action and requires user confirmation—reduces attack surface and is testable.
- Lifecycle and repository dynamics
- Backend teams aim to generalize individual requests into reusable workflow abstractions to maximize amortization.
- Continuous refactoring, deduplication and generalization are core backend tasks to keep the repository valuable.
Data & Methods
- Nature of the paper: conceptual/vision paper (no empirical dataset or randomized experiments presented).
- Methods used in the paper:
- Conceptual analysis of failures in current agent architectures and mapping to historical lessons from software engineering.
- A realistic motivating example (email → booking) to compare three design approaches (vanilla on‑the‑fly, on‑the‑fly with simple guardrails, and an engineered workflow).
- Architectural sketch and taxonomy of components and responsibilities (local agent, workflow repo, backend SE agent team).
- Identification of design variables (workflow representation spectrum, robustness vs. generality, synchronous/asynchronous production, automation limits).
- Enumeration of research challenges and open problems (workflow specification languages, adversarial test generation, discovery/matching, privacy/authorization, governance, incentives).
- Proposed evaluation directions (qualitative and implied quantitative work to be done):
- Define metrics for robustness (reliability/security under benign and adversarial settings) and generality (coverage of request space).
- Adversarial testing frameworks and staged rollouts to empirically compare on‑the‑fly vs. workflow‑based approaches.
- Cost/benefit and amortization studies to evaluate when upfront SE investment pays off given reuse rates.
Implications for AI Economics
- Upfront investment and amortization economics
- The Workflow Store requires higher upfront compute and labor to produce hardened workflows. Economic viability hinges on reuse: the more users/requests a workflow serves, the lower average cost per use—creating economies of scale.
- If tasks are sufficiently similar across users, amortization yields cost savings and improved reliability; if tasks remain highly idiosyncratic, amortization may fail and costs remain high.
- Platformization and market structure
- Workflow repositories have strong platform and network‑effects potential. Trusted, well‑tested workflows become valuable assets (intellectual property or platform monopolies).
- Markets may bifurcate into public/shared workflow libraries (commons, standards) and proprietary workflow catalogs (vendor lock‑in, paid access).
- Labor and specialization
- New productive roles and firms: backend SE agent teams, workflow auditors, adversarial‑test providers, and workflow marketplaces. Some human oversight remains required; work shifts from prompt crafting to workflow design, validation, and governance.
- Automation of parts of the SE lifecycle (via AI) can reduce human costs but not eliminate the need for specialist skills in design exploration and security testing—creating a mixed capital/labor substitution story.
- Compute and pricing effects
- Net compute demand may shift: more compute offloaded to upstream workflow engineering (higher per‑workflow cost) but lower compute per invocation for local agents. Pricing models for agent services might reflect workflow engineering costs (subscription, per‑workflow fee, tiered SLAs).
- Firms may bundle access to hardened workflows as premium features; consumers may face tradeoffs between price, latency, and safety guarantees.
- Security externalities and public policy
- Systemic security improvements from hardened workflows reduce negative externalities (fraud, data loss) and may justify public investment or standards/certification regimes for critical workflows (financial transfers, healthcare, enterprise admin).
- Liability and regulation: provenance, audit logs, and staged deployment practices become economically relevant—regulators may require minimum SE practices for high‑stakes workflows.
- Incentives and market failures
- Without reuse incentives or reputation mechanisms, firms may underinvest in rigor (free‑riding or winner‑take‑most dynamics). Reputation systems, certification, open standards, or regulation can align incentives toward sufficient engineering.
- Supply of good adversarial datasets and design patterns is a public good; underprovision could limit the effectiveness of automated SE pipelines.
- Consumer surplus and productivity
- More reliable agents expand the set of tasks consumers are willing to delegate to agents, increasing agent‑driven economic activity and productivity gains.
- However, higher reliability could increase moral hazard or over‑reliance, raising new regulatory/insurance considerations.
- Research & investment priorities from an economic lens
- Fund/curate shared adversarial datasets and workflow benchmarks to reduce duplication and improve automation quality.
- Invest in workflow discovery/matching algorithms and metadata markets so costly engineering is reused effectively.
- Explore pricing and contracting models (e.g., certified workflows vs. quick on‑the‑fly modes) that let users choose risk/latency/cost tradeoffs.
Summary takeaway: The AI Workflow Store reframes robustness as an engineering and economic design problem: require larger upfront SE investments and shared repositories to convert per‑prompt fragility into amortizable, reusable assets. Realizing the vision will reshape incentives, platform economics, labor roles, compute allocation, and potentially require governance to internalize security externalities.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. Adoption Rate | null_result | high | paradigm_adoption |
0.06
|
| The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems. Output Quality | negative | high | reliability and security (degree to which SE processes are applied) |
0.02
|
| By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them. Ai Safety And Ethics | negative | high | suitability for high-stakes use / risk to users |
0.02
|
| Integrating rigorous software engineering processes into the agentic loop will produce production-grade, hardened, and deterministically-constrained agent workflows that substantially outperform brittle on-the-fly synthesis. Output Quality | positive | high | workflow reliability/security and overall performance compared to on-the-fly synthesis |
0.02
|
| Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community. Adoption Rate | negative | high | resource_costs (compute/time) and implications for amortization/adoption |
0.02
|
| An AI Workflow Store of hardened and reusable workflows would allow agents to invoke workflows with far greater reliability and security than improvised tool chains. Output Quality | positive | high | reliability and security of agent-invoked workflows |
0.02
|
| The research challenges for this vision stem from a broader flexibility–robustness tension that requires moving beyond the on-the-fly paradigm to navigate effectively. Organizational Efficiency | mixed | high | trade-off between flexibility and robustness in agent design |
0.02
|