On-the-fly AI agents often deliver brittle, improvised outputs unsuited to high‑stakes use; embedding disciplined software engineering and a shared 'AI Workflow Store' of hardened workflows can yield far more reliable, reusable, and secure agent behavior, albeit at higher upfront compute and development cost.

Engineering Robustness into Personal Agents with the AI Workflow Store

Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari, Lillian Tsai, Wen Zhang · May 11, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

The paper argues that real-time, on-the-fly AI agents tend to deliver improvised, brittle prototypes rather than production-grade systems and proposes integrating rigorous software-engineering practices plus a shared 'AI Workflow Store' of hardened, reusable workflows to produce more reliable, secure agent behavior.

The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.

Summary

Main Finding

The paper argues that the prevailing “on-the-fly” agent paradigm—where LLM-based personal agents synthesize and execute multi-step tool chains in seconds—systematically short‑circuits essential software‑engineering (SE) practices and therefore produces brittle, insecure, and sometimes dangerous behavior. To address this, the authors propose an AI Workflow Store: a backend-driven ecosystem in which SE‑hardened, reusable, and deterministically constrained workflows are engineered, vetted, and stored for reuse by local agents. Properly designed and reused across users, these workflows can (1) materially improve robustness and security versus purely on‑the‑fly synthesis, and (2) amortize the engineering cost across many requests and users.

Key Points

Problem diagnosis
- On‑the‑fly agent loops prioritize speed and low cost at the expense of the iterative design, testing, adversarial evaluation, staged rollout, and monitoring that produce robust software.
- Real failures include wrong-account actions, deleting data, unauthorized transfers, and prompt‑injection attacks via untrusted inputs (e.g., email).
Vision: AI Workflow Store
- Architecture layers:
  - Local agent: matches user requests to stored workflows and invokes them (cheap, low latency).
  - Workflow repository: stores hardened, parameterizable workflows; supports filtering, refactoring, and versioning.
  - Backend SE agent team: performs an AI‑driven SE lifecycle (requirements, design exploration, implementation, adversarial testing, staged deployment) to create new workflows when needed.
- Workflows are durable artifacts (code + policies + invariants) rather than ephemeral LLM prompts or one‑off scripts.
Tradeoffs and design tensions
- Robustness vs. generality: precise, coded workflows give stronger guarantees but cover fewer cases; natural‑language/skill descriptions are more general but probabilistic.
- Latency vs. rigor: hardened workflows require more upfront compute/time to produce; this cost must be amortized by reuse across users.
- Automation limits: design exploration and adversarial testing are difficult to automate fully and may require curated datasets, human oversight, and repositories of design patterns.
Example (motivating): “Book the Airbnb Bob recommended”
- On‑the‑fly approach: LLM search + extraction + booking — vulnerable to choosing wrong emails and prompt injection.
- In‑loop defenses: policy generation can help but often lacks context and retroactive privilege adjustments.
- Engineered workflow: classify recommendation, present a trusted “Book this recommendation” overlay that constrains action and requires user confirmation—reduces attack surface and is testable.
Lifecycle and repository dynamics
- Backend teams aim to generalize individual requests into reusable workflow abstractions to maximize amortization.
- Continuous refactoring, deduplication and generalization are core backend tasks to keep the repository valuable.

Data & Methods

Nature of the paper: conceptual/vision paper (no empirical dataset or randomized experiments presented).
Methods used in the paper:
- Conceptual analysis of failures in current agent architectures and mapping to historical lessons from software engineering.
- A realistic motivating example (email → booking) to compare three design approaches (vanilla on‑the‑fly, on‑the‑fly with simple guardrails, and an engineered workflow).
- Architectural sketch and taxonomy of components and responsibilities (local agent, workflow repo, backend SE agent team).
- Identification of design variables (workflow representation spectrum, robustness vs. generality, synchronous/asynchronous production, automation limits).
- Enumeration of research challenges and open problems (workflow specification languages, adversarial test generation, discovery/matching, privacy/authorization, governance, incentives).
Proposed evaluation directions (qualitative and implied quantitative work to be done):
- Define metrics for robustness (reliability/security under benign and adversarial settings) and generality (coverage of request space).
- Adversarial testing frameworks and staged rollouts to empirically compare on‑the‑fly vs. workflow‑based approaches.
- Cost/benefit and amortization studies to evaluate when upfront SE investment pays off given reuse rates.

Implications for AI Economics

Upfront investment and amortization economics
- The Workflow Store requires higher upfront compute and labor to produce hardened workflows. Economic viability hinges on reuse: the more users/requests a workflow serves, the lower average cost per use—creating economies of scale.
- If tasks are sufficiently similar across users, amortization yields cost savings and improved reliability; if tasks remain highly idiosyncratic, amortization may fail and costs remain high.
Platformization and market structure
- Workflow repositories have strong platform and network‑effects potential. Trusted, well‑tested workflows become valuable assets (intellectual property or platform monopolies).
- Markets may bifurcate into public/shared workflow libraries (commons, standards) and proprietary workflow catalogs (vendor lock‑in, paid access).
Labor and specialization
- New productive roles and firms: backend SE agent teams, workflow auditors, adversarial‑test providers, and workflow marketplaces. Some human oversight remains required; work shifts from prompt crafting to workflow design, validation, and governance.
- Automation of parts of the SE lifecycle (via AI) can reduce human costs but not eliminate the need for specialist skills in design exploration and security testing—creating a mixed capital/labor substitution story.
Compute and pricing effects
- Net compute demand may shift: more compute offloaded to upstream workflow engineering (higher per‑workflow cost) but lower compute per invocation for local agents. Pricing models for agent services might reflect workflow engineering costs (subscription, per‑workflow fee, tiered SLAs).
- Firms may bundle access to hardened workflows as premium features; consumers may face tradeoffs between price, latency, and safety guarantees.
Security externalities and public policy
- Systemic security improvements from hardened workflows reduce negative externalities (fraud, data loss) and may justify public investment or standards/certification regimes for critical workflows (financial transfers, healthcare, enterprise admin).
- Liability and regulation: provenance, audit logs, and staged deployment practices become economically relevant—regulators may require minimum SE practices for high‑stakes workflows.
Incentives and market failures
- Without reuse incentives or reputation mechanisms, firms may underinvest in rigor (free‑riding or winner‑take‑most dynamics). Reputation systems, certification, open standards, or regulation can align incentives toward sufficient engineering.
- Supply of good adversarial datasets and design patterns is a public good; underprovision could limit the effectiveness of automated SE pipelines.
Consumer surplus and productivity
- More reliable agents expand the set of tasks consumers are willing to delegate to agents, increasing agent‑driven economic activity and productivity gains.
- However, higher reliability could increase moral hazard or over‑reliance, raising new regulatory/insurance considerations.
Research & investment priorities from an economic lens
- Fund/curate shared adversarial datasets and workflow benchmarks to reduce duplication and improve automation quality.
- Invest in workflow discovery/matching algorithms and metadata markets so costly engineering is reused effectively.
- Explore pricing and contracting models (e.g., certified workflows vs. quick on‑the‑fly modes) that let users choose risk/latency/cost tradeoffs.

Summary takeaway: The AI Workflow Store reframes robustness as an engineering and economic design problem: require larger upfront SE investments and shared repositories to convert per‑prompt fragility into amortizable, reusable assets. Realizing the vision will reshape incentives, platform economics, labor roles, compute allocation, and potentially require governance to internalize security externalities.

Assessment

Paper Typetheoretical Evidence Strengthn/a — The paper is conceptual and prescriptive: it presents arguments, a vision (the AI Workflow Store), and research challenges rather than empirical tests or causal identification, so there is no empirical evidence to rate. Methods Rigorn/a — No empirical methods, identification, or formal estimation procedures are applied; the rigor pertains to argumentative coherence and plausibility rather than methodological implementation. SampleNo empirical sample or dataset; the paper offers a conceptual critique of the 'on-the-fly' agent paradigm, a proposed architecture (hardened, reusable AI workflows / AI Workflow Store), illustrative examples, and an outline of research challenges. Themesproductivity org_design governance adoption human_ai_collab GeneralizabilityNo empirical validation — claims are conceptual and may not hold across real-world settings without testing, Assumes organizations have incentives and resources to develop, share, and maintain hardened workflows, Ignores heterogeneity in tasks/domains where on-the-fly synthesis may be preferable (creative, exploratory, ad-hoc use), Implementation feasibility depends on compute costs, latency tolerances, and integration with existing tooling and governance, Security, regulatory, and market-infrastructure constraints that vary across jurisdictions could limit adoption

Claims (7)

Claim	Direction	Confidence	Outcome	Details
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. Adoption Rate	null_result	high	paradigm_adoption	0.06
The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems. Output Quality	negative	high	reliability and security (degree to which SE processes are applied)	0.02
By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them. Ai Safety And Ethics	negative	high	suitability for high-stakes use / risk to users	0.02
Integrating rigorous software engineering processes into the agentic loop will produce production-grade, hardened, and deterministically-constrained agent workflows that substantially outperform brittle on-the-fly synthesis. Output Quality	positive	high	workflow reliability/security and overall performance compared to on-the-fly synthesis	0.02
Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community. Adoption Rate	negative	high	resource_costs (compute/time) and implications for amortization/adoption	0.02
An AI Workflow Store of hardened and reusable workflows would allow agents to invoke workflows with far greater reliability and security than improvised tool chains. Output Quality	positive	high	reliability and security of agent-invoked workflows	0.02
The research challenges for this vision stem from a broader flexibility–robustness tension that requires moving beyond the on-the-fly paradigm to navigate effectively. Organizational Efficiency	mixed	high	trade-off between flexibility and robustness in agent design	0.02