The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Constraining LLMs with a bounded‑autonomy orchestration layer sharply reduces unsafe actions while preserving large speed gains. In a deployed enterprise test the constrained system finished 23 of 25 scenarios with zero unsafe executions and 13–18× faster workflows, whereas an unconstrained configuration produced more failures and hallucinated successes.

Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
Sarmad Sohail, Ghufran Haider · April 16, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
A bounded‑autonomy orchestration architecture that constrains LLM actions via typed action contracts, permissioned capabilities, scoped context, and validation completed 23/25 tasks with zero unsafe executions and achieved 13–18× speedups versus manual operation, outperforming an unconstrained LLM configuration which completed 17/25 tasks and allowed unsafe mutations.

Large language models are increasingly used as natural-language interfaces to enterprise software, but their direct use as system operators remains unsafe. Model errors can propagate into unauthorized actions, malformed requests, cross-workspace execution, and other costly failures. We argue this is primarily an execution architecture problem. We present a bounded-autonomy architecture in which language models may interpret intent and propose actions, but all executable behavior is constrained by typed action contracts, permission-aware capability exposure, scoped context, validation before side effects, consumer-side execution boundaries, and optional human approval. The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest. We evaluate the architecture in a deployed multi-tenant enterprise application across three conditions: manual operation, unconstrained AI with safety layers disabled, and full bounded autonomy. Across 25 scenario trials spanning seven failure families, the bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions, while the unconstrained configuration completed only 17 of 25. Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class. Both AI conditions delivered 13-18x speedup over manual operation. Critically, removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success. Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output. The result is a practical, deployed architecture for making imperfect language models operationally useful in enterprise systems.

Summary

Main Finding

Bounded Autonomy Layer (BAL) — an execution-architecture that limits LLMs to proposing actions against typed action contracts while enforcing permission-aware capability exposure, tenant/workspace scoping, pre-execution validation, consumer-side execution, and optional human approval — makes imperfect language models operationally useful in enterprise settings. In a deployed multi-tenant application, BAL completed 23/25 scenario tasks with zero unsafe executions (two failures were safely contained), compared with 17/25 completions for an unconstrained AI configuration. Both AI conditions produced large speedups (≈13–18×) over manual operation, but removing BAL’s safety layers reduced task completion and exposed a failure class (wrong-entity mutations) that consumer backends are structurally blind to.

Key Points

  • Architectural framing: Treat model unreliability primarily as an execution-architecture problem, not just a model-quality problem. Aim for bounded autonomy (models plan/propose; application enforces execution).
  • Core mechanisms:
    • Typed action contracts: every executable capability is a typed, versioned contract including schema, permission predicates, validation, execution semantics, and outcome behavior.
    • Permission-aware capability exposure: BAL reasons only over actions the authenticated user is permitted to perform (granted-action synchronization).
    • Consumer-side execution: all side effects run through the consumer application’s own services and authorization, preserving business logic and audit trails.
    • Scoped operational context: tenant, workspace, and user identity are first-class runtime inputs to prevent cross-tenant/workspace leakage.
    • Pre-side-effect validation: schema-derived validation and structured errors prevent malformed payloads from reaching backends.
    • Explicit ambiguity handling: disambiguation workflows and candidate-returning errors enforce clarification for ambiguous entity selection.
    • Human approval gates for high-consequence workflows.
    • Versioned, published actions manifest as the orchestration surface (not an implicit or external policy surface).
  • Threats targeted: unauthorized actions, cross-tenant/workspace errors, ambiguous entity resolution, malformed payloads, premature execution of high-impact workflows, planners acting over stale/unguarded capabilities, and direct model-to-backend mutation.
  • Empirical observations:
    • BAL prevented 100% of the targeted violations that were architecturally enforced (permission filtering, workspace isolation, manifest governance).
    • Consumer backend checks caught many violations even without BAL, but could not catch wrong-entity mutations when the user had permission and payloads were structurally valid — only BAL’s disambiguation/confirmation intercepted that class.
    • Removing safety layers reduced usefulness: structured validation feedback helped the model converge to correct outcomes faster; unconstrained models retried with generic errors or hallucinated success.
  • Positioning vs prior work: complements content guardrails (NeMo Guardrails, LlamaFirewall), tool-access policies (Progent, OAP), and runtime governance (MI9, Agent-C) by governing how enterprise side effects actually execute after a tool call, binding safety to application-owned contracts rather than external policy layers.

Data & Methods

  • Implementation: BAL integrated into a deployed multi-tenant enterprise application. BAL provides orchestration over a published actions manifest and routes execution through the consumer app’s APIs and callbacks.
  • Experimental design:
    • Three experimental conditions: manual operation baseline; unconstrained AI (safety layers selectively disabled); and full bounded-autonomy (BAL) configuration.
    • 25 scenario trials spanning seven failure families aligned to the threat model (unauthorized actions, cross-tenant/workspace errors, ambiguous entities, malformed inputs, unsafe high-impact workflows, ungoverned capabilities, direct model-to-backend mutation).
  • Evaluation metrics:
    • Task completion rate (successful end-to-end completion).
    • Unsafe executions (side-effecting actions that violated safety constraints).
    • Speedup vs manual operation (interaction/turns and elapsed time).
    • Qualitative failure analysis by failure family.
  • Results summary:
    • BAL: 23/25 tasks completed; 0 unsafe executions (the two incomplete tasks were contained without enterprise side effects); ≈13.5× speedup vs manual.
    • Unconstrained AI: 17/25 tasks completed; some unsafe executions (notably wrong-entity mutations escaped consumer-side checks).
    • Both AI conditions delivered significant speedups (13–18×) over manual, but constrained architecture improved completion and eliminated enterprise damage.
  • Additional methodological notes: structured validation feedback reduced the number of interaction turns required for model convergence; architectural invariants (e.g., manifest-based capability filtering) are code-enforced and non-statistical.

Implications for AI Economics

  • Productivity vs risk trade-off: BAL-style architectures enable large productivity gains (order-of-magnitude speedups) while materially reducing downside risk from model errors. This improves expected ROI of AI assistants by increasing throughput and reducing expected loss from costly mistakes.
  • Lower marginal cost of adoption: By shifting safety enforcement into application-integrated contracts and consumer-side execution, enterprises can deploy assistants without rewriting core business logic or accepting elevated operational risk, lowering the friction/cost of integrating LLMs into workflows.
  • Governance and compliance economics: Contract-based governance (action manifests tied to application auth and validation) simplifies compliance and auditability compared with external policy layers, reducing maintenance overhead and regulatory risk exposure. This has implications for liability, insurance premiums, and audit costs.
  • Productization and market opportunity: There is demand for middleware that implements BAL-style guarantees (typed contracts, manifest publication, permission sync, consumer-side routing). Vendors that offer standardized, easy-to-integrate execution governance stacks can capture enterprise spend on safe automation infrastructure.
  • Labor and organizational effects: Faster task completion may enable role redesign (focus on oversight, exception handling, higher-level tasks), shifting labor from routine execution toward supervision and policy/contract engineering. Human-in-the-loop approval gates preserve jobs where high-consequence judgment remains necessary.
  • Pricing and business models: Enterprises may prefer subscription or platform pricing that bundles safety guarantees; vendors can market differentiated tiers by the strictness of execution governance and auditability. Insurers and compliance officers may demand architectural features like BAL as underwriting or certification criteria.
  • Counterintuitive insight for adoption strategy: Relaxing safety layers does not necessarily increase utility; structured validation and clarified error feedback can increase both usefulness and efficiency. Investment in execution governance can therefore enhance both value capture and risk reduction simultaneously.
  • Standardization implications: Existing tool-description standards (e.g., MCP) do not fully address execution authority and contract binding. Market and standards efforts that incorporate application-bound action contracts and manifest semantics will accelerate interoperable, safe enterprise automation.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper reports clear, operational metrics from a real deployed multi‑tenant application (task completion, unsafe executions, speedups) showing large differences between conditions, which provides useful empirical evidence. However, the sample is small (25 scenario trials), scenarios are likely curated, there is no randomization or statistical analysis reported, and results come from a single deployment and set of model/configuration choices, limiting confidence in broader causal claims. Methods Rigormedium — Strengths include a deployed, end‑to‑end system test, well‑specified safety mechanisms, and concrete failure categories; weaknesses are small trial counts, likely nonrandom scenario selection, limited reporting of statistical uncertainty or ablation detail, and potential experimenter/selection biases and single‑site implementation choices. SampleDeployed multi‑tenant enterprise application; evaluation used 25 scenario trials covering seven failure families; three experimental conditions compared: manual operator, unconstrained AI (safety layers disabled), and full bounded‑autonomy architecture with typed action contracts, permission-aware capability exposure, scoped context, validation, execution boundaries, and optional human approval; outcomes include task completion, unsafe executions (e.g., unauthorized actions, malformed requests, cross‑workspace execution), and workflow speedups (13–18x over manual). Themeshuman_ai_collab productivity IdentificationComparative deployed evaluation across three conditions (manual operation, unconstrained LLM with safety layers disabled, and bounded‑autonomy LLM) using 25 scenario trials spanning seven failure families; causal claims are based on observed differences in task completion rates, unsafe executions, and turnaround times across these conditions without randomization or formal causal identification techniques. GeneralizabilitySmall number of curated scenarios (25) may not represent real operational diversity, Single deployed application and implementation—results may not transfer to other enterprise domains or architectures, Specific LLM model(s) and tuning/configuration used are not broadly representative of all models, Human approval/workflow integration and operator behavior may vary across organizations, Adversarial or unexpected real‑world inputs beyond the tested failure families may reveal different failure modes

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
The bounded-autonomy system completed 23 of 25 tasks with zero unsafe executions. Organizational Efficiency positive high tasks completed / unsafe executions
n=25
23 of 25 tasks completed with zero unsafe executions
0.48
The unconstrained AI configuration completed only 17 of 25 tasks. Organizational Efficiency negative high tasks completed
n=25
17 of 25 tasks completed
0.48
Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class. Error Rate negative high wrong-entity mutation errors (escaped protections)
n=25
Two wrong-entity mutations escaped all consumer-contributed layers
0.48
Both AI conditions delivered 13–18x speedup over manual operation. Task Completion Time positive high task completion time (speedup vs. manual)
n=25
13-18x speedup over manual operation
0.48
Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success. Output Quality mixed high number of interaction turns to correct outcome; presence of hallucinated success
n=25
0.48
Several safety properties are structurally enforced by code and intercepted all targeted violations regardless of model output. Error Rate positive high interception of targeted violations / enforcement of safety properties
n=25
All targeted violations were intercepted by code-enforced safety properties (as reported)
0.48
The enterprise application remains the source of truth for business logic and authorization, while the orchestration engine operates over an explicit published actions manifest. Governance And Regulation positive high system design property (source-of-truth and orchestration behavior)
0.08
The system evaluation was performed in a deployed multi-tenant enterprise application across three conditions: manual operation, unconstrained AI with safety layers disabled, and full bounded autonomy. Other null_result high experimental design and conditions
n=25
0.48
The bounded-autonomy architecture is a practical, deployed approach for making imperfect language models operationally useful in enterprise systems. Organizational Efficiency positive high operational usefulness of LLMs in enterprise context
n=25
0.48

Notes