← Papers

A large procurement workflow that looks safe at the state level can still hide sizable next-step uncertainty: expanding state detail from 42 to 668 raises state-action blind mass from 0.0165 to 0.1253, implying much higher oversight needs than coarse metrics suggest; a simple maximum-probability score m(s) predicted autonomous-step accuracy on held-out data within about 3.4 percentage points.

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya · March 25, 2026

arxiv theoretical medium evidence 7/10 relevance Source PDF

A measure-theoretic Markov framework applied to a large procurement event log shows that workflows that appear well supported at coarse state granularity can hide substantial next-step uncertainty (state-action blind mass), and that these blind masses directly determine expected oversight burden for agentic AI.

Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.

Summary

Main Finding

The paper introduces a Markovian, measure-theoretic framework that makes “pre-deployment” reliability and oversight-cost for agentic AI auditable from event logs. Two finite-sample metrics — state blind-spot mass Bn(τ) and state-action blind mass BSA π,n(τ) — together with an entropy-and-risk escalation gate, identify where historical support (and therefore justified autonomy) ends. Applied to a large enterprise procurement log (BPI 2019), the framework shows that coarse state coverage can look adequate while a substantial fraction of actual decision mass is under-supported at the state-action level (e.g., BSA π,n(1000) = 0.1253 under a refined state), and that the same gate that defines a reliability envelope analytically determines expected oversight cost. A simple chronological held-out agent experiment shows that the greedy-action confidence m(s) = maxa ˆπ(a|s) tracks realized autonomous-step accuracy within ≈3.4 percentage points on average.

Key Points

Stochastic gap: inserting agentic (stochastic) policies into near-deterministic workflows turns evaluation from “is the next step plausible?” into “is the trajectory distribution historically supported and economically governable?”
Two operational coverage metrics:
- State blind-spot mass Bn(τ): deployment state occupancy mass in states with < τ historical visits.
- State-action blind mass BSA π,n(τ): deployment state-action occupancy mass in state-action pairs with < τ visits — more directly relevant to agentic decisions.
Support mismatch is formalized via measure decomposition; finite-sample blind masses are practical analogues of singular (unsupported) deployment mass.
Entropy and risk:
- Local ambiguity is measured by Shannon entropy H(ˆπ(·|s)).
- Value- and exception-weighted risk scores w(s), wSA(s,a) capture economic consequence.
- A HITL escalation gate Gτ,h0,w0(s) escalates states with low support, high entropy, or high risk.
Oversight-cost identity:
- Expected per-case cost under gate G: C(π;G) = cA E[T] + (cH − cA) Σs Dπ(s) G(s).
- Adding an error penalty yields Cλ(π;G) and directly couples permissiveness of autonomy with expected error exposure.
Surrogates for case-level outcomes:
- Zero-touch completion surrogate: eC0(G) = E[∏t (1−G(st)) m(st)]
- Safe-completion-with-fallback surrogate: eRsafe(G) = E[∏t (G(st) + (1−G(st)) m(st))]
- In held-out tests, m(s) proved a good step-level accuracy surrogate (≈3.4 pp average error).
Empirical audit highlights:
- State refinement (activity → activity+item_type+GR → +value_bin+actor) expands states from 42 → 190 → 668 and state-action pairs from 498 → 1,217 → 3,262.
- Even with 1.6M events, refined state-action blind mass is substantial: BSA π,n(50) = 0.0165; BSA π,n(200) = 0.0462; BSA π,n(1000) = 0.1253.
- Highest entropy contexts concentrate in human approval and exception-handling, especially in high-value bins.
Practicality: the framework is directly implementable where operational event logs exist and produces interpretable metrics that link statistical support, local ambiguity, economic consequence, and human-touch requirements.

Data & Methods

Data:
- BPI 2019 purchase-to-pay event log: 251,734 cases, 1,595,923 events, 42 distinct activities; both human and system actors.
- Case lengths: mean 6.34 events, median 5, 99th pct 24, maximum 990. Self-loop rate ~15.7%.
State abstractions evaluated:
- s(1): activity only (42 states)
- s(2): activity + item_type + goods-receipt flag (190 states)
- s(3): activity + item_type + goods-receipt flag + discretized value bin + actor class (668 states)
Empirical estimators:
- Empirical next-step policy ˆπ(a|s) = N(s,a)/N(s)
- Empirical transition kernel ˆP(s'|s,a) = N(s,a,s')/N(s,a)
- Empirical occupancy ˆdπ(s) and ˆdπ(s,a) from training log counts
Blind-mass definitions:
- Bn(τ) = Σs ˆdπ(s) 1{N(s) < τ}
- BSA π,n(τ) = Σs,a ˆdπ(s,a) 1{N(s,a) < τ}
- Risk-weighted BSA,∗ π,n(τ) = Σs,a ˆdπ(s,a) wSA(s,a) 1{N(s,a) < τ}
HITL gate: Gτ,h0,w0(s) = 1{N(s) < τ ∨ H(ˆπ(·|s)) > h0 ∨ w(s) > w0}
Agent simulation and evaluation:
- Chronological 80/20 split by case completion time: 201,387 cases (train), 50,347 cases (held-out).
- Simulated agent: on held-out state s, escalate if G(s)=1 else pick greedy action ˆa(s)=argmaxa ˆπ(a|s).
- Evaluation metrics:
  - Aevent, Acase (autonomous decision/case shares)
  - Realized zero-touch completion Ctest0(G) and realized safe-completion Rtest_safe(G)
  - Mean human touches (Htest_case(G))
- Comparison of theoretical surrogates eC0(G), eRsafe(G) against realized outcomes; m(s) tracked realized step accuracy within ≈3.4 percentage points on average.

Implications for AI Economics

Pre-deployment auditing is necessary: large logs and apparently well-covered coarse states can mask substantial decision-level blind mass; firms should compute state-action coverage (BSA) not just state coverage.
Data requirements grow with state refinement: incorporating economically relevant context (value bins, actor class, case-level context) greatly expands the state-action space and therefore the sample size needed to justify autonomous decisioning. This implies increasing marginal data collection costs (or the need to simplify process state) for wider autonomy.
Automation vs oversight tradeoff is quantifiable: the gate produces explicit formulas for expected oversight cost and error exposure (C(π;G), Cλ(π;G)). These let firms evaluate whether automation reduces total cost or merely shifts costs to human oversight.
ROI and scaling risk: because blind mass and entropy concentrate on higher-consequence and exception-handling states, automating for marginal productivity gains can leave a residual set of expensive human-in-the-loop tasks that dominate operational cost, explaining why many projects fail to scale or realize projected ROI.
Policy and product design guidance:
- Use event-log audits to identify where process redesign (reduce branching, add validation rules), additional labeled data collection, or human fallback is most valuable.
- Prioritize automated handling of low-entropy, well-supported transitions while gating high-entropy/high-risk transitions.
- Entropy- and risk-weighted gating allocates scarce human attention to high-consequence ambiguity and improves cost-effectiveness versus naive coverage thresholds.
Regulatory, governance, and insurance relevance: the framework supplies audit-ready, interpretable metrics (blind masses, expected human touches, error-cost tradeoffs) that can be used in compliance checks, contractual SLAs, and risk quantification for insurers.
Limitations and caution:
- Observational-log-based evaluation cannot observe counterfactual next states; empirical agent evaluation is imitation-style and assumes the logged next action is “correct.”
- Threshold choice (τ,h0,w0), state encoding decisions, and the choice of cost parameters (cA, cH, λ) materially affect conclusions; these must be tailored to business context.
- The simple greedy-agent model is a baseline; more sophisticated agents or interventions (e.g., active learning, targeted data collection) will change the sample-efficiency and cost tradeoffs.
Broad economic takeaway: credible, economically viable deployment of agentic AI in enterprises depends as much on statistical support at the decision (state-action) level and on structured gating (entropy+risk) as on model accuracy; absent such audits, automation programs risk underestimating human oversight costs and overestimating scalable ROI.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The paper provides a formal, measure-theoretic framework and validates it on a large, real-world event log (251,734 cases), giving credible descriptive evidence about blind spots and oversight burden in that workflow; however, findings are based on a single procurement dataset and a log-driven simulated agent rather than interventions or causal identification across multiple settings, limiting external validity and causal claims. Methods Rigorhigh — The work develops precise mathematical quantities (state blind-spot mass, state-action blind mass, entropy-based escalation gate, oversight-cost identity), applies them to a large, well-documented process log with clear train/holdout splitting, and reports concrete metrics (e.g., blind mass values, m(s) tracking accuracy), indicating careful theoretical and empirical implementation. SampleBusiness Process Intelligence Challenge 2019 purchase-to-pay event log: 251,734 cases, 1,595,923 events, 42 distinct workflow actions; chronological 80/20 split used to build a log-driven simulated agent and evaluate held-out performance; state refinements expanded action/state space to 668 states by including case context, economic magnitude, and actor class. Themeshuman_ai_collab governance org_design GeneralizabilitySingle-domain: results derived from one enterprise procurement workflow and may not hold for other processes or sectors, Simulated agent: agent behavior is simulated from logs rather than observed deployed AI systems interacting with humans, Event-log requirement: framework assumes rich operational logs are available and accurately capture relevant context, Ignores dynamic human adaptation: does not model how humans or organizations would change behavior in response to deployed agents, Simplifying oversight model: oversight costs and escalation modeled abstractly (entropy gate) and may omit heterogeneity in human oversight capacity or economic incentives

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We develop a measure-theoretic Markov framework for agentic AI in organizations, whose core quantities are state blind-spot mass B_n(\tau), state-action blind mass B^{SA}_{\pi,n}(\tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. Other	positive	high	other	0.12
We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. Other	positive	high	other	n=251734 0.2
Refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668. Other	positive	high	other	n=251734 expands state space from 42 to 668 0.12
Refining the state (as above) raises state-action blind mass from 0.0165 at \tau=50 to 0.1253 at \tau=1000. Automation Exposure	negative	high	state-action blind mass (measure of unsupported next-step decisions)	n=251734 0.0165 at tau=50 to 0.1253 at tau=1000 0.12
On the held-out split, m(s) = max_a \hat{\pi}(a\|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. Output Quality	positive	high	accuracy of autonomous step selection (realized autonomous step accuracy)	within 3.4 percentage points on average 0.12
The same quantities that delimit statistically credible autonomy (blind masses, escalation gate, m(s), etc.) also determine expected oversight burden (the framework includes an expected oversight-cost identity over the workflow visitation measure). Organizational Efficiency	positive	high	expected oversight burden / oversight cost	0.12
The framework is designed for direct application to engineering processes for which operational event logs are available. Adoption Rate	positive	high	adoptability / applicability to engineering processes	0.06