A large procurement workflow that looks safe at the state level can still hide sizable next-step uncertainty: expanding state detail from 42 to 668 raises state-action blind mass from 0.0165 to 0.1253, implying much higher oversight needs than coarse metrics suggest; a simple maximum-probability score m(s) predicted autonomous-step accuracy on held-out data within about 3.4 percentage points.
Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.
Summary
Main Finding
The paper introduces a Markovian, measure-theoretic framework that makes “pre-deployment” reliability and oversight-cost for agentic AI auditable from event logs. Two finite-sample metrics — state blind-spot mass Bn(τ) and state-action blind mass BSA π,n(τ) — together with an entropy-and-risk escalation gate, identify where historical support (and therefore justified autonomy) ends. Applied to a large enterprise procurement log (BPI 2019), the framework shows that coarse state coverage can look adequate while a substantial fraction of actual decision mass is under-supported at the state-action level (e.g., BSA π,n(1000) = 0.1253 under a refined state), and that the same gate that defines a reliability envelope analytically determines expected oversight cost. A simple chronological held-out agent experiment shows that the greedy-action confidence m(s) = maxa ˆπ(a|s) tracks realized autonomous-step accuracy within ≈3.4 percentage points on average.
Key Points
- Stochastic gap: inserting agentic (stochastic) policies into near-deterministic workflows turns evaluation from “is the next step plausible?” into “is the trajectory distribution historically supported and economically governable?”
- Two operational coverage metrics:
- State blind-spot mass Bn(τ): deployment state occupancy mass in states with < τ historical visits.
- State-action blind mass BSA π,n(τ): deployment state-action occupancy mass in state-action pairs with < τ visits — more directly relevant to agentic decisions.
- Support mismatch is formalized via measure decomposition; finite-sample blind masses are practical analogues of singular (unsupported) deployment mass.
- Entropy and risk:
- Local ambiguity is measured by Shannon entropy H(ˆπ(·|s)).
- Value- and exception-weighted risk scores w(s), wSA(s,a) capture economic consequence.
- A HITL escalation gate Gτ,h0,w0(s) escalates states with low support, high entropy, or high risk.
- Oversight-cost identity:
- Expected per-case cost under gate G: C(π;G) = cA E[T] + (cH − cA) Σs Dπ(s) G(s).
- Adding an error penalty yields Cλ(π;G) and directly couples permissiveness of autonomy with expected error exposure.
- Surrogates for case-level outcomes:
- Zero-touch completion surrogate: eC0(G) = E[∏t (1−G(st)) m(st)]
- Safe-completion-with-fallback surrogate: eRsafe(G) = E[∏t (G(st) + (1−G(st)) m(st))]
- In held-out tests, m(s) proved a good step-level accuracy surrogate (≈3.4 pp average error).
- Empirical audit highlights:
- State refinement (activity → activity+item_type+GR → +value_bin+actor) expands states from 42 → 190 → 668 and state-action pairs from 498 → 1,217 → 3,262.
- Even with 1.6M events, refined state-action blind mass is substantial: BSA π,n(50) = 0.0165; BSA π,n(200) = 0.0462; BSA π,n(1000) = 0.1253.
- Highest entropy contexts concentrate in human approval and exception-handling, especially in high-value bins.
- Practicality: the framework is directly implementable where operational event logs exist and produces interpretable metrics that link statistical support, local ambiguity, economic consequence, and human-touch requirements.
Data & Methods
- Data:
- BPI 2019 purchase-to-pay event log: 251,734 cases, 1,595,923 events, 42 distinct activities; both human and system actors.
- Case lengths: mean 6.34 events, median 5, 99th pct 24, maximum 990. Self-loop rate ~15.7%.
- State abstractions evaluated:
- s(1): activity only (42 states)
- s(2): activity + item_type + goods-receipt flag (190 states)
- s(3): activity + item_type + goods-receipt flag + discretized value bin + actor class (668 states)
- Empirical estimators:
- Empirical next-step policy ˆπ(a|s) = N(s,a)/N(s)
- Empirical transition kernel ˆP(s'|s,a) = N(s,a,s')/N(s,a)
- Empirical occupancy ˆdπ(s) and ˆdπ(s,a) from training log counts
- Blind-mass definitions:
- Bn(τ) = Σs ˆdπ(s) 1{N(s) < τ}
- BSA π,n(τ) = Σs,a ˆdπ(s,a) 1{N(s,a) < τ}
- Risk-weighted BSA,∗ π,n(τ) = Σs,a ˆdπ(s,a) wSA(s,a) 1{N(s,a) < τ}
- HITL gate: Gτ,h0,w0(s) = 1{N(s) < τ ∨ H(ˆπ(·|s)) > h0 ∨ w(s) > w0}
- Agent simulation and evaluation:
- Chronological 80/20 split by case completion time: 201,387 cases (train), 50,347 cases (held-out).
- Simulated agent: on held-out state s, escalate if G(s)=1 else pick greedy action ˆa(s)=argmaxa ˆπ(a|s).
- Evaluation metrics:
- Aevent, Acase (autonomous decision/case shares)
- Realized zero-touch completion Ctest0(G) and realized safe-completion Rtest_safe(G)
- Mean human touches (Htest_case(G))
- Comparison of theoretical surrogates eC0(G), eRsafe(G) against realized outcomes; m(s) tracked realized step accuracy within ≈3.4 percentage points on average.
Implications for AI Economics
- Pre-deployment auditing is necessary: large logs and apparently well-covered coarse states can mask substantial decision-level blind mass; firms should compute state-action coverage (BSA) not just state coverage.
- Data requirements grow with state refinement: incorporating economically relevant context (value bins, actor class, case-level context) greatly expands the state-action space and therefore the sample size needed to justify autonomous decisioning. This implies increasing marginal data collection costs (or the need to simplify process state) for wider autonomy.
- Automation vs oversight tradeoff is quantifiable: the gate produces explicit formulas for expected oversight cost and error exposure (C(π;G), Cλ(π;G)). These let firms evaluate whether automation reduces total cost or merely shifts costs to human oversight.
- ROI and scaling risk: because blind mass and entropy concentrate on higher-consequence and exception-handling states, automating for marginal productivity gains can leave a residual set of expensive human-in-the-loop tasks that dominate operational cost, explaining why many projects fail to scale or realize projected ROI.
- Policy and product design guidance:
- Use event-log audits to identify where process redesign (reduce branching, add validation rules), additional labeled data collection, or human fallback is most valuable.
- Prioritize automated handling of low-entropy, well-supported transitions while gating high-entropy/high-risk transitions.
- Entropy- and risk-weighted gating allocates scarce human attention to high-consequence ambiguity and improves cost-effectiveness versus naive coverage thresholds.
- Regulatory, governance, and insurance relevance: the framework supplies audit-ready, interpretable metrics (blind masses, expected human touches, error-cost tradeoffs) that can be used in compliance checks, contractual SLAs, and risk quantification for insurers.
- Limitations and caution:
- Observational-log-based evaluation cannot observe counterfactual next states; empirical agent evaluation is imitation-style and assumes the logged next action is “correct.”
- Threshold choice (τ,h0,w0), state encoding decisions, and the choice of cost parameters (cA, cH, λ) materially affect conclusions; these must be tailored to business context.
- The simple greedy-agent model is a baseline; more sophisticated agents or interventions (e.g., active learning, targeted data collection) will change the sample-efficiency and cost tradeoffs.
- Broad economic takeaway: credible, economically viable deployment of agentic AI in enterprises depends as much on statistical support at the decision (state-action) level and on structured gating (entropy+risk) as on model accuracy; absent such audits, automation programs risk underestimating human oversight costs and overestimating scalable ROI.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We develop a measure-theoretic Markov framework for agentic AI in organizations, whose core quantities are state blind-spot mass B_n(\tau), state-action blind mass B^{SA}_{\pi,n}(\tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. Other | positive | high | other |
0.12
|
| We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. Other | positive | high | other |
n=251734
0.2
|
| Refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668. Other | positive | high | other |
n=251734
expands state space from 42 to 668
0.12
|
| Refining the state (as above) raises state-action blind mass from 0.0165 at \tau=50 to 0.1253 at \tau=1000. Automation Exposure | negative | high | state-action blind mass (measure of unsupported next-step decisions) |
n=251734
0.0165 at tau=50 to 0.1253 at tau=1000
0.12
|
| On the held-out split, m(s) = max_a \hat{\pi}(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. Output Quality | positive | high | accuracy of autonomous step selection (realized autonomous step accuracy) |
within 3.4 percentage points on average
0.12
|
| The same quantities that delimit statistically credible autonomy (blind masses, escalation gate, m(s), etc.) also determine expected oversight burden (the framework includes an expected oversight-cost identity over the workflow visitation measure). Organizational Efficiency | positive | high | expected oversight burden / oversight cost |
0.12
|
| The framework is designed for direct application to engineering processes for which operational event logs are available. Adoption Rate | positive | high | adoptability / applicability to engineering processes |
0.06
|