A trajectory-level metric exposes hidden workflow shortcuts in LLM-based payment systems—ten of 18 models skip a critical checkout confirmation while appearing perfect under existing metrics—and ASR-guided fixes boost completion rates by up to 93.8 percentage points.

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

Donghao Huang, Joon Kiat Chua, Zhaoxia Wang · May 07, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Agentic Success Rate (ASR), a trajectory-fidelity metric, reveals systematic agent-level workflow shortcuts in LLM-based payment orchestration that standard metrics miss, and ASR-guided fixes substantially improve task completion rates.

LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.

Summary

Main Finding

The paper introduces Agentic Success Rate (ASR), a trajectory-level metric for LLM-based multi-agent payment workflows that compares expected and observed agent-to-agent transitions. Evaluating 18 LLMs on 90,000 HMASP instances, ASR reveals systematic, policy-relevant workflow deviations (not visible to Task Success Rate or unordered handoff F1). In particular, 10/18 models routinely skip an intermediate confirmation checkpoint during payment checkout—creating auditable gaps—while 8 models enforce the checkpoint perfectly. ASR-guided prompt and routing fixes produce very large task-success gains (up to +93.8 percentage points), showing engineering (not always model replacement) can eliminate many failures.

Key Points

Problem with existing metrics:
- Task Success Rate (TSR) reports only end outcomes; Agent Handoff F1 (HF1) treats handoffs as unordered edges. Neither captures whether agents followed the required ordered workflow steps.
ASR definition (trajectory-level, transition/bigram based):
- Represent an expected trajectory E and observed O as transition multisets BE and BO (each consecutive agent pair is a transition).
- Transition Recall (TR) = |BE ∩ BO| / |BE|
- Transition Precision (TP) = |BE ∩ BO| / |BO|
- ASR = F1(TR, TP)
- Example: expected 10 transitions but model produced 8 (skipped checkpoint) → TR = 8/10 = 80%, TP = 8/8 = 100%, ASR = F1(80%,100%) ≈ 88.9%.
Empirical discoveries:
- Dataset: HMASP 1,000 data points across 4 scenarios (T1–T4), 5 independent repeats per model–scenario → 90,000 runs.
- Models: 18 LLMs (open-weight and proprietary), including GPT-4.1, GPT-5.2, Qwen variants, Llama3.1 variants, Gemma4 variants, GPT-OSS, Mistral, Magistral, etc.
- Ten models systematically skip an explicit CPA confirmation checkpoint in payment processing (T3), producing a single consistent shortcut trajectory across affected samples—thus deterministic and not merely stochastic noise.
- Several models (including GPT-4.1) achieved perfect TSR and HF1 but still had ASR < 100% due to the skipped checkpoint; GPT-5.2 and a set of models achieved perfect ASR.
Interventions and results:
- Two ASR-guided engineering techniques: (1) iterative prompt engineering specifying role, handoffs, boundaries; (2) deterministic routing guards (cart-items guard, keyword-based card-view fallback).
- Large TSR improvements for weaker models after interventions (examples: Llama3.1:8b +93.8 pp, Magistral:24b +54.2 pp, Llama3.1:70b +33.5 pp).
Regulatory importance:
- Skipping confirmation/checkpoints breaks audit trails required by standards like PCI-DSS; outcome correctness alone is insufficient for compliance.

Data & Methods

Metric:
- ASR = F1 score computed on transition (bigram) multisets between expected and observed agent sequences; decomposed into Transition Recall (completeness) and Transition Precision (efficiency).
Experimental setup:
- HMASP evaluation dataset: 1,000 inputs over four scenarios: T1 (card registration), T2 (card retrieval), T3 (payment processing), T4 (irrelevant input rejection).
- Models: 18 LLMs (both locally run open models and proprietary OpenAI models). Local models run with quantized weights on Apple M5 Max.
- Repeats: 5 independent runs per model–scenario pair; report mean ± std across repeats (90,000 total instances).
- Baseline metrics reported: TSR, HF1 (existing), and ASR (proposed).
System engineering guided by ASR diagnostics:
- Prompt refinements to clarify required checkpoint behavior and role handoffs.
- Deterministic routing guards to enforce or suppress routes based on cart_items or keyword matches.
Key analytical observation:
- ASR reveals ordered-sequence deviations; HF1 can be perfect even when ordering (and thus checkpoint enforcement) is violated.

Implications for AI Economics

Compliance and regulatory risk valuation:
- In regulated financial systems, auditability and procedural fidelity are economic assets; skipping checkpoints increases regulatory, legal, and reputational risk even when outcomes are correct. ASR provides a more appropriate signal for quantifying these risks.
Cost trade-offs: engineering vs model replacement
- Large TSR gains from prompt/routing fixes imply that investing in system engineering (guided by trajectory metrics) can be more cost-effective than upgrading to larger/proprietary models. This affects procurement and total-cost-of-ownership decisions.
Incentives and metric design
- Using only outcome-level metrics (TSR) or unordered handoff metrics (HF1) misaligns incentives: operators might prioritize final result accuracy while neglecting required process steps. Deployers and regulators should adopt trajectory-aware metrics (or weighted variants) to align operational incentives with compliance needs.
Market differentiation and product claims
- Models/systems that enforce procedural fidelity (high ASR) can credibly claim superior auditability and lower compliance risk; this may become a competitive dimension for agentic payment products and services.
Risk-adjusted deployment strategies
- ASR (or transition-weighted ASR) can be used to prioritize stricter guard rails or human-in-the-loop checks for high-risk transitions, improving capital allocation for supervision and insurance costs.
Broader applicability and externalities
- Trajectory fidelity matters beyond payments (healthcare, legal, procurement). Economically, failure to model ordered workflows can create negative externalities (liability, fraud, customer harm) that are not captured by outcome-only KPIs.
Policy recommendations
- Regulators and industry standards should require or recommend trajectory-level conformance checks for agentic workflows in high-stakes domains; metrics like ASR can be operationalized into audits and certification criteria.

If you want, I can: - Produce a one-page slide-ready summary with key figures (e.g., ASR formula, example transition-skipping case, top metric improvements). - Suggest how to operationalize ASR as a monitoring KPI (alerts, thresholds, weighted-critical-transition scoring).

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large-scale empirical evaluation (18 LLMs, ~90,000 task instances) provides strong descriptive evidence that the proposed metric (ASR) uncovers systematic workflow deviations invisible to TSR and HF1; however the study is not causal with respect to broader economic outcomes and appears constrained to a single HMASP payment workflow, limiting external validity. Methods Rigorhigh — Introduces a principled trajectory-level metric with decomposed components (Transition Recall/Precision), applies it systematically across many models and instances, and demonstrates practical interventions (prompt refinements and routing guards) that yield large, measurable improvements; while full replication details (e.g., task diversity, statistical testing, pre-registration) are not provided in the summary, the scale and the interventional component indicate rigorous methodology. SampleEvaluations performed on the Hierarchical Multi-Agent System for Payments (HMASP) over 18 different LLMs and approximately 90,000 payment-task instances, comparing observed agent transition sequences to expected workflows and testing prompt/routing interventions; includes models such as GPT-4.1 and GPT-5.2. Themesgovernance adoption human_ai_collab GeneralizabilityRestricted to a single HMASP payment workflow—other workflows or domains may differ, Models tested (18) may not represent the full population of current or future LLMs or model configurations, Unclear whether task instances are synthetic, simulated, or drawn from real-world production logs, which affects external validity, Findings may depend on the specific agent orchestration, prompt templates, and checkpoint definitions used, Regulatory and UI constraints in other payment systems could change the relevance of ASR diagnostics

Claims (8)

Claim	Direction	Confidence	Outcome	Details
LLM-based multi-agent systems are increasingly deployed for payment workflows. Adoption Rate	positive	high	deployment/adoption of LLM-based multi-agent systems for payment workflows	0.03
Prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. Other	null_result	high	coverage/expressiveness of evaluation metrics (TSR, HF1)	0.09
We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Other	positive	high	trajectory fidelity / transition-level execution accuracy	0.18
Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Error Rate	negative	high	adherence to expected workflow transitions (confirmation checkpoint adherence)	n=90000 10 of 18 models systematically skip a confirmation checkpoint during payment checkout; 8 models enforce the checkpoint perfectly 0.3
GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1. Error Rate	negative	high	trajectory fidelity vs. standard metrics (TSR, HF1)	perfect TSR and HF1 (for GPT-4.1) while exhibiting hidden workflow shortcuts (ASR indicates deviation) 0.18
GPT-5.2 achieves perfect ASR. Error Rate	positive	high	Agentic Success Rate (ASR)	perfect ASR 0.18
Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models. Error Rate	positive	high	Task Success Rate (TSR)	+93.8 percentage points 0.18
Trajectory-level evaluation is essential in regulated domains. Governance And Regulation	positive	medium	suitability/necessity of trajectory-level evaluation in regulated contexts	0.02