A trajectory-level metric exposes hidden workflow shortcuts in LLM-based payment systems—ten of 18 models skip a critical checkout confirmation while appearing perfect under existing metrics—and ASR-guided fixes boost completion rates by up to 93.8 percentage points.
LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.
Summary
Main Finding
The paper introduces Agentic Success Rate (ASR), a trajectory-level metric for LLM-based multi-agent payment workflows that compares expected and observed agent-to-agent transitions. Evaluating 18 LLMs on 90,000 HMASP instances, ASR reveals systematic, policy-relevant workflow deviations (not visible to Task Success Rate or unordered handoff F1). In particular, 10/18 models routinely skip an intermediate confirmation checkpoint during payment checkout—creating auditable gaps—while 8 models enforce the checkpoint perfectly. ASR-guided prompt and routing fixes produce very large task-success gains (up to +93.8 percentage points), showing engineering (not always model replacement) can eliminate many failures.
Key Points
- Problem with existing metrics:
- Task Success Rate (TSR) reports only end outcomes; Agent Handoff F1 (HF1) treats handoffs as unordered edges. Neither captures whether agents followed the required ordered workflow steps.
- ASR definition (trajectory-level, transition/bigram based):
- Represent an expected trajectory E and observed O as transition multisets BE and BO (each consecutive agent pair is a transition).
- Transition Recall (TR) = |BE ∩ BO| / |BE|
- Transition Precision (TP) = |BE ∩ BO| / |BO|
- ASR = F1(TR, TP)
- Example: expected 10 transitions but model produced 8 (skipped checkpoint) → TR = 8/10 = 80%, TP = 8/8 = 100%, ASR = F1(80%,100%) ≈ 88.9%.
- Empirical discoveries:
- Dataset: HMASP 1,000 data points across 4 scenarios (T1–T4), 5 independent repeats per model–scenario → 90,000 runs.
- Models: 18 LLMs (open-weight and proprietary), including GPT-4.1, GPT-5.2, Qwen variants, Llama3.1 variants, Gemma4 variants, GPT-OSS, Mistral, Magistral, etc.
- Ten models systematically skip an explicit CPA confirmation checkpoint in payment processing (T3), producing a single consistent shortcut trajectory across affected samples—thus deterministic and not merely stochastic noise.
- Several models (including GPT-4.1) achieved perfect TSR and HF1 but still had ASR < 100% due to the skipped checkpoint; GPT-5.2 and a set of models achieved perfect ASR.
- Interventions and results:
- Two ASR-guided engineering techniques: (1) iterative prompt engineering specifying role, handoffs, boundaries; (2) deterministic routing guards (cart-items guard, keyword-based card-view fallback).
- Large TSR improvements for weaker models after interventions (examples: Llama3.1:8b +93.8 pp, Magistral:24b +54.2 pp, Llama3.1:70b +33.5 pp).
- Regulatory importance:
- Skipping confirmation/checkpoints breaks audit trails required by standards like PCI-DSS; outcome correctness alone is insufficient for compliance.
Data & Methods
- Metric:
- ASR = F1 score computed on transition (bigram) multisets between expected and observed agent sequences; decomposed into Transition Recall (completeness) and Transition Precision (efficiency).
- Experimental setup:
- HMASP evaluation dataset: 1,000 inputs over four scenarios: T1 (card registration), T2 (card retrieval), T3 (payment processing), T4 (irrelevant input rejection).
- Models: 18 LLMs (both locally run open models and proprietary OpenAI models). Local models run with quantized weights on Apple M5 Max.
- Repeats: 5 independent runs per model–scenario pair; report mean ± std across repeats (90,000 total instances).
- Baseline metrics reported: TSR, HF1 (existing), and ASR (proposed).
- System engineering guided by ASR diagnostics:
- Prompt refinements to clarify required checkpoint behavior and role handoffs.
- Deterministic routing guards to enforce or suppress routes based on cart_items or keyword matches.
- Key analytical observation:
- ASR reveals ordered-sequence deviations; HF1 can be perfect even when ordering (and thus checkpoint enforcement) is violated.
Implications for AI Economics
- Compliance and regulatory risk valuation:
- In regulated financial systems, auditability and procedural fidelity are economic assets; skipping checkpoints increases regulatory, legal, and reputational risk even when outcomes are correct. ASR provides a more appropriate signal for quantifying these risks.
- Cost trade-offs: engineering vs model replacement
- Large TSR gains from prompt/routing fixes imply that investing in system engineering (guided by trajectory metrics) can be more cost-effective than upgrading to larger/proprietary models. This affects procurement and total-cost-of-ownership decisions.
- Incentives and metric design
- Using only outcome-level metrics (TSR) or unordered handoff metrics (HF1) misaligns incentives: operators might prioritize final result accuracy while neglecting required process steps. Deployers and regulators should adopt trajectory-aware metrics (or weighted variants) to align operational incentives with compliance needs.
- Market differentiation and product claims
- Models/systems that enforce procedural fidelity (high ASR) can credibly claim superior auditability and lower compliance risk; this may become a competitive dimension for agentic payment products and services.
- Risk-adjusted deployment strategies
- ASR (or transition-weighted ASR) can be used to prioritize stricter guard rails or human-in-the-loop checks for high-risk transitions, improving capital allocation for supervision and insurance costs.
- Broader applicability and externalities
- Trajectory fidelity matters beyond payments (healthcare, legal, procurement). Economically, failure to model ordered workflows can create negative externalities (liability, fraud, customer harm) that are not captured by outcome-only KPIs.
- Policy recommendations
- Regulators and industry standards should require or recommend trajectory-level conformance checks for agentic workflows in high-stakes domains; metrics like ASR can be operationalized into audits and certification criteria.
If you want, I can: - Produce a one-page slide-ready summary with key figures (e.g., ASR formula, example transition-skipping case, top metric improvements). - Suggest how to operationalize ASR as a monitoring KPI (alerts, thresholds, weighted-critical-transition scoring).
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| LLM-based multi-agent systems are increasingly deployed for payment workflows. Adoption Rate | positive | high | deployment/adoption of LLM-based multi-agent systems for payment workflows |
0.03
|
| Prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. Other | null_result | high | coverage/expressiveness of evaluation metrics (TSR, HF1) |
0.09
|
| We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Other | positive | high | trajectory fidelity / transition-level execution accuracy |
0.18
|
| Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Error Rate | negative | high | adherence to expected workflow transitions (confirmation checkpoint adherence) |
n=90000
10 of 18 models systematically skip a confirmation checkpoint during payment checkout; 8 models enforce the checkpoint perfectly
0.3
|
| GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1. Error Rate | negative | high | trajectory fidelity vs. standard metrics (TSR, HF1) |
perfect TSR and HF1 (for GPT-4.1) while exhibiting hidden workflow shortcuts (ASR indicates deviation)
0.18
|
| GPT-5.2 achieves perfect ASR. Error Rate | positive | high | Agentic Success Rate (ASR) |
perfect ASR
0.18
|
| Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models. Error Rate | positive | high | Task Success Rate (TSR) |
+93.8 percentage points
0.18
|
| Trajectory-level evaluation is essential in regulated domains. Governance And Regulation | positive | medium | suitability/necessity of trajectory-level evaluation in regulated contexts |
0.02
|