Lab-style benchmarks and common NLP scores miss systemic failures when AI agents run continuously in production: four of seven production-specific failure modes were invisible to ROUGE, BERTScore and standard agentic benchmarks, and the authors offer PAEF, a continuous, production-oriented evaluation framework and open-source implementation to close the detection gap.

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Mukund Pandey · May 02, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Standard single-session benchmarks and common NLP metrics systematically miss or lag in detecting several failure modes unique to continuously operating agentic systems, so the authors propose PAEF — a five-dimension, continuous-evaluation framework with an open-source reference implementation — to monitor production traffic and catch these blind spots.

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

Summary

Main Finding

Production deployments of agentic LLM systems exhibit a set of failure modes that are invisible to standard, episodic evaluation metrics (ROUGE, BERTScore, accuracy/AUC, AgentBench, MT‑Bench). The paper (Pandey, 2026) (1) defines seven production-specific failure modes observed at billion-event scale, (2) empirically shows standard metrics either miss or detect these failures only with substantial lag, and (3) introduces PAEF — a five-dimension, continuous evaluation framework (with open-source reference code) to detect and monitor these failures in production traffic.

Key Points

Seven production-specific failure modes (short descriptions):
Cascading Decision Error / Coherence Illusion (FM-1): early wrong decision propagates and becomes internally coherent downstream.
Silent Degradation via Availability-Truth Decoupling (FM-2): degraded tools return schema-valid but stale/partial data so downstream logic appears correct.
Distribution Collapse Under Metric Optimization (FM-3): optimizing proxies leads to reduced output diversity and lower user value while point metrics stay stable.
Consistency Collapse Across Entry Points (FM-4): semantically identical requests via different surfaces produce inconsistent outcomes.
Explanation‑Decision Decoupling (FM-5): correct decision paired with incorrect attribution/explanation, misleading audits and debugging.
Silent Correctness Erosion Under Latency Pressure (FM-6): fallbacks under latency SLAs degrade decision quality while system SLAs remain green.
Proxy Goal Convergence / Reward Hacking (FM-7): system optimizes a monitored proxy (e.g., CTR) while true long‑horizon objectives (retention, trust) degrade.
Empirical gap: standard metrics fail to reliably detect four of the seven failure modes and detect the remaining three only after multiple evaluation cycles (i.e., with lag). The paper provides a detection-coverage table mapping common metrics to each failure mode.
PAEF (Production Agentic Evaluation Framework): designed for continuous monitoring on production traffic (live or shadow), distribution-aware, cross-signal correlated, CI-integrable. Produces scalar scores and thresholds to block/flag/regress deployments.
PAEF provides concrete metrics and implementations for the failure modes, including:
- Cascade Uncertainty (propagation of low-confidence inputs to high-confidence downstream outputs; coherence-illusion score),
- Tool Reliability (categorizing tool calls as SUCCESS / PARTIAL / FAILED; partial response rate PRR; latency–quality correlation ρLQ),
- Distribution Health (intra-session diversity, normalized output entropy, repeat-rate/unique-creator distribution),
- Cross-surface consistency checks (paired-request replay across API/UI/batch),
- Explanation / attribution consistency checks (perturbation-based counterfactuals),
- Long-horizon / multi-objective monitoring and counterfactual evaluation to detect proxy optimization / goal drift.
The authors open-sourced a reference implementation: https://github.com/mukund1985/llm-eval-toolkit

Data & Methods

Empirical grounding: observations and failure examples are drawn from production systems operating at O(10^9) events/day (real-world incidents and signal correlations across large-scale pipelines).
Formalization: each failure mode is given a formal definition and a detection signature. Several metrics are defined mathematically in the paper (examples shown in text):
- Cascade Uncertainty: coherence illusion score CIS(s) computed from confidence scores across pipeline steps and aggregate cascade score.
- Tool Reliability: partial response rate PRR = (#PARTIAL calls)/total, and latency–quality Pearson correlation ρLQ; combined into a tool reliability score.
- Distribution Health: entropy and intra-session diversity computed over sliding windows; monitoring repeat-rate and unique-creator distribution.
- Attribution checks: perturbation consistency/counterfactuals to test explanation–decision coupling.
Benchmarking / validation: the paper maps common evaluation metrics (ROUGE, BERTScore, accuracy/AUC, AgentBench, MT-Bench) against the seven failure modes and demonstrates gaps. It also reports experimental validation on public benchmarks to show detection gaps and PAEF coverage (details and code in the referenced repo).
Implementation focus: PAEF is designed for streaming/continuous evaluation, outputs CI-friendly scalar scores with configurable thresholds, and explicitly supports cross-signal correlation (tool health × decision quality, latency × correctness, attribution × perturbation impact).

Implications for AI Economics

Hidden negative externalities and measurement problems:
- Short-term proxy optimization (e.g., CTR) can raise measured performance while eroding long-term user value (retention, trust). This is a classic Goodhart/measurement problem with material economic consequences: short-run gains misrepresent product value and can drive investment/compensation decisions that are socially or commercially suboptimal.
- Undetected cohort-specific errors (FM-1) or cross-surface inconsistencies (FM-4) can create unequal value or harms across user segments, producing distributional welfare losses and regulatory compliance risks.
Operational and monitoring costs:
- Continuous, cross-signal monitoring (as PAEF prescribes) increases engineering, instrumentation, and compute costs relative to episodic benchmarking; organizations must trade off upfront monitoring expense versus expected avoided losses from undetected drift/failures.
- The need for multi-team, cross-metric coordination (product metrics, long-horizon KPIs, infra metrics) increases organizational complexity and may shift how performance bonuses, SLAs, and vendor contracts are structured.
Product design and incentives:
- Contracts and incentive mechanisms should be redesigned to reward long-horizon objectives (retention, trust, long-term engagement quality) rather than short-horizon proxies, to mitigate reward-hacking externalities (FM-7).
- Business metrics, A/B testing frameworks, and experiment horizons should be adjusted to capture distributional and long-term outcomes; otherwise optimization will produce economically inefficient equilibria.
Investment and risk management:
- Investors and managers evaluating AI products should discount superficially strong point-metric performance unless continuous, distribution-aware monitoring and causal attribution checks are in place.
- PAEF-like monitoring reduces operational tail risk (regulatory fines, reputational damage, churn) — this risk reduction has economic value that can be compared to monitoring costs.
Regulation and standards:
- Explainability failures (FM-5) have compliance implications: correct decisions with incorrect attributions risk misleading audits. Regulators may demand perturbation-based attribution checks or continuous monitoring requirements for certain domains (finance, health).
- Industry standards for production evaluation (continuous, cross-signal) would align incentives and lower information asymmetry between vendors and buyers.

Practical takeaway for economists and product managers: evaluating agentic AI requires shifting investment from episodic capability benchmarking to continuous, distribution-aware monitoring that aligns measurable metrics with long-horizon value. PAEF offers an actionable framework and tooling to operationalize that shift; the cost of adopting such systems should be weighed against the nontrivial economic risks of undetected production failures (user harm, churn, regulatory exposure, and misallocated R&D spend).

Assessment

Paper Typedescriptive Evidence Strengthmedium — The claims are supported by very large-scale, real-world production logs (billion-event scale) and systematic empirical demonstrations that common metrics fail to detect specific failure modes, which gives strong ecological validity for production settings; however the evidence is observational, context-specific, and lacks randomized or counterfactual validation and external replication across diverse organizations. Methods Rigormedium — The paper develops a taxonomy grounded in extensive production telemetry, conducts empirical tests showing metric blind spots, and provides an open-source reference implementation (PAEF) enabling continuous evaluation; but methodological details on labeling, cross-system sampling, potential selection biases, and quantitative thresholds for detection are not fully specified (no randomized interventions or external validation), limiting rigour for causal claims. SampleTelemetry and interaction logs from agentic AI systems operating in production at billion-event scale, including continuous multi-step agent sessions, tool-call traces, outcome signals, and downstream failures; documentation does not specify exact number of deployments, industries, or whether data came from one or multiple organizations. Themesadoption productivity IdentificationObservational comparison: the authors analyze billion-event production logs from agentic AI deployments to document seven production-specific failure modes and then empirically compare the sensitivity of standard metrics/benchmarks (ROUGE, BERTScore, accuracy/AUC, HELM/MT-Bench/AgentBench/BIG-bench) to these failures; there is no causal identification strategy for effects on economic outcomes. GeneralizabilityMay be specific to large organizations with high-volume agentic deployments and rich telemetry; not representative of small-scale or episodic usage., Findings may vary by domain (customer support, e-commerce, developer tools) and by specific agent architectures/tools used., Relies on availability of structured production logs and outcome signals; not applicable where telemetry or tool-integration is limited., Taxonomy and PAEF thresholds may require tuning for safety-critical or highly regulated settings and were not validated across such domains., Benchmarks and metrics evaluated are a moving target; conclusions may shift as benchmarks evolve or agents are retrained.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings and do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production. Ai Safety And Ethics	negative	high	ability of existing LLM evaluation frameworks to address continuous production agentic evaluation challenges	0.18
This paper presents a taxonomy of seven failure modes unique to production agentic systems. Ai Safety And Ethics	positive	high	cataloging of distinct failure modes in production agentic systems	n=7 0.18
The seven failure modes include compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. Ai Safety And Ethics	negative	high	types of failure modes affecting production agentic systems	0.18
The taxonomy and its failure modes are grounded in observations from systems operating at billion-event scale. Ai Safety And Ethics	positive	high	empirical grounding (scale) of observations used to derive the taxonomy	observations from systems operating at billion-event scale 0.18
Standard metrics (ROUGE, BERTScore, accuracy/AUC, and agentic benchmarks such as HELM/MT-Bench/AgentBench/BIG-bench) fail to detect each of the seven production failure modes. Ai Safety And Ethics	negative	high	detection capability of standard metrics/benchmarks for production failure modes	n=7 0.18
Standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles. Ai Safety And Ethics	negative	high	proportion and timing of detection of failure modes by standard metrics	n=7 standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles 0.18
We propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Ai Safety And Ethics	positive	high	provision of a continuous, production-focused evaluation framework (PAEF)	0.03