A causal intelligence layer dramatically shrinks SRE investigation workloads in a lab benchmark: in a 24-microservice demo, Causely cut time-to-diagnosis by 63%, slashed token and tool-call usage, and raised root-cause accuracy to 100%, compressing the investigation footprint nearly fivefold.

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, Shmuel Kliger · May 18, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

In a controlled 24-microservice benchmark with injected faults, adding the Causely causal intelligence layer cut mean time-to-diagnosis by 63%, reduced token use by 60% and tool calls by 78%, and raised root-cause accuracy from 75% to 100%, while lowering API cost per run by 57%.

AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63\%, mean token consumption by 60\%, and mean tool-call count by 78\%, compressing the investigation footprint by 4.8$\times$ and lowering direct API cost per run by 57\%; root-cause-diagnosis accuracy rises from 75\% to 100\%.

Summary

Main Finding

Supplying AI agents in SRE/reliability workflows with a persistent causal intelligence layer (Causely) — a structured, live model of topology, attribute dependencies, and causal relationships — materially improves operational performance versus agents that reason from raw telemetry. In a controlled benchmark over a 24‑microservice OpenTelemetry demo with injected faults, Causely reduced mean time-to-diagnosis by 63%, token consumption per run by 60%, tool-call count by 78% (compressing investigation footprint ~4.8×), lowered direct API cost per run by 57%, and increased root-cause-diagnosis accuracy from 75% to 100%.

Key Points

Causely concept: a causal intelligence layer CI = (GT, KC, GC, GA)
- GT: live topology graph of entities and relations (conn, layer, comp)
- KC: environment‑agnostic causal knowledge base (entity types → root causes ↔ symptoms + propagation rules)
- GC: environment‑specific causality graph (instantiated causal edges with probabilities)
- GA: attribute dependency graph for continuous attribute relationships (latency, utilization, etc.)
Use cases evaluated: health assessment, incident impact analysis, root cause localization, remediation/evaluation.
Experimental scope:
- 72 runs total.
- Four agent configurations: Claude Code, OpenAI Codex, HolmesGPT (Gemini Pro 3 backend), HolmesGPT (Claude Sonnet backend).
- Two scenarios: active fault (injected) and healthy baseline.
- Two conditions: with and without access to Causely.
- Environment: 24-microservice OpenTelemetry demo app; controlled fault injections.
Headline quantitative results (active-fault scenario, aggregated):
- Time-to-diagnosis: −63.2% mean (per-config reductions 34.8%–82.8%)
- Token consumption: −60% mean
- Tool-call footprint: −78% (4.8× compression)
- Direct API cost per run: −57%
- Root-cause-diagnosis accuracy: from 75% → 100%
Additional behaviours:
- Baseline agents often expended more time/tokens on the healthy baseline than on active faults (up to 7.2× tokens) because raw telemetry lacks a clear “no fault” stopping signal; causal grounding makes “nothing wrong” a first-class, cheap answer.
- Hallucinated-incident behavior in healthy runs was eliminated for some agents and significantly reduced for others when using Causely.

Data & Methods

Benchmark design:
- Factorial experiment: agent type × scenario (healthy vs active) × causal grounding (with/without).
- Repeated runs to reach 72 total experimental trials (paper includes detailed appendix on factorial structure).
Environment and fault injection:
- 24-microservice demo application instrumented with OpenTelemetry.
- Controlled, documented fault injections targeting typical cloud-native failure modes; appendix describes fault types and injection protocol.
Metrics collected:
- Wall-clock time-to-diagnosis (seconds)
- Token consumption per run
- Tool-call count (number of external/instrumentation calls)
- Direct API cost per run (USD; token-based cost proxy)
- Diagnostic correctness/accuracy per a rubric (appendix)
Analysis:
- Comparative means and per-configuration breakdowns reported (no formal p-values shown in the excerpt; primary claims are mean improvements and qualitative patterns).
- Qualitative analysis of agent behavior (e.g., search loops on raw telemetry vs targeted causal queries).
Limitations noted in the paper:
- Controlled, single-demo environment — generalization to diverse real-world deployments not yet demonstrated.
- Proprietary Causely implementation; results may depend on quality/coverage of the causal KB and topology extraction.
- Limited set of agents and models tested.
- Cost analysis tied to assumed token pricing and API billing models.

Implications for AI Economics

Visibility of token economics:
- As major providers shift from seat-based to per-token usage (or otherwise align pricing to compute), the “semantic-interpretation tax” (agents repeatedly consuming tokens to interpret raw telemetry) becomes a directly billable cost. The paper shows that reducing this tax can halve direct API spend per investigation.
Enterprise ROI and TCO:
- Integrating a causal intelligence layer can produce substantial per-investigation cost savings (paper: ~57% reduction), faster diagnostics (lower operational downtime), and higher correctness — a strong economic case for investing engineering effort to build or adopt such a layer for large-scale SRE/observability operations.
- Payback will depend on incident frequency, average agent-run costs under provider pricing, and integration/maintenance costs of the causal layer.
Provider economics and incentives:
- Reduced token consumption at the customer level reduces revenue per interaction for LLM providers if pricing remains per-token; providers may respond with new product tiers, bundled tooling, or their own causal/context layers to maintain margins.
- Alternatively, providers could partner with or monetize hosted causal layers (managed CI services) — creating new product differentiation.
Operational scale and marginal costs:
- If causal grounding enables cheaper, faster, and more reliable agentic automation, enterprises may scale agent deployments (more agents, continuous monitoring), increasing overall API volumes even if per-run token usage falls. Net spend depends on elasticity of agent usage and pricing.
Risks and strategic considerations:
- Engineering and maintenance overhead: building and continuously maintaining GT/KC/GC/GA (topology detection, causal KB coverage, attribute models) has nontrivial operational cost; enterprises must compare that to token/API savings and reliability gains.
- Vendor lock-in and data strategy: a proprietary causal KB tied to a vendor could create lock-in; enterprises should evaluate portability, standards, and interoperability.
- Generalization uncertainty: results are promising but derived from a single benchmark; firms should pilot in their own environments before wide rollout.
Recommendations for stakeholders:
- Enterprises: run a cost-benefit pilot estimating token-cost savings × incident rate minus engineering/maintenance cost of a CI layer; prioritize high‑value services or frequent incident classes.
- Providers: consider offering integrated causal/context layers or tooling to help customers reduce token usage while preserving model margins (e.g., managed CI, hybrid charging models).
- Researchers/policymakers: evaluate generalizability across heterogeneous, multi-tenant, and noisy production environments; study competitive effects if CI adoption materially reduces provider per-API revenue.

Suggested next steps for practitioners: run small-scale trials feeding an LLM agent with a prototype causal/context layer (even simple topology + symptom mappings), measure token/time/cost deltas against baseline workflows, and track how maintenance effort scales as topology and services evolve.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Internal validity is reasonably strong because of controlled fault injections and direct comparisons of agent configurations, yielding large effect sizes; however external validity is limited (synthetic demo app, limited fault types, small and unspecified number of runs), and the paper provides no evidence of randomized allocation, blinding, pre-registration, broader deployment, or robustness across diverse production environments. Methods Rigormedium — The study reports multiple clear outcome metrics and compares several agent backends, indicating thoughtful experimental design, but it omits key methodological details (number of trials, statistical tests/confidence intervals, randomization, model/hyperparameter controls, and replication), and evaluation appears confined to a single demo workload which risks overfitting the intervention to that environment. SampleA controlled benchmark using a 24-microservice OpenTelemetry demo application with injected faults; four agent configurations tested (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends) under two scenarios (active incident and healthy baseline); comparisons run with and without the Causely causal intelligence layer. The manuscript does not report the exact number of trial runs, distribution of fault types, or worker/operator interactions. Themesproductivity human_ai_collab IdentificationControlled A/B-style benchmark in a lab environment: the authors inject faults into a 24-microservice OpenTelemetry demo app and compare agent performance with and without the Causely intervention across two scenarios (active incident and healthy baseline). Causal claims rely on same-environment comparisons of metrics (time-to-diagnosis, tokens, tool-calls, cost, accuracy) under matched fault injections rather than on randomized field deployment or blinded evaluation. GeneralizabilitySynthetic demo application (24-microservice OpenTelemetry demo) may not reflect production heterogeneity, Limited set of injected fault types and scenarios — may not cover real-world incidents, Small, unspecified number of runs and agent versions reduces statistical robustness, Only four agent/back-end configurations tested — results may not hold for other models or future versions, Lab conditions omit real operator workflows, noisy/partial observability, and concurrent incidents, Cost reductions depend on current token/pricing models and may change with provider pricing, Possible tuning or optimization for the demo app could overstate performance gains in the wild

Claims (12)

Claim	Direction	Confidence	Outcome	Details
AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. Other	negative	high	semantic-interpretation costs (tokens, latency, inferential reliability)	0.24
Causely is a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships anchored to an ontological representation of the managed environment. Other	positive	high	system representation of environment topology, dependencies, and causal relationships	0.08
Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. Other	positive	high	availability of a live, queryable model derived from telemetry for diagnosis and action	0.48
We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Other	neutral	high	benchmark evaluation setup (24-microservice demo with injected faults)	n=24 0.48
Experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Other	neutral	high	agent configuration comparisons	n=4 0.48
Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. Other	neutral	high	experimental condition (Causely vs. no Causely) across two scenarios	0.48
On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63%. Task Completion Time	positive	high	mean time-to-diagnosis	63% reduction 0.48
On the active-fault scenario, causal grounding reduces mean token consumption by 60%. Organizational Efficiency	positive	high	mean token consumption	60% reduction 0.48
On the active-fault scenario, causal grounding reduces mean tool-call count by 78%. Organizational Efficiency	positive	high	mean tool-call count	78% reduction 0.48
Causely compresses the investigation footprint by 4.8× (in the active-fault scenario). Organizational Efficiency	positive	high	investigation footprint (aggregate investigation resource/effort)	4.8× compression 0.48
Causal grounding lowers direct API cost per run by 57%. Organizational Efficiency	positive	high	direct API cost per run	57% reduction 0.48
Root-cause-diagnosis accuracy rises from 75% to 100% when agents have causal grounding (Causely) in the active-fault scenario. Decision Quality	positive	high	root-cause-diagnosis accuracy	from 75% to 100% 0.48