Stateless Decision Memory for Enterprise AI Agents

Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load-bearing on four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load-bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

Summary

Main Finding

Deterministic Projection Memory (DPM) — an append-only event log plus a single task-conditioned, temperature-zero projection at decision time — preserves the enterprise-friendly systems properties (deterministic replay, auditable rationale, multi-tenant isolation, and statelessness for horizontal scale) while matching or improving decision alignment vs. a strong stateful summarization baseline. At tight memory budgets DPM substantially improves factual precision and reasoning coherence, while also being 7–15× faster (one LLM call vs N incremental calls). The core contribution is a systems argument: statelessness is the load-bearing property driving enterprise adoption, and DPM demonstrates that property can be achieved without sacrificing decision quality on practical regulated tasks that fit within a single projection call.

Key Points

Enterprise constraints matter: regulated decision systems require deterministic replay, auditable rationale, multi-tenant isolation, and statelessness for horizontal scaling. Many stateful memory architectures violate these properties by design (they accumulate mutable state via repeated LLM calls).
DPM architecture: immutable append-only event log E plus a single projection π(E, T, B) → M at decision time. Projection emits structured memory (facts / reasoning / compliance), cites event indices, runs at temperature 0, and is budget-bounded.
Operational advantage: DPM reduces the nondeterminism/replay surface from N intermediate LLM calls to a single projection call, making byte-exact replay feasible if paired with deterministic inference.
Empirical outcome:
- Benchmark: LongHorizon-Bench (10 cases: 5 mortgage, 5 claims; ~26–28k chars, 82–96 events/case).
- Conditions: Summ-only baseline (incremental summarization after each event) vs. DPM.
- Budgets: tight=1,338 chars (20× compression), moderate=5,352 (5×), loose=13,381 (2×).
- Metrics: FRP (factual precision – anchor recovery), RCS (reasoning coherence), EDA (decision accuracy), CRR (compliance reconstruction).
- Results (paired tests, n=10):
  - Tight budget: FRP +0.515 (p=0.001, Cohen’s h=1.17); RCS +0.533 (p=0.003, h=1.13). EDA and CRR improved (Δ=+0.50 each) with p≈0.065–0.066 (large effect sizes but not meeting conventional p<0.05).
  - Moderate & loose budgets: no statistically significant difference on the four axes.
- Speed: DPM 7–15× faster because it makes one LLM call at decision time instead of many incremental consolidation calls.
Determinism study: temperature-zero calls against live API (Anthropic claude-haiku) show residual API-level nondeterminism (byte drift on order of single-digit tokens). Structural asymmetry: DPM exposes one nondeterministic call; stateful summarization exposes N compounding calls.
Scope & limitations:
- DPM applies to trajectory memory (events within a single decision); it does not replace corpus retrieval/indexing.
- Single-projection DPM requires the trajectory to fit a single model context window; hierarchical DPM for longer horizons is future work and reintroduces intermediate calls.
- Byte-exact replay still requires a deterministic inference runtime; DPM only minimizes the practical surface that must be made deterministic.
Practitioner output: TAMS — a task-property heuristic to select between stateless (DPM/RAG) and stateful memory architectures (choose DPM/ stateless when replay/audit/isolation/scale are primary and trajectories fit single projection; choose stateful when an agent must edit memory mid-trajectory or requires richer internal deliberation that cannot be captured by a single projection).

Data & Methods

Benchmark: LongHorizon-Bench (regulated decisioning domains: mortgage underwriting under ECOA/Reg B; insurance claims adjudication).
- 10 cases (5 loan, 5 claim), each ~26–28k characters, 82–96 events.
- Ground truth constructed by decision-first inversion so all required anchors are derivable.
Architectures compared:
- Summ-only (stateful incremental summarization; summary updated after each event).
- DPM (append-only log, single projection at decision time).
Memory budgets: tight 1,338 chars (20× compression), moderate 5,352 (5×), loose 13,381 (2×). Under Summ-only these were running-summary caps; under DPM they were target projection lengths.
Backend: claude-haiku-4-5-20251001 for agents and judges, temperature=0, fixed seed in call stack; judge calls for RCS/CRR used case-specific rubrics.
Statistics: paired permutation tests (10,000 resamples) paired by case; paired-bootstrap 95% CIs on mean deltas; Cohen’s h for effect sizes on proportion metrics. Four decision-alignment axes evaluated: FRP, RCS, EDA, CRR.
Determinism experiment: 10 replays per case at temperature zero; measured byte-level drift.

Implications for AI Economics

Compute & API cost reductions:
- One-shot projection vs N incremental consolidations reduces LLM calls per decision by O(N), directly lowering API/compute spend. Measured 7–15× latency improvement implies substantial per-decision resource savings and higher throughput on fixed compute budgets.
Operational & compliance cost reductions:
- Structural determinism and auditable rationales simplify regulatory investigations and internal audits (fewer artifacts to log, inspect, or reconstruct). This reduces legal/compliance risk and the engineering overhead of retrofitting audits onto stateful systems.
- Multi-tenant isolation by construction reduces privacy/leakage risk and lowers costs for data governance and tenant-scoping infrastructure.
Scaling economics:
- Statelessness enables elastic horizontal scaling without per-request node affinity or heavyweight shared state coordination. That lowers operational complexity and OPEX for large-scale enterprise deployments (fewer persistent caches, less stateful orchestration).
Product & market implications:
- Enterprises will likely favor memory architectures that trade expressive power for operational guarantees when regulated decisions are involved. This explains continued industry preference for RAG-like pipelines despite academic gains in stateful memory accuracy.
- Research on stateful memory must internalize systems constraints (determinism, auditability, tenancy, scale) to increase enterprise adoption — delivering decision-quality gains alone is not sufficient.
Trade-offs and investment choices:
- To obtain bit-exact replay, DPM still requires investing in deterministic inference runtimes (self-hosted weights, deterministic samplers) — an upfront capex/engineering cost that is now more tractable because only one projection call must be pinned.
- For tasks requiring in-trajectory memory editing/deliberation or for trajectories exceeding current model context windows, stateful architectures or hierarchical DPM variants may still be necessary; enterprises must evaluate the economic trade-off between increased model/API costs and the business value of additional agent capabilities.
Recommendation for decision-makers:
- When compliance, auditability, tenant isolation, and scalable throughput are primary drivers (typical in regulated verticals), adopt stateless projection designs (DPM/RAG) where feasible — this reduces operating costs and regulatory risk with little/no decision-quality penalty in many practical settings.
- Budget and context-window constraints matter: DPM is especially advantageous when memory budgets are tight (where it outperforms summarization).

Summary takeaways for AI economists: stateless projection architectures materially change the cost-risk profile of deploying long-horizon decision agents in regulated settings by lowering per-decision compute, simplifying regulatory evidence assembly, improving isolation, and enabling simpler horizontal scaling. These operational benefits can explain enterprise adoption patterns and should be included in economic models of memory-augmented agent deployment and in evaluations of new memory research.

Assessment

Paper Typeother Evidence Strengthmedium — The paper reports large effect sizes and low p-values from paired tests and includes performance, speed, and determinism metrics, but evidence rests on a small set of 10 benchmark cases, a limited set of memory budgets, and results tied to particular LLM APIs/implementations so external validity is uncertain. Methods Rigormedium — The authors use sensible experimental controls (paired comparisons, permutation testing, deterministic replay at temperature zero) and measure multiple relevant outcomes (factual precision, reasoning coherence, latency, audit surface), but key methodological details that affect reproducibility/generalizability (exact models/APIs, prompt engineering, dataset construction, evaluator procedures, and robustness across many task types) are not fully reported or the sample is small. SampleTen regulated decisioning cases drawn from a LongHorizon-Bench-style suite covering underwriting, claims adjudication, and tax examination; experiments run at three memory budgets including a binding 20x compression ratio; determinism study uses 10 replays per case at temperature=0; comparisons measure factual precision, reasoning coherence, LLM call counts, and latency. Themesgovernance adoption org_design productivity human_ai_collab IdentificationControlled within-benchmark system comparison: the authors implement two memory architectures (Deterministic Projection Memory vs summarization-based memory) and evaluate them on 10 regulated decisioning cases across three memory budgets; statistical differences are assessed with paired permutation tests (n=10) reporting Cohen's h and p-values; a determinism study runs 10 replay attempts per case at temperature=0 to measure nondeterminism and audit-call counts. GeneralizabilitySmall number of benchmark cases (n=10) limits representativeness across regulated decision tasks, Results may depend on specific LLM/API implementations, prompt engineering, and evaluation procedures not fully specified, Memory-budget definitions (e.g., 20x compression) and hardware/latency characteristics may not map to all enterprise environments, Regulatory and audit requirements vary across jurisdictions and industries, limiting direct transferability, Benchmarks may not capture real-world data distribution shifts, adversarial inputs, or integration complexities in production systems

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. Adoption Rate	positive	medium	prevalence of retrieval-augmented pipelines in enterprise deployment	0.01
Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction. Governance And Regulation	negative	high	compatibility of stateful architectures with regulatory/system properties	0.02
We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. Other	positive	high	architecture design (DPM specification)	0.02
On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds. Output Quality	positive	high	relative performance (match/outperform) of DPM vs summarization-based memory across memory budgets	n=10 0.12
At a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) compared to summarization-based memory (paired permutation, n=10). Output Quality	positive	high	factual precision	n=10 +0.52 (Cohen's h=1.17, p=0.0014) 0.12
At a 20x compression ratio, DPM improves reasoning coherence by +0.53 (Cohen's h=1.13, p=0.0034) compared to summarization-based memory (paired permutation, n=10). Output Quality	positive	high	reasoning coherence	n=10 +0.53 (h=1.13, p=0.0034) 0.12
DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. Task Completion Time	positive	high	decision-time latency / number of LLM calls	7-15x faster; one LLM call at decision time instead of N 0.12
A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls. Ai Safety And Ethics	mixed	high	system nondeterminism / number of nondeterministic LLM calls exposed per decision	n=10 DPM: one nondeterministic call; summarization: N compounding calls 0.12
The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. Governance And Regulation	positive	high	number of LLM calls logged per decision (audit surface)	DPM logs two LLM calls per decision; summarization logs 83-97 0.12
Statelessness is the load-bearing property explaining enterprises' preference for weaker but replayable retrieval pipelines, and DPM demonstrates this property is attainable without the decisioning penalty retrieval pays. Governance And Regulation	positive	high	trade-off between stateless architectures and decisioning performance / auditability	n=10 0.12