The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A dual-process memory system lets LLMs sustain long-running scientific collaboration beyond full-context limits — maintaining 70–85% accuracy with 1–2s latency while halving token costs — and complements retrieval-augmented approaches by better handling numeric and temporal queries.

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
Nikola Milosevic · May 17, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
A Dual Process Memory Architecture that separates short-term episodic context from consolidated long-term knowledge preserves high accuracy (70–85%) and low latency (1–2s) at scales where full-context LLMs fail, using roughly 62% fewer tokens and demonstrating complementary strengths to RAG.

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

Summary

Main Finding

A Dual-Process (episodic + neocortical) memory architecture—with a fixed 10-message episodic buffer and an asynchronously consolidated natural-language “profile” (neocortical memory)—enables LLM-based scientific assistants to operate coherently across long-horizon research workflows (up to 15,000 messages) while using far fewer tokens and retaining high task accuracy. The system sustains 70–85% accuracy at scales where full-context approaches fail, reduces token usage by ~62% (45,434 vs. 120,000+), and supports sub-2s latency. Consolidation quality (not raw context capacity) is the primary scalability bottleneck; empirical growth of consolidated memory is ≈3 tokens per message in realistic workflows.

Key Points

  • Architecture

    • Episodic buffer: fixed sliding window W = 10 messages, raw uncompressed; guarantees constant-time recent state access and preserves discourse/coherence.
    • Neocortical memory: growing, consolidated natural-language profile updated asynchronously after every message by a specialized consolidation LLM (GPT-4o-mini in experiments).
    • Consolidation protocol: extract scientific facts, detect contradictions, merge with temporal precedence (recent wins), produce a compact, loss-minimized summary suitable for downstream LLM consumption.
  • Empirical findings

    • Large-scale evaluation over 15,000 messages with 1,440 queries and cross-model validation across six LLMs (OpenAI, Anthropic, Google).
    • Dual-Process maintained 70–85% overall accuracy at long horizons where full-context baselines collapsed.
    • Token efficiency: ~62% fewer tokens compared to full-context baselines (45,434 vs 120,000+).
    • Latency: 1–2 seconds for query responses in the tested setup.
    • Query-type trade-offs: Dual-Process excels on numeric/temporal/current-state queries (65–90% accuracy); RAG-style retrieval excels on historical/hall-of-fame retrieval (60–85% accuracy).
    • RAG failure mode: near-0% accuracy on recent-state queries due to cosine-similarity prioritizing semantic match over temporal recency.
    • Sim-to-Real gap: synthetic workflows may suggest constant memory footprint, but realistic scientific dialogues produce linear growth (~3 tokens/message); consolidation quality (lossiness, conflict resolution) limits scale more than raw context size.
    • Cognitive Event Horizon: observed empirically near 2,000 messages for full-context approaches in realistic scientific workflows.
  • Comparison to prior work

    • Differs from MemGPT: implicit, automatic background consolidation (no LLM-driven paging), deterministic episodic window vs LLM-invoked paging.
    • Addresses limitations of MemoryBank and RAG by providing temporal-aware consolidation and a separation designed for evolving scientific state (parameter updates, contradictory findings, precise numeric facts).

Data & Methods

  • Evaluation corpus

    • Realistic scientific workflow simulation: 15,000 messages representing months of research-style interaction (hypotheses, protocols, parameter updates, results).
    • 1,440 evaluation queries sampling different question types: recent-state (numeric/temporal), historical retrieval, multi-hop technical reasoning.
  • Cross-model validation

    • Experiments run across six LLMs drawn from three families (OpenAI, Anthropic, Google) to measure architecture-level behavior independent of a single model.
  • Architecture & consolidation

    • Episodic buffer: deterministic replacement, W = 10 messages.
    • Consolidation LLM: asynchronous extraction/merge protocol executed on each message; inputs: last W messages, current profile, newest exchange.
    • Consolidation outcome: appended/updated natural-language facts in the consolidated profile; empirical growth ≈3 tokens per message.
  • Metrics

    • Accuracy stratified by query type (numeric/temporal vs historical).
    • Token consumption (total tokens held in context), latency per query, and degradation thresholds (Cognitive Event Horizon).
    • Comparative baselines: full-context (extended window) and RAG-style retrieval.
  • Key quantitative results

    • Accuracy: Dual-Process 70–85% overall at large scale; Dual-Process 65–90% on numeric/temporal; RAG 60–85% on historical retrieval.
    • Token usage: 45,434 tokens vs 120,000+ for full-context baseline in the evaluated long-run scenario.
    • Latency: 1–2 seconds per query in the reported testbed.
    • Consolidated profile sizes: capable of managing profiles with 14,000+ scientific facts (~125k tokens in accumulated facts across archive) while keeping episodic buffer fixed.

Implications for AI Economics

  • Cost efficiency and compute trade-offs

    • Token-cost reduction: reducing context token count by ~62% directly lowers per-query inference costs when compared with full-context strategies (fewer tokens fed to expensive LLMs; smaller attention matrices).
    • Latency and throughput: sub-2s latencies and constant-size episodic buffer improve responsiveness and allow more queries per unit time on the same hardware.
    • Consolidation cost vs re-processing costs: asynchronous consolidation introduces extra background LLM calls (an added operational cost), but these are amortized over many downstream queries and eliminate repeated re-processing of full histories on every query — likely cheaper than running large context windows repeatedly, especially when attention scales superlinearly on some architectures.
  • Product and pricing strategies

    • Hybrid routing value: architectures should route queries by intent—route recent-state, numeric, or multi-hop planning queries to the Dual-Process augmented model; route historical or provenance queries to RAG stores. This routing maximizes accuracy per dollar.
    • Subscription or tier models: offer baseline Dual-Process assistants with periodic consolidation; premium tiers can include denser auditing (cross-model validation, human-in-the-loop reconciliation) and higher-frequency consolidation for mission-critical pipelines.
    • Infrastructure investment focus: spending on high-quality consolidation models/pipelines (better extract/merge logic, temporal metadata) yields outsized ROI because consolidation quality is the main scalability bottleneck, not raw context capacity.
  • Economic modeling and break-even considerations

    • Amortized consolidation cost per message: treat consolidation LLM call cost C_c per message vs incremental cost of expanding full-context or reprocessing history C_r per query; break-even occurs where cumulative C_c << (#queries × C_r).
    • Memory growth rate (~3 tokens/msg) enables cost forecasting: expected long-run context size ≈ base_profile + 3 × total_messages; operators can forecast token storage and retrieval costs linearly.
    • Attention/compute scaling: architectures that rely on massive context windows face superlinear compute growth for attention on some hardware; Dual-Process replaces that with linear consolidation cost and constant-time episodic access, improving long-run marginal cost.
  • Operational risks & mitigation (economic consequences)

    • Consolidation errors (incorrect merges or temporal precedence misapplied) can propagate costly scientific mistakes; investing in cross-model validation and reconciliation workflows (e.g., RAG + Dual-Process cross-checking, human audit for critical updates) is economically justified for high-stakes domains.
    • Sim-to-Real benchmarking risk: product planning based on synthetic benchmarks underestimates memory growth and consolidation needs. Realistic workload profiling should be required before launch to avoid under-provisioning and unanticipated operating costs.
  • Market and innovation impacts

    • Enables new product category: persistent scientific AI collaborators capable of multi-month project memory, increasing the value of AI in drug discovery, genomics, and lab automation. This may shift procurement from ad-hoc LLM calls to platform subscriptions that include persistent profiles and consolidation services.
    • Incentivizes specialization: investments in domain-specific consolidation LLMs (trained/prompts tuned for scientific precision and contradiction resolution) will be commercially valuable.
    • Complementary market for verification tools: demand for validation, provenance, and temporal-aware retrieval services (RAG augmentation with temporal metadata) will increase.

Summary recommendation for practitioners and economists - For builders: adopt the Dual-Process pattern (fixed small episodic window + asynchronous consolidation) and measure consolidation growth empirically on realistic workflows; invest first in consolidation quality (temporal metadata, contradiction resolution) and hybrid routing to RAG for archival retrieval. - For economic modeling: include amortized consolidation costs and projected 3 tokens/message growth in TCO; stress-test product economics using realistic conversation simulations (not synthetic toy dialogs) to avoid underestimating long-run compute/storage needs.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports a large-scale, cross-model experimental evaluation (15,000 messages, 1,440 queries, six LLMs from three families) with clear quantitative comparisons (accuracy, token use, latency). This provides substantive internal evidence that the Dual Process Memory Architecture reduces token usage and preserves accuracy at scales where full-context approaches fail. However, it does not establish causal effects on real-world productivity or economic outcomes, lacks detail on ground-truth labeling, statistical significance, and deployment heterogeneity, and relies in part on synthetic tests that differ from realistic workflows. Methods Rigormedium — The experimental design has strengths: multi-model cross-validation, large message counts, comparison to relevant baselines (full-context and RAG), and measurement of multiple performance dimensions (accuracy by query type, token growth, latency). But rigor is limited by missing methodological details in the description provided — e.g., sampling of queries/messages, how accuracy was measured and adjudicated, absence of statistical tests/confidence intervals, limited disclosure of hyperparameters/prompting and system/hardware setup, and potential tuning for selected models or tasks which may bias results. SampleExperimental dataset of ~15,000 messages drawn from simulated and 'realistic' scientific workflows; evaluation across six LLMs from three families (OpenAI, Anthropic, Google) with 1,440 labeled queries; synthetic probes plus realistic iterative analysis sessions showing consolidation growth ~3 tokens per message; stress tests including profiles with 14,000+ scientific facts (~125k tokens); performance metrics include accuracy by query type (numeric/temporal, historical retrieval), token usage (e.g., 45,434 vs >120,000), and end-to-end latency (1–2 seconds). Themeshuman_ai_collab productivity GeneralizabilityLimited model coverage: six models from three vendors may not represent the broader LLM ecosystem or future models with different architectures., Domain specificity: experiments focus on scientific workflows and dense technical content; results may not generalize to conversational, legal, creative, or customer-support domains., Synthetic vs real workflows: observed 'Sim-to-Real' gap indicates synthetic tests understate memory growth, so other real-world workflows may behave differently., Deployment differences: hardware, inference infrastructure, prompt engineering, and caching strategies can materially affect latency and token costs., Evaluation scope: accuracy and token metrics are proxies for productivity but do not measure downstream scientific output, user satisfaction, or economic impact., Unclear labeling/benchmarking procedures: without details on ground truth and inter-annotator agreement, accuracy estimates may not transfer to other datasets.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
Context window saturation is a critical bottleneck as LLMs evolve into persistent scientific collaborators, because iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content. Other negative high context window saturation in scientific workflows
0.03
Monolithic approaches suffer from quadratic cost scaling and cognitive degradation when used for long scientific workflows. Other negative high computational cost scaling and cognitive degradation of monolithic LLM approaches
0.18
The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Other positive high episodic window size; long-term memory growth rate (tokens/message)
n=15000
constant 10-message window; growing at approximately 3 tokens/message
0.18
We performed a large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries. Other neutral high evaluation sample size and cross-model coverage
n=15000
15,000 messages; six LLMs from three families; 1,440 queries
0.3
Full-context models fail at 10,000 messages due to context overflow. Other negative high failure (ability to answer/maintain accuracy) at ~10,000 messages
n=1440
0.18
The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches. Other positive high accuracy; latency (seconds); token usage
n=1440
70-85% accuracy; 1-2 second latency; 62% fewer tokens (45,434 vs 120,000+ limit)
0.18
Cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85% accuracy). Output Quality mixed high accuracy on numeric/temporal queries; accuracy on historical retrieval queries
n=1440
Dual Process: 65-90% accuracy; RAG: 60-85% accuracy
0.18
There is a "Sim-to-Real" gap: synthetic tests maintain constant memory usage but realistic workflows exhibit linear memory growth of about 3 tokens per message, with consolidation quality emerging as the primary scalability bottleneck. Other negative high memory growth rate (tokens per message); identification of consolidation quality as bottleneck
n=15000
about 3 tokens/message linear growth
0.18
The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits. Other positive high number of scientific facts and token footprint the system can manage (profile capacity)
n=14000
14,000+ scientific facts (125k tokens)
0.18

Notes