A new 300-item benchmark shows state-of-the-art LLMs fall far short at automating learned procedures: top models score about 65% versus human baselines, excelling at priming but failing inhibition and conditioned avoidance. The results suggest implicit memory deficits are structural and unlikely to be fixed by scale alone, with implications for dependable assistant behavior and productivity.

ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models

Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong · April 09, 2026

arxiv descriptive n/a evidence 8/10 relevance Source PDF

ImplicitMemBench — a 300-item benchmark testing procedural memory, priming, and conditioning in LLM agents — finds top models score only ~63–65%, with large asymmetries (strong preference/priming but weak inhibition), indicating systematic failures to automatize learned behavior.

Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus--Unconditioned Stimulus (CS--US) associations shaping first decisions). Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from "what agents recall" to "what they automatically enact".

Summary

Main Finding

IMPLICITMEMBENCH introduces the first systematic benchmark for implicit (unconscious) behavioral adaptation in LLM agents, operationalizing three non‑declarative memory constructs—procedural memory, priming, and classical conditioning—under a unified Learning/Interfere/Test protocol with first‑attempt scoring. Evaluation of 17 state‑of‑the‑art models shows severe limitations: no model exceeds 66% overall (top models: DeepSeek‑R1 65.3%, Qwen3‑32B 64.1%, GPT‑5 63.0%, average ≈55.3%), with pronounced asymmetries in learning types (e.g., inhibition 17.6% vs. preference 75.0%). Results indicate that implicit memory formation is a distinct capability not solved by current scaling or standard memory‑augmentation approaches and likely requires architectural and objective innovations.

Key Points

Scope and novelty
- First benchmark specifically targeting implicit memory (automatic behavioral adaptation) rather than explicit recall.
- Grounded in cognitive taxonomy of non‑declarative memory: procedural memory, priming, classical conditioning.
Benchmark design
- Unified three‑phase Learning–Interfere–Test protocol; first‑attempt (first‑trial) scoring to isolate automatized behavior.
- 300 items total: 100 per paradigm.
- Context budget ~500 tokens; designed to be lightweight and reproducible.
Task characteristics
- Procedural memory: five domains (tool/API usage, linguistic templates, logical ops, abstract rules, creative constraints); rule learning from minimal examples followed by heavy interference.
- Priming: matched experimental/control pairs across thematic domains; measures thematic bias (Priming Influence Score).
- Classical conditioning: CS–US pairings across tool safety, conversational adaptation, and system protection; measures automatic avoidance/adaptive first actions.
Evaluation & validation
- Generation pipeline: LLM (GPT‑4o‑mini) instantiation of templates followed by automated checks and human editing.
- Hybrid validation: deterministic/rule‑based validators for structured procedural items; LLM‑as‑judge for priming and conditioning outputs.
- Scoring: FTA (first‑trial accuracy) for procedural and conditioning; Priming Influence Score for priming.
Empirical findings
- No evaluated model achieves human‑like implicit automaticity; highest model performance ≈65.3%.
- Paradigm asymmetry: procedural memory is relatively more tractable; priming in a moderate band; classical conditioning is the hardest.
- Capability dissociation: good performance in one paradigm does not predict good performance in others.
- Strong asymmetry in types of adaptation (e.g., low inhibition learning vs. high preference learning).
- Memory‑augmented agents (explicit storage/retrieval) do not reliably improve performance on implicit tasks.

Data & Methods

Dataset
- Total items: 300 (100 procedural, 100 priming, 100 classical conditioning).
- Phases per item: Learning (rule/theme/CS–US exposure), Interference (neutral or misleading turns), Test (single probe requiring first action).
- Token distribution differs by paradigm (procedural uses heavy interference; conditioning concentrates tokens in learning).
Generation & quality control
- Two‑stage pipeline: automated instantiation (GPT‑4o‑mini) → automated checks → human review/editing.
- Multi‑layer validation: structural checks (turn counts, token limits), LLM judges for semantic adequacy, human review to avoid test leakage.
Experimental protocol
- Models evaluated: 17 models covering closed‑ and open‑source systems (examples: GPT‑5, Claude‑4.1‑opus, Gemini‑2.5‑pro, DeepSeek‑R1, Qwen3‑32B, LLaMA‑3.3‑70B).
- Standardized zero‑shot conversational interface; no fine‑tuning on tasks.
- Deterministic generation (temperature T=0) for procedural and conditioning to get reproducible first‑attempt scoring; T=0.8 at test phase for priming to allow creative variance; judges at T=0.
- Metrics: First‑Trial Accuracy (FTA) for procedural/classical conditioning; Priming Influence Score (comparison of experimental vs control outputs) for priming.
Analysis
- Aggregate and per‑paradigm scores; fine‑grained error analyses showing inhibition vs preference asymmetries and categories that are universally hard.

Implications for AI Economics

Product reliability and user experience
- Implicit memory (automatic adaptation) is critical for assistant reliability (e.g., applying learned safe defaults, avoiding repeated failures). Current LLMs show a substantial capability gap, implying limits on how “hands‑off” assistants can be without explicit reminders.
- Firms building agentic products should not assume explicit storage/retrieval features alone will produce robust automatic behavior; user workflows that rely on silent adaptation may underperform.
R&D and investment priorities
- Returns to pure parameter scaling or straightforward retrieval/memory modules appear limited for implicit adaptation tasks. Investment should target architectural changes, training objectives, and inductive biases that support proceduralization and associative conditioning.
- Benchmarks like IMPLICITMEMBENCH offer a compact, actionable evaluation for tracking progress on these capabilities; funders and teams can use it to prioritize research that produces behavioral automation rather than improved recall metrics.
Cost–benefit and go‑to‑market decisions
- Given a performance ceiling well below human baselines, companies should be cautious about deploying unattended autonomous behaviors where implicit safety or procedure learning is required; instead, design products with fallback explicit confirmation or simpler automation scopes.
- Market differentiation: providers that achieve credible implicit adaptation may command value in contexts where low‑friction automation and safety reflexes are monetizable (enterprise workflows, safety‑critical assistants).
Regulation, safety, and liability
- Failure modes tied to lack of implicit avoidance (low inhibition learning) raise legal and safety risks if agents repeat harmful actions unless explicitly instructed otherwise. Compliance and liability frameworks should account for implicit learning deficiencies.
Economic modeling and forecasting
- Productivity gains projected from assistant automation should be tempered by the additional R&D time and expense needed to close the implicit memory gap. Forecasts that assume rapid substitution of human procedural expertise by LLMs may be optimistic absent targeted advances.
Research & evaluation practice
- Procurement and benchmarking criteria for agentic systems should include implicit‑memory evaluations (first‑attempt, interference‑robust tasks) in addition to explicit recall and retrieval metrics.
- Funding agencies and corporate R&D should support work on training paradigms (e.g., experience replay, online reinforcement of procedural skills, objective functions for associative learning) and on model architectures that internalize behavioral routines.

Actionable takeaways for economists and decision‑makers: - Do not conflate explicit memory benchmarks with automatic behavioral competence—treat implicit memory as a separate capability requiring targeted assessment and investment. - Favor conservative deployment of automatic agent behaviors until models demonstrate reliable first‑trial avoidance and proceduralization in benchmarks like IMPLICITMEMBENCH. - Reallocate R&D budgets to approaches that promise structural gains (architecture/training changes) rather than only scaling or retrieval add‑ons if the goal is robust, unconscious adaptation.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper presents a benchmark and empirical evaluation of model capabilities rather than testing causal hypotheses; results describe model performance but do not support causal inference about why models behave as they do. Methods Rigormedium — The benchmark is grounded in cognitive-science constructs, uses a unified Learning/Priming-Interfere-Test protocol, covers three theoretically motivated paradigms, includes 300 items and evaluates 17 contemporary models with first-attempt scoring and human baselines — all strengths. However, the paper as summarized leaves open concerns about item selection and representativeness, construction and calibration of human baselines, potential confounds across model families (size, training data, instruction-tuning), language and domain coverage, statistical uncertainty reporting, and reproducibility details. SampleA 300-item suite (ImpIicitMemBench) spanning three non-declarative memory constructs — Procedural Memory (one-shot skill acquisition after interference), Priming (paired experimental/control instances producing theme-driven bias), and Classical Conditioning (CS–US associations affecting first decisions) — administered under a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring; evaluated on 17 LLMs (including DeepSeek-R1, Qwen3-32B, and GPT-5) with comparison to unspecified human baselines. Themesproductivity human_ai_collab GeneralizabilityLimited item set (300) may not capture full diversity of implicit-memory-like tasks or real-world assistant workflows, Models evaluated are a snapshot; future models or fine-tuned variants may perform differently, Benchmarks focus on first-attempt responses and do not measure within-session adaptation, fine-tuning, or long-term updates, Potential language, modality, and domain biases if items are concentrated in one language or topic, Human baseline construction and testing conditions not fully specified (limits comparability), Architectural/scale heterogeneity across tested models may confound interpretations about root causes

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. Other	negative	high	other	0.18
We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs. Other	positive	high	other	n=300 0.18
ImplicitMemBench operationalizes three cognitively grounded constructs from cognitive science: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (CS--US associations shaping first decisions). Other	positive	high	other	n=300 0.3
The benchmark's 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Other	positive	high	other	n=300 0.3
Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall. Output Quality	negative	high	overall accuracy on the implicit memory benchmark	n=17 no model exceeds 66% 0.3
Top performers were DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%). Output Quality	positive	high	overall accuracy on the implicit memory benchmark	n=17 DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) 0.3
Model performance on ImplicitMemBench is far below human baselines. Output Quality	negative	medium	benchmark accuracy compared to human performance	0.11
Analysis uncovers dramatic asymmetries: inhibition 17.6% vs. preference 75.0%. Decision Quality	mixed	high	rates of inhibition vs. preference effects (implicit memory outcomes)	n=17 inhibition 17.6% vs. preference 75.0% 0.3
There are universal bottlenecks requiring architectural innovations beyond parameter scaling. Other	negative	medium	model capability limitations / architectural requirements	0.02
ImplicitMemBench reframes evaluation from 'what agents recall' to 'what they automatically enact'. Other	positive	high	evaluation framing / measurement focus	0.09