Large language models often treat stored user preferences as global rules rather than context‑dependent signals, leaking or applying preferences in third‑party contexts; stronger personalization improves correct tailoring but also raises harmful misapplication, and prompt‑based fixes only partially mitigate the problem.

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim, Hyesoo Hong, Wonje Jeung, Yongil Kim, Wooseok Seo, Heuiyeen Yeen, Albert No · March 17, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

BenchPreS shows that state‑of‑the‑art LLMs frequently misapply stored user preferences in contexts that require suppression, producing a systematic trade‑off where stronger personalization raises correct application but also increases harmful over‑application even after reasoning and prompt‑based defenses.

Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.

Summary

Main Finding

BenchPreS (Benchmark for Preference Suppression) shows that modern LLMs frequently misapply persistent user preferences in contexts where social or institutional norms require suppression (third‑party communication). Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), the paper demonstrates a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than context‑dependent normative signals. Stronger adherence to preferences increases correct personalization but also increases harmful over‑application, and neither improved chain‑of‑thought reasoning nor prompt‑based defenses fully eliminate the problem.

Key Points

Problem framing: Many LLMs store user preferences in persistent memory to personalize across interactions, but some settings (e.g., third‑party communications, legal or workplace contexts) require suppression of those preferences for social, ethical, or institutional reasons.
BenchPreS: A benchmark and evaluation protocol that assesses whether an LLM applies or suppresses stored user preferences appropriately across varied communication contexts.
Metrics:
- Misapplication Rate (MR): fraction of instances where a preference was applied even though the context required suppression.
- Appropriate Application Rate (AAR): fraction of instances where a preference was applied when it was appropriate to do so.
Empirical findings:
- Frontier LLMs struggle to make context‑sensitive decisions about when to apply preferences.
- Models that more faithfully enforce stored preferences achieve higher AAR but also systematically have higher MR (trade‑off between personalization and over‑application).
- Attempts to fix the behavior with stronger reasoning prompts (e.g., chain‑of‑thought) or with prompt‑based safety/defenses reduce but do not eliminate misapplication.
Interpretation: Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context.

Data & Methods

Dataset/Benchmark: BenchPreS constructs scenarios that vary (a) the stored user preference, (b) the interaction partner (self vs. third party), and (c) the normative requirement (contexts where applying the preference is appropriate vs. where it should be suppressed).
Evaluation procedure:
- Models are given the same stored preference (memory) and are asked to generate responses in multiple contexts.
- MR and AAR are computed per model across the scenario set to quantify over‑application and correct personalization.
Models tested: Multiple state‑of‑the‑art LLMs (described generically as “frontier models”); comparisons analyze differences in preference adherence and misapplication.
Ablations: Experiments include:
- Varying the strength of preference encoding.
- Applying reasoning prompts (e.g., chain‑of‑thought) to encourage context reasoning.
- Adding prompt‑based defensive instructions to suppress preferences where inappropriate.
Outcomes: Quantitative comparisons show systematic MR even when AAR is high; defenses and reasoning reduce but do not eliminate MR.

Implications for AI Economics

Externalities and third parties: Misapplied personalized behavior creates negative externalities on third parties (privacy violations, normative harms, misinformation, contractual breaches), which markets and platforms may not internalize without regulation or design changes.
Trust and adoption costs: If models frequently leak or misuse preferences in third‑party contexts, users and organizations will discount the value of personalization or demand stronger controls, increasing costs for deploying memory features and reducing consumer surplus from personalization.
Liability and regulation: The failure mode suggests a need for legal and regulatory frameworks that assign liability or require transparency/audits for context‑aware memory systems. Regulatory intervention can change platform incentives to invest in context‑gating mechanisms.
Platform design and competitive differentiation: Firms offering robust context‑sensitive memory gating (e.g., fine‑grained policy engines, provenance, auditable suppression logic) may capture value by reducing downstream harms and liability—this becomes a potential product dimension and competitive moat.
Pricing and contracts: Service providers may need to price personalization differently (higher price for guaranteed context safety or separate premium controls) and offer contractual guarantees or indemnities for third‑party harms.
Mechanism design: BenchPreS provides an evaluative tool for mechanism designers to measure and compare context‑sensitivity, informing incentive structures (e.g., penalties, certifications) that encourage models to treat preferences as conditional signals.
Research and investment priorities: Economic arguments support investing in technical solutions (contextual memory gating, contextual policy layers, RL with human feedback conditioned on norms) and institutional solutions (standards, audits, certification regimes) to internalize third‑party harms and restore efficient personalization markets.
Social welfare trade‑offs: There is a trade‑off between personalization value and social/normative risk; optimal policy and product design should balance AAR gains against MR costs, using metrics like BenchPreS to quantify that frontier trade‑off.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Provides systematic, quantitative evidence across a structured scenario set and multiple state‑of‑the‑art models with targeted ablations and metrics (MR, AAR), but does not measure real‑world deployment outcomes or third‑party harms, model selection and scenario realism are not fully specified, and economic impacts are argued rather than empirically measured. Methods Rigormedium — Benchmark design, clear metrics, and ablation experiments indicate careful experimental work; however, rigor is limited by dependence on synthetic/constructed scenarios, unclear sampling and representativeness of the tested 'frontier' models, absence of user‑in‑the‑wild validation, and limited exploration of cross‑cultural or domain variation. SampleBenchPreS uses a constructed dataset of scenarios that systematically vary (a) a stored user preference in memory, (b) the interaction partner (self vs third party), and (c) whether the context normatively requires applying or suppressing the preference; results are reported over multiple unnamed state‑of‑the‑art LLMs with ablations on preference strength, chain‑of‑thought prompts, and prompt‑based defensive instructions. Themesadoption governance org_design human_ai_collab IdentificationControlled benchmark evaluation: construct matched scenarios that vary stored user preference, interlocutor (self vs third party), and normative requirement (apply vs suppress); measure Misapplication Rate (MR) and Appropriate Application Rate (AAR) across multiple frontier LLMs and run ablations (preference strength, chain‑of‑thought prompts, prompt‑based defenses) to isolate context‑sensitivity failures in model behavior. GeneralizabilityConstructed scenarios may not reflect the full complexity of real user dialogues and institutional contexts, Frontier models tested are described generically; results may not generalize to untested models or future model versions, Cultural and legal variation in norms about third‑party disclosure not captured, Limited range of preference types and domains could understate or overstate misapplication in specialized tasks, Offline benchmark evaluation lacks real‑world user behavior, multi‑turn persistence effects, and deployment engineering mitigations

Claims (14)

Claim	Direction	Confidence	Outcome	Details
Modern frontier LLMs frequently misapply stored user preferences in contexts where social or institutional norms require suppression (third‑party communication). Ai Safety And Ethics	negative	medium	Misapplication Rate (MR) — frequency of inappropriate application of stored preferences	0.11
BenchPreS detects a pervasive context‑sensitivity failure: models often treat stored preferences as globally enforceable rules rather than conditional, context‑dependent signals. Ai Safety And Ethics	negative	medium	Context sensitivity of preference application (operationalized via MR and AAR differences across contexts)	0.11
Models that more faithfully enforce stored preferences achieve higher Appropriate Application Rate (AAR) but also systematically have higher Misapplication Rate (MR), indicating a trade‑off between correct personalization and harmful over‑application. Ai Safety And Ethics	mixed	medium	Appropriate Application Rate (AAR) and Misapplication Rate (MR) — trade‑off relationship	0.11
Attempts to mitigate misapplication with stronger reasoning prompts (e.g., chain‑of‑thought) reduce Misapplication Rate but do not eliminate it. Ai Safety And Ethics	mixed	medium	Change in Misapplication Rate (MR) after applying chain‑of‑thought / reasoning prompts	0.11
Prompt‑based defensive instructions (explicitly instructing models to suppress preferences where inappropriate) reduce misapplication but fail to fully eliminate it. Ai Safety And Ethics	mixed	medium	Misapplication Rate (MR) and Appropriate Application Rate (AAR) under prompt‑based defenses	0.11
BenchPreS provides a benchmark and evaluation protocol that systematically varies stored user preference, interaction partner (self vs third party), and normative requirement to assess appropriate suppression or application of preferences. Ai Safety And Ethics	positive	high	Benchmark coverage and experimental protocol (design dimensions: preference, partner, normative context)	0.18
Quantitative comparisons across tested models show systematic Misapplication Rate even in settings where Appropriate Application Rate is high. Ai Safety And Ethics	mixed	medium	Co‑occurrence of high Appropriate Application Rate (AAR) and nonzero Misapplication Rate (MR)	0.11
Current models appear to internalize preferences as persistent, high‑priority rules rather than conditional behavioral signals contingent on conversational norms and context. Ai Safety And Ethics	negative	medium	Tendency to apply stored preferences across contexts (inferred internalization)	0.11
BenchPreS defines two complementary metrics—Misapplication Rate (MR) and Appropriate Application Rate (AAR)—to quantify over‑application and correct personalization, respectively. Ai Safety And Ethics	null_result	high	Definition and use of MR and AAR metrics	0.18
The failure mode (misapplication of preferences to third parties) creates negative externalities (privacy violations, normative harms, misinformation, contractual breaches) that markets and platforms may not internalize without regulation or design changes. Ai Safety And Ethics	negative	speculative	Projected negative externalities on third parties (not directly measured in study)	0.02
If models frequently leak or misuse preferences in third‑party contexts, users and organizations will discount the value of personalization or demand stronger controls, increasing costs for deploying memory features and reducing consumer surplus. Consumer Welfare	negative	speculative	Projected changes in trust, adoption costs, and consumer surplus (not empirically measured in this work)	0.02
Platform design that implements robust context‑sensitive memory gating (fine‑grained policy engines, provenance, auditable suppression logic) can reduce downstream harms and may become a competitive product differentiation. Firm Revenue	positive	speculative	Effectiveness of context‑sensitive memory gating in reducing harms (proposed, not tested)	0.02
BenchPreS can be used as an evaluative tool for mechanism designers and regulators to measure and compare models' context‑sensitivity to guide incentives, penalties, or certification regimes. Governance And Regulation	positive	high	Usability of BenchPreS metrics (MR, AAR) for model comparison and regulatory evaluation	0.18
There is a social welfare trade‑off between personalization value (higher AAR) and normative/social risk (higher MR); optimal policy and product design should balance these using BenchPreS metrics. Governance And Regulation	mixed	speculative	Trade‑off between personalization benefits (AAR) and social/normative risk (MR) — proposed for policymaking	0.02