The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Narrative moral probes expose that large language models often mask superficial ethical competence with reproducible reflexive failures; more capable systems reveal subtler breakdowns, meaning procurement, certification and insurers must discount polished moral answers unless tested with structurally resistant evaluations.

Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior
David C. Flynn · March 13, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
Narrative probes using unresolvable sci‑fi moral dilemmas reveal that LLMs often give surface‑plausible moral answers while exhibiting five reproducible reflexive failure modes, with the probe becoming more discriminating as model capability increases.

Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.

Summary

Main Finding

Existing AI moral-evaluation benchmarks largely measure surface-level, correct-sounding answers rather than genuine moral-reasoning capacity. Using deliberately unresolvable moral dilemmas embedded in literary narrative (science‑fiction stories) produces a probe that resists surface performance and exposes a measurable gap between performed and authentic moral reasoning. The probe scales with system capability (becoming more discriminating as models get stronger) and reveals reproducible reflexive failure modes that are consequential for deployment decisions in high‑stakes domains.

Key Points

  • Novel probe methodology: stimulus material = unresolvable moral scenarios taken from published science‑fiction narrative, chosen because their structure resists cueing and surface rhetorical strategies.
  • Study scope: 24 experimental conditions spanning 13 distinct systems across two series:
    • Series 1: frontier commercial systems (blind; n = 7)
    • Series 2: local and API open‑source systems (blind and declared; n = 6)
    • Four Series 2 systems re‑administered under declared conditions, yielding 13 blind + 4 declared + 7 ceiling‑probe = 24 total conditions.
  • Robustness: re‑administration under declared conditions produced zero delta across all 16 dimension‑pair comparisons (i.e., no measurable change when declaration status changed).
  • Judging and reliability:
    • Primary blind scoring performed by Claude (Anthropic) as an LLM judge.
    • Ceiling discrimination probe used Gemini Pro (Google) and Copilot Pro (Microsoft) as independent judges.
    • A supplemental theological differentiator probe achieved perfect rank‑order agreement between the two ceiling judges (Spearman rs = 1.00), supporting judge reliability for the ceiling probe.
  • Failure modes: five qualitatively distinct D3 reflexive failure modes were identified (examples include categorical self‑misidentification and false‑positive self‑attribution), indicating specific, interpretable ways systems fail when probed beyond surface responses.
  • Interpretation: literary narrative probes act as anticipatory evaluation instruments—more capable systems reveal subtler failures, and instrument sophistication appears to scale with system capability rather than being circumvented by it.

Data & Methods

  • Stimuli: a set of unresolvable moral dilemmas drawn from a published science‑fiction series. Chosen because dilemmas are open, ambiguous, and structured to foil straightforward rule‑based or canned ethical responses.
  • Systems tested: 13 distinct LLM systems across two series (commercial frontier systems and local/API open‑source systems). Conditions included blind (subjects/judges not informed) and declared (subjects informed they were being evaluated).
  • Conditions and sample: 24 experimental conditions total (13 blind, 4 declared re‑tests of Series 2, 7 ceiling‑probe runs).
  • Administration: probe runs were administered by two human raters across three machines to ensure operational consistency.
  • Scoring and adjudication:
    • Primary blind scoring used an LLM judge (Claude).
    • Ceiling discrimination probe used two strong LLM judges (Gemini Pro and Copilot Pro) as independent arbiters.
    • A theological differentiator probe was used as a reliability check; perfect rank‑order agreement (rs = 1.00) between independent judges reported.
  • Analysis: comparisons across 16 predefined dimension pairs (ethical dimensions or response axes). Repeated measures tested for delta between blind and declared administrations; qualitative coding of reflexive failure modes produced five D3 categories.
  • Notes on failure modes: the paper catalogs five D3 reflexive failures; two explicitly named in the paper are categorical self‑misidentification and false‑positive self‑attribution. (Full descriptions and examples for all five appear in the paper.)

Implications for AI Economics

  • Risk assessment and valuation: moral‑reasoning capacity is an asset distinct from language fluency; firms, buyers, and regulators should discount claims of “ethical competence” when based solely on surface benchmarks. Economic valuations of AI products should incorporate measured gaps between performed and authentic moral reasoning, particularly for high‑stakes applications (healthcare, justice, defense, finance).
  • Procurement and contracting: procurement specifications should require narrative‑based, structurally resistant probes (or equivalent) as part of acceptance testing for systems deployed in sensitive domains. Contracts and SLAs can tie deployment permissions or pricing tiers to demonstrated performance on such probes.
  • Compliance, certification, and audit markets: the probe supports creation of higher‑resolution certification regimes that distinguish polished output from genuine reasoning. This opens market opportunities for third‑party evaluators, auditors, and certifiers and raises the bar for regulatory standards.
  • Insurance and liability: insurers evaluating AI operational risk should consider reflexive failure modes (e.g., self‑misidentification, false self‑attribution) as sources of correlated blind spots that increase tail risk. Premium pricing and coverage terms may need to reflect measurable gaps revealed by narrative probes.
  • Competitive strategy and R&D incentives: vendors can no longer rely solely on training to produce plausible moral answers; strategic investment is needed in architectures and training objectives that target internalized, generalizable reasoning. Firms that can demonstrate authentic reasoning (per robust probes) may capture premium market segments or regulatory advantages.
  • Market externalities and policy: because the probe becomes more discriminating as capability rises, arms‑race dynamics could push providers to optimize for surface plausibility unless regulation or market pressure mandates deeper evaluation. Policy instruments (mandates, disclosure rules, standardized tests) can help align incentives toward authentic moral competence.
  • Cost of evaluation: implementing narrative‑based probes requires more elaborate design and adjudication than simple benchmarks, implying additional upfront evaluation costs for buyers/regulators—but these costs may be justified by reduced downstream harms and liability exposure.
  • Research & labor markets: demand for experts in designing structurally resistant evaluation instruments (literary/narrative probes, adversarial evaluation) will grow, affecting labor allocation and pricing in AI safety and evaluation services.

Limitations and cautions (brief) - Generalizability depends on the choice of narrative and series; results reflect the specific stimuli and systems tested. - The zero‑delta and perfect‑agreement findings pertain to the study’s settings and judges; broader replication and cross‑domain probes are needed to confirm external validity. - The paper focuses on measurement and failure taxonomy rather than prescriptive fixes for achieving authentic moral reasoning.

If you’d like, I can: - Extract the list and operational definitions of all five D3 reflexive failure modes (from the paper) and map each to likely economic consequences. - Draft a short policy checklist procurement teams can use to require narrative‑based moral evaluation in vendor RFPs.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study shows internally consistent, reproducible patterns (24 conditions, repeated administrations, independent ceiling judges, a reliability check with perfect rank agreement, and zero blind/declared delta), which supports the claim that narrative probes expose failure modes; however, the evidence is limited by a narrow stimulus set (one literary series), a modest set of models, and reliance on LLMs as primary judges rather than human gold standards, leaving external validity and judge-model dependencies unresolved. Methods Rigormedium — Design strengths include repeated measures across 24 conditions, predefined dimension pairs, independent ceiling judges, and a reliability probe; methodological weaknesses include primary adjudication by an LLM judge (potential bias/entanglement with subjects), a focused stimulus corpus (single sci‑fi series), qualitative coding of failure modes without a human-coded baseline, and limited information on randomization or pre‑registration. SampleStimuli: a set of unresolvable moral dilemmas drawn from a published science‑fiction series; Subjects: 13 distinct LLM systems across two series (7 frontier commercial systems tested blind; 6 local/API open‑source systems tested blind and declared), with 24 total experimental conditions (13 blind + 4 declared re‑tests + 7 ceiling‑probe runs); Administration: probe runs executed by two human operators across three machines; Scoring: primary blind scoring by Claude (Anthropic) as an LLM judge, with Gemini Pro and Copilot Pro used as independent ceiling judges and a theological differentiator probe used to check judge reliability (Spearman rs = 1.00). Themesgovernance adoption GeneralizabilityStimuli limited to a single published science‑fiction series — unclear if findings hold across other narratives, cultures, or genres, Primary scoring by LLM judges risks judge–subject dependency and circularity; absence of human gold‑standard adjudication, Model sample modest (13 systems) and not exhaustive of available architectures/versions, Findings pertain to written narrative probes and may not transfer to multimodal or interactive deployment contexts, Probe design could be gamed or rendered less effective by adversarial/targeted fine‑tuning, Qualitative taxonomy of failure modes may depend on coding choices and rater interpretation

Claims (14)

ClaimDirectionConfidenceOutcomeDetails
Existing AI moral-evaluation benchmarks largely measure surface-level, correct-sounding answers rather than genuine moral-reasoning capacity. Ai Safety And Ethics negative medium gap between polished/surface moral answers and deeper/authentic moral-reasoning (as exposed by performance on narrative-based unresolvable dilemmas)
n=13
0.11
A probe composed of deliberately unresolvable moral dilemmas embedded in literary (science-fiction) narrative resists surface performance and exposes a measurable gap between performed and authentic moral reasoning. Ai Safety And Ethics negative medium discriminative power of the probe (ability to expose failures/gaps) operationalized via scoring across 16 predefined dimension pairs and qualitative identification of reflexive failures
n=13
Probe discriminates across 16 dimension pairs; applied to 13 LLMs across 24 conditions
0.11
The probe's discriminating power scales with system capability — it becomes more discriminating as models get stronger. Ai Safety And Ethics positive medium change in probe discrimination (sensitivity to subtle failures) as a function of model capability
n=13
0.11
The study employed 24 experimental conditions spanning 13 distinct LLM systems across two series. Research Productivity null_result high number of experimental conditions and distinct systems tested (study scope)
n=24
24 experimental conditions spanning 13 distinct LLM systems
0.18
Series 1 consisted of frontier commercial systems administered blind (n = 7). Research Productivity null_result high count of systems in Series 1 (n=7) and administration mode (blind)
n=7
Series 1: n = 7 frontier commercial systems (blind)
0.18
Series 2 consisted of local and API open-source systems (n = 6) administered blind and declared, with four systems re-administered under declared conditions. Research Productivity null_result high count of systems in Series 2 (n=6) and number re-administered under declared conditions (4)
n=6
Series 2: n = 6 local/API open-source systems; 4 re-administered under declared conditions
0.18
Re-administration under declared conditions produced zero delta across all 16 dimension-pair comparisons (no measurable change when declaration status changed). Research Productivity null_result high difference (delta) in scores across 16 dimension-pair comparisons between blind and declared administrations
n=16
Zero delta across all 16 dimension-pair comparisons between blind and declared administrations
0.18
Primary blind scoring was performed by Claude (Anthropic) used as an LLM judge. Research Productivity null_result high agent used for primary blind scoring (Claude)
Primary blind scoring performed by Claude (Anthropic)
0.18
The ceiling discrimination probe used Gemini Pro (Google) and Copilot Pro (Microsoft) as independent judges. Research Productivity null_result high agents used for ceiling-probe adjudication (Gemini Pro, Copilot Pro)
n=2
Ceiling-probe independent judges: Gemini Pro, Copilot Pro
0.18
A supplemental theological differentiator probe achieved perfect rank-order agreement between the two ceiling judges (Spearman rs = 1.00), supporting judge reliability for the ceiling probe. Research Productivity positive high Spearman rank-order agreement (rs) between the two ceiling judges on the theological differentiator probe
Spearman rank-order agreement rs = 1.00 between the two ceiling judges on theological differentiator probe
0.18
Five qualitatively distinct D3 reflexive failure modes were identified in model responses, including categorical self-misidentification and false-positive self-attribution. Ai Safety And Ethics negative medium enumeration and qualitative descriptions of reflexive failure modes observed in model outputs
Five qualitatively distinct D3 reflexive failure modes identified
0.11
Probe administration included operational controls: runs were administered by two human raters across three machines to ensure operational consistency. Research Productivity null_result high operational administration procedure (two human raters, three machines)
n=2
Runs administered by two human raters across three machines (operational control)
0.18
Analysis compared responses across 16 predefined dimension pairs (ethical dimensions or response axes) and used repeated measures and qualitative coding to characterize system behavior. Other null_result high analytic procedures applied (16 dimension pairs; repeated measures; qualitative coding)
0.18
Literary narrative probes can serve as anticipatory evaluation instruments: they reveal subtler failures in more capable systems and their sophistication appears to scale with system capability rather than being circumvented by it. Error Rate positive medium extent to which narrative probes reveal failures correlated with model capability
0.11

Notes