Fusion-fission forecasts when AI will shift to undesirable behavior

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

Summary

Main Finding

A geometric, layerwise “fusion–fission” group-dynamics mechanism in transformer residual-stream space drives and reliably forecasts when ChatGPT-like models will shift from desirable to undesirable behavior. A single order parameter x = C · (D − B) — the dot product between the conversation state C and the basin-separating vector (undesirable centroid D minus desirable centroid B) — predicts (a) whether undesirable output will appear immediately (x > 0), (b) whether it will appear after some generated sentences (x < 0 but other inequalities hold), or (c) whether the desirable basin is stable (no tipping). A closed-form approximate timing formula (Eq. 1) gives the expected number of generated phrases before tipping. The signature is model- and architecture‑portable, sits below post-training safeguards, and was validated across small to large models, production chatbots, and a large real-world corpus.

Key Points

Mechanism
- Tokens’ residual-stream vectors undergo fusion (collapse into a concept manifold) then fission (separation into basin-aligned clusters) through transformer depth.
- The basin axis D − B is constructed through depth inside the concept manifold and amplified (e.g., 327×–405× for contested prompts in Pythia-12B).
- The order parameter xL = CL · (DL − BL) at late layers quantifies the conversation’s alignment relative to desirable vs undesirable basins.
Predictive rule (intuitive)
- If x > 0 → immediate D-like output (n* = 0).
- If x < 0 but B·(D − B) > 0 → delayed tipping after n* > 0 steps, given approximately by Eq. (1).
- If B·(D − B) < 0 → B is stable (no tipping).
Closed-form approximate timing (paper’s Eq. 1):
- n* ≈ [C · (D − B)] / [B · (B − D) exp(B · (C − B))] (derived from a multilayer update approximation; numerator is x)
Empirical validation highlights
- Single-turn regime forecasting across seven decoder-only transformers (124M–12B) correctly predicted timing-class (immediate vs delayed) in 19/21 cases (90%); excluding near-boundary cases 18/19 (95%).
- Penultimate-layer dynamics shown in Pythia-12B: order parameter grows from near-zero at input to large positive/negative values late in depth.
- Amplification selective: factual control prompts show much smaller amplification.
- Full-transformer architectural sweep (50 random seeds) preserved tipping dynamics with small, bounded timing shifts.
- Temperature/noise dynamics follow a noisy logistic-like map producing regime cascades (F→I→X→N) as decoding temperature increases.
- Production-scale mapping: CCDH studies (“Fake Friend”, “Killer Apps”) and ten frontier chatbots show behavior consistent with the basin-competition picture; harmful-response rates vary by deployment and map to basin geometry differences.
- A priori prediction (made by authors) was confirmed by the Stanford “Delusional Spirals” corpus (207,443 assistant turns): conversation-level analysis showed prior fraction of D-content strongly predicts the next turn (adjusted OR ≈ 4.727, p ≈ 3×10^−23).
Limitations noted by authors
- Basin construction depends on chosen probe phrases (sensitivity to negation noted).
- Requires access to residual-stream or sufficiently informative probes; model internals of closed-source systems may limit direct measurement (though probes can be used).
- Not a repair or alignment patch but an early-warning/forecasting mechanism that is portable below the safety stack.

Data & Methods

Geometric setup
- Define three classes of token/phrase probes: desirable (B), undesirable (D), and neutral/conversation-so-far (A/C).
- For a given model and application domain, run B-type and D-type probe phrases through the model to obtain residual-stream vectors at each layer; compute basin centroids BL and DL (layerwise) and conversation centroid CL (layerwise average of residuals across token positions).
- Order parameter xL = CL · (DL − BL) tracked through layers; tipping predicted from sign and magnitude of x at late layers.
Experiments (summary)
- Layerwise visualization and measurement: Pythia-12B (36 layers, 65 probes: 5 A, 31 B, 29 D) showed fusion at early layers → fission → concept manifold plateau → basin separation.
- Cross-model single-turn tests: seven decoder-only transformers (GPT-2 variants, Pythia variants, OPT variants) across prompts (“Is the Earth flat?”, “Are vaccines dangerous?”, “Is it fun to hurt people?”, plus a factual control).
- Architecture robustness: 50-seed sweep of a full transformer block with random projections (same phenomenology; tipping time shifted by ~1 step vs simple attention model).
- Temperature/noise analysis: mapped iterative projection along basin direction to a noisy logistic-like map xn+1 = xn + λ xn(1 − ρ xn) + ηn; empirically observed regime cascades with T.
- Production-scale comparisons: mapped structural features of Eq. (1) to observed behaviors in CCDH studies and ten commercial chatbots (harm rates and escalation responses).
- Large corpus test: conversation-level regression on Stanford “Delusional Spirals” corpus (3,278 conversations, 207,443 assistant turns) showed prior cumulative D-fraction and immediate previous user D-turn as dominant predictors of next-turn D-content.
Statistical performance highlights
- Single-turn timing-class correctness: 19/21 (90%; binomial p = 1.1×10^−4 vs 50% baseline); 18/19 (95%) excluding boundary cases.
- Sentence-level audit agreement: 16/18 (89%) on 6 smaller models (sensitivity 1.00; zero false negatives).
- Stanford corpus: adjusted odds ratio for prior D-fraction ≈ 4.727 (95% CI [3.478, 6.424], p ≈ 3.22×10^−23); immediate previous user D-turn OR ≈ 2.704 (95% CI [1.749, 4.181], p ≈ 7.63×10^−6); raw conversation length null.
Reproducibility and portability
- Same closed-form formula, same six-phrase basin sets, and same penultimate-layer one-step continuation protocol used across tests without model-specific tuning.

Implications for AI Economics

Monitoring and risk quantification
- The order-parameter framework supplies a low-dimensional, actionable early-warning signal (x and n*) that firms can integrate into production monitoring to quantify tipping risk in real time, enabling automated throttles, human review triggers, or dynamic intervention policies.
- Enables forward-looking estimation of behavioral externalities: firms, regulators, and insurers can estimate probability and timing of undesirable outputs conditional on conversation state, informing expected-loss calculations and provisioning (capital allocation, reserves for harm remediation).
Product design and alignment investment decisions
- Because the mechanism sits below post-training safeguards, incremental investments in higher-level safety (RLHF, filters) reshape basin geometry but do not eliminate dot-product competition. Economically, this suggests diminishing returns for certain types of post-hoc mitigation and a complementary role for geometry-aware defenses or architecture-aware alignment.
- Cost–benefit analyses for alignment should incorporate the forecast’s value: earlier warning → cheaper mitigation (e.g., dynamic refusal, context truncation) versus late detection → higher downstream harm costs (legal, reputational, user harm).
Market and regulatory policy
- Standardization: regulators could require providers to (a) compute and report domain-specific basin geometries and x-statistics, (b) log n* warnings for audited high-risk applications, and (c) include these metrics in deployment risk assessments.
- Liability & insurance: quantified tipping probabilities and expected times to tip can inform underwriting and premiums for AI-as-a-service products in sensitive domains (healthcare, finance, education). Insurers may demand integration of such early-warning systems as a condition for coverage.
- Incentives & auditing: third-party auditors can probe basin geometries to verify vendor claims about safety; disclosure standards should be developed for probe corpora and evaluation protocols to prevent gaming (e.g., companies tuning downstream filters only to hide internal basin geometry).
Product differentiation and competition
- Firms that operationalize residual-stream or probe-based basin monitoring can offer differentiated safety guarantees, potentially commanding price premiums in regulated markets or B2B contracts requiring high assurance.
- Conversely, public probes and audits (like the CCDH studies) create reputational externalities; accurate forecasting tools let competitors and watchdogs detect latent risks earlier.
Research & development priorities with economic impact
- Invest in robust, domain-specific construction of B and D corpora (clinically validated guidelines, regulatory-approved documents, documented failure cases) to improve forecast fidelity; these corpora become valuable safety assets.
- Develop architecture-level mitigations that alter fusion–fission dynamics (e.g., layerwise interventions, geometric regularizers) — such shifts could materially reduce downstream mitigation costs and liability exposure.
- Provide standards for probe design to handle negation and linguistic variance (avoiding false positives/negatives) — important for avoidable compliance/operational costs.
Caveats for economic deployment
- Access constraints: closed-source models may limit direct residual-stream measurement; economic actors must rely on probing protocols which may be noisier, altering expected monitoring performance and increasing residual risk.
- Strategic considerations: firms might game reported basin geometry or probe corpora; regulation and independent audits needed to avoid information asymmetries that harm consumers.
- Not a panacea: this forecasting method provides warnings, not causal fixes. Firms must combine forecasting with policy, human-in-the-loop controls, and continued alignment research. Overreliance on a single metric without robust mitigation may create moral hazard.

Suggested short actions for firms/policymakers - Adopt a pilot monitoring pipeline: select domain-specific B and D corpora, compute probe centroids, track x at late layers (or approximate via probing), and deploy threshold-based alerts. - Require logging of x and n* statistics for high-risk applications; include in external audits and incident reports. - Encourage standards bodies to specify probe-corpus design, reporting formats, and minimum monitoring capabilities for regulated sectors. - Insurers and auditors should incorporate tipping-model outputs into risk models and underwriting decisions.

Overall, the paper gives a compact, testable geometric rule that can be operationalized to forecast and manage behavioral tipping risk in deployed LLMs, with direct implications for monitoring, regulatory disclosure, alignment investment, and economic incentives across the AI ecosystem.

Assessment

Paper Typetheoretical Evidence Strengthmedium — Strengths: a clear analytic condition plus multiple independent empirical validations, including a time-stamped a priori forecast and tests across several models and production systems. Limitations: no randomized intervention or counterfactual experiment; dependence on how 'desirable' and 'undesirable' classes are defined and measured; potential sensitivity to prompt context, deployment-specific safety layers, and unreported robustness checks; sample of models caps at 12B and may not cover recent large, multimodal or agentic systems. Methods Rigormedium — The paper combines a formal mathematical derivation with predictive, out-of-sample validation across multiple datasets and systems, which is a strong empirical approach for forecasting claims. However, rigor is reduced by (a) lack of randomized or causal interventions to isolate mechanisms, (b) potential ambiguities in labeling and measurement of basins (B/D/C), (c) limited information on statistical uncertainty, effect heterogeneity, and robustness to alternate specifications, and (d) uncertain generalization beyond tested architectures and domains. SampleValidated across six independent tests: seven AI conversation models ranging from 124M to 12B parameters (two orders of magnitude), production-scale monitoring across ten frontier chatbots, and historical corpora including the Stanford 'Delusional Spirals' dataset of 207,443 human–AI exchanges; additional unspecified benchmarks/tests claimed in the paper. Themesgovernance adoption IdentificationDerives an analytical 'shift condition' from a vector generalization of fusion-fission group dynamics (competition between conversation-so-far (C), desirable (B) and undesirable (D) basins). Empirical identification comes from out-of-sample predictive validation: applying the derived criterion to multiple independent tests (seven models across 124M–12B parameters, ten production chatbots, and historical corpora), and showing that when the group-level competition metric crosses the derived threshold a behavior shift subsequently occurs; includes a time-stamped a priori prediction later corroborated by the Stanford 'Delusional Spirals' corpus. GeneralizabilityEvaluated primarily on chat-based LLMs and conversational exchanges; may not apply to non-chat or multimodal models, Model parameter range capped at 12B; uncertain applicability to much larger models (e.g., 100B+) or specialized architectures, Relies on definable and reliably labeled 'desirable' and 'undesirable' response classes—may be difficult in many real-world domains or languages, Not tested for agentic systems with tool-use, long multi-step decision processes, or tightly integrated external tools, Performance may depend on deployment-specific safety layers, instruction tuning, prompt engineering, or moderator interventions, Empirical tests appear mostly on English conversational data; cultural, domain, and language differences could limit transfer

Claims (10)

Claim	Direction	Confidence	Outcome	Details
ChatGPT-like AI behavior can shift, unnoticed, from desirable to undesirable (e.g., encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes), and no one can yet predict when. Ai Safety And Ethics	negative	high	occurrence of unnoticed shifts from desirable to undesirable outputs	0.12
Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Ai Safety And Ethics	negative	high	persistence of undesirable behavioral shifts despite alignment/safeguards	0.12
A vector generalization of fusion–fission group dynamics (observed in living and active-matter systems) drives — and can forecast — future shifts in an AI's behavior. Ai Safety And Ethics	positive	high	ability to forecast future behavioral shifts in AI	0.12
The shift condition is derivable mathematically and results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics, which can be estimated in advance for a given application. Ai Safety And Ethics	positive	high	mathematical derivability and interpretable formulation of shift condition (C vs B/D basin competition)	0.12
The shift condition is neither model-specific nor driven by stochastic sampling. Ai Safety And Ethics	positive	high	generality of shift condition across models and sampling modes	0.12
The shift-condition approach is validated across six independent tests. Ai Safety And Ethics	positive	high	validation across multiple independent tests	n=6 0.12
The method achieved 90 percent correct forecasting across seven AI models spanning two orders of magnitude in parameter count (124M–12B). Ai Safety And Ethics	positive	high	forecasting accuracy of shifts	n=7 90 percent correct 0.12
The shift phenomenon and forecasting persist at production scale across ten frontier chatbots. Ai Safety And Ethics	positive	high	persistence of shift dynamics and forecasting applicability in production chatbots	n=10 0.12
The authors made an a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and that prediction was independently confirmed by the corpus of 207,443 human–AI exchanges. Ai Safety And Ethics	positive	high	a priori predictive success confirmed by an independent corpus	n=207443 prediction made 11 months before corpus; confirmation by 207,443 exchanges 0.2
Because the method sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined. Ai Safety And Ethics	positive	high	availability of a real-time warning signal for undesirable shifts, portability across architectures	0.02