Small, automated distortions of reasoning problems can make large language models 'overthink', inflating output length by as much as 26× and creating a low-cost denial-of-service risk that raises inference latency and energy use.

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

Shuqiang Wang, Wei Cao, Jiaqi Weng, Jialing Tao, Licheng Pan, Hui Xue, Zhixuan Chu · May 13, 2026

arxiv other medium evidence 7/10 relevance Source PDF

Automated black-box perturbations of logical structure induce 'overthinking' in large reasoning models—expanding outputs up to 26.1x on MATH and transferring across models—creating a cheap vector for latency- and energy-oriented denial-of-service attacks.

Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

Summary

Main Finding

A hierarchical genetic algorithm (HGA) can automatically generate black-box adversarial problem formulations that induce “overthinking” in large reasoning models (LRMs), massively inflating their chain-of-thought output. The method produced up to a 26.1× increase in output length on the MATH benchmark, outperformed benign and manually crafted missing‑premise baselines, and adversarial inputs evolved on a small proxy model transferred effectively to large commercial LRMs (DeepSeek-R1, Qwen3‑Thinking, GPT‑o3, Gemini‑2.5‑Flash), demonstrating a practical DoS-style resource-exhaustion vulnerability in reasoning-oriented LLMs.

Key Points

Attack objective: maximize model cognitive load by fracturing logical structure of problems, causing long/repetitive chain-of-thoughts (CoTs) and explicit “overthinking” markers (hesitation, self-correction).
Representation: each input is decomposed as x = (P, q), where P = [p1, …, pn] are premises and q is the final question. This structured form enables premise- and question-level manipulations.
HGA operators:
- Question-level crossover: swap questions across premise sets to create mismatched tasks.
- Premise-level crossover: swap individual premises between problems.
- Mutation: premise deletion or insertion (borrowing premises from other individuals).
Fitness function: composite score f(x) = α · z(score_length) + (1 − α) · z(score_reflective), where
- score_length = number of tokens in the model’s CoT response,
- score_reflective = counts of tokens from a predefined vocabulary of overthinking markers,
- both components z-score normalized per generation; α trades off verbosity vs. reflective signals.
Selection: hybrid of elitism and roulette-wheel selection to balance exploitation and exploration.
Black-box setting: attacks operate via model APIs only (no gradients or internal access).
Baselines: clean/original datasets and a manually created Missing Premise dataset. HGA outperforms both in inducing overthinking.
Transferability: adversarial inputs evolved on a small open-source proxy retained high efficacy against closed-source commercial LRMs.
Practical effect: substantial increases in inference latency and energy use (directly proportional to output token count), enabling low-cost DoS-like attacks.

Data & Methods

Datasets: experiments seeded from multiple clean reasoning datasets (including MATH) and compared against a Missing Premise synthetic benchmark (Fan et al., 2025). Exact population sizes and generational hyperparameters are not provided in the excerpt.
Models evaluated (via API): DeepSeek‑R1 (671B), Qwen3‑Thinking, GPT‑o3, Gemini‑2.5‑Flash. Identical temperature and decoding settings used across models.
Evolutionary loop:
Initialize population of structured problems.
Query victim LRM to obtain CoT responses R(x) for each individual.
Compute fitness using normalized verbosity (|R(x)|) and reflective marker counts.
Select elites; sample remaining parents by roulette-wheel.
Apply hierarchical crossover (question-level w.p. pqc, otherwise premise-level) and premise-level mutations (deletion/addition) with probabilities pc, pm.
Repeat for G generations; report best individual(s).
Evaluation metrics: primary metric is CoT output length (token count); secondary metric is frequency of overthinking indicators; effectiveness benchmarked by multiplicative increase over baseline output lengths (e.g., up to 26.1× on MATH).
Transfer test: evolve adversarial inputs on a small proxy model, then evaluate their induced CoTs on commercial LRMs to measure retained effectiveness.
Reproducibility notes: paper reports black-box API calls only; parameters (population size N, G, pc, pm, pqc, α, V) are described conceptually but specific numeric defaults are not included in the provided excerpt.

Implications for AI Economics

Direct operational cost amplification: since inference cost and latency scale with output tokens, a 10–26× increase in CoT length can multiply compute, energy, and latency costs for providers and tenants. This raises marginal cost per query and can materially affect cloud/ML-as-a-service economics.
Denial-of-Service externalities: adversaries can cause resource exhaustion cheaply (black-box attacks via normal input channels), reducing availability for paying users and increasing required overprovisioning or throttling—raising infrastructure and customer-relations costs.
Pricing and risk models: providers may need to factor adversarial-induced variance in expected compute into pricing, SLAs, and capacity planning (e.g., dynamic throttling, per-token caps, different pricing tiers for reasoning-heavy endpoints).
Defense and compliance costs: mitigating this class of attacks (input sanitization, prompt structure validation, anomaly detection for overthinking markers, rate limits, robust model architectures) will require investment in detection, engineering, and monitoring, increasing total cost of ownership for deployed reasoning systems.
Product and procurement impact: organizations embedding LRMs into critical workflows must account for availability risk in cost‑benefit and procurement decisions; vendors may need to provide guarantees or hardened reasoning models at a premium.
Market and regulatory considerations: demonstrated transferability to closed commercial systems highlights systemic vulnerability. Regulators and standards bodies may push for resilience requirements for models used in critical infrastructure, affecting certification costs and market entry.
Insurance and liability: increased DoS risk may influence cyber insurance premiums and contractual liability clauses for AI providers and integrators.
Research & benchmarking incentives: economic value of robust reasoning models creates incentives for investing in defenses (robust training, abstention mechanisms, EOS suppression checks) and industry benchmarks that evaluate models’ resilience against resource-amplifying adversarial inputs.

Conflict of interest noted in the paper: some authors are employed by Alibaba Group, which developed one of the evaluated models (Qwen); consider this when interpreting results and model selection.

Assessment

Paper Typeother Evidence Strengthmedium — The paper provides consistent experimental results across multiple benchmarks (e.g., MATH) and several state-of-the-art models, including transferability from a proxy model to commercial LRMs, which supports the claimed vulnerability; however, evidence is limited to specific reasoning tasks and proxies, focuses on output-length and proxy overthinking markers rather than direct real-world latency/energy/cost measurements, and may be sensitive to prompt, decoding settings, or deployed system defenses. Methods Rigormedium — The methodology is systematic (hierarchical genetic algorithm, composite fitness, comparisons to benign and manual baselines, cross-model transfer tests), but lacks field-deployments or direct measurement of downstream economic impacts (real latency, energy, monetary cost), and may be affected by experimental choices (benchmarks, decoding parameters, rate limits, model updates) that constrain external validity. SampleAdversarial inputs evolved on structured problem decompositions for multiple reasoning benchmarks (notably the MATH benchmark), evaluated on four state-of-the-art reasoning models and several large commercial LRMs; experiments include comparisons to benign inputs and manually crafted missing-premise baselines and transferability tests using a small proxy model. Themesproductivity governance IdentificationExperimental adversarial evaluation: a black-box hierarchical genetic algorithm (HGA) systematically perturbs the logical structure of reasoning problems to maximize output length and overthinking markers, with effectiveness assessed by comparing evolved inputs to benign and handcrafted baselines across multiple benchmarks and models, plus transfer tests using a small proxy model against larger commercial LRMs. GeneralizabilityResults are demonstrated primarily on reasoning/math benchmarks (MATH); other task types (dialogue, coding, classification) may behave differently., Evaluations use specific models and decoding settings; results may vary with model architecture, instruction tuning, temperature, or future model updates., Black-box attacks assume the ability to submit many queries; real-world deployment constraints (rate limits, input filters, defenses) could reduce effectiveness., Measured outcome is output length and overthinking markers rather than direct measurements of latency, energy consumption, or operational cost in deployed settings., Transferability shown from a small proxy to some commercial models but may not hold across all providers, model versions, or proprietary defense layers.

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Large reasoning models (LRMs) exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces when confronted with incomplete or logically inconsistent inputs. Task Completion Time	negative	high	response length / reasoning trace length (verbosity and redundancy)	n=4 0.12
This overthinking behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS)-style resource exhaustion. Task Completion Time	negative	high	inference latency and energy consumption	0.02
We propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems using a hierarchical genetic algorithm (HGA) operating on structured problem decompositions and optimizing a composite fitness function to maximize response length and reflective overthinking markers. Other	positive	high	ability to induce overthinking / increase in response length (method capability)	0.06
Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. Task Completion Time	positive	high	output length (response length) on MATH benchmark	n=4 up to a 26.1x increase 0.12
Adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs (strong transferability). Task Completion Time	positive	high	transfer effectiveness of adversarial inputs (ability to induce overthinking / increased output length on target models)	0.12
Overthinking is a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses. Governance And Regulation	negative	high	presence of shared vulnerability across models (qualitative security posture)	n=4 0.12