Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Summary

Main Finding

More-capable LLMs can produce worse distributional (probabilistic) forecasts when the true time series exhibits superlinear growth followed by regime change. The failure is concentrated in the upper tail: larger or more post-trained models overcommit to aggressive extrapolations of growth, inflating their upper quantiles; when growth reverts the elevated upper tail yields large CRPS losses. This inverse scaling appears across simulated and real-world series (epidemics, housing bubble, hyperinflation) and is invisible to common single-threshold metrics (e.g., Brier at a fixed cutoff).

Key Points

Inverse-scaling pattern: At long horizons, model capability (measured by ECI) correlates negatively with CRPS on tasks with superlinear growth + regime-change risk.
- ForecastBench-Sim (FBSim) pooled at horizon 7: ρ = −0.42 (N=28).
- Synthetic SIR series: ρ = −0.62 (N=27).
- COVID-19 incidence, housing, hyperinflation also replicate (examples: housing ρ = −0.67, N=30).
- Pre-vaccine measles pooled: positive scaling at short horizons → inverse scaling at longer horizons (e.g., ρ ≈ −0.42 at 12–20 weeks).
Mechanism isolated: the combination of superlinear (exponential) growth + tail risk of regime change. A linear-growth control with identical crash structure shows positive scaling instead.
Failure localization: per-quantile pinball decomposition shows p90 (upper tail) drives the inversion; p10 remains relatively stable.
Role of scale and post-training: a controlled Llama-3.1 2×2 (70B / 405B × base / instruct) shows both larger scale and RLHF-style post-training independently and jointly amplify the upper-tail overcommitment (interaction p < .0001).
Domain cues: simply naming the domain or adding a minimal uncertainty cue has inconsistent effects — sometimes attenuates (e.g., COVID-19), sometimes not (hyperinflation).
Metric sensitivity: single-threshold scoring (Brier) can hide the problem and even reverse the sign of capability–performance relationship on the same forecast outputs. CRPS (integrated over thresholds) reveals upper-tail costs.
Practical recommendation from authors: evaluations for LLM forecasting should report continuous, unbounded distributional scoring rules (e.g., CRPS) alongside threshold metrics.

Data & Methods

Capability measure: Epoch Capabilities Index (ECI) aggregated over standard LLM benchmarks (range in study ~114–155).
Forecast elicitation: models asked for five quantiles (p10, p25, p50, p75, p90); CRPS computed from piecewise-linear CDF; pinball loss used for per-quantile decomposition; Brier used for binary/threshold questions.
Benchmarks / datasets:
- ForecastBench-Sim (FBSim): contamination-free, procedurally-generated rollouts from FreeCiv game; paired binary and continuous questions across six templates (treasury, territory, population, city count, technology count, overall score).
- Synthetic mechanism: 50 SIR epidemic series (exponential rise then intervention-driven decline) and a matched linear-growth control with same crash jumps.
- Real-world replication:
  - COVID-19 daily incidence (Our World in Data; 60 countries).
  - S&P/Case-Shiller monthly home prices for 19 US metros (history through Dec 2005).
  - Monthly CPI for 12 hyperinflation episodes across 10 countries.
  - Measles (Project Tycho), pre-vaccine US seasons 1928–1962: N=1,339 state-seasons after quality filters.
  - Influenza (modern ILINet and historical) used as a negative control (no inversion observed).
Models tested: ~27–30 modern LLMs across providers and families; within-family controlled tests on Llama-3.1 (70B vs 405B, base vs instruct).
Statistical reporting: Spearman ρ between ECI and scoring-rule means (sign-adjusted so positive = better performance with higher ECI); bootstrap CIs and Wilcoxon signed-rank tests for within-family contrasts.

Implications for AI Economics

Model selection hazards: Using capability scores (or choosing larger / more-instructed LLMs) as a proxy for forecasting quality can be misleading for tail-sensitive economic/financial tasks. More-capable models may overcommit to explosive growth scenarios and understate tail-downside probabilities once regimes change.
Risk of mispriced tail risk: In finance and macro policy, overconfident upper-tail forecasts can cause systematic overestimation of upside (e.g., asset-price continuations, demand surges) or underappreciation of downside options, leading to poor policy, investment, or risk-management decisions.
Evaluation practice: Economic and financial forecasting evaluations of LLMs should
- Elicit full predictive distributions (not just point estimates or binary thresholds),
- Report tail-inclusive continuous proper scoring rules (CRPS, pinball losses) and per-quantile calibration,
- Inspect upper-tail calibration (p90, p95, etc.) explicitly, and
- Stress-test models on superlinear-growth + regime-change scenarios.
Deployment precautions: For high-stakes or tail-sensitive tasks, consider:
- Complementing or replacing large/instructed LLMs with models explicitly trained for calibrated probabilistic forecasting,
- Using ensembles or Bayesian approaches that temper aggressive extrapolation priors,
- Applying conservative decision rules that inflate upper-tail uncertainty,
- Incorporating domain-specific intervention models or mechanistic priors when regime change is plausible.
Research & policy directions:
- Investigate training and fine-tuning methods that reduce upper-tail overcommitment (data augmentation with regime-change examples, calibration objectives emphasizing tails),
- Study internal representations and why post-training / instruction increases overcommitment,
- Establish benchmark standards for LLM forecasting in economics that require tail-aware metrics before model approval for operational use.
Broader note: Improvements in average or thresholded accuracy do not guarantee better outcomes for tail-sensitive economic decisions. Regulators, financial institutions, and policy teams should demand distributional calibration evidence, not just headline capability scores.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper demonstrates a consistent inverse-scaling pattern across a contamination-free simulation and multiple real-world datasets, and uses within-family variation to strengthen the link between model capability and tail miscalibration; however, it is observational with potential confounds (prompting, decoding/temperature choices, model family coverage, dataset selection), and lacks randomized or structural identification that would fully rule out alternative explanations. Methods Rigorhigh — Uses a principled, contamination-free simulated benchmark (FBSim), matched linear control, per-quantile decomposition to reveal distributional shifts, replication across several real-world series, and within-family analysis of scale and post-training—together these are strong methodological choices; remaining concerns are about evaluation sensitivity to prompting/decoding and the representativeness of chosen model families and datasets. SampleBenchmark suite includes ForecastBench-Sim (simulated tasks with superlinear growth and regime-change tail risk), synthetic SIR epidemics with a matched linear control, and real-world series: COVID-19 case/death data, measles incidence, housing market series, and historical hyperinflation episodes; models evaluated include multiple LLMs across capability levels and a within-family sweep of Llama-3.1 (different scales and post-training variants); evaluation uses per-quantile (continuous) scoring and conventional single-threshold metrics. Themesadoption governance IdentificationComparative empirical evaluation: cross-model comparisons across model sizes and post-training variants (within-family Llama-3.1), a contamination-free simulated benchmark (ForecastBench-Sim) and a matched linear-control task to rule out data-leakage/artifacts, per-quantile decomposition to isolate tail behavior, and replication on multiple real-world datasets (COVID-19, measles, housing, hyperinflation); no randomized or instrumental causal identification. GeneralizabilityFindings specifically concern forecasting tasks with superlinear growth and regime-change tail risk and may not apply to stationary or linear time series., Results may depend on prompting, decoding/temperature settings, and post-processing choices not fully generalizable across practitioners., Model families evaluated (notably Llama-3.1 and other unnamed LLMs) may not represent all contemporary architectures or fine-tuning regimes (e.g., heavily specialized forecasting models)., Real-world replication uses selected datasets (COVID-19, measles, housing, hyperinflation) and may not generalize to other domains (e.g., macro aggregates, commodity prices)., Evaluation conclusions depend on the choice of scoring rules; some operational decision contexts use thresholded metrics where the effect might be obscured.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts. Output Quality	negative	high	distributional forecast quality / calibration	0.18
The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control. Output Quality	negative	high	forecast performance on simulated SIR epidemics (distributional forecasts)	0.18
The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. Output Quality	negative	high	forecast performance on real-world time series (distributional forecasts / calibration)	0.18
A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. Output Quality	negative	high	upper-tail forecast calibration / shift in predictive quantiles	0.18
A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Output Quality	negative	high	relationship between model scale / post-training and forecasting calibration (distributional quality)	0.18
Domain knowledge does not reliably rescue calibration. Output Quality	null_result	high	forecast calibration after incorporating domain knowledge	0.18
This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks. Output Quality	null_result	high	relationship between model capability and accuracy under single-threshold metrics	0.18
Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. Output Quality	negative	high	capability–accuracy relationship under tail-inclusive scoring (impact of model capability on forecast accuracy when tails are included)	0.18
We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics. Other	positive	high	evaluation methodology for LLM forecasting (metric selection)	0.18

Bigger LLMs can make riskier forecasts: larger and post-trained models inflate the upper tail on outbreaks, hyperinflation and housing booms, appearing better on common threshold tests but worse on tail-inclusive scoring, risking overly aggressive extrapolations in high-stakes forecasting.