Leading language models are overconfident forecasters: none of 11 evaluated models hit the 90% prediction-interval target—Gemini 3.1 Pro (79.1%), Grok 4 (76.4%) and GPT-5.4 (75.3%) perform best—while calibration collapses at extreme magnitudes.

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin, Maksym Andriushchenko · April 17, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

Benchmarking with prediction intervals shows that current frontier LLMs are systematically overconfident in continuous numerical forecasting, with none achieving the 90% coverage target and top models reaching roughly 75–79% coverage.

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Summary

Main Finding

QuantSightBench introduces prediction intervals as a rigorous evaluation interface for LLM numerical forecasting and shows that current frontier and open-weight LLMs are systematically overconfident. Across 1,000 diverse forecasting questions (90% nominal confidence), no model reaches the 90% coverage target; top performers (Gemini 3.1 Pro, Grok 4, GPT-5.4) achieve roughly 75–79% coverage. Calibration degrades sharply at extreme magnitudes and for harder questions, revealing scale-sensitivity and persistent undercoverage.

Key Points

New evaluation proposal: prediction intervals (lower, upper bounds at a stated confidence) are argued to be a natural, testable interface for continuous numerical forecasts and preferable to point estimates.
QuantSightBench: benchmark of 1,000 automatically generated numeric-forecast questions spanning economics, public health, demographics, etc., constructed to avoid trivial recall or leakage.
Settings examined:
- Zero-shot (question only)
- Background-context (question + curated background news)
- Agentic (model can iteratively retrieve from a fixed corpus)
Main metrics:
- Coverage: proportion of true values inside predicted intervals (targets 1−α; typically 90%)
- Mean Log Interval Score (MLIS): log-transformed Winkler interval score combining sharpness and calibration (lower is better)
Main empirical results:
- No evaluated model achieves nominal 90% coverage. Top coverage (agentic, 90% target): Gemini 3.1 Pro 79.1%, Grok 4 76.4%, GPT-5.4 75.3%.
- Systematic overconfidence: models' intervals are too narrow overall.
- Scale sensitivity: coverage declines with increasing magnitude; many models fall below 65% coverage for very large quantities (100K+).
- Harder questions (as proxied by more agentic retrieval iterations) have lower coverage and higher MLIS; more iterations signal difficulty rather than benefit from extra retrieval in aggregate.
- Models do widen intervals when their point errors are larger (positive correlation between relative width and mean relative error), but the adjustment is insufficient and inconsistent across models.
Ablations:
- Providing background context improves coverage and MLIS vs zero-shot.
- Increasing target confidence (e.g., 80→90→95%) produces wider intervals; models still under-cover but MLIS can improve as intervals appropriately widen.
- More internal reasoning effort generally improves calibration/sharpness for most models.

Data & Methods

Data construction:
- Built on OpenForecast pipeline with modifications to retain only quantitative, uncertain-future questions.
- Background articles drawn from Jan–Aug 2025; forecasting questions generated from articles with resolution dates Sep 2025–Jan 2026 to reduce leakage.
- Retrieval corpus: ~320,000 news articles, deduplicated, chunked (512 tokens), embedded with OpenAI text-embedding-3-large.
- Final benchmark: 1,000 quantitative forecasting questions across multiple domains (examples include silver supply deficit in 2025, climate finance commitments, fatality counts).
Models evaluated: 11 models including frontier proprietary (OpenAI, Anthropic, Google, xAI) and open-weight models (GLM, Kimi, DeepSeek, Grok). All models had documented knowledge cutoffs prior to Sep 2025 to avoid training-set leakage.
Evaluation settings: zero-shot, background-context, agentic (iterative retrieval up to several iterations).
Metrics:
- Coverage at specified confidence (default 90%).
- Mean Log Interval Score (MLIS) — log-transformed Winkler interval score to normalize across scales while preserving a proper scoring rule.
Analysis dimensions:
- Coverage and MLIS per model, across settings.
- Breakdown by ground-truth magnitude bins.
- Relationship between number of agentic retrieval iterations and performance.
- Interval width vs. point error (modulation of uncertainty).
Limitations noted by authors:
- Preprint, under review.
- Evaluation limited to news-derived forecast questions and the chosen retrieval corpus.
- Models constrained by declared knowledge cutoffs.

Implications for AI Economics

For economic forecasting and policy-making, the study highlights key risks and practical recommendations:
- Risk of overconfidence: LLMs tend to understate uncertainty. Using raw LLM-produced point estimates or narrow intervals without calibration could produce systematically biased (overconfident) decisions—dangerous in macro policy, fiscal planning, and financial risk assessments.
- Prediction intervals as best practice: Economists and practitioners should prefer interval forecasts (with validated calibration diagnostics) over single-number outputs when using LLMs for numerical forecasts.
- Scale-aware evaluation and reporting: LLMs degrade on very large magnitudes and fractional quantities; economic applications (GDP, national debt, large population counts, rates) require explicit checks of coverage across relevant scales and domains.
- Human-in-the-loop and decision rules: Use LLM intervals as one input among others; require model calibration checks and conservative decision rules (e.g., widen reported intervals, combine with domain models) before acting on high-stakes forecasts.
- Calibration interventions:
  - Post-hoc methods (e.g., conformal prediction, ensemble calibration) can restore coverage guarantees and should be applied when possible—especially when model-native intervals are miscalibrated.
  - Fine-tuning or training objectives that incentivize honest uncertainty (proper scoring rules) could improve native interval quality in future model iterations.
- Retrieval and information synthesis: Agentic retrieval helps but more iterations often mark harder questions; provenance and transparency of retrieved evidence matter for trustworthy economic forecasts.
- Benchmarking and research directions:
  - Researchers in AI economics should adopt prediction-interval-based benchmarks (like QuantSightBench) for model evaluation and for assessing downstream decision performance under uncertainty.
  - Future work should test alternative corpora (e.g., proprietary economic data), longer horizons, and horizon-specific calibration, and should develop methods that correct scale sensitivity.
- Practical deployment checklist for economists using LLM forecasts:
  - Require interval forecasts (not just point estimates).
  - Validate empirical coverage on held-out tasks similar in scale/domain.
  - Apply post-hoc calibration when needed (conformal/ensemble).
  - Flag high-magnitude and high-iteration (hard) questions for additional human or model-based scrutiny.
  - Report interval sharpness and calibration (e.g., MLIS and empirical coverage) alongside forecasts.

Summary judgement: QuantSightBench supplies an actionable evaluation framework for continuous numerical forecasting with LLMs and surfaces important failure modes (overconfidence, scale sensitivity) that are directly relevant to economic forecasting and policy use. Users should treat current LLM interval outputs as informative but systematically underconfident unless calibrated and combined with domain safeguards.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a benchmarking/evaluation study of model forecasting performance, not a paper making causal claims; it provides direct empirical measurements of model calibration rather than causal identification. Methods Rigorhigh — Introduces a purpose-built benchmark (QuantSightBench) designed to elicit prediction intervals, evaluates 11 frontier and open-weight LLMs across multiple settings, and reports standard metrics (empirical coverage and interval sharpness) and calibration across magnitudes; methods appear systematic and appropriate for the stated evaluation goal though results depend on benchmark design choices and prompt/interface details. SampleQuantSightBench — a new benchmark of continuous numerical forecasting tasks drawn from domains including economics, public health, and social demographics; evaluates 11 frontier and open-weight LLMs (examples reported: Gemini 3.1 Pro, Grok 4, GPT-5.4) under multiple prompting/settings, measuring prediction-interval coverage and sharpness against true outcomes and a 90% target. Themesproductivity human_ai_collab GeneralizabilityLimited set of models (11) and specific model versions — performance may change as models update, Benchmark task selection and prompt/elicitation format may influence outcomes and not represent all real-world forecasting tasks, Evaluation measures model outputs in isolation (no human-in-the-loop or decision-context assessment), Performance may vary by domain, language, geographic or temporal scope not fully covered by the benchmark, Commercial model access, temperature/settings, and tokenization differences could affect reproducibility

Claims (6)

Claim	Direction	Confidence	Outcome	Details
None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target. Decision Quality	negative	high	empirical coverage (prediction interval coverage)	n=11 0.18
The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target. Decision Quality	negative	high	empirical coverage (prediction interval coverage) for specific models	n=11 Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), GPT-5.4 (75.3%) 0.18
Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models. Decision Quality	negative	high	calibration / overconfidence of prediction intervals across magnitudes	n=11 0.18
We introduce a new benchmark QuantSightBench to assess prediction-interval forecasting capability and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Other	null_result	high	benchmark introduction / evaluation of empirical coverage and interval sharpness	0.09
Prediction intervals are a more suitable evaluation format than point estimates for numerical forecasting because they require scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes. Other	positive	high	suitability of evaluation format (prediction intervals vs point estimates)	0.03
Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities. Other	negative	high	scope/coverage of existing evaluation formats	0.09