The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Large language models give fairly accurate point estimates but are wildly overconfident — their nominal 95% credible intervals contain the truth only 9–44% of the time; a simple statistical recalibration (conformal prediction) expands intervals to recover the intended coverage.

Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always
Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje · April 02, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
LLMs, especially larger ones, give more accurate point estimates but are severely overconfident in their 95% credible intervals (observed coverage 9–44%), a deficiency that can be corrected with conformal prediction recalibration.

Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.

Summary

Main Finding

Large model size improves point-estimate accuracy when using LLMs for Bayesian elicitation, but increasing a model’s “reasoning effort” (chain-of-thought / thinking budget) does not reliably improve accuracy or calibration. Critically, all evaluated models are severely overconfident: nominal 95% credible intervals contain the true value only 9–44% of the time. A post-hoc statistical recalibration (split conformal prediction with normalized residuals) reliably restores near-nominal coverage by expanding intervals (typical expansion factors ≈ 2–5×). Web-search/tool augmentation helps weaker models a little but often degrades already-accurate models.

Key Points

  • Dataset & scope: 400 estimation questions sampled from four real-world datasets across domains — Big Five (personality), NHANES (U.S. health survey), NCD‑RisC (global health), and Glassdoor (labor market). Questions ask for a point estimate plus a 95% credible interval.
  • Models: 11 LLMs from multiple vendors (OpenAI GPT‑5.2/mini/nano, Anthropic Claude family, Google Gemini Pro/Flash, DeepSeek variants). Each model tested at low/medium/high reasoning‑effort where supported; two non‑thinking controls included.
  • Raw calibration failure: empirical coverage of nominal 95% intervals ranges only 9%–44% across model/effort combinations → systematic and large overconfidence across families and effort levels.
  • Reasoning effort: pooled tests show no consistent improvement in negative log‑likelihood (NLL) or coverage with increased effort; effort often increases interval width but not calibration (i.e., more “thinking” widens intervals without making them accurate).
  • Model size matters: larger/more capable models (e.g., Gemini Pro, Claude Opus) have lower NLL and better coverage than smaller siblings.
  • Conformal recalibration: split conformal prediction using normalized residuals (calibration set 30%, test 70%) restores near‑nominal coverage; conformal quantiles imply intervals must be scaled by ≈2–5×.
  • Tool-augmented elicitation: web search yielded a net performance degradation overall (win rate ~40% vs baseline), though it modestly helped the weakest model and was near-neutral on data-rich topics. Harmful effects concentrated on models already strong and on specialized health/personality domains.
  • Domain heterogeneity: models perform well on frequently discussed/socially prominent topics (Big Five, Glassdoor) but systematically under-predict common health prevalences (NHANES / NCD‑RisC), e.g., predicting medians ~18% vs true median ~40%.
  • Practical constraints: conformal recalibration needs labeled calibration examples per model–domain group (authors flag groups with <15 calibration points as insufficient). Selection/abstention bias is present (models sometimes refuse or abstain), which can bias apparent performance.

Data & Methods

  • Questions: 100 randomly selected parameterized questions per dataset (400 total). Questions focus on derived population statistics (proportions and means) not trivially extractable from published summaries.
  • Models & effort: 11 LLMs tested; reasoning-effort controlled via vendor API parameters or thinking‑token budgets (low/medium/high). Two non‑thinking controls (GPT‑4.1, DeepSeek‑Chat).
  • Response extraction: structured parsing of numeric triplets (point estimate, lower, upper). Invalid/malformed responses recorded (invalid rates variable across models; some models had high invalid rates at low budgets).
  • Metrics:
    • Negative Log‑Likelihood (NLL) under binomial/Gaussian appropriate to the target.
    • Coverage: empirical fraction of true values within the reported 95% intervals.
    • Relative sharpness: interval width normalized by point estimate (CV-style).
    • MdAPE and other accuracy measures for contextualization.
  • Conformal recalibration: split conformal prediction with normalized nonconformity scores si = |yi − ŷi| / σi, where σi = ui − li (predicted interval width). Calibrate on 30% of responses per model/effort/dataset, compute conformal quantile q̂, then scale intervals on test set as [ŷi ± q̂·σi]. This approach is distribution-free under exchangeability.
  • Tool experiment: for six models, web search allowed on 25 matched questions per dataset (low effort); models autonomously invoked search.

Implications for AI Economics

  • Reliability of LLM uncertainty in economic/forecast settings: raw LLM‑reported credible intervals are highly unreliable and systematically overconfident. Using them directly in economic decision or risk models risks severe underestimation of uncertainty.
  • Need for post-hoc calibration: conformal prediction (or equivalent recalibration) should be treated as a necessary step before deploying LLM-elicited uncertainty in economic analyses. Expect to expand reported intervals substantially (commonly 2–5×).
  • Calibration costs and data needs: conformal recalibration requires labeled calibration examples that are representative of the target domain and model behavior (authors used ≥15 points per group as a minimal flag). In applied settings this imposes an explicit data‑collection/calibration cost that must be budgeted.
  • Model selection vs. compute budgets: larger models give materially better point estimates and somewhat better calibration; paying for larger models often yields greater value than paying for higher reasoning budgets. Economically, allocating budget to more capable models is generally preferable to increasing chain‑of‑thought/time limits.
  • Thinking budgets not cost‑effective for calibration: increasing reasoning effort consumes compute without reliably improving calibration or accuracy. For practitioners and platforms, this implies limited marginal returns on paying for extended reasoning budgets when the goal is well‑calibrated uncertainty.
  • Tool augmentation is risky: integrating web search/tooling can harm predictions from models that already have strong internal knowledge, so tool use should be validated per model and per domain. For ensemble or marketplace design, adding tools should come with domain‑specific validation rather than assumed improvement.
  • Domain sensitivity: LLMs perform better on commonly represented topics in training data; specialized domains (e.g., detailed health statistics) are more error‑prone. Economic models relying on LLMs for niche or technical statistics should be conservative, require stronger calibration, or keep human expert oversight.
  • Market & policy implications:
    • Market for calibration services: demand for model‑agnostic calibration layers and validated calibration datasets is likely to grow.
    • Procurement and standards: contracts that use LLM outputs for decision-making should require documented calibration, coverage guarantees, and holdout evaluation sets.
    • Labor substitution: LLMs can accelerate elicitation workflows but cannot yet replace human experts for calibrated probabilistic forecasts without additional statistical correction and oversight.
  • Practical recommendations for economists:
    • Always evaluate empirical coverage before trusting reported credences.
    • Use conformal or similar recalibration methods and budget for calibration data.
    • Prefer larger, better‑performing models when accuracy matters; don’t rely on higher reasoning budgets alone.
    • Validate any tool/search augmentation per model/domain.
    • Account for abstention/selection bias when interpreting model outputs and build procedures to handle refusals.

Limitations to bear in mind: selection bias from model abstentions can make some models look artificially good; conformal guarantees rely on exchangeability (i.e., calibration set representative of test cases); tool-augmentation results are preliminary and domain/sample limited.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Findings are based on direct, repeatable empirical tests across eleven LLMs and multiple domains with a clear metric (95% interval coverage), producing consistent overconfidence patterns; however the set of tasks, prompts, model versions, and sample sizes are limited and potentially sensitive to elicitation details, so external robustness is not fully established. Methods Rigormedium — The study uses sensible evaluation metrics (coverage rates) and a principled recalibration method (conformal prediction), and it experimentally varies reasoning effort and web access; but rigor is reduced by likely arbitrariness in prompt design, limited reporting on sample sizes and selection of population statistics, potential multiple-testing or overfitting concerns for recalibration, and dependence on specific model snapshots. SampleEleven large language models (unspecified versions) were asked to estimate population-level statistics across domains including health prevalence rates, personality trait distributions, and labor market figures; for each query models returned point estimates plus 95% credible intervals under three reasoning-effort conditions (low/medium/high); a preliminary variant provided web-search access; conformal recalibration used held-out examples to adjust interval widths. Themeshuman_ai_collab adoption IdentificationCompare model-produced 95% credible intervals to known ground-truth population statistics across multiple domains and models; vary model capability (size) and prompted reasoning effort (low/medium/high) experimentally; evaluate empirical coverage (fraction of intervals containing the true value) and test a post-hoc conformal prediction recalibration to adjust interval widths. GeneralizabilityLimited to the specific eleven models and versions tested; other or newer models may behave differently, Results may depend on prompt wording and elicitation protocol (prompt sensitivity), Selected population-statistic tasks may overrepresent common or widely discussed topics and underrepresent specialized or obscure data, Conformal recalibration performance may not generalize if calibration data are not representative of deployment queries, Preliminary web-search experiment is small and may not generalize to full internet-enabled agents or retrieval-augmented pipelines, Temporal generalizability: model behavior can change with updates; ground-truth statistics may vary by location/time

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
Larger, more capable models produce more accurate estimates. Output Quality positive high accuracy of population-statistic estimates
n=11
0.18
Increasing reasoning effort (low, medium, high) provides no consistent benefit to estimation performance. Output Quality null_result high impact of prompting/reasoning effort on estimate accuracy and calibration
0.18
All models are severely overconfident: their 95% intervals contain the true value only 9--44% of the time, far below the expected 95%. Error Rate negative high empirical coverage rate of 95% credible intervals
9--44% coverage (vs. expected 95%)
0.3
A statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. Decision Quality positive high coverage of recalibrated credible intervals (post-conformal prediction)
0.3
In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Output Quality mixed high change in predictive accuracy with web search access
0.18
Models performed well on commonly discussed topics but struggled with specialized health data. Output Quality mixed high topic-specific estimation accuracy
0.18
LLM uncertainty estimates require statistical correction before they can be used in decision-making. Decision Quality negative high adequacy of raw LLM uncertainty estimates for decision-making (calibration/coverage)
0.18

Notes