Large language models differ in whether they 'know what they don't know': Mistral gets the most facts right but shows the weakest metacognitive efficiency, while aggregate calibration metrics can mislead by reflecting confidence thresholds rather than true metacognitive sensitivity.
Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.
Summary
Main Finding
Applying Type-2 Signal Detection Theory (meta-d′ and the M-ratio = meta-d′/d′) to LLM confidence reveals structure invisible to standard calibration and ranking metrics. Across 224,000 factual QA trials on four 7–9B models, models with similar or superior Type‑1 sensitivity can have substantially worse metacognitive efficiency. In particular, Mistral-7B achieved the highest Type‑1 d′ but the lowest M-ratio (meaning it “knows” answers but is poor at signaling when it is right), while Gemma-2 and Llama-3-Base showed near‑optimal M-ratios despite lower d′. AUROC2 and M-ratio can produce fully inverted model rankings, so calibration/ranking metrics alone can mislead deployment choices that rely on confidence.
Key Points
- Conceptual advance: Distinguishes Type‑1 sensitivity (d′, how well the model discriminates correct from incorrect answers) from Type‑2 metacognitive sensitivity (meta-d′, how well the model’s confidence discriminates correct from incorrect), and normalizes with M-ratio = meta-d′ / d′.
- Metric interpretation:
- M-ratio = 1: confidence captures all Type‑1 information (optimal metacognition).
- M-ratio < 1: metacognitive loss (confidence discards information available to the decision).
- M-ratio > 1: confidence accesses information beyond binary correctness.
- Main empirical results (T = 1.0 on TriviaQA):
- Mistral-7B-Instruct: d′ = 1.597; M-ratio = 0.852 (95% CI [0.765, 0.941]) — highest d′ but significantly suboptimal metacognition.
- Llama-3-8B-Base: d′ = 1.407; M-ratio = 1.048 (95% CI [0.952, 1.152]).
- Llama-3-8B-Instruct: d′ = 1.386; M-ratio = 0.983 (95% CI [0.886, 1.083]).
- Gemma-2-9B-Instruct: d′ = 0.946; M-ratio = 1.048 (95% CI [0.913, 1.191]).
- AUROC2 vs M-ratio: Models ranked highest by AUROC2 (which inherits Type‑1 advantage) can be ranked lowest by M-ratio; relying on AUROC2/ECE can favor models with high raw accuracy but poor usable confidence.
- Domain specificity: M-ratio varies by knowledge domain within models — a model can be metacognitively strong in one domain and weak in another (e.g., different weakest domains across models).
- Temperature effects: Changing sampling temperature shifts Type‑2 criterion (confidence policy) while meta-d′ often remains stable for some models, indicating separability between confidence policy and metacognitive capacity.
- Robustness and protocol: Analyses pre-registered for three models (Gemma added post‑registration), NLP (normalized token log‑probability) used as the continuous confidence signal; monotonicity checks confirmed NLP correlates with accuracy; multiple robustness checks (binning, UVSDT, replication on Natural Questions) reported.
Data & Methods
- Models: Llama-3-8B-Instruct, Llama-3-8B-Base (Meta); Mistral-7B-Instruct-v0.3; Gemma-2-9B-Instruct.
- Data: TriviaQA (5,000 sampled Qs, domain-tagged), Natural Questions (3,000 short-answer Qs) → total trials: 4 models × 8,000 Qs × 7 temperatures = 224,000 trials.
- Confidence signal: Normalized log-probability (NLP) of generated answer (continuous), binned into ordered categories for SDT analysis (Maniscalco & Lau format).
- Main metrics:
- d′: Type‑1 sensitivity (NLP separation of correct vs incorrect).
- meta-d′: Type‑2 sensitivity estimated by maximum likelihood (maniscalco & Lau method, Hautus correction).
- M-ratio = meta-d′ / d′: metacognitive efficiency.
- AUROC2, ECE, Brier used for comparison.
- Temperatures tested: T ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0}, main hypotheses tested at T = 1.0.
- Inference/compute: Quantized GGUF models (Q5_K_M), inference via llama-cpp-python with Vulkan.
- Statistical inference: Bootstrap 95% CIs (10,000 resamples); pre-registered hypotheses for metacognition, domain effects, temperature dissociation, and hidden structure. Several robustness checks (binning sensitivity, unequal variance SDT, replication) reported.
Implications for AI Economics
- Model procurement and evaluation:
- Selecting models purely on accuracy, ECE or AUROC2 can be economically suboptimal when downstream systems use model confidence for decisions (selective prediction, routing, triage). Buyers should include M-ratio/meta-d′ in procurement metrics to capture usable uncertainty.
- Value of confidence:
- The economic value of a model’s confidence signal depends on metacognitive efficiency. A lower-accuracy model with higher M-ratio can yield better cost-adjusted outcomes (fewer costly false trusts) than a higher-accuracy model with poor M-ratio.
- Deployment design and costs:
- Decisions about abstention thresholds, human-in-the-loop allocation, and escalation policies should incorporate M-ratio by domain. Domain‑specific M-ratio estimates can optimize allocation of human review resources and reduce error‑related costs.
- Pricing and contracts:
- SLAs and pricing for model access could be stratified by metacognitive efficiency (e.g., higher prices for models/services that reliably indicate uncertainty), or include clauses tying penalties/bonuses to post‑hoc risk metrics that relate to M-ratio.
- Insurance and risk management:
- Insurers and firms assessing liability from model errors should consider M-ratio as a measure of how well the system protects against high‑cost errors via informative confidence. This affects premiums for AI-driven services.
- R&D and investment priorities:
- Returns to investment differ: improving Type‑1 sensitivity (more data, larger models) vs improving metacognitive efficiency (better uncertainty estimation, calibration layers, or auxiliary heads). Meta-d′ helps quantify marginal gains in usable confidence per dollar and guides R&D allocation.
- Market signaling and competition:
- Publicizing M-ratio and domain‑level metacognitive performance can become a competitive differentiator in model marketplaces, analogous to latency/throughput and accuracy metrics today.
- Policy and standards:
- Regulators and standard‑setters may require disclosure of metacognitive efficiency for high‑risk applications (medical, legal, financial) because it better predicts whether a model will reliably flag its own uncertainty.
- Human–AI labor substitution:
- Models with high M-ratio can more safely substitute for human labor in tasks requiring autonomous decisions; conversely, models with low M‑ratio necessitate more human oversight, changing labor cost calculations and ROI.
- Experimental pricing and productization:
- Companies can design products that monetize confidence-aware features (e.g., conservative mode that abstains more often but yields higher trust), with M-ratio guiding where those features pay off.
Limitations to consider when applying these implications: study limited to four relatively small open‑weight models and to NLP as the confidence variable; equal-variance SDT used as a baseline (UVSDT assessed in robustness checks). M-ratio requires sufficient per-domain trials to estimate reliably; hierarchical estimation may be needed for fine-grained economic decisions.
If helpful, I can (1) sketch a simple decision‑theoretic model that quantifies expected cost savings from using M-ratio–informed selective prediction, or (2) map how to incorporate M-ratio into procurement scorecards and SLAs.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). Decision Quality | negative | high | confounding of calibration metrics between Type-1 sensitivity (knowledge) and Type-2 metacognitive sensitivity |
0.18
|
| We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Decision Quality | positive | high | decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio |
0.3
|
| We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials. Decision Quality | positive | high | empirical evaluation of models' Type-1 and Type-2 metrics across factual QA trials |
n=224000
0.3
|
| Metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar — Mistral achieves the highest d' but the lowest M-ratio. Decision Quality | mixed | high | Type-1 sensitivity (d') and metacognitive efficiency (M-ratio) |
n=224000
0.18
|
| Metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics. Decision Quality | mixed | high | domain-specific metacognitive efficiency (M-ratio) across task domains |
n=224000
0.18
|
| Temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity. Decision Quality | mixed | high | Type-2 criterion (confidence policy) and meta-d' (metacognitive capacity) |
n=224000
0.18
|
| AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. Decision Quality | mixed | high | model ranking by AUROC_2 versus model ranking by M-ratio |
n=224000
0.18
|
| The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration. Decision Quality | positive | high | distinction between true metacognitive capacity and apparent calibration driven by criterion placement |
n=224000
0.18
|
| The analysis was pre-registered and code and data are publicly available. Other | positive | high | research transparency (pre-registration and public code/data) |
0.3
|