Large language models differ in whether they 'know what they don't know': Mistral gets the most facts right but shows the weakest metacognitive efficiency, while aggregate calibration metrics can mislead by reflecting confidence thresholds rather than true metacognitive sensitivity.

Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Jon-Paul Cacioli · March 26, 2026

arxiv descriptive high evidence 7/10 relevance Source PDF

Applying Type-2 signal-detection metrics to 224k QA trials, the paper shows that LLMs differ markedly in metacognitive efficiency (M-ratio) independent of accuracy, that efficiency is domain-specific, and that temperature shifts can change confidence policy without altering underlying metacognitive sensitivity.

Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.

Summary

Main Finding

Applying Type-2 Signal Detection Theory (meta-d′ and the M-ratio = meta-d′/d′) to LLM confidence reveals structure invisible to standard calibration and ranking metrics. Across 224,000 factual QA trials on four 7–9B models, models with similar or superior Type‑1 sensitivity can have substantially worse metacognitive efficiency. In particular, Mistral-7B achieved the highest Type‑1 d′ but the lowest M-ratio (meaning it “knows” answers but is poor at signaling when it is right), while Gemma-2 and Llama-3-Base showed near‑optimal M-ratios despite lower d′. AUROC2 and M-ratio can produce fully inverted model rankings, so calibration/ranking metrics alone can mislead deployment choices that rely on confidence.

Key Points

Conceptual advance: Distinguishes Type‑1 sensitivity (d′, how well the model discriminates correct from incorrect answers) from Type‑2 metacognitive sensitivity (meta-d′, how well the model’s confidence discriminates correct from incorrect), and normalizes with M-ratio = meta-d′ / d′.
Metric interpretation:
- M-ratio = 1: confidence captures all Type‑1 information (optimal metacognition).
- M-ratio < 1: metacognitive loss (confidence discards information available to the decision).
- M-ratio > 1: confidence accesses information beyond binary correctness.
Main empirical results (T = 1.0 on TriviaQA):
- Mistral-7B-Instruct: d′ = 1.597; M-ratio = 0.852 (95% CI [0.765, 0.941]) — highest d′ but significantly suboptimal metacognition.
- Llama-3-8B-Base: d′ = 1.407; M-ratio = 1.048 (95% CI [0.952, 1.152]).
- Llama-3-8B-Instruct: d′ = 1.386; M-ratio = 0.983 (95% CI [0.886, 1.083]).
- Gemma-2-9B-Instruct: d′ = 0.946; M-ratio = 1.048 (95% CI [0.913, 1.191]).
AUROC2 vs M-ratio: Models ranked highest by AUROC2 (which inherits Type‑1 advantage) can be ranked lowest by M-ratio; relying on AUROC2/ECE can favor models with high raw accuracy but poor usable confidence.
Domain specificity: M-ratio varies by knowledge domain within models — a model can be metacognitively strong in one domain and weak in another (e.g., different weakest domains across models).
Temperature effects: Changing sampling temperature shifts Type‑2 criterion (confidence policy) while meta-d′ often remains stable for some models, indicating separability between confidence policy and metacognitive capacity.
Robustness and protocol: Analyses pre-registered for three models (Gemma added post‑registration), NLP (normalized token log‑probability) used as the continuous confidence signal; monotonicity checks confirmed NLP correlates with accuracy; multiple robustness checks (binning, UVSDT, replication on Natural Questions) reported.

Data & Methods

Models: Llama-3-8B-Instruct, Llama-3-8B-Base (Meta); Mistral-7B-Instruct-v0.3; Gemma-2-9B-Instruct.
Data: TriviaQA (5,000 sampled Qs, domain-tagged), Natural Questions (3,000 short-answer Qs) → total trials: 4 models × 8,000 Qs × 7 temperatures = 224,000 trials.
Confidence signal: Normalized log-probability (NLP) of generated answer (continuous), binned into ordered categories for SDT analysis (Maniscalco & Lau format).
Main metrics:
- d′: Type‑1 sensitivity (NLP separation of correct vs incorrect).
- meta-d′: Type‑2 sensitivity estimated by maximum likelihood (maniscalco & Lau method, Hautus correction).
- M-ratio = meta-d′ / d′: metacognitive efficiency.
- AUROC2, ECE, Brier used for comparison.
Temperatures tested: T ∈ {0.1, 0.3, 0.5, 0.7, 1.0, 1.5, 2.0}, main hypotheses tested at T = 1.0.
Inference/compute: Quantized GGUF models (Q5_K_M), inference via llama-cpp-python with Vulkan.
Statistical inference: Bootstrap 95% CIs (10,000 resamples); pre-registered hypotheses for metacognition, domain effects, temperature dissociation, and hidden structure. Several robustness checks (binning sensitivity, unequal variance SDT, replication) reported.

Implications for AI Economics

Model procurement and evaluation:
- Selecting models purely on accuracy, ECE or AUROC2 can be economically suboptimal when downstream systems use model confidence for decisions (selective prediction, routing, triage). Buyers should include M-ratio/meta-d′ in procurement metrics to capture usable uncertainty.
Value of confidence:
- The economic value of a model’s confidence signal depends on metacognitive efficiency. A lower-accuracy model with higher M-ratio can yield better cost-adjusted outcomes (fewer costly false trusts) than a higher-accuracy model with poor M-ratio.
Deployment design and costs:
- Decisions about abstention thresholds, human-in-the-loop allocation, and escalation policies should incorporate M-ratio by domain. Domain‑specific M-ratio estimates can optimize allocation of human review resources and reduce error‑related costs.
Pricing and contracts:
- SLAs and pricing for model access could be stratified by metacognitive efficiency (e.g., higher prices for models/services that reliably indicate uncertainty), or include clauses tying penalties/bonuses to post‑hoc risk metrics that relate to M-ratio.
Insurance and risk management:
- Insurers and firms assessing liability from model errors should consider M-ratio as a measure of how well the system protects against high‑cost errors via informative confidence. This affects premiums for AI-driven services.
R&D and investment priorities:
- Returns to investment differ: improving Type‑1 sensitivity (more data, larger models) vs improving metacognitive efficiency (better uncertainty estimation, calibration layers, or auxiliary heads). Meta-d′ helps quantify marginal gains in usable confidence per dollar and guides R&D allocation.
Market signaling and competition:
- Publicizing M-ratio and domain‑level metacognitive performance can become a competitive differentiator in model marketplaces, analogous to latency/throughput and accuracy metrics today.
Policy and standards:
- Regulators and standard‑setters may require disclosure of metacognitive efficiency for high‑risk applications (medical, legal, financial) because it better predicts whether a model will reliably flag its own uncertainty.
Human–AI labor substitution:
- Models with high M-ratio can more safely substitute for human labor in tasks requiring autonomous decisions; conversely, models with low M‑ratio necessitate more human oversight, changing labor cost calculations and ROI.
Experimental pricing and productization:
- Companies can design products that monetize confidence-aware features (e.g., conservative mode that abstains more often but yields higher trust), with M-ratio guiding where those features pay off.

Limitations to consider when applying these implications: study limited to four relatively small open‑weight models and to NLP as the confidence variable; equal-variance SDT used as a baseline (UVSDT assessed in robustness checks). M-ratio requires sufficient per-domain trials to estimate reliably; hierarchical estimation may be needed for fine-grained economic decisions.

If helpful, I can (1) sketch a simple decision‑theoretic model that quantifies expected cost savings from using M-ratio–informed selective prediction, or (2) map how to incorporate M-ratio into procurement scorecards and SLAs.

Assessment

Paper Typedescriptive Evidence Strengthhigh — Large-scale empirical evaluation (224,000 factual QA trials), pre-registered analysis, public code/data, and application of a principled Type-2 signal-detection framework provide strong internal evidence about model metacognitive properties and how common calibration metrics conflate distinct capacities. Methods Rigorhigh — Uses established Type-2 SDT (meta-d', M-ratio) rather than ad-hoc calibration metrics, compares multiple models, runs domain-specific analyses, manipulates temperature to probe causal effects on confidence policy, and follows a pre-registered protocol with public materials. SampleFour LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) evaluated on ~224,000 factual question-answering trials spanning multiple domains; analyses include domain-stratified results and temperature variation experiments; accuracy and confidence responses were recorded per trial. Themeshuman_ai_collab adoption productivity GeneralizabilityOnly four small-to-medium sized models were tested—results may not hold for larger or differently trained models., Tasks limited to factual QA; findings may not translate to generative, reasoning, coding, or multi-turn dialogue tasks., Domain coverage unspecified and may not represent the full distribution of real-world user queries., Evaluation in controlled experimental settings may differ from interactive human-AI deployments where feedback and calibration occur over time.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). Decision Quality	negative	high	confounding of calibration metrics between Type-1 sensitivity (knowledge) and Type-2 metacognitive sensitivity	0.18
We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Decision Quality	positive	high	decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio	0.3
We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials. Decision Quality	positive	high	empirical evaluation of models' Type-1 and Type-2 metrics across factual QA trials	n=224000 0.3
Metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar — Mistral achieves the highest d' but the lowest M-ratio. Decision Quality	mixed	high	Type-1 sensitivity (d') and metacognitive efficiency (M-ratio)	n=224000 0.18
Metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics. Decision Quality	mixed	high	domain-specific metacognitive efficiency (M-ratio) across task domains	n=224000 0.18
Temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity. Decision Quality	mixed	high	Type-2 criterion (confidence policy) and meta-d' (metacognitive capacity)	n=224000 0.18
AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. Decision Quality	mixed	high	model ranking by AUROC_2 versus model ranking by M-ratio	n=224000 0.18
The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration. Decision Quality	positive	high	distinction between true metacognitive capacity and apparent calibration driven by criterion placement	n=224000 0.18
The analysis was pre-registered and code and data are publicly available. Other	positive	high	research transparency (pre-registration and public code/data)	0.3