Most large language models behave like model-based 'mentalizing' teachers in a controlled graph-teaching task, with Bayesian teacher models best explaining their choices; however, simple scaffolding prompts that help humans do not consistently improve LLM tutoring and can even reduce performance on harder test problems.

Do Large Language Models Mentalize When They Teach?

Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky · April 02, 2026

arxiv descriptive medium evidence 5/10 relevance Source PDF

In a controlled graph-teaching task most contemporary LLMs choose edges consistent with a Bayes-optimal, mentalizing teacher model and achieve high teaching scores, but lightweight scaffolding prompts that help humans do not reliably improve—and can sometimes worsen—LLM teaching on held-out hard graphs.

How do LLMs decide what to teach next: by reasoning about a learner's knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner's trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models' choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.

Summary

Main Finding

Most contemporary LLMs in this controlled Graph Teaching task behave like model-based, mentalizing teachers: their trial-by-trial choices are best explained by a Bayes-Optimal Teacher (inverse planning) model, they achieve high teaching scores and show graph-wise performance profiles correlated with humans. However, lightweight scaffolding prompts (inference- or reward-focused auxiliary steps) — while complied with on the auxiliary selection step — do not reliably improve downstream teaching on heuristic-incongruent test graphs and can sometimes reduce performance. In short: prompt compliance ≠ improved teaching decisions.

Key Points

Task-level outcome
- Across a baseline of 40 trials (20 graphs + flipped versions), most LLMs produced stable, high teaching performance close to the Bayes-Optimal benchmark.
- Several LLMs’ graph-by-graph performance profiles strongly correlated with humans (r ≈ 0.76–0.89 for seven models).
- LLMs’ per-run Teaching Scores clustered in the high-performing range (less bimodality than humans), though some models were more variable.
Cognitive strategy fits
- Trial-by-trial choices were fit using the same cognitive model family used for human subjects.
- Bayesian Information Criterion (BIC) model comparison identified the Bayes-Optimal Teacher as the best fit for most LLMs.
- Heuristic and non-mentalizing utility models (e.g., reward heuristic, depth heuristic, path-average utility, Q-values) explained fewer LLM teachers than they did humans.
Scaffolding intervention
- Design: 2×3 (training congruency: Heuristic-Congruent vs Heuristic-Incongruent) × (No scaffolding, Inference scaffolding, Reward scaffolding).
- LLMs reliably performed the auxiliary scaffolding tasks as instructed (they selected edges aligned with the scaffolding prompt).
- Despite auxiliary compliance, Inference scaffolding did not consistently improve test-phase teaching; Reward scaffolding sometimes reduced test performance and weakened the Bayes-Optimal fit advantage.
- The congruency effects and intervention benefits observed in humans did not reliably replicate for LLMs.
Practical interpretation
- LLMs can “appear” to mentalize (and do so in this task), but their susceptibility to auxiliary prompts does not guarantee improved downstream decision-making; prompting can change surface behavior without improving the underlying policy.

Data & Methods

Task: Graph Teaching (adapted from Harootonian et al., 2025)
- Deterministic, directed acyclic graphs with rewards on nodes; learner knows only a subset of edges and selects an optimal path given its partial knowledge.
- Teacher observes the learner’s single trajectory, full graph & rewards, and must reveal one edge to maximally increase the learner’s eventual expected return after replanning.
- No feedback on outcomes; each trial treated as a distinct learner. Within a simulated teacher, conversation history (including prior trials and scaffold responses) was preserved.
Models fitted to each simulated teacher’s trial choices
- Bayes-Optimal Teacher (BOT; inverse planning), No-Inverse-Planning Bayesian Teacher, Prior-only Bayesian Teacher.
- Heuristics: Reward Heuristic (sum of endpoint rewards), Depth Heuristic, reward+depth linear combination.
- Non-mentalizing utilities: Tabular Q-value, Path-averaged utility.
- Choice rule: softmax over model utilities with inverse-temperature β fit per simulated teacher; model comparison via BIC.
Experimental sets
- Baseline: ~40 simulated teachers per model run; each teacher completed 40 trials (20 unique graphs + flipped).
- Scaffolding intervention: 2×3 design; simulated teachers per model per condition typically ~20–30; training consisted of 15 trials (5 training graphs × flips × repeats), followed by 5 heuristic-incongruent test graphs (no scaffolding on test).
- Prompts: text-only instructions closely matched human experiment; scaffolding conditions added an auxiliary step requesting three edges (either likely unknown edges — Inference, or edges connected to highest-value nodes — Reward). Auxiliary responses appended to history before teaching prompt.
LLMs evaluated (examples reported)
- Included a range: GPT-3.5, GPT-4o, GPT-o3-mini, GPT-4.1-mini, GPT-o4-mini, GPT-5, Llama 4 Maverick, Gemini 2.5 Flash, Qwen 3 Next 80B, Claude Haiku 4.5, Claude Sonnet 4.5 (sample sizes per model varied).
Key quantitative summaries
- Graph-wise correlations vs human profile: several models r ≈ 0.76–0.89 (p < 1e-4); others moderate; a few not significant.
- Teaching scores: LLM distributions concentrated near Bayes-Optimal; humans showed a bimodal distribution (suggesting mixture of strategies).
- BIC model fits favored BOT for most LLMs; some heterogeneity across models (a few showed mixed best-fits).

Implications for AI Economics

For deployment and product design
- Superficial prompt compliance is not a reliable signal of improved downstream behavior: designers relying solely on short scaffolding prompts to steer tutoring policies may overestimate gains. Investments in prompt engineering alone may yield limited or inconsistent returns.
- To improve tutoring quality, organizations should consider stronger interventions (fine-tuning, RLHF / policy shaping, constrained decoding, or integrated user-modeling modules) rather than only auxiliary prompts at inference time.
For evaluation and auditing
- Cognitive-model-based audits (e.g., fitting BOT vs heuristic models) are a compact, informative tool for auditing tutoring policies and differentiating surface-level compliance from deeper strategy changes. Economic evaluations of tutoring systems should include such behavioral-model benchmarks, not just output-level metrics.
For labor and market effects
- If off-the-shelf LLMs naturally exhibit mentalizing-like teaching in simple controlled tasks, they may substitute for some routine tutoring tasks, but the fragility of scaffolding effects suggests limits: robust, context-sensitive tutoring likely still requires human-in-the-loop oversight or more deeply adapted models. This nuance affects forecasts about displacement and augmentation in the tutoring labor market.
For policy and regulation
- Regulators and institutions should require empirical evidence that prompting or transparency affordances actually improve learning outcomes (not just model outputs). Claims about “adaptive” or “personalized” tutoring should be validated against downstream outcomes and behaviorally grounded models.
For cost–benefit and R&D prioritization
- Short-term, low-cost prompt interventions can change model outputs but may not justify adoption costs if they do not improve—and can sometimes harm—learning outcomes. Prioritize investments in approaches that change model policies or incorporate explicit learner models (e.g., model-based tutoring architectures), and measure ROI using downstream learning metrics.
Research agenda (economic relevance)
- Study generalizability: test whether BOT-like behavior persists in richer, longer-horizon, interactive learning settings with real student feedback.
- Compare cost-effectiveness of interventions: prompt engineering vs fine-tuning vs RL-based policy shaping on long-run student outcomes.
- Quantify externalities: how unreliable scaffolding effects affect trust, adoption, and the competitiveness of LLM-based tutoring services.

Limitations to keep in mind - The experiments are simulated: learners were synthetic (optimal planners with partial transition knowledge); real students produce richer, noisy behavior and provide feedback that could alter model behavior. - The task is narrow (single-edge reveal in small DAGs); generalization to full curricula, longer dialogues, or open-ended instruction is uncertain. - LLMs were not allowed to learn online from feedback in this setup; real deployments often incorporate user data and iterative updates.

Bottom line: Many LLMs can behave like mentalizing teachers in this controlled task, but compliance with auxiliary prompts is not a reliable mechanism for improving tutoring effectiveness. For economically meaningful deployment, firms and policymakers should prioritize deeper interventions and rigorous behavior-based evaluation.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides controlled, systematic simulation experiments across many contemporary LLMs, uses established cognitive models, and reports model-comparison metrics (BIC) and trial-level analyses, giving credible internal evidence about LLM behavior in the specific task; however, claims about LLM 'mentalizing' are limited to a stylized graph-teaching paradigm with simulated learners and no real-world tutoring outcomes, reducing external validity. Methods Rigorhigh — The authors replicate a previously validated human experimental paradigm, test a broad set of LLMs with multiple simulated runs per model, fit a well-specified family of cognitive models by maximum likelihood, compare models with BIC, and run an intervention (scaffolding) with counterbalanced conditions; reporting appears thorough and directly comparable to human data. SampleSimulated 'teachers' comprised multiple independent runs (typically ~30–40 per model in baseline; ~20–30 per condition in scaffolding) of a range of commercial and open LLMs (e.g., GPT-3.5, GPT-4o, GPT-4.1-mini, GPT-o4-mini, GPT-o3-mini, GPT-5, Gemini 2.5 Flash, Claude Sonnet/Haiku 4.5, Llama 4 Maverick, Qwen 3 Next 80B). Baseline: each simulated teacher completed 40 trials (20 unique directed reward graphs plus flipped versions). Scaffolding experiment: 15 training trials (congruent or incongruent pools, with flipped repeats) followed by 5 held-out incongruent test graphs; human comparison data are from Harootonian et al. (2025) (human N shown ~100). No feedback was supplied to models during the task. Themeshuman_ai_collab skills_training GeneralizabilityTask is a stylized, deterministic graph-teaching paradigm and may not generalize to real-world tutoring or complex instructional settings, Simulations use synthetic learners and no live human learners, so downstream learning outcomes and interaction dynamics are untested, Only single-edge revelations and specific prompt templates were tested; different prompt designs, multi-turn tutoring, or richer pedagogical actions could change behavior, LLM behavior may depend on particular versions, API settings, or system prompts; study covers many but not all models and configurations, No long-run adaptation or feedback loop (models received no outcome feedback), so findings do not capture learning-by-teaching or online fine-tuning effects

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Teaching Scores were essentially flat over trials for all model groups: the correlation between trial number and Teaching Score was small in magnitude for every group (\|r\| < 0.1). Decision Quality	null_result	high	Teaching Score (trial-by-trial change)	\|r\| < 0.1 0.18
Most LLMs showed strong alignment with humans in graph-by-graph performance: seven models had large positive Pearson correlations with human performance (r ≈ 0.76–0.89, all p < 10^-4), and two additional models showed moderate correlations (r ≈ 0.46–0.56, p < .05); GPT-3.5 and Llama-4 Maverick were not significantly correlated with humans. Decision Quality	positive	high	Graph-wise Teaching Score correlation with human performance	r ≈ 0.76–0.89 (seven models); r ≈ 0.46–0.56 (two models); GPT-3.5 r=0.27; Llama-4 Maverick r=-0.08 0.3
Most LLM teachers were concentrated in the higher-performing range, close to the Bayes Optimal Teacher benchmark, overlapping with higher-performing human subjects; some models (notably GPT-o4-mini, Gemini 2.5 Flash, Claude Sonnet 4.5) showed more variability. Decision Quality	positive	high	Average Teaching Score (individual-level distribution)	0.18
Bayes Optimal Teacher (BOT) provides the best overall account (via BIC) for the trial-by-trial teaching choices of most LLM models. Decision Quality	positive	high	Model fit (BIC) to LLM trial-by-trial choices	0.3
In scaffolding conditions LLMs reliably executed the auxiliary selection step: under Reward Scaffolding they preferentially selected edges ranked highest by reward, and under Inference Scaffolding they preferentially selected edges ranked as more likely unknown to the learner. Decision Quality	positive	high	Probability of choosing edges in auxiliary selection (scaffold compliance)	0.3
Despite reliably performing the auxiliary scaffolding step, scaffolding did not reliably improve LLM teaching on heuristic-incongruent test graphs and sometimes reduced performance (notably after Reward Scaffolding). Decision Quality	negative	high	Test-phase Teaching Score after scaffolding	0.3
Reward Scaffolding reduced the Bayes Optimal Teacher's advantage in model-fit (∆BIC) for a subset of models, making the Reward Heuristic relatively more competitive, but there was no broad shift toward heuristic-dominant fits analogous to humans. Decision Quality	mixed	medium	∆BIC (relative model evidence) comparing BOT and Reward Heuristic	0.11
Overall, most LLMs achieve high Teaching Scores and are best fit by the Bayes Optimal Teacher, suggesting model-based (mentalizing) teaching strategies rather than model-free heuristics. Decision Quality	positive	high	Teaching Score and cognitive model fit	0.18
Simulated teachers: for each LLM, we ran multiple independent runs treated as simulated teachers (typically around 30–40 per model in the Baseline Experiment and around 20–30 per condition × training-group cell in the Scaffolding Intervention Experiment); the conversation context was reset between teachers and preserved across trials within a teacher; LLMs were not given feedback about the outcome of their teaching to prevent learning during the task. Other	other	high	experimental design / simulation protocol (methodological claim)	0.3