Frontier LLMs can make strategically realistic, hindsight‑free inferences about unfolding crises using only contemporaneous public data, but performance is uneven: they are reliable for economic and logistical questions yet struggle with politically ambiguous, multi‑actor scenarios. Model narratives also evolved across time points from expecting rapid containment toward scenarios of regional entrenchment, suggesting LLMs are a useful but imperfect input for crisis‑related economic forecasting and risk pricing.

When AI Navigates the Fog of War

Ming Li, Xirui Li, Tianyi Zhou · March 17, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

When restricted to contemporaneous public information at 11 moments during the early 2026 Middle East conflict, frontier LLMs often produced strategically realistic inferences—especially on economic, logistical, and capacity questions—while showing weaker and more inconsistent performance on politically ambiguous multi‑actor scenarios, and their narratives shifted over time from rapid containment to regional entrenchment.

Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.

Summary

Main Finding

Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric). However, their reliability is uneven across domains (stronger on economic/logistical questions than on politically ambiguous multi-actor issues), and their narratives evolve over time from expectations of rapid containment toward scenarios of regional entrenchment and attritional de‑escalation. The study provides a temporally grounded, hindsight‑free snapshot of model reasoning during the early stages of the 2026 Middle East conflict.

Key Points

Temporally grounded evaluation: The study assesses model outputs at 11 discrete temporal nodes during the early conflict, requiring reasoning only from information public at each node to reduce training-data leakage.
Question design: 42 node‑specific verifiable questions plus 5 general exploratory questions probe both factual inferences and higher‑level strategic reasoning.
Mitigating hindsight bias: Constraining models to contemporaneous evidence creates a setting closer to real-time analysis and avoids retrospective leakage from post-cutoff data.
Strategic realism: Models frequently infer deeper structural incentives and plausible strategic dynamics rather than repeating surface rhetoric or official statements.
Domain heterogeneity: Performance is stronger in structured domains (economic, logistical, capacity assessments) and weaker in politically ambiguous, multi‑actor scenarios where motives and alliances are fluid.
Temporal evolution in outputs: Model narratives shift across nodes—initial outputs emphasize rapid containment, while later outputs increasingly describe broader regional entrenchment and attritional trajectories.
Archival value: Because the conflict was ongoing after model training cutoffs, the dataset and analyses act as an archival benchmark allowing future researchers to study model reasoning without hindsight contamination.

Data & Methods

Case selection: Early stages of the 2026 Middle East conflict, deliberately chosen because it occurred after the training cutoff of contemporary frontier LLMs.
Temporal nodes: 11 moments during the crisis were defined to capture changing public information and uncertainty.
Question set: 42 node‑specific verifiable questions (designed to be answerable from contemporaneous public sources) plus 5 broader exploratory prompts to elicit strategic narratives.
Leakage control: For each node, models were constrained to use only information that would have been publicly available at that time, substantially reducing training-data leakage and retrospective bias.
Evaluation approach: Combination of verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes to observe narrative evolution.
Models evaluated: Current state‑of‑the‑art LLMs (frontier models) at their respective post‑cutoff states; the study implicitly leverages the gap between model training cutoffs and the unfolding crisis to test forward reasoning.

Implications for AI Economics

Forecasting and risk pricing: LLMs can provide useful inputs for near‑term economic and logistical forecasting in crises (supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously.
Asymmetric reliability: Economic agents and risk models that integrate LLM outputs should weight model inferences more heavily in structured domains (capacity estimates, trade flows, sanctions impact) and downweight or cross‑validate politically ambiguous predictions.
Real‑time market behavior: Because model narratives evolve with incoming information, automated or semi‑automated decision systems must account for shifting model priors and avoid overreacting to early outputs that favor rapid containment scenarios.
Investment and insurance: Insurers, investors, and firms can use temporally grounded LLM analyses as one input for scenario generation and stress testing, but should combine them with specialized geopolitical expertise and probabilistic calibration.
Methodological best practice: Temporal grounding (restricting models to contemporaneous information) should be adopted in economic research using LLMs to avoid leakage and to produce more realistic assessments of model forecasting ability.
Policy and governance: Regulators and policymakers relying on LLMs for crisis analysis should require documentation of temporal constraints, uncertainty quantification, and domain‑specific validation before these models influence high‑stakes decisions.
Research agenda: Further work should quantify calibration and skill over longer horizons, develop ensembles that pair LLMs with domain specialists (e.g., political scientists, logisticians), and expand temporally grounded benchmarks across different conflict types to map where LLMs add value for economic decision‑making.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Strengths include a clear temporal restriction that reduces hindsight leakage, a sizeable set of node-specific verifiable questions, and longitudinal analysis showing narrative evolution; weaknesses are single-case scope (one conflict), reliance on qualitative coding for strategic reasoning, limited detail on inter-rater reliability and quantitative metrics of calibration/skill, and uncertain coverage of different model versions. Methods Rigormedium — Methods are careful about leakage control and include verifiability checks and longitudinal design, but rigor is limited by potential selection bias in node/question choice, subjective coding of strategic realism, limited transparency about model versions and evaluation protocols, and lack of formal statistical calibration of predictive skill. SampleFrontier large language models evaluated at their post-cutoff states on the early stages of the 2026 Middle East conflict, using 11 temporal nodes capturing public-information snapshots; 42 node-specific verifiable questions plus 5 broader exploratory prompts; evaluation combined factual verifiability checks and qualitative coding of strategic reasoning, with models restricted to contemporaneous public sources at each node. Themeshuman_ai_collab adoption IdentificationTemporal grounding: evaluate frontier LLMs at 11 predefined temporal nodes during the early 2026 Middle East conflict and constrain model prompts and verifiability checks to information that was publicly available at each node to reduce training-data leakage and hindsight contamination; combine node-specific verifiable questions (42) with 5 exploratory prompts and qualitative coding of strategic reasoning. GeneralizabilitySingle case study: findings derived from one geopolitical crisis and may not generalize to other conflict types or non-conflict settings., Model-version and deployment variability: results depend on specific frontier LLMs and their post-cutoff states; future or alternative models may differ substantially., Prompt and question set dependence: chosen temporal nodes and the 47 prompts shape findings and may not represent the space of possible queries analysts would use., Language, regional, and source biases: public-information availability and media coverage patterns in this conflict affect verifiability and model performance., Outcome scope: assesses model reasoning and narrative generation, not downstream economic impacts or decision-maker behavior, limiting direct applicability to real-world economic outcomes.

Claims (13)

Claim	Direction	Confidence	Outcome	Details
Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric). Decision Quality	positive	medium	reasoning quality / frequency of responses exhibiting strategic realism (qualitative coding of model outputs)	n=47 0.11
Model reliability is uneven across domains: performance is stronger on structured economic and logistical questions than on politically ambiguous, multi-actor strategic issues. Decision Quality	mixed	medium	domain-specific accuracy/reliability (economic/logistical vs. political/strategic)	n=47 0.11
Model narratives evolve over time: earlier node outputs emphasize rapid containment, while later node outputs increasingly describe regional entrenchment and attritional de-escalation scenarios. Decision Quality	mixed	medium	narrative framing over time (frequency of containment vs. entrenchment/attrition themes)	n=11 0.11
Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias. Research Productivity	positive	high	presence/absence or reduction of training-data leakage/hindsight bias (procedural control)	0.18
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning. Research Productivity	null_result	high	number and type of questions/prompts used	n=47 0.18
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty. Research Productivity	null_result	high	number of temporal nodes	n=11 0.18
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes. Research Productivity	null_result	high	evaluation components (verifiability checks, qualitative coding, longitudinal analysis)	0.18
Because the conflict unfolded after the training cutoffs of contemporary frontier LLMs, the dataset and analyses provide an archival, hindsight-free benchmark for studying model reasoning. Research Productivity	positive	medium	availability of a hindsight-free archival benchmark (dataset existence and timing relative to model cutoffs)	0.11
LLMs can provide useful inputs for near-term economic and logistical forecasting in crises (e.g., supply-chain disruptions, commodity market impacts, transport/logistics constraints), but their political/strategic forecasts should be used cautiously. Decision Quality	mixed	medium	usefulness for forecasting (economic/logistical forecasting accuracy/utility vs. political/strategic forecast reliability)	n=47 0.11
Economic agents and risk models that integrate LLM outputs should weight inferences more heavily in structured domains (capacity estimates, trade flows, sanctions impact) and downweight or cross-validate politically ambiguous predictions. Decision Quality	positive	low	recommended weighting/usage strategy for LLM-derived inputs in economic risk models (prescriptive)	0.05
Because model narratives evolve with incoming information, automated or semi-automated decision systems must account for shifting model priors and avoid overreacting to early outputs that favor rapid containment scenarios. Decision Quality	mixed	low	risk of overreaction / need for accounting for evolving model priors (operational recommendation)	0.05
Temporal grounding (restricting models to contemporaneous information) should be adopted as a methodological best practice in economic research using LLMs to avoid leakage and produce more realistic assessments of model forecasting ability. Research Productivity	positive	medium	recommended methodological practice adoption (procedural recommendation)	0.11
Future research should quantify calibration and skill of LLMs over longer horizons, develop ensembles that pair LLMs with domain specialists, and expand temporally grounded benchmarks across different conflict types. Research Productivity	null_result	speculative	future research outputs (calibration metrics, ensemble methods, expanded benchmarks)	0.02