Pretrained ‘foundation’ time‑series models beat bespoke forecasts across 54 energy datasets, with largest gains in covariate-rich and highly aggregated series. Accuracy improvements are robust across categories, correlate with spectral entropy, and show limited returns beyond a moderate context length.

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair · April 24, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Pretrained foundation models trained across many energy time series consistently outperform dataset-specific machine-learning approaches across 54 datasets and 9 data categories—especially when using covariates and at higher aggregation levels—while predictive difficulty correlates with spectral entropy and saturates past a certain context length.

Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.

Summary

Main Finding

Foundation (time-series) models pretrained across many domains substantially outperform dataset-specific machine-learning baselines (XGBoost, random forest) on a broad, use-case differentiated benchmark of energy time-series forecasting (FETS). This holds across zero-shot and fine-tuned settings and across 54 datasets in 9 energy data categories — even though the classical baselines were trained with full access to historic target data. Covariate-informed foundation models achieve the largest gains.

Key Points

Benchmark: FETS assembles 54 openly available energy datasets spanning 9 categories (electricity, heat, mobility, grid, market, generation, etc.) organized by stakeholder-driven use cases and forecast attributes (horizon, resolution, aggregation).
Models compared: modern time-series foundation models (e.g., Chronos-2 and other recent TSFMs supporting univariate and covariate modes) versus task-specific classical ML (XGBoost, random forest) with domain-style feature engineering.
Settings: univariate zero-shot, covariate zero-shot, and task-specific fine-tuning; multiple horizons and context lengths; evaluation includes probabilistic outputs (quantiles/uncertainty).
Core empirical results:
- Foundation models consistently outperform dataset-specific optimized ML across all examined settings and data categories.
- Covariate-informed foundation models deliver the strongest performance gains.
- Predictive performance correlates strongly with spectral entropy of the series (datasets with lower spectral entropy are easier to forecast).
- There is performance saturation beyond a certain context (history) length — more context helps up to a point, then yields diminishing returns.
- Foundation models perform relatively better at higher aggregation levels (national load, district heating, power-grid aggregates).
Practical benefits emphasized: scalability, generalization to new assets or markets, robustness under data scarcity and privacy constraints (zero-shot capability reduces need to share sensitive historical data).

Data & Methods

Dataset collection: 54 datasets representative of energy forecasting tasks (short-term focus), covering electricity loads, renewable generation (PV, wind), market prices, grid flows, district heating, EV/mobility loads, and ancillary-service-related signals. Datasets are organized by stakeholders and forecast attributes (horizon, aggregation, resolution).
Model suite:
- Time-series foundation models leveraging large-scale pretraining and patch/token embeddings; models evaluated include transformer-based TSFMs supporting covariates (e.g., Chronos-2 and peers), and other architectures (state-space, xLSTM-style) where applicable.
- Classical baselines: task-specifically trained XGBoost and random forest with energy-typical features (lags, calendar, weather covariates where available).
Evaluation protocol:
- Zero-shot evaluation: pretrained TSFMs applied without task-specific training.
- Fine-tuning: TSFMs adapted on target task data.
- Baselines fully trained on each dataset (had access to full historic target series).
- Forecast horizons and context lengths varied; performance aggregated across datasets and data categories. Probabilistic forecasts (quantiles/uncertainty bands) considered.
Analyses: performance vs. spectral entropy, impact of context length, aggregation-level effects.

Implications for AI Economics

Economies of scale and winner-takes-most dynamics:
- Pretrained foundation models create strong economies of scale: one large pretraining expense can serve many downstream forecasting tasks. This favors centralized providers able to amortize pretraining costs, potentially concentrating market power in forecast-service provision.
Reduced data-dependence and lowered entry costs:
- Zero-shot and few-shot capabilities lower the need for large local historical datasets and extensive task-specific engineering. Small suppliers and new market entrants (e.g., distributed asset owners) can obtain high-quality forecasts without costly data collection or specialist teams.
Value of pretraining data and data markets:
- The benchmark highlights the value of diverse pretraining corpora. This raises questions about pricing, ownership, and trading of time-series datasets (who benefits from sharing data that improves foundation models).
Privacy, regulation, and data governance:
- Strong zero-shot performance helps in privacy-sensitive contexts (less need to share raw history), but also concentrates value in entities that control pretraining corpora. Regulatory frameworks will need to address transparency, liability, and access to pretrained models in critical infrastructure contexts.
Labor and skill-shift effects:
- Demand may shift from per-task model engineering to (a) curation of pretraining data, (b) fine-tuning/validation for local constraints, and (c) model-risk oversight and integration. This changes labor demand within energy analytics and consulting markets.
Market efficiency and welfare:
- Better forecasts at scale can reduce operational inefficiencies (e.g., imbalance costs, reserve procurement, congestion redispatch), potentially lowering system costs and improving integration of renewables. However, distributional impacts depend on who captures the forecast value (central providers vs. local operators).
Pricing of forecasting services and competitive dynamics:
- With superior pretrained models, there will be product differentiation (pretrained generalist vs. bespoke local models). Pricing strategies (subscription, per-forecast, licensing of model weights, or data-access fees) and contract design will affect competition and access.
Research/practical priorities for AI economics:
- Quantify welfare gains across levels (national system, utilities, retail consumers).
- Study market structure impacts: concentration risk, barriers to entry, pricing power of pretrained model providers.
- Value-of-data analyses: marginal contribution of different dataset classes to pretraining value.
- Policy design: recommendations for data-sharing incentives, open-pretraining initiatives for public goods, and regulatory guardrails for critical infrastructure forecasting.

Suggested next steps for researchers: measure the economic value of improved forecasts from foundation models in concrete market settings (imbalance cost reduction, reserve procurement), evaluate alternative business models for provisioning pretrained forecasts, and model the competitive implications of centralized pretraining on energy analytics markets.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents broad empirical benchmarking across 54 datasets and multiple forecasting settings, which supports external validity for forecasting performance claims; however, it does not establish causal mechanisms, may depend on choices of baselines, pretraining data, and tuning, and lacks evidence of real-world operational impacts (e.g., economic outcomes), so confidence in general-purpose superiority is substantial but not definitive. Methods Rigorhigh — The study assembles a large, structured benchmark (54 datasets, 9 categories), compares foundation models to dataset-specific optimized ML baselines across multiple settings, and conducts diagnostic analyses (spectral entropy, context-length, aggregation effects); these elements indicate careful experimental design, though results still depend on implementation details (hyperparameter tuning parity, pretraining data composition, compute budgets) that would determine reproducibility. SampleA heterogeneous collection of 54 time-series datasets spanning 9 energy-related data categories (including national load, district heating, power grid data and other metered series), with varying temporal resolutions, geographic coverage and aggregation levels; evaluation includes both covariate-informed and target-only forecasting tasks across multiple forecasting contexts. Themesproductivity adoption GeneralizabilityFocused on energy time series — may not generalize to non-energy domains or non-temporal tasks, Performance depends on pretraining data composition and scale; different pretraining regimes could change results, Operational constraints (latency, compute cost, data privacy) not fully evaluated for deployment, Benchmarks may be sensitive to baseline tuning, hyperparameter choices, and metric selection, Findings about context-length saturation and aggregation may not hold for very long horizons or rare-event forecasting

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories. Output Quality	positive	high	predictive performance of time series forecasts (forecast accuracy/output quality)	n=54 0.3
Covariate-informed foundation models achieve the strongest performance. Output Quality	positive	high	predictive performance of covariate-informed vs non-covariate models	n=54 0.18
Foundation models outperform classical machine learning approaches despite the latter having seen the full historic target data during training. Output Quality	positive	high	forecasting accuracy when classical ML had access to full historic targets	n=54 0.3
There is a strong correlation between predictive performance and spectral entropy. Output Quality	mixed	medium	correlation between predictive performance and spectral entropy of time series	n=54 0.11
Predictive performance exhibits saturation beyond a certain context length. Task Completion Time	null_result	high	change in forecast accuracy as context length increases	0.18
Foundation models show improved performance at higher aggregation levels such as national load, district heating, and power grid data. Output Quality	positive	high	forecast accuracy stratified by aggregation level	0.18
The FETS benchmark collects and analyzes 54 datasets across 9 data categories guided by typical stakeholder interests. Other	positive	high	breadth of dataset coverage (count and categories)	n=54 0.3
Foundation models are strong potential solutions for scalable and generalizable forecasting in the energy domain, particularly in data-constrained and privacy-sensitive settings. Adoption Rate	positive	medium	suitability of foundation models for data-constrained and privacy-sensitive forecasting contexts	n=54 0.02
The paper provides a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories. Other	positive	high	existence of a structured overview framework	0.18