Next‑generation time‑series foundation models beat classical forecasters in zero‑shot prediction for cloud workloads, but they do not automatically improve server consolidation outcomes; choosing the right predictive quantile matters more for balancing efficiency and reliability.

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu · June 11, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Foundation time-series models give superior zero-shot forecasting accuracy on diverse cloud workloads, but that accuracy does not automatically yield better consolidation decisions—careful calibration of predictive quantiles is the key lever to balance utilization gains against service reliability.

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

Summary

Main Finding

Although modern time-series foundation models (pretrained, zero-shot) achieve superior forecasting accuracy on diverse cloud workloads, that accuracy does not automatically translate into better downstream consolidation decisions. The paper introduces CloudCons, an end-to-end benchmark showing that decision utility depends critically on how forecasts are used (not just point accuracy). In particular, selecting predictive quantiles (i.e., calibration of probabilistic outputs) is a decisive lever to trade off resource efficiency versus service reliability.

Key Points

Problem: cloud operators over-provision to meet SLAs, leaving low utilization (typical CPU utilization ~15–20%). Forecast-then-optimize consolidation can reduce active physical machines (PMs) by anticipating demand and packing VMs, but the link between forecast accuracy and consolidation utility was unverified.
Contribution: CloudCons — a publicly released, end-to-end benchmark that (a) assembles multi-cloud datasets, (b) runs forecast-then-optimize pipelines, and (c) evaluates both forecasting and consolidation utility across several dimensions.
Pivotal empirical result: foundation models often dominate on zero-shot forecasting metrics, but their advantage frequently fails to improve the optimization outcomes (fewer active PMs, load balance, reliability) unless forecasts are translated into decisions with calibrated uncertainty (quantiles).
Actionable insight: predictive quantile selection (e.g., choosing a higher quantile to be conservative) is a simple, powerful policy knob to balance utilization gains against risk of SLA violations. Properly tuned quantiles can yield practical improvements even when point-forecast advantages are modest.
Benchmark scope: covers statistical baselines, supervised deep models, and several state-of-the-art time-series foundation models; covers multiple optimization approaches (heuristics, metaheuristics, exact solver).

Data & Methods

Datasets (multi-cloud, preprocessed with unified pipeline):
- Huawei2025: 174 series, ~3.35M points; mixed diurnal and stochastic patterns.
- Azure2019: 10,800 series, ~104.5M points; “pulse-like” workloads: long flat spots plus sharp bursts.
- Borg2019-d & Borg2019-e: 414 and 618 series respectively, several million points; Borg-d = near-zero baseline with high-frequency jitter, Borg-e = strong 24-hr diurnal synchronization.
- Data cleaning: remove very short/low-variance series, interpolate missing values, aggregate to multiple granularities (5-min, 30-min, 1-hour).
- Time-series features extracted: seasonality, spikiness, autocorrelation (ACF), flat spots, average utilization.
Forecasting stage (Stage I):
- Statistical models: AutoARIMA, AutoETS, AutoTheta.
- Deep learning: DeepAR, TFT, DLinear, PatchTST.
- Foundation (pretrained) models evaluated zero-shot: Moirai 2, Chronos 2, TimesFM 2.5, Sundial, TOTO, FlowState-9.1M, Kairos 50M.
- Outputs: point and probabilistic forecasts (quantiles / predictive CDF).
Optimization stage (Stage II):
- Problem: dynamic bin-packing formulation to assign N VMs to M PMs minimizing the number of active PMs, subject to capacity constraints across the forecast horizon and assignment constraints (binary x_ij, y_j variables).
- Solvers/evaluated methods:
  - Heuristics: First Fit Decreasing (FFD), Best Fit Decreasing (BFD) — low latency, industry-standard.
  - Meta-heuristic: Ant Colony Optimization (ACO).
  - Exact: Gurobi mixed-integer solver (upper bound / benchmark).
- Decision procedure: use forecasted demands (or a chosen predictive quantile) for capacity checks across the horizon; placement kept static for the horizon to reduce migration overhead.
Evaluation suite (five dimensions):
- Prediction error: MASE (scale-independent MAE), CRPS (continuous ranked probability score).
- Resource efficiency: Util (utility ratio = consumed resources / capacity of active PMs).
- Load balance and service reliability: measured via metrics such as peak-to-average ratio and violation counts (paper reports PAR, VS, VR but centers on trade-offs between utilization and reliability).
- Uncertainty quantification: PICP (prediction interval coverage probability), MPIW (mean prediction interval width), Winkler / Win. Score.
- End-to-end decision utility: how forecasting choices translate into fewer active PMs, migration frequency, and SLA violations.

Implications for AI Economics

Forecast quality is necessary but not sufficient for economic gains. Procurement decisions that focus only on forecast-error benchmarks risk overpaying for models whose accuracy does not materially reduce operational costs or SLA penalties.
Foundation models provide valuable zero-shot forecasting that reduces the need for dataset-specific retraining — this lowers model maintenance, compute, and latency costs. But the economic benefit materializes only after coupling forecasts to decision-aware calibration (quantiles) and appropriate optimization.
Predictive quantiles are an operational policy instrument analogous to risk buffers in economic decision-making:
- Choosing higher quantiles (more conservative) reduces risk of SLA violations but lowers resource efficiency (fewer PMs reclaimed).
- Choosing lower quantiles increases utilization and cost savings but raises probability (and expected cost) of violations.
- Operators should perform cost–benefit analysis: trade expected savings from shutting down PMs vs expected penalty / cost of SLA violations and migrations.
Recommendations for practitioners and economists analyzing AI investments:
- Use decision-centric benchmarks (like CloudCons) that measure downstream economic metrics (utilization savings, expected SLA penalty) rather than only forecasting metrics.
- When evaluating foundation models, include the cost of calibration/tuning and the operational testing needed to select quantiles that match organizational risk preferences.
- Compare total cost of ownership: model licensing/train/serve costs + optimization compute + migration overheads against expected resource savings and avoided infra costs.
- Consider hybrid operational policies: use fast heuristics (FFD/BFD) with conservative quantiles for latency-sensitive systems; use higher-quality optimization (e.g., Gurobi or metaheuristics) during low-load windows when computation time is available.
Research & policy directions:
- Design cost-aware benchmarks that directly map forecast + consolidation outcomes into monetary metrics (e.g., $ saved per month, expected SLA cost).
- Study market dynamics: willingness-to-pay for foundation models should reflect downstream decision utility, not raw forecasting accuracy.
- Encourage standardized reporting: when vendors claim forecasting superiority, require evidence of end-to-end economic impact (resource savings, SLA impact) or at least calibrated decision-utility evaluations.

Overall, CloudCons highlights that in AI-for-operations contexts the relevant economic question is not “which model predicts best?” but “which forecasting + decision pipeline yields the best economic outcome under our risk/cost profile?” The benchmark provides tools and datasets to answer that question empirically.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper uses large, real-world trace datasets from three major cloud providers and evaluates a wide range of forecasting approaches end-to-end against a concrete downstream consolidation task, giving credible empirical evidence; however the results are based on offline simulation/benchmarks (not live production deployments), depend on specific consolidation/policy and cost assumptions, and do not establish causal mechanisms beyond algorithmic performance differences. Methods Rigormedium — The authors construct multi-provider, high-quality datasets, compare statistical, deep-learning, and foundation models, and analyze decision-utility with sensitivity to predictive quantiles—indicating careful experimental design; nevertheless, important implementation and evaluation details (e.g., consolidation controller specifics, cost/reliability weightings, hyperparameter tuning parity, and absence of production A/B tests) limit methodological rigor from a deployment-causal perspective. SampleLarge-scale time-series traces of resource usage from three cloud environments — Huawei Cloud, Microsoft Azure, and Google Borg — covering diverse workload types (synchronized diurnal patterns, stochastic pulse-like bursts, and high-frequency noise); used for offline, end-to-end forecasting and simulated consolidation experiments (exact time spans and instance counts not specified in the summary). Themesproductivity adoption innovation GeneralizabilityResults may depend on specific workload mixes and may not generalize to other cloud providers or on-premise clusters, Findings rely on offline simulation of consolidation policies rather than live production deployments or A/B tests, Decision-utility outcomes depend on chosen cost models, SLA definitions, and consolidation controllers which may differ across operators, Model implementations, training budgets, and hyperparameter choices may affect relative performance; results may change with alternative configurations, Dataset time periods or rare-event occurrences (e.g., flash crowds, outages) may be underrepresented

Claims (10)

Claim	Direction	Outcome	Confidence & Evidence	Details
Resource utilization in cloud data centers remains at low levels due to conservative over-provisioning to guarantee service reliability. Organizational Efficiency	negative	resource utilization (low levels) driven by over-provisioning	Reading fidelity high Study strength medium	0.18
The forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. Organizational Efficiency	positive	effectiveness of consolidation via anticipating future demands	Reading fidelity high Study strength medium	0.18
Existing benchmarks for time-series forecasting focus solely on prediction error metrics; the decision utility of advanced forecasting (foundation) models remains unverified. Decision Quality	negative	coverage of evaluation metrics (prediction error vs decision utility)	Reading fidelity high Study strength medium	0.18
We propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. Research Productivity	positive	availability of an end-to-end benchmark for cloud consolidation forecasting	Reading fidelity high Study strength high	0.3
We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. Research Productivity	positive	diversity of workload patterns represented in datasets	Reading fidelity high Study strength high	0.3
We conduct an extensive evaluation of statistical, deep learning, and foundation models on the CloudCons benchmark. Research Productivity	positive	comparative evaluation coverage across model classes	Reading fidelity high Study strength high	0.3
Foundation (time-series) foundation models demonstrate superior zero-shot forecasting accuracy. Output Quality	positive	zero-shot forecasting accuracy	Reading fidelity high Study strength medium	0.18
The superior zero-shot forecasting accuracy of foundation models does not inherently translate into better decision utility for resource consolidation. Decision Quality	null_result	decision utility in resource consolidation (trade-off between resource efficiency and service reliability)	Reading fidelity high Study strength medium	0.18
Selection of predictive quantiles is a critical lever: by calibrating predictive quantiles, one can balance the trade-off between resource efficiency and service reliability. Organizational Efficiency	positive	trade-off between resource efficiency and service reliability as affected by predictive quantile choice	Reading fidelity high Study strength medium	0.18
CloudCons and the authors' analyses provide actionable guidelines and vital insights for real-world deployment decisions of forecasting-driven consolidation. Adoption Rate	positive	practical guidance for deployment decisions	Reading fidelity medium Study strength low	0.05