A safe deep reinforcement learning controller can halve simulated building heating costs while always honoring grid flexibility requests; the adaptive safety filter secures compliance but incurs a small rise in comfort breaches compared with unconstrained RL.

Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility

Colin Jüni, Mina Montazeri, Yi Guo, Federica Bellizio, Giovanni Sansavini, Philipp Heer · April 17, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

A safe deep RL heating controller with a real-time adaptive safety filter achieves up to 50% simulated energy/cost savings versus a rule-based controller while guaranteeing compliance with operator flexibility requests and slightly increasing comfort violations relative to an unconstrained RL agent.

Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency -- achieving up to 50% savings compared to a rule-based controller -- while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.

Summary

Main Finding

A model-free safe deep reinforcement learning (DRL) framework — combining a Deep Deterministic Policy Gradient (DDPG) controller with a Real-time Adaptive Safety Filter (RASF) — can optimize building space heating to provide demand-side flexibility while guaranteeing full compliance with operator flexibility requests. In simulation on a living-lab apartment, the approach achieves up to 50% energy/cost savings versus a rule-based controller and outperforms an unconstrained DRL agent on energy and cost metrics, at the cost of only a small increase in comfort-temperature violations.

Key Points

Problem: Buildings are large energy consumers and can provide flexibility to the grid, but DRL controllers’ exploratory behavior risks violating contractual energy constraints or occupant comfort.
Core contribution: A model-free RASF that enforces cumulative energy constraints in real time without requiring a physical model or prior system identification.
DRL core: DDPG (actor-critic) handles continuous valve-opening control (heat pump). Agent state includes temperatures, solar/ambient conditions, time features, BAU energy, energy already used in the flexibility window, and flexibility window timing.
Safety filter mechanics:
- Compute remaining average safe action u1,t = Eremaining,t / ttoend,t where Eremaining,t = EBAU,t − Et (ensures cumulative budget).
- Adapt tolerance τt based on normalized temperature risk and price favorability: τt = τbase,t · (w1·µtemp,t + (1−w1)·µprice,t)^1.5
- Construct bounds umaximal,t = min(u1,t(1+τt),1) and uminimal,t = max(u1,t(1−τt),0).
- Clip or adjust DRL action ut to usafe,t within [uminimal,t, umaximal,t]; this guarantees the cumulative flexibility constraint is met while allowing adaptive slack.
Reward design balances thermal comfort, electricity cost, and flexibility compliance (penalty α for constraint violations).
Results (simulation):
- Full compliance with flexibility requests when RASF is active.
- Up to 50% energy/cost savings relative to a rule-based controller.
- Better energy and cost performance than a standalone DRL agent, with only slight increase in comfort violations.
Modeling: Hybrid Physically Consistent Neural Network (PCNN) used to simulate room thermal dynamics (linear physics component + data-driven residual).

Data & Methods

Testbed / data:
- Simulation based on a living-lab apartment (UMAR unit) in the NEST building (Empa, Dübendorf, Switzerland). Uses real ambient/solar data and building historical data for training/simulation.
Thermal model:
- PCNN: Troom,t+1 = Dt+1 (physics term) + Pt+1 (NN-learned residual). Chosen for interpretability and transferability across buildings.
DRL algorithm:
- DDPG (continuous action) with actor µ(S|θµ) and critic Q(S,u|θQ).
- State vector includes: solar irradiation, ambient temperature, room and neighbor temperatures, time features, heating scenario flag, time-to-start/stop of flexibility window, BAU energy EBAU,t and consumed energy Et.
- Reward: Rt = β·Rtemp,t − δ·Rprice,t − Rflex,t, where Rtemp penalizes comfort violations, Rprice is electricity cost, and Rflex penalizes flexibility non-compliance (with different α penalties depending on violation conditions).
Safety filter design and implementation:
- RASF is entirely model-free (uses observed Troom,t, electricity price ζ(t), and BAU/accumulated energy estimates) and enforces cumulative energy budget over the flexibility window.
- Tolerance τt adapts over time and with operational context (price/temperature) via normalized indicators and a base tolerance schedule.
Baselines:
- Rule-based controller (RBC).
- Standalone DRL (DDPG) without the RASF.
Evaluation metrics:
- Energy consumption and electricity cost savings, compliance with flexibility message (cumulative energy within window), and thermal comfort violations.

Implications for AI Economics

Monetizing Flexibility: Model-free safe DRL with guaranteed compliance enables buildings to reliably participate in flexibility markets (capacity, load-shift, or energy-constrained products) without extensive modelling costs, increasing potential revenue streams for building owners and aggregators.
Reduced transaction/friction costs: The RASF’s model-free nature lowers setup and maintenance costs compared with MPC or model-based safety layers (less requirement for building-specific identification), improving scalability and lowering barriers for mass participation.
Risk mitigation and contracting: Guaranteed compliance addresses counterparties’ risk concerns (grid operators/market platforms), enabling tighter contracts and potentially better prices for flexibility due to lower delivery risk.
Market design and pricing signals: Adaptive tolerance leverages price signals to opportunistically consume when cheap and tighten during expensive periods; this suggests that well-designed price signals can be exploited safely by AI controllers to improve system efficiency.
Aggregation & coordination: Safe, compliant single-building controllers make aggregation simpler and more reliable for virtual power plants or aggregators — reduces monitoring/penalty costs and supports trust in automated provision.
Trade-offs & externalities:
- Small increase in comfort violations demonstrates trade-offs between economic benefit and occupant experience; regulators/market designers may need standards or incentive alignment (e.g., comfort-compensating tariffs).
- Dependence on accurate BAU estimates and reliable communications: errors in BAU or delays could affect service quality; markets must account for measurement and baseline uncertainties.
Policy and regulation:
- Demonstrated safety guarantees can support regulatory acceptance of ML-based controllers in demand-response programs, but real-world validation and certification processes will be needed.
Future economic research directions:
- Value stacking: quantify combined revenues from energy arbitrage, flexibility markets, and local services (ancillary) under safe DRL regimes.
- Contract design: how to price flexibility given guaranteed delivery vs probabilistic offers.
- Aggregator business models: optimal allocation of tolerance across portfolios to balance revenue and comfort penalties.
- Market impacts: large-scale adoption may alter price volatility and reshape short-term markets; endogenous modeling required.

Limitations noted by the authors (economic relevance): - Results are simulation-based using PCNN; field deployment and robustness under communication delays, measurement noise, and occupant behavior variability remain to be demonstrated. - BAU estimation and data requirements for PCNN and DRL training may affect performance and baseline settlement in markets.

Overall, the paper shows a practical pathway for safe, scalable AI-enabled building participation in electricity flexibility markets — potentially lowering costs and operational risk, and expanding the supply of reliable distributed flexibility.

Assessment

Paper Typedescriptive Evidence Strengthlow — Findings appear to come from simulated experiments comparing the proposed safe RL controller to a rule-based baseline and a standalone RL agent; there is no causal identification strategy or real-world field validation reported, so estimated savings and compliance guarantees may not generalize to operational buildings or markets. Methods Rigormedium — The paper uses a standard deep RL algorithm (DDPG) and introduces a real-time adaptive safety filter with quantitative comparisons against baselines, which is methodologically appropriate for a control-focused study; however, rigor is limited by reliance on simulation, likely few baseline varieties, and no reported robustness checks, ablation studies, or real-world deployment. SampleExperiments are conducted on simulated building thermal models (space heating) with occupant comfort constraints; the controller is trained via interaction with the model using time-varying electricity prices and explicit flexibility requests from a system operator, and evaluated against a rule-based controller and a standalone DDPG controller across simulated scenarios. Themesinnovation adoption GeneralizabilityResults are from simulation rather than field deployment, so real-world HVAC dynamics, sensor noise, and actuator delays may change outcomes, Potentially limited building types and thermal parameter sets used in experiments (single-building or narrow set of models), Occupant behavior and comfort preferences in real buildings are more heterogeneous than simulated constraints, Transferability of trained RL policies across buildings/seasons without retraining is unclear, Market, regulatory, and aggregator incentives that affect flexibility value are not modeled in full

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Buildings account for approximately 40% of global energy consumption. Fiscal And Macroeconomic	null_result	high	share of global energy consumption accounted for by buildings	approximately 40% 0.18
Enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency given the growing share of intermittent renewable energy sources. Consumer Welfare	positive	high	grid stability and energy efficiency enabled by demand-side flexibility in HVAC	0.03
This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. Organizational Efficiency	positive	high	ability to optimize building heating while providing demand-side flexibility	0.18
A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. Organizational Efficiency	positive	high	occupant comfort, energy cost, and flexibility provision resulting from DDPG-trained controller	0.18
We propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision; the proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators. Regulatory Compliance	positive	high	compliance with flexibility requests from system operators	0.18
The proposed real-time adaptive safety filter improves energy and cost efficiency — achieving up to 50% savings compared to a rule-based controller. Organizational Efficiency	positive	high	energy and cost efficiency (savings) relative to a rule-based controller	up to 50% savings compared to a rule-based controller 0.18
The proposed safety-filter outperforms a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations. Organizational Efficiency	mixed	high	energy metrics, cost metrics, and comfort temperature violations	0.18