The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

AI-generated reward functions for building controllers improve modeled comfort equity and cut energy use by 3.2%, with the largest comfort gains for elderly females after iterative refinement; however, results are simulation-based and demographic gaps remain.

OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings
Shadmehr Zaregarizi, Khashayar Yavari · May 27, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
In simulation, iterative LLM-mediated reward shaping for a DRL building controller improves modeled comfort equity across four demographic profiles and reduces energy costs modestly, but notable disparities—especially for elderly females—persist despite refinement.

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

Summary

Main Finding

LLM-guided, iterative reward shaping (OccuReward) can materially improve demographic equity in DRL-based building energy control while also modestly reducing energy cost. By introducing a Comfort Equity Index (CEI) and feeding per-profile satisfaction back to an LLM (Gemini 1.5), the authors re-weighted reward components (notably Solar and SoC utilization) and raised all four empirically grounded occupant profiles above a 0.5 satisfaction threshold. The largest gain was for Elderly Females (satisfaction 0.12 → 0.80, +567%), and overall energy cost fell by 3.2% in the equity-aware round. However, residual disparities remain due to structural environmental limits (e.g., HVAC/setpoint constraints), indicating reward-level intervention alone is insufficient for full parity.

Key Points

  • Contributions
    • Introduces the Comfort Equity Index (CEI), an inequity metric derived as 1 − Jain’s fairness index applied to per-profile comfort satisfaction.
    • Proposes an iterative LLM-in-the-loop reward-shaping method that uses CEI and per-profile feedback to refine DRL reward functions.
    • Empirically analyzes distributional outcomes of LLM-generated rewards in CityLearn v2 and documents both successes and limits.
  • Quantitative outcomes (Round 1 → Round 3)
    • Elderly Female satisfaction: 0.12 → 0.80 (+567%)
    • Young Male: 0.85 → 1.00 (+17.6%)
    • Mid-aged Female: 0.78 → 1.00 (+28.2%)
    • Health Sensitive: 0.65 → 1.00 (+53.8%)
    • Energy cost: normalized metric improved by 3.2% after equity-aware refinement.
    • CEI: 0.19 (Rounds 1–2) → 0.0082 (Round 3), i.e., much lower inequity.
  • Mechanism of improvement
    • The LLM rebalanced reward weights (increasing Solar/SoC incentives), enabling the agent to leverage local generation/storage to meet tighter thermal demands without a pure cost trade-off.
  • Persistent limits
    • Structural constraints of the simulation (ambient temperature ranges, HVAC capacity, lack of setpoint control) prevent full parity—reward shaping hit a “comfort ceiling.”
  • Robustness & scope
    • Experiments used five random seeds per round (five seeds × three rounds = 15 runs). Results are specific to CityLearn Phase 1 schema and one LLM (Gemini 1.5); small sample size for the Elderly Female profile (n=298) is noted.

Data & Methods

  • Occupant data
    • Source: ASHRAE Global Thermal Comfort Database II.
    • Filtered records: 13,440 valid votes (after filtering age, sex, comfort).
    • Four empirical profiles constructed with preferred temperature ranges from IQR of comfortable records:
      • Young Male (18–35, n=4,503): 23.6–27.7°C, flexibility ±2.0°C
      • Elderly Female (65–95, n=298): 21.3–23.8°C, flexibility ±1.0°C
      • Mid-aged Female (40–55, n=1,504): 21.9–26.2°C, flexibility ±1.5°C
      • Health Sensitive (45–60, n=3,798): 22.1–27.9°C, flexibility ±0.5°C
  • Comfort satisfaction function
    • Profile satisfaction S_i(T) = 1.0 if T ∈ preferred range; otherwise decreases with distance from range boundary scaled by a profile-specific flexibility parameter.
  • Comfort Equity Index (CEI)
    • CEI = 1 − J, where J is Jain’s fairness index applied to the vector of per-profile satisfaction scores. CEI ∈ [0,1], with 0 = perfect equity.
  • Simulation & agent
    • Environment: CityLearn v2.1.2, 5-building residential district (2022 Phase 1).
    • Agent: Soft Actor-Critic (Stable-Baselines3), MLP 2×256, lr = 3e-4, batch = 256, trained 50,000 timesteps.
    • Baseline: Rule-Based Controller (RBC) with fixed seasonal setpoints; energy KPIs normalized to RBC.
  • LLM-in-the-loop reward shaping
    • LLM: Gemini 1.5 Flash API. LLM generates Python reward logic and absolute coefficients (reward evaluated offline during training; no per-step calls).
    • Reward form: R = −Σ_j w_j · d_KPI_j (weighted sum of normalized KPIs).
    • Three rounds:
      • Round 1: LLM given energy objectives only (equity_weight = 0).
      • Round 2: LLM refines for energy KPIs only (naive refinement).
      • Round 3: Equity-aware LLM prompt includes energy KPIs + CEI + per-profile satisfaction; LLM permitted to include equity weight (equity_weight ≈ 0.15).
  • Evaluation
    • Five random seeds per round; reported means ± std. Metrics: per-profile satisfaction, CEI, normalized energy cost.

Implications for AI Economics

  • Distributional externalities matter in building AI
    • Aggregate energy/cost optimization can generate uneven welfare outcomes across demographic groups. Economic evaluations of AI controllers should include distributional metrics (like CEI) in cost–benefit analyses.
  • LLMs can discover Pareto-improving trade-offs
    • The study shows an instance where integrating equity feedback led to both better fairness and modest cost savings (3.2%). This suggests market value for equity-aware controllers: improved social welfare with limited or even negative marginal cost.
  • Policy and procurement
    • Regulators and building owners should require algorithmic audits for distributional impacts and include occupant-equity KPIs (CEI or similar) in procurement/specification for intelligent BMS systems.
  • Product and market opportunities
    • There is a potential commercial niche for “equity-aware” control products and services (LLM-assisted reward engineering + setpoint-control integration) that can be marketed on welfare and compliance grounds.
  • Limits and further investment needs
    • Reward-level interventions can’t fully substitute for physical or architectural control capabilities (e.g., zone-specific setpoint authority). Economists estimating benefits should account for required capital/retrofit investment to unlock full equity gains.
  • Research and evaluation recommendations
    • Broader validation needed: multi-LLM comparisons, diverse environments, longer training and deployment horizons, and richer occupant sampling (address small-n subgroups).
    • Incorporate CEI (or comparable distributional metrics) into welfare-weighted social welfare functions when valuing energy-efficiency interventions.
  • Cautions: Goodhart’s law and robustness
    • Reward engineering is susceptible to reward hacking; continuous monitoring and multi-metric governance are necessary to avoid gaming and unintended regressions in other KPIs.

Summary recommendation for AI economists: incorporate distributional measures like CEI into the economic appraisal of control AI, evaluate both reward-level and system-level interventions (including capital upgrades for setpoint control), and account for the non-negligible social welfare gains that equity-aware optimization can unlock alongside modest cost improvements.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper demonstrates consistent, substantial changes in simulated outcomes attributable to the LLM-shaped rewards and reports multiple runs, but evidence is limited to simulation, a small set of four aggregated occupant profiles, proprietary LLM API outputs, and a modest number of experimental repetitions — reducing external validity and robustness. Methods Rigormedium — Methods combine a standard DRL algorithm (Soft Actor-Critic), a novel metric (Comfort Equity Index), and iterative LLM refinement with multiple runs, but key details (prompting, LLM variability, statistical testing, ablations, sensitivity to environment/climate/building models) are either omitted or limited, and reproducibility is constrained by proprietary API use. SampleSimulated experiments in CityLearn v2 using a Soft Actor-Critic agent and reward functions generated/refined by the Gemini API; occupant behavior drawn from four empirically grounded profiles derived from the ASHRAE Global Thermal Comfort Database II (13,440 votes) representing Young Males, Mid-aged Females, Health Sensitive, and Elderly Females; results aggregated across 15 experimental runs and three rounds of LLM-mediated reward refinement. Themeshuman_ai_collab inequality IdentificationUses controlled simulation experiments (CityLearn v2) to compare DRL agent outcomes under baseline vs. LLM-refined reward functions across three iterative rounds; causal attribution relies on counterfactual comparison within the same simulated environment and occupant-profile scenarios. GeneralizabilitySimulation-only results — no field or real-building deployments to validate transfer to live systems, Only four aggregated occupant profiles used; may not capture full demographic, cultural, or regional diversity, Findings depend on a single building/environment simulator (CityLearn v2) and specific climate/building models, Uses one DRL algorithm (Soft Actor-Critic); different controllers may respond differently, LLM outputs (Gemini API) are proprietary and non-deterministic, limiting reproducibility, Comfort Equity Index (CEI) is a new metric whose external validity and sensitivity require further validation, Energy and comfort trade-offs may differ under realistic occupancy dynamics and multi-occupant interactions not modeled here

Claims (12)

ClaimDirectionConfidenceOutcomeDetails
We introduce the Comfort Equity Index (CEI) as a novel feedback signal. Inequality positive high Comfort Equity Index (CEI)
0.48
We use four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes). Other null_result high occupant profile representation (number of votes in dataset)
n=13440
0.8
We deploy a Soft Actor-Critic (SAC) agent in CityLearn v2 for experiments. Other null_result high DRL agent deployment (method)
0.8
We employ the Gemini API to generate reward function logic and weights across three refinement rounds rather than performing per-step inference. Other null_result high LLM-mediated reward generation method
n=3
0.8
Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. Consumer Welfare negative high occupant satisfaction (per demographic group)
n=15
0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Young Males (+17.6%). Consumer Welfare positive high satisfaction for Young Males
n=15
+17.6%
0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Mid-aged Females (+28.2%). Consumer Welfare positive high satisfaction for Mid-aged Females
n=15
+28.2%
0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Health Sensitive (+53.8%). Consumer Welfare positive high satisfaction for Health Sensitive occupants
n=15
+53.8%
0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Elderly Females (+567%). Consumer Welfare positive high satisfaction for Elderly Females
n=15
+567%
0.48
By Round 3, equity-aware LLM refinement reduces energy costs by 3.2%. Organizational Efficiency positive high energy costs
n=15
3.2%
0.48
Reward-level intervention (via equity-aware LLM refinement) significantly improves equity, but demographic disparities in AI-driven controllers persist. Inequality mixed high equity in occupant comfort across demographic groups
n=15
0.48
LLM-mediated reward design can affect demographic equity in occupant comfort (i.e., LLM reward shaping has the potential to exhibit or exacerbate disparities). Inequality mixed medium demographic equity in occupant comfort
n=15
0.29

Notes