OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

Summary

Main Finding

LLM-guided, iterative reward shaping (OccuReward) can materially improve demographic equity in DRL-based building energy control while also modestly reducing energy cost. By introducing a Comfort Equity Index (CEI) and feeding per-profile satisfaction back to an LLM (Gemini 1.5), the authors re-weighted reward components (notably Solar and SoC utilization) and raised all four empirically grounded occupant profiles above a 0.5 satisfaction threshold. The largest gain was for Elderly Females (satisfaction 0.12 → 0.80, +567%), and overall energy cost fell by 3.2% in the equity-aware round. However, residual disparities remain due to structural environmental limits (e.g., HVAC/setpoint constraints), indicating reward-level intervention alone is insufficient for full parity.

Key Points

Contributions
- Introduces the Comfort Equity Index (CEI), an inequity metric derived as 1 − Jain’s fairness index applied to per-profile comfort satisfaction.
- Proposes an iterative LLM-in-the-loop reward-shaping method that uses CEI and per-profile feedback to refine DRL reward functions.
- Empirically analyzes distributional outcomes of LLM-generated rewards in CityLearn v2 and documents both successes and limits.
Quantitative outcomes (Round 1 → Round 3)
- Elderly Female satisfaction: 0.12 → 0.80 (+567%)
- Young Male: 0.85 → 1.00 (+17.6%)
- Mid-aged Female: 0.78 → 1.00 (+28.2%)
- Health Sensitive: 0.65 → 1.00 (+53.8%)
- Energy cost: normalized metric improved by 3.2% after equity-aware refinement.
- CEI: 0.19 (Rounds 1–2) → 0.0082 (Round 3), i.e., much lower inequity.
Mechanism of improvement
- The LLM rebalanced reward weights (increasing Solar/SoC incentives), enabling the agent to leverage local generation/storage to meet tighter thermal demands without a pure cost trade-off.
Persistent limits
- Structural constraints of the simulation (ambient temperature ranges, HVAC capacity, lack of setpoint control) prevent full parity—reward shaping hit a “comfort ceiling.”
Robustness & scope
- Experiments used five random seeds per round (five seeds × three rounds = 15 runs). Results are specific to CityLearn Phase 1 schema and one LLM (Gemini 1.5); small sample size for the Elderly Female profile (n=298) is noted.

Data & Methods

Occupant data
- Source: ASHRAE Global Thermal Comfort Database II.
- Filtered records: 13,440 valid votes (after filtering age, sex, comfort).
- Four empirical profiles constructed with preferred temperature ranges from IQR of comfortable records:
  - Young Male (18–35, n=4,503): 23.6–27.7°C, flexibility ±2.0°C
  - Elderly Female (65–95, n=298): 21.3–23.8°C, flexibility ±1.0°C
  - Mid-aged Female (40–55, n=1,504): 21.9–26.2°C, flexibility ±1.5°C
  - Health Sensitive (45–60, n=3,798): 22.1–27.9°C, flexibility ±0.5°C
Comfort satisfaction function
- Profile satisfaction S_i(T) = 1.0 if T ∈ preferred range; otherwise decreases with distance from range boundary scaled by a profile-specific flexibility parameter.
Comfort Equity Index (CEI)
- CEI = 1 − J, where J is Jain’s fairness index applied to the vector of per-profile satisfaction scores. CEI ∈ [0,1], with 0 = perfect equity.
Simulation & agent
- Environment: CityLearn v2.1.2, 5-building residential district (2022 Phase 1).
- Agent: Soft Actor-Critic (Stable-Baselines3), MLP 2×256, lr = 3e-4, batch = 256, trained 50,000 timesteps.
- Baseline: Rule-Based Controller (RBC) with fixed seasonal setpoints; energy KPIs normalized to RBC.
LLM-in-the-loop reward shaping
- LLM: Gemini 1.5 Flash API. LLM generates Python reward logic and absolute coefficients (reward evaluated offline during training; no per-step calls).
- Reward form: R = −Σ_j w_j · d_KPI_j (weighted sum of normalized KPIs).
- Three rounds:
  - Round 1: LLM given energy objectives only (equity_weight = 0).
  - Round 2: LLM refines for energy KPIs only (naive refinement).
  - Round 3: Equity-aware LLM prompt includes energy KPIs + CEI + per-profile satisfaction; LLM permitted to include equity weight (equity_weight ≈ 0.15).
Evaluation
- Five random seeds per round; reported means ± std. Metrics: per-profile satisfaction, CEI, normalized energy cost.

Implications for AI Economics

Distributional externalities matter in building AI
- Aggregate energy/cost optimization can generate uneven welfare outcomes across demographic groups. Economic evaluations of AI controllers should include distributional metrics (like CEI) in cost–benefit analyses.
LLMs can discover Pareto-improving trade-offs
- The study shows an instance where integrating equity feedback led to both better fairness and modest cost savings (3.2%). This suggests market value for equity-aware controllers: improved social welfare with limited or even negative marginal cost.
Policy and procurement
- Regulators and building owners should require algorithmic audits for distributional impacts and include occupant-equity KPIs (CEI or similar) in procurement/specification for intelligent BMS systems.
Product and market opportunities
- There is a potential commercial niche for “equity-aware” control products and services (LLM-assisted reward engineering + setpoint-control integration) that can be marketed on welfare and compliance grounds.
Limits and further investment needs
- Reward-level interventions can’t fully substitute for physical or architectural control capabilities (e.g., zone-specific setpoint authority). Economists estimating benefits should account for required capital/retrofit investment to unlock full equity gains.
Research and evaluation recommendations
- Broader validation needed: multi-LLM comparisons, diverse environments, longer training and deployment horizons, and richer occupant sampling (address small-n subgroups).
- Incorporate CEI (or comparable distributional metrics) into welfare-weighted social welfare functions when valuing energy-efficiency interventions.
Cautions: Goodhart’s law and robustness
- Reward engineering is susceptible to reward hacking; continuous monitoring and multi-metric governance are necessary to avoid gaming and unintended regressions in other KPIs.

Summary recommendation for AI economists: incorporate distributional measures like CEI into the economic appraisal of control AI, evaluate both reward-level and system-level interventions (including capital upgrades for setpoint control), and account for the non-negligible social welfare gains that equity-aware optimization can unlock alongside modest cost improvements.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper demonstrates consistent, substantial changes in simulated outcomes attributable to the LLM-shaped rewards and reports multiple runs, but evidence is limited to simulation, a small set of four aggregated occupant profiles, proprietary LLM API outputs, and a modest number of experimental repetitions — reducing external validity and robustness. Methods Rigormedium — Methods combine a standard DRL algorithm (Soft Actor-Critic), a novel metric (Comfort Equity Index), and iterative LLM refinement with multiple runs, but key details (prompting, LLM variability, statistical testing, ablations, sensitivity to environment/climate/building models) are either omitted or limited, and reproducibility is constrained by proprietary API use. SampleSimulated experiments in CityLearn v2 using a Soft Actor-Critic agent and reward functions generated/refined by the Gemini API; occupant behavior drawn from four empirically grounded profiles derived from the ASHRAE Global Thermal Comfort Database II (13,440 votes) representing Young Males, Mid-aged Females, Health Sensitive, and Elderly Females; results aggregated across 15 experimental runs and three rounds of LLM-mediated reward refinement. Themeshuman_ai_collab inequality IdentificationUses controlled simulation experiments (CityLearn v2) to compare DRL agent outcomes under baseline vs. LLM-refined reward functions across three iterative rounds; causal attribution relies on counterfactual comparison within the same simulated environment and occupant-profile scenarios. GeneralizabilitySimulation-only results — no field or real-building deployments to validate transfer to live systems, Only four aggregated occupant profiles used; may not capture full demographic, cultural, or regional diversity, Findings depend on a single building/environment simulator (CityLearn v2) and specific climate/building models, Uses one DRL algorithm (Soft Actor-Critic); different controllers may respond differently, LLM outputs (Gemini API) are proprietary and non-deterministic, limiting reproducibility, Comfort Equity Index (CEI) is a new metric whose external validity and sensitivity require further validation, Energy and comfort trade-offs may differ under realistic occupancy dynamics and multi-occupant interactions not modeled here

Claims (12)

Claim	Direction	Confidence	Outcome	Details
We introduce the Comfort Equity Index (CEI) as a novel feedback signal. Inequality	positive	high	Comfort Equity Index (CEI)	0.48
We use four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes). Other	null_result	high	occupant profile representation (number of votes in dataset)	n=13440 0.8
We deploy a Soft Actor-Critic (SAC) agent in CityLearn v2 for experiments. Other	null_result	high	DRL agent deployment (method)	0.8
We employ the Gemini API to generate reward function logic and weights across three refinement rounds rather than performing per-step inference. Other	null_result	high	LLM-mediated reward generation method	n=3 0.8
Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. Consumer Welfare	negative	high	occupant satisfaction (per demographic group)	n=15 0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Young Males (+17.6%). Consumer Welfare	positive	high	satisfaction for Young Males	n=15 +17.6% 0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Mid-aged Females (+28.2%). Consumer Welfare	positive	high	satisfaction for Mid-aged Females	n=15 +28.2% 0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Health Sensitive (+53.8%). Consumer Welfare	positive	high	satisfaction for Health Sensitive occupants	n=15 +53.8% 0.48
By Round 3, equity-aware LLM refinement improves satisfaction for Elderly Females (+567%). Consumer Welfare	positive	high	satisfaction for Elderly Females	n=15 +567% 0.48
By Round 3, equity-aware LLM refinement reduces energy costs by 3.2%. Organizational Efficiency	positive	high	energy costs	n=15 3.2% 0.48
Reward-level intervention (via equity-aware LLM refinement) significantly improves equity, but demographic disparities in AI-driven controllers persist. Inequality	mixed	high	equity in occupant comfort across demographic groups	n=15 0.48
LLM-mediated reward design can affect demographic equity in occupant comfort (i.e., LLM reward shaping has the potential to exhibit or exacerbate disparities). Inequality	mixed	medium	demographic equity in occupant comfort	n=15 0.29

AI-generated reward functions for building controllers improve modeled comfort equity and cut energy use by 3.2%, with the largest comfort gains for elderly females after iterative refinement; however, results are simulation-based and demographic gaps remain.