Augmenting tabular housing-placement models with fine-tuned LLM casenote summaries raises accuracy and narrows error disparities in a nonprofit dataset; zero-shot LLM use produces mixed fairness effects and findings are limited by short, redacted notes and single-site data.

Auditing LLMs for Algorithmic Fairness in Casenote-Augmented Tabular Prediction

Xiao Qi Lee, Ezinne Nwankwo, Angela Zhou · April 21, 2026

arxiv correlational low evidence 7/10 relevance Source PDF

Fine-tuning LLMs to produce casenote summaries and augment tabular models improves multi-class housing-placement prediction accuracy and reduces certain algorithmic fairness disparities on a single nonprofit dataset, while zero-shot LLM classification yields mixed fairness outcomes.

LLMs are increasingly being considered for prediction tasks in high-stakes social service settings, but their algorithmic fairness properties in this context are poorly understood. In this short technical report, we audit the algorithmic fairness of LLM-based tabular classification on a real housing placement prediction task, augmented with street outreach casenotes from a nonprofit partner. We audit multi-class classification error disparities. We find that a fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities. We experiment with variable importance improvements to zero-shot tabular classification and find mixed results on resulting algorithmic fairness. Overall, given historical inequities in housing placement, it is crucial to audit LLM use. We find that leveraging LLMs to augment tabular classification with casenote summaries can safely leverage additional text information at low implementation burden. The outreach casenotes are fairly short and heavily redacted. Our assessment is that LLM zero-shot classification does not introduce additional textual biases beyond algorithmic biases in tabular classification. Combining fine-tuning and leveraging casenote summaries can improve accuracy and algorithmic fairness.

Summary

Main Finding

Fine-tuning large LLMs (Llama 3 70B) on serialized tabular records for housing-placement prediction substantially improves overall accuracy (up to 70.2%) but can amplify subgroup disparities (worse Equality-of-Opportunity gaps). Augmenting tabular prompts with LLM-generated casenote summaries improves zero-shot performance (notably for smaller models) and—importantly—can reduce fairness gaps when used carefully. Explicitly emphasizing historically predictive tabular features (e.g., max prior placement) increases accuracy but risks reinforcing historical bias; effects vary by model size and fine-tuning.

Key Points

Dataset and task
- 471 clients from a street-outreach nonprofit (seen monthly 2019–2021).
- Outcome Y ∈ {0,1,2,3}: highest housing placement by 2021 (0 = streets, 3 = permanent supportive housing).
- Tabular covariates include demographics (gender, race, age range), outreach metadata, binary casenote-extracted indicators, count of prior engagements, and max prior placement (identified as most predictive).
- Casenotes are short and heavily redacted.
Models evaluated
- Llama 3 variants: 17B and 70B (Together AI API), tested zero-shot and fine-tuned.
- Random forest baseline on tabular covariates.
- Prompt format: tabular record serialized to natural language; optional LLM-generated casenote summary; optional prompt cue for top feature.
Fairness metrics
- Multiclass Statistical Parity (SP) and Equality of Opportunity (EoO), reported as mean and max classwise deviations across race (White, Black, Other) and gender (Male, Female).
- Evaluations restricted to subgroups with sufficient sample size.
Quantitative highlights (selected)
- Fine-tuned 70B (no summaries): accuracy 70.2%, EoO mean ≈ 0.071, EoO max ≈ 0.209 (accuracy up, fairness gaps larger).
- Fine-tuned 70B + summaries: accuracy 68.6% (slight drop), EoO mean ≈ 0.021, EoO max ≈ 0.097 (notable reduction in EoO disparities).
- Zero-shot 17B: accuracy 33.1% → with summaries 47.0% (≈+13.9 pp), EoO mean reduced for some variants.
- Random forest baseline: accuracy 61.2% but worse SP metrics and comparable EoO to some LLMs.
Feature-importance cueing
- Adding the top tabular feature (max placement before 2019) to prompts raised accuracy (especially zero-shot 17B), but often increased EoO disparities—suggesting that explicitly surfacing historically predictive features can propagate structural bias.
- Interactions depend on model capacity: 17B and 70B responded differently to feature cues and summaries; larger/fine-tuned models can have learned balances that are disrupted by overemphasis on a single feature.
Other observations
- Zero-shot LLMs tended to predict class 2 more often (a "safe" middle-class bias).
- Female TPRs were consistently lower than male TPRs across models (dataset male:female ≈ 4:1).
- The authors explored alternative splits, SMOTE oversampling, prompt optimization, and few-shot prompting; none consistently improved both accuracy and fairness.

Data & Methods

Data: 471 client records, casenotes (short, redacted), tabular baseline covariates; outcome is highest placement level by 2021.
Serialization: Tabular fields (and optionally casenote summaries and feature cues) converted to a natural-language prompt for LLM input.
Models:
- Llama 3 17B and 70B, evaluated zero-shot and via fine-tuning (train/val/test primarily 75/10/15).
- Random forest trained on tabular covariates as baseline.
Casenote summaries: generated by LLM and appended to the prompt in augmented variants.
Fairness evaluation:
- Multiclass Statistical Parity: per-class deviation of predicted rates from overall prediction rates (reported as SP mean and SP max).
- Multiclass Equality of Opportunity: per-class TPR deviations from marginal TPR (reported as EoO mean and EoO max).
- Protected attributes: race (White/Black/Other) and gender (Male/Female).
Additional experiments: prompt engineering, feature cueing (top feature and up to 10 additional features), alternative train/test splits, SMOTE oversampling; authors report mixed/unstable gains.

Implications for AI Economics

Performance–fairness tradeoffs matter in resource-allocation contexts. Higher aggregate accuracy from fine-tuning can come at the cost of increased disparities; cost-benefit analyses of deploying tuned LLMs must include subgroup welfare impacts (who benefits/harmed).
Value of unstructured information: LLM summarization can extract useful contextual signals from casenotes that improve zero-shot performance at low implementation cost and—crucially—can mitigate some fairness gaps. This suggests economically attractive investments in lightweight LLM-based feature extraction when structured data are limited.
Model capacity and calibration: larger and fine-tuned models can better leverage nuanced context but are also more likely to amplify historical imbalances if training data are imbalanced. Deployment decisions should consider model capacity, fine-tuning regimen, and the distributional makeup of populations served.
Caution on feature highlighting: surfacing historically predictive features (even non-protected ones) can improve predictive accuracy but may entrench inequities because those features correlate with protected group membership. Economists and policymakers should avoid naive reliance on historically predictive signals without considering fairness externalities.
Practical recommendations for practitioners and policymakers:
- Always audit multiclass fairness (SP and EoO) on relevant subgroups before deployment.
- Prefer augmenting tabular models with LLM-derived summaries rather than only fine-tuning large LLMs, when the goal includes reducing disparities.
- Monitor subgroup-specific TPRs and worst-case gaps (EoO max), not just overall accuracy.
- Where fine-tuning is used, mitigate imbalances (sampling, fairness-aware objectives, separate calibrations) because fine-tuning can disproportionately benefit majority groups.
- Pilot with smaller models as robustness checks: smaller models may be less sensitive to noisy text inputs and can reveal failure modes.
Research & policy directions:
- Broader studies on longer, less-redacted casenotes and larger populations to test generalizability.
- Develop fairness-aware LLM fine-tuning and prompt interventions that balance accuracy with subgroup parity in multiclass outcomes.
- Economic evaluation of downstream impacts (allocation efficiency vs distributional harms) to guide acceptable tradeoffs in social-service deployments.

If you want, I can: (a) produce a one-page slide-ready summary; (b) extract the exact metric table rows into a compact CSV-style table; or (c) draft recommended operational auditing steps for a nonprofit deploying such a model. Which would be most useful?

Assessment

Paper Typecorrelational Evidence Strengthlow — The report presents empirical model comparisons on a single, observational dataset without experimental identification or external validation; results show promising accuracy and fairness improvements but may reflect dataset-specific patterns, overfitting, or limited robustness checks. Methods Rigormedium — The authors use reasonable technical approaches (fine-tuning, zero-shot evaluation, multi-class error disparity audits, and variable-importance experiments) and report direct model comparisons, but the study lacks randomized or quasi-experimental identification, detailed robustness/sensitivity analyses, comprehensive fairness metrics, and transparency about sample size and hyperparameter choices. SampleTabular housing-placement prediction data from a single nonprofit partner augmented with short, heavily redacted street-outreach casenotes (casenote summaries used as text features); outcome is multi-class housing placement; exact sample size and demographic composition not reported in the summary. Themesgovernance inequality GeneralizabilitySingle nonprofit dataset — results may not generalize to other providers, regions, or populations, Casenotes are short and heavily redacted — richer textual records could change results, Unclear sample size and representativeness — potential selection bias toward served clients, Findings depend on specific model fine-tuning and preprocessing choices, Outcome limited to housing-placement classification; other social-service tasks may behave differently

Claims (6)

Claim	Direction	Confidence	Outcome	Details
A fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities on the housing placement multi-class classification task. Ai Safety And Ethics	positive	high	multi-class classification accuracy; classification error disparities across demographic or protected groups	0.3
Variable importance improvements to zero-shot tabular classification produce mixed results with respect to algorithmic fairness. Ai Safety And Ethics	mixed	high	algorithmic fairness (classification error disparities) resulting from variable-importance adjustments to zero-shot classification	0.3
LLM zero-shot classification does not introduce additional textual biases beyond the algorithmic biases already present in tabular classification. Ai Safety And Ethics	null_result	high	additional textual bias introduced by LLM zero-shot classification relative to tabular-only classifiers	0.3
Leveraging LLMs to augment tabular classification with casenote summaries can safely incorporate additional text information with low implementation burden. Ai Safety And Ethics	positive	high	feasibility/safety of augmenting tabular models with LLM casenote summaries; implementation burden	0.3
The outreach casenotes used in the study are fairly short and heavily redacted. Other	null_result	high	casenote length and degree of redaction	0.3
Given historical inequities in housing placement, it is crucial to audit LLM use in this context. Governance And Regulation	positive	high	need for auditing LLMs (policy recommendation)	0.05