The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

An audit of a college Early Warning System shows it misallocates support: younger, male and international students are over-flagged while older and female students are under-identified. Compressing continuous risk scores into percentile tiers further magnifies these inequities across the intervention pipeline.

Fairness Audits of Institutional Risk Models in Deployed ML Pipelines
Kelly McConvey, Dipto Das, Maya Ghai, Angelina Zhai, Rosa Lee, Shion Guha · April 21, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
A replica-based audit of Centennial College's Early Warning System finds that younger, male, and international students are disproportionately flagged for support (many of whom succeed), while older and female students with comparable dropout risk are under-identified, and percentile-based post-processing amplifies these disparities.

Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.

Summary

Main Finding

A replica-based fairness audit of a deployed Early Warning System (EWS) at a public college finds that demographic disparities are (1) present in the training data, (2) learned by separate XGBoost intake models, and (3) amplified by percentile-based post-processing into fixed intervention tiers. Younger, male, and international students are over-flagged (receiving more intensive interventions) while older and female students with comparable risk are under-identified. Percentile bucketing (Low/Medium/High) is identified as the principal, locatable mechanism that converts prediction differences into amplified allocation disparities.

Key Points

  • Pipeline stages audited: (1) intake data, (2) model predictions (separate XGBoost for domestic vs international), (3) percentile-based post-processing into Low (top 50%), Medium (next 27%), High (bottom 23%).
  • Major baseline disparities in the data:
    • International success = 85% vs domestic 67%.
    • Female > Male: domestic 73% vs 59%; international 89% vs 82%.
  • Model performance and error patterns:
    • International model accuracy ≈ 91%; domestic ≈ 82%.
    • Female students have higher false-positive rates (domestic 32% vs male 23%; international 26% vs male 18%) but lower false-negative rates.
    • Strong age effects: e.g., domestic students 36–40 have FPR > 0.60; ages 19–20 show higher FNR.
  • Post-processing amplification:
    • Continuous probabilities are thresholded at ≈0.80 and ≈0.39 to form three tiers; Medium Risk bin is poorly calibrated (Brier Score 0.18).
    • ~21% of High Risk students ultimately succeed (nontrivial misallocation).
    • The male–female High Risk gap widens after bucketing (example: from 36% to 40% in the reported comparison), corresponding to roughly 300 additional male students/year receiving intensive supports.
    • Students ≤25 much more likely to be categorized High Risk than 36+ (example: 94% vs 75% in a reported comparison).
  • Conceptual mechanisms identified:
  • Task formulation mismatch — model predicts non-completion but is used as a proxy for “need for support.”
  • Institutional-priority drift — target (program non-completion) reflects institutional retention/funding priorities rather than student-centered outcomes.
  • Uncertainty relabeled as moderate need — Medium Risk bin converts model uncertainty into an intervention rule.
  • Limitations: replica (not direct production model) though trained on institutional data and specs; outcome conflates dropout/transfer/program change (construct-validity issue); no race/ethnicity or disability data; no data on actual intervention delivery or student responses.

Data & Methods

  • Data: student records 2011–2019 (102,353 records; 61,375 domestic; 40,978 international). Filtered to first-semester intake students. Test set: 15,461 students (9,209 domestic; 6,252 international).
  • Labels: Successful (completed program within allowed period) vs Unsuccessful (withdrawal, failure, transfer); institution’s definition used to replicate deployed system.
  • Modeling: Separate XGBoost models for domestic and international populations; feature sets differ per admissions processes (e.g., English test scores for international, high-school grades for domestic); 45 intake/program/admissions features retained after missing-data removal; sensitive attributes (age, gender, residency, funding, first-generation) included per design; race/disability unavailable.
  • Training protocol: chronological split 70/15/15; SMOTE oversampling to balance classes; hyperparameters tuned via grid search.
  • Fairness evaluation metrics: Statistical Parity Difference (SPD), Equal Opportunity Difference (EOD), False Positive Rate gaps (∆FPR), Calibration Error / Brier Score; pairwise group disparities and maximum absolute differences reported.
  • Post-processing: probabilities converted to percentiles with fixed quotas: Low = top 50%, Medium = next 27%, High = bottom 23% (thresholds ≈0.80 and ≈0.39).

Implications for AI Economics

  • Resource allocation and efficiency:
    • Algorithms convert historical completion disparities into concrete distribution of scarce support resources; misallocation (e.g., 21% of High Risk succeed) creates direct efficiency losses and opportunity costs for both students and institutions.
    • Fixed-quota percentile tiers can produce head-count distortions (over-investing in groups that are over-flagged and under-investing in under-flagged groups), affecting marginal return on advising and support spending.
  • Distributional and externality effects:
    • Algorithmic allocation shifts costs/benefits across demographic groups, with potential net welfare losses for under-identified groups (older/female) and deadweight intervention for over-identified groups (younger/male/international).
    • These distributional shifts can reinforce institutional incentives (retention-focused funding), producing feedback loops that entrench inequality (the ASP-HEI cycle).
  • Policy and regulatory relevance:
    • Audits must look beyond per-stage fairness metrics to trace how design choices (target construction, post-processing) transform predictions into institutional policy; regulators and funders should require pipeline-level transparency.
    • Construct-validity failures (predicting non-completion vs. who “needs” support) imply that compliance with statistical fairness may still produce economically and ethically undesirable allocations.
  • Practical mitigations with economic effects:
    • Replace fixed-percentile bins with calibrated probability thresholds to better align intervention cost with expected benefit (improves allocative efficiency).
    • Apply group-conditional calibration or multi-objective optimization (balance retention prediction accuracy against allocation parity) to reduce systematic misallocation while controlling for budget constraints.
    • Revisit target definitions and possibly decompose the outcome (dropout vs transfer vs program-change) to better match interventions to true needs — this can improve targeting and increase marginal returns on intervention spending.
  • Research directions for AI economics:
    • Quantify welfare and fiscal impacts of misallocation (cost per unnecessary intervention; foregone gains from missed interventions).
    • Model dynamic feedbacks: how algorithmic allocation affects future enrollments, completion rates, and institutional incentives.
    • Study trade-offs between accuracy, calibration, and allocation fairness under budget constraints to inform design of institutionally feasible intervention policies.

Summary takeaway: technical fairness fixes at the model level are insufficient if post-processing or target choice translates historical inequities into allocation rules. For economists and policy-makers, the key levers are (a) defining the right target, (b) designing allocation rules that align probabilities with resource constraints and equity goals, and (c) auditing the entire pipeline to measure real economic and distributive impacts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper uses a replica-based audit with institutional training data and standard fairness metrics across the full pipeline, which provides solid observational evidence of disparate treatment; however, it does not employ a causal identification strategy (e.g., randomization or quasi-experimental variation) to rule out alternative explanations for disparities or to estimate counterfactual impacts of interventions. Methods Rigorhigh — The authors replicate the deployed model using institutional training data and design specifications, evaluate disparities at multiple pipeline stages (training data, predictions, post-processing), and draw on multi-year collaboration and prior ethnographic work to contextualize construct validity — indicating careful, multi-method auditing and transparent replication practices. SampleMulti-year administrative student data from Centennial College used to replicate the institution's Early Warning System; includes training data, model predictions, post-processed risk tiers, demographic variables (gender, age, residency/international status), and outcome information on retention/dropout across cohorts (sample size and exact years not specified in the abstract). Themesinequality governance GeneralizabilitySingle-institution (Centennial College) — findings may not generalize to other colleges, universities, or national contexts, EWS design and post-processing (percentile-based risk tiers) may be specific to this deployment and not representative of other systems, Demographic categorizations and student population composition (e.g., international student share) are context-specific, Replica-based approach may differ from proprietary model internals if undocumented design choices were unavailable, Temporal scope and local policies (support allocation procedures) may limit applicability to other time periods or institutions

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
We present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. Governance And Regulation positive high successful replication of the deployed EWS
0.18
We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Governance And Regulation positive high fairness metrics (disparities) across pipeline stages
0.18
Younger, male, and international students are disproportionately flagged for support by the EWS, even when many ultimately succeed. Task Allocation negative high rate of being flagged for support (EWS risk flag) versus eventual success/dropout
0.18
Older and female students with comparable dropout risk are under-identified by the EWS. Task Allocation negative high identification/flagging rate for support relative to comparable dropout risk
0.18
Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. Task Allocation negative high change in disparity magnitude after post-processing (probability → percentile risk tiers)
0.18
Disparities emerge and compound across stages of the ML pipeline (training data, model predictions, and post-processing). Task Allocation negative high cumulative disparity across pipeline stages
0.18
This work provides a replicable methodology for auditing institutional ML systems and highlights the importance of evaluating construct validity alongside statistical fairness. Governance And Regulation positive high availability/replicability of audit methodology and emphasis on construct validity
0.18

Notes