A GPT-5 audit finds similar job and industry recommendations for simulated male and female applicants but uses markedly gendered language — women described as empathetic and relational, men as analytical and leadership-oriented, raising fairness concerns for AI-assisted hiring.
In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.
Summary
Main Finding
A prompt-based audit of GPT-5 using 24 simulated profiles of under-35 Italian graduates found no systematic differences in suggested job titles or industries by gender, but did find consistent gendered linguistic patterns: the model tended to describe female candidates with emotional/empathetic adjectives and male candidates with strategic/analytical adjectives. This indicates that LLMs can reproduce gendered trait stereotypes even when overt occupational recommendations appear gender-neutral.
Key Points
- Study focus: whether a state-of-the-art generative model (GPT-5) exhibits gendered responses when recommending occupations and describing candidates.
- Experimental setup: 24 simulated candidate profiles balanced on gender, age (under 35), work experience, and professional field.
- Primary result: job titles and recommended industries showed no meaningful gender split; descriptive language did.
- Language bias pattern: women were more frequently attributed emotional, interpersonal, or empathetic traits; men were more frequently attributed strategic, analytical, or leadership-oriented traits.
- Ethical concern: these trait attributions can influence human decision-makers and downstream automated systems in hiring and selection contexts, potentially reproducing and amplifying labor-market gender stereotypes.
Data & Methods
- Data: 24 synthetic candidate profiles representing Italian graduates under 35; profiles balanced across gender, age, experience level, and sector/field.
- Model: GPT-5 (prompted to suggest jobs and describe candidate attributes).
- Procedure: each profile was submitted with consistent prompts; outputs were compared across matched male/female versions.
- Analysis: qualitative and lexical analysis of the adjectives and descriptors produced for each gender; comparison of suggested job titles/industries for evidence of distributional differences.
- Limitations noted (implicit from the study design):
- Small, synthetic sample (24 profiles) limits generalizability and statistical power.
- Single-model audit (GPT-5) — results may differ across models or versions.
- Focus on Italian graduates under 35 restricts geographic and demographic scope.
- Analysis centered on language in outputs, not on downstream hiring outcomes in real organizations.
Implications for AI Economics
- Bias transmission and amplification: Even where occupational recommendations seem neutral, gendered trait attributions can bias human evaluators or automated downstream scorers, affecting hiring, promotion, and wage trajectories.
- Statistical discrimination risks: Models that attach stereotyped traits to demographic groups can reinforce employers’ beliefs about group productivity or fit, leading to persistent labor-market segmentation and reduced labor-market efficiency.
- Signaling and human capital valuation: If LLMs systematically emphasize interpersonal traits for women and analytical traits for men, this can distort perceived skill endowments and influence which skills are signaled, trained for, or rewarded in the market.
- Measurement and auditability: The paper underscores the need for routine auditing of generative models used in selection — including linguistic-content audits, not just outcome parity checks — as subtle language cues shape decisions.
- Policy and governance: Findings support policy interventions for AI in hiring: transparency requirements (model cards, prompt disclosures), mandated bias testing across attributes, documentation of training data provenance, and human-in-the-loop safeguards.
- Practical mitigations for firms and platforms:
- Use controlled templates that avoid open-ended trait generation when supporting selection.
- Apply de-biasing techniques for generated language and tune models against stereotype tests.
- Combine algorithmic recommendations with structured, criterion-based human review to reduce influence of stereotyped descriptors.
- Research agenda: scale audits across models, larger and more diverse candidate sets, measure downstream effects on real hiring decisions (field experiments), and develop quantitative metrics linking lexical stereotype measures to hiring outcomes and labor-market inequalities.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Generative AI (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. Hiring | positive | medium | presence/role of GenAI systems in recruitment and selection processes |
0.05
|
| Large language models (LLMs) risk reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. Ai Safety And Ethics | negative | medium | presence and amplification of gender stereotypes/bias in LLM outputs |
0.05
|
| This study evaluates how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background for under-35-year-old Italian graduates. Hiring | positive | high | occupation suggestions produced by GPT-5 for specified candidate profiles |
0.09
|
| The model was prompted to suggest jobs to 24 simulated candidate profiles balanced in terms of gender, age, experience and professional field. Hiring | positive | high | number and composition of simulated candidate profiles used in the experiment |
n=24
0.09
|
| No significant differences emerged in job titles and industry suggested by GPT-5 across genders. Hiring | null_result | medium | suggested job titles and industry assignments by GPT-5 across male and female profiles |
n=24
no significant differences in job titles/industries across genders (reported)
0.05
|
| Gendered linguistic patterns emerged in the adjectives attributed to female and male candidates: GPT-5 tended to associate women with emotional and empathetic traits and men with strategic and analytical traits. Ai Safety And Ethics | negative | medium | adjectives/descriptive language used by GPT-5 to characterize candidates |
n=24
gendered adjective patterns (women: emotional/empathetic; men: strategic/analytical)
0.05
|
| The findings raise ethical concerns about using such models in sensitive selection processes and highlight the need for transparency and fairness in digital labour markets. Ai Safety And Ethics | negative | medium | ethical risk and need for transparency/fairness when deploying LLMs in recruitment |
0.05
|