A GPT-5 audit finds similar job and industry recommendations for simulated male and female applicants but uses markedly gendered language — women described as empathetic and relational, men as analytical and leadership-oriented, raising fairness concerns for AI-assisted hiring.

Gender Bias in Generative AI-assisted Recruitment Processes

Martina Ullasci, Marco Rondina, Riccardo Coppola, Antonio Vetrò · March 12, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

In a controlled audit using 24 simulated under-35 Italian graduate profiles, GPT-5 produced similar job-title and industry suggestions across genders but systematically used different descriptive adjectives, portraying women with relational/emotional traits and men with leadership/analytical/practical traits.

In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.

Summary

Main Finding

A prompt-based audit of GPT-5 using 24 simulated profiles of under-35 Italian graduates found no systematic differences in suggested job titles or industries by gender, but did find consistent gendered linguistic patterns: the model tended to describe female candidates with emotional/empathetic adjectives and male candidates with strategic/analytical adjectives. This indicates that LLMs can reproduce gendered trait stereotypes even when overt occupational recommendations appear gender-neutral.

Key Points

Study focus: whether a state-of-the-art generative model (GPT-5) exhibits gendered responses when recommending occupations and describing candidates.
Experimental setup: 24 simulated candidate profiles balanced on gender, age (under 35), work experience, and professional field.
Primary result: job titles and recommended industries showed no meaningful gender split; descriptive language did.
Language bias pattern: women were more frequently attributed emotional, interpersonal, or empathetic traits; men were more frequently attributed strategic, analytical, or leadership-oriented traits.
Ethical concern: these trait attributions can influence human decision-makers and downstream automated systems in hiring and selection contexts, potentially reproducing and amplifying labor-market gender stereotypes.

Data & Methods

Data: 24 synthetic candidate profiles representing Italian graduates under 35; profiles balanced across gender, age, experience level, and sector/field.
Model: GPT-5 (prompted to suggest jobs and describe candidate attributes).
Procedure: each profile was submitted with consistent prompts; outputs were compared across matched male/female versions.
Analysis: qualitative and lexical analysis of the adjectives and descriptors produced for each gender; comparison of suggested job titles/industries for evidence of distributional differences.
Limitations noted (implicit from the study design):
- Small, synthetic sample (24 profiles) limits generalizability and statistical power.
- Single-model audit (GPT-5) — results may differ across models or versions.
- Focus on Italian graduates under 35 restricts geographic and demographic scope.
- Analysis centered on language in outputs, not on downstream hiring outcomes in real organizations.

Implications for AI Economics

Bias transmission and amplification: Even where occupational recommendations seem neutral, gendered trait attributions can bias human evaluators or automated downstream scorers, affecting hiring, promotion, and wage trajectories.
Statistical discrimination risks: Models that attach stereotyped traits to demographic groups can reinforce employers’ beliefs about group productivity or fit, leading to persistent labor-market segmentation and reduced labor-market efficiency.
Signaling and human capital valuation: If LLMs systematically emphasize interpersonal traits for women and analytical traits for men, this can distort perceived skill endowments and influence which skills are signaled, trained for, or rewarded in the market.
Measurement and auditability: The paper underscores the need for routine auditing of generative models used in selection — including linguistic-content audits, not just outcome parity checks — as subtle language cues shape decisions.
Policy and governance: Findings support policy interventions for AI in hiring: transparency requirements (model cards, prompt disclosures), mandated bias testing across attributes, documentation of training data provenance, and human-in-the-loop safeguards.
Practical mitigations for firms and platforms:
- Use controlled templates that avoid open-ended trait generation when supporting selection.
- Apply de-biasing techniques for generated language and tune models against stereotype tests.
- Combine algorithmic recommendations with structured, criterion-based human review to reduce influence of stereotyped descriptors.
Research agenda: scale audits across models, larger and more diverse candidate sets, measure downstream effects on real hiring decisions (field experiments), and develop quantitative metrics linking lexical stereotype measures to hiring outcomes and labor-market inequalities.

Assessment

Paper Typedescriptive Evidence Strengthlow — Findings are based on 24 simulated profiles (72 prompts) and one LLM (GPT-5) under default settings; while adjective differences are statistically significant, the small, synthetic sample, single-model design, binary gender operationalization, and manual coding limit external validity and preclude causal claims. Methods Rigormedium — The authors used a controlled, balanced profile design, repeated trials, and statistical tests (χ2) and documented coding procedures with cross-checks, but the study relies on a small n, one model/version, a single prompting strategy, and primarily manual open coding which introduces subjectivity and limits robustness. Sample24 synthetic Italian university graduate profiles (12 female, 12 male), all under 35, balanced across three occupational macro-areas (Cognitive, Socio-Relational, Technical) and two experience levels (Junior, Senior); each profile prompted three times via ChatGPT-5 web interface (default settings) between Aug–Sep 2025 for a total of 72 observations; outputs coded into job title, industry, and adjective categories. Themeslabor_markets inequality human_ai_collab GeneralizabilitySmall synthetic sample (N=24 profiles, 72 prompts) limits statistical power and representativeness, Profiles are simulated, not drawn from real applicants or administrative hiring data, Geographically and culturally specific (Italian graduates) and age-limited (<35), Gender operationalized as binary (excludes non-binary and intersectional identities), Single LLM (GPT-5) and single prompting strategy — results may not hold for other models or prompts, Data collected at one point in time — model updates could change behavior, Manual open coding introduces subjectivity in categorization

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Generative AI (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. Hiring	positive	medium	presence/role of GenAI systems in recruitment and selection processes	0.05
Large language models (LLMs) risk reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. Ai Safety And Ethics	negative	medium	presence and amplification of gender stereotypes/bias in LLM outputs	0.05
This study evaluates how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background for under-35-year-old Italian graduates. Hiring	positive	high	occupation suggestions produced by GPT-5 for specified candidate profiles	0.09
The model was prompted to suggest jobs to 24 simulated candidate profiles balanced in terms of gender, age, experience and professional field. Hiring	positive	high	number and composition of simulated candidate profiles used in the experiment	n=24 0.09
No significant differences emerged in job titles and industry suggested by GPT-5 across genders. Hiring	null_result	medium	suggested job titles and industry assignments by GPT-5 across male and female profiles	n=24 no significant differences in job titles/industries across genders (reported) 0.05
Gendered linguistic patterns emerged in the adjectives attributed to female and male candidates: GPT-5 tended to associate women with emotional and empathetic traits and men with strategic and analytical traits. Ai Safety And Ethics	negative	medium	adjectives/descriptive language used by GPT-5 to characterize candidates	n=24 gendered adjective patterns (women: emotional/empathetic; men: strategic/analytical) 0.05
The findings raise ethical concerns about using such models in sensitive selection processes and highlight the need for transparency and fairness in digital labour markets. Ai Safety And Ethics	negative	medium	ethical risk and need for transparency/fairness when deploying LLMs in recruitment	0.05