A hybrid AI hiring engine combining Transformer embeddings and a skills knowledge graph improves candidate retrieval and provides factor-level explanations compared with traditional keyword search on the JobSearch-XS benchmark; the system is released with a demo and installable package.

JobMatchAI An Intelligent Job Matching Platform Using Knowledge Graphs, Semantic Search and Explainable AI

Mayank Vyas, Abhijit Chakraborty, Vivek Gupta · March 15, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

JobMatchAI is a production-ready hybrid candidate matching system that combines Transformer embeddings, a skills knowledge graph, and an interpretable reranker to improve retrieval and skill generalization over keyword-based search, demonstrated on the JobSearch-XS benchmark.

Recruiters and job seekers rely on search systems to navigate labor markets, making candidate matching engines critical for hiring outcomes. Most systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores. We introduce JobMatchAI, a production-ready system integrating Transformer embeddings, skill knowledge graphs, and interpretable reranking. Our system optimizes utility across skill fit, experience, location, salary, and company preferences, providing factor-wise explanations through resume-driven search workflows. We release JobSearch-XS benchmark and a hybrid retrieval stack combining BM25, knowledge graph and semantic components to evaluate skill generalization. We assess system performance on JobSearch-XS across retrieval tasks, provide a demo video, a hosted website and installable package.

Summary

Main Finding

JobMatchAI is a deployable, microservices-based job matching platform that combines lexical (BM25), dense semantic (Sentence Transformer) retrieval, and a curated skill knowledge graph, then applies a white‑box, multi‑factor reranker whose component scores are narrated by an LLM. The architecture strictly separates deterministic scoring from generative explanation (LLM sees only precomputed factor scores and KG evidence), producing auditable, low-latency (P50 < 100 ms) explainable matches. On the authors’ JobSearch-XS benchmark the full pipeline achieves NDCG@10 ≈ 0.81 (≈7% relative gain vs BM25) while supporting interactive weight control and transparent explanations.

Key Points

Hybrid retrieval stack: parallel BM25 (Elasticsearch), ANN semantic kNN (all-MiniLM-L6-v2 embeddings, 384-d), and Neo4j multi-hop KG traversal. Per-channel k values: BM25 k=150, semantic kNN k=150, KG k=75; union cap ≈400.
Query enrichment: extracts entities/skills from free text or resumes, expands skills via depth-2 RELATED_TO traversal in the KG to bridge synonyms/related skills (e.g., Kubernetes ↔ container orchestration).
Fusion: reciprocal rank fusion (RRF) with query-adaptive weights (short queries favor KG, longer queries favor text).
Explainable reranking: deterministic utility U(c,j)=Σ wf · ϕf(c,j) with six interpretable factors (default weights shown): Skill 0.35 (Jaccard + KG relatedness bonus), Experience 0.25 (level distance), Location 0.15, Salary 0.10, Semantic similarity 0.10 (cosine of embeddings), Company fit 0.05. Users can adjust weights interactively.
LLM explanations: LLM is fed only the six factor scores and supporting KG paths (not raw documents), ensuring explanations are grounded/auditable and cannot hallucinate score-causing evidence.
Deliverables: live demo, installable package, and JobSearch-XS benchmark release (1,283 NYC civil-service roles, 30 queries, ~29K silver-label pairs; smaller human-verified gold set).
Evaluation: full pipeline (with reranking) achieves NDCG@10 ≈ 0.81, median latency ≈82 ms; recall tradeoffs noted (Recall@100 ≈0.35 due to fusion caps). Pilot user study (N=20) showed users rated explanations helpful and valued weight controls; faithfulness checks showed no unsupported claims in sampled explanations.
Limitations: small gold label set, lower recall due to conservative fusion/union cap, KG dependent on curated synonym tables (may miss emerging skills), limited user-study power, no full bias audit.

Data & Methods

Data:
- JobSearch-XS benchmark: 1,283 NYC civil-service job postings, 30 queries, 29K silver labels (KG-derived), a small human-verified gold set (≈20 queries / 40 judged pairs; κ ≈ 0.85 between annotators).
- Production-style ingestion used NYC Open Data for evaluation; system supports multi-source crawlers with dedupe.
Representations & Storage:
- Embeddings: all-MiniLM-L6-v2 (384 dimensions) for dense retrieval and semantic similarity feature.
- Lexical index + ANN: Elasticsearch with BM25 and vector kNN (HNSW).
- Knowledge graph: Neo4j with node types Candidate, Job, Skill, Location, Company; relations include HAS_SKILL, REQUIRES_SKILL, RELATED_TO, LOCATED_IN.
Pipeline:
- Ingestion → dual indexing (ES + Neo4j) → retrieval (parallel BM25 / semantic kNN / KG traversal) → RRF fusion (k=60 parameter in RRF) → hard-constraint filtering (visa, degree, etc.) → white-box reranking → LLM-based explanation.
- Query enrichment produces structured query representation ⟨entities, skills, expanded skills, embedding, keywords⟩.
Evaluation metrics:
- Ranking: NDCG@5/10, MRR.
- Coverage: Recall@50/100.
- Latency: P50 (and P95 reported).
- User study: Likert-scale ratings across relevance, synonym handling, explanation helpfulness, slider usefulness; qualitative feedback and a small automated faithfulness audit on explanations.
Experiments & Findings:
- Ablation table shows contribution of each retrieval channel; KG-only gives perfect recall on KG-reachable pairs but not scalable; best overall results come from combining channels + reranker.
- Per-split performance: train/dev/test splits are skill-disjoint; test NDCG@10 lower than train (zero-shot skill generalization challenge).

Implications for AI Economics

Reducing search frictions in labor markets:
- Better matching (bridging synonyms and nonlinear career paths) can reduce search costs and unemployment duration, improving matching efficiency in online labor markets.
- KG-enabled expansion allows non-lexical matches (skills expressed differently) which can increase effective labor supply to employers and broaden opportunities for workers with transferable skills.
Welfare and wage effects:
- Improved match quality could increase surplus for both employers (better hires, lower vacancy costs) and workers (higher match value, possibly higher wages), but distributional impacts depend on how matches are ranked and who is favored by the KG topology.
- If the system systematically favors certain skill clusters, it could channel demand towards subsets of workers, affecting wage dispersion and career mobility.
Signaling and complementarities:
- Transparent factor scores let job seekers understand gaps (skill, experience, salary mismatch) and may influence their investment in upskilling or job search strategy—affecting human capital investment dynamics.
- Recruiters may adjust behavior (e.g., posting differently) knowing how the platform matches and explains results, creating strategic complementarities between platform design and labor-market signaling.
Market structure and platform effects:
- A deployable, explainable stack lowers entry costs for niche job platforms focused on interpretability or regulated settings (e.g., public sector hiring), potentially increasing competition among matching platforms.
- Auditability can reduce regulatory friction: white-box scoring and logged factor-level data facilitate compliance and fairness assessments, which is valuable as regulation around automated hiring grows.
Risks and externalities:
- Bias propagation: embeddings and KG curation can encode historical biases (e.g., undervaluing nontraditional experience). Without active fairness interventions, automated re-ranking may reinforce inequalities.
- Misinterpretation risk: per-factor scores might be seen as objective measures of candidate worth, affecting employer decisions and candidate behavior; careful UI/education and governance are needed.
- Search concentration: improved matching could increase concentration if a few platforms become dominant suppliers of high-quality matches; this has implications for platform market power and fee structures.
Research and policy opportunities:
- The JobSearch-XS benchmark and released code enable empirical work on zero-shot skill generalization, explainability metrics in hiring, and fairness-aware reranking—valuable for economics research on matching markets and algorithmic governance.
- Policymakers could leverage audit-ready platforms for regulated hires (civil service), where traceability of ranking decisions is legally and socially important.

Overall, JobMatchAI is a concrete prototype showing how hybrid retrieval + KG + deterministic, factorized scoring plus constrained LLM narration can deliver auditable, explainable job matching. For AI economics, its main value is both operational (reducing matching frictions) and methodological (providing auditable signals and a benchmark for studying effects of algorithmic matching on labor-market outcomes), while requiring attention to bias audits, recall tradeoffs, and distributional impacts.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a systems paper presenting a production search/matching system and a benchmark; it does not attempt causal identification of AI impacts on economic outcomes. Methods Rigormedium — The paper builds a hybrid retrieval stack (BM25 + knowledge graph + transformer embeddings), introduces an interpretable reranker, and evaluates on a released benchmark (JobSearch-XS); however, it appears to rely on offline retrieval metrics and system demos rather than randomized experiments or real-world hiring outcome validation, and details on dataset representativeness and evaluation protocols are limited. SampleEvaluation uses the newly released JobSearch-XS benchmark (resume and job-description pairs) and offline retrieval tasks; system is demonstrated on production-ready components (embedding models, skill knowledge graph, BM25) with a hosted demo and installable package, but no reported A/B tests on live hiring pipelines. Themeslabor_markets adoption human_ai_collab GeneralizabilityJobSearch-XS benchmark may not reflect the full diversity of real-world hiring (industries, firm sizes, countries)., Results focus on retrieval metrics, not downstream hiring outcomes (interviews, offers, time-to-hire, wages)., Performance depends on the quality and coverage of the skills knowledge graph and embedding models, which may not generalize across languages or niche occupations., Production constraints (latency, privacy, integration with applicant tracking systems) may limit deployability at different organizations., Potential dataset or model biases in resumes and job descriptions could affect fairness and applicability across demographic groups.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
JobMatchAI integrates Transformer embeddings, skill knowledge graphs, and interpretable reranking. Other	positive	high	system design / component integration (presence of Transformer embeddings, knowledge graph, and interpretable reranking)	0.03
JobMatchAI is production-ready. Adoption Rate	positive	medium	production readiness (availability of deployable artifacts such as hosted site and installable package)	0.02
JobMatchAI optimizes utility across skill fit, experience, location, salary, and company preferences. Decision Quality	positive	medium	aggregate utility across factors: skill fit, experience, location, salary, company preferences	0.02
JobMatchAI provides factor-wise explanations through resume-driven search workflows. Ai Safety And Ethics	positive	medium	explainability: factor-wise explanations presented to users within resume-driven search workflows	0.02
The authors release JobSearch-XS benchmark. Other	positive	high	availability of JobSearch-XS benchmark (artifact release)	0.03
The authors provide a hybrid retrieval stack combining BM25, a skill knowledge graph, and semantic components to evaluate skill generalization. Other	positive	high	retrieval stack composition (BM25 + knowledge graph + semantic components) intended for skill generalization evaluation	0.03
The authors assess system performance on JobSearch-XS across retrieval tasks. Output Quality	null_result	medium	retrieval performance on JobSearch-XS tasks (metrics unspecified in excerpt)	0.02
The authors provide a demo video, a hosted website, and an installable package demonstrating JobMatchAI. Other	positive	high	availability of demonstration artifacts (video, hosted website, installable package)	0.03
Most existing candidate matching systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores. Decision Quality	negative	medium	limitations of extant systems: keyword-filter behavior, failure on skill synonyms and nonlinear careers, consequence of missed candidates and opaque scores	0.02