A modular, multi-agent interviewing system outperforms single‑agent LLM assessments on robustness and auditability, delivering high accuracy and candidate satisfaction while reducing subjective bias; its gains hinge on rubric design and come with higher engineering and compute costs.

CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Gengxin Sun, Ruihao Yu, Liangyi Yin, Yunqi Yang, Bin Zhang, Zhiwei Xu · March 17, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

CoMAI, a modular multi-agent interview-assessment system coordinated by a finite-state machine, achieves higher robustness, fairness, and interpretability than monolithic LLM-based assessments in reported experiments (90.47% accuracy, 83.33% recall, 84.41% candidate satisfaction).

Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

Summary

Main Finding

CoMAI is a modular, multi-agent interview-assessment framework (four specialized agents coordinated by a centralized finite-state machine) that outperforms monolithic LLM-based assessments on robustness, fairness, and interpretability. In experiments the system reached 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction, while offering multi-layered defenses against prompt-injection and rubric-based, multidimensional scoring that reduces subjective bias.

Key Points

Architecture
- Modular task decomposition: four specialized agents (question generation, security, scoring, summarization).
- Centralized finite-state machine coordinates agents and enforces workflow and information flow constraints.
- Contrasts with single-agent monolithic LLM approaches by separating responsibilities and enabling checks-and-balances.
Security and robustness
- Multi-layered defenses against prompt-injection and other prompt-level attacks through a dedicated security agent and constrained state transitions.
- Isolation of sensitive logic (scoring rubrics, adaptive difficulty rules) from free-text generation reduces attack surface.
Evaluation and fairness
- Rubric-based, structured scoring promotes consistent, auditable judgments and reduces subjective assessor bias.
- Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance.
Performance outcomes
- Reported experimental metrics: 90.47% accuracy, 83.33% recall, 84.41% candidate satisfaction.
Interpretability and auditability
- Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability.
Limitations and caveats (noted or implied)
- Quality depends on rubric design and on how the finite-state machine and agent prompts are specified.
- Modular system may increase engineering complexity and compute overhead vs. a single LLM endpoint.
- Generalization across domains and long-term robustness to adversarial adaptation require further validation.

Data & Methods

System design: implemented a four-agent pipeline coordinated by a finite-state machine; agents specialize in generation, security/validation, scoring by rubric, and summarization/reporting.
Evaluation metrics: accuracy and recall for assessment outcomes, and candidate satisfaction (likely via post-interview surveys).
Comparative framing: positioned against monolithic LLM single-agent systems to highlight gains from modularity and structured scoring.
Security testing: included prompt-injection/adversarial inputs to probe the security agent and layered defenses.
Adaptive testing: implemented dynamic difficulty adjustment logic within the workflow to create multidimensional evaluation paths.
Notes on reporting: the paper reports the above numeric results but does not (in the provided summary) disclose dataset sizes, domain breadth, or baseline specifics—these details determine external validity and would need to be checked for generalization claims.

Implications for AI Economics

Labor-market signaling and information frictions
- More robust, interpretable automated assessments can reduce information asymmetries between employers and applicants, potentially lowering search and screening costs.
- Rubric-based, auditable scores can serve as standardized signals in hiring markets, affecting sorting and wage matching.
Productivity and cost structure
- Automation of structured interviewing can reduce per-hire screening costs and speed up hiring pipelines; however, engineering and compute costs of a multi-agent system may be higher initially than simple LLM deployments.
- Firms may trade off higher upfront development and auditing costs for lower long-run labor costs and better compliance.
Fairness, regulation, and compliance
- Improved auditability and reduced subjective bias could ease regulatory concerns and support compliance with employment discrimination laws, increasing adoption among regulated firms.
- Regulators may prefer modular, auditable systems for certification or oversight, shaping procurement and market standards.
Market structure and competition
- Firms that adopt robust, interpretable assessment tools could gain advantage in talent acquisition; vendors that offer certified auditability may capture larger enterprise markets.
- Startups emphasizing modular, secure assessment frameworks may challenge incumbents that rely on opaque LLM scoring.
Externalities and strategic behavior
- Standardized rubrics and adaptive testing create incentives for candidates to game known structures; maintaining security and rubric quality becomes an ongoing arms race.
- Widespread adoption may change applicant behavior and credential investments, with distributional effects across skill levels and sectors.
Research and measurement opportunities
- The modular, auditable outputs offer richer data for labor economists to study screening dynamics, test design effects, and bias sources.
- Experimental evaluation across sectors, demographic groups, and long-run hiring outcomes would inform welfare implications.
Policy and adoption considerations
- Policymakers and firms should require transparency on rubric design, evaluation metrics, and security testing to ensure fair deployment.
- Cost-benefit analyses should account for engineering complexity, monitoring needs, and potential impacts on labor supply and wage setting.

Overall, CoMAI points toward an AI assessment design that trades monolithic simplicity for greater robustness, interpretability, and regulatory readiness—features that matter both for firm adoption choices and for broader labor-market and policy outcomes.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports empirical evaluation (accuracy, recall, candidate satisfaction) and adversarial/security probes showing improvements over monolithic LLM baselines, which supports the core claims; however, key details are missing (dataset sizes, domain breadth, baseline definitions, statistical significance, sampling and demographics, and out-of-sample validation), limiting confidence in external validity and the magnitude of claimed effects. Methods Rigormedium — The design (modular agents + finite-state machine) is clearly articulated and the evaluation includes multiple metrics and security tests, but the methodology lacks crucial reporting (sample sizes, baseline specifics, selection procedures, statistical tests, and replication artifacts), and it is unclear how adversarial attacks were constructed or how rubric calibration was performed. SampleImplemented a four-agent interview-assessment pipeline (question generation, security/validation, rubric-based scoring, summarization) and evaluated it on internal interview sessions and simulated/real adversarial prompt-injection tests; outcomes reported include accuracy (90.47%), recall (83.33%), and candidate satisfaction (84.41%), plus security-defense case studies—but the paper does not disclose dataset sizes, candidate demographics, domain breadth, or detailed baseline system descriptions. Themeshuman_ai_collab labor_markets GeneralizabilityUnknown domain coverage: results may not generalize beyond the specific interview tasks and question domains used in evaluation, Undisclosed sample sizes and participant demographics limit population-level inference (age, education, language, sector), Unclear baseline implementation: comparisons to monolithic LLM systems may depend on prompt engineering and model choice, Adversarial robustness may degrade over time as attackers adapt; one-time tests do not guarantee long-run security, Engineering and compute cost trade-offs may limit adoption for small firms or low-margin screening use cases, Rubric-dependent: performance and fairness depend on rubric quality and calibration, which may not transfer across contexts

Claims (14)

Claim	Direction	Confidence	Outcome	Details
CoMAI is a modular, four-agent interview-assessment framework coordinated by a centralized finite-state machine. Other	null_result	high	system architecture (agent decomposition and FSM coordination)	0.18
CoMAI outperforms monolithic LLM-based assessments on robustness, fairness, and interpretability. Ai Safety And Ethics	positive	medium	robustness; fairness (subjective bias reduction); interpretability/auditability	0.11
In experiments CoMAI achieved 90.47% accuracy. Output Quality	positive	medium	assessment accuracy	90.47% 0.11
In experiments CoMAI achieved 83.33% recall. Output Quality	positive	medium	recall (sensitivity) of target class(es)	83.33% 0.11
Candidate satisfaction with CoMAI was 84.41%. Worker Satisfaction	positive	medium	candidate satisfaction (survey-based)	84.41% 0.11
CoMAI implements multi-layered defenses against prompt-injection and other prompt-level attacks via a dedicated security agent and constrained state transitions. Ai Safety And Ethics	positive	medium	robustness to prompt-injection and prompt-level adversarial attacks	0.11
Isolating sensitive logic (scoring rubrics, adaptive difficulty rules) from free-text generation reduces the attack surface. Ai Safety And Ethics	positive	medium	attack surface for adversarial manipulation of scoring/adaptive rules	0.11
Rubric-based, structured scoring promotes consistent, auditable judgments and reduces subjective assessor bias. Output Quality	positive	medium	consistency of judgments; auditability; subjective assessor bias	0.11
Adaptive difficulty and multidimensional evaluation allow dynamic tailoring of questions to candidate performance. Training Effectiveness	positive	high	ability to adapt question difficulty and evaluate multiple skill dimensions	0.18
Modular outputs (question histories, security checks, rubric scores, summaries) enable post-hoc review and explainability. Ai Safety And Ethics	positive	high	interpretability and auditability (availability of logs and structured outputs)	0.18
Quality of CoMAI depends on rubric design and on how the finite-state machine and agent prompts are specified. Output Quality	negative	high	assessment quality as a function of rubric/FSM/agent prompt design	0.18
A modular system may increase engineering complexity and compute overhead compared to a single LLM endpoint. Organizational Efficiency	negative	high	engineering complexity and compute/resource overhead	0.18
Generalization across domains and long-term robustness to adversarial adaptation require further validation. Ai Safety And Ethics	negative	high	generalization across domains; long-term robustness to adaptive adversaries	0.18
Security testing included prompt-injection/adversarial inputs to probe the security agent and layered defenses. Ai Safety And Ethics	positive	medium	results of prompt-injection/adversarial tests (security evaluation)	0.11