A multi-agent VCA that feeds device diagnostics to LLMs more than doubles successful cybersecurity fixes, lifting correct resolutions from ~50% to over 90% and making users far more willing to replace human IT support; stepwise guidance also raises user satisfaction and cuts perceived burden.

SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

Yair Meidan, Omri Haller, Yulia Moshan, Shahaf David, Dudu Mimran, Yuval Elovici, Asaf Shabtai · April 29, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

SecMate, a multi-agent VCA that integrates device-level diagnostics, implicit user proficiency, and a context-aware recommender, substantially improves cybersecurity troubleshooting success (correct resolutions rising from ~50% to >90% vs an LLM-only baseline), reduces user burden, and elicits strong willingness to substitute human IT support at lower cost.

Recent advances in large language models and agentic frameworks have enabled virtual customer assistants (VCAs) for complex support. We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals. Device specificity is provided by a lightweight local diagnostic utility, while user specificity relies on implicit proficiency inference and profile-aware troubleshooting. Service specificity is achieved through a proactive, context-aware recommender. We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline, while step-by-step guidance improved pleasantness and reduced user burden. The recommender achieved high relevance (MRR@1=0.75), and participants showed strong willingness to substitute human IT support at costs well below human benchmarks. We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.

Summary

Main Finding

SecMate — a multi-agent virtual customer assistant that combines device-level evidence (Clue Collector), implicit user proficiency profiling (ProfiLLM), a profile-aware troubleshooter, a follow-up question generator, a proactive recommender, and a confidence-guided orchestrator — substantially improves self-service cybersecurity troubleshooting. In a controlled study (144 participants, 711 conversations), device-grounded evidence increased correct resolutions from roughly 50% (LLM-only baseline) to over 90%. Profile-aware, stepwise guidance raised perceived pleasantness and lowered user burden, and the embedded recommender achieved high relevance (MRR@1 ≈ 0.75). Participants reported strong willingness to substitute human IT support at conversation costs well below human benchmarks.

Key Points

Architecture and agents
- Clue Collector (CC): lightweight local diagnostic utility that collects OS, processes, network, installed software, etc., with explicit user consent to ground diagnostics.
- Profiler: implicit, continuous inference of user IT/cybersecurity proficiency using a 23-dimensional ProfiLLM taxonomy; scores inform response complexity and action prioritization.
- Profile-Aware Troubleshooter: LLM-based agent that prioritizes suitable (not merely probable) diagnostic/remediation paths and adapts phrasing/detail to user skill.
- Follow-Up Question Generator: LLM identifies missing information and elicits targeted, profile-aware clarifications.
- Proactive Recommender: ImpReSS-based implicit recommender extended with LLM-generated short rationales; ranks relevant support or security products during troubleshooting.
- Orchestrator: computes diagnosis confidence (Dconf) and coordinates when to ask questions, collect device evidence, or present segmented step-by-step solutions.
Evaluation highlights
- Study: repeated-measures, randomized configuration–scenario assignment; four SecMate ablations (no adaptation / CC / adaptation only / both) vs. LLM-only VCA baseline; five realistic cybersecurity scenarios (PC performance, online gaming/adware, airport Wi‑Fi risk, safe PC/firewall disabled, unexpected cursor movement/remote access).
- Outcomes: device evidence retrieval (CC) drove large gains in correct resolutions (≈50% → >90% vs. LLM-only). Stepwise, profile-aware interaction improved pleasantness and reduced user effort/overwhelm. Accurate implicit profiling improved solution ordering and communication; misprofiling significantly degraded perceived quality. Recommender achieved MRR@1 ≈ 0.75 and was perceived positively, though timing/presentation trade-offs affect conversational smoothness.
Privacy, implementation, and reproducibility
- Implemented with GPT-4o in a microservice agentic stack (LangChain, LangGraph, LangSmith), hosted on AWS with encryption, JWT, Cognito, PII anonymization via Microsoft Presidio. CC use requires consent and the Orchestrator only retrieves evidence when informative.
- Code and an annotated dataset of 711 conversations (DSComplete) to be released for reproducibility upon paper acceptance.
Limitations noted by authors
- Participant pool: engineering students (young, highly-educated), which limits external validity for general population / SMB employees.
- Scenarios: five predefined scenarios simulated on experiment laptops — broader scenario coverage and field deployment remain future work.
- Systemic risks: misprofiling and inappropriate automation can reduce trust; device-evidence collection raises consent/privacy and liability considerations.

Data & Methods

Participants and data
- 144 participants (mean age 25; 69% male) produced 711 troubleshooting conversations across five cybersecurity scenarios.
- Ground-truth labels: self-reported IT/cybersecurity proficiency questionnaires, personality trait inventories; conversations annotated per protocol.
Experimental design
- Configurations: SecMateNone, SecMateCC, SecMateAdap, SecMateBoth (combinations of CC and profile adaptation), plus an LLM-only VCABaseline. UI and recommender presentation held constant across conditions.
- Randomized, repeated-measures assignment of configurations to scenarios; within-respondent z-score normalization of Likert responses; analyses via linear mixed-effects models (configuration fixed effect; participant and scenario random effects).
Metrics
- Effectiveness (binary: reached expected outcome), efficiency (iterations to correct solution), perceived effectiveness/pleasantness/ease (Likert → z-scores), overwhelm (distinct diagnostic paths), recommender relevance (MRR@1), profiler error (MAE between inferred and ground-truth proficiencies), and substitution willingness (willingness to use VCA vs. human IT).
Key quantitative results reported
- Correct-resolution improvement with device evidence: ≈50% (LLM-only) → >90% (with CC).
- Recommender: MRR@1 ≈ 0.75.
- Profile-aware, stepwise guidance: statistically significant improvements in pleasantness and reduced iterations/overwhelm (exact effect sizes reported in paper’s analyses).
Engineering / privacy stack
- GPT-4o LLM, LangChain/LangGraph orchestration, microservices on AWS VPC, HTTPS/WSS, JWT, Cognito, MS Presidio for PII anonymization.

Implications for AI Economics

Unit-cost and labor substitution
- Large increases in self-service success (→ >90%) and strong substitution willingness suggest meaningful reductions in human IT support demand for routine cybersecurity troubleshooting. Per-interaction costs (cloud + model + orchestration) reported to be well below human-benchmarked session costs, implying substantial unit-cost savings at scale for SMBs and MSSPs.
Revenue and business model opportunities
- Built-in, contextual recommender (MRR@1 ≈ 0.75) creates monetizable channels: product upsells, managed services, and affiliate/partner models for security tools; potential to improve lifetime value per customer while keeping support costs low.
Labor market and organizational impacts
- Routine tier-1 support roles face downward demand; human staff can be redeployed to higher-complexity incident response, strategy, or oversight roles. MSSPs may shift pricing from per-ticket to platform/subscription models offering automated troubleshooting plus human escalation.
Investment and operational trade-offs
- Development, integration, and compliance costs (building CC with consent flows, secure telemetry, auditing, and liability controls) are non-trivial. Firms must invest in robust profiling, QA, and monitoring to avoid costly misdiagnoses and reputational loss.
Risk externalities and regulation
- Device evidence collection raises privacy and consent costs, and potential regulatory scrutiny (data collection, storage, cross-border processing). Misprofiling or erroneous remediation could create negative externalities (security incidents, data loss) and legal exposure — increasing expected cost of deployment unless mitigated.
Market competition and diffusion
- Releasing code + dataset lowers entry barriers, accelerating competition and faster diffusion of such VCAs. This may compress margins for vendors but increase adoption across SMBs.
Productivity and macro effects
- Improved automated cybersecurity support raises overall organizational resilience and productivity, potentially lowering incident-related downtime. Aggregate labor reallocation could increase demand for more skilled AI+security professionals, shifting the labor supply toward upskilling and higher-value tasks.

Takeaways for economists and policymakers: SecMate demonstrates that multi-agent, evidence-grounded VCAs can materially substitute routine cybersecurity support, creating cost savings and new monetization paths but also producing privacy, liability, and labor-shift externalities that should be quantified when assessing broader economic impact.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study reports large, consistent effects (resolution rates rising from ~50% to >90%) on a reasonably sized sample (144 participants, 711 conversations) in a controlled setting and compares to a clear LLM-only baseline; however, external validity concerns (lab/controlled setting, unclear representativeness of participants and devices), incomplete reporting on randomization/blinding and baseline configuration, and unknown sensitivity to LLM/version and prompt engineering limit confidence in strong causal generalization to real-world IT environments. Methods Rigormedium — Strengths include a controlled experiment with many conversations, multiple outcome measures (correct resolution, pleasantness, user burden, MRR@1), and public release of code and an annotated dataset enabling reproducibility; weaknesses are the absence of explicit details about randomization, allocation, statistical controls, pre-registration, participant recruitment and demographics, and potential measurement biases (simulated diagnostics, lab conditions), which reduces methodological transparency and robustness. Sample144 participants engaged in 711 troubleshooting conversations in a controlled user study; interactions included cybersecurity troubleshooting tasks across devices, comparing SecMate (multi-agent VCA with device/user/service specificity) against an LLM-only baseline; device signals were provided by a lightweight local diagnostic utility and user proficiency inferred implicitly; the dataset and code are released alongside richly annotated conversation-level labels. Themeshuman_ai_collab adoption productivity IdentificationControlled experimental comparison between the SecMate multi-agent system and an LLM-only baseline, where the key treatment is the provision of device-level evidence (from a local diagnostic utility) and step-by-step guidance; service-specific recommendations evaluated with ranking metrics (MRR@1). The writeup does not explicitly state randomization, blinding, or preregistration, so causal claims rest on the controlled treatment assignment and outcome differences across conditions rather than a fully documented randomized trial. Generalizabilitylab_setting_vs_real_world: study conducted in a controlled environment which may not capture production IT support complexity, participant_demographics_unknown: limited information on recruitment, skills, or representativeness of participants (e.g., crowdworkers vs enterprise users), limited_device_and_issue_scope: diagnostic utility and tasks may cover a narrow set of devices, OS versions, and cybersecurity issues, simulated_diagnostics: the lightweight local diagnostic utility may not replicate noisy/heterogeneous signals from real enterprise hardware, short_term_interactions: study measures immediate troubleshooting outcomes but not long-term effects (learning, follow-up issues), model_and_prompt_specificity: results may depend on the specific LLM version, system prompts, and agent architecture used

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals. Other	positive	high	system capability to integrate device, user, and service specificity	0.08
Device specificity is provided by a lightweight local diagnostic utility. Other	positive	high	presence and role of a local diagnostic utility for device specificity	0.08
User specificity relies on implicit proficiency inference and profile-aware troubleshooting. Other	positive	high	ability to infer user proficiency and use profiles for troubleshooting	0.08
Service specificity is achieved through a proactive, context-aware recommender. Other	positive	high	use of a proactive, context-aware recommender for service specificity	0.08
We evaluate SecMate in a controlled study with 144 participants and 711 conversations. Other	null_result	high	study sample size and conversation count	n=144 0.8
Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline. Output Quality	positive	high	correct resolutions (successful troubleshooting)	n=711 from about 50% to over 90% relative to an LLM-only baseline 0.8
Step-by-step guidance improved pleasantness and reduced user burden. Consumer Welfare	positive	high	pleasantness (user satisfaction) and user burden	n=144 0.48
The recommender achieved high relevance (MRR@1=0.75). Decision Quality	positive	high	recommendation relevance (MRR@1)	n=711 MRR@1=0.75 0.48
Participants showed strong willingness to substitute human IT support at costs well below human benchmarks. Adoption Rate	positive	medium	willingness to substitute human IT support (cost thresholds / preferences)	n=144 0.29
We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs. Other	positive	high	availability of codebase and annotated dataset	0.8