A CERN-deployed assistant, Archi, helps CMS technical operators answer operational queries by reasoning across documentation, historical records and live monitoring; locally hosted open-weight models proved competitive, enabling private, production-grade support without cloud dependencies.

Archi: Agentic Operations at the CMS Experiment

Pietro Lugato, Luca Lavezzo, Jason Mohoney, Hasan Ozturk, Muhammad Hassan Ahmed, Juan Pablo Salas, Viphava Ohm, Krittin Phornsiricharoenphant, Gabriele Benelli, Mariarosaria D'Alfonso, Manasvita Joshi, Warren Nam, Aron Soha, Samantha Sunnarborg, Austin Swinney, Jack Tucker, Dmytro Kovalskyi, Tim Kraska, Christoph Paus · June 03, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Archi, an open-source framework deployed at CERN's CMS, helps technical operators resolve real-world operational queries by integrating heterogeneous data and running private, locally-hosted language models that perform competitively with larger alternatives.

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

Summary

Main Finding

Archi is an open-source end-to-end framework that ingests heterogeneous scientific collaboration data, builds a retrieval layer, and runs configurable agentic workflows. A production instance for the CMS Computing Operations team (deployed Feb 2026) demonstrates real operational utility: agents that combine indexed documentation with live-tool calls can resolve operational queries, reduce search/friction, and produce grounded, traceable answers. Locally-hosted open-weight models perform competitively with frontier models, enabling fully private deployments for sensitive data.

Key Points

Framework and deployment
- Archi provides: modular collectors, ingestion (markdown conversion, anonymization, chunking), Postgres+pgvector storage, retrieval (BM25, vector, metadata), an agent runtime (supports cloud and local LLMs), tool connectors for live services, and a chat UI with RBAC and monitoring.
- Designed for privacy (anonymization, local open-weight model support), reliability (answers grounded in ingested data + tool traces), accessibility, and modularity.
CMS CompOps deployment
- Ingested corpus ≈ 10k documents (wiki pages, git docs, JIRA tickets, etc.).
- Live tool integrations: Rucio monitoring & client, HTCondor monitoring, DeepWiki, etc.
- Agent uses ReAct pattern: alternating reasoning and tool calls with tool-call traces exposed to users.
- Production usage (13 Feb–29 Apr 2026): 20 users, 393 conversations, 598 user messages, mean response time 107s, 99 explicit feedback events (70% positive).
- Representative successes: diagnosing transfer failures, investigating CPU-efficiency drops, and retrieving procedures. Noted failure mode: environment/version mismatches when agent lacks visibility into local CLI versions.
Evaluation and results
- Two curated question sets from production: 63Q (human + automated grading) and 270Q (automated grading). Of the 270Q, 71 questions required live-tool access.
- Compared configurations: Bare LLM, Single-shot RAG, Agent-no-live (iterative, static corpus), Agent-with-live (iterative + live tools — production config).
- Models tested: OpenAI GPT-5 family (GPT-5.5 used in evaluation/production), and open-weight Qwen3.6-27B & Qwen3.6-35B (served with vLLM on H200 GPUs).
- Human panel (63Q) mean scores (correctness/usefulness): GPT-5.5 Agent-with-live = 4.16; Qwen3.6-35B Agent-with-live ≈ 3.80; Qwen3.6-27B Agent-no-live ≈ 3.74; Single-shot RAG performed substantially worse (~2.45).
- Automated LLM judges (63Q) mean (relevance/completeness/specificity/helpfulness): GPT-5.5 Agent-with-live = 4.59; Qwen3.6-27B Agent-with-live = 4.32; Qwen3.6-27B Agent-no-live = 4.10; Single-shot RAG = 3.44.
- Key takeaways: agentic iterative use + live tools outperforms single-shot RAG; GPT-5.5 led metrics but open-weight models were competitive, especially when paired with the retrieval/tooling stack.

Data & Methods

System architecture
- Collectors: web scrapers (SSO-protected pages via Selenium), Git repos, JIRA, local files; configurable scheduled updates.
- Ingestion: conversion to markdown, NER-based anonymization (spaCy), regex stripping for personal data, chunking; stored as text + embeddings in Postgres + pgvector.
- Retrieval: BM25 + vector + metadata filters; RAG layer and an agent runtime built using LangGraph and ReAct-style agent classes; tools exposed through MCP wrappers.
- UI/ops: chat interface with trace of tool calls and a document list per response, Grafana for usage/quality, A/B testing support, RBAC.
Evaluation methodology
- Question sets assembled from real production traffic and curated by CMS operator.
- Configurations tested offline against the same static corpus and live-tool emulators where applicable.
- Human evaluation: blinded CMS experts rated correctness/usefulness (1–5) and ranked responses.
- Automated evaluation: LLM judges (GLM-5.1 for 270Q; average of 4 judges for 63Q) rated relevance, completeness, specificity, helpfulness on a reference-free rubric.
- Model serving setup: Qwen models on vLLM using two NVIDIA H200 GPUs, 16 CPUs, 400 GB RAM; Archi services on CPU nodes (8 CPUs, 32GB RAM). GPT-5.5 via OpenAI API (used under CERN AI privacy guidelines).

Implications for AI Economics

Value creation and productivity
- Targeted agentic systems like Archi can substantially reduce search/friction costs in specialized technical workflows (examples: diagnosing transfers, efficiency investigations, retrieving procedures). This implies time savings per operator and faster incident resolution — a direct productivity gain and potential reduction in labor-hour costs per incident.
- Exposing tool-call traces and grounding answers in internal data reduces rework from hallucinations and potentially lowers escalation costs (fewer false-positive escalations).
Cost trade-offs: hosted frontier models vs. local open-weight models
- Frontier models (GPT-5.5) achieved higher scores, but open-weight models (Qwen variants) were competitive when combined with a strong ingestion/retrieval and tool stack.
- Economic trade-off: buying API access to frontier models vs. investing capital and O&M in on-prem hardware (e.g., H200 GPUs, large RAM footprints, engineering to run vLLM). Competitive open-weight performance reduces reliance on external providers for privacy-sensitive domains, changing procurement and recurring licensing expenditure profiles.
- Hidden costs: on-prem deployment requires skilled engineering, maintenance, and power/infrastructure expenses; cloud APIs shift those to the provider but introduce recurring costs and privacy considerations.
Privacy, governance, and compliance economics
- The ability to run open-weight models locally plus anonymization pipelines allows organizations handling sensitive data to avoid the compliance/legal costs of sending data to external APIs. This can be monetized as reduced regulatory risk and lower compliance overhead.
- RBAC and service-account designs keep agent capabilities within users’ existing privileges, limiting liability; such governance features affect allowable procurement paths and the expected insurance/regulatory exposure.
Public goods and shared development
- Archi is open-source and modular; shared investments (connectors, ingestion tooling, evaluation suites) reduce duplication across scientific collaborations and lower marginal cost of adoption. This can shift market dynamics: more federated, community-maintained infrastructure vs. proprietary vertical offerings.
Investment signals and market demand
- Demonstrated utility for domain-specific operational agents creates demand for: (a) improved retrieval + tool integrations (engineering talent, middleware products), (b) on-prem/edge-serving model infrastructure, and (c) evaluation and auditing tooling to measure ROI and risk.
- Organizations may reallocate budgets from generalized LLM API spend to internal ML Ops, data engineering, and model-serving infrastructure if on-prem performance is sufficient.
Risk & reliability economics
- Grounding answers in internal corpora and exposing tool traces reduces operational risk from erroneous recommendations, which has economic value (fewer costly mistakes, lower mean time to resolution). However, incomplete coverage (e.g., missing live tools or environment visibility) can introduce error modes that require mitigation investment (additional connectors, version-awareness).
Measurement and ROI
- Archi’s built-in evaluation framework and A/B testing enable tracking of performance improvements and direct measurement of operational metrics — essential for justifying continued investment and estimating ROI on infrastructure vs. API spend.

Overall, Archi exemplifies an economic case where combining retrieval, tooling, and selective model choices (frontier vs open-weight) yields configurable trade-offs between performance, privacy, and total cost of ownership that organizations managing sensitive, specialized operations must weigh.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports results from a real-world deployment and evaluation using operator feedback plus human and automated grading of production questions, which provides practical, ecological evidence of effectiveness; however, there is no randomized or quasi-experimental design, limited information on sample size and baseline comparisons, and potential selection and measurement biases that limit causal claims. Methods Rigormedium — The implementation and evaluation appear systematic (end-to-end system, production question set, human + automated graders), and the project is deployed in an operational environment; but the paper lacks controlled experiments, pre-registered metrics, detailed grader protocols, or counterfactual baselines, reducing methodological rigor for causal inference. SampleDeployment at CERN's CMS Computing Operations team (since February 2026) supporting technical operators; the system ingests heterogeneous organizational sources (documentation, historical data, live monitoring) and was evaluated on operator feedback plus a question set collected from production usage, graded by human panels and automated checks; models used include locally-hosted open-weight language models. Exact number of operators, questions, grade distributions, and timeframe beyond 'since Feb 2026' are not reported in the summary. Themeshuman_ai_collab productivity adoption org_design GeneralizabilitySingle-site deployment (CMS at CERN) — domain-specific to particle-physics operations, Small and possibly self-selected operator sample — uptake and feedback may not represent all users, Integration depends on CMS-specific data sources and monitoring systems, limiting transfer to other orgs without similar infrastructure, Evaluation lacks control groups and may be sensitive to local workflows, training, and documentation quality, Performance tied to specific locally-hosted models and hardware; results may differ with other model choices or cloud-based setups

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Archi is an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. Other	positive	high	system functionality (ingestion, organization, agent deployment)	0.18
An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators. Adoption Rate	positive	high	deployment / adoption by a team	0.18
The deployed Archi instance offers retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. Other	positive	high	data integration and retrieval capability	0.18
We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. Other	null_result	high	evaluation methodology (feedback and graded question set)	0.18
The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. Task Completion Time	positive	medium	resolution of real-world queries / task completion	0.11
Locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data. Output Quality	positive	medium	model performance (competitiveness) and capability for private data management	0.11
Archi enables fully private management of sensitive data by using locally-hosted, open-weight models. Other	positive	medium	privacy / data management capability	0.05