A CERN-deployed assistant, Archi, helps CMS technical operators answer operational queries by reasoning across documentation, historical records and live monitoring; locally hosted open-weight models proved competitive, enabling private, production-grade support without cloud dependencies.
We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.
Summary
Main Finding
Archi is an open-source end-to-end framework that ingests heterogeneous scientific collaboration data, builds a retrieval layer, and runs configurable agentic workflows. A production instance for the CMS Computing Operations team (deployed Feb 2026) demonstrates real operational utility: agents that combine indexed documentation with live-tool calls can resolve operational queries, reduce search/friction, and produce grounded, traceable answers. Locally-hosted open-weight models perform competitively with frontier models, enabling fully private deployments for sensitive data.
Key Points
- Framework and deployment
- Archi provides: modular collectors, ingestion (markdown conversion, anonymization, chunking), Postgres+pgvector storage, retrieval (BM25, vector, metadata), an agent runtime (supports cloud and local LLMs), tool connectors for live services, and a chat UI with RBAC and monitoring.
- Designed for privacy (anonymization, local open-weight model support), reliability (answers grounded in ingested data + tool traces), accessibility, and modularity.
- CMS CompOps deployment
- Ingested corpus ≈ 10k documents (wiki pages, git docs, JIRA tickets, etc.).
- Live tool integrations: Rucio monitoring & client, HTCondor monitoring, DeepWiki, etc.
- Agent uses ReAct pattern: alternating reasoning and tool calls with tool-call traces exposed to users.
- Production usage (13 Feb–29 Apr 2026): 20 users, 393 conversations, 598 user messages, mean response time 107s, 99 explicit feedback events (70% positive).
- Representative successes: diagnosing transfer failures, investigating CPU-efficiency drops, and retrieving procedures. Noted failure mode: environment/version mismatches when agent lacks visibility into local CLI versions.
- Evaluation and results
- Two curated question sets from production: 63Q (human + automated grading) and 270Q (automated grading). Of the 270Q, 71 questions required live-tool access.
- Compared configurations: Bare LLM, Single-shot RAG, Agent-no-live (iterative, static corpus), Agent-with-live (iterative + live tools — production config).
- Models tested: OpenAI GPT-5 family (GPT-5.5 used in evaluation/production), and open-weight Qwen3.6-27B & Qwen3.6-35B (served with vLLM on H200 GPUs).
- Human panel (63Q) mean scores (correctness/usefulness): GPT-5.5 Agent-with-live = 4.16; Qwen3.6-35B Agent-with-live ≈ 3.80; Qwen3.6-27B Agent-no-live ≈ 3.74; Single-shot RAG performed substantially worse (~2.45).
- Automated LLM judges (63Q) mean (relevance/completeness/specificity/helpfulness): GPT-5.5 Agent-with-live = 4.59; Qwen3.6-27B Agent-with-live = 4.32; Qwen3.6-27B Agent-no-live = 4.10; Single-shot RAG = 3.44.
- Key takeaways: agentic iterative use + live tools outperforms single-shot RAG; GPT-5.5 led metrics but open-weight models were competitive, especially when paired with the retrieval/tooling stack.
Data & Methods
- System architecture
- Collectors: web scrapers (SSO-protected pages via Selenium), Git repos, JIRA, local files; configurable scheduled updates.
- Ingestion: conversion to markdown, NER-based anonymization (spaCy), regex stripping for personal data, chunking; stored as text + embeddings in Postgres + pgvector.
- Retrieval: BM25 + vector + metadata filters; RAG layer and an agent runtime built using LangGraph and ReAct-style agent classes; tools exposed through MCP wrappers.
- UI/ops: chat interface with trace of tool calls and a document list per response, Grafana for usage/quality, A/B testing support, RBAC.
- Evaluation methodology
- Question sets assembled from real production traffic and curated by CMS operator.
- Configurations tested offline against the same static corpus and live-tool emulators where applicable.
- Human evaluation: blinded CMS experts rated correctness/usefulness (1–5) and ranked responses.
- Automated evaluation: LLM judges (GLM-5.1 for 270Q; average of 4 judges for 63Q) rated relevance, completeness, specificity, helpfulness on a reference-free rubric.
- Model serving setup: Qwen models on vLLM using two NVIDIA H200 GPUs, 16 CPUs, 400 GB RAM; Archi services on CPU nodes (8 CPUs, 32GB RAM). GPT-5.5 via OpenAI API (used under CERN AI privacy guidelines).
Implications for AI Economics
- Value creation and productivity
- Targeted agentic systems like Archi can substantially reduce search/friction costs in specialized technical workflows (examples: diagnosing transfers, efficiency investigations, retrieving procedures). This implies time savings per operator and faster incident resolution — a direct productivity gain and potential reduction in labor-hour costs per incident.
- Exposing tool-call traces and grounding answers in internal data reduces rework from hallucinations and potentially lowers escalation costs (fewer false-positive escalations).
- Cost trade-offs: hosted frontier models vs. local open-weight models
- Frontier models (GPT-5.5) achieved higher scores, but open-weight models (Qwen variants) were competitive when combined with a strong ingestion/retrieval and tool stack.
- Economic trade-off: buying API access to frontier models vs. investing capital and O&M in on-prem hardware (e.g., H200 GPUs, large RAM footprints, engineering to run vLLM). Competitive open-weight performance reduces reliance on external providers for privacy-sensitive domains, changing procurement and recurring licensing expenditure profiles.
- Hidden costs: on-prem deployment requires skilled engineering, maintenance, and power/infrastructure expenses; cloud APIs shift those to the provider but introduce recurring costs and privacy considerations.
- Privacy, governance, and compliance economics
- The ability to run open-weight models locally plus anonymization pipelines allows organizations handling sensitive data to avoid the compliance/legal costs of sending data to external APIs. This can be monetized as reduced regulatory risk and lower compliance overhead.
- RBAC and service-account designs keep agent capabilities within users’ existing privileges, limiting liability; such governance features affect allowable procurement paths and the expected insurance/regulatory exposure.
- Public goods and shared development
- Archi is open-source and modular; shared investments (connectors, ingestion tooling, evaluation suites) reduce duplication across scientific collaborations and lower marginal cost of adoption. This can shift market dynamics: more federated, community-maintained infrastructure vs. proprietary vertical offerings.
- Investment signals and market demand
- Demonstrated utility for domain-specific operational agents creates demand for: (a) improved retrieval + tool integrations (engineering talent, middleware products), (b) on-prem/edge-serving model infrastructure, and (c) evaluation and auditing tooling to measure ROI and risk.
- Organizations may reallocate budgets from generalized LLM API spend to internal ML Ops, data engineering, and model-serving infrastructure if on-prem performance is sufficient.
- Risk & reliability economics
- Grounding answers in internal corpora and exposing tool traces reduces operational risk from erroneous recommendations, which has economic value (fewer costly mistakes, lower mean time to resolution). However, incomplete coverage (e.g., missing live tools or environment visibility) can introduce error modes that require mitigation investment (additional connectors, version-awareness).
- Measurement and ROI
- Archi’s built-in evaluation framework and A/B testing enable tracking of performance improvements and direct measurement of operational metrics — essential for justifying continued investment and estimating ROI on infrastructure vs. API spend.
Overall, Archi exemplifies an economic case where combining retrieval, tooling, and selective model choices (frontier vs open-weight) yields configurable trade-offs between performance, privacy, and total cost of ownership that organizations managing sensitive, specialized operations must weigh.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Archi is an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. Other | positive | high | system functionality (ingestion, organization, agent deployment) |
0.18
|
| An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators. Adoption Rate | positive | high | deployment / adoption by a team |
0.18
|
| The deployed Archi instance offers retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. Other | positive | high | data integration and retrieval capability |
0.18
|
| We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. Other | null_result | high | evaluation methodology (feedback and graded question set) |
0.18
|
| The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. Task Completion Time | positive | medium | resolution of real-world queries / task completion |
0.11
|
| Locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data. Output Quality | positive | medium | model performance (competitiveness) and capability for private data management |
0.11
|
| Archi enables fully private management of sensitive data by using locally-hosted, open-weight models. Other | positive | medium | privacy / data management capability |
0.05
|