A new industrial PHM benchmark finds leading LLM agents complete just 68% of realistic diagnostics and prognostics tasks, plagued by tool-orchestration errors and poor generalization; top configurations show 23% incorrect sequencing, a 14.9 percentage-point drop on multi-asset reasoning, and only 42.7% success on held-out datasets.

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

Ayan Das, Dhaval Patel · April 02, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

PHMForge is a comprehensive benchmark of 75 PHM scenarios and 65 tools showing that state-of-the-art LLM agents complete only 68% of industrial prognostics and health-management tasks and fail systematically on tool orchestration, multi-asset reasoning, and cross-equipment generalization.

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

Summary

Main Finding

PHMForge is a new, open-source benchmark that evaluates LLM-based agents on realistic, high-stakes Prognostics & Health Management (PHM) workflows. Across 75 SME-vetted scenarios and 65 domain-specific MCP tools, frontier agent configurations (e.g., Claude Sonnet 4.0 + Claude Code) reached at best 68% task completion and exhibited systematic failures in tool sequencing (23% incorrect sequencing), multi-asset reasoning (14.9 percentage-point performance drop), and cross-equipment generalization (42.7% on held-out datasets). The results indicate that current agentic LLMs and frameworks are promising but far from deployment-ready for industrial PHM without substantial domain engineering and governance.

Key Points

Scope and realism
- 75 expert-curated scenarios covering 7 asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines, etc.).
- Five PHM task categories: RUL Prediction (15 scenarios), Fault Classification (15), Engine Health Analysis (30), Cost-Benefit Analysis, Safety/Policy Evaluation.
- Ground-truth templates and SME-driven scenario generation (no LLMs used in scenario creation).
Tooling and protocol
- Implemented 65 specialized tools across two MCP servers (Prognostics Server: 38 tools; Intelligent Maintenance Server: 27 tools).
- Tools expose data loading, model training/prediction, signal analysis, cost modeling, safety/compliance checks, etc.
- Introduces an “unknown-tools” evaluation mode: agents must select appropriate functions from vague industrial queries without explicit tool names.
Evaluation setup
- Execution-based validation using task-commensurate metrics: MAE/RMSE for RUL; F1/accuracy for classification; categorical matching for health assessments; cost/compliance thresholds for strategic tasks.
- Models tested: Claude Sonnet 4.0 / 3.5, GPT-4o / GPT-4-Turbo, Granite-3.0-8B; agent frameworks: ReAct, Cursor Agent, Claude Code.
- Deterministic evaluation (temperature = 0) for reproducibility.
Performance highlights and failure modes
- Best-performing combinations achieved 68% task completion overall.
- Engine Health Analysis was the hardest category (range 23.3%–60.0%).
- Systematic failures: 23% incorrect tool sequencing, 14.9 pp degradation for multi-asset reasoning, 42.7% lower performance on held-out cross-equipment datasets.
Curation effort and openness
- Dataset selection pipeline reduced 52 candidates to 18 final datasets used for scenarios.
- SME involvement: 396 person-hours across domain experts.
- Full benchmark (scenarios, tools, evaluators) open-sourced for community use and ongoing expansion.

Data & Methods

Scenario curation
- Multi-stage pipeline: asset selection → scenario type → data-provision mode → output template → query formulation. Emphasis on stakeholder-voiced, operational prompts (e.g., “Which engines in our flight fleet are nearing critical limits?”).
- Filtering criteria for datasets: community validation (>1,000 downloads, >50 citations), explicit ground truth, PHM task alignment.
Ground truth and validation
- Task-specific ground-truth objects per scenario (numerical bounds, categorical mappings, structured templates), created either from dataset labels or SME adjudication.
- Metrics chosen to match task semantics and allow equitable aggregation across heterogeneous tasks.
Evaluation formalism
- Agent A with model M must select and sequence tools Tτ from the MCP servers given query Q and dataset context D; success evaluated by validate(ŷ, G) where G is scenario ground truth.
- Execution-based scoring (not just language-only or API-call correctness).
Implementations
- Two MCP servers implemented to reflect realistic industrial tool interfaces.
- 65 domain-specific functions implemented to support the scenarios.
- Experiments run deterministically (T=0) across multiple agent frameworks and model backbones.

Implications for AI Economics

Deployment cost drivers are substantial
- Substantial upfront engineering and SME effort is required to make agentic PHM usable: the benchmark itself required 65 tool implementations and ~396 SME person-hours. Real deployments will need even more integration, testing, and regulatory alignment.
- Poor cross-asset generalization (large drop on held-out assets) implies high marginal cost per new asset class — weak economies of scope. Organizations should expect significant customization costs when expanding agentic PHM across heterogeneous fleets.
Value vs. risk trade-offs
- Best-case task completion at 68% and notable sequencing/decision errors (23% incorrect sequencing) mean automation today would carry nontrivial operational and safety risk. Economic value of automation must be adjusted for potential cost of incorrect actions (repairs, downtime, safety incidents, regulatory fines).
- For high-stakes assets, human-in-the-loop oversight remains essential. The economic model shifts from pure substitution of labor to hybrid workflows where LLM agents reduce some analytic labor but add supervision, verification, and incident remediation costs.
Market opportunities and business models
- Demand for standardized, domain-specific MCP tooling and verifiable PHM toolkits is likely to grow; vendors who provide well-documented, certified tool APIs (and catalogs to address the “unknown-tools” challenge) can capture value by reducing integration costs for adopters.
- Benchmark-driven services (continuous scenario curation, SME validation, compliance adapters) will be necessary and monetizable—PHMForge’s “living benchmark” model highlights ongoing curation as an ongoing cost center.
Investment and policy implications
- Investments should prioritize robust tool-discovery, multi-asset reasoning, verifiable safety checks, and deterministic evaluators (to reduce tail risks that drive insurance/policy costs).
- Regulators and insurers will likely require execution-based validation and auditable toolchains; the benchmark’s execution-focused design aligns with what payers and regulators will want to see.
Research-to-practice timeline
- PHMForge sets concrete baselines showing that further progress is needed before wide-scale autonomous deployment; companies should plan staged rollouts emphasizing high-value, low-risk tasks (e.g., diagnostics, triage) while investing in tool catalogs, verification layers, and SME-driven governance to reduce residual deployment risk.

Concluding practical takeaway: PHMForge demonstrates that agentic LLMs can aid industrial PHM but that significant domain engineering, tooling, and governance remain necessary. Economically, this means meaningful near-term costs and ongoing maintenance spending, with the promise of longer-run returns only if firms invest in standardized, verifiable tooling and address the current gaps in generalization and safe orchestration.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic, execution-based benchmark evidence across 75 curated scenarios, 65 specialized tools, multiple agent frameworks, and several frontier LLMs, which supports robust descriptive claims about agent performance; however, it does not identify causal effects on economic outcomes or real-world deployments, and results may be sensitive to scenario curation, simulator fidelity, and the specific models/frameworks evaluated. Methods Rigorhigh — Design shows strong methodological care: expert-curated scenarios spanning seven asset classes and five task categories, implementation of specialized tools and MCP servers, task-appropriate evaluators (MAE/RMSE, F1, categorical matching), multiple agent frameworks and LLMs, and held-out tests for generalization; remaining concerns are limited external/operational validation and dependence on simulator realism. SampleBenchmark dataset of 75 expert-curated PHM scenarios across 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines), covering 5 task types (RUL prediction, fault classification, engine health analysis, cost‑benefit analysis, safety/policy evaluation); 65 specialized tool implementations exposed via two MCP servers; evaluated with multiple agent frameworks (ReAct, Cursor Agent, Claude Code) paired with LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B); metrics include MAE/RMSE, F1, categorical matching, plus held-out dataset tests. Themeshuman_ai_collab adoption GeneralizabilityUses simulated MCP servers and curated scenarios that may not capture full operational complexity or noise of real-world industrial systems, Limited to seven asset classes and selected tasks; other assets/tasks may exhibit different agent behavior, Evaluations restricted to a small set of current LLMs and agent frameworks — results may change with new models or tool designs, Performance metrics (MAE/RMSE, F1, categorical matching) may not capture all safety, regulatory, or economic consequences in production, Benchmarks focus on agent-level task completion rather than firm-level productivity, cost, or labor-market impacts

Claims (10)

Claim	Direction	Confidence	Outcome	Details
PHMForge is the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Other	positive	high	availability of a domain-specific benchmark for LLM agents	0.03
The benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. Other	positive	high	count and coverage of benchmark scenarios, asset classes, and task categories	n=75 0.3
We construct 65 specialized tools across two MCP servers to enable interactions for the benchmark. Other	positive	high	number of specialized tools and server deployment	n=65 0.3
Execution-based evaluators were implemented with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Output Quality	positive	high	metricized evaluation of model outputs (MAE/RMSE, F1, categorical matching)	0.3
We evaluated leading agent frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B). Other	positive	high	evaluation coverage across agent frameworks and LLMs	n=75 0.18
Even top-performing configurations achieve only 68% task completion. Output Quality	negative	high	task completion rate	n=75 68% task completion 0.18
There are systematic failures in tool orchestration, with 23% incorrect sequencing. Task Allocation	negative	high	rate of incorrect tool sequencing	n=75 23% incorrect sequencing 0.18
Multi-asset reasoning causes a 14.9 percentage point degradation in performance. Output Quality	negative	high	performance degradation (percentage points) when reasoning across multiple assets	n=75 14.9 percentage point degradation 0.18
Cross-equipment generalization is poor, with 42.7% performance on held-out datasets. Output Quality	negative	high	held-out dataset performance (cross-equipment generalization)	n=75 42.7% on held-out datasets 0.18
We open-source the complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts. Other	positive	high	availability of open-source benchmark artifacts	0.3