A new industrial PHM benchmark finds leading LLM agents complete just 68% of realistic diagnostics and prognostics tasks, plagued by tool-orchestration errors and poor generalization; top configurations show 23% incorrect sequencing, a 14.9 percentage-point drop on multi-asset reasoning, and only 42.7% success on held-out datasets.
Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.
Summary
Main Finding
PHMForge is a new, open-source benchmark that evaluates LLM-based agents on realistic, high-stakes Prognostics & Health Management (PHM) workflows. Across 75 SME-vetted scenarios and 65 domain-specific MCP tools, frontier agent configurations (e.g., Claude Sonnet 4.0 + Claude Code) reached at best 68% task completion and exhibited systematic failures in tool sequencing (23% incorrect sequencing), multi-asset reasoning (14.9 percentage-point performance drop), and cross-equipment generalization (42.7% on held-out datasets). The results indicate that current agentic LLMs and frameworks are promising but far from deployment-ready for industrial PHM without substantial domain engineering and governance.
Key Points
- Scope and realism
- 75 expert-curated scenarios covering 7 asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines, etc.).
- Five PHM task categories: RUL Prediction (15 scenarios), Fault Classification (15), Engine Health Analysis (30), Cost-Benefit Analysis, Safety/Policy Evaluation.
- Ground-truth templates and SME-driven scenario generation (no LLMs used in scenario creation).
- Tooling and protocol
- Implemented 65 specialized tools across two MCP servers (Prognostics Server: 38 tools; Intelligent Maintenance Server: 27 tools).
- Tools expose data loading, model training/prediction, signal analysis, cost modeling, safety/compliance checks, etc.
- Introduces an “unknown-tools” evaluation mode: agents must select appropriate functions from vague industrial queries without explicit tool names.
- Evaluation setup
- Execution-based validation using task-commensurate metrics: MAE/RMSE for RUL; F1/accuracy for classification; categorical matching for health assessments; cost/compliance thresholds for strategic tasks.
- Models tested: Claude Sonnet 4.0 / 3.5, GPT-4o / GPT-4-Turbo, Granite-3.0-8B; agent frameworks: ReAct, Cursor Agent, Claude Code.
- Deterministic evaluation (temperature = 0) for reproducibility.
- Performance highlights and failure modes
- Best-performing combinations achieved 68% task completion overall.
- Engine Health Analysis was the hardest category (range 23.3%–60.0%).
- Systematic failures: 23% incorrect tool sequencing, 14.9 pp degradation for multi-asset reasoning, 42.7% lower performance on held-out cross-equipment datasets.
- Curation effort and openness
- Dataset selection pipeline reduced 52 candidates to 18 final datasets used for scenarios.
- SME involvement: 396 person-hours across domain experts.
- Full benchmark (scenarios, tools, evaluators) open-sourced for community use and ongoing expansion.
Data & Methods
- Scenario curation
- Multi-stage pipeline: asset selection → scenario type → data-provision mode → output template → query formulation. Emphasis on stakeholder-voiced, operational prompts (e.g., “Which engines in our flight fleet are nearing critical limits?”).
- Filtering criteria for datasets: community validation (>1,000 downloads, >50 citations), explicit ground truth, PHM task alignment.
- Ground truth and validation
- Task-specific ground-truth objects per scenario (numerical bounds, categorical mappings, structured templates), created either from dataset labels or SME adjudication.
- Metrics chosen to match task semantics and allow equitable aggregation across heterogeneous tasks.
- Evaluation formalism
- Agent A with model M must select and sequence tools Tτ from the MCP servers given query Q and dataset context D; success evaluated by validate(ŷ, G) where G is scenario ground truth.
- Execution-based scoring (not just language-only or API-call correctness).
- Implementations
- Two MCP servers implemented to reflect realistic industrial tool interfaces.
- 65 domain-specific functions implemented to support the scenarios.
- Experiments run deterministically (T=0) across multiple agent frameworks and model backbones.
Implications for AI Economics
- Deployment cost drivers are substantial
- Substantial upfront engineering and SME effort is required to make agentic PHM usable: the benchmark itself required 65 tool implementations and ~396 SME person-hours. Real deployments will need even more integration, testing, and regulatory alignment.
- Poor cross-asset generalization (large drop on held-out assets) implies high marginal cost per new asset class — weak economies of scope. Organizations should expect significant customization costs when expanding agentic PHM across heterogeneous fleets.
- Value vs. risk trade-offs
- Best-case task completion at 68% and notable sequencing/decision errors (23% incorrect sequencing) mean automation today would carry nontrivial operational and safety risk. Economic value of automation must be adjusted for potential cost of incorrect actions (repairs, downtime, safety incidents, regulatory fines).
- For high-stakes assets, human-in-the-loop oversight remains essential. The economic model shifts from pure substitution of labor to hybrid workflows where LLM agents reduce some analytic labor but add supervision, verification, and incident remediation costs.
- Market opportunities and business models
- Demand for standardized, domain-specific MCP tooling and verifiable PHM toolkits is likely to grow; vendors who provide well-documented, certified tool APIs (and catalogs to address the “unknown-tools” challenge) can capture value by reducing integration costs for adopters.
- Benchmark-driven services (continuous scenario curation, SME validation, compliance adapters) will be necessary and monetizable—PHMForge’s “living benchmark” model highlights ongoing curation as an ongoing cost center.
- Investment and policy implications
- Investments should prioritize robust tool-discovery, multi-asset reasoning, verifiable safety checks, and deterministic evaluators (to reduce tail risks that drive insurance/policy costs).
- Regulators and insurers will likely require execution-based validation and auditable toolchains; the benchmark’s execution-focused design aligns with what payers and regulators will want to see.
- Research-to-practice timeline
- PHMForge sets concrete baselines showing that further progress is needed before wide-scale autonomous deployment; companies should plan staged rollouts emphasizing high-value, low-risk tasks (e.g., diagnostics, triage) while investing in tool catalogs, verification layers, and SME-driven governance to reduce residual deployment risk.
Concluding practical takeaway: PHMForge demonstrates that agentic LLMs can aid industrial PHM but that significant domain engineering, tooling, and governance remain necessary. Economically, this means meaningful near-term costs and ongoing maintenance spending, with the promise of longer-run returns only if firms invest in standardized, verifiable tooling and address the current gaps in generalization and safe orchestration.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| PHMForge is the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Other | positive | high | availability of a domain-specific benchmark for LLM agents |
0.03
|
| The benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. Other | positive | high | count and coverage of benchmark scenarios, asset classes, and task categories |
n=75
0.3
|
| We construct 65 specialized tools across two MCP servers to enable interactions for the benchmark. Other | positive | high | number of specialized tools and server deployment |
n=65
0.3
|
| Execution-based evaluators were implemented with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Output Quality | positive | high | metricized evaluation of model outputs (MAE/RMSE, F1, categorical matching) |
0.3
|
| We evaluated leading agent frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B). Other | positive | high | evaluation coverage across agent frameworks and LLMs |
n=75
0.18
|
| Even top-performing configurations achieve only 68% task completion. Output Quality | negative | high | task completion rate |
n=75
68% task completion
0.18
|
| There are systematic failures in tool orchestration, with 23% incorrect sequencing. Task Allocation | negative | high | rate of incorrect tool sequencing |
n=75
23% incorrect sequencing
0.18
|
| Multi-asset reasoning causes a 14.9 percentage point degradation in performance. Output Quality | negative | high | performance degradation (percentage points) when reasoning across multiple assets |
n=75
14.9 percentage point degradation
0.18
|
| Cross-equipment generalization is poor, with 42.7% performance on held-out datasets. Output Quality | negative | high | held-out dataset performance (cross-equipment generalization) |
n=75
42.7% on held-out datasets
0.18
|
| We open-source the complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts. Other | positive | high | availability of open-source benchmark artifacts |
0.3
|