A new benchmark reveals Text-to-CAD is not yet industrially ready: large language models often produce executable code and plausible geometry but rarely satisfy engineering criteria for functionality, manufacturability and assembly. MUSE’s multi-stage, rubric-driven evaluation shows a consistent failure cascade and provides a realistic yardstick for progress toward engineering-grade CAD generation.

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu · May 27, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

MUSE introduces a Text-to-CAD benchmark and multi-stage evaluation showing that current LLMs frequently fail a cascade from executable code to valid geometry to engineering-ready designs, with most models falling short on functionality, manufacturability, and assemblability.

Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

Summary

Main Finding

MUSE is a new benchmark and evaluation pipeline for Text-to-CAD that moves beyond shape-matching to measure engineering usefulness (functionality, manufacturability, assemblability) of generated B‑Rep assemblies. Using structured Design Specifications and a three-stage funneled protocol (code execution → geometric checks → design‑intent rubrics judged by a VLM), MUSE reveals a strong failure cascade: many LLM outputs execute, fewer yield geometrically valid CAD, and only a small fraction meet engineering-grade criteria. Closed‑source models outperform open‑source ones, but even the best model reaches limited success on fine‑grained engineering criteria.

Key Points

Core contribution: MUSE benchmark for Text‑to‑CAD focused on multi‑part, editable B‑Rep assemblies plus an engineering‑grounded evaluation protocol.
Design Specification S = ⟨D (design desc), G (physical assembly graph), Ω (valid parameter space), M (manufacturing plan)⟩ drives task definition and rubric generation.
Three‑stage evaluation:
Code check: execute generated CadQuery script, export STEP.
Geometric check: four binary checks (watertight, manifold, self‑intersection free, overlap free).
Design‑intent alignment: convert valid STEP to engineering views and score against a design‑specific rubric on functionality, manufacturability, assemblability.
Dataset: 106 engineering‑focused instances covering diverse processes (CNC milling, 3D printing, laser cutting), materials (timber, PLA, etc.), and connection methods (nailing, snap‑fit, interlocking). Instances were created via a human‑in‑the‑loop pipeline: expert seed modeling → LLM augmentation → engineering view generation → LLM + human review to synthesize Design Specifications.
Rubric and judge: design‑specific rubrics are generated from the Design Specification, reference drawings, and engineering knowledge tables. A rubric‑based VLM judge scores the final stage; its outputs were validated against human annotations (claimed reliable).
Empirical results: wide model sweep (closed‑source and open‑source). Typical performance pattern:
- Many models can produce executable scripts (code success), but fewer produce geometrically valid STEP files.
- Final engineering alignment is low on average. Reported summary numbers: closed‑source average final score ≈ 20% (with best closed‑source gpt‑5.5 at ≈ 52% final score), open‑source average final score ≈ 3.7%.
Key failure mode: “failure cascade” — errors compound from code generation to geometric integrity to failing nuanced engineering constraints (tolerances, joints, manufacturability).

Data & Methods

Dataset construction:
- 106 curated design instances.
- Seed models authored by expert designers in STEP, converted to CadQuery scripts.
- LLM augmentation (e.g., Claude Opus 4.7) expanded variants (stylistic and functional) to increase diversity.
- Engineering views (Top, Front, Right, Isometric) generated from B‑Rep geometry (hidden lines, contours) rather than photoreal renders.
- Design Specifications synthesized via few‑shot prompting (GPT‑5.5) and human review.
Evaluation pipeline:
- Prompt each LLM to generate CadQuery code given the Design Specification.
- Stage 1: sandbox execution to check script runs and exports STEP.
- Stage 2: run four binary geometric checks (watertight, manifold, self‑intersection free, overlap free); only passers proceed.
- Stage 3: convert STEP to standardized engineering drawings; generate a task‑specific rubric (6 binary subcriteria: Assembly‑ready, Connectable, Well‑toleranced, Functional, Robust, Manufacturable) and apply a rubric‑based VLM judge comparing output vs references.
- VLM judge validated against human annotations for reliability.
Models evaluated include closed‑ and open‑source LLMs (examples: GPT‑5.5, Claude‑opus‑4.7, Gemini‑3.1‑pro, GLM, Qwen family, Llama‑3.1), with per‑stage and per‑criterion reporting.

Implications for AI Economics

Productivity vs. readiness gap: MUSE quantifies that current LLMs can produce plausible CAD code but struggle to meet downstream engineering requirements. This limits immediate automation-driven productivity gains in product design and manufacturing; meaningful gains require models to clear the geometric and engineering checks.
Value of high‑quality, expert‑curated data: The human‑in‑the‑loop pipeline and expert seed models were essential and costly. Benchmarks like MUSE raise the bar for dataset quality; firms will invest in expensive expert curation to train/evaluate models that can claim engineering readiness. This raises dataset creation costs and creates a scarce asset (high‑quality engineering specs and annotated assemblies).
Market differentiation and rents for closed‑source providers: Closed‑source models currently show better performance on engineering metrics. If that gap persists, vendors offering higher‑fidelity Text‑to‑CAD capabilities can charge premium prices to design/engineering firms, leading to differentiation and potential monopoly rents for providers who achieve engineering‑grade outputs.
Labor and skill reallocation, not immediate displacement: Because fine‑grained engineering criteria remain a failure point, human designers and engineers retain a central role (verification, tolerance specification, assembly planning). The transition is likely to shift tasks toward higher‑level specification, oversight, and error correction rather than outright elimination of design jobs in the short term.
Downstream capital and liability considerations: Integrating imperfect automatic CAD generation into manufacturing pipelines creates risks (manufacturing failures, safety issues). Firms will face choices about investing in extra verification, insurance, and stricter standards before deploying automated outputs—raising operational costs and slowing adoption.
Incentives shaped by evaluation metrics: MUSE moves evaluation incentives from visual similarity toward engineering utility. This can redirect R&D investments (models, training data, safety checks) toward truly usable CAD outputs. Economically, better benchmarks align product development investments with buyer needs (reduced integration costs).
Cost of evaluation and compute: The three‑stage pipeline (sandboxed execution, geometry analysis, VLM judging) requires compute, CAD kernels, and tooling. Scaling such evaluations (for model training or leaderboards) imposes nontrivial operational costs that will favor well‑funded organizations or platforms that can absorb them.
Standardization and procurement: MUSE could become a procurement/benchmark standard for tools marketed to engineering customers. Standardized, engineering‑grounded benchmarks reduce information asymmetry between vendors and buyers, affecting purchasing decisions and potentially accelerating adoption once models meet thresholds.
Open vs closed‑source ecosystem effects: The large performance gap reported implies potential short‑term consolidation around high‑performing closed systems. However, availability of the MUSE dataset and code could lower barriers for open‑source progress; market dynamics will depend on how quickly open models close the engineering‑grade gap.
Incentives for complementary services: Given persistent gaps, there is opportunity economics for value‑added services—automated verification pipelines, CAD‑to‑manufacturing translators, hybrid human‑AI design teams, and insurance/validation offerings.
Regulatory and safety externalities: As AI moves toward producing manufacturable artifacts, regulators, standards bodies, and insurers may require benchmarks like MUSE for certification. Compliance costs and risk management will factor into the economics of deploying Text‑to‑CAD systems in high‑stakes domains.

Overall, MUSE provides a more realistic performance signal for Text‑to‑CAD systems and thus has significant implications for where economic value will accrue (models that reliably clear engineering checks, firms that can curate expert datasets, and services that manage verification/liability).

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a benchmark and evaluation paper rather than a causal or inferential study; it reports system performance on a curated dataset and does not attempt causal identification of economic effects. Methods Rigorhigh — The paper designs a realistic, multi-stage evaluation protocol (code check, geometric check, design-intent alignment), constructs a dataset of editable B-Rep assemblies with structured design specifications, and implements a rubric-based VLM judge that is validated against human annotations; these choices address multiple failure modes beyond simple shape metrics, though rubric subjectivity and dataset selection remain potential concerns. SampleA curated benchmark dataset (MUSE) of complex, multi-part boundary-representation (B-Rep) assemblies paired with structured design specifications and design-intent rubrics; evaluation runs include a variety of closed-source and open-source LLMs queried to generate CAD via code, with automatic VLM-based rubric scoring and human annotation used for validation; dataset and code are released on a project website. Themesproductivity innovation human_ai_collab GeneralizabilityBenchmark focuses on B-Rep assemblies and may not cover other CAD formats or all industrial design domains (e.g., sheet metal, castings, large assemblies)., Dataset composition and item selection may bias toward particular part types or design styles, limiting representativeness of broader industrial work., Rubric-based VLM judge, though validated, may not generalize to different engineering standards, industries, or cultural interpretations of manufacturability/assemblability., Results reflect the evaluated LLMs and prompting pipelines at time of study; future models or toolchains could perform differently., Does not evaluate downstream engineering validation steps (finite-element analysis, tolerance stacks, material specification), so practical deployability is only partially assessed.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. Output Quality	negative	high	adequacy of geometric similarity metrics to capture functionality, manufacturability, and assemblability	0.18
We introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. Other	positive	high	availability of a Text-to-CAD benchmark for complex B-Rep assemblies	0.3
MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. Other	positive	high	evaluation pipeline effectiveness (code executability, geometric validity, design-intent alignment)	0.3
The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. Output Quality	positive	high	assessed functionality, manufacturability, and assemblability of generated CAD models	0.3
To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Other	positive	high	reliability of rubric-based VLM judge (agreement with human annotation)	0.18
Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Output Quality	negative	high	success rates at stages: code executability, geometry validity, engineering-ready design (fine-grained engineering criteria)	0.18
Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Other	positive	high	utility of benchmark and evaluation framework for advancing Text-to-CAD toward engineering design	0.18
Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/. Other	positive	high	availability of project website, leaderboard, dataset, and code	0.3