A new large benchmark for industrial CAD automation finds multimodal LLMs can approximate outer shapes but struggle to produce correct, executable parametric designs; fine-tuning helps inside known families, but generalization to novel parts remains weak.
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
Summary
Main Finding
BenchCAD is a large, expert-verified benchmark (17,900 executable CadQuery programs across 106 industrial part families) that decomposes industrial CAD reasoning into perception, operation understanding, parametric abstraction, and executable program synthesis. Evaluations across 10+ leading MLLMs and CAD-specialist models show that current systems can often recover coarse outer geometry but generally fail to produce faithful, editable parametric CAD programs: they miss fine 3D structure, misinterpret industrial parameters, and substitute advanced operations (helical sweeps, lofts, twist-extrudes, involute gear constructions) with simpler sketch+extrude patterns. Supervised fine-tuning and RL improve in-distribution performance, but out-of-distribution generalization to held-out families remains limited.
Key Points
-
Dataset scale and scope
- 17,900 execution-verified CadQuery programs.
- 106 named industrial families (fasteners, gears, springs, drills, brackets, flanges, propellers, etc.).
- 49 CadQuery operations covered (including advanced ops often absent in prior corpora).
- 49% of families anchored to ISO/DIN/EN/ASME/IEC standards (52/106 families).
-
Capability decomposition and tasks
- Four-level capability hierarchy: L1 Holistic Visual Recognition → L2 CAD Operation Understanding → L3 Industrial Parametric Abstraction → L4 Compositional Spatial/Code Reasoning.
- Four evaluation tasks: VISION2CODE (img→CadQuery), CODE EDIT (instruction-guided edits), VISION QA (image-based numeric QA), CODE QA (program-based numeric QA).
- BenchCAD-QA: 2,400 paired image/code numeric-QA items; BenchCAD-Edit: 748 curated edit pairs.
-
Metrics and protocols
- Execution-verified outputs; rotation- and scale-invariant scoring.
- Metrics: exec_pct, voxel IoU, Chamfer & Hausdorff distances, Feature-F1, essential-op recall, normalized accuracy for edits (headroom-normalized IoU improvement).
- QA accuracy uses ±5% tolerance for ratios, exact match for integers.
-
Empirical findings
- Models more easily recover outer envelopes than correct parametric programs or operation sequences.
- Vision QA significantly underperforms Code QA on identical questions → visual recognition, not just reasoning, is a major bottleneck.
- Code edit (instruction-following edits) is consistently harder than greenfield code generation.
- Simple API-level edits are nearly solved; compositional and multi-step edits are still difficult.
- CAD-specialist models tend to get high IoU on extrude-heavy families but underperform on non-extrude advanced operations.
- SFT + RL on BenchCAD increases operation coverage and executable generation in-distribution; substantial OOD gaps remain.
-
Release and reproducibility
- Dataset and evaluation code publicly released (data CC-BY-4.0, code MIT).
- Generation pipeline uses domain experts, standard-table sampling, and a sandbox-execution + visual sign-off pipeline.
Data & Methods
-
Data generation pipeline
- Domain experts implemented parameterized builders per family using CadQuery; builders respect engineering constraints and standard tables.
- Each family exposes a typed parameter schema, sampler, validator, and deterministic builder.
- Three difficulty tiers (easy/medium/hard) per family by extending parameter ranges and optional features.
- Programs that fail compile, exceed a 30s runtime, or yield degenerate volume are quarantined; surviving programs receive visual sign-off.
-
Datasets
- BenchCAD: 17,900 verified CadQuery parts with multi-view renders and metadata.
- BenchCAD-QA: 2,400 numeric QA items paired across image/code modalities.
- BenchCAD-Edit: 748 verified edit pairs across five edit types (literal replacement, chained transform, relative computation, feature editing, geometry rebuilding).
-
Models evaluated
- Proprietary frontier MLLMs: GPT-4o, GPT-5.3 (chat/thinking variants), Claude Opus 4.7, Gemini 3.1 Pro, OpenAI o3, Moonshot Kimi.
- Open-source MLLMs/code LLMs: Qwen3-VL, InternVL3, gpt-oss-120b, nemotron-3.
- CAD-specialist models: cadrille-RL, CADEvolve v3.
-
Training experiments
- Trained a Qwen3-VL-2B baseline with three SFT variants: iid (BenchCAD + extrusion-heavy data), ood (BenchCAD minus 10 mechanical families), baseline (no BenchCAD).
- RL stage (on-policy GRPO-style) applied to SFT checkpoints with reward r = 0.2ess + 0.8IoU; parse errors penalized.
- Outcome: SFT+RL improves in-distribution operation recall and execution, but OOD generalization limited.
Implications for AI Economics
-
Near-term automation potential is limited and selective
- Current models are good at recovering global geometry but not the parametric/operation-level representations that engineers need for editability, manufacturability, and compliance. This implies limited immediate displacement of skilled CAD engineers; instead, models will augment routine/extrusion-heavy tasks first.
- Business ROI for automation will be highest in domains with repetitive, extrusion-dominant parts or where canonical parameter tables exist (e.g., standard fasteners, basic housings).
-
Value of domain-specific, standards-anchored data
- BenchCAD demonstrates the economic value of domain-aligned, expert-curated datasets (standard anchoring, verified builders). Firms possessing such corpora gain an advantage when fine-tuning models for industrial CAD tasks.
- Standards-aligned datasets reduce regulatory and interoperability frictions, increasing the commercial attractiveness of CAD automation in regulated industries.
-
Competitive dynamics and moats
- Companies that invest in expert-verified, operation-rich corpora and iterative SFT+RL pipelines can create a durable edge. The cost of producing high-quality parametric datasets (expert time, verification, engineering constraints) is a non-trivial barrier to entry.
- Public release under permissive licenses (CC-BY, MIT) lowers entry costs and may accelerate competition, benefiting downstream toolmakers and small firms but diluting proprietary data moats.
-
Market segmentation and specialized products
- Expect a two-tier market: general-purpose MLLM CAD assistants (good at rough geometry, ideation) and CAD-specialist models/SaaS for industrial use-cases requiring parametric fidelity, standards compliance, and edit fidelity.
- Firms may monetize specialist services via subscription APIs, fine-tuning-as-a-service, or human-in-the-loop workflows that certify parametric outputs.
-
Labor and skill shifts
- Demand will shift from routine model-building and trivial parametrization to tasks requiring higher-level design judgement: constraint formulation, validation, integration with manufacturing constraints, and supervising ML outputs.
- Upskilling needs: CAD users will need model-verification and prompt-engineering skills, and familiarity with parametric programming paradigms and standards.
-
Measurement of automation readiness and investment decision-making
- BenchCAD provides a structured signal to estimate technology readiness for specific families (e.g., certain gears or springs remain hard). Firms can use benchmark metrics (op recall, exec_pct, edit accuracy) to prioritize automation investments where expected productivity gains are real.
- Because Code QA >> Vision QA, exposing models to symbolic representations (templates, partial programs, parameter schemas) will increase practical utility faster than purely visual workflows.
-
R&D and regulatory policy implications
- Policy and procurement in safety-sensitive sectors should require benchmarks or certification for parametric fidelity and standards compliance rather than shape-only metrics.
- Public funding of open, standards-anchored benchmarks (like BenchCAD) can lower adoption barriers for SMEs and catalyze more trustworthy industrial automation.
-
Cost structure and training economics
- Achieving high parametric fidelity requires specialized corpora and compute (SFT + RL). Firms will face trade-offs between building proprietary datasets vs. relying on public benchmarks and fine-tuning existing open models.
- Given the limited OOD generalization observed, continuous data collection (new families, edge cases) and maintenance are necessary—implying ongoing operating costs rather than one-time model purchases.
Practical recommendations for firms and policymakers - Firms: start with hybrid workflows—use models for initial geometry drafts, retain engineers for parametric verification and edits; invest in collecting standards-aligned parametric corpora for high-value families. - Vendors: prioritize operation coverage (helices, lofts, twist-extrudes, involute gears) and provide transparent metrics (exec_pct, edit accuracy) for customers to evaluate model fit. - Policymakers / procurement teams: require benchmarks/tests that verify parametric/edit fidelity (not just rendered shape) for certified toolchains in regulated domains. - Researchers: focus on bridging the visual→parametric gap, robustness to OOD families, and compositional edit reasoning to unlock broader economic impact.
Overall, BenchCAD clarifies where CAD automation currently adds value and what remains to be solved before wide industrial substitution: useful productivity gains are feasible in constrained, standard-rich contexts, but broad automation across diverse industrial families requires further advances in parametric understanding, operation reasoning, and OOD generalization.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families. Research Productivity | positive | high | presence_and_scope_of_dataset |
n=17900
0.3
|
| BenchCAD evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing. Research Productivity | positive | high | evaluation_task_coverage |
0.18
|
| BenchCAD enables fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Research Productivity | positive | high | analysis_capability_of_benchmark |
0.18
|
| Across 10+ frontier models, current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Output Quality | mixed | high | faithfulness_of_generated_parametric_CAD_programs |
0.18
|
| Common failures include missing fine 3D structure. Output Quality | negative | high | completeness_of_3D_structure_in_generated_models |
0.18
|
| Common failures include misinterpreting industrial design parameters. Output Quality | negative | high | accuracy_of_inferred_design_parameters |
0.18
|
| Common failures include replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Output Quality | negative | high | use_of_appropriate_CAD_operations_in_generated_code |
0.18
|
| Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. Output Quality | mixed | high | in-distribution_performance_and_out-of-distribution_generalization |
0.18
|
| Industrial CAD code generation requires models to produce executable parametric programs from visual or textual inputs and to understand 3D structure, infer engineering parameters, and choose CAD operations that reflect design and manufacture. Other | positive | high | required_capabilities_for_task |
0.03
|
| BenchCAD positions itself as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation. Research Productivity | positive | high | benchmark_intended_impact_on_industrial_readiness |
0.03
|