Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

Summary

Main Finding

General-purpose coding agents (Claude Code Opus 4.5/4.6) with no HLS-specific training can materially improve FPGA high-level-synthesis (HLS) designs when organized as an “agent factory” (a two-stage, multi-agent pipeline). By decomposing designs, using an ILP to assemble promising sub-kernel variants, then scaling exploration with multiple independent agents, the authors obtain a mean 8.27× speedup over pragma-only baseline when scaling from 1→10 agents. Gains concentrate on complex kernels (streamcluster >20×, kmeans ≈10×). Agents repeatedly rediscover effective HLS patterns (e.g., ARRAY_PARTITION) and cross-function transformations that pragma-only searches miss, while limitations (small benchmark set, single model/toolchain, uneven gains) point to clear future work.

Key Points

Two-stage agent-factory pipeline:
- Stage 1: Decompose program into sub-functions; spawn one optimizer agent per sub-function to generate M=7 variants (baseline, conservative, pipeline variants, aggressive unroll, alternate transforms). Evaluate correctness and synthesize each variant to measure latency (L) and area (A). Use an ILP minimizing global latency subject to area budget to select top-N global combinations.
- Stage 2: Spawn N exploration agents, one per top-ILP solution, to explore cross-function optimizations (pragma recombination, loop fusion/inlining, memory restructuring, algebraic/compute rewrites). Each agent iterates synthesize/evaluate to build Pareto-optimal area–latency designs.
Empirical scaling: varying N ∈ {1,2,4,8,10} shows that increasing agent count expands the explored Pareto front; mean best speedup over baseline reaches 8.27× at N=10. Strongest gains on “hard” benchmarks (streamcluster, cfd, kmeans, leukocyte, lavamd).
Agents recover known hardware heuristics without task-specific training:
- ARRAY_PARTITION applied to relieve memory bottlenecks.
- PIPELINE is often harmful unless loop-carried dependencies and memory bandwidth are addressed first.
ILP top-ranked candidates are not always the origin of final best designs — global, cross-function exploration in Stage 2 finds improvements invisible to sub-kernel-only search.
Limitations highlighted by authors:
- Small benchmark set (12 kernels) and single toolchain (AMD Vitis HLS) and model family (Claude Opus).
- Baselines are bounded pragma enumerations (strong but not state-of-the-art DSE like AutoDSE).
- Gains uneven; some kernels saturate early; under tight area budgets agents may increase area without proportional latency reduction.
- HLS synthesis runtimes (minutes per run) and stochastic agent behaviors limit exhaustive coverage.

Data & Methods

Benchmarks: 12 kernels — 6 from HLS-Eval (AES, DES, KMP, NW, PRESENT, SHA256) and 6 from Rodinia-HLS (lavamd, kmeans, hotspot, leukocyte/lc_dilate, cfd/cfd_step_factor, streamcluster).
Tooling and model: Claude Code (Opus 4.5/4.6) as the agent LLM; AMD Vitis HLS for synthesis. Experiments averaged over 5 runs. Fixed clock: 10 ns (35 ns for streamcluster).
Stage 1 variant strategy (per sub-function): M = 7 variants including baseline, conservative pragmas, multiple PIPELINE II choices, pipeline+UNROLL, and alternate transforms (ARRAY_PARTITION, INLINE, closed-form rewrites).
Variant evaluation: functional-correctness testing → HLS synthesis → record (L_m^k, A_m^k). Discard incorrect variants.
ILP formulation:
- Decision variables x_m^k ∈ {0,1} (one variant per function).
- Objective: minimize L_total(x), where L_total encodes the call-graph-derived composition (sum along sequential paths, max over parallel branches, loop multipliers).
- Constraint: Σ_{k,m} A_m^k x_m^k ≤ A_budget ; Σ_m x_m^k = 1 ∀k.
- ILP used to enumerate top-N feasible global configurations S = {s1..sN}.
Stage 2 exploration: spawn N agents (one per si); each agent applies design-wide optimization paths (pragma composition, code restructuring, memory optimization, compute simplifications), iteratively checks correctness, synthesizes, and records (L,A). Final design D* is the feasible explored design with minimum latency.
Baselines:
- For HLS-Eval: bounded exhaustive enumeration per-loop of 5 options (none, PIPELINE II∈{1,2}, UNROLL factor∈{2,4}) with ILP composition (pragmas-only baseline).
- For Rodinia-HLS: compare to reference optimized implementations (tiling, pipelining, double-buffering).
Metrics & results presentation: Pareto fronts of latency speedup vs area; mean speedup table for agent scaling steps (1→2→4→8→10), with overall mean best speedup 8.27× at N=10.

Implications for AI Economics

Inference-time compute as a productized resource:
- Agent scaling (more parallel LLM agents) functions like buying more inference-time compute to expand search/optimization. This suggests a direct economic trade-off: increased cloud/compute spend (LLM inference + additional HLS synth runs) can substitute for engineering time or privileged domain expertise. Pricing of such services (per-agent or per-hour inference costs) becomes central to ROI.
Labor substitution/augmentation and productivity:
- For routine-to-moderately-complex HLS tasks, general-purpose LLM agents can reproduce domain heuristics and find nontrivial optimizations, implying potential to reduce hours of skilled HLS engineer time or to augment less-experienced engineers. The economic value depends on agent success rates, required human oversight, and cost per synthesized design iteration.
Cost structure and diminishing returns:
- Mean 8.27× speedup at N=10 shows positive returns to scaling but with uneven gains across workloads and non-monotonicities (stochastic exploration). Marginal benefit per added agent varies — economic adoption should model diminishing returns and the cost of additional synthesis runs (HLS runtime) plus inference cost.
Product and market impacts:
- EDA/HLS tool vendors and cloud providers can productize multi-agent optimization pipelines (integrated LLM + synthesis) as a managed offering. Value propositions: faster design cycles, better Pareto trade-offs (latency vs area), lower skilled-labor requirements.
- Startups/consultancies can offer HLS-optimization-as-a-service that charges based on compute budget (N) and synthesis runs, or on delivered improvements (performance/area gains).
Risks, operational costs, and externalities:
- Dependence on a particular LLM family and synthesis tool constrains portability and increases vendor lock-in risk. Economic assessments must include model access costs, IP and licensing, long HLS runtimes (time-to-solution), and verification overheads (functional correctness, hardware validation).
- Quality-control and safety: incorrect transformations have hardware correctness implications; human-in-the-loop verification adds labor costs which offset some automation gains.
Research and investment directions to improve economic viability:
- Quantify cost-benefit: convert agent/inference + synthesis costs into dollars and compare to saved engineer hours for typical enterprise projects to estimate payback periods.
- Portfolio optimization: adaptive strategies that pick N based on expected marginal improvement and budget constraints (cost-aware agent scaling).
- Broader evaluation (more benchmarks, ASIC targets, alternate HLS toolchains) to increase confidence for industrial adoption and to better model market value across segments.
Policy/market considerations:
- IP, provenance, and reproducibility concerns (who owns the optimized code? which model/tool produced it?) may influence procurement and regulatory decisions in safety-critical hardware domains.

Summary: The paper demonstrates a promising economic lever — paying for more inference-time compute (agent scaling) to expand automated HLS optimization — which can reduce engineering effort and improve hardware trade-offs for many kernels. Practical adoption requires careful modeling of inference and synthesis costs, human verification overhead, and workload-specific returns.

Assessment

Paper Typeother Evidence Strengthmedium — The paper presents systematic empirical results across 12 standard HLS kernels and reports large, consistent speedups with an explicit baseline and scaling experiments (1→10 agents). However, the sample is modest, there is no statistical testing or uncertainty quantification reported, comparisons to strongest human or automated baselines are limited, and results depend on a single HLS tool and specific LLM versions, reducing external validity. Methods Rigormedium — The two-stage pipeline (sub-kernel ILP assembly plus expert-agent global search) is clearly described and evaluated with controlled scaling of agent count; benchmark choices (HLS-Eval, Rodinia-HLS) are standard. Missing or unclear elements include details on randomized seeds/repeatability, compute and monetary cost of agents, sensitivity analyses (e.g., different LLMs, HLS settings, area constraints), and comparisons to human experts or other automated optimizers. SampleEmpirical evaluation on 12 kernels drawn from HLS-Eval and Rodinia-HLS; optimizations performed with Claude Code (Opus 4.5/4.6) driving AMD Vitis HLS; experiments scale the number of autonomous optimization agents from 1 to 10 and apply a two-stage pipeline (per-subkernel optimization + ILP assembly followed by multi-agent global search); reported metric is synthesized design throughput/speedup relative to a baseline implementation (baseline details not fully specified in the summary). Themesproductivity innovation GeneralizabilitySmall and curated benchmark set (12 kernels) may not represent full range of real-world hardware designs, Results tied to AMD Vitis HLS and specific versions of Claude Code; different tools or LLMs may perform differently, Area constraints and other synthesis/hardware targets are not varied widely, limiting applicability across architectures, Unclear robustness to random seeds, prompt/hyperparameter changes, or varying compute budgets, Does not measure engineering effort, cost, or time-to-solution — only final synthesized performance

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. Other	positive	high	existence and design of the two-stage agent factory pipeline	0.2
In Stage 1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. Other	positive	high	description of Stage 1 decomposition and ILP-based assembly	0.2
In Stage 2, the pipeline launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. Other	positive	high	description of Stage 2 expert-agent exploration of cross-function optimizations	0.2
We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus 4.5/4.6) with AMD Vitis HLS. Other	positive	high	evaluation dataset and toolchain used	n=12 0.2
Scaling from 1 to 10 agents yields a mean 8.27× speedup over baseline. Task Completion Time	positive	high	execution/performance speedup relative to baseline	n=12 8.27× speedup 0.12
Larger gains on harder benchmarks: streamcluster exceeds 20× and kmeans reaches approximately 10×. Task Completion Time	positive	high	execution/performance speedup for specific benchmarks	n=1 exceeds 20× (streamcluster); approximately 10× (kmeans) 0.12
Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training. Innovation Output	positive	medium	discovery of known hardware optimization patterns by agents	n=12 0.07
The best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. Output Quality	positive	high	origin/ranking of best designs relative to ILP candidates	n=12 0.12
These results establish agent scaling as a practical and effective axis for HLS optimization. Organizational Efficiency	positive	medium	practical effectiveness of scaling the number of agents for HLS optimization	n=12 0.07

Multiple general-purpose coding agents can dramatically accelerate hardware designs: a two-stage agent factory using up to 10 Claude Code agents achieves a mean 8.27× HLS speedup across 12 benchmarks, with peaks above 20×, and finds cross-kernel optimizations missed by per-subkernel search.