Multiple general-purpose coding agents can dramatically accelerate hardware designs: a two-stage agent factory using up to 10 Claude Code agents achieves a mean 8.27× HLS speedup across 12 benchmarks, with peaks above 20×, and finds cross-kernel optimizations missed by per-subkernel search.
We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.
Summary
Main Finding
General-purpose coding agents (Claude Code Opus 4.5/4.6) with no HLS-specific training can materially improve FPGA high-level-synthesis (HLS) designs when organized as an “agent factory” (a two-stage, multi-agent pipeline). By decomposing designs, using an ILP to assemble promising sub-kernel variants, then scaling exploration with multiple independent agents, the authors obtain a mean 8.27× speedup over pragma-only baseline when scaling from 1→10 agents. Gains concentrate on complex kernels (streamcluster >20×, kmeans ≈10×). Agents repeatedly rediscover effective HLS patterns (e.g., ARRAY_PARTITION) and cross-function transformations that pragma-only searches miss, while limitations (small benchmark set, single model/toolchain, uneven gains) point to clear future work.
Key Points
- Two-stage agent-factory pipeline:
- Stage 1: Decompose program into sub-functions; spawn one optimizer agent per sub-function to generate M=7 variants (baseline, conservative, pipeline variants, aggressive unroll, alternate transforms). Evaluate correctness and synthesize each variant to measure latency (L) and area (A). Use an ILP minimizing global latency subject to area budget to select top-N global combinations.
- Stage 2: Spawn N exploration agents, one per top-ILP solution, to explore cross-function optimizations (pragma recombination, loop fusion/inlining, memory restructuring, algebraic/compute rewrites). Each agent iterates synthesize/evaluate to build Pareto-optimal area–latency designs.
- Empirical scaling: varying N ∈ {1,2,4,8,10} shows that increasing agent count expands the explored Pareto front; mean best speedup over baseline reaches 8.27× at N=10. Strongest gains on “hard” benchmarks (streamcluster, cfd, kmeans, leukocyte, lavamd).
- Agents recover known hardware heuristics without task-specific training:
- ARRAY_PARTITION applied to relieve memory bottlenecks.
- PIPELINE is often harmful unless loop-carried dependencies and memory bandwidth are addressed first.
- ILP top-ranked candidates are not always the origin of final best designs — global, cross-function exploration in Stage 2 finds improvements invisible to sub-kernel-only search.
- Limitations highlighted by authors:
- Small benchmark set (12 kernels) and single toolchain (AMD Vitis HLS) and model family (Claude Opus).
- Baselines are bounded pragma enumerations (strong but not state-of-the-art DSE like AutoDSE).
- Gains uneven; some kernels saturate early; under tight area budgets agents may increase area without proportional latency reduction.
- HLS synthesis runtimes (minutes per run) and stochastic agent behaviors limit exhaustive coverage.
Data & Methods
- Benchmarks: 12 kernels — 6 from HLS-Eval (AES, DES, KMP, NW, PRESENT, SHA256) and 6 from Rodinia-HLS (lavamd, kmeans, hotspot, leukocyte/lc_dilate, cfd/cfd_step_factor, streamcluster).
- Tooling and model: Claude Code (Opus 4.5/4.6) as the agent LLM; AMD Vitis HLS for synthesis. Experiments averaged over 5 runs. Fixed clock: 10 ns (35 ns for streamcluster).
- Stage 1 variant strategy (per sub-function): M = 7 variants including baseline, conservative pragmas, multiple PIPELINE II choices, pipeline+UNROLL, and alternate transforms (ARRAY_PARTITION, INLINE, closed-form rewrites).
- Variant evaluation: functional-correctness testing → HLS synthesis → record (L_m^k, A_m^k). Discard incorrect variants.
- ILP formulation:
- Decision variables x_m^k ∈ {0,1} (one variant per function).
- Objective: minimize L_total(x), where L_total encodes the call-graph-derived composition (sum along sequential paths, max over parallel branches, loop multipliers).
- Constraint: Σ_{k,m} A_m^k x_m^k ≤ A_budget ; Σ_m x_m^k = 1 ∀k.
- ILP used to enumerate top-N feasible global configurations S = {s1..sN}.
- Stage 2 exploration: spawn N agents (one per si); each agent applies design-wide optimization paths (pragma composition, code restructuring, memory optimization, compute simplifications), iteratively checks correctness, synthesizes, and records (L,A). Final design D* is the feasible explored design with minimum latency.
- Baselines:
- For HLS-Eval: bounded exhaustive enumeration per-loop of 5 options (none, PIPELINE II∈{1,2}, UNROLL factor∈{2,4}) with ILP composition (pragmas-only baseline).
- For Rodinia-HLS: compare to reference optimized implementations (tiling, pipelining, double-buffering).
- Metrics & results presentation: Pareto fronts of latency speedup vs area; mean speedup table for agent scaling steps (1→2→4→8→10), with overall mean best speedup 8.27× at N=10.
Implications for AI Economics
- Inference-time compute as a productized resource:
- Agent scaling (more parallel LLM agents) functions like buying more inference-time compute to expand search/optimization. This suggests a direct economic trade-off: increased cloud/compute spend (LLM inference + additional HLS synth runs) can substitute for engineering time or privileged domain expertise. Pricing of such services (per-agent or per-hour inference costs) becomes central to ROI.
- Labor substitution/augmentation and productivity:
- For routine-to-moderately-complex HLS tasks, general-purpose LLM agents can reproduce domain heuristics and find nontrivial optimizations, implying potential to reduce hours of skilled HLS engineer time or to augment less-experienced engineers. The economic value depends on agent success rates, required human oversight, and cost per synthesized design iteration.
- Cost structure and diminishing returns:
- Mean 8.27× speedup at N=10 shows positive returns to scaling but with uneven gains across workloads and non-monotonicities (stochastic exploration). Marginal benefit per added agent varies — economic adoption should model diminishing returns and the cost of additional synthesis runs (HLS runtime) plus inference cost.
- Product and market impacts:
- EDA/HLS tool vendors and cloud providers can productize multi-agent optimization pipelines (integrated LLM + synthesis) as a managed offering. Value propositions: faster design cycles, better Pareto trade-offs (latency vs area), lower skilled-labor requirements.
- Startups/consultancies can offer HLS-optimization-as-a-service that charges based on compute budget (N) and synthesis runs, or on delivered improvements (performance/area gains).
- Risks, operational costs, and externalities:
- Dependence on a particular LLM family and synthesis tool constrains portability and increases vendor lock-in risk. Economic assessments must include model access costs, IP and licensing, long HLS runtimes (time-to-solution), and verification overheads (functional correctness, hardware validation).
- Quality-control and safety: incorrect transformations have hardware correctness implications; human-in-the-loop verification adds labor costs which offset some automation gains.
- Research and investment directions to improve economic viability:
- Quantify cost-benefit: convert agent/inference + synthesis costs into dollars and compare to saved engineer hours for typical enterprise projects to estimate payback periods.
- Portfolio optimization: adaptive strategies that pick N based on expected marginal improvement and budget constraints (cost-aware agent scaling).
- Broader evaluation (more benchmarks, ASIC targets, alternate HLS toolchains) to increase confidence for industrial adoption and to better model market value across segments.
- Policy/market considerations:
- IP, provenance, and reproducibility concerns (who owns the optimized code? which model/tool produced it?) may influence procurement and regulatory decisions in safety-critical hardware domains.
Summary: The paper demonstrates a promising economic lever — paying for more inference-time compute (agent scaling) to expand automated HLS optimization — which can reduce engineering effort and improve hardware trade-offs for many kernels. Practical adoption requires careful modeling of inference and synthesis costs, human verification overhead, and workload-specific returns.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. Other | positive | high | existence and design of the two-stage agent factory pipeline |
0.2
|
| In Stage 1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. Other | positive | high | description of Stage 1 decomposition and ILP-based assembly |
0.2
|
| In Stage 2, the pipeline launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. Other | positive | high | description of Stage 2 expert-agent exploration of cross-function optimizations |
0.2
|
| We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus 4.5/4.6) with AMD Vitis HLS. Other | positive | high | evaluation dataset and toolchain used |
n=12
0.2
|
| Scaling from 1 to 10 agents yields a mean 8.27× speedup over baseline. Task Completion Time | positive | high | execution/performance speedup relative to baseline |
n=12
8.27× speedup
0.12
|
| Larger gains on harder benchmarks: streamcluster exceeds 20× and kmeans reaches approximately 10×. Task Completion Time | positive | high | execution/performance speedup for specific benchmarks |
n=1
exceeds 20× (streamcluster); approximately 10× (kmeans)
0.12
|
| Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training. Innovation Output | positive | medium | discovery of known hardware optimization patterns by agents |
n=12
0.07
|
| The best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. Output Quality | positive | high | origin/ranking of best designs relative to ILP candidates |
n=12
0.12
|
| These results establish agent scaling as a practical and effective axis for HLS optimization. Organizational Efficiency | positive | medium | practical effectiveness of scaling the number of agents for HLS optimization |
n=12
0.07
|