An LLM-driven 'Agentic Architect' framework autonomously evolves microarchitecture components that match or beat state-of-the-art in cycle-accurate simulation — e.g., an evolved prefetcher yields a 1.76x geomean IPC boost over no prefetching and modest gains over top learned baselines; however, improvements are demonstrated in simulation and are contingent on seed quality and prompt guidance.
Rapid advances in Large Language Models (LLMs) create new opportunities by enabling efficient exploration of broad, complex design spaces. This is particularly valuable in computer architecture, where performance depends on microarchitectural designs and policies drawn from vast combinatorial spaces. We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. The human architect specifies the optimization target, seed design, scoring function, simulator interface, and benchmark split, while the LLM explores implementations within these constraints. Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Finally, our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Our analysis surfaces several findings about agentic AI-driven microarchitecture design. Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. The architect's role is shifting, but the human remains central. Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Likewise, objectives, constraints, and prompt guidance affect reliability and generalization. Overall, Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization.
Summary
Main Finding
Agentic Architect demonstrates that agentic AI — LLM-driven code evolution coupled with cycle-accurate simulation — can autonomously explore and optimize microarchitectural design spaces and produce policies that match or exceed human state-of-the-art across multiple domains (cache replacement, prefetching, branch prediction). The human architect’s role shifts from hand-coding policies to defining seeds, constraints, scoring, and evaluation structure; these human choices critically shape outcomes.
Key Points
- Framework and scope
- Agentic Architect is an open, modular framework pairing an evolutionary agent that uses LLMs to mutate code with cycle-accurate simulators (e.g., ChampSim) for automated evaluation.
- The framework is simulator-agnostic and model-agnostic (supports different LLMs and evolutionary algorithms).
- Representative domains and results
- Cache replacement: evolved policy yields 1.062× geomean IPC vs LRU, and +0.6% over Mockingjay (Mockingjay = 1.056× vs LRU).
- Branch prediction: evolved predictor yields 1.100× geomean IPC vs Bimodal, and +1.5% over its Hashed Perceptron seed (seed = 1.085×); up to 39% reduction in mispredictions on the most sensitive trace.
- Prefetching: evolved prefetcher yields 1.76× geomean IPC vs no prefetching, +17% over VA/AMPM Lite seed (1.59×), and +21% over reference SOTA (SMS, 1.55×).
- Evolution and engineering choices that matter
- Seed quality bounds search: evolution refines/extends strong seeds but cannot compensate for poor foundations.
- Scoring function matters: composite scores that balance IPC with domain-specific penalties reduce artifacts and improve generalization.
- Trace selection controls generalization: diverse training traces reduce overfitting; narrow traces risk brittle policies.
- Prompt strategy > LLM choice: minimal problem-focused prompts outperform highly prescriptive prompts; choice of LLM is less decisive.
- Practical safeguards: compilation-gated evolution (discard candidates that fail to compile) and simulation timeouts (terminate expensive candidates) are essential for search productivity.
- Qualitative finding about novelty
- Evolved designs largely recombine known techniques; novelty typically arises in how components and policies are coordinated rather than inventing totally new primitives.
Data & Methods
- Evaluation infrastructure
- Cycle-accurate microarchitectural simulators (ChampSim used in experiments; framework supports others).
- Benchmarks / traces: standard suites (SPEC 2006 / SPEC 2017 mentioned) with explicit training / evaluation split to test generalization.
- Evolutionary loop
- Iterative process: select parent -> prompt LLM to mutate -> compile candidate -> simulate on training traces -> collect per-trace metrics -> aggregate into scalar fitness -> update population.
- Compilation-gated: compile failures receive sentinel fitness and compiler feedback is returned to LLM for subsequent iterations.
- Simulation timeouts enforce practical runtime budgets for candidate evaluation.
- Interfaces & constraints
- Domain hooks constrain candidate structure (e.g., replacement policies must implement find_victim and update_replacement_state; branch predictors/prefetchers have their own required hooks).
- The system prompt communicates the simulator interface, constraints, and resource budgets to the LLM.
- Evolutionary frameworks and LLMs
- Plug-in evolutionary backends (OpenEvolve, AdaEvolve) and LLMs (examples cited: Gemini, Codex, Opus) were used as interchangeable components to evaluate sensitivity.
- Metrics
- Primary metric: IPC (end-to-end performance); secondary metrics: MPKI, misprediction rates, cache misses, etc.
- Evaluator aggregates per-trace metrics (equal weighting in experiments) into composite fitness scores (performance + penalty terms).
- Seeds and baselines
- Seeds used include existing strong designs: Mockingjay (cache replacement), VA/AMPM Lite (prefetch), Hashed Perceptron (branch predictor). Baselines include LRU and other classical predictors. Reference SOTA prefetcher: SMS.
Implications for AI Economics
- R&D productivity and cost structure
- Automation potential: Agentic Architect can accelerate iteration cycles in microarchitectural research, reducing human-hours per design iteration and enabling more extensive search over combinatorial spaces. This can raise marginal R&D output per engineer-hour.
- Compute & evaluation costs matter: the approach shifts cost from human labor to compute (LLM calls + many cycle-accurate simulations). Firms with access to large compute budgets or efficient simulation pipelines capture more of the value. Thus, capital intensity rises.
- Value of domain knowledge and incumbency
- Seed advantage: because seed quality bounds search outcomes, incumbents with mature IP and high-quality baseline designs retain strategic advantage — they can use agentic AI to further refine existing advantages rather than being displaced immediately.
- Architect role specialization: the highest-value human tasks shift to defining search space, scoring functions, constraint specification, and trace selection — activities with high returns to expertise and domain knowledge.
- Diffusion and democratization
- Open-source release of frameworks like Agentic Architect lowers barriers to entry for smaller teams and academic labs, potentially democratizing some parts of hardware design. However, effective use still requires domain expertise and compute resources for simulation.
- Competitive dynamics & product cycles
- Faster innovation cycles: automated exploration could shorten time-to-market for microarchitecture improvements, intensifying competitive pressure and shortening product life cycles in processor design segments where marginal IPC gains are commercially valuable.
- Platformization effect: toolchains that combine domain-specific evaluation infrastructure with agentic AI may become strategic platforms — firms that own both simulation/data and tailored scoring environments can derive disproportional advantages.
- Labor market impacts
- Role shifts, not wholesale replacement: demand will likely decline for routine policy-implementation tasks but increase for roles that design evaluation regimes, scoring functions, and integrate evolved components into robust, silicon-feasible implementations (verification, hardware constraints, safety).
- Policy, IP, and governance considerations
- Intellectual property ambiguity: LLM-generated designs raise questions about provenance, patentability, and IP ownership, especially when trained LLMs incorporate public code or prior art. Organizations must design governance and compliance processes.
- Reproducibility and externalities: dependence on expensive compute and proprietary LLMs can reduce reproducibility. Open-source frameworks help but do not fully neutralize compute-access asymmetries.
- Risk of overfitting and externalities on market valuation
- Overfitting risk: without careful trace selection and scoring, evolved designs can overfit to benchmark suites, producing overstated gains that don’t generalize to real-world workloads. Investors and managers should be wary of benchmark-driven claims without robust out-of-sample validation.
- Cost-benefit for adoption: adopters must weigh expected IPC gains (and their translation to product value) against compute and integration costs. Incremental percentage-point gains in IPC can be economically meaningful in some markets (e.g., datacenter TCO) but not in others.
Limitations and open questions (economic framing) - Marginal cost per candidate (LLM + simulation) is non-trivial; quantifying ROI requires workload-to-revenue translation for IPC gains. - The approach favors domains with strong simulators and measurable metrics; it is less directly applicable to design problems where evaluation is costly, noisy, or requires physical fabrication. - The long-term competitive landscape depends on who controls seeds, trace corpora (real workloads), and compute infrastructure.
Overall, Agentic Architect illustrates a near-term pattern likely to repeat across engineering domains: agentic AI amplifies exploration capacity, but human-designed search structure, domain expertise, and compute endowments determine who captures economic value.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. Innovation Output | positive | high | innovation_output |
0.18
|
| Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Task Completion Time | positive | high | task_completion_time |
0.18
|
| Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Task Completion Time | positive | high | task_completion_time |
1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x)
0.18
|
| Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Task Completion Time | positive | high | task_completion_time |
1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x)
0.18
|
| Our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Task Completion Time | positive | high | task_completion_time |
1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x)
0.18
|
| Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. Innovation Output | mixed | high | innovation_output |
0.09
|
| The architect's role is shifting, but the human remains central. Skill Acquisition | mixed | high | skill_acquisition |
0.09
|
| Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Innovation Output | negative | high | innovation_output |
0.18
|
| Objectives, constraints, and prompt guidance affect reliability and generalization. Organizational Efficiency | mixed | high | organizational_efficiency |
0.09
|
| Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization. Adoption Rate | positive | medium | adoption_rate |
0.02
|