An LLM-driven 'Agentic Architect' framework autonomously evolves microarchitecture components that match or beat state-of-the-art in cycle-accurate simulation — e.g., an evolved prefetcher yields a 1.76x geomean IPC boost over no prefetching and modest gains over top learned baselines; however, improvements are demonstrated in simulation and are contingent on seed quality and prompt guidance.

Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

Alexander Blasberg, Vasilis Kypriotis, Dimitrios Skarlatos · April 28, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Agentic Architect uses LLM-driven code evolution with cycle-accurate simulation to produce microarchitecture components that match or exceed state-of-the-art cache replacement, branch prediction, and prefetching designs in simulation.

Rapid advances in Large Language Models (LLMs) create new opportunities by enabling efficient exploration of broad, complex design spaces. This is particularly valuable in computer architecture, where performance depends on microarchitectural designs and policies drawn from vast combinatorial spaces. We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. The human architect specifies the optimization target, seed design, scoring function, simulator interface, and benchmark split, while the LLM explores implementations within these constraints. Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Finally, our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Our analysis surfaces several findings about agentic AI-driven microarchitecture design. Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. The architect's role is shifting, but the human remains central. Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Likewise, objectives, constraints, and prompt guidance affect reliability and generalization. Overall, Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization.

Summary

Main Finding

Agentic Architect demonstrates that agentic AI — LLM-driven code evolution coupled with cycle-accurate simulation — can autonomously explore and optimize microarchitectural design spaces and produce policies that match or exceed human state-of-the-art across multiple domains (cache replacement, prefetching, branch prediction). The human architect’s role shifts from hand-coding policies to defining seeds, constraints, scoring, and evaluation structure; these human choices critically shape outcomes.

Key Points

Framework and scope
- Agentic Architect is an open, modular framework pairing an evolutionary agent that uses LLMs to mutate code with cycle-accurate simulators (e.g., ChampSim) for automated evaluation.
- The framework is simulator-agnostic and model-agnostic (supports different LLMs and evolutionary algorithms).
Representative domains and results
- Cache replacement: evolved policy yields 1.062× geomean IPC vs LRU, and +0.6% over Mockingjay (Mockingjay = 1.056× vs LRU).
- Branch prediction: evolved predictor yields 1.100× geomean IPC vs Bimodal, and +1.5% over its Hashed Perceptron seed (seed = 1.085×); up to 39% reduction in mispredictions on the most sensitive trace.
- Prefetching: evolved prefetcher yields 1.76× geomean IPC vs no prefetching, +17% over VA/AMPM Lite seed (1.59×), and +21% over reference SOTA (SMS, 1.55×).
Evolution and engineering choices that matter
- Seed quality bounds search: evolution refines/extends strong seeds but cannot compensate for poor foundations.
- Scoring function matters: composite scores that balance IPC with domain-specific penalties reduce artifacts and improve generalization.
- Trace selection controls generalization: diverse training traces reduce overfitting; narrow traces risk brittle policies.
- Prompt strategy > LLM choice: minimal problem-focused prompts outperform highly prescriptive prompts; choice of LLM is less decisive.
- Practical safeguards: compilation-gated evolution (discard candidates that fail to compile) and simulation timeouts (terminate expensive candidates) are essential for search productivity.
Qualitative finding about novelty
- Evolved designs largely recombine known techniques; novelty typically arises in how components and policies are coordinated rather than inventing totally new primitives.

Data & Methods

Evaluation infrastructure
- Cycle-accurate microarchitectural simulators (ChampSim used in experiments; framework supports others).
- Benchmarks / traces: standard suites (SPEC 2006 / SPEC 2017 mentioned) with explicit training / evaluation split to test generalization.
Evolutionary loop
- Iterative process: select parent -> prompt LLM to mutate -> compile candidate -> simulate on training traces -> collect per-trace metrics -> aggregate into scalar fitness -> update population.
- Compilation-gated: compile failures receive sentinel fitness and compiler feedback is returned to LLM for subsequent iterations.
- Simulation timeouts enforce practical runtime budgets for candidate evaluation.
Interfaces & constraints
- Domain hooks constrain candidate structure (e.g., replacement policies must implement find_victim and update_replacement_state; branch predictors/prefetchers have their own required hooks).
- The system prompt communicates the simulator interface, constraints, and resource budgets to the LLM.
Evolutionary frameworks and LLMs
- Plug-in evolutionary backends (OpenEvolve, AdaEvolve) and LLMs (examples cited: Gemini, Codex, Opus) were used as interchangeable components to evaluate sensitivity.
Metrics
- Primary metric: IPC (end-to-end performance); secondary metrics: MPKI, misprediction rates, cache misses, etc.
- Evaluator aggregates per-trace metrics (equal weighting in experiments) into composite fitness scores (performance + penalty terms).
Seeds and baselines
- Seeds used include existing strong designs: Mockingjay (cache replacement), VA/AMPM Lite (prefetch), Hashed Perceptron (branch predictor). Baselines include LRU and other classical predictors. Reference SOTA prefetcher: SMS.

Implications for AI Economics

R&D productivity and cost structure
- Automation potential: Agentic Architect can accelerate iteration cycles in microarchitectural research, reducing human-hours per design iteration and enabling more extensive search over combinatorial spaces. This can raise marginal R&D output per engineer-hour.
- Compute & evaluation costs matter: the approach shifts cost from human labor to compute (LLM calls + many cycle-accurate simulations). Firms with access to large compute budgets or efficient simulation pipelines capture more of the value. Thus, capital intensity rises.
Value of domain knowledge and incumbency
- Seed advantage: because seed quality bounds search outcomes, incumbents with mature IP and high-quality baseline designs retain strategic advantage — they can use agentic AI to further refine existing advantages rather than being displaced immediately.
- Architect role specialization: the highest-value human tasks shift to defining search space, scoring functions, constraint specification, and trace selection — activities with high returns to expertise and domain knowledge.
Diffusion and democratization
- Open-source release of frameworks like Agentic Architect lowers barriers to entry for smaller teams and academic labs, potentially democratizing some parts of hardware design. However, effective use still requires domain expertise and compute resources for simulation.
Competitive dynamics & product cycles
- Faster innovation cycles: automated exploration could shorten time-to-market for microarchitecture improvements, intensifying competitive pressure and shortening product life cycles in processor design segments where marginal IPC gains are commercially valuable.
- Platformization effect: toolchains that combine domain-specific evaluation infrastructure with agentic AI may become strategic platforms — firms that own both simulation/data and tailored scoring environments can derive disproportional advantages.
Labor market impacts
- Role shifts, not wholesale replacement: demand will likely decline for routine policy-implementation tasks but increase for roles that design evaluation regimes, scoring functions, and integrate evolved components into robust, silicon-feasible implementations (verification, hardware constraints, safety).
Policy, IP, and governance considerations
- Intellectual property ambiguity: LLM-generated designs raise questions about provenance, patentability, and IP ownership, especially when trained LLMs incorporate public code or prior art. Organizations must design governance and compliance processes.
- Reproducibility and externalities: dependence on expensive compute and proprietary LLMs can reduce reproducibility. Open-source frameworks help but do not fully neutralize compute-access asymmetries.
Risk of overfitting and externalities on market valuation
- Overfitting risk: without careful trace selection and scoring, evolved designs can overfit to benchmark suites, producing overstated gains that don’t generalize to real-world workloads. Investors and managers should be wary of benchmark-driven claims without robust out-of-sample validation.
- Cost-benefit for adoption: adopters must weigh expected IPC gains (and their translation to product value) against compute and integration costs. Incremental percentage-point gains in IPC can be economically meaningful in some markets (e.g., datacenter TCO) but not in others.

Limitations and open questions (economic framing) - Marginal cost per candidate (LLM + simulation) is non-trivial; quantifying ROI requires workload-to-revenue translation for IPC gains. - The approach favors domains with strong simulators and measurable metrics; it is less directly applicable to design problems where evaluation is costly, noisy, or requires physical fabrication. - The long-term competitive landscape depends on who controls seeds, trace corpora (real workloads), and compute infrastructure.

Overall, Agentic Architect illustrates a near-term pattern likely to repeat across engineering domains: agentic AI amplifies exploration capacity, but human-designed search structure, domain expertise, and compute endowments determine who captures economic value.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Results are based on cycle-accurate simulation across multiple microarchitectural components with direct comparisons to strong baselines, giving credible engineering evidence that the LLM-driven framework can produce high-performing designs; however, evidence is limited to simulation, a constrained benchmark set, and is seed- and prompt-dependent, so external validity to real silicon and broad workloads is untested. Methods Rigormedium — The study uses cycle-accurate simulators, benchmark splits, and comparisons to state-of-the-art designs and seeds (e.g., Mockingjay, Hashed Perceptron, VA/AMPM Lite, SMS), and provides geomean IPC gains; but it appears to rely on engineering search experiments without formal statistical inference, limited disclosure of LLM/model/compute settings and search hyperparameters, and potential lack of broad robustness checks or hardware validation. SampleExperimental engineering study using a cycle-accurate microarchitecture simulator and benchmark suites (benchmark split provided by the human architect) to evolve designs for cache replacement, data prefetching, and branch prediction; seed designs included Mockingjay (cache), Hashed Perceptron (branch predictor), VA/AMPM Lite and SMS (prefetchers); evaluation reports geomean IPC improvements across the chosen benchmarks and compares to LRU, Bimodal, and other baselines; uses an open-source LLM-driven agentic framework (Agentic Architect) to generate and evolve code/designs. Themeshuman_ai_collab innovation GeneralizabilitySimulation-only results may not translate directly to real hardware or full-system performance, Limited and potentially narrow benchmark set — may overfit to chosen workloads, Performance is seed-dependent; weak seeds limit what evolution can discover, Results depend on unspecified LLM model, prompts, and compute budget, affecting reproducibility, Findings may not generalize across different microarchitectural platforms, ISA, or technology nodes, Potential for overfitting via extensive automated search to the held-out benchmark split

Claims (10)

Claim	Direction	Confidence	Outcome	Details
We introduce Agentic Architect, an agentic AI framework for computer architecture design exploration and optimization that combines LLM-driven code evolution with cycle-accurate simulation. Innovation Output	positive	high	innovation_output	0.18
Across cache replacement, data prefetching, and branch prediction, Agentic Architect matches or exceeds state-of-the-art designs. Task Completion Time	positive	high	task_completion_time	0.18
Our best evolved cache replacement design achieves a 1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x). Task Completion Time	positive	high	task_completion_time	1.062x geomean IPC speedup over LRU, 0.6% over Mockingjay (1.056x) 0.18
Our evolved branch predictor achieves a 1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x). Task Completion Time	positive	high	task_completion_time	1.100x geomean IPC speedup over Bimodal, 1.5% over its Hashed Perceptron seed (1.085x) 0.18
Our evolved prefetcher achieves a 1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x). Task Completion Time	positive	high	task_completion_time	1.76x geomean IPC speedup over no prefetching, 17% over its VA/AMPM Lite seed (1.59x) and 21% over SMS (1.55x) 0.18
Across evolved designs, components often correspond to known techniques; the novelty lies in how they are coordinated. Innovation Output	mixed	high	innovation_output	0.09
The architect's role is shifting, but the human remains central. Skill Acquisition	mixed	high	skill_acquisition	0.09
Seed quality bounds what search can achieve: evolution can refine and extend an existing mechanism, but cannot compensate for a weak foundation. Innovation Output	negative	high	innovation_output	0.18
Objectives, constraints, and prompt guidance affect reliability and generalization. Organizational Efficiency	mixed	high	organizational_efficiency	0.09
Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization. Adoption Rate	positive	medium	adoption_rate	0.02