A production-aligned benchmark finds leading LLM kernel agents fail to improve real-world inference speed: the best agent is 6% slower than hardened systems while others perform markedly worse. FastKernels—covering 46 representative architectures and matching production interfaces—highlights benchmark-production misalignment as a key bottleneck to deploying agent-generated kernels.

FastKernels: Benchmarking GPU Kernel Generation in Production

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari · May 22, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

FastKernels is a production-aligned kernel benchmark and minimal inference framework showing that current LLM kernel-generation agents rarely translate sandbox gains into production throughput, with the best agent achieving only 0.94× relative to hardened baselines.

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

Summary

Main Finding

FASTKERNELS is a benchmark-as-framework that closes the gap between kernel-generation research and real production inference. By deriving tasks top-down from real HuggingFace model architectures, exposing production-grade interfaces, replaying real tensors, and including multi-GPU communication kernels, FASTKERNELS makes kernel-level improvements directly evaluable inside full inference pipelines. Evaluating state-of-the-art kernel agents on FASTKERNELS reveals that production-aligned evaluation materially reduces apparent gains: the strongest agent achieved only 0.94× aggregate speedup versus production baselines (weaker agents: 0.78× and 0.53×), demonstrating that benchmark–production misalignment is a key bottleneck for translating research gains into real throughput improvements.

Code: https://github.com/Snowflake-AI-Research/fastkernels

Key Points

Benchmark-as-framework: FASTKERNELS is both a benchmark and a minimal production-grade inference framework (OpenAI-compatible serving API, continuous batching, chunked prefills) so generated kernels execute in the same pipeline they would be deployed into.
Top-down, real-architecture construction: Tasks are derived from 46 representative model architectures across 8 categories (dense & MoE LLMs, linear-attention & SSMs, vision, audio, video, robotics/recSys/world models). This set subsumes 96.2% (409/425) of HuggingFace Transformers architectures in terms of compute primitives.
Compositional task hierarchy: Four levels — L1 primitives (matmul, RMSNorm, RoPE, activations, etc.), L2 fused operators, L3 layers/blocks, L4 full models — allow reuse of optimized lower-level kernels when optimizing higher-level modules.
Production-compatible interfaces: Each task’s constructor and forward signature matches the corresponding module in state-of-the-art production libraries (e.g., vLLM, SGLang), enabling near-copy-paste deployment.
Multi-GPU communication kernels: First kernel benchmark to include tensor-parallel collectives (all-reduce/reduce-scatter), MoE expert-parallel all-to-all dispatch/combine, and overlap kernels—capturing effects that single-GPU benchmarks miss.
Rigorous evaluation metrics (MACROEVAL): Calibrated correctness per family, macro-averaged correctness and coverage, and geometric aggregation of throughput–latency speedups with a default blended metric. Invalid items (crash/NaN/hang/incorrect) are excluded from speedup but penalized via coverage.
Real-tensor replay and profiling: For data-dependent ops (e.g., MoE routing) FASTKERNELS replays captured tensors; integrates NVIDIA Nsight Compute/Systems and MLflow for profiling and lineage tracking.
Empirical finding: Agents that scored well on previous operator-level/synthetic benchmarks lose performance (or cause correctness/compatibility issues) in production-like evaluation; even the best current agent did not surpass production baselines by more than a small margin (0.94× aggregate).

Data & Methods

Task construction:
- Load HuggingFace configuration and architecture definitions; walk forward pass and inline model constants to generate standalone tasks.
- Audit each task to ensure semantic correctness, matching tensor shapes/dtypes, real execution behavior, and interface parity with production modules.
- No synthetic tasks: every task corresponds to an operation run in a real model.
Coverage:
- 46 end-to-end L4 architectures across 8 categories; underlying L1–L3 kernels chosen to minimize redundancy.
- Audit across 425 HF modeling files: 409/425 covered (96.2%); 5 architectures require a new kernel, 2 require external libs.
Task hierarchy:
- Level 1: primitive operators (attention variants, normalizations, activations, quantize/dequantize, RoPE, etc.)
- Level 2: fused operators (residual+norm+quant, attention+proj, MoE gate+dispatch+expert)
- Level 3: layers/blocks (transformer decoder, SSM blocks, MoE layers)
- Level 4: full end-to-end models
Benchmark tiers:
- Tier 1: Kernel-level tests with an input registry derived from full workloads; compare outputs and runtimes; replay real tensors for data-dependent ops.
- Tier 2: End-to-end model runs measuring throughput/latency/serving behavior.
- Tier 3: Standardized evaluation sweeps (fixed models, TP configs, workloads) used for leaderboard.
MACROEVAL specifics:
- Per-family discrepancy Df and calibrated correctness mapping Ci,r with thresholds gi (numerical nondeterminism based on atol/rtol per dtype) and fi (quality-cliff threshold set per family).
- Macro-averaging across families prevents LLMs or other dominant families from skewing results.
- Throughput and latency blended via a geometric combination with default λ = 0.5; family-level and macro geometric means aggregate speedups.
- Scoredefault = Smacro · Cmacro · Coveragemacro (product of macro speedup, macro correctness, and macro coverage).
Tooling:
- Nsight Compute and Nsight Systems integrated for kernel and trace-level profiling.
- MLflow logging for lineage and benchmark history.
Empirical evaluation:
- FASTKERNELS itself runs at parity with hardened systems (vLLM, SGLang) for mainstream LLM serving and outperforms upstream references on some under-served architectures.
- State-of-the-art kernel agents benchmarked on FASTKERNELS: strongest agent 0.94× aggregate speedup (i.e., slightly below or near parity); other agents 0.78× and 0.53×.

Implications for AI Economics

More realistic benchmarks change where value accrues. Improvements measured in synthetic, single-GPU sandboxes may not generate economic value if they break integration, correctness, or multi-GPU schedules. FASTKERNELS shifts research incentives toward deployable optimizations that produce measurable production throughput and latency benefits.
Reduced wasted engineering and compute spend. By surfacing integration and communication issues early (via module-compatible interfaces, multi-GPU kernels, and real-tensor replay), teams save time and cloud/GPU costs that would otherwise be spent iterating on non-deployable kernels.
Better ROI signal for kernel-generation agents. With production-aligned metrics (MACROEVAL), investors, engineering managers, and product owners can more accurately assess whether improvements from LLM-based kernel agents will translate into lower serving costs or higher capacity for the same hardware.
Labor and tooling economics:
- Demand shifts from isolated kernel-tuning expertise toward tooling and agent-integration engineers who can validate, deploy, and monitor generated kernels in production pipelines.
- Open, production-aligned benchmarks lower adoption friction for automated kernel-generation tools and can accelerate commoditization of deployment tooling (copy-paste module insertion).
Competitive dynamics and standards:
- Benchmarks that directly map to production behavior favor vendors and tools that integrate seamlessly with existing serving frameworks (vLLM, SGLang). Standardizing module-level interfaces and evaluation may reduce switching costs and create clearer market leaders in inference stacks.
Risk management and reliability:
- The calibrated correctness and coverage framing highlights that faster kernels are economically worthless (or harmful) if they cause correctness regressions; procurement and deployment decisions should weigh coverage/correctness metrics as heavily as raw speedups.
Policy and procurement:
- Procurement decisions at scale (cloud providers, large enterprises) can use production-aligned benchmarks to justify investments in kernel-agent tooling that demonstrably reduce per-inference costs under real workloads and multi-GPU setups.
Research prioritization:
- Funding and teams should emphasize multi-GPU-aware, interface-compatible optimization techniques and end-to-end validation rather than chasing isolated operator speed-of-light bounds that do not consider deployment constraints.

Caveats and limitations - FASTKERNELS targets broad but not complete coverage; a small number of HuggingFace architectures still require new kernels or external libraries. - The benchmark’s production parity depends on the chosen production baselines (vLLM, SGLang, etc.); different production stacks might yield different relative outcomes. - While FASTKERNELS reduces reward-hacking vectors inherent to synthetic benchmarks, agents could still overfit to the benchmark’s fixed interfaces and workloads if not diversified over time.

Overall, FASTKERNELS provides a pragmatic, deployment-focused evaluation surface that should improve the economic relevance of kernel-generation research by aligning evaluation metrics with what matters for production throughput, latency, correctness, and deployability.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents extensive engineering evidence: a curated benchmark of 46 representative architectures (covering 96.2% of HuggingFace Transformer variants) and direct throughput comparisons against hardened production frameworks. This gives a credible, production-oriented signal. However, the evidence is limited to the chosen architectures, hardware and compilation stacks, and a specific set of kernel agents, so results may not generalize to all models, platforms, or future agent designs. Methods Rigormedium — The authors construct a carefully chosen minimal benchmark, align task interfaces with production library modules, and perform head-to-head throughput evaluations versus real production baselines (vLLM, SGLang). This is strong systems methodology. Missing or unclear elements (from the abstract) include details on hardware variation, statistical treatment of measurement noise, the selection and number of agent runs, and whether comparisons were blinded or repeated across diverse environments — all of which temper the rigor rating. SampleA minimal benchmark suite called FastKernels comprising 46 representative architectures across 8 categories that collectively subsume 409 of 425 (96.2%) HuggingFace Transformers architectures; evaluation compares outputs of state-of-the-art LLM-kernel-generating agents against production inference frameworks (e.g., vLLM, SGLang) on mainstream GPU inference serving and several under-served architectures, reporting aggregate relative throughput (best agent 0.94×, others 0.78× and 0.53×); code and benchmark are publicly released. Themesproductivity adoption GeneralizabilityFocused on GPU kernel generation and inference — does not cover TPUs, CPUs, or other accelerators, Benchmark tailored to transformer-style architectures; may not generalize to non-transformer models or novel architectures, Evaluations limited to the particular production frameworks, compilation stacks, and hardware used — different stacks or driver/firmware versions may yield different results, Only a subset of kernel-generating agents were tested; future agents or training regimes might perform differently, Throughput-focused metrics may not capture other production concerns (latency tail, memory footprint, power, or implementation/maintenance cost)

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. Other	negative	high	benchmark-production alignment	0.18
Agents trained on such benchmarks learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. Error Rate	negative	medium	error_rate (correctness degradation) and integration compatibility	0.11
FastKernels is built around a minimal set of 46 representative architectures spanning 8 categories. Other	positive	high	benchmark coverage (number of representative architectures and categories)	n=46 46 representative architectures spanning 8 categories 0.3
The FastKernels kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. Other	positive	high	architecture coverage (proportion subsumed)	n=425 96.2% (409/425) 0.3
FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving. Firm Productivity	positive	medium	inference throughput / runtime performance (parity vs. vLLM and SGLang)	parity with vLLM and SGLang 0.11
FastKernels substantially exceeds upstream references on under-served architectures. Firm Productivity	positive	medium	inference performance relative to upstream references on under-served architectures	0.05
Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×. Firm Productivity	negative	high	aggregate runtime speedup relative to production baselines	n=46 0.94× (strongest agent), 0.78×, 0.53× (weaker agents) 0.3
FastKernels is released (code available) as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Adoption Rate	positive	high	availability of code / reproducibility (release of benchmark and framework)	https://github.com/Snowflake-AI-Research/fastkernels 0.09