A production-aligned benchmark finds leading LLM kernel agents fail to improve real-world inference speed: the best agent is 6% slower than hardened systems while others perform markedly worse. FastKernels—covering 46 representative architectures and matching production interfaces—highlights benchmark-production misalignment as a key bottleneck to deploying agent-generated kernels.
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels
Summary
Main Finding
FASTKERNELS is a benchmark-as-framework that closes the gap between kernel-generation research and real production inference. By deriving tasks top-down from real HuggingFace model architectures, exposing production-grade interfaces, replaying real tensors, and including multi-GPU communication kernels, FASTKERNELS makes kernel-level improvements directly evaluable inside full inference pipelines. Evaluating state-of-the-art kernel agents on FASTKERNELS reveals that production-aligned evaluation materially reduces apparent gains: the strongest agent achieved only 0.94× aggregate speedup versus production baselines (weaker agents: 0.78× and 0.53×), demonstrating that benchmark–production misalignment is a key bottleneck for translating research gains into real throughput improvements.
Code: https://github.com/Snowflake-AI-Research/fastkernels
Key Points
- Benchmark-as-framework: FASTKERNELS is both a benchmark and a minimal production-grade inference framework (OpenAI-compatible serving API, continuous batching, chunked prefills) so generated kernels execute in the same pipeline they would be deployed into.
- Top-down, real-architecture construction: Tasks are derived from 46 representative model architectures across 8 categories (dense & MoE LLMs, linear-attention & SSMs, vision, audio, video, robotics/recSys/world models). This set subsumes 96.2% (409/425) of HuggingFace Transformers architectures in terms of compute primitives.
- Compositional task hierarchy: Four levels — L1 primitives (matmul, RMSNorm, RoPE, activations, etc.), L2 fused operators, L3 layers/blocks, L4 full models — allow reuse of optimized lower-level kernels when optimizing higher-level modules.
- Production-compatible interfaces: Each task’s constructor and forward signature matches the corresponding module in state-of-the-art production libraries (e.g., vLLM, SGLang), enabling near-copy-paste deployment.
- Multi-GPU communication kernels: First kernel benchmark to include tensor-parallel collectives (all-reduce/reduce-scatter), MoE expert-parallel all-to-all dispatch/combine, and overlap kernels—capturing effects that single-GPU benchmarks miss.
- Rigorous evaluation metrics (MACROEVAL): Calibrated correctness per family, macro-averaged correctness and coverage, and geometric aggregation of throughput–latency speedups with a default blended metric. Invalid items (crash/NaN/hang/incorrect) are excluded from speedup but penalized via coverage.
- Real-tensor replay and profiling: For data-dependent ops (e.g., MoE routing) FASTKERNELS replays captured tensors; integrates NVIDIA Nsight Compute/Systems and MLflow for profiling and lineage tracking.
- Empirical finding: Agents that scored well on previous operator-level/synthetic benchmarks lose performance (or cause correctness/compatibility issues) in production-like evaluation; even the best current agent did not surpass production baselines by more than a small margin (0.94× aggregate).
Data & Methods
- Task construction:
- Load HuggingFace configuration and architecture definitions; walk forward pass and inline model constants to generate standalone tasks.
- Audit each task to ensure semantic correctness, matching tensor shapes/dtypes, real execution behavior, and interface parity with production modules.
- No synthetic tasks: every task corresponds to an operation run in a real model.
- Coverage:
- 46 end-to-end L4 architectures across 8 categories; underlying L1–L3 kernels chosen to minimize redundancy.
- Audit across 425 HF modeling files: 409/425 covered (96.2%); 5 architectures require a new kernel, 2 require external libs.
- Task hierarchy:
- Level 1: primitive operators (attention variants, normalizations, activations, quantize/dequantize, RoPE, etc.)
- Level 2: fused operators (residual+norm+quant, attention+proj, MoE gate+dispatch+expert)
- Level 3: layers/blocks (transformer decoder, SSM blocks, MoE layers)
- Level 4: full end-to-end models
- Benchmark tiers:
- Tier 1: Kernel-level tests with an input registry derived from full workloads; compare outputs and runtimes; replay real tensors for data-dependent ops.
- Tier 2: End-to-end model runs measuring throughput/latency/serving behavior.
- Tier 3: Standardized evaluation sweeps (fixed models, TP configs, workloads) used for leaderboard.
- MACROEVAL specifics:
- Per-family discrepancy Df and calibrated correctness mapping Ci,r with thresholds gi (numerical nondeterminism based on atol/rtol per dtype) and fi (quality-cliff threshold set per family).
- Macro-averaging across families prevents LLMs or other dominant families from skewing results.
- Throughput and latency blended via a geometric combination with default λ = 0.5; family-level and macro geometric means aggregate speedups.
- Scoredefault = Smacro · Cmacro · Coveragemacro (product of macro speedup, macro correctness, and macro coverage).
- Tooling:
- Nsight Compute and Nsight Systems integrated for kernel and trace-level profiling.
- MLflow logging for lineage and benchmark history.
- Empirical evaluation:
- FASTKERNELS itself runs at parity with hardened systems (vLLM, SGLang) for mainstream LLM serving and outperforms upstream references on some under-served architectures.
- State-of-the-art kernel agents benchmarked on FASTKERNELS: strongest agent 0.94× aggregate speedup (i.e., slightly below or near parity); other agents 0.78× and 0.53×.
Implications for AI Economics
- More realistic benchmarks change where value accrues. Improvements measured in synthetic, single-GPU sandboxes may not generate economic value if they break integration, correctness, or multi-GPU schedules. FASTKERNELS shifts research incentives toward deployable optimizations that produce measurable production throughput and latency benefits.
- Reduced wasted engineering and compute spend. By surfacing integration and communication issues early (via module-compatible interfaces, multi-GPU kernels, and real-tensor replay), teams save time and cloud/GPU costs that would otherwise be spent iterating on non-deployable kernels.
- Better ROI signal for kernel-generation agents. With production-aligned metrics (MACROEVAL), investors, engineering managers, and product owners can more accurately assess whether improvements from LLM-based kernel agents will translate into lower serving costs or higher capacity for the same hardware.
- Labor and tooling economics:
- Demand shifts from isolated kernel-tuning expertise toward tooling and agent-integration engineers who can validate, deploy, and monitor generated kernels in production pipelines.
- Open, production-aligned benchmarks lower adoption friction for automated kernel-generation tools and can accelerate commoditization of deployment tooling (copy-paste module insertion).
- Competitive dynamics and standards:
- Benchmarks that directly map to production behavior favor vendors and tools that integrate seamlessly with existing serving frameworks (vLLM, SGLang). Standardizing module-level interfaces and evaluation may reduce switching costs and create clearer market leaders in inference stacks.
- Risk management and reliability:
- The calibrated correctness and coverage framing highlights that faster kernels are economically worthless (or harmful) if they cause correctness regressions; procurement and deployment decisions should weigh coverage/correctness metrics as heavily as raw speedups.
- Policy and procurement:
- Procurement decisions at scale (cloud providers, large enterprises) can use production-aligned benchmarks to justify investments in kernel-agent tooling that demonstrably reduce per-inference costs under real workloads and multi-GPU setups.
- Research prioritization:
- Funding and teams should emphasize multi-GPU-aware, interface-compatible optimization techniques and end-to-end validation rather than chasing isolated operator speed-of-light bounds that do not consider deployment constraints.
Caveats and limitations - FASTKERNELS targets broad but not complete coverage; a small number of HuggingFace architectures still require new kernels or external libraries. - The benchmark’s production parity depends on the chosen production baselines (vLLM, SGLang, etc.); different production stacks might yield different relative outcomes. - While FASTKERNELS reduces reward-hacking vectors inherent to synthetic benchmarks, agents could still overfit to the benchmark’s fixed interfaces and workloads if not diversified over time.
Overall, FASTKERNELS provides a pragmatic, deployment-focused evaluation surface that should improve the economic relevance of kernel-generation research by aligning evaluation metrics with what matters for production throughput, latency, correctness, and deployability.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. Other | negative | high | benchmark-production alignment |
0.18
|
| Agents trained on such benchmarks learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. Error Rate | negative | medium | error_rate (correctness degradation) and integration compatibility |
0.11
|
| FastKernels is built around a minimal set of 46 representative architectures spanning 8 categories. Other | positive | high | benchmark coverage (number of representative architectures and categories) |
n=46
46 representative architectures spanning 8 categories
0.3
|
| The FastKernels kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. Other | positive | high | architecture coverage (proportion subsumed) |
n=425
96.2% (409/425)
0.3
|
| FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving. Firm Productivity | positive | medium | inference throughput / runtime performance (parity vs. vLLM and SGLang) |
parity with vLLM and SGLang
0.11
|
| FastKernels substantially exceeds upstream references on under-served architectures. Firm Productivity | positive | medium | inference performance relative to upstream references on under-served architectures |
0.05
|
| Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×. Firm Productivity | negative | high | aggregate runtime speedup relative to production baselines |
n=46
0.94× (strongest agent), 0.78×, 0.53× (weaker agents)
0.3
|
| FastKernels is released (code available) as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Adoption Rate | positive | high | availability of code / reproducibility (release of benchmark and framework) |
https://github.com/Snowflake-AI-Research/fastkernels
0.09
|