Skip to content

Commit 5e2548d

Browse files
Merge pull request #21 from ScalingIntelligence/kernelbench-arxiv
add kernelbench arxiv materials
2 parents 1b769d2 + bd42ced commit 5e2548d

File tree

7 files changed

+44
-2
lines changed

7 files changed

+44
-2
lines changed

_data/people.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,13 +76,11 @@ anneouyang:
7676
name: Anne Ouyang
7777
url: https://anneouyang.com/
7878
title: Rotating PhD Student
79-
not_current: True
8079

8180
simonguo:
8281
name: Simon Guo
8382
url: https://simonguo.tech/
8483
title: Rotating PhD Student
85-
not_current: True
8684

8785
# Alumni
8886

_data/tags.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
- system
55
- model
66
- dataset
7+
- benchmark
78
- empirical study
89
- # domain
910
- natural language processing

_pubs/fast.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ tags:
2323
- natural language processing
2424
- generative ai
2525
- highlight
26+
- ml systems
2627
teaser: Explore the cutting-edge Full-stack Accelerator Search Technique (FAST), a game-changing framework designed to optimize hardware accelerators for today's dynamic deep learning demands. This innovative approach fine-tunes every aspect of the hardware-software stack, from datapath design to software scheduling and compiler optimizations. By targeting bottlenecks in leading models like EfficientNet and BERT, FAST creates accelerators that deliver up to 3.7× better performance per watt compared to TPU-v3 for single workloads, and 2.4× better for a range of tasks. Discover how FAST can revolutionize datacenter efficiency and performance.
2728
materials:
2829
- name: A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

_pubs/hydragen.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ doi: 10.48550/arXiv.2402.05099
2121
tags:
2222
- machine learning
2323
- generative ai
24+
- ml systems
2425
teaser: Hydragen also enables the use of very long shared contexts with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
2526
materials:
2627
- name: Hydragen

_pubs/kernelbench.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: 'KernelBench: Can LLMs Write Efficient GPU Kernels?'
3+
authors:
4+
- key: anneouyang
5+
equal: true
6+
- key: simonguo
7+
equal: true
8+
- name: Simran Arora
9+
affiliation: Stanford
10+
- name: Alex L. Zhang
11+
affiliation: Princeton
12+
- name: William Hu
13+
affiliation: Stanford
14+
- name: Christopher Ré
15+
affiliation: Stanford
16+
- key: azaliamirhoseini
17+
venue: preprint
18+
year: 2025
19+
date: 2025-02-18
20+
has_pdf: true
21+
doi: 10.48550/arXiv.2502.10517
22+
tags:
23+
- benchmark
24+
- generative ai
25+
- ml systems
26+
teaser: KernelBench is a benchmark and environment for evaluating language models' ability to generate efficient GPU kernels.
27+
materials:
28+
- name: Paper
29+
url: https://arxiv.org/abs/2502.10517
30+
type: file-pdf
31+
- name: KernelBench Codebase
32+
url: https://github.com/ScalingIntelligence/KernelBench
33+
type: code
34+
- name: KernelBench Dataset
35+
url: https://huggingface.co/datasets/ScalingIntelligence/KernelBench
36+
type: database
37+
- name: Blog post
38+
url: /blogs/kernelbench/
39+
type: link
40+
---
41+
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

imgs/teasers/kernelbench.png

657 KB
Loading

pubs/kernelbench.pdf

2.92 MB
Binary file not shown.

0 commit comments

Comments
 (0)