Merge pull request #21 from ScalingIntelligence/kernelbench-arxiv

simonguozirui · web-flow · commit 5e2548d69d25 · 2025-02-19T16:17:15.000-08:00
add kernelbench arxiv materials
diff --git a/_data/people.yml b/_data/people.yml
@@ -76,13 +76,11 @@ anneouyang:
   name: Anne Ouyang
   url: https://anneouyang.com/
   title: Rotating PhD Student
-  not_current: True
 
 simonguo:
   name: Simon Guo
   url: https://simonguo.tech/
   title: Rotating PhD Student
-  not_current: True
 
 # Alumni
 
diff --git a/_data/tags.yml b/_data/tags.yml
@@ -4,6 +4,7 @@
   - system
   - model
   - dataset
+  - benchmark
   - empirical study
 - # domain
   - natural language processing
diff --git a/_pubs/fast.md b/_pubs/fast.md
@@ -23,6 +23,7 @@ tags:
   - natural language processing
   - generative ai
   - highlight
+  - ml systems
 teaser: Explore the cutting-edge Full-stack Accelerator Search Technique (FAST), a game-changing framework designed to optimize hardware accelerators for today's dynamic deep learning demands. This innovative approach fine-tunes every aspect of the hardware-software stack, from datapath design to software scheduling and compiler optimizations. By targeting bottlenecks in leading models like EfficientNet and BERT, FAST creates accelerators that deliver up to 3.7× better performance per watt compared to TPU-v3 for single workloads, and 2.4× better for a range of tasks. Discover how FAST can revolutionize datacenter efficiency and performance.
 materials:
   - name: A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
diff --git a/_pubs/hydragen.md b/_pubs/hydragen.md
@@ -21,6 +21,7 @@ doi: 10.48550/arXiv.2402.05099
 tags:
   - machine learning
   - generative ai
+  - ml systems
 teaser: Hydragen also enables the use of very long shared contexts with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
 materials:
   - name: Hydragen
diff --git a/_pubs/kernelbench.md b/_pubs/kernelbench.md
@@ -0,0 +1,41 @@
+---
+title: 'KernelBench: Can LLMs Write Efficient GPU Kernels?'
+authors:
+  - key: anneouyang
+    equal: true
+  - key: simonguo
+    equal: true
+  - name: Simran Arora
+    affiliation: Stanford
+  - name: Alex L. Zhang
+    affiliation: Princeton
+  - name: William Hu
+    affiliation: Stanford
+  - name: Christopher Ré
+    affiliation: Stanford
+  - key: azaliamirhoseini
+venue: preprint
+year: 2025
+date: 2025-02-18
+has_pdf: true
+doi: 10.48550/arXiv.2502.10517
+tags:
+  - benchmark
+  - generative ai
+  - ml systems
+teaser: KernelBench is a benchmark and environment for evaluating language models' ability to generate efficient GPU kernels.
+materials:
+  - name: Paper
+    url: https://arxiv.org/abs/2502.10517
+    type: file-pdf
+  - name: KernelBench Codebase
+    url: https://github.com/ScalingIntelligence/KernelBench
+    type: code
+  - name: KernelBench Dataset
+    url: https://huggingface.co/datasets/ScalingIntelligence/KernelBench
+    type: database
+  - name: Blog post
+    url: /blogs/kernelbench/
+    type: link
+---
+Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.
diff --git a/imgs/teasers/kernelbench.png b/imgs/teasers/kernelbench.png
diff --git a/pubs/kernelbench.pdf b/pubs/kernelbench.pdf