You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-07-31-cachegen.md
+17-13Lines changed: 17 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,33 @@
1
1
---
2
2
layout: post
3
-
title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
3
+
title: "CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!"
4
4
thumbnail-img: /assets/img/cachegen.png
5
5
share-img: /assets/img/cachegen.png
6
6
author: Kuntai Du
7
7
image: /assets/img/cachegen.png
8
8
---
9
9
10
-
**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
10
+
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
15
+
</div>
11
16
12
17
---
13
18
14
19
## Why CacheGen?
15
20
16
-
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17
-
Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
18
-
But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
19
-
**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
21
+
Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.
22
+
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
23
+
24
+
Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
25
+
**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
0 commit comments