You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!"
3
+
title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
4
4
thumbnail-img: /assets/img/cachegen.png
5
5
share-img: /assets/img/cachegen.png
6
6
author: Kuntai Du
7
7
image: /assets/img/cachegen.png
8
8
---
9
9
10
-
**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale!
10
+
**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
11
11
12
12
---
13
13
14
14
## Why CacheGen?
15
15
16
-
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17
-
**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads.
16
+
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17
+
Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
18
+
But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
19
+
**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
18
20
19
21
---
20
22
@@ -23,27 +25,21 @@ Modern LLMs rely on long contexts, but reprocessing those every time is slow and
0 commit comments