Skip to content

Commit 444eb53

Browse files
committed
update blog
Signed-off-by: Kuntai Du <[email protected]>
1 parent 4659714 commit 444eb53

File tree

2 files changed

+17
-13
lines changed

2 files changed

+17
-13
lines changed

_posts/2024-07-31-cachegen.md renamed to _posts/2025-07-31-cachegen.md

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,33 @@
11
---
22
layout: post
3-
title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
3+
title: "CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!"
44
thumbnail-img: /assets/img/cachegen.png
55
share-img: /assets/img/cachegen.png
66
author: Kuntai Du
77
image: /assets/img/cachegen.png
88
---
99

10-
**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
10+
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
11+
12+
<div align="center">
13+
<img src="/assets/img/cachegen.png" alt="comparison" style="width: 97%; vertical-align:middle;">
14+
<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
15+
</div>
1116

1217
---
1318

1419
## Why CacheGen?
1520

16-
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17-
Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
18-
But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
19-
**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
21+
Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.
22+
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
23+
24+
Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
25+
**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
2026

2127
---
2228

2329
## Key Results 📊
2430

25-
2631
| System | Mean TTFT (ms) | Mean TPOT (ms) |
2732
|-----------------------|:--------------:|:--------------:|
2833
| **LMCache + CacheGen**| **737** | **47.7** |
@@ -31,15 +36,14 @@ But for your popular chatting applications and agentic applications, even GPU an
3136
| DeepInfra | 2,949 | 79.0 |
3237
| Baseten | 113,239 | 174.9 |
3338

34-
35-
Takeaway: **CacheGen cuts Time-To-First-Token (TTFT) by up to 3× compared to other baselines!**
39+
**Takeaway:** CacheGen cuts Time-To-First-Token (TTFT) by up to **** compared to other baselines, and reduces per-token latency, too.
3640

3741
---
3842

3943
## How Does It Work?
4044

41-
- **Compress:** CacheGen encodes and compresses the KV cache (using residue coding and custom quantization).
42-
- **Decompress:** High-performance CUDA kernel to quickly decompress KV caches.
45+
- **Compress:** CacheGen encodes KV cache with custom quantization and residue coding—making files up to 3× smaller than quantized tensors.
46+
- **Decompress:** Fast CUDA kernels restore the cache in milliseconds, right into GPU memory.
4347

4448
---
4549

@@ -52,9 +56,9 @@ uv pip install lmcache
5256
# Start cache server
5357
lmcache_server localhost 65434
5458

55-
# Start vLLM+LMCache servers (example config below)
59+
# Start vLLM+LMCache server (using CacheGen)
5660
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
57-
```
61+
5862

5963
example.yaml
6064
```yaml

assets/img/cachegen.png

208 KB
Loading

0 commit comments

Comments
 (0)