Skip to content

Commit 9820dca

Browse files
committed
change
Signed-off-by: Kuntai Du <[email protected]>
1 parent e34cc7e commit 9820dca

File tree

1 file changed

+95
-0
lines changed

1 file changed

+95
-0
lines changed

_posts/2024-07-31-cachegen.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
layout: post
3+
title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!"
4+
thumbnail-img: /assets/img/cachegen.png
5+
share-img: /assets/img/cachegen.png
6+
author: Kuntai Du
7+
image: /assets/img/cachegen.png
8+
---
9+
10+
**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale!
11+
12+
---
13+
14+
## Why CacheGen?
15+
16+
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17+
**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads.
18+
19+
---
20+
21+
## Key Results 📊
22+
23+
24+
| System | Mean TTFT (ms) | Mean TPOT (ms) |
25+
|-----------------------|:--------------:|:--------------:|
26+
| **LMCache + CacheGen**| **737** | **47.7** |
27+
| Naive vLLM | 4,355 | 247.6 |
28+
| Fireworks | 2,353 | 664.7 |
29+
| DeepInfra | 2,949 | 79.0 |
30+
| Baseten | 113,239 | 174.9 |
31+
32+
- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6× compared to naive vLLM.**
33+
- **Drastically reduces generation latency per token (TPOT).**
34+
35+
- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6×.**
36+
- **Saves up to 4× bandwidth** vs. quantized KV cache.
37+
- Keeps decoding fast; decode overhead is negligible.
38+
39+
---
40+
41+
## How Does It Work?
42+
43+
- **Compress:** CacheGen encodes and compresses the KV cache (using custom quantization and coding, inspired by video codecs).
44+
- **Stream:** Loads KV cache in chunks; adapts compression live based on bandwidth/SLA.
45+
- **Persist:** Store to S3, disk, or anywhere. Cold starts and multi-node serving are now instant.
46+
- **Fallback:** If bandwidth drops, CacheGen can fallback to text recompute—always the fastest path.
47+
48+
---
49+
50+
## Quick Start 🛠️
51+
52+
```bash
53+
uv pip install vllm
54+
uv pip install lmcache
55+
56+
# Start cache server
57+
lmcache_server localhost 65434
58+
59+
# Start vLLM+LMCache servers (example config below)
60+
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
61+
```
62+
63+
example.yaml
64+
```yaml
65+
chunk_size: 2048
66+
local_cpu: False
67+
remote_url: "lm://localhost:65434"
68+
remote_serde: "cachegen"
69+
```
70+
71+
72+
Benchmark and more examples in CacheGen GitHub.
73+
74+
## Why Use CacheGen?
75+
76+
Persistent KV cache: Never pay cold-start penalty again.
77+
78+
Fast context reuse: Instantly load multi-GB context, even from S3.
79+
80+
Cloud & multi-node ready: No need for fast interconnects.
81+
82+
Plug-and-play: Integrates with vLLM, LangChain, your stack.
83+
84+
Stop wasting GPU cycles on recompute. Store, stream, and serve context—faster than ever.
85+
86+
## Try It Now!
87+
LMCache GitHub
88+
89+
LMIgnite platform
90+
91+
CacheGen Paper (SIGCOMM'24)
92+
93+
Join our Slack
94+
95+
**CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀

0 commit comments

Comments
 (0)