Skip to content

Commit 4659714

Browse files
committed
updatae cachegen
Signed-off-by: Kuntai Du <[email protected]>
1 parent 9820dca commit 4659714

File tree

1 file changed

+15
-35
lines changed

1 file changed

+15
-35
lines changed

_posts/2024-07-31-cachegen.md

Lines changed: 15 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
11
---
22
layout: post
3-
title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!"
3+
title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
44
thumbnail-img: /assets/img/cachegen.png
55
share-img: /assets/img/cachegen.png
66
author: Kuntai Du
77
image: /assets/img/cachegen.png
88
---
99

10-
**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale!
10+
**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
1111

1212
---
1313

1414
## Why CacheGen?
1515

16-
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17-
**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads.
16+
Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.
17+
Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
18+
But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
19+
**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
1820

1921
---
2022

@@ -23,27 +25,21 @@ Modern LLMs rely on long contexts, but reprocessing those every time is slow and
2325

2426
| System | Mean TTFT (ms) | Mean TPOT (ms) |
2527
|-----------------------|:--------------:|:--------------:|
26-
| **LMCache + CacheGen**| **737** | **47.7** |
28+
| **LMCache + CacheGen**| **737** | **47.7** |
2729
| Naive vLLM | 4,355 | 247.6 |
2830
| Fireworks | 2,353 | 664.7 |
2931
| DeepInfra | 2,949 | 79.0 |
3032
| Baseten | 113,239 | 174.9 |
3133

32-
- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6× compared to naive vLLM.**
33-
- **Drastically reduces generation latency per token (TPOT).**
3434

35-
- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6×.**
36-
- **Saves up to 4× bandwidth** vs. quantized KV cache.
37-
- Keeps decoding fast; decode overhead is negligible.
35+
Takeaway: **CacheGen cuts Time-To-First-Token (TTFT) by up to 3× compared to other baselines!**
3836

3937
---
4038

4139
## How Does It Work?
4240

43-
- **Compress:** CacheGen encodes and compresses the KV cache (using custom quantization and coding, inspired by video codecs).
44-
- **Stream:** Loads KV cache in chunks; adapts compression live based on bandwidth/SLA.
45-
- **Persist:** Store to S3, disk, or anywhere. Cold starts and multi-node serving are now instant.
46-
- **Fallback:** If bandwidth drops, CacheGen can fallback to text recompute—always the fastest path.
41+
- **Compress:** CacheGen encodes and compresses the KV cache (using residue coding and custom quantization).
42+
- **Decompress:** High-performance CUDA kernel to quickly decompress KV caches.
4743

4844
---
4945

@@ -68,28 +64,12 @@ remote_url: "lm://localhost:65434"
6864
remote_serde: "cachegen"
6965
```
7066
67+
## Contact
7168
72-
Benchmark and more examples in CacheGen GitHub.
69+
- **LMCache Github: [https://github.com/LMCache/LMCache](https://github.com/LMCache/LMCache)**
70+
- **Chat with the Developers** **[Interest Form](https://forms.gle/mQfQDUXbKfp2St1z7)**
71+
- **LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ)**
72+
- **vLLM Production-Stack [channel](https://vllm-dev.slack.com/archives/C089SMEAKRA)**
7373
74-
## Why Use CacheGen?
75-
76-
Persistent KV cache: Never pay cold-start penalty again.
77-
78-
Fast context reuse: Instantly load multi-GB context, even from S3.
79-
80-
Cloud & multi-node ready: No need for fast interconnects.
81-
82-
Plug-and-play: Integrates with vLLM, LangChain, your stack.
83-
84-
Stop wasting GPU cycles on recompute. Store, stream, and serve context—faster than ever.
85-
86-
## Try It Now!
87-
LMCache GitHub
88-
89-
LMIgnite platform
90-
91-
CacheGen Paper (SIGCOMM'24)
92-
93-
Join our Slack
9474
9575
**CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀

0 commit comments

Comments
 (0)