updatae cachegen

KuntaiDu · KuntaiDu · commit 46597146185d · 2025-08-01T16:29:19.000-07:00
Signed-off-by: Kuntai Du &lt;kuntai@uchicago.edu&gt;
diff --git a/_posts/2024-07-31-cachegen.md b/_posts/2024-07-31-cachegen.md
@@ -1,20 +1,22 @@
 ---
 layout: post
-title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!"
+title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
 thumbnail-img: /assets/img/cachegen.png
 share-img: /assets/img/cachegen.png
 author: Kuntai Du
 image: /assets/img/cachegen.png
 ---
 
-**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale!
+**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
 
 ---
 
 ## Why CacheGen?
 
-Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.  
-**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads.
+Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute. 
+Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
+But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
+**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
 
 ---
 
@@ -23,27 +25,21 @@ Modern LLMs rely on long contexts, but reprocessing those every time is slow and
 
 | System                | Mean TTFT (ms) | Mean TPOT (ms) |
 |-----------------------|:--------------:|:--------------:|
-| **LMCache + CacheGen**|   **737**      |    **47.7**    |
+| **LMCache + CacheGen**|   **737**      |    **47.7**    | 
 | Naive vLLM            |   4,355        |     247.6      |
 | Fireworks             |   2,353        |     664.7      |
 | DeepInfra             |   2,949        |      79.0      |
 | Baseten               | 113,239        |     174.9      |
 
-- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6× compared to naive vLLM.**
-- **Drastically reduces generation latency per token (TPOT).**
 
-- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6×.**
-- **Saves up to 4× bandwidth** vs. quantized KV cache.
-- Keeps decoding fast; decode overhead is negligible.
+Takeaway: **CacheGen cuts Time-To-First-Token (TTFT) by up to 3× compared to other baselines!**
 
 ---
 
 ## How Does It Work?
 
-- **Compress:** CacheGen encodes and compresses the KV cache (using custom quantization and coding, inspired by video codecs).
-- **Stream:** Loads KV cache in chunks; adapts compression live based on bandwidth/SLA.
-- **Persist:** Store to S3, disk, or anywhere. Cold starts and multi-node serving are now instant.
-- **Fallback:** If bandwidth drops, CacheGen can fallback to text recompute—always the fastest path.
+- **Compress:** CacheGen encodes and compresses the KV cache (using residue coding and custom quantization).
+- **Decompress:** High-performance CUDA kernel to quickly decompress KV caches.
 
 ---
 
@@ -68,28 +64,12 @@ remote_url: "lm://localhost:65434"
 remote_serde: "cachegen"
 ```
 
+## Contact
 
-Benchmark and more examples in CacheGen GitHub.
+- **LMCache Github: [https://github.com/LMCache/LMCache](https://github.com/LMCache/LMCache)**
+- **Chat with the Developers** **[Interest Form](https://forms.gle/mQfQDUXbKfp2St1z7)**
+- **LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ)**
+- **vLLM Production-Stack [channel](https://vllm-dev.slack.com/archives/C089SMEAKRA)**
 
-## Why Use CacheGen?
-
-Persistent KV cache: Never pay cold-start penalty again.
-
-Fast context reuse: Instantly load multi-GB context, even from S3.
-
-Cloud & multi-node ready: No need for fast interconnects.
-
-Plug-and-play: Integrates with vLLM, LangChain, your stack.
-
-Stop wasting GPU cycles on recompute. Store, stream, and serve context—faster than ever.
-
-## Try It Now!
-LMCache GitHub
-
-LMIgnite platform
-
-CacheGen Paper (SIGCOMM'24)
-
-Join our Slack
 
 **CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀