change

KuntaiDu · KuntaiDu · commit 9820dca69298 · 2025-07-31T18:26:20.000-07:00
Signed-off-by: Kuntai Du &lt;kuntai@uchicago.edu&gt;
diff --git a/_posts/2024-07-31-cachegen.md b/_posts/2024-07-31-cachegen.md
@@ -0,0 +1,95 @@
+---
+layout: post
+title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!"
+thumbnail-img: /assets/img/cachegen.png
+share-img: /assets/img/cachegen.png
+author: Kuntai Du
+image: /assets/img/cachegen.png
+---
+
+**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale!
+
+---
+
+## Why CacheGen?
+
+Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute.  
+**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads.
+
+---
+
+## Key Results 📊
+
+
+| System                | Mean TTFT (ms) | Mean TPOT (ms) |
+|-----------------------|:--------------:|:--------------:|
+| **LMCache + CacheGen**|   **737**      |    **47.7**    |
+| Naive vLLM            |   4,355        |     247.6      |
+| Fireworks             |   2,353        |     664.7      |
+| DeepInfra             |   2,949        |      79.0      |
+| Baseten               | 113,239        |     174.9      |
+
+- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6× compared to naive vLLM.**
+- **Drastically reduces generation latency per token (TPOT).**
+
+- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6×.**
+- **Saves up to 4× bandwidth** vs. quantized KV cache.
+- Keeps decoding fast; decode overhead is negligible.
+
+---
+
+## How Does It Work?
+
+- **Compress:** CacheGen encodes and compresses the KV cache (using custom quantization and coding, inspired by video codecs).
+- **Stream:** Loads KV cache in chunks; adapts compression live based on bandwidth/SLA.
+- **Persist:** Store to S3, disk, or anywhere. Cold starts and multi-node serving are now instant.
+- **Fallback:** If bandwidth drops, CacheGen can fallback to text recompute—always the fastest path.
+
+---
+
+## Quick Start 🛠️
+
+```bash
+uv pip install vllm
+uv pip install lmcache
+
+# Start cache server
+lmcache_server localhost 65434
+
+# Start vLLM+LMCache servers (example config below)
+LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
+```
+
+example.yaml
+```yaml
+chunk_size: 2048
+local_cpu: False
+remote_url: "lm://localhost:65434"
+remote_serde: "cachegen"
+```
+
+
+Benchmark and more examples in CacheGen GitHub.
+
+## Why Use CacheGen?
+
+Persistent KV cache: Never pay cold-start penalty again.
+
+Fast context reuse: Instantly load multi-GB context, even from S3.
+
+Cloud & multi-node ready: No need for fast interconnects.
+
+Plug-and-play: Integrates with vLLM, LangChain, your stack.
+
+Stop wasting GPU cycles on recompute. Store, stream, and serve context—faster than ever.
+
+## Try It Now!
+LMCache GitHub
+
+LMIgnite platform
+
+CacheGen Paper (SIGCOMM'24)
+
+Join our Slack
+
+**CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀