LMCache
diff --git a/‎_posts/2024-07-31-cachegen.md renamed to ‎_posts/2025-07-31-cachegen.md
Lines changed: 17 additions & 13 deletions b/‎_posts/2024-07-31-cachegen.md renamed to ‎_posts/2025-07-31-cachegen.md
Lines changed: 17 additions & 13 deletions
diff --git a/‎assets/img/cachegen.png
208 KB b/‎assets/img/cachegen.png
208 KB
@@ -1,28 +1,33 @@
 ---
 layout: post
-title: "CacheGen: Storing your KV cache to disk and AWS S3 while loading blazingly fast!"
+title: "CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!"
 thumbnail-img: /assets/img/cachegen.png
 share-img: /assets/img/cachegen.png
 author: Kuntai Du
 image: /assets/img/cachegen.png
 ---
 
-**TL;DR:** 🚀 CacheGen allows you to quickly load KV caches from disk or even from AWS S3 storage! It compresses your KV cache 3x smaller compared to quantization, while still allow you generating high-quality responses. Stop recomputing—get faster first-token times and smoother LLM serving at cloud scale!
+**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
+
+<div align="center">
+<img src="/assets/img/cachegen.png" alt="comparison" style="width: 97%; vertical-align:middle;">
+<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
+</div>
 
 ---
 
 ## Why CacheGen?
 
-Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute. 
-Existing LLM engines like vLLM already supports caching these contexts (in the form of KV cache) in GPU memory (and with LMCache, in CPU memory).
-But for your popular chatting applications and agentic applications, even GPU and CPU memory altogether may not be enough, but loading KV caches from storage devices like disk and AWS S3 is slow -- even slower than just recomputing the KV caches.
-**CacheGen** lets you **store** the KV cache to remote storage (S3, disk, etc.), and **load it back —- way faster than recomputing from text**. Perfect for remembering the valuable contexts for all your users and agents.
+Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.  
+While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
+
+Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!  
+**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
 
 ---
 
 ## Key Results 📊
 
-
 | System                | Mean TTFT (ms) | Mean TPOT (ms) |
 |-----------------------|:--------------:|:--------------:|
 | **LMCache + CacheGen**|   **737**      |    **47.7**    | 
@@ -31,15 +36,14 @@ But for your popular chatting applications and agentic applications, even GPU an
 | DeepInfra             |   2,949        |      79.0      |
 | Baseten               | 113,239        |     174.9      |
 
-
-Takeaway: **CacheGen cuts Time-To-First-Token (TTFT) by up to 3× compared to other baselines!**
+**Takeaway:** CacheGen cuts Time-To-First-Token (TTFT) by up to **3×** compared to other baselines, and reduces per-token latency, too.
 
 ---
 
 ## How Does It Work?
 
-- **Compress:** CacheGen encodes and compresses the KV cache (using residue coding and custom quantization).
-- **Decompress:** High-performance CUDA kernel to quickly decompress KV caches.
+- **Compress:** CacheGen encodes KV cache with custom quantization and residue coding—making files up to 3× smaller than quantized tensors.
+- **Decompress:** Fast CUDA kernels restore the cache in milliseconds, right into GPU memory.
 
 ---
 
@@ -52,9 +56,9 @@ uv pip install lmcache
 # Start cache server
 lmcache_server localhost 65434
 
-# Start vLLM+LMCache servers (example config below)
+# Start vLLM+LMCache server (using CacheGen)
 LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
-```
+
 
 example.yaml
 ```yaml