|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "CacheGen: Storing your KV cache into persistent store while loading blazingly fast!" |
| 4 | +thumbnail-img: /assets/img/cachegen.png |
| 5 | +share-img: /assets/img/cachegen.png |
| 6 | +author: Kuntai Du |
| 7 | +image: /assets/img/cachegen.png |
| 8 | +--- |
| 9 | + |
| 10 | +**TL;DR:** 🚀 CacheGen lets you store and load LLM KV cache from S3 (or any storage), with **3–4× faster loading** and **4× less bandwidth** than quantization. Stop recomputing—get faster first-token times and smoother LLM serving, even at cloud scale! |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Why CacheGen? |
| 15 | + |
| 16 | +Modern LLMs rely on long contexts, but reprocessing those every time is slow and wastes compute. |
| 17 | +**CacheGen** lets you **persist** the KV cache to remote storage (S3, disk, etc.), and **load it back—way faster than recomputing from text**. Perfect for multi-user, distributed, or bursty workloads. |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +## Key Results 📊 |
| 22 | + |
| 23 | + |
| 24 | +| System | Mean TTFT (ms) | Mean TPOT (ms) | |
| 25 | +|-----------------------|:--------------:|:--------------:| |
| 26 | +| **LMCache + CacheGen**| **737** | **47.7** | |
| 27 | +| Naive vLLM | 4,355 | 247.6 | |
| 28 | +| Fireworks | 2,353 | 664.7 | |
| 29 | +| DeepInfra | 2,949 | 79.0 | |
| 30 | +| Baseten | 113,239 | 174.9 | |
| 31 | + |
| 32 | +- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6× compared to naive vLLM.** |
| 33 | +- **Drastically reduces generation latency per token (TPOT).** |
| 34 | + |
| 35 | +- **CacheGen cuts Time-To-First-Token (TTFT) by up to 6×.** |
| 36 | +- **Saves up to 4× bandwidth** vs. quantized KV cache. |
| 37 | +- Keeps decoding fast; decode overhead is negligible. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## How Does It Work? |
| 42 | + |
| 43 | +- **Compress:** CacheGen encodes and compresses the KV cache (using custom quantization and coding, inspired by video codecs). |
| 44 | +- **Stream:** Loads KV cache in chunks; adapts compression live based on bandwidth/SLA. |
| 45 | +- **Persist:** Store to S3, disk, or anywhere. Cold starts and multi-node serving are now instant. |
| 46 | +- **Fallback:** If bandwidth drops, CacheGen can fallback to text recompute—always the fastest path. |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Quick Start 🛠️ |
| 51 | + |
| 52 | +```bash |
| 53 | +uv pip install vllm |
| 54 | +uv pip install lmcache |
| 55 | + |
| 56 | +# Start cache server |
| 57 | +lmcache_server localhost 65434 |
| 58 | + |
| 59 | +# Start vLLM+LMCache servers (example config below) |
| 60 | +LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' |
| 61 | +``` |
| 62 | + |
| 63 | +example.yaml |
| 64 | +```yaml |
| 65 | +chunk_size: 2048 |
| 66 | +local_cpu: False |
| 67 | +remote_url: "lm://localhost:65434" |
| 68 | +remote_serde: "cachegen" |
| 69 | +``` |
| 70 | +
|
| 71 | +
|
| 72 | +Benchmark and more examples in CacheGen GitHub. |
| 73 | +
|
| 74 | +## Why Use CacheGen? |
| 75 | +
|
| 76 | +Persistent KV cache: Never pay cold-start penalty again. |
| 77 | +
|
| 78 | +Fast context reuse: Instantly load multi-GB context, even from S3. |
| 79 | +
|
| 80 | +Cloud & multi-node ready: No need for fast interconnects. |
| 81 | +
|
| 82 | +Plug-and-play: Integrates with vLLM, LangChain, your stack. |
| 83 | +
|
| 84 | +Stop wasting GPU cycles on recompute. Store, stream, and serve context—faster than ever. |
| 85 | +
|
| 86 | +## Try It Now! |
| 87 | +LMCache GitHub |
| 88 | +
|
| 89 | +LMIgnite platform |
| 90 | +
|
| 91 | +CacheGen Paper (SIGCOMM'24) |
| 92 | +
|
| 93 | +Join our Slack |
| 94 | +
|
| 95 | +**CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀 |
0 commit comments