|
49 | 49 | "- Configure some settings for GPU processing\n",
|
50 | 50 | "- Defines batch processing parameters (8 requests per batch, 2 GPU workers)\n",
|
51 | 51 | "\n",
|
| 52 | + "#### Model Source Configuration\n", |
| 53 | + "\n", |
| 54 | + "The `model_source` parameter supports several loading methods:\n", |
| 55 | + "\n", |
| 56 | + "* **Hugging Face Hub** (default): Use repository ID `model_source=\"meta-llama/Llama-2-7b-chat-hf\"`\n", |
| 57 | + "* **Local Directory**: Use file path `model_source=\"/path/to/my/local/model\"`\n", |
| 58 | + "* **Other Sources**: ModelScope via environment variables `VLLM_MODELSCOPE_DOWNLOADS_DIR`\n", |
| 59 | + "\n", |
| 60 | + "For complete model support and options, see the [official vLLM documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).\n", |
| 61 | + "\n", |
52 | 62 | "```python\n",
|
53 | 63 | "import ray\n",
|
54 | 64 | "from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig\n",
|
|
60 | 70 | " dtype=\"half\",\n",
|
61 | 71 | " max_model_len=1024,\n",
|
62 | 72 | " ),\n",
|
| 73 | + " # Batch size: Larger batches increase throughput but reduce fault tolerance\n", |
| 74 | + " # - Small batches (4-8): Better for fault tolerance and memory constraints\n", |
| 75 | + " # - Large batches (16-32): Higher throughput, better GPU utilization\n", |
| 76 | + " # - Choose based on your Ray Cluster size and memory availability\n", |
63 | 77 | " batch_size=8,\n",
|
| 78 | + " # Concurrency: Number of vLLM engine workers to spawn \n", |
| 79 | + " # - Set to match your total GPU count for maximum utilization\n", |
| 80 | + " # - Each worker gets assigned to a GPU automatically by Ray scheduler\n", |
| 81 | + " # - Can use all GPUs across head and worker nodes\n", |
64 | 82 | " concurrency=2,\n",
|
65 | 83 | ")\n",
|
66 | 84 | "```"
|
|
105 | 123 | "cell_type": "markdown",
|
106 | 124 | "metadata": {},
|
107 | 125 | "source": [
|
| 126 | + "#### Running the Pipeline\n", |
108 | 127 | "Now we can run the batch inference pipeline on our data, it will:\n",
|
109 | 128 | "- In the background, the processor will download the model into memory where vLLM serves it locally (on Ray Cluster) for use in inference\n",
|
110 | 129 | "- Generate a sample Ray Dataset with 32 rows (0-31) to process\n",
|
|
0 commit comments