feat: add config details to batch inference demo

kryanbeane · kryanbeane · commit f33353ee9571 · 2025-07-28T15:01:57.000+01:00
diff --git a/demo-notebooks/additional-demos/batch-inference/remote_offline_bi.ipynb b/demo-notebooks/additional-demos/batch-inference/remote_offline_bi.ipynb
@@ -49,6 +49,16 @@
     "- Configure some settings for GPU processing\n",
     "- Defines batch processing parameters (8 requests per batch, 2 GPU workers)\n",
     "\n",
+    "#### Model Source Configuration\n",
+    "\n",
+    "The `model_source` parameter supports several loading methods:\n",
+    "\n",
+    "* **Hugging Face Hub** (default): Use repository ID `model_source=\"meta-llama/Llama-2-7b-chat-hf\"`\n",
+    "* **Local Directory**: Use file path `model_source=\"/path/to/my/local/model\"`\n",
+    "* **Other Sources**: ModelScope via environment variables `VLLM_MODELSCOPE_DOWNLOADS_DIR`\n",
+    "\n",
+    "For complete model support and options, see the [official vLLM documentation](https://docs.vllm.ai/en/latest/models/supported_models.html).\n",
+    "\n",
     "```python\n",
     "import ray\n",
     "from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig\n",
@@ -60,7 +70,15 @@
     "        dtype=\"half\",\n",
     "        max_model_len=1024,\n",
     "    ),\n",
+    "    # Batch size: Larger batches increase throughput but reduce fault tolerance\n",
+    "    #   - Small batches (4-8): Better for fault tolerance and memory constraints\n",
+    "    #   - Large batches (16-32): Higher throughput, better GPU utilization\n",
+    "    #   - Choose based on your Ray Cluster size and memory availability\n",
     "    batch_size=8,\n",
+    "    # Concurrency: Number of vLLM engine workers to spawn \n",
+    "    #   - Set to match your total GPU count for maximum utilization\n",
+    "    #   - Each worker gets assigned to a GPU automatically by Ray scheduler\n",
+    "    #   - Can use all GPUs across head and worker nodes\n",
     "    concurrency=2,\n",
     ")\n",
     "```"
@@ -105,6 +123,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "#### Running the Pipeline\n",
     "Now we can run the batch inference pipeline on our data, it will:\n",
     "- In the background, the processor will download the model into memory where vLLM serves it locally (on Ray Cluster) for use in inference\n",
     "- Generate a sample Ray Dataset with 32 rows (0-31) to process\n",