Releases: HabanaAI/vllm-fork
v0.9.0.1+Gaudi-1.22.0
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.22.0 and above
Running vLLM on Gaudi with Docker Compose
Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the instruction to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana.
, please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime
package installed and that habana
container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
Note
Prerequisite: Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command. Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
Feature | Description | References |
---|---|---|
Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
Inference with torch.compile | vLLM HPU backend supports inference with torch.compile fully supports FP8 and BF16 precisions. |
vLLM HPU backend execution modes |
INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | Documentation |
AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
Multi-step schedulin... |
v0.8.5+Gaudi-1.22.0-aice-v0
What's Changed
- Re-integrate HPU after upstream refactors by @kzawora-intel in #20
- Fix model_output_idx on HPU by @madamczyk-intel in #27
- Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
- Rebase habana_main up to cc466a3 by @kzawora-intel in #26
- WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
- bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
- Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
- Cleanup: Restore int64 sampling by @kzawora-intel in #35
- Cleanup: Llama whitespace fix by @kzawora-intel in #36
- Cleanup: Restore pyproject.toml by @kzawora-intel in #37
- Add vLLM high-level profiler by @DamianSzwichtenberg in #29
- Add release docs for Gaudi by @kzawora-intel in #32
- Minor: update release tag in README by @kzawora-intel in #39
- Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
- Static fused moe op by @jkaniecki in #41
- WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
- Use setuptools older than 70.0.0 by @madamczyk-intel in #42
- Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
- Graphs v2 by @madamczyk-intel in #44
- Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
- Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
- Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
- Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
- Bump ray version to 2.23.0 by @kzawora-intel in #52
- Skip incompatible tests with HPU by @afierka-intel in #46
- Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
- Add syncs in mixtral weight loader by @jkaniecki in #55
- HPU: Change KV-cache layout by @madamczyk-intel in #56
- Add more detailed event names to profiler by @kzawora-intel in #57
- Disable value splitting by default on G3 by @madamczyk-intel in #58
- Fix for OOM in Llama 70b by @tzielinski-habana in #60
- Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
- Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
- Add Mistal&Mixtral supported configurations by @szutenberg in #64
- Normalize router weights in MoE OP by @jkaniecki in #72
- Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
- Add more metrics to high level profiler by @kzawora-intel in #63
- [Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
- Remove allgather workaround in logits_processor by @kzawora-intel in #76
- habana_main rebase by @kzawora-intel in #81
- Conform to vLLM formatting rules by @kzawora-intel in #83
- SiLU memory leak in fwd by @michalkuligowski in #87
- habana_main rebase v4 by @kzawora-intel in #85
- Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
- Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
- Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
- formatting fixes by @kzawora-intel in #132
- Address upstream PR code review comments by @kzawora-intel in #133
- Whitespace fix by @kzawora-intel in #134
- Add torch.compile support by @kzawora-intel in #48
- habana_main rebase v5 by @kzawora-intel in #108
- Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
- Remove redundant torch.device call by @kzawora-intel in #139
- Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
- Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
- Re-enable FusedRoPE by @kzawora-intel in #145
- Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
- Allocate blocks from id=1 for HPU by @kdamaszk in #160
- Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
- Reimplement silu_and_mul for mixtral by @jkaniecki in #167
- Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
- remove reminder_comment.yml by @kzawora-intel in #179
- Fix logger initialization in ops.py by @kzawora-intel in #178
- 1.17 documentation update by @kzawora-intel in #172
- Readme 1.17 update by @kzawora-intel in #186
- Support FP8 INC in vLLM by @nirda7 in #144
- [Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
- split gptbigcode forward by @libinta in #194
- Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
- Enable LoRA support for HPU by @scsudhak-intel in #170
- Compile mode bug fix for LoRA by @scsudhak-intel in #196
- Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
- Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
- Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
- Port not warmed-up configurations log warnings by @adobrzyn in #222
- Remove mark step from static MoE loop by @jkaniecki in #231
- Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
- [bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
- Remove token budget from decode buckets by @kzawora-intel in #241
- [habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
- Mask based BGMV implementation by @hlahkar in #223
- Dispersed dummy slots by @madamczyk-intel in #243
- Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
- Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
- Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
- Hardcode fastapi version due to pydantic error by @hlahkar in #255
- Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
- Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
- Port flat PA from habana_next to habana_main by @dolszewska in #169
- Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
- Fix dispersed slots by @madamczyk-intel in #261
- Skip compilation warnings during warmup phase by @jkaniecki in #262
- Port PT Profiler to habana_main by @adobrzyn in #256
- Fix Lo...
v0.8.5.post1+Gaudi-1.21.3
What's Changed
- Update requirements-hpu.txt by @michalkuligowski in #1018
- [SW-224648] Redirect test logs to file by @bmyrcha in #1016
- add ScaleToHwAligned for loading fp8 vllm model by @changwangss in #941
- Fix async callback ordering by @madamczyk-intel in #1023
- Implement Pipeline Parallelism support for HPU. by @jmaksymczuk in #1000
- Make lazy mode autodetection more robust by @kzawora-intel in #921
- [SW-224648] Fix test logs redirection by @bmyrcha in #1026
- [CI] Add APC tests by @kzawora-intel in #866
- [SW-225233] Adjust method of getting synapse_build by @bmyrcha in #1044
- Add more testowners by @adobrzyn in #1046
- APC - Remove prompt attn with context and use existing implementation by @adobrzyn in #1020
- Add exponential bucketing integration by @kzawora-intel in #642
- Marketing requested additional details of the ramp-up phase. by @MohitIntel in #1069
- Add in Dockerfile.hpu.ubi by @AnetaKaczynska in #1077
- Synchronize vLLM flags to support cross-node inference by @IT-Forrest in #897
- Set VLLM_T_COMPILE_FULLGRAPH=False in CI multi-modal tests by @afierka-intel in #1042
- Enable APC pre-merge tests to compile test suite by @afierka-intel in #1076
- IG: fix multimodal reshape for Qwen2.5-VL (revet #691) by @imangohari1 in #1081
- Fix embedding model accuracy issue when merged prefill is enabled by @libinta in #1047
- Enable dynamic shape for torch.compile under flag by @anko-intel in #1078
- [SW-225980] Allow to skip pytest for non-code related changes by @bmyrcha in #1092
- Update CODEOWNERS by @mgawarkiewicz-intel in #1107
- fix prepare_cos_sin invoke in RotaryEmbedding by @zhouyu5 in #1035
- multi-image support for llama3.2 [1/N] by @zhouyu5 in #926
- Add t.compile fp8 performance test to jenkins by @bkowalskiINTEL in #1066
- Update run-tests.sh by @michalkuligowski in #1117
- Rebase - 2025.04.06 by @kzawora-intel in #947
- Revert "Rebase - 2025.04.06" by @kzawora-intel in #1128
- Rebase mar 24 again by @michalkuligowski in #1127
- Restore fsdpa calibration by @madamczyk-intel in #1086
- Rebase mar 24 fixed by @michalkuligowski in #1130
- Simplify calling torch.compile by @anko-intel in #1140
- Bump xgrammar from 0.1.11 to 0.1.18 by @dependabot[bot] in #1043
- Update requirements-hpu.txt by @afierka-intel in #1125
- Modify RobertaEmbedding forward as custom op method by @yeonsily in #996
- [TC] Fix to graph break inside set_block_mapping by @jczaja in #1143
- [SW-224668] Fix for LLaMA LoRA test_layers_hpu by @rsshaik1 in #1074
- [SW-224666] Fix for LLaMA LoRA test_lora_manager_hpu by @rsshaik1 in #1070
- Fix profiling collection for VLLM_PT_PROFILE by @mswiniarsk in #1156
- Enable torchrun on Gaudi by @czhu15 in #974
- Minor fix regd. VLLM_GRAPH_PROMPT_RATIO in README_GAUDI.md by @MohitIntel in #1168
- Fix accuracy issue for llama 3.2 vision models. by @libinta in #1176
- add test owner by @jikunshang in #1082
- Add additional devs to TESTOWNERS by @bkowalskiINTEL in #1075
- Update CODEOWNERS by @michalkuligowski in #1185
- [SPEC_DECODE][V0] fix for spec decode eagle after rebase by @xuechendi in #1150
- Fix fixture duplication in async_engine tests by @akarnows in #1180
- Rebase apr 25 by @michalkuligowski in #1166
- [SW-225282] - Handle Batch Dimension for LoRA by @hlahkar in #1182
- Rebase apr 30 by @michalkuligowski in #1190
- Reduce recompilations when using merged_prefill by @madamczyk-intel in #1167
- Update TESTOWNERS by @madamczyk-intel in #1200
- [SW-225635] Adjust logging in CI by @bmyrcha in #1202
- Switch V1 env to False as default by @afierka-intel in #1206
- Update codeowners by @madamczyk-intel in #1217
- Rebase may 06 by @michalkuligowski in #1207
- [V1] Set dynamo cache size even if warmup is skipped by @Kacper-Pietkun in #1173
- Introduce block_softmax_adjustment kernel by @madamczyk-intel in #1174
- add missing transpose in MultiHeadAttention by @zhouyu5 in #1218
- [Spec Decode] Fix MLP speculative failing issue after rebase to Apr 30 by @xuechendi in #1210
- [Deepseek R1][v0] Porting deepseek r1 to habana_main by @xuechendi in #1161
- Set vllm-hpu-extension to 89030c by @madamczyk-intel in #1228
- Set hpu-extension to a060794 by @madamczyk-intel in #1232
- Add VLLM_PROFILE_* flags to V1 by @madamczyk-intel in #1203
- Update Dockerfile.hpu.ubi by @AnetaKaczynska in #1205
- Fix INC Finalization Check by @yiliu30 in #1230
- [CI] Align t.compile and lazy test definitions by @anko-intel in #1157
- [SW-228109][v0] [llama4 ]Llama 4 support for vLLM fork by @leopck in #1235
- fix dummy sequence length setting in llama3.2 by @zhouyu5 in #1229
- Enable Delayed Sampling by default by @mswiniarsk in #937
- [V1] Port t.compile optimizations from V0 to V1 by @Kacper-Pietkun in #1237
- [V1] enable fp8 by @Kacper-Pietkun in #1222
- Switch to V0 by default in envs.py by @kwisniewski98 in #1233
- [SW-228755] Fix CI for v0 spec decode fix by @xuechendi in #1252
- Apply test permission by @zhouyu5 in #1258
- [CI] Align t.compile and lazy tests by @anko-intel in #1250
- [BugFix] Fix --disable-log-stats in V1 server mode vllm-project#17600 by @iboiko-habana in #1249
- [SW-219737][habana_main] Support MTP to deepseek by @xuechendi in #1254
- fix text only input for llama3.2 by @zhouyu5 in #1262
- Remove intel implementation of top-p/top-k sampling method by @afierka-intel in #1243
- [CI] Add benchamrk return status by @anko-intel in #1259
- [habana_main]enable padding_aware_scheduler for speculative decoding by @xuechendi in #1264
- Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #1184
- [SW-228365] - Update test cases for Lora by @hlahkar in #1256
- fix embedding crash with torch.compile by @libinta in #1213
- WA for CI - pkg resources by @adobrzyn in #1280
- [SW-228266] Fix LoRA layers test by @hlahkar in #1276
- Skip guards after fully warmup the model by @anko-intel in #1272
- Replace in-place add with out-of-place add in layernorm forward_hpu. by @jmaksymc in #1281
- Add 256 as possible option within block-size arg by @ksmusz in #1279
- Flat KV cache layout by @kdamaszk in #1106
- [Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) by @adobrzyn in https://github.com/HabanaAI/vllm-fork/pull/...
v0.8.5+Gaudi-1.21.2-aice-v0
What's Changed
- Re-integrate HPU after upstream refactors by @kzawora-intel in #20
- Fix model_output_idx on HPU by @madamczyk-intel in #27
- Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
- Rebase habana_main up to cc466a3 by @kzawora-intel in #26
- WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
- bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
- Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
- Cleanup: Restore int64 sampling by @kzawora-intel in #35
- Cleanup: Llama whitespace fix by @kzawora-intel in #36
- Cleanup: Restore pyproject.toml by @kzawora-intel in #37
- Add vLLM high-level profiler by @DamianSzwichtenberg in #29
- Add release docs for Gaudi by @kzawora-intel in #32
- Minor: update release tag in README by @kzawora-intel in #39
- Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
- Static fused moe op by @jkaniecki in #41
- WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
- Use setuptools older than 70.0.0 by @madamczyk-intel in #42
- Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
- Graphs v2 by @madamczyk-intel in #44
- Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
- Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
- Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
- Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
- Bump ray version to 2.23.0 by @kzawora-intel in #52
- Skip incompatible tests with HPU by @afierka-intel in #46
- Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
- Add syncs in mixtral weight loader by @jkaniecki in #55
- HPU: Change KV-cache layout by @madamczyk-intel in #56
- Add more detailed event names to profiler by @kzawora-intel in #57
- Disable value splitting by default on G3 by @madamczyk-intel in #58
- Fix for OOM in Llama 70b by @tzielinski-habana in #60
- Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
- Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
- Add Mistal&Mixtral supported configurations by @szutenberg in #64
- Normalize router weights in MoE OP by @jkaniecki in #72
- Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
- Add more metrics to high level profiler by @kzawora-intel in #63
- [Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
- Remove allgather workaround in logits_processor by @kzawora-intel in #76
- habana_main rebase by @kzawora-intel in #81
- Conform to vLLM formatting rules by @kzawora-intel in #83
- SiLU memory leak in fwd by @michalkuligowski in #87
- habana_main rebase v4 by @kzawora-intel in #85
- Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
- Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
- Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
- formatting fixes by @kzawora-intel in #132
- Address upstream PR code review comments by @kzawora-intel in #133
- Whitespace fix by @kzawora-intel in #134
- Add torch.compile support by @kzawora-intel in #48
- habana_main rebase v5 by @kzawora-intel in #108
- Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
- Remove redundant torch.device call by @kzawora-intel in #139
- Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
- Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
- Re-enable FusedRoPE by @kzawora-intel in #145
- Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
- Allocate blocks from id=1 for HPU by @kdamaszk in #160
- Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
- Reimplement silu_and_mul for mixtral by @jkaniecki in #167
- Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
- remove reminder_comment.yml by @kzawora-intel in #179
- Fix logger initialization in ops.py by @kzawora-intel in #178
- 1.17 documentation update by @kzawora-intel in #172
- Readme 1.17 update by @kzawora-intel in #186
- Support FP8 INC in vLLM by @nirda7 in #144
- [Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
- split gptbigcode forward by @libinta in #194
- Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
- Enable LoRA support for HPU by @scsudhak-intel in #170
- Compile mode bug fix for LoRA by @scsudhak-intel in #196
- Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
- Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
- Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
- Port not warmed-up configurations log warnings by @adobrzyn in #222
- Remove mark step from static MoE loop by @jkaniecki in #231
- Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
- [bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
- Remove token budget from decode buckets by @kzawora-intel in #241
- [habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
- Mask based BGMV implementation by @hlahkar in #223
- Dispersed dummy slots by @madamczyk-intel in #243
- Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
- Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
- Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
- Hardcode fastapi version due to pydantic error by @hlahkar in #255
- Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
- Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
- Port flat PA from habana_next to habana_main by @dolszewska in #169
- Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
- Fix dispersed slots by @madamczyk-intel in #261
- Skip compilation warnings during warmup phase by @jkaniecki in #262
- Port PT Profiler to habana_main by @adobrzyn in #256
- Fix Lo...
v0.8.5.post1+Gaudi-1.21.2
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.21.2 and above
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana.
, please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime
package installed and that habana
container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.8.5.post1+Gaudi-1.21.2
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
Feature | Description | References |
---|---|---|
Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
Inference with torch.compile | vLLM HPU backend supports inference with torch.compile . |
vLLM HPU backend execution modes |
INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | Documentation |
AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
Speculative decoding (functional releas... |
v0.7.2+Gaudi-1.21.0
vLLM with Intel® Gaudi® AI Accelerators
This README provides instructions on how to run vLLM with Intel Gaudi devices.
Requirements and Installation
To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.21.0 and above
Quick Start Using Dockerfile
Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.
Ubuntu
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana.
, please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime
package installed and that habana
container runtime is registered.
Red Hat Enterprise Linux for Use with Red Hat OpenShift AI
$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from the vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
Feature | Description | References |
---|---|---|
Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
Tensor parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. | Documentation Example HCCL reference |
Pipeline parallel inference (single or multi-node multi-HPU) | vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. | Documentation Running Pipeline Parallelism |
Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
Inference with torch.compile | vLLM HPU backend supports inference with torch.compile . |
vLLM HPU backend execution modes |
INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | Documentation |
AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | Library |
AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | Library |
LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
Speculative decoding (functional release) ... |
v0.6.6.post1+Gaudi-1.20.0
vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0
Requirements and Installation
Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Ubuntu 22.04 LTS OS
- Python 3.10
- Intel Gaudi 2 and 3 AI accelerators
- Intel Gaudi software version 1.20.0 and above
Quick Start Using Dockerfile
Set up the container with latest release of Gaudi Software Suite using the Dockerfile:
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana.
, please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime
package installed and that habana
container runtime is registered.
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.6.post1+Gaudi-1.20.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
Feature | Description | References |
---|---|---|
Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
Tensor parallel inference (single-node multi-HPU) | vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. | Documentation Example HCCL reference |
Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
Inference with torch.compile (experimental) | vLLM HPU backend experimentally supports inference with torch.compile. | vLLM HPU backend execution modes |
Attention with Linear Biases (ALiBi) | vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. | vLLM supported models |
INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | Documentation |
AutoAWQ quantization | vLLM HPU backend supports the inference with models quantized using AutoAWQ library. | Library |
AutoGPTQ quantization | vLLM HPU backend supports the inference with models quantized using AutoGPTQ library. | Library |
LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
Automatic prefix caching (experimental) | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. |
Documentation Example |
Multiprocessing backend | Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray. | Documentation |
Unsupported Features
- Beam s...
v0.6.4.post2+Gaudi-1.19.0
vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.19.0
Requirements and Installation
Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
Requirements
- Ubuntu 22.04 LTS OS
- Python 3.10
- Intel Gaudi accelerator
- Intel Gaudi software version 1.19.0 and above
Quick Start Using Dockerfile
Set up the container with latest release of Gaudi Software Suite using the Dockerfile:
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
Tip
If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana.
, please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime
package installed and that habana
container runtime is registered.
Build from Source
Environment Verification
To verify that the Intel Gaudi software was correctly installed, run the following:
$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed
Refer to System Verification and Final Tests for more details.
Run Docker Image
It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.
Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:
$ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
Build and Install vLLM
Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:
1. Build and Install the stable version
vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.4.post2+Gaudi-1.19.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop
2. Build and Install the latest from vLLM-fork
Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop
3. Build and Install from vLLM main source
If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop
Supported Features
Feature | Description | References |
---|---|---|
Offline batched inference | Offline inference using LLM class from vLLM Python API | Quickstart Example |
Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | Documentation Example |
HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
Tensor parallel inference (single-node multi-HPU) | vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. | Documentation Example HCCL reference |
Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. | Documentation vLLM HPU backend execution modes Optimization guide |
Inference with torch.compile (experimental) | vLLM HPU backend experimentally supports inference with torch.compile. | vLLM HPU backend execution modes |
Attention with Linear Biases (ALiBi) | vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. | vLLM supported models |
INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). | Documentation |
LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | Documentation Example vLLM supported models |
Multi-step scheduling support | vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. |
Feature RFC |
Automatic prefix caching (experimental) | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. |
Documentation Details |
Speculative decoding (experimental) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. |
Documentation Example |
Unsupported Features
- Beam search
- AWQ quantization
- Prefill chunking (mixed-batch inferencing)
Supported Configurations
The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.
- meta-llama/Llama-2-7b on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
- meta-llama/Llama-2-7b-chat-hf on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with rando...
v0.6.4.post2+Gaudi-1.19.2
What's Changed
- Update CODEOWNERS by @iboiko-habana in #658
- [BLOCKER] Fix in v1.19.2 for dataclass error due to triton package update by @MohitIntel in #726
Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.2
v0.6.4.post2+Gaudi-1.19.1
What's Changed
- Update CODEOWNERS by @iboiko-habana in #658
- [BLOCKER] Fix in v1.19.1 for dataclass error due to triton package update by @MohitIntel in #727
Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.1