05 Sep 07:29

2e9b2b3

Latest

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.22.0 and above

Running vLLM on Gaudi with Docker Compose

Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the instruction to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Note

Prerequisite: Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command. Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile` fully supports FP8 and BF16 precisions.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step schedulin...

Contributors

CAC, leopck, and 85 other contributors

Assets 2

13 Aug 08:17

github-actions

v0.8.5+Gaudi-1.22.0-aice-v0

6693645

v0.8.5+Gaudi-1.22.0-aice-v0 Pre-release

Pre-release

What's Changed

Re-integrate HPU after upstream refactors by @kzawora-intel in #20
Fix model_output_idx on HPU by @madamczyk-intel in #27
Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
Rebase habana_main up to cc466a3 by @kzawora-intel in #26
WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
Cleanup: Restore int64 sampling by @kzawora-intel in #35
Cleanup: Llama whitespace fix by @kzawora-intel in #36
Cleanup: Restore pyproject.toml by @kzawora-intel in #37
Add vLLM high-level profiler by @DamianSzwichtenberg in #29
Add release docs for Gaudi by @kzawora-intel in #32
Minor: update release tag in README by @kzawora-intel in #39
Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
Static fused moe op by @jkaniecki in #41
WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
Use setuptools older than 70.0.0 by @madamczyk-intel in #42
Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
Graphs v2 by @madamczyk-intel in #44
Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
Bump ray version to 2.23.0 by @kzawora-intel in #52
Skip incompatible tests with HPU by @afierka-intel in #46
Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
Add syncs in mixtral weight loader by @jkaniecki in #55
HPU: Change KV-cache layout by @madamczyk-intel in #56
Add more detailed event names to profiler by @kzawora-intel in #57
Disable value splitting by default on G3 by @madamczyk-intel in #58
Fix for OOM in Llama 70b by @tzielinski-habana in #60
Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
Add Mistal&Mixtral supported configurations by @szutenberg in #64
Normalize router weights in MoE OP by @jkaniecki in #72
Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
Add more metrics to high level profiler by @kzawora-intel in #63
[Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
Remove allgather workaround in logits_processor by @kzawora-intel in #76
habana_main rebase by @kzawora-intel in #81
Conform to vLLM formatting rules by @kzawora-intel in #83
SiLU memory leak in fwd by @michalkuligowski in #87
habana_main rebase v4 by @kzawora-intel in #85
Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
formatting fixes by @kzawora-intel in #132
Address upstream PR code review comments by @kzawora-intel in #133
Whitespace fix by @kzawora-intel in #134
Add torch.compile support by @kzawora-intel in #48
habana_main rebase v5 by @kzawora-intel in #108
Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
Remove redundant torch.device call by @kzawora-intel in #139
Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
Re-enable FusedRoPE by @kzawora-intel in #145
Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
Allocate blocks from id=1 for HPU by @kdamaszk in #160
Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
Reimplement silu_and_mul for mixtral by @jkaniecki in #167
Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
remove reminder_comment.yml by @kzawora-intel in #179
Fix logger initialization in ops.py by @kzawora-intel in #178
1.17 documentation update by @kzawora-intel in #172
Readme 1.17 update by @kzawora-intel in #186
Support FP8 INC in vLLM by @nirda7 in #144
[Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
split gptbigcode forward by @libinta in #194
Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
Enable LoRA support for HPU by @scsudhak-intel in #170
Compile mode bug fix for LoRA by @scsudhak-intel in #196
Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
Port not warmed-up configurations log warnings by @adobrzyn in #222
Remove mark step from static MoE loop by @jkaniecki in #231
Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
[bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
Remove token budget from decode buckets by @kzawora-intel in #241
[habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
Mask based BGMV implementation by @hlahkar in #223
Dispersed dummy slots by @madamczyk-intel in #243
Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
Hardcode fastapi version due to pydantic error by @hlahkar in #255
Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
Port flat PA from habana_next to habana_main by @dolszewska in #169
Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
Fix dispersed slots by @madamczyk-intel in #261
Skip compilation warnings during warmup phase by @jkaniecki in #262
Port PT Profiler to habana_main by @adobrzyn in #256
Fix Lo...

Contributors

CAC, leopck, and 85 other contributors

Assets 2

22 Jul 17:34

github-actions

v0.8.5.post1+Gaudi-1.21.3

3bcdfd4

v0.8.5.post1+Gaudi-1.21.3 Pre-release

Pre-release

What's Changed

Update requirements-hpu.txt by @michalkuligowski in #1018
[SW-224648] Redirect test logs to file by @bmyrcha in #1016
add ScaleToHwAligned for loading fp8 vllm model by @changwangss in #941
Fix async callback ordering by @madamczyk-intel in #1023
Implement Pipeline Parallelism support for HPU. by @jmaksymczuk in #1000
Make lazy mode autodetection more robust by @kzawora-intel in #921
[SW-224648] Fix test logs redirection by @bmyrcha in #1026
[CI] Add APC tests by @kzawora-intel in #866
[SW-225233] Adjust method of getting synapse_build by @bmyrcha in #1044
Add more testowners by @adobrzyn in #1046
APC - Remove prompt attn with context and use existing implementation by @adobrzyn in #1020
Add exponential bucketing integration by @kzawora-intel in #642
Marketing requested additional details of the ramp-up phase. by @MohitIntel in #1069
Add in Dockerfile.hpu.ubi by @AnetaKaczynska in #1077
Synchronize vLLM flags to support cross-node inference by @IT-Forrest in #897
Set VLLM_T_COMPILE_FULLGRAPH=False in CI multi-modal tests by @afierka-intel in #1042
Enable APC pre-merge tests to compile test suite by @afierka-intel in #1076
IG: fix multimodal reshape for Qwen2.5-VL (revet #691) by @imangohari1 in #1081
Fix embedding model accuracy issue when merged prefill is enabled by @libinta in #1047
Enable dynamic shape for torch.compile under flag by @anko-intel in #1078
[SW-225980] Allow to skip pytest for non-code related changes by @bmyrcha in #1092
Update CODEOWNERS by @mgawarkiewicz-intel in #1107
fix prepare_cos_sin invoke in RotaryEmbedding by @zhouyu5 in #1035
multi-image support for llama3.2 [1/N] by @zhouyu5 in #926
Add t.compile fp8 performance test to jenkins by @bkowalskiINTEL in #1066
Update run-tests.sh by @michalkuligowski in #1117
Rebase - 2025.04.06 by @kzawora-intel in #947
Revert "Rebase - 2025.04.06" by @kzawora-intel in #1128
Rebase mar 24 again by @michalkuligowski in #1127
Restore fsdpa calibration by @madamczyk-intel in #1086
Rebase mar 24 fixed by @michalkuligowski in #1130
Simplify calling torch.compile by @anko-intel in #1140
Bump xgrammar from 0.1.11 to 0.1.18 by @dependabot[bot] in #1043
Update requirements-hpu.txt by @afierka-intel in #1125
Modify RobertaEmbedding forward as custom op method by @yeonsily in #996
[TC] Fix to graph break inside set_block_mapping by @jczaja in #1143
[SW-224668] Fix for LLaMA LoRA test_layers_hpu by @rsshaik1 in #1074
[SW-224666] Fix for LLaMA LoRA test_lora_manager_hpu by @rsshaik1 in #1070
Fix profiling collection for VLLM_PT_PROFILE by @mswiniarsk in #1156
Enable torchrun on Gaudi by @czhu15 in #974
Minor fix regd. VLLM_GRAPH_PROMPT_RATIO in README_GAUDI.md by @MohitIntel in #1168
Fix accuracy issue for llama 3.2 vision models. by @libinta in #1176
add test owner by @jikunshang in #1082
Add additional devs to TESTOWNERS by @bkowalskiINTEL in #1075
Update CODEOWNERS by @michalkuligowski in #1185
[SPEC_DECODE][V0] fix for spec decode eagle after rebase by @xuechendi in #1150
Fix fixture duplication in async_engine tests by @akarnows in #1180
Rebase apr 25 by @michalkuligowski in #1166
[SW-225282] - Handle Batch Dimension for LoRA by @hlahkar in #1182
Rebase apr 30 by @michalkuligowski in #1190
Reduce recompilations when using merged_prefill by @madamczyk-intel in #1167
Update TESTOWNERS by @madamczyk-intel in #1200
[SW-225635] Adjust logging in CI by @bmyrcha in #1202
Switch V1 env to False as default by @afierka-intel in #1206
Update codeowners by @madamczyk-intel in #1217
Rebase may 06 by @michalkuligowski in #1207
[V1] Set dynamo cache size even if warmup is skipped by @Kacper-Pietkun in #1173
Introduce block_softmax_adjustment kernel by @madamczyk-intel in #1174
add missing transpose in MultiHeadAttention by @zhouyu5 in #1218
[Spec Decode] Fix MLP speculative failing issue after rebase to Apr 30 by @xuechendi in #1210
[Deepseek R1][v0] Porting deepseek r1 to habana_main by @xuechendi in #1161
Set vllm-hpu-extension to 89030c by @madamczyk-intel in #1228
Set hpu-extension to a060794 by @madamczyk-intel in #1232
Add VLLM_PROFILE_* flags to V1 by @madamczyk-intel in #1203
Update Dockerfile.hpu.ubi by @AnetaKaczynska in #1205
Fix INC Finalization Check by @yiliu30 in #1230
[CI] Align t.compile and lazy test definitions by @anko-intel in #1157
[SW-228109][v0] [llama4 ]Llama 4 support for vLLM fork by @leopck in #1235
fix dummy sequence length setting in llama3.2 by @zhouyu5 in #1229
Enable Delayed Sampling by default by @mswiniarsk in #937
[V1] Port t.compile optimizations from V0 to V1 by @Kacper-Pietkun in #1237
[V1] enable fp8 by @Kacper-Pietkun in #1222
Switch to V0 by default in envs.py by @kwisniewski98 in #1233
[SW-228755] Fix CI for v0 spec decode fix by @xuechendi in #1252
Apply test permission by @zhouyu5 in #1258
[CI] Align t.compile and lazy tests by @anko-intel in #1250
[BugFix] Fix --disable-log-stats in V1 server mode vllm-project#17600 by @iboiko-habana in #1249
[SW-219737][habana_main] Support MTP to deepseek by @xuechendi in #1254
fix text only input for llama3.2 by @zhouyu5 in #1262
Remove intel implementation of top-p/top-k sampling method by @afierka-intel in #1243
[CI] Add benchamrk return status by @anko-intel in #1259
[habana_main]enable padding_aware_scheduler for speculative decoding by @xuechendi in #1264
Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile by @anko-intel in #1184
[SW-228365] - Update test cases for Lora by @hlahkar in #1256
fix embedding crash with torch.compile by @libinta in #1213
WA for CI - pkg resources by @adobrzyn in #1280
[SW-228266] Fix LoRA layers test by @hlahkar in #1276
Skip guards after fully warmup the model by @anko-intel in #1272
Replace in-place add with out-of-place add in layernorm forward_hpu. by @jmaksymc in #1281
Add 256 as possible option within block-size arg by @ksmusz in #1279
Flat KV cache layout by @kdamaszk in #1106
[Bugfix] config.head_dim is now explicitly set to None (vllm-project#18432) by @adobrzyn in https://github.com/HabanaAI/vllm-fork/pull/...

Contributors

leopck, xuechendi, and 38 other contributors

Assets 2

16 Jul 02:12

github-actions

v0.8.5+Gaudi-1.21.2-aice-v0

578b34a

v0.8.5+Gaudi-1.21.2-aice-v0 Pre-release

Pre-release

What's Changed

Re-integrate HPU after upstream refactors by @kzawora-intel in #20
Fix model_output_idx on HPU by @madamczyk-intel in #27
Allow block_sizes: 64 and 128 by @madamczyk-intel in #28
Rebase habana_main up to cc466a3 by @kzawora-intel in #26
WA: Disable cumsum in HPU _prepare_prompt by @kzawora-intel in #30
bs/seq bucketing for prompt and decode by @madamczyk-intel in #33
Cleanup: Fix HPU auto-detection in setup.py by @kzawora-intel in #34
Cleanup: Restore int64 sampling by @kzawora-intel in #35
Cleanup: Llama whitespace fix by @kzawora-intel in #36
Cleanup: Restore pyproject.toml by @kzawora-intel in #37
Add vLLM high-level profiler by @DamianSzwichtenberg in #29
Add release docs for Gaudi by @kzawora-intel in #32
Minor: update release tag in README by @kzawora-intel in #39
Fix error with high-level profiler in multi-card scenario by @DamianSzwichtenberg in #38
Static fused moe op by @jkaniecki in #41
WA: Remove pyproject.toml, bypass HPU autodetection by @kzawora-intel in #45
Use setuptools older than 70.0.0 by @madamczyk-intel in #42
Add VLLM_SKIP_WARMUP flag by @madamczyk-intel in #43
Graphs v2 by @madamczyk-intel in #44
Remove usage of wrap_in_hpu_graph in PT eager by @kzawora-intel in #47
Add HPU support to benchmark_latency and benchmark_throughput by @kzawora-intel in #49
Use int32 seeds for random sampler on HPU by @kzawora-intel in #50
Add host memory profiling to HabanaMemoryProfiler by @kzawora-intel in #51
Bump ray version to 2.23.0 by @kzawora-intel in #52
Skip incompatible tests with HPU by @afierka-intel in #46
Enable PA_SPLIT_VALUE by default by @kzawora-intel in #54
Add syncs in mixtral weight loader by @jkaniecki in #55
HPU: Change KV-cache layout by @madamczyk-intel in #56
Add more detailed event names to profiler by @kzawora-intel in #57
Disable value splitting by default on G3 by @madamczyk-intel in #58
Fix for OOM in Llama 70b by @tzielinski-habana in #60
Enable high-level profiler on multiple instances by @DamianSzwichtenberg in #61
Add mark steps to prevent OOM in static moe op by @jkaniecki in #65
Add Mistal&Mixtral supported configurations by @szutenberg in #64
Normalize router weights in MoE OP by @jkaniecki in #72
Revert "Disable value splitting by default on G3" by @tzielinski-habana in #74
Add more metrics to high level profiler by @kzawora-intel in #63
[Hardware][Gaudi]Add alibi support by @wenbinc-Bin in #69
Remove allgather workaround in logits_processor by @kzawora-intel in #76
habana_main rebase by @kzawora-intel in #81
Conform to vLLM formatting rules by @kzawora-intel in #83
SiLU memory leak in fwd by @michalkuligowski in #87
habana_main rebase v4 by @kzawora-intel in #85
Add workaround for RuntimeError: Invalid inputs for scatter_nd_onnx by @kzawora-intel in #107
Refactor forward_hpu of RMSNorm by @kzawora-intel in #128
Refactor & re-enable HPU RoPE for Gaudi1 by @kzawora-intel in #129
formatting fixes by @kzawora-intel in #132
Address upstream PR code review comments by @kzawora-intel in #133
Whitespace fix by @kzawora-intel in #134
Add torch.compile support by @kzawora-intel in #48
habana_main rebase v5 by @kzawora-intel in #108
Add constraints for HPU UnquantizedFusedMoEMethod by @kzawora-intel in #137
Remove redundant torch.device call by @kzawora-intel in #139
Add functools.wraps decorator to with_mark_steps by @kzawora-intel in #138
Add HPU platform and HpuCommunicator for TP by @kzawora-intel in #136
Re-enable FusedRoPE by @kzawora-intel in #145
Overhaul HPU memory management in HPUGraph capture by @kzawora-intel in #147
Allocate blocks from id=1 for HPU by @kdamaszk in #160
Revert "Allocate blocks from id=1 for HPU" by @kzawora-intel in #163
Reimplement silu_and_mul for mixtral by @jkaniecki in #167
Enable GitHub Actions static checks for habana_main by @kzawora-intel in #177
remove reminder_comment.yml by @kzawora-intel in #179
Fix logger initialization in ops.py by @kzawora-intel in #178
1.17 documentation update by @kzawora-intel in #172
Readme 1.17 update by @kzawora-intel in #186
Support FP8 INC in vLLM by @nirda7 in #144
[Doc][BugFix] Update setup instructions and reference links by @MohitIntel in #191
split gptbigcode forward by @libinta in #194
Enable FusedSDPA for prompt attention with VLLM_PROMPT_USE_FUSEDSDPA by @libinta in #168
Enable LoRA support for HPU by @scsudhak-intel in #170
Compile mode bug fix for LoRA by @scsudhak-intel in #196
Ensure buckets do not exceed the batch token limit by @kzawora-intel in #206
Make max_num_batched_tokens behavior more verbose, add legacy mode by @kzawora-intel in #208
Update paddings computed to adjust selected_token_indices by @vivekgoe in #210
Port not warmed-up configurations log warnings by @adobrzyn in #222
Remove mark step from static MoE loop by @jkaniecki in #231
Enable llama-405b - w/a for memory allocation error by @afierka-intel in #184
[bugfix] handle large bucket minimums correctly by @kzawora-intel in #235
Remove token budget from decode buckets by @kzawora-intel in #241
[habana_main bugfix] Fix min bucket boundary calculation by @kzawora-intel in #239
Mask based BGMV implementation by @hlahkar in #223
Dispersed dummy slots by @madamczyk-intel in #243
Use PT_COMPILE_ONLY_MODE during warmup by @mfylcek in #227
Do not pass warmup_mode to execute_model_kwargs by @kzawora-intel in #229
Add error handling for PT_COMPILE_ONLY_MODE by @kzawora-intel in #251
Hardcode fastapi version due to pydantic error by @hlahkar in #255
Mask based BGMV implementation for LoRA Embedding by @scsudhak-intel in #247
Eliminate graph breaks for torch.compile mode by @yuwenzho in #202
Port flat PA from habana_next to habana_main by @dolszewska in #169
Add disable_tensor_cache=True to HPUGraph capture by @kzawora-intel in #252
Fix dispersed slots by @madamczyk-intel in #261
Skip compilation warnings during warmup phase by @jkaniecki in #262
Port PT Profiler to habana_main by @adobrzyn in #256
Fix Lo...

Contributors

CAC, leopck, and 87 other contributors

Assets 2

01 Jul 14:53

github-actions

v0.8.5.post1+Gaudi-1.21.2

9f1222c

v0.8.5.post1+Gaudi-1.21.2

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.21.2 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.8.5.post1+Gaudi-1.21.2
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile`.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode)	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional releas...

Contributors

leopck, xuechendi, and 38 other contributors

Assets 2

19 May 14:07

github-actions

v0.7.2+Gaudi-1.21.0

0275ce4

v0.7.2+Gaudi-1.21.0

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

Requirements

Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.21.0 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL.	Documentation Example HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU)	vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism.	Documentation Running Pipeline Parallelism
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile	vLLM HPU backend supports inference with `torch.compile`.	vLLM HPU backend execution modes
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode)	Documentation
AutoAWQ quantization	vLLM HPU backend supports inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional release) ...

Contributors

CAC, xuechendi, and 76 other contributors

Assets 2

26 Feb 09:53

bartekkuncer

v0.6.6.post1+Gaudi-1.20.0

6af2f67

v0.6.6.post1+Gaudi-1.20.0

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

Ubuntu 22.04 LTS OS
Python 3.10
Intel Gaudi 2 and 3 AI accelerators
Intel Gaudi software version 1.20.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.6.post1+Gaudi-1.20.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single-node multi-HPU)	vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL.	Documentation Example HCCL reference
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile (experimental)	vLLM HPU backend experimentally supports inference with torch.compile.	vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi)	vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b.	vLLM supported models
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).	Documentation
AutoAWQ quantization	vLLM HPU backend supports the inference with models quantized using AutoAWQ library.	Library
AutoGPTQ quantization	vLLM HPU backend supports the inference with models quantized using AutoGPTQ library.	Library
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching (experimental)	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (functional release)	vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard `--speculative_model` and `--num_speculative_tokens` parameters.	Documentation Example
Multiprocessing backend	Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray.	Documentation

Unsupported Features

Beam s...

Contributors

xuechendi, yangw1234, and 56 other contributors

Assets 2

12 Feb 09:46

kzawora-intel

v0.6.4.post2+Gaudi-1.19.0

faf27e2

v0.6.4.post2+Gaudi-1.19.0

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.19.0

Requirements and Installation

Requirements

Ubuntu 22.04 LTS OS
Python 3.10
Intel Gaudi accelerator
Intel Gaudi software version 1.19.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.4.post2+Gaudi-1.19.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature	Description	References
Offline batched inference	Offline inference using LLM class from vLLM Python API	Quickstart Example
Online inference via OpenAI-Compatible Server	Online inference using HTTP server that implements OpenAI Chat and Completions API	Documentation Example
HPU autodetection	HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup	N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators	vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices.	N/A
Custom Intel Gaudi operator implementations	vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding.	N/A
Tensor parallel inference (single-node multi-HPU)	vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL.	Documentation Example HCCL reference
Inference with HPU Graphs	vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads.	Documentation vLLM HPU backend execution modes Optimization guide
Inference with torch.compile (experimental)	vLLM HPU backend experimentally supports inference with torch.compile.	vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi)	vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b.	vLLM supported models
INC quantization	vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC).	Documentation
LoRA/MultiLoRA support	vLLM HPU backend includes support for LoRA and MultiLoRA on supported models.	Documentation Example vLLM supported models
Multi-step scheduling support	vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard `--num-scheduler-seqs` parameter.	Feature RFC
Automatic prefix caching (experimental)	vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter.	Documentation Details
Speculative decoding (experimental)	vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard `--speculative_model` and `--num_speculative_tokens` parameters.	Documentation Example

Unsupported Features

Beam search
AWQ quantization
Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

meta-llama/Llama-2-7b on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with random or greedy sampling
meta-llama/Llama-2-7b-chat-hf on single HPU, or with tensor parallelism on 2x and 8x HPU, BF16 datatype with rando...

Contributors

xuechendi, yangw1234, and 40 other contributors

Assets 2

10 Feb 23:19

github-actions

v0.6.4.post2+Gaudi-1.19.2

61f141c

v0.6.4.post2+Gaudi-1.19.2 Pre-release

Pre-release

What's Changed

Update CODEOWNERS by @iboiko-habana in #658
[BLOCKER] Fix in v1.19.2 for dataclass error due to triton package update by @MohitIntel in #726

Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.2

Contributors

MohitIntel and iboiko-habana

Assets 2

10 Feb 23:17

github-actions

v0.6.4.post2+Gaudi-1.19.1

1ea378e

v0.6.4.post2+Gaudi-1.19.1 Pre-release

Pre-release

What's Changed

Update CODEOWNERS by @iboiko-habana in #658
[BLOCKER] Fix in v1.19.1 for dataclass error due to triton package update by @MohitIntel in #727

Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.1

Contributors

MohitIntel and iboiko-habana

Assets 2

Releases: HabanaAI/vllm-fork

v0.9.0.1+Gaudi-1.22.0

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Running vLLM on Gaudi with Docker Compose

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features

Contributors

Uh oh!

v0.8.5+Gaudi-1.22.0-aice-v0

What's Changed

Contributors

Uh oh!

v0.8.5.post1+Gaudi-1.21.3

What's Changed

Contributors

Uh oh!

v0.8.5+Gaudi-1.21.2-aice-v0

What's Changed

Contributors

Uh oh!

v0.8.5.post1+Gaudi-1.21.2

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features

Contributors

Uh oh!

v0.7.2+Gaudi-1.21.0

vLLM with Intel® Gaudi® AI Accelerators

Requirements and Installation

Requirements

Quick Start Using Dockerfile

Ubuntu

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from the vLLM main source

Supported Features

Contributors

Uh oh!

v0.6.6.post1+Gaudi-1.20.0

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0

Requirements and Installation

Requirements

Quick Start Using Dockerfile

Build from Source

Environment Verification

Run Docker Image

Build and Install vLLM

1. Build and Install the stable version

2. Build and Install the latest from vLLM-fork

3. Build and Install from vLLM main source

Supported Features

Unsupported Features

Contributors