Skip to content

Releases: HabanaAI/vllm-fork

v0.9.0.1+Gaudi-1.22.0

05 Sep 07:29
2e9b2b3
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Python 3.10
  • Intel Gaudi 2 and 3 AI accelerators
  • Intel Gaudi software version 1.22.0 and above

Running vLLM on Gaudi with Docker Compose

Starting with the 1.22 release, we are introducing ready-to-run container images that bundle vLLM and Gaudi software. Please follow the instruction to quickly launch vLLM on Gaudi using a prebuilt Docker image and Docker Compose, with options for custom parameters and benchmarking.

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

Note

Prerequisite: Starting from the 1.22.x Intel Gaudi software version, the RHEL Docker image must be created manually before running the command. Additionally, the path to the Docker image must be updated in the Dockerfile.hpu.ubi file.

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.9.0.1+Gaudi-1.22.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. Documentation
Example
HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. Documentation
Running Pipeline Parallelism
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile vLLM HPU backend supports inference with torch.compile fully supports FP8 and BF16 precisions. vLLM HPU backend execution modes
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). Documentation
AutoAWQ quantization vLLM HPU backend supports inference with models quantized using AutoAWQ library. Library
AutoGPTQ quantization vLLM HPU backend supports inference with models quantized using AutoGPTQ library. Library
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step schedulin...
Read more

v0.8.5+Gaudi-1.22.0-aice-v0

13 Aug 08:17
Compare
Choose a tag to compare
Pre-release

What's Changed

Read more

v0.8.5.post1+Gaudi-1.21.3

22 Jul 17:34
3bcdfd4
Compare
Choose a tag to compare
Pre-release

What's Changed

Read more

v0.8.5+Gaudi-1.21.2-aice-v0

16 Jul 02:12
578b34a
Compare
Choose a tag to compare
Pre-release

What's Changed

Read more

v0.8.5.post1+Gaudi-1.21.2

01 Jul 14:53
9f1222c
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Python 3.10
  • Intel Gaudi 2 and 3 AI accelerators
  • Intel Gaudi software version 1.21.2 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.8.5.post1+Gaudi-1.21.2
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. Documentation
Example
HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. Documentation
Running Pipeline Parallelism
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile vLLM HPU backend supports inference with torch.compile. vLLM HPU backend execution modes
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) Documentation
AutoAWQ quantization vLLM HPU backend supports inference with models quantized using AutoAWQ library. Library
AutoGPTQ quantization vLLM HPU backend supports inference with models quantized using AutoGPTQ library. Library
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step scheduling support vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. Feature RFC
Automatic prefix caching vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. Documentation
Details
Speculative decoding (functional releas...
Read more

v0.7.2+Gaudi-1.21.0

19 May 14:07
0275ce4
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators

This README provides instructions on how to run vLLM with Intel Gaudi devices.

Requirements and Installation

To set up the execution environment, please follow the instructions in the Gaudi Installation Guide. To achieve the best performance on HPU, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Python 3.10
  • Intel Gaudi 2 and 3 AI accelerators
  • Intel Gaudi software version 1.21.0 and above

Quick Start Using Dockerfile

Set up the container with the latest Intel Gaudi Software Suite release using the Dockerfile.

Ubuntu

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to the "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Red Hat Enterprise Linux for Use with Red Hat OpenShift AI

$ docker build -f Dockerfile.hpu.ubi -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from the Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.21.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.7.2+Gaudi-1.21.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to the vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install --upgrade pip
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from the vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across multiple nodes with tensor parallelism with multiprocessing or Ray and HCCL. Documentation
Example
HCCL reference
Pipeline parallel inference (single or multi-node multi-HPU) vLLM HPU backend supports multi-HPU inference across single or multi-node with pipeline parallelism. Documentation
Running Pipeline Parallelism
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile vLLM HPU backend supports inference with torch.compile. vLLM HPU backend execution modes
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) Documentation
AutoAWQ quantization vLLM HPU backend supports inference with models quantized using AutoAWQ library. Library
AutoGPTQ quantization vLLM HPU backend supports inference with models quantized using AutoGPTQ library. Library
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step scheduling support vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. Feature RFC
Automatic prefix caching vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. Documentation
Details
Speculative decoding (functional release) ...
Read more

v0.6.6.post1+Gaudi-1.20.0

26 Feb 09:53
6af2f67
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.20.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Ubuntu 22.04 LTS OS
  • Python 3.10
  • Intel Gaudi 2 and 3 AI accelerators
  • Intel Gaudi software version 1.20.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.6.post1+Gaudi-1.20.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single-node multi-HPU) vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. Documentation
Example
HCCL reference
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile (experimental) vLLM HPU backend experimentally supports inference with torch.compile. vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi) vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. vLLM supported models
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). Documentation
AutoAWQ quantization vLLM HPU backend supports the inference with models quantized using AutoAWQ library. Library
AutoGPTQ quantization vLLM HPU backend supports the inference with models quantized using AutoGPTQ library. Library
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step scheduling support vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. Feature RFC
Automatic prefix caching (experimental) vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. Documentation
Details
Speculative decoding (functional release) vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. Documentation
Example
Multiprocessing backend Multiprocessing is the default distributed runtime in vLLM. The vLLM HPU backend supports it alongside Ray. Documentation

Unsupported Features

  • Beam s...
Read more

v0.6.4.post2+Gaudi-1.19.0

12 Feb 09:46
faf27e2
Compare
Choose a tag to compare

vLLM with Intel® Gaudi® AI Accelerators - Gaudi Software Suite 1.19.0

Requirements and Installation

Please follow the instructions provided in the Gaudi Installation Guide to set up the execution environment. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.

Requirements

  • Ubuntu 22.04 LTS OS
  • Python 3.10
  • Intel Gaudi accelerator
  • Intel Gaudi software version 1.19.0 and above

Quick Start Using Dockerfile

Set up the container with latest release of Gaudi Software Suite using the Dockerfile:

$ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env

Tip

If you are facing the following error: docker: Error response from daemon: Unknown runtime specified habana., please refer to "Install Optional Packages" section of Install Driver and Software and "Configure Container Runtime" section of Docker Installation. Make sure you have habanalabs-container-runtime package installed and that habana container runtime is registered.

Build from Source

Environment Verification

To verify that the Intel Gaudi software was correctly installed, run the following:

$ hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
$ apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
$ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
$ pip list | grep neural # verify that neural-compressor is installed

Refer to System Verification and Final Tests for more details.

Run Docker Image

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the Intel Gaudi documentation for more details.

Use the following commands to run a Docker image. Make sure to update the versions below as listed in the Support Matrix:

$ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest

Build and Install vLLM

Currently, multiple ways are provided which can be used to install vLLM with Intel® Gaudi®, pick one option:

1. Build and Install the stable version

vLLM releases are being performed periodically to align with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's vLLM-fork. To install the stable release from HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout v0.6.4.post2+Gaudi-1.19.0
$ pip install -r requirements-hpu.txt
$ python setup.py develop

2. Build and Install the latest from vLLM-fork

Currently, the latest features and performance optimizations are being developed in Gaudi's vLLM-fork and periodically upstreamed to vLLM main repository. To install latest HabanaAI/vLLM-fork, run the following:

$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop

3. Build and Install from vLLM main source

If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -r requirements-hpu.txt
$ python setup.py develop

Supported Features

Feature Description References
Offline batched inference Offline inference using LLM class from vLLM Python API Quickstart
Example
Online inference via OpenAI-Compatible Server Online inference using HTTP server that implements OpenAI Chat and Completions API Documentation
Example
HPU autodetection HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup N/A
Paged KV cache with algorithms enabled for Intel Gaudi accelerators vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. N/A
Custom Intel Gaudi operator implementations vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. N/A
Tensor parallel inference (single-node multi-HPU) vLLM HPU backend support multi-HPU inference across a single node with tensor parallelism with Ray and HCCL. Documentation
Example
HCCL reference
Inference with HPU Graphs vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time, to be later replayed during inference, significantly reducing host overheads. Documentation
vLLM HPU backend execution modes
Optimization guide
Inference with torch.compile (experimental) vLLM HPU backend experimentally supports inference with torch.compile. vLLM HPU backend execution modes
Attention with Linear Biases (ALiBi) vLLM HPU backend supports models utilizing Attention with Linear Biases (ALiBi) such as mpt-7b. vLLM supported models
INC quantization vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). Documentation
LoRA/MultiLoRA support vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. Documentation
Example
vLLM supported models
Multi-step scheduling support vLLM HPU backend includes multi-step scheduling support for host overhead reduction, configurable by standard --num-scheduler-seqs parameter. Feature RFC
Automatic prefix caching (experimental) vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard --enable-prefix-caching parameter. Documentation
Details
Speculative decoding (experimental) vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurabie via standard --speculative_model and --num_speculative_tokens parameters. Documentation
Example

Unsupported Features

  • Beam search
  • AWQ quantization
  • Prefill chunking (mixed-batch inferencing)

Supported Configurations

The following configurations have been validated to be function with Gaudi2 devices. Configurations that are not listed may or may not work.

Read more

v0.6.4.post2+Gaudi-1.19.2

10 Feb 23:19
61f141c
Compare
Choose a tag to compare
Pre-release

What's Changed

Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.2

v0.6.4.post2+Gaudi-1.19.1

10 Feb 23:17
1ea378e
Compare
Choose a tag to compare
Pre-release

What's Changed

Full Changelog: v0.6.4.post2+Gaudi-1.19.0...v0.6.4.post2+Gaudi-1.19.1