Releases · NVIDIA/k8s-device-plugin

17 Apr 12:22

elezar

v0.15.0

435bfb7

v0.15.0

The NVIDIA GPU Device Plugin v0.15.0 release includes the following major changes:

Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories

Since the NVIDIA GPU Device Plugin and GPU Feature Discovery (GFD) components are often used together, we have consolidated the repositories. The primary goal was to streamline the development and release process and functionality remains unchanged. The user facing changes are as follows:

The two components will use the same version, meaning that the GFD version jumps from v0.8.2 to v0.15.0.
The two components use the same container image, meaning that instead of nvcr.io/nvidia/gpu-feature-discovery is to be used nvcr.io/nvidia/k8s-device-plugin. Note that this may mean that the gpu-feature-discovery command needs to be explicitly specified.

In order to facilitate the transition for users that rely on a standalone GFD deployment, this release includes a gpu-feature-discovery helm chart in the device plugin helm repository.

Added experimental support for GPU partitioning using MPS.

This release of the NVIDIA GPU Device Plugin includes experiemental support for GPU sharing using CUDA MPS. Feedback on this feature is appreciated.

This functionality is not production ready and includes a number of known issues including:

The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.
There is no synchronization between the CUDA MPS control daemon and the GPU Device Plugin under restarts or configuration changes. This means that workloads may crash if they lose access to shared resources controlled by the CUDA MPS control daemon.
MPS is only supported for full GPUs.
It is not possible to "combine" MPS GPU requests to allow for access to more memory by a single container.

Deprecation Notice

The following table shows a set of new CUDA driver and runtime version labels and their existing equivalents. The existing labels should be considered deprecated and will be removed in a future release.

New Label	Deprecated Label
`nvidia.com/cuda.driver-version.major`	`nvidia.com/cuda.driver.major`
`nvidia.com/cuda.driver-version.minor`	`nvidia.com/cuda.driver.minor`
`nvidia.com/cuda.driver-version.revision`	`nvidia.com/cuda.driver.rev`
`nvidia.com/cuda.driver-version.full`
`nvidia.com/cuda.runtime-version.major`	`nvidia.com/cuda.runtime.major`
`nvidia.com/cuda.runtime-version.minor`	`nvidia.com/cuda.runtime.minor`
`nvidia.com/cuda.runtime-version.full`

Full Changelog: v0.14.0...v0.15.0

Changes since v0.15.0-rc.2

Moved nvidia-device-plugin.yml static deployment at the root of the repository to deployments/static/nvidia-device-plugin.yml.
Simplify PCI device clases in NFD worker configuration.
Update CUDA base image version to 12.4.1.
Switch to Ubuntu22.04-based CUDA image for default image.
Add new CUDA driver and runtime version labels to align with other NFD version labels.
Update NFD dependency to v0.15.3.

v0.15.0-rc.2

Bump CUDA base image version to 12.3.2
Add cdi-cri device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.
Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
these limits are not applied for devices if set by UUID.
Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
Set mps device memory limit by index.
Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
Run tail -f for each MPS daemon to output logs.
Enforce replica limits for MPS sharing.

v0.15.0-rc.1

Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.
Add tooling to create a kind cluster for local development and testing.
Update go-gpuallocator dependency to migrate away from the deprecated gpu-monitoring-tools NVML bindings.
Remove legacyDaemonsetAPI config option. This was only required for k8s versions < 1.16.
Add support for MPS sharing.
Bump CUDA base image version to 12.3.1

Assets 2

18 Mar 11:48

elezar

v0.15.0-rc.2

d838ad1

v0.15.0-rc.2

What's changed

Bump CUDA base image version to 12.3.2
Add cdi-cri device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations.
Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
these limits are not applied for devices if set by UUID.
Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
Set mps device memory limit by index.
Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
Run tail -f for each MPS daemon to output logs.
Enforce replica limits for MPS sharing.

Assets 2

29 Feb 10:23

elezar

v0.14.5

3d549fb

v0.14.5

What's Changed

Update the nvidia-container-toolkit go dependency. This fixes a bug in CDI spec generation on systems were lib -> usr/lib symlinks exist.
Update the CUDA base images to 12.3.2

Full Changelog: v0.14.4...v0.14.5

Assets 2

26 Feb 13:59

elezar

v0.15.0-rc.1

40ff9bf

v0.15.0-rc.1 Pre-release

Pre-release

What's Changed

Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.
Add tooling to create a kind cluster for local development and testing.
Update go-gpuallocator dependency to migrate away from the deprecated gpu-monitoring-tools NVML bindings.
Remove legacyDaemonsetAPI config option. This was only required for k8s versions < 1.16.
Add support for MPS sharing.
Bump CUDA base image version to 12.3.1

Full Changelog: v0.14.0...v0.15.0-rc.1

Assets 2

29 Jan 14:42

elezar

v0.14.4

cde1a66

v0.14.4

What's Changed

Update to refactored go-gpuallocator code. This permanently fixes the NVML_NVLINK_MAX_LINKS value addressed in a
hotfix in v0.14.3. This also addresses a bug due to uninitialized NVML when calling go-gpuallocator.

Full Changelog: v0.14.3...v0.14.4

Assets 2

15 Nov 13:09

elezar

v0.14.3

0715067

v0.14.3

Bug fixes

Patched vendored NVML_NVLINK_MAX_LINKS to 18 to support devices with 18 NVLinks

Dependency updates

Bumped CUDA base images version to 12.3.0

Full Changelog: v0.14.2...v0.14.3

Assets 2

20 Oct 09:59

elezar

v0.14.2

3105421

v0.14.2

This release bumps dependencies.

Dependency Updates

Updated CUDA Base Image to 12.2.2
Updated GPU Feature Discovery version to v0.8.2

Full Changelog: v0.14.1...v0.14.2

Assets 2

13 Jul 09:35

elezar

v0.14.1

7086f91

v0.14.1

This release fixes bugs and bumps dependencies.

Bug fixes

Fixed parsing of deviceListStrategy in device plugin config (#410)

Dependency Updates

Updated CUDA Base Image to 12.2.0
Update GPU Feature Discovery version to v0.8.1
Update Node Feature Discovery to v0.13.2
Updated Go dependencies.

Full Changelog: v0.14.0...v0.14.1

Assets 2

03 Apr 21:09

cdesiniotis

v0.14.0

e6c111a

v0.14.0

Full Changelog: v0.13.0...v0.14.0

Changes

Promote v0.14.0-rc.3 to v0.14.0
Bumped nvidia-container-toolkit dependency to latest version for newer CDI spec generation code
Updated GFD subchart to version v0.8.0

Changes from `v0.14.0-rc.3`

Removed the --cdi-enabled config option and instead trigger CDI injection based on cdi-annotation strategy.
Bumped go-nvlib dependency to latest version to support new MIG profiles.
Added cdi-annotation-prefix config option to control how CDI annotations are generated.
Renamed driver-root-ctr-path config option added in v0.14.0-rc.1 to container-driver-root.
Updated GFD subchart to version v0.8.0-rc.2

Changes from `v0.14.0-rc.2`

Fix bug from v0.14.0-rc.1 when using cdi-enabled=false

Changes from `v0.14.0-rc.1`

Added --cdi-enabled flag to GPU Device Plugin. With this enabled, the device plugin will generate CDI specifications for available NVIDIA devices. Allocation will add CDI anntiations (cdi.k8s.io/*) to the response. These are read by a CDI-enabled runtime to make the required modifications to a container being created.
Updated GFD subchard to version 0.8.0-rc.1
Bumped Golang version to 1.20.1
Bumped CUDA base images version to 12.1.0
Switched to klog for logging
Added a static deployment file for Microshift

Note:

The container image nvcr.io/nvidia/k8s-device-plugin-v0.14.0-ubi8 contains the following high-severity CVEs:

CVE-2023-0286 - Vulnerability found in os package type (rpm) - openssl-libs
CVE-2023-24329 - Vulnerability found in os package type (rpm) - platform-python and python3-libs

Assets 2

29 Mar 12:56

elezar

v0.14.0-rc.3

5ff75c2

v0.14.0-rc.3 Pre-release

Pre-release

Full Changelog: v0.14.0-rc.2...v0.14.0-rc.3

Changes

Removed the --cdi-enabled config option and instead trigger CDI injection based on cdi-annotation strategy.
Bumped go-nvlib dependency to latest version to support new MIG profiles.
Added cdi-annotation-prefix config option to control how CDI annotations are generated.
Renamed driver-root-ctr-path config option added in v0.14.0-rc.1 to container-driver-root.
Updated GFD subchart to version v0.8.0-rc.2

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories

Added experimental support for GPU partitioning using MPS.

Deprecation Notice

Changes since v0.15.0-rc.2

v0.15.0-rc.2

v0.15.0-rc.1

Uh oh!

What's changed

Uh oh!

What's Changed

Uh oh!

What's Changed

Uh oh!

What's Changed

Uh oh!

Bug fixes

Dependency updates

Uh oh!

Dependency Updates

Uh oh!

Bug fixes

Dependency Updates

Uh oh!

Changes

Changes from `v0.14.0-rc.3`

Changes from `v0.14.0-rc.2`

Changes from `v0.14.0-rc.1`

Note:

Uh oh!

Changes

Uh oh!

Releases: NVIDIA/k8s-device-plugin

v0.15.0

Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories

Added experimental support for GPU partitioning using MPS.

Deprecation Notice

Changes since v0.15.0-rc.2

v0.15.0-rc.2

v0.15.0-rc.1

Uh oh!

v0.15.0-rc.2

What's changed

Uh oh!

v0.14.5

What's Changed

Uh oh!

v0.15.0-rc.1

What's Changed

Uh oh!

v0.14.4

What's Changed

Uh oh!

v0.14.3

Bug fixes

Dependency updates

Uh oh!

v0.14.2

Dependency Updates

Uh oh!

v0.14.1

Bug fixes

Dependency Updates

Uh oh!

v0.14.0

Changes

Changes from v0.14.0-rc.3

Changes from v0.14.0-rc.2

Changes from v0.14.0-rc.1

Note:

Uh oh!

v0.14.0-rc.3

Changes

Uh oh!

Changes from `v0.14.0-rc.3`

Changes from `v0.14.0-rc.2`

Changes from `v0.14.0-rc.1`