Skip to content

20.08

Pre-release
Pre-release
Compare
Choose a tag to compare
@dholt dholt released this 14 Aug 18:30
· 48 commits to release-20.08 since this release

DeepOps 20.08 Release Notes

NOTE: Use 20.08.1 release instead of this one for various bug fixes.

What's New

  • DGX A100 support
  • NVIDIA HPC SDK
  • Spack package manager
  • HPL Burn-in test
  • MPI Operator

Changes

  • Slurm 20.02.4, Pyxis v0.8.0, Enroot v3.1.1
  • Kubernetes v1.17.9 (Kubespray v2.13.3), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • DGX OS 4.5
  • DGX role updated to current versions/packages
  • K8S DCGM Exporter 1.7.2 (port switch from 9101 to 9400)
  • Bug fixes and enhancements
  • Default nfs configurations have changed

Bugs/Enhancements

  • General Kubeflow installation and polling improvements (along with Jenkins tests)
  • Kubeflow deletion now actually deletes Kubeflow along with Istio, cert-manager, etc.
  • Kubeflow installation now automatically installs the MPI Operator
  • DCGM/Grafana dashboard updates
  • General cleanup and version pinning in K8S monitoring deployment script
  • Improved Jenkins testing (new tests: spack, kubeflow, centos tests; additional debugging/scale-tests/fixes)
  • Peg Rook/Ceph versions
  • Updated/improved/spell-checked documentation (slurm-perf, kubeflow, kubernetes, Lmod, Spack, EasyBuild)
  • Slurm MPI now defaults to pmix if available
  • golang galaxy role bumped to 2.4.0
  • Improved Trident usability
  • New default config variables (install_chrony, ...)
  • General reorg of Slurm role and slurm-cluster.yml
  • Dedicated lmod playbook
  • Replaced a few helm repos with stable version
  • gpu plugin now uses helm install

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.06 run git diff 20.08 20.06 -- config.example/

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.