Skip to content

Deep Learning Training Service v1.4.0

Compare
Choose a tag to compare
@Anbang-Hu Anbang-Hu released this 04 Feb 05:17
· 10 commits to r1.4 since this release

Job Manager

  • Improve 95th percentile job creation (from job submission to "scheduling") time from 400s to 46s.
  • Speed up job initialization by prebuilding and copying required apt packages from an init container
  • Per-user password for ssh login for user jobs
  • Azure blobfuse plugin(s) for a job
  • Custom docker registry secret(s) for a job
  • Scheduling jobs on pure CPU machines
  • VC machine hard assignment
  • Provide consistent environment variables for training in both interactive and non-interactive SSH

Restful API

  • Improve 95th percentile latency for job info and permission related Restful APIs from 2000ms to <500ms.

Web Portal (Dashboard)

  • Speed up page loading for "View and Manage Jobs" - "View and Manage Jobs V2"
  • Dashboard as a Kubernetes service

User Synchronization

  • Automate the user/group permission update process

Storage Manager

  • Scan NFS and send alert email for over-sized (boundary) paths when NFS storage usage exceeds threshold.

Repair Manager

  • Detect and send alert email for uncorrectable ECC errors

Fundamental

  • Fix occasionally failed NFS mounting upon machine restart