MLPerf Checkpointing on AKS #91

wolfgang-desalvador · 2025-11-14T16:21:10Z

This pull request introduces a new MLPerf Storage checkpoint benchmark deployment for Azure Kubernetes Service (AKS), including a Helm chart, Docker image, and supporting documentation. It also adds enhancements to the AKS deployment script, such as optional operator toggles and automated Azure Container Registry (ACR) provisioning and attachment. The changes streamline the process for benchmarking storage on AKS and improve cluster setup automation.

MLPerf Storage Benchmark Deployment:

Added a new Helm chart (mlperf-storage-checkpoint) for deploying the MLPerf Storage checkpoint benchmark on AKS with the Kubeflow MPI Operator, including templates, values, helper functions, and release notes. [1] [2] [3] [4] [5]
Created a comprehensive README for running the checkpoint benchmark on AKS, detailing prerequisites, quick start, configuration, scaling, and architecture.

Docker Image Build and Workflow:

Added a Dockerfile for building the MLPerf Storage benchmark container, and a GitHub Actions workflow to automate building and publishing the image to the GitHub Container Registry. [1] [2]

AKS Deployment Script Enhancements:

Introduced optional toggles for operator installation and integrated ACR provisioning in the AKS deployment script, allowing users to skip certain steps or automate ACR creation and cluster attachment. [1] [2] [3]
Improved AMLFS role assignment logic to also assign Contributor role on the primary resource group, ensuring broader access for storage operations.

Copilot

Pull Request Overview

This PR introduces MLPerf Storage checkpoint benchmarking capabilities for AKS, including a complete Helm chart deployment, containerized benchmark environment, and enhanced AKS deployment automation with optional ACR integration.

Key Changes:

Added Helm chart for MLPerf Storage checkpoint benchmark deployment using Kubeflow MPI Operator with configurable worker scaling and resource allocation
Implemented automated ACR provisioning and attachment in AKS deployment script with optional operator installation toggles
Created Docker image build pipeline with GitHub Actions for publishing MLPerf Storage benchmark container

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
infrastructure_references/aks/scripts/deploy-aks.sh	Enhanced deployment script with ACR integration, operator installation toggles, and extended AMLFS role assignments
examples/mlperf_storage/docker/Dockerfile	Ubuntu 22.04-based container with MLPerf Storage v2.0, OpenMPI, and SSH configuration for distributed benchmarking
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/values.yaml	Helm values defining benchmark parameters, MPI configuration, resource requests, and storage mounting
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/mpijob.yaml	MPIJob template with launcher/worker specs, SSH readiness checks, and checkpoint benchmark execution
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/_helpers.tpl	Standard Helm helper functions for naming, labels, and chart metadata
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/NOTES.txt	Post-install instructions with status commands, configuration summary, and usage examples
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/Chart.yaml	Helm chart metadata defining application type and version 0.1.0
examples/mlperf_storage/aks/README.md	Comprehensive guide covering prerequisites, deployment steps, configuration options, and MPIJob architecture
.github/workflows/build_mlperf_storage.yml	GitHub Actions workflow for building and publishing MLPerf Storage container to GHCR

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

infrastructure_references/aks/scripts/deploy-aks.sh

.github/workflows/build_mlperf_storage.yml

examples/mlperf_storage/docker/Dockerfile

.github/workflows/build_mlperf_storage.yml

examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/mpijob.yaml

examples/mlperf_storage/aks/README.md

examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/values.yaml

… size calculation, and training run - Implemented Helm chart for MLPerf Storage checkpointing with MPIJob configuration. - Added Helm chart for MLPerf Storage training dataset generation, including job specifications and resource management. - Introduced Helm chart for calculating training dataset size, with detailed configuration and command invocation. - Created Helm chart for executing MLPerf Storage training runs using MPI, with comprehensive logging and resource allocation. - Defined default values and configurations for all new charts, ensuring compatibility with Kueue workload management. - Enhanced documentation in NOTES.txt for each chart, providing users with clear instructions and configuration summaries.

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-25T19:36:19Z

infrastructure_references/aks/scripts/deploy-aks.sh

+	local acr_location="${ACR_LOCATION:-${AZURE_REGION}}"
+
+	# Always randomize if user didn't override ACR_NAME (still default "${CLUSTER_NAME}acr"). If ACR_RANDOMIZE=true, re-randomize regardless.
+	if [[ -z "${ACR_NAME}" || "${ACR_RANDOMIZE}" == "true" ]]; then


The condition at line 698 has a logic issue. The variable ACR_NAME is initialized as an empty string ("") on line 59, which means -z "${ACR_NAME}" will always be true even when the user hasn't overridden it. This causes the randomization to always occur, making the comment on line 697 misleading.

Consider changing line 59 to not set a default value:

: "${ACR_NAME:=}" # ACR name (must be globally unique)

Or alternatively, update the condition on line 698 to properly check if the user has provided a custom value:

if [[ "${ACR_NAME}" == "" || "${ACR_RANDOMIZE}" == "true" ]]; then

Copilot · 2025-11-25T19:36:19Z

infrastructure_references/aks/scripts/deploy-aks.sh

+		random_suffix="$(LC_ALL=C tr -dc 'a-z0-9' </dev/urandom | head -c ${remaining} 2>/dev/null || true)" #|| true
+
+		generated_name="${base_prefix}${random_suffix}"
+		ACR_NAME="${generated_name:0:10}"


ACR names in Azure must be 5-50 characters long and contain only alphanumeric characters. The current logic truncates the generated name to 10 characters maximum, but there's no validation that the final name is at least 5 characters long.

If base_prefix is very short (1-4 characters) and the random suffix generation fails or produces fewer characters than needed, the resulting ACR name could be less than 5 characters and fail during creation.

Consider adding validation after line 712:

if [[ ${#ACR_NAME} -lt 5 ]]; then echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)" exit 1 fi

Suggested change

ACR_NAME="${generated_name:0:10}"

ACR_NAME="${generated_name:0:10}"

if [[ ${#ACR_NAME} -lt 5 ]]; then

echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)"

exit 1

fi

Copilot · 2025-11-25T19:36:20Z

examples/mlperf_storage/docker/Dockerfile

+    python3 -m venv /opt/mlperf-storage && \
+    /opt/mlperf-storage/bin/pip3 install --no-cache-dir -e .
+
+# Fix libmpi.so link 


[nitpick] The comment has a trailing space that should be removed for consistency.

Suggested change

# Fix libmpi.so link

# Fix libmpi.so link

Copilot · 2025-11-25T19:36:20Z

examples/mlperf_storage/aks/helm/mlperf-training-dataset-generation/templates/mpijob.yaml

+                  --num-processes "{{ mul .Values.mpi.workers .Values.mpi.slotsPerWorker }}"  \
+                  --data-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.dataDir }} \
+                  --params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \
+                  --hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}


[nitpick] Line 63 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:

--params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \ --hosts ${HOSTS} \ {{- if .Values.benchmark.debug }} --debug {{- end }} \ {{- if .Values.benchmark.verbose }} --verbose {{- end }} \ {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}

Suggested change

--hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}

--hosts ${HOSTS} \

{{- if .Values.benchmark.debug }} --debug \{{- end }}

{{- if .Values.benchmark.verbose }} --verbose \{{- end }}

{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }}{{- end }}

Copilot · 2025-11-25T19:36:20Z

examples/mlperf_storage/aks/helm/mlperf-training-dataset-size/templates/job.yaml

+              --max-accelerators {{ .Values.benchmark.maxAccelerators }} \
+              --num-client-hosts {{ .Values.benchmark.numClientHosts }} \
+              --accelerator-type {{ .Values.benchmark.acceleratorType }} \
+              --results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}


Lines 33-34 have inconsistent formatting. Line 33 ends without a backslash but line 34 continues with additional parameters on a separate line. Either merge line 34 into line 33, or add a backslash at the end of line 33 to indicate line continuation:

--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} \ {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}

Suggested change

--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}

--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} \

Copilot · 2025-11-25T19:36:21Z

examples/mlperf_storage/aks/helm/mlperf-checkpointing/templates/mpijob.yaml

+                  --num-checkpoints-write "{{ .Values.benchmark.numCheckpointsWrite }}" \
+                  --num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \
+                  --client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
+                  --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}


[nitpick] Line 65 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:

--client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \ --hosts "${HOSTS}" \ {{- if .Values.benchmark.debug }} --debug {{- end }} \ {{- if .Values.benchmark.verbose }} --verbose {{- end }} \ {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}

Suggested change

--hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}

--hosts "${HOSTS}" \

{{- if .Values.benchmark.debug }}--debug \{{- end }}

{{- if .Values.benchmark.verbose }}--verbose \{{- end }}

{{- if .Values.benchmark.additionalParams }}{{ .Values.benchmark.additionalParams }}{{- end }}

Copilot · 2025-11-25T19:36:21Z

examples/mlperf_storage/aks/helm/mlperf-checkpointing/templates/mpijob.yaml

+                  --num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \
+                  --client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
+                  --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
+                  


Line 66 contains only whitespace and should be removed. This blank line appears within the command block and serves no purpose.

Suggested change

Copilot · 2025-11-25T19:36:22Z

examples/mlperf_storage/aks/helm/mlperf-checkpointing/templates/NOTES.txt

+	* Model flag is a sizing hint; adjust as needed for future logic.
+
+UPGRADE EXAMPLE
+	helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \


The command invocation template shows incorrect chart name. It refers to ./mlperf-checkpointing-checkpoint but the actual chart directory name is mlperf-checkpointing. This would cause the upgrade command to fail.

Change to:

helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \

Suggested change

helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \

helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \

Copilot · 2025-11-25T19:36:22Z

.github/workflows/build_mlperf_storage.yml

+  push:
+    branches:
+      - main
+      - add-mlperf-storage-checkpointing


The workflow is triggered on push to the add-mlperf-storage-checkpointing branch, which appears to be a feature branch. Once this PR is merged, this trigger should be removed to avoid unnecessary workflow runs. Only the main branch trigger should remain for production use.

Suggested change

- add-mlperf-storage-checkpointing

Copilot · 2025-11-25T19:36:22Z

infrastructure_references/aks/scripts/deploy-aks.sh

+	echo "  INSTALL_GPU_OPERATOR     Install GPU Operator during 'all' (default: true)"
+	echo "  INSTALL_ACR              Attach/create ACR during 'all' (default: true)"
+	echo "  ACR_RANDOMIZE            Force random ACR name (always randomizes, even if ACR_NAME set) (default: false)"
+	echo "  ACR_NAME                 ACR name (default: \"${CLUSTER_NAME}acr\")"


The default value shown in the help text is misleading. Line 810 states the default is "${CLUSTER_NAME}acr", but on line 59, ACR_NAME is initialized to an empty string "". The help text should reflect the actual default behavior:

echo " ACR_NAME ACR name (default: auto-generated from cluster name)"

This aligns with the actual behavior where an empty ACR_NAME triggers automatic generation in the create_or_attach_acr function.

Suggested change

echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")"

echo " ACR_NAME ACR name (default: auto-generated from cluster name)"

MLPerf Checkpointing on AKS

1845f06

wolfgang-desalvador requested review from Copilot, edwardsp and surajssd November 14, 2025 16:21

Copilot started reviewing on behalf of wolfgang-desalvador November 14, 2025 16:21 View session

Fix container name in the pipeline for MLPerf Storage

70b8277

Copilot finished reviewing on behalf of wolfgang-desalvador November 14, 2025 16:24

Include the pipeline file for MLPerf Storage in monitored push

9653e72

Copilot AI reviewed Nov 14, 2025

View reviewed changes

wolfgang-desalvador and others added 9 commits November 14, 2025 18:00

Add push to MLPerf Storage example

d449f72

Fix MLPerf image name

cff45ec

linter and minor readme edits

da8fb40

Merge branch 'main' into add-mlperf-storage-checkpointing

8ab8640

Added kueue support

1a7d366

Merge 1a7d366 into e29ca29

fc8a453

super-linter: fix linting issues [skip ci]

8671adf

copilot suggestions

4b61923

removed push from feature branch

9dd5598

edwardsp approved these changes Nov 17, 2025

View reviewed changes

wolfgang-desalvador self-assigned this Nov 20, 2025

wolfgang-desalvador added 5 commits November 20, 2025 21:34

Add link to the main README for MLPerf Storage

a9e67b5

Switch MLPerf Storage to MPICH

8bbae46

Add current branch to MLPerfStorage image build

a1b7a85

Fix branch names in MLPerfStorage image build workflow

b027591

wolfgang-desalvador requested a review from Copilot November 25, 2025 19:31

Copilot started reviewing on behalf of wolfgang-desalvador November 25, 2025 19:31 View session

Copilot finished reviewing on behalf of wolfgang-desalvador November 25, 2025 19:34

Copilot AI reviewed Nov 25, 2025

View reviewed changes

Manage startup check with mpirun

948ff53

-                  --hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
+                  --hosts ${HOSTS} \
+                  {{- if .Values.benchmark.debug }} --debug \{{- end }}
+                  {{- if .Values.benchmark.verbose }} --verbose \{{- end }}
+                  {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }}{{- end }}

	--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}
	--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} \

-                  --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
+                  --hosts "${HOSTS}" \
+                  {{- if .Values.benchmark.debug }}--debug \{{- end }}
+                  {{- if .Values.benchmark.verbose }}--verbose \{{- end }}
+                  {{- if .Values.benchmark.additionalParams }}{{ .Values.benchmark.additionalParams }}{{- end }}

	helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \
	helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \

	echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")"
	echo " ACR_NAME ACR name (default: auto-generated from cluster name)"

MLPerf Checkpointing on AKS #91

Are you sure you want to change the base?

MLPerf Checkpointing on AKS #91

Uh oh!

Conversation

wolfgang-desalvador commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants