-
Notifications
You must be signed in to change notification settings - Fork 9
MLPerf Checkpointing on AKS #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces MLPerf Storage checkpoint benchmarking capabilities for AKS, including a complete Helm chart deployment, containerized benchmark environment, and enhanced AKS deployment automation with optional ACR integration.
Key Changes:
- Added Helm chart for MLPerf Storage checkpoint benchmark deployment using Kubeflow MPI Operator with configurable worker scaling and resource allocation
- Implemented automated ACR provisioning and attachment in AKS deployment script with optional operator installation toggles
- Created Docker image build pipeline with GitHub Actions for publishing MLPerf Storage benchmark container
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| infrastructure_references/aks/scripts/deploy-aks.sh | Enhanced deployment script with ACR integration, operator installation toggles, and extended AMLFS role assignments |
| examples/mlperf_storage/docker/Dockerfile | Ubuntu 22.04-based container with MLPerf Storage v2.0, OpenMPI, and SSH configuration for distributed benchmarking |
| examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/values.yaml | Helm values defining benchmark parameters, MPI configuration, resource requests, and storage mounting |
| examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/mpijob.yaml | MPIJob template with launcher/worker specs, SSH readiness checks, and checkpoint benchmark execution |
| examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/_helpers.tpl | Standard Helm helper functions for naming, labels, and chart metadata |
| examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/NOTES.txt | Post-install instructions with status commands, configuration summary, and usage examples |
| examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/Chart.yaml | Helm chart metadata defining application type and version 0.1.0 |
| examples/mlperf_storage/aks/README.md | Comprehensive guide covering prerequisites, deployment steps, configuration options, and MPIJob architecture |
| .github/workflows/build_mlperf_storage.yml | GitHub Actions workflow for building and publishing MLPerf Storage container to GHCR |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/mpijob.yaml
Outdated
Show resolved
Hide resolved
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/values.yaml
Outdated
Show resolved
Hide resolved
… size calculation, and training run - Implemented Helm chart for MLPerf Storage checkpointing with MPIJob configuration. - Added Helm chart for MLPerf Storage training dataset generation, including job specifications and resource management. - Introduced Helm chart for calculating training dataset size, with detailed configuration and command invocation. - Created Helm chart for executing MLPerf Storage training runs using MPI, with comprehensive logging and resource allocation. - Defined default values and configurations for all new charts, ensuring compatibility with Kueue workload management. - Enhanced documentation in NOTES.txt for each chart, providing users with clear instructions and configuration summaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local acr_location="${ACR_LOCATION:-${AZURE_REGION}}" | ||
|
|
||
| # Always randomize if user didn't override ACR_NAME (still default "${CLUSTER_NAME}acr"). If ACR_RANDOMIZE=true, re-randomize regardless. | ||
| if [[ -z "${ACR_NAME}" || "${ACR_RANDOMIZE}" == "true" ]]; then |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition at line 698 has a logic issue. The variable ACR_NAME is initialized as an empty string ("") on line 59, which means -z "${ACR_NAME}" will always be true even when the user hasn't overridden it. This causes the randomization to always occur, making the comment on line 697 misleading.
Consider changing line 59 to not set a default value:
: "${ACR_NAME:=}" # ACR name (must be globally unique)Or alternatively, update the condition on line 698 to properly check if the user has provided a custom value:
if [[ "${ACR_NAME}" == "" || "${ACR_RANDOMIZE}" == "true" ]]; then| random_suffix="$(LC_ALL=C tr -dc 'a-z0-9' </dev/urandom | head -c ${remaining} 2>/dev/null || true)" #|| true | ||
|
|
||
| generated_name="${base_prefix}${random_suffix}" | ||
| ACR_NAME="${generated_name:0:10}" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACR names in Azure must be 5-50 characters long and contain only alphanumeric characters. The current logic truncates the generated name to 10 characters maximum, but there's no validation that the final name is at least 5 characters long.
If base_prefix is very short (1-4 characters) and the random suffix generation fails or produces fewer characters than needed, the resulting ACR name could be less than 5 characters and fail during creation.
Consider adding validation after line 712:
if [[ ${#ACR_NAME} -lt 5 ]]; then
echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)"
exit 1
fi| ACR_NAME="${generated_name:0:10}" | |
| ACR_NAME="${generated_name:0:10}" | |
| if [[ ${#ACR_NAME} -lt 5 ]]; then | |
| echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)" | |
| exit 1 | |
| fi |
| python3 -m venv /opt/mlperf-storage && \ | ||
| /opt/mlperf-storage/bin/pip3 install --no-cache-dir -e . | ||
|
|
||
| # Fix libmpi.so link |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment has a trailing space that should be removed for consistency.
| # Fix libmpi.so link | |
| # Fix libmpi.so link |
| --num-processes "{{ mul .Values.mpi.workers .Values.mpi.slotsPerWorker }}" \ | ||
| --data-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.dataDir }} \ | ||
| --params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \ | ||
| --hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }} |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Line 63 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:
--params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \
--hosts ${HOSTS} \
{{- if .Values.benchmark.debug }} --debug {{- end }} \
{{- if .Values.benchmark.verbose }} --verbose {{- end }} \
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}| --hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }} | |
| --hosts ${HOSTS} \ | |
| {{- if .Values.benchmark.debug }} --debug \{{- end }} | |
| {{- if .Values.benchmark.verbose }} --verbose \{{- end }} | |
| {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }}{{- end }} |
| --max-accelerators {{ .Values.benchmark.maxAccelerators }} \ | ||
| --num-client-hosts {{ .Values.benchmark.numClientHosts }} \ | ||
| --accelerator-type {{ .Values.benchmark.acceleratorType }} \ | ||
| --results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 33-34 have inconsistent formatting. Line 33 ends without a backslash but line 34 continues with additional parameters on a separate line. Either merge line 34 into line 33, or add a backslash at the end of line 33 to indicate line continuation:
--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} \
{{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}| --results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} | |
| --results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} \ |
| --num-checkpoints-write "{{ .Values.benchmark.numCheckpointsWrite }}" \ | ||
| --num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \ | ||
| --client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \ | ||
| --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }} |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Line 65 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:
--client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
--hosts "${HOSTS}" \
{{- if .Values.benchmark.debug }} --debug {{- end }} \
{{- if .Values.benchmark.verbose }} --verbose {{- end }} \
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}| --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }} | |
| --hosts "${HOSTS}" \ | |
| {{- if .Values.benchmark.debug }}--debug \{{- end }} | |
| {{- if .Values.benchmark.verbose }}--verbose \{{- end }} | |
| {{- if .Values.benchmark.additionalParams }}{{ .Values.benchmark.additionalParams }}{{- end }} |
| --num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \ | ||
| --client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \ | ||
| --hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }} | ||
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 66 contains only whitespace and should be removed. This blank line appears within the command block and serves no purpose.
| * Model flag is a sizing hint; adjust as needed for future logic. | ||
|
|
||
| UPGRADE EXAMPLE | ||
| helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \ |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command invocation template shows incorrect chart name. It refers to ./mlperf-checkpointing-checkpoint but the actual chart directory name is mlperf-checkpointing. This would cause the upgrade command to fail.
Change to:
helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \
| helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \ | |
| helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \ |
| push: | ||
| branches: | ||
| - main | ||
| - add-mlperf-storage-checkpointing |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow is triggered on push to the add-mlperf-storage-checkpointing branch, which appears to be a feature branch. Once this PR is merged, this trigger should be removed to avoid unnecessary workflow runs. Only the main branch trigger should remain for production use.
| - add-mlperf-storage-checkpointing |
| echo " INSTALL_GPU_OPERATOR Install GPU Operator during 'all' (default: true)" | ||
| echo " INSTALL_ACR Attach/create ACR during 'all' (default: true)" | ||
| echo " ACR_RANDOMIZE Force random ACR name (always randomizes, even if ACR_NAME set) (default: false)" | ||
| echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")" |
Copilot
AI
Nov 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value shown in the help text is misleading. Line 810 states the default is "${CLUSTER_NAME}acr", but on line 59, ACR_NAME is initialized to an empty string "". The help text should reflect the actual default behavior:
echo " ACR_NAME ACR name (default: auto-generated from cluster name)"This aligns with the actual behavior where an empty ACR_NAME triggers automatic generation in the create_or_attach_acr function.
| echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")" | |
| echo " ACR_NAME ACR name (default: auto-generated from cluster name)" |
This pull request introduces a new MLPerf Storage checkpoint benchmark deployment for Azure Kubernetes Service (AKS), including a Helm chart, Docker image, and supporting documentation. It also adds enhancements to the AKS deployment script, such as optional operator toggles and automated Azure Container Registry (ACR) provisioning and attachment. The changes streamline the process for benchmarking storage on AKS and improve cluster setup automation.
MLPerf Storage Benchmark Deployment:
mlperf-storage-checkpoint) for deploying the MLPerf Storage checkpoint benchmark on AKS with the Kubeflow MPI Operator, including templates, values, helper functions, and release notes. [1] [2] [3] [4] [5]Docker Image Build and Workflow:
AKS Deployment Script Enhancements: