Skip to content

Conversation

@wolfgang-desalvador
Copy link
Collaborator

This pull request introduces a new MLPerf Storage checkpoint benchmark deployment for Azure Kubernetes Service (AKS), including a Helm chart, Docker image, and supporting documentation. It also adds enhancements to the AKS deployment script, such as optional operator toggles and automated Azure Container Registry (ACR) provisioning and attachment. The changes streamline the process for benchmarking storage on AKS and improve cluster setup automation.

MLPerf Storage Benchmark Deployment:

  • Added a new Helm chart (mlperf-storage-checkpoint) for deploying the MLPerf Storage checkpoint benchmark on AKS with the Kubeflow MPI Operator, including templates, values, helper functions, and release notes. [1] [2] [3] [4] [5]
  • Created a comprehensive README for running the checkpoint benchmark on AKS, detailing prerequisites, quick start, configuration, scaling, and architecture.

Docker Image Build and Workflow:

  • Added a Dockerfile for building the MLPerf Storage benchmark container, and a GitHub Actions workflow to automate building and publishing the image to the GitHub Container Registry. [1] [2]

AKS Deployment Script Enhancements:

  • Introduced optional toggles for operator installation and integrated ACR provisioning in the AKS deployment script, allowing users to skip certain steps or automate ACR creation and cluster attachment. [1] [2] [3]
  • Improved AMLFS role assignment logic to also assign Contributor role on the primary resource group, ensuring broader access for storage operations.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces MLPerf Storage checkpoint benchmarking capabilities for AKS, including a complete Helm chart deployment, containerized benchmark environment, and enhanced AKS deployment automation with optional ACR integration.

Key Changes:

  • Added Helm chart for MLPerf Storage checkpoint benchmark deployment using Kubeflow MPI Operator with configurable worker scaling and resource allocation
  • Implemented automated ACR provisioning and attachment in AKS deployment script with optional operator installation toggles
  • Created Docker image build pipeline with GitHub Actions for publishing MLPerf Storage benchmark container

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
infrastructure_references/aks/scripts/deploy-aks.sh Enhanced deployment script with ACR integration, operator installation toggles, and extended AMLFS role assignments
examples/mlperf_storage/docker/Dockerfile Ubuntu 22.04-based container with MLPerf Storage v2.0, OpenMPI, and SSH configuration for distributed benchmarking
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/values.yaml Helm values defining benchmark parameters, MPI configuration, resource requests, and storage mounting
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/mpijob.yaml MPIJob template with launcher/worker specs, SSH readiness checks, and checkpoint benchmark execution
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/_helpers.tpl Standard Helm helper functions for naming, labels, and chart metadata
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/templates/NOTES.txt Post-install instructions with status commands, configuration summary, and usage examples
examples/mlperf_storage/aks/helm/mlperf-storage-checkpoint/Chart.yaml Helm chart metadata defining application type and version 0.1.0
examples/mlperf_storage/aks/README.md Comprehensive guide covering prerequisites, deployment steps, configuration options, and MPIJob architecture
.github/workflows/build_mlperf_storage.yml GitHub Actions workflow for building and publishing MLPerf Storage container to GHCR

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wolfgang-desalvador wolfgang-desalvador self-assigned this Nov 20, 2025
… size calculation, and training run

- Implemented Helm chart for MLPerf Storage checkpointing with MPIJob configuration.
- Added Helm chart for MLPerf Storage training dataset generation, including job specifications and resource management.
- Introduced Helm chart for calculating training dataset size, with detailed configuration and command invocation.
- Created Helm chart for executing MLPerf Storage training runs using MPI, with comprehensive logging and resource allocation.
- Defined default values and configurations for all new charts, ensuring compatibility with Kueue workload management.
- Enhanced documentation in NOTES.txt for each chart, providing users with clear instructions and configuration summaries.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

local acr_location="${ACR_LOCATION:-${AZURE_REGION}}"

# Always randomize if user didn't override ACR_NAME (still default "${CLUSTER_NAME}acr"). If ACR_RANDOMIZE=true, re-randomize regardless.
if [[ -z "${ACR_NAME}" || "${ACR_RANDOMIZE}" == "true" ]]; then
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition at line 698 has a logic issue. The variable ACR_NAME is initialized as an empty string ("") on line 59, which means -z "${ACR_NAME}" will always be true even when the user hasn't overridden it. This causes the randomization to always occur, making the comment on line 697 misleading.

Consider changing line 59 to not set a default value:

: "${ACR_NAME:=}"  # ACR name (must be globally unique)

Or alternatively, update the condition on line 698 to properly check if the user has provided a custom value:

if [[ "${ACR_NAME}" == "" || "${ACR_RANDOMIZE}" == "true" ]]; then

Copilot uses AI. Check for mistakes.
random_suffix="$(LC_ALL=C tr -dc 'a-z0-9' </dev/urandom | head -c ${remaining} 2>/dev/null || true)" #|| true

generated_name="${base_prefix}${random_suffix}"
ACR_NAME="${generated_name:0:10}"
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACR names in Azure must be 5-50 characters long and contain only alphanumeric characters. The current logic truncates the generated name to 10 characters maximum, but there's no validation that the final name is at least 5 characters long.

If base_prefix is very short (1-4 characters) and the random suffix generation fails or produces fewer characters than needed, the resulting ACR name could be less than 5 characters and fail during creation.

Consider adding validation after line 712:

if [[ ${#ACR_NAME} -lt 5 ]]; then
    echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)"
    exit 1
fi
Suggested change
ACR_NAME="${generated_name:0:10}"
ACR_NAME="${generated_name:0:10}"
if [[ ${#ACR_NAME} -lt 5 ]]; then
echo "❌ Generated ACR name '${ACR_NAME}' is too short (minimum 5 characters required)"
exit 1
fi

Copilot uses AI. Check for mistakes.
python3 -m venv /opt/mlperf-storage && \
/opt/mlperf-storage/bin/pip3 install --no-cache-dir -e .

# Fix libmpi.so link
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment has a trailing space that should be removed for consistency.

Suggested change
# Fix libmpi.so link
# Fix libmpi.so link

Copilot uses AI. Check for mistakes.
--num-processes "{{ mul .Values.mpi.workers .Values.mpi.slotsPerWorker }}" \
--data-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.dataDir }} \
--params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \
--hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Line 63 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:

--params dataset.num_files_train={{ .Values.benchmark.numFilesTrain }} \
--hosts ${HOSTS} \
{{- if .Values.benchmark.debug }} --debug {{- end }} \
{{- if .Values.benchmark.verbose }} --verbose {{- end }} \
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Suggested change
--hosts ${HOSTS} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
--hosts ${HOSTS} \
{{- if .Values.benchmark.debug }} --debug \{{- end }}
{{- if .Values.benchmark.verbose }} --verbose \{{- end }}
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }}{{- end }}

Copilot uses AI. Check for mistakes.
--max-accelerators {{ .Values.benchmark.maxAccelerators }} \
--num-client-hosts {{ .Values.benchmark.numClientHosts }} \
--accelerator-type {{ .Values.benchmark.acceleratorType }} \
--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 33-34 have inconsistent formatting. Line 33 ends without a backslash but line 34 continues with additional parameters on a separate line. Either merge line 34 into line 33, or add a backslash at the end of line 33 to indicate line continuation:

--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} \
{{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Suggested change
--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }}
--results-dir {{ .Values.storage.mountPath }}/{{ .Values.storage.resultsDir }} {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} \

Copilot uses AI. Check for mistakes.
--num-checkpoints-write "{{ .Values.benchmark.numCheckpointsWrite }}" \
--num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \
--client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
--hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Line 65 has very long command arguments that extend beyond typical line length limits, making it difficult to read and maintain. Consider breaking this into multiple lines for better readability:

--client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
--hosts "${HOSTS}" \
{{- if .Values.benchmark.debug }} --debug {{- end }} \
{{- if .Values.benchmark.verbose }} --verbose {{- end }} \
{{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Suggested change
--hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
--hosts "${HOSTS}" \
{{- if .Values.benchmark.debug }}--debug \{{- end }}
{{- if .Values.benchmark.verbose }}--verbose \{{- end }}
{{- if .Values.benchmark.additionalParams }}{{ .Values.benchmark.additionalParams }}{{- end }}

Copilot uses AI. Check for mistakes.
--num-checkpoints-read "{{ .Values.benchmark.numCheckpointsRead }}" \
--client-host-memory-in-gb "{{ .Values.benchmark.clientHostMemoryGiB }}" \
--hosts "${HOSTS}" {{- if .Values.benchmark.debug }} --debug {{- end }} {{- if .Values.benchmark.verbose }} --verbose {{- end }} {{- if .Values.benchmark.additionalParams }} {{ .Values.benchmark.additionalParams }} {{- end }}
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 66 contains only whitespace and should be removed. This blank line appears within the command block and serves no purpose.

Suggested change

Copilot uses AI. Check for mistakes.
* Model flag is a sizing hint; adjust as needed for future logic.

UPGRADE EXAMPLE
helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command invocation template shows incorrect chart name. It refers to ./mlperf-checkpointing-checkpoint but the actual chart directory name is mlperf-checkpointing. This would cause the upgrade command to fail.

Change to:

helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \
Suggested change
helm upgrade {{ .Release.Name }} ./mlperf-checkpointing-checkpoint \
helm upgrade {{ .Release.Name }} ./mlperf-checkpointing \

Copilot uses AI. Check for mistakes.
push:
branches:
- main
- add-mlperf-storage-checkpointing
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow is triggered on push to the add-mlperf-storage-checkpointing branch, which appears to be a feature branch. Once this PR is merged, this trigger should be removed to avoid unnecessary workflow runs. Only the main branch trigger should remain for production use.

Suggested change
- add-mlperf-storage-checkpointing

Copilot uses AI. Check for mistakes.
echo " INSTALL_GPU_OPERATOR Install GPU Operator during 'all' (default: true)"
echo " INSTALL_ACR Attach/create ACR during 'all' (default: true)"
echo " ACR_RANDOMIZE Force random ACR name (always randomizes, even if ACR_NAME set) (default: false)"
echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")"
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value shown in the help text is misleading. Line 810 states the default is "${CLUSTER_NAME}acr", but on line 59, ACR_NAME is initialized to an empty string "". The help text should reflect the actual default behavior:

echo "  ACR_NAME                 ACR name (default: auto-generated from cluster name)"

This aligns with the actual behavior where an empty ACR_NAME triggers automatic generation in the create_or_attach_acr function.

Suggested change
echo " ACR_NAME ACR name (default: \"${CLUSTER_NAME}acr\")"
echo " ACR_NAME ACR name (default: auto-generated from cluster name)"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants