The goal is to run a deep learning task on containers in an HPC environment using Slurm.
Slurm is an alternative compute environment for Kubernetes. The advantage of the former is the native job scheduling capability, and is a key player in the HPC industry.
In this work, we create an HPC environment on AWS using AWS ParallelCluster. We then fine-tune a Llama 3.1 (8B) model using SFT. The task itself will run inside a container.
We install the ParallelCluster CLI tool via
pip install "aws-parallelcluster"
To provision the cluster, you can generate the configuration via :
pcluster configure --config config.yaml
The naive configuration may not suit for most purposes - especially deep learning.
A modified version of the configuration can be found here
pcluster create-cluster --cluster-name dlhpc --cluster-configuration config.yaml
This confguration
- specifies the VPC/subnet to fasten provisioning
- installs pyxis and enroot (Nvidia's alternative for Singularity/Apptainer) to run containers
You can see the status of the cluster via:
pcluster list-clusters
Do not SSH before the cluster/cloudformation reads Ready/Created even if HeadNode is ready! This is because the Slurm commands or other tools might not be available.
You can login to the head node via
pcluster ssh --cluster-name dlhpc -i ~/.ssh/<your_key>.pem
Inside the head-node, you can ssh to other compute nodes via
ssh <name_of_compute_node>
This might be useful for say monitoring GPU usage or other debugging issues.
To get the name of the compute node, you can use the sinfo
command.
The sinfo
command will list the status of all of the compute nodes.
We need to provision the container that will run the training code. Having a container is not a pre-requisite since we can install our Python dependencies directly on the host itself. While this practice might work for our AWS environment, it might not be recommended in a shared HPC environment.
We will assume that a PyTorch image is already built and hosted on an image repository e.g. DockerHub, ghcr, etc. A sample image can be found in the docker_image
folder. (Since we are on AWS, it is better to use a PyTorch image built by/for AWS).
enroot import -o llm.sqsh docker://<username>/<image>:<tag>
This command must either be run inside the home directory (which is shared among all nodes) or on a shared file system so that the image is accessible across the entire cluster. For large model training, it is better to create FSx for Lustre as the shared file system due to its high-performance.
Also, since our code will
-
push model to hub &
-
log metrics to WandB
we can install the huggingface-cli and wandb on our head node via
pip install 'huggingface_hub[cli]' wandb
and then log in via the CLIs. This will save the credentials somewhere within the home directory, which will be accessible not only to the compute nodes but also to the containers.
Create your code and save it in one of the shared path. For our task, we can use the default SFT script provided by HuggingFace.
wget https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/sft.py
You have to create a batch script - a sample shown here in order to submit a batch job.
Among other things, it
- specifies the training container via the
--container-image
argument - mounts the home directory inside the container via
--container-mount-home
- sets up the working directory inside the container to
--container-workdir
inside the container image.
If you need to run tasks before (after) the training job (e.g. to install python packages), you can append the --task-prolog
(--task-epilog
) command to the srun
command e.g. --task-prolog="setup.sh"
To submit a job, run
sbatch task.sbatch
You can see the job queue and the status of the jobs via
squeue
To view the output generate by the runs, you can see the *.err and *.out files in the working directory.
Finally, you can delete the cluster and all the resources via
pcluster delete-cluster --cluster-name dlhpc