use shared storage for megatron checkpointing #22

rcano-baseten · 2025-09-22T14:39:04Z

This script is pretty hacky but does all the necessary things to make the project "magical" when we press play.

What should the script be in the cookbook, given the state of the product today
Are there truss utility functions/cli commands that we could add for customers to invoke
How should the product change based on the things we need to do her

Can we add case study to https://www.notion.so/ml-infra/Truss-Training-Tooling-Idea-27891d247273802da15fed940e85e440#27891d247273802da15fed940e85e440

rcano-baseten · 2025-09-25T21:43:29Z

examples/megatron-qwen3/training/run.sh

+        --optimizer_cpu_offload true \
+        --use_precision_aware_optimizer true \
+        --use_hf 1 \
+        --wandb_project qwen3_moe_megatron \


Wandb expects the latest_checkpoint_iter.txt file to be accessible from all nodes

Reading the swift docs, it recommends checkpointing to "shared storage"

We need to checkpoint to the training cache

rcano-baseten · 2025-09-25T21:44:07Z

examples/megatron-qwen3/training/run.sh

 cd /root/
 export DATASET="zai-org/LongAlign-10k"
 export MODEL_ID="Qwen/Qwen3-30B-A3B-Instruct-2507"
+export CKPT_DIR=${BT_RW_CACHE_DIR}/${BT_TRAINING_JOB_ID}


Here, the checkpoint dir is in the cache, and namespaced to the training job to ensure that the next job doesn't overwrite the data here

rcano-baseten · 2025-09-25T21:44:26Z

examples/megatron-qwen3/training/run.sh

+if [[ "${BT_NODE_RANK}" == "0" ]]; then
+    echo "Setting up continuous rsync from shared file system to checkpointing directory"
+    # Start a background loop that continuously syncs
+    (
+        while true; do
+            rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/
+            sleep 30  # Sync every 30 seconds
+        done
+    ) &
+    RSYNC_PID=$!
+    echo "Continuous rsync started with PID: $RSYNC_PID"
+fi


We set up rsync in the background to move data from cache to checkpointing volume

rcano-baseten · 2025-09-25T21:45:12Z

examples/megatron-qwen3/training/run.sh

+    # Perform final synchronization to ensure everything is synced
+    echo "Performing final rsync..."
+    rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/


make sure everything syncd by doing a blocking call to sync the directories

rcano-baseten · 2025-09-25T21:48:40Z

examples/megatron-qwen3/training/run.sh

+    swift export \
+        --mcore_model "${CKPT_DIR}/${V0_DIR}" \
+        --to_hf true \
+        --torch_dtype bfloat16 \
+        --output_dir megatron_output/hf_converted \
+        --push_to_hub true \
+        --hub_token $HF_TOKEN \
+        --hub_model_id rayraycano/megatron-qwen3-30b-a3b
+
+    echo "Final synchronization complete!"


we convert the weights back to hf mode - but let's avoid uploading to HF Repository

rcano-baseten · 2025-09-25T21:50:36Z

examples/megatron-qwen3/training/run.sh

+    pushd $CKPT_DIR
+    ls -la
+    V0_DIR=$(echo v0-*)
+    popd
+    echo "V0_DIR: $V0_DIR"


Some middle-man directory that has datestamp for the training run

One checkpoint:

training_dir/v0-20250925-123459/iter_000040/....
Another checkpoint from the same run:

training_dir/v0-20250925-123459/iter_000080/....

this is us figuring out "v0-20250925-123459"

rcano-baseten · 2025-09-25T21:52:45Z

examples/megatron-qwen3/training/run.sh

+set +e
+run_megatron_training 2>&1 | tee training.log
+EXIT_CODE=$?
+set -e  # Re-enable exit on error


Megatron, in highly distributed workloads, might hang at the end - I did some searching, and looks like this isn't uncommon: (#1541, #735, #1207)

So what we're doing here is we're allowing a non-zero exit code, capturing it, and then "re-setting" error detection.

Piping to training.log was just something Claude suggested - i don't know the utility

use shared storage for megatron checkpointing

cae4421

rcano-baseten requested a review from nidhihrmth September 22, 2025 14:39

rcano-baseten added 8 commits September 22, 2025 10:39

remove unnecessary file

d270e7a

iterations

f01a105

one big copy paste

66ee7ed

continuous rsync

c92d3ff

checkpointing with rsync

b28469b

adds clearing of cache

d930989

working for the most part

578c3c1

should be functional

b561135

rcano-baseten commented Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use shared storage for megatron checkpointing #22

use shared storage for megatron checkpointing #22

Uh oh!

rcano-baseten commented Sep 22, 2025 •

edited

Loading

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

rcano-baseten Sep 25, 2025

Uh oh!

Uh oh!

use shared storage for megatron checkpointing #22

Are you sure you want to change the base?

use shared storage for megatron checkpointing #22

Uh oh!

Conversation

rcano-baseten commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

rcano-baseten Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rcano-baseten commented Sep 22, 2025 •

edited

Loading