-
Notifications
You must be signed in to change notification settings - Fork 0
use shared storage for megatron checkpointing #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
--optimizer_cpu_offload true \ | ||
--use_precision_aware_optimizer true \ | ||
--use_hf 1 \ | ||
--wandb_project qwen3_moe_megatron \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Wandb expects the
latest_checkpoint_iter.txt
file to be accessible from all nodes - Reading the swift docs, it recommends checkpointing to "shared storage"
- We need to checkpoint to the training cache
cd /root/ | ||
export DATASET="zai-org/LongAlign-10k" | ||
export MODEL_ID="Qwen/Qwen3-30B-A3B-Instruct-2507" | ||
export CKPT_DIR=${BT_RW_CACHE_DIR}/${BT_TRAINING_JOB_ID} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the checkpoint dir is in the cache, and namespaced to the training job to ensure that the next job doesn't overwrite the data here
if [[ "${BT_NODE_RANK}" == "0" ]]; then | ||
echo "Setting up continuous rsync from shared file system to checkpointing directory" | ||
# Start a background loop that continuously syncs | ||
( | ||
while true; do | ||
rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/ | ||
sleep 30 # Sync every 30 seconds | ||
done | ||
) & | ||
RSYNC_PID=$! | ||
echo "Continuous rsync started with PID: $RSYNC_PID" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set up rsync in the background to move data from cache to checkpointing volume
# Perform final synchronization to ensure everything is synced | ||
echo "Performing final rsync..." | ||
rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure everything syncd by doing a blocking call to sync the directories
swift export \ | ||
--mcore_model "${CKPT_DIR}/${V0_DIR}" \ | ||
--to_hf true \ | ||
--torch_dtype bfloat16 \ | ||
--output_dir megatron_output/hf_converted \ | ||
--push_to_hub true \ | ||
--hub_token $HF_TOKEN \ | ||
--hub_model_id rayraycano/megatron-qwen3-30b-a3b | ||
|
||
echo "Final synchronization complete!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we convert the weights back to hf mode - but let's avoid uploading to HF Repository
pushd $CKPT_DIR | ||
ls -la | ||
V0_DIR=$(echo v0-*) | ||
popd | ||
echo "V0_DIR: $V0_DIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some middle-man directory that has datestamp for the training run
One checkpoint:
training_dir/v0-20250925-123459/iter_000040/....
Another checkpoint from the same run:
training_dir/v0-20250925-123459/iter_000080/....
this is us figuring out "v0-20250925-123459"
set +e | ||
run_megatron_training 2>&1 | tee training.log | ||
EXIT_CODE=$? | ||
set -e # Re-enable exit on error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megatron, in highly distributed workloads, might hang at the end - I did some searching, and looks like this isn't uncommon: (#1541, #735, #1207)
So what we're doing here is we're allowing a non-zero exit code, capturing it, and then "re-setting" error detection.
Piping to training.log was just something Claude suggested - i don't know the utility
This script is pretty hacky but does all the necessary things to make the project "magical" when we press play.
Can we add case study to https://www.notion.so/ml-infra/Truss-Training-Tooling-Idea-27891d247273802da15fed940e85e440#27891d247273802da15fed940e85e440