Skip to content

Conversation

rcano-baseten
Copy link
Contributor

@rcano-baseten rcano-baseten commented Sep 22, 2025

This script is pretty hacky but does all the necessary things to make the project "magical" when we press play.

  1. What should the script be in the cookbook, given the state of the product today
  2. Are there truss utility functions/cli commands that we could add for customers to invoke
  3. How should the product change based on the things we need to do her

Can we add case study to https://www.notion.so/ml-infra/Truss-Training-Tooling-Idea-27891d247273802da15fed940e85e440#27891d247273802da15fed940e85e440

--optimizer_cpu_offload true \
--use_precision_aware_optimizer true \
--use_hf 1 \
--wandb_project qwen3_moe_megatron \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Wandb expects the latest_checkpoint_iter.txt file to be accessible from all nodes
  • Reading the swift docs, it recommends checkpointing to "shared storage"
  • We need to checkpoint to the training cache

cd /root/
export DATASET="zai-org/LongAlign-10k"
export MODEL_ID="Qwen/Qwen3-30B-A3B-Instruct-2507"
export CKPT_DIR=${BT_RW_CACHE_DIR}/${BT_TRAINING_JOB_ID}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the checkpoint dir is in the cache, and namespaced to the training job to ensure that the next job doesn't overwrite the data here

Comment on lines +26 to +37
if [[ "${BT_NODE_RANK}" == "0" ]]; then
echo "Setting up continuous rsync from shared file system to checkpointing directory"
# Start a background loop that continuously syncs
(
while true; do
rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/
sleep 30 # Sync every 30 seconds
done
) &
RSYNC_PID=$!
echo "Continuous rsync started with PID: $RSYNC_PID"
fi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set up rsync in the background to move data from cache to checkpointing volume

Comment on lines +111 to +113
# Perform final synchronization to ensure everything is synced
echo "Performing final rsync..."
rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure everything syncd by doing a blocking call to sync the directories

Comment on lines +121 to +130
swift export \
--mcore_model "${CKPT_DIR}/${V0_DIR}" \
--to_hf true \
--torch_dtype bfloat16 \
--output_dir megatron_output/hf_converted \
--push_to_hub true \
--hub_token $HF_TOKEN \
--hub_model_id rayraycano/megatron-qwen3-30b-a3b

echo "Final synchronization complete!"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we convert the weights back to hf mode - but let's avoid uploading to HF Repository

Comment on lines +116 to +120
pushd $CKPT_DIR
ls -la
V0_DIR=$(echo v0-*)
popd
echo "V0_DIR: $V0_DIR"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some middle-man directory that has datestamp for the training run

One checkpoint:

training_dir/v0-20250925-123459/iter_000040/....
Another checkpoint from the same run:

training_dir/v0-20250925-123459/iter_000080/....

this is us figuring out "v0-20250925-123459"

Comment on lines +92 to +95
set +e
run_megatron_training 2>&1 | tee training.log
EXIT_CODE=$?
set -e # Re-enable exit on error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megatron, in highly distributed workloads, might hang at the end - I did some searching, and looks like this isn't uncommon: (#1541, #735, #1207)

So what we're doing here is we're allowing a non-zero exit code, capturing it, and then "re-setting" error detection.

Piping to training.log was just something Claude suggested - i don't know the utility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant