Only save rank0 with deepspeed_stage_3 #15665
Unanswered
ZeyiLiao
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am using deepspeed stage 3 and find that the checkpoint only save rank_0 state_dict and optim_dict.
I am not sure that we only need to use the result from rank_0 and it would be good, OR we the ideal result should save results from all ranks and then apply the
zero_to_fp32
orconvert_zero_checkpoint_to_fp32_state_dict
(BTW, which one should we use? i just find so many different results online about which one to use..)Beta Was this translation helpful? Give feedback.
All reactions