CPU DDP freezes on cluster node, but not on local machine #16223
Unanswered
moritzschaefer
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
On my server node, training a LightningModule using DDP leads to a freeze, even before entering the training loop.
This is the last output before it freezes: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
The node has 2 GPUs and the freeze occurs indepently of whether acceleator is set to "gpu" or "cpu".
The source code I used is from this lightning demo: https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31 (class MNISTModel)
Here is the trainer code:
trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="gpu") # neither works with GPU nor CPU
trainer.fit(mnist_model)
I installed a fresh pytorch_lightning conda environment to make sure that an old/unsupported packages is not the issue here.
Notably, on my local machine, running trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="cpu") does not lead to a freeze, so it's somewhat an hardware/machine/environment issue. Any idea how to debug this?
PS: cross posted on slack (https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1672591654954199)
Beta Was this translation helpful? Give feedback.
All reactions