CPU DDP freezes on cluster node, but not on local machine #16223

moritzschaefer · 2023-01-02T11:25:30Z

moritzschaefer
Jan 2, 2023

On my server node, training a LightningModule using DDP leads to a freeze, even before entering the training loop.
This is the last output before it freezes: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

The node has 2 GPUs and the freeze occurs indepently of whether acceleator is set to "gpu" or "cpu".

The source code I used is from this lightning demo: https://colab.research.google.com/drive/1F_RNcHzTfFuQf-LeKvSlud6x7jXYkG31 (class MNISTModel)

Here is the trainer code:
trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="gpu") # neither works with GPU nor CPU
trainer.fit(mnist_model)

I installed a fresh pytorch_lightning conda environment to make sure that an old/unsupported packages is not the issue here.

Notably, on my local machine, running trainer = pl.Trainer(devices=2, strategy="ddp", accelerator="cpu") does not lead to a freeze, so it's somewhat an hardware/machine/environment issue. Any idea how to debug this?

PS: cross posted on slack (https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1672591654954199)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU DDP freezes on cluster node, but not on local machine #16223

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CPU DDP freezes on cluster node, but not on local machine #16223

Uh oh!

moritzschaefer Jan 2, 2023

Replies: 0 comments

moritzschaefer
Jan 2, 2023