Skip to content

DDP Hangs with TORCH_DISTRIBUTED_DEBUG = DETAIL #13503

Discussion options

You must be logged in to vote

Reading @akihironitta 's response and looking at the documentation again, I noticed that they set the environment variable prior to calling mp.spawn. Moving the os.environ['TORCH_DISTRIBUTED_DEBUG] = 'DETAIL' line outside of the main function prevented hanging.

import argparse
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def _…

Replies: 2 comments 5 replies

Comment options

You must be logged in to vote
2 replies
@kelvins64
Comment options

@akihironitta
Comment options

Comment options

You must be logged in to vote
3 replies
@akihironitta
Comment options

@kelvins64
Comment options

@l-moamen
Comment options

Answer selected by kelvins64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strategy: ddp DistributedDataParallel
3 participants