DDP Hangs with TORCH_DISTRIBUTED_DEBUG = DETAIL #13503

kelvins64 · 2022-07-02T00:31:58Z

kelvins64
Jul 2, 2022

I'm not certain whether this is user error or a PyTorch/Lightning issue, so am posting a discussion instead.

Adding the line os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'DETAIL' while using multiple GPUs and DDP causes the program to hang indefinitely.

To reproduce:

import argparse
import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run(cl_args):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()

    # Start changed code
    import os
    os.environ[
        "TORCH_DISTRIBUTED_DEBUG"
    ] = "DETAIL"  # set to DETAIL for runtime logging.

    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    args = parser.parse_args(cl_args.split() if cl_args else None)
    trainer = Trainer.from_argparse_args(args)
    # End changed code

    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run('--gpus 2 --strategy ddp')

Answered by kelvins64

Jul 4, 2022

Reading @akihironitta 's response and looking at the documentation again, I noticed that they set the environment variable prior to calling mp.spawn. Moving the os.environ['TORCH_DISTRIBUTED_DEBUG] = 'DETAIL' line outside of the main function prevented hanging.

import argparse
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def _…

View full answer

akihironitta · 2022-07-04T22:04:58Z

akihironitta
Jul 4, 2022

I confirmed the hang with my script, too. https://github.com/akihironitta/gist/blob/repro/13503-torch-dist-debug-detail/pl_boring_model/main.py

env

$ pip list|grep torch
torch                   1.12.0+cu116
torchaudio              0.12.0+cu116
torchmetrics            0.9.2
torchvision             0.13.0+cu116

Giving the doc page a read and tying out a few runs, I think the env var is supposed to be set on rank 0 only, so instead, you might want to set the env var outside the script TORCH_DISTRIBUTED_DEBUG=DETAIL python your_script.py, which worked in my case.

2 replies

kelvins64 Jul 4, 2022
Author

Thanks for your response. Strangely, the program hangs even when setting the variable with

    if trainer.global_rank == 0: # alternatively, trainer.is_global_zero
        os.environ[
            "TORCH_DISTRIBUTED_DEBUG"
        ] = "DETAIL"  # set to DETAIL for runtime logging.

Setting the environment variable in the shell prevents hanging for me as well.

akihironitta Jul 5, 2022

you might want to set the env var outside the script TORCH_DISTRIBUTED_DEBUG=DETAIL python your_script.py

(Just in case it wasn't clear) By this, I meant setting the env var outside the script TORCH_DISTRIBUTED_DEBUG=DETAIL python your_script.py AND removing the env var setting from the script completely will address your issue.

Anyway, I'm glad to hear you resolved the issue as per #13503 (comment) :)

kelvins64 · 2022-07-04T23:53:16Z

kelvins64
Jul 4, 2022
Author

Reading @akihironitta 's response and looking at the documentation again, I noticed that they set the environment variable prior to calling mp.spawn. Moving the os.environ['TORCH_DISTRIBUTED_DEBUG] = 'DETAIL' line outside of the main function prevented hanging.

import argparse
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

# Start changed code
import os
os.environ['TORCH_DISTRIBUTED_DEBUG'] = 'DETAIL'
# End changed code

def run(cl_args):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()

    # Start changed code
    parser = argparse.ArgumentParser()
    parser = Trainer.add_argparse_args(parser)
    args = parser.parse_args(cl_args.split() if cl_args else None)
    trainer = Trainer.from_argparse_args(args)
    # End changed code

    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run('--gpus 2 --strategy ddp')

I presume this has to do with where the Trainer is forking the process. In summary, it seems one can

Set the environment variable via os.environ outside of the main function
Set the environment variable in the shell

3 replies

akihironitta Jul 5, 2022

they set the environment variable prior to calling mp.spawn

Yes! That's what I was trying to point out :) (Alternatively, you can use Trainer(strategy="ddp_spawn") to match the example in the linked PyTorch doc page, but it's definitely not worth switching from ddp to ddp_spawn just for this.)

kelvins64 Jul 5, 2022
Author

Got it. Thanks so much for your help! :)

l-moamen Oct 11, 2023

thanks @kelvins64 6 hours of debugging and paying AWS and this was the last issue I had 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP Hangs with TORCH_DISTRIBUTED_DEBUG = DETAIL #13503

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DDP Hangs with TORCH_DISTRIBUTED_DEBUG = DETAIL #13503

Uh oh!

Uh oh!

kelvins64 Jul 2, 2022

Replies: 2 comments · 5 replies

Uh oh!

akihironitta Jul 4, 2022

Uh oh!

kelvins64 Jul 4, 2022 Author

Uh oh!

akihironitta Jul 5, 2022

Uh oh!

Uh oh!

kelvins64 Jul 4, 2022 Author

Uh oh!

akihironitta Jul 5, 2022

Uh oh!

kelvins64 Jul 5, 2022 Author

Uh oh!

l-moamen Oct 11, 2023

kelvins64
Jul 2, 2022

Replies: 2 comments 5 replies

akihironitta
Jul 4, 2022

kelvins64 Jul 4, 2022
Author

kelvins64
Jul 4, 2022
Author

kelvins64 Jul 5, 2022
Author