DDPShardedStrategy with gradient accumulation #13426

SerezD · 2022-06-28T09:03:48Z

SerezD
Jun 28, 2022

I need to use both DDPShardedStrategy and accumulate_grad_batches > 1

This setting outputs the following warning during Training:

WARNING:root:Grads waiting to be reduced. If this is on purpose (grad accumulation), please use a no_sync() context

The question is: how can i remove the warning (using a no_sync() context) ?

SeanNaren · 2022-06-29T10:19:25Z

SeanNaren
Jun 29, 2022

I haven't been able to reproduce this error. Which version of Lightning/FairScale are you using?

Below is my code in which I haven't been able to reproduce this with:

import os

import torch
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.strategies import DDPShardedStrategy
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        accelerator='gpu',
        devices=2,
        strategy=DDPShardedStrategy(),
        accumulate_grad_batches=12,
        num_sanity_val_steps=0,
        max_epochs=1,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)


if __name__ == "__main__":
    run()

4 replies

SerezD Jun 29, 2022
Author

Hi,

I am using:

python 3.9
pytorch 1.11.0
pytorch-lightning 1.6.4
fairscale 0.4.6

I also tried the boring model (the same code as yours) and the warning is not showing up; so I really don't know how to reproduce this.

SeanNaren Jun 29, 2022

interesting, would you be able to share more information about your code? There must be something particular to your model/code that's causing this error to appear (and not appear when using the BoringModel).

SerezD Jun 30, 2022
Author

The model I'm using is basically the VQ-GAN model proposed here:
https://github.com/CompVis/taming-transformers

The insights of the model are the same, I've simply changed some trainer parameters and the logs in the training/validation steps.

This is my trainer init:

acc_grad = 4  # in this example
gpus = 4  # in this example

wandb_logger = WandbLogger(project='vqgan', name=run_name, offline=offline)

# callbacks
early_stop_callback = EarlyStopping(monitor='validation_loss', patience=10)

checkpoint_callback = ModelCheckpoint(dirpath='./runs/', filename=run_name + '-{epoch:02d}',
                                      monitor='validation_loss', save_top_k=1)

callbacks = [early_stop_callback, checkpoint_callback]
trainer = pl.Trainer(callbacks=callbacks, deterministic=True, strategy=DDPShardedStrategy(),
                     logger=wandb_logger, gpus=gpus, 
                     accumulate_grad_batches=acc_grad, max_epochs=-1)

jtawade Oct 25, 2022

Were you able to find a solution to the error? I'm facing the same problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDPShardedStrategy with gradient accumulation #13426

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DDPShardedStrategy with gradient accumulation #13426

Uh oh!

SerezD Jun 28, 2022

Replies: 1 comment · 4 replies

Uh oh!

SeanNaren Jun 29, 2022

Uh oh!

SerezD Jun 29, 2022 Author

Uh oh!

SeanNaren Jun 29, 2022

Uh oh!

SerezD Jun 30, 2022 Author

Uh oh!

jtawade Oct 25, 2022

SerezD
Jun 28, 2022

Replies: 1 comment 4 replies

SeanNaren
Jun 29, 2022

SerezD Jun 29, 2022
Author

SerezD Jun 30, 2022
Author