DDP: All devices get the same data #16548
Replies: 1 comment 11 replies
-
This only gets added if you set
At the very beginning. Typically before you do anything with random number generators, data, model, training etc.
No, each GPU should only see 1/N of the data, where N is the number of GPUs. If this isn't the case, it means the distributed sampler wasn't applied. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I am attempting to implement multi-GPU training for our single cell genomics model written with Pyro / scvi-tools model - a collaboration between Stegle, Bayraktar and Teichmann groups (Wellcome Sanger Institute, DKFZ, EMBL, including @macwiatrak @gtca) - as well as with @adamgayoso (scvi-tools). This project would also help scvi-tools (single cell genomics modelling project) to provide multi-GPU training for all models.
Our current model uses a custom Dataset and BatchSampler (map style) to enable the loading of various variables from anndata object using both
obs
andvar
indices. This is a reasonably complex project with many moving parts, so given that I am new to lightning and multi-GPU training, it is hard for me to generate simpler reproducible examples.DistributedSamplerWrapper
makes sense for our application - so any suggestions on what's going on and how to fix the issues would be great.Following on the discussion in the parameter loading issue here @awaelchli, writing a more detailed description :
Problem 1: Main problem. All devices get the same data - suggesting that the distributed sampler wrapper doesn't select different subsets of the data. After reading this #7186 and other related issues I don't understand what is the correct setup to address this issue.
Problem 2: During
trainer.fit()
,sampler.shuffle
is set to False for the training batch sampler (my sampler, notDistributedSamplerWrapper
). This can be fixed by replacing shuffle with a different argument (e.g randomise_batches) - however, problem 1 still holds - all GPUs see the same data. This line sets shuffle to False for all samplers except RandomSampler. DistributedSamplerWrapper docs say that this should not happen - but maybe it does [Bug]?Also
self.trainer.train_dataloaders
doesn't exist before and aftertrainer.fit
. We usepl.LightningDataModule
with.setup()
and.train_dataloader()
methods.Also, the lack of debug messages about the seed suggests that
worker_init_fn=pl_worker_init_function
is never called - suggesting that it is never added. If I add it manually I see that all workers in all ranks have the same seed.Problem 3: I am not using DistributedSampler - however, setting
replace_sampler_ddp=False
doesn't raise any issues, errors or warnings. Settingreplace_sampler_ddp=False
andis the only way I can get different GPU devices to see different data batches. As far as I understand, this is not correct because the same observation will be seen by each process because each process goes through the full set of data batches.
Problem 4: Where in the training script should seed_everything(1, workers=True) be called? Currently, it is called when scvi-tools package is loaded at the very start of the script.
Beta Was this translation helpful? Give feedback.
All reactions