Configure optimizers in 1.8 with FSDP ? #16402
Unanswered
w2ex
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments
-
Seems like I got also something similar moving from 1.7 to 1.8, if somebody has any clue? |
Beta Was this translation helpful? Give feedback.
0 replies
-
I also have the same issue,. |
Beta Was this translation helpful? Give feedback.
0 replies
-
What is the correct way to use |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I recently made the jump to PL 1.8.4 from 1.7.7.
However, it seems to be breaking my script. I use Fairscale FSDP to shard my model.
Originally, I used
self.model.parameters()
in theconfigure_optimizers
function of my LightningModule to pass a list of dicts of the form[{"params":param, "weight_decay": self.weight_decay}]
to my optimizer.This now raise the error
optimizer got an empty parameter list
, which seems consistent with the note I see in the doc hereFollowing this note and the error isplayed, I tried simply using
torch.optim.Optimizer(self.trainer.model.parameters(), ...)
However, when I do this, PL seems to not longer detect any parameter :
It appears that
self.trainer.model.parameters()
returns one generator (per shard) with a single Parameter listing all of the parameter values of that shard.It then fails on batch 2 of the training (edit: this is due to my batch accumulator. It fails on the first call of the optimizer) with this error:
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
EDIT: Using the modelhooks to print messages to try and locate the error, it appears that this happens after the on_before_backward()
hook but before the
on_after_backward() of the last batch of the accumulator. Which is weird because it passes just fine the backward pass of the previous batches of the accumulator. My guess is that this is due to the optimizer call.EDIT2: it appears this error occurs when using simultaneously FSDP and the fairscale checkpointing wrapper (tried it on 2 different architectures). Replacing FSDP by DDP, OR removing the checkpointing solves the issue. But I need both.
Am I doing anything wrong here? It used to work fine until 1.7.7 and following the instructions from the doc is not resolving anything.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions