Configure optimizers in 1.8 with FSDP ? #16402

w2ex · 2023-01-17T14:00:08Z

w2ex
Jan 17, 2023

Hello,
I recently made the jump to PL 1.8.4 from 1.7.7.
However, it seems to be breaking my script. I use Fairscale FSDP to shard my model.
Originally, I used self.model.parameters() in the configure_optimizers function of my LightningModule to pass a list of dicts of the form [{"params":param, "weight_decay": self.weight_decay}] to my optimizer.
This now raise the error optimizer got an empty parameter list, which seems consistent with the note I see in the doc here

Following this note and the error isplayed, I tried simply using torch.optim.Optimizer(self.trainer.model.parameters(), ...)

However, when I do this, PL seems to not longer detect any parameter :

0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)

It appears that self.trainer.model.parameters() returns one generator (per shard) with a single Parameter listing all of the parameter values of that shard.

It then fails on batch 2 of the training (edit: this is due to my batch accumulator. It fails on the first call of the optimizer) with this error:
The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

EDIT: Using the modelhooks to print messages to try and locate the error, it appears that this happens after the on_before_backward()hook but before theon_after_backward() of the last batch of the accumulator. Which is weird because it passes just fine the backward pass of the previous batches of the accumulator. My guess is that this is due to the optimizer call.

EDIT2: it appears this error occurs when using simultaneously FSDP and the fairscale checkpointing wrapper (tried it on 2 different architectures). Replacing FSDP by DDP, OR removing the checkpointing solves the issue. But I need both.

Am I doing anything wrong here? It used to work fine until 1.7.7 and following the instructions from the doc is not resolving anything.
Thank you.

RomainCendre · 2023-01-18T14:41:29Z

RomainCendre
Jan 18, 2023

Seems like I got also something similar moving from 1.7 to 1.8, if somebody has any clue?

0 replies

amariucaitheodor · 2023-04-16T09:49:31Z

amariucaitheodor
Apr 16, 2023

I also have the same issue,.

0 replies

asuglia-alana · 2023-10-07T14:34:07Z

asuglia-alana
Oct 7, 2023

What is the correct way to use configure_optimizers using FSDP? I'm experiencing this issue now as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Configure optimizers in 1.8 with FSDP ? #16402

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Configure optimizers in 1.8 with FSDP ? #16402

Uh oh!

Uh oh!

w2ex Jan 17, 2023

Replies: 3 comments

Uh oh!

RomainCendre Jan 18, 2023

Uh oh!

amariucaitheodor Apr 16, 2023

Uh oh!

Uh oh!

asuglia-alana Oct 7, 2023

w2ex
Jan 17, 2023

RomainCendre
Jan 18, 2023

amariucaitheodor
Apr 16, 2023

asuglia-alana
Oct 7, 2023