What is the proper way to load large pretrained HF model (e.g. bloom) for inference and finetuning? #17419

richarddwang · 2023-04-20T03:39:16Z

richarddwang
Apr 20, 2023

Problem

I want to load huggingface model that is larger than my 40GB A100 for inference and even for further finetuning. This makes CPU offloading necessary, and thus I want to load it with deepspeed zero 3 offload.

While I can run bloomz with huggingface/transformers-bloom-inference->bloom-ds-zero-inference.py by deepspeed --num_gpus 2 bloom-ds-zero-inference.py --cpu_offload. I can not load model with lightning by

class LitGenerator(L.LightningModule):
    ...
    def configure_shareded_model(self):
        self.model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz", torch_dtype="auto")

trainer = L.Trainer(
    devices=[0,1],
    strategy=DeepSpeedStrategy(
        stage=3,
        offload_optimizer=True,  # not used I guess.
        offload_parameters=True,
    ),
    precision="bf16-mixed",
)

, which always results in CUDA OOM on the line self.model = AutoModelForCausalLM.from_pretrained...

My trial to fix the problem

After some research and trials, this make things work:

from transformers.deepspeed import HfDeepSpeedConfig

class LitGenerator(L.LightningModule):
    .....
    def setup(self, stage) -> None:
        deepspeed_config = self.trainer.strategy.config
        self.dschf = HfDeepSpeedConfig(deepspeed_config)
        self.model = AutoModelForCausalLM.from_pretrained(
            "bigscience/bloomz", torch_dtype="auto"
        )

trainer = L.Trainer(
    devices=[0,1],
    strategy=DeepSpeedStrategy(
        stage=3,
        offload_optimizer=True,  # not used I guess.
        offload_parameters=True,
    ),
    precision="bf16-mixed",
)

and run it with deepspeed --include "localhost:6,7" myscript.py

What I fixed:

configure_shareded_model is called under the context deepspeed.zero.init, which seems to be used to randomly initialize a model instead of constructing and loading pretrained parameters, as suggest in HF/DeepSpeed Integration
I instead chose to construct the model under setup hook, and keep the object HfDeepSpeedConfig(deepspeed_config) to shard instantly, as suggested in nontrainer-deepspeed-integration

What needs to be fixed:

I get weird messages every two steps

Validation DataLoader 0:   0%|     | 1/1475 [01:21<33:15:37, 81.23s/it]
Invalidate trace cache @ step 777: expected module 2, but got module 2
Validation DataLoader 0:   0%|▏ | 3/1475 [03:23<27:44:27, 67.84s/it]
Invalidate trace cache @ step 777: expected module 2, but got module 2
Validation DataLoader 0:   0%|▎ | 5/1475 [05:10<25:21:38, 62.11s/it]
Invalidate trace cache @ step 777: expected module 2, but got module 2
Validation DataLoader 0:   0%|▍ | 7/1475 [06:49<23:50:34, 58.47s/it]

I have to run my python script with deepspeed. To use cuda device 6 and 7, I have to specify --include "localhost:6,7" for running the script and specify devices=[0,1] for trainer, which is not very good experience.
Although I can run through trainer.validate, there is error when computing metrics. The error seems related to problematic distributed process handling.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/richardwang/llm-experiments/evaluate.py:123 in <module>                                    │
│                                                                                                  │
│   120 trainer.validate(model, dm)                                                                │
│   121                                                                                            │
│   122 # %%                                                                                       │
│ ❱ 123 print("accuracy:", round(model.acc.compute().item() * 100, 1))                             │
│   124 print("f1:", round(model.f1.compute().item() * 100, 1))                                    │
│   125                                                                                            │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/metric.py:527 in │
│ wrapped_func                                                                                     │
│                                                                                                  │
│   524 │   │   │   # compute relies on the sync context manager to gather the states across pro   │
│   525 │   │   │   # if synchronization happened, the current rank accumulated states will be r   │
│   526 │   │   │   # accumulation going if ``should_unsync=True``,                                │
│ ❱ 527 │   │   │   with self.sync_context(                                                        │
│   528 │   │   │   │   dist_sync_fn=self.dist_sync_fn,                                            │
│   529 │   │   │   │   should_sync=self._to_sync,                                                 │
│   530 │   │   │   │   should_unsync=self._should_unsync,                                         │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/contextlib.py:135 in __enter__              │
│                                                                                                  │
│   132 │   │   # they are only needed for recreation, which is not possible anymore               │
│   133 │   │   del self.args, self.kwds, self.func                                                │
│   134 │   │   try:                                                                               │
│ ❱ 135 │   │   │   return next(self.gen)                                                          │
│   136 │   │   except StopIteration:                                                              │
│   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │
│   138                                                                                            │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/metric.py:498 in │
│ sync_context                                                                                     │
│                                                                                                  │
│   495 │   │   │   │   continue to be accumulated.                                                │
│   496 │   │   │   distributed_available: Function to determine if we are running inside a dist   │
│   497 │   │   """                                                                                │
│ ❱ 498 │   │   self.sync(                                                                         │
│   499 │   │   │   dist_sync_fn=dist_sync_fn,                                                     │
│   500 │   │   │   process_group=process_group,                                                   │
│   501 │   │   │   should_sync=should_sync,                                                       │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/metric.py:450 in │
│ sync                                                                                             │
│                                                                                                  │
│   447 │   │   self._cache = {attr: getattr(self, attr) for attr in self._defaults}               │
│   448 │   │                                                                                      │
│   449 │   │   # sync                                                                             │
│ ❱ 450 │   │   self._sync_dist(dist_sync_fn, process_group=process_group)                         │
│   451 │   │   self._is_synced = True                                                             │
│   452 │                                                                                          │
│   453 │   def unsync(self, should_unsync: bool = True) -> None:                                  │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/metric.py:359 in │
│ _sync_dist                                                                                       │
│                                                                                                  │
│   356 │   │   │   if reduction_fn == dim_zero_cat and isinstance(input_dict[attr], list) and l   │
│   357 │   │   │   │   input_dict[attr] = [dim_zero_cat(input_dict[attr])]                        │
│   358 │   │                                                                                      │
│ ❱ 359 │   │   output_dict = apply_to_collection(                                                 │
│   360 │   │   │   input_dict,                                                                    │
│   361 │   │   │   Tensor,                                                                        │
│   362 │   │   │   dist_sync_fn,                                                                  │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/utilities/data.p │
│ y:186 in apply_to_collection                                                                     │
│                                                                                                  │
│   183 │                                                                                          │
│   184 │   # Recursively apply to collection items                                                │
│   185 │   if isinstance(data, Mapping):                                                          │
│ ❱ 186 │   │   return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) fo   │
│   187 │                                                                                          │
│   188 │   if isinstance(data, tuple) and hasattr(data, "_fields"):  # named tuple                │
│   189 │   │   return elem_type(*(apply_to_collection(d, dtype, function, *args, **kwargs) for    │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/utilities/data.p │
│ y:186 in <dictcomp>                                                                              │
│                                                                                                  │
│   183 │                                                                                          │
│   184 │   # Recursively apply to collection items                                                │
│   185 │   if isinstance(data, Mapping):                                                          │
│ ❱ 186 │   │   return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) fo   │
│   187 │                                                                                          │
│   188 │   if isinstance(data, tuple) and hasattr(data, "_fields"):  # named tuple                │
│   189 │   │   return elem_type(*(apply_to_collection(d, dtype, function, *args, **kwargs) for    │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/utilities/data.p │
│ y:182 in apply_to_collection                                                                     │
│                                                                                                  │
│   179 │                                                                                          │
│   180 │   # Breaking condition                                                                   │
│   181 │   if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dt   │
│ ❱ 182 │   │   return function(data, *args, **kwargs)                                             │
│   183 │                                                                                          │
│   184 │   # Recursively apply to collection items                                                │
│   185 │   if isinstance(data, Mapping):                                                          │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torchmetrics/utilities/distri │
│ buted.py:131 in gather_all_tensors                                                               │
│                                                                                                  │
│   128 │   # 1. Gather sizes of all tensors                                                       │
│   129 │   local_size = torch.tensor(result.shape, device=result.device)                          │
│   130 │   local_sizes = [torch.zeros_like(local_size) for _ in range(world_size)]                │
│ ❱ 131 │   torch.distributed.all_gather(local_sizes, local_size, group=group)                     │
│   132 │   max_size = torch.stack(local_sizes).max(dim=0).values                                  │
│   133 │   all_sizes_equal = all(all(ls == max_size) for ls in local_sizes)                       │
│   134                                                                                            │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torch/distributed/distributed │
│ _c10d.py:1436 in wrapper                                                                         │
│                                                                                                  │
│   1433 │   @functools.wraps(func)                                                                │
│   1434 │   def wrapper(*args, **kwargs):                                                         │
│   1435 │   │   try:                                                                              │
│ ❱ 1436 │   │   │   return func(*args, **kwargs)                                                  │
│   1437 │   │   except Exception as error:                                                        │
│   1438 │   │   │   if is_initialized():                                                          │
│   1439 │   │   │   │   error_msg_dict = {                                                        │
│                                                                                                  │
│ /home/richardwang/miniconda3/envs/dev/lib/python3.10/site-packages/torch/distributed/distributed │
│ _c10d.py:2435 in all_gather                                                                      │
│                                                                                                  │
│   2432 │   │   default_pg = _get_default_group()                                                 │
│   2433 │   │   work = default_pg.allgather([tensor_list], [tensor])                              │
│   2434 │   else:                                                                                 │
│ ❱ 2435 │   │   work = group.allgather([tensor_list], [tensor])                                   │
│   2436 │                                                                                         │
│   2437 │   if async_op:                                                                          │
│   2438 │   │   return work                                                                       │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Tensors must be CUDA and dense

Ask for help

I want to know what is the proper way to load large model like bloomz with cpu offloading. Or help me with the issues specifed in section "What needs to be fixed".
I hope there is an example script of load and inference with bigscience/bloomz with cpu offloading. BTW, from my personal experience, although I think Lightning is beautiful and very useful in training, it lacks documentation and functionalities in inferencing, but that could be another story...

mrluin · 2023-08-03T01:44:50Z

mrluin
Aug 3, 2023

Hello,

Did you fix the weird message "invalidate trace cache @ step xxx" ?

1 reply

richarddwang Aug 10, 2023
Author

Nope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What is the proper way to load large pretrained HF model (e.g. bloom) for inference and finetuning? #17419

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What is the proper way to load large pretrained HF model (e.g. bloom) for inference and finetuning? #17419

Uh oh!

Uh oh!

richarddwang Apr 20, 2023

Problem

My trial to fix the problem

What I fixed:

What needs to be fixed:

Ask for help

Replies: 1 comment · 1 reply

Uh oh!

mrluin Aug 3, 2023

Uh oh!

richarddwang Aug 10, 2023 Author

richarddwang
Apr 20, 2023

Replies: 1 comment 1 reply

mrluin
Aug 3, 2023

richarddwang Aug 10, 2023
Author