What is the proper way to load large pretrained HF model (e.g. bloom) for inference and finetuning? #17419
Unanswered
richarddwang
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 1 reply
-
Hello, Did you fix the weird message "invalidate trace cache @ step xxx" ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
I want to load huggingface model that is larger than my 40GB A100 for inference and even for further finetuning. This makes CPU offloading necessary, and thus I want to load it with deepspeed zero 3 offload.
While I can run bloomz with huggingface/transformers-bloom-inference->bloom-ds-zero-inference.py by
deepspeed --num_gpus 2 bloom-ds-zero-inference.py --cpu_offload
. I can not load model with lightning by, which always results in CUDA OOM on the line
self.model = AutoModelForCausalLM.from_pretrained...
My trial to fix the problem
After some research and trials, this make things work:
and run it with
deepspeed --include "localhost:6,7" myscript.py
What I fixed:
configure_shareded_model
is called under the contextdeepspeed.zero.init
, which seems to be used to randomly initialize a model instead of constructing and loading pretrained parameters, as suggest in HF/DeepSpeed Integrationsetup
hook, and keep the objectHfDeepSpeedConfig(deepspeed_config)
to shard instantly, as suggested in nontrainer-deepspeed-integrationWhat needs to be fixed:
deepspeed
. To use cuda device 6 and 7, I have to specify--include "localhost:6,7"
for running the script and specifydevices=[0,1]
for trainer, which is not very good experience.trainer.validate
, there is error when computing metrics. The error seems related to problematic distributed process handling.Ask for help
Beta Was this translation helpful? Give feedback.
All reactions