Fabric, How to run lig-gpt/finetune/lora.py on multiple nodes #18404

Andcircle · 2023-08-26T18:55:43Z

Andcircle
Aug 26, 2023

The code is here https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py
the environment is: 2 nodes x 8 A100 80gb

I did:

python finetune/lora.py on both master and worker node
lightning run model finetune/lora.py --strategy=fsdp --devices=8 --num-nodes=2 --accelerator=cuda --precision="bf16" --main-address $MASTER_ADDR --main-port=$MASTER_PORT on both master and worker node

Always get following error:

seems like fabric is default to port 38807, how can I customize it? the only port open is 23456 in my use case
in 2nd approach as above mentioned, although I appointed port to be $MASTER_PORT, but fabric.launch still looking for 38807
(when I use lightning run start, should I change the code in finetune/lora.py ?)

Traceback (most recent call last):
  File "/sensei-fs/users/zhangli/workspace/lit-gpt/finetune/lora_llama2.py", line 346, in <module>
    CLI(setup)
  File "/home/user/.local/lib/python3.8/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/home/user/.local/lib/python3.8/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "/sensei-fs/users/zhangli/workspace/lit-gpt/finetune/lora_llama2.py", line 100, in setup
    fabric.launch(main, data_dir, checkpoint_dir, out_dir, quantize)
  File "/home/user/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 834, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 920, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/lightning/fabric/fabric.py", line 925, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/sensei-fs/users/zhangli/workspace/lit-gpt/finetune/lora_llama2.py", line 155, in main
    train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir, speed_monitor)
  File "/sensei-fs/users/zhangli/workspace/lit-gpt/finetune/lora_llama2.py", line 181, in train
    validate(fabric, model, val_data, tokenizer, max_seq_length)  # sanity check
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/sensei-fs/users/zhangli/workspace/lit-gpt/finetune/lora_llama2.py", line 270, in validate
    logits = model(input_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/lightning/fabric/wrappers.py", line 118, in forward
    output = self._forward_module(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 825, in forward
    args, kwargs = _pre_forward(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 428, in _pre_forward
    unshard_fn(state, handle)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 463, in _pre_forward_unshard
    _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 346, in _unshard
    handle.unshard()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/flat_param.py", line 1251, in unshard
    padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/flat_param.py", line 1339, in _all_gather_flat_param
    dist.all_gather_into_tensor(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2897, in all_gather_into_tensor
    work = group._allgather_base(output_tensor, input_tensor)```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fabric, How to run lig-gpt/finetune/lora.py on multiple nodes #18404

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Fabric, How to run lig-gpt/finetune/lora.py on multiple nodes #18404

Uh oh!

Andcircle Aug 26, 2023

Replies: 0 comments

Andcircle
Aug 26, 2023