NCCL WARN Duplicate GPU detected #13086
Unanswered
PedroRASB
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Hello, I encountered similar issues with PyTorch training on multi-nodes using PBS and NCCL. Interestingly, the code works fine with the 'gloo' backend, but it's noticeably slower. Have you found a solution to this problem? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am trying to run multi-node training in a PBS HPC. In the single-node case my code runs fine, but with more nodes I always get the following warning:
init.cc:521 NCCL WARN Duplicate GPU detected
Followed by the error:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
The code below is an example for running on 2 nodes, each one with 2 GPUs (Tesla V100):
Thank you very much for the help!
Beta Was this translation helpful? Give feedback.
All reactions