NCCL WARN Duplicate GPU detected #13086

PedroRASB · 2022-05-16T17:35:13Z

PedroRASB
May 16, 2022

Hello, I am trying to run multi-node training in a PBS HPC. In the single-node case my code runs fine, but with more nodes I always get the following warning:

init.cc:521 NCCL WARN Duplicate GPU detected

Followed by the error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

The code below is an example for running on 2 nodes, each one with 2 GPUs (Tesla V100):

#PBS -l select=2:ncpus=20:ngpus=2:mpiprocs=2
#PBS -l walltime=00:02:00
#PBS -j oe
#PBS -N teste_2Nodes_ddp_batch16
#PBS -q gpu

module load anaconda/3.2020.2
source activate lightning

#This command to run your pytorch script
COMMAND="ISNetL07UnbiasedAttLocal3_lightning.py --devices 2 --nodes 2 --epochs 1 --accelerator gpu --strategy ddp"
#Number of processes per node to launch
NPROC_PER_NODE=2
#Number of process in all modes
WORLD_SIZE=$PBS_NUM_NODES
MASTER_ADDR=`/bin/hostname -s`
cat $PBS_NODEFILE>nodelist
#Make sure this node (MASTER) comes first
SLAVES=`cat nodelist | grep -v $MASTER_ADDR | uniq`
#We want names of master and slave nodes
HOSTLIST="$MASTER_ADDR $SLAVES"

#Get free port:
BASE_PORT=16998
INCREMENT=1
port=$BASE_PORT
isfree=$(netstat -taln | grep $port)
while [[ -n "$isfree" ]]; do
    port=$[port+INCREMENT]
    isfree=$(netstat -taln | grep $port)
done
MASTER_PORT=$port

cd $PBS_O_WORKDIR
cd ISNetShare/Faces/

export NCCL_DEBUG=INFO

#ssh to each node and run
NODE_RANK=0
for node in $HOSTLIST; do
	ssh -q $node \
		. ~/.bashrc
		torchrun \
		--nproc_per_node=$NPROC_PER_NODE \
		--nnodes=2 \
		--node_rank=$NODE_RANK \
		--master_addr="$MASTER_ADDR" --master_port="$MASTER_PORT" \
                $COMMAND &
        NODE_RANK=$[NODE_RANK+1]
done
wait

Thank you very much for the help!

reemlores · 2023-11-26T22:11:49Z

reemlores
Nov 26, 2023

Hello, I encountered similar issues with PyTorch training on multi-nodes using PBS and NCCL. Interestingly, the code works fine with the 'gloo' backend, but it's noticeably slower. Have you found a solution to this problem?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL WARN Duplicate GPU detected #13086

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

NCCL WARN Duplicate GPU detected #13086

Uh oh!

PedroRASB May 16, 2022

Replies: 1 comment

Uh oh!

reemlores Nov 26, 2023

PedroRASB
May 16, 2022

reemlores
Nov 26, 2023