Execution Issue of VASP 6.4.1 with Openacc and GPU

#1 Post by guorong_weng » Mon Aug 07, 2023 5:18 am

Dear VASP folks,
I have been recently trying to compile VASP 6.4.1 with Openacc on our local GPU machine (NVIDIA RTX A5000, 8 GPUS).
The compilation is successful with the following "makefile.include" file, with NVIDIA HPC 23.7 and FFTW3 installed in the indicated directory.

Code: Select all

# Default precompiler options
              -DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP \
              -D_OPENACC \
              -DUSENCCL -DUSENCCLP2P

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
FC          = mpif90 -acc -gpu=cc60,cc70,cc80,cuda12.2 -mp
FCL         = mpif90 -acc -gpu=cc60,cc70,cc80,cuda12.2 -mp -c++libs

FREE        = -Mfree -Mx,231,0x1

FFLAGS      = -Mbackslash -Mlarge_arrays

OFLAG       = -fast

DEBUG       = -Mfree -O0 -traceback

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = nvfortran
CC_LIB      = nvc -w
FFLAGS_LIB  = -O1 -Mfixed

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
#NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

# If the above fails, then NVROOT needs to be set manually
NVHPC_PATH  ?= /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk
NVVERSION   = 23.7
NVROOT      = $(NVHPC_PATH)/Linux_x86_64/$(NVVERSION)

## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
#OFLAG_IN   = -fast -Mwarperf
#SOURCE_IN  := nonlr.o

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# BLAS (mandatory)
BLAS        = -L$(NVROOT)/compilers/lib -lblas

# LAPACK (mandatory)
LAPACK      = -L$(NVROOT)/compilers/lib -llapack

# scaLAPACK (mandatory)
SCALAPACK   = -L$(NVROOT)/comm_libs/mpi/lib -Mscalapack


# FFTW (mandatory)
FFTW_ROOT  ?= /home/gwen/libraries/fftw-3.3.10/fftw
LLIBS      += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS       += -I$(FFTW_ROOT)/include
The "LD_LIBRARY_PATH" is exported as follows:

Code: Select all

And the "ldd vasp_std" reads as follows:

Code: Select all

linux-vdso.so.1 (0x00007ffec7542000)
	libqdmod.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/extras/qd/lib/libqdmod.so.0 (0x00007fb326400000)
	libqd.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/extras/qd/lib/libqd.so.0 (0x00007fb326000000)
	liblapack_lp64.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/liblapack_lp64.so.0 (0x00007fb325589000)
	libblas_lp64.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libblas_lp64.so.0 (0x00007fb323738000)
	libmpi_usempif08.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_usempif08.so.40 (0x00007fb323400000)
	libmpi_usempi_ignore_tkr.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007fb323000000)
	libmpi_mpifh.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_mpifh.so.40 (0x00007fb322c00000)
	libmpi.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi.so.40 (0x00007fb322600000)
	libscalapack_lp64.so.2 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libscalapack_lp64.so.2 (0x00007fb321f82000)
	libnvhpcwrapcufft.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvhpcwrapcufft.so (0x00007fb321c00000)
	libcufft.so.11 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcufft.so.11 (0x00007fb316e00000)
	libcusolver.so.11 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcusolver.so.11 (0x00007fb30fc00000)
	libcudaforwrapnccl.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapnccl.so (0x00007fb30f800000)
	libnccl.so.2 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/12.2/nccl/lib/libnccl.so.2 (0x00007fb2fe800000)
	libcublas.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcublas.so.12 (0x00007fb2f7e00000)
	libcublasLt.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcublasLt.so.12 (0x00007fb2d5e00000)
	libcudaforwrapblas.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapblas.so (0x00007fb2d5a00000)
	libcudaforwrapblas117.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapblas117.so (0x00007fb2d5600000)
	libcudart.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libcudart.so.12 (0x00007fb2d5200000)
	libcudafor_120.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor_120.so (0x00007fb2cf200000)
	libcudafor.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor.so (0x00007fb2cee00000)
	libacchost.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacchost.so (0x00007fb2cea00000)
	libaccdevaux.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevaux.so (0x00007fb2ce600000)
	libacccuda.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacccuda.so (0x00007fb2ce200000)
	libcudadevice.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudadevice.so (0x00007fb2cde00000)
	libcudafor2.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor2.so (0x00007fb2cda00000)
	libnvf.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvf.so (0x00007fb2cd200000)
	libnvhpcatm.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvhpcatm.so (0x00007fb2cce00000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb2ccbd4000)
	libnvomp.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvomp.so (0x00007fb2cba00000)
	libnvcpumath.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvcpumath.so (0x00007fb2cb400000)
	libnvc.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvc.so (0x00007fb2cb000000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb2cadd8000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb32663f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb326319000)
	libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007fb326635000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb326630000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb326629000)
	libopen-rte.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libopen-rte.so.40 (0x00007fb2caa00000)
	libopen-pal.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libopen-pal.so.40 (0x00007fb2ca400000)
	librdmacm.so.1 => /lib/x86_64-linux-gnu/librdmacm.so.1 (0x00007fb3262fa000)
	libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007fb3262d7000)
	libnuma.so.1 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libnuma.so.1 (0x00007fb2ca000000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fb326622000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb3262bb000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb3262b6000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb326683000)
	libnvJitLink.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libnvJitLink.so.12 (0x00007fb2c6c00000)
	libcusparse.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcusparse.so.12 (0x00007fb2b6e00000)
	libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007fb326291000)
	libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007fb3236b5000)
After installation, I export the following the parameters

Code: Select all

export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m
and then launch "make test".

Immediately from the output I received the following repeated bugs (errors) in each tested folder:

Code: Select all

[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: lambda-scalar

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
 running    4 mpi-ranks, with    1 threads/rank, on    1 nodes
 distrk:  each k-point on    2 cores,    2 groups
 distr:  one band on    1 cores,    2 groups
 OpenACC runtime initialized ...    4 GPUs detected
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 898                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |

[lambda-scalar:1552414] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[lambda-scalar:1552414] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
As a beginner in VASP, I have no clue about how to figure this bug out. Hope someone can help me out here. Thanks a lot.


Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#2 Post by alexey.tal » Tue Aug 08, 2023 12:03 pm

Dear Gwen,
[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Indicates that the issues is in the communication network
Does this calculation run on a single GPU?

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#3 Post by guorong_weng » Tue Aug 08, 2023 10:51 pm

Hi Alexey. By default, four GPUS are used for testing in the vasp package. I set CUDA_VISIBLE_DEVICES to include 4 GPUS.

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#4 Post by alexey.tal » Wed Aug 09, 2023 10:21 am

You can run the tests with a single MPI rank by changing the number of ranks in testsuite/fast.conf and execute the tests by running the following command
./runtest --fast fast.conf

Were you able to run the tests without GPUs?

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#5 Post by guorong_weng » Thu Aug 10, 2023 8:40 pm

Hi Alexey. The crashing issue has been resolved by using all the four libraries from intel oneAPI. However, the following warning still persists

Code: Select all

[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: lambda-scalar

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
So far I can only suppress it by setting btl_base_warn_component_unused to 0. I am afraid of having low performance now. Is there any way to resolve this problem? Thanks.

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#6 Post by alexey.tal » Fri Aug 11, 2023 8:13 am

Do you have an infiniband connection? If not, you can manually choose the shared memory communications fabric:

Code: Select all

mpirun -np 4 -genv I_MPI_FABRICS=shm vasp_std 

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#7 Post by guorong_weng » Fri Aug 11, 2023 6:01 pm

I believe this will resolve my problem finally.
Since I am not using intel MPI but open MPI from the nvidia HPC kit, I am using the following command

Code: Select all

mpirun -np 4 --mca btl [fabric options] vasp_std
I am wondering what fabric options below work the best for VASP

Code: Select all

 MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: smcuda (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.5)

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

#8 Post by alexey.tal » Mon Aug 14, 2023 8:37 am

As far as I understand, you are running this job on a single node with multiple GPUs and you don't use any inter-node communication, so you don't need openib or tcp, but you should specify one of the shared-memory options. I think you should be able to get the best performance with --mca btl self,vader. But I don't know if smcuda might give some advantages, so you might want to try this option too.

