Installing Infrastructure Software
List of Infrastructure Software
The installation node fields are expressed as follows:
- M
Management node
- L
Login node
- C
Compute node
Software Name |
Component Name |
Version |
Service Name |
Installation Node |
Notes |
---|---|---|---|---|---|
nfs |
nfs-utils |
1.3.0 |
nfs-server |
M |
|
nfs-kernel-server |
1.3.0 |
nfs-server |
M |
||
nfs-client |
1.3.0 |
nfs |
C,L |
||
ntp |
ntp |
4.2.6 |
ntpd |
M |
|
slurm |
ohpc-slurm-server |
1.3.3 |
munge,slurmctld |
M |
|
ohpc-slurm-client |
1.3.3 |
munge,slurmd |
C,L |
||
ganglia |
ganglia-gmond-ohpc |
3.7.2 |
gmond |
M,C,L |
|
singularity |
singularity-ohpc |
2.4 |
M |
||
cuda |
cudnn |
7 |
C |
Only needs to be installed on the GPU node |
|
cuda |
9.1 |
C |
|||
mpi |
openmpi3-gnu7-ohpc |
3.0.0 |
M |
Install at least one of three types of MPI |
|
mpich-gnu7-ohpc |
3.2 |
M |
|||
mvapich2-gnu7-ohpc |
2.2 |
M |
Set the Local Repository for Management Node
Download the local repository
Configuring the local repository
Upload the package to management node.Run the commands below to configure the local Lenovo OpenHPC repository:
$ sudo mkdir -p $ohpc_repo_dir $ sudo tar xvf Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar -C $ohpc_repo_dir $ sudo $ohpc_repo_dir/make_repo.sh
$ sudo mkdir -p $ohpc_repo_dir $ sudo tar xvf Lenovo-OpenHPC-1.3.3.SLES.x86_64.tar -C $ohpc_repo_dir $ sudo $ohpc_repo_dir/make_repo.sh $ sudo rpm --import $ohpc_repo_dir/SLE_12/repodata/repomd.xml.key
Configuring the Local Repository for Compute and Login Nodes
-
$ sudo psh all yum --setopt=\*.skip_if_unavailable=1 -y install yum-utils
$ sudo cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo sed -i '/^gpgkey=/d' /vars/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/CentOS_7" >> /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/CentOS_7/repodata/repomd.xml.key" >> /var/tmp/Lenovo.OpenHPC.local.repo # Distribute repo files $ sudo xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/yum.repos.d/ $ sudo psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros
Run the following command to shut down the yum source access to the external network.
Note
This step can be performed according to the actual situation.If the operating system itself does not install enough packages, the subsequent installation steps may fail
$ sudo psh all yum-config-manager --disable CentOS\*
-
$ sudo cp /etc/zypp/repos.d/Lenovo.OpenHPC.local.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/SLE_12" >> /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/SLE_12/repodata/repomd.xml.key" >> /var/tmp/Lenovo.OpenHPC.local.repo # Distribute repo files $ sudo xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/zypp/repos.d/ $ sudo psh all rpm --import http://${sms_name}/${ohpc_repo_dir}/SLE_12/repodata/repomd.xml.key $ sudo psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros
Configuring LiCO Dependencies Repository
-
Download the package: https://hpc.lenovo.com/lico/downloads/5.1/lico-dep-5.1.el7.x86_64.tgz
$ sudo mkdir -p $lico_dep_repo_dir $ sudo tar xvf lico-dep-5.1.el7.x86_64.tgz -C $lico_dep_repo_dir $ sudo $lico_dep_repo_dir/mklocalrepo.sh
$ sudo cp /etc/yum.repos.d/lico-dep.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo $ sudo echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo $ sudo echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL7" >> /var/tmp/lico-dep.repo # Distribution configuration $ sudo xdcp all /var/tmp/lico-dep.repo /etc/yum.repos.d
-
Download the package: https://hpc.lenovo.com/lico/downloads/lico-dep-5.1.sle12.x86_64.tgz
$ sudo mkdir -p $lico_dep_repo_dir $ sudo tar xvf lico-dep-5.1.sle12.x86_64.tgz -C $lico_dep_repo_dir $ sudo $lico_dep_repo_dir/mklocalrepo.sh $ sudo rpm --import $lico_dep_repo_dir/RPM-GPG-KEY-LICO-DEP-SLE12
$ sudo cp /etc/zypp/repos.d/lico-dep.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo $ sudo echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo $ sudo echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-SLE12" >> /var/tmp/lico-dep.repo # Distribution configuration $ sudo xdcp all /var/tmp/lico-dep.repo /etc/zypp/repos.d $ sudo psh all rpm --import http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-SLE12
Installing slurm
-
$ sudo yum -y install lenovo-ohpc-base
$ sudo yum -y install ohpc-slurm-server
$ sudo psh all yum -y install ohpc-base-compute ohpc-slurm-client lmod-ohpc
Configuring pam_slurm
Note
The following optional command will prevent non-root logins to the compute nodes unless they are already running a slurm job on that node submitted by the userid being used for logging in:
$ sudo psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd
-
$ sudo zypper install lenovo-ohpc-base
$ sudo zypper install ohpc-slurm-server
$ sudo psh all zypper install -y --force-resolution ohpc-base-compute ohpc-slurm-client lmod-ohpc
Configuring pam_slurm
Note
The following optional command will prevent non-root logins to the compute nodes unless they are already running a slurm job on that node submitted by the userid being used for logging in:
$ sudo psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd
Configuring nfs
-
Note
Run the following commands to create the share directory of
/opt/ophc/pub
. This directory is necessary. If you have already shared this directory, you can skip this step.# Management node share the Lenovo OpenHPC directory $ sudo yum -y install nfs-utils $ sudo echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports $ sudo exportfs -a # Installing NFS for Cluster Nodes $ sudo psh all yum -y install nfs-utils # Configure shared directory for cluster nodes $ sudo psh all mkdir -p /opt/ohpc/pub $ sudo psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab # Mount shared directory $ sudo psh all mount /opt/ohpc/pub
Note
Run the following commands to create user share directory, this document takes
/home
as an example, also you can choose other directory.# Management node shares /home and Lenovo OpenHPC package directory $ sudo echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports $ sudo exportfs -a # if /home already mounted, unmount it first $ sudo psh all "sed -i '/ \/home /d' /etc/fstab" $ sudo psh all umount /home # Configure a shared directory for cluster nodes $ sudo psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab # Mount a shared directory $ sudo psh all mount /home
-
Note
Run the following commands to create the share directory of
/opt/ophc/pub
. This directory is necessary. If you have already shared this directory, you can skip this step.# Management node share the Lenovo OpenHPC directory $ sudo zypper install nfs-kernel-server $ sudo echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports $ sudo exportfs -a # Configure shared directory for cluster nodes $ sudo psh all zypper install -y --force-resolution nfs-client $ sudo psh all mkdir -p /opt/ohpc/pub $ sudo psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab # Mount shared directory $ sudo psh all mount /opt/ohpc/pub
Note
Run the following commands to create user share directory, this document takes
/home
as an example, also you can choose other directory.# Management node shares /home and Lenovo OpenHPC package directory $ sudo echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports $ sudo exportfs -a # if /home already mounted, unmount it first $ sudo psh all "sed -i '/ \/home /d' /etc/fstab" $ sudo psh all umount /home # Configure a shared directory for cluster nodes $ sudo psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab # Mount a shared directory $ sudo psh all mount /home
Configuring ntp
Note
If ntp service has already been configured for nodes in the cluster, skip this step
$ sudo echo "server 127.127.1.0" >> /etc/ntp.conf
$ sudo echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
$ sudo systemctl enable ntpd
$ sudo systemctl start ntpd
$ sudo psh all yum -y install ntp
$ sudo psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf
# Startup
$ sudo ppsh all systemctl enable ntpd
$ sudo ppsh all systemctl start ntpd
# check service
psh all "ntpq -p | tail -n 1"
$ sudo echo "server 127.127.1.0" >> /etc/ntp.conf
$ sudo echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
$ sudo systemctl enable ntpd
$ sudo systemctl start ntpd
$ sudo psh all zypper install -y --force-resolution ntp
$ sudo psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf
# Startup
$ sudo psh all systemctl enable ntpd
$ sudo psh all systemctl start ntpd
# check service
psh all "ntpq -p | tail -n 1"
Installing cuda and cudnn
Note
Run the commands below to install CUDA and CUDNN on all the GPU compute nodes (if only a subset of nodes have GPUs, replace “compute” argument in psh commands with node range corresponding to GPU nodes):
Installing cuda
Download
cuda_9.1.85_387.26_linux.run
to share directory, if you are installing according to this document the share directory is/home
Download address: https://developer.nvidia.com/cuda-downloads
$ sudo psh compute systemctl set-default multi-user.target $ sudo psh compute reboot
Install nvidia drivers
Download address:
Note
We suggest you to install kernel patch for Spectre/Meltdown security issue, you can get the pach from here:
Then make sure to install the kernel-devel package that matches the running kernel. If this has been done then the kernel-devel package can be omitted from the following commands. Otherwise run the following commands as shown:
$ sudo psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm $ sudo psh compute yum install -y cuda-drivers
$ sudo psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-sles123-390.46-1.0-1.x86_64.rpm $ sudo psh compute zypper --gpg-auto-import-keys install -y --force-resolution cuda-drivers $ psh compute perl -pi -e "s/NVreg_DeviceFileMode=0660/NVreg_DeviceFileMode=0666/" /etc/modprobe.d/50-nvidia-default.conf $ psh compute reboot -h now
Installing cuda
$ sudo psh compute yum install -y kernel-devel gcc gcc-c++ $ sudo psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose --override
$ sudo psh compute zypper install -y --force-resolution kernel-devel gcc gcc-c++ $ sudo psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose --override
Download cudnn
Download
cudnn-9.1-linux-x64-v7.1.tgz
into directory/root
.The official website:Installing cudnn
$ cd ~ $ tar -xvf cudnn-9.1-linux-x64-v7.1.tgz $ sudo xdcp compute cuda/include/cudnn.h /usr/local/cuda/include $ sudo xdcp compute cuda/lib64/libcudnn_static.a /usr/local/cuda/lib64 $ sudo xdcp compute cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64 $ sudo psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64/libcudnn.so.7" $ sudo psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcudnn.so" $ sudo psh compute chmod a+r /usr/local/cuda/include/cudnn.h $ sudo psh compute chmod a+r /usr/local/cuda/lib64/libcudnn*
Configuring Environmental Variables
$ sudo echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf $ sudo echo "export CUDA_HOME=/usr/local/cuda" >> /etc/profile.d/cuda.sh $ sudo echo "export PATH=/usr/local/cuda/bin:\$PATH" >> /etc/profile.d/cuda.sh
Distribute Configuration
$ sudo xdcp compute /etc/ld.so.conf.d/cuda.conf /etc/ld.so.conf.d/cuda.conf $ sudo xdcp compute /etc/profile.d/cuda.sh /etc/profile.d/cuda.sh
Run the commands below on the GPU nodes to determine if the GPU can be identified:
$ sudo psh compute ldconfig $ sudo psh compute nvidia-smi $ sudo psh compute "cd /root/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery; make; ./deviceQuery" | xcoll
Set CUDA’s self-start
# configuration $ sudo psh compute sed -i '/Wants=syslog.target/a\Before=slurmd.service' /usr/lib/systemd/system/nvidia-persistenced.service $ sudo psh compute systemctl daemon-reload $ sudo psh compute systemctl enable nvidia-persistenced $ sudo psh compute systemctl start nvidia-persistenced
# add configure file $ cat << eof > /usr/lib/systemd/system/nvidia-persistenced.service [Unit] Description=NVIDIA Persistence Daemon Before=slurmd.service Wants=syslog.target [Service] Type=forking ExecStart=/usr/bin/nvidia-persistenced --user root ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced [Install] WantedBy=multi-user.target eof # Distribute configure file xdcp compute /usr/lib/systemd/system/nvidia-persistenced.service /usr/lib/systemd/system/nvidia-persistenced.service # restart service psh compute systemctl daemon-reload psh compute systemctl enable nvidia-persistenced psh compute systemctl start nvidia-persistenced
Installing slurm
Configuring slurm
Download
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/slurm.conf
to the/etc/ganglia/
directory on the management node and change as neededDownload
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/gres.conf
to the/etc/ganglia/
directory on the management node and change as needed
Distribute Configuration
$ sudo xdcp all /etc/slurm/slurm.conf /etc/slurm/slurm.conf $ sudo xdcp all /etc/munge/munge.key /etc/munge/munge.key
Startup service
# Startup Management Node $ sudo systemctl enable munge $ sudo systemctl enable slurmctld $ sudo systemctl restart munge $ sudo systemctl restart slurmctld # Startup Other Node $ sudo psh all systemctl enable munge $ sudo psh all systemctl restart munge $ sudo psh all systemctl enable slurmd $ sudo psh all systemctl restart slurmd
Note
If the slurm operation appears a problem, please refer to How To Solve slurm Common Problem
Installing ganglia
Installing gmond
# Management node $ sudo yum -y install ganglia-gmond-ohpc # Other node $ sudo psh all yum install -y ganglia-gmond-ohpc
# Management node $ sudo zypper install ganglia-gmond-ohpc # Other node $ sudo psh all zypper install -y --force-resolution ganglia-gmond-ohpc
Configuring gmond
Download
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/management/gmond.conf
to the/etc/ganglia/gmond.conf
directory on the management node.Download
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/gmond.conf
to the/var/tmp/gmond.conf
directory on the management node
Note
Please according to the actual situation modify the hostname to the management node in the udp_send_channel.
Modifying Kernel Parameters
$ sudo echo net.core.rmem_max=10485760 > /usr/lib/sysctl.d/gmond.conf $ sudo /usr/lib/systemd/systemd-sysctl gmond.conf $ sudo sysctl -w net.core.rmem_max=10485760
Distribute Configuration
$ sudo xdcp all /var/tmp/gmond.conf /etc/ganglia/gmond.conf
Startup service
# Management node $ sudo systemctl enable gmond $ sudo systemctl start gmond # Other node $ sudo psh all systemctl enable gmond $ sudo psh all systemctl start gmond # Make sure all nodes are listed $ sudo gstat -a
Installing mpi
Installing mpi Module
$ sudo yum -y install openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc
$ sudo zypper install openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc
Note
The above commands will install three modules ( openmpi, mpich , mvapich ) to the system, and the user can use lmod to choose the specific MPI module to be used.
Set The Default mpi
$ sudo yum -y install lmod-defaults-gnu7-openmpi3-ohpc
$ sudo yum -y install lmod-defaults-gnu7-mpich-ohpc
$ sudo yum -y install lmod-defaults-gnu7-mvapich2-ohpc
$ sudo zypper install lmod-defaults-gnu7-openmpi3-ohpc
$ sudo zypper install lmod-defaults-gnu7-mpich-ohpc
$ sudo zypper install lmod-defaults-gnu7-mvapich2-ohpc
Here is table of interconnect support for each MPI type from OpenHPC
Ethernet(TCP)
InfiniBand
Omni-Path
Installing mpi
X
MVAPICH2
X
MVAPICH2(psm2)
X
OpenMPI
X
X
X
OpenMPI(PMIx)
X
X
X
Note
If you want to use MVAPICH2 (psm2), you should install mvapich2-psm2-gnu7-ohpc. If you want to use OpenMPI (PMIx), you should install openmpi3-pmix-slurm-gnu7-ohpc. However, openmpi3-gnu7-ohpc and openmpi3-pmix-slurm-gnu7-ohpc is incompatible and mvapich2-psm2-gnu7-ohpc and mvapich2-gnu7-ohpc is incompatible.
Installing singularity
singluarity is an hpc-facing lightweight container framework
Installing singluarity
$ sudo yum -y install singularity-ohpc
$ sudo zypper install singularity-ohpc
Installing openhpc default environment
# Add in module try-add module try-add singularity # Add in module del module del singularity
Make the configuration file take effect
$ sudo source /etc/profile.d/lmod.sh
Note
Changes to /opt/ohpc/pub/modulefiles/ohpc
may be lost when default modules are changed by installing lmod-defaults* package. In that case, modify /opt/ohpc/pub/modulefiles/ohpc
file again, or, alternatively,add module try-add singularity to the bottom of /etc/profile.d/lmod.sh
Checkpoint B
Checking slurm
$ sudo sinfo ... PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 2 idle c[1-2] ...
Attention
The status of all nodes should be
idle
,idle*
is not acceptable.Add a test account
$ sudo useradd -m test $ sudo echo "MERGE:" > syncusers $ sudo echo "/etc/passwd -> /etc/passwd" >> syncusers $ sudo echo "/etc/group -> /etc/group" >> syncusers $ sudo echo "/etc/shadow -> /etc/shadow" >> syncusers $ sudo xdcp all -F syncusers
Run and Test mpi
$ su - test $ mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c $ srun -n 8 -N 1 -w compute --pty /bin/bash $ prun ./a.out ... Master compute host = c1 Resource manager = slurm Launch cmd = mpiexec.hydra -bootstrap slurm ./a.out Hello, world (8 procs total) --> Process # 0 of 8 is alive. -> c1 --> Process # 4 of 8 is alive. -> c2 --> Process # 1 of 8 is alive. -> c1 --> Process # 5 of 8 is alive. -> c2 --> Process # 2 of 8 is alive. -> c1 --> Process # 6 of 8 is alive. -> c2 --> Process # 3 of 8 is alive. -> c1 --> Process # 7 of 8 is alive. -> c2
Note
After the finished the command, notice that you exit to the root user of the management node.