安装集群基础软件
基础软件列表
下表中安装节点字段的表示如下:
- M
管理节点
- L
登录节点
- C
计算节点
软件名 |
组件名称 |
版本 |
服务名 |
安装节点 |
备注 |
---|---|---|---|---|---|
nfs |
nfs-utils |
1.3.0 |
nfs-server |
M |
|
nfs-kernel-server |
1.3.0 |
nfs-server |
M |
||
nfs-client |
1.3.0 |
nfs |
C,L |
||
ntp |
ntp |
4.2.6 |
ntpd |
M |
|
slurm |
ohpc-slurm-server |
1.3.3 |
munge,slurmctld |
M |
|
ohpc-slurm-client |
1.3.3 |
munge,slurmd |
C,L |
||
ganglia |
ganglia-gmond-ohpc |
3.7.2 |
gmond |
M,C,L |
|
singularity |
singularity-ohpc |
2.4 |
M |
||
cuda |
cudnn |
7 |
C |
仅GPU节点 需要安装 |
|
cuda |
9.1 |
C |
|||
mpi |
openmpi3-gnu7-ohpc |
3.0.0 |
M |
至少安装三种 MPI 中的一种 |
|
mpich-gnu7-ohpc |
3.2 |
M |
|||
mvapich2-gnu7-ohpc |
2.2 |
M |
为管理节点设置本地源
下载本地源
配置本地源
将安装包上传到管理节点,执行如下命令来配置 Lenovo OpenHPC 的本地源:
$ sudo mkdir -p $ohpc_repo_dir $ sudo tar xvf Lenovo-OpenHPC-1.3.3.CentOS_7.x86_64.tar -C $ohpc_repo_dir $ sudo $ohpc_repo_dir/make_repo.sh
$ sudo mkdir -p $ohpc_repo_dir $ sudo tar xvf Lenovo-OpenHPC-1.3.3.SLES.x86_64.tar -C $ohpc_repo_dir $ sudo $ohpc_repo_dir/make_repo.sh $ sudo rpm --import $ohpc_repo_dir/SLE_12/repodata/repomd.xml.key
为计算及登录节点配置本地源
-
$ sudo psh all yum --setopt=\*.skip_if_unavailable=1 -y install yum-utils
$ sudo cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo sed -i '/^gpgkey=/d' /vars/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/CentOS_7" >> /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/CentOS_7/repodata/repomd.xml.key" >> /var/tmp/Lenovo.OpenHPC.local.repo # Distribute repo files $ sudo xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/yum.repos.d/ $ sudo psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros
关闭指向外部网络的 yum 源
Note
此步骤可视实际情况来执行,如果操作系统本身没有安装足够的包,可能会导致后续安装步骤失败
$ sudo psh all yum-config-manager --disable CentOS\*
-
$ sudo cp /etc/zypp/repos.d/Lenovo.OpenHPC.local.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "baseurl=http://${sms_name}/${ohpc_repo_dir}/SLE_12" >> /var/tmp/Lenovo.OpenHPC.local.repo $ sudo echo "gpgkey=http://${sms_name}/${ohpc_repo_dir}/SLE_12/repodata/repomd.xml.key" >> /var/tmp/Lenovo.OpenHPC.local.repo # Distribute repo files $ sudo xdcp all /var/tmp/Lenovo.OpenHPC.local.repo /etc/zypp/repos.d/ $ sudo psh all rpm --import http://${sms_name}/${ohpc_repo_dir}/SLE_12/repodata/repomd.xml.key $ sudo psh all echo -e %_excludedocs 1 \>\> ~/.rpmmacros
配置LiCO依赖源
-
下载安装包: https://hpc.lenovo.com/lico/downloads/5.1/lico-dep-5.1.el7.x86_64.tgz
$ sudo mkdir -p $lico_dep_repo_dir $ sudo tar xvf lico-dep-5.1.el7.x86_64.tgz -C $lico_dep_repo_dir $ sudo $lico_dep_repo_dir/mklocalrepo.sh
$ sudo cp /etc/yum.repos.d/lico-dep.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo $ sudo echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo $ sudo echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL7" >> /var/tmp/lico-dep.repo # Distribution configuration $ sudo xdcp all /var/tmp/lico-dep.repo /etc/yum.repos.d
-
下载安装包: https://hpc.lenovo.com/lico/downloads/5.1/lico-dep-5.1.sle12.x86_64.tgz
$ sudo mkdir -p $lico_dep_repo_dir $ sudo tar xvf lico-dep-5.1.sle12.x86_64.tgz -C $lico_dep_repo_dir $ sudo $lico_dep_repo_dir/mklocalrepo.sh $ sudo rpm --import $lico_dep_repo_dir/RPM-GPG-KEY-LICO-DEP-SLE12
$ sudo cp /etc/zypp/repos.d/lico-dep.repo /var/tmp $ sudo sed -i '/^baseurl=/d' /var/tmp/lico-dep.repo $ sudo sed -i '/^gpgkey=/d' /var/tmp/lico-dep.repo $ sudo echo "baseurl=http://${sms_name}/${lico_dep_repo_dir}" >> /var/tmp/lico-dep.repo $ sudo echo "gpgkey=http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-SLE12" >> /var/tmp/lico-dep.repo # Distribution configuration $ sudo xdcp all /var/tmp/lico-dep.repo /etc/zypp/repos.d $ sudo psh all rpm --import http://${sms_name}/${lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-SLE12
安装slurm
-
$ sudo yum -y install lenovo-ohpc-base
$ sudo yum -y install ohpc-slurm-server
$ sudo psh all yum -y install ohpc-base-compute ohpc-slurm-client lmod-ohpc
配置 pam_slurm
Note
该组件可以防止用户提交占位任务来直接登录计算节点,可以跳过该步骤
$ sudo psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd
-
$ sudo zypper install lenovo-ohpc-base
$ sudo zypper install ohpc-slurm-server
$ sudo psh all zypper install -y --force-resolution ohpc-base-compute ohpc-slurm-client lmod-ohpc
配置 pam_slurm
Note
该组件可以防止用户提交占位任务来直接登录计算节点,可以跳过该步骤
$ sudo psh all echo "\""account required pam_slurm.so"\"" \>\> /etc/pam.d/sshd
配置nfs
-
Note
执行如下命令来配置集群的共享目录,其中
/opt/ophc/pub
是必须要配置的,如果集群中/opt/ohpc/pub
已经设置为共享目录,请跳过。# Management node share the Lenovo OpenHPC directory $ sudo yum -y install nfs-utils $ sudo echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports $ sudo exportfs -a # Installing NFS for Cluster Nodes $ sudo psh all yum -y install nfs-utils # Configure shared directory for cluster nodes $ sudo psh all mkdir -p /opt/ohpc/pub $ sudo psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab # Mount shared directory $ sudo psh all mount /opt/ohpc/pub
Note
下面的步骤是创建用户共享目录,本文以
/home
例,你也可以选择其他目录# Management node shares /home and Lenovo OpenHPC package directory $ sudo echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports $ sudo exportfs -a # if /home already mounted, unmount it first $ sudo psh all "sed -i '/ \/home /d' /etc/fstab" $ sudo psh all umount /home # Configure a shared directory for cluster nodes $ sudo psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab # Mount a shared directory $ sudo psh all mount /home
-
Note
执行如下命令来配置集群的共享目录,其中
/opt/ophc/pub
是必须要配置的,如果集群中/opt/ohpc/pub
已经设置为共享目录,请跳过。# Management node share the Lenovo OpenHPC directory $ sudo zypper install nfs-kernel-server $ sudo echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports $ sudo exportfs -a # Configure shared directory for cluster nodes $ sudo psh all zypper install -y --force-resolution nfs-client $ sudo psh all mkdir -p /opt/ohpc/pub $ sudo psh all echo "\""${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0"\"" \>\> /etc/fstab # Mount shared directory $ sudo psh all mount /opt/ohpc/pub
Note
下面的步骤是创建用户共享目录,本文以
/home
例,你也可以选择其他目录# Management node shares /home and Lenovo OpenHPC package directory $ sudo echo "/home *(rw,no_subtree_check,fsid=10,no_root_squash)" >> /etc/exports $ sudo exportfs -a # if /home already mounted, unmount it first $ sudo psh all "sed -i '/ \/home /d' /etc/fstab" $ sudo psh all umount /home # Configure a shared directory for cluster nodes $ sudo psh all echo "\""${sms_ip}:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0"\"" \>\> /etc/fstab # Mount a shared directory $ sudo psh all mount /home
配置ntp
Note
如果集群各节点已经配置了 ntp 服务,请跳过此步骤。
$ sudo echo "server 127.127.1.0" >> /etc/ntp.conf
$ sudo echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
$ sudo systemctl enable ntpd
$ sudo systemctl start ntpd
$ sudo psh all yum -y install ntp
$ sudo psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf
# Startup
$ sudo ppsh all systemctl enable ntpd
$ sudo ppsh all systemctl start ntpd
# check service
psh all "ntpq -p | tail -n 1"
$ sudo echo "server 127.127.1.0" >> /etc/ntp.conf
$ sudo echo "fudge 127.127.1.0 stratum 10" >> /etc/ntp.conf
$ sudo systemctl enable ntpd
$ sudo systemctl start ntpd
$ sudo psh all zypper install -y --force-resolution ntp
$ sudo psh all echo "\""server ${sms_ip}"\"" \>\> /etc/ntp.conf
# Startup
$ sudo psh all systemctl enable ntpd
$ sudo psh all systemctl start ntpd
# check service
psh all "ntpq -p | tail -n 1"
安装cuda和cudnn
Note
仅需在带有 GPU 的 计算节点 上运行,以下命令会在所有GPU计算节点上安装CUDA和CUDNN(如果只有一部分节点具有GPU,则将psh命令中的“compute”参数替换为对应于GPU节点的节点范围)
下载 cuda
下载
cuda_9.1.85_387.26_linux.run
到共享目录(本文的共享目录配置的是/home
)下载地址: https://developer.nvidia.com/cuda-downloads
$ sudo psh compute systemctl set-default multi-user.target $ sudo psh compute reboot
安装 nvidia 驱动
下载地址:
Note
我们建议您安装 kernel 的补丁包,以修复一些安全漏洞,您可以下面找到需要下载的补丁包
然后确保安装与正在运行的内核匹配的 kernel-devel 软件包。 如果已经完成,那么可以从以下命令中省略 kernel-devel 包。 否则,运行以下命令,如下所示:
$ sudo psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-rhel7-390.46-1.0-1.x86_64.rpm $ sudo psh compute yum install -y cuda-drivers
$ sudo psh compute rpm -ivh /home/nvidia-diag-driver-local-repo-sles123-390.46-1.0-1.x86_64.rpm $ sudo psh compute zypper --gpg-auto-import-keys install -y --force-resolution cuda-drivers $ psh compute perl -pi -e "s/NVreg_DeviceFileMode=0660/NVreg_DeviceFileMode=0666/" /etc/modprobe.d/50-nvidia-default.conf $ psh compute reboot -h now
安装 cuda
$ sudo psh compute yum install -y kernel-devel gcc gcc-c++ $ sudo psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose --override
$ sudo psh compute zypper install -y --force-resolution kernel-devel gcc gcc-c++ $ sudo psh compute /home/cuda_9.1.85_387.26_linux.run --silent --toolkit --samples --no-opengl-libs --verbose --override
下载 cudnn
从官网下载
cudnn-9.1-linux-x64-v7.1.tgz
到/root
目录下。官网地址:安装 cudnn
$ cd ~ $ tar -xvf cudnn-9.1-linux-x64-v7.1.tgz $ sudo xdcp compute cuda/include/cudnn.h /usr/local/cuda/include $ sudo xdcp compute cuda/lib64/libcudnn_static.a /usr/local/cuda/lib64 $ sudo xdcp compute cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64 $ sudo psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7.0.5 /usr/local/cuda/lib64/libcudnn.so.7" $ sudo psh compute "ln -s /usr/local/cuda/lib64/libcudnn.so.7 /usr/local/cuda/lib64/libcudnn.so" $ sudo psh compute chmod a+r /usr/local/cuda/include/cudnn.h $ sudo psh compute chmod a+r /usr/local/cuda/lib64/libcudnn*
配置环境变量
$ sudo echo "/usr/local/cuda/lib64" >> /etc/ld.so.conf.d/cuda.conf $ sudo echo "export CUDA_HOME=/usr/local/cuda" >> /etc/profile.d/cuda.sh $ sudo echo "export PATH=/usr/local/cuda/bin:\$PATH" >> /etc/profile.d/cuda.sh
分发配置
$ sudo xdcp compute /etc/ld.so.conf.d/cuda.conf /etc/ld.so.conf.d/cuda.conf $ sudo xdcp compute /etc/profile.d/cuda.sh /etc/profile.d/cuda.sh
运行如下命令,来确认是否能够识别 GPU :
$ sudo psh compute ldconfig $ sudo psh compute nvidia-smi $ sudo psh compute "cd /root/NVIDIA_CUDA-9.1_Samples/1_Utilities/deviceQuery; make; ./deviceQuery" | xcoll
设置Cuda的自启动
# configuration $ sudo psh compute sed -i '/Wants=syslog.target/a\Before=slurmd.service' /usr/lib/systemd/system/nvidia-persistenced.service $ sudo psh compute systemctl daemon-reload $ sudo psh compute systemctl enable nvidia-persistenced $ sudo psh compute systemctl start nvidia-persistenced
# add configure file $ cat << eof > /usr/lib/systemd/system/nvidia-persistenced.service [Unit] Description=NVIDIA Persistence Daemon Before=slurmd.service Wants=syslog.target [Service] Type=forking ExecStart=/usr/bin/nvidia-persistenced --user root ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced [Install] WantedBy=multi-user.target eof # Distribute configure file xdcp compute /usr/lib/systemd/system/nvidia-persistenced.service /usr/lib/systemd/system/nvidia-persistenced.service # restart service psh compute systemctl daemon-reload psh compute systemctl enable nvidia-persistenced psh compute systemctl start nvidia-persistenced
配置slurm
配置 slurm
下载文件
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/slurm.conf
到管理节点的/etc/slurm/
下, 并参考附录根据实际情况进行修改下载文件
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/gres.conf
到管理节点的/etc/slurm/
, 并参考附录根据实际情况进行修改。若节点不是 GPU 节点则不需要该文件。
分发配置
$ sudo xdcp all /etc/slurm/slurm.conf /etc/slurm/slurm.conf $ sudo xdcp all /etc/munge/munge.key /etc/munge/munge.key
启动服务
# Startup Management Node $ sudo systemctl enable munge $ sudo systemctl enable slurmctld $ sudo systemctl restart munge $ sudo systemctl restart slurmctld # Startup Other Node $ sudo psh all systemctl enable munge $ sudo psh all systemctl restart munge $ sudo psh all systemctl enable slurmd $ sudo psh all systemctl restart slurmd
Note
如果 slurm 运行出现问题,请参考 如何解决slurm常见问题
安装ganglia
安装 gmond
# Management node $ sudo yum -y install ganglia-gmond-ohpc # Other node $ sudo psh all yum install -y ganglia-gmond-ohpc
# Management node $ sudo zypper install ganglia-gmond-ohpc # Other node $ sudo psh all zypper install -y --force-resolution ganglia-gmond-ohpc
配置 gmond
下载
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/management/gmond.conf
到/etc/ganglia/gmond.conf
下载
https://hpc.lenovo.com/lico/downloads/5.1/examples/conf/ganglia/gmond.conf
到/var/tmp/gmond.conf
Note
请根据实际情况修改 udp_send_channel 中 host 参数为管理节点的 hostname 地址
修改内核参数
$ sudo echo net.core.rmem_max=10485760 > /usr/lib/sysctl.d/gmond.conf $ sudo /usr/lib/systemd/systemd-sysctl gmond.conf $ sudo sysctl -w net.core.rmem_max=10485760
发布配置
$ sudo xdcp all /var/tmp/gmond.conf /etc/ganglia/gmond.conf
启动服务
# Management node $ sudo systemctl enable gmond $ sudo systemctl start gmond # Other node $ sudo psh all systemctl enable gmond $ sudo psh all systemctl start gmond # Make sure all nodes are listed $ sudo gstat -a
安装mpi
安装 mpi 模块
$ sudo yum -y install openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc
$ sudo zypper install openmpi3-gnu7-ohpc mpich-gnu7-ohpc mvapich2-gnu7-ohpc
Note
以上命令会在系统中安装 openmpi, mpich , mvapich 三个模块,用户可以使用 lmod 来选择具体使用的 mpi 模块。 openhpc 也提供了模块包来制定默认使用模块。
设置默认 mpi 模块
$ sudo yum -y install lmod-defaults-gnu7-openmpi3-ohpc
$ sudo yum -y install lmod-defaults-gnu7-mpich-ohpc
$ sudo yum -y install lmod-defaults-gnu7-mvapich2-ohpc
$ sudo zypper install lmod-defaults-gnu7-openmpi3-ohpc
$ sudo zypper install lmod-defaults-gnu7-mpich-ohpc
$ sudo zypper install lmod-defaults-gnu7-mvapich2-ohpc
openhpc上每个mpi类型的互连支持表
Ethernet(TCP)
InfiniBand
Omni-Path
MPICH
X
MVAPICH2
X
MVAPICH2(psm2)
X
OpenMPI
X
X
X
OpenMPI(PMIx)
X
X
X
Note
注意:如果你想使用 MVAPICH2(psm2),你应该安装 mvapich2-psm2-gnu7-ohpc , 如果你想使用 OpenMPI(PMIx),你应该安装 openmpi3-pmix-slurm-gnu7-ohpc. 但是 openmpi3-gnu7-ohpc 和 openmpi3-pmix-slurm-gnu7-ohpc 不兼容, mvapich2-psm2-gnu7-ohpc 和 mvapich2-gnu7-ohpc 不兼容。
安装singularity
singluarity 是面向 hpc 领域的轻量级容器框架
安装 singluarity
$ sudo yum -y install singularity-ohpc
$ sudo zypper install singularity-ohpc
安装 openhpc 默认环境
# Add in module try-add module try-add singularity # Add in module del module del singularity
使配置生效
$ sudo source /etc/profile.d/lmod.sh
Note
当您安装 lmod-defaults* 的安装包时,默认的配置可能会使 /opt/ohpc/pub/modulefiles/ohpc
文件中的更改丢失。
在这种情况下,重新修改 /opt/ohpc/pub/modulefiles/ohpc
文件,或者在 /etc/profile.d/lmod.sh
的底部添加 module try-add singularity 。
检查点B
检测 slurm
$ sudo sinfo ... PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up 1-00:00:00 2 idle c[1-2] ...
Attention
节点的状态应该是
idle
,idle*
是不正常的状态添加一个测试帐户
$ sudo useradd -m test $ sudo echo "MERGE:" > syncusers $ sudo echo "/etc/passwd -> /etc/passwd" >> syncusers $ sudo echo "/etc/group -> /etc/group" >> syncusers $ sudo echo "/etc/shadow -> /etc/shadow" >> syncusers $ sudo xdcp all -F syncusers
运行测试 mpi 程序
$ su - test $ mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c $ srun -n 8 -N 1 -w compute --pty /bin/bash $ prun ./a.out ... Master compute host = c1 Resource manager = slurm Launch cmd = mpiexec.hydra -bootstrap slurm ./a.out Hello, world (8 procs total) --> Process # 0 of 8 is alive. -> c1 --> Process # 4 of 8 is alive. -> c2 --> Process # 1 of 8 is alive. -> c1 --> Process # 5 of 8 is alive. -> c2 --> Process # 2 of 8 is alive. -> c1 --> Process # 6 of 8 is alive. -> c2 --> Process # 3 of 8 is alive. -> c1 --> Process # 7 of 8 is alive. -> c2
Note
测试完成之后,请注意您退出到管理节点的 root 用户