安装操作系统
为管理节点安装OS
Note
如果集群中所有节点均已完成操作系统安装可以跳过此步骤。
为管理节点安装 或者 官方版本,可选择最小安装
部署集群其它节点OS
配置环境变量
登录 管理节点 ,执行如下命令来为整个安装过程配置环境变量
$ su root $ cd ~ $ vi lico_env.local
编辑文件
lico_env.local
Note
文件
lico_env.local
的内容请参考如下进行修改,并保存(实际文件请忽略#
开头的注释行), 本文假设所以节点的 BMC 用户名、密码都是一样的。若不一致,需要在安装到 设置xcat节点信息 时修改。# Management node hostname sms_name="head" # Set the domain name domain_name="hpc.com" # Set the OpenLDAP domain name lico_ldap_domain_name="dc=hpc,dc=com" # The IP address of the management node in the cluster intranet sms_ip="192.168.0.1" # The network card name of the management node's IP address sms_eth_internal="eth0" # Subnet mask of the intranet in the cluster. # If the OS has been installed on all the nodes in the cluster, # the default configuration is maintained. internal_netmask="255.255.0.0" # The user name and password of BMC bmc_username="<BMC_USERNAME>" bmc_password="<BMC_PASSWORD>" # xCAT OS Image Path iso_path="/isos" # Lenovo OpenHPC's local source directory ohpc_repo_dir="/install/custom/ohpc" # ios's local source directory os_repo_dir="/install/custom/server" sdk_repo_dir="/install/custom/sdk" # xCAT's local source directory xcat_repo_dir="/install/custom/xcat" # Local Yum repository directory for Lenovo OpenHPC ohpc_repo_dir="/install/custom/ohpc" # LiCO-DEP's local source directory lico_dep_repo_dir="/install/custom/lico-dep" # LiCO's local source directory lico_repo_dir="/install/custom/lico" # Total number of computing nodes num_computes="2" # Calculate the prefix of node hostname. # If all nodes in the cluster have completed the OS installation, fill in the actual compute_prefix="c" # Computes a list of node hostnames. # If all nodes in the cluster have completed the OS installation, fill in the actual c_name[0]=c1 c_name[1]=c2 # Compute node IP list. # If all nodes in the cluster have completed the OS installation, fill in the actual c_ip[0]=192.168.0.6 c_ip[1]=192.168.0.16 # The compute node IP corresponds to the network adapter MAC address. # If all nodes in the cluster have completed the OS installation, fill in the actual c_mac[0]=fa:16:3e:73:ec:50 c_mac[1]=fa:16:3e:27:32:c6 # Compute node BMC address list. c_bmc[0]=192.168.1.6 c_bmc[1]=192.168.1.16 # Total number of login nodes num_logins="1" # Login nodes hostname list. # If all nodes in the cluster have completed the OS installation, fill in the actual l_name[0]=l1 # Login node IP list. If all nodes in the cluster have completed the OS installation, fill in the actual l_ip[0]=192.168.0.15 # Login nodes IP corresponds to the MAC address of the network card. # If all nodes in the cluster have completed the OS installation, fill in the actual l_mac[0]=fa:16:3e:2c:7a:47 # Login Node BMC Address List l_bmc[0]=192.168.1.15
使配置生效
$ sudo chmod 600 lico_env.local $ sudo source lico_env.local
Note
集群环境搭建完成后若要通过外网登陆 LiCO web 需要在登陆节点配置公网 IP
获取本地源
下载官方镜像
-
$ sudo mkdir -p ${iso_path} # run the command below to verify the iso file, and you can get the verification # code from http://centos.unixheads.org/7/isos/x86_64/sha256sum.txt # then make sure the one is the same as another. $ sudo sha256sum ${iso_path}/CentOS-7-x86_64-Everything-1708.iso # mount image $ sudo mkdir -p ${os_repo_dir} $ sudo mount -o loop ${iso_path}/CentOS-7-x86_64-Everything-1708.iso ${os_repo_dir} # configuration local repository $ sudo cat << eof > ${iso_path}/EL7-OS.repo [EL7-OS] name=el7-centos enabled=1 gpgcheck=0 type=rpm-md baseurl=file:///${os_repo_dir} eof $ sudo cp -a ${iso_path}/EL7-OS.repo /etc/yum.repos.d/
-
$ sudo mkdir -p ${iso_path} # run the command below to verify the iso file, and check it’s the same as the one # you get from the above link $ sudo md5sum ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso $ sudo md5sum ${iso_path}/SLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso # mount image $ sudo mkdir -p ${os_repo_dir} $ sudo mkdir -p ${sdk_repo_dir} $ sudo mount -o loop ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso ${os_repo_dir} $ sudo mount -o loop ${iso_path}/SLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso ${sdk_repo_dir} # configuration local repository $ sudo cat << eof > ${iso_path}/SLES12-SP3-12.3.repo [SLES12-SP3-12.3-SERVER] name=sle12-server enabled=1 autorefresh=0 gpgcheck=0 baseurl=file:///${os_repo_dir} [SLES12-SP3-12.3-SDK] name=sle12-sdk enabled=1 autorefresh=0 gpgcheck=0 baseurl=file:///${os_repo_dir} eof $ sudo zypper ar ${iso_path}/SLES12-SP3-12.3.repo
安装Lenovo xcat
下载 xcat
配置及安装
将安装包上传到管理节点上,然后依照下面的命令配置并安装 xcat
# Creat xCAT's local source $ sudo yum -y install bzip2 $ sudo mkdir -p $xcat_repo_dir $ sudo tar -xvf xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-el7.tar.bz2 -C $xcat_repo_dir $ sudo cd $xcat_repo_dir/lenovo-hpc-el7 $ sudo ./mklocalrepo.sh $ sudo cd ~ # Installing xCAT $ sudo yum -y install xCAT $ sudo systemctl start xcatd $ sudo source /etc/profile.d/xcat.sh
# Creat xCAT's local source $ sudo mkdir -p $xcat_repo_dir $ sudo tar -xvf xcat-2.13.8.lenovo3_confluent-1.8.2_lenovo_confluent-0.8.1-sles12.tar.bz2 -C $xcat_repo_dir $ sudo cd $xcat_repo_dir/lenovo-hpc-sles12 $ sudo ./mklocalrepo.sh $ sudo cd ~ # Installing xCAT $ sudo zypper install xCAT $ sudo systemctl start xcatd $ sudo source /etc/profile.d/xcat.sh
为其他节点准备OS系统
Note
如果集群中所有节点均已完成操作系统安装可以跳过此步骤。
导入镜像
$ sudo copycds ${iso_path}/CentOS-7-x86_64-Everything-1708.iso
$ sudo copycds ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso
确认镜像可用
$ sudo lsdef -t osimage ... centos7.4-x86_64-install-compute (osimage) centos7.4-x86_64-netboot-compute (osimage) centos7.4-x86_64-statelite-compute (osimage) ...
$ sudo lsdef -t osimage ... sles12.3-x86_64-install-compute (osimage) sles12.3-x86_64-install-service (osimage) sles12.3-x86_64-netboot-compute (osimage) sles12.3-x86_64-statelite-compute (osimage) ...
设置镜像参数
Note
Nouveau 模块是 NVIDIA 开源的加速驱动程序。遵循 NVIDIA 官方安装指南,在安装 Cuda 驱动程序之前,应禁用此模块
$ chdef -t osimage centos7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau nouveau.modeset=0 R::modprobe.blacklist=nouveau"
$ chdef -t osimage sles12.3-x86_64-install-compute addkcmdline="rdblacklist=nouveau nouveau.modeset=0 R::modprobe.blacklist=nouveau"
设置xcat节点信息
for ((i=0; i<$num_computes; i++)); do sudo mkdef -t node ${c_name[$i]} groups=compute,all arch=x86_64 netboot=xnba mgt=ipmi bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${c_ip[$i]} mac=${c_mac[$i]} bmc=${c_bmc[$i]} serialport=0 serialspeed=115200; donefor ((i=0; i<$num_logins; i++)); do sudo mkdef -t node ${l_name[$i]} groups=login,all arch=x86_64 netboot=xnba mgt=ipmi bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${l_ip[$i]} mac=${l_mac[$i]} bmc=${l_bmc[$i]} serialport=0 serialspeed=115200; doneNote
若节点的 BMC 用户名和密码不一致,请运行如下命令,在文件中修改 username , password 两列。
$ sudo tabedit ipmi$ sudo chtab key=system passwd.username=root passwd.password=<ROOT_PASSWORD>
添加hosts解析
Note
如果集群已经安装过操作系统并且能够通过 hostname 来解析 IP 地址,请跳过此步骤。
$ sudo chdef -t site domain=${domain_name}
$ sudo chdef -t site master=${sms_ip}
$ sudo chdef -t site nameservers=${sms_ip}
$ sudo sed -i "/^\s*${sms_ip}\s*.*$/d" /etc/hosts
$ sudo sed -i "/\s*${sms_name}\s*/d" /etc/hosts
$ sudo echo "${sms_ip} ${sms_name} ${sms_name}.${domain_name} " >> /etc/hosts
$ sudo makehosts
配置DHCP及DNS服务
$ sudo makenetworks
$ sudo makedhcp -n
$ sudo makedns -n
通过网络为节点安装操作系统
Note
如果集群中所有节点均已完成操作系统安装可以跳过此步骤。
$ sudo nodeset all osimage=centos7.4-x86_64-install-compute
$ sudo rsetboot all net -u
$ sudo rpower all reset
$ sudo nodeset all osimage=sles12.3-x86_64-install-compute
$ sudo rsetboot all net -u
$ sudo rpower all reset
Note
整个安装过程需要较长时间来完成。你可以使用以下命令来检查安装进度。
$ sudo nodestat all
检查点A
$ sudo psh all uptime
...
c1: 05:03am up 0:02, 0 users, load average: 0.20, 0.13, 0.05
c2: 05:03am up 0:02, 0 users, load average: 0.20, 0.14, 0.06
l1: 05:03am up 0:02, 0 users, load average: 0.17, 0.13, 0.05