安装操作系统

为管理节点安装OS

Note

如果集群中所有节点均已完成操作系统安装可以跳过此步骤。

为管理节点安装 el7 或者 sle12 官方版本,可选择最小安装

部署集群其它节点OS

配置环境变量

  • 登录 管理节点 ,执行如下命令来为整个安装过程配置环境变量

    $ su root
    $ cd ~
    $ vi lico_env.local
    
  • 编辑文件 lico_env.local

    Note

    文件 lico_env.local 的内容请参考如下进行修改,并保存(实际文件请忽略 # 开头的注释行), 本文假设所以节点的 BMC 用户名、密码都是一样的。若不一致,需要在安装到 设置xcat节点信息 时修改。

    # Management node hostname
    sms_name="head"
    # Set the domain name
    domain_name="hpc.com"
    # Set the OpenLDAP domain name
    lico_ldap_domain_name="dc=hpc,dc=com"
    # The IP address of the management node in the cluster intranet
    sms_ip="192.168.0.1"
    # The network card name of the management node's IP address
    sms_eth_internal="eth0"
    # Subnet mask of the intranet in the cluster.
    # If the OS has been installed on all the nodes in the cluster,
    # the default configuration is maintained.
    internal_netmask="255.255.0.0"
    
    # The user name and password of BMC
    bmc_username="<BMC_USERNAME>"
    bmc_password="<BMC_PASSWORD>"
    
    # xCAT OS Image Path
    iso_path="/isos"
    # Lenovo OpenHPC's local source directory
    ohpc_repo_dir="/install/custom/ohpc"
    
    # ios's local source directory
    os_repo_dir="/install/custom/server"
    sdk_repo_dir="/install/custom/sdk"
    
    # xCAT's local source directory
    xcat_repo_dir="/install/custom/xcat"
    
    # Local Yum repository directory for Lenovo OpenHPC
    ohpc_repo_dir="/install/custom/ohpc"
    
    # LiCO-DEP's local source directory
    lico_dep_repo_dir="/install/custom/lico-dep"
    
    # LiCO's local source directory
    lico_repo_dir="/install/custom/lico"
    
    # Total number of computing nodes
    num_computes="2"
    # Calculate the prefix of node hostname.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    compute_prefix="c"
    # Computes a list of node hostnames.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    c_name[0]=c1
    c_name[1]=c2
    # Compute node IP list.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    c_ip[0]=192.168.0.6
    c_ip[1]=192.168.0.16
    # The compute node IP corresponds to the network adapter MAC address.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    c_mac[0]=fa:16:3e:73:ec:50
    c_mac[1]=fa:16:3e:27:32:c6
    # Compute node BMC address list.
    c_bmc[0]=192.168.1.6
    c_bmc[1]=192.168.1.16
    
    # Total number of login nodes
    num_logins="1"
    # Login nodes hostname list.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    l_name[0]=l1
    # Login node IP list. If all nodes in the cluster have completed the OS installation, fill in the actual
    l_ip[0]=192.168.0.15
    # Login nodes IP corresponds to the MAC address of the network card.
    # If all nodes in the cluster have completed the OS installation, fill in the actual
    l_mac[0]=fa:16:3e:2c:7a:47
    # Login Node BMC Address List
    l_bmc[0]=192.168.1.15
    
  • 使配置生效

    $ sudo chmod 600 lico_env.local
    $ sudo source lico_env.local
    

    Note

    集群环境搭建完成后若要通过外网登陆 LiCO web 需要在登陆节点配置公网 IP

获取本地源

  • 下载官方镜像

  • el7

    下载 CentOS-7-x86_64-Everything-1708.iso 并拷贝到 ${iso_path} 路径下,并执行命令
    $ sudo mkdir -p ${iso_path}
    # run the command below to verify the iso file, and you can get the verification
    # code from http://centos.unixheads.org/7/isos/x86_64/sha256sum.txt
    # then make sure the one is the same as another.
    $ sudo sha256sum ${iso_path}/CentOS-7-x86_64-Everything-1708.iso
    
    # mount image
    $ sudo mkdir -p ${os_repo_dir}
    $ sudo mount -o loop ${iso_path}/CentOS-7-x86_64-Everything-1708.iso  ${os_repo_dir}
    
    # configuration local repository
    $ sudo cat << eof > ${iso_path}/EL7-OS.repo
      [EL7-OS]
      name=el7-centos
      enabled=1
      gpgcheck=0
      type=rpm-md
      baseurl=file:///${os_repo_dir}
      eof
    
    $ sudo cp -a ${iso_path}/EL7-OS.repo /etc/yum.repos.d/
    
  • sle12

    下载 SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.isoSLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso 并拷贝到 ${iso_path} 路径下,并执行命令
    $ sudo mkdir -p ${iso_path}
    # run the command below to verify the iso file, and check it’s the same as the one
    # you get from the above link
    $ sudo md5sum ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso
    $ sudo md5sum ${iso_path}/SLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso
    
    # mount image
    $ sudo mkdir -p ${os_repo_dir}
    $ sudo mkdir -p ${sdk_repo_dir}
    $ sudo mount -o loop ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso  ${os_repo_dir}
    $ sudo mount -o loop ${iso_path}/SLE-12-SP3-SDK-DVD-x86_64-GM-DVD1.iso  ${sdk_repo_dir}
    
    # configuration local repository
    $ sudo cat << eof > ${iso_path}/SLES12-SP3-12.3.repo
      [SLES12-SP3-12.3-SERVER]
      name=sle12-server
      enabled=1
      autorefresh=0
      gpgcheck=0
      baseurl=file:///${os_repo_dir}
    
      [SLES12-SP3-12.3-SDK]
      name=sle12-sdk
      enabled=1
      autorefresh=0
      gpgcheck=0
      baseurl=file:///${os_repo_dir}
      eof
    
    $ sudo zypper ar ${iso_path}/SLES12-SP3-12.3.repo
    

安装Lenovo xcat

为其他节点准备OS系统

Note

如果集群中所有节点均已完成操作系统安装可以跳过此步骤。

  • 导入镜像

    $ sudo copycds ${iso_path}/CentOS-7-x86_64-Everything-1708.iso
    
    $ sudo copycds ${iso_path}/SLE-12-SP3-Server-DVD-x86_64-GM-DVD1.iso
    
  • 确认镜像可用

    $ sudo lsdef -t osimage
    ...
    centos7.4-x86_64-install-compute (osimage)
    centos7.4-x86_64-netboot-compute (osimage)
    centos7.4-x86_64-statelite-compute (osimage)
    ...
    
    $ sudo lsdef -t osimage
    ...
    sles12.3-x86_64-install-compute  (osimage)
    sles12.3-x86_64-install-service  (osimage)
    sles12.3-x86_64-netboot-compute  (osimage)
    sles12.3-x86_64-statelite-compute  (osimage)
    ...
    
  • 设置镜像参数

    Note

    Nouveau 模块是 NVIDIA 开源的加速驱动程序。遵循 NVIDIA 官方安装指南,在安装 Cuda 驱动程序之前,应禁用此模块

    $ chdef -t osimage centos7.4-x86_64-install-compute addkcmdline="rdblacklist=nouveau nouveau.modeset=0 R::modprobe.blacklist=nouveau"
    
    $ chdef -t osimage sles12.3-x86_64-install-compute addkcmdline="rdblacklist=nouveau nouveau.modeset=0 R::modprobe.blacklist=nouveau"
    

设置xcat节点信息

导入计算节点信息
  for ((i=0; i<$num_computes; i++)); do
  sudo mkdef -t node ${c_name[$i]} groups=compute,all arch=x86_64 netboot=xnba mgt=ipmi bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${c_ip[$i]} mac=${c_mac[$i]} bmc=${c_bmc[$i]} serialport=0 serialspeed=115200;
  done
导入登录节点信息
  for ((i=0; i<$num_logins; i++)); do
  sudo mkdef -t node ${l_name[$i]} groups=login,all  arch=x86_64 netboot=xnba mgt=ipmi bmcusername=${bmc_username} bmcpassword=${bmc_password} ip=${l_ip[$i]} mac=${l_mac[$i]} bmc=${l_bmc[$i]} serialport=0 serialspeed=115200;
  done

Note

若节点的 BMC 用户名和密码不一致,请运行如下命令,在文件中修改 usernamepassword 两列。

$ sudo tabedit ipmi
设置 root 账户密码(请把 <ROOT_PASSWORD> 替换成你想要设置的密码)
  $ sudo chtab key=system passwd.username=root passwd.password=<ROOT_PASSWORD>

添加hosts解析

Note

如果集群已经安装过操作系统并且能够通过 hostname 来解析 IP 地址,请跳过此步骤。

$ sudo chdef -t site domain=${domain_name}
$ sudo chdef -t site master=${sms_ip}
$ sudo chdef -t site nameservers=${sms_ip}
$ sudo sed -i "/^\s*${sms_ip}\s*.*$/d" /etc/hosts
$ sudo sed -i "/\s*${sms_name}\s*/d" /etc/hosts
$ sudo echo "${sms_ip}    ${sms_name}   ${sms_name}.${domain_name} " >> /etc/hosts
$ sudo makehosts

配置DHCP及DNS服务

运行如下命令进行配置
$ sudo makenetworks
$ sudo makedhcp -n
$ sudo makedns -n

通过网络为节点安装操作系统

Note

如果集群中所有节点均已完成操作系统安装可以跳过此步骤。

  $ sudo nodeset all osimage=centos7.4-x86_64-install-compute
  $ sudo rsetboot all net -u
  $ sudo rpower all reset
  $ sudo nodeset all osimage=sles12.3-x86_64-install-compute
  $ sudo rsetboot all net -u
  $ sudo rpower all reset

Note

整个安装过程需要较长时间来完成。你可以使用以下命令来检查安装进度。

$ sudo nodestat all

检查点A

$ sudo psh all uptime
...
c1: 05:03am up 0:02, 0 users, load average: 0.20, 0.13, 0.05
c2: 05:03am up 0:02, 0 users, load average: 0.20, 0.14, 0.06
l1: 05:03am up 0:02, 0 users, load average: 0.17, 0.13, 0.05