Diskless installation guide

1: This document is based on the confluent image. It is used to create an image on the cluster head node and push it to the computing node to deploy the cluster. Therefore, the header node needs to enable both httpd and Nginx, and the httpd port must be the default 443, so the https port of Nginx service must be changed to another port

2: In this scenario the login node and the management node are the same node.

 

Head node deployment

Configuration and Preparation

Configure the memory:

Configuring environment variables

  1. Login management node

  2. Create a new lico_env.local file according to section 2 of the LiCO installation documentation

  3. Reload the file:

Create OS local repository

  1. Create a directory for storing ISO storage:

  2. download Rocky-8.6-x86_64-dvd1.iso and CHECKSUM file: https://rockylinux.org/download

  3. Copy the file to ${iso_path}.

  4. Validate that the verification code of the ISO file matches the code listed in CHECKSUM.

  5. Mount the ISO image:

  6. Configure the local repository and copy it to /etc/yum.repos.d/:

  7. Backup the repository:

  8. Enable the NGINX web server:

Install Lenovo Confluent

  1. Download the following package: https://hpc.lenovo.com/downloads/22b/confluent-3.5.0-2-el8.tar.xz

  2. Upload the package to the /root directory.

  3. Create confluent local repository:

  4. Install Lenovo Confluent:

  5. Create confluent account:

  6. Close SELinux:

Configure confluent to prepare for deploying for compute node

Please ensure that the BMC user name and password are consistent for every node.

The deployment.useinsecureprotocols=firmware enables PXE support (HTTPS only mode is by default the only allowed mode), console.method=ipmi may be skipped but if specified instructs confluennt to use IPMI to access the text console to enable the nodeconsole command.

While passwords and similar may be specified the same way, it is recommended to use the -p argument to prompt for values, to keep them out of your command history. Note that if unspecified, default root password behavior is to disable password based login:

Define nodes in confluent
  1. Define the management node in the lico_env.local file to confluent:

  2. Define the compute node configuration to confluent:

  3. Set the node to boot using network pxe by default

Prepare name resolution
  1. Append node information to /etc/hosts:

  2. Install and start to dnsmasq, making /etc/hosts available through dns:

Initialize confluent operating system deployment

Users can set up requirements for operating system deployment through the initialized sub-command of the osdeploy command. The -i parameter is used to interactively prompt the options that are available:

Import install media:
Build image directory

If you don`t have any GPU nodes in the cluster, build a single image and dismiss any GPU related commands bellow.

If both GPU and non GPU nodes are present, you will need to build two separate images.

 

Install OHPC

Define share folder

Enable repositories for other nodes

Enable httpd services

Configure Lenovo OpenHPC repositories

  1. Download the following package: https://hpc.lenovo.com/lico/downloads/7.0/Lenovo-OpenHPC-2.5.EL8.x86_64.tar

  2. Upload the package to the /root directory.

  3. Configure the local Lenovo OpenHPC repository:

Configure the LiCO dependencies repositories

  1. Download the following package:: https://hpc.lenovo.com/lico/downloads/7.0/lico-dep-7.0.0.el8.x86_64.tgz

  2. Upload the package to the /root directory.

  3. Configure the repository for the management node:

Obtain the LiCO installation package

  1. Obtain the LiCO 7.0.0 release package for EL8 lico-release-7.0.0.el8.tar.gz and the LiCO license file from: https://commercial.lenovo.com/cn/en/signin
  2. Upload the release package to the management node.

Configure the local repository for LiCO

  1. Configure the local repository for the management node:

Install Slurm

  1. Install the base package:

  2. Install Slurm

Configure NFS

Configure user shared directory The following steps describes how to create the user shared directory by taking /home as an example.

Manage node sharing /home:

Configure shared directory for OpenHPC

Manage node sharing /opt/ohpc/pub for OpenHPC:

Configure Chrony

  1. Install Chrony:

  2. Configure Chrony as below link : https://chrony.tuxfamily.org/documentation.html

  3. Enable chronyd

Configure slurm

  1. Download slurm.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  2. Upload slurm.conf to /etc/slurm/, and modify this file as installation guide.

  3. Download cgroup.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  4. Upload cgroup.conf to /etc/slurm.

  5. Create /etc/slurm/gres.conf,edit the GPU resources of all nodes in the following format

  6. Start service:

Install Icinga2

Install MPI

  1. Install three modules (OpenMPI, MPICH, and MVAPICH) to the system::

  2. Set the default module.

    Set OpenMPI module as the default:

    Set the MPICH module as the default:

    Set the MVAPICH module as the default:

Install Singularity

  1. Install Singularity:

  2. Edit the file /opt/ohpc/pub/modulefiles/ohpc by adding the following content to the end of the module try-add block:

  3. In the module del block, add the following content as the first line:

  4. Run the following command:

Install the LiCO dependencies

Install RabbitMQ

  1. Install RabbitMQ:

  2. Start RabbitMQ service:

Install MariaDB

  1. Install MariaDB:

  2. Configure MariaDB for LiCO:

  3. Configure the MariaDB limits:

Install InfluxDB

Configure user authentication

Install OpenLDAP
  1. Install and configure OpenLDAP a. Download the openldap-server package from Rocky official website:\ https://download.rockylinux.org/pub/rocky/8/PowerTools/x86_64/os/Packages/o/openldap-servers-2.4.46-18.el8.x86_64.rpm

    b. Upload the package to the LiCO management node

    c. Install openldap-server:

    d. Install slapd-ssl-config:

  1. Modify the configuration file:

  2. Obtain the OpenLDAP key:

  3. Edit /etc/openldap/slapd.conf to set the root password to the key that was obtained.

  4. Change the owner of the configuration file:

  5. Start the OpenLDAP service:

  6. Verify that the service has been started:

Install libuser

The libuser module is a recommended toolkit for OpenLDAP. The installation of this module is optional.

  1. Install libuser:

  2. Download libuser.conf from https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/ to /etc on the management node, and modify this file referring to installation guide.

Install OpenLDAP-client
Install nss-pam-ldapd
  1. Install nss-pam-ldapd:

  2. Download nslcd.conf from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  3. Upload the file to /etc. Use installation guide to modify the configuration.

  4. Modify file permissions

  5. Start the nslcd service:

Configure authselect-nslcd-config
  1. Create the path for the configuration file:

  2. Download configuration files from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/authselect/authselect.tar.gz

  3. Upload the configuration files to /root

  4. Extract the archive:

  5. Enable the configuration:

Install lico

  1. Install Lico module

  2. Configure shared directory for LiCO

  3. Install portal

  4. Install AI component

  5. (Optional) Provide e-mail, SMS, and WeChat services:

  6. (Optional) Install Icinga2 monitoring components

  7. Restart services:

Configure LiCO and start service

Note The username and password of icinga2 can be viewed and changed at /etc/icinga2/conf.d/api-users.conf

Compute node deployment

Prepare the files that need to be put into the image

  1. Repos

  2. Configure automatic start for the GPU driver (for GPU image)

    Dowload NVIDIA-Linux-x86_64-520.61.07.run from https://us.download.nvidia.com/tesla/520.61.07/NVIDIA-Linux-x86_64-520.61.07.run and copy it to the shared directory $share_installer_dir

cat << eof > $share_installer_dir/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 eof

cp /etc/slurm/slurm.conf $share_installer_dir/slurm.conf cp /etc/slurm/cgroup.conf $share_installer_dir/cgroup.conf cp /etc/slurm/gres.conf $share_installer_dir/gres.conf cp /etc/munge/munge.key $share_installer_dir

cp /etc/openldap/ldap.conf $share_installer_dir cp /etc/nslcd.conf $share_installer_dir/nslcd.conf

cp /root/authselect.tar.gz $share_installer_dir

\cp ~/lico_env.local /tmp/scratchdir/root/ \cp $share_installer_dir/hosts /tmp/scratchdir/etc/hosts \cp $share_installer_dir/limits.conf /tmp/scratchdir/etc/security/limits.conf \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir/etc/yum.repos.d/ \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir/etc/yum.repos.d/ echo -e %_excludedocs 1 >> /tmp/scratchdir/root/.rpmmacros \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir/etc/yum.repos.d/ \cp $share_installer_dir/lico-release.repo /tmp/scratchdir/etc/yum.repos.d/

cd /tmp/scratchdir/etc/yum.repos.d mkdir rocky mv Rocky* rocky/

\cp ~/lico_env.local /tmp/scratchdir-gpu/root/ \cp $share_installer_dir/hosts /tmp/scratchdir-gpu/etc/hosts \cp $share_installer_dir/limits.conf /tmp/scratchdir-gpu/etc/security/limits.conf \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ echo -e %_excludedocs 1 >> /tmp/scratchdir-gpu/root/.rpmmacros \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/lico-release.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/nvidia-* /tmp/scratchdir-gpu/usr/lib/systemd/system/ \cp $share_installer_dir/blacklist-nouveau.conf /tmp/scratchdir-gpu/usr/lib/modprobe.d/blacklist-nouveau.conf

cd /tmp/scratchdir-gpu/etc/yum.repos.d mkdir rocky mv Rocky* rocky/

dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \ elfutils-libelf-devel libglvnd-devel

dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

chmod +x $share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run cd $share_installer_dir

$share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run --add-this-kernel -s

imgutil exec -v /install/installer:- /tmp/scratchdir

imgutil exec -v /install/installer:- /tmp/scratchdir-gpu

source /root/lico_env.local share_installer_dir="/install/installer"

dnf module reset nginx dnf module enable -y nginx:1.20

dnf install -y chrony

systemctl enable chronyd

echo "${sms_ip}:/home /home nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab mkdir -p /home

mkdir -p $share_installer_dir echo "${sms_ip}:/install/installer /install/installer nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab

mount -a

cp $share_installer_dir/ldap.conf /etc/openldap/ldap.conf dnf install -y nss-pam-ldapd cp $share_installer_dir/nslcd.conf /etc/nslcd.conf chmod 600 /etc/nslcd.conf systemctl enable nslcd mkdir -p /usr/share/authselect/vendor/nslcd tar -xzvf $share_installer_dir/authselect.tar.gz -C /usr/share/authselect/vendor/nslcd/ dnf install -y authselect authselect select nslcd with-mkhomedir --force

dnf install -y icinga2 icinga2 node setup --master --disable-confd echo -e "LANG=en_US.UTF-8" >> /etc/sysconfig/icinga2

dnf install -y ohpc-base-compute ohpc-slurm-client lmod-ohpc echo 'account required pam_slurm.so' >> /etc/pam.d/sshd (option)

cp $share_installer_dir/munge.key /etc/munge/munge.key cp $share_installer_dir/cgroup.conf /etc/slurm/cgroup.conf cp $share_installer_dir/slurm.conf /etc/slurm/slurm.conf cp $share_installer_dir/gres.conf /etc/slurm/gres.conf systemctl enable munge systemctl enable slurmd

dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \ elfutils-libelf-devel libglvnd-devel

dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

echo "${sms_ip}:/opt/lico/pub /opt/lico/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab

mkdir -p /opt/lico/pub

mount -a

echo "${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab

mkdir -p /opt/ohpc/pub

mount -a

exit imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm

exit imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm-gpu

sms_name=head icinga_api_port=5665 icinga2 pki save-cert --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --host ${sms_name} nodename=uname -a |awk '{print $2}' ticket=ssh $sms_name icinga2 pki ticket --cn $nodename icinga2 node setup --ticket ${ticket} --cn $nodename --endpoint ${sms_name} --zone $nodename --parent_zone master --parent_host ${sms_name} --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --accept-commands --accept-config --disable-confd modprobe ipmi_devintf systemctl start icinga2 systemctl enable icinga2

systemctl stop slurmd share_installer_dir="/install/installer" $share_installer_dir/NVIDIA-Linux-x86_64-520.61.07-custom.run -s mkdir -p /var/run/nvidia-persistenced systemctl daemon-reload systemctl enable nvidia-persistenced --now systemctl enable nvidia-modprobe-loader.service --now systemctl restart slurmd

nodedeploy compute -n rocky-8.6-diskless-slurm

nodedeploy gpu -n rocky-8.6-diskless-slurm-gpu

imgutil unpack rocky-8.6-diskless-slurm /tmp/scratchdir-v2/

imgutil exec /tmp/scratchdir-v2/

imgutil pack /tmp/scratchdir-v2/ rocky-8.6-diskless-slurm-v2

cd /var/lib/confluent/public/os cp rocky-8.6-diskless-slurm/profile.yaml rocky-8.6-diskless-slurm-v2/profile.yaml cp rocky-8.6-diskless-slurm/scripts/onboot.d/* rocky-8.6-diskless-slurm-v2/scripts/onboot.d/