Diskless Installation Guide

Diskless installation guide

1: This document is based on the confluent image. It is used to create an image on the cluster head node and push it to the computing node to deploy the cluster. Therefore, the header node needs to enable both httpd and Nginx, and the httpd port must be the default 443, so the https port of Nginx service must be changed to another port

2: In this scenario the login node and the management node are the same node.

Head node deployment

Configuration and Preparation

Configure the memory:

echo '* soft memlock unlimited' >> /etc/security/limits.conf
echo '* hard memlock unlimited' >> /etc/security/limits.conf
reboot

Configuring environment variables

  1. Login management node

  2. Create a new lico_env.local file according to section 2 of the LiCO installation documentation

  3. Reload the file:

    chmod 600 lico_env.local
    source lico_env.local

Create OS local repository

  1. Create a directory for storing ISO storage:
mkdir -p ${iso_path}
  1. download Rocky-8.6-x86_64-dvd1.iso and CHECKSUM file: https://rockylinux.org/download

  2. Copy the file to ${iso_path}.

  3. Validate that the verification code of the ISO file matches the code listed in CHECKSUM.

cd ${iso_path}
sha256sum Rocky-8.6-x86_64-dvd1.iso
cd ~
  1. Mount the ISO image:
mkdir -p ${os_repo_dir}
mount -o loop ${iso_path}/Rocky-8.6-x86_64-dvd1.iso ${os_repo_dir}
  1. Configure the local repository and copy it to /etc/yum.repos.d/:
cat << eof > ${iso_path}/EL8-OS.repo
[AppStream]
name=appstream
baseurl=file://${os_repo_dir}/AppStream/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rockyofficial
[BaseOS]
name=baseos
baseurl=file://${os_repo_dir}/BaseOS/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rockyofficial
eof

cp -a ${iso_path}/EL8-OS.repo /etc/yum.repos.d/
  1. Backup the repository:
mkdir -p ${repo_backup_dir}
mv /etc/yum.repos.d/Rocky* ${repo_backup_dir}
dnf clean all
dnf makecache
  1. Enable the NGINX web server:
dnf module reset nginx
dnf module enable -y nginx:1.20

Install Lenovo Confluent

  1. Download the following package: https://hpc.lenovo.com/downloads/22b/confluent-3.5.0-2-el8.tar.xz

  2. Upload the package to the /root directory.

  3. Create confluent local repository:

dnf install -y bzip2 tar
mkdir -p $confluent_repo_dir
cd /root
tar -xvf confluent-3.5.0-2-el8.tar.xz -C $confluent_repo_dir
cd $confluent_repo_dir/lenovo-hpc-el8
./mklocalrepo.sh
cd ~
  1. Install Lenovo Confluent:
dnf install -y lenovo-confluent tftp-server
systemctl enable confluent --now
systemctl enable tftp.socket --now
systemctl disable firewalld --now
systemctl enable httpd --now
  1. Create confluent account:
source /etc/profile.d/confluent_env.sh
confetty create /users/<CONFLUENT_USERNAME> password=<CONFLUENT_PASSWORD> role=admin
  1. Close SELinux:
sed -i 's/enforcing/disabled/' /etc/selinux/config
setenforce 0

Configure confluent to prepare for deploying for compute node

Please ensure that the BMC user name and password are consistent for every node.

nodegroupattrib everything deployment.useinsecureprotocols=firmware \
console.method=ipmi dns.servers=$dns_server dns.domain=$domain_name \
net.ipv4_gateway=$ipv4_gateway net.ipv4_method="static"

The deployment.useinsecureprotocols=firmware enables PXE support (HTTPS only mode is by default the only allowed mode), console.method=ipmi may be skipped but if specified instructs confluennt to use IPMI to access the text console to enable the nodeconsole command.

While passwords and similar may be specified the same way, it is recommended to use the -p argument to prompt for values, to keep them out of your command history. Note that if unspecified, default root password behavior is to disable password based login:

nodegroupattrib everything -p bmcuser bmcpass crypted.rootpassword
Define nodes in confluent
  1. Define the management node in the lico_env.local file to confluent:
nodegroupdefine all
nodegroupdefine compute
nodedefine $sms_name
nodeattrib $sms_name net.hwaddr=$sms_mac
nodeattrib $sms_name net.ipv4_address=$sms_ip
nodeattrib $sms_name hardwaremanagement.manager=$sms_bmc
  1. Define the compute node configuration to confluent:
for ((i=0; i<$num_computes; i++)); do
nodedefine ${c_name[$i]};
nodeattrib ${c_name[$i]} net.hwaddr=${c_mac[$i]};
nodeattrib ${c_name[$i]} net.ipv4_address=${c_ip[$i]};
nodeattrib ${c_name[$i]} hardwaremanagement.manager=${c_bmc[$i]};
nodedefine ${c_name[$i]} groups=all,compute;
done
  1. Set the node to boot using network pxe by default

    for ((i=0; i<$num_computes; i++)); do
    nodeconfig ${c_name[$i]} bootorder.bootorder=Network
    done
Prepare name resolution
  1. Append node information to /etc/hosts:
for node_name in $(nodelist); do
noderun -n $node_name echo {net.ipv4_address} {node} {node}.{dns.domain} >> /etc/hosts
done
  1. Install and start to dnsmasq, making /etc/hosts available through dns:
dnf install -y dnsmasq
systemctl enable dnsmasq --now
Initialize confluent operating system deployment

Users can set up requirements for operating system deployment through the initialized sub-command of the osdeploy command. The -i parameter is used to interactively prompt the options that are available:

ssh-keygen -t ed25519
chown confluent /var/lib/confluent
osdeploy initialize -i
systemctl restart sshd
Import install media:
osdeploy import ${iso_path}/Rocky-8.6-x86_64-dvd1.iso
Build image directory

If you don`t have any GPU nodes in the cluster, build a single image and dismiss any GPU related commands bellow.

imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir

If both GPU and non GPU nodes are present, you will need to build two separate images.

imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir
imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir-gpu

Install OHPC

Define share folder

dnf install -y nfs-utils
systemctl enable nfs-server --now

Enable repositories for other nodes

Enable httpd services

cat << eof > /etc/httpd/conf.d/installer.conf
Alias /install /install
<Directory /install>
AllowOverride None
Require all granted
Options +Indexes +FollowSymLinks
</Directory>
eof

systemctl restart httpd

Configure Lenovo OpenHPC repositories

  1. Download the following package: https://hpc.lenovo.com/lico/downloads/7.0/Lenovo-OpenHPC-2.5.EL8.x86_64.tar

  2. Upload the package to the /root directory.

  3. Configure the local Lenovo OpenHPC repository:

mkdir -p $ohpc_repo_dir
cd /root
tar xvf Lenovo-OpenHPC-2.5.EL8.x86_64.tar -C $ohpc_repo_dir
rm -rf $link_ohpc_repo_dir
ln -s $ohpc_repo_dir $link_ohpc_repo_dir
$link_ohpc_repo_dir/make_repo.sh

Configure the LiCO dependencies repositories

  1. Download the following package:: https://hpc.lenovo.com/lico/downloads/7.0/lico-dep-7.0.0.el8.x86_64.tgz

  2. Upload the package to the /root directory.

  3. Configure the repository for the management node:

mkdir -p $lico_dep_repo_dir
cd /root
tar -xvf lico-dep-7.0.0.el8.x86_64.tgz -C $lico_dep_repo_dir
rm -rf $link_lico_dep_repo_dir
ln -s $lico_dep_repo_dir $link_lico_dep_repo_dir
$link_lico_dep_repo_dir/mklocalrepo.sh

Obtain the LiCO installation package

  1. Obtain the LiCO 7.0.0 release package for EL8 lico-release-7.0.0.el8.tar.gz and the LiCO license file from: https://commercial.lenovo.com/cn/en/signin

  2. Upload the release package to the management node.

Configure the local repository for LiCO

  1. Configure the local repository for the management node:

    mkdir -p $lico_repo_dir
    tar zxvf lico-release-7.0.0.el8.x86_64.tar.gz -C $lico_repo_dir --strip-components 1
    rm -rf $link_lico_repo_dir
    ln -s $lico_repo_dir $link_lico_repo_dir
    $link_lico_repo_dir/mklocalrepo.sh

Install Slurm

  1. Install the base package:

    dnf install -y lenovo-ohpc-base
  2. Install Slurm

    dnf install -y ohpc-slurm-server

Configure NFS

Configure user shared directory The following steps describes how to create the user shared directory by taking /home as an example.

Manage node sharing /home:

echo "/home *(rw,async,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -a

Configure shared directory for OpenHPC

Manage node sharing /opt/ohpc/pub for OpenHPC:

echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports
exportfs -a

Configure Chrony

  1. Install Chrony:

    dnf install -y chrony
  2. Configure Chrony as below link : https://chrony.tuxfamily.org/documentation.html

  3. Enable chronyd

    systemctl enable chronyd --now

Configure slurm

  1. Download slurm.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  2. Upload slurm.conf to /etc/slurm/, and modify this file as installation guide.

  3. Download cgroup.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  4. Upload cgroup.conf to /etc/slurm.

  5. Create /etc/slurm/gres.conf,edit the GPU resources of all nodes in the following format

    NodeName=c1 Name=gpu File=/dev/nvidia[0-1]
    NodeName=c2 Name=gpu File=/dev/nvidia[0-2]
  6. Start service:

    systemctl enable munge
    systemctl enable slurmctld
    systemctl restart munge
    systemctl restart slurmctld

Install Icinga2

dnf install -y icinga2
dnf install -y nagios-plugins-ping
icinga2 api setup
icinga2 node setup --master --disable-confd
echo -e "LANG=en_US.UTF-8" >> /etc/sysconfig/icinga2
systemctl restart icinga2

Install MPI

  1. Install three modules (OpenMPI, MPICH, and MVAPICH) to the system::

    dnf install -y openmpi4-gnu9-ohpc mpich-ofi-gnu9-ohpc mvapich2-gnu9-ohpc ucx-ib-ohpc
  2. Set the default module.

    Set OpenMPI module as the default:

    dnf install -y lmod-defaults-gnu9-openmpi4-ohpc

    Set the MPICH module as the default:

    dnf install -y lmod-defaults-gnu9-mpich-ofi-ohpc

    Set the MVAPICH module as the default:

    dnf install -y lmod-defaults-gnu9-mvapich2-ohpc

Install Singularity

  1. Install Singularity:

    dnf install -y singularity-ohpc
  2. Edit the file /opt/ohpc/pub/modulefiles/ohpc by adding the following content to the end of the module try-add block:

    module try-add singularity
  3. In the module del block, add the following content as the first line:

    module del singularity
  4. Run the following command:

    source /etc/profile.d/lmod.sh

Install the LiCO dependencies

Install RabbitMQ

  1. Install RabbitMQ:

    dnf install -y rabbitmq-server
  2. Start RabbitMQ service:

    systemctl enable rabbitmq-server --now

Install MariaDB

  1. Install MariaDB:

    dnf install -y mariadb-server mariadb-devel
    systemctl enable mariadb --now
  2. Configure MariaDB for LiCO:

    mysql
    create database lico character set utf8 collate utf8_bin;
    create user '<USERNAME>'@'%' identified by '<PASSWORD>';
    grant ALL on lico.* to '<USERNAME>'@'%';
    exit
  3. Configure the MariaDB limits:

    sed -i "/\[mysqld\]/a\max-connections=1024" /etc/my.cnf.d/mariadb-server.cnf
    mkdir /usr/lib/systemd/system/mariadb.service.d
    cat << eof > /usr/lib/systemd/system/mariadb.service.d/limits.conf
    [Service]
    LimitNOFILE=10000
    eof
    systemctl daemon-reload
    systemctl restart mariadb

Install InfluxDB

dnf install -y influxdb
systemctl enable influxdb --now
influx
create database lico
use lico
create user <INFLUX_USERNAME> with password '<INFLUX_PASSWORD>' with all privileges
exit
sed -i '/# auth-enabled = false/a\ auth-enabled = true' /etc/influxdb/config.toml
systemctl restart influxdb

Configure user authentication

Install OpenLDAP
  1. Install and configure OpenLDAP

    1. Download the openldap-server package from Rocky official website:
      https://download.rockylinux.org/pub/rocky/8/PowerTools/x86_64/os/Packages/o/openldap-servers-2.4.46-18.el8.x86_64.rpm

    2. Upload the package to the LiCO management node

    3. Install openldap-server:

    dnf install -y openldap-servers-2.4.46-18.el8.x86_64.rpm
    1. Install slapd-ssl-config:
    dnf install -y slapd-ssl-config
  2. Modify the configuration file:

    sed -i "s/dc=hpc,dc=com/${lico_ldap_domain_name}/" /usr/share/openldap-servers/lico.ldif
    sed -i "/dc:/s/hpc/${lico_ldap_domain_component}/" /usr/share/openldap-servers/lico.ldif
    sed -i "s/dc=hpc,dc=com/${lico_ldap_domain_name}/" /etc/openldap/slapd.conf
    slapadd -v -l /usr/share/openldap-servers/lico.ldif -f /etc/openldap/slapd.conf -b \
    ${lico_ldap_domain_name}
  3. Obtain the OpenLDAP key:

    slappasswd
  4. Edit /etc/openldap/slapd.conf to set the root password to the key that was obtained.

    rootpw <ENCRYPT_LDAP_PASSWORD>
  5. Change the owner of the configuration file:

    chown -R ldap:ldap /var/lib/ldap
    chown ldap:ldap /etc/openldap/slapd.conf
  6. Start the OpenLDAP service:

    systemctl enable slapd --now
  7. Verify that the service has been started:

    systemctl status slapd
Install libuser

The libuser module is a recommended toolkit for OpenLDAP. The installation of this module is optional.

  1. Install libuser:

    dnf install -y libuser python3-libuser
  2. Download libuser.conf from https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/ to /etc on the management node, and modify this file referring to installation guide.

Install OpenLDAP-client
echo "TLS_REQCERT never" >> /etc/openldap/ldap.conf
Install nss-pam-ldapd
  1. Install nss-pam-ldapd:

    dnf install -y nss-pam-ldapd
  2. Download nslcd.conf from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/

  3. Upload the file to /etc. Use installation guide to modify the configuration.

  4. Modify file permissions

    chmod 600 /etc/nslcd.conf
  5. Start the nslcd service:

    systemctl enable nslcd --now
Configure authselect-nslcd-config
  1. Create the path for the configuration file:

    mkdir -p /usr/share/authselect/vendor/nslcd
  2. Download configuration files from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/authselect/authselect.tar.gz

  3. Upload the configuration files to /root

  4. Extract the archive:

    tar -xzvf /root/authselect.tar.gz -C /usr/share/authselect/vendor/nslcd/
  5. Enable the configuration:

    authselect select nslcd with-mkhomedir --force

Install lico

  1. Install Lico module

    dnf install -y python3-cffi
    dnf install -y lico-core lico-file-manager lico-confluent-proxy \
    lico-vnc-proxy lico-icinga-mond lico-async-task lico-service-tool
  2. Configure shared directory for LiCO

    mkdir -p /opt/lico/pub
    touch /opt/lico/pub/DO_NOT_DELETE
    echo "The file is required by lico monitor." >> /opt/lico/pub/DO_NOT_DELETE
    echo "/opt/lico/pub *(ro,sync,no_subtree_check,no_root_squash)" >> /etc/exports
    exportfs -a
  3. Install portal

    dnf install -y lico-workspace-skeleton lico-portal
  4. Install AI component

    dnf install -y lico-ai-scripts
  5. (Optional) Provide e-mail, SMS, and WeChat services:

    dnf install -y lico-mail-agent
    dnf install -y lico-sms-agent
    dnf install -y lico-wechat-agent
  6. (Optional) Install Icinga2 monitoring components

    dnf install -y lico-icinga-plugin-slurm
    
    mkdir -p /etc/icinga2/zones.d/global-templates
    
    echo -e "object CheckCommand \"lico_monitor\" {\n command = [ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
    lico-icinga-plugin\" ]\n}" > /etc/icinga2/zones.d/global-templates/commands.conf
    
    echo -e "object CheckCommand \"lico_job_monitor\" {\n command = [ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
    lico-job-icinga-plugin\" ]\n}" >> /etc/icinga2/zones.d/global-templates/commands.conf
    
    echo -e "object CheckCommand \"lico_check_procs\" {\n command =[ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
    lico-process-icinga-plugin\" ]\n}" >>/etc/icinga2/zones.d/global-templates/commands.conf
    
    echo -e "object CheckCommand \"lico_vnc_monitor\" {\n command =[ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
    lico-vnc-icinga-plugin\" ]\n}" >>/etc/icinga2/zones.d/global-templates/commands.conf
    
    mkdir -p /etc/icinga2/zones.d/master
    
    echo -e "object Host \"${sms_name}\" {\n check_command = \"hostalive\"\n \
    address = \"${sms_ip}\"\n vars.agent_endpoint = name\n}\n" >> \
    /etc/icinga2/zones.d/master/hosts.conf
    
    for ((i=0;i<$num_computes;i++));do
    echo -e "object Endpoint \"${c_name[${i}]}\" {\n host = \"${c_name[${i}]}\"\n \
    port = \"${icinga_api_port}\"\n log_duration = 0\n}\nobject \
    Zone \"${c_name[${i}]}\" {\n endpoints = [ \"${c_name[${i}]}\" ]\n \
    parent = \"master\"\n}\n" >> /etc/icinga2/zones.d/master/agent.conf
    echo -e "object Host \"${c_name[${i}]}\" {\n check_command = \"hostalive\"\n \
    address = \"${c_ip[${i}]}\"\n vars.agent_endpoint = name\n}\n" >> \
    /etc/icinga2/zones.d/master/hosts.conf
    done
    
    echo -e "apply Service \"lico\" {\n check_command = \"lico_monitor\"\n \
    max_check_attempts = 5\n check_interval = 1m\n retry_interval = 30s\n assign \
    where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
    command_endpoint = host.vars.agent_endpoint\n}\n" > \
    /etc/icinga2/zones.d/master/service.conf
    
    echo -e "apply Service \"lico-procs-service\" {\n check_command = \"lico_\
    check_procs\"\n enable_active_checks = false\n assign where \
    host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
    command_endpoint = host.vars.agent_endpoint\n}\n" >> \
    /etc/icinga2/zones.d/master/service.conf
    
    echo -e "apply Service \"lico-job-service\" {\n check_command = \"lico_job_monitor\"\n \
    max_check_attempts = 5\n check_interval = 1m\n retry_interval = 30s\n assign \
    where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
    command_endpoint = host.vars.agent_endpoint\n}\n" >> \
    /etc/icinga2/zones.d/master/service.conf
    
    echo -e "apply Service \"lico-vnc-service\" {\n check_command = \"lico_vnc_monitor\"\n \
    max_check_attempts = 5\n check_interval = 15s\n retry_interval = 30s\n assign \
    where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
    command_endpoint = host.vars.agent_endpoint\n}\n" >> \
    /etc/icinga2/zones.d/master/service.conf
    
    chown -R icinga:icinga /etc/icinga2/zones.d/master
    systemctl restart icinga2
    
    modprobe ipmi_devintf
    systemctl enable icinga2
  7. Restart services:

    systemctl restart confluent

Configure LiCO and start service

Note The username and password of icinga2 can be viewed and changed at /etc/icinga2/conf.d/api-users.conf

cd /etc/lico
\cp gres.csv.example gres.csv
\cp nodes.csv.example nodes.csv
vim nodes.csv

lico-password-tool

mkdir -p /tmp/scratchdir/var/lib/lico/tool
cp /var/lib/lico/tool/.db /tmp/scratchdir/var/lib/lico/tool/

cd lico.ini.d/
sed -i s/false/true/ user.ini
lico init
sed -i s/80/8080/g /etc/nginx/nginx.conf
sed -i s/443/444/ /etc/nginx/conf.d/https.conf

luseradd hpcadmin -P Passw0rd@123
lico import_user -u hpcadmin -r admin

lico-service-tool start
lico-service-tool enable

Compute node deployment

Prepare the files that need to be put into the image

  1. Repos

    share_installer_dir="/install/installer"
    mkdir -p $share_installer_dir
    
    echo "/install/installer *(rw,async,no_subtree_check,no_root_squash)" >> /etc/exports
    exportfs -a
    
    cp /etc/hosts $share_installer_dir
    cp /etc/security/limits.conf $share_installer_dir
    
    cp /etc/yum.repos.d/EL8-OS.repo $share_installer_dir
    sed -i '/^baseurl=/d' $share_installer_dir/EL8-OS.repo
    
    sed -i "/name=appstream/a\baseurl=http://${sms_name}${os_repo_dir}/AppStream/" \
    $share_installer_dir/EL8-OS.repo
    
    sed -i "/name=baseos/a\baseurl=http://${sms_name}${os_repo_dir}/BaseOS/" \
    $share_installer_dir/EL8-OS.repo
    
    cp /etc/yum.repos.d/lenovo-hpc.repo $share_installer_dir
    sed -i '/^baseurl=/d' $share_installer_dir/lenovo-hpc.repo
    sed -i '/^gpgkey=/d' $share_installer_dir/lenovo-hpc.repo
    
    echo "baseurl=http://${sms_name}${confluent_repo_dir}/lenovo-hpc-el8" \
    >> $share_installer_dir/lenovo-hpc.repo
    
    echo "gpgkey=http://${sms_name}${confluent_repo_dir}/lenovo-hpc-el8\
    /lenovohpckey.pub" >> $share_installer_dir/lenovo-hpc.repo
    
    cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo $share_installer_dir
    sed -i '/^baseurl=/d' $share_installer_dir/Lenovo.OpenHPC.local.repo
    sed -i '/^gpgkey=/d' $share_installer_dir/Lenovo.OpenHPC.local.repo
    
    echo "baseurl=http://${sms_name}${link_ohpc_repo_dir}/EL_8" \
    >> $share_installer_dir/Lenovo.OpenHPC.local.repo
    
    echo "gpgkey=http://${sms_name}${link_ohpc_repo_dir}/EL_8\
    /repodata/repomd.xml.key" >> $share_installer_dir/Lenovo.OpenHPC.local.repo
    
    cp /etc/yum.repos.d/lico-dep.repo $share_installer_dir
    sed -i '/^baseurl=/d' $share_installer_dir/lico-dep.repo
    sed -i '/^gpgkey=/d' $share_installer_dir/lico-dep.repo
    
    sed -i "/name=lico-dep-local-library/a\baseurl=http://${sms_name}\
    ${link_lico_dep_repo_dir}/library/" $share_installer_dir/lico-dep.repo
    
    sed -i "/name=lico-dep-local-library/a\gpgkey=http://${sms_name}\
    ${link_lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL8" $share_installer_dir/lico-dep.repo
    
    sed -i "/name=lico-dep-local-standalone/a\baseurl=http://${sms_name}\
    ${link_lico_dep_repo_dir}/standalone/" $share_installer_dir/lico-dep.repo
    
    sed -i "/name=lico-dep-local-standalone/a\gpgkey=http://${sms_name}\
    ${link_lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL8" $share_installer_dir/lico-dep.repo
    
    cp /etc/yum.repos.d/lico-release.repo $share_installer_dir
    
    sed -i '/baseurl=/d' $share_installer_dir/lico-release.repo
    
    sed -i "/name=lico-release-host/a\baseurl=http://${sms_name}\
    ${link_lico_repo_dir}/host/" $share_installer_dir/lico-release.repo
    
    sed -i "/name=lico-release-public/a\baseurl=http://${sms_name}\
    ${link_lico_repo_dir}/public/" $share_installer_dir/lico-release.repo
  2. Configure automatic start for the GPU driver (for GPU image)

    Dowload NVIDIA-Linux-x86_64-520.61.07.run from https://us.download.nvidia.com/tesla/520.61.07/NVIDIA-Linux-x86_64-520.61.07.run and copy it to the shared directory $share_installer_dir

    cat << eof > $share_installer_dir/nvidia-persistenced.service
    [Unit]
    Description=NVIDIA Persistence Daemon
    After=syslog.target
    [Service]
    Type=forking
    PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
    Restart=always
    ExecStart=/usr/bin/nvidia-persistenced --verbose
    ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
    TimeoutSec=300
    [Install]
    WantedBy=multi-user.target
    eof
    
    cat << eof > $share_installer_dir/nvidia-modprobe-loader.service
    [Unit]
    Description=NVIDIA ModProbe Service
    After=syslog.target
    Before=slurmd.service
    [Service]
    Type=oneshot
    ExecStart=/usr/bin/nvidia-modprobe -u -c=0
    RemainAfterExit=yes
    [Install]
    WantedBy=multi-user.target
    eof
    
    
    cat << eof > $share_installer_dir/blacklist-nouveau.conf
    blacklist nouveau
    options nouveau modeset=0
    eof
    
  3. Slurm config file

    cp /etc/slurm/slurm.conf $share_installer_dir/slurm.conf
    cp /etc/slurm/cgroup.conf $share_installer_dir/cgroup.conf
    cp /etc/slurm/gres.conf $share_installer_dir/gres.conf
    cp /etc/munge/munge.key $share_installer_dir
  4. LDAP config file

    cp /etc/openldap/ldap.conf $share_installer_dir
    cp /etc/nslcd.conf $share_installer_dir/nslcd.conf
  5. Authselect

    cp /root/authselect.tar.gz $share_installer_dir
  6. Synchronize the files to the image and clean up the original file

    \cp ~/lico_env.local /tmp/scratchdir/root/
    \cp $share_installer_dir/hosts /tmp/scratchdir/etc/hosts
    \cp $share_installer_dir/limits.conf /tmp/scratchdir/etc/security/limits.conf
    \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir/etc/yum.repos.d/
    \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir/etc/yum.repos.d/
    echo -e %_excludedocs 1 >> /tmp/scratchdir/root/.rpmmacros
    \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir/etc/yum.repos.d/
    \cp $share_installer_dir/lico-release.repo /tmp/scratchdir/etc/yum.repos.d/
    
    cd /tmp/scratchdir/etc/yum.repos.d
    mkdir rocky
    mv Rocky* rocky/

    For the GPU image do the following:

    \cp ~/lico_env.local /tmp/scratchdir-gpu/root/
    \cp $share_installer_dir/hosts /tmp/scratchdir-gpu/etc/hosts
    \cp $share_installer_dir/limits.conf /tmp/scratchdir-gpu/etc/security/limits.conf
    \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir-gpu/etc/yum.repos.d/
    \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir-gpu/etc/yum.repos.d/
    echo -e %_excludedocs 1 >> /tmp/scratchdir-gpu/root/.rpmmacros
    \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir-gpu/etc/yum.repos.d/
    \cp $share_installer_dir/lico-release.repo /tmp/scratchdir-gpu/etc/yum.repos.d/
    \cp $share_installer_dir/nvidia-* /tmp/scratchdir-gpu/usr/lib/systemd/system/
    \cp $share_installer_dir/blacklist-nouveau.conf /tmp/scratchdir-gpu/usr/lib/modprobe.d/blacklist-nouveau.conf
    
    cd /tmp/scratchdir-gpu/etc/yum.repos.d
    mkdir rocky
    mv Rocky* rocky/

Prepare NVIDIA drivers

dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \
elfutils-libelf-devel libglvnd-devel

dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

chmod +x $share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run
cd $share_installer_dir

$share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run --add-this-kernel -s

Enter the image

For Non-GPU image:

imgutil exec -v /install/installer:- /tmp/scratchdir

If you are building the GPU image:

imgutil exec -v /install/installer:- /tmp/scratchdir-gpu

Set local environment variables

source /root/lico_env.local
share_installer_dir="/install/installer"

Start nginx service

dnf module reset nginx
dnf module enable -y nginx:1.20

Install Chrony

  1. Install Chrony
   dnf install -y chrony
  1. Edit /etc/chrony.conf to configure chrony

  2. Set the system to automatically start after startup

    systemctl enable chronyd

Configure NFS

echo "${sms_ip}:/home /home nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab
mkdir -p /home

mkdir -p $share_installer_dir
echo "${sms_ip}:/install/installer /install/installer nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab

mount -a

Install OpenLDAP

cp $share_installer_dir/ldap.conf /etc/openldap/ldap.conf
dnf install -y nss-pam-ldapd
cp $share_installer_dir/nslcd.conf /etc/nslcd.conf
chmod 600 /etc/nslcd.conf
systemctl enable nslcd
mkdir -p /usr/share/authselect/vendor/nslcd
tar -xzvf $share_installer_dir/authselect.tar.gz -C /usr/share/authselect/vendor/nslcd/
dnf install -y authselect
authselect select nslcd with-mkhomedir --force

Install Icinga2

dnf install -y icinga2
icinga2 node setup --master --disable-confd
echo -e "LANG=en_US.UTF-8" >> /etc/sysconfig/icinga2

Install and configure Slurm

dnf install -y ohpc-base-compute ohpc-slurm-client lmod-ohpc
echo 'account required pam_slurm.so' >> /etc/pam.d/sshd   (option)

cp $share_installer_dir/munge.key /etc/munge/munge.key
cp $share_installer_dir/cgroup.conf /etc/slurm/cgroup.conf
cp $share_installer_dir/slurm.conf /etc/slurm/slurm.conf
cp $share_installer_dir/gres.conf /etc/slurm/gres.conf
systemctl enable munge
systemctl enable slurmd

Add kernel headers(for GPU image)

dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \
elfutils-libelf-devel libglvnd-devel

dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Mount LiCO monitor components

echo "${sms_ip}:/opt/lico/pub /opt/lico/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab

mkdir -p /opt/lico/pub

mount -a

Mount OHPC directory

echo "${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab

mkdir -p /opt/ohpc/pub

mount -a

Exit and pack the image

exit
imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm

or

exit
imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm-gpu

Add Startup Scripts

Install Icinga script

Create and edit: /var/lib/confluent/public/os/rocky-8.6-diskless-slurm/scripts/onboot.d/icinga.sh

or

/var/lib/confluent/public/os/rocky-8.6-diskless-slurm-gpu/scripts/onboot.d/icinga.sh

sms_name=head
icinga_api_port=5665
icinga2 pki save-cert --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --host ${sms_name}
nodename=`uname -a |awk '{print $2}'`
ticket=`ssh $sms_name icinga2 pki ticket --cn $nodename`
icinga2 node setup --ticket ${ticket} --cn $nodename  --endpoint ${sms_name} --zone $nodename --parent_zone master --parent_host ${sms_name} --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --accept-commands --accept-config --disable-confd
modprobe ipmi_devintf
systemctl start icinga2
systemctl enable icinga2

Note: Hostname in this script and hostname defined in lico_env.local should be consistent

Install GPU drivers script

For GPU image also add the following script: /var/lib/confluent/public/os/rocky-8.6-diskless-slurm-gpu/scripts/onboot.d/gpu-drivers.sh

systemctl stop slurmd
share_installer_dir="/install/installer"
$share_installer_dir/NVIDIA-Linux-x86_64-520.61.07-custom.run -s
mkdir -p /var/run/nvidia-persistenced
systemctl daemon-reload
systemctl enable nvidia-persistenced --now
systemctl enable nvidia-modprobe-loader.service --now
systemctl restart slurmd

Deploy the nodes

nodedeploy compute -n rocky-8.6-diskless-slurm

For GPU nodes:

nodedeploy gpu -n rocky-8.6-diskless-slurm-gpu

Monitor on XCC untill the deployment is finished.

Modify an image that has been already packaged

  1. Unpack the image to the specified directory
   imgutil unpack rocky-8.6-diskless-slurm /tmp/scratchdir-v2/
  1. Enter the image to make modifications

    imgutil exec /tmp/scratchdir-v2/
  2. Pack the image

    Note: the new image cannot have the same name as existing images

    imgutil pack /tmp/scratchdir-v2/ rocky-8.6-diskless-slurm-v2
  3. Copy the profile.yaml and onboot scripts of the previous image as required

    cd /var/lib/confluent/public/os
    cp rocky-8.6-diskless-slurm/profile.yaml rocky-8.6-diskless-slurm-v2/profile.yaml
    cp rocky-8.6-diskless-slurm/scripts/onboot.d/* rocky-8.6-diskless-slurm-v2/scripts/onboot.d/