1: This document is based on the confluent image. It is used to create an image on the cluster head node and push it to the computing node to deploy the cluster. Therefore, the header node needs to enable both httpd and Nginx, and the httpd port must be the default 443, so the https port of Nginx service must be changed to another port
2: In this scenario the login node and the management node are the same node.
xxxxxxxxxx
echo '* soft memlock unlimited' >> /etc/security/limits.conf
echo '* hard memlock unlimited' >> /etc/security/limits.conf
reboot
Login management node
Create a new lico_env.local file according to section 2 of the LiCO installation documentation
Reload the file:
xxxxxxxxxx
chmod 600 lico_env.local
source lico_env.local
Create a directory for storing ISO storage:
xxxxxxxxxx
mkdir -p ${iso_path}
download Rocky-8.6-x86_64-dvd1.iso and CHECKSUM file: https://rockylinux.org/download
Copy the file to ${iso_path}.
Validate that the verification code of the ISO file matches the code listed in CHECKSUM.
xxxxxxxxxx
cd ${iso_path}
sha256sum Rocky-8.6-x86_64-dvd1.iso
cd ~
Mount the ISO image:
xxxxxxxxxx
mkdir -p ${os_repo_dir}
mount -o loop ${iso_path}/Rocky-8.6-x86_64-dvd1.iso ${os_repo_dir}
Configure the local repository and copy it to /etc/yum.repos.d/:
xxxxxxxxxx
cat << eof > ${iso_path}/EL8-OS.repo
[AppStream]
name=appstream
baseurl=file://${os_repo_dir}/AppStream/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rockyofficial
[BaseOS]
name=baseos
baseurl=file://${os_repo_dir}/BaseOS/
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rockyofficial
eof
cp -a ${iso_path}/EL8-OS.repo /etc/yum.repos.d/
Backup the repository:
xxxxxxxxxx
mkdir -p ${repo_backup_dir}
mv /etc/yum.repos.d/Rocky* ${repo_backup_dir}
dnf clean all
dnf makecache
Enable the NGINX web server:
xxxxxxxxxx
dnf module reset nginx
dnf module enable -y nginx:1.20
Download the following package: https://hpc.lenovo.com/downloads/22b/confluent-3.5.0-2-el8.tar.xz
Upload the package to the /root directory.
Create confluent local repository:
xxxxxxxxxx
dnf install -y bzip2 tar
mkdir -p $confluent_repo_dir
cd /root
tar -xvf confluent-3.5.0-2-el8.tar.xz -C $confluent_repo_dir
cd $confluent_repo_dir/lenovo-hpc-el8
./mklocalrepo.sh
cd ~
Install Lenovo Confluent:
xxxxxxxxxx
dnf install -y lenovo-confluent tftp-server
systemctl enable confluent --now
systemctl enable tftp.socket --now
systemctl disable firewalld --now
systemctl enable httpd --now
Create confluent account:
xxxxxxxxxx
source /etc/profile.d/confluent_env.sh
confetty create /users/<CONFLUENT_USERNAME> password=<CONFLUENT_PASSWORD> role=admin
Close SELinux:
xxxxxxxxxx
sed -i 's/enforcing/disabled/' /etc/selinux/config
setenforce 0
Please ensure that the BMC user name and password are consistent for every node.
xxxxxxxxxx
nodegroupattrib everything deployment.useinsecureprotocols=firmware \
console.method=ipmi dns.servers=$dns_server dns.domain=$domain_name \
net.ipv4_gateway=$ipv4_gateway net.ipv4_method="static"
The deployment.useinsecureprotocols=firmware enables PXE support (HTTPS only mode is by default the only allowed mode), console.method=ipmi may be skipped but if specified instructs confluennt to use IPMI to access the text console to enable the nodeconsole command.
While passwords and similar may be specified the same way, it is recommended to use the -p argument to prompt for values, to keep them out of your command history. Note that if unspecified, default root password behavior is to disable password based login:
xxxxxxxxxx
nodegroupattrib everything -p bmcuser bmcpass crypted.rootpassword
Define the management node in the lico_env.local file to confluent:
xxxxxxxxxx
nodegroupdefine all
nodegroupdefine compute
nodedefine $sms_name
nodeattrib $sms_name net.hwaddr=$sms_mac
nodeattrib $sms_name net.ipv4_address=$sms_ip
nodeattrib $sms_name hardwaremanagement.manager=$sms_bmc
Define the compute node configuration to confluent:
xxxxxxxxxx
for ((i=0; i<$num_computes; i++)); do
nodedefine ${c_name[$i]};
nodeattrib ${c_name[$i]} net.hwaddr=${c_mac[$i]};
nodeattrib ${c_name[$i]} net.ipv4_address=${c_ip[$i]};
nodeattrib ${c_name[$i]} hardwaremanagement.manager=${c_bmc[$i]};
nodedefine ${c_name[$i]} groups=all,compute;
done
Set the node to boot using network pxe by default
xxxxxxxxxx
for ((i=0; i<$num_computes; i++)); do
nodeconfig ${c_name[$i]} bootorder.bootorder=Network
done
Append node information to /etc/hosts:
xxxxxxxxxx
for node_name in $(nodelist); do
noderun -n $node_name echo {net.ipv4_address} {node} {node}.{dns.domain} >> /etc/hosts
done
Install and start to dnsmasq, making /etc/hosts available through dns:
xxxxxxxxxx
dnf install -y dnsmasq
systemctl enable dnsmasq --now
Users can set up requirements for operating system deployment through the initialized sub-command of the osdeploy command. The -i parameter is used to interactively prompt the options that are available:
xxxxxxxxxx
ssh-keygen -t ed25519
chown confluent /var/lib/confluent
osdeploy initialize -i
systemctl restart sshd
xxxxxxxxxx
osdeploy import ${iso_path}/Rocky-8.6-x86_64-dvd1.iso
If you don`t have any GPU nodes in the cluster, build a single image and dismiss any GPU related commands bellow.
xxxxxxxxxx
imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir
If both GPU and non GPU nodes are present, you will need to build two separate images.
xxxxxxxxxx
imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir
imgutil build -s rocky-8.6-x86_64 /tmp/scratchdir-gpu
xxxxxxxxxx
dnf install -y nfs-utils
systemctl enable nfs-server --now
Enable httpd services
xxxxxxxxxx
cat << eof > /etc/httpd/conf.d/installer.conf
Alias /install /install
<Directory /install>
AllowOverride None
Require all granted
Options +Indexes +FollowSymLinks
</Directory>
eof
systemctl restart httpd
Download the following package: https://hpc.lenovo.com/lico/downloads/7.0/Lenovo-OpenHPC-2.5.EL8.x86_64.tar
Upload the package to the /root directory.
Configure the local Lenovo OpenHPC repository:
xxxxxxxxxx
mkdir -p $ohpc_repo_dir
cd /root
tar xvf Lenovo-OpenHPC-2.5.EL8.x86_64.tar -C $ohpc_repo_dir
rm -rf $link_ohpc_repo_dir
ln -s $ohpc_repo_dir $link_ohpc_repo_dir
$link_ohpc_repo_dir/make_repo.sh
Download the following package:: https://hpc.lenovo.com/lico/downloads/7.0/lico-dep-7.0.0.el8.x86_64.tgz
Upload the package to the /root directory.
Configure the repository for the management node:
xxxxxxxxxx
mkdir -p $lico_dep_repo_dir
cd /root
tar -xvf lico-dep-7.0.0.el8.x86_64.tgz -C $lico_dep_repo_dir
rm -rf $link_lico_dep_repo_dir
ln -s $lico_dep_repo_dir $link_lico_dep_repo_dir
$link_lico_dep_repo_dir/mklocalrepo.sh
Configure the local repository for the management node:
xxxxxxxxxx
mkdir -p $lico_repo_dir
tar zxvf lico-release-7.0.0.el8.x86_64.tar.gz -C $lico_repo_dir --strip-components 1
rm -rf $link_lico_repo_dir
ln -s $lico_repo_dir $link_lico_repo_dir
$link_lico_repo_dir/mklocalrepo.sh
Install the base package:
xxxxxxxxxx
dnf install -y lenovo-ohpc-base
Install Slurm
xxxxxxxxxx
dnf install -y ohpc-slurm-server
Configure user shared directory The following steps describes how to create the user shared directory by taking /home as an example.
Manage node sharing /home:
xxxxxxxxxx
echo "/home *(rw,async,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -a
Manage node sharing /opt/ohpc/pub for OpenHPC:
xxxxxxxxxx
echo "/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)" >> /etc/exports
exportfs -a
Install Chrony:
xxxxxxxxxx
dnf install -y chrony
Configure Chrony as below link : https://chrony.tuxfamily.org/documentation.html
Enable chronyd
xxxxxxxxxx
systemctl enable chronyd --now
Download slurm.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/
Upload slurm.conf to /etc/slurm/, and modify this file as installation guide.
Download cgroup.conf from the following web site: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/
Upload cgroup.conf to /etc/slurm.
Create /etc/slurm/gres.conf,edit the GPU resources of all nodes in the following format
xxxxxxxxxx
NodeName=c1 Name=gpu File=/dev/nvidia[0-1]
NodeName=c2 Name=gpu File=/dev/nvidia[0-2]
Start service:
xxxxxxxxxx
systemctl enable munge
systemctl enable slurmctld
systemctl restart munge
systemctl restart slurmctld
xxxxxxxxxx
dnf install -y icinga2
dnf install -y nagios-plugins-ping
icinga2 api setup
icinga2 node setup --master --disable-confd
echo -e "LANG=en_US.UTF-8" >> /etc/sysconfig/icinga2
systemctl restart icinga2
Install three modules (OpenMPI, MPICH, and MVAPICH) to the system::
xxxxxxxxxx
dnf install -y openmpi4-gnu9-ohpc mpich-ofi-gnu9-ohpc mvapich2-gnu9-ohpc ucx-ib-ohpc
Set the default module.
Set OpenMPI module as the default:
xxxxxxxxxx
dnf install -y lmod-defaults-gnu9-openmpi4-ohpc
Set the MPICH module as the default:
xxxxxxxxxx
dnf install -y lmod-defaults-gnu9-mpich-ofi-ohpc
Set the MVAPICH module as the default:
xxxxxxxxxx
dnf install -y lmod-defaults-gnu9-mvapich2-ohpc
Install Singularity:
xxxxxxxxxx
dnf install -y singularity-ohpc
Edit the file /opt/ohpc/pub/modulefiles/ohpc by adding the following content to the end of the module try-add block:
xxxxxxxxxx
module try-add singularity
In the module del block, add the following content as the first line:
xxxxxxxxxx
module del singularity
Run the following command:
xxxxxxxxxx
source /etc/profile.d/lmod.sh
Install RabbitMQ:
xxxxxxxxxx
dnf install -y rabbitmq-server
Start RabbitMQ service:
xxxxxxxxxx
systemctl enable rabbitmq-server --now
Install MariaDB:
xxxxxxxxxx
dnf install -y mariadb-server mariadb-devel
systemctl enable mariadb --now
Configure MariaDB for LiCO:
xxxxxxxxxx
mysql
create database lico character set utf8 collate utf8_bin;
create user '<USERNAME>'@'%' identified by '<PASSWORD>';
grant ALL on lico.* to '<USERNAME>'@'%';
exit
Configure the MariaDB limits:
xxxxxxxxxx
sed -i "/\[mysqld\]/a\max-connections=1024" /etc/my.cnf.d/mariadb-server.cnf
mkdir /usr/lib/systemd/system/mariadb.service.d
cat << eof > /usr/lib/systemd/system/mariadb.service.d/limits.conf
[Service]
LimitNOFILE=10000
eof
systemctl daemon-reload
systemctl restart mariadb
xxxxxxxxxx
dnf install -y influxdb
systemctl enable influxdb --now
influx
create database lico
use lico
create user <INFLUX_USERNAME> with password '<INFLUX_PASSWORD>' with all privileges
exit
sed -i '/# auth-enabled = false/a\ auth-enabled = true' /etc/influxdb/config.toml
systemctl restart influxdb
Install and configure OpenLDAP a. Download the openldap-server package from Rocky official website:\ https://download.rockylinux.org/pub/rocky/8/PowerTools/x86_64/os/Packages/o/openldap-servers-2.4.46-18.el8.x86_64.rpm
b. Upload the package to the LiCO management node
c. Install openldap-server:
xxxxxxxxxx
dnf install -y openldap-servers-2.4.46-18.el8.x86_64.rpm
d. Install slapd-ssl-config:
xxxxxxxxxx
dnf install -y slapd-ssl-config
Modify the configuration file:
xxxxxxxxxx
sed -i "s/dc=hpc,dc=com/${lico_ldap_domain_name}/" /usr/share/openldap-servers/lico.ldif
sed -i "/dc:/s/hpc/${lico_ldap_domain_component}/" /usr/share/openldap-servers/lico.ldif
sed -i "s/dc=hpc,dc=com/${lico_ldap_domain_name}/" /etc/openldap/slapd.conf
slapadd -v -l /usr/share/openldap-servers/lico.ldif -f /etc/openldap/slapd.conf -b \
${lico_ldap_domain_name}
Obtain the OpenLDAP key:
xxxxxxxxxx
slappasswd
Edit /etc/openldap/slapd.conf to set the root password to the key that was obtained.
xxxxxxxxxx
rootpw <ENCRYPT_LDAP_PASSWORD>
Change the owner of the configuration file:
xxxxxxxxxx
chown -R ldap:ldap /var/lib/ldap
chown ldap:ldap /etc/openldap/slapd.conf
Start the OpenLDAP service:
xxxxxxxxxx
systemctl enable slapd --now
Verify that the service has been started:
xxxxxxxxxx
systemctl status slapd
The libuser module is a recommended toolkit for OpenLDAP. The installation of this module is optional.
Install libuser:
xxxxxxxxxx
dnf install -y libuser python3-libuser
Download libuser.conf from https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/ to /etc on the management node, and modify this file referring to installation guide.
xxxxxxxxxx
echo "TLS_REQCERT never" >> /etc/openldap/ldap.conf
Install nss-pam-ldapd:
xxxxxxxxxx
dnf install -y nss-pam-ldapd
Download nslcd.conf from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/
Upload the file to /etc. Use installation guide to modify the configuration.
Modify file permissions
xxxxxxxxxx
chmod 600 /etc/nslcd.conf
Start the nslcd service:
xxxxxxxxxx
systemctl enable nslcd --now
Create the path for the configuration file:
xxxxxxxxxx
mkdir -p /usr/share/authselect/vendor/nslcd
Download configuration files from: https://hpc.lenovo.com/lico/downloads/7.0/examples/conf/authselect/authselect.tar.gz
Upload the configuration files to /root
Extract the archive:
xxxxxxxxxx
tar -xzvf /root/authselect.tar.gz -C /usr/share/authselect/vendor/nslcd/
Enable the configuration:
xxxxxxxxxx
authselect select nslcd with-mkhomedir --force
Install Lico module
xxxxxxxxxx
dnf install -y python3-cffi
dnf install -y lico-core lico-file-manager lico-confluent-proxy \
lico-vnc-proxy lico-icinga-mond lico-async-task lico-service-tool
Configure shared directory for LiCO
xxxxxxxxxx
mkdir -p /opt/lico/pub
touch /opt/lico/pub/DO_NOT_DELETE
echo "The file is required by lico monitor." >> /opt/lico/pub/DO_NOT_DELETE
echo "/opt/lico/pub *(ro,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -a
Install portal
xxxxxxxxxx
dnf install -y lico-workspace-skeleton lico-portal
Install AI component
xxxxxxxxxx
dnf install -y lico-ai-scripts
(Optional) Provide e-mail, SMS, and WeChat services:
xxxxxxxxxx
dnf install -y lico-mail-agent
dnf install -y lico-sms-agent
dnf install -y lico-wechat-agent
(Optional) Install Icinga2 monitoring components
xxxxxxxxxx
dnf install -y lico-icinga-plugin-slurm
mkdir -p /etc/icinga2/zones.d/global-templates
echo -e "object CheckCommand \"lico_monitor\" {\n command = [ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
lico-icinga-plugin\" ]\n}" > /etc/icinga2/zones.d/global-templates/commands.conf
echo -e "object CheckCommand \"lico_job_monitor\" {\n command = [ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
lico-job-icinga-plugin\" ]\n}" >> /etc/icinga2/zones.d/global-templates/commands.conf
echo -e "object CheckCommand \"lico_check_procs\" {\n command =[ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
lico-process-icinga-plugin\" ]\n}" >>/etc/icinga2/zones.d/global-templates/commands.conf
echo -e "object CheckCommand \"lico_vnc_monitor\" {\n command =[ \"/opt/lico/pub/monitor/lico_icinga_plugin/\
lico-vnc-icinga-plugin\" ]\n}" >>/etc/icinga2/zones.d/global-templates/commands.conf
mkdir -p /etc/icinga2/zones.d/master
echo -e "object Host \"${sms_name}\" {\n check_command = \"hostalive\"\n \
address = \"${sms_ip}\"\n vars.agent_endpoint = name\n}\n" >> \
/etc/icinga2/zones.d/master/hosts.conf
for ((i=0;i<$num_computes;i++));do
echo -e "object Endpoint \"${c_name[${i}]}\" {\n host = \"${c_name[${i}]}\"\n \
port = \"${icinga_api_port}\"\n log_duration = 0\n}\nobject \
Zone \"${c_name[${i}]}\" {\n endpoints = [ \"${c_name[${i}]}\" ]\n \
parent = \"master\"\n}\n" >> /etc/icinga2/zones.d/master/agent.conf
echo -e "object Host \"${c_name[${i}]}\" {\n check_command = \"hostalive\"\n \
address = \"${c_ip[${i}]}\"\n vars.agent_endpoint = name\n}\n" >> \
/etc/icinga2/zones.d/master/hosts.conf
done
echo -e "apply Service \"lico\" {\n check_command = \"lico_monitor\"\n \
max_check_attempts = 5\n check_interval = 1m\n retry_interval = 30s\n assign \
where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
command_endpoint = host.vars.agent_endpoint\n}\n" > \
/etc/icinga2/zones.d/master/service.conf
echo -e "apply Service \"lico-procs-service\" {\n check_command = \"lico_\
check_procs\"\n enable_active_checks = false\n assign where \
host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
command_endpoint = host.vars.agent_endpoint\n}\n" >> \
/etc/icinga2/zones.d/master/service.conf
echo -e "apply Service \"lico-job-service\" {\n check_command = \"lico_job_monitor\"\n \
max_check_attempts = 5\n check_interval = 1m\n retry_interval = 30s\n assign \
where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
command_endpoint = host.vars.agent_endpoint\n}\n" >> \
/etc/icinga2/zones.d/master/service.conf
echo -e "apply Service \"lico-vnc-service\" {\n check_command = \"lico_vnc_monitor\"\n \
max_check_attempts = 5\n check_interval = 15s\n retry_interval = 30s\n assign \
where host.name == \"${sms_name}\"\n assign where host.vars.agent_endpoint\n \
command_endpoint = host.vars.agent_endpoint\n}\n" >> \
/etc/icinga2/zones.d/master/service.conf
chown -R icinga:icinga /etc/icinga2/zones.d/master
systemctl restart icinga2
modprobe ipmi_devintf
systemctl enable icinga2
Restart services:
xxxxxxxxxx
systemctl restart confluent
Note The username and password of icinga2 can be viewed and changed at /etc/icinga2/conf.d/api-users.conf
xxxxxxxxxx
cd /etc/lico
\cp gres.csv.example gres.csv
\cp nodes.csv.example nodes.csv
vim nodes.csv
lico-password-tool
mkdir -p /tmp/scratchdir/var/lib/lico/tool
cp /var/lib/lico/tool/.db /tmp/scratchdir/var/lib/lico/tool/
cd lico.ini.d/
sed -i s/false/true/ user.ini
lico init
sed -i s/80/8080/g /etc/nginx/nginx.conf
sed -i s/443/444/ /etc/nginx/conf.d/https.conf
luseradd hpcadmin -P Passw0rd@123
lico import_user -u hpcadmin -r admin
lico-service-tool start
lico-service-tool enable
Repos
xxxxxxxxxx
share_installer_dir="/install/installer"
mkdir -p $share_installer_dir
echo "/install/installer *(rw,async,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -a
cp /etc/hosts $share_installer_dir
cp /etc/security/limits.conf $share_installer_dir
cp /etc/yum.repos.d/EL8-OS.repo $share_installer_dir
sed -i '/^baseurl=/d' $share_installer_dir/EL8-OS.repo
sed -i "/name=appstream/a\baseurl=http://${sms_name}${os_repo_dir}/AppStream/" \
$share_installer_dir/EL8-OS.repo
sed -i "/name=baseos/a\baseurl=http://${sms_name}${os_repo_dir}/BaseOS/" \
$share_installer_dir/EL8-OS.repo
cp /etc/yum.repos.d/lenovo-hpc.repo $share_installer_dir
sed -i '/^baseurl=/d' $share_installer_dir/lenovo-hpc.repo
sed -i '/^gpgkey=/d' $share_installer_dir/lenovo-hpc.repo
echo "baseurl=http://${sms_name}${confluent_repo_dir}/lenovo-hpc-el8" \
>> $share_installer_dir/lenovo-hpc.repo
echo "gpgkey=http://${sms_name}${confluent_repo_dir}/lenovo-hpc-el8\
/lenovohpckey.pub" >> $share_installer_dir/lenovo-hpc.repo
cp /etc/yum.repos.d/Lenovo.OpenHPC.local.repo $share_installer_dir
sed -i '/^baseurl=/d' $share_installer_dir/Lenovo.OpenHPC.local.repo
sed -i '/^gpgkey=/d' $share_installer_dir/Lenovo.OpenHPC.local.repo
echo "baseurl=http://${sms_name}${link_ohpc_repo_dir}/EL_8" \
>> $share_installer_dir/Lenovo.OpenHPC.local.repo
echo "gpgkey=http://${sms_name}${link_ohpc_repo_dir}/EL_8\
/repodata/repomd.xml.key" >> $share_installer_dir/Lenovo.OpenHPC.local.repo
cp /etc/yum.repos.d/lico-dep.repo $share_installer_dir
sed -i '/^baseurl=/d' $share_installer_dir/lico-dep.repo
sed -i '/^gpgkey=/d' $share_installer_dir/lico-dep.repo
sed -i "/name=lico-dep-local-library/a\baseurl=http://${sms_name}\
${link_lico_dep_repo_dir}/library/" $share_installer_dir/lico-dep.repo
sed -i "/name=lico-dep-local-library/a\gpgkey=http://${sms_name}\
${link_lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL8" $share_installer_dir/lico-dep.repo
sed -i "/name=lico-dep-local-standalone/a\baseurl=http://${sms_name}\
${link_lico_dep_repo_dir}/standalone/" $share_installer_dir/lico-dep.repo
sed -i "/name=lico-dep-local-standalone/a\gpgkey=http://${sms_name}\
${link_lico_dep_repo_dir}/RPM-GPG-KEY-LICO-DEP-EL8" $share_installer_dir/lico-dep.repo
cp /etc/yum.repos.d/lico-release.repo $share_installer_dir
sed -i '/baseurl=/d' $share_installer_dir/lico-release.repo
sed -i "/name=lico-release-host/a\baseurl=http://${sms_name}\
${link_lico_repo_dir}/host/" $share_installer_dir/lico-release.repo
sed -i "/name=lico-release-public/a\baseurl=http://${sms_name}\
${link_lico_repo_dir}/public/" $share_installer_dir/lico-release.repo
Configure automatic start for the GPU driver (for GPU image)
Dowload NVIDIA-Linux-x86_64-520.61.07.run from https://us.download.nvidia.com/tesla/520.61.07/NVIDIA-Linux-x86_64-520.61.07.run and copy it to the shared directory $share_installer_dir
xxxxxxxxxx
cat << eof > $share_installer_dir/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
After=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced/*
TimeoutSec=300
[Install]
WantedBy=multi-user.target
eof
cat << eof > $share_installer_dir/nvidia-modprobe-loader.service
[Unit]
Description=NVIDIA ModProbe Service
After=syslog.target
Before=slurmd.service
[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-modprobe -u -c=0
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
eof
cat << eof > $share_installer_dir/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 eof
xxxxxxxxxx
3. Slurm config file
cp /etc/slurm/slurm.conf $share_installer_dir/slurm.conf cp /etc/slurm/cgroup.conf $share_installer_dir/cgroup.conf cp /etc/slurm/gres.conf $share_installer_dir/gres.conf cp /etc/munge/munge.key $share_installer_dir
xxxxxxxxxx
4. LDAP config file
cp /etc/openldap/ldap.conf $share_installer_dir cp /etc/nslcd.conf $share_installer_dir/nslcd.conf
xxxxxxxxxx
5. Authselect
cp /root/authselect.tar.gz $share_installer_dir
xxxxxxxxxx
6. Synchronize the files to the image and clean up the original file
\cp ~/lico_env.local /tmp/scratchdir/root/ \cp $share_installer_dir/hosts /tmp/scratchdir/etc/hosts \cp $share_installer_dir/limits.conf /tmp/scratchdir/etc/security/limits.conf \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir/etc/yum.repos.d/ \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir/etc/yum.repos.d/ echo -e %_excludedocs 1 >> /tmp/scratchdir/root/.rpmmacros \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir/etc/yum.repos.d/ \cp $share_installer_dir/lico-release.repo /tmp/scratchdir/etc/yum.repos.d/
cd /tmp/scratchdir/etc/yum.repos.d mkdir rocky mv Rocky* rocky/
xxxxxxxxxx
For the GPU image do the following:
\cp ~/lico_env.local /tmp/scratchdir-gpu/root/ \cp $share_installer_dir/hosts /tmp/scratchdir-gpu/etc/hosts \cp $share_installer_dir/limits.conf /tmp/scratchdir-gpu/etc/security/limits.conf \cp $share_installer_dir/EL8-OS.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/Lenovo.OpenHPC.local.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ echo -e %_excludedocs 1 >> /tmp/scratchdir-gpu/root/.rpmmacros \cp $share_installer_dir/lico-dep.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/lico-release.repo /tmp/scratchdir-gpu/etc/yum.repos.d/ \cp $share_installer_dir/nvidia-* /tmp/scratchdir-gpu/usr/lib/systemd/system/ \cp $share_installer_dir/blacklist-nouveau.conf /tmp/scratchdir-gpu/usr/lib/modprobe.d/blacklist-nouveau.conf
cd /tmp/scratchdir-gpu/etc/yum.repos.d mkdir rocky mv Rocky* rocky/
xxxxxxxxxx
### Prepare NVIDIA drivers
dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \ elfutils-libelf-devel libglvnd-devel
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
chmod +x $share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run cd $share_installer_dir
$share_installer_dir/NVIDIA-Linux-x86_64-520.61.07.run --add-this-kernel -s
xxxxxxxxxx
### Enter the image
For Non-GPU image:
imgutil exec -v /install/installer:- /tmp/scratchdir
xxxxxxxxxx
If you are building the GPU image:
imgutil exec -v /install/installer:- /tmp/scratchdir-gpu
xxxxxxxxxx
### Set local environment variables
source /root/lico_env.local share_installer_dir="/install/installer"
xxxxxxxxxx
### Start nginx service
dnf module reset nginx dnf module enable -y nginx:1.20
xxxxxxxxxx
### Install Chrony
1. Install Chrony
dnf install -y chrony
xxxxxxxxxx
2. Edit /etc/chrony.conf to configure chrony
3. Set the system to automatically start after startup
systemctl enable chronyd
xxxxxxxxxx
### Configure NFS
echo "${sms_ip}:/home /home nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab mkdir -p /home
mkdir -p $share_installer_dir echo "${sms_ip}:/install/installer /install/installer nfs nfsvers=4.0,nodev,nosuid,noatime 0 0" >> /etc/fstab
mount -a
xxxxxxxxxx
### Install OpenLDAP
cp $share_installer_dir/ldap.conf /etc/openldap/ldap.conf dnf install -y nss-pam-ldapd cp $share_installer_dir/nslcd.conf /etc/nslcd.conf chmod 600 /etc/nslcd.conf systemctl enable nslcd mkdir -p /usr/share/authselect/vendor/nslcd tar -xzvf $share_installer_dir/authselect.tar.gz -C /usr/share/authselect/vendor/nslcd/ dnf install -y authselect authselect select nslcd with-mkhomedir --force
xxxxxxxxxx
### Install Icinga2
dnf install -y icinga2 icinga2 node setup --master --disable-confd echo -e "LANG=en_US.UTF-8" >> /etc/sysconfig/icinga2
xxxxxxxxxx
### Install and configure Slurm
dnf install -y ohpc-base-compute ohpc-slurm-client lmod-ohpc echo 'account required pam_slurm.so' >> /etc/pam.d/sshd (option)
cp $share_installer_dir/munge.key /etc/munge/munge.key cp $share_installer_dir/cgroup.conf /etc/slurm/cgroup.conf cp $share_installer_dir/slurm.conf /etc/slurm/slurm.conf cp $share_installer_dir/gres.conf /etc/slurm/gres.conf systemctl enable munge systemctl enable slurmd
xxxxxxxxxx
### Add kernel headers(for GPU image)
dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils \ elfutils-libelf-devel libglvnd-devel
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
xxxxxxxxxx
### Mount LiCO monitor components
echo "${sms_ip}:/opt/lico/pub /opt/lico/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab
mkdir -p /opt/lico/pub
mount -a
xxxxxxxxxx
### Mount OHPC directory
echo "${sms_ip}:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=4.0,nodev,noatime 0 0" >> /etc/fstab
mkdir -p /opt/ohpc/pub
mount -a
xxxxxxxxxx
### Exit and pack the image
exit imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm
xxxxxxxxxx
or
exit imgutil pack /tmp/scratchdir/ rocky-8.6-diskless-slurm-gpu
xxxxxxxxxx
### Add Startup Scripts
#### Install Icinga script
Create and edit:
/var/lib/confluent/public/os/rocky-8.6-diskless-slurm/scripts/onboot.d/icinga.sh
or
/var/lib/confluent/public/os/rocky-8.6-diskless-slurm-gpu/scripts/onboot.d/icinga.sh
sms_name=head
icinga_api_port=5665
icinga2 pki save-cert --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --host ${sms_name}
nodename=uname -a |awk '{print $2}'
ticket=ssh $sms_name icinga2 pki ticket --cn $nodename
icinga2 node setup --ticket ${ticket} --cn $nodename --endpoint ${sms_name} --zone $nodename --parent_zone master --parent_host ${sms_name} --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --accept-commands --accept-config --disable-confd
modprobe ipmi_devintf
systemctl start icinga2
systemctl enable icinga2
xxxxxxxxxx
Note: Hostname in this script and hostname defined in lico_env.local should be consistent
#### Install GPU drivers script
For GPU image also add the following script: /var/lib/confluent/public/os/rocky-8.6-diskless-slurm-gpu/scripts/onboot.d/gpu-drivers.sh
systemctl stop slurmd share_installer_dir="/install/installer" $share_installer_dir/NVIDIA-Linux-x86_64-520.61.07-custom.run -s mkdir -p /var/run/nvidia-persistenced systemctl daemon-reload systemctl enable nvidia-persistenced --now systemctl enable nvidia-modprobe-loader.service --now systemctl restart slurmd
xxxxxxxxxx
### Deploy the nodes
nodedeploy compute -n rocky-8.6-diskless-slurm
xxxxxxxxxx
For GPU nodes:
nodedeploy gpu -n rocky-8.6-diskless-slurm-gpu
xxxxxxxxxx
Monitor on XCC untill the deployment is finished.
## Modify an image that has been already packaged
1. Unpack the image to the specified directory
imgutil unpack rocky-8.6-diskless-slurm /tmp/scratchdir-v2/
xxxxxxxxxx
2. Enter the image to make modifications
imgutil exec /tmp/scratchdir-v2/
xxxxxxxxxx
3. Pack the image
**Note: the new image cannot have the same name as existing images**
imgutil pack /tmp/scratchdir-v2/ rocky-8.6-diskless-slurm-v2
xxxxxxxxxx
4. Copy the profile.yaml and onboot scripts of the previous image as required
cd /var/lib/confluent/public/os cp rocky-8.6-diskless-slurm/profile.yaml rocky-8.6-diskless-slurm-v2/profile.yaml cp rocky-8.6-diskless-slurm/scripts/onboot.d/* rocky-8.6-diskless-slurm-v2/scripts/onboot.d/