Notes

The following procedure is for installing the NVIDIA GPU drivers on RHEL 9.5. Note that this is for a single-node install. Scale with nodeshell and arguments to make these steps unattended as needed.

Register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager

This is required because some of the packages required are only available for registered systems.

 subscription-manager register --username <username> --password <password> 
 subscription-manager release --set=9.5
 subscription-manager repos --enable codeready-builder-for-rhel-9-x86_64-rpms
 dnf install -y wget
 wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
 rpm -ivh epel-release-latest-9.noarch.rpm
 dnf install dnf-plugin-config-manager
 crb enable

Prerequisites

DOCA must be installed before installing the GPU drivers to make sure the nvidia-peermem kernel module gets the right Infiniband symbols.

Enable update repos

 subscription-manager repos --enable=rhel-9-for-x86_64-baseos-rpms
 subscription-manager repos --enable=rhel-9-for-x86_64-appstream-rpms

Install newer kernel packages

Note that these steps install a new version as opposed to replacing the older kernel version(s), so the system can be booted to the older kernel version(s) if needed.

 dnf install kernel kernel-core kernel-modules-core kernel-modules 

Install newer kernel devel packages

Note that these steps install a new version as opposed to replacing the older kernel version(s), so the system can be booted to the older kernel version(s) if needed.

 dnf install kernel-devel kernel-devel-matched kernel-headers kernel-modules-extra

Update kernel tools and kernel abi stablelists

Note that this still will update the existing kernel tools packages, as multiple versions of these kernel tools on the system as the same time.

 dnf install kernel-tools kernel-tools-libs kernel-abi-stablelists

Install dkms

 dnf install dkms

Disable update repos

 subscription-manager repos --disable=rhel-9-for-x86_64-appstream-rpms
 subscription-manager repos --disable=rhel-9-for-x86_64-baseos-rpms

Reboot the system using the newly installed kernel

 reboot

Install GPU driver local repo

 rpm -ivh <path to nvidia local repo RPM>/nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm

Clean DNF/YUM cache

 dnf clean all

Backup grub.cfg from ESP

When installing the nvidia-kmod-common package from the NVIDIA 570.124.06 local driver repo on RHEL 9.5 on an EFI-booted system, the /boot/efi/EFI/redhat/grub.cfg file, which is a “stub” grub.cfg in the EFI system partition (ESP) that only points to /boot/grub2 for the full grub.cfg, is overwritten with a full grub.cfg and the grub.cfg in /boot/grub2 is not updated. To work around this, backup the grub.cfg in the ESP (this will be restored and the grub.cfg file in /boot/grub2 fixed after installing the GPU drivers):

 cp /boot/efi/EFI/redhat/grub.cfg ~/grub.cfg.PRE-NVIDIA

Install GPU drivers

Note The nvidia-fabric-manager package is only necesssary on 8-GPU HGX configs.

 dnf install nvidia-driver-cuda kmod-nvidia-open-dkms nvidia-fabric-manager

Restore backed up stub grub.cfg and fix grub.cfg in /boot/grub2

 cp -f /boot/efi/EFI/redhat/grub.cfg /boot/grub2
 cp -f ~/grub.cfg.PRE-NVIDIA /boot/efi/EFI/redhat95/grub.cfg

Reboot

 reboot

Check driver status

The following should show the correct driver version installed for the correct kernel version:

 dkms status

The nvidia driver should be loaded and the nouveau driver should not be loaded:

 lsmod | grep -i nvidia
 lsmod | grep -i nouveau

Make sure the nouveau driver never loaded during the boot process:

 dmesg | grep -i nouveau

Make sure the version of the nvidia driver running is correct:

 cat /sys/module/nvidia/version

Start the nvidia-persistenced service and make sure its running OK:

 systemctl start nvidia-persistenced
 systemctl status nvidia-persistenced

Start the nvidia-fabricmanager service and make sure its running OK (note, this is only necessary for and should only be run on 8-GPU HGX systems only):

 systemctl start nvidia-fabricmanager
 systemctl status nvidia-fabricmanager

Check that nvidia-smi reports all expected GPUs:

 nvidia-smi
 nvidia-smi nvlink -s
 nvidia-smi -q -i 0 | grep -i -A 2 Fabric

The output from that command should appear as follows:

         GPU Fabric GUID                   : 0x7215545ecf79e88f
     Inforom Version
         Image Version                     : G525.0225.00.05
 --
     Fabric
         State                             : Completed
         Status                            : Success