
NVIDIA GPU Driver Install on RHEL 9.5
Notes
The following procedure is for installing the NVIDIA GPU drivers on RHEL 9.5. Note that this is for a single-node install. Scale with nodeshell and arguments to make these steps unattended as needed.
Register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager
This is required because some of the packages required are only available for registered systems.
subscription-manager register --username <username> --password <password>
subscription-manager release --set=9.5
subscription-manager repos --enable codeready-builder-for-rhel-9-x86_64-rpms
dnf install -y wget
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
rpm -ivh epel-release-latest-9.noarch.rpm
dnf install dnf-plugin-config-manager
crb enable
Prerequisites
DOCA must be installed before installing the GPU drivers to make sure the nvidia-peermem kernel module gets the right Infiniband symbols.
Enable update repos
subscription-manager repos --enable=rhel-9-for-x86_64-baseos-rpms
subscription-manager repos --enable=rhel-9-for-x86_64-appstream-rpms
Install newer kernel packages
Note that these steps install a new version as opposed to replacing the older kernel version(s), so the system can be booted to the older kernel version(s) if needed.
dnf install kernel kernel-core kernel-modules-core kernel-modules
Install newer kernel devel packages
Note that these steps install a new version as opposed to replacing the older kernel version(s), so the system can be booted to the older kernel version(s) if needed.
dnf install kernel-devel kernel-devel-matched kernel-headers kernel-modules-extra
Update kernel tools and kernel abi stablelists
Note that this still will update the existing kernel tools packages, as multiple versions of these kernel tools on the system as the same time.
dnf install kernel-tools kernel-tools-libs kernel-abi-stablelists
Install dkms
dnf install dkms
Disable update repos
subscription-manager repos --disable=rhel-9-for-x86_64-appstream-rpms
subscription-manager repos --disable=rhel-9-for-x86_64-baseos-rpms
Reboot the system using the newly installed kernel
reboot
Install GPU driver local repo
rpm -ivh <path to nvidia local repo RPM>/nvidia-driver-local-repo-rhel9-570.124.06-1.0-1.x86_64.rpm
Clean DNF/YUM cache
dnf clean all
Backup grub.cfg from ESP
When installing the nvidia-kmod-common package from the NVIDIA 570.124.06 local driver repo on RHEL 9.5 on an EFI-booted system, the /boot/efi/EFI/redhat/grub.cfg file, which is a “stub” grub.cfg in the EFI system partition (ESP) that only points to /boot/grub2 for the full grub.cfg, is overwritten with a full grub.cfg and the grub.cfg in /boot/grub2 is not updated. To work around this, backup the grub.cfg in the ESP (this will be restored and the grub.cfg file in /boot/grub2 fixed after installing the GPU drivers):
cp /boot/efi/EFI/redhat/grub.cfg ~/grub.cfg.PRE-NVIDIA
Install GPU drivers
Note The nvidia-fabric-manager package is only necesssary on 8-GPU HGX configs.
dnf install nvidia-driver-cuda kmod-nvidia-open-dkms nvidia-fabric-manager
Restore backed up stub grub.cfg and fix grub.cfg in /boot/grub2
cp -f /boot/efi/EFI/redhat/grub.cfg /boot/grub2
cp -f ~/grub.cfg.PRE-NVIDIA /boot/efi/EFI/redhat95/grub.cfg
Reboot
reboot
Check driver status
The following should show the correct driver version installed for the correct kernel version:
dkms status
The nvidia driver should be loaded and the nouveau driver should not be loaded:
lsmod | grep -i nvidia
lsmod | grep -i nouveau
Make sure the nouveau driver never loaded during the boot process:
dmesg | grep -i nouveau
Make sure the version of the nvidia driver running is correct:
cat /sys/module/nvidia/version
Start the nvidia-persistenced service and make sure its running OK:
systemctl start nvidia-persistenced
systemctl status nvidia-persistenced
Start the nvidia-fabricmanager service and make sure its running OK (note, this is only necessary for and should only be run on 8-GPU HGX systems only):
systemctl start nvidia-fabricmanager
systemctl status nvidia-fabricmanager
Check that nvidia-smi reports all expected GPUs:
nvidia-smi
Check the NVLINKs (where applicable–this applies to all HGX systems and systems with PCIe GPUs with NVLINK bridge cards installed). All expected links should show up at expected bandwidth:
nvidia-smi nvlink -s
Check the NVLINK fabric status (applicable for HGX 8-way configurations)
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
The output from that command should appear as follows:
GPU Fabric GUID : 0x7215545ecf79e88f
Inforom Version
Image Version : G525.0225.00.05
--
Fabric
State : Completed
Status : Success