Installing and Upgrading CUDA Driver in an HPC System


Step 1: Check Current NVIDIA Driver Version

Before installing/upgrading CUDA, check your existing driver version:

nvidia-smi

This will display the installed driver version and GPU details.


Step 2: Download the Latest NVIDIA Driver

Get the latest driver from NVIDIA's official website.

Alternatively, use wget:

wget https://us.download.nvidia.com/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run

Step 3: Blacklist Nouveau (if needed)

If Nouveau is active, disable it:

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf echo "options nouveau modeset=0" > /etc/modprobe.d/nouveau.conf dracut --force reboot

Check if Nouveau is disabled:

lsmod | grep nouveau

Step 4: Stop Display Manager (If Applicable)

If running a GUI, stop the display manager:


systemctl stop gdm # For GNOME systemctl stop lightdm # For LightDM systemctl stop sddm # For KDE

Step 5: Install the Driver

Run the installer:

chmod +x NVIDIA-Linux-x86_64-550.54.15.run
./NVIDIA-Linux-x86_64-550.54.15.run

Follow the on-screen instructions and agree to the DKMS installation.


Step 6: Verify Installation

Reboot the system:

reboot

Check the driver:


nvidia-smi

If everything is fine, you should see the updated driver version.



Upgrading CUDA Driver


To upgrade, first remove the existing driver:

sudo dnf remove -y nvidia-driver

 
Alternate Method to Install & Upgrade CUDA Driver in an HPC System

Instead of using the .run installer, you can use the package manager (dnf, yum, or apt) for easier installation and upgrades.


Method 1: Installing CUDA Driver Using Package Manager (Preferred)

Step 1: Add the NVIDIA Repository

For RHEL / AlmaLinux / RockyLinux:


sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

For Ubuntu/Debian:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin -O /etc/apt/preferences.d/cuda
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"

Step 2: Install the Latest NVIDIA Driver

For RHEL / AlmaLinux / RockyLinux:

dnf clean all dnf install -y nvidia-driver kernel-devel kernel-headers

For Ubuntu/Debian:

sudo apt update sudo apt install -y nvidia-driver-550

Step 3: Install CUDA Toolkit (If Needed)

To install CUDA:

dnf install -y cuda

or

sudo apt install -y cuda

Step 4: Load the NVIDIA Module & Verify

modprobe nvidia
nvidia-smi


This should show the GPU details and driver version.


Method 2: Upgrading the CUDA Driver

  1. Check the Installed Driver Version:

    nvidia-smi
  2. Remove the Existing NVIDIA Driver:

    dnf remove -y nvidia-driver

    or

    sudo apt remove --purge -y nvidia-driver

What is GDCopy and How It Works?

GDCopy (GPUDirect Copy) Overview

GDCopy is an optimized data transfer library that enhances GPU memory copy performance. It allows high-bandwidth GPU-to-GPU copies with low CPU overhead.


How GDCopy Works

  1. Traditional GPU Memory Copy:

    • Normally, when copying data from one GPU to another:

      • The data is first copied from GPU1 → System RAM.

      • Then, it's copied from System RAM → GPU2.

      • This adds overhead and reduces performance.


  2. GDCopy Optimization:

    • GDCopy bypasses system memory by allowing direct GPU-to-GPU copies.

    • This is achieved using NVLink (for direct inter-GPU communication).

    • Results in faster memory movement, which is essential for HPC workloads.


Why Use GDCopy?

  • Reduces CPU involvement in GPU-GPU communication.

  • Higher memory bandwidth for multi-GPU setups.

  • Optimized for NVLink (faster than PCIe-based copies).

  • Useful for AI, HPC, and ML workloads where large memory copies are frequent.



GDR (GPUDirect RDMA)

  • GDR (GPUDirect RDMA) allows third-party devices (like NICs) to directly access GPU memory.

  • It eliminates the need for extra CPU memory copies, reducing latency and increasing bandwidth.

  • Used in high-speed interconnects like InfiniBand (Mellanox) for HPC workloads.


Comments

Popular Posts