High Performance Computing Blogs

April 02, 2025

Installing and Upgrading CUDA Driver in an HPC System

Step 1: Check Current NVIDIA Driver Version

Before installing/upgrading CUDA, check your existing driver version:

nvidia-smi

This will display the installed driver version and GPU details.

Step 2: Download the Latest NVIDIA Driver

Get the latest driver from NVIDIA's official website.

Alternatively, use wget:

wget https://us.download.nvidia.com/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run

Step 3: Blacklist Nouveau (if needed)

If Nouveau is active, disable it:

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "options nouveau modeset=0" > /etc/modprobe.d/nouveau.conf
dracut --force
reboot

Check if Nouveau is disabled:

lsmod | grep nouveau

Step 4: Stop Display Manager (If Applicable)

If running a GUI, stop the display manager:


systemctl stop gdm     # For GNOME
systemctl stop lightdm # For LightDM
systemctl stop sddm    # For KDE

Step 5: Install the Driver

Run the installer:

chmod +x NVIDIA-Linux-x86_64-550.54.15.run
./NVIDIA-Linux-x86_64-550.54.15.run

Follow the on-screen instructions and agree to the DKMS installation.

Step 6: Verify Installation

Reboot the system:

reboot

Check the driver:


nvidia-smi

If everything is fine, you should see the updated driver version.

Upgrading CUDA Driver

To upgrade, first remove the existing driver:

sudo dnf remove -y nvidia-driver

 
Alternate Method to Install & Upgrade CUDA Driver in an HPC System

Instead of using the .run installer, you can use the package manager (dnf, yum, or apt) for easier installation and upgrades.

Method 1: Installing CUDA Driver Using Package Manager (Preferred)
Step 1: Add the NVIDIA Repository
For RHEL / AlmaLinux / RockyLinux:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
For Ubuntu/Debian:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin -O /etc/apt/preferences.d/cuda
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"


Step 2: Install the Latest NVIDIA Driver
For RHEL / AlmaLinux / RockyLinux:
dnf clean all
dnf install -y nvidia-driver kernel-devel kernel-headers
For Ubuntu/Debian:
sudo apt update
sudo apt install -y nvidia-driver-550


Step 3: Install CUDA Toolkit (If Needed)
To install CUDA:
dnf install -y cuda
or
sudo apt install -y cuda


Step 4: Load the NVIDIA Module & Verify
modprobe nvidia
nvidia-smi

This should show the GPU details and driver version.

Method 2: Upgrading the CUDA Driver
Check the Installed Driver Version:
nvidia-smi
Remove the Existing NVIDIA Driver:
dnf remove -y nvidia-driver
or
sudo apt remove --purge -y nvidia-driver

What is GDCopy and How It Works?
GDCopy (GPUDirect Copy) Overview
GDCopy is an optimized data transfer library that enhances GPU memory copy performance. It allows high-bandwidth GPU-to-GPU copies with low CPU overhead.

How GDCopy Works

Traditional GPU Memory Copy:

Normally, when copying data from one GPU to another:

The data is first copied from GPU1 → System RAM.

Then, it's copied from System RAM → GPU2.

This adds overhead and reduces performance.

GDCopy Optimization:
GDCopy bypasses system memory by allowing direct GPU-to-GPU copies.

This is achieved using NVLink (for direct inter-GPU communication).

Results in faster memory movement, which is essential for HPC workloads.

Why Use GDCopy?

Reduces CPU involvement in GPU-GPU communication.

Higher memory bandwidth for multi-GPU setups.

Optimized for NVLink (faster than PCIe-based copies).

Useful for AI, HPC, and ML workloads where large memory copies are frequent.


GDR (GPUDirect RDMA)

GDR (GPUDirect RDMA) allows third-party devices (like NICs) to directly access GPU memory.

It eliminates the need for extra CPU memory copies, reducing latency and increasing bandwidth.
Used in high-speed interconnects like InfiniBand (Mellanox) for HPC workloads.

Search This Blog

High Performance Computing Blogs

Step 1: Check Current NVIDIA Driver Version

Step 2: Download the Latest NVIDIA Driver

Step 3: Blacklist Nouveau (if needed)

Step 4: Stop Display Manager (If Applicable)

Step 5: Install the Driver

Step 6: Verify Installation

Upgrading CUDA Driver

Method 1: Installing CUDA Driver Using Package Manager (Preferred)

Step 1: Add the NVIDIA Repository

Step 2: Install the Latest NVIDIA Driver

Step 3: Install CUDA Toolkit (If Needed)

Step 4: Load the NVIDIA Module & Verify

Method 2: Upgrading the CUDA Driver

What is GDCopy and How It Works?

GDCopy (GPUDirect Copy) Overview

How GDCopy Works

Why Use GDCopy?

GDR (GPUDirect RDMA)

Comments

Post a Comment

Popular Posts