Deploying a Slurm Cluster on Rocky Linux 8+ and Ubuntu 22.04/24.04

With GPU (GRES) Support, HA Architecture, and Best Practices

1. Introduction

Slurm (Simple Linux Utility for Resource Management) is a highly scalable, open-source workload manager widely used in HPC, AI/ML, and GPU clusters. It provides efficient job scheduling, resource allocation (CPU, memory, GPU), and accounting with minimal overhead.

In this guide, we walk through deploying a production-ready Slurm cluster with:

  • Rocky Linux 8+ and Ubuntu 22.04 / 24.04

  • Munge authentication

  • MariaDB-based accounting (slurmdbd)

  • NFS shared filesystem

  • NVIDIA GPU scheduling using GRES

  • High Availability (HA) Slurmctld architecture

  • Slurm tuning & operational best practices

This setup is suitable for learning, benchmarking, AI workloads, and small-to-medium production clusters.


2. Slurm Cluster Architecture

2.1 Core Components

ComponentDescription
slurmctldCentral scheduler and controller
slurmdCompute node daemon
slurmdbdAccounting daemon
MungeAuthentication service
MariaDBJob accounting backend
NFSShared filesystem
NVIDIA GPUsScheduled via GRES

2.2 Logical Architecture (with GPUs)



2.3 HA Slurmctld Architecture

For production clusters, controller HA is strongly recommended.

Key settings:

ControlMachine=slurmctld1 BackupController=slurmctld2 SlurmctldPort=6817 SlurmdPort=6818

HA requirements:

  • Shared /var/spool/slurmctld (via NFS or DRBD)

  • Identical slurm.conf on both controllers

  • Munge keys synchronized

  • Only one active controller at a time


3. Example Cluster Layout

HostnameIP AddressRoleGPUs
master10.0.1.5Controller + DB
node110.0.1.6Compute2 × NVIDIA
node210.0.1.7Compute2 × NVIDIA

4. Preparation

4.1 Passwordless SSH

Ensure SSH access from controller to all nodes:

ssh root@node1 ssh root@node2

4.2 GPU Prerequisites (Compute Nodes Only)

Ubuntu 22.04 / 24.04

sudo apt update sudo apt install -y nvidia-driver-550 nvidia-utils-550 cuda reboot

Rocky Linux 8+

sudo dnf module install nvidia-driver:latest-dkms sudo dnf install cuda reboot

Verify:

nvidia-smi

5. Create Global Users (All Nodes)

UID/GID must match on all nodes.

groupadd -g 991 munge useradd -u 991 -g munge -s /sbin/nologin munge groupadd -g 992 slurm useradd -u 992 -g slurm -s /bin/bash slurm

6. Install and Configure Munge

Install Munge

Rocky Linux

dnf install epel-release -y dnf install munge munge-libs munge-devel -y

Ubuntu

apt install -y munge libmunge-dev

Generate Munge Key (Controller Only)

dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key chown munge: /etc/munge/munge.key chmod 400 /etc/munge/munge.key

Copy to compute nodes and start:

systemctl enable --now munge

Test:

munge -n | ssh node1 unmunge

7. Install Slurm

Recommendation (2026): Use Slurm 23.x or 25.x for better GPU auto-detection via NVML.

Rocky Linux (RPM build example)

rpmbuild -ta slurm-25.x.tar.bz2 dnf localinstall slurm*.rpm -y

Ubuntu

apt install -y slurm-wlm

8. Slurm Configuration

8.1 slurm.conf (Controller)

ClusterName=hpc-cluster ControlMachine=master SlurmUser=slurm AuthType=auth/munge # Accounting AccountingStorageType=accounting_storage/slurmdbd AccountingStorageTRES=CPU,Mem,Node,gres/gpu # GPU support GresTypes=gpu # Nodes NodeName=node1 NodeAddr=10.0.1.6 CPUs=8 RealMemory=32000 Gres=gpu:2 State=UNKNOWN NodeName=node2 NodeAddr=10.0.1.7 CPUs=8 RealMemory=32000 Gres=gpu:2 State=UNKNOWN # Partition PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP

Distribute to compute nodes.


8.2 gres.conf (ALL Nodes)

Recommended explicit mapping

Name=gpu File=/dev/nvidia0 Name=gpu File=/dev/nvidia1

Auto-detect (Slurm 20.11+)

AutoDetect=nvml Name=gpu

9. Required Directories

mkdir -p /var/spool/slurm /var/log/slurm chown -R slurm: /var/spool/slurm /var/log/slurm

10. Slurm Accounting (MariaDB)

CREATE DATABASE slurm_acct_db; GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY 'password'; FLUSH PRIVILEGES;

slurmdbd.conf

StorageType=accounting_storage/mysql StorageUser=slurm StoragePass=password StorageLoc=slurm_acct_db

11. Start Services

systemctl enable --now slurmdbd systemctl enable --now slurmctld systemctl enable --now slurmd

12. Validation

sinfo -o "%20N %10G %20C %10m %T" scontrol show node node1

GPU Job Test

#!/bin/bash #SBATCH --job-name=gpu_test #SBATCH --gres=gpu:1 #SBATCH --output=gpu_%j.out nvidia-smi
sbatch gpu_test.sh

13. Slurm Tuning & Best Practices

Performance & Stability

  • Enable CPU binding and task affinity

  • Use SelectType=select/cons_tres

  • Enable job accounting compression

  • Tune SlurmdTimeout and InactiveLimit

GPU Best Practices

  • Always use explicit GRES mapping

  • Track GPU usage via sacct -o JobID,User,AllocTRES

  • Separate GPU partitions for fairness

Operational Tips

  • Monitor with sdiag, squeue, sacct

  • Regularly backup MariaDB

  • Use HA controllers for production

  • Keep Slurm versions consistent across nodes


14. Conclusion

This guide provides a complete, modern Slurm deployment with GPU scheduling, HA architecture, and tuning recommendations. It serves as a strong foundation for AI training, HPC workloads, and enterprise clusters.

If you want next-level topics like QoS policies, fairshare tuning, Kubernetes-Slurm integration, or GPU isolation strategies, this setup is ready to scale.

Comments

Popular Posts