Deploying a Slurm Cluster on Rocky Linux 8+ and Ubuntu 22.04/24.04

With GPU (GRES) Support, HA Architecture, and Best Practices

1. Introduction

Slurm (Simple Linux Utility for Resource Management) is a highly scalable, open-source workload manager widely used in HPC, AI/ML, and GPU clusters. It provides efficient job scheduling, resource allocation (CPU, memory, GPU), and accounting with minimal overhead.

In this guide, we walk through deploying a production-ready Slurm cluster with:

Rocky Linux 8+ and Ubuntu 22.04 / 24.04
Munge authentication
MariaDB-based accounting (slurmdbd)
NFS shared filesystem
NVIDIA GPU scheduling using GRES
High Availability (HA) Slurmctld architecture
Slurm tuning & operational best practices

This setup is suitable for learning, benchmarking, AI workloads, and small-to-medium production clusters.

2. Slurm Cluster Architecture

2.1 Core Components

Component	Description
`slurmctld`	Central scheduler and controller
`slurmd`	Compute node daemon
`slurmdbd`	Accounting daemon
Munge	Authentication service
MariaDB	Job accounting backend
NFS	Shared filesystem
NVIDIA GPUs	Scheduled via GRES

2.2 Logical Architecture (with GPUs)

2.3 HA Slurmctld Architecture

For production clusters, controller HA is strongly recommended.

Key settings:


ControlMachine=slurmctld1
BackupController=slurmctld2
SlurmctldPort=6817
SlurmdPort=6818

HA requirements:

Shared /var/spool/slurmctld (via NFS or DRBD)
Identical slurm.conf on both controllers
Munge keys synchronized
Only one active controller at a time

3. Example Cluster Layout

Hostname	IP Address	Role	GPUs
master	10.0.1.5	Controller + DB	—
node1	10.0.1.6	Compute	2 × NVIDIA
node2	10.0.1.7	Compute	2 × NVIDIA

4. Preparation

4.1 Passwordless SSH

Ensure SSH access from controller to all nodes:


ssh root@node1
ssh root@node2

4.2 GPU Prerequisites (Compute Nodes Only)

Ubuntu 22.04 / 24.04


sudo apt update
sudo apt install -y nvidia-driver-550 nvidia-utils-550 cuda
reboot

Rocky Linux 8+


sudo dnf module install nvidia-driver:latest-dkms
sudo dnf install cuda
reboot

Verify:


nvidia-smi

5. Create Global Users (All Nodes)

UID/GID must match on all nodes.


groupadd -g 991 munge
useradd -u 991 -g munge -s /sbin/nologin munge

groupadd -g 992 slurm
useradd -u 992 -g slurm -s /bin/bash slurm

6. Install and Configure Munge

Install Munge

Rocky Linux


dnf install epel-release -y
dnf install munge munge-libs munge-devel -y

Ubuntu


apt install -y munge libmunge-dev

Generate Munge Key (Controller Only)


dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

Copy to compute nodes and start:


systemctl enable --now munge

Test:


munge -n | ssh node1 unmunge

7. Install Slurm

Recommendation (2026): Use Slurm 23.x or 25.x for better GPU auto-detection via NVML.

Rocky Linux (RPM build example)


rpmbuild -ta slurm-25.x.tar.bz2
dnf localinstall slurm*.rpm -y

Ubuntu


apt install -y slurm-wlm

8. Slurm Configuration

8.1 `slurm.conf` (Controller)


ClusterName=hpc-cluster
ControlMachine=master
SlurmUser=slurm
AuthType=auth/munge

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=CPU,Mem,Node,gres/gpu

# GPU support
GresTypes=gpu

# Nodes
NodeName=node1 NodeAddr=10.0.1.6 CPUs=8 RealMemory=32000 Gres=gpu:2 State=UNKNOWN
NodeName=node2 NodeAddr=10.0.1.7 CPUs=8 RealMemory=32000 Gres=gpu:2 State=UNKNOWN

# Partition
PartitionName=debug Nodes=node[1-2] Default=YES MaxTime=INFINITE State=UP

Distribute to compute nodes.

8.2 `gres.conf` (ALL Nodes)

Recommended explicit mapping


Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

Auto-detect (Slurm 20.11+)


AutoDetect=nvml
Name=gpu

9. Required Directories


mkdir -p /var/spool/slurm /var/log/slurm
chown -R slurm: /var/spool/slurm /var/log/slurm

10. Slurm Accounting (MariaDB)


CREATE DATABASE slurm_acct_db;
GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;

slurmdbd.conf


StorageType=accounting_storage/mysql
StorageUser=slurm
StoragePass=password
StorageLoc=slurm_acct_db

11. Start Services


systemctl enable --now slurmdbd
systemctl enable --now slurmctld
systemctl enable --now slurmd

12. Validation


sinfo -o "%20N %10G %20C %10m %T"
scontrol show node node1

GPU Job Test


#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --gres=gpu:1
#SBATCH --output=gpu_%j.out

nvidia-smi


sbatch gpu_test.sh

13. Slurm Tuning & Best Practices

Performance & Stability

Enable CPU binding and task affinity
Use SelectType=select/cons_tres
Enable job accounting compression
Tune SlurmdTimeout and InactiveLimit

GPU Best Practices

Always use explicit GRES mapping
Track GPU usage via sacct -o JobID,User,AllocTRES
Separate GPU partitions for fairness

Operational Tips

Monitor with sdiag, squeue, sacct
Regularly backup MariaDB
Use HA controllers for production
Keep Slurm versions consistent across nodes

14. Conclusion

This guide provides a complete, modern Slurm deployment with GPU scheduling, HA architecture, and tuning recommendations. It serves as a strong foundation for AI training, HPC workloads, and enterprise clusters.

If you want next-level topics like QoS policies, fairshare tuning, Kubernetes-Slurm integration, or GPU isolation strategies, this setup is ready to scale.

Search This Blog

High Performance Computing Blogs

Deploying a Slurm Cluster on Rocky Linux 8+ and Ubuntu 22.04/24.04

With GPU (GRES) Support, HA Architecture, and Best Practices

1. Introduction

2. Slurm Cluster Architecture

2.1 Core Components

2.2 Logical Architecture (with GPUs)

2.3 HA Slurmctld Architecture

3. Example Cluster Layout

4. Preparation

4.1 Passwordless SSH

4.2 GPU Prerequisites (Compute Nodes Only)

Ubuntu 22.04 / 24.04

Rocky Linux 8+

5. Create Global Users (All Nodes)

6. Install and Configure Munge

Install Munge

Generate Munge Key (Controller Only)

7. Install Slurm

Rocky Linux (RPM build example)

Ubuntu

8. Slurm Configuration

8.1 `slurm.conf` (Controller)

8.2 `gres.conf` (ALL Nodes)

9. Required Directories

10. Slurm Accounting (MariaDB)

11. Start Services

12. Validation

GPU Job Test

13. Slurm Tuning & Best Practices

Performance & Stability

GPU Best Practices

Operational Tips

14. Conclusion

Comments

Post a Comment

Popular Posts

Deploying a Slurm Cluster on Rocky Linux 8+ and Ubuntu 22.04/24.04

With GPU (GRES) Support, HA Architecture, and Best Practices

1. Introduction

2. Slurm Cluster Architecture

2.1 Core Components

2.2 Logical Architecture (with GPUs)

2.3 HA Slurmctld Architecture

3. Example Cluster Layout

4. Preparation

4.1 Passwordless SSH

4.2 GPU Prerequisites (Compute Nodes Only)

Ubuntu 22.04 / 24.04

Rocky Linux 8+

5. Create Global Users (All Nodes)

6. Install and Configure Munge

Install Munge

Generate Munge Key (Controller Only)

7. Install Slurm

Rocky Linux (RPM build example)

Ubuntu

8. Slurm Configuration

8.1 slurm.conf (Controller)

8.2 gres.conf (ALL Nodes)

9. Required Directories

10. Slurm Accounting (MariaDB)

11. Start Services

12. Validation

GPU Job Test

13. Slurm Tuning & Best Practices

Performance & Stability

GPU Best Practices

Operational Tips

14. Conclusion

Comments

Post a Comment

Popular Posts

8.1 `slurm.conf` (Controller)

8.2 `gres.conf` (ALL Nodes)