Deploying a Slurm Cluster on Rocky Linux 8+ and Ubuntu 22.04/24.04
With GPU (GRES) Support, HA Architecture, and Best Practices
1. Introduction
Slurm (Simple Linux Utility for Resource Management) is a highly scalable, open-source workload manager widely used in HPC, AI/ML, and GPU clusters. It provides efficient job scheduling, resource allocation (CPU, memory, GPU), and accounting with minimal overhead.
In this guide, we walk through deploying a production-ready Slurm cluster with:
-
Rocky Linux 8+ and Ubuntu 22.04 / 24.04
-
Munge authentication
-
MariaDB-based accounting (
slurmdbd) -
NFS shared filesystem
-
NVIDIA GPU scheduling using GRES
-
High Availability (HA) Slurmctld architecture
-
Slurm tuning & operational best practices
This setup is suitable for learning, benchmarking, AI workloads, and small-to-medium production clusters.
2. Slurm Cluster Architecture
2.1 Core Components
| Component | Description |
|---|---|
slurmctld | Central scheduler and controller |
slurmd | Compute node daemon |
slurmdbd | Accounting daemon |
| Munge | Authentication service |
| MariaDB | Job accounting backend |
| NFS | Shared filesystem |
| NVIDIA GPUs | Scheduled via GRES |
2.2 Logical Architecture (with GPUs)
2.3 HA Slurmctld Architecture
For production clusters, controller HA is strongly recommended.
Key settings:
HA requirements:
-
Shared
/var/spool/slurmctld(via NFS or DRBD) -
Identical
slurm.confon both controllers -
Munge keys synchronized
-
Only one active controller at a time
3. Example Cluster Layout
| Hostname | IP Address | Role | GPUs |
|---|---|---|---|
| master | 10.0.1.5 | Controller + DB | — |
| node1 | 10.0.1.6 | Compute | 2 × NVIDIA |
| node2 | 10.0.1.7 | Compute | 2 × NVIDIA |
4. Preparation
4.1 Passwordless SSH
Ensure SSH access from controller to all nodes:
4.2 GPU Prerequisites (Compute Nodes Only)
Ubuntu 22.04 / 24.04
Rocky Linux 8+
Verify:
5. Create Global Users (All Nodes)
UID/GID must match on all nodes.
6. Install and Configure Munge
Install Munge
Rocky Linux
Ubuntu
Generate Munge Key (Controller Only)
Copy to compute nodes and start:
Test:
7. Install Slurm
Recommendation (2026): Use Slurm 23.x or 25.x for better GPU auto-detection via NVML.
Rocky Linux (RPM build example)
Ubuntu
8. Slurm Configuration
8.1 slurm.conf (Controller)
Distribute to compute nodes.
8.2 gres.conf (ALL Nodes)
Recommended explicit mapping
Auto-detect (Slurm 20.11+)
9. Required Directories
10. Slurm Accounting (MariaDB)
slurmdbd.conf
11. Start Services
12. Validation
GPU Job Test
13. Slurm Tuning & Best Practices
Performance & Stability
-
Enable CPU binding and task affinity
-
Use
SelectType=select/cons_tres -
Enable job accounting compression
-
Tune
SlurmdTimeoutandInactiveLimit
GPU Best Practices
-
Always use explicit GRES mapping
-
Track GPU usage via
sacct -o JobID,User,AllocTRES -
Separate GPU partitions for fairness
Operational Tips
-
Monitor with
sdiag,squeue,sacct -
Regularly backup MariaDB
-
Use HA controllers for production
-
Keep Slurm versions consistent across nodes
14. Conclusion
This guide provides a complete, modern Slurm deployment with GPU scheduling, HA architecture, and tuning recommendations. It serves as a strong foundation for AI training, HPC workloads, and enterprise clusters.
If you want next-level topics like QoS policies, fairshare tuning, Kubernetes-Slurm integration, or GPU isolation strategies, this setup is ready to scale.


Comments
Post a Comment