Slurm Basics

changing directories
browsing files and directories
checking/editing file contents
copying/transferring files
monitoring jobs
lightweight stuff

⚠️ Prohibited:

Running computationally intensive tasks or long-running jobs directly on the login node
Any processes that can consume significant CPU, memory, or I/O

If we see a job running on the login node, we will send a warning to you and your supervisor.

danger

Repeated violations will result in revocation of access to MONTAGE and deletion of the user account (including files).

Misuse of shared resources impacts the experience of all users — you are warned.

Okay, we got that out of the way...

Slurm is the job scheduler. Your script that you execute is a job, and slurm manages the job execution when the resources are available.

To use SLURM to execute your jobs, you need to specify a few key parameters in your job script header. These settings tell SLURM what resources you need and how your job should be scheduled. Typical decisions include:

Partition/queue: Which group of nodes your job should run on, often determined by job size, time limits, or user permissions. Read on our QOS.
Number of nodes and tasks: How many compute nodes and tasks (e.g., MPI ranks or CPU cores) your job requires.
Memory and time limits: How much memory per node or per task, and the maximum runtime of your job.
Job name and output: Name your job and specify files for standard output and error. Very useful to log your jobs, or good luck telling people what happened to your job when an error occurred.

These SLURM directives are typically included at the top of your script as #SBATCH headers, guiding SLURM in allocating resources and scheduling your job efficiently. Several examples:

Minimal single-node CPU job

#!/bin/bash
#SBATCH --job-name=example_job       # Job name
#SBATCH --output=example_job.out     # Standard output file
#SBATCH --error=example_job.err      # Standard error file. Remember, log your jobs
#SBATCH --time=01:00:00              # Maximum runtime (HH:MM:SS)
#SBATCH --partition=cpu              # Partition/queue to submit to
#SBATCH --ntasks=1                   # Number of tasks (processes)
#SBATCH --cpus-per-task=1            # Number of CPU cores per task

Multi-core single-node job

#!/bin/bash
#SBATCH --job-name=multi_core_test
#SBATCH --output=multi_core.out
#SBATCH --error=multi_core.err
#SBATCH --time=02:00:00
#SBATCH --partition=gpu              # GPU partition!
#SBATCH --ntasks=4                   # 4 tasks (processes)
#SBATCH --cpus-per-task=2            # Each task uses 2 CPU cores
#SBATCH --mem=8G                     # Memory per node

Multi-node MPI job

#!/bin/bash
#SBATCH --job-name=mpi_simulation
#SBATCH --output=mpi_sim.out
#SBATCH --error=mpi_sim.err
#SBATCH --time=04:00:00
#SBATCH --partition=cpu
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks-per-node=8           # Tasks per node
#SBATCH --cpus-per-task=1             # CPU cores per task
#SBATCH --mem=32G                     # Memory per node

But how many cores/RAM do we have?

Dont let us catch you not reading on Resources in Slurm

Commands

Command	Description
`srun`	Initiate an interactive Slurm session
`sbatch`	Submit a job to Slurm
`squeue`	Reports job status
`scancel`	Terminate queued/running jobs
`sinfo`	Reports system status
`sacct`	Provides info about running/completed jobs

Okay, we got that out of the way...​

Minimal single-node CPU job​

Multi-core single-node job​

Multi-node MPI job​

Commands​

Okay, we got that out of the way...

Minimal single-node CPU job

Multi-core single-node job

Multi-node MPI job

Commands