Slurm Workload Manager
The standard usage model for a HPC cluster is that you log into a front-end server or web portal and from there launch applications to run on one of more back-end servers. The software tool which manages this is called a workload manager or batch scheduler and the one used on Kay is the widely used Slurm workload manager.
The typical way a user will interact with compute resources managed by a workload manager is as follows:
- Write a job script which describes the resources required (e.g how many CPUs and for how long), instructions such as where to print standard out and error, and the commands to run once the job starts.
- Submit the job to the workload manager which will then start the job once the requested resources are available.
- Once the job completes, the user will see all results as well as output that would normally appear on screen in previously specified files.
The two types of job commonly used are:
- Interactive : Request a set of nodes to run an interactive bash shell on. This is useful for quick tests and development work. These type of jobs should only be used with the DevQ queue. For example, the following command will submit an interactive job requesting 1 node for 1 hour to be charged to myproj_id:
srun -p DevQ -N 1 -A myproj_id -t 1:00:00 --pty bash
- Batch : A script is submitted for later execution whenever the requested resources are available. Both within this script and on the commandline when submitting the job, a set of constraints, required information and instructions are given. The file must be a shell script (i.e start with
#!/bin/sh) and Slurm directives must be preceeded by
#SBATCH. A sample script is displayed below which request 4 nodes (each with 40 cores) for 48 hours to run an MPI application and could be submitted using the command:
#!/bin/sh #SBATCH --time=00:20:00 #SBATCH --nodes=4 #SBATCH -A myproj_id module load intel/2019 mpirun -np 80 ./a.out
Further details are available elsewhere including: