Kay User Guide

Table of Contents

Hardware and Architecture

When you connect to Kay using ssh your connection will automatically be routed to one of three login nodes (login1, login2, or login3). These nodes are intended for interactive tasks like compiling code, editing and managing files. They are shared with other users hence they should not be used for compute-intensive workloads.

Apart from the login nodes, the vast majority of Kay is made up of compute nodes. These nodes are used for running compute-intensive jobs that are managed by a batch system. Users submit batch scripts to request node(s) for each job, which are then placed into a scheduling system before compute nodes are allocated to execute the job.

Further hardware and architectural details of Kay are described in our Infrastructure section.

In order to connect to Kay via SSH, our firewall policy allows SSH access (using password authentication) only from an IP address in our member university's network. Users may configure key-authenticated SSH access that will enable access from the wider internet. Please note that the latter requires that you generate SSH public-private key pairs on another computer other than Kay for security reasons (e.g. ssh-keygen for Linux/Mac, PuTTYgen and Pageant for Windows; your private key should NOT be stored on Kay). Check out our tutorial - Setting up SSH Keys - for more details.

Data Storage

When using ICHEC systems your files will be stored in two locations:

Home: /ichec/home/users/username

Work: /ichec/work/projectname

Home will have a relatively small storage quota (25GB) and should be used for personal files and source code which are related to your use of the system. It is not suited for storing large volumes for simulation results for example.

Work is an area of common storage for use by all the members of a project with a much larger quota. In practice this is where the majority of files should be stored. Note that only home directories are backed up to tape; project directories under /ichec/work/projectname are NOT backed up. The backup of home directories is only intended for disaster recovery, we do not accommodate bespoke data recovery requests from users, e.g. accidental file deletions.

While a job is running on the compute nodes, it will also have access to two scratch directories (/scratch/global/ and /scratch/local/) to store temporary files. While /scratch/global/ is just a temporary directory similar to others in your home or work directories, /scratch/local/ points to local SSD drives on individual compute nodes that can provide fast read/write access. All compute nodes have 400GB SSDs for local scratch storage, apart from the High Memory nodes which have 1TB SSDs. Please keep in mind that any files stored in scratch storage only lasts for the duration of the job, once the job ends everything in the scratch directories are deleted.

Environment Modules

We support a range of software packages on Kay - a detailed list is in our Software section. In order to make use of any specific software package, you must load its appropriate module(s).

Loading a module typically sets or modifies some environment variables, e.g. the PATH variable (so that the shell knows where to look for the relevant executable binaries, libraries, etc. for a particular software package).

You can load the appropriate modules (software, compilers, etc) 

module load modulename

# Load the software module
module load intel/2019

Note: The software specific module load commands must be present in your job submission scripts (before the software specific run command)

Some other useful module commands are

# List the loaded modules 
module list

# Unload the loaded modules
module unload intel/2019

Software Packages

The details of software applications available on Kay via modules can be found in our Software section.

Workload Manager (SLURM)

We use SLURM to allocate compute resources on Kay. In order to submit jobs to the compute nodes you should:

  • Write a job script which describes the resources required (e.g how many CPUs and for how long), instructions such as where to redirect standard output and error, and the commands to be executed once the job starts.
  • Submit the job to the workload manager which will then start the job once the requested resources are available.
  • Once the job completes, you will find the results generated by the job on the filesystem (e.g. expected application output files, special files that contain the standard output/error generated by the job).

The jobs can be:

  • Interactive : With interactive jobs, you can request a set of nodes to run an interactive bash shell on. This is useful for quick tests and development work. For example, the following command will submit an interactive job requesting 1 node for 1 hour to be charged to myproj_id:
srun -p DevQ -N 1 -A myproj_id -t 1:00:00 --pty bash

Note: Interactive jobs should only be triggered with the DevQ queue

  • Batch : For each batch job, a job script is submitted for execution whenever the requested resources are available. A sample script is displayed below which request 2 nodes (each with 40 cores, i.e. 80 cores in total) for 20 minutes to run an MPI application:
#!/bin/sh 

#SBATCH --time=00:20:00
#SBATCH --nodes=2
#SBATCH -A myproj_id
#SBATCH -p DevQ

module load intel/2019

mpirun -n 80 mpi-benchmarks/src/IMB-MPI1

Note: The file must be a shell script (e.g. the first line being #!/bin/sh like above) with Slurm directives preceeded by #SBATCH.

To submit the batch job, use the sbatch command:

sbatch mybatchjob.sh
  • Multiple Serial Jobs (Task Farming) : For running multiple serial jobs using slurm, refer to our tutorial for Task Farming.

Job Reason Codes

These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

Those code can be found in the Slurm Documentation Website.

Backup Policy

As stated in our Acceptable Usage Policy backups are only made of user's home directoriesProject directories under /ichec/work/projectname are NOT backed up. Furthermore, backups are only carried out as part of our system failure recovery plan; the restoration of user files deleted accidentally is not provided as a service.

Support

The Helpdesk is the main entry point to ICHEC's support team for users. Here you can get help in using the service, find out more about ICHEC or send us your comments. If the documentation on this site does not resolve your query do not hesitate to use it to contact ICHEC.