Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 CapacitiesGPGPU Research ProjectsESOF 2012
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Services | Research | Support | Education & Training | Consultancy | About Us | Login

Stokes and Stoney

Contents

  1. Getting Started
  2. Environment, Applications & Development
  3. Batch Processing
  4. Performance
  5. Misc

1. Getting Started

This documentation relates to the National Service systems listed below:

If you are not currently an ICHEC user then you should visit our Services section first to determine how you would like to become a new user. Possible options are: (a) submit a project application through the Full National Service, (b) join an existing project, or (c) gain access through your institution if you are affiliated with University College Dublin or National University of Ireland, Maynooth.

All use of the National Service systems is subject to the ICHEC Acceptable Usage Policy (AUP).

1.1 Logging In

When registration is complete you can log in using SSH. This will be installed by default on most Unix style systems. Windows user will need to download and install and SSH client such as openSSH or putty. By default most users will use the Stokes system however users whose applications' require large amounts of memory per node will also be able to request access to Stoney. From the command line you can log on using the commands:

  • Stokes: ssh username@stokes.ichec.ie
  • Stoney: ssh username@stoney.ichec.ie

If you wish to run X windows based graphical applications use the -X ssh flag.

Once you have an account you may join multiple projects subject to the approval of the project Principal Investigator.

1.2 Transferring Files

File Transfer is available via sftp

  • Stokes: sftp username@stokes.ichec.ie
  • Stoney: sftp username@stoney.ichec.ie

scp is also available

  • Stokes: scp text.tar username@stokes.ichec.ie:
  • Stoney: scp text.tar username@stoney.ichec.ie:

There are a number of graphical applications such as WinSCP or FileZilla which can also be used for file transfer.

1.3 The Helpdesk

The Helpdesk is the main entry point to ICHEC's support team for users. Here you can get help in using the service, find out more about ICHEC or send us your comments. If the documentation on this site does not resolve your query do not hesitate to use it to contact ICHEC.

1.4 Unix

As is generally the case ICHEC High Performance Computing (HPC) systems use the Unix style Linux operating system. An introduction to this system can be found here.

1.5 Directories

When using ICHEC systems your files will be stored in two locations:

  • Home: /ichec/home/users/username
  • Work: /ichec/work/projectname
Home will have a relatively small storage quota and should be used for personal files and source code which are related to your use of the systemr. It is not suited for storing large volumes for simulation results for example. Work is an area of common storage for use by all the members of a project it will normally have a much larger quota. In practice this is where the majority of files should be stored. Note that only home directories are backed up to tape; project directories under /ichec/work/projectname are NOT backed up.

1.6 Login and Compute Nodes

When you contect to the Stokes system using ssh as described above your conection will automatically be routed to one of two login nodes stokes1 or stokes2. Similarly when you connect to Stoney you will be routed to stoney4. These nodes sometimes called frontend nodes are used by users for interactive tasks like compiling code, editing files and managing files. They are shared by users and should not be used for intensive computation.

In order to connect to Stokes or Stoney it is necessary to connect from a machine with an IP address which belongs to one of ICHEC's participant institutions. Thus if you which to connect from home or while travelling you must first connect to a machine in your home institution and then connect to Stokes or Stoney.

The vast majority of the Stokes and Stoney systems are made up of compute nodes. These nodes are used for running jobs that are submitted to system. They are sometimes referred to as backend nodes. They are dedicated to a single user at a given time and can be used for intensive long term work loads.

1.6 Backup Policy

As stated in our Acceptable Usage Policy backups are only made of user's home directories. Project directories under /ichec/work/projectname are NOT backed up. Furthermore, backups are only carried out as part of our system failure recovery plan; the restoration of user files deleted accidentally is not provided as a service.

2. Environment, Applications & Development

2.1 Modules

The large array of software packages installed means that incompatabilities are inevitable. To minimise the problems this can cause, in order to use a software package that is not part of the base operating system one must load the appropriate module(s). Loading a module generally sets environment variables such as your PATH. To see what modules are available type:

module avail

You can then load the appropriate modules as follows:

module load intel-cc
module load intel-fc

As well as loading the necessary modules at compile time it is also required that they be present at runtime on the compute nodes. If these modules are not loaded the program is likely to crash due to not being able to find the required libaries etc. They can be loaded in two ways. You can use the PBS directive #PBS -V to import your current environment settings at submission time. Or you can add module load package_name commands to the submission script itself.

Other useful module commands:

  • module unload intel-cc (removes that module)
  • module list (lists the modules you are using at the moment)

For more information on modules see: Using Modules or type man module. Note: the default module for a given package points to the most recent version of that package. To use an older version specify the name explicitly.

2.2 Compilers

Both the GNU and Intel compiler suites are available on Stokes and Stoney.

The gnu suite is available by default and to use the Intel compilers one must load the relevant modules (intel-cc, intel-fc). In general the Intel compilers should give better performance and are recommended.

  Intel Compilers MPI wrappers around Intel compilers
C icc mpicc
C++ icpc mpicxx
Fortran 77 ifort mpif77
Fortran 90 ifort mpif90
OpenMP yes -
Intel compilers on Stokes and Stoney.
  GNU Compilers MPI wrappers around GNU compilers
C gcc mpicc
C++ g++ mpicxx
Fortran 77 gfortran mpif77
Fortran 90 gfortran mpif90
OpenMP no -
GNU compilers on Stokes and Stoney.

2.3 OpenMP

When using OpenMP on Stokes you need to be aware that Hyperthreading is enabled by default. This means that each physical core can appear as two logical cores. Thus by default an OpenMP program will typically try to use 24 threads rather than 12 as one might expect. Typical HPC workloads will not benefit from over subscribing the phyiscal cores unless the code is constrained by I/O.

The environment variable OMP_NUM_THREADS is normally used to control how many threads an OpenMP program will use. It can be set in the PBS job script prior to luanching the program as follows:

export OMP_NUM_THREADS=12

Hyperthreading is not supported on Stoney. Further information can be found in Intel's OpenMP documentation.

2.4 MPI

There are a number MPI libraires available and sometimes it is perferable to use one rather than another however unless you have a specific reason to do so it is recommended to use the defualt libraires.

For the Stokes and Stoney systems there are two MPI modules to choose between:

  • module load mvapich2-gnu For use when using the GNU compiler suite.
  • module load mvapich2-intel For use when using the Intel compiler suite.

These modules provide support for MPI2 and the Infiniband based networking used in these machines. They also provide the compiler wrapper scripts listed in the tables in the previous section which greatly simplfiy compiling and linking MPI based codes.

To run a MPI job, the mpiexec command is used in a job submission script e.g. mpiexec ./my_prog my_args.

Again there is a man page for mpiexec for more details. See also Batch Processing below.

The Intel (intel-mpi) MPI libraries are also available, however the mvapich2 libraries are recommended. On Stokes the SGI (sgi-mpt) MPI libraries are available too.

2.5 Mixing MPI/OpenMP with MVAPICH2

By default, MVAPICH2 sequentially attaches MPI processes to cores when starting a program (by calling sched_setaffinity() during MPI_Init()). This is done sequentially, which means that on each node the MPI process of lowest rank will be attached to core number 0. Then the process with the next rank up will be attached to core number 1, and so on until there are no MPI processes left. The above core numbers are the logical numbers not the physical ones.

The following diagrams show the per node system architectures of Stokes and Stoney:

Stokes Architecture

Stokes system architecture

Stoney Architecture

Stoney system architecture

Hence, if you want to mix MPI and OpenMP then, the first thing you have to do is to change this MPI process to core attachment, as this would be inherited by all the OpenMP threads spawned by a given MPI process. Then, all those threads would run on this single core, leading to extremely poor performance. This process to core attachment is managed through an environment variable, listing sequentially on which cores each MPI process should be attached. The default behaviour corresponds to:

  • Stokes: MV2_CPU_MAPPING="0:1:2:3:4:5:6:7:8:9:10:11"
  • Stoney: MV2_CPU_MAPPING="0:1:2:3:4:5:6:7"

The core lists corresponding to each MPI processes are separated by colons. On Stokes, 4 typical MPI / OpenMP attachment policies may be explored, sorted by decreasing likelihood of efficiency:

  • OMP_NUM_THREADS=6 and MV2_CPU_MAPPING="0,1,2,3,4,5:6,7,8,9,10,11"
  • OMP_NUM_THREADS=3 and MV2_CPU_MAPPING="0,1,2:3,4,5:6,7,8:9,10,11"
  • OMP_NUM_THREADS=2 and MV2_CPU_MAPPING="0,1:2,3:4,5:6,7:8,9:10,11"
  • OMP_NUM_THREADS=12 and MV2_CPU_MAPPING="0,1,2,3,4,5,6,7,8,9,10,11"

On stoney, 3 typical attachment policies may be explored, also sorted by decreasing likelihood of efficiency:

  • OMP_NUM_THREADS=4 and MV2_CPU_MAPPING="0,2,4,6:1,3,5,7"
  • OMP_NUM_THREADS=2 and MV2_CPU_MAPPING="0,2:1,3:4,6:5,7"
  • OMP_NUM_THREADS=8 and MV2_CPU_MAPPING="0,2,4,6,1,3,5,7"

Both environment variables have to be exported in the PBS job script and the mpiexec command line should should resemble the following:

mpiexec -npernode $(( [12|8] / $OMP_NUM_THREADS )) my_program my_arguments

2.6 MKL

The Intel Math Kernel Library (MKL) is a very useful package. It provides optimised and documented versions of a large number of common mathematical routines. It supports both C and Fortran interfaces for most of these. It features the following routines:

  • Basic Linear Algebra Subprograms (BLAS); vector, matrix-vector, matrix-matrix operations.
  • Sparse BLAS Levels 1, 2, and 3.
  • LAPACK routines for linear equations, least squares, eigenvalue, singular value problems and Sylvester's equations problems.
  • ScaLAPACK Routines.
  • PBLAS routines for distributed vector, matrix-vector and matrix-matrix operation.
  • Direct and iterative sparse solver routines.
  • Vector Mathematical Library (VML) for computing mathematical functions on vector arguments.
  • Vector Statistical Library (VSL) for generating pseudorandom numbers and for performing convolution and correlation.
  • General Fast Fourier Transform (FFT) funtions for fast computation of Discrete FFTs.
  • Cluster FFT fucntions.
  • Basic Linear Algebra Communication Subprograms (BLACS)
  • GNU multiple precision arithmetic library.

If your code depends on standard libraries such as BLAS or LAPACK, it is recommended that you link against the MKL versions for optimal performance.

Parallelism in a program can be achieved at the process level as in most MPI development or at the thread level as in OpenMP development, or in some mix of these approaches, a so-called hybrid code. The most common mode of development on our systems is MPI based, as this allows you write programs which can run across many nodes. Often such codes will want to call routines provided by MKL. However many of these routines are themselves parallel so at the node level one is left with two levels of parallelism contenting with one and other. To eliminate this the MKL module sets the environment variable MKL_NUM_THREADS=1. If you are writing hybrid code or pure OpenMP code that uses MKL you may need to override this setting. Chapter 6 of the MKL userguide explains in detail how this and other related environment variables can be used. Note if you have used a version of MKL older than 10.0 you should be aware that MKL's method for controlling thread numbers has changed.

This issue can also be addressed by explicitly linking the sequential version of the libraries which can be found in the /ichec/packages/intel/mkl/mkl_version/lib/em64t directory and are identified by a _sequential in the name. Note you are also required to link the pthread library.

Extensive high quality MKL documentation can be found in the /ichec/packages/intel/mkl/mkl_version/doc/. Remember that when a code is linked against MKL it will be necessary for you to have the MKL module loaded via the submit script when running the code. Further ICHEC documentation on MKL can be found here.

3. Batch Processing

To try to utilise compute resources in a fair and efficient manner, all compute jobs must be run through the batch queueing system. The system supports three main classes of jobs:

  • Production jobs - These are day to day production jobs which potentially run for long periods over large numbers of cores.
  • Development jobs - Development jobs are generally of short duration over a limited number of cores and are typically used for testing and developing while modifying code.
  • Interactive development jobs - Such jobs have the same purpose as regular development jobs however when the submission takes place you are given a command prompt on one of the allocated backend nodes from where you can run commands interactively, much as you would were a queueing system not in place.

By specifying how many processor cores you need and for how long, the system can mix and match resource timeslots with jobs from multiple users. The most common operations you will need to perform with the batch system are submitting jobs, monitoring the queues or canceling your jobs.

As detailed in the next section it is straightforward to submit jobs to a specific queue. However in general allowing the system to decide which queue to use will give the best results. This decision is based on the requested walltime and the number of cores requested. Hence it is your interest to try to provide a reasonably accurate walltime.

3.1 Sample PBS script for Stokes or Stoney

Before submitting a job you normally prepare a PBS script.

#!/bin/bash
#PBS -l nodes=4:ppn=12
#PBS -l walltime=1:00:00
#PBS -N my_job_name
#PBS -A project_name
#PBS -r n
#PBS -j oe
#PBS -m bea
#PBS -M me@my_email.ie
#PBS -V

cd $PBS_O_WORKDIR
mpiexec ./my_prog my_args

The # symbol is required at the start of each PBS directive. The line #PBS -l nodes=4:ppn=12 requests 48 processor cores in this case i.e. 4 nodes each of which have 12 cores. As each Stokes node has 12 cores this ppn figure will be fixed. However each Stoney nodes has 8 cores so this figure should be set to 8, and the resulting job request will be for 32 cores.

The line #PBS -l walltime=1:00:00 requests a walltime of 1 hour. If the job does not complete before this time the system will kill it. #PBS -N my_job_name sets the job name as it will appear in the queue.

The project_name is used to associate core hours used with a given project. You may only specify projects you are a member of. #PBS -r n inidicates that the job should not automatically be rerun if it fails. #PBS -j oe joins the output and error streams into a single file. To receive a mail at the address specified with -M when a job begins, ends or aborts use #PBS -m bea.

The #PBS -V directive is very important if you do not explicitly load modules in the PBS script as it causes environment settings to be imported from the submission environment to the runtime environment. At this point we change to the working directory and start the job using mpiexec. If the job is solely based on OpenMP and so runs on one node you do not use mpiexec.

3.2 Submitting Jobs

You can choose to explicitly send your job to a given queue using the #PBS -q directive or the qsub -q command.

To see what queues are available use the qstat -q command. Note that not all queues listed by qstat -q are available to users and that the Walltime and Node columns list the maximum runtime and node count for jobs in that queue.

Note: the queue names used on Stokes differ from those on previous systems.

To submit a PBS script type qsub scriptname.pbs

Sometimes, for debugging purposes, it can be useful to launch a shell as a batch job and get an interactive session on compute nodes where you can see immediately what happens when launching a program. In these cases, an Interactive Job can be used. Note interactive jobs will only run in the DevQ region. For example, if I wanted to test my MPI program on 24 cores, I could request an interactive job for 30 minutes and then be given a shell on one of the 2 compute nodes

username@stokes1> module load intel-fc
username@stokes1> module load mvapich2-intel
username@stokes1> mpif77 -o hello -freeform hello_mpi.f
username@stokes1> qsub -I -l nodes=2:ppn=12,walltime=0:30:00 -V -A sys_test
qsub: waiting for job 32374.stokes-svcs.ice.ichec.ie to start
qsub: job 32374.stokes-svcs.ice.ichec.ie ready
username@r4i2n15> mpiexec ./hello
node 2 :Hello, world
node 3 :Hello, world
node 4 :Hello, world
node 5 :Hello, world
node 6 :Hello, world
node 7 :Hello, world
node 1 :Hello, world
node 0 :Hello, world
node 8 :Hello, world
node 9 :Hello, world
node 14 :Hello, world
node 13 :Hello, world
node 10 :Hello, world
node 15 :Hello, world
node 11 :Hello, world
node 12 :Hello, world
node 16 :Hello, world
node 20 :Hello, world
node 21 :Hello, world
node 19 :Hello, world
node 18 :Hello, world
node 22 :Hello, world
node 17 :Hello, world
node 23 :Hello, world
username@r4i2n15>

Note however that this method should only be used for debugging and not for production runs as network breaks or timeouts will kill the job. Also, please exit the shell when you are no longer using the interactive session so that the resources can be released for other users.

3.3 Monitoring Jobs

The showq command displays information on the current status of jobs.

showq - status of jobs.

showq -w user=$USER - status of your own jobs only.

showq -w acct=myaccount - status of jobs running under specified account.

To cancel a job you should use the canceljob command.

canceljob JOBID - cancels a job

3.4 Frequently Used Batch System Commands

qsub SUBMIT_SCRIPT submit jobscript to PBS
qsub -I submit an interactive-batch job
qsub -q queue_name submit job directly to specified queue
qstat -q list all queues on system
qstat -Q list queue limits for all queues
showq list all running, queued and blocked jobs
showq -u userid list all jobs owned by user userid
showq -w acct=myaccount list all jobs using the specified project account
showq -r list all running jobs
mybalance list the balance in CPU core hours for each project you are a member
qstat -f jobid list all information known about specified job
canceljob JOBID delete job jobid
qalter JOBID modify the attributes of the job or jobs specified by jobid

3.5 OpenMP Job Submission

If you wish to run a multithreaded code on a single node which does not use MPI then you can simply call the program from the submission script without prefacing it with the mpiexec command. The job will then have access to the cores on the node. OpenMP based codes are the most common form of this type of job.

It is possible to write a so called hybrid code which uses both OpenMP and MPI. This means that a job can use shared memory within a node and MPI between a number of nodes. In this case you generally wish to allocate just one MPI process to each node. This process can then create worker threads to exploit the available cores. To do this you request the required number of nodes in the normal fashion, #PBS -l nodes=n:ppn=12. Ensuring ppn is set to 12 or 8 in the case of Stoney. Then you launch the job with an additional argument, mpiexec -npernode 1 ./job my_args. With npernode set to 1, a single MPI process is allocated to each node it is up to this process to use the available cores.

4. Performance

4.1 Formatted Fortran IO Performance

Stokes uses a high preformance Panasas based IO subsystem. The Panasas filesystem, panfs, is a true parallel filesystem and so implements the semantics of file access slightly differently from other filesystems which you may be familiar with. A side effect of this that you may get unexpected performace results. In general for well written software using an appropriate approach to IO the results will be very good. However under certain circumstances a slowdown can be seen. One such case is VASP IO.

If VASP is writting small amounts of data sequentially it may do so very slowly. By default with the Intel Fortran compiler IO is assumed to be non buffered, as a result records are written to disk after each write. This results in a large number of small writes which must all be committed separately. There is a physical limit to how many operations a disk can handle at one time. While a file will be spread across a number of disks the limit can still be hit. Buffered IO will result in data being written in 4k blocks which can be handled much more efficiently. To enable this either:

  • "export FORT_BUFFERED=true" before the job
  • Compile with "-assume buffered_io"
It is advisable not to do a tail -f on the file as it is being produced as this forces flushing of the file also.

4.2 Minor page fault problems

Linux is by default configured to minimise the memory footprint of each process. This means that by default all freed memory is given back to the kernel for later use. The drawback of this is that in some cases it can lead to a great many allocations and deallocations of memory while entering and leaving functions. This is especially true for some automatically allocated and deallocated Fortran local variables. Such memory allocation and deallocation can increase the runtime of jobs. For example, VASP can sometimes exhibit such behaviour. To identify such problems, you can add to your submission command line the /usr/bin/time command, as follows:

mpiexec /usr/bin/time usual_command_line

Be careful not to use the built-in 'time' command, as it would be of no use here. At the end of the execution this should give some timing information, including the number of minor page faults:

13.37user 0.01system 0:13.43elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (4major+1740minor)pagefaults 0swaps

If the minor page fault number is excessive, say more than a few hundred of them pre second, you could greatly benefit of setting the following environment variables in your batch script:

export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1

Furthermore, there are only a few cases where setting these variables could be counter-productive. You could therefore consider setting them globally in you .bashrc file. Should unexpected behaviour occur, you could unset them just for the problematic jobs, by adding the following lines in your script:

unset MALLOC_MMAP_MAX_
unset MALLOC_TRIM_THRESHOLD_

For a more comprehensive explanation of this issue click here.

4.3 Compute Node Optimization

Following the Stokes upgrade (Aug. 2010) the compute nodes have significantly different processors from those found in the login nodes, which were not upgraded. If you are compiling code it is important to take full advantage of the capabilities of the newer compute node processors. They have more faster cores and support additional instructions. Normally compilers will optimise for the processor they are being run on. The login nodes, where compilation normally takes place, have Harpertown processors and the compute nodes now have Westmere processors. Adding the following flag to your Intel compiler compilation commands will instruct the compiler to compile for the Westmere generation processor.

-axsse4.2

Using the above flag means that the resultant executable will most likely not run on the login nodes. If for some reason you need to use run the code on the login node as well the following flags will result in an alternate code path being produced which will be invoked automatically should the code be run on the older processors. In this case we are most concerned with performance on the compute nodes and merely compatability on the login nodes, which should not be used for intensive computation.

-axsse4.2 -msse3

The Stoney system uses the same processors in both the login and compute nodes and so this issue does not arise. More details can be found in the compiler man pages or in the following tutorial.

4.4 Monitoring Job Efficiency and Memory Usage

Users are encouraged to use the command "qutil" to investigate the performance of their running jobs.

qutil [ -u username | -j jobid,... | -a | -h ] [ -s ] - usage

For instance: qutil -a - shows all of your jobs

qutil -j 5675,5677 - lists these two jobs with ID 5675 and 5677, but only if they belong to you)

The output from qutil gives a number of useful pieces of information. For each compute node used by a job it lists the 1,5 and 15 minute load figures. These figures are a rough measure of how high the compute load on each node is. Ideally this value should be roughly equal to the number of cores in the node. So 12 on Stokes an 8 on Stoney. Not all codes are able to fully utlise all cores all the time but if the figure is consistently low we recommend you contact us to discuss the implications and what options are open to you to improve it. The efficiency figure listed is based on these values and is normalisted such that a best case figure is in the region of 1.0.

The memory utilisation per node is also listed, while this too can vary far too rapidly to be accuratly represented over time by a utility like qutil many HPC codes allocate the bulk of their memory requirements at startup and only release the memory when the job completes so it can be useful. On Stokes each node has 24GB of RAM and on Stoney each node has 48GB. qutil allows one to easily compare utilisation on each node in a job.

5. Misc.

5.1 Quotas

Users can check their user and project disk quotas with the quota command. Once the hard quota is exceeded no more data can be written. Note that for the moment the quota command is not available on Stoney.

username@stokes1:~> quota
Disk quotas for user userame (/ichec/home/users/username):
Used: 6.47 GB Soft: 10.00 GB Hard: 11.00 GB
username@stokes1:~> quota -g myproject
Disk disk quotas for project myproject (/ichec/work/myproject):
Used: 7.45 GB Soft: 100 GB Hard: 105 GB

NOTE: The disk usage figures displayed by the quota system are based on actual disk usage not file size. There will be a minimum of 20% difference between these two figures. Where a lot of extremely small files are present this difference may be more than 100% due to partial disk block use and performance optimisation.

In order to find out how much resources (core hours) are available to your project, use the mybalance command as follows:

username@stokes1:~> mybalance
Project Machines Balance
--------- --------------- -------
icphy001 ANY 0
icphy001c ANY 0

This command will return the number of core hours available to all your projects (in the above examples, icphy001 and icphy001c). So for instance, if you wish to run a 32 core job for 24 hours, you will need to ensure that you have a minimum of 24*32=768 core hours on your project's account.

5.2 ICHEC Training:

A number of online lectures and tutorials can be found on our website. Please check the Education & Training page for further training courses being organised by ICHEC.

5.3 References and Further Reading