This documentation relates to the National Service systems listed below:
If you are not currently an ICHEC user then you should visit our Services section first to determine how you would like to become a new user. Possible options are: (a) submit a project application through the Full National Service, (b) join an existing project, or (c) gain access through your institution if you are affiliated with University College Dublin or National University of Ireland, Maynooth.
All use of the National Service systems is subject to the ICHEC Acceptable Usage Policy (AUP).
When registration is complete you can log in using SSH. This will be installed by default on most Unix style systems. Windows user will need to download and install and SSH client such as openSSH or putty. By default most users will use the Stokes system however users whose applications' require large amounts of memory per node will also be able to request access to Stoney. From the command line you can log on using the commands:
If you wish to run X windows based graphical applications use the -X ssh flag.
Once you have an account you may join multiple projects subject to the approval of the project Principal Investigator.
File Transfer is available via sftp
scp is also available
The Helpdesk is the main entry point to ICHEC's support team for users. Here you can get help in using the service, find out more about ICHEC or send us your comments. If the documentation on this site does not resolve your query do not hesitate to use it to contact ICHEC.
As is generally the case ICHEC High Performance Computing (HPC) systems use the Unix style Linux operating system. An introduction to this system can be found here.
When using ICHEC systems your files will be stored in two locations:
When you contect to the Stokes system using ssh as described above your conection will automatically be routed to one of two login nodes stokes1 or stokes2. Similarly when you connect to Stoney you will be routed to stoney4. These nodes sometimes called frontend nodes are used by users for interactive tasks like compiling code, editing files and managing files. They are shared by users and should not be used for intensive computation.
In order to connect to Stokes or Stoney it is necessary to connect from a machine with an IP address which belongs to one of ICHEC's participant institutions. Thus if you which to connect from home or while travelling you must first connect to a machine in your home institution and then connect to Stokes or Stoney.
The vast majority of the Stokes and Stoney systems are made up of compute nodes. These nodes are used for running jobs that are submitted to system. They are sometimes referred to as backend nodes. They are dedicated to a single user at a given time and can be used for intensive long term work loads.
As stated in our Acceptable Usage Policy backups are only made of user's home directories. Project directories under /ichec/work/projectname are NOT backed up. Furthermore, backups are only carried out as part of our system failure recovery plan; the restoration of user files deleted accidentally is not provided as a service.
The large array of software packages installed means that incompatabilities are inevitable. To minimise the problems this can cause, in order to use a software package that is not part of the base operating system one must load the appropriate module(s). Loading a module generally sets environment variables such as your PATH. To see what modules are available type:
You can then load the appropriate modules as follows:
As well as loading the necessary modules at compile time it is also required that they be present at runtime on the compute nodes. If these modules are not loaded the program is likely to crash due to not being able to find the required libaries etc. They can be loaded in two ways. You can use the PBS directive #PBS -V to import your current environment settings at submission time. Or you can add module load package_name commands to the submission script itself.
Other useful module commands:
For more information on modules see: Using Modules or type man module. Note: the default module for a given package points to the most recent version of that package. To use an older version specify the name explicitly.
Both the GNU and Intel compiler suites are available on Stokes and Stoney.The gnu suite is available by default and to use the Intel compilers one must load the relevant modules (intel-cc, intel-fc). In general the Intel compilers should give better performance and are recommended.
|Intel Compilers||MPI wrappers around Intel compilers|
|GNU Compilers||MPI wrappers around GNU compilers|
When using OpenMP on Stokes you need to be aware that Hyperthreading is enabled by default. This means that each physical core can appear as two logical cores. Thus by default an OpenMP program will typically try to use 24 threads rather than 12 as one might expect. Typical HPC workloads will not benefit from over subscribing the phyiscal cores unless the code is constrained by I/O.
The environment variable OMP_NUM_THREADS is normally used to control how many threads an OpenMP program will use. It can be set in the PBS job script prior to luanching the program as follows:export OMP_NUM_THREADS=12
Hyperthreading is not supported on Stoney. Further information can be found in Intel's OpenMP documentation.
There are a number MPI libraires available and sometimes it is perferable to use one rather than another however unless you have a specific reason to do so it is recommended to use the defualt libraires.
For the Stokes and Stoney systems there are two MPI modules to choose between:
These modules provide support for MPI2 and the Infiniband based networking used in these machines. They also provide the compiler wrapper scripts listed in the tables in the previous section which greatly simplfiy compiling and linking MPI based codes.
To run a MPI job, the mpiexec command is used in a job submission script e.g. mpiexec ./my_prog my_args.Again there is a man page for mpiexec for more details. See also Batch Processing below.
The Intel (intel-mpi) MPI libraries are also available, however the mvapich2 libraries are recommended. On Stokes the SGI (sgi-mpt) MPI libraries are available too.
By default, MVAPICH2 sequentially attaches MPI processes to cores when starting a program (by calling sched_setaffinity() during MPI_Init()). This is done sequentially, which means that on each node the MPI process of lowest rank will be attached to core number 0. Then the process with the next rank up will be attached to core number 1, and so on until there are no MPI processes left. The above core numbers are the logical numbers not the physical ones.
The following diagrams show the per node system architectures of Stokes and Stoney:
Stokes system architecture
Stoney system architecture
Hence, if you want to mix MPI and OpenMP then, the first thing you have to do is to change this MPI process to core attachment, as this would be inherited by all the OpenMP threads spawned by a given MPI process. Then, all those threads would run on this single core, leading to extremely poor performance. This process to core attachment is managed through an environment variable, listing sequentially on which cores each MPI process should be attached. The default behaviour corresponds to:
The core lists corresponding to each MPI processes are separated by colons. On Stokes, 4 typical MPI / OpenMP attachment policies may be explored, sorted by decreasing likelihood of efficiency:
On stoney, 3 typical attachment policies may be explored, also sorted by decreasing likelihood of efficiency:
Both environment variables have to be exported in the PBS job script and the mpiexec command line should should resemble the following:mpiexec -npernode $(( [12|8] / $OMP_NUM_THREADS )) my_program my_arguments
The Intel Math Kernel Library (MKL) is a very useful package. It provides optimised and documented versions of a large number of common mathematical routines. It supports both C and Fortran interfaces for most of these. It features the following routines:
If your code depends on standard libraries such as BLAS or LAPACK, it is recommended that you link against the MKL versions for optimal performance.
Parallelism in a program can be achieved at the process level as in most MPI development or at the thread level as in OpenMP development, or in some mix of these approaches, a so-called hybrid code. The most common mode of development on our systems is MPI based, as this allows you write programs which can run across many nodes. Often such codes will want to call routines provided by MKL. However many of these routines are themselves parallel so at the node level one is left with two levels of parallelism contenting with one and other. To eliminate this the MKL module sets the environment variable MKL_NUM_THREADS=1. If you are writing hybrid code or pure OpenMP code that uses MKL you may need to override this setting. Chapter 6 of the MKL userguide explains in detail how this and other related environment variables can be used. Note if you have used a version of MKL older than 10.0 you should be aware that MKL's method for controlling thread numbers has changed.
This issue can also be addressed by explicitly linking the sequential version of the libraries which can be found in the /ichec/packages/intel/mkl/mkl_version/lib/em64t directory and are identified by a _sequential in the name. Note you are also required to link the pthread library.
Extensive high quality MKL documentation can be found in the /ichec/packages/intel/mkl/mkl_version/doc/. Remember that when a code is linked against MKL it will be necessary for you to have the MKL module loaded via the submit script when running the code. Further ICHEC documentation on MKL can be found here.
To try to utilise compute resources in a fair and efficient manner, all compute jobs must be run through the batch queueing system. The system supports three main classes of jobs:
By specifying how many processor cores you need and for how long, the system can mix and match resource timeslots with jobs from multiple users. The most common operations you will need to perform with the batch system are submitting jobs, monitoring the queues or canceling your jobs.
As detailed in the next section it is straightforward to submit jobs to a specific queue. However in general allowing the system to decide which queue to use will give the best results. This decision is based on the requested walltime and the number of cores requested. Hence it is your interest to try to provide a reasonably accurate walltime.
Before submitting a job you normally prepare a PBS script.
The # symbol is required at the start of each PBS directive. The line #PBS -l nodes=4:ppn=12 requests 48 processor cores in this case i.e. 4 nodes each of which have 12 cores. As each Stokes node has 12 cores this ppn figure will be fixed. However each Stoney nodes has 8 cores so this figure should be set to 8, and the resulting job request will be for 32 cores.
The line #PBS -l walltime=1:00:00 requests a walltime of 1 hour. If the job does not complete before this time the system will kill it. #PBS -N my_job_name sets the job name as it will appear in the queue.
The project_name is used to associate core hours used with a given project. You may only specify projects you are a member of. #PBS -r n inidicates that the job should not automatically be rerun if it fails. #PBS -j oe joins the output and error streams into a single file. To receive a mail at the address specified with -M when a job begins, ends or aborts use #PBS -m bea.
The #PBS -V directive is very important if you do not explicitly load modules in the PBS script as it causes environment settings to be imported from the submission environment to the runtime environment. At this point we change to the working directory and start the job using mpiexec. If the job is solely based on OpenMP and so runs on one node you do not use mpiexec.
You can choose to explicitly send your job to a given queue using the #PBS -q directive or the qsub -q command.
To see what queues are available use the qstat -q command. Note that not all queues listed by qstat -q are available to users and that the Walltime and Node columns list the maximum runtime and node count for jobs in that queue.
Note: the queue names used on Stokes differ from those on previous systems.
To submit a PBS script type qsub scriptname.pbs
Sometimes, for debugging purposes, it can be useful to launch a shell as a batch job and get an interactive session on compute nodes where you can see immediately what happens when launching a program. In these cases, an Interactive Job can be used. Note interactive jobs will only run in the DevQ region. For example, if I wanted to test my MPI program on 24 cores, I could request an interactive job for 30 minutes and then be given a shell on one of the 2 compute nodes
Note however that this method should only be used for debugging and not for production runs as network breaks or timeouts will kill the job. Also, please exit the shell when you are no longer using the interactive session so that the resources can be released for other users.
The showq command displays information on the current status of jobs.
showq - status of jobs.
showq -w user=$USER - status of your own jobs only.
showq -w acct=myaccount - status of jobs running under specified account.
To cancel a job you should use the canceljob command.
canceljob JOBID - cancels a job
|qsub SUBMIT_SCRIPT||submit jobscript to PBS|
|qsub -I||submit an interactive-batch job|
|qsub -q queue_name||submit job directly to specified queue|
|qstat -q||list all queues on system|
|qstat -Q||list queue limits for all queues|
|showq||list all running, queued and blocked jobs|
|showq -u userid||list all jobs owned by user userid|
|showq -w acct=myaccount||list all jobs using the specified project account|
|showq -r||list all running jobs|
|mybalance||list the balance in CPU core hours for each project you are a member|
|qstat -f jobid||list all information known about specified job|
|canceljob JOBID||delete job jobid|
|qalter JOBID||modify the attributes of the job or jobs specified by jobid|
If you wish to run a multithreaded code on a single node which does not use MPI then you can simply call the program from the submission script without prefacing it with the mpiexec command. The job will then have access to the cores on the node. OpenMP based codes are the most common form of this type of job.
It is possible to write a so called hybrid code which uses both OpenMP and MPI. This means that a job can use shared memory within a node and MPI between a number of nodes. In this case you generally wish to allocate just one MPI process to each node. This process can then create worker threads to exploit the available cores. To do this you request the required number of nodes in the normal fashion, #PBS -l nodes=n:ppn=12. Ensuring ppn is set to 12 or 8 in the case of Stoney. Then you launch the job with an additional argument, mpiexec -npernode 1 ./job my_args. With npernode set to 1, a single MPI process is allocated to each node it is up to this process to use the available cores.
Stokes uses a high preformance Panasas based IO subsystem. The Panasas filesystem, panfs, is a true parallel filesystem and so implements the semantics of file access slightly differently from other filesystems which you may be familiar with. A side effect of this that you may get unexpected performace results. In general for well written software using an appropriate approach to IO the results will be very good. However under certain circumstances a slowdown can be seen. One such case is VASP IO.
If VASP is writting small amounts of data sequentially it may do so very slowly. By default with the Intel Fortran compiler IO is assumed to be non buffered, as a result records are written to disk after each write. This results in a large number of small writes which must all be committed separately. There is a physical limit to how many operations a disk can handle at one time. While a file will be spread across a number of disks the limit can still be hit. Buffered IO will result in data being written in 4k blocks which can be handled much more efficiently. To enable this either:
Linux is by default configured to minimise the memory footprint of each process. This means that by default all freed memory is given back to the kernel for later use. The drawback of this is that in some cases it can lead to a great many allocations and deallocations of memory while entering and leaving functions. This is especially true for some automatically allocated and deallocated Fortran local variables. Such memory allocation and deallocation can increase the runtime of jobs. For example, VASP can sometimes exhibit such behaviour. To identify such problems, you can add to your submission command line the /usr/bin/time command, as follows:
Be careful not to use the built-in 'time' command, as it would be of no use here. At the end of the execution this should give some timing information, including the number of minor page faults:
If the minor page fault number is excessive, say more than a few hundred of them pre second, you could greatly benefit of setting the following environment variables in your batch script:
Furthermore, there are only a few cases where setting these variables could be counter-productive. You could therefore consider setting them globally in you .bashrc file. Should unexpected behaviour occur, you could unset them just for the problematic jobs, by adding the following lines in your script:
For a more comprehensive explanation of this issue click here.
Following the Stokes upgrade (Aug. 2010) the compute nodes have significantly different processors from those found in the login nodes, which were not upgraded. If you are compiling code it is important to take full advantage of the capabilities of the newer compute node processors. They have more faster cores and support additional instructions. Normally compilers will optimise for the processor they are being run on. The login nodes, where compilation normally takes place, have Harpertown processors and the compute nodes now have Westmere processors. Adding the following flag to your Intel compiler compilation commands will instruct the compiler to compile for the Westmere generation processor.
Using the above flag means that the resultant executable will most likely not run on the login nodes. If for some reason you need to use run the code on the login node as well the following flags will result in an alternate code path being produced which will be invoked automatically should the code be run on the older processors. In this case we are most concerned with performance on the compute nodes and merely compatability on the login nodes, which should not be used for intensive computation.
The Stoney system uses the same processors in both the login and compute nodes and so this issue does not arise. More details can be found in the compiler man pages or in the following tutorial.
Users are encouraged to use the command "qutil" to investigate the performance of their running jobs.
qutil [ -u username | -j jobid,... | -a | -h ] [ -s ] - usageFor instance: qutil -a - shows all of your jobs
qutil -j 5675,5677 - lists these two jobs with ID 5675 and 5677, but only if they belong to you)
The output from qutil gives a number of useful pieces of information. For each compute node used by a job it lists the 1,5 and 15 minute load figures. These figures are a rough measure of how high the compute load on each node is. Ideally this value should be roughly equal to the number of cores in the node. So 12 on Stokes an 8 on Stoney. Not all codes are able to fully utlise all cores all the time but if the figure is consistently low we recommend you contact us to discuss the implications and what options are open to you to improve it. The efficiency figure listed is based on these values and is normalisted such that a best case figure is in the region of 1.0.
The memory utilisation per node is also listed, while this too can vary far too rapidly to be accuratly represented over time by a utility like qutil many HPC codes allocate the bulk of their memory requirements at startup and only release the memory when the job completes so it can be useful. On Stokes each node has 24GB of RAM and on Stoney each node has 48GB. qutil allows one to easily compare utilisation on each node in a job.
Users can check their user and project disk quotas with the quota command. Once the hard quota is exceeded no more data can be written. Note that for the moment the quota command is not available on Stoney.
NOTE: The disk usage figures displayed by the quota system are based on actual disk usage not file size. There will be a minimum of 20% difference between these two figures. Where a lot of extremely small files are present this difference may be more than 100% due to partial disk block use and performance optimisation.
In order to find out how much resources (core hours) are available to your project, use the mybalance command as follows:
This command will return the number of core hours available to all your projects (in the above examples, icphy001 and icphy001c). So for instance, if you wish to run a 32 core job for 24 hours, you will need to ensure that you have a minimum of 24*32=768 core hours on your project's account.
A number of online lectures and tutorials can be found on our website. Please check the Education & Training page for further training courses being organised by ICHEC.