Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 Capacities
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Outreach | Services | Research | Support | Education & Training | Consultancy | About Us | Login

User Mailing

ICHEC mail #4

Posted: 2005-09-29

Dear ICHEC users,


1. Unscheduled maintenance session
2. Update on the scheduling system (walton)
3. Funding source(s) for your project (reminder to all PIs)

1 - Unscheduled maintenance session

We would like to inform you that the user service on the IBM cluster (walton) will be suspended on Thursday 29th September from 9:00 until 13:00.

This downtime has been required following the failure of the GPFS tests carried out during the planned maintenance session of 26th September. IBM engineers will be carrying out a number of re-configurations / tests including:

- Set default gateway on management nodes
- Modification of MTU
- Networking tests following task to validate task #2
- Filesystem testing (GPFS)

These changes will improve the overall reliability and performance of the cluster.

Note that you will be unable to log on walton over this period. Any jobs still running on Thursday at 9:00 will have to be killed. Normal service may resume earlier than 13:00.

The shared memory system (hamilton) will *not* be affected by this downtime, so a normal service will be provided on this system. See http://www.ichec.ie/status

Apologies for the inconvenience.

2 - Update on the scheduling system (walton)

Users who have submitted jobs since last week-end have noticed that their jobs had failed to start, and various other related problems, such as problems deleting jobs they had themselves submitted to the queueing system. Our system administrators have since identified the source of the problem and are taking steps to address it.

This problem has been traced to an automated update which overwrote the maui user and group, and possibly the installation of a new service pack (SP1) on our scheduler node. The knock on effects was that communication between the torque batch system and the maui scheduler failed resulting in no jobs running and jobs being left in an invalid state. We ended up having to purge the torque queue by deleting these jobs to clear these from the system. When we restarted torque and maui after purging these jobs, newly submitted jobs would not start citing a lack of resources despite the availability of a pool of 900+ free CPUs.

We have since managed to fix these problems, although the scheduler crashed another time during the day. The current status would best be described as "functioning but in need of constant monitoring". We are currently considering re-installing and possibly completely rebuilding torque and maui on the scheduler node to see if we can get rid of this unreliability possibly introduced by the application of SP1.

3 - Funding source(s) for your project (reminder to all PIs)

PIs have recently been contacted and asked to indicate which funding body primarily funded the work described in their application. We would like to thank all those who have already supplied this information, and ask other PIs to contact us with this information as soon as possible.

As a reminder, the original request was as follow:


We have been recently asked by our funding agency to collate statistics on the primary sources of funding supporting researchers applying for HPC resources at ICHEC.

We would appreciate if you could let us know which of the following funding bodies is primarily funding the work described in your project proposal:

- Science Foundation Ireland (SFI)
- Higher Education Authority (HEA)
- Enterprise Ireland (EI)
- Health Research Board
- European Commission (EC)
- Other (please specify)

If funded by SFI, please specify the grant reference number.

We would like to confirm that this information will only be used for statistical purposes.

Return to User Mailings