Updates on recent service disruption on KAY

New advice for Kay users

22 February 2021 - Following reopening of KAY for users last Friday 19 February, we witnessed one brief metadata availability timeout which suggests that not all of the underlying issues have been resolved. In order to decrease the load during the weekend we paused the main queue and no further issues were seen so the main ProdQ was re-enabled this morning.

Normal user access is now available but we are continuing to monitor KAY. ICHEC would like to advise users of the ongoing need to be vigilant about maintaining back-ups of critical data. Users may experience further disruption until this issue is fully resolved. 

If you have any queries please contact our helpdesk by opening a ticket in the normal way or emailing support@ichec.ie.

 

Normal Access to KAY resumed

 

19 February 2021 - Normal access has now been restored on KAY for all users. Queues are now re-enabled so users can submit jobs and continue with their research. We will continue to monitor KAY stability.

 

We regret this disruption has arisen and the consequent disruption to users.

 

//ends

 

Update 3 - Reintroducing access for users

18 February 2021 - An upgrade to the affected firmware, which has caused restricted access to KAY in recent days was carried out this afternoon. We are now in a position to allow reasonable access to Kay in a phased manner starting this evening. We will be carrying out further monitoring and testing once user access resumes.

As users were informed this morning, we are initially re-opening access to the login nodes to allow users access to their data. No new jobs will be allowed to run while we monitor the filesystem performance. Users can continue to submit jobs to the queue however. If the filesystem remains stable, we will release these jobs to begin running as soon as is safe to do so.

No user data has been affected due to this issue. ICHEC would like to remind all users to be vigilant about backing-up their data at all times as there is no second copy maintained locally. 

We regret this disruption has arisen and the consequent disruption to users.

//ends

Update 2 - Users continuing to experience KAY service disruption due to unstable Lustre Filesystem

17 February 2021 - ICHEC is continuing to work with its supplier engineers to resolve the issues which were first observed with stability on KAY on Saturday, 13th February. 

Since the issue arose we have escalated it to the highest level of severity and engineers from our storage supplier are currently logged in and working on the system to resolve it.

We apologise to all users for this inconvenience. However, our priority is maintaining the safety of the Lustre filesystem to secure user data. 

Unfortunately, we do not have an update on when this service will be restored but please be aware that we are focused on making sure that services are back as soon as possible. 

KAY Service Disruption - ICHEC engineers are working around the clock to resolve an issue which has disrupted our services on 'KAY'. We hope to be back online as quickly as possible. 

16 February 2021

The Irish Centre for High-End Computing confirms that service was interrupted on the national high-performance computer Kay on Saturday 13th February. 

This disruption was caused by the Lustre filesystem for Kay becoming unstable. ICHEC acted immediately to secure all data for users and attempted to restore service. 

ICHEC engineers have been working around the clock since Saturday, 13th February to resolve this outage as expeditiously as possible for all users. 

ICHEC has undertaken the following actions since first becoming aware of this issue:

  • Undertaken a full assessment of the issue to discover the source of the problem,
  • Issued a direct communication to all users on 13th February at 12:10,
  • Issued a follow-up communication to all users on 13th February at 18:57,
  • Engaged in ongoing contact with suppliers, supplying them with detailed information and logs to help with determining the underlying fault.
  • Allocated key personnel to this issue until it is resolved.

As ICHEC acted quickly on notification of instability in the system all data stored on Kay is currently available and consistent. However, we cannot currently enable user access as the corresponding load would cause instability and endanger this data. 

The Irish Centre for High-End Computing is working to restoe KAY as soon as possible for all users and apologises for any disruption this has caused.  

Further updates will be sent directly to all users using the normal user communication email. Updates will also be available on our website and social media platforms.  For further information please contact: Marie-Therese Culligan, marie-therese.culligan@ichec.ie or Goar Sanchez goar.sanchez@ichec.ie

//Ends