Logo of Science Foundation Ireland  Logo of the Higher Education Authority, Ireland7 Capacities
Ireland's High-Performance Computing Centre | ICHEC
Home | News | Infrastructure | Outreach | Services | Research | Support | Education & Training | Consultancy | About Us | Login

User Mailing

ICHEC mail #11

Posted: 2005-11-25

Dear ICHEC users,

Further delays to the re-opening of the user service on walton

We regret to inform you that we will be unable to re-open the user service on Monday 28th as initially planned, as the problems which prompted the suspension of the user service still remain to be addressed. As was mentioned in our previous user mailing, the situation has been escalated to "critical situation" by IBM and we have now teams from IBM UK and IBM US working on the problem determination. Technical staff from Force10, Broadcom and AMD are also involved.

The most likely cause of the problem is a hardware driver problem for the on-board Gigabit Ethernet controller. Under extreme network load, the congestion control algorithm is failing to manage the flow of data, resulting in the network interface dropping for a few seconds on random nodes, in turn leading to these same nodes being removed from GPFS, and eventually crashing user applications. We have at this stage established a workaround (reducing the MTU size) which would circumvent this problem, but it has the side effect of substantially degrading the overall performance of the cluster.

A second problem which remains to be addressed - but would not justify interrupting the service in itself - is the issue of kernel panics. We have also made some progress on that front, and suspect that these problems have been caused by faulty CPUs.

The service on Walton is therefore likely to be interrupted for one more week. For further updates, see http://www.ichec.ie/status. Note that the service on the Bull system "Hamilton" is unaffected.

Return to User Mailings