Skip to content The Open University

The Open University
IMPACT Cluster

A joint venture between the Science & MCT faculties

We had a major breakdown on the weekend of 22nd of June 2013. The air-conditioning system in the JLB server room developed a fault in the early morning and the backup chiller failed to come on line correctly. Systems put in place to call out A/C engineers also failed to react in a timely manner and this led to a total file system crash as the NetApp unit over heated and shutdown. Damage was done to the ZFS file system and this had to be rebuilt from a backup.

We had the a major system update scheduled for the mid August “Scheduled downtime” so we just moved this forward. This included the system being upgraded from RedHat 5.x to 64Bit to Centos 6.x 64Bit. The move to Centos was to allow us to have more software packages & bleeding edge versions available for users. Currently we build most software by hand that we do not have an rpm for and this takes quite a long time but with this speed this up considerably.

We also have updated the speed of the network connections to each node and between switches. We have gone from 1Gb connections between all to 2Gb to each node and 4Gb between the switches.

We have updated the cluster queuing system to bring it in line with the current changes in Science and MCT removing some of the projects and user groups that we had before and going with a much simpler layout. We now have 4 groups and 4 projects.

  • IMPACT Default group and project for all OU users
  • Science For all science Staff
  • MCT For all Math’s, Computing and Engineering Staff
  • Commercial For all external Users (Not collaborators)

  • New systems have also been put in place to protect the system with automated system shutdown scripts and a direct text to the A/C engineer if something goes wrong, this got tested 2 days after the first event when an update to the building management system went wrong, the engineer was on site with in 20 minuets.

    We apologies to all our users for the downtime and any problems caused by bringing the rebuild forward but this was caused by the initial A/C problem which was out of our control.

    Geoff / Allan