We had a major breakdown on the weekend of 22nd of June 2013.
The air-conditioning system in the JLB server room developed a fault in the early morning and the backup chiller failed to come on line correctly. Systems put in place to call out A/C engineers also failed to react in a timely manner and this led to a total file system crash as the NetApp unit over heated and shutdown. Damage was done to the ZFS file system and this had to be rebuilt from a backup.
We had the a major system update scheduled for the mid August “Scheduled downtime” so we just moved this forward. This included the system being upgraded from RedHat 5.x to 64Bit to Centos 6.x 64Bit. The move to Centos was to allow us to have more software packages & bleeding edge versions available for users. Currently we build most software by hand that we do not have an rpm for and this takes quite a long time but with this speed this up considerably.
We also have updated the speed of the network connections to each node and between switches. We have gone from 1Gb connections between all to 2Gb to each node and 4Gb between the switches.
We have updated the cluster queuing system to bring it in line with the current changes in Science and MCT removing some of the projects and user groups that we had before and going with a much simpler layout. We now have 4 groups and 4 projects.
We have moved the site away from a very old version of Drupal and gone back to flat text as Drupal was far more than we needed
We now have the latest version of theIntel Cluster Studio for linux installed system wide. This will be a replacement for version 9 and will sit along side version 11 (which is only available on the cluster)
We now have a dedicated AMD compiler available system wide, We have chosen the Portland group PGI Fortran/C/C++ Server combined Fortran and C/C++ Version 11.10-0