Update on recent IT Critical Incidents

On the 24th and 25th of October we had two critical incidents in IT.

24th – Network Issue

The incident on Wednesday 24th October affected both the wired and Wi-Fi networks and meant that many services were not available.  The incident started at about 13:30.The severe impact of the incident was picked up quickly and a critical incident was called within IT.

The first meeting of the Critical Incident Team was held at 13:45.  Some staff were able to continue working but many key services were unavailable (web pages, RISIS, Trent, Agresso etc.)  Email remained available.

The nature of the incident meant that we could not use many of our standard communications channels (mail lists, status page, IT blog) to update University staff and students.  Information was emailed out individually to key contacts and Tweeted at 13:53.

Resolution:

Our Networks and Infrastructure Services teams, along with our network supplier, investigated the issue as a priority and identified what looked to be a faulty network device on our Earley Gate data centre.  The network device was disabled at about 16:30 as soon as the cause was identified.

The diagnosis was especially difficult which is why it took about 2.5 hours.  Whilst some services were available again quite quickly after this, our staff worked into the evening to restore others including: eduroam, Skype for Business, MyID, Apps Anywhere, Managed Print.

Further work took place over the following week to determine the exact fault before the device could be re-connected to the network.

25th – Data Storage Issue

On Thurs 25th October we had another critical incident that affected our Research Data Storage service.

All storage on the Gold tier was affected and about half of storage on the Basic tier were unavailable.  This outage was logged with our supplier at approximately 10:00.

It was flagged as a Critical Incident at 11:12. We held four Critical Incident Team meetings during that day and worked closely with our supplier on a resolution.  Following investigation by our supplier, the incident was found to have been caused by the file system manager (ZFS) locking up on one of the two nodes and the system not automatically switching over to the other node.

Resolution:

The failover was forced by our suppler and all services were restored before 16:00.  We continue to work with our supplier on determining the root cause to reduce the likelihood of this re-occurring.

Next Time

Following these two critical incidents, we are reviewing our Critical Incident Plan and our Communications Plan to further improve our incident response.