IT Services take this opportunity to apologise for loss of network service on the evening of 20 and morning of 21 October both for internal and external customers. Initially this incident appeared to be only affecting offsite access to resources, but the true extent became clear on the Friday morning.
At all times we were in close consultation with our suppliers regarding this, including Data Integration, our network infrastructure support company and JaNET, our internet service provider.
We endeavoured to keep staff informed via internal emails and the IT Services Status page, although we accept that these services may not have always been accessible.
We made extensive use of the IT Services Twitter feed, as this was the only off campus way to be aware of any IT Services communication on this incident. Our Twitter feed can be found at https://twitter.com/#!/UniRdg_ITS for those who do not have a Twitter account and those who wish to follow us, we are @UniRdg_ITS
As part of our commitment to continuous services and infrastructure improvement and development, we are holding an Incident Review Meeting on Friday 28 October and would request customer comments for discussion in this review. As such, please tell us in your opinion:
- What went well?
- What went badly?
- What could we have done better?
Lessons learned will be added to this newsfeed article early next week
The loss of networking at this level is regrettable and I am sure some people will have been inconvienenced but in general the impact was minor AFAICS. I noted some postgraduates even cleaning their offices while awaiting service restoration. In general we suffer much more when local systems go down, such as email, BlackBoard, directory/file services and the like.
Overall, I thank ITS for getting the service back up quickly and restoring resiliance. This is something which we should not find unexpected at times.
Ian Bland
Technical Support Manager
School of Systems Engineering
I couldn’t agree more Ian. Although this caused particular issues for one of our programmes with students that were being enrolled, it did give staff a moment to reflect and tidy/organise themselves in the physical world.
The service was back up quickly and ITS kept people informed, perhaps a hourly update of service status during office hours could help to further instil confidence in the communication methods being used?
Personally I’m always interested in the technical fault that causes problems such as these, is that kind of information made available anywhere?
David Jones
IT Manager
International Study and Language Centre
Just one point – even after the ‘status update’ webpage suggested that the data archive was accessible again, I could not fully access my N Drive. I could log in on Network Connect, but I could not connect directly to the server in a way that would allow me to see the N Drive on my Mac as if it were a normal folder.
Thanks,
Oisin
We started experiencing problems on Thursday evening in the School but the network break only caused minor inconvenience. I was able to email the school about the problems on Friday and the service was back by one of clock
For people like me not on Twitter I was able to keep informed on the ITS status page. We are not all on Twitter
What a load of sycophantic nonsense. The service “broke” just after 7pm. Not every student postgrad, other or staff sits in the libary or their office. Somebody in IT services should have been alert to this; it was obvious the web-system had seriously crashed. There should be enough experience in the support team to see this. Next day seemed more panic-driven than practical. The status pages were claiming out of service from 7pm to 7pm; it was obvious nonsense because the fact that internal email was connecting would indicate where to start looking for what was NOT connecting. Ergo I told staff at 1pm the service would return. If it was that quick, then of course it could have been done the night before.
If the system is just automatic and noone was there, there should have been a plan on whom to immediatly contact and where to start looking for the fault. What was done is very reminiscent of the ad-hoc way the service was run in the 1990s. If there was someone there that evening what exactly were they doing?
Thanks for all the comments. These will be fed back into the incident review meeting that is being held today, 28 October 2011.
We thought it would be useful to clarify the timeline a bit more precisely as this may assist in understanding how and why we responded in particular ways.
At 18.30 on the Thursday we lost connectivity on the Earley Gate internet connection. This should not be a problem as the system has redundancy built in and traffic was re-routed via the Whiteknights Datacentre, however at 18.55 this switch also failed.
After the service went down a member of IT Services staff returned to work specifically to fix this problem and services were restored just around 21.00. A status update was placed on Twitter, but at this time an update could not be placed on the status page as it was not accessible from offsite – this is something we are looking at separately as part of our overall service provision.
The network then went down again at just after 01.00 on Friday morning. On Friday morning, we started updating the status page and Twitter feed. Initially we put the 24 hour time-to-fix up as we felt it was better to overestimate rather than underestimate, as we were dealing with what was a multiple system failure.
Just after 09.00 service was restored to business critical parts of the university network using a backup route. At this time email started to flow on/off site and University websites were available off campus.
At approximately 09.30 a hardware fault was identified in one piece of network hardware and a replacement part was requested. This was delivered just before midday and installed at 12.30. At this point full internet connectivity was restored; testing continued until 14.00 when the root cause was finally identified and fixed.
On a personal note, being around IT Services on the Friday morning there was no sense of panic in the fixing of the problem, rather investigation of a serious problem, remedies proposed. There were some initial problems with accessing the status page and twitter feeds, but these were soon remedied.