Last night we upgraded our storage to fix a known bug with moving NFS services between storage nodes.

The NFS service was not running on the storage which manages the exports for mail and web server configuration.

We recovered service by restarting the NFS service and are currently working with vendors on root cause.

On 5th October 2018 we experienced a loss of network connectivity to the RUSU building, which also affected the connection providing Wi-FI Connectivity to the Open Day dome. Working on the Friday night and on Saturday morning, IT and the Estates team were able to restore connections and restore a full service 

At approximately 6:30pm on 5th October, there was an issue that caused an interruption to service between network cabinets in Black Horse House – we know this is a fault with the cabling within the building (“structured cabling”), but the root cause is still being identified.  

This caused an outage at RUSU that affected all network services, including Wi-Fi and the tills. Taking place on the evening of the Fresher’s Ball this casued a substantial impact, and RUSU was unable to accept card payments. The IT networks team responded but were unable to implement a fix. They left at 11:30pm. 

On Saturday (University Open Day) it was identified that this would also have an impact on the Open Day Dome. Some connectivity via leakage from buildings was possible and IT were able to improve this. Working closely with Estates, a new connection direct from the IT datacentre to RUSU was brought into service. This new connection had only been completed on the Friday and was due to be made live shortly. This restored service to RUSU at about 1.30 pm and boosted the signal to the dome. 

Thanks to members of IT and Estates for giving up their Friday nights and Saturdays to fix this.

IT

UPDATE:

Thursday 4th October at 19.00 a fix was applied to the ACT system to resolve the underlying problem with the insights database.

Certain commands run against the storage were causing high load to the insights database and consuming memory.

Prior to the fix we were consuming on average 9-10GB of memory. This was hitting the limits of memory for the service.

Post fix, we are now consuming 40mb of the 10GB memory limit.

This has been achieved by creating a cache of the database with static data, rather than accessing the dynamically changing database.

We receive around 50 requests a second and these were taking 1.5 second to respond. These requests are now being completed in milliseconds as expected.

We will continue to monitor with engineers from our supplier.


The service used for Research Data Storage consists of two key elements; the underlying storage system itself ‘ADFS’ (i.e. your data), and an insights database which contains the metadata which is associated with this data.

The insights database polls the underlying storage system at regular intervals to identify changes to data since the last poll. This includes new, updated and removed files. The database then records where on the underlying storage system the data is held and the usage against any quotas that are in place.

Prior to the issue yesterday we saw a large number of files deleted from the underlying file system (around 4TB) this caused high load on the insights database. The database then ran out of memory causing the database to crash. The database recovered immediately but continued to under perform with the volume of changes it had to process.

We are working with the supplier of the storage system, to identify why this change caused the issue with running out of memory. We have increased the amount of memory the insights database can consume to mitigate this issue until a permanent fix has been put in place. Our supplier is working on this issue as a matter of priority.

We are actively monitoring the system along with our supplier who are monitoring remotely.

 

UPDATE:

We have now resolved the issue affecting some users being unable to access their N drive or web pages.

A fix was implemented on the file storage system that was causing this problem and will continue to monitor these services.


We have reports of some users being unable to access the N:/ Drive and we are working on implementing a fix.

This is due to an issue with the file storage system which is also causing some issues with web pages being unavailable.

We are working on this as a priority and will update as soon as we have any further information.

IT

UPDATE:

The following services have now been restored and should be accessible from 2pm this afternoon. Please consider the services to be at risk as we continue to monitor them.

  • NX Linux Desktops
  • Met-Cluster
  • Free Cluster
  • Select VMs on the Research Cloud
  • Met webserver
  • Select Research Data Storage Silver Shares
  • Computer Science Linux Desktops
  • SMPCS X-Drive
  • Unix home directories

We will send a further more detailed update message tomorrow morning at 10am.

Once again we would like to apologise for the inconvenience this has caused.


UPDATE:

Engineers  in America are continuing to work on the issue and have diagnosed a likely cause which they are working to resolve. We expect this remedial work to be complete by 2:30pm but will monitor the situation and if we can restore the service sooner we will. Once the service has been restored it should be considered to be at risk until we have a full diagnosis of the cause.

We will provide a further update by 2:30pm or sooner.


There is a system issue affecting the storage systems. Supplier support is currently diagnosing the cause and as a precautionary measure we are preventing further access to these systems.

Affected Services include:

  • NX Linux Desktops
  • Met-Cluster
  • Free Cluster
  • Select VMs on the Research Cloud
  • Met webserver
  • Select Research Data Storage Silver Shares
  • Computer Science Linux Desktops
  • SMPCS X-Drive
  • Unix home directories

We are continuing to work with the suppler and will provide a further update at 12:30

We would like to apologise for the inconvenience this incident has caused, we are working to resolve this as quickly as possible.

 

N Drives were inaccessible this morning due to an IT system issue that has now been resolved.
N Drives and all other services are now back up and working fully. The issue occurred over the weekend and was resolved by 9:30 this morning.
IT is working with the supplier of the affected system to make sure services remain fully operational.
If you need further help or assistance please contact IT.

IT is aware that the N Drive (Windows home directory) were inaccessible this morning but are now back up. We are investigating the cause and will update shortly. 

Unix and collab shares were not affected by this outage. 

We are also investigating reported issues when trying to print. 

IT

After the failure of the CloudFS server yesterday the Academic Computing Team has worked through the night with our supplier to ensure the system is stable.

All access to systems for users were restored in the small hours of this morning. We have monitored the NFS mounts for home directories during the night to ensure they have been restored. If you have any issues logging into NX due to a stuck session, instructions on how to terminate your session can be found here:

https://research.reading.ac.uk/act/knowledgebase/nx-terminate-session/

We will continue to monitor all systems very closely for the rest of the week, but we are confident that we have rectified the initial root cause of the issue which was a known bug in the CentOS operating system.

We apologise again for the inconvenience this outage will have caused yesterday. The Academic Computing Team worked throughout the day and night to bring the service back to ensure full service was restored today.

IT

Due to an infrastructure problem most of the University’s web pages and web services were unavailable from 13:18 to 14:24 on 30 August 2018. All services are now back up and running as normal

We apologise for any inconvenience caused.

IT

UPDATE: This particular email and the links within do not pose a threat. However, other phishing attempts may pose a threat.


We have reports of a new phishing email claiming to come from IT (see below).

Note the sender address is not ours: “it@reading-ac.uk”

Note that there is a ‘dash’ instead of the usual ‘dot’. This is not an official University of Reading email address.

There is a also a link in the email with the same fake domain.

Please always check links in emails by hovering the cursor over them.

Never respond to unsolicited emails, click on links or open attachments. Further help can be found on the IT website.

« Older entries