ACT systems issues – conclusion


Thursday 4th October at 19.00 a fix was applied to the ACT system to resolve the underlying problem with the insights database.

Certain commands run against the storage were causing high load to the insights database and consuming memory.

Prior to the fix we were consuming on average 9-10GB of memory. This was hitting the limits of memory for the service.

Post fix, we are now consuming 40mb of the 10GB memory limit.

This has been achieved by creating a cache of the database with static data, rather than accessing the dynamically changing database.

We receive around 50 requests a second and these were taking 1.5 second to respond. These requests are now being completed in milliseconds as expected.

We will continue to monitor with engineers from our supplier.

The service used for Research Data Storage consists of two key elements; the underlying storage system itself ‘ADFS’ (i.e. your data), and an insights database which contains the metadata which is associated with this data.

The insights database polls the underlying storage system at regular intervals to identify changes to data since the last poll. This includes new, updated and removed files. The database then records where on the underlying storage system the data is held and the usage against any quotas that are in place.

Prior to the issue yesterday we saw a large number of files deleted from the underlying file system (around 4TB) this caused high load on the insights database. The database then ran out of memory causing the database to crash. The database recovered immediately but continued to under perform with the volume of changes it had to process.

We are working with the supplier of the storage system, to identify why this change caused the issue with running out of memory. We have increased the amount of memory the insights database can consume to mitigate this issue until a permanent fix has been put in place. Our supplier is working on this issue as a matter of priority.

We are actively monitoring the system along with our supplier who are monitoring remotely.