By: Grenville Lister
High performance computing (HPC) is changing – there will be a new UK national service in early 2020 (and a period of time with no national service while the new platform is installed) – and the medium to longer-term future is more uncertain than at any time in the last few decades. Much of the community is planning for exascale computing, with associated challenges in both the utilisation of storage and programmability. However, for all the changes ahead, a key issue is managing the resources we have, and will have. Here I take the opportunity to discuss this issue, drawing on my experiences with NERC HPC, but with a take-home message that should apply to other busy resource pools (e.g. departmental or institutional computing).
We usually think of compute resource in terms of node-hours – you’d generally pay for use of whole nodes, even if whole nodes aren’t actually being used (the bit of your node left unused isn’t accessible to others, hence you foot the cost). On day one of a new machine, it will be capable of delivering a fixed number of these given its projected lifetime; for ARCHER (the UK National HPC service), that number was approximately 212 million node-hours (4920 nodes for 24 hours per day for 360 days per year for 5 years. On day 2 and each subsequent day, that number went down by 118,080 – as of July 11th 2019, ARCHER had only 25 million left. Unfortunately, node-hours disappear whether or not they are used for computation (the energy bill is lower if they’re not computing). The same goes for resource allocations – we effectively have a NERC- ARCHER for a year-at-a-time since resources are allocated yearly with the reset switch thrown on March 31st; a block of ARCHER node-hours allocated to a project starts to evaporate on April 1st. Obvious really, but sometimes overlooked by those of us running numerical simulations under the typical yearly resource allocation cycle. This argument is a little oversimplified, nevertheless, expecting to use large parts of an allocation at the last minute may be unrealistic and/or not possible at all – ultimately, the node-hours just won’t be there.
Differing HPC systems try to ensure an equal spread of usage over time to avoid a mad rush at the end of an allocation period, either through imposing a use-it-or-lose-it policy in conjunction with periodic (quarterly or semi-annual) node-hour sub-distributions or by use of a clever job scheduler. None of us like having restrictions placed upon us by HPC service providers or administrators, especially when circumstances beyond our control cause delays or otherwise prevent HPC usage as intended, but managing an even burn rate of nodes ensures that users are able to consume their full resource quota.
Efficient use of storage space raises in some sense orthogonal concerns. Space doesn’t disappear over time. It fills up of course, but the user generally has the option to recover it, and whereas node-hours are available to all until used, storage space is reserved at the moment of allocation and can (and does) sit empty for significant lengths of time. This is less of a problem on a system such as ARCHER, where there is an understanding that data held on disc is only ever ephemeral and managing space is easy, on JASMIN (a super-data-cluster based at the Rutherford Appleton Laboratory), for example, where group workspaces are relatively long lived, the challenge is to request and manage an appropriate volume, bearing in mind that several storage media may be available to support data storage on different time scales, with particular emphasis on the use of Elastic Tape for the medium term.
We in NERC do a pretty good job of consuming HPC resources, both node-hours and petabytes. I am confident that with a community cognizant of resourcing challenges and their efficient use, we shall continue to do so as new technologies emerge. Speaking of new technologies: the major event at ARCHER in February 2020 will be its withdrawal from service and in May 2020 ARCHER’s successor will commence operation. We shall have a whole lot more node-hours to play with to generate a whole lot more data – a scenario under which we anticipate that management of resources will be increasingly important.