By: Annette Osprey
High resolution modelling
Running very detailed and fine scale (“high resolution”) simulations of the Earth’s atmosphere is vital for understanding changes to the Earth’s climate, particularly extreme events and high-impact weather [1]. However, each simulation is 1) time-consuming to set up – scientists spend a lot of time designing the experiments and perfecting the underlying science, and 2) expensive to run – it may take many months to complete a multi-decade simulation on thousands of CPUs. But the data from each simulation may be used many times for many different purposes.
Under the hood
There is a lot of technical work that is done “under the hood” to make sure the simulations run as seamlessly and efficiently as possible and the results safely moved to a data archive where they can be made available to others. This is the work that we do in NCAS-CMS (the National Centre for Atmospheric Science’s Computational Modelling Services group), alongside our colleagues at CEDA (the Centre for Environmental Data Analysis) and the UK Met Office. My role is to work with the HRCM (High Resolution Climate Modelling) team, helping scientists to set up and manage these very large-scale simulations.
CMS is responsible for making sure the simulation code, the Met Office Unified Model (UM), runs on the national supercomputer, Archer2, for academic researchers around the UK. As well as building, testing and debugging different versions of the code, we need to install the supporting software that is required to actually run the UM (we call this the “software infrastructure”). This includes code libraries, experiment and workflow management tools [2], and software for processing input and output data. This is all specialist code that we need to configure for our particular systems and the needs of our users, and sometimes we need to supplement this with our own code.
Robust workflows
We call the end-to-end process of running a simulation the “workflow”. This involves 1) setting up the experiment (selecting the code version, scientific settings, and input data), 2) running the simulation on the supercomputer, 3) processing the output data, 4) then archiving the data to the national data centre Jasmin, where we can look at the results and share with other scientists. When running very high resolution and/or long-running simulations we need this process to be as seamless as possible. We don’t want to have to keep manually restarting the experiment or troubleshooting technical issues.
Furthermore, the volume of data that is generated from these high resolution simulations is incredibly large. It is too large to store all the data on the supercomputer, and it can sometimes take as long as the simulation to move the data to the archive. The solution therefore, is to process and archive the data as the simulation is running. We build this into the workflow so that it can be done automatically, and we have as many of the tasks running at the same time as possible (this is known as “concurrency”).
The HRCM workflow
Figure 1: An example workflow for a UM simulation with data archiving to Jasmin, showing several tasks running concurrently.
The image shows the workflow we have set up for our latest high resolution simulations. We split the simulation into chunks, running 1 month at a time. Once one month has completed, we set the next month running and begin processing the data we just produced. The workflow design means that the processing can be done at the same time as the next simulation month is running. First we perform any transformations on the data, then we begin copying it to Jasmin. We generate unique hashes (checksums) that we use to verify the data copy is identical to the original, so that we can safely delete it, clearing space for forthcoming data. Then we upload the data to the Jasmin long term tape archive, and we may put some files in a workspace where scientists can review the progress of the simulation.
Helping climate scientists get on with science
The advances that we make for the high resolution simulations are made available to our other users, whatever the size of the run. Ideally the workflow design means that the only user involvement is to start the run going. In reality, of course, sometimes the machine goes down, connections are lost, the model crashes, (or the experiment wasn’t set up correctly!) Thus, we have built a level of resilience into our workflow that means that we can deal with failures effectively. So, scientists can focus on setting up the experiment and analysing the results, without worrying too much about how the simulation runs.