By: David Case
In principle, to simulate the earth should be a doddle. We know that it’s made of such things as molecules, crystals and atoms, and the forces between these derive from charged particles, and these do little more than move around and interact via Coulomb’s law (plus a little symmetry). So how hard could it be to start from this and scale up?
Unfortunately, if one consults the ancient tomes (such as my PhD thesis), one realises that all of this has been known for a while, and we aren’t there yet. To make a calculation of molecular forces which is as accurate as an experiment, a typical cost may scale at around the seventh power of the size of the basis set; so as we double the size of our simulated system, we need to perform 27=128 times as many calculations, and this barrier is impossibly steep. Whilst the approach has been known since before WWII, progress in these types of simulations really started to increase when people decided to just use parameterised models and call them ab initio. And even by cheating we can only get so far.
More recently, I’ve moved to Meteorology, and the approach here is to start from the big (the atmosphere/ocean) and move towards the small. Points upon the system are mapped to a grid, and more of these points are added until the resolution is sufficient to describe the interesting physical phenomena. Encouraged by a manageable scaling in the number of computations required (although new to the game, I’m yet to see seven nested loops in a meteorology code), we throw more processors at it. One of the first things that I did when I joined the NCAS-CMS (Computational Modelling Services) team was to graph the scaling of the Met Office Unified Model for the atmosphere (below), so as to advise researchers with resource allocations. When we double the number of processing elements, we don’t double the rate at which we are performing calculations, because the communication between them starts to hit bottlenecks. Further profiling, especially for bigger models, reveals that the code spends ever increasing amounts of time calling things with names like ‘barrier’ or ‘waitall’, i.e. it’s stuck.
Figure 1: The amount of actual simulation achievable (y) for a typical UM job shows diminishing returns with number of cores (x).
When scaling up the number of processors working on a problem, there is a step which appears trivial, which can be the major bottleneck: reading and writing the data (and in Meteorology there is a lot of data). As we parallelize the calculation of data, we must also try to parallelize the reading and writing of it, which can be hard because writing to disks imposes a physical bottleneck. The Met Office and NERC Cloud model (MONC) previously wrote the large 3D fields in parallel, but an optimisation that I implemented applied this to (far smaller) 2D fields too. The message from the profiles below is that the number of times in which you write data may be as important as the amount which is written.
Figure 2: Darshan profiles of IO for processors (y-axis) vs time (x) for MONC. Blue lines indicate that the processors are writing. In the bottom profile, 2D fields are written in parallel, and both writing and overall runtimes are shorter.
Following the logic that the biggest calculations hit the most trivial problems, we note that a major consideration in huge calculations is the electricity bill. In fact, for this and other reasons, people are turning to a wide range of technologies when designing the current generation of supercomputers, some using graphical processor units or other accelerators. A practical problem with this is that you need to write code with different instructions for these different machines, which may take many hours to learn the tricks of and successfully port. A collaboration that I have started on recently with the Science and Technologies Facilities Council seeks to implement their tool for parsing and rewriting code, PSyclone, to target GPU cards, starting with the NEMOVAR data assimilation code.
Figure 3: Fancy GPU from a well-known company
In the above, I have touched on a few of the practical problems that we face in big simulations and mentioned my own career story to get here. One last thing that I have noted since moving to Meteorology, and working within the structure of NCAS, is that there is a lot of teamwork which we apply to solving these problems. I was lying when I said that these were trivial, but, between us we can keep pushing through them.