Machine learning enhanced gap filling in global land surface temperature analysis

By: Dr. Shaerdan Shataer

Land Surface Temperature (LST) data, an essential component of climate change indicators (CCI), often suffers from data gaps due to various reasons such as cloud coverage, sensor limitations, or data processing issues. These gaps can hinder the accurate monitoring of the impact of climate change and environmental trends, especially its impact on human lives, vegetation, and agriculture in general.  

To address this, LST data cloud gap-filling plays a crucial role. Cloud gap-filling involves using advanced algorithms and techniques to estimate and fill in the missing LST data, ensuring a continuous and complete dataset. One of the primary methods for filling these gaps is through the use of statistical interpolation techniques, such as Kriging, also called Inverse Distance Weighting (IDW). Empirical Orthogonal Functions (EOF) is another popular method in this category, which estimate the missing data based on the spatial and temporal relationships of the available data. Another approach is the application of machine learning algorithms, which can learn from the patterns in the existing data to predict the missing values accurately. These algorithms might include neural networks, decision trees, or support vector machines, tailored to handle the specific characteristics of LST data. Additionally, satellite data from different sources or times can be merged to fill in the gaps. This method, known as data fusion, leverages the strengths of multiple datasets to create a more comprehensive and robust dataset. For instance, if one satellite fails to capture certain data due to cloud cover, data from another satellite or from a different time frame can be used to compensate for the missing information.   

The importance of cloud gap-filling in LST data for climate change indicators cannot be overstated. Accurate and complete LST datasets are vital for monitoring the Earth’s surface temperature, assessing environmental changes, and developing strategies to mitigate the impacts of climate change. By ensuring the integrity and continuity of LST data, researchers and policymakers can make more informed decisions and better understand the dynamics of our changing planet. This is particularly crucial in the context of global efforts to track climate change and its effects on ecosystems, weather patterns, and long-term environmental shifts. 

In our recent work, we have focused on addressing the challenge of cloud gap-filling for Land Surface Temperature (LST) datasets, specifically targeting three distinct areas in the United Kingdom: Reading, the Lake District, and Bristol. Our approach has been to implement and analyze two innovative methods: DINEOF (Data Interpolating Empirical Orthogonal Functions) and DINCAE (Data-Interpolating Convolutional Auto-Encoder). The DINEOF method is grounded in Singular Value Decomposition (SVD) which decomposes a given data matrix into three constituent matrices: U, Σ, and V. In this decomposition, U and V are orthogonal matrices containing the left and right singular vectors, respectively, while Σ is a diagonal matrix of singular values. The singular vectors in U and V encapsulate the spatial and temporal patterns within the dataset, respectively. Specifically, the columns of U represent the spatial patterns (EOFs), and the columns of V represent the temporal patterns. This separation of spatial and temporal components is a defining characteristic of DINEOF. 

The strength of DINEOF lies in its ability to identify and retain the most significant modes (EOFs) from the data. This selection is based on the singular values in Σ, where higher values indicate modes that capture more variance in the dataset. By focusing on these principal modes, DINEOF effectively filters out noise, leading to a regularization effect that reduces the likelihood of overfitting. This aspect is particularly beneficial in environmental datasets, where the presence of noise and the risk of overfitting are common concerns. 

Moreover, DINEOF’s iterative approach to filling missing data adds to its robustness. Starting with an initial guess for missing values, the method iteratively updates these estimates by projecting the data onto the retained EOFs and back. This iterative cycle continues until convergence, ensuring that the reconstructed data align well with the dominant spatial and temporal patterns identified by the EOFs.  

On the other hand, DINCAE leverages the power of Deep Neural Networks (DNN), specifically utilizing an autoencoder architecture, to reconstruct the missing data points. Application of DINCAE in gap filling is an example of the broader capabilities of Deep Neural Networks (DNN) in environmental data analysis. A DNN is a type of architecture, it consists of layers of interconnected nodes or ‘neurons,’ each capable of performing simple computations. By passing data through these layers and minimizing a loss function based on the last output of these layers, a DNN can learn complex patterns and relationships within the data. DINCAE uses a specific type of architecture of DNN known as convolutional autoencoder, it is trained to recognize and predict the spatial and temporal patterns in environmental data sets like SST (Sea Surface Temperature) or LST. What makes DINCAE and similar DNN models particularly effective for this task is their ability to handle the high variability and complexity often present in environmental data. Traditional methods might struggle with such variability, especially in the presence of non-linear relationships or when the data contains a significant amount of noise. DNNs, however, can adapt to these complexities, offering more nuanced and accurate gap filling. 

A schematic of DINCAE by Yan et al. (2023)

The DNN within DINCAE is trained on sections of data that are complete (this is sometimes referred to as the observation), allowing it to extract spatial and temporal patterns. The weights of the whole neural net will adjust according to the minimization of a loss function which informs the network about the goal. In the case of DINCAE, the network should maximize the Gaussian likelihood of complete data/observations, the likelihood is conditioned on the missing part. When dealing with incomplete/missing data segments, the network applies the weights associated with these learned patterns to reconstruct the missing values, a process which is more sophisticated than traditional interpolation methods. 

The efficacy of DINCAE in handling environmental data lies in its ability to adapt to the inherent variability and non-linear characteristics of these datasets. Conventional gap-filling techniques often falter in such complex scenarios, particularly when dealing with irregularities or noise. However, DNNs, with their capacity for high-dimensional data processing and pattern recognition, offer nuanced and accurate predictions, even in data-rich environments. 

The convolutional auto-encoder architecture of DINCAE is essential to its effectiveness. The convolutional layers specialize in extracting spatial features, crucial for geospatial data analysis. These layers systematically identify localized patterns within the data, which is integral for spatially coherent gap filling. The auto-encoder component of DINCAE aids in compressing the dataset into an efficient representation, highlighting essential features, and subsequently reconstructing the data with an emphasis on accuracy and detail. One notable drawback is the intensive tuning required during the training process. The effectiveness of DINCAE is contingent upon the careful calibration of numerous hyperparameters, including the number of layers, the number of neurons in each layer, learning rates, and regularization techniques. This tuning process is critical to ensure that the model accurately captures the underlying patterns in the data without overfitting or underfitting. Furthermore, training a DNN model like DINCAE demands a considerable level of expertise and understanding of machine learning principles. The complexity of these models requires a nuanced approach to training, where the data scientist must have a deep understanding of both the algorithmic intricacies of DNNs and the specific characteristics of the environmental data being analyzed. 

A significant challenge that underscores our work is the notably low data availability, a direct consequence of the unique meteorological conditions prevalent in the UK, characterized by frequent and extensive cloud cover. This scenario of extensive cloud cover presents a test bed for our methodologies, pushing the boundaries of LST data recovery in environments where traditional satellite-based monitoring faces substantial limitations.  

Applying DINCAE and DINEOF methods to data in these three distinct UK regions, our initial findings have been promising, indicating the effectiveness of both methods in producing reliable, cloud gap-filled LST datasets. However, a comparative analysis suggests that DINEOF, with its SVD-based framework, exhibits a higher degree of robustness in this context. We find that DINCAE does perform better for a short-range dataset than DINEOF, e.g., when the dataset covers one year worth of daily temperature. But this advantage is reduced and, in some cases, reversed as the range of data increases. We are currently looking into the cause of this transition.  

An example of LST gap infilling using DINEOF over Lake District, the reconstruction captures the general pattern of the true data effectively, with an average RMS error of less than 1 Kelvin.

Further reading:

Alvera-Azcárate, Aïda, et al. “Reconstruction of incomplete oceanographic data sets using empirical orthogonal functions: application to the Adriatic Sea surface temperature.” Ocean Modelling 9.4 (2005): 325-346. 

Barth, Alexander, et al. “DINCAE 2.0: multivariate convolutional neural network with error estimates to reconstruct sea surface temperature satellite and altimetry observations.” Geoscientific Model Development 15.5 (2022): 2183-2196. 

Beckers, J-M., Alexander Barth, and Aïda Alvera-Azcárate. “DINEOF reconstruction of clouded images including error maps–application to the Sea-Surface Temperature around Corsican Island.” Ocean Science 2.2 (2006): 183-199. 

Yan, Xiting, et al. “Application of Synthetic DINCAE–BME Spatiotemporal Interpolation Framework to Reconstruct Chlorophyll–a from Satellite Observations in the Arabian Sea.” Journal of Marine Science and Engineering 11.4 (2023): 743. 

About sdriscoll

https://twitter.com/SimonDriscoll_ Researching machine learning and thermodynamics of Arctic sea ice. Part of SASIP (2021-present) @UniofReading (Schmidt Futures). Previously DPhil Physics @UniofOxford (climate/volcanoes/geoengineering). Also nuclear war/winter + X-risk.
This entry was posted in Climate. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *