SAMA: Spatially-Aware Model-Agnostic Machine Learning Framework for Geophysical Data

Geophysical data is a form of spatial data that suffers from various limitations when applying conventional machine learning algorithms and evaluation techniques. A key limitation facing models trained on geophysical data is their inability to generalize well when deployed to predict from new unseen data. We address the problem of inaccurate performance assessments of machine learning models, that stems from violating independence assumptions during the feature selection and evaluation phases of the learning process. Our proposed spatially-aware and model-agnostic (SAMA) framework provides a suite of spatially-aware feature generation, feature selection, and model validation algorithms that account for spatial characteristics of geophysical data. The framework is model agnostic, as it tackles data-related challenges that are not affected by the specific machine learning algorithm used to fit the data. To demonstrate the effectiveness of the proposed approach, it is applied to the water saturation mapping problem using a novel geophysical dataset to train a prediction model. The proposed spatially-aware models obtains an <inline-formula> <tex-math notation="LaTeX">$R^{2}$ </tex-math></inline-formula> of 0.620, an <inline-formula> <tex-math notation="LaTeX">$RMSE$ </tex-math></inline-formula> of 0.220 for predicting water saturation for the Whole Region of the reservoir model box and an <inline-formula> <tex-math notation="LaTeX">$R^{2}$ </tex-math></inline-formula> of 0.161, an <inline-formula> <tex-math notation="LaTeX">$RMSE$ </tex-math></inline-formula> of 0.263 for the Interwell Region. Extensive experiments on 5 additional unseen datasets show that the model maintains stable performance across different datasets, which demonstrates the ability of the SAMA framework to produce robust models that are transferable to new datasets.


I. INTRODUCTION
The geophysical domain involves a wide range of problems that can benefit from machine learning based approaches. Geophysical data is typically represented in the form of spatial data. It provides information on the physical properties of the Earth's surface and subsurface that is essential in analytical approaches for groundwater, mineral, and hydrocarbon discovery [1]. One of the applications that benefit from geophysical data is water saturation distribution mapping, which The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong .
holds key information for reservoir engineers to maximize oil recovery and reduce costs [2]. Due to it's unique properties inherited from spatial data, geophysical data introduces several challenges for Machine Learning (ML) modeling using conventional methods and evaluation techniques [3].
Developing a machine learning based prediction model for a given problem seeks to search the hypothesis space, using the given dataset, to find a function (hypothesis) with highest prediction power compared to other models, i.e. lowest bias, while performing equally well when used to predict from new unseen datasets, i.e. lowest variance. Bias occurs when the model is underfitted to the training dataset due to inaccurate assumptions regarding the data or the model during the learning process. Variance occurs when a model is overfitted to the training data which is commonly caused by an inaccurate assessment of the model performance during the validation process. Failing to address model bias results in a model that produces highly erroneous predictions, which makes a machine learning solution to the problem entirely futile. On the other hand, not accounting for variance leads to a model that cannot generalize, in other words, one that is not transferable to datasets different than the one used for training. Therefore, the objective of the learning process is to find a model that balances the bias-variance trade-off. Although this study aims to address overfitting in spatial data, the proposed approach implicitly addresses underfitting, as we demonstrate in following sections. Generalizabilty of a model requires addressing the problem of overfitting through data-based methods, model-based methods, or both. Model-based methods include techniques such as regularization if the model is overly complex, or in the case of neural networks overfitting can be reduced using dropout techniques. However, if the overfitting is caused fully, or even partly, by data-related sources, the data-based approaches are applied to address inconsistencies such as noise or missing data using data augmentation and denoising techniques. Spatial data has certain characteristics that can cause overfitting, such as spatial dependency, spatial heterogeneity, and scale [3].
Spatial dependency causes spatial auto-correlations in the data, restraining the application of ML and statistical models that are designed with the independence and identical distribution assumption for features, i.e. the i.i.d assumption. It also leads to over-promising performance when following conventional evaluation methods. Another property, spatial heterogeneity, introduces obstacles related to the evaluation of the model, which can severely degrade the generalizability of ML models [4], [5]. Spatial evaluation strategies in the literature [5], [6] focus on cross-validation methods, which can be fairly expensive when hyperparameter tuning ML models or performing spatial feature selection, as spatial evaluation is performed instead of a random train-test split [7]. Also, a majority of these studies focus only on interpolation tasks.
In addition to the challenges resulting from spatial data properties described above, the reservoir water saturation mapping problem is challenging in itself. A main challenge is that the data required for the mapping is not always available, especially the area between two well locations, referred to as the interwell region, since information about the different characteristics is mostly available only in near-well locations. Crosswell electromagnetic (EM) surveys are introduced with the goal to bridge this gap of information in the interwell area and provide resistivity mapping to regions extending to over 1km between the emitting and receiving boreholes [8].
Although Machine learning techniques have been used as early as 2002 [9] to predict water saturation for nearwell locations, recent studies addressing the reservoir water saturation mapping problem mostly follow data assimilation and modeling approaches. These methods require knowledge of either extra parameters, such as permeability, or the presence of a physical or numerical model. Other studies use history matching techniques that require several snapshots in time to make the predictions [10], [11]. To the best of our knowledge, there are no studies that rely only on machine learning techniques for water saturation distribution mapping from a single snapshot of Crosswell EM surveys and porosity distribution.
This work aims to provide an approach that will provide a more accurate evaluation of geophysical models built using conventional machine learning approaches. The main contributions of this work are summarized as follows: 1) Spatially-aware model-agnostic (SAMA) learning framework to develop robust ML models for geophysical data that generalize well. 2) Spatial masking algorithm for the evaluation of ML models built on 3D geophysical data cubes to obtain a realistic performance estimate of the model, and this avoiding over-promising models. 3) Spatial Masking-Random Forest model for performing water saturation mapping utilizing only the resistivity and porosity of a reservoir data cube. 4) Evaluation of the proposed approach based on the water saturation mapping problem requirements.
The rest of this paper is structured as follows: In section II, we introduce background related to the problem and discusses some related work. In section III, we describe the components comprised the proposed Spatially-aware modelagnostic (SAMA) framework. In Section IV, we present our experimental setup and results. In section V, significant findings and observations are discussed. Finally, in section VI, we conclude.

A. PROBLEM FORMULATION
Water Saturation (S w ) per Schlumberger oilfield's glossary is defined as ''The fraction of water in a given pore space'', and is measured in percent of saturation units [12]. Water saturation is the most significant parameter to compute hydrocarbon volume in oil-water fields to optimize oil production. It it is imperative for the oil production process to assess the availability of hydrocarbon reserves [2]. Ideally, core analysis is performed in order to determine water saturation in the subsurface. However, core data is not always available for most wells in a given field due to the borehole condition or the high cost of obtaining cores. Core data also requires lab analysis, which is typically time-consuming and expensive [13].
To overcome challenges with obtaining core data, Archie's equation [14] for computing water saturation was proposed. To calculate water saturation levels, Archie's equation relies on the resistivity and rock porosity, whose values are obtained from resistivity well-logs and porosity logs, respectively. In ideal situations, specifically clean sandstone formations, this calculation yields accurate results. However, it tends to VOLUME 11, 2023 make simplifications that can cause erroneous results when dealing with shaly and heterogeneous formations where clay and other minerals can be the source of the conductivity instead of the hydrocarbons [15].
Several machine learning techniques have been used since 2002 [9] to predict water saturation for near-well locations. Methods including Artificial Neural Networks (ANNs) [9], [16], [17], [18], [19], [20], Support Vector Machine (SVM) [18], [19], and Decision Trees (DT) [19] have been used to predict saturation from well log data. As predictions based on well-logs can extend only a few meters surrounding the borehole, seismic data [21], [22], [23], and crosswell electromagnetic surveys [10], [11], [23] are used for the estimates to extend to a few kilometers surrounding the borehole. These approaches allow the engineers to obtain information about the interwell region and provide estimates for the water saturation. At the same time, they are either proposed in the context of history matching or data assimilation, which requires a physical model. Therefore, in this work we aim to develop a water saturation mapping model that does not require any physical model and relies only on the resistivity and porosity at a single snapshot in time.

B. RELATED WORK 1) REGRESSION FOR GEOPHYSICAL DATA
Machine learning algorithms such as support vector machines, decision trees, and random forests have gained considerable attention in the field of spatial data applications [3]. While recent studies focus on evaluation strategies for ML models built for spatial data, early studies focused on modifying ML algorithms to be used for spatial data. In 2005, a study proposed an extension to Support Vector Machine (SVM), in an algorithm named Support Vector Random Fields (SVRF) [24]. Unlike SVMs, that make the ''ndependent and identically distributed'' assumption, SVRFs allow for spatial dependencies to be modeled using Conditional Random Fields (CRF). CRF is a statistical model that leverages the contextual information in the data. To capture the relationship between the features and the class's label, the SVRF model makes use of an observation-matching function. As for capturing the relationships between the neighboring data points, and a local-consistency function, respectively [24]. Forms of decision trees have also been developed to account for spatial dependency and autocorrelation, namely, spatially aware Predictive Clustering Trees (PCT). This method follows the same approach of decision trees, where the test criteria at the nodes is the main difference, in PCT it selects the split that maximizes the inter-cluster variance reduction [25]. In order to leverage spatial heterogeneity a study [26] proposes Geographical Random Forest (GRF), which converts the global process of building trees into a decomposition of multiple local sub-models.

2) WATER SATURATION MAPPING
Predicting water saturation using machine learning has been extensively studied in the literature through various approaches that are guided by the data available for the location.
Well-log data such as sonic, density, neutron porosity, and resistivity logs are used to predict water saturation in reservoirs. Several models exists that are based on neural networks on their own [9], or in combination with other techniques such as fuzzy logic [16], [27], ensemble structures [28], Mutual Information (MI) [20], least-squares support vector machine (LS-SVM) [29], and subtractive clustering [2]. Traditional machine learning models such as multilayer perceptron (MLP), Support Vector Machine, Decision Tree Forest, and Tree Boost methods were used, in another study, to train models for predicting water saturation in tight gas reservoirs [19].
Core data provides a different set of features that can be used to predict water saturation. A study using core porosity, deep resistivity log, neutron porosity, density log, sonic, and gamma-ray logs as input parameters [17], shows to improve prediction accuracies over the dual water model [30]. In another study, ANN is also used to predict the water saturation using the porosity, permeability of sample extracted from the core of the well, and height above the free water level [31].
In addition to well-log data and core data, seismic data contain useful information that can be used to predict porosity and water saturation. Studies adapting machine learning approaches for learning from seismic data use models such as support vector Regression (SVR) [21], ANNs [22] and adaptive neuro-fuzzy inference system optimized by a genetic algorithm [18].
To overcome some challenges of seismic data such as low resolution, crosswell electro-magnetic surveys are proposed to better understand the fluid types and saturation in the inter-well region of a reservoir [32]. Several studies are dedicated to mapping water saturation from these surveys using machine learning models [10], [11], [23], [33].

C. MODEL VALIDATION 1) K-FOLD CROSS VALIDATION
In K-fold cross validation, random 1/K of the points of the data cube are reserved for testing and the rest for training. This process is repeated K times and can be with or without replacement.

2) RELATED WORKS ON SPATIALLY-AWARE VALIDATION
Multiple methods have been proposed to give a more realistic estimation of ML models built for spatial data by having spatially dependent data in train and test sets. One study [34] proposed Spatial leave-one-out cross-validation. It considers a single data point for validation, and leaves out all spatially dependent surrounding data points within a certain threshold and trains the ML model using the remaining points. It repeats this step and then computes the average performance over all repetitions. Although the approach addresses the issue of spatial autocorrelation, it remains computationally costly.
Other works discuss the idea of cross-validation in spatial dependent data or data with spatial autocorrelation. One approach proposes spatial k-fold cross-validation (SKCV) [6] that introduces a dead zone around the testing data points, in which data points to be removed and the final model is trained on the remaining points. Another study investigates cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic dependencies [5]. The study introduces the concept of spatial blocking for the different data structures. It presents a guide on how to perform the blocking, starting with analyzing the type of dependency, then assessing the objective of the final model, for both interpolation and extrapolation. The next step is to block the data according to the objective and structure, perform the cross-validation, and finally make the final prediction. As most spatial analysis studies focus on interpolation, this study points out that blocking induces extrapolating and is key in building ML models that extrapolate [5]. In addition, it mentions that in the case where the approach aims to avoid inducing extrapolation in the model, the size of the blocks should be minimized. Geographical covariant maps and buffer distances can be used to ensure that data points with high spatial dependency are not co-located in the same train or test split [35]. Although the resulting model is unbiased, the process is computationally expensive and storage demanding. Therefore, it is not recommended for data sets with greater than 1000 points. Moreover, with the covariant maps added as features, the resulting model is optimal for interpolating only and not to extrapolate or explain the structural dependencies. Whereas in this study we focus on generalization.
In the remote sensing field, studies emphasize that in addition to spatial validation strategies, spatial dependency should be considered in different aspects of ML model development, including spatial feature selection [7]. Using spatial feature selection ensures that only features that generalize beyond the training data are included. The study also concludes that statistical evaluation for the models is not enough to evaluate ML models built for remote sensing problems and that a visual assessment must be performed [7].
Recently, a study pointed out the issue of over-promising RF models, and RF models that perform well in random cross-validation but have poor transferability [4]. As an example, the study evaluated an RF model that was trained to predict total volatile organic compounds from different borehole geophysical logs. It demonstrates that RF models built with random cross-validation might be suitable for interpolating missing well-logs, however, they may not be suitable for generalization [4].

D. RANDOM FOREST REGRESSION
Random Forests (RF), formally introduced by Leo Breiman [36], is an ensemble of tree predictors in which each tree is fit to an independent bootstrap sample drawn from the data. The results of those trees are aggregated by unweighted voting in the case of classification, whereas in regression, it is performed by calculating the mean of the individual predictions [36], [37].
Assuming an unknown joint distribution P XY (X , Y ), where X is a p-dimensional random vector X = (X 1 , . . . , X p , ) T representing the predictor variables and Y is the independent variable, the goal of the algorithm is to find the prediction function f (X ) for predicting Y . The prediction function is set to minimize the expected value of the loss function [37].
where E XY is the expectation with respect to the joint distribution of X and Y . In regression, usually, L is the squared error loss.
By that, minimizing E XY [L(Y , f (X ))] for squared error loss gives the conditional expectation.
The ensemble predictor f (x) in regression is a result of averaging the results of the base learners h 1 where J is the number of base learner trees in the ensemble.
The tree base learners leverage an independent bootstrap sample from the data given at random and make the fitting using binary recursive partitioning. In binary recursive partitioning, all the training data points are put in a single node. Then, until the stopping criteria is met, the node(s) are split into two descendant nodes based on the value of the predictor variables, and this is done recursively. In order to determine the split in regression, each possible split is considered, and the selected split point is the one that minimizes the mean squared residuals at the nodes (Q), that are defined as: where n is the number of training data points at a node and y = 1 n n i=1 y i , which is the predicted value at the node [37]. When training a Random Forests Regressor, some of the most common hyperparameters to tune are (using sklearn library parameter names): n_estimators, which determines the number of decision trees building the forest, max_depth, which sets the maximum number of levels in each decision tree, and max_features, which sets the maximum number of features to consider when splitting a node. There are also other parameters that are less common to tune, such as bootstrap to determine whether sampling the data point is with or without replacement, min_samples_split which sets the minimum number of points to be placed in a node before the node is split, and min_samples_leaf which sets the minimum number of points to be allowed in a leaf node. There are several advantages of using Random Forests such as its robustness to outliers and noise, simplicity, and that it can be VOLUME 11, 2023  implemented with parallelization easily. Also, that it provides variable importance and internal estimates of errors [36].

III. APPROACH
In this section we describe the components of the proposed SAMA machine learning framework. The framework is designed to tackle overfitting, and implicitly under-fitting, of regression models that arise from data-related inconsistencies. Therefore, the framework is considered model-agnostic, as any regression algorithm can be used for fitting the data.

A. SPATIAL FEATURE ENGINEERING 1) SPATIAL FEATURE GENERATION
In order to learn from nearby points, features are derived based on the grid to which the points belong in order to help the model learn from its region. These features are the mean of the resistivity or porosity of a 2D plane or a 3D area which a point belongs to. A description of the 19 feature IDs considered in this study is shown in table 1. For the 2D features, the featureID naming consists of the number of data points on the two axes that the grid is constructed on. For the 3D features, the Feature_id naming consists of 3D followed by one or two digits on the size of the 3D block for the x and y-axis, followed by one digit, which is the height of the block across the z-axis. For each feature ID, the resistivity res_featureID and porosity poro_featureID are calculated, amounting to a total of 24 derived features.

2) SPATIAL FEATURE SELECTION
Feature selection is performed to reduce the complexity of the model, to improve the accuracy of the prediction, and to better interpret the model. In our case study, it is performed after adding the derived features that were a result of averaging the feature from the grid in which the data point belongs, Table 1. This experiment aims to compare two approaches considered for spatial feature selection and feature selection approaches that use evaluation methods that less enforce spatial generalization: • Spatial Forward Feature Selection(SFFS-6L) The features are added one at a time, and after that the prediction's performance is evaluated using the mask testing layer following the selected Mask during the mask selection step. After evaluation, if the performance increases, the feature is added. This process is to be repeated multiple times until the performance of the model stops increasing.
• Spatial Backward Feature Selection(SBFS-6L) All features are added at first. Then they are removed one at a time until the performance stops increasing while validating on the mask testing layer following the selected Mask during the mask selection step.
• Spatial Forward Feature Selection(SFFS-0L) Features are added one at a time, then the prediction's performance is evaluated using the mask testing layer with the mask 0000010000, in which there are no leave-out layers. After evaluation, if the performance increases, the feature is added. This process is to be run multiple times until the performance of the model stops increasing.
• Random Forward Feature Selection (RFFS) The features are added one at a time, validating randomly on 10% of the data points. After evaluation, if the performance increases, the feature is added. This process is to be run multiple times until the performance of the model stops increasing. The random state is set to 42, and the R 2 is used for the comparison during the feature selection process. For the testing of the feature selection approaches, a Random Forests model is trained and tested by shifting the Mask by one each time (e.g., 00xxx1xxx0, 000xxx1xxx, x000xxx1xx, etc.). For each mask shift, the model is run 5 times without setting the random state. This results in 50 model runs.

B. SPATIALLY-AWARE VALIDATION STRATEGIES 1) SPATIAL BLOCKING
Building on the spatial blocking works in the literature [5] and the analogy of spatial k-fold cross-validation [6] and CV-block [38], a similar idea is applied to spatial data cubes.

a: K-FOLD RANDOM CROSS-SECTION CROSS-VALIDATION
In Random cross-sections cross-validation, 1/K random cross-sections across the x, y, or z-axis of the data cube are reserved for validation, while the rest for training. This process is repeated K times and can be with or without replacement. No neighboring points on the cross-section used for validation are included in the training data.

b: K-FOLD CONSECUTIVE CROSS-SECTION CROSS-VALIDATION
In K-fold Consecutive cross-sections cross validation 1/K layers of consecutive x, y, or z-axis cross-sections are used for validation, and the remaining data is used for training. This process is repeated K times by changing the validation layers until the data cube is exhausted. This approach provides a much stricter evaluation approach as 1/K data cubes of the original spatial data cube is held out during testing. No neighboring points on the cross-section used for validation are included in the training data, nor are the points of the neighboring cross-sections except for two layers.

2) SPATIAL MASKING
Although K-fold Consecutive cross-sections cross validation seems strict in enforcing spatial generalization, in datasets with spatial heterogeneity or in small datasets, important information may be lost in removing consecutive layers that represent 1/K of the data. Therefore, the need for an approach that takes into account the spatial heterogeneity of the data arises. In addition, as a previous study concluded, when dealing with spatial data, spatial feature selection should be performed to avoid the selection of features that do not generalize when extrapolating [7], which results in high computation cost when using the existing cross-validation spatial evaluation techniques for tuning the model. Therefore we propose Spatial Masking as an approach to train-test split spatial data. In Spatial Masking, conforming to a certain pattern, cross-sectionals(layers) are either included in the training subset, left out, or used for testing. The aim of this approach is to provide a more accurate estimate of the model performance and to reduce the spatial auto-correlation in the dataset from the spatial dependencies while accounting for spatial heterogeneity.
In order to select the Mask, K-fold Consecutive cross-sections cross validation is performed. The length of the fold is set to the length of the mask pattern, in our case, K = 10. K is the length axis from which the cross-sectionals are taken over the length of the Mask, in our case, ten, too. Then the nine training folds are trained and evaluated using Spatial Masking, and the tenth fold is used for testing. The average RMSE of the layers resulting from the K-fold Consecutive cross-sections cross validation after removing the first and last layers (as guards) in the K-fold Consecutive cross-sections cross validation are compared against the RMSE resulting from the Mask. The Mask that had an RMSE equivalent to the K-fold Consecutive cross-sections cross validation's RMSE is selected. An illustration of the mask selection approach is shown in Figure 9. The implementation details are in Algorithm 1, where LCount is the total number of layers, LNumber is the layer's sequential number, and MaskLen is the Mask's length.

C. TRAINING WATER SATURATION MAPPING MODEL
In this section we describe our proposed water saturation mapping model that is developed using the SAMA framework, illustrated in Figure 1. Given the above components, the SAMA framework now combines them using the following pipeline: 1) The machine learning algorithm(s) that are going to be used for regression are selected. 2) The mask selection algorithm is used to generate a validation mask using the raw features.
3) The feature derivation algorithm is executed. This is performed by applying the algorithm proposed in the feature generation section above.

4)
The selected mask and model are used to perform feature selection using one of the feature selection algorithms. 5) Hyper-parameter tuning is performed at this step for model-specific parameters, if any. 6) The prediction model is built using the selected features, selected validation mask, and selected hyperparameters.

A. DATASET 1) RESERVOIR BOX DATASETS
This work introduces one real-world dataset and five synthetic datasets to evaluate the proposed approach. The real-world dataset is a 3D data cube containing features of a realistic reservoir box model, specifying porosity, resistivity, and water saturation values. The reservoir box model dimensions are (122 x 100 x 20) in terms of (x,y,z), representing an area with dimensions 2km x 2km x 70ft (0.21 km) in depth. We refer to this dataset as RBMD. Figure 3 displays the water saturation, resistivity, and porosity for the 3D data cube. Figure 4 displays the water saturation, resistivity, and porosity taken along the z-axis. Figure 5 displays the water saturation, resistivity, and porosity taken along the y-axis. Five synthetic datasets (RBMD-02,RBMD-05, RBMD-10, RBMD-20, RBMD-50) are generated for evaluation by adding white Gaussian noise at different rates: 2%, 5%, 10%, 20%, and 50% of the standard deviation of the original distribution of the feature, respectively. Visualization of the resistivity and porosity with the added noise for three of the synthetic datasets are presented in Table 5.
The distribution of the features and target in the dataset along the z-axis is shown in Figure 6. We can notice that the water saturation increases along the z-axis. The least water-saturated layers have a mean of 0.4 water saturation; 17 out of 20 layers along the z-axis have a mean of water saturation higher than 0.6. When performing Shapiro-Wilk Test on the target 'swat' water saturation, it fails the test with a p-test score of 0.00, indicating the water saturation data is highly skewed. Figure 7 shows the correlation between the features 'poro',' res' and the target 'swat'. It is observed that

2) PRODUCTION AND INJECTION WELLS
The reservoir model box Whole Region contains one horizontal injection well, located between 35 and 42 on the x-axis and 23 and 70 on the y-axis. It also contains one horizontal production well, located between 83 and 90 on the x-axis and 27 and 80 on the y-axis. Locations are illustrated in Figure 8. The EM transmitters and receivers are embedded within the wells.
There are two regions that we evaluate separately due to their importance for the problem: we refer to the first one as the Focus Region and is located between layers 30 and 95 with respect to the x-axis, layers 23 and 80 with respect to the yaxis, and layers 3 and 13 with respect to the z-axis. The other region is the Interwell Region, and is located between layers 43 and 82 with respect to the x-axis, layers 23 and 80 with respect to the y-axis, and layers 3 and 13 with respect to the z-axis.

B. EXPERIMENT SETTINGS 1) EXPERIMENT DESIGN
The experiment procedure first starts by applying the mask selection algorithm. Then using the selected mask evaluate Spatial Masking as a train test split against other cross-validation strategies in terms of providing a better estimate for the extrapolation error provided byK-Consecutive Cross-section Cross-validation. Then different feature selection spatial and aspatial feature selection approaches are evaluated. Finally, two approaches for fitting the data are evaluated. For all the aforementioned experiments, Random Forests are used as they have been used in multiple studies concerning the evaluation strategies of ML for spatial data [5], [7]. Random Forests do not make the i.d.d. assumption, making them suitable for spatially dependent data. However, they would easily over-fit in the case of spatial data with high auto-correlation since it allows for distributing nearby data points with high auto-correlation within the training and testing subsets, which causes over-estimating the model's performance [5]. Random Forests do not require an extensive amount of hyperparameter tuning to get models with high  performance, which will direct our focus to the proposed approach for evaluation, as opposed to hyperparameter optimization.

2) EVALUATION METRICS
In order to evaluate and report the performance of regression models, several metrics can be used. Here, we will present the evaluation metrics that are used for evaluation; namely, the root mean square error (RMSE), and the coefficient of determination R 2 . Those metrics are the top two metrics used for evaluating the S w prediction in the surveyed studies are the RMSE and R 2 , where 69% of the surveyed studies for water saturation prediction use RMSE, and 50% of the studies used R 2 for evaluation.   (y j −ŷ j ) 2 (6) where n is the number of points, y j is the actual value, and y j is the predicted value for a point j. Squaring the difference penalizes more for errors with a larger magnitude, and it is commonly used in engineering problems.

b: COEFFICIENT OF DETERMINATION (R 2 )
The coefficient of determination, also known as the goodness of fit, measures the extent to which the output of the regression model is better than a straight line and is given by 7444 VOLUME 11, 2023 the formula: where n is the number of points, y j is the actual value, and y j is the predicted value for point j, and y is the mean of the observed values.

1) SPATIAL FEATURE ENGINEERING
The validation results of applying different feature selection approaches to our case study are presented in Table 6. As shown in the table, the models built on the selected features of the two approaches that did not use spatial feature selection or used a less strict mask as with the Mask 111110111, had a high R 2 score and a fewer number of features, mostly of the larger block sizes. However, when evaluating them using the selected Mask (00xxx1xxx0), which more accurately estimates the extrapolation error and using the evaluation strategy, a drop in R 2 score and the rise in RMSE error is noticed. This is accompanied by very low performance for the Focus Region and the Interwell Region, which indicates an overpromising and overfitting models. Evaluation results are shown in Table 7.
On the other hand, when applying spatial feature selection while using the selected Mask (00xxx1xxx0), whether by using SFFS or SBFS, validation scores, in Table 6, and testing scores, in Table 7, are very close. In addition, a higher R 2 score is observed, especially in the Focus Region and Interwell Region, than when not applying spatial feature selection. Although SFFS and SBFS have almost equal performance, the SFFS-L6 feature set contains 15 features while the SBFS-L6 feature set contains 21 features. Therefore, the SFFS-L6 feature set is used in building the final model. These results corroborate results from previous studies such in [7] that when dealing with spatial data, spatial feature selection should be performed to avoid the selection features that do not generalize when extrapolating. Also, that SFFS is more suitable for avoiding outfitting in spatial data than SBFF.
This experiment aims to select the Mask to be used for Spatial Masking for this RBMD dataset, following the Algorithm 1. The masks used for the experiments are 0000010000, 0000 × 1x000, 000xx1xx00, x00xx1xx00, 00xxx1xxx0, 0xxxx1xxxx, x0xxx1xxx0, where 1: training layer, x: leave out layer, 0: training layer. The masks are combined with adding a mesh-like structure to include only (50%, 25%, 12.5%, and 6.25%) of the data points in the training layers in training. All cross-sectionals are taken across the y-axis, and the optimal hyperparameters in Table 2 are used. The random state is set to 42, and the RMSE is used for the comparison.

b: SPATIAL MASKING STRATEGY EVALUATION
This experiment aims to compare K-fold cross-validation [39], spatial blocking for 3D data cubes, and Spatial Masking. To compare the approaches, Random Forests models are built. The percentage of the training data is 90%, and for the testing, the data percentage is 10% for the crossvalidation strategies. As for Spatial Masking, since the Mask 00xxx1xxx0 is used, based on the results from the Mask Selection Experiment, 30% of the data is utilized for training, and 10% is used for testing. The random state is set to 42, and the R 2 and RMSE are used for the comparison.

c: SPATIAL MASKING AS A MODEL FITTING APPROACH
This experiment aims to evaluateSpatial Maskingas a modelfitting approach, in which the model is only fit on the 0 layers of the Mask instead of the whole data for the final build. K-Consecutive Cross-section Cross-validation was used for the evaluation, and the model is fit once using the Mask and once all the data points reserved for training are fit.

3) WATER SATURATION MAPPING MODEL
To build SM-RF water saturation (S w ) mapping model, the hyperparameters are tuned via a grid search on the different hyperparameter values in Table 2. The derived features selected by the SFFS-L06 approach are utilized, and the model is built and evaluated using the Mask 00xxx1xxx0. The model has a R 2 of 0.620 and a 0.220 RMSE when evaluating the Whole Region. As for the Focus Region and the Interwell Region, an improvement of 33% and 32.6% in terms of RMSE difference is observed compared to using only the resistivity and porosity without the derived features, which we consider here as the baseline model, shown in Table 8. Visualization of some of the predictions are in Table 9.

V. DISCUSSION AND ANALYSIS
A. MASK SELECTION Figure 9 illustrates how as leave-out layers surrounding the testing layers are added, the difference between the RMSE of using K-Consecutive cross-sections crossvalidation and Spatial Masking diminishes. At masks 00xxx1xxx0, 0xxxx1xxxx, and x0xxx1xxx0 the RMSE is    very close. However, at Mask 00xxx1xxx0, the RMSE of the K-Consecutive cross-sections cross-validation is at its lowest value. The difference in RMSE between using the Mask 00xxx1xxx0 while utilizing 50% of the training layers data and the K-Consecutive cross-sections cross-validation RMSE is at 0.006. Therefore, the Mask 00xxx1xxx0 and the  utilization ratio of 50% are selected to build the Spatially Masked Random Forests (SM-RF) models for predicting water saturation.
To investigate that further, we look at the errors per layer when running a K-Consecutive cross-sections crossvalidation. Errors are illustrated in Figure 10. We can notice the errors stabilize at their highest around layers 4-7 with a sharp decline in RMSE error in the layers before and after for every ten layers. This indicates that layers 4-7 are more representative of the actual performance of the model built, and the high performance of the outer three layers from each side is due to overfitting from having nearby points in training and validation data. This validates the mask choice by the algorithm as it selected a mask with three leave-out layers from each side of the testing layer to avoid this sort of contamination to the testing data. Applying the mesh-like structure further treats the overfitting problem, as we can notice from the illustration and by the reduction in the standard deviation of the RMSE error across the layers from 0.05 when not using the mesh structure to 0.043 and 0.3 when using 50% and 12.5% of the data respectively. However, when combining the mesh structure with the Spatial Masking method and a Mask that has a total of 6 leave-out layers reduces the training data significantly and throws out too much of the data, which reduces the overall performance. Therefore, the elimination of the data to create the mesh-like structure was capped at 50%. Table 11 shows that using Spatial Masking as an evaluation approach gives the closest error estimate in terms of RMSE to the baseline of 10-consecutive cross-section cross-validation layer. Thus providing a more accurate estimate for the extrapolation error.

C. WATER SATURATION MAPPING MODEL
The SM-RF model robustness when testing on the 5 synthetic datasets, the model's performance is steady to a great extent. From Table 10, there is a 0.008 increase in RMSE for the RBMD-20 dataset, where 20% of noise is added. This equates to a decrease of 3.6% in R 2 for the SM-RF model as opposed to a 59% decrease in R 2 for the Baseline Model. As the error difference percentage is bound by the percentage of noise added to generate the synthetic datasets presented in Table 10, mathematically, the model is considered stable [40]. VOLUME 11, 2023

VI. CONCLUSION
In this work, we tackled the issue of over-promising evaluation and over-fitting Forests models when built on spatially auto-correlated data. We proposed Spatial Masking as a train-test split approach to build and evaluate Random Forests models while minimizing overfitting and delivering a more realistic evaluation of the models. Through a series of experiments on a case study of water saturation (S w ) mapping in oil/water reservoirs, Spatial Masking did provide a more accurate error estimate of the extrapolation error than random K-Fold cross validation and Random K-Fold crosssection cross validation. Combining Spatial Masking with a Mesh-like structure did reduce the extrapolation error further when fitting the model. An SM-RF (S w ) mapping model was developed by fitting the data on the Mask 00xxx1xx00 with a 50% utilization rate of the training data and the features selected by the spatial forward selection approach using the same mask. In the future, we will investigate the different sources of the errors found in the mapping and work on further improvement in the mapping related to the Interwell region.