Introduction
In recent years, deep learning has shown a significant impact on medical imaging analysis in different fields [1], [2], [3], [4]. It has also been shown to have a significant impact in the field of brain tumor analysis. The brain is considered the most complex and important organ in the human body as it is responsible for main functions such as thinking, sensing, and memorization. A damage to brain cells affects the entire central system leading to a disability in most body organs. A brain tumor is defined as a mass or group of abnormal cells in the brain and is considered one of the most dangerous diseases that damage brain cells [5], [6], [7]. There exist different types of brain tumors. Some tumors are considered benign (non-cancerous) whereas others are considered malignant (cancerous). The difference between them is that benign tumors spread slower compared to the malignant tumors and are less likely to return back after treatment [8], [9]. In addition, brain tumor severity and treatment depend on the type, location and size of the tumor. It is also worth mentioning that brain tumors are more common in children than adults and its treatment in children is more challenging because the child’s brain is still developing [10]. Moreover, the growth rate of a tumor in the brain determines its severity and how it affects the brain functionalities and the central nervous system [11]. Since the central nervous system controls most body organs and brain tumors are the most common diseases that negatively affect the central nervous system. Hence, brain tumors can damage most body organs. According to the World Health Organization (WHO), brain tumors are considered the 10th leading cause of death worldwide [8], [12]. Consequently, early classification and detection of brain tumors can help much in the treatment process, lessen their awful damage, and reduce the overall death rate. The manual process of determining tumors in MRI images is time-consuming and difficult in some cases. Thus, the automation of such a process can ease the identification step and help detecting the disease at its earliest stages.
In general, deep learning models require diverse and large amounts of data to guarantee a well-generalized model. However, most medical datasets are limited and suffer from data imbalance problems. On the other hand, the collection of new medical data is a significant challenge, a time-cost intensive process, and requires the collaboration of medical domain experts. Different approaches are commonly used to handle such problems (i.e., limited and imbalanced data), including data augmentation, which is the most common approach used in computer vision tasks to handle these problems by generating synthetic samples from existing ones. Hence, increasing the diversity of the training data and enhancing the generalization and performance of the classification models. In the literature, most studies considered using basic data augmentation techniques such as rotation, flipping, shearing, translation, brightness and contrast adjustment [4], [13], [14], [15]. Other studies considered more effective augmentation techniques based on the idea of Generative Adversarial networks (GANs) [16], [17], [18], [19], [20] which capture the distribution of dataset samples and accordingly generate artificial samples that look realistic based on the learned distribution [21].
In this study, we introduce novel augmentation techniques called RegionInpaint augmentation, Cutoff augmentation, and RegionMix augmentation to significantly improve the generalization ability by augmenting the training samples. In addition to those techniques, basic augmentation techniques (e.g., flipping, rotation, etc.) are also experimented. These novel proposed augmentation techniques depend mainly on segmenting tumors from the MRI images. Thus, segmentation is a crucial step in our study to generate segmentation masks corresponding to the input MRI images and accordingly, the proposed augmentation techniques can be applied. To obtain a high-quality segmentation mask, a U-Net-like architecture is utilized named VGGUNET, where we take advantage of combining a pre-trained encoder (VGG16 [22]) with the U-Net architecture [23] to achieve promising results compared to the other used models for segmentation (i.e., U-Net [23], SegNet [24], and ResUNet [25]).
Two public datasets are used in this study including SPMRI [26] and Br35H datasets [27]. The SPMRI dataset is a small dataset used for training and validation purposes, whereas the Br35H dataset is used for testing purposes. Both datasets consist of two classes (tumor and non-tumor). The main focus of this research is to use a small set for training to significantly investigate the generalization ability of the classification model on the unseen validation and testing sets when using the proposed augmentation techniques. In addition, different pre-trained classifiers are experimented on the original small training set without augmentation. Accordingly, the classifier that attains the best results is selected for experimentation along with the proposed augmentation techniques.
To analyze the efficiency of the proposed augmentation techniques, the classification results are compared before and after extending the training set using the newly generated samples. Moreover, each proposed augmentation technique is applied initially on its own to investigate the effect of each one of them on the classification model and to select the best augmentation technique. Subsequently, combinations of these augmentation techniques are used to generate more diverse synthetic samples. The use of the proposed data augmentation techniques showed better results compared to related studies that used other popular augmentation techniques which demonstrates the superiority of the proposed novel augmentation techniques. It is also noteworthy to mention that the proposed augmentation techniques can be applied to a wide range of tasks where segmentation step is applicable.
In summary, we make the following contributions 1) we introduce different novel augmentation techniques named RegionInpaint and Cutoff, in addition to RegionMix augmentation. 2) To the best of our knowledge, we are the first to introduce new effective augmentation techniques suitable for medical tasks rather than the existing popular augmentation techniques and other GAN-based augmentation techniques. 3) We demonstrate that using the VGGUNET network for segmentation achieved promising results compared to the other segmentation networks used. 4) Extensive experiments show the efficiency of the proposed augmentation techniques and their ability to significantly enhance the generalization performance on unseen samples.
The rest of the paper is organized as follows: Section II briefly discusses related studies to our work. The full methodology and details of each model are discussed in detail in Section III, which can be divided into four main sections (Preprocessing, Segmentation, Data augmentation, and Classification). Section IV introduces the results and analysis for each method. Finally, Section V presents the conclusions of our study and future work.
Related Work
Brain tumors are one of the most critical and fatal diseases. It affects and damages the central nervous system, which is responsible for most of the body functions. Many recent studies in machine learning, image processing, and deep learning fields have introduced different state-of-the-art techniques and methods, especially in classification, segmentation, and detection tasks, to identify brain tumors in MRI images in the early stages. Most studies consider extending the training data by applying different data augmentation techniques due to the problem of limited and imbalanced data. Thus, in this paper, we focus on studies that have utilized different augmentation techniques. Furthermore, this Section briefly discusses recent related research papers that focus on the identification of brain tumors in MRI images.
Asif et al. [14] proposed different transfer learning-based deep learning models for brain tumor detection in MRI images. First, they applied different preprocessing techniques, such as resizing images to a fixed size (
Younis et al. [28] proposed a deep learning method for brain tumor analysis using the VGG-16 ensemble learning approaches. They applied different preprocessing techniques, such as data normalization and thresholding, followed by a series of erosions and dilations to remove any existing small patch of noise. Moreover, they cropped the brain regions from MRI images. Finally, they applied different data augmentation techniques such as shearing, rotation and shifting. Three different approaches including Custom CNN, VGG16, and Ensemble model, have been tested on the SPMRI dataset and achieved accuracies of 96%, 98.5%, and 98.14% respectively.
Ramtekkar et al. [29] proposed an optimized feature selection method for accurate brain tumor detection using deep learning techniques. They worked on the small SPMRI dataset in their study. For preprocessing, they used a compound filter, which is a combination of gaussian, median, and mean filters. Image segmentation is then applied using thresholding and histogram techniques. A gray-level co-occurrence matrix (GLCM) was then applied in the feature extraction step. Subsequently, they used different optimization algorithms, such as particle swarm optimization, genetic optimization, whale optimization, and wolf optimization for feature selection. Furthermore, the small dataset was augmented to reach 2318 samples. The best testing accuracy of 98.9% was achieved when using the whale optimization algorithm along with a custom CNN for classification purposes.
Kang et al. [15] presented an approach for the classification of brain tumors in MRI images using an ensemble of deep features and machine learning classifiers. Their experiments were conducted on three different public datasets: BT-small-2c, BT-large-2c, and BT-large-4c. They cropped the brain region from the MRI images as a pre-processing step. Thereafter, they extended the datasets using flipping and rotation augmentation strategies. For classification purposes, many different pre-trained CNN models were used for feature extraction with different ML classifiers. In most cases, the SVM classifier with the RBF kernel outperformed the other ML classifiers. However, for feature extraction, the DenseNet-161 deep feature alone achieved the best results on the BT-small-2c dataset and the ensemble of InceptionV3, DenseNet-169, and ResNeXt-50 deep features achieved the best results on the BT-large-2c dataset. Finally, the ensemble of ShuffleNetV2, DenseNet-169, and MnasNet deep features achieved the best results on the BT-large-4c dataset.
Sakib et al. [30]. developed a deep CNN network for brain tumor detection using MRI images. The SPMRI dataset was used in the study. The brain area was cropped from the MRI images, and normalization was applied to narrow the intensity values to a stable range. Different augmentation techniques were applied to extend the limited data including rotation, shifting, flipping, shearing, adjusting brightness, and darkness. Finally, a pre-trained VGG-16 network was used and achieved an accuracy of 96%.
Salama et al. [16] proposed a novel approach for brain tumor detection based on convolutional variational generative models. The experiments were conducted using the SPMRI dataset. The MRI images were resized to a fixed resolution of
Alsaif et al. [13] focused on developing a novel data augmentation-based brain tumor detection method using CNN. They utilized the SPMRI dataset in their experiments. First, they applied different data augmentation techniques such as flipping, rotation, and translation techniques. Consequently, the images were fed to different pre-trained networks including ResNet-50, ResNet-150, VGG16, VGG19, InceptionV3, and DenseNet121. The pre-trained VGG-16 network achieved the best results compared to the others with an accuracy of 96%.
Rai et al. [31] developed a novel LU-Net deep neural CNN model to detect brain abnormalities in MRI images. They used the SPMRI dataset in their experiments. Different preprocessing techniques were applied, such as converting the images to grayscale, cropping the brain region from the MRI image, and resizing the images into a fixed resolution of
Methodology and Proposed Work
In this section, we illustrate the entire proposed method for classifying brain MRI images as shown in Figure 1. The full methodology consists of four main steps: preprocessing the raw input MRI images, image segmentation, and applying different novel augmentation techniques, and finally the classification step. These steps are discussed in detail in the following sub-sections.
A. Preprocessing
Most of brain MRI images contain unnecessary black pixels. Thus, the brain area is cropped from the MRI images as the remaining black area does not contain any relevant features that can help for classification. In addition, this will help the CNN network to converge faster during training. The cropping process follows the steps shown in Figure 2. First, each Input RGB image is smoothed by applying a
B. Segmentation
Image segmentation is a crucial part in our research study. It is considered as a preprocessing step needed for applying the proposed augmentation techniques. The segmentation main objective is to produce masks for the input training images where those masks are used as an initial step for applying the proposed augmentation techniques. In this section, we discuss the segmentation models and the loss functions used.
U-Net [23] is one of the most popular and powerful convolution neural networks that was first introduced in 2015 for the segmentation of biomedical images. Since then, it has been used in many different segmentation tasks such as medical image segmentation, self-driving cars, and satellite image segmentation. The network consists of two parts: an encoder followed by a decoder. The encoder part is responsible for generating high level semantic features from the input image through a sequence of encoder blocks. Each encoder block consists of a set of convolution and max-pooling operations. The decoder is responsible for mapping the dense high-level features generated by the encoder into the desired segmentation mask through a sequence of decoder blocks. Each decoder block consists of a set of up-sampling and convolution operations. Skip connections are utilized in the U-Net architecture to combine the features from the encoder part with their corresponding feature resolution in the decoder part to refine the segmentation results. In addition, this allows recovering the spatial information lost during downsampling and enhancing the fine-grained details learned by the encoder.
VGGUNET is similar to U-Net. However, it uses a pre-trained VGG16 network on the ImageNet dataset [33] without its fully connected layers as its encoder. The main advantage of replacing the default encoder of U-Net with a pre-trained network is that it enables the segmentation model to produce better features, thereby enhancing the overall segmentation results. This also helps the segmentation model to converge faster. Some recent studies also replaced the default encoder of the segmentation model with different versions of VGG architecture [34], [35]. Figure 3 shows the full network of VGGUNET. The VGGUNET encoder consists of 13 convolutional layers and 5 max-pooling layers. Each convolutional layer uses a
Finally, a
Each pixel in the prediction mask indicates whether it belongs to a tumor or not.
The loss function used to optimize the VGGUNET segmentation model is a combination of two losses (binary cross entropy and dice loss functions) which is defined as follows:\begin{equation*} \mathrm {Total loss = Dice loss + 0.1\ast BCE loss } \tag{1}\end{equation*}
\begin{equation*}\mathrm {BCE loss}=-\frac {1}{m}\sum \nolimits _{i=1}^{m} {(y_{i}log\hat {y}_{i}} +(1-y_{i})\mathrm {log(1-}\hat {y}_{i})) \tag{2}\end{equation*}
The Dice loss function is primarily based on Dice coefficient. The Dice coefficient is one of the most common metrics that is widely used to assess the performance of segmentation models. It is considered an overlapping index measure and is very similar to the IOU metric, where both are responsible for calculating the similarity between the ground truth mask and the predicted one. It was then slightly adapted to be used as a loss function [36]. Both Dice loss and Dice coefficient are defined using Equations (3) and (4) respectively.\begin{align*} Dice loss&=1-Dice coefficient \tag{3}\\ Dice coefficient&= \frac {2\ast y\ast \hat {y}+\varepsilon }{y+\hat {y}+\varepsilon } \tag{4}\end{align*}
C. Data Augmentation
Data augmentation techniques are used to expand the training data by generating new synthetic samples from the existing ones. Thus, it helps to reduce overfitting and increases the generalization performance of the classification model for unseen samples. In this study, we introduce new effective augmentation techniques called RegionInpaint augmentation, Cutoff augmentation, and RegionMix augmentation. In addition, basic data augmentation techniques are used. These techniques are discussed in detail in the following sub-sections.
1) RegionInpaint Augmentation
Image inpainting is a task of reconstructing missing pixels in an image. The missing pixels are filled in a realistic-looking way to represent a complete image. Many recent researches [37], [38], [39] have introduced novel image inpainting techniques that showed promising results. This study introduces a novel augmentation technique based on the idea of image inpainting. Figure 4 illustrates the steps involved in applying our proposed augmentation technique. First, segmentation is applied to the training images as discussed in the previous section.
Proposed RegionInpaint augmentation technique (the upper part is responsible for generating the binary mask which is then fed along with the Tumor image to the inpainting network to generate the augmented image as shown in the lower part).
The goal of applying segmentation is to generate binary masks for our training images where the white pixels in the binary mask correspond to the tumor area and the black pixels correspond to the non-tumor area. Thereafter, a set of dilation and erosion operations is applied to the generated masks to fill any holes that may result from segmentation. These binary masks are then inverted such that the black pixels correspond to the tumor area. Finally, each inverted mask is fed to the inpainting network along with its corresponding original image. The inverted mask is used as the input mask for the inpainting network.
The goal of the inpainting network is to fill in the pixels in the input image that correspond to the black pixels in the given mask. This corresponds to filling the tumor area with a realistic non-tumor area. In this manner, the data are augmented by transferring the tumor images to non-tumor images, and thus increasing the number of samples in the “non-tumor class”. The inpainting method used in the proposed augmentation is primarily based on the method introduced by Liu et al. [37]. The image inpainting network utilized is depicted in Figure 5.
Similar to the Unet architecture, the inpainting network is designed in an encoder-decoder fashion, where skip connections exist between the encoder and decoder layers. The main differences are that the inpainting network uses partial convolution operation instead of using standard convolution operation. In addition, the input image is stacked with its corresponding input mask to be fed into the network. The encoder part comprises a set of partial convolution layers, each using a stride of
The Partial convolutional layer is responsible for performing two steps. The first step is applying convolution using only the valid pixels (non-missing pixels), which is called partial convolution. The second part is the mask update mechanism step. Since standard convolution considers all the pixels in the sliding window, including the missing pixels. Thus, it is replaced with partial convolution, where only valid pixels are considered, resulting in a much better image quality. The partial convolution operation is expressed by the following equation.\begin{align*}X^{\prime }=\begin{cases} W^{T}\left ({X\odot M }\right)\frac {sum\left ({1 }\right)}{sum\left ({M }\right)}+b, &if sum\left ({M }\right)>0 \\ 0, & otherwise \\ \end{cases} \tag{5}\end{align*}
The second step is the mask update mechanism, which is expressed by the following equation:\begin{align*}m^{\prime }=\begin{cases} 1, & if sum\left ({M }\right)>0 \\ 0, &otherwise \\ \end{cases} \tag{6}\end{align*}
A combination of different loss functions is used in order to optimize the inpainting network parameters. These loss functions are given below.
Pixel-wise loss: This loss is used to improve pixel-wise reconstruction process. The pixel-wise loss (L1 loss) is computed based on two equations; Equation (7) computes the pixel-wise loss for the reconstructed valid pixels (non-hole pixels) and Equation (8) computes the pixel-wise loss for the reconstructed missing pixels.\begin{align*}L_{valid}&=\frac {1}{N}{\vert \vert M\odot (I_{out}-I_{gt})\vert \vert }_{1} \tag{7}\\ L_{hole}&=\frac {1}{N}{\vert \vert (1-M)\odot (I_{out}-I_{gt})\vert \vert }_{1} \tag{8}\end{align*}
Perceptual loss: The goal of perceptual loss [40], [41] is to check the perceptual similarity by passing each of the original ground truth images and the reconstructed images into a pre-trained deep neural network (i.e., VGG16). Therefore, instead of minimizing pixel-wise loss, it minimizes the \begin{align*} L_{perceptual}&= \sum \limits _{n=0}^{N} \frac {\left |{ \left |{ \psi _{n}\left ({I_{out} }\right)-\psi _{n}\left ({I_{gt} }\right) }\right | }\right |_{1}}{N_{\psi _{n}}} \\ &\quad + \sum \limits _{n=0}^{N} \frac {\left |{ \left |{ \psi _{n}\left ({I_{comp} }\right)-\psi _{n}\left ({I_{gt} }\right) }\right | }\right |_{1}}{N_{\psi _{n}}} \tag{9}\end{align*}
Style loss: It is similar to perceptual loss, [40], [41], [42] as it is also computed using the feature maps generated by a pre-trained model (VGG16) to define the loss. The difference is that the style loss performs autocorrelation (Gram matrix) on each feature map before applying the \begin{equation*}{GM\left ({X }\right)=Kn\ast (\psi _{n}\left ({X }\right)}^{T}\psi _{n}(X)) \tag{10}\end{equation*}
\begin{align*} L_{style(out)}&=\sum \limits _{n=0}^{N-1} {\frac {1}{C_{n}^{2}}{\vert \vert GM(I_{out})-GM(I_{gt})\vert \vert }_{1}} \tag{11}\\ L_{style(comp)}&=\sum \limits _{ \boldsymbol {n}= \boldsymbol {0}}^{ \boldsymbol {N-1}} {\frac {1}{C_{n}^{2}}{\vert \vert GM(I_{comp})-GM(I_{gt})\vert \vert }_{1}} \tag{12}\end{align*}
Total variation loss (TV loss): The use of the total variation loss encourages the network to reduce the noise in the resulting image [41]. It computes the summation of the absolute differences for the pixels and their corresponding neighbors in order to ensure the smoothness of the reconstructed missing pixels obtained by the inpainting network. It is defined by the following equation:\begin{align*} L_{tv}&=\sum \limits _{\left ({i,j }\right)\in P,\left ({i,j+1 }\right)\in P} \frac {\left |{ \left |{ I_{comp}^{i,j+1}-I_{comp}^{i,j} }\right | }\right |_{1}}{N_{I_{comp}}} \\ &\quad +\sum \limits _{\left ({i,j }\right)\in P,\left ({i,j+1 }\right)\in P} \frac {\left |{ \left |{ I_{comp}^{i+1,j}-I_{comp}^{i,j} }\right | }\right |_{1}}{N_{I_{comp}}} \tag{13}\end{align*}
The total loss is defined as a combination of all discussed losses with different weights as given by Liu et al. [37].\begin{align*} \boldsymbol {L}_{ \boldsymbol {total}}&= \boldsymbol {L}_{ \boldsymbol {valid}}+{ \boldsymbol {6L}}_{ \boldsymbol {hole}} +0.05 \boldsymbol {L}_{ \boldsymbol {perceptual}}+120(\boldsymbol {L}_{ \boldsymbol {style}\left ({\boldsymbol {out} }\right)} \\ &\quad + \boldsymbol {L}_{ \boldsymbol {style}\left ({\boldsymbol {comp} }\right)}+0.1 \boldsymbol {L}_{ \boldsymbol {tv}} \tag{14}\end{align*}
2) Cutoff Augmentation
The proposed Cutoff augmentation approach is based on randomly selecting two images, one from the “tumor class” and the other from the “non-tumor class”. Figure 6 illustrates the steps for applying this augmentation technique. The first step is applying segmentation on the tumor image to obtain its corresponding mask, where the white pixels in the predicted mask represent the tumor region. Afterwards, a set of dilations and erosions is applied to the predicted mask to fill any small holes that may result from the segmentation step. The predicted mask is then superimposed with the original image to obtain the segmented tumor. This segmented tumor is copied to the non-tumor image to obtain a new augmented image. Hence, using this augmentation approach, we are able to increase the number of images in the “tumor class”. Finally, a Gaussian blur filter is applied to the resulting image, and thus enabling the copied tumor to blend with the background. Moreover, different transformations are applied such as rotation, flipping, adjusting brightness, and contrast, to make the tumor in the resulting image look different from the tumor in the original image and accordingly increasing the variety of the training samples.
Proposed Cutoff augmentation technique (the upper part is responsible for generating the segmented tumor which is then copied to the non-tumor image to generate the augmented image as shown in the lower part).
3) RegionMix Augmentation
We introduce a new effective augmentation technique called RegionMix. It is based mainly on segmentation and the MixUp approach [43] where the region of interest (i.e., the tumor region) is first extracted from a tumor image through a segmentation network and then mixed with a non-tumor image. The Mixup approach was introduced by Zhang et al. [43]. Since then, it has been widely used in different deep learning tasks such as segmentation, image recognition, natural language processing, and speech recognition. Mixup also has a great advantage for being data-agnostic as it works with different types of data, such as images, text, speech, or any other source of data.
The main idea of Mixup is to extend the training samples in the training distribution by generating new ones that act as a linear interpolation of existing training samples and their corresponding labels. In addition, it acts as a regularization technique because it helps to reduce overfitting and increases the generalization and robustness of the model. In short, mixup generates new samples as a weighted linear combination of random image pairs from the training set as shown in Figure 7. The Mixup process is simply defined by the following equation:\begin{align*} \overline x &=\lambda x_{i}+\left ({1-\lambda }\right)x_{j} \tag{15}\\ \overline y& =\lambda y_{i}+\left ({1-\lambda }\right)y_{j} \tag{16}\end{align*}
Figure 8 illustrates the steps for applying the proposed RegionMix augmentation technique. Similar to the Cutoff augmentation technique, two random images are selected where the first image belongs to the “tumor class” and the second one belongs to the “non-tumor class”. The segmented tumor from the first image is mixed with the pixels that share the same tumor location in the second non-tumor image. The newly generated image is considered in-between both classes. Finally, different transformations are applied to the resulting image.
Proposed RegionMix augmentation technique (the upper part is responsible for generating the segmented tumor. Mixup approach is then applied between the segmented tumor region and the non-tumor image as shown in the lower part.
4) Basic Data Augmentation
In this work, basic data augmentation techniques are also used to obtain more images in each class, it is simply done by applying different transformations on the input images to generate new ones. Different transformations that are applicable for medical image classification tasks are applied, including horizontal and vertical flipping, image rotation, and adjusting the image brightness and contrast. Figure 9 shows a sample of the augmented images obtained using these different basic transformations.
Sample of brain tumor MRI image and its corresponding augmentation results: (A) Original MRI, (B) Vertical flipping, (C) Adjusting brightness and (D) Rotation.
D. Classification
For the classification of brain MRI images, different pre-trained CNNs are experimented including VGG16 [22], VGG19 [22], ResNet50 [44], InceptionV3 [45], DenseNet121 [46], Xception [47], and MobileNetV2 [48]. First, these models are trained on the original training set only without augmentation. The model with the best performance on the validation and testing sets is selected. This model will be considered the main classification model and will be further used along with the proposed augmentation techniques. Figure 10 shows the architecture of the VGG19 network that achieved the best performance. The input image size for the network is
Experiments & Results
All the experiments in this study are conducted on NVIDIA Tesla P100 GPU using TensorFlow and Keras frameworks. The next sub-sections provide all the details and discussions about the datasets used, the considered evaluation metrics, the segmentation results, the inpainting results, and the classification results.
A. Datasets
We performed our experiments using two different brain MRI datasets. The first dataset used is a small public dataset that has been released on Kaggle in 2020 [26]. For simplicity, we refer to this dataset as the SPMRI dataset. It consists of 253 images distributed among two classes. The first class corresponds to the “Tumor” class and contains 98 images.
The second class corresponds to the “non-tumor” class and contains 155 images.
The second dataset (Br35H dataset) is also a public dataset released on Kaggle in 2020 [27]. it has the same two classes as those described in the SPMRI dataset. It consists of 3000 images, which are distributed equally among both classes; therefore, each class has 1500 image. This dataset is more diverse and has many more images than the SPMRI dataset.
To ensure that the proposed augmentations enable the classification model to generalize much better on unseen samples, we used two sets for evaluating the classification models (i.e., the validation and testing sets). The training set represents 80% of the SPMRI dataset, whereas the validation set represents the remaining 20% of the SPMRI dataset samples. It should be noted that the validation set used in our experiments was completely unseen during training. The testing set represents the samples of the Br35H dataset. We aimed to use a tiny training set to experiment on how well our classification model will generalize to both validation and testing sets when extending the small training set with the new synthetic samples generated by the different proposed augmentation techniques. In addition, we considered using BR35H dataset for evaluation besides the validation set, in order to evaluate the model to a larger dataset with different distribution from the SPMRI dataset and thus, we can ensure that the proposed augmentation techniques enable the classification model to even generalize well on new samples with different distributions. Table 1 shows the distributions of the training, validation, and testing sets.
B. Evaluation Metrics
To assess the effectiveness of the classification and segmentation models, four metrics including precision, recall, F1-score, and overall accuracy are used. These metrics are derived from the normalized confusion matrices, which are constructed using the four elements of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) according to the following equations:\begin{align*} Precision&=\frac {TP}{TP+FP} \tag{17}\\ Recall&=\frac {TP}{TP+FN} \tag{18}\\ F1-score&=\frac {2\ast Precision\ast Recall}{Precision+Recall} \tag{19}\\ Overall accuracy&=\frac {TP+TN}{TP+FP+TN+FN} \tag{20}\end{align*}
For evaluating the image inpainting model performance. Two different metrics are considered, which are Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). PSNR is a widely used measure for image reconstruction tasks. It is a modified version of MSE, which analyzes the comparison of each pixel in the images. It represents the ratio between the maximum power of image to the maximum power of noise that distorts the image representation. When the PSNR value increases, this indicates the better image reconstruction compared to the original one. The PSNR metric is calculated based on the following equation:\begin{equation*} PSNR=20.\log \left ({\frac {MAX_{I}}{\sqrt {MSE}} }\right) \tag{21}\end{equation*}
The SSIM metric [50] is widely used for determining the similarity between two images. It is considered an image perceptual metric, as it correlates with the human visual system’s perception (HSV color mode). When the SSIM value approaches 1, this indicates that the reconstructed image is almost identical to the original image. The SSIM is calculated using the following equation:\begin{equation*} SSIM\left ({x,y }\right)=\frac {\left ({2~\mu _{x}\mu _{y}+c_{1} }\right)\left ({2\sigma _{xy}+c_{2} }\right)}{(\mu _{x}^{2}+ \mu _{y}^{2}+ c_{1})(\sigma _{x}^{2}+\sigma _{y}^{2}+c_{2})} \tag{22}\end{equation*}
C. Segmentation Results
This section demonstrates the configurations and the achieved results obtained from the experimental segmentation models. As mentioned previously, the main purpose of using segmentation is to obtain segmentation masks for the training images. These masks are needed as an initial step for applying the proposed augmentation techniques. Different segmentation models are experimented such as U-Net [23], VGGUNET, SegNet [24], and ResUNet [25]. In the BR3H dataset, 800 images only out of the total 3000 image are annotated with their corresponding segmentation masks. These 800 annotated samples are used to train and evaluate the segmentation models, where 500 images from the annotated samples are used for training, 200 images are used for validation and the remaining 100 are used for testing the models. Table 2 lists the configurations used to train each segmentation model. The hyperparameters used in Table 2 for each segmentation model are tuned and selected based on the achieved results on the validation set. Table 3 shows the validation and testing results for each model using different evaluation metrics including accuracy, precision, recall, and dice coefficient score (DSC).
The Dice coefficient follows the rule depicted in Equation (4), while precision and recall are calculated using Equations (17) and (18) respectively. As shown in Table 3, U-Net achieved comparable results to VGGUNET, however the VGGUNET model outperformed all other models by achieving the best results for all evaluation metrics. This is due to the use of a pre-trained VGG-16 as an encoder which enabled the model to converge much faster and learn better feature representations. Since VGGUNET achieved the best results among the remaining segmentation models, it is used to predict the segmentation masks on the training set of the SPMRI dataset as an initial step for applying the proposed novel augmentation techniques. Figure 11 shows random sample images from the Br35H dataset, along with their corresponding annotation masks. Figure 12 shows the prediction masks of the VGGUNET model for some sample images in the SPMRI dataset.
D. Inpainting Results
1) Dataset Preparation and Training
RegionInpaint augmentation is one of the proposed augmentation techniques used to augment the training set of the SPRMI dataset to be ready for classification (see Section III-C.1).
Prior to augmentation, the inpainting network is first trained using 155 images in the “non-Tumor class”. These 155 images are split as follow: 100 images are used for training and the remaining 55 are used for validation. For training purposes, five different random masks are generated for each training image, which enables the model to learn filling different missing areas based on each random mask. Hence, a total of 500 images are used to train our inpainting model. Moreover, during training the inpainting network, each image has a 50% chance of being flipped to increase the variety of the images fed to the network. Two different techniques are used to generate random masks.
The first method generates masks with random small circles, ellipses, and lines of different sizes. This can be seen in the examples in Figure 13 (3rd and 4th rows). The second technique generates masks by using random circles only with varying sizes. This can be seen in the examples in Figure 13 (1st and 2nd rows). The first technique enables the model to learn how to fill in the small missing parts of different shapes with appropriate pixels based on the known surrounding context. Although, this technique is used in most deep learning-based approaches for image inpainting, the second technique is considered more relevant to our research study as the brain tumor areas tend to look like circles that vary in their radius. Thus, it helps the model to be more robust when filling in missing parts that look like tumors of varying sizes.
Another important notice is that inpainting models can easily achieve remarkable progress in restoring small missing holes in an image with appropriate pixels. However, when the holes become larger, their fillings contents begin to suffer from blurry textures and distortion due to the large gap between the known and the unknown pixels. In such case, the model tries to replace the large hole with a blurry area instead of an appropriate visual area. Accordingly, combining both techniques of random masks generation allows the model to converge faster and learn a more natural filling of the missing areas/holes regardless of its size.
2) Results of Image Inpainting
This section represents the results and the configurations of the image inpainting model. The inpainting network is trained using a combination of different losses (see section III-C.1). The training setup of the inpainting network used is shown in Table 4. Adam optimizer [51] is used to update the network weights with learning rate value of 1e-5 and batch size of 4. The model is pre-trained on ImageNet dataset [33] and fine-tuned for 100 epochs on the training set. PSNR and SSIM metrics are considered to evaluate the model performance. The model achieved 28.9 PSNR and 0.875 SSIM on the validation set. Figure 14 and Figure 15 shows the values of PSNR and SSIM over the number of epochs respectively. After training the inpainting network, it is used to generate the augmentation images for the SPRMI train split (i.e., non-tumor class) using the generated masks from the segmentation task. Both images and their generated binary masks with size
Sample input images with their corresponding segmentation masks and their inpainting results.
E. Classification Results
1) Experiment I - Comparing Different Classification Models
For classification, different pre-trained classification models are used including VGG16 [22], VGG19 [22], ResNet50 [44], DenseNet121 [46], InceptionV3 [45], Xception [47], and MobileNetV2 [48]. The Adam optimizer [51] is utilized for tuning the model parameters with a mini-batch size of 32. Moreover, all these models are pre-trained on the ImageNet dataset [33], trained and validated on the SPMRI dataset and finally tested on the Br35H dataset. The number of samples in the training, validation, and testing sets for each class is listed in Table 1. The training and validation sets are randomly selected from the SPMRI dataset with a ratio of 80% and 20% respectively. Since the SPMRI dataset samples are not equally distributed on both classes. Thus, the number of samples in the training and validation sets are also not equally distributed among the two classes.
To fairly compare between the experimental models, all of them are trained on the same training set and evaluated on the same validation and testing sets. In addition, to illustrate the efficiency of our proposed augmentation techniques, these classification models are trained on the original training set without augmentation. The model that achieves the best results is selected for further training on the extended dataset after applying the different proposed augmentation techniques to investigate its generalization ability on the validation and testing data. Different metrics are used to evaluate our models including Accuracy, Precision, Recall, F1-score and AUC-score. It can be observed from Table 5 that VGG19 outperformed the remaining models in terms of the considered evaluation metrics. It achieved the best overall accuracy, F1-score and AUC-score on the unseen samples of the validation and testing sets. VGG16 and InceptionV3 also achieved comparable results to VGG19.
2) Experiment II - Effect of Balancing the Training Set
Since the training set samples are not balanced among the two classes, thus the second experiment aims to investigate the effect of balancing the training set using different proposed augmentation techniques. Referring to the unbalanced training sample sizes of each class (see Table 1.), the minor class (i.e., non-tumor class) is extended with 40 samples using basic and RegionInpaint augmentation techniques, which are considered the valid augmentation techniques for extending the minor class. This allows both classes to be almost equally distributed. Consequently, VGG19 is used along with both augmentation techniques as depicted in Table 6 with the same hyperparameters used for training before augmentation. It is observed from Table 6 that the proposed RegionInpaint augmentation showed a significantly favorable performance compared to the cases of no augmentation and basic augmentation on both the validation and testing sets.
This demonstrates that the proposed technique (RegionInpaint augmentation) creates new synthetic samples capable of significantly improving the generalization ability of the model. Furthermore, the proposed RegionInpaint augmentation gives an advantage of enabling the classification model to consider the same brain image twice; once with the tumor area and once without the tumor area. This helps the model focus more on distinguishing between images through the presence or absence of the discriminative features (i.e., tumor region) regardless of the other non-informative features in the image.
3) Experiment III - Comparison of the Different Proposed Augmentation Techniques
In this sub-section, we investigate the effect of all the proposed augmentation techniques. The VGG19 model is used in this experiment along with the proposed augmentation techniques. First, each augmentation technique is used individually to observe its effect on the classification model. Thereafter, two or more techniques are combined together to generate more and diverse synthetic samples while trying to maintain the balance of the dataset among both classes. As mentioned previously, the RegionInpaint augmentation technique is used to increase the number of samples in the “non-tumor class”. In contrast, the Cutoff augmentation technique is used to increase the number of samples in the “tumor class”. Thus, both techniques are used together to increase the number of samples in both classes.
RegionMix augmentation technique is used to generate new synthetic samples that are considered in-between both classes as illustrated in Figure 8. The weighting parameter (
To fairly observe the effect of each augmentation technique, 150 samples are initially generated by each augmentation technique. In this context, the combination of using RegionInpaint and Cutoff techniques achieved the best results on the validation and testing sets compared to the basic and RegionMix augmentation techniques. Moreover, the RegionMix augmentation outperformed the basic data augmentation on the testing set.
Generally, the best overall validation accuracy (100%) is achieved when using the RegionInpaint and Cutoff augmentation techniques together as well as when using RegionInpaint, Cutoff augmentation, and basic augmentation techniques. On the other hand, the best overall testing accuracy (96.8%) is achieved when using all augmentation techniques together (i.e., RegionInpaint, Cutoff, RegionMix and basic augmentations). Finally, it is notable to mention that when using RegionMix augmentation in training the network, the convergence is slower compared to other augmentation techniques because it explores new samples with different distributions in the data space.
4) Experiment IV - Comparison of the Proposed Work Results With Related Works
In this sub-section, our quantitative results are compared with other studies from the literature that works on the SPMRI dataset. Most studies have only considered extending the SPMRI dataset using the traditional augmentation techniques. Moreover, all these studies evaluated their experimental models on the validation set that belongs to the same SPMRI dataset according to their train/test split.
As depicted in Table 8, our results on the validation set (i.e., when using VGG19 along with Cutoff, RegionInpaint, and basic augmentation techniques) outperforms the remaining studies which demonstrates the efficiency and robustness of the proposed work. Table 8 also provides other details needed for comparison including the model used, applied augmentation techniques, and the overall accuracy.
Conclusion
In this study, we investigate the impact of training a deep CNN network using a small dataset of MRI brain tumor images and how this adversely affects the generalization of the model. To address this issue, we introduce several novel augmentation techniques named; RegionInpaint augmentation, Cutoff augmentation, and RegionMix augmentation. Traditional augmentation methods are also used in addition to the proposed ones. By using the proposed augmentation techniques to generate synthetic samples, the performance of the used classifier improved significantly.
The full proposed approach can be described in the following manner. First, the brain area is cropped from the input MRI images to remove irrelevant backgrounds. Segmentation is then applied to generate their corresponding segmentation masks that highlight the tumor area. The aim of applying segmentation in our study is that the proposed augmentation techniques depend mainly on segmentation. Thus, segmentation is a crucial step in this study to generate the segmentation masks, and accordingly the proposed augmentation techniques can be applied. A U-Net like architecture called VGGUNET is used, which take advantage of using a pre-trained VGG16 network instead of the default encoder of U-Net. It achieved 85.82% and 82% Dice coefficients on the validation and testing sets respectively. In addition, it outperformed the other segmentation models used, including U-Net, SegNet, and ResUNet. Thereafter, VGGUNET is used to obtain the segmentation masks for the input training MRI images.
Finally, different pre-trained classifiers are used including VGG16, VGG19, DensNet121, ResNet50, IncepitonV3, Xception, and MobileNetV2. They are trained and validated on the train split (80%) and validation split (20%) of the SPMRI dataset respectively and tested on Br35H dataset. VGG19 achieved the best results among them and thus, it is furtherly selected to be used along with the proposed augmentation techniques. Initially, each augmentation technique is applied individually to observe its effect on the classification model. Afterwards, more augmentation techniques are used together to generate more diverse samples. The best validation accuracy obtained is 100% when using the Cutoff and RegionInpaint augmentation techniques while the best testing accuracy achieved is 96.88% when using all the augmentation techniques together. The results obtained reveal that the proposed augmentation techniques guarantee a well-generalized model with superior performance, surpassing other studies that have applied other common augmentation techniques.
Although our proposed augmentation techniques along with the used models achieved promising results, there is still a number of challenges and future works to consider. First, we aim to apply the novel proposed augmentation techniques to other applicable medical datasets. In addition, we will investigate to experiment more deep learning models in the segmentation and classification steps with reduced time complexity to ease their deployment in a real-time environment. Moreover, one limitation of the proposed RegionInpaint augmentation is that we can only generate limited number of images according to the number of samples in the class that we are interested to remove the region of interest from (i.e., Tumor area in our case). Another constraint of this study is that the proposed augmentation techniques depend mainly on the segmentation step. Thus, if the annotated segmentation masks are not available for a specific task, the proposed augmentation techniques cannot be applied.