PMED-Net: Pyramid Based Multi-Scale Encoder-Decoder Network for Medical Image Segmentation

A pyramidical multi-scale encoder-decoder network, namely PMED-Net, is proposed for medical image segmentation. Different variants of encoder-decoder networks are in practice for segmenting the medical images and U-Net is the most widely used one. However, the existing architectures for segmenting medical images have millions of parameters that require enormous computations which results in memory and cost-inefficiency. To overcome such limitations, we come up with the idea of training small networks in a cascaded form for coarse-to-fine prediction. The proposed adaptive network is extended up to six pyramid levels, and at each level, features are extracted at different scales of the input image. Each lightweight encoder-decoder network is trained independently to minimize loss, where succeeding level networks further refine the prior predictions. Evaluation and comparison of our architecture were performed on four different publicly available medical image segmentation datasets: International Skin Imaging Collaboration (ISIC) challenge 2018 dataset, brain tumor dataset, nuclei dataset, and X-ray dataset. The experimental results of the PMED-Net are either better or on par with other state-of-the-art networks in terms of IoU, F1-Score, and sensitivity metrics. Moreover, PMED-Net is efficient in terms of parameterized complexity as it has 1/21.3, 1/21.1, 1/14.0, 1/11.6, 1/11.2, 1/6.64, and 1/4.95 times fewer parameters than SegNet, U-Net, BCDU-Net, CU-Net, FCN-8s, ORED-Net, and MultiResUNet respectively. The pre-trained models, datasets information, and implementation details are available at https://github.com/kabbas570/Pyramid-Based-Encoder-Decoder.

has resulted in their application to all fields of computer vision, from self-driving cars [11] to facial recognition [12], bioinformatics [13], [14], stereo vision [15], 3D scene reconstruction [16], and healthcare [17] with no exception. In medical image processing, different imaging technologies like magnetic resonance imaging (MRI), microscopy ultrasound, dermoscopy, X-ray, and computer tomography (CT) are used to capture images of the human body [18]. The goal of CAD system is to analyze these images, and produce accurate and quick diagnostic reports for medical specialists so that patients can receive immediate and effective treatment.
During recent few years, deep neural networks (DNNs) have replaced all classical and hand-engineered features based recognition and segmentation methods [19]. However, supervised deep learning-based models are data-hungry that require an extensive amount of training data (with well-defined ground truths) [20]. Procuring large-amounts of training data is often impractical and infeasible (especially for the rarely occurring diseases) [21]. Furthermore, obtaining medical data faces challenges related to logistics approvals regarding patient privacy, storage problem, getting data from proprietary ancestral raw files, and ground truth generation. Data augmentation strategies can provide an alternative approach to meet this data requirement. However, it results in a compromised training performance due to presence of similar textures, shapes, and correlated features [22].
Image segmentation refers to the process of identifying images at pixel level [23]. For medical images, segmentation is very crucial in many applications for extracting the region of interest (ROI). It can divide an image into different ROIs to give a clear interpretation of a diseased organ, tissues, or cells [24]. For illustration, Figure 1 shows examples of four publicly available medical image segmentation datasets used for the experiments conducted in this paper. In this study, we propose a small and efficient pyramid based multi-scale encoder-decoder network called PMED-Net for medical image segmentation. The main contributions of this work can be summarized as follows.
• An architecture that employs small pyramid based encoder-decoder networks in a cascaded fashion is proposed for extracting complex lesions and biomarkers contained within medical images by leveraging their multi-scale feature representations.
• We address the adaptive techniques of network size to achieve an optimal trade-off between performance and computations.
• Features of different scales are extracted with the use of pyramid-based encoder decoder networks.

II. RELATED WORK
Medical image segmentation had been investigated even before the advent of deep learning. The graph-cut method [32], thresholding based on histograms [33], and edge-region based techniques [34] were one of the popular schemes. To extract coherent regions, clustering algorithms were implemented [35], and for some cases in which images had an irregular pattern and boundaries, the fuzzy c-means algorithm (FCM) was introduced [36]. However, these clustered based methods were limited in their application due to their dependence on prior information about the number of clusters. A region growing based method was proposed in [37], which grouped the pixels with the same intensities in one region. However, the method is semi-automated, requiring human supervision for selecting the initial seed region. In deep learning, most of the networks used for segmentation are encoder-decoder based topology [38]. All these networks follow the same strategy of increasing the depth and decreasing the spatial dimension of the feature maps in the encoder, while in the decoder, their mission is vice versa [39]. Fully convolutional network (FCN) [29], on the other hand, was the first model to extend the power of contemporary classification networks such as AlexNet [10], VGG [40], and GoogleNet [41], for segmentation task and performed much better than patch-based methods [42]. Furthermore, FCN offers variable stride rates to generate coarser-to-finer predictions (FCN-8s, FCN-16s, and FCN-32s). The encoder part is same for all FCN versions while the decoder differs in terms of the up-sampling stride. In FCN-8s and FCN-16s the predictions are added with previously pooled layers to make finer final predictions. SegNet is another popular encoder-decoder based architecture and it is widely used for semantic-segmentation [25]. The decoding part up-samples low-resolution feature maps using the pooling indices from the encoder to create sparse feature maps. One of the most famous networks for the segmentation of medical images is U-Net [26]. The network is similar to an encoder-decoder architecture, with skip connections from encoder to decoder side. In the encoder part, after two consecutive convolutions, a 2 × 2 max-pooling is performed to reduce the feature map size. In decoder, it uses the up-sampling with a stride = 2, to recover the resolution [26].
A variety of modifications to the basic structure of U-Net have been proposed with the goal of improving its performance. By introducing a cascaded deep framework for brain tumor segmentation, CU-Net [28] could outperform the original U-Net architecture. However, with the addition of auxiliary supervision, branch supervision, and using two cascaded U-Net the overall architecture of CU-Net becomes very large and slow. A deep neural network called Bi-directional ConvLSTM U-Net with Densely connected convolutions (BCDU-Net) was proposed by [27] to utilize the Bi-directional ConvLSTM (BConvLSTM) and dense convolutions [43] with U-Net. BConvLSTM (replaced the skip connections of U-Net) and densely connected convolutions, in the encoding path were implemented for better feature reuse.
The ORED-Net architecture was proposed in [30], to segment eye regions in multiple classes. The network is based on SegNet [25], with non-identity residual connections from encoder to decoder side to reduce information loss. Ibtehaz and Rahman [30] designed an enhanced version of U-Net named MultiResUNet. Each pair of convolutional layers of U-Net were replaced with Inception-like blocks [39]. The authors claim that this strategy iteratively reuse spatial features across various scales. For multi-resolution analysis, 3 × 3, 5 × 5, and 7 × 7 kernels are used in parallel but this results in increasing memory requirements. To address this, they factorized the larger convolution filters into a series of 3 × 3 convolutions.  Keeping in mind the loopholes and an excessive number of parameters in all these networks, we developed PMED-Net for medical image segmentation. The pyramid architecture enables the network to extract features at different scales, and cascaded models are employed for a coarse-to-fine prediction. Furthermore, we achieved superior performance compared to the other state-of-the-art models in terms of intersection over union (IoU), F1 scores, and sensitivity metrics on four publicly available medical image segmentation datasets.
The rest of the paper is organized as follows: Section III discusses the proposed framework, and evaluation metrics are enlisted in Section IV. The dataset details and ablation studies are included in Section V and VI, respectively. Finally, Section VII showcases the evaluation results, followed by concluding remarks in Section VIII.

III. PROPOSED ARCHITECTURE
The PMED-Net architecture shown in Figure 2, consists of six small encoder-decoder networks, where each network generates coarse predictions that are further refined at the next level. Predictions made by k th level encoder-decoder network (N k ), are up-sampled with stride 2, concatenated with input image, and used as an input for N k+1 network. The proposed cascaded methodology enables the network to reuse the information iteratively and extract the features at different resolutions.

A. PYRAMID LEVELS
The proposed PMED-Net architecture, has six pyramid levels, which enables the model to extract the input image details at different scales. If the input image size is H × W then the corresponding input and ground truth sizes for the six pyramid levels, (level-1, level-2, level-3, level-4, level-5, and level-6) will be 2 k−1 * H × 2 k−1 * W, where k = (1, 2, 3, 4, 5, 6) for each corresponding level. We used bilinear interpolation for up-sampling, to match the dimensions. The intuitive strategy of these pyramid levels increases the network's ability to extract the details of smaller regions of interest at different scales from the images.

B. NETWORK STRUCTURE
At each pyramid level k, a small encoder-decoder network is trained independently to reduce the loss function. The predictions of this network are then up-sampled using bi-linear interpolation to match the dimensions of the next level pyramid because the next network input size is double compared to the preceding one. The up-sampled predictions are concatenated with the input image and further used by the next level network. The exceptional case is for the level-1 network, where the coarse estimation is not available, and the network uses only the images as the input. The reuse of input images at different scales improves the flow of information and finer details while generating the latent feature representations.

C. ENCODER-DECODER NETWORK
At each level k within the proposed scheme, a three-stage encoder-decoder network is trained independently to estimate the segmentation map. The detailed architecture of a single light-weighted encoder-decoder network is shown in Figure 3. The number of feature maps in the three encoder stages are increased as 16, 32, and 64. At each stage, we used two consecutive 3 × 3 convolutions with Rectified Linear Unit (ReLU) activation function [44] which is followed by max-pooling with stride = 2 and window size = 2 × 2 to decrease the spatial dimension. Starting from 16, after each stage the number of feature maps is doubled and the maximum number of feature maps was limited to 64 in the encoder part to minimize the number of trainable parameters for each encoder-decoder network.
After third stage, the feature maps are up-sampled with stride 2 and directly fed to the decoder part. Before concatenating with the corresponding encoder feature maps, a 2 × 2 convolution halves the number of feature maps. Then two, 3 × 3 convolutions with ReLU activation function [43] are applied in decoder part.
This sequence of 2 × 2 up-sampling, 2 × 2 convolution, concatenation, and two, 3 × 3 convolutions are stacked together to match the dimensions with the respective encoder end (at each stage). The final segmentation map is generated by using a 1 × 1 convolution operation. The last convolution layer uses sigmoid activation function.
The proposed encoder-decoder is lightweight and relatively shallower network (compared to a conventional encoder-decoder) in terms of the feature maps and computational depth. Thus, employing it alone would produce coarser segmentation results. To overcome such problem, we stacked it in cascaded (as shown in Figure 2). The predictions (obtained from the proceeding model instance) are refined by concatenating the finer-scale feature representations resulting in superior performance compared to the existing framework while drastically reducing the computational requirements.
Overall, the PMED-Net architecture is quite small and has far less parameterized complexity as compared to other segmentation networks, shown in Figure 4. Each instance of the proposed network has only three stages with a much fewer number of feature maps, therefore the level-1 model (i.e.N 1 ) has 244,209 parameters, and the remaining each one N k (k = 2, 3, 4, 5, 6) has 244,353 parameters. This slight increase in parameters happens because the networks at these levels also use the previous coarse predictions to refine it further in addition to the input images. In total, the proposed architecture comprises 1,465,974 parameters for its six pyramid level training.

D. NETWORK TRAINING
We trained each of the encoder-decoder networks (N k ) independently and compute the coarse prediction p k , for the given input I k ⊕ P k , where P k is the up-sampled prediction, I k is the input image and the symbol⊕ represents the concatenation. N k represents the encoder-decoder network for k = (1, 2, 3, 4, 5, 6). Each N k aims to reduce the dice-loss at different scales of the input. The network N 1 (shown in Figure 2) was trained with 48 × 48 images and for each subsequent pyramid level we doubled the resolution. This process was iterated until the level-6. We trained each network using the prediction of the previous network as an initialization. All the networks were trained using Adam optimization [45] with β 1 = 0.9 and β 2 = 0.99. The learning rate was set to 1e-4 with a batch size of 2. The loss function is defined in term of the dice coefficient [46] as follow,

IV. EVALUATION METRICS
We used different evaluation metrics to evaluate and compare the performance of the PMED-Net architecture. First of all, we computed the confusion matrix between prediction and ground truth by calculating the number of true positives (TP), true negatives (TN), false positives (FP), and false-negatives (FN). These variables are used to measure the performance of the network in terms of intersection over union (IoU), F1-Score, and recall/sensitivity. IoU is the ratio of the area of overlap to the area of union between prediction and the ground truth. In terms of the variables of the confusion matrix, it is defined as, Precision defines the ability of the model to locate relevant objects only, and recall evaluates true positive detections relative to all ground truths. In terms of the confusion matrix's variables, precision and recall are defined as; and F1-Score is the harmonic-mean between precision and recall and is expressed as,

V. DATASETS
We used four different publicly available medical image segmentation datasets for our experiments conducted in the proposed study. For each dataset, a pixel-wise prediction was performed. The details for each dataset and the distribution of data for training, validation, and testing are described in this section.

A. ISIC 2018 (SKIN LESION ANALYSIS TOWARDS MELANOMA DETECTION) DATASET
This dataset was released by International Skin Imaging Collaboration (ISIC) in 2018 [47], [48]. It contains 2594 dermoscopy images that are available at: https://challenge2018.isic-archive.com/. The dataset consist of different challenging tasks like boundary segmentation, attribute detection, and disease classification. For all the experiments conducted in this paper, we used 1816 images for training, 258 for validation, and 520 for testing taken from task-1 of boundary segmentation.

B. BRAIN TUMOR DATASET
This dataset was obtained from The Cancer Imaging Archive (TCIA) which contained 110 cases of lower-grade glioma patients. The data has MR images along with FLAIR abnormality segmented masks. For the proposed experiments, we deleted the images without label pixels, and after data filtering, we left with 880 images along with their ground truths. These images were split into training (600), validation (100), and holdout test images (180). The dataset is available at the following link: https://www.kaggle.com/mateuszbuda/lggmri-segmentation/version/1.

C. X-RAY DATASET
The X-ray dataset used in this paper is composed of four different datasets, namely the Montgomery County chest X-ray set, Japanese Society of Radiological Technology (JSRT) dataset [49], the Shenzhen chest X-ray set [50]- [52], and the National Institutes of Health (NIH) Chest X-ray Dataset [53]. The Montgomery County X-ray dataset was obtained from the Department of Health and Human Services of Montgomery County, MD, USA. It contains 138 posterioranterior X-rays from their tuberculosis control program. The set has 80 normal and 58 abnormal scans together with their corresponding ground truth masks available at: http://openi.nlm.nih.gov/imgs/collections/NLM-MontgomeryCXRSet.zip.
The JSRT dataset was created by JSRT and the Japanese Radiological Society (JRS) for different tasks such as computer-aided diagnosis, image compression, and picture archiving. It consists of 247 images having 154 with and 93 without lung nodule. A pixel-wise lung annotation masks of 246 images are also provided for segmentation tasks at the following link: http://db.jsrt.or.jp/eng.php.
The Shenzhen dataset contains 662 X-ray images, among which 326 are normal and 336 X-rays have symptoms of Tuberculosis. Pixel-wise annotation masks of 566 instances are available at: https://www.kaggle.com/yoctoman/shcxrlung-mask .
Overall, by combining the above three datasets, we had total of 950 images, divided into training (850) and validation set (100). For testing purposes, we used a different dataset named NIH dataset. One hundred samples have been taken from the NIH Chest X-ray dataset and annotated manually by [54] having various lung diseases. These images are available at: https://nihcc.app.box.com/s/ r8kf5xcthjvvvf6r7l1an99e1nj4080m. This NIH dataset includes several severities of lung diseases that can evaluate the network performance and generalization capability more effectively.

D. NUCLEI DATASET
This dataset contains 670 segmented nuclei images and is provided by Data Science Bowl 2018 Segmentation Challenge available at: https://www.kaggle.com/gangadhar/nucleisegmentation-in-microscope-cell-images. The images were captured under different conditions, magnification, and modalities (brightfield vs. fluorescence) and provided with a mask for each nucleus. As a pre-processing step, all the nuclei of single input image were combined together in one ground truth. Images were randomly assigned into a training set (510), validation set (60), and a testing set (100).

VI. ABLATION STUDIES
The effectiveness of the proposed PMED-Net architecture was also evaluated by comparing it with different ablated variants of it. We investigated two versions of PMED-Net in our ablation study: (1) Rather than using pyramids of different scales, we used the same size images in all six levels of the network (2) We increased or decreased the number of pyramid levels or encoder-decoder networks in architecture. This strategy is used for each dataset to experimentally determine the optimal tradeoff between performance and the computations.
For the case (1), we used NIH X-ray segmentation dataset in the ablation study for which the optimal performance is obtained at the fourth level of PMED-Net. By using images of the same sizes in all four levels, the network is unable to extract features at different scales. So, the performance of PMED-Net is lower as compared to using the pyramid of different scales in all four levels. The quantitative results of this experiment are listed in Table 2. In the implementation of the 'without pyramid', method all images are of the same size (384 × 384).
In the proposed method, we employed six pyramid levels to develop PMED-Net. The six levels are determined empirically. Although for some datasets the optimal performance is VOLUME 9, 2021 TABLE 2. Performance comparison using different scales of images and same size images in four levels of the proposed architecture for NIH X-ray dataset segmentation. obtained at the fourth or fifth level (as shown in Figure 5) and for other datasets performance improves up to the sixth level. We also extended the pyramid levels beyond six levels (i.e. up to seven and eight levels), but the performance gain was statistically insignificant. Accordingly, in this study, we set the maximum level to six and the minimum levels are dependent on the dataset itself.
We analyzed the performance of PMED-Net by changing the number of encoder-decoder networks in the architecture. A different number of encoder-decoder networks were cascaded, ranging from one to six for all four datasets. As the number of levels increased, improvement in the performance could be observed as shown in Figure 5.
The optimal number of levels depends upon the complexity of the dataset, and the boost in performance for the six pyramid levels is different for each dataset. For all four datasets, there was a significant improvement in IoU from level-1 to level-4, and by further increasing the number of levels, increment in IoU is quite slow. Thus, by considering the complexity of the dataset and a tradeoff between performance and computations we can adaptively change the network size. However, using more number of levels required longer training and testing time.
The PMED-Net architecture performs a coarse-to-fine prediction in a cascaded manner, as shown in Figure 6. At level-1, the network can identify the area of interest to be segmented. However, still it cannot distinguish between different nuclei, a higher level networks further refine these coarse predictions and segment each nucleus more clearly. For the sake of visualization, we scaled all predictions in Figure 6 to the same size.

VII. RESULTS AND DISCUSSIONS
As introduced in Section V, we used four publically available medical image datasets, to evaluate and compare the performance of the proposed PMED-Net. All the experiments of the proposed study were conducted using a PC equipped with an NVIDIA Titan XP GPU and a Keras framework with Tensorflow backend.

A. ISIC SEGMENTATION
The quantitative analysis for the ISIC dataset between PMED-Net and the other comparing networks are listed in Table 3. For each evaluation index, the proposed PMED-Net outperforms other networks. While the most reliable results, in comparison with our network, were produced by CU-Net. However, they are 3%, 1.82%, and 1.2% less accurate than the proposed network in terms of the IoU, F1-Score, and sensitivity metric, respectively. The results of FCN-8s were the lowest, and this results from under segmentation of the area of interest. For visualization purposes, the qualitative results are shown in Figure 7. The first column is the input, the second one is ground truth, and the proceeding columns are the segmentation maps generated by U-Net, FCN-8s, Seg-Net, BCDU-Net, CU-Net, ORED-Net, MultiResUNet and PMED-Net, respectively.

B. NUCLEI SEGMENTATION
The quantitative results for the nuclei segmentation task are listed in Table 4. The performance of the proposed architecture was comparatively better than that of SegNet, FCN-8s, CU-Net, ORED-Net, MultiResUNet and U-Net in terms of the IoU and F1-score. PMED-Net performs on par with BCDU-Net.   BCDU-Net performs marginally (0.21%, 0.13%, and 1.3% in terms of IoU, F1-Score, and sensitivity, respectively) better than PMED-Net utilizing 14 times more parameters. The PMED-Net architecture was extended to six pyramid levels for this dataset, and the performance improvement contributed by each level is shown in Figure 5. As can be seen, each extra stage in the pyramid level network further refined the previous predictions. The PMED-Net compromised only around 1.3 million parameters as compared to 20.66 million parameters of BCDU-Net.
The visual results of PMED-Net and the other comparing models are shown in Figure 8. PMED-Net gave satisfactory performance in segmenting the small nuclei and clearly distinguishing the boundaries of each nucleus when compared to U-Net, FCN-8s, SegNet, ORED-Net, MultiResUNet and CU-Net which were unable to distinctively differentiate the region of interest. Table 5 illustrates the quantitative performance of the proposed architecture for the brain tumor dataset as compared to the other networks. For this dataset, PMED-Net was extended to four pyramid levels, and after fourth level there was insignificant improvement in the performance, as shown in Figure 5. PMED-Net outperforms SegNet, FCN-8s, CU-Net, ORED-Net, and MultiResUNet in terms of IoU and F1-score whereas slightly underperforms compared to U-Net, BCDU-Net, CU-Net and ORED-Net in terms of sensitivity.    The visual results for the brain tumor dataset are shown in Figure 9. The PMED-Net architecture for this dataset had fewer than one million parameters (977,268) and was capable of producing on par or better results as compared to the other comparative methods.

D. X-RAY SEGMENTATION
The quantitative and qualitative analysis, for X-ray segmentation task, was performed on the NIH database. Table 6 summarizes the segmentation performance of PMED-Net architecture against each evaluation metric and all other networks. For each evaluation index, the proposed network performance is significantly better than all other networks in terms of the IoU, F1-Score, and sensitivity metric.
The qualitative results of the X-ray segmentation are shown in Figure 10. PMED-Net performance is good in terms of segmenting small regions and boundaries, which is evident from row 2 of Figure 10. Such optimal performance is obtained with four level PMED-Net architecture where the six level PMED-Net enhance the performance by only 0.83%, but cost 1.5 times more parameters.
Moreover, in this study, we also included few bad segmentation examples of the PMED-Net, shown in Figure 11, where  the proposed network performance is reduced as it either over-segments or under-segments the region of interest.

VIII. CONCLUSION
In summary, we have presented a pyramid based multi-scale encoder-decoder, PMED-Net, for medical image segmentation. The proposed PMED-Net has quite less number of parameters and training as well as inference time, making it more efficient and applicable for embedded applications in healthcare. The PMED-Net architecture uses a coarse-to-fine prediction approach at each pyramid level to extract features with different scales using small encoder-decoder networks. We have extended the architecture up to six pyramid levels (where the optimal number of levels determined empirically). At each level, a light-weighted encoder-decoder network is trained independently, and then its predictions are upsampled, concatenated with the next pyramid level images, and used as input for the next level encoder-decoder network. We have evaluated and compared PMED-Net on four different publicly available medical image datasets. The results show that the proposed PMED-Net significantly improves the computer aided diagnosis of medical images compared to the other state-of-the-art networks with much lower parameterized complexity.