An Evolutionary DenseRes Deep Convolutional Neural Network for Medical Image Segmentation

The performance of a Convolutional Neural Network (CNN) highly depends on its architecture and corresponding parameters. Manually designing a CNN is a time–consuming process in regards to the various layers that it can have, and the variety of parameters that must be set up. Increasing the complexity of the network structure by employing various types of connections makes designing a network even more challenging. Evolutionary computation as an optimisation technique can be applied to arrange the CNN layers and/or initiate its parameters automatically or semi–automatically. Dense network and Residual network are two popular network structures that were introduced to facilitate the training of deep networks. In this paper, leveraging the potentials of Dense and Residual blocks, and using the capability of evolutionary computation, we propose an automatic evolutionary model to detect an optimum and accurate network structure and its parameters for medical image segmentation. The proposed evolutionary DenseRes model is employed for segmentation of six publicly available MRI and CT medical datasets. The proposed model obtained high accuracy while employing networks with minimal parameters for the segmentation of medical images and outperformed manual and automatic designed networks, including U–Net, Residual U–Net, Dense U–Net, Non–Bypass Dense, NAS U–Net, AdaresU–Net, and EvoU–Net.


I. INTRODUCTION
A Deep Convolutional Neural Network (DCNN) is a deep network constructed from several layers, such as convolution layers and pooling layers. Also, in some cases, it can include short and long connections with various connection patterns. Hand-designing a Deep Neural Network (DNN) is a complicated task that involves tremendous efforts and in-depth knowledge in the area of artificial neural networks. Using more parameters and operations to design a network makes this task more complicated. However, using an automatic or semi-automatic method can simplify the process of developing a DCNN. Evolutionary Algorithms [1] and Reinforcement Learning (RL) [2] are two popular methods to establish a neural network and/or initiate its parameters automatically or semi-automatically.
The associate editor coordinating the review of this manuscript and approving it for publication was Hengyong Yu . Neuroevolution is using an evolutionary algorithm to generate a network and/or set up its parameters [3]. Several papers investigated the application of Neuroevolution to establish different types of neural networks [4]- [9]. A very early paper in this area was published by Montana et al. [10], where a Genetic Algorithm (GA) [11] is utilised for network weights initialisation. Gradually, other parameters have also been included in the search space to uncover the optimum values for parameters [12]- [14]. Also, the first application of evolutionary computation in a deep network's evolution was introduced by Koutnik et al. [15] in 2014. During the last five years, several papers have been published in the area of using Neuroevolution to design a deep feed-forward network and deep CNN [7].
EvoDeep [16] is a graph-based evolutionary model that was developed to create a deep network structure along with its parameters for image classification. In addition, EvoCNN [17] is another GA-based evolutionary model equipped with a new encoding strategy and an original crossover method for image classification. Since the encoding of networks is a very critical issue that can affect a network's evolution, several techniques have been introduced for network representations. For instance, two different encoding models, including a binary representation and a grammatical representation, have been applied for the evolution of a network for handwriting recognition [18]. Also, a fixed IP-based encoding model and its corresponding variable-length version encoding model were introduced for image classification [19], [20].
The next issue in a network's evolution is a network construction's method, that can be layer by layer or block by block. The evolutionary encoding models that are mentioned above develop a network layer by layer. Neural Architecture Search (NAS) [21] is a cell (block) based model that evolves blocks using a Recurrent Neural Network (RNN) [22]. After finding the block's structures, the corresponding network will be established by stacking the achieved blocks on top of each other. Further, AmoebaNet [23] as an evolutionary version of NAS, has been developed to construct a network based on the obtained blocks. Another block-based evolutionary model was introduced by Fielding et al. [24] to develop a VGG [25] based architecture for image classification. In this model, most of the parameters are kept fixed, such as the number of output feature maps and filter size, although the number of layers in each block is specified using an enhanced version of the Particle Swarm Optimisation (PSO) algorithm [26].
Furthermore, Sun et al. [27] introduced another block-based evolutionary model for image classification using the combination of Dense blocks [28] and Residual blocks [29]. In their proposed model, three different units are established, including Dense Block Unit (DBU), Residual Block Unit (RBU), and Pooling Unit (PU) such that DB and RB units can contain several blocks. Also, some parameters, like filter size, and the number of convolution layers are kept fixed. Since the variable-length encoding strategy is employed for network representation, new crossover and mutation operations are also proposed. AdaResU-Net [30] is another evolutionary model, where a fixed network structure is utilised. However, a number of parameters, including learning rate, dropout probability, the number of filters, activation function, and the filter size of each convolution layer, are specified using a multi-objective evolutionary algorithm. In AdaResU-Net, all the blocks are equipped with residual connections and three convolution layers. Finally, EvoU-Net [31] is another block-based evolutionary method proposed for medical image segmentation. EvoU-Net utilised a GA to determine a network structure and its parameters automatically, where the block's internal structure, shortcut and long connections patterns were also specified during the evolution [31].
As discussed above, in most of the previous works, a part of the parameters or network structure is kept fixed, then an evolutionary or reinforcement technique is applied to set up some aspects of the networks. However, in this paper, we propose a new block-based evolutionary technique to build a network and set up its parameters automatically for medical image segmentation. In regard to the capability of Dense blocks [28] and Residual blocks [29] for feature extraction, our evolutionary DenseRes model is the first automatic U-Net [32] based model that can develop a deep convolutional neural network using a combination of Dense and Residual blocks. It is here introduced for image segmentation. In this paper, a Genetic Algorithm (GA) [11] is used to discover the appropriate network structure, and also the other parameters to create and train a network automatically. In the proposed model, 14 parameters are required to be specified by the GA, including, number of blocks (block's status), number of convolution layers in each block (layer's status), the size of filter for each convolution layer, number of output feature maps, dropout [33] probability, pooling layer, type of activation function, type of the block, long connection, Batch Normalisation (BN) [34], optimiser, learning rate, batch size, and the augmentation size.
In the proposed model, GA is employed to identify all necessary parameter's values to develop a U-Net-based network using a combination of Dense and Residual blocks. Therefore, without having in-depth knowledge related to deep learning, an optimum network structure along with its parameters can be developed for medical image segmentation. The proposed model using limited computation in a relatively short time, can find a small, accurate network for medical image segmentation. The obtained results for MRI prostate segmentation (two datasets), CT liver (two datasets), CT spleen, and MRI brain segmentation, show the capability of our proposed model for developing appropriate networks for various medical image segmentation, and outperformed U-Net [32], Residual U-Net [35], Dense U-Net [36], Non-Bypass Dense [37], NAS U-Net [38], AdaresU-Net [30], and EvoU-Net [31].
The rest of this paper is organised as follows. Section 2 provides a background concerning the network structures and genetic algorithm that we utilised in our proposed model. Section 3 demonstrates the proposed model. The dataset and experimental results are discussed in section 4. Section 5 provides the discussion and conclusion.

II. BACKGROUND
In our proposed model, a combination of three different network structures, including U-Net, Dense Network, and Residual Network, along with a Genetic Algorithm, are employed to create an evolutionary DenseRes network. In this section, a review of the listed algorithms is provided. Also, the proposed model is applied for medical image segmentation; therefore, a review of the various type of image analysis techniques is also provided in this section.

A. U-Net
A Fully Convolutional Neural Network (FCNN) is a version of CNN that is designed for image segmentation [39]. FCNN is constructed from two parts, of down-sampling (encoding, convolution) for feature extraction and up-sampling (decoding, deconvolution) for segmentation reconstruction. In some networks, there is a specific block named Bottleneck (Bridge) to connect these two parts. In the down-sampling section, the network undertakes to extract features as it goes from the higher resolution to lower resolution with applying pooling layers [40] while the up-sampling part attempts to reconstruct the coarse-to-fine segmentation with transposed convolution [41]. FCNN utilises an end-to-end method for learning. For 2D images, it uses image-to-image, and for 3D volumes, it applies volume-to-volume supervised learning.
One of the first models that utilised FCNN with 2D convolution for 2D medical semantic image segmentation is U-Net [32]. The original version of U-Net is constructed from three parts. The down-sampling part contains four blocks such that between each pair of blocks, there is a max-pooling layer to select the maximum value of the cluster and to halve the size of the feature maps [40]. In the up-sampling section, between each pair of blocks, there is a convolution layer with 2 × 2 kernel size to double the size of the output feature maps. Additionally, the Bottleneck block connects the two parts. All the blocks are constructed from two convolution layers followed by a non-linearity and finally, a 1 × 1 convolution layer is employed in the last layer. Also, to improve the results in this structure, long connections are used for cropping and copying a part of the extracted feature maps from the down-sampling part and concatenating them with the obtained feature maps from the up-sampling section.

B. DENSE NETWORK
A Dense Convolutional Neural Network [28] is a deep network composed of several Dense blocks and each Dense block consists of several layers. In this network, to reuse extracted feature maps, the output feature maps of each layer are concatenated (stacked) to the output of all previous layers in that block, then the concatenated feature maps are sent to the next layer (see Figure 1). Stacking several feature maps with various quality that are obtained utilising shortcut connections can benefit the network to reuse features, improve the quality of propagation, and also alleviate the vanishing gradients [42] problem [28]. Although this network is introduced for natural image classification, the fully convolutional network structure of this network is also introduced in [43] for colour image segmentation.

C. RESIDUAL NETWORK
Since training deep networks is more difficult [29], [44], using the shortcut and long connections can help to ease the training process of the networks. A Residual Neural Network [29] is a kind of neural network that applies skip connections or shortcut connections to jump over layers. A Skip connection transfers the input feature maps of one block to its outputs. Assigned feature maps will be element-wise summed with the output feature maps such that the number and size of feature maps should the same size (see Figure 1b). Using skip connections can reduce the probability of vanishing gradients [42] during backpropagation [45].

D. GENETIC ALGORITHM
Genetic Algorithm (GA) [11] is a meta-heuristic algorithm that was inspired from natural selection and belongs to the category of evolutionary algorithms. GA, as an optimisation technique, relies on three critical operations, namely selection, crossover, and mutation. Generally, the evolution process in GA commences with an initialisation stage, where a number of chromosomes are created randomly as possible solutions; moreover, each chromosome must be evaluated to specify the quality of the generated solution. To create a new generation, in the selection stage, some chromosomes will be selected and then utilising crossover; the new solutions will be designed using combinations of the parents. In the end, a random alternation by mutation must be applied to a number of chromosomes to increase the exploration ability of the algorithm. Finally, a part of or all of the population in the last generation will be replaced with the new chromosomes to generate a new generation, and this process can be repeated to some specified end point.

E. IMAGE SEGMENTATION
Medical image segmentation is one of the critical medical image analysis techniques, which can be used to find organs, cancers, tumours or any other abnormalities in medical images. Established semi-automated and automated medical image segmentation models can mainly be categorised into four various groups: Atlas-based, Shape-based, Image-based, and Superpixel-based segmentation. Recently, Deep learning-based approaches have achieved state-ofthe-art results in image processing and specifically in image segmentation [46]- [49].
The capability of Neural Network (NN), specifically Convolutional Neural Network (CNN), for image analysis convinced researchers to develop new CNN-based structures for medical image analysis [50], [51]. Most of the CNN-based networks use a combination of an image processing technique and CNN for image analysis, such as the combination of Atlas-based model and CNN [52]- [54]. Also, a combination of the Proposal-based technique and CNN was successful for image segmentation [55]. However, recently several CNN-based architectures have been developed [56], such that in some cases using only CNN is good enough for feature extraction and image analysis, such as image segmentation [37], [57] and classification [58].
For example, U-Net [32], is a FCNN-based network that was developed for medical image segmentation without using any pre-processing and post-processing. In some studies, the trial and error approach is applied to find a good network structure for medical image segmentation. For instance, Hassanzadeh et al. [37] tried several types of shortcut connections to find the best one for Prostate MRI segmentation. However, to deal with various and changing datasets, this would be very time consuming and computationally expensive.

F. RESEARCH CONTRIBUTION
• In this paper, we propose the first evolutionary U-Net-based model to construct a network using a combination of Dense blocks and Residual blocks for medical image segmentation.
• To develop a precise network structure, the Genetic algorithm is utilised to uncover the best combination of the Dense blocks and Residual blocks, as well as the internal arrangement of the blocks.
• All the parameters that are needed to create a U-Net-based network, along with network training parameters, are specified using the GA.
• A new block-based encoding model is developed to create blocks and consequently the network. We propose a fixed-length encoding model, to create variable-depth networks that are compatible with standard GA operations.
• Users without in-depth knowledge in the area of artificial neural networks will be capable of establishing an optimum network for medical image segmentation utilising our proposed model.
• The proposed framework is general enough to be applied to the different types of medical images.

III. EVOLUTIONARY DenseRes NETWORK
In this paper, we propose an evolutionary DenseRes model to create a U-Net-based deep convolutional neural network for medical image segmentation. In regards to the difficulty of training a deep network, the combination of Dense blocks and Residual blocks that were developed to ease the training process of the deep networks, can be utilised to alleviate the complexity of the network's training. However, the problem in regard to the numerous of possible combinations of Dense and Residual blocks in a network, while also considering possible arrangements inside of each block, to find a network structure that is more compatible for our specified application is very challenging. The manual design of such a complicated structure is very time consuming and in some cases, even impossible and has large elements of trial and error [37].
To address this problem, we propose a new block-based encoding model to represent the possible solutions to develop a network automatically. In the proposed model, the aim is uncovering the best combination of Dense and Residual blocks; moreover, the arrangement of layers inside the blocks, and also the training parameters of the network using GA. First of all, to create a block-based network structure, the number of blocks in the down-sampling section, up-sampling section, and whether there is a bridging block are necessary to determine. As mentioned above, the number of blocks and the block's structures in the encoding and decoding sections are the same in the U-Net-based network. Therefore, having the structures of encoding blocks and bridging block is sufficient to create the whole network. Second, as is discussed above, in the U-Net-based structure after each block there is a pooling layer in the encoding section to halve the size of feature maps, and its corresponding deconvolution layer to double the size of feature maps in the decoding section. Consequently, the maximum number of blocks that can be utilised in each part of the network is related to the size of the input image. For example, six blocks are the maximum number of blocks that can be utilised in the down-sampling and up-sampling sections of a U-Net-based network, plus a bridging block, trained by 128 × 128 input images. Table 1 provides the list of 14 parameters and their corresponding ranges to create and train a network in our proposed model. As shown in Table 1, the maximum number of blocks in the encoding section plus the bridging block is seven. Also, each block can contain one to three convolution layers. Besides, each convolution layer has its filter size. Consequently, the convolution layers in the block can employ similar or completely different filter sizes. In a DCNN, the size of the receptive field can significantly indicate the percentage of utilisation of context information [59]. Therefore, using filters with different sizes can detect various sized feature maps. The number of filters, dropout probability, pooling, activation function, type of block, long connection, and batch normalisation are set up in a block-based manner. In other words, each of these parameters is applied to the whole block. For instance, if Batch Normalisation (BN) is active for a block, it means BN will be applied after each convolution layer. Also, when we convert genotype to phenotype, the long connection parameter will be ignored for bridging block and decoding blocks because it is meaningless for them.
In our proposed encoding model, in the initialisation stage, each chromosome corresponds to seven blocks, that is the maximum number of blocks that can be utilised in the encoding and bridging sections. In the proposed encoding model to create networks with various depths, an additional parameter is employed to determine the activation status of each block. Besides, similar activation parameters are used to specify the convolution layer activation status. Since in our encoding model, each convolution layer has its activation parameter, the block is supposed to be inactivated, while all three convolution layers inside the block or the block are inactivated. Therefore, one status parameter plus 13 other parameters are required to construct a block. Since 14 parameters are involved in constructing each block, 98 genes are necessary to represent seven blocks. Then at the end of each chromosome, four more parameters are added to represent the optimiser, learning rate, batch size, and augmentation size for training the network. Overall, each chromosome has a fixed size (102 genes). Still, the output network can have various depths, because a part of the blocks and the convolution layers inside of the blocks might be inactivated in the initialisation stage or during the evolution process. Utilising variable depth networks can increase the chance of finding more precise networks, because the depth of network is a significant parameter on the achievement of a network [17], [25], [60], [61].
An example genotype of a block (see Figure 2a) and its corresponding phenotype (see Figure 2b) are provided in Figure 2. As is shown in Figure 2a, the first parameter shows that the block is active (1) or not (0). A block will be excluded from its network if either its activation parameter is inactivated, or all three layers inside of the block are inactivated. The active blocks among the first six blocks will be considered as the encoding and decoding blocks and the last block (while it is active) as the bridging block. Therefore, the proposed encoding model is capable of establishing networks with or without a bridging block. Furthermore, if only one block is activated, then the output network contains two blocks (one encoding and one decoding block). In the provided example (see Figure 2a), considering the block is active, and there are two active layers inside the block, this block will be included in its network structure. Moreover, it is a Dense block; therefore, the output of each layer must be concatenated with output feature maps of the previous layers. There is also a long connection between this block and its corresponding block in the decoding section. The size of filters for each convolution layer is stated, and other parameters are set up for the block. The phenotype of this block is also presented in Figure 2b. All active blocks will be created and stacked on top of each other, according to the provided example. The encoding blocks will be utilised to develop decoding blocks; however, the differences are the pooling layer will be replaced with deconvolution layers, and the long connections will be ignored in the decoding section.
It needs to be noted that the element-wise sum in the Residual blocks can just be applied on the same number of feature maps. Therefore, to convert the Residual block's genotype to its corresponding phenotype, an additional 1 × 1 convolution layer needs to be applied to size up the feature maps, while the unequal number of features comes up. For instance, a block has 32 input feature maps; however, the output is 16 feature maps; then, a 1 × 1 convolution will be used to increase the number of output feature maps to 32.
After converting the genotypes to their corresponding phenotypes, each network will be trained up to predefined epochs. The best network, which achieved the maximum segmentation accuracy, will be transferred to the next generation directly. It needs to be noted that, our proposed model is a single objective model; therefore, each individual is evaluated based on its accuracy. The aim is using an evolutionary process to find networks with the highest segmentation accuracy. Then, to create the rest of the population, the Roulette Wheel selection [62] method is applied to select two best parents. However, to decrease the variance of the selection process, a normalised fitness rather than absolute fitness is members The combination of those two selected networks by a single-point crossover will generate two new networks. Lastly, a random number ranging from zero to three, determines the number of mutation of each genotype, which is a random change based on the gene's valid range. After G generations, the best-generated networks will be selected to train for more epochs to uncover the best network for image segmentation.
The pseudocode of the proposed model is presented in Algorithm 1. The evolutionary process starts with the initialisation stage in which a random population is created using Generate_Random_Population function. To create a chromosome, first the block's parameters initialised randomly, and then the network's training parameters (including optimiser, learning_rate, batch_size, and augmentation_size) are also added at the end of it. This process needs to be repeated to create the whole population (I). In the evaluation stage, using evaluate_population function, each genotype is converted to its corresponding phenotype and network is trained up to the specified epochs. Finally, to reproduce a new population (using Reproduce function), the best network (network with the highest segmentation accuracy) is sent to the next generation directly, and to create the rest of the population, selection, crossover and mutation are also employed.

A. DATASET
The Promise12 dataset [48] for prostate MRI segmentation, Combined Healthy Abdominal Organ Segmentation (CHAOS) dataset [63], and segmentation of the liver competition 2007 (SLIVER07) dataset [64] for liver CT segmentation, two MRI prostate and CT spleen segmentation datasets from Decathlon challenge [65] and also, BraTs2019 dataset [66]- [68], for brain tumour (whole tumour) segmentation were selected for evaluating the proposed model. Detailed information about the six datasets is provided in Table 2. Since Promise12 was collected from various centres with different qualities, a Z-score [69] normalisation model was applied on its images. However, the rest of the datasets were utilised without any pre-processing. All the images were resized to 128 × 128 for training the networks and evaluation. Since the number of images is limited, for all six datasets the number of training images was augmented to a  specified number using various types of augmentation techniques, including, rotation, zooming, vertical and horizontal flips, and elastic transformation [70]. In the proposed model, all image slices, regardless of whether they included Region Of Interest (ROI) or not, were used for training, validation and testing of the obtained networks.

B. IMPLEMENTATION
The proposed model was implemented using the Keras python package [71]. All experiments were carried out on one Nvidia GPU. In regards to the six datasets that we utilised for evaluation, the proposed model trained using each dataset separately according to the provided information in Table 3. The number of generations and population size were set up based on preliminary experiments. In the initialisation stage, 60 networks were established randomly as the possible solutions. This is a relatively large population size in this application. The larger population size can increase the diversity of solutions, and consequently the chance of discovering a precise network. Then the size of the population is halved in the second generation and training is continued to nine generations. Networks were trained for up to five epochs in the training stage to save time. Also, to show the consistency of the proposed model, each model was run ten times. For evaluating the evolved networks, Dice Coefficient (DSC) [72] is employed (see equation 1). Also, DSC is used as a fitness function to evaluate the networks during evolution. Where, Y shows the label image, Y represents predicted segmented image, and | Y | and | Y | indicates the cardinality of Y and Y . Moreover, in the proposed model, DSC was exploited as the loss function for training the networks.
Finally to compare the proposed model to previous works, we also employ: Hausdorff distance [64], Average Boundary

C. EXPERIMENTAL RESULTS
As mentioned above for each dataset, we repeated the proposed evolutionary process ten times. In the end, based on the obtained results after nine generations, the best run was selected. Consequently, the ten best networks from the specified sequence were selected as the best networks for final segmentation. However, the best networks were trained for just five epochs; therefore, they need to be trained to more epochs. Since the network's weight initialisation is random, and it is effective on the final results, to find the best network structure, each network was trained five times to 25 epochs. In the end, the best results were selected for each network. The obtained results after five and then 25 epochs for each dataset are in Table 4. As can be seen from Table 4, our proposed model found ten networks corresponding to each dataset with high accuracy, such that for the MRI prostate segmentation, the best network obtain 0.902 DSC, and for the CT liver segmentation 0.921 and 0.943 respectively for CHAOS and SLIVER07 datasets. Besides, for Decathlon spleen and prostate datasets, the obtained DSC's are 0.95 and 0.92 respectively. Finally, our proposed model obtained 0.767 DSC for the whole brain tumour segmentation. These results show the capability of the proposed framework to find networks with high accuracy for datasets with various features.

D. BEST NETWORK'S STRUCTURES
In this section, the genotype of the best six network structures corresponding to each dataset is presented in Table 5. As can be seen from Table 5, the evolution process finds a unique network structure along with its training parameters for each dataset. As can be seen from Table 5, each genotype shows the block structures in the down-sampling and bridging sections. For example, the genotype of the first network structure indicates that there are six active blocks such that five of  the active blocks are in the down-sampling, five blocks in the up-sampling and a bridging block. However, the third network (SLIVER07) has no bridging block and contains three blocks on each side of the network. Besides, there is, an exception in the Spleen best network, the third block of the Spleen dataset is active, but all three convolution layers inside of the block are deactivated; therefore, this block should be excluded from the network's phenotype. Also, each network has its training parameters. The type of the optimiser, learning rate, augmentation and batch size are specified during the evolution of each network.

E. CROSS-VALIDATION
In this section, to show the capability of our proposed model, we have applied four-fold cross-validation for evaluation of the SLIVER07 dataset. Its data was partitioned to four-fold according to Table 6. As can be seen from Table 6, for each fold the image slices corresponding to four volumes were considered for the test, four volumes for validation and the rest for the training of each fold. Then we trained each fold using the proposed evolutionary model and obtained the ten best networks regarding each fold. Train and validation sets were used for training and evolving the networks. Such that networks with the highest validation accuracy have higher chance to be selected for the next generation. In the end, the test set is used for the evaluation of the best networks. The obtained DSCs of the test set for the ten best networks regarding each fold is provided in Table 7. As can be seen from Table 7, our proposed model found ten networks for each fold with high accuracy. This shows, despite the limited amount of data for training of each fold, our proposed model can still find networks with high segmentation accuracy.

F. COMPARISON WITH PRIOR WORK
The comparison of the obtained results with previous works is also reported in Table 8. In this section, we compare the obtained results versus U-Net [32], Dense U-Net [36], Res U-Net [35], and Non-Bypass Dense [37] which are all manually designed networks; also, AdaResU-Net [30] and EvoU-Net [31] that are evolutionary networks and finally with NAS U-Net [38] which was developed using reinforcement learning. As shown in Table 8, in all six datasets, our proposed DenseRes evolutionary models obtained the best accuracy for image segmentation. Since each network was developed for a specific dataset, the segmentation performance considerably improved when compared to prior networks. We also compared the networks in terms of their size (the number of trainable parameters). The results show that for each model, the obtained networks are usually far smaller than the others while achieving high accuracy. Although the DenseRes model for segmentation of the prostate MRI dataset used more trainable parameters in comparison to AdaresU-Net and EvoU-Net; however it obtains better accuracy for the segmentation.
Also, we compared the obtained results versus previous work using AUC [74] (see Table 9). AUC calculated for image slices that contained ROI based on the following rules.
• FP: The number of background pixels predicted as ROI pixel.
• FN : The number of ROI pixels predicted as a background pixel.
• TN: The number of background pixels recognised as background.
• TP: The number of ROI pixels recognised as ROI pixels.
As can be seen from Table 9, minimum, maximum, average, and standard deviation of the obtained AUC regarding the test images for each dataset are reported. As shown, again, the proposed model outperformed previous works. This shows the capability of the proposed framework for medical image segmentation. VOLUME 8, 2020   Besides, as a subjective comparison, we provide one sample image from each dataset and their corresponding predicted segmentation using various models separately in Figure 3. As can be seen from Figure 3, the accuracy of segmentation is different, and some of them suffer from over-segmentation or under segmentation. However, in some cases, the predicted segmentation is similar because some of the methods obtained competitive results. Also, we provide five more sample images with various shapes and sizes from each dataset in Figure 4. Each column shows five segmented sample images from each dataset. In all six datasets, the proposed DenseRes evolutionary model segments the Region Of Interest (ROI) precisely. Especially for liver and spleen segmentation, where the colour and texture of the liver and spleen are similar to surrounding organs, previous work suffered from over or under segmentation; however our proposed models find the ROI with high accuracy and are much more precise than the others.

G. TIME ANALYSIS
In this section, we have compared the training times of our proposed model against existing evolutionary and non-evolutionary CNN based models. Table 10 shows a number of existing automatic models that use Reinforcement Learning (RL) and the evolutionary concept for image classification or segmentation. As can be seen from Table 10, we compared our proposed model with eight different automatic search algorithms for training time. NAS [21], NASNet [75], AmoebaNet [23], and Automatically Designing CNN [76] need tens of GPU days to find a network for image classification. NAS [21] utilised 800 GPUs which took 28 days for training. While NASNet [75] using 500 GPUs and needed four days for training. Besides, AmoebaNet [23] is an evolutionary model ran on 450 GPUs for seven days. Furthermore, Automatically Designing CNN [76] used 15 GPUs took 10 days for training. The reported GPU days have been taken from their source papers. Since these papers were published at different years in the past, the time comparisons may not be accurate due to changes in technology over time. Also, Adversarial NAS [77] is another RL-based technique that applied RL just to find the structure of the discriminator of the network which takes just six GPU days. Besides, NAS U-Net [38], need two days to find a network, however, in this method, natural images are used to find the network, and then medical images are used for network evaluation. AdaResU-Net [30] and EvoU-Ne [31] are two evolutionary-based model to find a network structure, such that AdaResU-Net uses a fixed network structure (the number of blocks and layers inside of the blocks are kept fixed) and an evolutionary technique is just applied to find some hyper-parameters. Finally, our proposed model needs two and a half to four days for training using one GPU for various datasets. Despite developing the whole network along with its training parameters using an evolutionary model, our proposed model is still among the models using minimum time for training.

H. PROPOSED DenseRes MODEL FOR 3D SEGMENTATION
The proposed DenseRes framework is a 2D evolutionary segmentation model. Developing a 3D evolutionary model is computationally expensive because we need to evolve a group of 3D networks to the specified generation, which requires a massive amount of computation and time. The advantage of using 3D convolution for 3D image analysis is using the contextual information for segmentation, which can increase the accuracy of segmentation. In this section, to show the capability of the proposed model, we also analysed the FIGURE 3. One sample segmented image from each dataset using various methods. The red contour is the ground truth, orange is U-Net, olive is Non-Bypass Dense, purple is AdaRes U-Net, pink is EvoU-Net, cyan is Res U-Net, green is Dense U-Net, light coral is NAS U-Net, and yellow is proposed DenseRes model. VOLUME 8, 2020 FIGURE 4. Five sample segmented images from each dataset using various models. The red contour is the ground truth, orange is U-Net, olive is Non-Bypass Dense, purple is AdaRes U-Net, pink is EvoU-Net, cyan is Res U-Net, green is Dense U-Net, light coral is NAS U-Net, and yellow is proposed DenseRes model. obtained 2D models for 3D image segmentation. To do this, we stack all the segmented 2D image slices on top of each other to create a 3D volume. Then, we calculate the volumetric DSC of the obtained volumes and compare the results with three previous 3D manually designed CNN networks: 3D U-Net [78], Convnet [79], and 3D Dense U-Net [36] (see Table 11). We have used the same data for training both 2D and 3D models, however for training the 2D models, we used image slices, while for 3D models the corresponding 3D volumes were used.
As can be seen from Table 11, we used three previous 3D models for the segmentation of the six various datasets. To train previous works, the number of training images was augmented up to 32000 and with a batch size of 16. Their training parameters were set up based on their source papers. Since all the previous models are manually designed networks for a specific segmentation application, in some cases they obtained high accuracy. In contrast, in other instances such as CHAOS dataset segmentation, we can see low effi-ciency using ConvNet and 3D Dense U-Net. Interestingly, except for the segmentation of the Prostate dataset, our proposed DenseRes model achieved high accuracy for the 3D image segmentation. The reason is we developed networks for each of the datasets independently using an evolutionary technique. Therefore, we can see that even when we stack the image slices and evaluate the results in 3D, again the results are comparable and competitive with original 3D models for medical image segmentation, and even in most cases outperformed them. It needs to be outlined that the 2D networks are using a considerably smaller number of parameters compared to 3D networks and require less time for training.
Also, this section is provided with an example of the 3D segmented image in regards to each dataset as subjective evaluation (see Figure 5). As can be seen from Figure 5, the proposed model obtained high performance in terms of 3D image segmentation evaluation (the red volume is ground truth, and the cyan one is the obtained segmentation). The output segmentation shows the capability of a well-designed 2D network.

I. PROPOSED EVOLUTIONARY MODEL ANALYSIS
In this section, we analysed the proposed evolutionary DenseRes network evolution, structures and the training parameters of the obtained networks. All provided FIGURE 5. One sample 3D segmented image from each dataset. The red contour is the ground truth, cyan is output segmentation. information in this section are based on the ten best networks of the best run for each dataset (collectively 60 networks).

1) NETWORK'S TRAINING PARAMETERS
In the proposed DenseRes evolutionary technique, the type of the optimiser and its corresponding learning rate was selected automatically during the evolution. Based on the obtained results, for the six datasets, Figure 6 shows that among four various optimisers, the best networks mainly utilised rmsprop [80] and adagrad [81]. Also, Figure 6 shows the distribution of using learning rate. Most of the best networks applied the smallest learning rate (0.001) for training.
The next training parameters that were found using the evolutionary model are the number of augmentations and the batch size. The number of training data depends on the size of the network and its parameters. Using an inappropriate number of training images can be the cause of over or underfitting. Notably, in our proposed model that is constructed using Dense and Residual blocks, the size of the constructed networks can be variable, therefore finding the number of training images automatically can help to increase the accuracy of the proposed model. As shown in Figure 7, most of the best networks were trained using 32000 and then 16000 images. Also, Figure 7, shows the distribution of batch size.

2) NETWORK'S STRUCTURE PARAMETERS
The components that were utilised to create the best network's structures are analysed in this section. All numbers in this   section are based on the down-sampling and bridging sections of the found networks. First, the contribution of the Dense blocks, Res Blocks, convolution layers inside of the blocks, batch normalisation, and long connections to create networks are shown in Figure 8. As can be seen, convolution has the most prominent role in building networks in all six datasets. Moreover, 108 residual blocks and 136 dense blocks were used to create the best networks (see Figure 9). As shown, the ten best networks that were created using the PROMISE dataset mostly utilised Dense blocks, however, the ten best networks created using the prostate dataset used the minimum number of Residual blocks. This diversity shows the complexity of finding the best combination of blocks to create a network.  Besides the number of convolution layers, the filter size also plays a significant role in feature extraction. In the proposed model, the filter size was found for each convolution layer independently. Based on the best-obtained networks, 190 convolution layers applied 3 × 3, 164 convolution layers used 7 × 7, and just 91 convolution layers utilised 5 × 5 filters-the distribution of the mentioned parameters to create networks for each dataset are provided in Figure 10.
Also, pooling layers have an essential role in excluding unimportant features from extracted feature maps. In U-Net-based networks, after each block there is a pooling layer in the down-sampling section to reduce the size of the feature maps and consequently to remove less informative features. Two popular pooling layers are Maxpooling and Averagepooling layers were chosen by the proposed evolutionary model. As shown in Figure 11, to create the ten best networks for six datasets, 167 average and 78 max-pooling layers were utilised. As can be seen from Figure 11, for the Prostate and CHAOS datasets, the best networks used an almost equal number of Max and Average pooling, however, to create a network for spleen dataset, just two Max-poolings were used. Again the obtained results show the variety of the network structures and the importance of the automatic creation of the networks.
Activation functions need to be used to specify the output of each node in regards to its input, which has an essential role in the neural network convergence and convergence speed. However, in some cases, it can prevent the network from converging. Therefore, it is necessary to specify an appropriate  activation function for each convolution layer, which is specified for each block in the proposed model. As shown in Figure 12, networks corresponding to each dataset, applied various numbers and types of activation function. Collectively, 145 ReLu and 99 Sigmoid activation functions were assigned to the blocks.
Finally, in Figure 13, the distribution of using dropout with various probability is shown. In the proposed model, it is possible to have a dropout at the end of each block to control overfitting. When the probability of dropout is zero, it means there is no dropout, which happens several times in PROMISE's best networks. However, Figure 13 shows the patterns of using dropout in different datasets.

3) TIME RELATED PARAMETERS
One of the critical parameters to build a network is the number of trainable parameters that play significant roles in the training of networks and their accuracy. Moreover, the training time of the network is related to its size (also the number of training data). Figure 14, shows the size of the ten best networks in regards to each dataset. Notably, networks with significantly different sizes obtained usually high accuracy in most cases. It shows that increasing the depth and number of parameters of a network is not always a solution to create a more precise network.
Finally, in this section, the required wall-time to find a set of networks corresponding to each dataset, using one Nvidia GPU is provided in Figure 15. Because the size of  the networks and the number of augmentations are variable, the training time is different. For example, training of the proposed model for the CHAOS dataset takes less than 60 hours; however, the case of the Spleen dataset takes about 100 hours. Overall, Figure 15 shows that the proposed model is not computationally expensive.

4) NETWORK'S EVOLUTION
As mentioned above, the proposed evolutionary model was repeated ten times for each dataset. In Table 16, the evolutionary process is shown during the nine generations for each dataset. The orange line is training, blue is the test, and green shows the validation accuracy during the evolution. The solid line shows the accuracy of the best run during the evolution, and the dashed line shows the average accuracy of the ten runs. It needs to be noted that the way the test, validation and the training sets have been selected randomly can affect the evolution process. For example, if randomly, unique and challenging samples are located in the test set while the easy samples are in the validation set, training will cause high accuracy in the validation set and low accuracy in the test set. This problem can be addressed to some extent by using an appropriate augmentation technique. Before analysing each model separately, it needs to be outlined that the evolutionary process starts with 60 population and then training is continued with 30 networks, which is a large number of population size compared to previous evolutionary models [30]. As shown in Table 16, in the PROMISE dataset, the training accuracy starts at about 0.65 and then increases to 0.73, and then accuracy decreased. FIGURE 16. The evolution of the ten best networks for each dataset. VOLUME 8, 2020 Therefore the best results were obtained in the seventh generation. Also, in the CHAOS model, there is an inconsistency during the training as training accuracy increased and decreased during the training. In the case of the Sliver dataset, training starts with a set of networks with high accuracy, in average the best networks obtained 0.94 accuracy, which increased about one per cent during the training. Besides, evolution in the Spleen dataset started with 0.77 accuracy in the best run and ended with ten per cent improvement in segmentation accuracy in the training set at the end of the training. Also, for the Prostate dataset, the training accuracy in the best run starts with 0.77, while increased to 0.90 in the ninth generation. Lastly, for the brain cancer segmentation, the training accuracy in the best run starts at 0.47 and then increased to 0.56 in the last generation.
Overall, despite the limited number of generation, we can see improvement in segmentation accuracy during evolution.

V. DISCUSSION AND CONCLUSION
In this paper, we proposed the first evolutionary DenseRes model for medical image segmentation, leveraging from U-Net [32], Res U-Net [35], Dense U-Net [36], and GA [11], such that our proposed automatic model can uncover an accurate network structure along with its training parameters for each dataset precisely and quickly. Experimental results utilising six different datasets show that the proposed DenseRes evolutionary technique can be used as a general model for developing a network automatically for different segmentation purposes. That is one of the most significant achievements of the proposed model. Because developing a precise network structure along with its parameters is sometimes completely different from a dataset or application to another one and there is no general rule to address this. However, using a general technique like our proposed model can address this problem, especially as the obtained networks are relatively small and do not need extensive computation.
The proposed model is not computationally expensive, such that to find the best network using one GPU, the maximum training time was about 100 hours. That is again reachable and comparable to other automatic techniques [21], [23], which used hundreds of GPUs in some cases, thus it is effective and computationally reasonable and cheap. Another essential aspect of our proposed model is that, unlike most of the previous automatic techniques that included only a part of their parameters in their search space, we include all necessary parameters in the search space and by utilising the relatively large population size, our proposed model found good parameters. Such that the end-user even does not need to set the batch size and augmentation numbers manually, which have a strong effective on whether a network is over or under fitted.
We evaluated the proposed model for medical image segmentation; and the proposed model is flexible enough to be applied to other medical segmentation datasets, in addition to those we used. Besides, with modifying the proposed encoding model and changing the number of channels from one to three, it is possible to develop networks for natural image segmentation. Also, depending on the size of the input image; the number of blocks can be increased to find deeper networks. In addition, based on our proposed encoding model, other evolutionary techniques can be applied in this application instead of GA. Lastly, other block's structures (like attention models) can be included in the search space to find more complicated networks' structure. Finally, in this paper, we show that using the combination of the existing network structures and an evolutionary technique, is an effective technique to find a 2D network automatically. However, the results show that even our proposed model works well for 3D segmentation.