Eso-Net: A Novel 2.5D Segmentation Network With the Multi-Structure Response Filter for the Cancerous Esophagus

Automatic segmentation of the cancerous esophagus in computed tomography (CT) images is a computer-assisted method that can improve the efficiency of the diagnosis and treatment. Due to the diversity of the cancer stage and location, the anatomical structure of the cancerous esophagus is various. Moreover, the low contrast against surrounding tissues leads to a blurry boundary of the cancerous esophagus. Therefore, existing segmentation networks cannot achieve satisfactory results in automatic segmentation of the cancerous esophagus. In this article, we propose a novel 2.5D segmentation network named Eso-Net for the cancerous esophagus based on an encoder-decoder architecture. A 3D enhancement filter called Multi-Structure Response Filter (MSRF) is designed to extract 3D structural information as prior knowledge. Furthermore, dilated convolutions and residual connections are employed in the convolutional blocks of Eso-Net for multi-scale feature learning. With 3D structural priors, Prior Attention Modules (PAM) are incorporated into the network to facilitate the transmission of relevant spatial information. The experiments are conducted on the dataset from 30 esophageal cancer patients, and we report an 84.839% dice similarity coefficient, an 85.955% precision, an 83.752% sensitivity, and a 2.583mm Hausdorff distance. The experimental results demonstrate that the proposed method outperforms other existing segmentation networks in this task and can effectively assist doctors in the diagnosis and treatment of esophageal cancer.


I. INTRODUCTION
Esophageal cancer is a common cancer with a high mortality [1], and becomes a major public health problem worldwide since its incidence has been increasing in The associate editor coordinating the review of this manuscript and approving it for publication was K. C. Santosh . recent years [2]. Therefore, the diagnosis and treatment of esophageal cancer is particularly critical. Medical imaging is a technique for the diagnosis and clinical analysis of diseases since it can reveal internal structures of a body. Among many imaging methods, Computed Tomography (CT) is widely used for the diagnosis of esophageal cancer, since it can create visual representations of body cross-sections to assist doctors to observe and evaluate the esophageal lesions [3]. However, manual segmentation of the cancerous esophagus is tedious and time-consuming due to the large number of CT slices. Limited by professional abilities of doctors, some segmentation results may be subjective or even inaccurate, which has serious consequences. Therefore, automatic methods are developed for the accurate and effective segmentation of the cancerous esophagus, which can assist doctors to diagnose esophageal cancer. However, automatic segmentation of the cancerous esophagus in CT images is still a difficult task. As shown in Fig. 1, the size and shape of the cancerous esophagus is complex and variable due to the diversity of the cancer stage and location. Moreover, the boundary of the esophagus is irregular and blurry in CT images since the esophagus is a non-rigid structure with low contrast against surrounding tissues. Furthermore, air bubbles randomly appear in the esophagus, which further increases the difficulty of segmentation.
Using traditional image processing, many methods have been proposed for this task. Rousson et al. described a probabilistic shortest path approach combined with spatial prior knowledge [4]. This method models spatial dependency between the esophagus and the aorta and left atrium to obtain the esophagus outer boundary. However, this method requires the two extreme points of the esophagus centerline and a segmentation of the aorta and left atrium as input. Feulner et al. proposed a multi-step method to extract the esophagus from CT images [5]. A classifier for discriminating appearances is combined with an explicit model of the respiratory and esophageal air distribution, followed by a Markov chain model to estimate the approximate esophagus shape. Then the approximate surface performs non-rigid deformation to obtain a better fitting boundary. In [6], Damien et al. introduced a skeleton-shaped model to guide the segmentation. This method performs a 3D segmentation with the prior knowledge of the skeleton. Then over-segmented slices are detected and a 2D propagation by graph cut is used to improve segmentation results. Yang et al. proposed an online atlas selection approach for multi-atlas segmentation [7]. Based on local anatomical similarity, the atlases of the esophagus are ranked and the optimal atlases are selected. The final segmentation is obtained by fusing the deformed contours of the optimal atlases. Since most methods using traditional image processing are proposed for specific scenarios with pre-defined hypotheses, the generalization capabilities of the algorithms are restricted. Moreover, the feature extractors of the esophagus are artificially designed with complex parameters, which reduces the robustness and requires tedious parameter tuning.
Recently, deep learning has been widely used in image segmentation. Shelhamer et al. proposed a pixels-to-pixels model called Fully Convolutional Networks (FCN) [8], based on an encoder-decoder architecture without any fully connected layers. Fechter et al. proposed a random walker approach driven by a 3D FCN [9], which is believed as the first work to apply deep learning in the segmentation of the esophagus. This method is not end-to-end since the 3D FCN is just used to generate a rough probability map of the esophagus. Trullo et al. utilized two improved FCN to perform and refine the segmentation of the esophagus [10]. However, a multi-organ segmentation is required to locate the esophagus, which complicates the model training and data labeling. Furthermore, some novel segmentation networks are developed based on FCN. U-Net combines a U-shaped network with skip connections to fuse features and recover lost spatial information [11]. To achieve less memory requirement and inference time overhead, SegNet uses max-pooling indices to perform non-linear up-sampling [12]. LinkNet connects each encoder input to the corresponding decoder output for recovering lost spatial information [13]. Due to the compact but efficient network topology, U-Net becomes the focus of biomedical image segmentation. Çiçek et al. employed 3D convolutions to construct 3D U-Net with a large number of trainable parameters that can easily lead to the model overfitting without a large amount of data [14]. Chen et al. proposed U-Net plus to segment the cancerous esophagus on a single CT slice [15]. However, this method requires a manually placed point to start segment the entire esophagus.
Even using deep learning, automatic segmentation of the cancerous esophagus remains a challenging problem. The networks mentioned above also failed to achieve satisfactory results, which can be described in three aspects. First, limited by GPU memory, 3D segmentation can just process a small patch of the CT scan, which is not conducive to learning the complete esophagus structure. Whereas 2D segmentation cannot utilize 3D structural information. Second, various tumor sizes lead to various scales of the cancerous esophagus. It is difficult to achieve multi-scale feature learning in this task simply by using down-sampling to enable convolutions to extract multi-scale features. Third, the loss of spatial information in the encoding phase is one of the major factors that limit segmentation accuracy. Skip connections of U-Net is a solution by simply concatenating the feature maps from decoder layers and encoder layers. However, it is not well enough to recover and fuse lost spatial information of the cancerous esophagus that has an irregular and blurry boundary. To solve the above problems, we propose a novel segmentation network named Eso-Net that consists of different convolutional blocks. Eso-Net performs channelwise 2.5D segmentation with 3D structural priors extracted by a 3D enhancement filter, which is a trade-off between 2D and 3D segmentation. In medical image tasks, the term ''2.5D'' means that the network utilizes only 2D convolutions to extract features from multiple images in different planes or different views. In the convolutional blocks, dilated convolutions and residual connections are applied to extract and combine features in different receptive fields. Furthermore, the encoder blocks and decoder blocks are connected and an attention mechanism is introduced to make skip connections more effective.
The contributions of our work are summarized as follows:  [23]. These methods utilized to directly enhance the target tissues are not applicable to the cancerous esophagus since it has various anatomical structures. Anatomical structure is an important property to distinguish different tissues and organs. Inspired by [16], we design a filter to enhance the surrounding tissues and organs of the esophagus, and the enhanced images are used to improve the segmentation accuracy.

B. MULTI-SCALE FEATURE EXTRACTION AND FUSION FOR IMAGE SEGMENTATION
The features extracted from different layers contain different information. The low-level features contain more detail information, whereas the high-level features contain more semantic information. The reason is that down-sampling operations in the network gradually reduces the size of feature maps and convolution kernels can scan at larger scales. In image segmentation, some attempts at multi-scale feature learning have achieved effective results. Yu et al. firstly employed dilated convolutions to aggregate multi-scale features without losing resolution [24]. PSPNet uses the pyramid pooling module to extract and concatenate the features of different sizes [25]. Fu et al. constructed multi-scale input layers to combine information from multiple scales for accurate segmentation of retinal vessels [26]. Wang et al. proposed a hybrid dilated convolution (HDC) framework to alleviate the ''gridding issue'' caused by dilated convolutions [27]. Chen et al. proposed Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale features with dilated convolutions at multiple sampling rates [28]. Zhao et al. developed the image cascade network for real-time semantic segmentation by efficiently utilizing information from images with different resolutions [29]. U-Net++ integrates U-Nets of varying depths to extract and aggregate features of varying scales with skip connections [30].
The various scale of the cancerous esophagus requires an effective method for capturing contextual information. Limited by the small size of the cancerous esophagus dataset, the model with a large number of trainable parameters can easily cause overfitting. Excessively employing various network components for feature extraction and fusion is not applicable to this task. Therefore, we subtly combine dilated convolutions and residual connections in the convolutional blocks of our Eso-Net to achieve multi-scale feature learning.

C. ATTENTION MECHANISMS IN MEDICAL IMAGE TASKS
Attention mechanisms were initially proposed for machine translation, and are widely used in the computer vision field [31]- [33]. Attention mechanisms can make the model more focused on key information relevant to the foreground, and has found applications in medical image tasks recently. Attention U-Net combines attention gates with U-Net to emphasize useful features with minimal computational overhead [34]. Roy et al. proposed three attention modules modified from the squeeze-and-excitation module and embedded them in different segmentation network to perform multi-organ segmentation [35]. Wang et al. proposed global aggregation blocks with a spatial attention mechanism to extract global information of feature maps [36].
All the above are self-attention mechanisms without supervision by prior knowledge. In medical image segmentation, 3D structural information can be used as prior knowledge. Therefore, we utilize 3D structural priors to explicitly guide the execution of a spatial attention mechanism in Prior Attention Modules (PAM), which contributes to the recovery and fusion of key spatial information. Our PAMs can efficiently model the importance of different regions on a CT image instead of self-learning from feature maps, which facilitates the transmission of features relevant to the esophagus.

III. PROPOSED METHOD A. OVERVIEW
The proposed method comprises two stages, which are illustrated in Fig. 2. In the first stage, pre-processing is performed on the input CT slices. Firstly, the window level (WL) and window width (WW) of the CT slices are set to 40 and 200 respectively to increase the contrast between the esophagus and other tissues and organs. Intensity values between −160 and 240 are linearly normalized into [0,1] to accelerate the model convergence during training. Intensities less than −160 are set to 0 and those greater than 240 are set to 1. Then, the Multi-Structure Response Filter (MSRF) is used to enhance tissues and organs with specific geometrical structures at multiple scales on CT images, which can provide the segmentation network with additional prior knowledge. Finally, the 144×144 pixels regions in the centers of the original images and the enhanced images are cropped to obtain the Regions of Interest (ROI) that are large enough to contain the entire esophagus. Furthermore, the cropping operation also decreases the interference from irrelevant tissues and organs and speeds up model training and inferencing. In the second stage, channel-wise 2.5D segmentation is performed by Eso-Net. The CT image to be segmented is performed channel-wise concatenation with two adjacent images as one of the network inputs, which is conducive to efficiently utilize z-axis information without a significant increase in the number of parameters. If the image to be segmented is the first or last image of the CT scan, unavailable adjacent images are replaced by the image to be segmented itself. Moreover, another input is the corresponding enhanced images that are concatenated in the same way. Finally, the network outputs the segmentation map for the CT image in the middle channel.
In the following, the proposed method is described in detail. In Section III.B., we illustrate the Multi-Structure Response Filter (MSRF). In Section III.C., we introduce Eso-Net that comprises different convolutional blocks and the Prior Attention Module (PAM).

B. MULTI-STRUCTURE RESPONSE FILTER
In medical image segmentation, image enhancement contributes to improve segmentation performance. A common approach to performing image enhancement on CT images is utilizing 3D structural information. Since the size and shape of the esophageal tumor are various, the anatomical structure of the cancerous esophagus is irregular. Therefore, direct enhancement of the cancerous esophagus cannot achieve a satisfactory result. However, other tissues and organs surrounding the esophagus are regular geometrical structures. Inspired by this insight, we propose the Multi-Structure Response Filter (MSRF) based on the Hessian matrix. In contrast to traditional approaches, MSRF is designed to enhance other tissues and organs with specific geometrical structures instead of the esophagus. Thus, the enhanced regions in a CT image should be predicted as the background by the segmentation network. The enhanced images can instruct the segmentation network to distinguish the cancerous esophagus from similar tissues and organs in CT images.
Since the Hessian matrix are related to local geometrical structures [16], we can detect specific structures by using the eigenvalues of the Hessian matrix. Firstly, we combine consecutive CT images into a 3D volume data. The Hessian matrix of each voxel in the 3D volume data comprises second order derivatives in different directions. Let I (x) denotes the intensity of the 3D volume data at coordinate x = [x 1 , x 2 , x 3 ] T . For analyzing structures of multiple scales, differentiation is performed on a Gaussian scale space. Therefore, using the linear scale-space theory [37], the elements in the Hessian matrix of x at scale σ is defined as: where i, j = 1, 2, 3 indicate the positions of the elements and * denotes the convolution operation. The parameter γ is introduced to rescale the response of differential operation at multiple scales [38] and is set to 2. The Gaussian function G(x, σ ) is defined as: The second derivative of a Gaussian kernel at scale σ can be considered as a probe kernel that can capture the difference between the regions inside and outside the range (−σ, σ ) in the direction of the derivative. Moreover, the half-width of the kernel is set to the integer closest to 3σ . when performing the image convolution. Each voxel in the 3D volume data corresponds to a 3 × 3 Hessian matrix. The eigenvalue λ 1 , λ 2 , and λ 3 are obtained by using eigenvalue decomposition of the Hessian matrix and are sorted in order of absolute values, which means |λ 1 | ≤ |λ 2 | ≤ |λ 3 |. Their magnitudes represent the curvature in the direction pointed by the corresponding eigenvectors since the Hessian matrix describes second order structural characteristics. For a local structure at a specific scale, eigenvalue decomposition extract three representative and orthogonal eigenvectors. Therefore, local geometrical structures can be interpreted by analyzing the signs and magnitudes of λ 1 , λ 2 and λ 3 , which is shown in Table 1. For example, in a tube-like structure, the eigenvector corresponding to λ 1 points in the axial direction that is the direction of minimal curvature (|λ 1 | ≈ 0) and the other two eigenvectors point in the radial direction that is the direction of larger curvature (|λ 2 | ≈ |λ 3 | 0). Whereas the curvature is large in all directions in a blob-like structure (|λ 1 | , |λ 2 | , |λ 3 | 0). For our purpose, not all evident structures need to be enhanced. MSRF should enhance surrounding tissues and organs with the plate-like, tube-like, and blob-like structure. Moreover, it should avoid enhancing the esophagus and the background. Hence, the enhancement function E σ (x) at scale σ is defined as: Since the structures expected to enhanced are brighter than the background that is black in CT images, the filter response in all the darker structures is set to 0. R t , R p , and R b denote the similarity measures of the tube-like, plate-like, and blob-like structure, respectively. They are defined as follows: where their values are proportional to the similarity. R s denotes the similarity measure of evident structures, which is defined as: where R s is small in the background and is large in an evident structure since one of the eigenvalues is large at least. Moreover, α, β, γ , and δ control the sensitivity of MSRF for R t , R p , R b , and R s , respectively. S t and S o denote different scale ranges. S t matches the size of the tube-like structure except the esophagus, and S o matches the size of the other structures (plate-like and blob-like structures). Therefore, MSRF can enhance other tissues and organs with evident structures at multiple scales and avoid enhancing the esophagus by setting appropriate S t and S o . According to anatomical information of the esophagus and lots of experiments, an optimal parameter combination of MSRF is finally found.
α, β, γ , and δ are set to 0.8, 1, 1.5, and 200. S t and S o are set to [5], [10] and [11], [15] respectively, and the step size is set to 1 when choosing scales. MSRF has non-negative responses between 0 and 1 in each voxel. For each voxel, we integrate the filter responses at multiple scales and take the maximum as the output response: Finally, the enhanced 3D volume data is decomposed to consecutive 2D images that are cropped as the input of the segmentation network. The MSRF can enhance the surrounding tissues and organs to extract prior knowledge from anatomical structures, which can assist the segmentation network in channel-wise 2.5D segmentation and strengthen the robustness of the network. Some examples of enhancement results are shown in Fig. 3.

C. NETWORK ARCHITECTURE OF ESO-NET
The basic structure of our 2.5D segmentation network is improved from U-Net [11] that is widely used in medical image segmentation. The symmetrical encoder-decoder architecture with skip connections can extract and combine low-level detail information and high-level semantic information. As shown in Fig. 4, Eso-Net is comprised of different convolutional blocks, and the Prior Attention Module (PAM) is embedded in each connection path between the encoder block and the corresponding decoder block. The concatenated CT images are the input of the network, and the enhanced images are utilized in PAMs. In the last decoder block, VOLUME 8, 2020 a 1 × 1 convolution is employed at the end to reduce the number of channels to 2, followed by a softmax function to produce two probability maps of the foreground and background. The regions with a higher foreground probability are predicted as the cancerous esophagus. Finally, the segmentation map is generated by the network.

1) CONVOLUTIONAL BLOCKS FOR MULTI-SCALE FEATURE LEARNING
Accurate segmentation of the cancerous esophagus with various scales requires an effective capability of multi-scale feature learning. In the encoding phase, a common approach to extract features of different scales is using standard convolutions in gradually down-sampled feature maps. By contrast, we employ extra dilated convolutions in each encoder block to sufficiently capture contextual information. As shown in Fig. 5(a), a 3 × 3 convolution is used in the left branch of the encoder block to perform normal feature extraction. In the right branch, three 3 × 3 dilated convolutions with dilated rates of 2-4 are parallelly employed to extract features in larger receipt fields. Then, the produced feature maps are concatenated, followed by a 1 × 1 convolution to learn the weights of these extracted features and reduce the number of channels. Next, the feature maps from two branches are added up for feature fusion. Finally, the added-up feature maps are transmitted to the PAM of the same level. Meanwhile, a subsequent max-pooling reduces the size of the added-up feature maps by half. The bottom block is designed for further feature extraction, which as shown in Fig. 5(b). We employ a 3 × 3 convolution and four 3×3 dilated convolutions with dilated rates of 2-5 to capture multi-scale information, and the produced feature maps are added up to aggregate information. In contrast to deepening the network with more down-sampling operations, using the combination of dilated convolutions is an efficient method to achieve multi-scale feature learning with fewer parameters, and avoids reconstructing small objects from excessively small feature maps.
In the decoding phase, the feature maps are gradually restored to the size of the input images. The structure of the decoder blocks is shown in Fig. 5(c). Using deconvolution, the input feature maps are up-sampled by a factor of 2. Then, they are concatenated with the feature maps transmitted from the corresponding PAM, and two consecutive convolutional layers are employed behind. Furthermore, the transmitted feature maps are added to the output of the second convolutional layers. Since the transmitted feature maps contain a wealth of contextual information, it is important to make full use of these feature maps. The residual connection helps the network retain this information and passes it on to the next encoder block or the last 1 × 1 convolutional layer. Note that each convolutional layer in convolutional blocks is followed by a batch normalization and a rectified linear unit (ReLU) as the activation function unless otherwise stated.

2) PRIOR ATTENTION MODULE
In this task, the segmentation accuracy is limited by the loss of spatial information in the encoding phase. Skip connections can facilitate the recovery of spatial information by concatenating the feature maps from the encoder and decoder blocks. However, not all the activations of the feature maps in skip connections are relevant to the esophagus. To emphasize  Figure 4. Note that there is a 1 × 1 convolution followed by a softmax function at the end of the last decoder block, which is not shown in the illustration for the sake of simplicity. the esophagus and suppress irrelevant regions, we propose PAM that utilizes the enhanced images to recalibrate the feature maps passed through skip connections. The inputs of PAM include two parts: the enhanced images and the feature map produced by the encoder block of the same level. As shown in Fig. 5(d), the enhanced images are downsampled to the same size as the input feature maps firstly. Since they cannot be directly considered as the attention map, we employ a 3 × 3 convolution to capture implicit information of the enhanced images. Then, a 1 × 1 convolution is employed to reduce the number of channels to 1, followed by a sigmoid function to rescale the activations into [0,1]. Hence, the attention map q ∈ R H ×W is transformed from the enhanced images by two consecutive convolutional layers. Let M = [m 1 , m 2 , · · · , m i , · · · , m N ] represents the input feature maps, where m i ∈ R H ×W denotes the feature map in channel i, and N is the number of channels. Finally, the output feature maps M are given by: where denotes an element-wise multiplication. Finally, the recalibrated feature map is transmitted to the decoder block of the same level. With a spatial attention mechanism, the activation of each spatial location is recalibrated by the corresponding scale factor of the attention map. PAMs can learn to emphasize relevant activations and filter irrelevant and noisy activations to make the concatenation operation fuse only useful features. Furthermore, the attention mechanism also works in the back-propagation during training, which allows model parameters in the encoder blocks to be updated mostly based on relevant regions [34].

IV. EXPERIMENTS A. DATASET DESCRIPTION
The dataset is provided by Sun Yat-sen University Cancer Center and is from 30 esophageal cancer patients with different genders, ages, and cancer stages. The patients include males and females between the ages of 44 and 85. In the dataset, most of the patients are at stage II and III. 14 patients are at stage II and 12 patients are at stage III. Moreover, 1 patient with stage I cancer and 3 patients with stage IV cancer are also contained in the dataset to increase the data diversity. CT scan of the chest is performed for each patient and then a total of 6362 CT slices are collected. The CT slices have 512×512 pixels in-plane size with 2 mm slice thickness and the pixel spacing varies from 0.625 mm to 0.923 mm in the axial plane. Moreover, all the CT slices are labeled manually by professional doctors.
In dataset division, the CT slices of 5 randomly selected patients are utilized as our test set and the rest are utilized as the training and validation set. Moreover, five-fold crossvalidation [39] is performed to evaluate the model training, and 5 models are trained based on different training sets. The CT slices of the remaining 25 patients are randomly split into 5 groups on average so that each group includes 5 patients. In the training of each model, one of the groups is selected as the validation set in turn and the remaining groups are as the training set. We tune the hyper-parameters based on the validation effects of 5 models. Finally, the model is trained on all the CT slices of 5 groups with the optimal hyper-parameter configuration and the performance is evaluated on the test set.

B. IMPLEMENTATION DETAILS 1) DATA AUGMENTATION
Data augmentation is a common approach to enhance model robustness. During training, we perform online data augmentation to increase the diversity of data available for training models and alleviate storage requirements. The process is shown in Fig. 6. Firstly, the 160 × 160 pixels region in the center of the original 160 × 160 CT image is cropped. Then, the cropped image is randomly rotated −5 • to 5 • clockwise and the size remains the same by zero-padding. Next, we randomly crop the rotated image to obtain a random-size image. The aspect ratio of the random-size image is randomly limited within [0.8, 1.2]. The area ratio of the random-size image to the rotated image is randomly limited within [0.85,0.95]. Finally, using bilinear interpolation, the random-size image is resized to 144 × 144 pixels suitable for the segmentation network. Furthermore, we perform the same processing on the enhanced images and the label images.

2) MODEL TRAINING
We using deep learning technology to train the models and then determine the model parameters of the segmentation networks. The adaptive moment estimation (ADAM) optimizer [40] with a weight decay of 0.002 is employed to realize gradient descent, and the first and second moment estimates are set to 0.9 and 0.999 respectively. We use an initial learning rate of 0.00001 and employ a step-based decay schedule that drops the learning rate by 40% every 20 epochs. The batch size is set to 16 and the model parameters are initialized using the method introduced in [11]. Moreover, we select Dice loss [41] as the loss function. Using fivefold cross-validation, we train the models for 100 epochs and the performance is evaluated on the validation set after each epoch. The best models of different networks are selected for final evaluation on the test set. All the models are implemented with Python 3.7 and PyTorch 1.3.1, and are trained on an NVIDIA GTX 1080TI GPU.

C. EVALUATION METRICS
To measure the performance of the proposed method and other existing methods, dice similarity coefficient (DSC), precision (PRE), sensitivity (SEN) and Hausdorff distance (HD) are employed in our experiments. All of them are usual pixel-level evaluation metrics for biomedical image segmentation. DSC, PRE and SEN can be considered as similarity metrics of two different sets, which are defined as follows: where Y denotes the pixel set of the target region (cancerous esophagus) andŶ denotes the pixel set of the predicted region segmented by the network. |·| means to count the number of pixels in a set. All of them range between [0,1] and greater DSC, PRE and SEN means better segmentation performance. HD measures the deviation degree of two sets in a metric space, which is defined as:  where d(Y ,Ŷ ) and d(Ŷ , Y ) are defined as follows: where y andŷ denote the pixel in Y andŶ respectively.
· means to compute the euclidean distance. HD ranges between 0 and +∞. As HD declines, segmentation performance increases.
From the experimental results, LinkNet and SegNet achieve poor performance in DSC and HD. One of the reasons is that LinkNet and SegNet are proposed for scene segmentation and are not applicable to segmenting the object with a blurry boundary. Moreover, U-Net and its variants obtain better segmentation results except U-Net plus. The improvement of performance is credited with the correct use of skip connections. Among them, 3D U-Net achieves the best performance with 83.693% DSC, 84.274% PRE, 83.120% SEN, and 2.591mm HD, which reflects the superiority of 3D convolutional networks in medical image segmentation. Two 2D convolutional networks also achieve competitive performance. U-Net++, which integrates U-Nets of different depths, obtains 79.025% DSC, 74.992% PRE, 83.516% SEN, and 2.875mm HD. Attention U-Net, which uses a spatial attention mechanism in skip connections, obtains a higher DSC (82.425%), PRE (82.494%), and a lower HD (2.751mm). Whereas both of them still do not outperform 3D U-Net. The experimental results show that no existing segmentation network can achieve the best performance in all evaluation metrics, since they do not take into account the particularity of the cancerous esophagus. By contrast, the proposed method can tackle this task fairly well. From Table 2, it obtains the highest DSC (84.839%), PRE (85.955%), SEN (83.752%), and the lowest HD (2.583mm), which demonstrates that it outperforms other existing segmentation networks in this task. Moreover, the experiments based on our dataset also verify the strong generalization capacity of the proposed method. As shown in Fig. 7, we draw the box plots of performance in the two most important metrics (DSC and HD) for samples in the test set. We can observe that the proposed method can obtain better performance and relatively steady segmentation accuracy.
The visual examples of segmentation results from the test set are shown in Fig. 8, where each row shows the same CT image. From Fig. 8, we can observe that the size and shape of the esophagus are various. Some segmentation network, such as FCN-8s, SegNet, LinkNet, and U-Net plus, cannot obtain satisfactory segmentation results. For these networks, some similar tissues and organs are wrongly predicted as the cancerous esophagus. Since the region of the cancerous esophagus in CT images is the key diagnostic basis, inaccurate segmentation results can bring serious consequences. By contrast, the proposed method can obtain high accurate segmentation results. The last row in Fig. 8 shows a challenging case for segmentation, where the esophagus with a huge tumor has a blurry boundary and low contrast against surrounding tissues. In this case, the proposed method achieves better segmentation performance than the other networks.
These visual examples show that the proposed methods can provide reliable segmentation results for the diagnosis and treatment of esophageal cancer.

2) ABLATION STUDY
We conduct an ablation study to show the effectiveness of each improvement in the proposed method. The experimental results are shown in Table 3. For simplicity, we use Model 1-4 to represent the models in Table 3 from top to bottom. Model 1, which is indicated by ''BS'' in Table 3, accepts a single CT image as input. It outperforms the original U-Net in DSC, SEN, and HD. By contrast, Model 2 requires three CT images as input to perform channel-wise 2.5D segmentation that brings 1.828% DSC and 0.048mm HD improvement.
However, Model 2 still perform poorly than some existing networks such as Attention U-Net and 3D U-Net. Model 3 embeds PAMs in skip connections and utilizes the original CT images instead of the enhanced images to generate attention maps. From Table 3, PAMs can bring 1.121% DSC improvement due to the use of a spatial attention mechanism in skip connections. At last, Model 4 represents the proposed method. The experimental result of Model 4 shows that MSRF brings a relatively huge improvement in PRE and HD. This is because the enhanced images produced by MSRF can guide the network to distinguish irrelevant tissues and organs, which contributes to eliminate most false positives.
The convergence curves of Mode 1-4 are shown in Fig. 9. Even with trainable parameters increased, the models with PAMs have higher convergence efficiency than those models without PAMs. The major reason is that PAMs can make the encoder blocks of the network mostly focus on the emphasized region during the back-propagation. Moreover, MSRF further facilitates the model convergence since the enhanced images produced by it can provide additional prior knowledge for the network.

V. DISCUSSION
The various anatomical structure and the blurry boundary of the cancerous esophagus are the key problems limiting the improvement of segmentation accuracy. Currently,  the performance of existing segmentation networks is not satisfactory in this task. In this study, we have proposed a novel segmentation network named Eso-Net and a 3D enhancement filter called Multi-Structure Response Filter (MSRF) to address these problems. The experiments demonstrate that the proposed method outperforms existing segmentation networks in this task.
The experimental results show that channel-wise 2.5D segmentation is conducive to improve the performance, since it can efficiently utilize z-axis information with fewer parameters. Moreover, our experiments demonstrate that using only standard convolutions for multi-scale feature extraction is not enough to achieve highly accurate segmentation results in this task. By contrast, we parallelly use multiple dilated convolutions in the same feature maps, and residual connections are employed for feature fusion and facilitating gradient propagation. The validity of this approach is verified by our experiments. Moreover, MSRF was designed to enhance the regions of irrelevant tissues and organs, and the enhanced images are utilized in PAMs. The experimental results demonstrate that the use of MSRF and PAMs takes a positive effect on the improvement of the segmentation performance.
Generally, Doctors observe CT slices one by one to diagnose the stage and location of esophageal cancer. However, it is not convenient for diagnosis and clinical analysis due to heavy workload and tiresome procedures. Automatic segmentation based on deep learning can effectively assist doctors. The proposed method can take the place of doctors to accomplish this tedious and time-consuming work, since it is automatic and achieves state-of-the-art performance in this task. Verified by professional doctors, the segmentation results of the proposed method achieve the accuracy needed in practical clinical applications. As shown in Fig. 10, the output segmentation maps can be used to generate a 3D model that presents the entire esophagus of a patient. Doctors can conveniently observe the shape and structure of the tumor at the esophagus for further diagnosis and treatment. Moreover, it is easier to measure the size of the esophageal tumor on a 3D model than on 2D CT images. However, the proposed method also has its limitations. Firstly, the generalization capacity of Eso-Net is limited by the small size of the cancerous esophagus dataset. In this case, Eso-Net may obtain poor results when segmenting some extremely special samples. Secondly, MSRF is an untrainable component independent of the network. When applying the proposed method in other medical image tasks, we need to tune its parameters manually before the model training. Hence, our future goals are optimizing the architecture of Eso-Net on a larger dataset and merging MSRF with the deep learning model, which can further facilitate the development of medical image segmentation.

VI. CONCLUSION
In this article, we have proposed a novel 2.5D segmentation network named Eso-Net for automatic segmentation of the cancerous esophagus and designed a 3D enhancement filter called Multi-Structure Response Filter (MSRF) to extract 3D structural priors. Eso-Net is based on an encoder-decoder architecture and consists of different convolutional blocks. Dilated convolutions and residual connections are employed in the convolutional blocks to facilitate multi-scale feature extraction and fusion. Furthermore, the proposed Prior Attention Modules (PAM) are embedded in skip connections to recalibrate the activations of feature maps with the assistance of the enhanced images. In the experiments, the proposed method reports the highest DSC (84.839%), PRE (85.955%), SEN (83.752%), and the lowest HD (2.583mm), which demonstrates the proposed method achieve the best performance in automatic segmentation of the cancerous esophagus. Moreover, the ablation study shows that each improvement of the proposed method contributes to obtain better segmentation performance. In the future, we are interested in optimizing the architecture of Eso-Net on a larger dataset and merging MSRF with the deep learning model. He has published an undergraduate textbook, a research monograph, several book chapters, a book review published in an IEEE journal, more than 150 internationally leading journal articles, and more than 130 highly rated international conference papers. His research interests include the time frequency analysis, the optimization theory, the symbolic dynamics, the multimedia signal processing, and the biomedical signal processing. He is a Fellow of IET, a China National Young Thousand-People-Plan Distinguished Professor, and a University Hundred-People-Plan Distinguished Professor. He was awarded the Best Reviewer Prizes from the IEEE Instrumentation and Measurement Society, in 2008 and 2012. He serves in the technical committees of the Nonlinear Circuits and Systems Group, the Digital Signal Processing Group, and the Power Electronics and Systems Group of the IEEE Circuits and Systems Community. He has also served as the Guest Editor-in-Chief for several special issues of highly rated international journals, such as Circuits, Systems, and Signal Processing and the American Journal of Engineering and Applied Sciences. He is an Associate Editor of Circuits, Systems, and Signal Processing, the Journal of the Franklin Institute, Measurement, IET Signal Processing, and the Journal of Industrial and Management Optimization.