A Symmetric Fully Convolutional Residual Network With DCRF for Accurate Tooth Segmentation

Accurate tooth segmentation from CBCT images is a crucial step for specialist to perform quantitative analysis, clinical diagnosis and operation. In this paper, we present a symmetric full convolutional network with residual block and Dense Conditional Random Field (DCRF), which can achieve accurate segmentation automatically for tooth images. The proposed method can not only strengthen feature propagation, but also boost feature reuse, which can be credited to the contracting path and the expanding path that extract and recover pixel cues sufﬁciently. To this end, we apply special deep bottleneck architectures (DBAs) and summation-based skip connection into our network to ensure accurate segmentation for much deeper neural network. Compared with previous methods which are based on conditional random ﬁeld for original image intensity, our approach applies DCRF to the posterior probability generated by the proposed network. To avoid the interferences of noises around the tooth, we combine the pixel-level prediction capability of DCRF, which further enhance the segmentation performance. In the experiments, we verify the capabilities of our methods based on four evaluation indicators, which demonstrates the superiority of our method.


I. INTRODUCTION
With a bunch of virtues like low radiant quantity, Cone Beam Computer Tomography (CBCT) [1] can produce 3D images of tooth with high resolution by scanning only once, which makes it be one of the most important method for diagnosis of dental diseases. By mapping tooth data produced by CBCT into 3D model, details of tissue and structure inside the tooth can be revealed. However, segmentation of tooth is also one of the challenging task for dental diagnosis.
Different from traditional CT technology, CBCT uses the projection data of 2D surface to replace the projection data of 1D linear produced by traditional CT, i.e., the cone beam The associate editor coordinating the review of this manuscript and approving it for publication was Md. Asikuzzaman . scanning of 3D takes the place of traditional fan beam scanning of 2D. The 3D images can be formed directly after CBCT image reconstruction. Besides, not only can CBCT technology generate clear images with high resolution, but also it can produce images within reasonable range, requiring a short period of time for data assumption and employing standard forms of data. This makes significant contributions to segmentation and reconstruction based on medical images.
The visualization technology of human tooth has become an essential tool for computer aided diagnosis. Xu et al. [39] proposed a 3D tooth segmentation model completed in two-level hierarchical: teeth-gingiva labeling and inter-teeth labeling, and it has been directly applicable to orthodontic CAD systems. To solve some oral problems in orthodontics and prosthodontics, doctors are required to grasp the overall anatomical information of the root and crown manually with segmentation map generated by CBCT images of the target tooth. With the help of accurate segmentation, they will be able to perform clinical diagnosis, making reasonable plans for surgical and repair more efficiently. Hosntalab et al. [11] proposed a level set model with 3D region as auxiliary to extract the surface of tooth. Considering the huge advantages brought by segmentation process, tooth segmentation based on CBCT images has now become a fundamental task.
The existing segmentation methods for medical images can be classified into two categories. (1) The traditional methods, including the template-based fitting method [3] and the level-set method [5], [6]. This type of method requires more participation of professional experts and suffers from low self-learning ability and high sensitivity to noise and vague cases. (2) The methods based on deep learning [2], [4], [7], [9]. Recently, due to the strong robustness and versatility of convolution neural networks (CNNs), research on medical image processing using deep learning algorithms has become a contemporary hot spot, which outperforms the traditional methods significantly [8], [10]. Miki et al. [40] used AlexNet [44] network architecture and image data enhancement strategy to realize the classification of teeth, which can be used for automatic filing of dental records for forensic identification. Tian et al. [35] presented a two-level hierarchical feature learning 3D tooth segmentation and classification model by combining sparse voxel octree and 3D CNNs, which solved the problem of misclassification of highly similar tooth categories. Cui et al. [41] obtained the edge map of CBCT images firstly, and then proposed a new learning similarity matrix based on Mask R-CNN [45] to realize the automatic segmentation and recognition of tooth from CBCT images. Zhou et al. [42] proposed an asymmetric architecture called LEDNet. The encoder of the network adopts the idea of the downsampling of FCN, and the decoder adopts the idea of attention of pyramid network to realize precise semantic segmentation. Gou et al. [37] annotated the tooth image as a dataset by improving the level-set method, and then trained the dataset by employing the U-Net network [31]. This method can use the unlabeled data in the experiment to achieve semi-supervised learning.
However, the deep learning based segmentation is limited by the high diversity and cost of medical images. For example, compared with other kinds of images, the tooth images have three challenges. Firstly, there are significant variation between conventional and unconventional cases, limiting the networks' ability of generalization, as illustrated in Figure 1 (bottom left). Secondly, an image always contains multiple objects [19], requiring more manual annotation, as shown in Figure 1. Thirdly, similar gray values between tooth and similar structures around them make the tooth vague margin [11], [13], as shown in the bottom row of Figure 1. Therefore, an accurate and efficient automatic segmentation method without labeling is urgently demanded in clinical practice to reduce the workload of experts. Based on these observations, this paper rebuilds and optimizes the deep bottleneck architectures such as ResNet via the network structure based on U-Net. Moreover, we introduce our novel deep neural network called symmetric fully convolutional residual network, with dense conditional random fields to refine the posterior probability map. The overall architecture is showed as Figure 2. To summarize, our main contributions are listed as follows.
(1) To be more specific, we design a novel DBA to replace the general convolutional layer, which can not only increase the number of layers in the network, improving feature extraction, but also reduce the amount of network parameters.
(2) We skillfully employ our modified DBAs into every convolution layer of U-Net and introduce skip connection structure to enhance the propagation and reuse of the features.
(3) In addition, instead of directly using DCRF to operate segmentation on original intensity information, we apply DCRF to the segmentation probability map produced by the proposed network.
(4) Based on the probability map produced by our network, we make full use of DCRF for overall structured prediction to get rid of the noises in the images, which can locate the tooth contour precisely. In addition, the refinement of the tooth boundary further boosts the performance of segmentation.

II. RELATED WORK
Because the level set method has advantages in dealing with topological changes and contour propagation, researchers have applied the level set method to biomedical images. Broadly speaking, in the existing methods for image segmentation, the level set method [13], [15] is in a pivotal status. Gan et al. [5] proposed a hybrid level set method based on global energy, local energy, tooth shape constraint energy and edge detection energy of tooth to segment tooth from CT images. An active contour model based on edge is proposed by Li and Xu [16]. Their main contribution is to use distance regularized level set. Gao and Oksam [6] proposed a level set model based on the shape and strength prior of VOLUME 8, 2020 FIGURE 2. The overview of our framework. Given CBCT tooth images, we first put the preprocessed images into the symmetric fully convolutional residual network to get the segmentation probability map. Then DCRF is directly applied to the probability map, and the segmentation result of tooth is obtained through iteration and optimization. Finally, the segmentation task is completed.
tooth to segment the crown and root of individual tooth. Xia et al. [14] proposed a method to segment the contacts of maxillary and mandible teeth. Although the result of traditional methods is acceptable, these methods need tedious expert annotations and repeated iterations to obtain an available single image segmentation. In addition, it is hard to produce satisfactory result when the maxillary and mandible teeth are occluded [13].
Thanks to the development of computer hardware, deep learning has achieved outstanding performance in the field of image processing. By training an end-to-end network, deep neural network performs superiorly in several image processing fields such as feature extraction, image classification [18], semantic segmentation [8] and so on. Among multiple excellent models, ResNet [20], [24] proposes a unique jump structure called residual unit. Employing bottleneck structure with residual unit enhances its performance by a meaningful margin, it achieves a significant breakthrough in the field of deep neural network. With its strong property over image classification, image location and object detection, ResNet won the championship of these three events of Large Scale Visual Recognition Challenge (ILSVRC) in 2015. Besides, ResNet also demonstrates dominated performance in the field of semantic segmentation [25]- [27].
For segmentation task, fully convolutional networks (FCNs) [28] are proposed and then applied into medical image segmentation [29], [30]. In order to achieve better performance, some researchers introduce U-Net [31], which uses a contracting path and expansive path to extract feature and produce pixel-wise feature map. The unique structure of U-Net contributes to its outstanding performance over FCNs, making U-Net popular in the field of medical image segmentation [32]- [34]. However, with just a few layers, U-Net suffers from lack of the ability to extract feature from deep layers at contracting path while at expansive path it cannot supplement enough information to the feature maps, limiting the accuracy of the segmentation. Adding the number of layers in U-Net to improve its ability of extracting feature is also a promising direction [36]. However, the structure and means of optimization of those models still need to be improved. Besides, always with large amount of noise, the fact that gray scale information provided by medical images are in a low-quality feature space makes different kinds of tissue representing the same grey level, bring about low precision in segmentation.
Context cues represent the spatial relationships between category labels and play an important role in the tasks of structured prediction. Context cue or higher-level information is the key in semantic image segmentation. In recent years, conditional random field (CRF) acts as an effective way to optimize results from medical image segmentation [21]- [23]. Combining DNN's ability of feature extraction and CRF's ability of structured modeling can contribute to better performance in image segmentation task. Zhou et al. [43] designed a multi-scale deep context convolutional network named MDCCNet for semantic segmentation. It utilized convolution to combine feature maps from different levels of networks in a holistic manner. On this basis, the author used dense connected conditional random fields to optimize the output of the MDCCNet and achieved efficient experimental results on the PASCAL VOC 2012 and SIFTFlow dataset. Based on the traditional multi-scale input and sliding pyramid network, Lin et al. [38] combined with CRF to obtain background context information, constructed a deep learning model. However, lots of methods for training medical images still rely on the adjustment of limited parameters and us original intensity information as the main feature space. The original intensity always provides low quality feature space for CRF. Also, with large amount of noise, different tissues may display the same grey level, which makes the precise segmentation of medical images based on grey level be a challenge task.

III. METHODS
In this section, we introduce in detail our proposed method for accurate tooth segmentation from CBCT images. We start by extracting and organizing the information of the CBCT images. Furthermore, a novel symmetrical network architecture designed with DBAs is introduced to perform the training task. Finally, a post processing procedure named DCRF is applied to further optimize the contours of tooth segmentation results.

A. SYMMETRIC FULLY CONVOLUTIONAL RESIDUAL NETWORK
U-Net [31], composed of expansive path and contracting path, is used as our basic network structure. The contracting path can help the extraction of the feature of the context while the expansive path achieving the pixel-wise location and prediction precisely. It is true that increasing the number of hidden layers can be beneficial to the performance of neural networks. Increasing the number of layers, however, will not only accelerate the number of arguments, but also require more computing resource, which lead to overfitting. To get rid of the defect, we apply modified deep bottleneck architectures to deepen the network while decreasing the number of arguments.
In detail, we designed three DBAs to replace the convolutional layers that are employed in the original U-Net structure, as shown in Figure 3, where the left side is the original structure, and the right side is our proposed DBA. The original structure has only one path. In our proposed structure instead, each DBA can be divided into two branches. The left one only contains a 1×1 convolutional layer with the stride of 1 while the right one is composed of three connected convolutional layers, which are 1 × 1, 3 × 3 and 1 × 1 with the stride of 1 respectively. In particular, we added an identical mapping path to DBA1 to ensure that the features are effectively propagated to solve the problem of gradient disappearance or explosion.
After each operation of convolution, ReLU activation function is directly applied so as to pass the non-linear feature. Before the employment of ReLU function, the Batch Normalization (BN) is used. The detailed step is that we firstly assume k (n) = k (1) , k (2) , ···, k (n) to represent the outputs of n nodes within a layer, then we calculate the average of these outputs in terms of avg k , defined as The variance is also required in terms of var k 2 , ie.e, We employ the learnable parameters ∂ and γ . and then based on the standard variation calculated by the outputs of n nodes, we perform unbiased estimation to get the output Here, is a constant. The advantage by employing BN operating is that it can boost the speed of training, accelerate the process of convergence and alleviate the vanishing of gradient.
We define the input of DBAs as x (w, h, n), where w, h, n represent the width, hight and number of channels respectively. The weight of the filter in the left branch is denoted as ω L1 . And ω R1 , ω R2 , ω R3 denote the weight of the filters in the right branch. Then each layer of the network can be defined as t (1) The functions of BN and ReLU are defined uniformly as G (x). As a result, the output of DBA can be represented as where x denotes the input data x (w, h, n), which means the identity mapping in DBA1, n denotes the number of channels of the DBA output. Each kind of DBA produce different parameter n. f (x, ω) L is the output of the left branch, and can be represented as where f (x, ω) R is the output of the right branch, i.e., Among the three types of DBAs, DBA1 can keep the number of channels while performing a series of convolution, which replaces the normal convolution in U-Net.
DBA2 can double the number of channels after a series of convolution, which replaces the convolution that doubles the number of channels in U-Net. Here is no identity mapping x (w, h, n) in DBA2, so parameter x should be subtracted.  DBA3 can halve the number of channels after convolution, replacing the convolution that halves the number of channels after performing multi-scale contextual features fusion in U-Net. Here is no identity mapping x (w, h, n) in DBA3, so parameter x should be subtracted.
Replacing the ordinary convolutional layers in U-Net with three types of DBAs, we build a network with 54 layers, which effectively deepens the network, optimizes the process of feature extraction and improves the accuracy of location. The detailed information of the network structure is showed in Figure 4. Table 1 compares the number of parameters between the original structure and three different DBAs. It is worth noting that DBA1 and DBA2 have only half the amount of parameters of the original structure. The results show that although the number of layers in our network increases, the amount of our total parameters decreases, which indicates the efficiency of our strategy.

B. DCRF FOR POST PROCESSING PROCEDURE
The conditional random fields model [12], [17] is graph model made up of unary potential and adjacent pixels, which ignores the information of the whole space. In this paper, we apply DCRF to the posterior probability map directly instead of the original intensity of the tooth. The segmentation probability map obtained from the symmetric fully convolutional residual network is used as the input of DCRF model. DCRF can not only make use of the relation between adjacent pixels, but also master and use the pixel information of the whole space to judge and predict the local pixels. At the same time, the model can be built according to the relationship between the length and distance of pixels in the space to fully grasp the context of the whole space, and the Gibbs energy function is defined as The DCRF energy function is composed of unary potential function P i (a i ) and binary potential function P ij a i , a j , where N denotes the number of pixels in the whole space. The unary potential function is a status characteristic function defined at the position i of the observed sequence, which can be further defined as: where ϕ (a i ) is to calculate the probability that the ith pixel in the input image belongs to a certain category a i . In this paper, a i is the segmentation probability map got by the symmetric fully convolutional residual network. The binary potential function is a transfer eigenfunction defined at different observation positions, which is used to characterize the correlation between variables and the effect of observation sequences on them, and is represented as where θ a i , a j = 1 if a i = a j 0 otherwise . Since all pixels in the DCRF model are fully connected, there is a correspondence between each pair of pixels i and j in the image, no matter where they are relative to each other. Here, f i and f j are the eigenvectors of i and j, respectively. k m is Gaussian kernel, which depends on the eigenvectors of the pixels i and j, and ω m is its corresponding weight. The bilateral relationship is a popular pairwise relationship in image processing, which roughly indicates that pixels with similar colors or positions may belong to the same class. The binary potential function is defined by the reference to bilateral relationship and gray intensity, i.e., where the first kernel depends on both the pixel position in term of δ and the pixel gray intensity in term of I , while the second kernel only depends on the pixel position, and α, β and γ the super parameters to control the value of the Gaussian kernel. The binary potential function is to describe the relationship between pixels, and encourage similar pixels to assign the same label, while pixels with larger differences are assigned by different labels. The definition of ''relationship'' between pixels is related to the gray intensity and the actual relative distance, so that the segmentation at the boundary can be achieved, and the effect of refining the boundary can be realized, and the accurate segmentation map can be obtained finally.

C. DATASET AND PRE-PROCESSING
The original dataset of CBCT tooth images is provided by West China Hospital, Sichuan University. The images of the dataset are all in DCM format, as shown in Figure 5(a). In order to transform these images into the data that can be passed into the neural network, we extract and reorganize the information of these DCM files. The overall procedure of processing is showed in Figure 5.
Firstly, the useful parts of CBCT tooth images are extracted and transformed into trainable images. In details, we first calculate the windowing of the CBCT file, and then use the windowing to map the gray value of the image to 0-255 to get the imgMap. Next we use the minimum value of the imgMap to window the image data and convert the data to uint8 format. These generated images are store in PNG format also with resolution of 401 × 401. We use these images as training dataset shown in Figure 5(b). Algorithm 1 summarizes the available training images extracted from CBCT tooth images.
Secondly, we use annotation tool called LabelMe to annotate these images as shown in Figure 5(c). Thirdly, considering that the grey-scale of the annotated images produced by LabelMe ranges from 0 to 1 while the network requires images range from 0 to 255, we expand the grayscale. Finally, we use these images as the labels as shown in Figure 5(d).
After preprocessing of the CBCT images of tooth, the training dataset is composed of 86 images (conventional / unconventional = 51/35) with ground truth annotations. Testing data contains 24 images. In order to achieve better performance, data augmentation strategies are applied within our VOLUME 8, 2020 training process. We perform rotation, horizontal translation, vertical translation, scaling and normalization to the images to enrich the training dataset.

A. NETWORK TRAINING
We build a novel symmetric fully convolutional residual network which is composed of nine convolution groups.
Each group contains two DBAs structure, excepting that we only apply one DBA1 structure in the first block. At the stage of downsampling, DBA2 and DBA1 are used for the second to fifth blocks, followed by a single max-pooling layer. In order to alleviate the concerns about overfitting, a dropout layer is employed before the max-pooling layer within the fourth block. With the purpose of saving the result of downsampling, instead of adding a max-pooling layer, we directly employ dropout layer after the fifth block. To achieve multi-scale feature fusion, we employ DBA3 and DBA1 for the sixth to ninth blocks, after which we perform deconvolution and concat channels. A convolutional layer with 1 × 1 kernel (stride 1), is added at the end of the ninth block. We adopt sigmoid activation function to the 1 × 1 layer and get the final segmentation map of probability. Adam is used as the optimizer for parameter tuning with a fixed learning rate 0.01. Binary cross entropy is employed as loss function of the network. The whole network model is stored as HDF5 format file after training. Thus, the well-trained symmetric fully convolutional residual network model is obtained.

B. IMPLEMENTATION DETAILS
After the training procedure, we input the dataset to our trained network model. In symmetric fully convolutional residual network, fully connected layers are removed, which eliminates the restriction to the size of input images. The probability map of tooth segmentation is generated through the whole network, and then this probability map is used as the input of DCRF model. After DCRF model optimization, the final accurate result of tooth segmentation is generated. All the works are implemented in Keras and ran on a server with Nvidia GeForce 1070 GPU. During the experiment, the batch size for training is set to be 100 and the network is trained for 10 epochs. Specifically, Figure 6 shows the training accuracy and loss of our tooth data set for every 100 iterations. It can be seen that with thencrease of the number of iterations, the segmentation accuracy is gradually improving and converging.

C. QUALITATIVE EVALUATION
To better evaluate the effectiveness of our proposed method qualitatively, the segmentation results of partial test data are shown in Figure 7 (conventional case) and Figure 8 (unconventional case). In order to demonstrate the fact that DCRF has to refine the outline information, we show the segmentation probability map generated by the symmetric fully convolutional residual network. The qualitative results are shown at the middle row in Figure 7 and Figure 8. As can be seen from the segmentation results, our method can accurately segment tooth objects by using a special and efficient neural network architecture to fuse multi-level contextual features and refine tooth boundaries in combination with DCRF. What should be mentioned is that our method can accurately segment the tooth objects in both conventional and unconventional cases. However, some teeth are not completely separated due to the excessive contact between tooth and a large number of noises of the same structure around the tooth. In comparison, our method is capable of separating these touching tooth objects clearly. Under the framework of unified multitasking, the symmetric full convolution residual network and DCRF can qualitatively discover and predict the filling of dental information, which enhances its superiority.

D. QUANTITATIVE EVALUATION AND COMPARISON
In order to evaluate tooth segmentation accuracy, we employ four widely used performance metrics, i.e. Volume Difference (VD), Dice Similarity Coefficient (DSC), Average Symmetric Surface Distance (ASSD) and Maximum Symmetric Surface Distance (MSSD). For convenience, the real and predicted region of a tooth are denoted as 'A' and 'B' respectively. VD is the volume difference between segmentation results and grouptruth. If the value of DSC equals to 0, it means perfect segmentation, and the formulation is: DSC is the most widely used evaluation criteria for segmentation, which ranges from 0 to 1. If the value of DSC equals to 1, it means perfect segmentation, and the formulation is defined as ASSD is based on surface voxels. If the value of ASSD equals to 0, it means perfect segmentation, and the formulation is defined as  MSSD is also known as the hausdorff distance. If the value of MSSD equals to 0, it means perfect segmentation, and the formulation is defined as where S (B) denotes the set of surface voxels. The dist (v, S (B)) is the shortest distance from a voxel v to the whole set S (B), which is defined as dist (v, S (B)) = min S B S(B) v − S B , where · is the Euclidean norm.
To validate the effect of the DCRF model, we augment the base network with DCRF model detection study. We use the segmentation probability map output by the basic network as the result without DCRF (see the third row of Figure 8), and then compare it with the segmentation result with DCRF (as the fourth row of Figure 8), as shown in Table 2.
The quantitative analysis of our proposed tooth segmentation method is compared with the traditional methods of Gan et al. [5], Xia et al. [14], Gao et al. [6], and Hosntalab et al. [11]. And the quantitative tooth segmentation accuracy of our method and four traditional methods on the test dataset is shown in Figure 9. It is worth noting that the traditional method is not inferior to thedeep learning-based method. Xia et al. [14] used the level set method to obtain desired effect, but slightly worse than the method of our proposed. In order to further verify the superiority of our method. We modified the original U-Net method [31] to achieve CBCT tooth segmentation. Not only that, we also compared with several latest learning-based segmentation methods [35], [37], [40] for tooth segmentation. The accuracy comparison between our proposed method and other excellent algorithms is shown in Table 3. We can see that compared with other latest learning-based advanced methods, the accuracy of our method is higher in the four performance metrics. Especially, the value of ASSD is close to the size of a pixel, which means that our method achieves high performance in pixel level segmentation. And in the most authoritative indicator DSC, our method also has the highest accuracy. By testing some unconventional cases of tooth, we discovered that when there are large-scale similar confusing structures around the tooth, the confusing structures are excessively discriminated as a segmentation object, as shown in Figure 8 (fourth column). In summary, our method for accurate tooth segmentation from CBCT achieved the best segmentation result on all testing data, which proved the effectiveness of our proposed method consistently.

V. CONCLUSION
In this paper, we have presented a symmetric fully convolutional residual network with DCRF to accurately segment tooth from CBCT images. We apply symmetric fully convolutional residual network to generate the segmentation probability map from input images of tooth and use DCRF as post processing measure, which solves the problem of boundary smoothing in the field of deep learning based segmentation. In symmetric fully convolutional residual network, special DBAs and BN layers are applied to enhance the propagation and reuse of feature, improving the network ability to extract feature and perform pixel-wise prediction. Rather than the original information, we use DCRF on the segmentation probability map produced by our network. We make full use of DCRF to locate the outline of tooth and refine its boundary, which improves the accuracy of segmentation of tooth images. According to the experimental result, our method has a good performance compared with existing methods.