A Fast and Accurate Algorithm for Nuclei Instance Segmentation in Microscopy Images

Nuclei instance segmentation within microscopy images is a fundamental task in the pathology work-flow, based on that the meaningful nuclear features can be extracted and multiple biological related analysis can be performed. However, this task is still challenging because of the large variability among different types of nuclei. Although deep learning(DL) based methods have achieved state-of-the-art results in nuclei instance segmentation tasks, these methods are usually focus on improving the accuracy and require support of powerful computing resources. In this paper, we joint the detection and segmentation simultaneously, and propose a fast and accurate box-based nuclei instance segmentation method. Mainly, we employ a fusion module based on the feature pyramid network(FPN) to combine the complementary information of the shallow layers with deep layers for detection the nuclear location by bounding boxes. Subsequently, we crop the feature maps according to the bounding boxes and feed the cropped patches into an U-net architecture as a guide to separate clustered nuclei. The experiments show that the proposed approach outperforms prior state-of-the-art methods, not only on accuracy but also on speed. The source code will be released at: https://github.com/QUAPNH/Nucleiseg.


I. INTRODUCTION
Nuclei instance segmentation in microscopy images is a fundamental task in the pathology work-flow, based on that the nuclear morphometric and appearance features such as average size, density and pleomorphism can be extracted, and multiple biological related analysis such as classifying phenotypes and profiling treatments can be performed [1]. For instance, it is well-known that the Breast Cancer is the most common malignant tumor in women worldwide [2]. The gold standard for its diagnosis, grading and prognosis predicting remains examination of H&E stained tissue under a microscope [3]. Nuclear shapes and spatial arrangements, which are key factors of Nottingham grading system, are achieved by pathologists manual examine the H&E stained tissue sections [4]. With the development of the high throughput technologies, manual analysis of millions of tissue biopsies seems to be a bottleneck [5]. Therefore, automatic techniques The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar . that accurately and fastly segment nuclei instance in histology images spanning diverse patients, organs and disease states, have significant contribution for developing computer aided systems of clinical and medical applications [6].
Nuclei instance segmentation, which aims to detect and segment every nucleus in a microscopy image simultaneously. Automatically accurate segmentation of nucleus at instance-level in microscopy images remains a challenging task due to several reasons [7]- [9]. Firstly, there are existing amount of nuclei occlusion, adhesion and clusters, which can easily cause over or under-segmentation and affect morphological feature measurements. Secondly, the blurred border and inconsistent staining make it is difficult to distinguish each nucleus in the images, and hence introduces mislabelling and incorrect annotations even accomplished by a specialized pathologist, which is challenging for obtaining objective and robust results. Thirdly, the images are acquired under a variety of pathological conditions. Nuclei enlarge, exhibit margination and prominent nucleoli that make nuclear appearance, magnitude, density vary among different cancer VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ subtypes. Thus require the method possess good ability to generalize across these variations. The large differences of nuclear morphology and key challenges for segmentation of nuclei in the microscopy images are illustrated in Fig.1. In recent years, we have seen the rapid development of microscopic imaging techniques, lots of studies have developed various algorithms to study nuclei instance segmentation [1], [10]. Algorithms based on conventional image processing techniques such as thresholds, marker controlled watershed and morphological operations, are still employed for this complex and difficult task. They typically need the shape and spatial priors to guide the segmentation pipeline. A drawback of these classic algorithms is the need for prior knowledge to adjust the parameters, which cause the methods usually work well on one dataset but perform poorly when generalize to new datasets or experimental conditions change, thus the parameters need be returned. Recently, deep learning methods dominate the field of computer vision and medical image analysis [11], amount variants of deep Convolution Neural Network(DCNN) are applied in nuclei instance segmentation [12], [13]. Most algorithms exclusively focus on improving the accuracy of segmentation [14]. Thus, a lot of algorithms even with complicated combination of different techniques have been proposed, which hope to achieve state-of-the-art segmentation accuracy. In clinical situations, the computing resources are usually insufficient. Therefore, it is significant to develop efficient and effective algorithms that can perform fast and accurate instance segmentation on resource restricted situations.
Recently, in order to capture the nuclei instances accurately, Yi et al. [15] developed an attentive instance segmentation model called ANCIS which was built on a joint network that combined single shot multi-box detector (SSD) [16] and Unet [17]. The attention mechanisms are used in both the detection module and segmentation module of ANCIS to improve the detection and segmentation accuracy, but these attention mechanisms require more computations, which complicate the model and restrict its execution speed. In order to address this drawback, we propose a new architecture for the encoding module and segmentation module. Inspired by the classical structure of U-Net, we design the encoding branch and utilize SSD to detect the bounding boxes, and leverage these boxes to crop the RoI patches for guiding the segmentation, particularly for the adhesion and clustered nuclei. The contributions of this work are: • We joint the detection and segmentation simultaneously, and propose a simple box-based nuclei instance segmentation method, which utilizes detection branch to guide the segmentation.
• The fusion module based on FPN in the segmentation branch could aggregate the complementary information of the shallow layers with deep layers. Instance normalization is used in skip connection to remove the neighbor statistics for cropped patch and focus on the target.
• Experimental results demonstrate that the proposed approach achieves a remarkable performance, not only on accuracy but also on speed. The rest of the paper is organized as follows: Section II reviews the relevant works on nuclei instance segmentation. Section III presents the proposed nuclei instance segmentation approach. In Section IV, we describe the datasets, experiments and results. Finally, we discuss the conclusion in Section V.

II. RELATED WORK
Within the prior literatures, methods based on the energy model such as the watershed algorithm, have been widely utilized for nuclear instance segmentation. Yang et al. [18] presented a marker-controlled watershed based on mathematical morphology to extract the nuclear instances with less over-segmentation. However, it often produces unreliable results because its morphological operator relied on consistent intensity difference between the foreground and background, which cannot well work in more complex situations. Lots of approaches have attempted to obtain an improved marker for marker-controlled watershed, such as active contours [19] and more complicated morphological operations to generate the energy landscape [20]. The results heavily dependent on the predefined geometry of the nuclei which determined the marker generation. Because obtaining sufficiently strong markers for guiding watershed algorithm is a challenge task, some methods have departed from the energy-based approach. Kwak et al. [21] tried to compute the concavity of nuclear clusters. Liao et al. [22] assumed the nuclear shape is eclipse and utilized ellipse-fitting to separate the clusters, which didn't encompass the natural diversity of the nuclei. In addition, the segmentation results of these methods were sensitive to the parameters, which usually determined by manual selection.
Recently, deep learning methods achieve outstanding performances than non-deep state-of-the-art methods and dominate the fields of computer vision tasks. Multiple researchers attempt to investigate the potential to employ deep learning in nuclei instance segmentation. One representative class of algorithms called box-free instance segmentation, which segment nuclei instances directly without the aid of bounding boxes. For example, Chen et al. [23] presented a deep contour aware network(DCAN) by harnessing the nuclear appearance and contour information explicitly. They first utilized the complementary appearance and contour information to do semantic segmentation and extract contours under a unified multi-task learning framework. After that, they overlap the contours onto the semantic segmentation results in order to separate the touching and clustered nuclei. DCAN ranked the first in MICCAI nuclei segmentation challenge. However,it tends to produce over-segmentation when the boundaries between the touching nuclei unclear. Oda et al. [24] proposed a Boundary-Enhanced Segmentation Network (BESNet) that has an encoding path similar to Unet and two decoding paths to restore the resolution. One decoding path is designed to enhance the nuclear boundaries, which can be used to improve the quality of the entire nuclei segmentation achieved in the other decoding path. There are two limitations of BESNet, one is that it learns complementary information in nuclei branch but ignores the potential benefits from nuclei to contour, another is its computational complexity. Zhou et al. [25] proposed a Contour-aware Informative Aggregation Network (CIA-Net) with multilevel information aggregation module that leverage the merit of spatial and texture dependencies between nuclei and contour by bi-directionally aggregating task-specific features. Graham et al. [26] presented Hover-Net for nuclear segmentation and classification that leverages the instance-rich information encoded within both vertical and horizontal distances of nuclear pixels to their centers of mass. Spatial aware network (Spa-Net) [27] was proposed to predict the position information of each nuclei by combining a multi-scale dense unit. The final nuclei instances are separated by a clustering method through clustering the predicted location ordinates.
In the field of instance segmentation, to overcome the weakness of box-free methods which without global understanding of the objects, amounts of studies have sought to combine object detection and segmentation. These box-based methods locate the objects' bounding boxes from a global perspective and perform the instance segmentation within the region defined by the bounding boxes. Mask RCNN [28], which extended the Fast RCNN [29], perhaps the most common and successful two-stage method. The two-stage methods first detect the object bounding boxes by initial candidate regions, then crop patches according the bounding boxes and perform segmentation within the cropped patches for achieving more accurate results. However, these methods are computationally expensive and require support of powerful computing. Several approaches, as exemplified by SSD [16] and RetinaNet [30], which eschewed the second refinement stage in two-stages methods and focused on direct sliding-window prediction, and shown promising results. For nuclei instance segmentation, Wang et al. [31] designed a dilated residual network to solve the problem of information loss of small objects in deep neural network, and showed a well recognition and segmentation capability for nuclei detection and segmentation in microscopic images. In order to improve the precision of neural cell instance segmentation, Yi et al. [15] constructed a joint network based on the SSD and Unet. They employed the attention mechanism in both detection and segmentation modules to refine the feature maps and make the model pay attention to the useful features. However,the attention modules complicate the model and need higher computing. In order to deal with adjacent neural cells, Yi et al. [32] proposed a context refined instance segmentation method to focus on the targets within cropped ROIs and suppress the background information.Yi et al. [33] proposed a keypoint-based object detector based on detecting the five pre-defined points of a nucleus and a separate segmentation branch to segment the nuclei instance, but it fails to localize the small objects. Chen et al. [34] presented a Boundary-assisted Region Proposal Networks (BRP-Net) for nuclei instance segmentation. A task specific feature encoding network is utilized to get the instance proposals in the first stage, and a instance segmentation network is used to segment proposals in the second stage.

III. METHODS
The flow chart of the proposed network is illustrated in Fig. 2. The proposed model is an unified and end-to-end trainable instance segmentation network which includes a detection branch to locate the bounding boxes of nuclei and a segmentation branch to segment nuclei. The segmentation is performed on the RoI patches which are cropped from the feature maps by employing the detected bounding boxes. Below we introduce our network in details.

A. NUCLEI DETECTION BRANCH
Inspired by ANCIS, we redesign the detection network as shown in the top half of Fig 2. We design a fusion module by fusing the shallow layers and deeper layers to build high-level semantic feature maps. We resize the input image to 512 × 512 before being fed into fusion module. We use resnet50 Conv1-6 as the backbone network [35]. We set the parameters of Resnet50 same as Yi [15].
The feature pyramid network(FPN) [36] is a well-known structure in the field of object detection, which fuse the shallow feature map and the upsampled deep feature map, to detect different scale targets. In order to leverage the rich semantic information of small nuclei, we use FPN to fuse the shallow feature map and deep feature map of the backbone network. The details of fusion module are shown in Fig 3. Firstly, we convert the feature map c1, c3, c4 and c5 of backbone network to the channel numbers(256) by using a 1 × 1 convolutional layer. Secondly, we resize feature map c6 using bilinear interpolation to 16 × 16, which is equal to c5. We add the feature map from c5 and c6 to obtain output feature map p5 through a 3 × 3 convolutional layer. Repeat similar strategy as the second step, we get output feature maps p1-p4 of fusion module.
Let L locs and L conf be the bounding box loss and confidence loss, respectively. The total detection loss function is a weighted combination of bounding box loss and confidence loss. It is defined as following: where N pos is the number of positive anchors, θ is a weight factor. L locs indicates a smooth L 1 loss. It is defined as following: smooth L 1 (x) is a piecewise function: where l m i and g m i are the predicted and encoded anchor boxes, m is parameter of coordinate.
L conf is defined as a binary cross entropy loss between the ground truth confidence and predicted box confidence. The form of L conf is defined as following where x i is the labeled confidence,and p i is the predicted confidence, i refers to each position in a segmentation mask.

B. NUCLEI SEGMENTATION BRANCH
After the detection operation, we obtain the bounding boxes for all nuclei in the input images, then we crop the instance nuclei regions from the encoder layer C0-C6 and perform the individual nuclei segmentation for each nuclei instance in these cropped regions. The process of nuclei segmentation module is shown in the bottom of Fig 2. Inspired by the classical structures of U_Net, we design a skip connection to combine the feature maps from the shallow layers with the deep layers, which can utilize both high-level semantics and low-level image details. The process of the skip connection is illustrated in the bottom_left corner of Fig 2. A deep feature map x d ∈ R C d ×W d ×H d and a shallow feature map x s ∈ R C s ×W s ×H s are passed into skip connection, where C, W , H are the channel size, height and width of a feature map. Firstly, the feature map x d is up-sampled to the same size as x s through a bilinear interpolation. After a 3 × 3 convolutional layer, instance normalization, and Rectified Linear Unit(ReLU) layer, the feature map is passed into concatenation operation. It should be noted that instance normalization is proved to be a perfect choice for our task. Instance normalization can be used to remove style statistics of instance for image generation [37].
For our segmentation task, each cropped patch is supposed to include mainly one nucleus. Thus, we can formalize our problem as removing the neighbor statistics for each cropped patch, so it is feasible to utilize instance normalization. Let x ∈ R H ×W be a cropped RoI patch, the instance normalization can be defined as following: where µ and σ are respectively the mean and variance of the cropped patch, γ and β are scaling factors to control the extent of cropped patch normalization. Secondly, we concatenate map from the previous operation and a shallow feature map x s . Finally, we transfer the concatenation feature map through a 1 × 1 convolutional layer, instance normalization, and a Rectified Linear Unit(ReLU) layer, to obtain the output feature map. We use binary-cross entropy loss as the objective loss of the segmentation. It is given by where p and t represent the predicted and ground-truth segmentation probability map, respectively. i is the positions of pixels, j is positive predicted bounding box, M is number of pixels in the map, and N is the number of positive predicted bounding boxes.

A. DATA SETS
We employ two public data sets: Data Science Bowl 2018(DSB2018) and MonuSeg to verify the effectiveness of the proposed method. The two datasets are described in detail as follows:

1) DATA SCIENCE BOWL 2018
DSB2018 is released by the Kaggle competition, which aims to develop nuclei segmentation methods in a variety of microscopy images without manual interaction or adjustment [38]. The dataset contains a large number of segmented nuclei images. The images were extracted under a variety of microscopic imaging conditions, microscopy instruments, operators and staining protocols, thus there are vary in cell types, magnification and imaging modality. The nuclei particle size distribution is in the range of 21 to 1037 pixels, which brings great difficulties for nuclear detection and segmentation in some images. The dataset is usually utilized to measure an algorithm's generalization ability across variations.

2) MoNuSeg DATA SET
MoNuSeg [6] is released to encourage the development of nucleus segmentation algorithms on a diverse set of Hematoxylin-Eosin (H&E) stained tissue images. The dataset contains 30 images with 21623 manually annotated nuclear boundaries in total. Each image is download from the Cancer VOLUME 8, 2020 Genomic Atlas(TCGA) of an individual patient. Since the size of whole slide image is too large, the image is cropped to 1000 * 1000 to reduce the computational requirement. The dataset collected from 18 different hospitals. Since the staining process and image scanning equipments across labs aren't exactly the same, there exist large appearance variations among them. Furthermore, in order to ensure the abundance to nuclear types, 7 different organs viz., breast, liver, kidney, prostate, bladder, colon and stomach, including both benign and diseased tissue samples, are represented.

B. EVALUATION METRICS
We utilize mask-level average precision(AP) at IoU (intersection over union) threshold of α(term as AP@α) to evaluate the nuclear instance segmentation accuracy [15]. AP is formulated as following: where p is the mean precision at 11 recall(r) levels ({0, 0.1, · · · , 1}). We also use AIoU @α to measure the average IoU at threldholds α to measure the quality of the segmentation following the work of [15]. It is defined as following: where N α is the total number of predictions that satisfy IoU ≥ α(α = 0.5 and 0.7). IoU is calculated between the predicted segmentation mask and all the ground-truth masks for each prediction: where x is the predicted mask, y i is the i-th ground-truth mask, W and H are the width and height of the image, respectively. In addition to accuracy, another important evaluation index of target detection and segmentation algorithm is speed. Frame Per Second(FPS), which is the number of images that can be processed per second, is a common index to evaluate the speed. Therefore, we use FPS, which is calculated both the inference time and post-processing time, to measure the speed of our method.

C. IMPLEMENTATION DETAILS
We implement the model in the open source software library Pytorch, and perform it on a high performance computer with 35.4816 Tflops CPU and 18.8 Tflops GPU. In order to avoid over-fitting and improve the generalization ability, we utilize several data augmentation methods such as random expanding, cropping, flipping, contrast distortion and brightness distortion. We initialize the parameters of the backbone network from Resnet50 which is pretrained on Imagenet, and set the other parts with a standard Gaussian distribution. Since the detection branch is used to crop the RoI patches for guiding the segmentation branch, we first train the detection network, then froze its parameters when training the segmentation network. We initialize the learned scaling factors β = 0 and γ = 1 in Formula (1). We set the mini-batch size to 100 and the learning rate to 0.001. We use the stochastic gradient descent(SGD) to solve the optimize problem in our method.

D. ABLATION STUDY
To verify the fusion module is effective for improving the SSD's detection accuracy of bounding boxes, we compare the results of the detection module with SSD on kaggle 2018 Data Science Bowl and MoNuSeg. We randomly select some images from two datasets, and display the detailed results in Fig.4 and Fig.5, respectively. From the results, we could see that the detection module used in our method could improve the detection accuracy of bounding boxes.
To verify the mechanisms of fusion and instance normalization are effective for improving the model's performance, we compare the quantitative results of our method with different modules on kaggle 2018 Data Science Bowl and MoNuSeg dataset. The results are showed in Table 1 and Table 2. Baseline means the method without Fusion module and instance normalization. Its segmentation branch is a standard Unet structure. Fusion means the method with fusion mechanism. InstanceNorm means skip connection with instance normalization operation. Fusion+InstanceNorm means the proposed method with fusion mechanism and instance normalization. The results showed in Table 1 and Table 2 appear to suggest that the fusion and instance normalization mechanism can improve the performance of segmentation, and the combination method with both mechanisms achieves the best performance.

E. RESULTS
Recently, a lot of literatures are committed to develop nuclei instance segmentation methods and got state-of-the-art performance in microscopy images, such as Mask RCNN [28], DCAN [23], BESNet [24], CIA-Net [25], ANCIS [15]. CRN-CIS [32], SPA-Net [27] and BRP-Net [34], Since the source  codes of Mask RCNN, DCAN, CRNCIS and ANCIS are public available, we just compare our method with these methods for fairness. It should be noted that we use exactly the same network architecture of the proposed method for both the two different datasets.
Firstly, we test the performance of the proposed method on DSB dataset. We split the original training datasets with annotations into three different sub-datasets: training set with 402 images, validation set with 134 images and testing set with 134 images. Table 3 shows the quantitative results of nuclei instance segmentation on the testing set. From the results presented in Table 3, we can see that the proposed method achieves a best performance with 71.38 AP@0.5 and 60.80 AP@0.7, which indicate that the proposed method can improve the instance segmentation performance. To demonstrate the segmentation, we also detailedly display the results of six images, which are randomly selected from the testing set in Fig. 6. We could see that Mask RCNN can segment the tiny structures of small nuclei, but it fails to capture the correct structure when the nuclei instance is large and cell adhesion. When the boundaries is clear, DCAN can obtain a satisfactory results, but it is less effective at separating the overlapping and clustered nuclei because the unclear nuclear boundaries. ANCIS can segment the tiny and slender structures, but it tends to make several over-segmentation. Compared to these methods, our model exhibits remarkable performance in segmentation of both tiny and adhesion structures.
Secondly, we test the proposed method on the MoNuSeg dataset. We use the same strategy described in [6] to split  the dataset into 3 sub-datasets:training set, validation set and testing set. Compared with the DSB dataset, images in MoNuseg dataset exist larger number of nuclei, more adhesion and heavy noisy information from the background.
Thus, it is much difficult to segment nuclear instance in these histopathological images. The quantitative results on the testing set are reported in Table 4. The results demonstrate that the proposed method also obtain the best performance with 77.65 AP@0.5 and 55.05 AP@0.7, and comparable result of AIoU among these methods. We also display the detailed results for 7 random selected images of diffident organs in Fig. 7. The results illustrate that MASK RCNN fails to segment the large instance and DCAN fails to segment the adhesion instance. The ANCIS can capture the tiny and slender structures, but miss several nuclei that similar to the background. Our method has greatly improve the metric of AP@0.5 and AP@0.7. VOLUME 8, 2020 Regarding the processing time, the average FPS of the proposed method is 7.4786 and 8.8527 for two datasets, respectively, that means the proposed method only takes average of 0.1337 and 0.1130 seconds to complete the segmentation of an image from two datasets, respectively. Form Table 3 and  Table 4, we also know that the average processing times for ANCIS are 0.1970 and 0.1857 seconds and Mask-RCNN are 0.5271 and 0.5142 seconds, respectively. Therefore,the speed of the proposed method is faster than the other methods.

V. DISCUSSION AND CONCLUSION
Analysis of nuclei in microscopic images is the first step towards developing automated computer aided methods for diagnosis and prognosis of cancer. A lot of researches attempted to propose various algorithms focusing on accurate nuclei instance segmentation which enable subsequent exploration of nuclear features to predict clinical outcome. For example, DCAN [23] first utilized the complementary appearance and contour information to improve the discriminative capability of intermediate features for separating the adhesion and clustered nuclei. However, its performance heavily depends on the quality of the nuclear boundaries. When the boundaries are unclear such as the adhesion and clustered nuclei, the segmentation results is unsatisfactory. Besides, it is easy to segment the adhesion and clustered nuclei as a large individual one, this phenomenon causes that it has a good performance of AIoU. Mask RCNN [28] fails to capture the nuclei instance who is large and cell adhesion, because it adopts ROI align to restrict the object feature patch to 7 × 7 that lead to ignore the high-resolution details in the shallow layers. ANCIS [15] utilized attention mechanism both in segmentation branch and detection branch, it performs well on tiny and slender neural cells, but it tends to make mistakes to segment touching instances who have the same textures and classes. In addition, the attentive mechanism make it computing complex.
In this paper, we have designed a fusion module to utilize the multi-scale features of backbone network for nuclei instance bounding boxes detection, and utilized a skip connection mechanism to combine the features maps from the shallow layers with the deep layers of standard Unet. We have shown that the proposed method obtains the state-of-the-art nuclei segmentation results compared to a lot of recently published methods across two well-known competition datasets. Regarding the processing time, Mask-RCNN utilizes a very large arrays in memory to store the nuclear instances because it inherently stores a single instance per channel and there are many nuclei in a image, which causes that it requires much longer processing time. ANCIS utilizes attentive mechanism both in the detection and segmentation branches, which makes the computing procedure more sophisticated. In our method, we abandon the attentive module and use some simple mechanisms to optimize the method. In addition, the fusion module could improve the detection accuracy of bounding boxes, and the instance segmentation is performed within the patches that are cropped according the bounding boxes. These are the main reasons that the proposed method could improve the nuclei instance segmentation accuracy.
A major bottleneck for the development of satisfactory nuclei instance segmentation methods is the limitation of training data, particularly with accurate annotations. In this paper, we use the datasets from two well-known competitions, which could be seen as 'optimized' version of real data. The real data in clinical environment could contain artificial damages, noise, non-uniform staining and artifacts,etc. Accurate annotation of real data is time consuming and requires professional prior knowledge. Sometimes, annotations are quite different even for professional annotators. In addition, the datasets used in this manuscript include thousands of nuclei, but they are obtained from limited patients. Therefore, these competition datasets are insufficient reflect the characters of real clinical data. Thus, we will test our algorithm in real clinical data in our future work.