Deep Learning Methods for Lung Cancer Segmentation in Whole-slide Histopathology Images - the ACDC@LungHP Challenge 2019

—Accurate segmentation of lung cancer in pathology slides is a critical step in improving patient care. We proposed the ACDC@LungHP (Automatic Cancer Detection and Classiﬁca-tion in Whole-slide Lung Histopathology) challenge for evaluating different computer-aided diagnosis (CADs) methods on the automatic diagnosis of lung cancer. The ACDC@LungHP 2019 focused on segmentation (pixel-wise detection) of cancer tissue in whole slide imaging (WSI), using an annotated dataset of 150 training images and 50 test images from 200 patients. This paper reviews this challenge and summarizes the top 10 submitted methods for lung cancer segmentation. All methods were evaluated using the false positive rate, false negative rate, and DICE coefﬁcient This work was

(DC). The DC ranged from 0.7354±0.1149 to 0.8372±0.0858. The DC of the best method was close to the inter-observer agreement (0.8398±0.0890). All methods were based on deep learning and categorized into two groups: multi-model method and single model method. In general, multi-model methods were significantly better (p<0.01) than single model methods, with mean DC of 0.7966 and 0.7544, respectively. Deep learning based methods could potentially help pathologists find suspicious regions for further analysis of lung cancer in WSI.

I. INTRODUCTION
Lung cancer is the top cause of cancer-related death in the world. According to the 2009-2013 SEER (Surveillance, Epidemiology, and End Results) database, the 5-year survival rate of lung cancer patients is approximately 18% [1]. For patients with the early stage, resectable cancer, the 5-year survival rate is about 34%, but for unresectable cancer, the 5-year survival rate is less than 10%. Therefore, early detection and diagnosis of lung cancer are the key important steps in improving patient treatment outcomes. According to the National Comprehensive Cancer Network (NCCN) guidelines, for imagesuspected tumors, histopathological assessment of biopsies obtained via fiberoptic bronchoscopy should be performed for the diagnosis [2], [3]. Assessment of biopsy tissue by a pathologist is the golden standard for lung cancer diagnosis. However, the diagnostic accuracy was less than 80% [4]. The major histological subtypes of malignant lung disease are squamous carcinoma, adenocarcinoma, small cell carcinoma, and undifferentiated carcinoma. Correctly assessing these subtypes on biopsy is paramount for correct treatment decisions. However, the number of qualified pathologists is too small to meet the substantial clinical demands, especially in countries such as China, with a significant population of lung cancer patients. Recently, the results from the largest randomized control lung screening trial, the National Lung Screening Trial (NLST), led to the implementation of lung cancer screening with low-dose Computed Tomography in the United States in 2015. Moreover, the results from the second-largest randomized control trial, the Dutch-Belgian lung cancer screening trial (NELSON), also show the benefits of implementing lung cancer screening. The implementation in the U.S. and the possible implementation of lung cancer screening in Europe will likely lead to a substantial amount of whole-slide histopathology images biopsies and resected tumors. At the same time, the workload and the shortage of pathologists are severe. An artificial intelligence (AI) system might efficiently solve the problems mentioned above by an automatic assessment of lung biopsies.
Digital pathology has been gradually introduced in pathological clinical practice. Digital pathology scanners could generate highresolution WSIs (up to 160nm per pixel). It facilitates the development of automatic analysis algorithms for reducing the burden arXiv:2008.09352v1 [eess.IV] 21 Aug 2020 and improving the performance of pathologists. Most recently, a large number of deep learning (DL) methods have been proposed for automatic image analysis of WSIs from the cell level to the image level [5]- [11] .
At the image level, a three-layers CNN was first introduced to detect invasive ductal breast carcinoma and showed a comparable result (65.40% accuracy) with classifiers relying on specific handcrafted features [27]. CNNs were also used in the detection of prostate cancer [28], pancreas cancer [29] and kidney cancer [30]. Furthermore, CNN could be used as a features extractor for colon cancer classification and colon cancer prediction [31].
Deeper CNN, such as GoogLeNet [32], AlexNet [33], VGG [34] and ResNet [35], was transferred to breast cancer classification [36] and prostate cancer prediction [37]. In the CAMELYON16 challenge [38], the 1st rank team ensembled two GoogLeNets to elevate the AUC of classification lymph node metastases to 99.4%. Several challenges in medical imaging also significantly advanced the pathology image analysis community, such as mitosis detection challenges in ICPR 2012 1 , CAMELYON16 2 and CAMELYON17 3 for identifying breast cancer metastases. In particular, the CAMELYON16 was the first challenge to offer WSIs a large number of annotations, which is essential for training larger CNNs such as ResNet.
With the breakthrough of DL methods in medical image analysis and increasing of available public WSIs for developing a specific CNN, we believe that the CNN could be leveraged to give pathologists more reliable objective results or even help pathologists to improve the cancer diagnostic level. However, after assessing recent review papers [10], [11], we found very few articles discussing the applications of CNNs to histopathological images of lung cancer. Furthermore, no public datasets of WSI were available to evaluate such algorithms. A recent paper that used CNNs on lung cancer detection was only on cytological image [39]. The size of each image was limited (only around 1k*1k pixels), and the appearance of this image was quite different from the hematoxylin&eosin (H&E) stained image that we used in this paper. The recent research [40] suggested that image features automatically extracted from WSIs can predict the prognosis of lung cancer patients and thereby contribute to precision oncology by machine learning classifiers.
To further explore the potential application of DL on WSI for lung cancer diagnosis, we proposed the ACDC@LungHP challenge which is the first one addressing lung cancer detection and classification using WSI, to our best knowledge [36]. This manuscript is a summary of the first stage of ACDC@LungHP (in conjunction with ISBI2019) that focused on the segmentation of cancer tissue in WSI. The sample of pathological WSI with annotations was shown in Fig.1.

B. Data preparation
Histological slides were stained with H&E scanned by a digital slide scanner (3DHISTECH Pannoramic 250) with objective magnifications of 20x. The close look of different tissues in the slides can be seen in Fig.2. One can see that the patch colors were quite different even among the patches from normal tissue due to the staining variability. The appearance of the cancer regions was also quite different because of the different cancer types. For instance,  In total, 200 H&E stained slides were scanned and digitized. We randomly split those 200 slides into training and test sets. 150 slides with annotation were released as the training set. 50 slides were held as the test set. The main types of cancer were included in our data: squamous cell carcinoma, small cell carcinoma, and adenocarcinoma. The ratio of them was approximately 6:3:1. One pathologist with 30 years of experience (the director of the pathology department) annotated the cancer regions for all 200 slides (See Fig.1). We also asked the second pathologist (with 20 years experience) to annotate the test set only. The annotation of the second pathologist was only used for accessing the inter-observer variability. Participants were allowed to use their own training data for pre-training. All data were uploaded to Microsoft OneDrive, Google Drive, and Baidu Pan for participants from different regions. Whole-Slide images were released in the TIFF format. Manual annotations were in XML format.
In the clinical practice, more than one sample from the same biopsy will be scanned. If samples have a similar shape, the pathologist only annotated one sample for the WSI. Participants were suggested to use ASAP 4 to make a bounding box themselves to exclude the unused samples.

A. Challenge overview
The first stage of the ACDC@LUNGHP challenge focused on detecting and segmenting lung carcinoma in WSI. The segmentation as a potential aid could quickly help pathologists to identify suspicious regions. At this stage, 495 participants submitted the challenge applications, and 391 of them were confirmed as valid participants (with required registration information). Each team was allowed to submit their result three times per day. 25 participants successfully submitted their results before the closing time. The distribution of the participants was shown in Fig.3. The Dice coefficient (DC) was computed to evaluate the agreement between the automatic segmentation and the manual annotation by the pathologist. The DC was defined as: where the GT and RES are ground truth from the pathologist and result of automatic segmentation. The top 10 teams were selected from the final participants. The overall comparison could be seen in Table I. The DC ranges from 0.7354 to 0.8372. Based on the model ensembling strategy (see the following sections), the methods from top 10 teams could be categorized into two groups: multi-model method and single model method. Other criteria, such as label refine, pre-processing and pre-training strategy were also summarized in Table I. 4 https://github.com/GeertLitjens/ASAP B. The methods based on single model The single model methods only used the individual model as their architectures. The mean DC for single model methods was 0.7544. An overall comparisons of single model methods could be seen in Table I).
The rank #4 team combined advantages of a CNN and a fully convolutional network (FCN) [41] to improve the accuracy of segmentation. At first, a bounding box was manually annotated to locate the tissue regions. CNN was based on DenseNet-121 structure [42] with two output neurons, and the FCN was based on the DenseNet structure consisting of three dense blocks. The first dense block was with five convolutional layers, and the other two were with eight convolutional layers. The architecture of their model was shown as Fig.4. Intel Core i7-7700k CPU and a GPU of Nvidia GTX 1080Ti were used for training. They used cross-entropy with softmax output as the loss function, and Adam as the optimizer for CNN structure. The dice loss and focal loss were set as loss function, and the SGD with momentum was set as the optimizer for FCN structure.
The rank #5 team integrated the Atrous fusing module and CNN feature extractor to build their networks (See Fig.5). They combined ResNet and Inception V2 (IncRes), which replaced eight middle blocks of ResNet18 with Inceptions module. The WSI was split into big patches (with size of 768 * 768 * 3) in the data preprocessing step, and nine small patches were extracted uniformly from each big patch. After feature extraction using IncRes, the multiatrous convolution was used for feature fusion [43]. The big patches were assigned to TUMOR, NORMAL,and MIX according to the annotation. They mixed MIX patches into TUMOR and NORMAL to keep the balance of the training data. In their experiments, four parallel atrous convolution modules were used to fuse all features with different dilation ratios. The Convolutional Conditional Random Field (CRF) [44] after the concatenate layer was connected. The CRF did not involve in the training stage, but used to modify the output results. They used four NVIDIA GTX 1080Ti 12GB GPU and set the learning rate to 1e-3 for beginning 40 epochs, 7e-4 for the last 20 epochs. The loss function was set as BCEWithLogitsLoss. They illustrated that the model combining IncRes, atrous convolution module, and CRF gave the best segmentation performance.
The rank #8 team used a fast deep learning-based model. They put all training sets into the FCN [41] in the AI Explore platform [45]. After training, the test set was tested by the AI Explore platform for real-time lung whole slide segmentation. They used NVIDIA GeForce GTX 1080 Ti to train the model and the SGD with large momentum to avoid the multiple local gradient minimums. The learning rate set to 1e-10.
The rank #9 team used a classification method by labeling large regions instead of distinct pixels. They trained a ResNet18 model with multiple data augmentation methods. An adaptation of threshold was used for cell detection. The training was on a single NVIDIA GeForce GTX 1060, Adam was used as the optimizer with a learning rate set to 1e-4.
The rank #10 team processed WSIs in large patches with no overlap to capture more context. They evaluated three alternative networks: Small-FCN-16, Small-FCN-32 and Small-FCN-512. For locating the cancer region rather than exact boundaries, they used 4×4 convolutional filters to increase the receptive field at a different level. They also used Imagenet-FCN to train their model. The training was on the NVIDIA Pascal GPU. Adam optimizer, with a decaying learning rate that started with 1e-4, was used to optimize the weights of these networks. The cross-entropy was set as a loss function. They compared different small networks and selected small-FCN-32 with Imagenet-FCN as their final model.

C. The methods based on multi-model
In general, the single model is not flexible enough to solve complex problems [46], such as the segmentation of lung cancer regions. Furthermore, training multiple models could significantly improve the generalized performance than only using single model [46].
The rank #1 team combined the DenseNets and dilation block to work with U-net. DenseNet [42] connected each layer to every other layer in a feeg-forward fashion (See Fig.6(a)). The U-Net has an encoder-decoder structure with skip connections that enables efficient information flow [47]. In the dilation block, with the same convolution kernel size, different dilation rates could be utilized to obtain multi-scale features and more context information. The dilation rate (1, 3, 5) with 3x3 kernel were concatenated as the input of the convolution. The dense block was constructed by four layers. They trained different models by changing the loss function through weights and choose the best-performing model to ensemble. This model was more sensitive to tiny lesions and able to capture more context information and multi-scale feature. They used four GPU on Tesla M60 and Adam optimization with default parameters β 1 = 0.9, β 2 = 0.999 for training, set the initial learning rate to 2 * 10-4, and then divided learning rate by 20 in every 20 epochs. The loss function was a combination of dice function and cross-entropy. The rank #2 team refined labels by removing the background within the tumor area and performed data augmentation at the training step. The ResNet50 was used as an encoder network to extract semantic information. The DeepLab V3+ was used for upsampling. They also modified the ResNet architecture to adapt to the task as described below: 1) Down-sampling step in stage4 was eliminated by changing first convolution layer stride 2 to 1; 2) All convolution layers in stage4 had been altered to use atrous rate 1 to 2; 3) Global average pooling layer was removed and attached DeepLab V3+ decoder;4) All convolution layers in DeepLab V3+ decoder used separable convolution. The model is shown as Fig.6(b). In the experiments, they used ImageNet pre-trained weights for encoder and Adam as the optimizer, set the initial learning rate to 1e-4. The loss function was a combination of the cross-entropy loss and soft dice loss. They trained CNN models with five-fold cross-validation and ensembled five models from cross-validation training.
The rank #3 team proposed a multi-scale U-net fusion model with the CRF [44] (See Fig.6(c)). The framework fused networks in two ways: multi-scale fusion and sub-datasets fusion. In multi-scale fusion, three models were trained on the whole training set of three resolutions (576, 1152, and 2048 pixels). The network structure was a modified U-net called SU-net (shallow U-net), which focused on the local details of tumor cells. They removed one downsampling and one upsampling steps in the original U-net and added a fully connected layer before every remaining downsampling and upsampling steps. The SU-net included three times of downsampling and upsampling, consisting of 24 layers in total. In sub-datasets fusion: the dataset was divided into three sub-datasets by k-means algorithm [50]. Each subdataset was in the same image resolution of 512 pixels and trained on a DU-net (deep U-net model). The DU-net added one additional downsampling and upsampling stages, consisting of 28 layers. In the experiments, they used the soft-max combining with the cross entropy loss as the loss function.
The rank #6 team used of existing classical models, including ResNet101, ResNet152, DenseNet201, DenseNet264 and Mdrn80 (a short version of the network that DeepMind [48]). They used the tile labeling strategy to label the cancer regions (See Fig.7). Tile overlapped more than 75% with annotated cancer region was defined as a positive cancer tile, and the tile without overlapping the cancer region was a negative tile. Other tiles were not used for training. They trained and ensemble 16 models to conduct the experiments. They used three NVIDIA RTX Titan GPUs and Adam optimization with a learning rate of 1e-4 for training. Cross-entropy was set as the loss function.
The rank #7 team used Co-teaching to train networks (See Fig.8). Co-teaching [49] aims to clean the noisy label. The proposed method trained two networks simultaneously. In each mini-batch of data, each network viewed its small-loss instances as useful knowledge and taught such instances to its peer network for updating the parameters. Comparing with the original Co-teaching algorithm, the main difference was the dynamic drop rate (T ), which controlled the number of clean-instances selected for training. It was used to avoid the training error from a network to be directly transferred back on itself (See Algorithm.1). They used the fully convolutional DenseNet (FC-DenseNet) 103 network as a backbone. The two FC-DenseNet 103 networks were trained from scratch simultaneously using the same data. In their experiment, the networks were trained using four NVIDIA GeForce GTX 1080 Ti GPUs, and an Adam optimizer was used with an initial learning rate of 1.5e-4.

A. Comparisons of Top 10 Methods
The box-plot of the DC for test set of the top 10 teams was shown in Fig.9. The inter-observer variability between the two pathologists  was also assessed using the mean DC, which was 0.8398 (See Fig.9). And mean DC of multi-model methods and single model methods for each test image was shown in Fig.10.
All teams got a relative high DC on the NO.27 test image. The DC ranged from 0.8653 to 0.9435. This sample was well prepared during H&E staining like most of the training datasets, and the cancer tissue was clearly shown in this image. One could see typical results from two different teams in Fig.11(c) and (d). The tissue was shown in Fig.11(a) and (b).
In contrast, most of the teams got relative low DC on the NO.41 test image (between 0.2-0.5, see Fig.9). Only rank #1 team got a high DC (0.9458), among others. A visual comparison could be found in Fig.12(c) and (d). NO.41 was an example of high differentiated squamous cell carcinoma. The abnormal cells were similar to normal cells in this case. And the color appearance of this slide was not consistent with other slides due to the off-standard H&E staining process. Models from most of the teams might not be generalized enough to deal with this problem.
In order to better evaluate the performance of top 10 methods on the test set, we listed the mean DC on the images of the squamous cell carcinoma (SCC), small cell carcinoma (SCLC), and adenocarcinoma (ADC) (See Table II). The result illustrated that the accuracy of the segmentation depends on how the cancer cells grow. There were no other components in the squamous cell carcinoma nest, so the segmentation accuracy was higher than the other two types. However, small cell carcinoma spread along the sparse fibrous interstitium and gaps, and its cytoplasm was minimal. The adhesion between the cells was inferior, and it was easy to loosen, plant, transfer. Also, the cells were squeezed and deformed during the biopsy, resulting in unclear boundaries. So high performance was hard to be achieved for SCLC. ADC grew along the alveolar wall, and there were too many vascular interstitial components that may affect the segmentation accuracy. The sign rank test was used to evaluate differences of DC between multi-model and single model methods (based on Fig.10). The multimodel methods gave significantly better results (p=1.0872e-09) than single model methods. Besides comparing the DC, we also calculated accuracy, FPR, and FNR of detection for the top 10 methods (see Table III).
We can see from Table III that FNR of multi-model methods were generally lower than single model methods. We can see that different types of cancer tissue were with different appearances. The current challenge may difficult to provide enough data for all types of cancer. Using a single model might not be sufficient in identifying specific types of cancer. Through model fusion, we could combine multiple models' performance and reduce the probability of missed inspections.
C. Pre-trained model V.S. No pre-trained model Transfer learning is a commonly used method in the AI community. Using the pre-trained model for fine-tuning can reduce training time and achieve better results in several applications. Three teams used ImageNet pre-trained weights to initialize their models. In the challenge, the methods using pre-trained models did not outperform the method that learning from scratch. This might be because the  The CAMELYON16, TUPAC, and CAMELYON17 challenges aimed at detecting the micro-and macro-metastases in the lymph node in H&E stained WSIs (CAMELYON16/17) and assessing tumor proliferation in breast cancer (TUPAC). Using these data for pretraining might get good results. However, none of the teams used a pre-training model from those data.

D. Label Refine
Experienced pathologist annotated cancer regions using ASAP software. We intended to make relative rough labels for the training set (e.g., label contains background region shown as Fig.13) to evaluate the robustness of the methods in dealing with label noise. All background and not labeled tissue were kept in the training set as well. It makes tumor tissue and the normal area extremely unbalanced in the training set. Therefore, label refine is one of the significant issues that should be taken into consideration in this challenge to keep data balanced.
Several teams used different methods to refine the label. Three teams (rank #1, #2, #5) used the Otsu algorithm to remove the background area in tumor tissue labeled and obtain a tighter boundary of the cancer region. The team (rank #4) located the tissue region by a bounding box and filtered the blank areas using a threshold. The team (rank #6) used a tile labeling strategy in their method and removed background by the percentage of pixel values above 200 in grayscale space. The team (rank #7) used the "Co-teaching" algorithm to refine noisy annotation. The team (rank #10) increased the receptive field at a different level, and they tried to label regions of WSIs rather than finding the exact boundaries.
We found that teams using the Otsu algorithm that removing background gave relatively higher DC. The preprocess for removing the label noise (such as the background in the label area) is essential for model training for the challenge despite the network design.

V. CONCLUSION
In this paper, the ACDC@LungHP challenge was summarized. The current stage of the challenge focused on lung cancer segmentation. 200 slides were used for this challenge, and methods from the top 10 teams were selected for comparison. In general, multi-model method was relatively better than single model-based methods. The results showed the potentiality of using deep learning for accurate lung cancer diagnosis on WSI.
All submitted methods were based on deep learning, but the networks were quite different. Methods based on multi-model outperformed single model method (mean DC of a single model is 0.7544±0.0991 and multi-model is 0.7966±0.0898). Unlike finetuning for other computer vision tasks, the submitted methods did not benefit too much from the ImageNet pre-trained models. The pre-processing for the label noise during the training stage is crucial since our training data was not accurately labeled for test set.
In the coming second stage of this challenge, we will focus on classifying the primary lung cancer subtypes (e.g., squamous carcinoma, adenocarcinoma) using WSI biopsy. At least 4000 slides collected from multiple medical centers will be released in mid-2020. We believe that the experiences of the first stage will greatly help digital pathology communities to achieve better performance for the second stage.