A Joint Detection and Recognition Approach to Lung Cancer Diagnosis From CT Images With Label Uncertainty

Automatic lung cancer diagnosis from computer tomography (CT) images requires the detection of nodule location as well as nodule malignancy prediction. This article proposes a joint lung nodule detection and classification network for simultaneous lung nodule detection, segmentation and classification subject to possible label uncertainty in the training set. It operates in an end-to-end manner and provides detection and classification of nodules simultaneously together with a segmentation of the detected nodules. Both the nodule detection and classification subnetworks of the proposed joint network adopt a 3-D encoder-decoder architecture for better exploration of the 3-D data. Moreover, the classification subnetwork utilizes the features extracted from the detection subnetwork and multiscale nodule-specific features for boosting the classification performance. The former serves as valuable prior information for optimizing the more complicated 3D classification network directly to better distinguish suspicious nodules from other tissues compared with direct backpropagation from the decoder. Experimental results show that this co-training yields better performance on both tasks. The framework is validated on the LUNA16 and LIDC-IDRI datasets and a pseudo-label approach is proposed for addressing the label uncertainty problem due to inconsistent annotations/labels. Experimental results show that the proposed nodule detector outperforms the state-of-the-art algorithms and yields comparable performance as state-of-the-art nodule classification algorithms when classification alone is considered. Since our joint detection/recognition approach can directly detect nodules and classify its malignancy instead of performing the tasks separately, our approach is more practical for automatic cancer and nodules detection.


I. INTRODUCTION
Lung cancer is the primary cause of cancer deaths worldwide. The 2018 Global Cancer Statistics [1] shows that there are approximately 1.8 million deaths and 2.1 million new cancer cases caused by lung cancer, ranking first among other cancers. Early diagnosis of a small tumor can prevent metastasis of cancer and substantially improves the prognosis and survival rate [2]. Therefore, the development of an intelligent computer-aided diagnosis system (CADS) can be beneficial to the early treatment of lung cancer.
The volumetric thoracic computed tomography (CT) is the most commonly used imaging technique for lung scan [3], The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed . which can be used to detect lesions in the lung called pulmonary nodules. Such nodules can be benign or malignant, and the detection of the latter is of great importance. One difficulty in detecting the nodules from these CT scans is that the nodules absorb the same level of X-ray as normal body tissues. Thus, there is no apparent intensity discrepancy. The distinctive features of pulmonary nodules are primarily related to shape and location. Figure 1 shows an example 2D slice from such as volumetric or 3D-CT scan. It can be seen from Figure 1 (c) that the tiny pulmonary nodule has no distinctive feature compared with vessels in the 2-D image. However, the vessels have a continuous structure, while nodules are isolated. This motivates us to develop a network for detecting nodule and malignancy using 3-D volumetric data instead of fusing results from multiple 2D slices. On the other hand, humans are more proficient in extracting information from 2-D images than 3-D volumetric images. Therefore, it is expected that a thorough analysis of CT scans by clinicians can take much time, increasing the cost of such check. Compared with checking by doctors, CADS has the potential advantage of taking the three-dimension image data into account and output potential nodule candidates for reference or confirmation quickly. More importantly, the CADS approach can even learn and accumulate the experience from radiologists via continuous training. Hence, they may provide very stable prediction comparable or even outperforming a single experienced radiologist [4]. Hence, it is helpful to develop an efficient CADS for the diagnosis of lung cancer from CT images. In the literature, such automatic diagnosis usually consists of two steps: nodule detection and nodule classification [5]. With the success of deep learning in natural image processing, most recent studies on these two tasks are based on the convolution neural network (CNN) [6]- [9]. Methods for nodule detection usually rely on networks for object detection problems, including faster R-CNN [10] and YOLO [11], which outputs region proposals of the target objects. The nodule classification problem, on the other hand, is usually regarded as a 3-D image 1 recognition problem using the data at the detected regions as inputs. 3-D extensions of well-known image classification networks such as ResNet [12] are widely used.
Despite these advances, a fully automatic CADS for lung nodules detection and cancer classification still present several major challenges. First of all, separating the detection with classification tasks usually reduce the overall classification rate as considerable amounts of detected nodules are, in fact, false positives. By introducing a simple classification stage to refine the detected nodules after the detection task can considerably reduce the false-positive results [13], which, otherwise, will mislead the classification task later. Therefore, it is desirable to develop a methodology for joint nodule detection and malignancy classification.
Secondly, most pulmonary nodules are small and isolated in the raw CT scans. The shape of the nodule thus serves as an informative feature for distinguishing it from other body tissues. Therefore, it is desirable to exploit the 3D nature of the data for better classification. However, due to significantly increased parameters of 3D neural networks, most conventional approaches are still based on multiple 2D networks [14]- [16]. The primary obstacle of applying the 3-D model in nodule classification is the overfitting problem arising from the increased number of parameters and the limited number of training samples. For instance, while ImageNet [17] uses millions of images for training, there are only 1018 scans in the LIDC-IDRI [18]- [20] lung cancer CT dataset.
Finally, for some cases, the labels of the radiologists may not be consistent or missing (say the nodules may be labelled by 1 or 2, but not all the radiologists). This arises because labeling nodules as benign or malignant using CT images depends mostly on the experience of radiologists and the limitations in the data collection process. Unless a single consistent label can be agreed on (as in some dataset), such uncertain labels, which we shall also refer to as marginal labels, will arise for some nodules. In fact, it is commonly found in the LIDC-IDRI dataset. If the network is forced to fit these marginal samples, the performance usually deteriorates as reported in [15], [16]. This problem is usually referred to as the label uncertainty problem. Though a precise probabilistic model to describe such variations can be difficult to obtain, it is desirable that such adverse effect on the overall performance of the network can be mitigated. All these motivate us to develop a joint detection and recognition approach to lung cancer diagnosis and segmentation from CT images with possibly marginal or uncertain labels.
An important advantage of the proposed joint detection/recognition approach is that it can directly detect nodules and classify its malignancy instead of performing the two tasks separately. Therefore, our approach is more practical as it can be applied in an end-to-end manner to automatic cancer and nodules detection. Moreover, the proposed joint nodule segmentation/recognition (JNSC) network is capable of exploring the semantic segmentation information [21] to yield a more detailed segmentation of the nodules and their malignancy instead of conventional simple regional proposal. It is known that nodule malignancy is highly related to its morphology. The segmentation information offered by our proposed joint nodule segmentation/recognition (JNSC) network can provide valuable morphology description of FIGURE 2. System overview of the proposed framework. The detection phase outputs multiple potential nodules. The recognition phase uses features of detection phase to build an additional classifier to discriminate them into three classes: benign, cancer and non-nodule. Only the benign and cancer nodules are then evaluated for nodule detection task and classification task. Importantly, in the classification task, the undetected nodules are directly labeled as benign to report the result. The architecture of the joint nodule detection and classification network is shown in Figure 3.
the detected nodules, which can be useful in differentiating malignant tumors from scars or other complications.
From the neural network training point of view, the encoded features and initial segmentation obtained in our nodule detection network serve as valuable prior information for the subsequent classification process. This not only helps to the classification network to extract more discriminative features but also makes possible the training of our 3D neural network for classification and further refinement of the segmentation map without suffering from excessive overfitting. Figure 2 shows the system overview of the proposed network, where the input CT image is passed through the proposed joint nodule detection and recognition network to provide a segmentation map of the nodule as well as its malignancy prediction. Our JNSC network is a 3D network and it adopts the encoder-decoder architecture with multiscale features extraction, which has the advantages to encode the desired location information as well as shape information of the nodules. Moreover, instead of simply cascading the detection and classification networks, a path for extracting discriminative features from the output of the encoder of the nodule detection module to the classification network is proposed. These features are jointly trained from the two networks and provide valuable additional information for improving the classification performance.
Thanks to this additional information provided by the nodule detection network, the proposed 3D JNSC can be trained from scratch despite the limited number of training samples. Moreover, the encoder in our JNSC is trained on the whole CT image, which can also distinguish other body tissues for nodule detection. Experiment results to be presented later show that the joint detection and classification framework is superior to the sole classification approach with an improvement of 1.25% in terms of accuracy. This is in accordance with previous studies in scene geometry and semantics research [22], [23] where it has been demonstrated that multi-task learning can effectively boost the overall performance.
Finally, to address the label uncertainty problem, we treat the problem as a training problem with label noise 2 [24] where the noisy label will be corrected during the training phase. In the lung nodule diagnosis problem, samples with inconsistent or missing annotations are commonly encountered and they may be less reliably annotated. Here, we introduce the concept of pseudo-label to alleviate the adverse effect of these possible less reliable annotations. More precisely, the unreliable annotations are detected and their labels are re-estimated as ''pseudo-labels'' by minimizing a variant of the cross-entropy loss function, which is capable of seeking a better tradeoff between network prediction and fitting errors. While the true model of these less reliable labels is different to obtain in practice, the use of the more robust cross-entropy loss function effectively prevents the network from overfitting those less reliable marginal samples. 3 Experimental results show that training with the proposed pseudo-labels can improve the accuracy by 2.44% compared with the hard-label assignment and by 1.31% compared with the soft-label assignment. 4 The proposed approach has been evaluated and compared with state-of-state algorithms on the publicly available LIDC-IDRI dataset. In particular, the nodule detection phase is validated on the LUNA 16 [13] competition, which is a subset of LIDC-IDRI. The result shows that our proposed nodule detection network outperforms state-of-the-art algorithms while achieving comparable results with stateof-art nodule classification algorithms. Since our joint detection/recognition approach can directly detect nodules and classify its malignancy in an end-to-end manner instead of performing the two tasks separately, 5 our approach is more practical for automatic cancer and nodules detection. Moreover, the segmentation map of the nodules and its malignancy are available from the network output, which provides valuable information on the morphology of the tumor. 6 The rest of the paper is organized as follows. Section II briefly reviews the literature of related works. The information of the dataset under study is given in Section III. The proposed network architecture, feature extraction, and joint optimization methods are presented in Section IV. The experimental results, analysis, and comparisons are presented in Section V. Section VI summarizes the major findings/contributions and possible limitations of the work. Finally, conclusions are drawn in Section VII.

II. RELATED WORKS A. NODULE DETECTION
Nodule detection from CT images usually involves two steps: i) nodule candidate proposal and ii) false-positive reduction [30]. The goal of nodule detection is to identify potential nodule candidates from the remaining lung tissues, whereas the false positive reduction aims to suppress potential false positive due to interference from tissues such as blood vessels, etc. TABLE 1 summarizes some recent works on nodule detection and their performance. 7 Traditional detection methods usually rely on hand-craft features and classic image segmentation methods [31]. Recently, a more extensive dataset LIDC-IDRI is made publicly available. Hence, more sophisticated deep learningbased methods can be applied and significantly better performance over traditional approaches in the larger dataset has been demonstrated [13], [32].
In Ding et al. [33], a 2-D region proposal network, which is transferred from the general image detection framework [10], was proposed and an impressive sensitivity of 94.6% under 15 candidates per scan is achieved. Though a 2-D network generally has fewer parameters than a 3-D network, it cannot fully utilize the 3-D shape information simultaneously. Therefore, more recent studies [9], [34], [35] tend to adopt 3-D CNN to solve the problem directly. For instance, Khosravan and Bagci [35] propose a 3-D densely connected region proposal network to acquire the region proposals. This densely connected network connects every two layers in the network, while the typical network only connects two successive layers. Therefore, it usually improves the overall performance over normal layer-by-layer connected network, while requiring much fewer parameters than many conventional 3-D networks. Besides the region proposal network, Pezeshk et al. [8] proposed to segment the nodules from the CT scans directly. Similar pixel-wise segmentation has been widely applied to biomedical-related applications, in which the 3-D U-net [36] and V-Net [37] are prevalent network architectures. While segmentation can provide more accurate information than detection only, it is also more involved as more detailed annotation will be required. Since LIDC-IDRI has released the pixel-wise segmentation label recently, training deep networks for nodule segmentation is now feasible, and it can potentially provide more information to the joint detection (segmentation) and classification of lung nodules.
The false-positive reduction is another essential step after nodule detection to eliminate false positive candidates, and 3-D CNN is usually preferred [4], [8], [32], [33] because of their excellent performance. The network usually undertakes a classical classification task, i.e., classifying nodule with non-nodule. Furthermore, there is no need to develop an independent network as features can be simply transferred from the detection stage for performing classification. In Qin et al. [4], the feature from the nodule detection network is directly cropped. As the LUNA 16 competition provides an additional false-positive reduction (FPR) task which labels many possible false-positive nodules, better performance is achieved if a FPR network is trained to refine the detection result. Moreover, it is observed that even if the false positive samples in the detection task are collected without additional labels from FPR task, training their own FPR networks can also improve the result [25], [35].

B. NODULE CLASSIFICATION
Currently, nodule classification is performed either on the patient-level or nodule level. On the patient-level, only the binary label for each patient is available regardless of the number of nodules of the patient. Liao et al. [34] proposed an end-to-end CADS and won the competition for patient-level lung cancer classification. The nodule-level evaluation is popular because it has an accurate label for each nodule and avoids the variance raising from the multiple instance problem. Indeed, the framework of both levels is quite similar, except for the training strategy.
Some classical image processing descriptors, including Local Binary Pattern (LBP) [38], Histogram of Oriented Gradients (HOG) [39], and Fourier shape descriptor [40], are firstly exploited in nodule classification. Nevertheless, deep learning-based approaches usually outperform these hand-craft features [15]. Zhao et al. [41] propose a hybrid approach using well-known AlexNet and LeNet to classify the nodule slice, the performance is superior to single model methods. Moreover, in order to alleviate the overfitting problem, the 3-D nodules can be decomposed into multi-views [32] therefore the 3-D network is simplified to multiple 2-D networks. Recently, Xie et al. [16] adopt, in total, 27 ResNet for classifying the 3-D nodules from 9 viewpoints. Similarly, Hussein et al. [42] adopt a slice-by-slice approach by fusing the results from all the slices. Although many studies [9], [15] have focused on 3-D architecture, the performance is usually inferior to these 2-D ensemble methods Liao et al. [34] firstly incorporate the nodule classification into the nodule detection network and train the detection and classification network alternatively. Zhang et al. [43] fine tune the classification network from the detection network and shows that classification performance can be benefited from information of the detection stage. Moreover, Xie et al. [44] show that joint training can boost segmentation and classification in skin lesion. While the choice of 2-D or 3-D networks in nodule classification remains controversial, we shall focus on 3-D network as it is more promising in exploring the morphology information of pulmonary nodules. Notably, we extended the co-training method in [34] for training our 3-D network to be described in Section IV.

C. LABEL NOISE
Estimating the malignancy level of nodules from morphology depends mainly on the experience of the clinicians and there are inevitably variations and perhaps errors for difficult cases. Therefore, labels may not always be consistent, especially when only a few annotations are available. Although up to 4 radiologists will label the data in the LIDC-IDRI database [18]- [20], many samples are only labeled by only one radiologist. Such uncertainty in the labels are usually referred to as label noise. Frenay and Verleysen [45] give a comprehensive review on tackling label noise. Manwani and Sastry [46] studied the noise tolerance performance of various loss function and found that the 0-1 loss has the best noise toleration ability. Zhang et al. [47] developed a probabilistic model to deal with potential misclassification where the noise label is used as prior information for updating the posterior probability. These algorithms mainly focus on loss function and label correction. Other improvements proposed include data cleansing [48], [49] and model-based methods [50], [51].
Since training a neural network is time-consuming, it is hard to train a neural network several times until the noise correction converges. Patrini et al. [52] recently proposed a two-stage training method which adapts the loss function at the first stage and re-trains the network at the second stage. Adjusting the loss function is preferred on the neural network-based model because it can be easily integrated into the current framework if the loss function is differentiable.

III. DATASET
In this study, the LIDC-IDRI [18]- [20] dataset from The Cancer Imaging Archive (TCIA) is used to evaluate the performance of our proposed network. There are 1018 scans obtained from seven institutions in the dataset, and four experienced thoracic radiologists annotate each scan with detailed nodule location as well as malignancy level. However, the radiologists sometimes cannot reach a consensus for some lesions, and therefore, some nodules are annotated by one to three radiologists.
The diameter of nodules ranges from 3 mm to 30 mm, and the malignancy level is evaluated in a 5-point scale where 1 represents 'Highly unlikely' nodule, 3 represents 'Indeterminate' and 5 represents 'Highly suspicious'. Following the settings in the previous studies [15], [16], [53], we calculate the malignancy score (MS) by taking the median of the malignancy levels from different annotations and label the nodules whose MS<3 as benign, MS=3 as uncertain, and MS>3 as malignant. Note that uncertain nodules are excluded in the testing phase. Moreover, we observe that a considerable number of nodules are marginally classified as benign or malignant, and some nodules are only annotated by one radiologist, which may introduce label uncertainty. Thus, we further categorize the benign and malignant nodules as certain and marginal nodules. Marginal nodules are defined as the nodules which are labelled by only one or two radiologists, and the median malignancy levels are between 2 and 4, including 2 and 4. We list the precise number of nodules in each class in TABLE 2. For nodule detection, we adopt the Lung Nodule Analysis 2016 (LUNA16) [13] to evaluate the performance of the nodule detection algorithms. The LUNA16 dataset is a subset of the previous LIDC-IDRI dataset. To better evaluate the nodule detection algorithms, the scans with a slice thickness greater than 2.5 mm are excluded from the LIDC-IDRI dataset. LUNA16 only consists of nodules whose diameters are larger than 3 mm and annotated by at least three radiologists. Therefore, there are in total of 888 scans with 1186 nodules in the challenge. Due to the large image size, most works tend to train the detection and classification network on a small size voxel 8 like 64 × 64 × 64, randomly sampled from the entire image. Afterward, to obtain the final detection/classification for a particular subject, one needs to apply the network to the many sub-voxels of the entire image and aggregate the respective outputs. For instance, in the LUNA challenge, the results obtained by applying the detection to 64 × 64 × 64 voxels with a shift of multiples of 32 voxels in any of the three directions are averaged to form the performance metric. In this study, we use the official 10-fold split in LUNA16 to report the detection performance by randomly splitting the scans in LIDC-IDRI to 10-fold for five times to report the nodule classification performance.

IV. PROPOSED METHOD
We now present our joint nodule segmentation and recognition network (JNSC) and its construction, which consists of the following step: 1) data pre-processing and data augmentation (DPA), 2) multiscale voxel-based feature-extraction and nodule size estimation (MVFNSE), 3) pseudo-label assignment for marginal samples (PSA), and 4) jointly-optimized nodule segmentation and classification (JNSC). In the DPA step, the training samples are generated from the CT scans data after standard processing procedure. Moreover, additional training samples are generated using data augmentation technique to improve the robustness of the neural networks against various variations such as rotation of the input, etc. The input voxel is assumed to be a voxel cube with size 64 × 64 × 64. Next, we shall introduce the network architecture and the details of the above four steps will be presented.

A. NETWORK ARCHITECTURE
The proposed joint nodule segmentation and recognition network (JNSC) is shown in Figure 3. It adopts the V-Net [37] as the backbone as the V-Net adopts a multiscale encoder-decoder architecture, and it can perform pixelwise segmentation. The upper and lower branches form the encoder and decoder in a V-Net architecture where the input voxels are segmented to yield the segmented output at the left lower corner. The encoder and decoder are arranged in a multiscale manner where features are extracted at each scale via the voxel-based feature extraction layer (see also Figure 4). The multiscale features and the nodule size are also estimated in the MVFNSE step which are then concatenated (denoted by the block CC in Figure 3) for predicting whether the current block is a nodule, and whether they are benign or malignant (the middle path in Figure 3).
In the MVFNSE step of the nodule detection subnetwork, the possible locations of the nodule at each scale are estimated from the initial segmentation outputs to form the nodule location map (NLM), which consists of bounding boxes containing potential nodules (as is shown in Figure 4). volumetric image is shown here for sake of presentation. The nodule specific region (NSR) is obtained by applying a threshold to the nodule segmentation map. The nodule location map (NLM) is generated as 3D bounding boxes encapsulating the NSR, which is introduced to tolerate the irregular shape of the potential nodules. Nodule specific features are extracted at the location of NLM and are fed into the voxel-based feature extraction layer. Finally, the flatted feature vectors from multiple scales as defined in Figure 3 are concatenated (block CC in Figure 3) for classification using the soft-max criterion. Note, the in the above example, it is assumed that three possible nodules are detected, each with a multiscale feature vector. Each of these candidate nodules will pass through the linear layer and the softmax unit as shown in the middle of Figure 3 to yield the classification output for all these nodules candidates. The number of nodules detected (i.e. the number of NLM) can be variable from each input of voxels.
For classification, the multiscale features of each nodule candidate in each NLM and the nodule size will be fed to the linear layer and softmax layer for classification as shown in the middle path of Figure 3. 9 The PSA step will adjust the label for the marginal nodules to avoid possible overfitting of the marginal samples. The feature vector, together with the segmentation outputs, enables us to jointly optimize the segmentation and classification in a single network at the JNSC step. Training and other details of the above operations will now be discussed.

B. DATA PRE-PROCESSING AND AUGMENTATION (DPA)
The LIDC-IDRI dataset consists of CT scans from seven institutions. Therefore, the pixel spacing and slice thickness may vary on different scans. To reduce the variation from inconsistent resolution, we simply normalize all scans into a resolution of 1.0 mm × 1.0 mm× 1.0 mm by spline interpolation. Besides, the raw CT images are clipped to between −1000 and 400 Hounsfield unit (HU), which can reduce the effect of air and bone in the images. The last step is normalizing the CT images to zero mean and unit variance as commonly used in training neural networks. n each epoch, we extract two voxels from each scan. One of the voxels consists of a nodule, and if a scan has multiple 9 It should be noted that there may be more than one nodule candidate (or none) detected inside each voxel volume, each with its own multiscale feature vector and each of these feature vectors will pass through the linear layer and the softmax unit to yield the classification output for all these nodules candidates (please refer to Figure 4 for more details) inside the voxel volume. nodules, we randomly pick one of the nodules every time. The other voxel is extracted from the normal region, which does not include any nodule. The motivation for sampling voxels from nodules is to increase the occurrence of the nodule in the training data while sampling other position is to encourage the network to distinguish other body tissues better.
Different from many studies [15], [16], which mainly consider nodule classification, we do not require the nodules to be located in the center of the voxels. To reduce overfitting and improve the generalization ability of the network, we further adopt data augmentation by random rotating the extracted voxels. The rotation is done in one of the x-y plane, x-z plane, and y-z plane with equal probability at each time. To avoid the blank region caused by rotation, we only rotate the image with one of the following angles [0 • , 90 • , 180 • , 270 • ] with equal probability.

C. MULTISCALE VOXEL-BASED FEATURE EXTRACTION AND NODULE SIZE ESTIMATION (MVFNSE)
As mentioned, we choose the V-Net [37] as the backbone of our JNSC as the V-Net adopts a multiscale encoder-decoder architecture as it can perform pixel-wise segmentation. The multiscale voxel-based feature extraction has three steps: i) generation of the nodule location map (NLM), ii) extraction of the multiscale features, and iii) concatenation of the nodule size information to the feature vector. We summarize these procedures in Figure 4.  The yellow contours denote the ground truth nodule boundary annotated by at least three radiologists. The final segmentation result is a binary map obtained by threshold the network output having a value from 0 to 1. Thus, the segmentation map will depend on the applied threshold. In the above illustration, a conservative threshold of 0.4 is used. For the best performance, it can be further optimized via cross validation. It should be noted that the CT images (nodules) are 3D volumetric images and the 2D images (nodules) shown above are their x-y, y-z and x-z cross sections.
To generate the nodule location map, the network is trained on the pixel-wise segmentation from radiologists. Therefore, we can acquire the corresponding nodule probability map from the output of the detection network. The nodule probability map contains the probability of each pixel being classified as nodules. Note that the dimension of the map is identical to the input voxel, which is 64 × 64 × 64. Afterward, we empirically use a detection threshold of 0.4 (40%) to include more suspicious regions for detection. The probability map is then transformed into a binary segmentation map, where 1 represents nodules, and 0 represents nonnodules. Because the shape of the detected nodule is irregular at this stage, as shown in Figure 5, we propose to draw a bounding box 10 encapsulating each nodule to tolerate the irregular shape and reduce the variance in extracting nodule specific features. Then the region inside the box is called a nodule-specific region (NSR). The NSR is found based on its voxel connectivity in the binary map [54]. It should be noted that the segmentation results at this stage may contain errors, say a single or small patch of voxels may be detected, which are likely to be false positives. Therefore, the NSR extracted may be false positives. Fortunately, these false positives are not that many, and their labels are available. Therefore, they are also extracted and will be labelled as non-nodule against benign and malignant nodules, and this preliminary decision information can then be corrected at the classification stage. To this end, we pre-train the detection network at initialization so as to simplify its joint training with the classification network.
Compared with pixel-wise NSR, using the NSR for feature extraction the following benefits. Firstly, accurate morphology information is prone to segmentation errors. Secondly, it allows information/features surrounding the nodules to be 10 Note, the bounding box is used for feature extraction. The final segmentation output will be derived from these features as shown in the lower branch of the joint network in Figure 3  extracted for performing the classification at the final stage. Finally, even if the segmentation is extremely accurate, it may be smeared by the subsequent convolution layers. Therefore, more emphasis should be paid on the features of the nodule voxels as well as its neighborhood. Hence, the final nodule location map (NLM) is then generated based on NSR to tolerate the mentioned effect.
For the extraction of the multiscale feature, the size of the input voxel is 64 × 64 × 64, which will be down-sampled 4 times in the encoder network. Therefore, we have feature maps of size 64, 32,16,8,4 as shown in Figure 3. The NLM is also down-sampled to the same size of each feature maps, as shown in Figure 4. For each feature map, we crop the feature from the corresponding location in NLM. Following the feature cropping, we further add 1 × 1 convolution layers to aggregate inter-channel information. An adaptive max-pooling operation on the features is then performed where the features from the first two voxel-based feature extraction layers V 1 , V 2 are pooled into a uniform spatial size of 2 while those at the third to fifth layers V 3 , V 4 , V 5 are pooled into a spatial size of 1. Because of the adaptive max-pooling layer, the length of the final feature vector is invariant to the size of the NSR, and it can be flattened and concatenated among different scales.
The last step of the MVFNSE step is to concatenate the nodule size information on the feature vector. It is widely recognized that nodule size is highly related to the malignancy level, and larger size usually increases with the probability of being malignant. The pooling operation in step 2 is invariant to nodule size, and therefore, we can directly add the information to the concatenated features. The nodule size is estimated as: where V is the estimated nodule size and P is the number of pixels for the given nodule in the NSR. The nodule diameters vary from 3 mm to 30 mm and the resolution of segmentation result is 1.0 mm × 1.0 mm × 1.0 mm. Since large values in the features may dominate the classification performance, the estimated size is scaled by a factor of 0.1, which is determined empirically. It was found that the performance is relatively insensitive to the choice. The final feature used for classification consists of concatenated multiscale features from step 2 and a dimension of estimated nodule size. Each vector will pass through the linear layer and the softmax unit to yield the classification output for all the nodules candidates detected inside the voxel volume (please refer to Figure 4 for more details).

D. PSEUDO-LABEL ASSIGNMENT FOR MARGINAL SAMPLES (PLA)
In nodule classification, some nodules are labelled by 1 or 2 radiologists. However, radiologists are likely to be inconsistent on the malignancy level, especially all with a marginal level of malignancy. To address this issue in training our network, we propose a pseudo-label approach for those marginal nodules to alleviate the effect caused by label uncertainty. More precisely, the cross-entropy loss we based for training is given by: where T i and p i are the malignancy score and the predicted probability by the network respectively. Here, the labels ''0'' and ''1'' represent the benign and malignant nodules respectively. However, due to label uncertainty, T i is usually not chosen as either 0 or 1 and the following soft-label is preferred: where M i is the MS for the i-th nodule.
Here, we re-estimate the underlying label called the pseudo-labelp i for addressing those marginal nodule samples and continuously adapting them based on the network prediction obtained as well as the MS. Specifically, by initializing the initial value of the pseudo-label with the soft-label in (3), the resultant loss function using the pseudo-label is given bỹ where α is a regularization parameter that balances the influence of MS and network prediction on the pseudo-label. If α is large, the pseudo-label will mainly depend on MS andL ce will approach the cross-entropy loss. On the contrary, if α is small, the pseudo-label is dominated by the network output, which is not desirable because the training information T i cannot guide the learning process. The influence of alpha on the classification result will be further studied in the experiment section. By introducing the regularization inL ce , the pseudo-label becomes adjustable. The gradient ofL ce , which is required for performing the optimization, is given by: We now briefly explain the advantage of the proposed pseudo-label approach. Firstly, if the network prediction result is consistent with the MS, the first term in (6) will increase the certainty of the pseudo-label, which will implicitly increase the weight on this sample. For example, if the network prediction value p i is 0.7, the first term in (6) is negative and the correspondingp i will become larger during optimization. This largerp i will increase the absolute value of the gradient in (5), which in turn will encourage learning from the sample. On the other hand, if the network prediction is contradicting the MS, forcing the network to fit the sample may lose the generalization ability of the network due to the MS noise. Thus, for such samples, the first term in (6) will drive thep i towards p i , which will implicitly lower the weights of learning from such samples. Besides, the second term in (6) is used to penalize the pseudo-label for large deviation from T i , which avoids large fluctuation in the pseudovariable. Thus, the pseudo-label can be regarded a weight reflecting our confidence on the marginal label given the original annotation as well as the current network knowledge.
The pseudo-label can be updated using gradient descent: where r 2 is the learning rate for the pseudo-labels. Since the pseudo-label represents the probability of malignancy, it should be bounded between 0 and 1. Therefore, the update in (7) is further projected on these bound constraints as:

E. JOINTLY-OPTIMIZED OF NODULE SEGMENTATION AND CLASSIFICATION (JNSC)
The proposed JNSC network comprises of a nodule detection module and a nodule classification module with a shared structure for information exchange. The features for nodule VOLUME 8, 2020 classification can be extracted from the encoder of the nodule detection module, which provides additional information for feature extraction. For training this joint network, we first train the nodule classification network for 100 epochs using the pixel-wise cross-entropy loss: where S i denotes the probability of the pixel belonging to the nodule. After the initialization of the nodule segmentation network, the output segmentation may still generate many false-positive nodules. To overcome this problem, we extract not only features for true positive nodules, but also those false positive nodules for classification. Moreover, the false-positive nodules are labelled as non-nodule with probability 1. The network is then trained jointly. For the following 100 epochs, we do not update the pseudo-label because the network prediction is unstable at these early stages. Finally, the segmentation and classification modules are properly initialized, and the network can be optimized using the following cost function: Different from [34] where the segmentation and classification networks are trained iteratively, the parameters in both the detection and classification modules of the proposed JNSC can be updated simultaneously.
Additionally, because the parameters in our network are differentiable, the parameters can be optimized by efficient optimizer like Adam. In each epoch, which consists of a number of iterations, the network parameters are updated at each iteration. Since the parameters are likely to sufficient training after each epoch, each pseudo-label will be updated after each epoch. To reduce the effect of previous gradient, the pseudo-labels are directly updated by gradient descent without momentum.

F. IMPLEMENTATION DETAILS
Our proposed network mainly consists of three convolution layers, and the parameters of the convolution layers are listed in TABLE 3. Each convolution layer is followed by an instance normalization [55] layer and a ReLU layer. The Adam optimizer optimizes the parameters in our network with default settings in PyTorch. The initial learning rate is 0.001, and it is decreased every 250 epochs with a factor of 0.2. The maximum training epoch is set to 1000 and the batch size is 12. The spatial dropout strategy is applied to the 3-D convolutions with a dropout rate of 0.1. We also employ gradient clipping during the optimization by clipping the gradient to 1 if the L 2 norm of the gradient is larger than 1 for the sake of stability.
Since the number of benign nodules is almost 2 times that of the cancer nodules, class-imbalance problem will occur. Specifically, the non-nodule pixels in L seg and benign nodules in L ce will dominate the training phase if no balancing mechanism is used. To leverage this problem, we, therefore, adopt different weights in the cross-entropy loss. Specifically, in the nodule detection module, they are chosen as 0.01 and 0.99 for nodule and non-nodule pixels, respectively. On the nodule classification module, the weights for the malignant, benign, and non-nodule classes are set to 0.35, 055 and 0.1, respectively. In principle, the weights are chosen as the ratio of samples in the two classes. Of course, one can increase the weight to allow the network to focus more on the cancer samples. The weights in the nodule segmentation also adopt a similar criterion, where the weight of non-nodule pixels is about 100 times that of the nodule pixels. The performance does not depend critically on these weights as long as they can reflect the difference in the sample number between classes.

A. NODULE DETECTION
We first evaluate the performance of the nodule detection performance of our JNSC and other state-of-the-art algorithms on the LUNA16 dataset. The standard ten-fold cross-validation of LUNA16 competition is adopted and the standard evaluation script is used to compute the Free-response Receiver Operating Characteristic FROC curve.
To extract the nodule candidate from the 3-D nodule detection probability maps, we first set the detection threshold to 0.4 and label the connected regions in the segmentation map based on their voxel connectivity [54]. Then, the region proposals can be extracted from the labelled map, and the center is calculated by the centre of mass of the proposed regions. Lastly, we use non-maximum suppression [56] on the proposed regions and exclude those with diameter less than 3 mm. Figure 5 shows five examples of our nodule detection results with a wide range of nodule diameters. To visualize the 3-D segmentation result in a 2D figure, we present the cross sections of the nodules as well as the corresponding segmentation maps along the x-y, y-z and x-z planes. It can be seen from Figure 5 that the detected regions are relatively larger than the ground truth.
Moreover, as shown in Figure 5 (b), our network can detect tiny nodules while distinguishing the small nodule from other body tissues like vessels. The resolution of CT scans in the z-axis is much lower than the resolution in the x-and y-axis. For instance, the resolution in the x-and y-axis is usually 0.7 mm per pixel, but the resolution in the z-axis can vary from 1.25 to 3 mm per pixel. To ensure similar accuracy in the three dimensions, we employ interpolation to convert the resolution along the three dimensions to 1 mm per pixel. The result shows that our network can tolerate the problem of different resolutions and achieve similar performance on the three dimensions.

1) PERFORMANCE OF JOINTLY OPTIMIZED NODULE DETECTION
To verify the effectiveness of the structure, we compare the performance of our proposed approach on nodule detection under standard settings with and without the classification phase. 11 The FROC under the two settings is shown in Figure 6. As shown in Figure 6, the jointlyoptimized approach significantly outperforms the detection only case. More specifically, the sensitivity of JNSC with classification at 0.125 false positives per scan is 0.776, while that of the classification only case is 0.630. Because the undetected nodules at low false-positive levels are primarily small ones, the joint optimization approach is capable of significantly improving the detection performance on such tiny nodules, which is essential to the early detection of the disease.
The detection module of the JNSC is trained using the pixel-wise cross-entropy cost function. Since large nodules have more pixels, they will dominate the performance at the training phase as the gradients are mainly backpropagated from these large nodules. Consequently, the small but important nodules can easily be neglected. Moreover, despite the shortcut path, the gradient backpropagated from the decoder may be less sensitive to the small nodules. On the other hand, the direct path of our JNSC to every encoder helps to propagate the gradient from the classification network to train the encoder so that undetected small nodules can be distinguished from the non-nodule region during the training phase. 11 In the detection without classification case, the outputs from the detection phase are directly evaluated using standard evaluation script. The joint training case will further classify the outputs as non-nodule, benign and malignant. Afterwards, benign and malignant nodules in the final result are evaluated. The result shows the classification stage can significantly reduce false positive nodules.
It can facilitate the detection of those small nodule regions which are not detected by the detection network alone. It is also observed that the information backpropagated from the direct path is much more direct and effective than those backpropagating from the gradient of the classification network, due to the large separation between the classification output and the detection encoder. Moreover, the classification phase performs the simultaneously false-positive reduction, which further improves the detection rate.
Additionally, from Figure 6, the proposed JNSC with and without classification achieves respectively an impressive sensitivity of 0.953 and 0.942 at 8 false positives per scan, which further demonstrates the effectiveness of joint optimization.  ZNET [13] and Aidence [13] are the participants of the competition and win the first and second places. ZNET uses a 2-D U-Net [36] architecture and computes the nodule probability map slice by slice. Though the 2-D network cannot fully utilize the 3-D structure of the nodules, the parameters to be trained are much less than the 3-D network. The ZNET achieves a CPM of 0.811 and a sensitivity of 0.915 at 8 false positives per scan. The detailed method of Aidence is unavailable because of commercial confidentiality. The Aidence also achieves a CPM of 0.807 on the competition.
Despite the advantage of having fewer parameters in 2-D networks, 3-D neural networks are preferred recently due to its ability to detect 3-D patterns and the increased availability of computational power. DeepMed [8] was extended to a 3-D architecture, but the network is relatively shallow. Also, an independent false-positive network is trained to distinguish the detected candidates. Our JNSC is deeper than [8], which can capture more complicated structures and the false-positive reduction stage is implicitly incorporated into the JNSC. SDFPR [4] and DeepLung [9] adopt faster R-CNN structure which performs the regression of nodule location as well as probability but not pixel-wise segmentation as in our JNSC. Their encoder-decoder architecture is similar to our network, but our network has an additional shortcut path to the encoder. Hence our network can be more sensitive at a low false-positive level. For example, our JNSC obtains 0.776 sensitivity at 0.125 false positives per scan, while SDFPR [4] is approximately 0.62.
The 3D-CNN in [25] uses a combination of 2-D and 3-D networks where the 2-D network is used for candidate detection while the 3-D network is used to classify false positives. The candidate detection network can benefit from the pre-trained VGG network while the 3-D network can only be trained from scratch. The conditional non-maximum suppression in [25] is superior to normal NMS. However, the two networks are still independent of each other while our network adopts a joint optimization approach. The result shows that the CPM of our JNSC outperforms [25] by 2.4%.
The S4ND [35] employs a single end-to-end network and replace convolution blocks with densely connected convolution blocks. The results from [35] show that densely connected block outperforms regular residual connection.
However, S4ND does not perform false positive reduction after detection, while a considerable number of tiny nodules are, in fact, body tissues. Our JNSC jointly achieves false positive reduction with the help of the classification network and outperforms state-of-the-art algorithms.

B. NODULE CLASSIFICATION
We now evaluate the nodule classification performance using the LIDC-IDRI dataset. As described in section VI, the uncertain nodules are excluded from evaluation. 12 We randomly split the 1018 scans into ten subsets and adopt 10-fold cross-validation to report the result. Additionally, each fold is trained five times to reduce the effect of network initialization. Note that the uncertain nodules in the testing set are excluded from calculating the accuracy.
The classification network in our proposed JNSC requires the segmentation result from the nodule detection network to perform multiscale voxel-based feature extraction. In order to compare with other classification only algorithms, those undetected nodules are directly labelled as benign. We have also neglected the false positives in the nodule detection process.

1) COMPARISON WITH THE STATE-OF-THE-ART ALGORITHMS ON NODULE CLASSIFICATION
To our knowledge, few studies report the end-to-end result, and therefore, the comparisons can hardly be absolutely fair. 12 We follow the common practice that nodules with MS = 3 are excluded, as such these nodules are uncertain as to benign or cancer. Therefore, we report algorithms using the same MS and CT scans as ours. It should be noted that our system is endto-end, which is more challenging than just classification of the nodules as nodules detection process may itself be error-phone. On the other hand, our framework is closer to a realistic operating environment.
The accuracy, sensitivity, and specificity of the proposed approach on nodule classification is reported and compared with state-of-the-art algorithms. Moreover, as there are more negative samples than positive samples in the dataset, the network is likely to perform better on the negative samples (thus, the specificity is usually higher than the sensitivity). Hence, the negative samples will have more influence on the accuracy. To illustrate the overall performance of the algorithms despite these effects, we also report the balanced accuracy to better reveal and compare the performance. More precisely, the definition of the balanced accuracy is Furthermore, to verify the effectiveness of the segmentation information in the proposed joint-optimization approach, the NSR is replaced by the ground truth region and the JNSC is trained without segmentation, i.e. it is operated in classification only mode. Particularly, we do not backpropagate the gradient from the segmentation module so that the encoder is trained only by the classification network. The results show that the joint training performs better than the classification only mode.
As shown in TABLE 5, our proposed JNSC achieves the highest balanced accuracy and sensitivity among the algorithms. Although the 2D-MV-KBC [16] has the best accuracy, the higher accuracy results from the imbalanced classes where specificity can contribute more to the overall accuracy. Moreover, 2D-MV-KBC only considers the classification on the extracted nodule patches while our algorithm does not require the nodule location to be known in the training phase. Although the 2-D U-Net is adopted for labelling the nodule from the patch, training the network on the extracted patches is still much easier than for the entire CT scans because the extracted regions will be free from the interference of many other body tissues. Moreover, it is required to train 27 independent networks in 2D-MV-KBC so that their results can be aggregated. Its complexity will be significantly increased. In [16], a three-dimension network with 3 independent networks based on ResNet-50 is also proposed. Experimental results show the 2-D network outperforms the 3-D network, which is likely due to the fact that the 2-D network can benefit from the pre-trained ResNet-50 network.
On the other hand, the proposed 3D JNSC can be trained from scratch since the nodule detection network can provide additional information in the form of regularization to alleviate the overfitting problem caused by insufficient training samples. Moreover, the encoder in our JNSC is trained on the whole CT image which can also distinguish other body tissues for nodule detection. The experiment results show that the joint detection and classification framework is superior to the classification only approach with an improvement of 1.25% accuracy. Overall, our approach is more practical for automatic cancer and nodules detection.
The MC-CNN [15] is the first to introduce the approach of cropping nodule-specific feature, which is similar to our multiscale feature extraction method. However, our algorithm differs from [15] in that: i) our extraction is based on the nodule detection while MC-CNN uniformly extracts multiscale feature by using successive max-pooling on each feature, ii) MC-CNN requires nodule-centric inputs (i.e. the first identification of the location of the nodules to be classified by the network) while our JNSC is more flexible in that the nodule can occur anywhere in the voxels and our feature extraction is invariant to the nodule location. Moreover, MC-CNN employs 2-D convolution given the 3-D inputs (i.e. as multiple 2D channels), and hence the information among slices may not be efficiently exploited.
In conclusion, our JNSC is at least comparable to the stateof-the-art nodule classification algorithms with respect to accuracy, sensitivity, and specificity for classification alone task. On the other, the JNSC is fully automatic and does not require pre-selected inputs of the detected nodules. Actually, it can be operated in an end-to-end manner.

2) ANALYSIS OF THE EFFECT OF PSEUDO-LABEL
To examine the effect of labels on the classification performance of our approach, experiments are performed on the following three cases: 1) assigning hard label to nodules, by which each nodule is labelled either ''0'' or ''1''; 2) substituting the hard label by soft label, by which nodule is labelled based on MS in (3); and 3) replacing the soft label by our pseudo-label for the marginal nodules. The results are shown in TABLE 6. Apparently, the performance of using the hard label is the worst among the three methods. This phenomenon reveals that classification in the biomedical area is different from natural image recognition because ground truth is not absolutely correct. Inconsistent labels may arise in the biomedical area due to human errors. It is noted that we are not proposing a physical model to accurately model the probability that the label is uncertain. Instead, we empirically estimate the reliability of the marginal samples and its associated labels via the cross-entropy loss function so as to prevent the network from overfitting these less reliable samples, which affect the overall performance. Consequently, assigning soft-label in classification can significantly improve classification accuracy. However, soft label requires the estimation of probability, which may also introduce additional noise when only a few annotations are available. In this study, we assume that the nodules annotated by at least three radiologists are reliable, while nodules annotated by less than three radiologists and not highly confident are marginal. We then estimate and update a soft label in the form of pseudo-labels for the marginal nodules based on the annotation and network prediction to reduce the noise mentioned above.
To visualize and validate the effectiveness of the proposed pseudo-label during the training phase, the histograms of pseudo-labels before and after training under different regularization parameter α are shown in Figure 7. Figure 7 (a) plots the initial distributions of pseudo-labels. Note that the data is acquired on a randomly selected fold. We then examine the effect of α on pseudo-labels. As shown in Figure 7 (b), lower α pushes the pseudo-label towards the boundary, where pseudo-labels are similar to hard labels on the marginal samples. This can be explained by the fact that the network prediction results dominate the pseudo-label update. However, this is undesired because little information can be learned from the marginal nodules. The result shows that the network tends to fit the benign nodules, and the highest specificity is achieved and the overall performance is inferior to the soft label. When α grows larger, we observe from Figure 7 (d) that α still forces the pseudo-label towards the boundary, but the changes are less severe than before.
Moreover, the network increases the malignancy probability of some benign nodules, revealing that the network treats such nodules as malignant. The network is not trained to mine the marginal samples. Instead, it relies more on the certain data for classification as the marginal samples may not be absolutely correct due to label uncertainty. The problem is commonly encountered in biomedical applications where ground truth may not be precisely gauged from limited human labels. This is in great contrast to natural image classification and language understanding where such labels are usually correct, except for occasion human errors. In summary, the pseudo-label approach addresses the label uncertainty by incorporating the network prediction results or knowledge in addition to the label provided.
Next, we observe that the regularization power does not grow linearly with increasing α. Figure 7 (h) shows that α = 20 performs similarly as 10. When α is set to 10, the majority of pseudo-labels only vary in a small range as is shown in Figure 7 (g). It is reasonable that the network prediction and ground truth annotation are balanced under α = 10, thus achieving the best overall performance. Theoretically, when α grows to infinity, the annotation should govern the pseudo-label, which is identical to soft label. We do not explore larger α and 10 is selected as the default value in this study.

3) ANALYSIS ON MULTISCALE FEATURE EXTRACTION
Our proposed JNSC relies on the features from several encoders to perform nodule classification. Hence, it is important to evaluate the effect of the number of the multiscale features on the classification performance. The experiment is designed to observe the classification performance over concatenating features from first encoder V 1 to the deepest level V 5 . Note that the nodule size is still concatenated to the feature.
As shown in TABLE 7, the classification performance generally improves as deeper features are added. Although discarding the feature up to V 4 yields higher accuracy and specificity, the performance is comparable to that of concatenating all features after considering the balanced accuracy and sensitivity. Therefore, to maintain the consistency of the structure, we do not discard the feature from V 5 . The reason for such a behaviour can be explained as followed. As features of different scales are extracted from the corresponding location in multiscale feature map, the convolution operation can expand the reception field, which means that the extracted features usually represent a larger region in the input CT images. For the features from V 1 and V 2 , the effect is negligible. For the feature from V 5 , such effect can somewhat affect the classification, especially on small nodules because it may encode the information of other body tissues. Meanwhile, the small nodules are likely benign nodules and thus the specificity decreases after adding V 5 features.

VI. DISCUSSION AND FUTURE WORK
A deep-learning based approach for joint detection, segmentation and classification of nodules from 3-D CT scans has been proposed. Moreover, the concept of pseudo-label has been proposed to tackle the problem of label uncertainty, which is commonly encountered in biomedical data. While most algorithms proposed focus on either detection or classification, the proposed algorithm operates in an end-to-end manner, which provides detection and classification of nodules simultaneously together with a segmentation of the detected nodules. Experimental results show that it outperforms the state-of-the-art nodule detection algorithm, and yields comparable performance as state-of-the-art nodule classification algorithm while classification alone is considered.
While natural images are often in two-dimension, biomedical images, such as CT and MRI, are often in threedimension. Since it is usually difficult for human to efficiently visualize these three-dimension data for detection, detail segmentation and classification of region of interest, the proposed algorithm offers a promising approach in developing similar computer-aided diagnosis systems.
In this work, we have employed a multi-task framework, which combines the detection and classification in a single network. Such an integrated approach allows essential information to be exchanged between individual subnetworks and lead to higher performance in both tasks. Moreover, in many practical applications, it is required to be able to provide users with the detailed location or morphology of the objects of interest, in addition to the final decision. In this work, we further extend the nodule detection to pixel-wise nodule segmentation, where a more accurate shape or morphology description of nodules can be obtained. Therefore, the present framework may also be useful in related applications.
Some limitations do exist in our study. Firstly, the patient-level prediction is not studied in this work. Secondly, the slice thickness of various CT scans can vary dramatically. The nodule detection competition (LUNA16) manually excludes the scans whose slice thicknesses are larger than 2.5 mm. The diameter of the small nodules is around 3 mm, which is very close to the slice thickness. Therefore, the low and variant resolution on the z-axis is another difficulty in nodule detection, especially for the small nodules. Many studies [57][58][59][60] have adopted the deep-learning-based super-resolution approaches to address the problem in CT and MRI images. It is interesting to incorporate the superresolution into the proposed nodule detection and classification framework.

VII. CONCLUSION
A joint lung nodule detection and classification network for end-to-end lung nodule detection, segmentation and classification subject to possible label uncertainty in the training set has been presented. It operates in an end-to-end manner, which provides detection and classification of nodules simultaneously together with a segmentation of the detected VOLUME 8, 2020 nodules. A 3D encoder-decoder architecture is adopted for better exploration of the 3D nature of the data. The nodule classification subnetwork of the joint network utilizes the features from the encoder output of the detection subnetwork and the multiscale nodule-specific features for boosting the classification performance. This valuable prior information also allows the more complicated 3D nodule classification encoder network to be optimized directly with improved performance on both tasks. Evaluation using the LUNA16 and LIDC-IDRI datasets shows that the proposed nodule detector outperforms the state-of-the-art algorithms and yields comparable performance as state-of-the-art nodule classification algorithms when classification alone is considered. Finally, since our joint detection/recognition approach can directly detect nodules and classify its malignancy instead of performing the tasks separately, our approach is more practical for automatic cancer and nodules detection.