Cxnet-M3: A Deep Quintuplet Network for Multi-Lesion Classification in Chest X-Ray Images Via Multi-Label Supervision

Medical image analysis is motivated by the success of deep learning, where annotations are usually expensive and not easy to obtain. In this paper, we propose a deep quintuplet network CXNet-m3, where the classification of lesion type of chest x-ray images (CXRs) could benefit from easily accessible annotations like patient age, gender, identity and view position. To improve classification performance, a novel loss function combining both deep metric learning and deep learning is first designed based on multiple labels. Then, a deep model based on transfer learning is built to optimize the loss function. To solve the problem of slow convergence, a quintuplet mining algorithm is presented to provide valuable training samples for the proposed classification model. The experimental results on Chest X-ray14 database show that our classification method outperforms some state-of-art models under Area Under Curve (AUC) score, reaching 0.824 on an average. Besides, our proposal achieves more than 0.9 AUC values in the case of Infiltration, Atelectasis, Cardiomegaly and Nodule.


I. INTRODUCTION
Many chest lesions such as nodules and emphysema are early manifestations of lung cancer, the leading cause of death in the world [1]- [3]. Some lesions shown on chest X-ray images (CXRs) are also useful biomarkers associated with severe heart failure and respiratory diseases [4]- [6]. Therefore, diagnosing chest lesions is essential for reducing morbidity and mortality from lung, heart and respiratory diseases.
Chest X-ray is the most commonly used radiology exam for screening and diagnosing chest lesions. With growing population and increasing health awareness, demand for chest readings is growing. In the United States of America (USA) alone, over 35 million CXRs are taken every year and radiologists have to read more than 100 CXRs in a day [7]. Meanwhile, manual method has problems with providing expert readings and correct diagnosis for CXRs. According to a report, within 12 months, up to 23,000 CXRs were The associate editor coordinating the review of this manuscript and approving it for publication was Zhiwei Gao . not formally reviewed by radiologists at Queen Alexandra Hospital alone [8]. Therefore, advanced technologies are urgently needed to assist radiologists, improve the work efficiency and enhance the diagnosis accuracy.
With the development of computer computing power and the advent of the era of big data, deep learning (DL) technology based on artificial neural networks has been a great success in many fields including image processing [9]- [12]. Compared with traditional machine learning methods such as support vector machine, K-nearest neighbor method and random forest, deep learning method does not have to manually extract image features including Local Binary Pattern, Histogram of Oriented Gradient and Haar-like [13]- [17]. In contrast, deep convolutional neural networks (CNNs) obtain multiple levels of image features automatically by end-to-end training [18]- [20]. The local connectivity and shared weights make CNN to be the leading computational intelligence for image processing and classification. However, the training of CNNs should be supervised by expert annotations, generally expensive and not easy to obtain in the field of medical VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ imaging. Therefore, the value of labels like patient gender and identity, easy to obtain but usually ignored by researchers, should also be explored. In this paper, a classification model, CXNet-m3, is proposed based on convolutional neural network to provide auxiliary diagnosis for CXRs in ChestX-ray14 database. In CXNet-m3, the classification of lesion types benefits from easily accessible annotations like patient age, gender, identity and view position. Taking advantage of transfer learning, CXNet-m3 is built with quintuplet inputs and trained by both classification losses and embedding distances between quadruplets of CXRs. These quintuplets are generated using quintuplet mining (QM) algorithm, where quintuplets are also filtered based on information from multiple labels. Therefore, the main contributions in this paper are listed as follows: 1) A novel idea of using easily available label information is proposed to improve the classification performance of CNN model for lesion types.
2) A novel loss function with the help of deep metric learning for classification is proposed to make use of multiple label information.
3) A quintuplet mining algorithm based on multiple labels is presented to provide valuable training samples for the proposed classification model.

II. RELATED WORK
The recent success of deep learning in image processing tasks has led to rich applications in medical image field including the classification of chest x-ray images. Using 5,232 CXRs as training set, Yadav et al. trained a linear support vector machine, a fine-tuned convolutional neural network, and a capsule neural network to classify CXRs into bacterial pneumonia, viral pneumonia and disease-free CXRs [21], [22]. Experimental results prove that deep learning algorithms are superior to traditional machine learning algorithm, and the fine-tuning method is better than training from scratch. Lakhani et al. took a dataset containing 1,007 chest X-ray images as the research object, and classified tuberculosis based on convolutional neural networks including AlexNet and GoogLeNet [23]- [25]. Among them, AlexNet won the championship in the ImageNet image classification competition in 2012, far exceeding the second place. GoogLeNet introduced inception module to improve the expressive ability of CNN without increasing the amount of calculation. Shin et al. used a recurrent neural network (RNN) model to read chest X-ray images and conducted experiments on the open database OpenI containing 3,955 radiological reports [26].Kieu et al. proposed a multi-CNN model combining with fusion rules to detect abnormal chest radiographs [27]. Anavi et al. used age and gender to visualize patients and improve deep learning frameworks for chest X-ray image retrieval [28]. Although the above researches have achieved good results, the study of deep learning methods in the field of chest X-ray imaging is still restricted because of the limited scale of dataset.
In 2017, National Institutes of Health (NIH) of USA released one of the world's largest accessible labeled chest X-ray image archive, ChestX-ray14. ChestX-ray14 contains 112,120 chest x-ray images from 30,805 patients [29]. Due to the large scale, it triggers a considerable attention in deep learning community. Xu et al. trained a two-class deep model from scratch to detect abnormal chest radiographs in ChestX-Ray14 [30]. Yao et al. and Xu et al. made use of image features and dependencies between labels by combining CNN and RNN to detect multiple lesions in a single image [31], [32]. Most scholars conducted researches to classify CXRs in ChestX-ray14 into 14 kind of lesion types and most of them only used the label of lesion type to supervise the training process of convolutional neural network. X. Wang et al. fine-tuned four standard CNN architectures including AlexNet, VGGNet, GoogLeNet and ResNet [24], [25], [29], [33], [34]. Compared to AlexNet, the small convolutional kernels and stacked convolutional layers are two improvements of VGGNet. Such a design improves the ability of extracting features, reduces network complexity, and facilitates training convergence. To solve the problem of vanishing gradient caused by deep layers, ResNet connects a skip connection between the input and output of two stacked convolutional layers, which also reduces the time complexity of training. Among these four classic CNN models, ResNet achieved the best result for multi-class classification in the research work of X.Wang et al.. Li et al. presented a model for ChestX-ray14 that simultaneously performed the classification of lesion type and the localization of lesion based on Resnet and a simple recognition network [35]. P. Rajpurkar et al. utilized a 121-layer DenseNet architecture with little modification to detect pneumonia using ChestX-ray14 [36], [37]. Compared with Resnet, DenseNet further establishes a skip connection not only between the residual blocks, but also between each layer. Yao et al. introduced an architecture that learned at multiple resolutions and used a learnable lower bound adaptability to parameterize the pooling function. They achieved satisfactory classification and recognition results for up to 9 lesion types while generating high-resolution saliency maps [38]. Aviles et al. proposed a graph-based semi-supervised learning method for chest X-ray image classification. They introduced a new loss function to strengthen the synergy between a limited number of labels and a large amount of unlabeled data. They obtained good results on the Chest X-ray14 database while greatly reducing the need for annotated data [39]. Different from their method, Baltruschat et al. tried to use as many label information as possible. Their research is currently the only one that uses view position, patient gender, and patient age information besides image information to train classification model for Chest X-ray14 [13]. They abstracted the three label into a 3-dimensional feature vector and concatenated it with 2048-dimensional feature vector of training image. Although excellent work, low-dimensional non-image features may be hard to play a really powerful role after being concatenated with high-dimensional image features. However, the success of their method has revealed that patient information and view position information, usually ignored by researchers, are also very useful. Inspired by their work, we propose a multi-label supervision method that makes full use of nonimage information in this paper. Different from the research work of Li et al., Our model is trained without the supervision of lesion location, expensive and difficult to obtain. In ChestX-ray14 database, only 0.8% of chest X-ray images were labeled by lesion location. we are committed to dig out more value from other labels, easily accessible but usually ignored by researchers. Rather than transforming non-image information into features like what Baltruschat did, we take advantage of them by combination of deep learning and deep metric learning to implement the supervision of multiple labels.
Compared with classic metric learning, deep metric learning can make non-linear mapping of input features, and has been widely used in the field of computer vision, such as image clustering and image retrieval [40]- [45]. Deep metric learning learns the mapping of samples to features through a loss function. Under this mapping, the metric between features can reflect the degree of similarity between samples. Taking the advantages of feature extraction of deep learning, contrastive loss mapped the original input space to Euclidean space, directly constraining the feature distance of samples [46]. Triplet loss further considered the relative relationship between intra-class pairs and inter-class pairs [47]. By optimizing triplet loss, the distance between features of intra-class (anchor-negative) is longer than that between features of intra-class (anchor-positive). Triplet loss has a good performance for extreme classification tasks such as face recognition and person retrieval [48], [49]. In this paper, the thought of triplet loss of constructing positive and negative sample pairs is transferred and improved to classify CXRs based on multi-label information.
It can be found that most of the above medical image classification researches make use of CNN by finetuning the existing deep learning models such as AlexNet, VGGNet, GoogLeNet, ResNet and DenseNet. Fine-tuning is a kind of transfer learning method, proposed to overcome problems caused by training models directly on relatively small-scale dataset, such as over-fitting and poor robustness [50]- [53]. In the field of medical images, datasets as large as Imagenet are very difficult to obtain because of expensive expert annotations [24]. Therefore, despite of the success of natural image processing, the performance of deep CNNs trained directly on medical images is limited. Transfer learning solves to some extent the contradiction between the use of deep learning methods and limited-scale medical data set. Except for the research of Xu et al. of training a two-classifier from scratch, above research teams trained classification models for ChestX-ray14 by transferring parameters trained on ImageNet [13], [29], [30], [35]- [37]. These researches prove the effectiveness of deep transfer learning from natural domain to CXRs. Different from them, this paper involves not only the transfer between different domains, but also the transfer between the same domain.

III. PROPOSED CXNet-m3
The outline of our overall method for the classification of CXR lesion types is shown in Fig.1. Taking advantage of multiple labels, including lesion type, patient identity (ID), gender, age and view position, the quintuplet mining algorithm is first presented to mine quintuplets from VOLUME 8, 2020 ChestX-ray14 database. After being extracted by parametershared feature extractors, four feature vectors are sent to the measurer and the rest one is sent to the classifier. The initial parameters of the feature extractor are transferred from ImageNet, and the parameters of classifier are pre-trained before the formal training. The whole model is optimised by the proposed quintuplet loss during training. Proposed loss function and mining algorithm are discussed in the first subsection and the model architecture is described in the seconde subsection.

A. PROBLEM FORMULATION 1) CLASSIFICATION LOSS
To aid diagnosis, deep learning can be used to train an end-to-end multi-lesion classification model. Each input of the model is a chest X-ray image I and the output is K -dimensional predictions, where K is the number of lesion types. The location of the largest probability value ranging from 0 to 1 represents the type of lesion predicted by the model. In order to prevent local optimization, the softmaxbased cross-entropy loss function is used to optimize model parameters, as shown in (1): where n is the number of training images, y i is lesion type label, and p i ∈ [0, 1] is defined as (2): where z i ∈ Z is the input of softmax layer. We use M to donate the whole non-linear model and θ f to donate the parameter vector of M . The aim of the training of M is to find out the best parameter combinations in the parameter space θ F through optimising C cla , as shown in (3): where argmin means ''make it minimal'', n is the number of training chest x-ray images,I donates a training image, y i is the label of image I and C i is the loss of image I , where C cla = C i (M (I |θ f ), y i ).

2) DML-BASED LOSS a: LESION-WISE LOSS
Deep metric learning (DML) implements classification by optimizing the distance of features in the embedding layer. Among them, triplet loss is widely used in the field of face detection, such as FaceNet [47]. FaceNet sets two face pictures belonging to the same person as anchor sample anc and positive sample pos, and sets pictures not belonging to this person as negative sample neg. The idea of triplet loss is that d(anc, pos), the distance between anc and pos should be less than d(anc, neg), the distance between anc and neg. Guided by this thought, deep model can be trained by optimizing triplet loss function as shown in (4).
The idea of triplet loss can be transferred to classification task of chest radiographs. Assume that there is an ideal model that can correctly classify all the chest radiographs. This model should be able to accurately capture the discriminative features of each lesion. Therefore, the extracted features between CXRs belonging to the same lesion should be highly similar, while the extracted features between CXRs belonging to different lesion types should be highly different. It can be set as an optimization goal to train the parameters of the multi-lesion classification model we build, just like (4). The only difference is that anc, neg and pos are selected based on the type of lesion rather than the person's ID.

b: PATIENT-WISE LOSS
In above subsections, model is supposed to be optimized from the supervision of lesion labels. In fact, patient ID is also a kind of important label because medical images from the same patient are possibly more similar. Therefore, patientwise split is often required to construct training, validation and test data sets. Such requirements eliminate the bad effects that the model's performance may be biased by seeing images of the same patient ID from different subsets.
However, instead of avoiding the problems caused by the same patient ID, it is better to utilise the patient ID for optimizing. Ideally, suppose that there are some chest radiographs containing lesion type x and a 100% accurate multi-lesion classification model. The model extracts powerful features and accurately classify them into the same class x. In this case, patient-wise differences are no longer obvious and it can be set as a part of optimization goal to train the classification model. We set up a similar sample sim with the same lesion type and patient ID as the anchor sample anc, and a positive sample pos with the same lesion but different patient ID as the anchor sample anc. The distance between anc and pos, d(anc, pos) should be close to the distance between anc and sim, d(anc, sim). Guided by this thought, the following loss function is formulated, shown as (5).
Taking advantage of lesion-wise loss and patient-wise loss, a DML-based loss is constructed as (6): As shown in Figure 2, this DML-based loss minimizes the distance between an anchor and a positive CXR, and maximizes the distance between the anchor and a negative CXR. At the same time, this DML-based loss makes the distance between the anchor and positive CXR and the distance between the anchor and similar CXR as close as possible.

3) QUINTUPLET LOSS
In summary, C cla is a direct classification loss function, while C dml provides an optimization target for the multi-lesion model from lesion type and patient ID. Our strategy is to perform a weighted sum of C dml and C cla . Although metric learning is not main task, the optimization of C dml could guide the model to learn discriminative features for classification. Taking advantage of classification loss and DML-based loss, the Quintuplet loss is formulated as (7): According to the loss function, every 5 CXRs including anchor sample * 2, a positive sample * 1, a negative sample * 1 and a similar sample*1 need to be put in model each time to train. And this is why formula (7) is called Quintuplet loss.

4) QUINTUPLET MINING ALGORITHM
In addition to lesion type and patient ID, there are also patient gender label ∈ {0, 1}, view position label ∈ {0, 1}, and patient age information of each CXR in ChestX-ray14 database. In order to make use of them, a strategy is to treat them as three-dimensional features connected to high-dimensional image features [13]. However, under the contrast of highdimensional image features, these low-dimensional features are difficult to play an important role. Another strategy is to integrate them into the quintuplet loss function and then provide online supervision for model along with the lesion type. However, it involves more hyper-parameters and slower convergence in our experiment. Finally, we decide to use these information to mine the quintuplets, achieving off-line selection to accelerate model convergence. Generating all possible quintuplets would result in super large-scale data pairs that are easily fulfill the constraint in formula (4) and formula (5). These quintuplets would not contribute to the training but slows down the convergence of model. It is crucial to select relatively hard quintuplets according to multi-view label.
First, symbols are used to define some relationships, as shown in Table (1). According these relationships, total three constraints are added to the pairs that appear in formula (4) and formula (5). As a hard pair, the distance between the anchor sample and the positive sample should be as far as possible except for the type of lesion. The relationship between anchor sample and positive sample should satisfy the constraint shown in formula (8): In contrast, the distance between anchor sample and negative sample should be as close as possible. Although the type of lesion is different, gender and view position should be the same. The relationship between anchor sample and negative sample should satisfy the constraint shown in formula (9): The relationship between anchor sample and similar sample should satisfy the constraint shown in formula (11): Based on these constraints, the quintuplet mining algorithm is proposed for quintuplet selection, as shown in Algorithm1:

B. MODEL ARCHITECTURE
In the problem formulation subsection, a method is proposed to jointly supervise the deep model using classification loss and embedding layer distance. Accordingly, an end-to-end deep model with five inputs is designed, as shown in Figure1.
After comparing the current multi-lesion classification methods for ChestX-ray14, we find that transfer learning is widely used and more effective than training from scratch. Further, Baltruschat et.al. compared some existing deep convolutional network models that could be used for transfer learning on ChestX-ray14. They found that Resnet VOLUME 8, 2020

Input:
Chest X-ray14 dataset; Output: Quintuplets generated from Chest X-ray14 dataset; 1: Sort CXRs according to patient ID; 2: for each img1, img2,img3,img4 in CXR dataset do 3 Select anc & sim from Obj, pos from Pos, neg from Neg randomly; Put anc*2, sim, pos, neg together into a quintuplet; 10: end while 11: end for had the best transferability [13]. We therefore choose Resnet-50 as the base-bone of our model. As shown in Figure 1, the model is expanded into three parts based on Resnet-50: a feature extractor, a classifier, and a measurer. Among them, the network structure of the feature extractor is the same as that of Resnet-50 except for class layers, shown in Figure 1 1 4 . Such a network structure can benefit from transfer learning by directly loading the parameters trained on ImageNet. The classifier consists of a BanchNorm layer, a Flatten layer, and a Dense layer, as shown in Figure 1 3 . The BanchNorm layer normalizes the learned features, and the Dense layer outputs the predicted classification results. The measurer first uses the Concatenate layer to connect the down-sampled features of anchor sample, similar sample, positive sample and negative sample. After L2 regularization, the output feature vectors are measured, shown in Figure 1 2 . The classifier and the measurer are not connected to each other, while they are both connected to the last layer of the parameter-shared feature extractors.
The parameters of classifier and the measurer are separately optimized by two different loss functions from quintuplet loss, while the design of parameter sharing allows them to jointly optimize feature extractors. Although classification performance is our only concern, the measurer is used to further enhance the capabilities of the feature extractors.

A. DATASET
The dataset used in this study is ChestX-ray14, established by the researchers from the National Library of Medicine and Clinical Center of NIH. Chest X-ray14 is the largest publicly accessible chest x-ray database, downloaded through https ://nihcc.app.box.com/v/ChestXray-NIHCC. Its recent release triggered research on chest radiographs by the deep learning community. Chest X-ray14 contains more than 30,000 patients, 112,120 labeled chest x-ray images labeled by 14 kinds of lesion types including Infiltration, Effusion, Atelectasis, Nodule, Mass, Pneumothorax, Consolidation, Pleural Thickening (PT), Cardiomegaly, Emphysema, Edema, Fibrosis, Pneumonia and Hernia. In Chest X-ray14, 60,361 CXRs are lesion-free, while other 51,759 CXRs are abnormal. Although a large scale, Table 2 shows that the lesion distribution in Chest X-ray14 is imbalanced.
Each CXR with a resolution of 1024 * 1024 has a unique image identity. In addition to the type of lesion, CXRs are also labeled by other basic information such as patient ID (30805), gender(female or male), view position( posterio anterior (PA) or anterior posterio (AP)), and age. The vast majority of chest radiographs are concentrated in the 20 to 80 age group, with the largest number in the 50 to 60 age group. Table 6 shows that the gender-wise and view positionwise distribution are relatively uniform. Therefore, the impact of gender and view position on classification accuracy cannot be ignored.

B. METRICS
AUC, accuracy, recall, precision, and F-value are important and commonly-used metrics for classification tasks in the field of machine learning. There are 4 quantities are first defined: • true positive as TP: The prediction is positive and the prediction is true.
• true negative as TN: The prediction is negative and the prediction is true.
• false positive as FP: The prediction is positive and the prediction is false.
• false negative as FN: The prediction is negative and the prediction is false 1) ACCURACY [54] The classification accuracy rate refers to the proportion of correctly classified samples in total samples. The accuracy rate A is defined as (11): 2) PRECISION [55] The classification precision rate is the ratio of the number of positive samples classified correctly to the number of samples determined by the classifier as positive samples. The precision rate P is defined as (12):

3) RECALL [56]
The classification recall rate refers to the ratio of the number of positive samples that are correctly classified to the number of samples that are truly positive. The recall rate R is defined as (13): 4) F-VALUE [57] In most cases, the higher the recall rate, the lower the precision rate and vice versa. Therefore, using either P or R cannot fully measure the performance of classification model. F-measure value is defined to take both P and R into consideration (14): where α 2 is weight factor, and when α 2 = 1, P and R are equally-weighted.

5) AUC [58]
The AUC is defined as the area under the receiver operating characteristic (ROC) curve, which has typically horizontal axis as False Positive Rate and vertical axis as True Positive Rate. True Positive Rate (Sensitivity) is computed as TP/(TP + FN) and False Positive Rate is defined as FP/(TN + FP). Using the AUC value as the evaluation standard is more clear and direct than ROC Curve. Larger AUC means the classification performance is better.

C. EXPERIMENTAL SETUP 1) EXPERIMENT ENVIRONMENT
The experiments are conducted on an ubuntu linux server with 32G random access memory (RAM) and a 16-core central processing unit (CPU). Both the quintuplet mining procedure and the model architecture are developed with Python and deep learning libraries (e.g., Keras and Tensorflow). The whole model is trained using 2 GeForce GTX 1080 Ti graphics processing units (GPUs).

2) TRAINING DETAILS
In our experiments, the CXRs in the Chest Xray14 database are divided into training set, validation set and test set at a ratio of 8:1:1. During training, the feature extractor's parameters of Resnet-50, which were pre-trained from Imgenet, are first transferred to our model. Then parameters of the classifier are trained by freezing feature extractor and optimizing formula (1), using all lesion-labeled CXRs in Chest X-ray14 except the test set and validation set. After that, the feature extractors' parameters trained by ImageNet and the classifier's parameters trained by Chest X-ray14 are loaded into the whole model. In the last step, classifier, measurer, and high-level features of feature extractor in CXNet-m3 are trained by parameter transfer between the same domain and optimising formula (7).

3) HYPERPARAMETER SETTING
In the quintuplet mining algorithm, A c in Constraint ap is set as larger than 10, while A c in Constraint ap is set as equal to 0.
In the training procedure, the weight of the classification loss in formula (7) α is set as 0.8. In formula (6), the proportions of patient-wise distance and lesion-wise distance are each set to 0.5. According to experience and the validation results, we choose the different initial learning rates for two losses, decayed by 10 manually through monitoring the loss curve.

D. EXPERIMENTAL RESULTS AND DISCUSSION
Ling et al. discussed about AUC in their paper and they conclude that AUC is a better measure than accuracy based on formal definitions of discriminancy and consistency [58]. The implicit goal of AUC is to deal with situations where there is a skewed sample distribution and over-fit to a single class should be avoided. The paper recommends using AUC as a ''single number'' measure to over accuracy when evaluating and comparing classifiers.    Table 2, the sample distribution in the Chest X-ray14 dataset is extremely uneven. For example, the category with the most samples is Infltration, with 19,894 CXRs, while the category with the least samples is Hernia, with only 227 CXRs. Therefore, AUC is used as a measure for each class in our paper and the values are compared with experimental results of state-of-art algorithms for multi-lesion classification for this dataset.

As shown in
As shown in Figure 3, the average AUC values are compared in our experiments under different parameter choices. the ratio of C dml in formula (7) is set as 0.1,0.2,0.3,0.4 and 0.5, respectively. the difference between line 0, line 1, line 2, and line 3 is the quintuplet sample selection strategies, as shown in Table 5. QM refers to whether the proposed quintuplet mining algorithm is used. S-N, P-N, and N-N represent the number of similar samples, the number of positive samples, and the number of negative samples, respectively. Q-N-A donates the number of quintuplets of each Anchor sample. Here, since the similar samples are very similar, the number of all similar samples is set to 1. Since the convergence rate of C dml and C cla is different, the total number of samples should not be too much. Otherwise, C dml will not recognize any pattern when C cla has converged. In this case, if the training process is continued, the over-fitting phenomenon will be exacerbated. If the training process is stopped, C dml would not contribute to the model training.Therefore, Q-N-A is set three values of 4, 8, and 16, where Q-N-A is the product of S-N, P-N, and N-N. Compared with positive samples, the distribution of negative samples in the feature space is more scattered. Therefore, when the Q-N-A is limited, the number of negative samples N-N is set to a larger value.
In Figure 3, the AUC values of line 0 is much lower than those of other lines. It means that the quintuplet mining algorithm is effective for improving model performance. The use of quintuplet mining algorithm also speeds up the convergence of the model during training. It takes about a day to converge without quintuplet mining algorithm, while the loss value no longer changes significantly only after about 5 hours with training data selection. With the increase of the ratio, the AUC values of line 1 and line 2 tend to increase first and then decrease. However, the AUC values of line 3 decreases as the ratio becomes larger. Compared with line 1 and line 2, the Q-N-A value of line 3 is higher, which is 16. It means that the quintuplet mining algorithm allocates 16 quintuplets for each eligible Anchor sample. The model thus converges more slowly during training. The larger the weight of C dml , the slower the model will converge. Line 1 and line 2 reach their peaks at a ratio of 0.2 and 0.4, respectively. Both of them obtain high AUC values. It can be inferred from Figure 3 that the selection of samples and the proportion of two losses are very important influencing factors. When selecting samples, the quintuplet mining algorithm is effective, and the total sample size should not be too large. Besides, 0.2 to 0.4 is an optimal range of weight factor of C dml for our task.
As shown in Table 4, our work is compared with the work of Wang, Z. Li, A. I. Aviles-Rivero, Y. L, and I. M. Baltruschat [13], [29], [35], [38], [39]. Among them, Wang et al. first released the Chest Xray14 dataset in 2017 and used deep learning models to classify 14 lesions, which was published on IEEE Conference on Computer Vision and Pattern Recognition [29]. Li et al. used a simple recognition network to assist the training of Resnet. Although only a few hundred chest radiographs labeled with lesion location, the auxiliary task greatly improved the classification effect. Their research was published on IEEE Conference on Computer Vision and Pattern Recognition in 2018 [35]. Although a newly published paper in 2019, the citation rate of I. M. Baltruschat et al's paper is relatively high [13]. They used view position, gender, and age information as feature to train deep model and obtained good results. The work of Reference [39] and Reference [38] are also very new, printed in arXiv in 2019 an 2018, relatively. Table 4 shows that our method achieves an average AUC value of 0.824, which is higher than other research work. Compared with the initial work of Reference [29], the average AUC values of later research work are all increased. In addition to our work, the average AUC values of Reference [35] and Reference [13] are also relatively high. Compared with other research, both of our proposal and their work use auxiliary information, such as lesion location, patient ID, gender and view position. It can be inferred that the use of auxiliary information can to some extent contribute to classification.
In addition to average performance, our model shows better performance for about half of the lesions, such as Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass and Nodule. Although not the best, our results of other lesions, such as Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, PT and Hernia, are not bad. However, in terms of AUC values, the model has the worst classification effect on Pneumonia. One possible reason is that our model does not obtain the strong features of pneumonia from small sample set. Table 4 also illustrates that our method has outstanding classification performance for Atelectasis, Cardiomegaly, Infiltration, Nodule, getting AUC values higher than 0.9. Comparing Table 2 and Table4, we find that sample size of lesions with high AUC values is relatively large. For Infiltration, the AUC value has improved by nearly 30% when comparing with the earliest research work in Reference [29]. Deep learning encourages a large number of training samples to learn more robust features. Although the sample size is the largest, the AUC value of Infiltration has not always been high before our work. One possible reason is that Infiltration's classification features are difficult to learn. Our method is specifically designed to enhance the learning performance of classification features, thus improving the classification performance of large sample classes.
For Chest X-ray14 with uneven sample distribution, all state-of-art methods only use AUC as the evaluation metric, as compared in Table4. As auxiliary measures, we also verify the overall model performance under accuracy and F-score which takes both precision and recall into consideration. When the accuracy and F-score of the training set reach 0.925 and 0.950, respectively, the accuracy and F-score of the validation set reach the maximum. At this time, the accuracy and F-score of the test set are average 0.749 and 0.853, respectively. Generally, as a robust evaluation metric based on all cutoff values, the value of AUC is smaller than the value of accuracy. In our experiments, the accuracy value is relatively small, which may be caused by a large cutoff threshold.
In addition to the classification performance, the complexity of CXNet-m3 model is also analyzed and compared to the existing results. In neural networks, the spatial complexity is determined by the number of parameters, related to the number of convolution kernels, the number of output channels, and the number of layers. The more the number of parameters, the higher the spatial complexity of the model. Floating-point operations (FLOPs) is used to measure the time complexity of the model, related to the number of feature maps, the number of convolution kernels, the number of output channels and the number of layers [59]. FLOPs refers to the number of additions and multiplications in the model. The larger the FLOPs value, the higher the time complexity of the model. Table 7 shows the comparison of parameter number (Params) and FLOPs between CXNet-m3 and other methods. Among them, Reference [29]-1 to Reference [29]-4 are improved based on AlexNet, GoogLeNet, VGGNet and ResNet, respectively. Correspondingly, the model size and FLOPs value are similar to respective base model. Although based on ResNet, References [35], [13] and our method have different spatial complexities and time complexities due to different improvement strategies. The model size and FLOPs value of Reference [13] and our method are similar to standard ResNet, while the method in Reference [35] needs more parameters and larger FLOPs because of changed image scale and added recognition network. Classification in Reference [38] in designed based on both standard ResNet and standard DenseNet, which means parameter number and FLOPs should be close to the sum of ResNet and DenseNet. It can be seen from Table 7 that parameters in our model is the second least, which means relatively low space VOLUME 8, 2020  [35] and Reference [38]. FLOPs is related to the size of the input image. Except that Reference [35] sets the input image size to 299 * 299, other FLOPs in Table 7 are calculated based on a 224 * 224 input image. During training, CXNet-m3 model is more time-consuming than single-input models because it needs to process 5 images at each time. However, only one input is open during the inference process of CXNet-m3 model, which is really time-saving according to Table 7.

V. CONCLUSION
Chest X-rays are the most common imaging examination tool used to detect lesions related to heart, lungs, and respiratory system. In this paper, a deep multi-lesion classification model CXNet-m3 for CXRs in Chest X-ray14 is proposed to aid diagnosis. In CXNet-m3, easily accessible labels are explored to assist the classification of lesion types. To enhance the classification performance, a DML-based loss function is first constructed using labels of lesion type and patient ID. Then, a deep model taking advantage of transfer learning is built with quintuplet inputs to optimize both DML-based loss function and the classification loss function. To overcome the problem of slow convergence, a quintuplet mining algorithm for the selection of training sample is proposed based on labels of lesion type, view position, patient ID, patient age, and patient gender. The experiment results show that our method can achieve better AUC values than some state-of-art methods for the classification of multiple lesions in Chest X-ray14. The analysis of the experimental results also shows that our method has a significant effect on improving the classification performance of large sample categories. The disadvantage of this method is that CXNet-m3 involves a lot of hyper-parameters. How to make a reasonable or adaptive selection of the best hyper-parameters to make further improvements will be the focus of our future work. She is currently a Professor with the College of Information Science and Technology, Beijing Normal University. She is the author or coauthor of more than 100 articles. Her current research interests include knowledge representation and acquisition for the Internet of Things, dynamic spectrum allocation, big data analysis and application, and so on.