Breast Cancer Diagnosis in Two-View Mammography Using End-to-End Trained EfficientNet-Based Convolutional Network

Some recent studies have described deep convolutional neural networks to diagnose breast cancer in mammograms with similar or even superior performance to that of human experts. One of the best techniques does two transfer learnings: the first uses a model trained on natural images to create a"patch classifier"that categorizes small subimages; the second uses the patch classifier to scan the whole mammogram and create the"single-view whole-image classifier". We propose to make a third transfer learning to obtain a"two-view classifier"to use the two mammographic views: bilateral craniocaudal and mediolateral oblique. We use EfficientNet as the basis of our model. We"end-to-end"train the entire system using CBIS-DDSM dataset. To ensure statistical robustness, we test our system twice using: (a) 5-fold cross validation; and (b) the original training/test division of the dataset. Our technique reached an AUC of 0.9344 using 5-fold cross validation (accuracy, sensitivity and specificity are 85.13% at the equal error rate point of ROC). Using the original dataset division, our technique achieved an AUC of 0.8483, as far as we know the highest reported AUC for this problem, although the subtle differences in the testing conditions of each work do not allow for an accurate comparison. The inference code and model are available at https://github.com/dpetrini/two-views-classifier


I. INTRODUCTION
Major medical and governmental health agencies endorse mammography screening programs, because it reduces breast cancer-specific mortality, and nowadays, more and more women adhere to this recommendation. As a consequence, the number of mammograms that should be analyzed are increasing day after day. Mammograms must be interpreted by experienced radiologists to achieve a low error rate.
The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li.
To help radiologists, CAD (Computer-Aided Detection and Diagnosis) systems have been and are being developed.
Recently, there has been a revolution in artificial intelligence (AI) and computer vision with the introduction of the deep convolutional neural network (CNN) [1]- [3]. Some recent works have proposed to use CNN to diagnose cancer in mammograms. However, we should consider that there are important differences between classifying natural images and mammograms. In natural images, the target that defines the image category occupies a large area. This does not happen on mammograms, where the cancer tissue may occupy only a tiny area. Consequently, directly training a CNN or making VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a conventional transfer learning to classify mammograms usually does not work well. Shen et al. [4] present a good idea to overcome this challenge, that consists on performing two transfer learnings. The first uses a model trained on the ImageNet [5] natural images to initialize the ''patch classifier'' that classifies small mammogram patches into five categories: background, benign calcification, malignant calcification, benign mass, and malignant mass. The second uses the patch classifier to initialize the ''single-view whole-image classifier'' that is end-to-end trained using whole mammograms with cancer status. In other words, they first build the patch classifier because it is easier than building a whole image classifier. Subsequently, the patch classifier scans the entire mammogram, generating attribute maps that describe the likelihood of having different types of lesions in each region of the mammogram. The whole image classifier uses these maps to make the final classification and is end-to-end trained. In this paper, we propose some improvements to Shen et al.'s method to increase its performance: (1) The original technique used ResNet [6] and VGG [7] as the base models. We replaced them with the more recent EfficientNet [8].
(2) Standard mammography consists of two views for each breast: bilateral craniocaudal (CC) and mediolateral oblique (MLO). The original algorithm processes only one view at a time and, to take the two views into account, it simply averages the scores of the two views processed independently. Our technique performs a third transfer learning, in addition to the original two, to take into account the two views. We use the single-view classifier to initialize the ''two-view classifier'' and then the entire system (patch, single-view and two-view classifiers) is end-to-end trained, using two-view mammograms with cancer status.
With the above improvements, together with test-time augmentation (TTA) and ensemble of four models with the same architecture, we achieved an AUC (Area Under ROC Curve) of 0.9344± 0.0341 in 5-fold cross-validation using CBIS-DDSM dataset (accuracy, sensitivity and specificity are 85.13% at the equal error rate point of the ROC). It is known that a substantially smaller AUC is obtained using the original CBIS-DDSM training/test division [9]. In this condition, we obtained an AUC of 0.8483± 0.0253 (with TTA). As far as we know, this is the largest AUC reported for this problem.
A previous version of this work was presented at the 2021 AACR annual meeting and published as an abstract [10], [11].

II. LITERATURE REVIEW A. CNN-BASED BREAST CANCER DIAGNOSIS
Recently, the deep convolutional neural network (CNN) has been applied with remarkable success in different areas. Some recent CNN-based breast cancer diagnostic systems show similar or even better performance than human specialists.
Kooi et al. [12] compared classification of mammography ROIs using state-of-the-art classic method, CNN-based method and radiologists. They concluded that CNN has a performance comparable to radiologists and superior to the classic method.
Rodriguez et al. [13] compared a CNN-based commercial system (Transpara 1.4.0) with 101 radiologists, using 9 datasets from different institutions in the US and Europe. The AUC of the AI system was 0.840 while the mean AUC of the radiologists was 0.814. Therefore, the AI was better than the average of radiologists but its performance was inferior to that of the best radiologist.
Schaffter et al. [14] describe the ''DM DREAM Challenge'' fostered to develop AI algorithms for interpreting mammograms, held between September 2016 and November 2017. The top-performing single algorithm achieved an AUC of 0.858 (in the US dataset) and 0.903 (in the Swedish dataset). No single or ensemble algorithm outperformed radiologists.
McKinney et al. [15] present an AI system that surpasses human experts in breast cancer prediction. This system consists of an ensemble of three deep learning models that were tested on private UK and US private datasets and achieved AUCs of 0.889 and 0.8107, respectively.
Wu et al. [16] designed a four-view deep learning. They achieved an AUC of 0.895 in predicting cancer using 4 views, which is higher than the radiologists' average AUC of 0.778. Although both Wu et al.'s work and ours use multi-view to classify cancer, there are fundamental differences that we explain in Section IV-C4.

B. PUBLIC MAMMOGRAM DATASETS
Currently, the DDSM [17] is the largest public mammogram dataset, with 2,620 exams and contains normal, benign and malignant cases with verified pathological information. CBIS-DDSM [18] is an updated and curated version of DDSM, organized to make it easier to use. It consists of 3,103 mammography images. We use this dataset to train and test our system. Table 1 summarizes the number of mammograms in this dataset. The InBreast public dataset contains only 115 cases with 410 images [19], too small to be used in deep learning. The recently published public dataset CSAW-M [20] does not classify lesions into normal/benign/malignant classes and therefore cannot be used in our study. Other recent public datasets such as KAU-BCMD [21] or VinDr-Mammo [22] lack verified pathological information and are not fully available at the time of this writing.

C. COMPARING CAD PERFORMANCES
It is difficult to compare the techniques described in different papers even when all systems use the same dataset (e.g. CBIS-DDSM). Many works in the literature randomly divide this dataset into training and test sets [4], [9]. This procedure can generate biased results, as there is the possibility of randomly choosing a test set that is easy (or difficult) to classify. This phenomenon can be seen in our own results. When the CBIS-DDSM dataset is randomly divided into 5 subsets and our two-view classifier is trained using 4 subsets and tested on the remaining set, the 5 obtained AUCs vary from 0.90 to 0.99 (4 models with TTA, see Table 5). Thus, if we were lucky in the random division, our two-vew classifier would reach astonishing 0.99 AUC and, if we were unlucky, it would reach only 0.90. Neither of the two values reflects the true performance of our system. Consequently, the results obtained using random training/test division are unreliable.
Especially, using the official training/test division of CBIS-DDSM, a remarkably small AUC is obtained. Shen et al. [4] obtained an AUC of 0.87 using random training/test division but, using the official division, their system achieves only an estimated AUC of ∼ 0.75 [24], or 0.7522±0.0105 obtained in our own tests emulating their experiments (single runs). Similarly, Wei et al. [9] obtained an AUC of 0.9182 using a random division but only 0.7964 using the official division (single runs). According to Wei et al., this happens because the testing data is another holdout set acquired in a different time.

D. RECENT WORKS THAT USE CBIS-DDSM
Our work is based on Shen et al.'s [4]. Making a random division of CBIS-DDSM dataset, they obtained AUCs of 0.87, 0.88, and 0.91 respectively single-model without TTA, single-model with TTA and ensemble of four models with TTA but, as we argued before, these results are unreliable. Besides Shen et al., there are more recent works that use CBIS-DDSM to train and test their convolutional models.
Shu et al. [25] proposed two new pooling techniques and used them instead of the traditional average-pooling or maxpooling layers. The largest AUC obtained by their method was 0.838. It is unclear whether the authors used the original training/test split because they say they used 85/15% of the images for training/testing, while the original dataset is divided into 80/20% (Table 1).
Wei et al. [9] proposed to use neural net morphing instead of the traditional transfer learning. They reported an AUC of 0.796, 0.822 and 0.831 respectively single-model without TTA, single-model with TTA and four models ensemble with TTA, using the original training/test division. They reported an AUC of 0.9427 (with TTA) using random training/test division but, as we argued before, this result is unreliable.
Almeida et al. [23] compared the performance of classic XGBoost and convolutional VGG16 on CBIS-DDSM images resized to 224 × 224 pixels using the original dataset division and obtained AUCs of 0.6849 and 0.6822 respectively, concluding that the two techniques have similar prediction accuracy when used in low-resolution images.
Panceri et al. [26] selected a small subset of 503 craniocaudal mammograms from CBIS-DDSM with calcification lesions and trained CNN to distinguish between cancerous and normal patches. The classification of the mammogram is obtained by simply thresholding the patches.

III. METHODOLOGY
In this section, we describe the two test methods used to evaluate the algorithm, the preprocessing steps and data augmentation, and then the implemented CNN architectures for the patch classifier, single-view classifier, and the new twoview classifier.

A. TWO TESTING METHODS
In order to get unbiased results, we did not randomly split CBIS-DDSM into fixed training/test sets. Instead, we repeated the experiments using two different methodologies. The techniques used in both tests are similar, but we introduced some minor improvements in the second test.
1) Cross validation (CV) test: At first, we did the ''CV test'', in which we randomly divided the dataset into 5 subsets, trained and tested our system five times using one of the subsets as the test set and the remaining four as the training set (5-fold cross validation). Then, we computed the mean and standard deviation of the five results. Of the 3,103 original mammograms, we discarded those with only one view as we are proposing a two-view system. We also discarded those classified as ' As preprocessing, we resized all mammograms to 1152 × 896 pixels due to insufficient GPU memory. We subtracted the mean of all training images from training, validation and test sets. In all trainings in this work, we used data augmentation with parameters: rotation ± 25 • , zoom ± 20%, shear ± 12%, intensity shift ± 20% and horizontal/vertical flips. We used border reflection to fill out the area outside of the image domain. We developed our code in PyTorch.

C. PATCH CLASSIFIER
We created a ''patch classifier'' similar to the one described by Shen et al. [4], but based on the modern EfficientNet [8] instead of VGG [7] or ResNet [6]. From 3,103 images, we selected 3,568 ROIs (some images have more than one ROI). From each ROI, we selected 20 patches sized 224 × 224: 10 around the ROI and another 10 in the background ( Figure 1). To select patches around the ROI, we calculated its center of mass from the corresponding mask and selected an area with 224×224 pixels around the center with a random displacement of ± 10% of the height/width (inside the white rectangle in Figure 1). In sequence, we sampled 10 background patches from anywhere in the image except the ROIs. We further divided the patches containing the lesions into 4 subcategories according to their labels in CBIS-DDSM: benign calcification, malignant calcification, benign mass and malignant mass. So a patch can be of 5 types, with the background summing up 50% and the remaining categories making up respectively 9.5%, 17.5%, 11.1% and 11.9% for the ''OD Test'' and 11.5%, 11.5%, 13.5% and 13.5% for the ''CV Test''. We did not use any technique to compensate for this imbalance. There are 8 models of EfficientNet, numbered from B0 to B7 [8]. EfficientNet-B0 is the smallest model and was designed automatically by the Neural Architecture Search. Then, this base model was scaled up in width, depth and resolution of the input image to obtain the remaining seven models. We took EfficientNets pre-trained on ImageNet [5] images and performed transfer learning to classify mammogram patches into 5 categories. As mammograms have only one channel, the same grayscale feeds EfficientNet's red, green and blue inputs. When an EfficientNet without the top layers is fed with a 224 × 224 patch, it yields different numbers of maps with 7 × 7 attributes. For example, EfficientNet-B0, B4 and B7 generate respectively 1280, 1792 and 2560 maps with 7 × 7 attributes. These maps are average-pooled and pass through a fully-connected layer with five outputs to make the classification into 5 categories.

D. SINGLE-VIEW CLASSIFIER
The ''single-view whole-image classifier'' is created from the patch classifier by first removing the fully connected layer with 5 outputs. If this model is fed with a mammogram with 1152 × 896 pixels (instead of a 224 × 224 patch), it will yield 1792 (''CV test'') or 1280 (''OD test'') maps with 36 × 28 attributes that represent the likelihoods of presence of different types of lesions in each region ( Figure 2). We added additional layers on top of this model to extract high-level features and classify full mammograms into malignant or non-malignant. We tested many different combinations of EfficientNet base blocks (i.e., MBConv blocks [8], [28]) using: (a) One, two or three MBConv blocks; (b) MBConv blocks with strides 1 or 2. After testing the combinations of these two hyperparameters, we concluded that the best model is obtained using: (a) One MBConv block with strides 1 (in the ''CV test''); (b) Two MBConv blocks with strides 2 (In the ''OD test'').
The output of the last MBConv block is followed by global average pooling and a dense layer with two output categories.

E. TWO-VIEW CLASSIFIER
In standard mammography, each breast is radiographed twice in CC and MLO views and thus an abnormality appears in both views. We propose a convolutional network that simultaneously takes into account the two views of the same side of the mammography, making a third transfer learning. We use the weights of the single-view classifier to obtain the twoview classifier and end-to-end train the whole system. Also here we evaluated different combinations of number of blocks and strides to choose the best network architectures.
In the ''CV test'', we take a pair of single-view classifiers and discard the upper layers (the MBConv blocks onwards). This operation results in a network that takes the two views of a mammography exam (CC and MLO, 1152 × 896 pixels each) and generates a pair of 1792 maps with 36 × 28 attributes (Figure 3). Then we concatenate these maps, obtaining 3584 maps with 36 × 28 attributes that are processed by two new MBConv blocks with strides equals to 1. The output of the last MBConv block passes through average pooling followed by a dense layer to make the final classification. The network architecture of the ''OD test'' is similar. Discarding the top layers, we get a network that takes the two views and generates a pair of 1280 maps with 36 × 28 attributes (Figure 4). We concatenate these maps, obtaining 2560 maps with 36 × 28 attributes. These maps are processed by two new MBConv blocks with strides 2 that reduces dimensionality, producing 2560 maps with 9 × 7 attributes. The final classification is obtained by average pooling these maps followed by a dense layer.

IV. EXPERIMENT AND RESULTS
With the aforementioned architectures, we performed many tests to find the optimal parameters and obtained the results described below.

A. PATCH CLASSIFIER 1) TRAINING PATCH CLASSIFIER
In the ''CV test'', we simply used the Adam optimizer with fixed learning rate of 10 −4 for 20 epochs and batch size of 40 to adapt the ImageNet-trained EfficientNet to classify patches. In the ''OD test'', we used the Adam optimizer with learning rate determined by the ''warm-up and cyclic cosine'' [29] with 30 epochs, period of 3 (the cyclic repetition in number of epochs), delta of 2 × 10 −4 (the amplitude of learning rate changing), and warm-up delay of 4 epochs (the linear rise until the initial learning rate of 10 −4 ). Table 2 shows the accuracies of the patch classifiers using different EfficientNet models. These values are for reference only, as the selection of the best network is decided by the performance of the single view classifier. In the ''CV test'', the patch classifier based on EfficientNet-B4 presents the lowest accuracy (0.7644) but it presents the largest AUC (0.8757) when it is converted into a single-image classifier. Consequently, we use EfficientNet-B4 as the basis in this test. In the ''OD test'', surprisingly the opposite happens: the patch classifier based on EfficientNet-B0 presents the lowest accuracy (0.7554) but its corresponding single-image classifier presents the largest AUC (0.8033). Consequently, we use EfficientNet-B0 as the base model in this test. As we anticipated, the accuracies and AUCs of the ''OD tests'' are considerably smaller than those of the ''CV tests''.

B. SINGLE-VIEW CLASSIFIER 1) THE TRAINING AND THE RESULTS OF ''CV TEST''
To train single-view classifier, we fed the network with sample mammograms with the cancer status. Backpropagation adjusts the network parameters to better classify samples.
In the ''CV test'', we used fixed learning rate of 10 −5 , batch size of 3 (to fit in GPU memory) and 50 epochs. The obtained results are summarized in table 3. As we have already explained, the results obtained by different works by making random division are unreliable and cannot be compared with our cross-validation results. We also tested a ResNet-based network, getting an average AUC of 0.8512, considerably less than 0.8757 obtained with EfficientNet-based network (single runs).

2) THE TRAINING AND THE RESULTS OF ''OD TEST''
In the ''OD test'', we used the Adam optimizer with learning rate determined by the ''warm-up and cyclic cosine'' with 50 epochs, warm-up in 4 epochs, period of 5 epochs, delta of 2 × 10 −5 , initial learning rate of 10 −5 and batch size of 4 (to fit in GPU memory). The obtained results (single runs) are summarized in Table 4. The performance of our best single-view classifier is better than Shen et al.'s [4] and similar to Wei et al.'s [9].

C. TWO-VIEW CLASSIFIER 1) TRAINING TWO-VIEW CLASSIFIER
To train the two-view classifier, we fed the network with twoview mammography samples with the cancer status. Backpropagation adjusts the network parameters to better classify samples.
In the ''CV test'', we use Adam optimizer with batch size of 2 and learning rates: • 10 −5 during 8 epochs with all layers unfrozen. In the ''OD test'', we use Adam optimizer with learning rate calculated by ''warm-up and cyclic cosine'', 100 epochs, warm-up in 5 epochs, period of 20 epochs, delta 2×10 −6 , and initial learning rate 2×10 −6 . Here, all layers are unfrozen and we use batch size 6 because EfficientNet-B0 is smaller than B4. Figure 5 shows the profile of the used learning rate.  Table 5 summarizes the results obtained in the ''CV tests'' and Figure 6 depicts the obtained ROCs. In single run, single model, AUC has increased from 0.8757±0.0310 (single view, Table 3) to 0.9298±0.0379 (two views, Table 5). With TTA and 4 models, AUC has increased from 0.8907±0.0238 (single view, Table 3) to 0.9344±0.0341 (two views, table 5). Therefore, we can conclude that taking into account CC and MLO images simultaneously actually improves cancer detection. Note that the AUC we obtained using our two-view classifier with EfficientNet (0.9298) is greater than the best AUC reported by Shen et al. (0.85 + 0.048 = 0.898) [4] obtained using VGG+ResNet combination, independently processing the two views and averaging them. They measured AUC without TTA or model ensemble, so we compared our two-view approach under the same conditions and concluded that our result is seemingly substantially better than simply averaging the results for both views taken separately, although these authors used random training test partition and we used crossvalidation.
We also tested the ResNet50-based two-view classifier to find that replacing ResNet with EfficientNet seems to slightly increase system performance, from 0.9255 to 0.9298 (Table 5). However, as the difference is small, we cannot state statistically that the latter is better than the former.

3) RESULTS OF ''OD TEST''
The table 6 summarizes the AUCs obtained with the official CBIS-DDSM division by different methods from the  literature and by our two-view classifier. The data in this table should be interpreted with reservations, because there are many subtleties that do not allow direct comparison of the numbers. For example, the CBIS-DDSM set grew over time, so not all works used the same data. Also, unlike many other works, our work had to discard images that did not have two views in order to test the two-view classifier. Anyway, our method seems to be at least as good as the best methods reported in the literature. Figure 7 represents the ROC curve of our two-view classifier in ''OD test''.
The AUC in single run has increased from 0.8033±0.0183 (single view, Table 4) to 0.8418±0.0258 (two views, Table 6). Using TTA, it has increased from 0.8153±0.0178 (single view, Table 4) to 0.8483±0.0253 (two views, Table 6). This confirms that the use of the two views indeed improves the system. With TTA, we achieved our best mark of 0.8483±0.0253. This is the largest reported AUC using CBIS-DDSM original division, as far as we know. We tried using an ensemble of independently trained 4 models (with the same architecture) but the AUC did not increase. We hypothesize that AUC would increase if we use ensemble of models with different architectures. In the table 6 we also compare with other works that use the original division of the CBIS-DDSM.

4) MULTI-VIEW TECHNIQUE BY WU et al.
Wu et al. [16] also use multi-view to improve their breast cancer CAD performance. However, there is some important differences between their four-view classifier and ours. First, they do not make transfer learning from patch classifier to whole-image classifier in end-to-end fashion, idea proposed by Shen et al. [4] and essential to obtain high performance. Second, Wu et al. independently process each view with ResNet-22 and concatenate the maps obtained after the four average poolings. Meanwhile, our classifier processes each view with EfficientNet-B4 and concatenates the attribute maps before doing average poolings. We tested both ideas (concatenating the attribute maps after or before the average poolings), always using EfficientNet-B4 as the base model, and the results seem to show that slightly better results are obtained when concatenating the maps before the average poolings (Table 7). This is not surprising, as information about the spatial locations of lesions are lost by average poolings.

V. CONCLUSION
In this paper, we have presented a new high performance breast cancer CAD (Computer-Aided Detection and Diagnosis) system. We have proposed a deep convolutional network that simultaneously takes into account the two views of the same side of the mammography that is end-to-end trained making three transfer learnings: 1) First, we use the weights of EfficientNet trained on natural images to train the patch classifier. 2) Second, we use the patch classifier weights to train the single-view classifier. 3) Third, we use the single-view classifier weights to train the two-view classifier. Using 5-fold cross validation, our system has achieved an AUC of 0.9344± 0.0341 in classifying CBIS-DDSM mammograms with two views (accuracy, sensitivity and specificity are 85.13% at the equal error rate point of the ROC). Using the original CBIS-DDSM division into training/testing sets, our technique achieved an AUC of 0.8483± 0.0253, the highest ever achieved, as far as we know (although a direct comparison of the different methods is not possible due to the subtle differences in test conditions). In both tests, AUCs increased significantly from single view classifiers to two view ones: from 0.8907 to 0.9344 in ''CV test'' and from 0.8033 to 0.8483 in ''OD test''. Furthermore, the AUC obtained by our technique (0.9255) is substantially higher than that obtained by averaging the two views processed independently by Shen  Since then he is working with breast radiology in academic and private medical practice. He is currently a Coordinator of breast radiology with the Hospital das Clínicas and Cancer Institute of USP and has artificial intelligence applied to the breast image as the main line of research. He is currently an Associate Professor with the Department of Electronic Systems Engineering, USP. He is the author of more than 100 articles and holds three patents. His research interests include image processing, machine learning, medical image processing, and computer security.
Dr. Kim and colleagues received the 6th edition of the Petrobras Technology Award in the ''Refining and petrochemical technology'' category, in 2013; the ''Best Paper in Image Analysis'' Award at the Pacific-Rim Symposium on Image and Video Technology, in 2007; and the Thomson ISI Essential Science Indicators ''Hot Paper'' Award, for writing one of the top 0.1% of the most cited computer science papers, in 2005. VOLUME 10, 2022