An Efficient Deep Neural Network to Classify Large 3D Images With Small Objects

3D imaging enables accurate diagnosis by providing spatial information about organ anatomy. However, using 3D images to train AI models is computationally challenging because they consist of 10x or 100x more pixels than their 2D counterparts. To be trained with high-resolution 3D images, convolutional neural networks resort to downsampling them or projecting them to 2D. We propose an effective alternative, a neural network that enables efficient classification of full-resolution 3D medical images. Compared to off-the-shelf convolutional neural networks, our network, 3D Globally-Aware Multiple Instance Classifier (3D-GMIC), uses 77.98%-90.05% less GPU memory and 91.23%-96.02% less computation. While it is trained only with image-level labels, without segmentation labels, it explains its predictions by providing pixel-level saliency maps. On a dataset collected at NYU Langone Health, including 85,526 patients with full-field 2D mammography (FFDM), synthetic 2D mammography, and 3D mammography, 3D-GMIC achieves an AUC of 0.831 (95% CI: 0.769-0.887) in classifying breasts with malignant findings using 3D mammography. This is comparable to the performance of GMIC on FFDM (0.816, 95% CI: 0.737-0.878) and synthetic 2D (0.826, 95% CI: 0.754-0.884), which demonstrates that 3D-GMIC successfully classified large 3D images despite focusing computation on a smaller percentage of its input compared to GMIC. Therefore, 3D-GMIC identifies and utilizes extremely small regions of interest from 3D images consisting of hundreds of millions of pixels, dramatically reducing associated computational challenges. 3D-GMIC generalizes well to BCS-DBT, an external dataset from Duke University Hospital, achieving an AUC of 0.848 (95% CI: 0.798-0.896).


I. INTRODUCTION
M AMMOGRAPHY is the standard medical imaging modality in breast cancer screening programs, which are crucial in the early detection of the leading cause of cancer-related death among women worldwide [1].Digital breast tomosynthesis (DBT) or "3D mammography", has been introduced to improve quality of diagnosis in breast cancer screening.Full-field digital mammography (FFDM) projects the three-dimensional information of breasts to twodimensional planes, which can obscure lesions with other breast tissues.Because DBT images are three-dimensional, they more accurately capture the three-dimensional information than FFDM and enable better identification of suspicious findings.Thus, DBT leads to more cancers being found and a reduction in recall rates [2], [3], [4], [5], [6], [7].However, as DBT images contain 70 slices on average (about 340 million pixels on average), it almost doubles its interpretation time for radiologists compared to FFDM [8].Therefore, there is a need for tools for computer-aided detection and diagnosis to reduce interpretation time for DBT images.
One cannot naively apply large-capacity neural networks to high-resolution multi-slice DBT images with modern GPUs without running out of memory.Attempts to downsample 3D images or project each 3D image into one 2D image lead to losing the benefit of DBT: separation of lesions from nearby structures.If pixel-level or lesion-level ground truth labels (segmentations of regions of interest) were available, it is possible to train neural networks by utilizing small subset of large 3D images at a time.However, collecting these annotations is a time-consuming and mostly manual process, which requires experts (e.g., radiologists) to annotate suspicious areas on dozens of slices.
To address these limitations, we propose 3D Globally-Aware Multiple Instance Classifier (3D-GMIC), a deep neural network which predicts malignancy of lesions and localizes them in 3D images.Our architecture extends the Globally-Aware Multiple Instance Classifier (GMIC) [9], a highly effective neural network architecture for large 2D images, to 3D data.3D-GMIC acts by first determining the most important regions utilizing a low-capacity network, to which a high-capacity network is applied.3D-GMIC learns to highlight the regions important to predicting image-wise classification labels.This enables applying 3D-GMIC to learn from data without segmentation labels.To implement 3D-GMIC, we introduce a number of technical advances.Importantly, among similar regions in nearby slices, 3D-GMIC selects the region in the most important slice and avoids processing duplicate information.This enables training deep neural networks with entire DBT images without downsampling or performance degradation.In addition, we improve the stability of learning in low-data regimes when using backbone networks with ReLU nonlinearity.Furthermore, we devise a novel approach to adeptly handle the varying number of slices in DBT images.Lastly, we have designed 3D-GMIC to learn effectively with mini-batch size of 1 DBT image per GPU using 4 GPUs in parallel by using group normalization [10] and synchronized batch normalization.A detailed discussion of these technical advances is provided in section V-B.
To demonstrate the effectiveness of 3D-GMIC, we compare its performance to models trained on equivalent 2D datasets.We train and evaluate the models on three different imaging modalities used in breast cancer screening; (1) FFDM: Full-Field Digital Mammography, (2) DBT: Digital Breast Tomosynthesis and (3) synthetic 2D: 2D image generated by combining information from different slices of DBT images using the proprietary C-View™ algorithm [11].We show that our 3D-GMIC model enables end-to-end training with whole DBT images by using 77.98%-90.05%less GPU memory and 91.23%-96.02%less computation compared to off-the-shelf deep convolutional neural networks.In addition, we show that 3D-GMIC trained on our internal dataset generalizes well to an external dataset from Duke University Hospital [12], [13], [14].Hence 3D-GMIC may allow other researchers to perform classification and weakly-supervised semantic segmentation with other types of 3D data.Our model code and model weights are released at https://github.com/nyukat/3D_GMIC.

II. RELATED WORK
Prior works on building deep neural networks for DBT images have at least one of the following four disadvantages.
First, many of them work with synthesized 2D images rather than the entire 3D image.These images are generated from the DBT images by proprietary algorithms bundled with the scanner [15] or by third-party methods aggregating information from all slices such as dynamic feature image [16] or maximum intensity projection [17].In addition, there exist approaches which create attention-weighted summaries of small subsets of DBT slices, resulting in multiple "slabs" that resemble 2D mammography [18].The resulting images may suffer from the same disadvantage as FFDM: the lesion could be hidden by nearby structures located at different depths.Therefore, the performances of the model on synthesized 2D images could suffer compared to model on DBT images.In addition, models trained on synthesized 2D images will not be able to predict which slices in DBT images contain the suspicious findings, limiting the usefulness in assisting the radiologists in reading DBT images.
Second, the studies that use DBT instead of synthetic 2D images still do not utilize the entire image at once in training.Some utilize a subset of DBT slices [19] whereas others process small region of interest (ROI) patches pre-extracted from lesions [17], [20], [21].A model utilizing a subset of slices risks missing those potentially most informative to the diagnosis.A model processing image patches does not learn to utilize the global context of the entire breast.
Third, some of the works require more detailed labels such as bounding-boxes or pixel-level segmentations [12], [17], [20], [21], [22], [23], [24].While training the model with pixel or lesion-level labels may improve the performance compared to just using image-level labels, the former are more labor-intensive and time-consuming to collect, especially for 3D data.Collecting segmentation or bounding box labels requires experts to perform manual work, while study-level labels can be reliably extracted from hospital information systems.
Lastly, most of the prior works were evaluated on datasets with small sample sizes [23], [25], [26], [27] or patient cohorts with specific inclusion criteria.In addition, they do not perform any external validation of their models using datasets from other hospitals.This limits drawing reliable conclusions about the potential clinical utility of such models.
More details on the related works on AI for DBT can be found in Bai et al. [28].For example, Singh et al. [17] achieved an AUC of 0.847, Tardy and Mateus [18] achieved an AUC of 0.73.Please note that the reported performances of the related works are not directly comparable because they are not evaluated on the same dataset.
In addition, there exist general-purpose computation-and memory-saving techniques such as gradient checkpointing [29] and swapping CPU and GPU memory [30], [31] which reduce the maximum GPU memory usage to enable training high-capacity neural networks with large 3D images without downsampling.However, these techniques do not reduce the amount of required computation.In fact, they add additional overhead for recomputing forward passes, saving and loading gradients, and/or swapping memories between devices.

A. Internal Dataset
Our retrospective study was approved by the NYU Langone Health Institutional Review Board (study ID# i18-00712_CR3) and was compliant with the Health Insurance Portability and Accountability Act.Informed consent was waived.To perform this study, we created a dataset consisting of exams containing both FFDM and DBT images.We refer to this dataset as "NYU Combo v1".The NYU Combo v1 dataset consists of 99,862 exams from 85,526 patients screened between January 2016 and June 2018 at NYU Langone Health.Each exam contains the four standard views used in screening mammography (R-CC, L-CC, R-MLO, and L-MLO) for all three imaging modalities (FFDM, DBT, synthetic 2D).The manufacturer and model name of all images in this dataset are 'HOLOGIC, Inc.' and 'Selenia Dimensions'.Since 99% of DBT images have 96 slices or less, we discarded exams that contain DBT images with more than 96 slices.This is to keep the efficiency of data loading and confine the VRAM usage to a predictable amount during our experiments.Note that it is possible to utilize more than 96 slices for both training and inference as long as sufficient VRAM is available.
Only a small number of breasts are associated with pathology labels.Among the 199,724 breasts in the NYU Combo v1 dataset, 2,579 breasts (1.29%) had benign findings and 635 breasts (0.32%) had malignant findings.150 (0.08%) breasts had both benign and malignant findings.All benign and malignant findings were pathology-proven.Exams not associated with any pathology report were assigned a negative label, indicating the absence of any biopsied findings.Note that NYU Combo v1 dataset is about half the size of the dataset used in Wu et al. [32] and Shen et al. [9].
We divided NYU Combo v1 into training, validation, and test sets.First, we sorted patients in the chronological order of their most recent exams.We designated the first (i.e., the ones who had their last exam earlier than other patients) 80% of the patients (68,412) to the training set, the next 10% of the patients (8,543) to the validation set, and the remaining (most recent) 10% (8,571)  To acquire lesion annotations on DBT images in the test set, we asked a group of 16 radiologists from NYU Langone Health to annotate the location of biopsied lesions by segmenting them with ITK-SNAP [33].They annotated the location, shape and size of the biopsied lesions by drawing over them.Each image was annotated only once.Thus, there could be some interobserver variability affecting the evaluation, which we do not attempt to solve here.While it is possible to annotate the same lesion multiple times by different radiologists and take the average to minimize interobserver variability, this would limit the number of images we would be able annotate.We used these labels only for evaluating semantic segmentation performance in the test set.They were not used during training and validation.
In addition, as only a small number of exams have benign or malignant findings, we utilize transfer learning to improve the performance of our models as described by Wu et al. and Shen et al. [9], [32].In this pretraining, we utilized BI-RADS labels extracted from radiology reports, which are the radiologist's assessment of a patient's risk of having breast cancer based on the screening mammography.We use all BI-RADS scores but group them into 3 classes: (a) incomplete or high suspicion (BI-RADS category 0, 4, 5), (b) normal (BI-RADS category 1), and (c) benign (BI-RADS category 2 and 3).We utilized both the NYU Breast Cancer Screening Dataset v1.0 (BCSDv1) [34] and NYU Combo v1 in pretraining a part of our model.When combining the two datasets, we only selected patients from BCSDv1 who do not overlap with the patients in the validation and test sets of NYU Combo v1 to avoid an information leak.As a result, we were able to utilize additional 210,389 2D screening mammography exams for pretraining our models with BI-RADS labels.More details of pretraining and transfer learning are in section V-D.1.

B. External Dataset
In an additional evaluation of our models, we utilize BCS-DBT, an external dataset from Duke University Hospital [12], [13], [14], specifically the subset that was released as the training dataset of the DBTex challenge1 [35].This dataset consists of 19,148 DBT images from 4,838 studies that belong to 4,362 patients.We utilize all images, including the ones with 96 slices or more.The manufacturer and model name of all images in this dataset are 'HOLOGIC, Inc.' and 'Selenia Dimensions'.It contains 87 bounding-box labels for malignant lesions, and 137 bounding-box labels for benign lesions.Even though this dataset uses the same manufacturer and model as NYU Combo V1 dataset, BCS-DBT still serves as an effective dataset for external validation.This is because there are variations in data collection and processing between datasets collected from different hospitals, which could degrade crossinstitutional generalization [36].

IV. TASK DEFINITION
We formulate the task of predicting the probability of presence of benign and malignant lesions as a multi-label classification problem.That is, given an image x ∈ R H,W,D , our models make probability predictions p corresponding to the labels y = y b y m for each image, where y b , y m ∈ {0, 1} indicate the presence of at least one biopsy-confirmed benign or malignant lesion in x, respectively.

V. AI ARCHITECTURE
The proposed architecture, 3D-GMIC, consists of two subnetworks: the global module and the local module (Fig. 1).
A. High-Level Overview of 3D-GMIC 1) The Global Module: The proposed model, 3D-GMIC (Fig. 1), extends GMIC [9] to 3D data.The low-capacity global network f g is a convolutional neural network which processes 2D input images.As in the original GMIC, the global network f g is parameterized as ResNet-22 [32] which is narrower and has larger strides compared to the canonical ResNet architectures [37].3D-GMIC applies the global network separately to each slice of a 3D image in parallel.For each slice x d in the input 3D image x with D slices, the global network first extracts the hidden representation h g,d ∈ R h,w,c where h,w,c are the hidden dimensions.The hidden representation h g,d is turned into saliency maps A d using a semantic segmentation layer, which is a combination of a 1 × 1 convolution layer with a nonlinear function f n .Concretely, The saliency maps for all slices of x are then aggregated to form the 3D saliency map A ∈ R h,w,D,2 .These saliency maps identify the most important regions and slices of the input image x for the benign and malignant categories.
Then, for each class c ∈ {b, m}, we use an aggregation function f agg (A c ) : R h,w,D → [0, 1] to transform the saliency map A c into image-level class prediction p c global : We define the aggregation function as where H + denotes the set containing locations of top t% values in A c as described in section V-B.3.
2) The Local Module: In the local module of 3D-GMIC, we select the most important regions of x according to the saliency maps A. Concretely, we greedily select K square patches which corresponds to the high values in the saliency maps as in the retrieve_roi algorithm from Shen et al. [9].We adapt and modify this algorithm for 3D images to prohibit selecting a patch if one of the previously selected patches were at the same xy-location and from ±ζ neighboring slices.This prevents cropping multiple patches with duplicate information from nearby slices.We name this algorithm retrieve_roi _from_3d_image (Algorithm 1).The width and height of the patches are fixed to 256 in all experiments.This algorithm identifies K image patches {x k } to maximize the criterion in line 7 at each selection as follows: We can then apply the high-capacity local network f l to utilize fine-grained details from the K selected image patches {x k } by computing hk = f l (x k ).The extracted feature vectors { hk } are then aggregated using a gated attention mechanism [38].Attention scores, α k ∈ [0, 1], indicating the relevance of each patch are calculated as follows: where ⊙ denotes an element-wise multiplication, w ∈ R L×1 , V ∈ R L×S and U ∈ R L×S are learnable parameters.In all experiments, we set L = 128 and S = 512.This process yields an attention-weighted representation We then apply a fully connected layer with sigmoid nonlinearity to z to generate the prediction p local = sigm(w local T z), where w local ∈ R S×2 are learnable parameters.3) 3D-GMIC Output: We average the predictions from the global and local modules of 3D-GMIC to produce the final class predictions as 4) The Loss Function Used in the Training of 3D-GMIC: 3D-GMIC is trained end-to-end by minimizing binary crossentropy (BCE) losses for the predictions from the two stages p c global and p c local .In addition, we encourage sparsity on the saliency maps A by imposing L1 regularization L reg (A): This enables successful semantic segmentation of important regions in the saliency map.In summary, the training loss has the following form: where β is a hyperparameter.We calculate the two BCE losses separately and sum them up rather than calculating one BCE loss with the arithmetic average between the two predictions y c global and y c local .This is because the latter leads to underutilization of one of the modules [39].In other words, optimizing the latter BCE loss leads to one of the modules learning to predict the cancer accurately and the other module just predicting the same probability for all images.

B. Technical Advances Enabling Learning From Datasets With Small Number of Large Images
DBT is a relatively recent imaging modality and therefore researchers might have a limited number of DBT exams available.In addition, DBT is an imaging modality consisting of hundreds of millions of pixels and training AI models on such large images is technically challenging.In this section, we describe the components which enable processing DBT images without compromising performance.
1) Group Normalization: For the global network, we replace batch normalization [40] with group normalization [10] with 8 groups to retain performance with small batch size and avoid treating each slice of image as different examples in batch normalization.Since group normalization is mostly robust to the choice of the number of groups, we choose 8 as it showed good performance in the original paper and all channel sizes are divisible by 8.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Patch Sampling in 3D: In DBT, a lesion appears across multiple slices with varying focus.Cropping patches from all of these slices is unnecessary as they are redundant.Therefore, the retrieve_roi_from_3d_image algorithm (Algorithm 1) is designed to avoid cropping redundant patches.Specifically, when cropping each patch, we remove saliency from nearby slices at the same xy-location as the best candidate patch to prevent cropping redundant patches in the following steps.In addition, during training, we uniformly sample one of the patches from nearby slices (±ζ slices) at the same xy-location as a part of input data augmentation.
3) Saliency Aggregation Independent of Image Slices: For the benign and malignant classes, the aggregation function f agg in the global module of standard GMIC pools the top t% of the values from the saliency maps A b and A m .It can be viewed as a balance between global average pooling and global max pooling.In the standard GMIC, this top t% pooling leads to superior performances in comparison to either of the extremes.
The size of the saliency maps A b and A m for a DBT image is h × w × D where h is the height, w is the width, and D is the number of slices in the input image (therefore, also in the saliency map).If the aggregation function f agg pools t% of the pixels with respect to the entire A b and A m , then the number of pooled pixels in each saliency map is t/100 × h × w × D. Since this formula is dependent on D, the aggregation function f agg pools different numbers of pixels from the saliency maps from images with different numbers of slices.However, the sizes of lesions in DBT images do not depend on the number of slices. 2 Since the sizes of the lesions are independent of the sizes of the breasts, the formula's dependence on D can cause contradictory training signals and outputs.

Algorithm 1 retrieve_roi_from_3d_image
for each class c ∈ {benign, malignant} do [i, j] = 0 17: end for 18: return O For example, suppose there exist two DBT images that contain an identical malignant lesion but consist of different numbers of slices.Since the lesions show the same characteristics such as shape and size, it will highlight the same number of pixels in the saliency maps corresponding to these two images.Nonetheless, the aggregation function f agg will pool different number of pixels from these saliency maps because they have different number of slices.This can lead to the predictions of malignancy of the global module p m global to greatly differ between these two images even though the lesions are identical.This inconsistency could make the learning more difficult than necessary and degrade the performance.
Ideally, p m global should not depend on the size of the breast and only depend on the detected lesions.To do so, 3D-GMIC applies f agg to the saliency maps A to pool the values in a way that does not depend on the number of slices D in the input image x.Specifically, we define the pooling percentage t% with respect to a single slice of a 3D image.For example, from saliency maps of size h × w × 50 and size h × w × 80, setting t = 200% will pool 2 × h × w values from both saliency maps which accounts for 4% and 2.5% of the entire 3D saliency map, respectively.This is still a small subset of the entire 3D saliency maps, which allows 3D-GMIC to localize the important regions.At the same time, the number of pooled values from the saliency maps no longer changes across images with different number of slices.

4) The Initialization of the Semantic Segmentation Layer:
In the the global module of standard GMIC, the weights of the semantic segmentation layer are randomly initialized.However, we find that this often leads to the failure of weakly-supervised semantic segmentation in low-data regimes such as in this work, 3 as shown in Fig. 2.
The problem with randomly initialized weights in the convolutional layer occurs when performing transfer learning from a pretrained ReLU network with top t% pooling, as shown in Fig. 2a∼c.The pretrained CNN with ReLU nonlinearity tends to output high positive logit values in some channels for locations where small, important features are.Ideally, this value should be multiplied with some positive weight in the semantic segmentation layer such that we can preserve this information as shown in Fig. 2a.With random initialization, however, negative weights often get assigned to these channels which detect important features as shown in Fig. 2b.The outputs from such important features will then be lower than the output of the unimportant regions shown in Fig. 2c.
In such a case, the important regions will not be pooled to p c global as shown in Fig. 3. Instead, f agg will pool from the background and normal regions without any pathology since the values in the saliency maps corresponding to these regions will be higher than those where malignant or benign lesions are.As a result, the saliency pooled from normal regions from malignant breasts will be encouraged to output 1, teaching the model to highlight the wrong portions of the images.This leads to the failure of weakly-supervised semantic segmentation.
To address this issue, we initialize the conv 1×1 layer in the semantic segmentation layer with constant, positive weights ω (Fig. 2d∼e).No matter which channel of the pretrained model reacts to important regions in the image, it will be correctly captured as high value in the saliency maps A since the model weighs all channels equally at the beginning.In this configuration, f agg will pool from important regions in positive exams and encourage the pooled values to be closer to 1.
5) The Choice of Nonlinearity f n : When using constant initialization of the semantic segmentation layer, sigmoid function is no longer an ideal choice for nonlinearity f n .For the sigmoid function to suppress the probability predictions corresponding to background regions, the 1 × 1 convolution layer in the semantic segmentation layer must output highly negative values for uninteresting regions.When using the constant weight initialization and sigmoid nonlinearity, the low logits from the background regions are mapped close to 0.5 probability in the saliency map, and the high logits from the suspicious regions are mapped to probability values that are somewhat higher than 0.5 at the beginning of training.To decrease the 0.5 probability predictions from the background regions, the backbone network f g must learn to output non-zero logits for some channels from all such background regions and then the 1 × 1 convolution layer must associate negative weights for such logits.However, if the model had not already learned to do so during the pretraining phase, it could be hard to learn when fine-tuning with the downstream task.To learn this behavior during the fine-tuning,  the pretrained backbone network must change significantly, which could risk catastrophic forgetting.
To mitigate this issue, an ideal nonlinearity f n should naturally map the near-zero outputs of the 1 × 1 convolution layer into zero probabilities in the saliency maps.One might think that shifting the sigmoid function horizontally could achieve this.However, the input interval corresponding to the highest rate of change in nonlinearity also shifts away from 0, which hinders learning.The ideal nonlinearity for constant weight initialization would map near-zero values to 0 and would have the highest rate of change near zero.
To satisfy these criteria, we propose ReLU(tanh(x)) for f n as shown in Fig. 4. Like sigmoid, ReLU(tanh(x)) also maps all input values to the [0, 1] interval.Unlike sigmoid, ReLU(tanh(x)) turns the near-zero outputs of the 1 × 1 convolution layer into zero values in the saliency map.At the same time, the highest rate of change occurs with the input values around zero.This simplifies the learning and no longer requires a change of behavior in the backbone network f g .We observe that ReLU(tanh(x)), combined with the constant initialization of the semantic segmentation layer, leads to consistent success of weakly-supervised semantic segmentation in the saliency maps A with the NYU Combo v1 dataset which is smaller than the dataset used in Shen et al. [9].

C. Fusion Module
The standard GMIC includes a fusion module to combine the hidden representations of the global and local modules to output the final prediction.In the preliminary experiments, however, we found that the fusion module does not provide benefit with DBT and ReLU(tanh(x)) nonlinearity, and thus we exclude it.We hypothesize that this is because more information is lost in max pooling from 3D representations than max pooling from 2D, which might prevent the learning.

D. Training Procedure
To train the model architectures on different modalities we take the following steps, as shown in Fig. 5: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

1) Pretrain the global network on BI-RADS labels with
the BCSDv1 + NYU Combo v1 combined dataset using FFDM modality only described in section V-D.1).2) BI-RADS fine-tuning: transfer the model from ( 1) and further train on the BI-RADS classification task to fine-tune the pretrained weights to each modality.a) BI-RADS fine-tuning is performed on the synthetic 2D images from NYU Combo v1 dataset, using the weight from step 1. b) BI-RADS fine-tuning is performed on the DBT images from NYU Combo v1 dataset, using the weight from step 1.

3) Training GMIC and 3D-GMIC.
a) GMIC is trained on FFDM images from NYU Combo v1 dataset using the weight from step 1. b) GMIC is trained on synthetic 2D images from NYU Combo v1 dataset using the weight from step 2a.c) 3D-GMIC is trained on DBT images from NYU Combo v1 dataset using the weight from step 2b.Note that the resulting models from the steps 3a, 3b, 3c are different models trained on different imaging modalities.For example, to train 3D-GMIC, only steps 1, 2b, 3c are required.We compare performances from models created in steps 3a, 3b, 3c in Table II For the global network of our models, we apply transfer learning from networks pretrained with BI-RADS labels as described by Wu et al. [32] and Geras et al. [41].We modify the BI-RADS pretraining procedure so that instead of predicting BI-RADS labels utilizing all of the four views in an exam at once, we predict BI-RADS labels for each image separately.We first pretrain a model parameterized with ResNet-22 architecture [32] using the FFDM images, and transfer to further BI-RADS pretraining on other modalities and training GMIC models as described above.For the pretraining phase for the models with DBT images, we use the same ResNet-22 model but apply it to all slices in each image in parallel.And then, we aggregate the average-pooled representation from all slices using a gated attention mechanism [38] before applying logistic regression.
While BI-RADS labels are more noisy than labels derived from pathology reports, we have 46,680 exams with BI-RADS 0 or BI-RADS 2 labels in the training set of NYU Combo V1 dataset.In addition, in the training set obtained by combining BCSDv1 and NYU Combo v1 datasets as discussed in section III-A, we have 159,471 exams with BI-RADS 0 or BI-RADS 2 labels.In comparison, 2,391 exams are associated with positive pathology-driven labels in the training set of NYU Combo V1 dataset.Therefore, BI-RADS is a good target to pretrain our models so that they can learn to recognize important visual features of mammograms, which will be useful for the downstream training with pathology-derived labels.In addition, this step is of vital importance to 3D-GMIC whose contributions on top of GMIC assumes that the global network is pretrained.BIRADS fine-tuning is designed to learn visual features that are specific to synthetic 2D and DBT images and thus to improve performances compared to just using the weights pretrained with FFDM images only.
2) Training 2D GMIC: We use GMIC for training models with FFDM and synthetic 2D data.For a fair comparison, we apply the same modifications made to 3D-GMIC when we can.Specifically, we apply the same ReLU(tanh(x)) nonlinearity for the generation of the saliency map, replace the fusion module with the average prediction between global and local modules, and replace batch normalization with group normalization in the global module.While these changes do not necessarily improve the best possible performances with 2D-GMIC, our initial experiments indicated that applying them makes learning more stable when learning with fewer data.
3) Image Augmentation: FFDM images have dimensions (4096, 3328) or (3328, 2560).DBT and synthetic 2D image sizes are either (2457,1996) or (2457, 1890).The corresponding FFDM, DBT, and synthetic 2D images have the same field of view.We crop each image to a predefined input size using the procedure described by Wu et al. [32].FFDM images are cropped to the size of 2866 × 1814 pixels, whereas synthetic 2D images and DBT images are cropped to the size of 2116 × 1339 pixels to capture the equivalent field of view.These window sizes were selected as they yield an equivalent field of view for all modalities, considering that the original image resolutions differ between modalities.We also apply random shifting and resizing during training and testing phases for a maximum of 100 pixels in any direction.At test time, we apply 10 random augmentations to get 10 predictions for each image and average them to make a final prediction.
For 3D-GMIC, we choose a different range of some hyperparameters compared to 2D to mitigate the differences between the imaging modalities.For example, the lesions in the DBT contain 10.97 times more pixels on average compared to the lesions in synthetic 2D images, regardless of the number of slices in DBT images.As this suggests that more salient pixels will be highlighted in the saliency maps for DBT images than the corresponding synthetic 2D images, we increase the minimum and maximum value of the pooling percentage t 10.97 times to capture a comparable amount of highlighted area in the saliency maps A. In addition, we decrease the minimum and maximum value of regularization weight β 10.97 times so that the regularization term has a comparable magnitude to the remaining terms in the final loss function used to train 3D-GMIC.
Finally, we adjust the number of sampled patches K for 3D-GMIC to account for the depth in DBT data.Even though there are about 70 slices on average, we do not increase K by 70 times compared to GMIC on 2D images.This is because our method is more efficient than applying GMIC on all slices in parallel.Since our retrieve_roi_from_3d_image algorithm avoids cropping duplicate information from nearby patches, 3D-GMIC can focus its computation on much smaller proportion of the input image compared to GMIC.Nonetheless, as a precaution, we increase the number of patches K by a conservative factor of 2 compared to GMIC on 2D images as some patches might still be located within the same xy coordinates but at different slices.In such a case, the total number of xy-coordinates from which patches are sampled in 3D-GMIC could be decreased compared to GMIC on 2D mammography, and we want to mitigate this.In Table VIII, we show that this increase was not strictly necessary, as both GMIC and 3D-GMIC utilizing the same number of patches achieve comparable performances.
As a result, we randomly sample the pooling percentage t from uniform distribution t ∼ U(10.97%, 274.25%) with respect to one slice of an image 5 from uniform distribution, number of patches in image K ∈ {8, 12, 16} with equal probabilities, and the regularization weight β from log-uniform distribution log 10 (ω) ∼ U(−6.54, −4.54).
For all models described in this paper, we optimize their hyperparameters with 40 random search trials.All models were trained for a maximum of 40 epochs but the ones which do not show any improvement in validation performances for 15 consecutive epochs are early stopped.We selected 5 models with the highest AUC scores in identifying images with malignant findings on the validation set.
Training 3D-GMIC for 40 epochs takes about three days when utilizing four Tesla V100 GPUs with 32GB GPU memory in parallel.Training GMIC with FFDM and synthetic 5 Even though the pooling percentage can be larger than 100% of a slice of a saliency map, it is still a small subset of the entire 3D saliency map.2D data for 40 epochs on a Tesla V100 GPU with 16GB GPU memory takes about 18 hours and about 12 hours, respectively.

A. Computational Efficiency
To compare 3D-GMIC and other architectures, we measure the maximum GPU memory usage during training and the number of computations in Giga Multiply-ACcumulate operations (GMACs) by benchmarking with a single DBT image with 96 slices.Since it is not possible to run off-the-shelf models with all of 96 slices, we estimate their metrics using linear extrapolation with DBT images with 4, 8, 16 slices.

B. Classification
To evaluate the classification performance, we calculate area under the receiver operating characteristic curve (AUC) on two levels of granularity: 1) Image-wise AUC, where the performance of the model is evaluated using the predictions for each image.2) Breast-wise AUC, where the performance of the model is evaluated using the averaged prediction between CC and MLO views of each breast.This is the main evaluation metric of this work.To address the class imbalance of the dataset, we additionally report specificity at 90% sensitivity, Matthew's correlation coefficient [44] at 90% sensitivity, sensitivity at 95% specificity, and sensitivity at 99% specificity. 6

C. Semantic Segmentation
We calculate the Dice similarity coefficient (DSC) and pixel average precision (PxAP) [45] to evaluate the ability of 3D-GMIC to perform weakly-supervised semantic segmentation based on the saliency maps.DSC and PxAP are computed as an average over the images with visible biopsy-confirmed findings.It is worth noting that while comparing these values to different models learning from the same imaging modality is fair, comparing them between models learning from different imaging modalities is not necessarily so.FFDM images have higher resolutions than both DBT and synthetic 2D images,   and therefore superpixel in the saliency maps corresponds to a smaller physical region.This results in finer saliency maps for FFDM images.In addition, DBT has a lower ratio of important regions compared to the entire image because the lesions only appear in a small fraction of slices in each image.Therefore, DSC or PxAP values will be lower for DBT even if the model similarly highlights important regions in the slices where they are visible.To partially mitigate this issue and make these values more comparable between 2D and 3D imaging modalities, we do not report the DSC or PxAP on DBT images and instead calculate these metrics after max-projecting both saliency maps and the segmentation labels in the depth dimension.This way, we can get a better sense of how 3D-GMIC is predicting the location of lesions in the xy-dimension by comparing its score to the 2D models.

VII. RESULTS
We evaluated the classification and semantic segmentation performances of GMIC and 3D-GMIC architectures on the internal NYU Combo v1 dataset as well as an external DBT dataset from Duke University hospital.

A. Computational Efficiency
Compared to off-the-shelf deep convolutional neural networks, 3D-GMIC uses 77.98%-90.05%less GPU memory and 91.23%-96.02%less computation as shown in Table I.

B. Classification
For each imaging modality, we trained the GMIC or 3D-GMIC architecture as well as a variant with only the global module (global-only).This measures the benefit of adding the local module in the GMIC architecture.The results are in Table II   For both global-only and full architectures, we find that 3D-GMIC performs comparably with other models trained with 2D imaging modalities.Ensembling models trained with different modalities does help, but which modalities lead to the best performance differs between models and target tasks.
In addition, we demonstrate our models' classification performances on the BCS-DBT dataset from Duke University Hospital [12], [13], [14].3D-GMIC achieved image-wise AUC of 0.848 (95% CI: 0.798-0.896) in identifying images with malignant findings and image-wise AUC of 0.741 (95% CI: 0.697-0.785) in identifying images with benign findings.In comparison, on the DBT images in the test set of NYU Combo v1 dataset, the same 3D-GMIC models achieve image-wise AUC of 0.809 (95% CI: 0.761-0.852) in identifying images with malignant findings and image-wise AUC of 0.704 (95% CI: 0.683-0.728) in identifying images with benign findings.Even though the model did not use any images from Duke during training, it generalizes well and even shows higher performances than on our own internal data set.

C. Semantic Segmentation
Table VII shows DSC and PxAP scores for the three modalities.Even though semantic segmentation in 3D data can be more difficult, 3D-GMIC reaches comparable performances.A sample set of model visualizations, produced for the same breast, is shown in Fig. 6.

D. Ablation Studies
We performed several ablation studies to assess the effects of different hyperparameters on 3D-GMIC.Table VIII shows the comparison of model performances when using the same number of patches.We observe that 3D-GMIC on DBT still shows comparable performances to GMIC on 2D images.
2) The Half-Width of Slice-Wise Sampling ζ : The default value of ζ = 10 was set in initial experiments before Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6.Visualization of results for an example for DBT, synthetic 2D and FFDM modalities.From left to right: input images, patch maps indicating the locations of cropped ROI patches as blue squares, saliency maps for benign category, saliency maps for malignant category, and the cropped patches with the two highest attention scores.This example contains an irregular mass in the central right breast at mid depth was diagnosed as malignant on ultrasound-guided core biopsy.
systematically tuning the hyperparameters and was used in our experiments.In this ablation study, we investigate the effect of the half-width of slice-wise sampling ζ ∈ {0, 5, 10, ∞} during training by keeping the rest of the hyperparameters the same.The purpose of this ablation study is twofold.First, we aim to observe if slice-wise patch sampling provides an unfair advantage to 3D-GMIC compared to other GMIC models trained on 2D modalities.Second, we want to verify if the default value of ζ = 10 was optimal.The results are in Table IX.We show that the performance of ζ = 0 and ζ = 10 are indistinguishable, which means that the slice-wise sampling of the image patches did not give an unfair advantage to 3D-GMIC compared to GMIC with 2D imaging modalities.In addition, we observe that ζ = 5 leads to an improved performance.
3) The Amount of DBT Data in Training: In Table X, we report the breast-level test AUCs of 3D-GMIC models trained with  in section V-D and Fig. 5, which is the default approach in our work, is 0.8.
When following steps 1-3c, the performance of the model decreases as we decrease the amount of training data, which is expected.Moreover, the drop from 100% to 50% of training data is similar to the drop observed from 50% to 25%.This suggests that the model performance has not yet saturated, and our models could still benefit from additional data.
When following steps 1-3a-3c, the drop in the performance by reducing the number of DBT training images is lower compared to when following steps 1-3c.This is to be expected as the network already utilized 100% of pathology-derived labels in the NYU Combo V1 dataset, even though it was the FFDM images for which the model used these labels.This suggests that, if starting from a model that is trained on a lot of FFDM images with pathology-derived labels, 3D-GMIC might only require a relatively small number of DBT images for saturating performance.
Note that the training paths of the models in this ablation study (1-3c and 1-3a-3c) diverge from the standard training path of 3D-GMIC, which is 1-2b-3c.We opted for the path 1-3c as a substitute for 1-2b-3c because it is time-consuming to repeat step 2b with varying amount of DBT data in the standard path of 1-2b-3c.Furthermore, we selected the path 1-3a-3c to represent a use case where there is a surplus of 2D data but a limited quantity of 3D data.

VIII. DISCUSSION AND CONCLUSION
In this work, we propose 3D-GMIC, a novel deep neural network capable of learning and predicting from highresolution 3D medical images in a computationally efficient manner.3D-GMIC effectively focuses its computation to the small subset of important regions by first identifying the regions of interest with a low-capacity sub-network and selectively applying a high-capacity sub-network to the regions of interest while avoiding processing duplicate information from nearby slices.3D-GMIC enables training deep neural networks for high-resolution 3D images with small regions of interests without requiring expensive annotation labels or compromising performance.Our model is efficient enough to train with a batch size of 4 images using 4 GPUs with 32GB of GPU memory.
3D-GMIC focuses its computation on a lower proportion of the input image compared to GMIC.This is done by avoiding cropping patches with duplicate information as explained in section V-A.The reason for processing slices independently with the global network f g and the local network f l rather than applying a 3D convolution to process them together with nearby slices is that DBT is not a true 3D image.Instead, DBT is a collection of the reconstructed slices for the goal of increasing lesion conspicuity.Each structure is visible from most slices with varying focus.Therefore, our architecture aims to maximally utilize the information from individual slices, thus maximally benefiting from the advantage of DBT which is in enabling distinguishing lesions from other structures from nearby slices.
We demonstrate the performance of the proposed architecture with a large dataset where each exam contains the three imaging modalities of screening mammography: FFDM, synthetic 2D, and DBT.3D-GMIC achieved comparable performances compared to GMIC on FFDM or synthetic 2D.This demonstrates that 3D-GMIC successfully classified large 3D images despite focusing its computation on a smaller percentage of its input compared to GMIC.Furthermore, the performance is improved when we ensemble predictions from multiple imaging modalities.This suggests that the model might have learned different behaviors for each modality depending on their relative strengths and weaknesses.This accentuates the benefit of training AI systems which can handle the DBT images even when the 2D imaging modalities with equivalent information are available.
The reported semantic segmentation and classification performances of 2D modalities are lower than the values reported in the original GMIC paper.The reason for this is twofold.First, the NYU Combo v1 dataset is smaller than the BCSDv1 dataset used in the original GMIC paper.Our work utilizes only half the number of exams with biopsy labels and this leads to worse generalization.Second, the two papers have different test sets and two AUCs calculated from different datasets are not directly comparable.As a result, the method presented in this paper demonstrates potential for future development but has yet to reach the level of clinical applicability.As it stands, the performance requires further refinement for practical use.For example, Wu et al. [32] showed that their method for FFDM images with an AUC of 0.895 is comparable with radiologists and can improve radiologists' performance when ensembled with their predictions.In addition, Shen et al. [9] showed that their method for FFDM images with an AUC of 0.930 outperforms radiologists.

IX. LIMITATIONS
We note that 3D-GMIC shares some limitations with the standard GMIC model.Namely, training 3D-GMIC is more complicated than training 3D ResNet models because there are two separate networks to optimize.The learning speeds for the global and local modules could be different and they could start overfitting at different epochs.This could prevent 3D-GMIC from reaching the best possible performance.To mitigate this, separate learning rates could be used for global and local modules as done in Wu et al. [39].
We ackowledge that our DSC and PxAP evaluations have certain limitations in accounting for false positive findings of our methods, which we elaborate below.One one hand, all of the images with biopsied findings were used in calculating Dice and PxAP metrics for both benign and malignant classes, regardless of whether the said biopsied finding was benign or malignant.This means that images with benign findings were also used in evaluating Dice for the malignant category and vice versa.Therefore, it accounts for some of the false positive findings of the algorithm.On the other hand, this metric is not able to measure the false positive semantic segmentation from the images that are not associated with biopsied findings.However, since this method is applied consistently across different models, we measure the performance on images with lesions fairly across different models.Finally, it is common for papers on medical semantic segmentation to use datasets which consist only of images with annotated objects, and thus report segmentation performances calculated only with positive images [48], [49].
We acknowledge that our ground-truth annotation labels contain a degree of subjectivity.Variability in annotation could stem from different radiologists' training, experience, and personal interpretative skills, potentially leading to inconsistencies in the location, size, and shape of the biopsied lesions.This could also introduce noise into the evaluation of our semantic segmentation performance.Ideally, having multiple radiologists annotate the same lesion and averaging out their input could help minimize this interobserver variability.However, due to resource constraints, this was not feasible within the scope of this study.

Fig. 1 .
Fig. 1.Architecture of 3D-GMIC.3D-GMIC first applies the same low-capacity global network to all slices of a DBT image to generate the saliency map.The retrieve_roi_from_3d_image algorithm crops ROI patches from the regions corresponding to the highest values in the saliency map.The model then applies the high-capacity local network to the cropped patches to focus the computation in the sampled regions.Finally, the model combines the predictions from the global and local modules to create the overall prediction for the input image.

Fig. 2 .
Fig. 2. The problem with randomly initialized weights of the 1 × 1 convolution filter in the semantic segmentation layer.Randomly initialized weights (a-c) can map high values of the hidden representation h g to negative outputs (b).This leads potentially useful features to have lower values in the saliency maps than those from the background.In comparison, weights initialized with a constant ω (d-e) always map high hidden representation values to positive output.This ensures that the values in the saliency maps from interesting regions will be higher than those from the background, which is a more reasonable starting point for model training and leads to consistent success of semantic segmentation.

Fig. 3 .
Fig. 3.Failure of weakly-supervised semantic segmentation.As explained in section V-B.4, the areas in the saliency map (right) corresponding to the biopsied lesion (left) have lower values than the unimportant regions and the background after BI-RADS pretraining.In this case, it is difficult for model to learn to highlight lesions correctly with top-t% pooling.

Fig. 4 .
Fig. 4. The effect of different nonlinearity functions and the weight initialization constant ω on a global module pretrained with BI-RADS labels.The visualizations of saliency maps are made with different initialization constants ω before any further training on the pathology labels.Sigmoid (top) leads to initially predicting 0.5 everywhere in the background, which requires the model to learn to suppress saliency in the background in the downstream training.On the other hand, ReLU(tanh(x)) (bottom) maps all the unimportant near-zero hidden representations to near-zero in the saliency maps from the beginning, which is an easier starting point that ensures the success of semantic segmentation.The effect of three different values of ω for each nonlinearity is shown in the right panel.With the pretrained global module, we observe that ω between 0.001 and 0.1 results in saliency maps that sparsely highlight the important regions.

Fig. 5 .
Fig. 5.The training pipeline for the three imaging modalities.

5 )
Distributed Training: To maximize the training speed and increase batch size, we parallelize the training over 4 Nvidia v100 GPUs when training models on the 3D modality.The effective batch size per update is 4 images.For the local module, we use a synchronized batch normalization layer to share batch statistics between GPUs.
, Table III, Table IV, Table V, and Table VI.

TABLE I COMPUTATIONAL
EFFICIENCY OF PROCESSING A DBT IMAGE WITH 96 SLICES.3D-GMIC USES 77.98%-90.05%LESS GPU MEMORY AND 91.23%-96.02%LESS COMPUTATION THAN OTHER ARCHITECTURES

TABLE II BREAST
-WISE TEST PERFORMANCES (AUC) FOR THE GLOBAL-ONLY AND FULL ARCHITECTURES.FOR EACH SINGLE-MODALITY MODEL, WE INDICATE ITS TRAINING STEPS AS DEFINED IN SECTION V-D AND FIG.5.FOR BOTH GLOBAL-ONLY AND FULL ARCHITECTURES, WE FIND THAT 3D-GMIC PERFORMS COMPARABLY WITH OTHER MODELS TRAINED WITH 2D IMAGING MODALITIES.IN ADDITION, A MULTI-MODAL ENSEMBLE INCLUDING ALL THREE MODALITIES LEADS TO THE BEST PERFORMANCE IN PREDICTING THE PRESENCE OF MALIGNANT LESIONS.WE COMPUTE 95% CONFIDENCE INTERVALS USING 1,000 ITERATIONS OF THE BOOTSTRAP METHOD [47]

TABLE III BREAST
-WISE TEST PERFORMANCES (SPECIFICITY AT 90% SENSITIVITY) FOR THE GLOBAL-ONLY AND FULL ARCHITECTURES.FOR EACH SINGLE-MODALITY MODEL, WE INDICATE ITS TRAINING STEPS AS DEFINED IN V-D AND FIG.5.WE COMPUTE 95% CONFIDENCE INTERVALS USING 1,000 ITERATIONS OF THE BOOTSTRAP METHOD [47]

TABLE IV BREAST
-WISE TEST PERFORMANCES (MATTHEW'S CORRELATION COEFFICIENT AT 90% SENSITIVITY) FOR THE GLOBAL-ONLY AND FULL ARCHITECTURES.FOR EACH SINGLE-MODALITY MODEL, WE INDICATE ITS TRAINING STEPS AS DEFINED IN SECTION V-D AND FIG.5.WE COMPUTE 95% CONFIDENCE INTERVALS USING 1,000 ITERATIONS OF THE BOOTSTRAP METHOD [47] , Table III, Table IV, Table V, and Table VI.We also show the ensemble performances where the predictions of different models, trained with different modalities and/or for a different number of epochs are averaged.

TABLE V BREAST
[47]E TEST PERFORMANCES (SENSITIVITY AT 95% SPECIFICITY) FOR THE GLOBAL-ONLY AND FULL ARCHITECTURES.FOR EACH SINGLE-MODALITY MODEL, WE INDICATE ITS TRAINING STEPS AS DEFINED IN SECTION V-D AND FIG.5.WE COMPUTE 95% CONFIDENCEINTERVALS USING 1,000 ITERATIONS OF THE BOOTSTRAP METHOD[47]

TABLE VI BREAST
-WISE TEST PERFORMANCES (SENSITIVITY AT 99% SPECIFICITY) FOR THE GLOBAL-ONLY FULL ARCHITECTURES.FOR EACH SINGLE-MODALITY MODEL, WE INDICATE ITS TRAINING STEPS AS DEFINED IN SECTION V-D AND FIG.5.WE COMPUTE 95% CONFIDENCE INTERVALS USING 1,000 ITERATIONS OF THE BOOTSTRAP METHOD [47] THE BREAST-WISE TEST PERFORMANCES OF MODELS TRAINED WITH EACH MODALITY WHEN CROPPING 8 ROI PATCHES IN retrieve_roi_from_3d_image ALGORITHM.WE TRAIN THREE MODELS FOR EACH SETTING AND SHOW ENSEMBLE PERFORMANCES WITH TEN RANDOM AUGMENTATIONS, ALONG WITH THE 95% CONFIDENCE INTERVALS.3D-GMICON DBT IS STILL SHOWING COMPARABLE PERFORMANCES TO GMIC ON 2D IMAGES, EVEN WHEN THE MODELS USE THE SAME NUMBER OF PATCHES 1) Comparing Modalities Using the Same Number of Patches:

TABLE IX ABLATION
STUDY: EFFECT OF A CHOICE OF ζ ON THE AUC OF IDENTIFYING BREASTS WITH MALIGNANT FINDINGS, ALONG WITH THE 95% CONFIDENCE INTERVALS the same seed and hyperparameters, using different pretrained weights as starting points.In addition, we also vary the amount of DBT data in training step 3c.For quick results, we only perform inference over the test dataset once without random augmentation.As a reference, the breast-level test AUC of a 3D-GMIC model trained by following steps 1-2b-3c as defined Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE X ABLATION
STUDY: THE EFFECT OF TRAINING DATA ON THE BREAST-LEVEL TEST AUCS OF 3D-GMIC MODELS.WE VARY AMOUNTS OF DBT EXAMS WE USED FOR THE STEP 3C (100%, 50%, AND 25%).THE MODELS ARE TRAINED FOLLOWING STEPS 1-3C AND 1-3A-3C (BOTH OF WHICH ARE NON-STANDARD PATHS) AS DEFINED IN SECTION V-D AND FIG. 5. WE OBSERVE THE DECLINE IN MODEL PERFORMANCE AS THE AMOUNT OF TRAINING DATA IS REDUCED 2. For example, consider GMIC on synthetic 2D image of size (2116 × 1339) and 3D-GMIC on DBT image of size (2116 × 1339 x 70), both cropping 8 two-dimensional patches of size 256 × 256 as shown in Table VIII.Then the local networks of either model will utilize 256*256*8=524,288 pixels.While this corresponds to 18.50% of the synthetic 2D image, it corresponds to only 0.26% of the DBT image.