Deep Radiomic Analysis to Predict Gleason Score in Prostate Cancer

Convolutional neural networks (CNNs) require large amounts of data for training, beyond what can be acquired for current radiomics models. We hypothesize that deep entropy features (DEFs) derived from existing CNNs can be applied to MRI images of prostate cancers (PCa) to reliably predict the Gleason score (GS) of PCa lesions. In this study, we analyzed 112 lesions acquired from 99 PCa patients, either pre-biopsy or pre-treatment, their associated GS, and multi-parametric MRI (mpMRI) sequences. Our approach is based on the extraction of DEF features produced in individual layers of 9 pre-trained CNN models. We first analyze DEFs from separate CNNs using the Wilcoxon test and Spearman correlation to find significant features associated with GS. In a multivariate analysis, we then use the combined DEFs of all CNNs as input to a random forest (RF) classifier for predicting the Gleason grade group of patients. Among the 9 pre-trained CNNs, the NASNet-mobile architecture offered the features most correlated to GS ( $\rho =0.47$ ; p < 0.05). From the 7,857 combined features, 11 DEFs could differentiate GS < 8 from GS ≥8 (corrected p < 0.05). Moreover, the RF classifier discerned GS of 6, 3+4, 4+3, 8 and ≥9 with an AUC (%) of 80.08, 85.77, 97.30, 98.20, and 86.51, respectively. Our results suggest that the DEFs can be used to differentiate GS of PCa lesions with the highest accuracy of GS ≥8 based on mpMRI. DEFs could improve diagnosis accuracy, reduce the risks of misclassification, help to better assess prognosis, and individualize patient care approaches.


I. INTRODUCTION
Radiomics is a technique to extract large number of features from medical image to build prediction models. However, this technique suffers from overfitting when a large number of features are directly used to train and test predictive models [1]. While, convolutional neural networks (CNNs) have shown an outstanding ability to identify complex associations in high-dimensional data for disease diagnosis and treatment planning [2]. In order to get the benefit of the representational capacity of well-known deep CNN designs (e.g., ResNet, GoogleNet, etc.) and overcome the issue of overfitting and limited datasets, we propose to encode the The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos .
CNN features which are a key challenge in personalized medicine of grading prostate cancer (PCa). For devising a personalized approach to patients with PCa, the diagnosis and management depend on the assessment of biological aggressiveness of the malignancy, for which the gold standard is prostate biopsy [3], [4]. The biopsy specimen is evaluated in a standardized fashion by specialized physicians, i.e. the pathologists, for assigning a Gleason Score (GS) to the malignancy [5]. However, this procedure can lead to complications [6], incurs a significant cost [7] and may need to be repeated if sampled tissue are inadequate for analysis [8]. Additionally, significant discrepancies can arise between the biopsy-evaluated GS and what is found during surgery (e.g., radical prostatectomy [9]). Important inter-observer variability may also be found in biopsy reports [10]. Hence, there is a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ critical need to develop non-invasive methods that can predict the PCa grades to improve delivery of high precision care for these patients. Evidence supports the role of multiparametric magnetic resonance imaging (mpMRI) performed before biopsy as a guide for PCa assessment [3], [4]. Recent studies have shown that MRI offers advantages over transrectal ultrasound (TRUS) guided biopsies in ruling out clinically-significant disease, and that MRI followed by targeted biopsies improves the detection rate compared to systematic biopsies [11], [12]. Likewise, the MRI-FIRST study [13] found that combining a MRI-targeted approach with a systematic biopsy provided substantial added value. The standardized method for reporting prostate mpMRI, known as Prostate Imaging Reporting and Data System (PI-RADS), stratifies prostatic lesions by their potential for malignancy [14], [15]. PI-RADS's efficacy ranges from 74-82% for its sensitivity to detect PCa and 65-94% for its negative predictive value [12]. Recently, important efforts have been invested to improve PCa screening, risk stratification and individualized patient management. Radiomics and CNNs offers an effective and non-invasive way to predict oncological outcomes [16]- [20]. For PCa, multiple studies have identified imaging features that correlate with GS [21]- [23]. However, a common limitation of radiomics approaches is the requirement of having enough high-quality data to both train and validate the model.
Our work proposes a novel radiomics method based on deep entropy features (DEFs) to predict the GS of PCa lesions from mpMRI. In contrast to traditional imaging features, DEFs are learned from a convolutional neural network (CNN) and thus have the potential of capturing more informative characteristics of an image. In a previous study, DEFs obtained from a three-dimensional CNN were shown to be capable of describing differences between brain MRI of patients with Alzheimer's and healthy control subjects [24]. Expanding on this study, the current work evaluates the potential of DEFs, extracted from all layers of multiple network architectures, to offer a more reliable prediction of GS in prostate mpMRI. To overcome the challenge of limited training data, the proposed approach exploits a transfer learning strategy where pre-trained CNNs are used to extract generic imaging features, which are then summarized into a small set of DEFs. These radiomic descriptors offer a highly-compact representation of image texture which captures the heterogeneity of imaged tissues. To ensure reproducibility, our study employs a publicly-accessible and verifiable database of prostate cancer images.
The major contributions of our paper are as follows: • This work is the first comprehensive work in encoding well-known CNNs with quantifier function (Shannon entropy) for predicting the GS of patients with PCa.
• We demonstrate the effectiveness of using the deep entropy features to deal with RF in predicting the GS.
• We propose a small set of DEFs that encode multi-scale (e.g., deep feature maps) PCa-related information.
• We present the classification performance of different prostatic zone and its relationship with GS.
The rest of this article is structured as follows. Section II describes the data used in this study as well as the proposed pipeline. We then present the experimental results in Section III and discuss our main findings in Sections IV. Finally, Section V concludes with a summary of our work's main contributions and results

II. MATERIALS AND METHODS
This section describes the public dataset of PCa and explains the steps data acquisition in preprocessing procedures, the proposed pipeline and the performance metrics.

A. PATIENTS AND DATA ACQUISITION
The Cancer Imaging Archive (TCIA) was accessed to acquire patient data and mpMRI images for this study. TCIA hosts a publicly-accessible repository of labelled imaging data sponsored by the International Society for Optics and Photonics (SPIE), National Cancer Institute/National Institutes of Health (NCI/NIH), and the American Association of Physicists in Medicine (AAPM) [25]. Our study uses the labeled training data of the SPIE-AAPM-NCI Prostate MR Gleason Grade Group Challenge (PROSTATEx-2), comprising a total of 112 PCa lesions from 99 different subjects [26]. The testing data of the PROSTATEx-2 dataset was not considered in our work since it does not contain labels. The GS of each tumor was determined via MRI-localization. Specifically, MR studies were read and reported by an expert radiologist (>20 years of experience in prostate MR) who indicated areas of suspicion with a score per modality using a point marker. A biopsy was then performed for areas considered as cancer. The biopsy process was performed under MR-guidance and confirmation scans of the biopsy needle in situ were done to achieve the highest localization accuracy [25]. At each stage, a physician with relevant expertise in the procedure was involved. Based on the biopsy specimens and the mpMRI report, tumors were partitioned into different Epstein grade groups [27], as per their GS: G1 (GS ≤ 6), G2 (GS 3 + 4 = 7), G3 (GS 4 + 3 = 7), G4 (GS = 8), or G5 (GS ≥ 9) ( Table 1). As all patient data were accessed through an anonymized public resource, no institutional review board or Health Insurance Portability and Accountability Act approval was required.
Images were acquired by either a Siemens 3T MAGNETOM Trio or Skyra MRI [27]. Pixel spacing, slice thickness, and contrast varied within the included cohort. Image heterogeneity was corrected via resampling all the images to an ordinary voxel resolution of 1 mm 3 , for a total size of 320 × 320 × 19 voxels. The unified grey image adjustment to the [0-255] range for normalization where the maximum grayscale value is 255, the minimum value is 0, and the rest have been linearly transformed. We identified an 84 × 84 pixel 2D ROI from each MRI sequence. The ROI selected included the abnormal area expertly identified by 2) Upsampling of the ROI (e.g., 224 × 224 pixels) for processing by the 9 pre-trained CNNs. The texture of the CNN layer-blocks (e.g., convolution layers, max pooling layers, ReLU, normalization and fully-connected layer) is quantified using entropy. 3) Features capacity to predict the GS evaluated via uni-and multivariate analyses.
the PROSTATEx-2 Challenge. Using these ROIs, we derived DEFs from the T2-WI, ADC, and DCE images.

B. DEEP ENTROPY FEATURE EXTRACTION
The deep entropy features (DEFs) employed in the proposed radiomics pipeline (Figure 1) measure the spatial heterogeneity (i.e., texture) of feature maps computed by a pre-trained deep CNN. In this study, we considered 9 well-known 2D CNN architectures that were pre-trained on natural images from the ImageNet database: Xception [28], AlexNet [29], Inception ResNet-v2, GoogleNet, Inception-v3 [30], SqueezeNet [31], ResNet101 [32], NASNet-mobile [33] and NASNet-large [33]. Considered the 2D ROIs extracted from mpMRI, each network was applied separately on T2-WI, ADC, and DCE MRI series to obtain imaging descriptors corresponding to the feature maps of convolutional blocks. In most CNNs, a convolutional block is composed of the following sequence of operations: convolution, pooling, normalization, and rectified linear unit (ReLU) activation. For extracting DEFs, we computed the entropy in each feature map of the 9 CNNs. Toward this goal, the values at each position of a feature map are aggregated into a discrete probability distribution (i.e., histogram) by grouping them into 256 equal-sized bins. Let p i be the ratio of values of a feature map falling into bin i, entropy is computed as Feature maps with high entropy correspond to textures having more pronounced heterogeneity. The number of obtained DEFs can vary for each CNN, according to the number of convolution blocks in the network.

C. DEEP ENTROPY FEATURE EVALUATION AND MODELING
Uni-and multivariate analyses were performed to assess the relationship between DEFs and GS. First, we used Spearman correlation to identify the features most correlated to GS [34]. The Wilcoxon rank-sum test [35] was then employed to compare the distribution of features in lesion groups defined based on GS. For this second analysis, we considered five different partitions of lesions in two separate groups: G1 vs all (G2-5); G2 vs all (G1+G3+G4-5); G3 vs all (G1-G2+G4-5); G4G5 vs all (G1-3); G1G2 vs all (G3-5). For each of these binary partitions, we performed a Wilcoxon rank-sum test on individual features to identify those having a significantly different distribution across the two lesion groups. To account for multiple comparisons, the p-values of all Wilcoxon tests and Spearman correlation estimates were adjusted using the Holm-Bonferroni correction. Statistical significance was defined as corrected p < 0.05 [36].
In a multivariate analysis, we used the DEFs as input to a random forest (RF) classifier for predicting different combinations of Gleason grade groups. While different classifiers could be used for the same tasks, we have chosen the RF classifier as it is performing well when training data is small and has an optimized selection mechanism that allows interpretability [37]. By integrating decision tree bagging with random subspace search, it decreases errors due to the heterogeneity of training data and offers a strong generalization for new samples [38]. RF classifiers also have a relatively small number of hyper-parameters to tuning compared to more complex models such as neural networks, the major factors being the number of trees, the maximum tree depth, the minimum number samples in a node. In our experiments, these hyper-parameters were selected using grid search on a validation set. In this context, we set 500, 15 and 4, respectively, for the number of trees, the maximum tree depth, and the minimum number samples in a node.
In this analysis, we considered the same partitions as before, i.e. G1 vs all, G2 vs all, G3 vs all, G4G5 vs all, and G1G2 vs all to define five binary classification problems. A 5-fold cross-validation (CV) was performed to obtain performance measures. In this internal validation technique, data samples are randomly divided into five folds. Each of these folds is then used, in turn, to calculate the area under the ROC curve (AUC) of an RF model trained with remaining samples (those in the 4 other folds). To generate a quantifiable performance metric, we then computed the average of AUC values across all five folds. The out-of-bag sample permutation error of the RF classifier was used to measure the relative importance of each feature for predicting the Gleason grade group. Importance values were computed for every RF tree and then averaged over the entire ensemble. To obtain normalized values, we divided them by the standard deviation of the ensemble. Features are considered to be predictive of the grade group if they have a positive importance value [39].
To further validate results, for each classification task (G1 vs all, G2 vs all, etc.), we randomly divided the datasets into a training (70%) and testing (30%) cohort using balanced populations of each grade group in training. The performance of predictive models was measured based on the AUC and the confusion matrix obtained on the test samples. Moreover, we analyzed the localized relationship between DEFs and GS by considering separately the lesions located in three different anatomical zones of the prostate, i.e. peripheral  zone (PZ), transitional zone (TZ), and anterior zone (AZ). The zone labels of PCa lesions were provided by TCIA in the dataset. Among the 112 lesions, 50 where located in the PZ, 17 in the TZ, and 45 in the AZ. For each zone, we measured the Spearman correlation between DEFs and GS and used the Kruskall-Wallis test to establish significant differences between the feature distributions of distinct Gleason grade groups. Once again, Holm-Bonferroni correction of p-values was used to account for multiple comparisons. All our processing/analysis steps were performed using the Matlab Statistics and Machine Learning Toolbox.

A. CHARACTERISTICS OF THE STUDY POPULATION
Histopathological data was available to confirm the GS of the 112 malignant lesions identified in mpMRI. All mpMRI images had the same three series available, i.e. T2 weighted imaging (T2 WI), apparent diffusion coefficient (ADC), and dynamic contrast enhancement (DCE) series. Among these 112 lesions/findings, there were 36,41,19,8,8 tumors with GS ≤ 6 (G1), GS = 7 (3+4; G2), GS = 7 (4+3; G3), GS = 8 (4+4, 3+5, or 5+3; G4), and GS ≥ 9 (G5), respectively ( Table 1). In the cohort of 99 patients (average age 65 years, range 42-78 years), 87 patients had one lesion, 11 patients had two, and a single patient had three [40].  Table 2 reports the number of layers in each of the 9 pretrained CNN architectures and their corresponding number of unique DEFs. The layer names of these architectures are reported in Supplementary Table S1. Combining features of all 9 networks yields a total of 7,857 unique DEFs. The Spearman rank correlation (ρ) between GS and all significant DEFs (determined by input modality and layer name) is given in Table 3 and Figure 2. A DEF is significant if it has a correlation p-value < 0.05 after correction. The correlation ρ and corrected p-values of all layers can be found in Supplementary Table S2. It can be seen that the NASNet-mobile architecture yields the most correlated DEFs and the feature with the highest absolute correlation of ρ = 0.47. After Holm-Benferroni correction, a total of 5, 4, 3, 2, 6, 3, 5, 16 and 2 DEFs extracted from Xception, AlexNet, InceptionResNet-v2, GoogleNet, Inception-v3, SqueezeNet, ResNet101, NASNet-mobile and NASNet-large architectures, respectively, were statistically correlated with Gleason grade groups. Statistically-correlated DEFs are found for both T2-WI and ADC modalities in all 9 pre-trained CNNs.

B. ANALYSIS OF DEEP ENTROPY FEATURES
Results of the Wilcoxon rank sum test comparing the distribution of DEF values across Gleason grade groups are shown Color-coded from 0 (dark blue: least significance) to 6 (dark red: greatest significance). x represents the compared Gleason grade groups and the y -axis corresponds to the most significant DEFs per Gleason grade group comparison, for each CNN architecture. in Figure 3. We find 11 DEFs with statistically-significant differences for the GS ≥ 8 (i.e., G4G5) vs GS < 8 lesion partition, with p-value < 0.05 following correction. All these significant DEFs were derived from T2-WI images. No significant DEFs were found when comparing between other lesion partitions due to the p-value correction on a large number of comparisons. The full set of p-values is provided in Supplementary Table S3. Table 4 summarizes the results of the 5-fold CV analysis evaluating the RF classifier's ability to predict the Gleason grade group of the 112 lesions. When considering DEFs of each network architecture separately, the NASNet-mobile yields the best prediction in all but one case (i.e., G3 vs all), a result which is consistent with the previous correlation analysis. The highest accuracy is obtained when discriminating between G4G5 and other Gleason grade groups (G4G5 vs all), with an AUC of 92.68%. Furthermore, combining DEFs derived from all 9 CNNs (7,857 features) into the same RF model boosts performance in all but one classification tasks compared to NASNet-mobile, with relative AUC improvements of 0.38, 4.80, 3.90, 0.35 and -1.67, for G1 vs all, G2 vs all, G3 vs all, G4G5 vs all, and G1G2 vs all, respectively. Figure 4 compares the DEFs, with the greatest importance values from the pretrained CNN's application to the mpMRI's T2-WI, ADC, and DCE series. Specifically, Figure 4 shows the 10 DEFs with the highest importance value (i.e., permutation error on out-of-bag samples) for each classification task. Overall, the image modality (i.e., T2-WI, ADC or DCE) and network architecture leading to the most predictive features VOLUME 8, 2020  Figure 5 gives the confusion matrix and ROC curves of the RF model for the five tasks. The model achieves an accuracy of 80.95%, 83.33%, 100.00%, 100.00% and 47.62%, respectively, for G1 vs all, G2 vs all, G3 vs all, G4G5 vs all and G1G2 vs all. Correspondingly, the highest AUC is obtained for discriminating between GS < 8 and GS ≥ 8 (G4G5 vs all), with an AUC of 98.20%.In addition, Table 5 illustrates the performance  TABLE 4. Area under the ROC curve (%) of the RF classifier for discriminating between different Gleason grade group partitions, when using DEFs from 9 pre-trained CNNs.

F. ZONE-SPECIFIC RELATIONSHIP BETWEEN DEFs AND GLEASON SCORE
It is unknown if peripheral zones (PZ) are biologically different than transitional Zones (TZ). With radomic analysis applied on mpMRI in understanding the mechanism of PCa, the impact of tumor location on biological behavior may have significant implications for optimum treatment modalities [40]. Figure 6A shows the DEFs most correlated with GS, for lesions located in the three anatomical zones of the prostate (i.e., PZ, TZ, and anterior -AZ). We see that features with moderate (0.3 ≤ |ρ| ≤ 0.7) or high correlation (|ρ| > 0.7) are found in all three zones. However, after p-value correction, statistical significance can only be established for DEFs in peripheral lesions, with absolute correlation in the 0.6-0.63 range. Although more pronounced correlation is found for transitional lesions, statistical significance could not be confirmed due to the smaller number of lesions in this zone (i.e., 17 compared to 50 for PZ). Similarly, Figure 6B displays the results of the Kruskal-Wallis test comparing the DEFs among each Gleason grade groups by anatomic zone. Following Holm-Bonferroni correction, 39 DEFs derived from PZ were statistically significant, the highest significance obtained for ADC-NASNet-mobile features. The complete set of corrected p-values can be found in Supplementary Table S6.All correlation coefficients and corrected p-values are reported in Supplementary Table S5.

IV. DISCUSSIONS
We implemented a novel approach using deep entropy features (DEFs) derived from all layers of 9 different pre-trained CNNs to analyze mpMRI images of PCa lesions. VOLUME 8, 2020 This contrasts with our previous study on brain MRI, where features extracted only from the most superficial layers of a single CNN were used to predict Alzheimer's or mild cognitive impairment [45]. Experiments in the current study identified 46 DEFs derived from the pre-trained CNNs that were significantly correlated to GS. Furthermore, when given as input to a RF classifier, these combined features led to a highly accurate prediction of the Gleason grade group. There currently is a clinical need for radiomic tools that can predict the aggressiveness of PCa lesions with high reliability. Even with modern advances in targeting, the existing diagnostic gold standard of TRUS biopsy is not as reliable as definitive resection [46], while also carrying risks. Reported morbidity includes pain, bleeding, lower urinary tract symptoms, erectile dysfunction, and infection which can be life threatening in some cases [6], [47]. Cost is another limiting factor of biopsy. Thus, a single biopsy requires a medical specialist to acquire a sample then a separate specialized pathologist to evaluate the resultant specimen with reasonable accuracy [48]. In many centers, TRUS biopsy will already be preceded by an MRI for guidance or to determine the necessity of additional biopsy [49]. Therefore, the implementation of an MRI-derived radiomics approach would not add significant cost, and could replace two expensive steps for the diagnosis and/or re-evaluation of prostate cancer. The proposed method based on DEFs compares favorably with previous approaches for predicting the Gleason grade group of PCa lesions. In the PROSTATEx Challenge [26], the bestperforming method among 32 submissions achieved an AUC of 87% for discriminating between clinically significant and non-significant lesions. Since almost all clinically-significant lesions (i.e., 71 out of 73) have a Gleason grade group > 1, we can compare this value with the AUC of 88.8% obtained by our method for the G1 vs all task.
Our experiments also confirm results of previous works showing the efficacy of radiomics for analyzing PCa images [40], [50]- [55]. In [56], entropy-based texture features extracted from gray-level co-occurrence matrix (GLCMs) were found to be related to GS, more specifically, that a higher GS is associated with a higher ADC entropy and low ADC energy. Likewise, the average ADC image/maps are thought to be a biomarker for GS, combinations of the ADC volume and average held an AUC value of 74.9% to discriminate a biologically low risk PCa (GS=6) from higher risk malignancies (GS≥7) PCa [21]. Other methods have similarly discriminated between a low and high GS, such as combined T2-WI and spectroscopy images [57]. Strategies which compensated for unbalanced samples, via imputation, have found that texture features could reliably discriminate intermediate-risk prostate cancers (GS 6, 3+4, and 4+3) from each other [58]. When implemented as combined radiomic feature models (joint intensity matrices and GLCM), AUC values were 78.40% (GS=6), 82.35% (GS=3+4), and 64.76% (GS≥4+3) [42].
Along the same vein, a model which combined 45 radiomic features achieved AUC values of 83.40% (GS=6), 72.71% (GS=3+4), and 77.35% (GS≥4+3) [41]. Moreover, volume derived from mpMRI has shown a moderate correlation with GS [59], while the combined volume and mean ADC features achieved an AUC of 70.4% for classifying GS 6 from GS ≥ 7 tumors [21]. Compared to features based on texture, shape features are more sensitive to the manual segmentation of ROIs. The current study overcomes this problem by applying a fixed ROI size of 84 × 84 pixels.
Our experiments showed that certain CNN architectures provide more discriminative features for analyzing PCa lesions. In particular, the NASNet-mobile network yielded the features most correlated to GS, with the highest correlation of 0.47 (p=0.003) obtained for ADC image in layer Separable-Conv-1-Normal-right2-1-point-wise. This architecture, as well as the NASNet-large network, also showed prominent differences when comparing lesions grouped based on Gleason score. Notably, a total of 9 DEFs computed by these two networks from T2-WI images gave statistically-significant differences when comparing lesions with GS < 8 vs GS ≥ 8, with corrected p < 0.05. Furthermore, among the 9 pre-trained CNNs, the NASNet-mobile network gave the most predictive DEFs when used as input to a RF classifier (Figure 4 and Figure 5). Hence, the features of this model achieved the highest AUC for identifying G1, G2, G4G5, and G1G2 lesions.
Compared to applications involving natural images, deep learning models like CNNs have had a more limited success for classifying medical images. This is largely due to the much smaller amount of training data in clinical applications, but also to the particularity of medical images which often have poor contrast and low resolution. A recent survey on deep learning for Alzheimer's prediction [60] found that most studies reporting high accuracy suffered from some form of data leakage (e.g., using images from the same subject in both training and testing).
In the current work, we alleviate the problem of overfitting when training with a small dataset via a transfer learning strategy that computes a compact set of informative features from pre-trained CNNs. The proposed DEFs are based on entropy, a well-known concept of information theory to measure uncertainty of random variables. In our radiomics model, entropy is used to assess the heterogeneity of CNN feature maps considered as image textures. Information theory has been explored for various applications in computational biology [61], for example, in a maximal information transduction estimation approach to reduce uncertainty in transcriptome analyses [62]. To our knowledge, this is the first work proposing DEFs from different pre-trained CNNs for PCa analysis.
This work has notable limitations, the foremost being the limited number of retrospectively evaluated PCa lesions (n=112) and patients (n = 99). Validation steps on a larger scale and prospective design are required before broader clinical application. Furthermore, images derived from multiple medical centers would be an essential step in demonstrating the generalizability of DEFs to predict PCa lesion aggressiveness. A larger patient cohort would also enable a better quantification of the variability between CNN features and their relationship to GS. Another key limitation is our reliance on biopsy data, which has an understandable potential for sampling error despite modern targeting approaches [46]. This limitation is also present in other radiomics works for PCa [53] and is likely due to the logistical difficulties of acquiring pre-operative mpMRI alongside an anatomicallycorrelated intact pathological specimen. To bridge this gap in the literature, a study with complete prostatectomy specimens would likely have a limited sample size, but the application of our pre-established CNN methodology could minimize this limitation while validating the approach in an external data set. Future work will be focused on testing GS using the DEF before therapy and the post-diagnostic prognostic test.

V. CONCLUSION
In this study, we generated and evaluated novel radiomic features based on the entropy of features maps in 9 pre-trained CNNs fed with mpMRI data of 112 PCa lesions and GS ≥ 9 with an AUC of 80.08, 85.77, 97.30, 98.20, and 86.51 %, respectively. Our results surpass, via an indirect assessment, the published performance of the clinically-implemented PIRADS as well as recent radiomics models in the literature. We conclude that the use of pre-trained CNNs to generate DEFs is an efficient method to empower radiomics analysis for PCa. The potential clinical yield of this work is a tool that can not only limit misclassification but could be refined to optimize non-invasive evaluations of a PCa's malignant potential. Next steps will include combining DEFs with other novel imaging features or in prospective assessments that can quantify its clinical applicability. SAIMA RATHORE received the Ph.D. degree in computer science from the Pakistan Institute of Engineering and Applied Sciences, Islamabad, Pakistan, in 2015. She is currently a Research Fellow with the Center for Biomedical Image Computing and Analytics (CBICA), Radiology Department, University of Pennsylvania. She has industry software design and development experience of 11 years. At CBICA, she is the Lead Scientific Developer of Cancer Imaging Phenomics Toolkit. Her research interests include medical image analysis, segmentation, classification and evolutionary algorithms. She has published her work in leading scientific journals and presented it at various conferences and universities around the world.
PAUL SARGOS is currently a Radiation-Oncologist working with the Institut Bergonié, Comprehensive Cancer Care Center, Bordeaux, France. His clinical activity is mainly oriented in the management of sarcomas and genito-urinary tumors, where he develops technical innovation. He coordinates several prospective studies in soft tissue sarcomas, prostate, bladder, and testis cancers. He is the author or a coauthor of more than 50 referenced publications.