Accurate Detection of Non-Proliferative Diabetic Retinopathy in Optical Coherence Tomography Images Using Convolutional Neural Networks

Diabetic retinopathy (DR) is a disease that forms as a complication of diabetes. It is particularly dangerous since it often goes unnoticed and can lead to blindness if not detected early. Despite the clear importance and urgency of such an illness, there is no precise system for the early detection of DR so far. Fortunately, such system could be achieved using deep learning including convolutional neural networks (CNNs), which gained momentum in the field of medical imaging due to its capability of being effectively integrated into various systems in a manner that significantly improves the performance. This paper proposes a computer aided diagnostic (CAD) system for the early detection of non-proliferative DR (NPDR) using CNNs. The proposed system is developed for the optical coherence tomography (OCT) imaging modality. Throughout this paper, all aspects of deployment of the proposed system are studied starting from the preprocessing stage required to extract input retina patches to train the CNN without resizing the image, to the use of transfer learning principals and how to effectively combine features in order to optimize performance. This is done through investigating several scenarios for the system setup and then selecting the best one, which from the results revealed to be a two pre-trained CNNs based system, in which one of these CNNs is independently fed by nasal retina patches and the other one by temporal retina patches. The proposed transfer learning based CAD system achieves a promising accuracy of 94%.


I. INTRODUCTION
Ophthalmologists today are capable of leveraging computer assisted diagnostic (CAD) systems to inform their opinions, in contrast to the traditional methods of visual interpretation and observation. CAD systems are still a new technology in the field of medicine, and there is a continuous flux of interest in the development of such systems due to their capability of improving the medical services provided to the community in terms of accuracy and reliability in the diagnosis of diseases. Meanwhile, machine learning is paving the way for breakthroughs in the different areas of medical imaging such as in classification [1], segmentation [2], disease The associate editor coordinating the review of this manuscript and approving it for publication was Nikhil Padhi . detection [3], and image registration [4]. The application of deep learning [5]- [8] a subset of machine learning algorithms, has made tremendous impact in the area of medical image processing research [9], [10]. Deep learning is the leading machine learning paradigm in computer vision and image processing domains, particularly as regards convolutional neural networks (CNNs) [11]. CNN are especially powerful in solving problems that are computationally difficult or with a high error rate such as medical image recognition with outstanding performance results [12]. In this context, we determined to use CNNs for the early detection of one of the most serious ophthalmological concerns, which is diabetic retinopathy (DR).
Blindness resulting from diabetes, which is becoming an increasingly alarming issue, is a consequence of the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ associated eye disease: DR. Such disease which develops as a complication of diabetes, particularly type II [13], [14], occurs specifically from the chronic high levels of sugar in the blood associated with swelling and damage of the tiny retinal blood vessels in the eye [15]- [17]. This leads to distortion of the vision followed by scarring of the retina in advanced stages, and finally consequent blindness [16]. It is worth mentioning that DR is one of the leading causes of blindness in adults [18], [19]. This problem is further exacerbated by the fact that 75% of the diabetic patients are not aware of the eye complications that they may be experiencing [13]. To prevent such a ramification, it is paramount to diagnose DR as soon as possible. Early detection and intervention can slow down the process and halt it completely [20], [21], which in turn protects the vision of the patient [22]. However, despite the significance of this matter and the notable rise in prevalence of diabetes, a precise procedure to detect early retinal changes for DR prevention is absent [17], [21]. DR can be broadly classified as proliferative (PDR) or non-proliferative (NPDR) [10]. NPDR is characterized by the presence of damaged blood vessels in the retina in addition to fluid leakage, which results in the retina swelling and wetness. In the case of PDR, multiple regions of the retina are affected by the appearance of new abnormal blood vessels, which makes it a severe and advanced DR stage. The work presented in this paper is limited only to NPDR.
One of the ophthalmic imaging modalities used for early DR detection is funduscopy [23], [24]. This modality is an area of active research for the development of CAD systems, because while funduscopy is well understood in terms of its imaging principles, interpretation requires a highly trained ophthalmologist; hence it is expensive [25]. Related work on fundus imaging for early detection of DR includes a trainable system for automated micro-aneurysm detection [26], performing with 65% sensitivity and averaging 27 false positives per image by supposition testing. Another system presented in [27] is able to detect hard and soft exudates based on combining fine and coarse segmentation, however it sometimes does not discriminate between exudates and non-exudate regions if their features are similar. On the other hand, in [28], Pachiyappan et al. use a combination of filtering, morphological processing, and thresholding for DR macular abnormalities detection, while in [29], automatic extraction of retinal vasculature was performed in order to obtain the blood vessels network. Similar feature-based algorithms for fundus image interpretation have also been reviewed in [30]. Recently, CNNs have also been used for exudate detection in fundus images from diabetic patients [31]. For the optical coherence tomography angiography (OCTA), Eladawi et al. [32] proposed a CAD system for the early detection of DR based on vascular segmentation of various layers of the retina using spatial statistical modeling.
Nonetheless, another medical imaging modality that may be employed in the early detection of DR is optical coherence tomography (OCT) [33], which is useful because it facilitates retinal morphology evaluation to microscopic resolution [34]. In turn, various retinal abnormalities including glaucoma, macular degeneration, and diabetic macular edema may be diagnosed in a non-invasive manner. Compared to fundus imaging, OCT is more favorable because it supports quantitative evaluations as it capable of capturing depth, in addition to its lower cost, and its ability to allow human bias free monitoring of changes [35]. However, OCT is relatively unexplored in comparison to fundus images in terms of early detection of DR. Related to this, Roychowdhury et al. [36] presented an OCT automatic system for localizing cysts with diabetic macular edema (DME) using 6 layers of the retina. Correia et al. [37] used Monte Carlo simulation to find the changes in the OCT images of DME patients from the cellular perspective. Trabelsi et al. [38] presented a technique to detect cystoids macular edema from OCT images.
This paper investigates the early detection of DR in OCT images, which is principally performed using CNNs.
The proposed system has an optimal CNN architecture for early detection of DR through the exploration of various CNN configurations and parameters. In contrast to conventional feature extraction methods, using CNNs effectively classifies normal and DR images without the need for features that are extracted manually. The system starts with roughly segmenting the 12 layers of the retina and localizing the fovea using unsupervised learning [39]. Patches are then extracted from both the nasal and temporal sides of the fovea. These patches which are extracted from different subjects are aligned in both x and y directions, where the x-direction alignment is dependent on the location of the fovea, while the y-direction alignment is performed based on the position of a certain retina layer. Different retina layers have been investigated to be used for the y-direction alignment. The aligned nasal patches and the aligned temporal patches are then used for training CNNs, from which features are extracted and then fused using support vector machine (SVM) that is responsible for giving the final classification (DR vs normal). Throughout the work, different scenarios for training the CNNs have been tested, among which transfer learning is used. The following items are studied: 1) the effect of transfer learning on improving the performance of the proposed CAD system, given the scarcity of the data; 2) the effect of fusing CNNs retrained with different datasets on the overall system performance; 3) the depth of CNN layers required to extract features to train the final classifier used for data fusion; and 4) the OCT layer required to be segmented in order to be used for the y-coordinate axis alignment of the extracted patches for optimum results. The rest of this paper is organized in three sections as follows: a section that presents the materials and methods, followed by a section that discusses the experimental results, and finally the conclusion is given in the last section.

II. MATERIALS AND METHODS
A simplified block diagram of the proposed CAD system for early detection of DR in OCT images using CNNs is shown in Fig. 1. The proposed system is composed of (1) a preprocessing stage which includes rough segmentation of retina layers, and fovea detection, in addition to patch extraction and alignment; (2) a CNN-based feature extraction stage; and (3) a classification stage. The details of these stages as well as the used dataset and the used validation technique are given below.

A. OCT DATASET
Patients with and without DR were enrolled at the Kentucky Lions Eye Center at University of Louisville between June 2015 and December 2015 (University of Louisville IRB protocol 18.0010). Informed consent (or assent) was provided by each participant. Exclusion criteria included history of retinal pathology, including diabetes-related, and severe myopia, defined as refractive error ≤ −6.0 diopters. In all, 52 subjects were enrolled, 26 of whom had DR, ranging in age from 40 to 79 years.
Data used for training and testing of the CAD system were obtained using a clinical OCT scanner, Cirrus HD-OCT 5000 (Carl Zeiss Meditec, Dublin, California). B-scans were obtained over a 21-line raster across the macula of both eyes. For each eye, a single B-scan, passing through the fovea, was selected for analysis. Images were 1024 × 1024 pixels, 8 bit grayscale, capturing an optical slice 2 mm deep and 9 mm from side to side (nasal-temporal).

B. PREPROCESSING
The preprocessing stage is illustrated in Fig. 2, where we extract the input patches to be fed to the CNN as appropriate. This starts with rough segmentation of the retina proper from the rest of the image, and identification of the fovea. The results of the prior two processes are then used for the positioning and extraction of the appropriate patches as per the schema of the proposed system. The segmentation of the original OCT scan into twelve different layers is performed by the application of an unsupervised parametric mixture model and Markov Gibbs Random Fields [39]. This is inspired from the appearance of the retina in an OCT B-scan, which reveals approximately 12 bands of greater or lesser reflectivity as shown in Fig. 3. Histological studies have correlated these bands with the layers of the retina, proceeding from the vitreous body to the choroid: 1. nerve fiber layer (NFL), 2. ganglion cell layer (GCL), 3. inner plexiform layer (IPL), 4. inner nuclear layer (INL), 5. outer plexiform layer (OPL), 6. outer nuclear layer (ONL), 7. external limiting membrane (ELM), 8. myoid zone (MZ), 9. ellipsoid zone (EZ), 10. outer segments of the photoreceptors (OPR), 11. interdigitation zone (IZ), and 12. the retinal pigment epithelium (RPE).
As shown in the OCT image in Fig. 3, the retina has 12 layers, with the fovea in the middle of the image. The thickness of these 12 layers are not constant all over the image. So we can find that at the fovea, the vitreous body is nearly adjacent to the ONL (layer 6), while layers 1-5 almost vanish. Also, layers 1-5 are thickest in the foveal rim, which is surrounding the fovea. This structure is common among retinas of different subjects, so it could be used to roughly detect the fovea location and guide the patch extraction procedure for consistent representation of the retina across different OCT images. Patches were extracted from both sides of the fovea (the temporal and nasal sides) and oriented to align with the retinal layers. Consequently, the amount of extracted background (vitreous or choroid) is minimized, and feature extraction should be independent of any peculiarities (e.g. slight tilt or off-center) of a given OCT scan.
The fovea coordinate localization starts off with applying a median filter in order to remove any impulsive noise. This is then followed by the ''à trous'' algorithm [40] that decomposes each scan into scale-space components of coarser and finer detail by undecimated wavelet transform as per the aforementioned details. Edge detection in scale space allows for easy identification of high contrast boundaries in the OCT image: vitreous-NFL, MZ-EZ, and RPE-choroid. Contours are first detected as local gradient maxima in the appropriate wavelet component, then smoothed using adaptive VOLUME 8, 2020  spline smoothing. Considerations of typical retina structure, above, lead to identification of the fovea with the point on the vitreous-NFL boundary at minimum distance from the MZ-EZ boundary. When computing these distances, it is important to correct for the non-square pixel aspect ratio of typical OCT scanners. The preprocessing algorithm is based on the work presented in [39].
Upon the detection of the fovea, the required patches' locations along the x-coordinate axis are computed appropriately, i.e. with the origin at the fovea. These calculated points are used for extracting the corresponding vertical slice from the segmentation mask of the layer that we would be centering the patch extraction at for the y-coordinate axis. The sum of the values of the pixels of these extracted slices along the x-axis is then considered to take into account any orientation or skewness display of the OCT scan. It acts as an efficient measure where if it is greater than zero then the corresponding row contains a significant part of the layer. The resultant of the final algorithm of the preprocessing stage as shown in Fig. 4 is a representative patch for each of the temporal and nasal sides, i.e. extracted image from the original OCT, that is in a matching size to the input layer of the CNN. Similar to the nasal and temporal patch extraction procedure discussed above, the preporcessing module could be tuned to generate center patches with the fovea in the middle as well as nasal and temporal patches that are distal from the fovea. These patches will be used in addition to the nasal and temporal patches to investigate several scenarios for the system setup, in order to select the best one.
The preprocessing stages inherently compensates for the mismatch between the dimensional size of the original OCT scans that are the input image data and that of the input layer of pre-trained CNNs, which will be used for feature extraction as discussed below. The preprocessing stage also eliminates unimportant information; hence, improving the speed and efficiency of the system.

C. FEATURE EXTRACTION AND CLASSIFICATION
Generally, the patches provided at the output of the preprocessing stage are fed into a CNN/CNNs from which features are extracted from certain layers. The involved CNN could be pre-trained i.e. based on transfer learning or not. Throughout the work, the AlexNet CNN shown in Fig. 5 is used with an input patch of size 227 × 227 × 3. The CNNs with no transfer learning are randomly initialized.
In order to find out the optimum parameters for the proposed system, various scenarios were investigated throughout this work to determine: 1) whether the application of transfer learning improves the accuracy of the algorithm rather than just training the network from scratch, taking into consideration the fact that the dataset size is relatively small; 2) the patches to be used for best performance (distal nasal patches, nasal patches, central patches, temporal patches and/or distal temporal patches); 3) the OCT layer to be used for the y-direction alignment; and 4) The CNN layer to be used for feature extraction. A generic framework that illustrates various scenarios is given in Fig. 6. The figure has 7 CNNs, where the features extracted from each CNNi is stored in a VOLUME 8, 2020  corresponding feature vector fvi. Each scenario will investigate the use of 1 or more CNNs out of these 7 CNNs but all the branches shown in the figure will not be used at the same time. They are included in a single figure for illustration purpose only. The extracted feature vector(s) is/are fed into a linear SVM, a machine learning architecture used in classification problems [41], with a fast stochastic gradient descent solver for the final classification. If more than 1 CNN is involved in the scenario, the corresponding feature vectors are fused at the beginning of the SVM module. This is carried out by concatenating the features that are extracted by default from the bottleneck stages of the CNNs. The CNNs involved in various scenarios are 1) CNN1: Distal Nasal patch pre-trained CNN, 2) CNN2: Nasal patch pre-trained CNN, 3) CNN3: Nasal patch CNN (no transfer learning), 4) CNN4: Central patch pre-trained CNN, 5) CNN5: Temporal patch pre-trained CNN, 6) CNN6: Temporal patch CNN (no transfer learning) and 7) CNN7: Distal Temporal patch pre-trained CNN.
Scenario 1 and Scenario 2 are concerned only with the nasal patches and are used to investigate whether transfer learning enhances the performance or not. In Scenario 1, CNN2 is used alone, while in Scenario 2, CNN3 is used alone. The training of CNN2 and CNN3 is done using the the normal/DR dataset described above after being preprocessed. The pre-training of the CNN to be used for transfer learning was performed on a subset of the large-scale ImageNet database [42], containing 1.2 million real-life images and 1000 object categories.
The pre-training is used only as far as layer pool5 (Fig. 5) for the activations of the hidden layers, while subsequent layers of the AlexNet are treated identically for the comparison between the use and non-use of transfer learning. Similarly, in Scenario 3 and Scenario 4, patches that are extracted from the temporal side of the fovea are likewise input to two AlexNet CNNs, one of which was pre-trained (CNN5 in Scenario 3), and the other was not (CNN6 in Scenario 4), for the purpose of carrying out the same comparison. Note that each grayscale patch is input to all three different channels of the AlexNet, which was designed to operate on color images. Scenario 5 investigates using only the pre-trained CNN4 which is fed by central patches with the fovea in the center.
It was also tested whether improved accuracy would result from fusion of various input data, as in Scenario 6 and Scenario 7, where fusion of features extracted from pre-trained CNNs is involved. It is worth mentioning that our experiment is hypothesized to show that transfer learning does improve the results particularly with small datasets. In Scenario 6, both nasal and temporal patches are used with the pre-trained CNN2 and CNN5 respectively. Scenario 5 and Scenario 6 incorporate information from both the nasal and temporal sides of the fovea. However, information from both sides are used independently in Scenario 6 as patches from each side is fed into an independent CNN, while in Scenario 5, the information from both sides are used with a single CNN (CNN4). Scenario 7 investigates using 4 type of patches to train 4 pre-trained CNNs independently and then feed the corresponding extracted features into the SVM module to perform classification. The patches are distal nasal patches, nasal patches, temporal patches and distal nasal patches that correspond to CNN1, CNN2, CNN5 and CNN7 respectively. The distal nasal and temporal patches are extracted farther away from the fovea compared to the nasal and temporal patches respectively.
For the scenario with the best performance, an investigation is carried out in order to find out the optimum parameters for the best performance regarding the layer at which we carry out transfer learning. This is done by varying the CNN layer used for extracting the features. The following layers are used for comparison of the performance results: (i) one layer above the bottleneck features which is relu6 layer, {fc6}; (ii) the layer with the bottleneck features which is pool5 layer, {P5}; (iii) one layer before the bottleneck features which is relu5 layer, {C5}; and (iv) two layers before the bottleneck features which is relu4 layer, {C4}. In this investigation, both the results in terms of accuracy and the computation expense for an overall performance evaluation are taken into consideration.
Furthermore, for each of the previous scenarios, the default preprocessing pipeline for each of the OCT scans includes fovea detection. This is required to extract patches at standardized locations relative to the center of the fovea. This preprocessing step uses the ONL for y-direction alignment. However, it is noteworthy to mention that different experiments were carried out for the scenario with the best performance, where all other parameters are fixed except for OCT layer used for the y-direction alignment, in order to find the layer that achieves the optimum results and hence further improves the algorithm. The layers that were considered in this sub-experiment are: (i) layer 5 or the OPL; (ii) layer 6 or the ONL; (iii) layer 7 or the ELM; (iv) layer 8 or the MZ; and (v) layer 9 or the EZ. Samples of such extracted patches are shown in Fig. 7. Scenario 8 (not included in Fig. 7) explores using resized whole retina images to directly train a pre-trained CNN, as opposed to using only patches from the retina in Scenarios 1 through 7.

D. VALIDATION
Five-fold cross validation is used for the evaluation of each of the investigations of the CNN CAD system for early detection of DR. This is a particular case of k-fold cross validation, where k = 5 training runs are performed, each time leaving out a fraction 1 k of the data for subsequent testing. In this way every available observation (OCT scan) is used for both training and (exactly once) for testing. This is in contrast to the traditional hold-out method where a fixed proportion of the data are set aside from the beginning of the experiment to be used only for validation. Cross validation has the advantage of making the most of a limited amount of data, but is known to produce biased estimates of system performance.
The performance metrics used are accuracy (α), error rate (β), specificity ( ), precision (PPV), and recall (TPR). If TP is the number of correctly classified DR cases in a particular run of cross validation, TN is the number of correctly classified normal cases, and P = TP + FP and N = TN + FN are the total number of DR and normal cases in the test data, then these metrics are defined as: Second, we compared the performance results in the case of training the proposed system with a single patch with the fovea centered (Scenario 5) versus training the system with independent nasal and temporal patches (Scenario 6 and Scenario 7). Scenario 6 uses two independent patch types for training, which are nasal patches and temporal patches, while Scenario 7 uses two additional independent patch types, which are distal nasal patches and temporal nasal patches. It was found that using independent patch training to combine two or four patches results in the highest performance metrics across the board. However, it can also be observed that the training time taken for the four patches scenario is almost four times that of the two patches scenario, with no improvement in any of the metrics. As such, only the two-patch approach is accounted for in the next investigation of finding the optimum CNN layer to extract the features from. It can be derived from the summary in the table that the bottleneck features represent the best choice in terms of balance between the run time required, i.e. computational complexity, and the accuracy. Finally, in choosing the OCT layer to align the extraction of the input patches at, we find that layer 6 is the best given that it reaches the highest performance metrics. Although other layers achieve similar levels of accuracy, error rate, specificity, precision, and recall, they all require more time to train. As such, the final design choices for the proposed system is the CNN retrained with two patches, one on each side of the fovea, with the ONL centered vertically within each patch. The features are extracted from pool5 CNN layer and transfer learning is applied in the deployment of the proposed system. Additional testing was carried out in order to confirm that the CNN architecture is not biased due to color space, for the ImageNet dataset is RGB while our dataset is intrinsically grayscale. Hence, we retrained the network with 200 grayscale images and reapplied our investigation. The result was conclusive that the network is independent of the color space as the results obtained exactly matched that of directly using the ImageNet pre-trained CNN.
Significant degradation of performance resulted upon downsampling, as shown in Table 1 (Scenario 8). After training with downsampled images, accuracy was at most 78% (71% specificity, 78% precision, and 100% recall). Every metric, except for the recall, was lower comapred to that of the CNN trained at full resolution. Based on this, resampling of input should be avoided, and extraction of patches as per the proposed methodology is recommended.
Moreover, the confusion matrix of the chosen CNN for the proposed early DR detection CAD system is shown in Fig. 8.   Finally, a comparison of our proposed technique against other machine learning techniques shows its superiority as can be observed in Table 2. It is worth mentioning that we used the same methodology described in [35] to implement the comparison framework using Matlab-ready K-Star, K-Nearest Neighbor (kNN), Random Forest, and Random tree classifiers.

IV. CONCLUSION
Early intervention is essential to delay or prevent complications of DR, including blindness. As such, in this paper, a novel CAD system for early detection of DR-related changes in OCT images using CNNs was presented. The system was developed for use with patients with almost clinically normal retina appearances.
Upon investigation of the various scenarios for the proposed CAD system, the optimal conditions for its deployment were found. Foremost, transfer learning should be used to achieve high accuracy given the scarcity of the data. Second, best results are seen when combining the output features of two independently trained CNNs, which operate on both sides of the fovea. Features extracted at the pool5 layers of these CNN provided for the highest accuracy with the least computational complexity. Finally, in order to reach highest accuracy, which our results found to be 94 %, the patches extracted for training and testing should be aligned along the y-axis using the patch extraction algorithm presented with the segmented OCT layer number 6, or the ONL. This paper recommends that further research is directed towards this relatively uncharted topic, especially with OCT images, for the results were observed to be high even given the scarcity of the data and the relative complexity of the problem.