Discrimination of Diabetic Retinopathy From Optical Coherence Tomography Angiography Images Using Machine Learning Methods

The goal was to discriminate between diabetic retinopathy (DR) and healthy controls (HC) by evaluating Optical coherence tomography angiography (OCTA) images from $3\times 3$ mm scans with the assistance of different machine learning models. The OCTA angiography dataset of superficial vascular plexus (SVP), deep vascular plexus (DVP), and retinal vascular network (RVN) were acquired from 19 DR (38 eyes) patients and 25 HC (44 eyes). A discrete wavelet transform was applied to extract texture features from each image. Four machine learning models, including logistic regression (LR), logistic regression regularized with the elastic net penalty (LR-EN), support vector machine (SVM), and the gradient boosting tree named XGBoost, were used to classify wavelet features between groups. The area under the receiver operating characteristics curve (AUC), sensitivity, specificity, and diagnostic accuracy of the classifiers were obtained. The OCTA image dataset included 114 and 132 images from DR and HC subjects, respectively. LR-EN and LR using all three images, SVP, DVP, and RVN, provided the highest sensitivity of 0.84 and specificity of 0.80, the best diagnostic accuracy of 0.82, and an AUC of 0.83 and 0.84, respectively, which were slightly lower than that of LR using one image SVP (0.85) or two images DVP and SVP (0.85). The LR-EN and LR classification algorithms had the high sensitivity, specificity, and diagnostic accuracy in identifying DR, which may be promising in facilitating the early diagnosis of DR.


I. INTRODUCTION
Diabetic retinopathy (DR) is one of the main causes of blindness among the working-age population [1]. Since it becomes incurable in its late stages, early diagnosis of DR is important. However, it is difficult to achieve early diagnosis because of the paucity of experienced retinal doctors and the increased number of DR patients [2]. Recently, several automated diagnosis systems have been developed to assist in the early diagnosis of DR, which are objective and reliable tools used The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
to support clinical decision-making process without inter and intra-expert variability.
Artificial intelligence (AI) can analyze a large amount of data simultaneously and can correct automatically and learn continuously to improve upon its sensitivity and specificity as a diagnostic tool and predict disease progression in medicine [3]- [5]. An AI system is designed to mimic human brain perception for data processing and decisions making. Recently, there has been an increasing interest in using machine learning models for medical applications, which are branches of AI that can learn from data, identify patterns, and make decisions automatically [6]. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In ophthalmic research, the application of AI systems has led to robust diagnostic accuracy for several ocular conditions such as DR [7], age-related macular degeneration (AMD) [8], cataract [9], glaucoma [10], and keratoconus [11].
Machine learning models using fundus images have been developed by researchers for retinopathy staging as well as identifying multiple ocular diseases. However, it is difficult to detect micro-vasculature alterations in different retinal layers and the area around the foveal avascular zone using fundus images. Additionally, a large and well-documented database of fundus images is needed to train and optimize convolutional neural networks. It is also extremely difficult to provide strong accuracy metrics because of the database variances from different imaging centers [7], [12]. A deep learning system (DLS) is a branch of machine learning using AI and representation learning methods to process large data and extract meaningful patterns that contribute to the excellent performance for discovering the intricate structures of high-dimensional data. Currently, DLS for the automatic classification and segmentation of optical coherence tomography angiography (OCTA) images in ophthalmology affords excellent results. In DLS, convolutional neural network (CNN), and multi-scaled encoder-decoder neural network (MEDnet) have been utilized to classify referable DR on optical coherence tomography angiography datasets with high predictive accuracy [13]- [18].
Abramoff and associates conducted a pivotal trial in which an autonomous AI-based diagnostic system detected DR, and that system has now been authorized by the Food and Drug Administration for use by health care providers to detect mild DR and diabetic macular edema [19]. Alam et al. presented the feasibility of a supervised machine learning method to detect DR with excellent diagnostic accuracy using 6 × 6 mm OCTA images from small datasets [20]. Recently, Sandhu et al. introduced a new automated system for detecting DR using OCTA. The system demonstrated a high degree of accuracy, sensitivity, and specificity in analyzing vessel density, vessel caliber, and FAZ area [21].
In the present study, we applied different machine learning models to discriminate between healthy individuals and DR by evaluating OCTA images from the 3 × 3 mm scans of superficial vascular plexus (SVP), deep vascular plexus (DVP), and retinal vascular network (RVN). We use sensitivity, specificity, and accuracy as metrics to compare the performance of different methods. Sensitivity is the ability of a method to correctly identify those with the disease (true positive), whereas specificity is the ability of the method to identify those without the disease (true negative) correctly. Accuracy is the proportion of true results, either true positive or true negative, identified by a method.

A. STUDY DESIGN, SETTING, AND POPULATION
This study was approved by the institutional review board for human research of the University of Miami. All participants were treated according to the Declaration of Helsinki. Written informed consent was obtained.
Nineteen patients with DR were recruited from Bascom Palmer Eye Institute, University of Miami, from August 2017 to June 2019. These patients underwent a complete fundus examination and were diagnosed by a retinal specialist (JT). Twenty-five age and sex-matched healthy control (HC) subjects were recruited.

B. OCTA, IMAGE SEGMENTATION, AND DATA ACQUISITION
The angiograph dataset of the 3 × 3 mm scans was acquired using the Optovue OCTA device (Optovue, Fremont, CA, USA), which consists of 70,000 Hz A-scans rate, an 840 nm central wavelength, a 5 µm resolution, and a 22 µm beamwidth. Each scan of retinal angiography included a raster scan with 304 (A-scan) x 304 (B-scan). The OCTA image quality was quantified with the scan quality score metric provided in the Angiovue's software interface. Any OCTA image with a scan quality score of less than 5 was excluded. The SVP was segmented from the inner limited membrane layer (ILM) to the inner plexus layer (IPL); the DVP was segmented from the IPL to the outer plexus layer (OPL), and RVN was segmented from the ILM to the OPL.

C. DATA PRE-PROCESSING AND OCTA FEATURE EXTRACTION
The first step of applying a machine learning model to predict DR from images is to extract useful features from the images. It has been shown that wavelet transform can capture texture features of an image at multiple resolutions and that such features can be used to classify images with high accuracy [22], [23]. Wavelet transform has also been applied to fundus images for the diagnosis of glaucoma [24], [25].
We first applied the two-dimensional discrete wavelet transform (DWT) [26] to images to extract features in different frequency bands. Specifically, we employed the wavedec2 function in Matlab to implement the 2-dimensional DWT with the biorthogonal wavelet bior1.1. Of note, DWT had also been employed to extract features for the classification of fundus images of glaucoma [24], [27]. Since images were of size 304 × 304 pixels, we performed the 8-level ( log 2 (304) = 8 ) DWT. As shown in Figure 1, at the ith level (i = 1, 3,. . . , 8), we used biorthogonal wavelet bior1.1 to decompose the image in the LL i−1 band into four images in four frequency bands: HH i, HL i , LH i , and LL i , where LL 0 was the original image. Therefore, each image yielded 25 images of different resolutions with a total of 25 frequency bands. For each image, the energy in each of 25 frequency bands was calculated and standardized using the z-score, which generated 25 features that would be further used to classify images. Specifically, the image of the ith individual in the jth frequency band was represented as a p×q matrix D ij . The energy of the band was E ij = 1 and the kth column. Suppose that there were n individuals in the dataset,definedĒ j = 1 25 ] T , which was then used by a machine learning model for image classification. In our dataset, each eye has three images, including DVP, RVN, and SVP. We can use any one, two, or three images of all the three images to predict DR. If we use k (k = 2 or 3) images, we concatenate the feature vectors of these k images into a 25k × 1 vector as the input to a machine learning model. Of note, prior studies have shown that the energy distribution in different frequency bands provided discriminative power for image classification [22], [23].

D. DATA ANALYSIS
We employed four machine learning models to classify images based on wavelet features. These four models are logistic regression (LR), support vector machine (SVM), logistic regression regularized with the elastic net penalty (LR-EN), and the gradient boosting tree named XGBoost. LR and LR-EN were implemented with software glmnet [28]; SVM and XGBoost were implemented with the SVM function in R package e1071 [29] and the software XGBoost [30], respectively. Of note, Boosting trees are particularly powerful in supervised learning and have frequently won competitions [30], [31], sometimes outperforming deep neural networks. The mathematical formulation of LR and SVM can be found in a machine learning textbook [33], [34]; a detailed description of LR-EN and XGBoost is in [28] and [30], respectively. Next, we will describe these four methods briefly.
As mentioned earlier, the feature vector in the ith individual is denoted as x i . Let y i = −1 or 1 be the class label of the ith individual. The logistic regression model assumes the following probability p(y i |x i ) = 1/(1 + e −y i (β 0 +x T i β) ), where the vector β and the scalar β 0 contains model parameters. LR-EN finds (β 0 , β) by minimizing the negative log-likelihood function of the data regularized by the elastic net penalty [24]. More specifically, the following optimization problem is solved to find (β 0 , β) where β k , k = 1, 2, is the l k -norm of β, and λ > 0 and 0 ≤ α ≤ 1 are two hyperparameters that can be determined with cross-validation (CV). If we set λ = 0 in (1), the solution of (1) gives the parameters of LR. Since the number of data samples is relatively small, and with the issue of quasi complete separation [35] in the datasets under consideration, the solution of (1) with λ = 0 is not stable. To overcome this problem, we implemented LR using (1) with α = 0 and λ >0. The hyperparameter λ was determined using cross validation [36]. SVM finds the decision boundary β 0 +x T β = 0 by solving the following optimization problem where C > 0 is a hyperparameter. The SVM in (2) finds a linear decision boundary in the feature space. We can also use the kernel trick to find a nonlinear decision boundary in the feature space. In this paper, we used the Gaussian kernel, which is defined as K 2 ), where γ > 0 is another hyperparameter of the model.
We used the nested CV procedure [37] to train the four models and estimate the classification errors. The nested CV procedure has two loops, both of which employed a leaveone-out (LOO) approach. The inner loop was used to tune the hyperparameters and train the model, while the outer loop was used to estimate the classification error. Since the samples used in the outer loop for error estimation were never used by the training process in the inner loop, the nested CV provided an unbiased estimate of the generalization error. First, one of the 82 samples was held out. The remaining 81 samples were used by a LOO procedure, which was referred to as the inner CV loop to be described in detail later, to train the model and select the optimal values of the hyperparameters. After the values of the hyperparameters were selected, the 81 samples were used to fit the model, which was then used to find the error on the one sample that was held out. This process, referred to as the outer loop, was repeated 82 times until all samples were held out and used to calculate the error. This error is the estimated classification error. Now, let us describe the inner CV loop. One of the 81 samples was left out, and the remaining 80 samples were used to train the model for a set of values within the hyperparameters, and the trained model was used to predict the label of the sample left out. This process was repeated 81 times until every sample was left out, and its label was predicted. The predicted labels of the 81 samples were used to calculate the CV error, and the value of the hyperparameter(s) that yielded the smallest CV error was selected as the optimal value [37].
We use sensitivity, specificity, and accuracy as performance measures. Sensitivity measures the proportion of true positives (TP) that are correctly identified with false negatives (FN), and is defined as sensitivity = TP/ (TP+FN). Specificity measures the proportion of true negatives that are correctly identified with false positives (FP), and can be computed as specificity=TN/ (TN+FP). Accuracy is the degree of closeness of measurements of a quantity to that quantity's true value and can be calculated as accuracy=(TP+TN)/(TP+FN+TN+FP) [38].

III. RESULTS
The OCTA image dataset included 114 images from 38 eyes of 19 DR patients and 132 images from 44 eyes of 25 HC subjects. Tables 1 and 2 show the performance of the four machine learning models in predicting DR from one, two, or three images. As shown in Table 1, LR-EN and LR using the three images (DVP+RVN+SVP) offered the highest sensitivity (0.84) and specificity (0.80). As shown in Table 2, the AUCs of LR-EN and LR using the three images are 0.83 and 0.84, respectively, which are slightly lower than the AUCs of LR using one image SVP (0.85) or two images DVP and SVP (0.85), but are higher than or equal to the AUCs of LR-EN and LR using other images, and the AUCs of SVM and XGBoost. Table 3 shows that LR-EN and LR using the three images provided the best diagnostic accuracy (0.82) among all four classifiers. Figure 2 depicted the receiver operating characteristic (ROC) curves and AUCs of LR-EN using different images. It was seen that the LR-EN using the three images offers almost the same performance as using one image, the SVP, and better performance than LR-EN using other images. Figure 3 showed the ROC curves and AUCs of LR-EN, LR, SVM, and XGBoost using the three images. It is clearly seen from Figure 2 that LR-EN and LR have almost the same performance but outperform SVM and XGBoost.

IV. DISCUSSION
We herein clarified the applicability of machine learning algorithms to predict diabetic retinopathy (DR) using  3 × 3 mm scans of optical coherence tomography angiography (OCTA) images. Subtle retinal microvascular alterations due to DR can be analyzed quantitatively based on OCTA images. The LR-EN and LR classifiers demonstrated the best diagnostic performance in all classification tasks in our current study. The LR-EN or LR algorithm would provide an effective strategy for clinicians to diagnose early stages of DR.
Generally, in our current dataset, using all three images improved the performance in terms of diagnostic accuracy and sensitivity when compared to the cases using only one or two images. However, using three images did not necessarily improve the performance in terms of the AUC when compared to the cases using the one image of SVP, or two images of DVP and SVP. This implies that the SVP image  contains the most information that can help discriminate DR from normal tissue, and can explain why adding the other two images did not significantly change the performance of the classifier.
Sensitivity is an important criterion for any screening and diagnostic prediction system [15]. Studies have shown that the sensitivity of automatic DR screening ranges from 75% to 97.5%, and the accuracy is comparable [7], [12], [19], [20], [39], [40]. The 84% sensitivity of our system represented the capability to identify individual eyes with DR from a healthy control. Similarly, specificity is also an important factor because it will represent the capability of detecting subjects that do not require a referral to an ophthalmologist. Our study indicated that LR-EN and LR algorithm could achieve the best performances for maximum diagnostic accuracy in all classification tasks. As supported by the results from Table 3, we can observe that, in all performance metrics, the LR-EN and LR algorithms demonstrated better diagnostic proficiency than SVM and XGBoost algorithms.
Machine learning models for analyzing angiography images did not require alignment since the wavelet features were directly processed from the raw angiography images. The machine learning models, particularly LR-EN and LR, provided better performance than the traditional vessel density analysis (Dbox). The traditional vessel density analysis provided a sensitivity of 0.70 and specificity of 0.65 and used the vessel density of DVP (VDd), to discriminate DR from HC [41]. The improvement of machine learning models was mainly due to the following two factors. First, machine learning models exploited wavelet features of the angiography images of each eye, whereas the VDd analysis used the average density. Wavelet features can capture patterns in an image at multiple resolutions, which provides more information and improves discriminative power. Second, the machine learning models were designed based on certain optimality criteria to discriminate images in the vector space of features, which can exploit the discriminative information appropriately [23], [24], [42]. XGBoost is a very efficient implementation of the gradient boosting tree [29]. Instead of fitting a tree to the gradient of the loss function at each iteration, as normally done by gradient boosting, XGBoost uses the second-order approximation of the loss function to build trees. The hyperparameters of XGBoost include the maximum depth of each tree, the number of trees, and the learning rate. In this work, the LR-EN and LR have almost the same performance, and they outperform SVM and XGBoost.
Although our findings indicated the LR-EN and LR classification algorithms had high sensitivity, specificity, and diagnostic accuracy in identifying DR, there are several limitations in this study. Firstly, the sample size is relatively small for each cohort. In future studies, we plan to include multiple imaging centers and a much more extensive OCTA database to test our AI screening algorithm for practical performance in retinal clinics. Additionally, we did not use retinal images to test the availability of our algorithm for detecting DR. There are many important components that should be included in the grading system for detecting DR using retinal images, such as microaneurysms, hemorrhages, hard and soft exudations, as well as blood vessel morphology [43]- [45]. Thirdly, we only analyzed the 3 × 3mm OCTA images from mild DR patients. Patients with moderate and severe DR were not included. Since mild DR is hard to distinguish from normal individuals, the sensitivity and accuracy for detecting DR may have been found to be relatively lower. Future studies should enlarge the sample size and include moderate and severe DR patients. Finally, we did not use other data such as thickness maps of intra-retinal layers and retinal angiography, which, if incorporated into the machine learning process, may further improve the diagnostic accuracy and applicability.

V. CONCLUSION
In summary, in this evaluation of the 3 × 3 mm scans of OCTA angiography images from DR patients and healthy individuals, the LR-EN algorithm had high sensitivity and specificity in identifying DR, which may be promising in facilitating the early diagnosis of DR. Further large scale and multicenter studies are necessary to assess the applicability of LR-EN algorithm in DR and related eye diseases to improve vision outcomes.