A Statistical Robust Glaucoma Detection Framework Combining Retinex, CNN, and DOE Using Fundus Images

Motivated by the challenge that manual glaucoma detection is costly and time consuming, and that existing automated glaucoma detection processes lack either good performance or any statistical robustness testing procedures, we proposed an effective, robust, and automated framework for glaucoma detection based on fundus images. The proposed framework using 1450 color fundus images provided by Kaohsiung Chang Gung (KCG) Memorial Hospital in Taiwan. The proposed framework combines the use of convolutional neural networks (CNN) with the proposed generalized loss function, robust design of experiment (DOE), and Retinex theory to improve the results of fundus photography flash by restoring the original colors via removing the light effect. The proposed framework outperformed most archival automatic glaucoma detection approaches in its effectiveness and simplicity. The effectiveness was demonstrated via the estimated sensitivity 0.95, specificity 0.98, and accuracy 0.97. The simplicity was shown via the adopted basic CNN model compared to deep CNNs such as GoogleLeNet and ResNet152. Further, the proposed framework outperformed all relevant archival work in terms of its robustness, illustrated in the associated standard errors (all less than 0.03). This paper demonstrated the proposed framework via intuitive graphs and clear mathematical notations to make it easy for others to reproduce our results. The proposed framework and demonstration have the potential to become the standard automated glaucoma detection approaches in practice.


I. INTRODUCTION
Glaucoma is an eye disease that is notoriously known to be incurable, treatable with medical and surgical procedures that can only delay the aggravation of glaucoma but not restore eye health. Glaucoma is considered to be one of major causes of irreversible-blindness worldwide.
At present, existing glaucoma detection processes do not consistently provide satisfactory results. Manual glaucoma detection (i.e., assessment of the optic nerve head) is subjective, costly, time consuming, and that the variations among physician performances are high.
The associate editor coordinating the review of this manuscript and approving it for publication was Amjad Ali.
All of the above-cited automated glaucoma detection method studies use color fundus images as input data and adopt a common framework that includes two major steps: (1) data transformation, which converts the input data (color fundus images) into so-called transformed data; and (2) data classification, which categorizes the binary output (glaucoma or healthy eye) as a function of the transformed data.
We proposed an effective automatic classification framework that integrates Retinex transformation and  (14), was proposed. b) Retinex theory (to improve the results of fundus photography flash by restoring the original colors via removing the light effect) was adopted. c) A robust design of experiments (DOE) to generate optimal hyper-parameters (including input image size, convolution layers, and hidden layers) was adopted. d) All reported digits of the estimated performance were determined through the leading digit rule (LDR [33]). 4) Simplicity: The basic CNN model (with 4 convolutional layers) was adopted instead of choosing deep CNNs such as GoogleLeNet (with 21 convolutional layers) or ResNet152 (with 152 convolutional layers). 5) Clarity: Many intuitive graphs and clear mathematical notations were proposed to make it easy for others to reproduce our results. The remainder of this paper is organized as follows. Section II provides a literature review with expansions. Section III presents the methodology combining Retinex theory and CNN with the optimal hyper-parameters using DOE. A discussion of the associated performance is provided in Section IV. The summary and conclusion are given in Section V.

II. LITERATURE REVIEW
Twenty archived publications addressing automated glaucoma classification are summarized in Table 1, with the top row indicating the column (C) number. Significant factors (such as Retinex, ROI*, robust DOE, and the proposed new loss function for the CNN) proposed in this study are marked in red in Table 1. C1-C6 are defined as follows: C1 is the paper number. C2 identifies the paper's referenced index, with the publication year shown in C3. C4 shows the original-data figures indicating glaucoma (G) and normal (N), while C5 indicating whether augmented data are adopted. Note that augmented data (rather then original data) are finally used to be classified. C6 lists the adapted data-ratio for the training, validation, and test sets. Figures appear as a hyphen in C5 and C6 if the associated information are not clearly shown in the cited paper. All images, except those used in [9], are independent. Images adopted in [9] are sequential images The results of our current research are shown at the bottom of Table 1. The explanation of corresponding C4-C6 are given below. This study investigated original data for 899 glaucoma (G) cases and 551 normal (N) cases. After augmentation (see Step 4 of Section III-B), the data size was enlarged twice. C-6 shows a data-ratio of 5:2:3.
Data transformation methods are listed in C7. To conserve space, only an abbreviation is listed for each associated transformation method, including ROI, PCA, EF, IC, HF, GCM, EWT, and ROI*. Specifically, ROI (region of interest) indicates the area covering around optic nerve cup and optic nerve disc. ROI* adopted in the proposed method covers the optic nerve cup, optic nerve disc, and macula. Other transformation methods are briefly described below. PCA (principal component analysis [1] and [17]) transforms the data via the logic of reducing the dimensionality of a set of variables while retaining the maximum variability in terms of the variance-covariance structure. Texture feature [2] provides measures of properties, such as smoothness, coarseness, and regularity of the images. Higher order spectra (HOS) [2] elicited both amplitude and phase information of a given signal. EF (ellipse fitting [6]) fits an elliptical model of data into the area identified in ROI. IC (illumination correction [5]) uses anisotropic diffusion filtering to remove noise, further subtracting the estimated background from the original color image to achieve homogeneity. HF (Haralick features [13]) here refers to 28 statistical features (including the first and second moment, correlation, and entropy). EWT (empirical wavelet transform [11]) transforms time-domain data into frequency domain data. GCM (grid color moment) transforms pixels into statistical moments such mean, variance, and skewness. Retinex [21], used to remove the effects of fundus photography flash and to restore the original colors of the fundus image, will be further explained in Section III-B.
Data classifiers, shown in C8, include ANFIS (adaptive neuro-fuzzy inference systems), ANN with the associated optimizer BP (back propagation [2]), SVM (supporting vector machine [3]), k-means clustering [16], RF (denotes for random forest [23]), RT (denotes for random tree [22]), CNN (denotes for convolutional neural network [20]) which integrates both feature extraction and classification. The loss function, such as mean squared error (mse) or cross entropy (CrossE), adopted in CNN is listed inside the parenthesis. Authors [7] claimed that the advantage of CNN is that it does not need any pre-processing or handcrafted feature extraction by other methods.
The above-mentioned deep CNN models have been commonly adopted to use in automatic glaucoma detection since 2019. For example, [29] proposed deep CNN architectures including GoogleLeNet (with 21 convolutional layers) and ResNet152 (with 152 convolutional layers). [29] proposed to train and test 5 different data sets by employing four of them during training and the other during testing.
For example, the first row corresponding to No 16 (in Column 1) used no. of glaucoma and healthy images (G=27, H=18) and total train data 1662 (= 70 + 31 + 194 + 261 + 101 + 300 + 396 + 309, the sum of the rest four data sets) and testing data 45 (= 27 + 18). All of the proposed 5 cases corresponding to No 16 in Table 1 are inferior than the model proposed in this paper.
The last 4 columns (C9, . . . ,C12) are categories designated as follows. With regard to C9, none of the studies in the 20 cited publications appear to have used statistical robust design of experiments (DOE) (see Section III-F) via blocking (see Section III-A) to determine the optimal hyper-parameters. C10, C11, and C12 indicate data for three performances: sensitivity, specificity, and accuracy, respectively. The average sensitivity, average specificity, and average accuracy and the associated standard error for the 10 block settings adopted in this paper are listed in the last two lines. The average performances are all above 0.95 and the corresponding standard errors are all below 0.03. Further explanation about the proposed method is given in Section III. Again, figures appear as a hyphen if the associated information was not clearly shown in the cited paper.
Finally, the comparison of the performance with respect to sensitivity (shown in C10), specificity (shown in C11), and accuracy (shown in C12) is summarized below.
• Results show that the proposed framework provided better or comparable (< 5% worse) results compared with 20 archival work listed in Table 1. Specifically, results in bold indicated that our model produced worse results, but the differences are less than 5%, 2%, 1% with respect to sensitivity, specificity, and accuracy, respectively. Results in bold are listed below.
• The proposed estimated performances have standard error less than 0.03, while the standard errors for archival work listed in Table 1 were not stated. Overall, the proposed framework, combining a CNN (with 4 convolutional layers) with a pre-processing via the Retinex with the proposed parameters (see Eq. (1) to Eq. (10)) presented promising results for automatic glaucoma detection.

III. PROPOSED METHODOLOGY
The framework of the proposed method is illustrated in Fig. 2, in which there are three main parts. We first explain Part 1 (marked in light purple) and 3 (marked in blue). Part 1 confirms that 1450 fundus images (899 showing glaucoma and 551 healthy) were provided by Kaohsiung Chang Gung Memorial (KCGM) Hospital, Taiwan. The percentage adopted for the training, validation, and test sets were 50%, 20%, and 30%. Part 3 indicates that the performance measures adopted for testing data were accuracy, sensitivity, and specificity, defined below.
• sensitivity: TP (true positive) rate, the ratio of cases correctly identified as positive to all glaucoma cases.
• specificity: TN (true negative) rate, the ratio of cases correctly identified as negative to all normal cases.
• accuracy: the ratio of correct classification to the total cases. Next, we explain Part 2 (mark in gray) with regard to four issues: (1) the relationship among block, epoch, and batch, (2) the pre-CNN process, (3) the CNN process, and (4) the robust design of experiment (DOE). Further explanation is given in Subsections III-A through III-F.

A. BLOCK, EPOCH, AND BATCH
First, we discuss here the relationship between blocks and epochs. The proposed framework contains l blocks, where l = 10. The term ''block'' was adopted from the experimental design point of view in that the data set (in terms of the training, validation, and test sets) is identical in each block. where we chose h = 50 to be the maximum value of the total epochs in each block. If the termination rule was not reached (see Section III-D), then h i = h. After the final epoch was completed for each block, we constructed a block-model to be tested using the test set to generate the performance measures, shown in Part 3 of Fig. 2.
Next, we assessed the relationship between epochs and batches. Each epoch passed through b = 40 batches, each with batch size (images) m = 40. The m = 40 images in each batch were randomly selected from the training set in the associated block. Each image in each batch passed through the pre-CNN-process (Steps 0 to 4, as shown in Fig. 3. and CNN (Steps 5 to 8, as shown in Fig. 4. At the end of each batch, the loss function and therefore the associated parameters were updated. A pre-validation model was derived from 40 batches of 40 images. Pre-CNN-process and CNN-process are further discussed in Subsections III-B and III-C, respectively.

B. PRE-CNN-PROCESS
This subsection describes the pre-CNN-process, including Steps 0 to 4, illustrated in Fig. 3. The values followed by ''input and output'' are the number of images with respect to R-G-B, the number of row pixels, and the number of column pixels. For example, the values 3@500 × 750 in step 0 indicates that there are 3 images regarding R-G-B, 500 row pixels, and 750 column pixels, Further descriptions of Steps 0 to 4 are given below. • Step 0. Input Images: All input fundus images include text about patient information and the date that the associated images were collected.
• Step 1. Clean Images: All text on each image was replaced by pixel 0, which is the color black.
• Step 2. Retinex: The word Retinex combines retina and cortex, where the retina is the part of the eye that detects color, and the visual cortex is the part of a brain that processes the information it receives from the retina.
This step was used to remove the light effect in a flash image. The logic of removing the light effect is explained below. Suppose that an original image recorded by a camera is denoted as I, which is a function of the actual color (denoted as R), light conditions (denoted as L), and other camera settings. Assuming that we can ignore the camera settings, we can therefore write the relationship among I, R and L as follows: To prevent the effect of a small value in the denominator L(x, y) leading to a huge R (i) (x, y), we performed a logarithm function on both sides of Eq. (2). Thus, Let the estimated image of R (i) (x, y) be denoted as R where we chose ω = 1/3, The image I (i) A (x, y) with dim 502 × 752 is the augmentation of I (i) (x, y). The Gaussian blur (smoothing) function, used to reduce image noise and unimportant detail, is defined as  (7) where and bias, α, and β are the color-control parameters. We chose bias = 125, α = 125, β = 230. -Function R (i) III is used to increase the variability of pixel in R (i) II , provided that all pixels are still between (0, 255).
where Max R (i) and Min R (i) are the maximum and minimum value for the associated matrix R (i) , respectively.
• Step 3. Region of Interest (ROI*): Based on the knowledge of this paper's second author, a well-respected ophthalmologist in Taiwan, we recognized that the region of interest should be near the opticnerve, cup, and macular portion. Therefore, we selected the (ROI*) as follows.
where ROI* covers the optic nerve cup, optic disc, and macula.
• Step 4. Image augmentation and dimension reduction: For each input image, we randomly randomly twisting the image among 30, 60, 90, 120, or 180 counterclockwise. Thus, each image was augmented into twice images. Finally, we reduced each image, shifting the row and column sizes 500 × 750 into smaller sizes 78 × 116. Let D(i) denote the image processed at the end of this step, where i = 1, 2, 3 regarding R-G-B.

C. CNN PROCESS
This subsection describes the CNN process, including Steps 5 through 8, which were used to train the optimal parameters. In order to implement a data transformation and data classification method, one needs to first determine the associated values of parameters and hyper-parameters. Using CNN as an example, it is necessary to begin by determining the parameter values (such as weights in each hidden layer) and hyper-parameter values (such as convolution layer, hidden layer numbers, and numbers of nodes in each hidden layer). The rules and the values of hyper-parameters used in this paper are summarized below.
Rules determined in advance: 1) loss function: cross-entropy 2) padding method: same (instead of valid) 3) initiation weights: Glorot_normal 4) optimizer: Adam. 5) activation functions for convolution and ANN hidden layers: Relu 6) activation function for ANN output layer: sigmoid Hyper-parameters determined in advance: 1) termination rule: if overfitting more than 15 times, stop the replication for epoch. delete (rather than filling with black) 7) G. parameters α, β used in the proposed loss function: 1, 1.5 After all hyper-parameters were set, we initiated the CNN architecture, illustrated in Steps 5 through 8, below.  Let h (0) be the vector containing the flattening data with size 3 × 5 × 32 = 480. • Step 7. ANN, used to train all parameters for hidden layers and output layer (see Fig. 5.) The jth hidden layer h (j) is computed via three transformations: convolution (C), normalization (N), and Relu (R), where the notation C-N-R is similar to that used in Step 5. The true response y and its estimatesp andŷ, (used for the train and test set data, respectively) are shown at the top right of Fig. 5. Notations used in Step 7 will be further explained below.
-Convolution (C) used to extract image feature: Let c (j) k denotes the linear sum (i.e, convolution) of h (j−1) , W (j−1) , and b (j−1) , where h (j−1) is a column vector containing h (j−1) k ; W (j−1) is a matrix with the kth column vector, denoted as w where K 0 = 480 (the data size after flatten in Step 6), hidden layers J = 4, number of nodes in each hidden layer K j = 16, j = 1, 2, . . . , J , and K J +1 = 1 indicating one node in the output layer.
-Normalization (N) used to balance the scale: where c -Relu (R) used to avoid the gradient vanishing: Step 8. Back-propagation implemented with optimizer Adam (see [12]) and the proposed loss, defined in Eq. (14) and Eq.(16) below. Recall that optimal weights, obtained via the back-propagation methods, are referred to as parameters rather than hyper-parameters.

D. THE PROPOSED LOSS FUNCTION
The loss function, L, used for each image in the train set is the proposed generalized cross-focal entropy, defined below.
where a sigmoid function is the estimate of y for the train set data (0 ≤p ≤ 1) and the response y = 1 or y = 0 indicates the truth being glaucoma or normal, respectively. Note that two commonly used loss functions: cross entropy (α = 0, β = 1) and focal loss (β = 1) are special cases of Eq. (14). Consequently, the average batch-loss function L B for a batch of images used is the average of the associated m loss functions. That is, where the batch size m = 40 used in this study. The average batch-loss function L B is then used in the back-propagation (Step 8) to update all parameters. In summary, the proposed loss function defined in Eq. (14) is for the train set. Regarding the test set, the estimate of the y for the test set data is:ŷ = 1 ifp > 0.5, otherwiseŷ = 0.
Here, we discuss the selection of the parameters α and β in Eq. (14). Based on the experiments studied in this paper, we conclude that the optimal parameters in the proposed loss function are α = 1, β = 1.5 for the glaucoma detection. Fig. 6 shows the plot of four loss functions: mse (y−p) 2 , cross entropy (α = 0, β = 1), focal loss (α = 1, β = 1), and the proposed generalized cross-focal loss (α = 1, β = 1.5) for the case that the true response y = 1 andp = 10 −2 . Consider an examplep = 10 −2 (close to 0), which can be treated as a hard sample since it estimates the true response y = 1. The proposed loss (marked in green) is −1 × log (10 −3 ) − 0 = 3, which is 1.5 times larger than the cross-entropy (marked in red) −1 × log (10 −2 ) − 0 = 2, which is 2 times larger than the mse loss (marked in block) (p−y) 2 = (10 −4 − 1) 2 1. That is, the proposed loss focuses on the hard sample (sayp = 10 −2 , y = 1). For normal samples (sayp > 0.5, the loss between the mse (marked in black) and the proposed one (marked in green) are negligible. We further compare the performance for policy (ROI* + Retinex) using various loss functions: the proposed loss (α = 1, β = 1.5), focal loss (α = 1, β = 1) cross-entropy (α = 0, β = 1), and mse. Table 2 showed that the proposed loss with values α = 1, β = 1.5 performs the best among 4 loss functions.The difference performances for the rest of three loss functions are negligible. Note that the standard errors (se, denoted as σ * ) listed in the parenthesis, to the right of the associated estimates, are further explained in Subsection III-E. Table 2 are estimated performances (estimated accuracy, estimated sensitivity, and estimated specificity). The robustness of the estimated performances depends on the associated standard errors, which are used as the quality measures of the estimated performances.  Letˆ denote the estimated performance such as estimated sensitivity. And the associated standard error is denoted as σ * , where

All values listed in
assuming that all samples adopted to calculateˆ are independent.
Motivated by the question of which point-estimator digits to report in a statistical experiment, [33] and [34] proposed a leading digit rule (LDR) which is used to decide the position of the right-most reported digit. LDR rule is equivalent to discard digits that the practitioner considers to be meaningless (because of sampling error). LDR rule reports the point estimate through the leading digit of the standard error, where the leading digit of the standard error is the first left-most non zero digit of the standard error. All reported digits via LDR are meaningful and unreported digits are meaningless. See [33] and [34] for further explanation of the meaningless digits in terms of the probability sense.
Below, we demonstrate two examples to show how LDR suggests the reported estimated performances. Ex 1. Let the estimated sensitivity is 0.954 and its standard error σ * = 0.03. LDR suggests to report the estimated sensitivity 0.95 (i.e., up to the hundredth digit). Ex 2. Let the estimated sensitivity is 0.954 and its standard error σ * = 0.1. LDR suggests to report the estimated sensitivity 0.9 (i.e., up to the tenth digit). That is, in Ex. 2, LDR does not report digits 5 and 4 because those digits are meaningless.

F. ROBUST AND OPTIMAL DESIGN
Here, we describe the optimal values of the proposed hyper-parameters in three steps.
Step 1. We first adopted a 2 7−3 IV fractional factorial design with resolution IV, where there were eight factors (denoted as A through H in subsection III-C, each with two levels. Results of the 16 combinations (runs) in the 2 7−3 IV fractional factorial design are listed in Table 3, where the mean accuracy and the standard error (σ * ) ) are shown in the last two associated columns. The listed data in the right most three columns are explained as follows: g (i) j denotes a vector containing g (i) j (k), the accuracy at the ith replication, jth combination,     Table 4 shows that with respect to mean    Table 4), we continue to investigate the DOE for factor B and D in relation to the layer numbers (6, 6), (5,5), (4,4), (3,3), where the first figure in each parenthesis indicating the number of convolutional layers each with 4 transformations (C-N-R-P), and the second figure in each parenthesis indicating the number of NNhidden layers each with 3 transformations (C-N-R). The average accuracy continues to increase up to (4,4), but decreases at (3,3). We conclude that the optimal design centers on (4,4). That is, we chose B = 4 each with 4 transformations VOLUME 9, 2021  (C-N-R-P), and D = 4 each with 3 transformations (C-N-R). Fig. 7 illustrates the plot of the average accuracy as a function of factors B and D, where the x-axis is the layer no. for B (each layer with C-N-R-P) and D (each layer with C-N-R).

IV. PERFORMANCE
This section compared four policies, illustrated in Fig. 8, in terms of the sensitivity, specificity, and accuracy. Policies were provided for the complete cross-section range, in which Policy 1 adopted no ROI and no Retinex, while Policy 4 adopted Retinex and ROI. Table 5 listed the associated mean and standard error (σ * ), and Table 6 listed the associated optimal combination. All statistical data (mean and is standard errorσ * ) were computed based on 10 blocks setting, each with 2 replications, described in Section III-A. Note that all policies used CNN with the proposed loss function (α = 1, = 1.5).
Results shown in Tables 5 and 6 are summarized below. 1) Both Retinex and ROI* were significant factors in terms of the estimated performance and its standard error (σ * ) for accuracy, sensitivity, and specificity. However, Retinex showed greater significance than ROI*. 2) Policies involving Retinex (Policies 3 and 4) performed significantly better than those without Retinex (Policies 1 and 2) in terms of both estimated performance and its standard error. Specifically, results in Table 5 showed that the estimated performance were all above 0.90 for Policies 3 and 4 involving Retinex, while the estimated performance were all below 0.80 for Policies 1 and 2 without Retinex. The associated standard errors were all below 0.05 for Policies 3 and 4 involving Retinex, while the associated standard errors were all above 0.1 for Policies 1 and 2 without Retinex. 3) Policies involving ROI* (Policies 2 and 4) performed better than those without ROI* under the same conditions. As such, Policy 2 performed better than Policy 1, and Policy 4 performed better than Policy 3. Table 5 showed that the associated estimated performances for Policy 4 were all above 0.95, and the associated standard errors were all below 0.03. Results in Table 6 showed that the optimal performance for Policy 4 was (0.97, 0.98, 0.98).

V. SUMMARY AND CONCLUSION
This study proposed an effective and robust framework for detection of glaucoma that integrates deep learning technologies and statistical methodologies. These technologies include (a) Retinex, a color enhancement algorithm that removes the effects of fundus photography flash and restores the original colors of the fundus image; (b) the extraction of critical areas for analysis (including the optic nerve cup, optic disc, macula, and minimal area of optic nerve fiber layer); and (c) a basic CNN with 4 convolutional layers using the proposed loss function L = −y(1 −p) α logp β − (1 − y)(p) α log(1 −p) β with the optimal parameters α = 1, β = 1.5 and with the optimal hyper-parameters (such as input image size, convolution layer, and hidden layer) obtained via a robust design of experiment.
The proposed framework was illustrated in Fig. 2 (a simple form) and Fig. 9 (a thorough form) using 1450 color fundus images provided by KCGM Hospital in Taiwan. The proposed framework outperformed other approaches with respect to effectiveness, robustness, simplicity, and clarity. The effectiveness was demonstrated via the estimated sensitivity 0.95, specificity 0.98, and accuracy 0.97 in different training, validation, and test set settings. The robustness was demonstrated via he associated standard error all below 0.03. the simplicity was shown via the adopted basic CNN model (with 4 convolutional layers) compared to deep CNNs such as GoogleLeNet (with 21 convolutional layers) or ResNet152 (with 152 convolutional layers). The clarity is demonstrated via intuitive graphs and clear mathematical notations which make it easy for others to reproduce our results. Overall, the proposed framework advances the field of glaucoma detection.