Effective Deep Learning Approach to Denoise Optical Coherence Tomography Images Using BM3D-Based Preprocessing of the Training Data Including Both Healthy and Pathological Cases

Retinal diseases are significant cause of visual impairment globally. In the worst case they may lead to severe vision loss or blindness. Accurate diagnosis is a key factor in the right treatment planning that can stop or slow the disease. The examination that can aid in the right diagnosis is Optical Coherence Tomography (OCT). OCT scans are susceptible to various noise effects which deteriorate their quality and as a result may impede the analysis of their content. In this paper, we propose a novel and effective method for OCT image denoising using a deep learning model trained on pairs of noisy and clean scans obtained by BM3D filtering. A comprehensive dataset of 21926 OCT scans, collected from 869 patients (1639 eyes), covering both healthy and pathological cases, was used for training and testing of the proposed scheme. The method was validated taking into account quantitative metrics concerning image quality. In addition, the proposed denoising scheme was evaluated by analyzing the impact of applying it in the eye disease classification based on Convolutional Neural Networks (CNNs) where we obtained the improvement of around 1–3 pp (percentage point). A separate dataset of 25697 scans collected from 1910 patients (2953 eyes) was used for this purpose. The conducted experiments have proved that the method can be applied as a preprocessing step in order to provide better disease classification results and can be useful in other OCT image analysis tasks. The proposed solution is much faster and perform better than the classical BM3D filter (over ninetyfold speed-up) and other related methods, especially when a big set of images needs to be processed at once. Furthermore, the use of the diverse dataset show the benefit over methods which are based on using only healthy scans for the training of the neural network.


I. INTRODUCTION
Optical Coherence Tomography (OCT) has become an indispensable clinical tool in ophthalmology [1], [2], neurology [3], cardiology [4], [5] and dermatology [6]. It is a well-established medical imaging technique that generates high-resolution, cross-sectional images of anatomical structure. The procedure enables three-dimensional, non-invasive, The associate editor coordinating the review of this manuscript and approving it for publication was Ikramullah Lali. in vivo and real time visualization of architectural morphology of the biological tissue. For instance, in ophthalmology, by observing retina, retinal nerve fiber layer (RNFL) and optic nerve head (ONH), it is possible to enhance early diagnosis and staging of ocular diseases and monitor the efficacy of a treatment [1], [2], [7].
The OCT images are obtained by measuring the interference between the backscattered infrared light from the biological sample and the light reflected from a reference mirror [1], [8]. That is why, the method is susceptible to the presence of coherent noise (speckle noise) which has a granular pattern [9]. Another type of noise which is present in OCT scans is a system noise. It is a variation of the measured signal caused by some properties of the light detection mechanism, such as shot, relative intensity or thermal noise from the detector [10]. These phenomena are inevitable and result in degraded quality of OCT scans, reduced image contrast and impaired diagnostic interpretation of OCT data [11], [12], [13], [14]. This, in turn, may lead to the limitation of accurate layer segmentation due to imprecise delineation of the intra-retinal layers [15], [16] and deterioration of retinal thickness measurements [17], [18]. For instance, wrong estimation of RNFL thickness can affect the correct diagnosis of glaucoma [19].
In case of OCT image denoising, deep learning methods usually require a proper training dataset that contains pairs of clean and noisy images. There are various approaches to generate such pairs. For example, frame averaging or adding artificial noise to the clean samples might be performed. However, the former method requires a large number of repeated input images to compute the ground truth. In turn, an inherent limitation of the latter approach is that the generated images may not accurately represent the statistical properties of noise in real images. Therefore, the trained models may not be effective when applied to real scans.
Another important point is the selection and number of patients involved in the study. Obviously, the greater the quantity and diversity of the scans are, especially in training phase, the better the representation and generalization of the dataset are. Below, we discuss these aspects in some related works.
For instance, in [58] the clean images are obtained by performing multi-frame averaging and the noisy images are created by adding an artificial noise with the right distribution. A total of 20 healthy subjects were recruited for the study and in total 3880 scans were obtained.
Another option for the creation of clean/noisy pairs, presented in [59], is to select one frame from a series and treat it as a noisy sample and then perform multi-frame averaging on the whole series of scans and make the resulting image a corresponding clean sample. For data acquisition, 2350 scans from 47 healthy eyes were obtained.
In [60] except for signal averaging also contrast enhancement is performed in order to generate clean scans. For training 512 scans of healthy eyes were used while for testing 36 scans coming from both healthy and pathological eyes were selected.
The method described in [61] is also based on frame averaging. Six OCT volumes were acquired from both eyes of 38 healthy patients. These scans were then registered and averaged to create the ground truth denoised image. Due to inaccurate registration only 69 volumes (each consisting of 180 B-scans) from healthy eyes were used for further training and analysis. As a result, 74520 clean/noisy scan pairs were produced. In addition, 1080 B-scans from patients with glaucoma were acquired but they were only used to evaluate the method.
In [63] volumetric OCT datasets of healthy human retina centered on the fovea and optic nerve head were acquired to train and test the self-fusion neural network. Each volume contained 2500 raw B-scans (500 sets of 5-repeated frames). The repeated frames were averaged and the resulting frames were self-fused together with 6 adjacent frames (3 before and 3 after the current frame) in order to achieve ground truth images of high quality for the neural network training. The model was trained on 9 ONH volumes and validated on 3 fovea volumes. For testing, 3 ONH and 3 fovea volumes were used. Additionally, OCT images from external datasets were added as test images. This group also contained the samples with pathological features present. Another approach was presented in [64] where the authors do not need the pairs of clean and noisy samples but the pairs of scans from the same location but with a different noise level. In total, images from 16 subjects, resulting in 3895 pairs of B-scans at different locations of the retina, were included for training. For the evaluation 3 pairs of single and 128x averaged scans acquired at ONH from distinct subjects were used.
Similarly as in the previous work, in [65] the authors propose a deep learning method for noise reduction in OCT images without the need for obtaining noise-free ground truth as labels. In contrast, they need pairs of B-scans from the same sample location. Then, one noisy image is taken as the input while another noisy image is treated as the label.
The dataset consists of 15 subsets of OCT images of various sample materials, including finger nails, hand palms, tomato, sample tooth, plastic tubes and thin films. In total, 3750 OCT scans were collected in this study.
It is also relevant to select the right method for introducing the artificial noise to the images. Some approaches assume that the noise has additive form [58], [66] and other apply a multiplicative noise model [59], [67]. Another key factor here is the distribution of the noise. Some methods in the literature approximate the speckle noise distribution to Gaussian [58], [66], [67], Rayleigh [68], [69]/Generalized Gamma [70] or Poisson [46]. However, each model type has its own pros and cons and it is hard to find a universal solution.
The denoising scheme presented in this paper assumes using two approaches for the generation of pairs of clean and noisy scans. The first method is the standard one that simply adds a Gaussian noise to the original clean scans and it is used for comparison purpose. The second approach applies the BM3D filter to the noisy images selected from the original dataset. The main characteristic of this approach is the fact that the noisy images represent the actual noise which is present in OCT scans and not the one which is artificially added. Another aspect is that the clean images are obtained by using BM3D algorithm which is commonly used for image denoising. This way, we will try to train the network so that it tries to reproduce the results obtained by using the BM3D filter. Similar attempt was proposed in [71] where the structure of the convolutional neural network is designed in such a way that it mimics some steps from the computational pipeline of BM3D method, e.g. its block matching stage, by adding extraction and aggregation layers in the model. In contrast, our method is based on the preprocessing of the training dataset so that the network can try to replicate the filtering of BM3D algorithm.
One of the main advantages of the proposed method in contrast to most of methods described in the literature consists in using the dataset that contains both the scans from healthy patients and the ones with pathological features present in the image. Moreover, it consists of 21926 samples acquired from 869 patients (1639 eyes) which guarantees a good representation and diversity of potential cases we may face during OCT examinations. Besides, the proposed method is validated taking into account not only quantitative metrics concerning image quality but, in addition, it is evaluated by analyzing the impact of applying it in the eye disease classification. For this purpose, a separate dataset of 25697 scans collected from 1910 patients (2953 eyes) was used. Although the quantitative metrics can give an overview about the general effect on image quality improvement, they may be easily tricked. Therefore, it is of great importance to assess the performance of the given method by examining how it behaves in some specific OCT image analysis tasks.
The paper is organized as follows. The following section is dedicated to the methodology of the proposed denoising scheme. Section III contains experimental results and their discussion. Finally, section IV presents conclusions and future work.

II. METHODOLOGY A. USED DATASET
The dataset used in this study consists of 21926 scans. They were collected during the realization of the INDOK project [72] that aims to develop a system for automatic diagnosis of retinal diseases based on OCT images with the use of artificial intelligence. The dataset contains the samples from both healthy and sick eyes. In total, the scans from 869 patients were collected. Among the pathological changes present in the 'sick' scans in total 19 features were identified, e.g. cyst, drusen, hyperreflective foci (HRF), epiretinal membrane (ERM), retinal detachment (RD), macular hole (MH) and macular pseudo hole (MpH), serous and fibrovascular pigment epithelial detachment (sPED/fvPED). The collected scans are of various quality -some are clean and other have significant amount of speckle and/or system noise visible. All scans have been precisely tagged with proper annotations by professionals from the ophthalmology field. The method for splitting the dataset into two subsets regarding the quality of the scans was described in detail in the following subsection. The exact representations for healthy/sick and clean/noisy scans are presented in Table 1.

B. SPLIT INTO CLEAN AND NOISY SAMPLES
In order to divide the samples into either a clean or a noisy subset, it was noticed that it is much easier to detect system noise in the upper part of the scan which is located above the retinal layers. Therefore, a few metrics have been selected in order to describe this region. Additionally, a whole image is also analyzed in order to estimate the level of speckle noise. We selected these metrics based on our observation concerning the appearance and a character of noise which is present in OCT scans. All these metrics are used to group the scans into one of the mentioned categories. These metrics include:   The upper regions in the scans were detected through Gaussian blurring, image thresholding, some morphological operations and analysis of the location and area of the identified blobs. The sample difference between clean and noisy images with outlined upper regions of scans is shown in Fig. 1.
The distribution of values for each measure is depicted in corresponding histograms in Figs. 2a to 2e and in Table 2.
As can be seen in the presented results, none of the metrics allows for easy separation of data and it is difficult to define a specific threshold for any of them. That is why, it was necessary to apply a more complex solution. Hence, it was decided to select 500 random samples for both classes (clean and noisy) which were consulted with an ophthalmologist and then train an SVM model based on these annotations and aforementioned metrics. 80% of samples from each group were used for training and the remaining 20% for test. The training of the model was realized in Python with the use of scikit-learn package [74]. The trained SVM model achieved the accuracy of 95%. As a result, 9361 clean samples and 12565 noisy samples have been obtained.

C. NETWORK ARCHITECTURE
The network architecture selected for the task of OCT scans denoising is a modified U-Net architecture introduced for the first time in 2015 in [75]. It was originally used for biomedical image segmentation. However, due to its characteristics and after slight modifications it can be also applied for OCT image denoising. The U-Net network consists of two main blocks, i.e. downsampling and upsampling ones. The downsampling block (or encoder) performs a specific number of convolutions so that the input image is downsampled to a feature map. By contrast, the upsampling block (or decoder) through transposed convolution operations upsamples this feature map back to the original input image size. In addition, the so-called skip connections help to avoid losing relevant information from the encoder part and preserve the spatial relationships. It is done by transferring information from every downsampling layer in the encoder to their corresponding upsampling layer in the decoder. In this way, the network learns how to retain the knowledge about both local (textures) and contextual (spatial arrangement) characteristics of a tissue [58].
In our study, the network takes as the input the images of size 256 × 192 pixels. It performs four downsampling and corresponding upsampling operations. It has over 8 million trainable parameters. Adam optimizer is used for training with default learning rate equal to 0.001 and the mean squared error (MSE) is selected as a loss function. The number of epochs is 40 and the batch size is set to 32. The trainings were held on NVIDIA GeForce RTX 2800 Ti with CUDA v10.1 and cuDNN v7.6.5 acceleration.

D. TRAINING AND TESTING OF THE NETWORK
In this study, two approaches for the generation of pairs of clean and noisy scans have been selected. The first approach involves adding a Gaussian noise to the original clean samples. 80% of the pairs created this way are used for training. For testing, 20% of the remaining pairs were assigned. In contrast, the second approach aims at emulating the BM3D filter behavior. BM3D filtering is a very common method used for OCT image denoising [38], [70]. In order to let the network reduce the noise in a similar way, the noisy dataset was subjected to BM3D filtering implemented in Python [76]. This time, similarly as in the first approach, the resulting pairs of noisy and clean samples were split into training and testing subsets using the same ratio as previously (80:20).
Another option that was examined consisted in taking this whole noisy subset, together with the corresponding scans subjected to BM3D filtering, for training. In turn, the pairs of clean samples together with the samples with added Gaussian noise were utilized for testing. Although this leads to the situation when the training and testing samples come from different data distributions and have a slightly different nature, it was still worth verifying how the network could behave. When applying BM3D filtering, the algorithm takes as the input the value of sigma parameter which is related to the expected noise level. In our experiments three variants have been taken into account: 1) σ parameter is assigned based on the value calculated following the method for noise variance estimation presented in [73] [SIGMA_REAL] 2) σ parameter is the previous value incremented by 10 [SIGMA_PLUS] 3) σ parameter is equal to 20 [SIGMA_20] In case of adding a Gaussian noise to the samples from the clean subset, the following ranges for σ parameter are taken into account: The limits for these three ranges were selected based on potential values of noise variance in OCT scans and in order to examine various scopes.
All the above-mentioned variants are presented in Fig. 3.

E. QUANTITATIVE MODEL EVALUATION
In order to assess the performance of all denoising approaches, the following image quality metrics have been selected: • Mean Squared Error (MSE) -it represents the cumulative squared error between two images. It can be calculated using the below formula (1): where: M -the width of the image N -the height of the image • Peak Signal to Noise Ratio (PSNR) -it measures the peak error between two images. It can be expressed by the following formula (2):  where: MAX -the maximum possible pixel value of the image • Structural Similarity (SSIM) -it is a quality assessment metric based on the visual degradation of structural information (changes in local structure and contrast between two images). It can provide a good approximation to human visual perception [77]. The parameter value can be obtained using the below formula (3): where: µ x , µ y -mean intensities of two images being compared σ x , σ y -standard deviations of two images σ xy -cross-covariance between both images C 1 , C 2 -two constants to stabilize the division • Multi-Scale Structural Similarity (MS-SSIM) -it is based on multiple SSIM measurements at different image scales [78]. It is calculated with the use of the following formula (4): where: x i , y i -the images at i th scale M -the number of scales A small MSE indicates minor error. The larger a PSNR value is, the better quality an image has. Both SSIM and MS-SSIM return values from the range < 0, 1 >, where 0 indicates no similarity and 1 implies that both images are identical.

A. DENOISING PERFORMANCE
At the beginning, we use cross-validation to evaluate the efficiency of the deep learning approaches. Therefore, both datasets with pairs of clean and noisy scans are divided into training and testing subsets with the ratio 80:20. The results for the networks trained on both datasets described in the subsection II-D are presented in Tables 3 and 4. As might be observed, both denoising schemes seem to work and in most cases provide improved results taking into account particular metrics. It is worth noting that in case of SIGMA_REAL approach the metrics for the noisy scans seem to be better. However, this is caused by the little effect introduced by denoising with sigma parameter which is equal to the value obtained by using the noise variance estimation formula [73]. As a result, noisy and clean (after BM3D denoising) images are very similar and the results seem to be very satisfying but in this case they should be ignored. That is why, it seems reasonable to check other variants for the sigma parameter in BM3D algorithm.
In the next step, we focus on the tests performed when the noisy subset and the corresponding samples denoised with the use of BM3D method are used for training. Then, the full clean dataset with added Guassian noise will be treated as a testing set. Although the noise nature is different across these two datasets, such experiments can help to notice some dependencies and point out which method could be useful in various tasks.
The results for this scheme are shown in Table 5. Each test set (ADDNOISE*) is evaluated with the use of three models -UNET_REAL, UNET_PLUS and UNET_20 -trained on different training sets (SIGMA_REAL, SIGMA_PLUS and SIGMA_20, respectively). In addition to the approaches based on U-Net architecture, we also evaluate the original BM3D algorithm and BM3D-Net [71] algorithms. Moreover, we also verified how a different architecture model could behave when trained on the same two datasets (SIGMA_PLUS and SIGMA_20) as in case of U-Net. This way we are able to determine if the proposed approach is valid no matter which model is used. For this purpose RED-Net model [79] was selected. The first row is common for all seven methods since it contains the results for the metrics calculated for all the pairs of corresponding clean and noisy scans. Then, the following seven rows present the results for denoised scans from particular denoising variants.
The results obtained for the models that are trained on noisy scans and their corresponding BM3D-denoised scans (especially SIGMA_PLUS and SIGMA_20 datasets), that try to emulate the outcomes of BM3D filter, seem to be satisfying since they provide better results for the denoised scans for all the metrics in comparison to referenced noisy scans. These two schemes are the most promising in terms of applying them in other OCT image analysis tasks which is why they will be evaluated more deeply in the following subsection.
It is worth mentioning that the difference between particular approches presented in Table 5 is statistically significant. It was verified by performing ANOVA test for all three testing scenarios and all four quality metrics. Each time the P-value measured within and between particular groups (denoising approaches) was equal to zero. This satisfies the condition that P-value is lower than 0.05 which is usually the value selected for the significance level. 65400 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    Apart from quantitative metrics analysis, it is always recommended to visually assess the outcomes of image processing algorithms. The sample groups of clean, noisy and denoised scans for UNET_PLUS and UNET_20 approaches are presented in Figs. 4a and 4b. One can see that in both noisy images there is a noticeable granular pattern present. In both denoised images it is much less visible. Besides, the level of details is higher in denoised images and they resemble the original structures from the clean scans. For instance, the difference between particular layers or some little features (like hyperreflective foci or shadows) can be much more easily perceived. This might be especially noticed in case of UNET_20 denoising.
Since quantitative metrics can be easily tricked and the visual outcomes do not guarantee how the denoising will influence, for example, disease classification accuracy, the more elaborated evaluation is required. Such an attempt is made in the next subsection.

B. IMPACT ON DISEASE CLASSIFICATION ACCURACY
In order to evaluate the usefulness of the proposed denoising schemes, their impact on disease classification accuracy has been assessed. Therefore, three CNN-based classifiers that aim to detect the presence of three eye diseases in OCT scans were trained. These classifiers perform binary classification to one of two classes -selected disease or 'rest' which covers all other diseases and/or healthy scans. The diseases selected for the evaluation are Age-Related Macular Degeneration (AMD), Epiretinal Membrane (ERM) and Diabetic Retinopathy (DR). The dataset used for training and testing contains VOLUME 11, 2023  5539 scans with AMD present, 4704 with ERM and 711 with DR. In total 25697 scans obtained from 1910 patients and covering 2953 single eyes were used. It is worth noting that the dataset used for training and testing in disease classification task is independent of the one used for the creation of the denoising models described above. It is guaranteed by selecting the scans acquired in different time periods for both datasets. For each classifier the class balance was assured.
The exact statistics regarding the classification dataset is presented in Table 6. The datasets were split into training, validation and test subsets using a ratio 70:10:20. This split was performed in such a way that the scans of the same eye can be selected only to one subset. Some preliminary investigations indicated that the most satisfying results for AMD, ERM and DR recognition are obtained by VGG16, Resnet152 and VGG19 model architectures, respectively. The image classifiers were implemented in Python with the use of Tensorflow library [80].
For each disease, six approaches were investigated: no denoising (used as a reference for the remaining solutions), classical BM3D filtering, BM3D-Net [71] and two different architectures trained using two proposed schemes (SIGMA_PLUS and SIGMA_20), i.e. REDNET_PLUS, REDNET_20, UNET_PLUS and UNET_20, with the latter variant additionally evaluated when trained only on the healthy scans (UNET_20_HS). This way it will be possible to determine whether denoising as a preprocessing step, in general, improves the disease classification accuracy and if the proposed methods can offer a benefit over classical and other already existing approaches. What is more, in contrast to most of other similar attempts, the dataset used for training of the proposed denoising models contains both healthy and pathological samples. Therefore, the impact of using such a diverse dataset is also studied by comparing last two variants (UNET_20 and UNET_20_HS).
The results for accuracy, F1 score and AUC of image classification for all three diseases are presented in Table 7. A sample ROC curve (of UNET_20 model for AMD) was shown in Fig. 5. The assignment to one of two classes (DIS-EASE or REST) was based on the higher probability of the given class (threshold equal to 0.5).
It can be seen from the Table 7 that the proposed denoising methods guarantee similar (in case of AMD) or better (for ERM and DR) classification performance with respect to the classical BM3D filtering. Similarly, at least one of the proposed methods may outperform the BM3D-Net. This is especially noticeable in case of ERM and DR classifier. The classifiers which were trained on the diverse datasets always prevail over their corresponding versions trained only on the healthy scans. This clearly justifies the benefit of using the extended dataset covering both healthy and nonhealthy scans. Besides, all the classifiers which operate on  the denoised images in the majority of cases can provide improved classification results. What is more, the results are better for both metrics in case of the proposed solutions in comparison to the classifiers which do not utilize denoised images. Although the improvement is of around 1-3 pp (percentage point), it is worth highlighting that the basic classifiers, with no denoising performed, could already offer the classification accuracy (or F1 score) of around 90%. In such a case making progress of every 1 pp is significant while at the same time being much more difficult to achieve. In other words, we can imagine that using one of the presented schemes can effectively lead to the situation when additional 1-3 in 100 sick patients are given a correct diagnosis and as a result can get adequate treatment on time. Eventually, the enhanced classifiers can save a considerable amount of patients by stopping or delaying their disease progression or even preventing them from getting blind.

C. DENOISING TIME
The denoising of the whole noisy subset (12565 scans) with the use of the proposed method (UNET_20 variant) took 1104.9207 s and in case of BM3D_NET it took 23161.0645 s. Due to the fact that the denoising via BM3D filtering takes much longer, only 5903 scans were processed which took 46981.46 s. The processing time for a single scan is shown in Table 8. As might be observed, the proposed scheme offers much faster processing time, namely a ninetyfold reduction in comparison with standard BM3D approach, over twentyfold reduction in comparison with BM3D-Net and over eightfold reduction in comparison with RED-Net. All three methods were tested on the images of the same size, i.e. 256 × 192 pixel. Therefore, due to the fastest denoising time, the proposed method is suitable for all the applications in which the high number of scans needs to be processed simultaneously.

IV. CONCLUSION AND FUTURE WORK
In this paper, new deep learning schemes for OCT image denoising based on U-Net architecture are proposed. The novel idea consists in specific preprocessing of the input dataset in order to generate pairs of clean and noisy images for VOLUME 11, 2023 the training of the neural network models. The proposed solutions were evaluated taking into account quantitative metrics concerning image quality as well as the impact of applying the presented denoising scheme in the eye disease classification. The conducted experiments prove that the new methodology can be useful in various OCT image analysis tasks. An important feature of this method is that the processing time was decreased significantly. In our study, we achieved a solution that is 90 times faster in comparison to standard BM3D algorithm while at the same time provides enhanced final image quality. We used a comprehensive dataset of 21926 OCT scans covering both healthy and pathological cases that were collected from 869 patients (1639 eyes) for the creation of denoising models. In turn, the dataset used for disease classification consists of 25697 scans collected from 1910 patients (2953 eyes) that are independent of the previous dataset.
The main limitation of the presented approach consists in the need for making additional preprocessing step (BM3D filtering) before the training of the model might be performed. This might be time-consuming depending on the denoising method that was selected. However, this process is performed only once and the benefits obtained later outweigh these limitations -the processing time of the already trained model is constantly low and satisfying.
Further work will include the application of the proposed methodology in other problems related to OCT image analysis, e.g. in OCT layer segmentation task. Additionally, other referenced image filtering methods and neural network architectures could be tested.

APPENDIX A LIST OF ABBREVIATIONS
See Table 9.