Optimizing CNN Hyperparameters for Blastocyst Quality Assessment in Small Datasets

Morphological assessment of blastocyst quality is one of the most significant challenges in the IVF process because the current assessment is based on evaluation by an embryologist; thus, it is still manual and subjective and lacks precision. Artificial intelligence (AI) plays a role in overcoming the limitations of the manual assessment system, and its use is expected to increase implantation rates in IVF. This study aims to optimize the convolutional neural network (CNN) model using the grid search method and to evaluate the effectiveness of different machine learning models in classifying the blastocyst quality in a small dataset. The reliability of the proposed model will be compared with that of other machine learning methods, such as logistic regression (LR), support vector machine (SVM), k-nearest neighbors (KNN), the boosting algorithm, and with the addition of the Canny operator as a segmentation process and principal component analysis (PCA) as a feature extraction approach. We evaluated the results using various performance measures, such as the precision, recall, F1-measure, accuracy, and area under the curve of the receiver operating characteristic curve (AUC-ROC). The final results showed that our proposed CNN model achieves a validation accuracy of 84.00%, a test accuracy of 83.33%, and an AUC of 0.844. McNemar’s statistical test results support that our CNN model outperforms the other classifiers.


I. INTRODUCTION
One procedure performed to overcome infertility problems is called IVF. The process is reserved for cases in which other methods, such as fertility drugs, surgery, and artificial insemination, have not worked. Blastocysts have a higher implantation potential than embryos at the cleavage stage (embryonic day 3) [1]. Research has shown that continuing embryo culture up to day 5 results in a higher chance of successful delivery [2]. Therefore, grading the embryo on day five is crucial. Grading of embryos on day 5 (blastocyst) has been based on the Gardner system, in which the grade is The associate editor coordinating the review of this manuscript and approving it for publication was Mouloud Denai .
determined by the quality of the inner cell mass (ICM) and the trophectoderm (TE) [3]. Blastocysts with good grades will be transferred to the uterus so that pregnancy can be expected, thereby avoiding repeated IVF cycles that incur additional costs. The IVF process may produce more than one embryo, but not all embryos have implantation potential. Transferring more than one embryo can increase the chance of pregnancy but also increases the likelihood of pregnancy complications for both the mother and baby. One solution to minimize multiple pregnancies is to transfer only one embryo, although this will reduce the probability of pregnancy [4]; thus, accurate embryo grading is necessary. Morphological grading of embryos is one of the challenges associated with IVF, which is currently still determined based on the embryologist's VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ assessment with a microscope. It is still manual, subjective, and lacks precision. Artificial intelligence (AI) plays a role in helping overcome the limitations of the manual scoring system, and its use is expected to increase the implantation rate in IVF. The inaccuracy of manual assessment is caused by the blastocyst texture image, in which it is difficult to distinguish between the ICM and TE texture shapes, and by the blurring of the image of the blastocyst edge due to high noise levels.
The blastocyst images used in this study were taken with Hoffman modulation contrast (HMC) [5] imaging, which is a light microscopy or optical microscopy technique. HMC imaging is routinely used in IVF clinics to capture images of developing embryos. In performing pattern recognition based on blastocyst images, this work uses segmentation and SVM [6], [7] as classifiers. However, determining the optimal classifier engine is still a problem, especially in small datasets, because this is a challenge in pattern recognition studies. The use of a pattern learning feature is necessary during the machine learning preprocessing stage. The correct way to solve this problem is to implement a deep learning technique using a CNN model. Deep learning methods, especially CNNs, are currently being used in the IVF field to evaluate embryo morphology, embryo quality, implantation potential, and system quality control. Dimitriadis et al. [8] proposed a CNN model that was trained and tested with a dataset of 3,469 embryos to classify between 2PN embryos and non-2PN embryos. Their model classified embryos using a test dataset of 947 images with an accuracy of 91.86%. Using the Inception v3 architecture, Irene Dimitriadis et al. [9] classified two types of embryos: blastocysts and nonblastocysts. After training on a dataset of 1,100 embryos, the model could classify 182 test embryos with an accuracy of 89.01%. Hariton et al. [10] developed a CNN model by combining genetic algorithms that can select the best quality blastocyst. The CNN model was trained using a dataset of 3,469 images, and the resulting test accuracy was 75.3%. Hariton et al. and Khosravi et al. [11] proposed a framework based on a deep neural network with a dataset of 50,000 time-lapse embryos to select high-quality embryos. Based on the Inception model, the framework can predict blastocyst quality with an AUC > 0.98. Thirumalaraju et al. [12] proposed an AI system to evaluate fertilization and blastocyst development and used it on 947 images of embryos. The resulting coefficient of variation in measuring the best grade of blastocyst quality was 10.9%. To predict the case of fetal heart pregnancy [13], they proposed a deep learning model that can automatically predict this without assessing the blastocyst morphology. The resulting model can predict fetal heart pregnancy with an AUC of 0.93. Thirumalaraju et al. [14] proposed a multilayered CNN model to differentiate embryos based on their morphological qualities. Using data from 2,440 embryos, the developed model distinguished between blastocyst and nonblastocyst embryos with a validation accuracy of 49.17%. Bormann et al. [15] proposed a CNN model trained with 742 embryos; an accuracy of 90% in selecting the best-quality embryos was achieved. To automatically predict the value of ICM and TE in blastocysts [16], they proposed a deep learning model that can assess blastocyst quality. Chen et al. [17] proposed an automatic scoring system for embryo assessment using a dataset of 171,239 embryo images and training data using the ResNet50 model; the average predictive accuracy was 75.36% for the three blastocyst assessment categories. Dirvanauskas et al. [18] used 7,002 embryo images to develop a combined CNN with the discriminant classifier model for evaluating and predicting embryo quality. The proposed model can predict the embryo quality with 97.62% accuracy. Bori et al. [19] developed an ANN-based AI model to predict live births using the blastocyst morphology. Using data from 186 embryo images, the total accuracy in predicting live births was 72.7%.
However, they used large datasets from time-lapse microscopy and provided limited information about the neural network itself, and the effect of hyperparameters on assessing the embryo morphology with a small dataset is not clear. Several researchers have also researched implementing deep learning methods on small medical datasets, but this is not the case for blastocyst image datasets. The authors [20] used a transfer learning method based on a CNN to evaluate a limited number of magnetic resonance imaging (MRI) datasets. In the current work, the aim is to increase the efficiency and effectiveness in image recognition. Our previous paper [21] using the transfer learning method in the blastocyst quality classification task resulted in a test accuracy of only 64.29%; thus, the accuracy of the method needs to be further improved. In Table 1, detailed related work is presented along with the methods, datasets used, and results.
The problem of applying deep learning or machine learning techniques with little training data is overfitting. Overfitting occurs when the trained model fits the training data but does not generalize well. Adding dropout and regularization strategies [22], [23] [24] to a deep learning network can reduce the occurrence of overfitting. This hyperparameter evaluation study on a CNN network used images of blastocysts because of their importance in the IVF process. Some researchers have focused their work on embryonic development [14], [15] [16], [17]. Therefore, the specific purpose of this task was to evaluate the effects of the types of hyperparameters on the CNN when using a small dataset of blastocyst images in classifying two quality classes (good and poor) based on their morphology and to decide which hyperparameter is the best for our dataset.

A. DATASET
We used a publicly available blastocyst dataset to evaluate the performance of our proposed CNN model. The dataset contains 249 images of blastocysts from HMC microscopy and is accessible via ''https://vault.sfu.ca/index.php/login'' upon request [25]. The dataset includes two grades of images, good-quality and poor-quality [11], and an expert embryologist at the Pacific Centre for Reproductive Medicine (PCRM) graded each blastocyst. The blastocyst images in the dataset have various pixel sizes, and we resized them to 224 × 224 pixels. The blastocyst images were obtained using an Olympus IX71 inverted microscope with Nomarski optics (DIC). As the embryo develops from day 1 to day 5, it undergoes many cell divisions. The essential structure of the blastocyst on day 5 is shown in Fig. 1, which also shows the two quality levels of blastocyst images we used.

B. TRAINING, VALIDATION, AND TESTING
To train and evaluate the proposed CNN model, we used 249 blastocyst images. The dataset was divided into three subsets for training (70%), validation (20%), and testing (10%). We performed an augmentation process at the training stage to avoid overfitting due to a lack of training data. In the augmentation process, we sheared, zoomed and flipped the training data. In all experiments on the proposed CNN model, we used Python programming language with Jupyter Notebook as an IDE (Integrated Development Environment). We also used Keras [28] as a framework with the TensorFlow backend.
The performance of our CNN model was compared with that of conventional machine learning. In previous studies, the classification task was generally applied using SVM [29], KNN [30], LR [31], and gradient boosting [32].
The frameworks commonly used in machine learning include image acquisition, preprocessing, feature extraction, and classification. In this study, for the classification that used conventional machine learning, we performed preprocessing and feature extraction steps.
In a previous study [33], we used the Canny operator and achieved the best detection for blastocyst images. As a feature extraction method, we used principal component analysis (PCA) [34], a feature extraction method that uses an appearance-based approach that attempts to identify blastocyst components using global representations based on the whole image instead of only on local features of the blastocyst. Using the same input as for the CNN, all blastocyst image datasets were also used in other classification engines. Fig. 2 shows our proposed framework for human blastocyst quality classification.
We classified blastocyst quality using conventional machine learning methods on the same dataset as for the proposed CNN model. We split the dataset into 80% for training and 20% for testing. In the first step, we converted the blastocyst image from RGB to grayscale and resized the image from the average dimensions of 424 x 378 to 224 x 224 so that our machine learning model could be trained faster on smaller images. The second step consisted of processing the blastocyst image with Canny edge detection to quickly determine the boundaries of the objects in the image. Canny operators have the advantage of better detection, especially under noise conditions, compared to other operators. In the third step, PCA was used to reduce the dimensions of the feature vector and remove the less essential features. We implemented the last feature vector at the blastocyst quality classification step. Finally, we applied four conventional classifiers to the classification task, SVM, KNN, LR, and gradient boosting, using the Keras library in Python. We optimized the parameters used in the conventional classifier using the grid search method; the list of optimized parameters is shown in Table 2. We evaluated the classification results and compared them with the results obtained using the proposed CNN model.

C. HYPERPARAMETER OPTIMIZATION
Hyperparameter optimization is the process of determining the best hyperparameter combination to use. It is done to find the hyperparameter values that can produce the bestperforming model. One way to determine the best combination of hyperparameters is to use a grid search method. Grid search is the strategy most frequently used to optimize VOLUME 10, 2022  hyperparameters [35] because it can be easily parallelized [36], and hyperparameter model optimization can improve the model accuracy. Grid search works by combining the hyperparameter values input into the model, searching for all combinations and choosing the best combination based on the highest score. This work used grid search on CNN, SVM, KNN, LR, and gradient boosting models. Scikit-learn was used to perform a grid search, where gridsearchcv performs a search across all parameter sets in the grid. The tuned hyperparameters of the conventional classification model are given in Table 2.

D. BLASTOCYST QUALITY STATE CLASSIFICATION
The deep learning performance can be affected by CNN models. In this work, we configure the CNN model to obtain better performance. The CNN is built with multiple 2D (twodimensional) convolutions, maximum pooling, fully connected neurons, and dropout [26]. In each convolution layer, the output form can be calculated according to the following equation (1) [27].
where i is the input dimensions of the image, f indicates the size of the filter or kernel in the 2D convolution layer, p is the padding provided as additional data outside the input, and s is the stride, which is a parameter that determines how much the filter shifts. We use a stride value of 1; thus, filter convolution will shift the filter 1 pixel horizontally and then vertically. This optimization process uses the following network structure hyperparameters: number of filters per layer, kernel size in each layer, dropout rate and L2 regularization. We use the most common technique, known as L2 regularization, which aims to minimize the square of the weights. Because our image data are too complex to be modeled accurately, L2 is a better choice because it can learn the patterns inherent in the data. A weight regularizer is added to each layer in the Keras model with a value of 0.01.
The details of the hyperparameters of our proposed method are shown in Table 2. There is nothing to learn in the input layer or layer 0; the input image is given and reshaped to dimensions of 224×224. We perform the augmentation process using ImageDataGenerator. The stride value defines the number of kernels that convolve the blastocyst image. In this model, we choose a stride value of 1, and the convolution produces an output that is usually called an activation map. The smaller the stride value is, the more detailed the information we obtain from the input, but a small stride value requires more computation than a large stride. However, the use of a small stride will not always result in good performance. The activation map process resembles an extraction process, as in a handcrafted feature extraction process. The kernel weights are randomly initialized, and the output of the convolution operation has a separate activation map for each filter. In the first layer, convolution is performed inside the kernel and filters, leading to a new activation map. The resulting activation map is wrapped with the kernel, and the process is repeated. The second layer is the pooling layer used to reduce the dimensions of the activation map; the operations can use maximum or average values. In this case, we use the maximum value, in which the value for each activation map patch is calculated. This reduces the number of parameters to be studied and the amount of computation performed in the network. The convolution process is continued for the next-to-the-last layer in our CNN architecture. The activation map generated from the feature extraction layer is still in the form of a multidimensional array, so we have to flatten or reshape the activation map into a vector to use it as input from the fully connected layer. In the last layer, the number of  output values is one, where each image has one output value, namely, its label is 0 or 1 with the previously added dropout. Because we have a small dataset, to limit overfitting and speed up the learning process, we use a dropout process by assigning a value of 0.5. In the CNN model, we use a rectified linear activation function or ReLU with the sigmoid classifier because binary image classification provides better accuracy than the combination of activation and other classifiers [28]. We also optimize the network training hyperparameters, such as the optimizer, learning rate, and momentum. Table 3 shows the training hyperparameters to be optimized; in each case, the range of values is shown in square brackets.

E. DATA AUGMENTATION
To obtain optimal performance, deep learning requires more data than other machine learning algorithms. We have only 249 blastocyst images; 164 of these are good-quality blastocyst images, and 85 are poor-quality blastocyst images. This amount of data is insufficient to obtain optimal performance. Therefore, we need to perform a data augmentation process.
Data augmentation creates additional training data that artificially expands the training set with a label preserving transformation [37]. Data augmentation aims to generate virtual data samples that can be used to improve the training dataset and reduce overfitting. In this study, we add data only to the training dataset and not to the validation or test datasets; this is different from preparing data using image resizing, which requires consistency across all datasets that interact with the model. Using ImageDataGenerator, a function of Keras, we perform random transformations on the training dataset using shear, zoom, rotation and flip augmentation techniques and then change the parameters in the function. The parameters we declare in the ImageDataGenerator function are shear_range=0.2, which shifts the image by 20%, zoom_range=0.2, which zooms in and out by 20%, rotation_range=45, and horizontal_flip=True. After declaring the parameters of the ImageDataGenerator function, we create an iterator that fetches the image and loops in batches by streamlining the image into an ImageDataGenerator object. To stream images, we use the flow_from_directory method, which takes the directory path and generates an additional dataset. When the iterator has been created, it can be used to train our CNN model by calling the fit_generator() function.

F. EVALUATION OF THE PROPOSED CNN MODEL
Evaluating the model created is essential in developing good deep learning and machine learning models. In this section, we will discuss evaluation of the model specifically with respect to the classification of blastocyst quality. The model evaluation process is conducted after the model training is completed. The model evaluation uses evaluation data that cannot be the same as the data used to train the model. Testing with these evaluation data will provide the actual accuracy of a model that has been trained. However, accuracy (3) is not the only parameter considered in conducting evaluations because high accuracy values can be deceptive due to dataset imbalances [38]. Therefore, more comprehensive evaluation metrics such as the confusion matrix, precision (4), recall (5), and F1-score (6) are needed.
In binary classification, four parameters can be considered in evaluating the prediction results of a model. These four parameters are true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Comparisons of the accuracy results when using different parameters are applied in this work, i.e., the difference in the accuracy between different filters, kernel sizes, optimizers, and kinds of machine learning.

A. IMPACT OF FILTERS ON ACCURACY
The accuracy of the CNN model can depend on the number of kernels in the convolution. A larger number of filters in each convolution will cause overfitting when the dataset is small. We conduct simulations to determine the impact of the number of filters on the results obtained using our proposed CNN model. To determine the optimum number of filters, we explored and evaluated four sets of filters, viz., 32, 64, 128, and 256. After assessing the impact of the number of filters on the accuracy in the classification of blastocyst quality, it was found that the use of 32 filters yielded the greatest impact on increasing the accuracy (0.759±0.128)  Fig.3.

B. IMPACT OF KERNEL SIZE ON ACCURACY
The kernel size on the CNN determines the receptive field. It provides information about the number of input image pixels that can be seen on activation in the network. Additionally, the use of a small kernel rather than a fully connected network benefits from weight sharing and reduced computational costs. In the experiment with kernel sizes, we determined whether the use of a small or a large kernel size affects the accuracy of blastocyst image classification. Based on the experimental results, we found that a kernel size of 3 gives the highest accuracy (0.743±0.119), followed by a kernel size of 5 (0.679±0.047), a kernel size of 7 (0.681±0.022), and a kernel size of 9 (0.694±0.024). From these results, it can be concluded that kernel size does not significantly affect the accuracy. Nevertheless, the kernel size jeopardizes the training and testing process time. More detailed information about the distribution of accuracy values with respect to the kernel size is presented in Fig.4.

C. EVALUATION OF THE CNN ARCHITECTURE
The best correlation can be achieved by using optimization strategies or algorithms called optimizers. The optimization algorithm is responsible for reducing losses and providing the most accurate results possible. Our proposed CNN model used several optimizers, such as adaptive moment estimation (Adam) [39], stochastic gradient descent (SGD) [40] and RMSprop, and the results were evaluated. The following Table 2 shows how the use of an optimizer increases the accuracy of our CNN model. Table 2 shows that the best validation accuracy of 86.00% is achieved using the Adam optimizer with a learning rate of 0.001. We used the early stopping function to obtain the number of epochs. The early stopping function can halt the training of neural networks at the optimal time. The early stopping callback function can monitor the loss or accuracy value. When the loss value is being monitored, the training will stop if the loss value increases. If the accuracy is being monitored, then training is stopped when there is a decrease in the accuracy. We obtained the best number of epochs, which is 36. After obtaining the best CNN model, we tested it on a dataset consisting of 24 blastocyst images. The test results on these testing data yielded an accuracy of 83.33% with a loss of 0.6141.
We evaluated the accuracy of the blastocyst quality selection model and determined the area under the curve (AUC) and receiver operating characteristic (ROC) values. The resulting AUC in predicting the blastocyst quality in the test dataset is 0.844. Fig.5 shows the confusion matrix visualization of the test accuracy and the AUC of the ROC curves.

D. COMPARISON TO THE OTHER MACHINE LEARNING METHODS
This paper evaluated binary classification of blastocyst quality using the proposed CNN, SVM, KNN, LR, and gradient boosting on a dataset of 249 human blastocysts. We performed analysis and evaluation of the best performance of our CNN model. Fig. 6 shows the variation in the accuracy of each method. 88626 VOLUME 10, 2022  The confusion matrix in binary classification represents predictions with actual conditions from the data generated by the trained model. The confusion matrix uses the parameters TP, FP, TN, and FN in its representation.
For validation, we used 50 blastocyst images. Based on the confusion matrix results, we can see that our CNN model produces a maximum accuracy of 84.00%. The SVM method produces a maximum accuracy of 82.00%, the KNN method produces a maximum accuracy of 74.00%, LR has a top accuracy of 82.00%, and gradient boosting has a maximum accuracy of 64.00%. The CNN model has the highest accuracy, and gradient boosting produces the lowest accuracy.
Testing of all methods on the blastocyst image testing data based on the confusion matrix indicated that the model performance was not good enough to classify blastocyst images with ''poor'' quality. This problem is due to the very low amount of input data and to the unbalanced dataset regarding the number of blastocyst images, where the number of ''good'' quality blastocyst images is far greater than the number of ''poor'' quality blastocyst images.

E. MODEL PERFORMANCE MEASUREMENT
Based on the confusion matrix, we can calculate the recall, precision, and F1-scores. CNN has the best recall value for classifying good-quality and poor-quality blastocysts, with average values of 0.89 for good-quality blastocysts and 0.76 for poor-quality blastocysts. In second place is SVM with a recall value of 0.97 for good-quality blastocysts and 0.51 for poor-quality blastocysts. The recall scores and the F1-score of LR outperform those of other techniques. The KNN method has the lowest precision, with a value between 0.50 and 0.66. Based on the recall value, the gradient boosting classifier obtains the lowest score, and KNN has the lowest F1-score. In addition to evaluating the model through the accuracy, the precision, recall, and F1-scores can show the classification performance, as shown in Fig. 8, which shows that all of the classifier machines offer different performances.

F. STATISTICAL SIGNIFICANCE TEST
To measure the statistical significance of our results, we used the McNemar test. The McNemar test is still widely used in VOLUME 10, 2022   Table 6 include the p-value of the CNN classification model, which rejected H0, where the model has a different error proportion. Other classification models (SVM, KNN, LR, and gradient boosting) produce the same proportion of errors or fail to reject H0. Table 6 presents the p-values and T-statistics obtained using McNemar's test to test whether the performance of the CNN model significantly differs from that of the other classification models.

IV. DISCUSSION
In this study, we optimized the CNN to assess the quality of blastocysts produced during the IVF process. The optimization procedure was conducted using a grid search due to its good performance in general and its ease of implementation. We also propose using weight decay and dropout regularization techniques to reduce overfitting. The input data for this research are raw images of human blastocysts from HMC microscopy. In this study, we used four conventional classifiers to compare our CNN model and performance measures such as the accuracy, precision, recall, and F1-score. The essential part of deep learning is the convolution layer, which uses some filters. In general, the use of a large number of filters affects the accuracy. Our experimental results show a significant difference in the results obtained using different numbers of filters. The use of 32 filters has the greatest influence on the accuracy compared to the use of 64, 128 or 256 filters. Although this research contributes to knowledge, particularly for small blastocyst images, it has some limitations. One limitation of the present study is that the blastocyst images produced by the inverted microscope are affected by artifacts and noise; another limitation is the imbalanced number of blastocyst datasets in the ''good'' and ''poor'' quality classes. These limitations prevent the optimized CNN from providing satisfactory results, and the imbalance in the classes causes the model to not make sufficient observations in the very few data class. The classification performance can be further improved by building models based on larger datasets and using techniques to address imbalanced classes in medical images [41]. The latest study [42] enhances the research on blastocysts through the use of a convolutional neural network to image the blastocyst combined with an elemental layer for maternal age. With an accuracy of 75%, this study shows potential in determining the probability of live birth. Vaidya et al. [43] used a combination of CNN and LSTM models to automatically assess embryos in timelapse images. This study obtained a 100% accuracy validation result without performing an accuracy test. Based on the latest research, it has been shown that our proposed CNN model has the potential to be improved for predicting the probability of a live birth, whether time-lapse images are used or not.

V. CONCLUSION
Our study examined the performance of the CNN model in assessing human blastocyst quality in the case of small datasets. The optimization process yielded good results: the highest validation accuracy was 84.00%, and the test accuracy was 83.33%, with an AUC value of 0.844. Based on the accuracy and the AUC results, the classification of blastocyst VOLUME 10, 2022 quality described in this paper has excellent potential to assist embryologists in assessing and selecting the best-quality blastocysts. McNemar's statistical significance test proved that the CNN model scores high in prediction compared to the other classifiers. The confusion matrix study results show that the accuracy of the proposed model enable classification of blastocysts as ''good'' or ''poor'' quality.
In future work, we will explore grid search methods and expand the boundaries of the CNN hyperparameter space used for optimization [44]. Finally, we would also like to extend our case study to time-lapse images of blastocysts. More test subjects are needed to ensure that the results obtained are statistically significant and that the proposed approach can be applied as a general tool for assessing blastocyst quality.