A Novel Approach for Polycystic Ovary Syndrome Prediction Using Machine Learning in Bioinformatics

Polycystic ovary syndrome (PCOS) is a critical disorder in women during their reproduction phase. The PCOS disorder is commonly caused by excess male hormone and androgen levels. The follicles are the collections of fluid developed by ovaries and may fail to release eggs regularly. The PCOS results in miscarriage, infertility issues, and complications during pregnancy. According to a recent report, PCOS is diagnosed in 31.3% of women from Asia. Studies show that 69% to 70% of women did not avail of a detecting cure for PCOS. A research study is needed to save women from critical complications by identifying PCOS early. The main aim of our research is to predict PCOS using advanced machine learning techniques. The dataset based on clinical and physical parameters of women is utilized for building study models. A novel feature selection approach is proposed based on the optimized chi-squared (CS-PCOS) mechanism. The ten hyper-parametrized machine learning models are applied in comparison. Using the novel CS-PCOS approach, the gaussian naive bayes (GNB) outperformed machine learning models and state-of-the-art studies. The GNB achieved 100% accuracy, precision, recall, and f1-scores with minimal time computations of 0.002 seconds. The k-fold cross-validation of GNB achieved a 100% accuracy score. The proposed GNB model achieved accurate results for critical PCOS prediction. Our study reveals that the dataset features prolactin (PRL), blood pressure systolic, blood pressure diastolic, thyroid stimulating hormone (TSH), relative risk (RR-breaths), and pregnancy are the prominent factors having high involvement in PCOS prediction. Our research study helps the medical community overcome the miscarriage rate and provide a cure to women through the early detection of PCOS.

folds of research data are used during the k-fold analysis. 92 The machine learning models are generalized and give 93 accurate performance scores for unseen test data. 94 The remainder of the research study is as follows: 95 Section II is based on the related literature analysis 96 of PCOS. Our research methodology analysis is conducted 97 in Section III. The employed machine learning models for 98 PCOS prediction are examined in Section IV. The scientific 99 results validation and evaluations of our research approaches 100 are analyzed in Section V. The research study concluding 101 remarks are described in Section VI. 103 The related literature to our proposed research study is exam-104 ined in this section. The past applied state-of-the-art study for 105 PCOS prediction is analyzed. The related research findings 106 and proposed techniques are examined. 107 One of the most common health problems [19] caught 108 in early age women is PCOS disease. PCOS disease is a 109 complicated health dilemma distressing women of childbear-110 ing age, which can be identified based on different medical 111 indicators and signs. Accurate identification and detection of 112 PCOS is the essential baseline for appropriate treatment. For 113 this purpose, researchers applied different machine learning 114 approaches such as SVM, random forest, CART, logistic 115 regression and naive bayes classification to identify PCOS 116 patients. After comparing the results, the Random Forest 117 algorithm gave a high performance with 96% accuracy in 118 PCOS diagnostics on a given dataset [20]. 119 Machine learning algorithms were implemented on a 120 dataset of 541 patients, from which 177 have PCOS disease. 121 The dataset consists of 43 features. As all features did not 122 have equal importance, researchers used a feature selection 123 model to rank them according to their value, called the uni-124 variate feature selection model. This model is implemented 125 to get ten high-ranked features that can be used to predict 126 the PCOS disease. After splitting the dataset into the train 127 and test portion, different algorithms were implemented to 128 get a result. These models include gradient boosting classi-129 fiers [21], logistic regression classifiers, random forest classi-130 fiers, RFLR abbreviation of random forest and logistic regres-131 sion. As a result, the proposed RFLR algorithm achieved a 132 90.01% accuracy score in classifying the PCOS patients with 133 ten highly ranked features [22]. 134 A new technique was proposed for the early detection and 135 identification of PCOS disease in 2021. The proposed model 136 was based on XGBRF and catBoost. After preprocessing 137 the data, the top 10 attributes were selected by the univari-138 ate feature selection method. The classifiers implemented to 139 compare the accuracy results are MLP, decision tree, SVM, 140 HRFLR, random forest, logistic regression, and gradient 141 boosting. Results showed that XGBRF performed with an 142 89% accuracy score while catBoost outperformed with a 95% 143 accuracy score. The accuracy scores of other classifiers lay 144 between 76% and 85%. between 57% to 79% [3]. 169 According to a stat, every 3 to 4 women from 10 are  The ABC-based modified metaheuristics optimization tech-193 nique was applied for the classification task.

194
The identification of PCOS using a novel immune infiltra-195 tion and candidate biomarker was proposed in this study [30]. 196 The proposed approach was the machine learning-based 197 logistic regression and support vector machine models.

198
The five datasets were utilized for training and testing the 199 models. The proposed model achieved a 91% accuracy 200 score for PCOS identification. The study contributes to 201 presenting a novel framework for analysis. The mutational 202 landscape screening-based modified PCOS-related genes 203 analysis was proposed in this study [31]. The PCOS-related 204 gene data of nsSNPs of the 27 were selected for 205 analysis.

207
Our research study uses the PCOS-related clinical and phys-208 ical features dataset for machine learning model building. 209 The dataset feature engineering is done by using the novel 210 proposed CS-PCOS approach. The PCOS exploratory data 211 analysis (PEDA) is applied to figure out the data patterns 212 and factors that are the primary cause of PCOS disease. 213 The dataset is fully preprocessed during feature engineer-214 ing. The preprocessed dataset is split into two portions 215 train and test. The split ratio used is 80% for training and 216 20% for the model's evaluations on unseen test data. The 217 hyper-parametrized model is completely trained and tested. 218 The proposed model is ready to predict the POCS disease 219 in deployment. The research methodology working flow is 220 examined in Figure 1.  The feature engineering techniques are applied to transform 234 the dataset features into the best fit for a predictive model 235 with high accuracy. A novel CS-PCOS feature selection 236 approach is proposed based on the optimized chi-squared 237 mechanism. The operational flow of feature selection by the 238 CS-PCOS approach is visualized in Figure 2         No PCOS occur when the TSH(mmHg) is less than 50 and 286 Bp_Systolic is above 80. Figure 5(b) demonstrates that, When 287 the value of TSH(mmHg) is above 50 and the Bp_Systolic 288 value less than 80, then PCOS happen.

289
The lmplot is dragged on the dataset's high-value features 290 to represent the PCOS regression described in Figure 6.

291
The lmplot is a two-dimensional plot that combines reg-292 plot and FacetGrid. The FacetGrid class helps visualize the 293 distribution of one variable and the relationship between 294 multiple variables separately within subsets of your dataset 295 using numerous panels. The lmplot is more computationally 296 intensive and is intended as a convenient interface to fit 297 regression models across conditional subsets of a dataset.

298
In Figure 6(A), a lmplot is drawn between the Hip(inch) 299 and Waist(inch) to visualize the PCOS Regression. As the 300 waist and Hip size increase, the Chance of PCOS increases. 301 In Figure 6(B), the Waist: Hip Ratio and Hb(g/dl) sub-302 set is used to analyze the PCOS regression. When the 303 value of Hb(g/dl) is greater than 14 and less than 9, there 304 is more chance of PCOS.    The histogram is plotted to analyze the frequency distribu-     The data splitting is applied to prevent model overfitting and 345 evaluate the trained model on the unseen test portion of the 346 dataset. The PCOS dataset is split into two portions for 347 the training and testing employing machine learning models. 348 The 80:20 ratio is used for dataset splitting. The 80% portion 349 of the dataset is used for model training, and a 20% portion of 350 the dataset is used for employed model's results evaluations 351 on unseen data. Our research models are trained and evaluated 352 with high accuracy results.

354
The employed machine learning techniques are examined for 355 PCOS prediction in this section. The working mechanism 356 The random forest (RF) [36] is a supervised classifica-380 tion model that creates a forest of multiple decision trees. 381 The decision trees are created randomly based on the data 382 samples. Decision nodes represent the features, and tree leaf 383 nodes represent the target output. The majority voting predic-384 tion of decision trees is selected as the final prediction. The 385 gini index and entropy are used for data splitting in tress nodes 386 as expressed in equations 3 and 4. The bayesian ridge (BR) [37] algorithm uses probability 390 computations for the classification task. The BR model is 391 suitable for real-world problems where the data is insuffi-392 cient and poorly distributed. The BR model formulates a 393 linear regression model by using the probability distributors. 394 VOLUME 10, 2022    h( The k-neighbors classifier (KNC) [39] is the simplest and The logistic regression (LOR) [42] is a supervised machine 438 learning model for binary classification. The LOR model [43]    hyperparameter tuning [49] of our research models is ana-477 lyzed in Table 3. The analysis demonstrates the parameters 478 utilized to achieve the high-performance metrics score. The  The precision score of a learning model is also known 508 as positive predictive value. The precision is measured by 509 the positively predicted label proportion that is positive. The 510 precision, in general, calculates the employed model accuracy 511 in predicting a data sample as positive. The precision score of 512 our proposed model is 100%. The mathematical notations to 513 express precision scores are as follows: The recall score of employed models is the measure of how 516 many of the TP were recalled (found) correctly. The recall is 517 also called the sensitivity of a learning model. The recall score 518 of our proposed model is 100%. The mathematical notations 519 to define the recall are as follows: The f1 score is the statistical measure that sums up a 522 predictive model's performance by combing the precision and 523 recall values. The f1 measure is the harmonic mean between 524 the recall and precision. The f1 score of our proposed model 525 is 100%. The mathematical equation to calculate the f1 score 526 is expressed as: The comparative performance metrics analysis of applied 529 learning models is conducted in Table 4. The time complexity 530 computations and performance metrics results are calculated 531 without using our proposed approach. The analysis demon-532 strated that all applied learning models achieved average 533 scores in predicting PCOS. From the analysis and Figure 8, 534 the highest accuracy, precision, recall, and f1 score is 89%, 535 achieved by RF and GBC techniques. The minimum accuracy 536 score is 70%, the precision score is 68%, the recall score is 537 70%, and the f1 score is 68% achieved by the KNC technique. 538 The time complexity analysis describes that KNC have less 539 training time of 0.002. However, also have low-performance 540 metrics scores.

541
The performance metrics comparative analysis of applied 542 learning models is conducted in Table 5. The performance 543 metrics results and time complexity computations are cal-544 culated using our proposed approach. The analysis demon-545 strated that all applied learning models achieved the highest 546 VOLUME 10, 2022 FIGURE 8. The accuracy scores comparative evaluation of employed machine learning models for unseen test data without using the proposed technique.  performance metrics scores in predicting the PCOS. From the 547 analysis and Figure 9, the highest accuracy, precision, recall, 548 and f1 score is 100%, achieved by LIR, RF, BR, SVM, LOR, 549 GNB, and GBC techniques. The minimum accuracy score is 550 56%, the precision score is 53%, the recall score is 56%, and 551 the f1 score is 54%, achieved by the KNC technique. The time 552  The classification report analysis by individual target class 557 for each employed learning model is examined in Table 6.

558
The classification report values are calculated for the mod-559 els using the proposed approach. The analysis demonstrates 560 that the KNC and SDG have low accuracy scores in class-561 wise metrics evaluations. The outperformed GNB model has 562 achieved 100% scores in classification report analysis. To validate the overfitting of employed machine learning 564 models, we have applied the k-fold cross-validation tech-565 nique as analyzed in Table 7. The 10 folds of the dataset 566 are used for validation. The analysis demonstrates that tech-567 niques achieved 100% scores using our proposed approach 568 and 100% accuracy using the k-fold techniques. Figure 10 569 shows the accuracy of comparative analysis of employed 570 models by using the k-fold validation. The visualized analysis 571 demonstrates that the MLP model achieved 99%, and by 572 using k-fold, 98% accuracy was achieved. The SGD and 573 KNC models achieve the lowest accuracy scores in this anal-574 ysis. In conclusion, all employed models are validated using 575 k-fold technique. The k-fold analysis demonstrates that our 576 employed machine learning models are not overfitted. Mod-577 els are in generalize form and accurate results on unseen test 578 data.   The confusion matrix analysis is conducted to validate our 588 performance metrics scorers as analyzed in Figure 11. The 589 analyzed confusion matrix is for outperformed GNB model.    The proposed model's overfitting is validated using a ten-606 fold cross-validation technique. Our research study concludes 607 that the dataset features prolactin (PRL), blood pressure sys-608 tolic, blood pressure diastolic, thyroid stimulating hormone 609 (TSH), relative risk (RR-breaths), and pregnancy are the 610 most prominent factors having high involvement in PCOS 611 prediction. The study limitations and in future work, we will 612 enhance the dataset by collecting more data on PCOS-related 613 patients and applying data balancing techniques. Also, the 614 deep learning-based will be applied for PCOS prediction.  where he is currently pursuing the M.S. degree 849 in computer science. His current research inter-850 ests include data science, artificial intelligence, 851 data mining, natural language processing, machine 852 learning, deep learning, and image processing.