Mining and Interpretation of Critical Aspects of Infant Health Status Using Multi-Objective Evolutionary Feature Selection Approaches

The rate of infant mortality (IMR) in a population under one year of age is a marker for infant mortality. It is a major sensitive marker of a community’s overall physical health. Protecting the lives of newborns has become a challenging issue in public health, development programs, and humanitarian initiatives. Almost 10.1% infants died in the United States of America (USA) in 2021. Therefore, this paper aims to extract and understand the various influential factors causing infant deaths in the USA. A crowding distance-based multi-objective ant lion optimization (MOALO-CD) is proposed here with statistical evidence for feature selection. The proposed technique is compared with competitive metaheuristic models such as multi-objective genetic algorithm based on crowding distance (MOGA-CD), multi-objective filter approaches, and recursive feature elimination. Various machine learning classifiers are applied to the selected feature subset obtained from MOALO-CD on the USA’s infant dataset. Extensive experimental results indicate that the proposed model outperforms the existing metaheuristic approaches in terms of Generational Distance, Inverted Generational Distance, Spread, and Hyper volume. Also, the comparative analysis of various machine learning models reveals that random forest achieves significantly better performance on the feature subset obtained from MOALO-CD.


I. INTRODUCTION
Infants are human offspring who range in age from birth to one year old. Child deaths under the age of one year are called infant deaths. This mortality rate is estimated according to the number of children under one year old, which is the number of deaths/1,000 live births. Infant mortality can be categorised into three classes: perinatal mortality, neonatal mortality, and post-neonatal mortality. Fetal death is the mortality of the fetus during the gestation of 22 weeks to a newborn's birth or death up to one week after birth (perinatal). Neonatal deaths occur within 28 days after the baby is part of the family (Neonatal), and child deaths occur within 29 days to one year after the baby is part of the family (Postneonatal). Malnutrition, respiratory disease, maternity complications, unexpected infant death disorders, and home complications are the main causes of post-neonatal mortality. India recorded an IMR of 44/1000 live births 1 in 2016. In 2017, 802,000 infant deaths occurred due to a lack of access to water, sanitation, appropriate nutrition, and necessary health services. The US mortality rate is 5.8 2 which, in the face of extreme poverty, extreme risky health outcomes, and a fragmented healthcare system, is substantially more advanced than other comparable countries. Further, there are significant variations in infant death rates between various racial groups, with the most striking disparities between babies born to black women and white women. In the past 50 years, substantial improvements have been made to reduce infant deaths in the USA, but more needs to be done.
Infant mortality relies heavily on factors such as the mother's health, safety, and access to healthcare facilities, social and economic circumstances, and policies on public healthcare. It is therefore a significant measure of a country's well-being. Mani et al. [1] have used data mining techniques for routine collection of data from 299 admitted infants in the Monroe Carell Jr. Children's Hospital at Vanderbilt. This data was tested by nine ML-based models for the early diagnosis of late-onset sepsis. Wilson et al. [2] have focused on identifying a complex Bayesian network of causes that contribute to neonatal mortality, including preterm birth, maternal death, movement, and breathing at birth. Here the authors also aim to estimate the causal effects of neonatal mortality from the Global Network Maternal and Child Health Registry using logistic regression models. Rittenhouse et al. [3] have focused on the development of algorithms that integrate maternal factors linked to SGA (Small Gestational Age). The data used in this research was assembled from an ongoing obstetric cohort of early pregnancy ultrasounds for the evaluation of GA (Genetic Algorithm) in Lusaka, Zambia. The results of this study include a mixture of six parameters that have been correctly classified as being above 94% of newborns, such as baby weight, twin birth, maternal height, labor hypertension, and HIV serostatus, and that reached 0.9796 AUC. Saravanou et al. [4] in their research are interested to find how various socioeconomic factors like race or mother's education can improve predicting infant mortality. They used a highly imbalanced dataset with complete birth certificate information from 2000-2002 in the USA, for solving binary infant mortality classification problem. Kefi et al. [5] have applied ML methodologies for prediction of short-term mortality in the NICU so that better clinical decisions can be made by doctors. The data supplied in this analysis were obtained from the Medical Information Mart III (MIMIC-III) for Intensive Care, which was developed by the MIT Laboratory for Computational Physiology. Unfortunately, the prognosis of infant health utilising the characteristics selected by evolutionary techniques is rarely addressed.
The number of infants in the USA born prematurely and with low birth weight continues to rise, and as infants born too early or too young have higher mortality rates, this ongoing increase has a major impact on infant mortality. However, substantial disparities between racial and ethnic groups in child mortality rates continued to rise and not all groups have received social and health gains equally. Therefore, in this proposal, we are analyzing U.S. Territories Birth Data (2018) (https://www.cdc.gov/nchs/data_access/ vitalstatsonline.htm) recorded by the Centers for Disease Control (CDC) and the National Center for Health Statistics (NCHS) for mining sensitive causes of infant death, as we found from literature that there was no or very limited feature selection (FS) work on this dataset. Again, FS tasks based on evolutionary multi-objective techniques have not been explored in this dataset. This is the motivation behind the selection of the U.S. Territories Birth Data to extract the critical factors related to infant health.
Mothers, infants, and children are critical public health priorities in the United States. Predicting public health concerns for families, communities, and the healthcare system depends on how well children are doing now. The main objective is to mine the most influential factors to predict whether an infant will survive or not. Removing non-important factors from the dataset is a significant goal to achieve. The field of metaheuristic algorithms has grown significantly in the last two centuries as an alternative to real-world optimization problems. They are capable of performing well in circumstances where exact optimization methods fail to produce a good result. Metaheuristic techniques can provide high-standard solutions in comparatively much less time than conventional optimizers for challenging optimization issues. Applications for metaheuristics are available in a variety of areas, including finance, planning, scheduling, and engineering design. These Metaheuristics/Population-based strategies have been used as a wrapper in recent years in solving FS tasks [6], and have proven their strengths. Various population-based approaches have been used to solve FS problems, including several recent algorithms: MOABC [7], and many variants of PSO [8,9], etc. This motivated us to propose a crowding distance-based MOALO (MOALO-CD) for selecting important features from the infant dataset.
The major contributions of this paper are: 1) A crowding distance-based multi-objective ant lion optimization (MOALO-CD) is proposed for the feature selection (FS) task.
2) The concept of crowding distance is used to maintain diversity and to select the most unique feature subset for mining the factors affecting infant health status. 3) Various multi-objective performance metrics are used to evaluate the quality of Pareto fronts obtained from the proposed model. 4) To evaluate the role of selected features for the classification of infants as survived or not, various state-ofthe-art classifiers are trained on the infant's dataset by considering the same selected features only.
The remaining portion of this paper is organized as follows: Section II discusses the materials and methods. Section III demonstrates the performance analysis. Section IV presents the discussion of the obtained results. Section V concludes the proposed work and also presents the future This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

II. MATERIALS AND METHODS
The suggested system is intended to predict infant mortality by using the most influential characteristics of the 2018 birth data file for the US Territories. The architecture of the VOLUME 4, 2016 proposed model is given in FIGURE. 1. This section is split into six subsections. First, the details of the 2018 birth data file for the US Territories are explained. The second subsection introduces the notion of multi-objective optimization. The third subsection explains the proposed feature selection methodology. A short description of the FS methods used for comparison purposes is presented in the fourth section. Explanations of the basic classifiers used to decide whether or not the infant survives depending on the features selected are given in the fifth section. Finally, the setups for conducting all the experiments are specified.

A. DATASET
The 2018 birth data file of the US Territories recorded by the CDC and the NCHS is the dataset used in this proposed research. There are a total of 25,919 records, each having a length of 1330 including filler. The different characteristics present in this dataset are: i) Demographic characteristics, ii) Medical and Public Services Utilization, iii) Maternal Behavior and Health Characteristics, and iv) Infant Health Characteristics. After removing fillers, recode variables, flag variables, and some irrelevant columns such as Pa-ternity_Acknowledged, Mother's_Marital_Status_Imputed, etc., 99 relevant columns, including a class column, are derived and set for further research. The class column "In-fant_living" has three values: Yes, No, and Unknown. We removed the records having an unknown Infant_living value and finally, we began our work with a preprocessed dataset of 25868 rows and 99 columns. The number of records having infant_living = Yes (25799) is greater as compared to the number of records of infant_living = No (69), so the dataset is a highly imbalanced one. In these cases, it may be possible to misinterpret the classification model developed using traditional learning algorithms. This is because the ML algorithms are usually constructed by reducing errors to boost performance (https://www.analyticsvidhya.com/blog/ 2017/03/imbalanced-data-classification/). Consequently, the distribution of classes is not taken into account. There are several approaches [10] for handling imbalanced datasets, such as over-sampling, under-sampling, bagging based methods, etc. We have used SMOTE (Synthetic Minority Oversampling Technique) for balancing our referenced dataset and the number of samples after SMOTE is 51598. SMOTE [11] is a very popular solution for tackling the problem of imbalance. This helps to balance the distribution of classes by replicating them randomly through instances of minority classes. SMOTE synthesizes new instances of minorities as shown in FIGURE. 2 within existing instances of minorities. It produces records of virtual training for the minority class by linear interpolation. In each case, a random collection of one or more of the k-nearest minority class neighbours can produce these synthetic training records. Many classifiers may be used to process the results following the oversampling procedure. Calculate nearest k-neighbors of x by measurement of the Euclidean distance of x from each other in set A. 4: end for 5: Fix N % The sampling rate as per the unbalanced ratio. 6: for each x ∈ A do 7: Create a set A1 by choosing N instances (i.e. x1, x2, ...xn) randomly from its k-nearest neighbors. 8: for each example x k ∈ A1(k = 1, 2, ....., N ) do 9: Generate new example by using the following formula: MOO [12] means that more than one objective must be optimized concurrently in a multiple-criterion decision-taking environment. There is no specific solution to the problem of nontrivial multi-target optimization that can optimize each target in parallel since the goal functions are contradictory. So, finally, a non-dominated solution set is achieved as the outcome. There are 2 solutions to MOO problems. 1) To acquire the optimal results, it first discovers several non-dominated solutions and then employs advanced understanding. 2) A composite function is formulated by using the chosen preference vector. Then any single objective optimization algorithm is applied to the developed composite function to discover the best one.
The definition of "Pareto superiority/dominance" is used as follows when finding optimal solutions in multi-objective optimization, meaning that all objectives can be taken into consideration at the same time. A solution c1 dominates another solution c2 if and only if: • c1 is not weaker than c2 in each objective.
• c1 is specifically superior to c2 for at least one of the objectives. Pareto optimal options are solutions not dominated by any member in the field of search. Let the solutions to the hypothetical problem given in FIGURE. 3 are to be minimized with two goals: f1 and f2. Solution P has little f1 but a large f2, while solution Q contains large f1 and low f2. When it is necessary to minimize all objectives, we cannot conclude that solution P is better than solution Q or vice versa. So the two solutions, P and Q, are Pareto optimal or non-dominated solutions because none of them is dominated by other solutions. A Pareto front is obtained by joining all the Pareto optimal solutions in objective space.

C. PROPOSED METHOD FOR FEATURE SELECTION (FS)
Mirjalili suggests an ant lion optimizer (ALO) [13] based on ant lion hunting in nature. Ant-lions chase for larvae, and the larva excavates a cone-formed hole (plot) in the sand by walking along a circular road. The larva hides and waits until the ants are trapped in the ditch. The ant lions seek to capture the prey while it is in the pit and throw sand at the pen. The victim is sucked into the ground, where it is then devoured. Ant-lions throw out the remnants of the trap and amend it to continue hunting.
In this study, a binary version of the ant lion optimization [14] is used because in the FS problem the solutions are restricted to {0,1}. ALO is very famous for preventing local optimums, exploitation, refined exploration, and convergence. Flexible border reduction technology and ALO elitism lead to high-speed exploitation and convergence. The random walk and roulette wheel selection techniques of ALO maintain population diversity. Again, the multi-objective version of ALO is used for finding the best feature subsets of the referenced dataset with regard to fewer attributes and maximal classification efficiency. ALO proved itself as an excellent optimizer in several problems, including: Optimal Power Flow, Load Dispatch, Feature Selection, Image Processing, Computer Vision, Neural Networks, and Medical Applications, etc. ALO provides massive benefits, including easy conception, simplicity of implementation, and a small number of controlling variables [15]. This provoked us to explore a multi-objective variant of ALO in the era of FS tasks. The working principle of MOALO-CD includes the following functions: 1) Encoding Each ant and ant lion location is encoded as an Lbit pattern in which L is the number of characteristics present in the original dataset. The appearance or missing of a particular variable in the competitor substring is determined by the bit position (1: present and 0: absent). The pictorial representation of the encoding scheme is shown in FIGURE 1.

2) Fitness Evaluation
Each ant and ant lion is evaluated by using two fitness criteria/objectives: #features (Fitness1) and the accuracy of the wrapper classifier (Fitness2).
where X is pattern of L bit length.
where T P , F N , T N , and F P show true positive, false negative, true negative, and false positive, respectively.
To determine the Fitness2 for each ant and ant lion, the following processes are adopted: • Initially, a compact dataset is created by extracting the properties from the relevant positional vector cells containing 1. • The Fitness2 is then determined using the KNN algorithm with a 10-fold CV. The total samples are arbitrarily separated into ten batches for training and testing.

3) Repository Maintenance
The archive will be updated by the best ant lions for any population shift. An individual ant lion can reside in the repository if it is not dominated by the ant lions already present in the archive. Former archive members should be erased if they are dominated by this inclusion. The external repository may overflow as its size is fixed by the user. If overflow emerges, individuals are organized in decreasing order [16] of their CD values. From the bottom 5% of the archive, pick any alternative that includes the most crowded archive substitutes [17] and remove it.

4) Position Update of Ants
To carry on with the upcoming repetition, each ant picks an ant lion haphazardly by the roulette wheel principle and twists it by equation 4 (RW1) [18].
where r1 and r2 are two randomly [0, 1] generated numbers, x d out is the d dimension value of the output vector from mutation, x d in is the d dimension value of the input vector for mutation (shown in FIGURE 4), and r is the mutation rate computed by equation 5.
Here, i: current generation and GEN : the total generations to complete. Afresh, the best solution (ant lion)

FIGURE 4. Mutation
from the repository is picked by the roulette wheel and mutation using equation 4 is carried out on it, resulting in RW2. Then the new position of the ant is set by the outcome of the crossover among RW1 and RW2, which is done using the equation 6.
where x d is the crossover result between vectors x d 1 and x d 2 at dimension d.

5) Catching Prey/Replacement of Ant lion with Fitter Ant
After updating the position of each ant, an ant lion can be replaced by its respective ant if it is suitable. The details of feature selection using binary multiobjective ant lion optimization are given in Algorithm 2.

D. FS METHODS USED FOR COMPARISON
The following is a list of FS approaches used here to compare the efficiency of the proposed FS mechanism: 1) MOGA-CD [19] GA is an important natural selection-inspired evolutionary search algorithm. The working procedure of GA is to create a group of individuals where each individual is the nominee solution to a specific problem. Every individual is assessed by a fitness function to determine the quality of the solution. In each generation, the best or fittest individuals are selected and recombined to create offspring so that the population progressively finds better and better alternatives. The initialization of chromosomes in MOGA is structurally the same as the encoding of ants and ant lions in into external achieve A. 10: for COU N T ← 1 to GEN do 11: Calculate mutation rate r using equation 5. 12: for each ant j do 13: Select an ant_lion (RW) at random using Roulette Wheel selection. 14: Apply mutation on RW with mutation rate r, resulting in RW1. 15: Select a fittest ant_lion from A by roulette wheel and mutate it to get RW2. 16: Set new position of ant j to the output of CROSSOVER(RW1, RW2). 17: end for 18: Evaluate all ants (calculate Fitness1 and Fitness2). 19: Restore an ant_lion with its respective ant if it is fitter(catching prey). 20: Update the repository A by newly found fitter ant_lions 21: end for 22: Use CD measure to obtain best out of bests from the repository.
MOALO. In this case, we used the same fitness functions to evaluate each chromosome. At each iteration, the best individuals are stored in an external repository, and the repository is updated as given in section II-C. A detailed algorithm for MOGA-CD based FS is given below:

2) MOGA-CD and MOALO-CD Filter Version
To compare the performance of the proposed wrapperbased FS mechanism MOALO-CD, we have implemented the filter versions of the two multi-objective evolutionary algorithms, MOGA and MOALO, by tak- Calculate Fitness1 and Fitness2 6: end for 7: Store the non dominated solutions found in P into external achieve A. 8: for COU N T ← 1 to GEN do 9: Perform Roulette Wheel selection to choose fittest individual to recombine. 10: Perform two -point crossover 11: Perform mutation 12: Evaluate updated population P (calculate Fitness1 and Fitness2) 13: Update the repository A 14: end for 15: Use CD measure to obtain best out of bests from the repository.
ing the following two fitness functions: where X is pattern of L bit length.
is the average correlation between the selected feature subset B (consists of all those features of X where x i = 1) and the corresponding class attribute y [20].

3) Recursive Feature Elimination (RFE)
The RFE [21] selection process is fundamentally a recursive process, which classifies features according to their significance. The importance value of each feature is computed, and an irrelevant one is discarded at each iteration. Another option, not used here, is to each time delete a group of features to speed up the operation. The recursion is required because, for such steps, when measured using another sub-set of features during the elimination period, the relative value of each feature will alter substantially. To create a final ranking, the (inverse) order where features are eliminated is used.
The feature selection method only takes into account the first n attributes in this ranking.

E. CLASSIFICATION
The next task, after getting important features from the infant dataset, is to build classification models to train the machine so that it can predict whether an infant will survive or not, depending upon the selected features. In this research, we have used six state-of-the-art classification techniques, namely Logistic Regression [22], KNN [22], Decision Tree [23], Random Forest (RF) [24], Gaussian Naive Bayes 3 , and Support Vector Machine [5,24] to classify the reduced datasets. During the analysis of the infant data set, every algorithm was checked with ten-fold cross-validation. Algorithm performance was recorded and subsequently tabled.

F. EXPERIMENTAL SETUPS
In this study, we used the 2018 birth data file of the US Territories recorded by the CDC and the NCHS. For more details about the dataset, refer to section II-A. To reduce the dimensionality of the dataset and to get the most influential factors causing infant death, we used MOALO-CD as a feature selection mechanism. All the experiments are executed on Python 3.7 using a PC with an Intel Core i3-7020U CPU @ 2.30 GHz and 4.00 GB RAM. The parameter settings of feature selection techniques are given in Table 1.
To contrast, the proposed method for selecting features from infant data with the benchmark methods discussed in section II-D, the below mentioned multi-objective performance indicators [25] are used.
where A is the content of the external archive after 100 th generations. The F(X) is a point in the objective space whose co-ordinate values are (Fitness1(X), Fitness2(X)) and X is any solution (r or a) lying in the true pareto front (PF) or calculated pareto front (A) respectively. PF is the representative of the True Pareto front. As in most practical problems, the true PF is unknown, so the following steps have been followed here to find an approximation of the true PF [26].  2) Inverted Generational Distance (IGD) For both GD and IGD calculation p = 2 is taken.

3) HyperVolume (HV)
The HyperVolume measure can be explained as the volume of the area in the objective space dominated by the calculated P F , A, and delimited from above by a reference point rpϵR m such that for all aϵA, a dominates rp.

4) Spread
Spread (S) can be computed as: In this article, we have used the classification performance assessment measures [27] 4 Table 2 to evaluate the performance of all classifiers.

III. PERFORMANCE ANALYSIS
First, wrapper-based MOALO-CD and MOGA-CD optimization algorithms are executed to find the best feature subsets by using the 10-fold CV with the KNN method to get the second fitness function (Accuracy) of each candidate feature subset. During the optimization process, cross-validation is used to avoid the over-fitting problem.
The True Pareto front along with the same obtained from wrapper based MOALO-CD and MOGA-CD are shown in features. The performance of the two above discussed methods is evaluated by using four multi-objective performance indicators, explained in section II-F and values are given in Table 3. The non dominated set output from MOALO-CD is better as the GD and IGD values are less and the HV value is more as compared to the same obtained from MOGA-CD. But the portion of the objective space covered by the MOGA-CD Pareto front is more than that of the MOALO-CD Pareto front.
Secondly, filter based MOALO-CD and MOGA-CD methods are implemented using the fitness functions described by equations 7 and 8. The resulting Pareto fronts, along with the true PFs are shown in FIGURE. 6. The True PF has only one solution and that is far away from the fronts obtained from the above-mentioned filter methods. The cardinality of the solutions present in the Pareto fronts is less (2) in MOALO-CD as compared to MOGA-CD (4). The values of the multi-objective performance indicators for the filter methods are given in Table 4. The MOALO-CD Filter version also outperforms in terms of GD, IGD, and HV values.
Here, we have used the Student t-Test to examine the IGD figures of the four approaches (MOALO-CD Wrapper vs MOGA-CD Wrapper and MOALO-CD Filter vs MOGA-CD Filter) acquired from 30 different trials to identify the optimal wrapper and filter method, as well as for solid affirmation of

Specificity
Specif icity = ROC ROC curve is a graphical representation of sensitivity against (1-specificity). Precision P recision = T P T P +F P

FPR (False Positive Rate)
F P R = F P F P +T N convergence demonstration. Table 5 shows the statistical outcome of a two-tailed t-test at a level of significance of 0.05. The outcomes of ′ T echnique1 vs T echnique2 ′ are expressed as "»" or "«" or "==", indicating whether the first technique is substantially superior than, markedly inferior than, or substantially equivalent to Technique 2. The MOALO-CD exceeds the MOGA-CD in both the filter and wrapper versions, according to the findings. Hence, we may say that the suggested MOALO-CD has superior optimization ability for the FS issue in the US Territories Birth data. The number of features in the best solutions picked by MOALO-CD and MOGA-CD wrapper versions is 44 and 44, respectively, by using the crowding distance measure. So, while implementing the non-evolutionary Recursive Feature Elimination (RFE) algorithm, we have supplied the value for the required number of features, which is equal to 44, which is the average number of features resulting from the above two methods.

MCC (Matthews Correlation Coefficient ) MCC=
After getting the five sets of features by MOGA-CD WRAPPER, MOALO-CD WRAPPER, MOGA-CD FIL-TER, MOALO-CD FILTER, and RFE methods, we performed our next experiment of classification with these five reduced datasets along with the original dataset. Here we have subjected the above six (including original) datasets to six different states of art classifiers. The experimental results along with the performance measures after repeating the 10 fold cross-validation 3 times are outlined in Table 6, 7, 8, 9, 10, and 11. By observing Table 9, and 10, we can infer that most of the classification models (LR, RF, DT, and KNN) performed well on the reduced dataset produced by MOGA-CD Filter approach. Although the number of features present in the best solution of the MOALO-CD Filter approach is less (16) as compared to the MOGA-CD Filter method (36), it is missing some very important features like breastfeed_at_discharge, mother's_edu, cigaratte_before_pregnency, mother's_height, and pre_preg_diabetes, etc. The feature set selected by the MOGA-CD Filter contains these features, and they influence the infant's health status.
Similarly, a performance comparison is made by considering the datasets obtained from the MOALO-CD Wrapper and MOGA-CD Wrapper using Table 7   Various plots for the performance metrics values of the above three datasets after applying the said classifiers are given in FIGURE. 7, 9, 10, 8, 12, 11, 13, and 14. As can be observed, the accuracy values for the RDS3 are very attractive concerning all the classifiers as compared to the RDS2 and RDS1. Only in the case of RDS1, the SVM classifier has given better accuracy than other datasets. RF, DT, and KNN defeated others in classifying infant states, thus producing higher classification accuracy, sensitivity, specificity, precision, MCC, F1-score, and lower error rate. As the sensitivity and specificity values of RF, DT, and KNN are very high and their gap is smaller, these three classifiers can  be treated as the best model for predicting whether an infant will survive or not. The precision value of RF and DT for all the datasets is 0.99, which is an indication of more positive outcomes with respect to the number of positive outcomes forecasted by the above two classifiers.

FIGURE 8. Precision Comparison
ROC is a probability curve, and AUC is a degree or separability measure. This shows how much one model can differentiate between classes. From FIGURE 15, FIGURE  16, and FIGURE 17 it can be concluded that for all three  datasets, as AUC of RF and DT is equal to 1, these two models are the best ones to differentiate between alive and dead infants. As for the RDS3, the AUC value of RF, DT, and KNN is equal to 1, and the same for GNB is also very close to 1, so it can be concluded that the feature subset produced by MOALO-CD Wrapper is more relevant in classifying the infant's health status.
The proportion of samples that have been wrongly classified (ERR) is higher in SVM and LR as compared to others.
As the Gmean is very close to 1, the combined efficiency of two classes (alive or dead) is higher when applying RF, DT, and KNN to classify the infant state for RDS1, RDS2, and RDS3.
The MCC value of 0.99 for RF, DT, and KNN for RDS3 indicates that these are perfectly correct binary classifiers for the RDS3. The FPR value of the random forest model is much smaller. So the number of wrongly categorized negative events as positive (false positives) concerning the total number of actual negative events is much less as compared to other models.
The combined performance of both the alive and dead classes is also very attractive and eye-catching when the dataset is trained and tested under RF, DT, and KNN. It is found that RF proves itself as a very fast and accurate classifier in classifying infants as alive or dead. The performances of DT and KNN are also remarkable as they are very close to RF. After comparing the performance results of all the six classifiers with respect to the three chosen datasets, it is observed that all the classification models proved their efficiency in classifying whether an infant is alive or not when applied to RDS3 (the dataset obtained from MOALO-CD Wrapper). So, at the end of 100 iterations, a set of features selected by the MOALO-CD Wrapper, MOGA-CD Filter, and RFE approach are listed in Table 12 for easy comparison.
The P-values (with a 0.05 significance level) and absolute correlation values (with target attribute) of all the features selected by the above three methods are listed in the Table 12. The P-value gives us the likelihood of finding an observation on the assumption that a given hypothesis is valid. This likelihood is used to accept the hypothesis or deny it. In feature selection, removing various features from the dataset will have different effects on the dataset's p-value. If a particular feature has a p-value > threshold (here 0.05), then this feature does not provide any noticeable change to the output and can be easily removed without consequences. However, a feature with a p-value < threshold provides a very significant change to the output, and it cannot be removed.

IV. DISCUSSION
Many studies have been performed over the last few years to accurately classify infants' health. Singha et al. [28] have analyzed different attributes of pregnant women to identify risky maternal factors that are responsible for neonatal infant mortality. Based on a statistically significant pattern, an ML model is suggested to predict the key factors and high-risk mothers for proper medical care. The birth data of the U.S. Territories (2013) was trained and tested by using LR, NB, and LSVM. Rinta-Koski et al. [29] have proposed Gaussian Process Classification to predict preterm infant in-hospital-mortality where, in addition to clinical ratings, gestational age at birth, and birth weight, the characteristics are derived from direct cardiac, arterial, and oximeter sensor measurements. Thompson and Steele [30] have identified the use of ML-based analysis of the cost and usage data set of a healthcare project for the development of longstay newborn models. In this research, the 2014 SID (State Inpatient Databases) for Florida was used, which contains 230 attributes per patient admission record. Implementation results showed that RF was the best performing model, with a ROC score of 0.877. Podda et al. [22] have estimated preterm neonatal mortality risk by developing ML-based predictive models. The developed models use Italian neonatal data (2008 -2014) for training and (2015-2016) for testing. Out of six ML predictors, this study chose ANN and compared it with LR, where ANN had better discrimination than LR with a p-value < 0.002. Preterm birth and neonatal death are strongly associated with each other. So there is a need for early and accurate prediction of preterm infants for appropriate clinical intervention. Li et al. [31] have made a comparison of five feature selection methods, namely the chisquare test, information gain, minimum redundancy maximum relevance (MRMR), stepwise logistic regression, and Gini index in RF, to identify the risk factors for SGA infants. They have evaluated their work by applying 4 classifiers (LR, NB, SVM, and RF) while taking precision and AUC as evaluation criteria. Again, to construct powerful SGA prediction models, they proposed an ensemble method based on feature subsets that performed better than individual ones. Gismondi et al. [32] attempted ANN, while Semenova et al. [33] proposed a boosted decision tree method for analyzing infant mortality and exploring the association between heart rate variability (HRV) and blood pressure (BP) in preterm babies, respectively. In the study performed by Lee et al. [24], 275 infants born before 32 weeks were taken with an average birth weight of 929g and 49% of them were female. They found that the RF model had excellent performance (sensitivity-88%, AUC-0.93) for the prediction of preterm newborn mortality.
However, the prediction of infant health status using the factors selected by evolutionary approaches is less explored. Also, extracting the crucial factors causing infant deaths from the USA birth data file is still an unexplored area. So, in this article, an evolutionary wrapper-based multi-objective FS technique called MOALO-CD is attempted to mine the valuable aspects of causing infant death. By observing the selected features, their p-values, and correlation values, we found some of the most interesting and influential causes of infant death in the USA are: mothers' education, the number of prenatal visits, pre-pregnancy diabetes, smoking during pregnancy, BMI, birth weight, spina bifida, 5-min, and 10min APGAR scores, antibiotics for newborns, breastfeeding, ventilation requirement, admission to NICU, steroids, plurality, etc. Maternal education has shown that increased education provides mothers with more links to children's health services and a sense of healthy behaviours (including exercise and not smoking). Education could also improve the skills needed to access the health care system and make effective use of it. In general, the mortality rates of infants born to more trained mothers are lower. Low birth weight is the result of insufficient growth in the fetus, and the lower the birth weight, the greater the immaturity and death risk. Numerous factors lead to low birthweight: low socioeconomic status, low rates of employment, very late or very early puberty, inadequate diet, health problems, and substance misuse. Prenatal visits play a significant role in reducing risk factors and optimizing pregnancy outcomes, particularly if treatment is sufficient and early, by delivering effective healthcare and helping pregnant women improve their overall health. Infant mortality risk among normal-weight women was lowest and increased with increasing category of BMI. Maternal weight and weight gain extremes can contribute to the high infant mortality rates [34] in the US. The Body Mass Index (BMI) of approx. 30kg/m 2 is nearly 40% more likely to die for babies born to women who have obese pregnancies than for infants born to mothers of average weight. Preconception weight loss therapy may not be a realistic preventive technique, however, as half of the pregnancies are unplanned and significant reductions in body weight during pregnancy (20% -30%) may be needed to substantially reduce the risk of infant mortality. Spina bifida [35] occurs when the spinal cord of a developing baby fails to develop or close properly while it is in the womb. The risk of infant mortality in children with spina bifida is 4.4%. The risk of premature delivery, low weight, and sudden baby death syndrome, both caused by child mortality, were enhanced by maternal smoking. Studies have shown higher mortality among diabetic mothers' infants (DMIs) [36] compared with controls. The rate of neonatal mortality is more than five times that of nondiabetic mothers' babies and is higher for gestational age (GA) groups at all gestational ages and birth weight. These risks can be reduced by optimal glycaemic regulation before and during the entire pregnancy. The APGAR score [37] is a way of easily summing up newborn children's health against infant mortality. For babies with 5 minute (1-3) extremely low APGAR, neonatal and post-neonatal death rates have remained high (37 weeks). Only in babies with high APGAR ratings (almost 7) did neonatal and post-neonatal deaths gradually decrease with gestational age. In a dose-response way, breastfeeding was associated with the lowest risk of mortality among exclusively breastfed infants, the intermediate-risk among partially breastfed infants, and the highest risk among non-breastfed infants. Children's mortality most commonly happens in infants without maternal antibiotic use (11.9%) and is lower for all prenatally infected babies independent of maternal suggestion [38]. Antenatal steroid [39] intervention in neonatal mortality and morbidity is very successful, but it tends to have poor penetration in low-and middle-income countries. If this technique is greatly extended, up to 500 000 neonatal lives may be saved annually. Respiratory distress syndrome (RDS), sepsis, and birth asphyxia are the most common signs of ventilation for newborns [40]. Relevant mortality predictors for ventilated neonates were <2500 g, <34 weeks of gestation, <7.1 of initial pH, pain, pulmonary hemorrhage, apnea, hypoglycemia, neutropenia, and thrombocytopenia. As the risk of accident-induced baby mortality is associated with low birth weight and multiples are of lower weight than singletons, the indirect impact of plurality [41] on a child's death thus remains. Because of outside causes, direct influences on infant mortality may include heightened workload and parental care heterogeneity, when parents take care of more than one child at a time. As a family grows in size, parental care per child decreases-twins may represent a severe form of family development. Additionally, in the first year of a child's life, twin parents face more fear, tension, and depression than single parents do. Pregnancy-induced hypertension (PIH) is a type of hypertension disorder during pregnancy that includes gestational hypertension, pre-eclampsia, and eclampsia [42]. The effect of gestational hypertension on the neonatal outcomes of twins is even more positive than the effect of singleton pregnancies. The correlation of PIH with child mortality depends on the disparity in birth weight, pregnancy, and death age.

V. CONCLUSION AND FUTURE WORK
The concept of infant mortality is examined in this article. Because of the high imbalance in the data, this is a difficult binary classification task. So, SMOTE is applied as the preprocessing technique for balancing the abovediscussed infant mortality dataset. In this article, an evolutionary wrapper-based multi-objective FS technique called MOALO-CD is attempted to mine the valuable aspects of causing infant death. Smoking during pregnancy, birth weight, breastfeeding, BMI, prenatal visits, use of steroids, APGAR score, etc., are some of the important factors affecting infant health selected by the above-discussed FS mechanism. The proposed methodology is compared with one non-evolutionary FS method (RFE), two filter-based multi-objective approaches (MOALO-CD Filter and MOGA-CD Filter), and one wrapper-based multi-objective algorithm (MOGA-CD Wrapper) by considering four multi-objective performance indicators. After getting five sets of the reduced dataset, six standard classifiers are applied to them to verify their classification performance in predicting the infant's health state (alive or dead). In the case of a reduced dataset obtained from the MOALO-CD wrapper approach, all the classifiers showed their best performance. Experimental findings reveal that RF can easily and reliably identify an infant as dead or alive. DT and KNN's performance is also excellent as it is very similar to the RF. In the future, the proposed model could also be tried on different infant mortality datasets from different countries in different years. Some improved evolutionary feature selection algorithms can also be attempted to mine more relevant factors causing infant death in our society.