Filter-Based Multi-Objective Feature Selection Using NSGA III and Cuckoo Optimization Algorithm

Feature selection aims to confiscate inappropriate features and yet improve classification performance. These aims are conflicting with one another, and a choice must be made in the presence of the trade-off between them. Numerous researches deal with feature selection problem but, they are mostly single-objective based. Nowadays, multi-objective optimisation approaches are becoming the most suitable approaches to deal with feature selection problems. They can easily create a balance between selected features and classification accuracy or error rate. Evolutionary computation techniques have been applied for multi-objective feature selection. Cuckoo optimisation algorithm is among the most popular technique that is exceptional in solving the problems of feature selection. Based on the binary cuckoo optimisation algorithm, two different multi-objective filter-based feature selection frameworks are presented with the idea of nondominated sorting genetic algorithms NSGAIII (BCNSG3) along with NSGAII (BCNSG2). Thus, four multi-objective filter-based feature selection approaches are proposed by employing mutual information along with gain ratio based-entropy as the respective filter evaluation measures in all the proposed frameworks. The results obtained are examined and analysed against the existing methods and single objective scheme on fourteen (14) datasets of varying degree of difficulties. The outcome of the experiments displays that the proposed multi-objective algorithms successfully derive a set of nondominated solutions that used the least feature size and attained the best error rate than using full-length features. In general, BCNSG2 obtained the best results compared to the existing methods and single-objective algorithm, whereas BCNSG3 outdoes all other approaches.


I. INTRODUCTION
We are nowadays in the epoch of big data where data has become ubiquitous in various domain ranging from bioinformatics, social media, healthcare, manufacturing industries and online education. The rapid expansion of data is a severe challenge in handling the information effectively. Thus, the necessity to put on data mining together with machine learning approaches to determine unseen knowledge arising out of the large stored data [1], [53]. Classification is one of the data mining technique that is employed to The associate editor coordinating the review of this manuscript and approving it for publication was Shangce Gao . categorise each row in a dataset into a set of groups according to their class label. It is a known fact that feature size is the key problem that deters the work of all classifiers [2]. However, if a piece of previous information about the useful and most relevant features is available the task is not challenging, else it will be hard to discover the most valuable and relevant features primarily when the number of features is considerable [3]. The term feature selection is introduced to select the ultimate and appropriate features from these enormous volumes of data.
FS is also one of the data mining processes that are used to pick the appropriate features from a dataset. The main issue of FS is in what way one can explore for the perfect subsets and then, assess the perfectly generated subsets [4].
The current algorithms used as a search technique cannot efficaciously explore the large search space of FS without been stuck into the local optima [5]. Currently, evolutionary computations (EC) have been employed as search techniques to explore the large search space in an FS problem. However, most of them go through early convergence. Cuckoo optimisation algorithm (COA) is amongst the EC techniques that are testified in [6] to have resourceful exploration operatives that determine the whole promising area in the exploration space and converges earlier than many other EC based techniques. Based on that, binary COA (BCOA) is employ as a search method to explore the most appropriate subsets of feature automatically.
Evaluating the subset of the features produced depends on the type of FS. This can be either filter or wrapper. Wrappers use a classifier to measure the accuracy for each of the selected subgroup of features. Nevertheless, this procedure is computationally cost more especially on datasets the large number of features [7]. Alternatively, Filter-based approaches are computationally cheap and performed well on big datasets. The main drawback of the filter-based FS is the absence of feature dependency or connection among the carefully chosen features [1], [5]. Information theory is amongst the whole theories used to estimate both the relevance and redundancy amongst two or more features along with their target class [8].
Using the concepts of information theory, to discover the redundancy, as well as relevancy of nominated features using different EC techniques, is becoming popular nowadays. For example, Cervantes et al. in [9] and [10] both used the ideas of information theory, especially mutual information (MI) along with entropy as a fitness function in a binary particle swarm optimisation algorithm (BPSO). Different weights values are employed to enhance the relevancy and reduce the redundancy on the datasets, and a better result was achieved. Recently, in the work of Hancer et al.,in [11] differential evolution (DE) was used for feature grading with the assistance of information theory ideas including relief f, MI and Fisher scores. The outcome got supersede single and multi-objective methods offered. The literature, showed that EC techniques are gaining popularity for filter-based FS, predominantly with the idea of information theory [11]. However, there are other EC techniques like COA that showed an encouraging outcome and yet not been considered for FS specifically using the idea of information theory.
FS aimed at minimising error rate and consequently reduce the size of the features, thus, considered as multi-objective optimisation problem [12], [40]. These aims are conflicting to one another, and the optimal choice needs to be carried out in the company of the compromise between them, however, very little work is conducted on multi-objective FS. References [10], [12] use the idea of nondominated sorting genetic algorithm II (NSGAII) BPSO and PSO respectively. Generally, NSGAII was commonly used as multi-objective optimisation to find the optimal solution of the various objective functions [13]. Although, NSGAII is reported to be slightly computationally expensive and outdated but can successfully evolve the set of nondominated solutions [14]. Recently, in the work of [15], the idea of NSGAIII was proposed. It is less computationally expensive and can successfully evolve the set of nondominated solutions for many-objective functions. Since the inception of NSGAIII, its neither use for FS nor enhance with other EC techniques to solve a feature selection problem. To our knowledge, no work used BCOA specifically, as a multi-objective FS to date.
In a nutshell, COA can solve only the continuous optimisation problem, and FS can be best solved as a binary discrete optimisation problem; thus, BCOA is proposed in this study. A ''1'' means a feature is selected while ''0'' means otherwise. Moreover, Feature selection is now considered as a multi-objective optimisation problem that aims at reducing the number of selected features and consequently improving the classification performance -hence considered as two objective optimisation problem. In this study, two concepts of multi-objective optimisation algorithms, particularly NSGAII and NSGAIII, optimisation are employed to tackle the issues of multi-objective feature selection and obtained the set of nondominated solutions. NSGAII can solve two-objective optimisation problems like feature selection. However, NSGAIII is strictly meant to address many-objectives optimisation problems. In this study, both NSGAIII and NSGAII are used for the first time to solve the feature selection problem along with BCOA.
The generic aim of this study, is to adopt BCOA [16] with entropy (gain ratio based entropy) and MI as the evaluation measures together with the idea of nondominated sorting genetic algorithms NSGAII (BCNSG2MI and BCNSG2E), and NSGAIII (BCNSG3MI and BCNSG3E) to find the set of nondominated solutions with fewer number of features and comparable or better classification performance than using the full-length features and within a short period.
The proposed FS algorithms were investigated and scrutinised on the UCI standard benchmark datasets of varying degree of difficulties. Precisely, this study will scrutinise whether • the filter-based single objective approach with gain ratio based-entropy (BCOA-E ) and MI (BCOA-MI) as the evaluation measures might select fewer features and enhance classification accuracy than using the full-length features.
• the proposed multi-objective BCNSG2MI and BCNSG2E FS algorithms can evolve a set of nondominated solutions that might perform better than the filter-based single objective and other existing methods; and, • the multi-objective BCNSG3MI and BCNSG3E can evolve a set of nondominated solutions that might perform better than the approaches above as well as other existing methods. Apart from the introduction, the rest is organised as: Section 2 illustrates the contextual information including COA, BCOA, multi-objective optimisation, information theory concepts as well as related works. Section 3 presents the proposed filter-based multi-objective BCOA, each using the information theory concepts along with NSGAII and NSGAIII respectively. Section 4 displays the experimental design while Section 5 demonstrates the outcomes and discussions. To end, in Section 6, the conclusions were examined along with more research directions.

A. TRADITIONAL FILTER-BASED FEATURE SELECTION
Majority of the filter-based approaches are employed to rank a feature to its target class based on some suitable evaluation measures. The target is to confiscate irrelevant and redundant features, and the challenge is how to hunt for the best subset of features with the standard evaluation measures. For example, Kira and Rendell in [17] presented a classical filter-based FS algorithm known as relief algorithm. It uses some statistical methods and hence avoids the heuristic search. It allocates weight to all feature to symbolise how statistically importance a feature is to its target class. Nevertheless, the relief algorithm does not consider irrelevant features since it concentrates on finding all statistically relevant features irrespective of the redundancy amid them. Also, a decision tree (DT) algorithm was proposed by [18] to enhance the classification performance of case-based learning. The results achieved indicate that the features produced by the DT can automatically aid to diminish the error rate of the DT classifier.
Another filter-based algorithm named FOCUS is presented by Almuallim and Dietterich in [19], FOCUS is an exhaustive search algorithm that explores all the possible feature subset, then, later on, select the least subset. However, this makes it computationally expensive due to the exhaustive search, especially on large dimensional datasets. In another perspective, [20] developed an MI feature selector (MIFS) method in a supervised neural network and categorised the features as relevant and redundant. Redundant features are features with low information content or high redundancy. A heuristic function was employed to control and balance between the relevance and redundant features. Lastly, features are selected greedily as it is in the greedy algorithm apart from the fourth step. Then, [21] enhanced the limitation of MIFS mentioned by introducing another greedy search and uniformly improved MI feature selector (MIFS-U) is used to choose the useful features and halts as soon as it has reached the required number of features. One of the algorithms considers using MI along with input features and output classes compared to the MIFS that cannot perform well on nonlinear problems.
Bishop and Bishop in [22] proposed a supervised filter-based FS algorithm called Fisher score. It works by ranking features based on discriminant ability agreement, which evaluates the features individually. The limitation of this algorithm is that there is still redundancy on the chosen features since there is no correlation among the chosen features. Similarly, [23] developed a fast correlation-based feature selection (CFS) that can work for continuous and discrete data. The results obtained showed that it outperformed naïve Bayes, instance-based learning, relief F and DT. The CFS algorithm used heuristic techniques for FS. As such, it finds features that are extremely correlated to the target class but not correlated with each other. Even though, systematic uncertainty was applied to measure the level of the correlation; nevertheless, the relationship among the features cannot work well on several features. Reference [24] presented a relief F a variant of relief algorithm for feature ranking which also ranks a score for each feature separately based on the KNN algorithm. Despite being amongst the best filter-based FS, its, however, have some redundant subset of features. On the other hand, [25] proposed an alternative way of selecting features that have maximum relevance to the target class. In that case, the selected features will individually have the largest mutual information with the target class. The proposed technique works in two stages, at the initial stage a two-stage FS by merging minimal redundancy maximal relevance (mRMR) and other wrapper-based FS techniques. After selecting the best features at a little cost, the outcomes exposed that mRMR achieved better results on both accuracy and the nominated feature size.
Later Ling and Tang in [26] introduced class relevance and redundancy framework based on information theory. A novel algorithm named conditional informative feature extraction that improves the info carried by the entire set of features by clearly minimising the class redundancies. Besides, the computational cost as one of the major issues of information theory drastically reduced by coupling discrete approximation along with 1D Parzen window method and the local active region method. In order to, ranked features in descending order of mean and standard deviation along with their class label. A Pearson correlation coefficient was introduced in [27]. The algorithm chooses the subsets with the smallest validation error. Also, a predictor was used on the M nested subsets. Although it is computationally inexpensive and simple to implement, it leads to feature independent since it can only recognise a linear relationship between a feature and its target class.
Also, Huawen et al., in [28] developed a dynamic MI feature selection, where the MI of particular features were recomputed on unlabelled instances, compared to the entire sampling space. The results obtained performed well on 16 UCI datasets with four standard classifiers.
Furthermore, Estevez et al.,in [29] presented a normalised MI feature selection (NMIFS) a development over MIFS, MIFS-U, and mRMR approaches. The mean of the NMIFS was applied to estimate redundancy among the selected features. The experimental outcomes showed that it outperformed the three other MI methods on several benchmark datasets without demanding any user-defined parameter.
On the other hand, [30] proposed an extension of the Shannon MI amid feature and class label along with the use of this extension to the naturally derived space of possible filter VOLUME 8, 2020 criteria. This was achieved by adding a class-conditional correlation to the main equation of mutual information denoted as the first-order utility. Other solid mathematical backgrounds and theoretical concepts of mutual information are presented.
Laplace Score (LS) is among the favourite filter-based ranking technique that is used for both supervised and unsupervised FS. It works with useful features and rejects the features with high variance. Reference [31] proposes LS along with entropy measure, the idea is to select successful features by substituting the standard k-means in LS with an information distance measure. A better result was achieved compared to the LS in terms of efficiency, stability and scalability. An iterative LS based neighbourhood graph was proposed in [32], and the results showed that better features were chosen according to the structure of the graph.
Still Foithong, Pinngern, and Attachoo in [33] developed another FS approach through MI measure deprived of demanding a user-defined parameter for the choice of the candidate feature set. Despite [34] established a comprehensive library for FS which presents other measures, like MI, and Fisher Score to compute correlations amongst features. Recently, [11] introduced a new filter criterion encouraged by the concepts of MI, Relief F, as well as Fisher Score. As an alternative of using shared redundancy, the expected norm attempts to select the peak ranked features regulate by Relief F and Fisher Score while specifying the mutual relevance within features as well as the target class labels.

B. EVOLUTIONARY COMPUTATION FOR FILTER-BASED FEATURE SELECTION
Moghadasian and Hosseini in [36] developed a filter-based cuckoo search algorithm (CSA) along with MI and entropy are used as evaluation criteria on some high dimensional datasets. The results of the classification accuracy using ANN showed that almost 90% of the real features are minimised considerably. CSA with entropy performed well on classification performance whereas CSA with MI on the selected features than using the full complete features.
Similarly, Cervante et al., in [9] presented a BPSO algorithm together with entropy and MI as an evaluation measure. The results obtained on the four data sets showed that BPSO with mutual information could develop a set of features along with fewer features. Whereas, BPSO with entropy has more classification accuracy using a DT compared to BPSO with MI. Similarly, the work is extended in [37], whereby, a multiobjective filter-based FS using BPSO and nondominated sorting genetic algorithm with information measures as the evaluation criteria are presented. The results obtained was tested on six data sets where DT was used to measure the classification error rate. Moreover, [38] developed another multi-objective filter-based FS. GA fitness function with both MI and entropy as evaluation measures are embedded as a single-objective based FS. While GA+MI chose the least but an appropriate number of features, GA with entropy performed well in terms of classification performance. Furthermore, strength Pareto evolutionary algorithm (SPEA2) and NSGAII are enhanced with MI and entropy. The results showed that both SPEA2 and NSGAII outperformed the single-objective algorithm and NSGAII outperformed SPEA2 specifically on the features that are carefully chosen.
Xue, Zhang and Browne in [12] developed a crowding, dominance, and mutation PSO (CMDPSOFS) for multi-objective FS by improving the performance and defining suitable operators. Similarly, a cost-based multi-objective PSO for FS named hybrid mutation PSO (HMPSOFS) was presented in [39]. The proposed HMPSOFS used a hybrid mutation and updated the speeding up coefficients together with an adaptive mechanism. Whereas, the CMDPSOFS enhances the variety of search by applying both the regular and irregular mutation operators together with the anticipated mutation mechanism. However, the planned approaches can be used only to solve feature selection problems, whereas other approaches might yield better results.
Nguyen et al., in [40] introduced insert, swap and remove PSO feature selection (ISRPSOFS) a local search based on sequential, a forward or backward search is performed by inserting removing and swapping operators. However, the proposed method is computationally expensive, particularly on more extensive data where the redundant and irrelevant features are many.
A filter-based FS based on differential evolution (DE) was developed in [11]. MI of the highest rank features by Relief F and Fisher score are selected. Based on that, two filter-based DE are proposed. The first one has just one objective in a weighted way. Whereas, the second one is on multi-objective optimisation. The proposed method was compared with mutual information feature selection (MIFS) adopted also using DE single-objective as well as multi-objective approaches.
Moreover, it performed better than MIFS and DE for the pair of single-objective along with multi-objective on all the data sets with reduced feature size and better classification accuracy. In the same vein, [41] presented another multi-objective filter-based FS using artificial bee colony (ABC). Both the numerical ABC, as well as its binary counterparts, are examined using nondominated sorting method and genetic operators. The binary ABC outperformed its numerical counterparts both on accuracy and as well as the selected features.
Applied data envelopment analysis (DEA) method along with COA for dealing with multi-objective optimisation problems, are presented in [42]. The profit function of the COA is substituted by the efficiency value that is obtained from DEA. Later on, COA is hybridised with simple additive weighting (SAW) [43]; the proposed COAW algorithm has high speed in finding the Pareto frontiers and can find the starting and stop points of Pareto frontiers appropriately. However, all the COA-based multi-objective presented are a hybrid based not multi-objective based.
On the other hand, there has been no COA based multi-objective optimisation proposed in the literature like other EC based techniques mentioned earlier. Recently [4] developed a filter-based COA using the general filter algorithm as the fitness function of the COA. Some heart disease data sets were applied to validate the efficacy of the proposed method. However, the results obtained are in favour of filter-based CSA, especially on the small size datasets. Just because most of the data sets have few numbers of features. Perhaps, if its demonstrated on high dimensional data or enhance to avoid redundancy among selected subsets, it may provide a better result as argued by [6].
Most of the existing studies show that COA has limited application, especially in FS compared to other evolutionary computation based techniques likes PSO, GA, ACO and ABC among others.
On the other hand, NSGA is the most common multi-objective optimisation algorithm. Since it has shown promising results in solving different kinds of multi-objective optimisation problems in various domain.
With the introduction of NSGAII in [13] it becomes more potent in handling multi-objective issues. Hamdani et al.,in [49] proposed the first multi-objective FS framework using the NSGAII. Based on that, [10] applied the framework and developed a multi-objective filter-based FS using BPSO. In addition to that, PSO along with MI and entropy were used as evaluation criteria within the NSGAII in [38]. However, these methods are limited to the application of NSGAII along with PSO and BPSO alone. Whereas, there are other EC techniques such as COA with proven records and yet not use in that regards.
In an attempt to reduce the computational cost of wrapper-based FS without jeopardising the results of the FS, [55] presented a faster multi-objective FS by incorporating an improved ABC based on particle update model into the framework. In the framework, k-means clustering, along with ladder-like sample utilisation, are employed to minimise the cost of the evolutionary process. The experimental results showed that it has promising results and performed better than NSGAII-FS, among others.
To achieve local a trade-off between both local exploitation and global exploration [56] proposed binary DE with self-learning strategy to solve the multi-objective FS problems. Based on that, three operators are employed to achieve better and promising results. New binary mutation operator that will aid and fasten in locating the most promising regions. And new one-bit purifying search operator that can aid the self-learning strategy of elite individuals and (3. A nondominated sorting operator with crowding distance that can reduce the time consumption of selection operators. The proposed MOFS-BDE performed well on public data sets and competitive in comparison with DEMOFS, NSGAFS, MOPSOFS, and B-MOABCFS) and a new MOEA/D method (MOEA/D-2TMFI. However, the results obtained are not compared and analysed with NSGAII and NSGAIII.
On the other hand, [60] introduced a novel swarm intelligence algorithm, known as Rc-BBFA, and effectively used it to solve FS problems. The proposed algorithm extends the idea of FFA by presenting binary variables. Three new strategies, i.e. the return-cost attractiveness, the Pareto dominance-based selection, and the binary movement with the adaptive jump, are employed in the novel algorithm, which is effective in handling the FS problems. Experiments on ten well-known datasets were conducted, and promising results were obtained compared to others mentioned in the paper. However, the results are not compared with the most recent multi-objective evolutionary algorithms such as NSGAIII and MOEA/D, among others.
Recently [57] proposed a new PSO-based unsupervised FS approach, known as filter-based bare-bone particle swarm optimisation algorithm (FBPSO). Local filter-based search strategy based on feature redundancy is employed to enhance the exploitation ability of the swarm, on the other hand, space reduction strategy using the mean of mutual information is employed to eliminate the irrelevant and redundant features faster.
Since most of the existing multi-objective evolutionary algorithms experience difficulties in resolving many-objective optimisation problems owing to the inability to balance the convergence and diversity in high-dimensional space. Reference [58] propose a new many-objective evolutionary algorithm using a one-by-one selection strategy. It works like this; once an individual is selected, its neighbours are de-emphasise using a niche technique to guarantee the diversity of the population, in which the similarity between individuals is examined and evaluated using a distribution indicator. The comparative results show the goodness of the proposed method. However, this method is not examined on multi-objective FS problems.
Similarly, [59] proposed another multi-objective evolutionary optimisation based on reference points (RPEA). It exploited the potential of the reference points in handling many-objective optimisation problems. The proposed RPEA can primarily be categorised as: (1) adaptively generating a series of reference points with good convergence and distribution based on the evolution of a population; (2) greatly increasing the selection pressure toward the Pareto front by calculating the distances between the reference points and the individuals in the environment selection process. The proposed method was applied to seven benchmarks many-objective optimisation problems and compared with the other four state-of-the-art methods to evaluate its performance. The results reveal that RPEA is very competitive to the others in terms of seeking for a solution set with good approximation and distribution in many-objective optimisation. Also, this work is not tested on multi-objective FS.
Moreover, [60] introduced a novel swarm intelligence algorithm, known as Rc-BBFA, and effectively used it to solve FS problems. The proposed algorithm extends the idea of FFA by presenting binary variables. Three new strategies, i.e. the return-cost attractiveness, the Pareto dominance-based selection, and the binary movement with the adaptive jump, are employed in the novel algorithm, which is effective in handling the FS problems. Experiments VOLUME 8, 2020 on ten well-known datasets were conducted and promising results were obtained compared to others mentioned in the paper. However, the results are not compared with the most recent multi-objective evolutionary algorithms such as NSGAIII and MOEA/D among others.
In the same vein, [61] proposed an improved MOPSO, termed as BMOPSOFS to solve FS problems with unreliable data. To achieve that, the probability-based encoding strategy, the reinforced memory and the hybrid mutation, together with several established techniques, such as the external archive and the crowding distance are proposed. It makes BMOP-SOFS more effective in dealing with the multi-objective FS problems and performs well on various benchmark datasets.
Recently, [62] presented an unsupervised FS approach by combining the discriminative information of class labels with subspace learning. The nonnegative Laplacian embedding was initially employed to produce pseudo labels, to enhance the classification accuracy. Then, an optimal feature subset is chosen by the subspace learning guiding by the discriminative information of class labels, on the premise of maintaining the local structure of data. Based on that, an iterative strategy for updating similarity matrix and pseudo labels was developed, which bring more accurate pseudo labels that provide the convergence of the proposed strategy. The results on six real-world datasets show the goodness of the proposed method over other seven state-of-the-art methods.
To enhance convergence and exploitation ability of ABC, [54] presented a two archived guided multi-objective ABC called TMABC-FS. The first archives comprise of the external archive and the leader archive that are employed to improve the searchability of various kinds of bees. And two new operators; convergence-guiding search for employed bees and diversity-guiding search for onlooker bees, are proposed for gaining a group of non-dominated subsets of the feature with better distribution and convergence. The proposed TMABC-FS is validated on different UCI benchmark datasets and is compared with two traditional algorithms and three multi-objective approaches. The results have shown that TMABC-FS is an effective and vigorous optimisation method for solving cost-sensitive FS problems.
The concept of NSGAIII is introduced in [50], and it has since recorded numerous achievement since its introduction [14]. However, the used of NSGAIII, particularly for filter-based FS, is limited in the literature. Therefore, in this study, the frameworks of both NSGAII and NSGAIII are adopted with BCOA along with MI and the combined entropy as an evaluation measure.

C. CUCKOO OPTIMISATION ALGORITHM
An innovative EC-based technique called the cuckoo optimisation algorithm (COA) was developed by [16]. COA has its rules as follows: 1) The variables should be in an array named ''habitat'' of 2) The upper and lower limit iterations use 5-20 eggs respectively.

3) The maximum range distance for egg laying is
where, α is an integer, and v hi ,v low are respective limits in step 2 above. In the Eq.2, α is set to 1. The search space is in the interval of (-55, 55) and twenty cuckoos in the population. By nature, a cuckoo can only lay 5-20 eggs. In this study, the same concept was used that five cuckoos with less profit lay five eggs and also other fifteen cuckoos lay an egg in the interval [6,20] proportional to their profit. Thus, the total number of eggs will be 220. To compute the ELR of a cuckoo whose profit is in the 5th order we used: It signifies that a cuckoo with profit of 16 degrees can lay egg within a circle of 8 radius. 4) Just a p% of the eggs i.e. 10% with a smaller amount of profit value and more cost will be killed. 5) A k-means of 3-5 is sufficient in most simulations. 6) Every single cuckoo flies only λ % distance towards goal line habitat with a deviation of ω radians as shown below: The labelled, λ ∼ U (0, 1) shows that λ is a constantly distributed arbitrary number within the range of 0 and 1. ω is a parameter that limits a nonconformity from goal line habitat. An ω of π/6 rad is mostly okay and suitable. The detailed algorithm of the typical COA is shown in Algorithm 1.

Algorithm 1
The Typical COA Pseudocode 1: Begin 2: Set cuckoo locations through some arbitrary ideas on the global function 3: Dedicate some eggs roughly to respectively cuckoos 4: Compute ELR for every single cuckoo 5: Allow the cuckoos to lay their eggs in their matching ELR 6: Destroy those cuckoos familiar by the multitude birds 7: Allow egg to hatch and baby chicken raise 8: Estimate the location of every newly mature cuckoo 9: Restricts cuckoos' highest number in location and destroy those that exist in substandard locations 10: Group cuckoos and discover the best cluster and choose goal line environment 11: Allow the new cuckoo populace to settle at the goal line environment 12: If stop criteria are fulfilled stop, else go to 3 13: End Later after the development of the COA, since its meant to solve only continuous optimisation problems. Then, Mahmoudi and Rajabioun in [45] introduced the BCOA that is capable of dealing with binary discrete optimisation problems. To compute the X goal and X Curpos of the habitat the following equation is used below: To offer a new habitat X Nhabitat appropriate for discrete binary difficulties, a sigmoid function in the Eq.(5) was employed. The reason is to map X Nhabitat into the range [0,1]. Lastly, Eq.(6) will modify the values in the habitat as 0 or 1. Whereby rand in Eq. (6) is an arbitrary number, that is produced randomly.
Entropy H(X) is the degree of ambiguity of an arbitrarily variable relative to the possibility of manifestation of an event. The detailed definition of entropy is shown in Eq. (7). The possibility of the manifestation of an event happens only if the entropy is high else not.
The termed, X is an Where both the joint and conditional entropy of X and Y are: where X = x 1 , x 2 , . . . , x i . . . , x n and Y = y 1 , y 2 , . . . , y j . . . , y m Mutual information (MI) is employed to measure the relationship amongst two arbitrary variables and evaluate the relevance of the feature subset [46]. The MI between X and Y features can be defined as Eq.(10) means that the I (X ; Y ) is larger if X and Y are interconnected. Otherwise, they are not connected whatsoever.

E. MULTI-OBJECTIVE OPTIMISATION
By nature feature selection is considered as multi-objective optimisation problems (MOP). MOP usually occur when optimum decisions are required to be made in the company of the trade-offs or agreement amid the different objectives [47].
It comprises minimising or maximising the various disagreeing objective functions. The solution to the problem is normally a set of solutions that define the best trade-off between competing objectives. Mathematically, it can be written as follows: . . , n x is the candidate solution vector, f m (x) is the mth objective function to be minimize or maximise, f (x) is the objective function, h k (x) and g j (x) are the constraint functions, J and K are integer numbers x i and x j are lower and upper bound respectively.
In the single-objective optimisation problem, the superiority of a solution over other solutions is readily determined by comparing their objective function values. In the case of the multi-objective optimisation problem, enhancing one objective may worsen another. As such balance in trade-off solutions is accomplished if a solution cannot enhance any objective deprived of degrading one or more of the other objectives and this is called Pareto improvement [48].
The dominance determines the goodness of a solution. For instance, let y and z be two candidate solution vectors of the f m (x) to be maximize or minimize. If the criteria in Eq. (13) are satisfied, then y dominates z or y is good compared to z or z is dominated by y When a solution is nondominated by any other solutions or no further Pareto improvement can be made, it is referred to as a Pareto-optimal solution or nondominated solutions. The set of the complete Pareto-optimal solutions forms the agreement outward in the search space and is known as the Pareto front [12], [47].
FS has two opposing objectives; these are reducing feature size along with the error rate of a classifier. Thus, considered a multi-objective minimisation problem.

III. THE PROPOSED BCOA FILTER-BASED APPROACHES
This section presents the proposed filter-based approaches. The first one is the single objective filter-based using gain ratio based-entropy together with MI as the fitness evaluation measures. Whereas, the second one is according to MOP, especially the NSGAII and NSGAIII frameworks in addition to the single objective.

A. BCOA FILTER-BASED SINGLE-OBJECTIVE APPROACH
Two filter-based BCOA algorithms BCOA-MI and BCOA-E, each with MI and gain ratio based-entropy as the respective evaluation criteria, are proposed in this section. The details of both BCOA-MI, as well as BCOA-E, is depicted in Algorithm 2 VOLUME 8, 2020

1) BCOA-MI
The essence of MI is to measure the relationship between two pair of features together with their target class. The target is to choose highly relevant features and eliminate the most redundant features. Majority of the researches that address the issue of feature interaction between the pair of features used the MI in Eq. (14). The details is as shown below: X and Y stands for the distinct binary feature subsets, M is the feature size, C is the target class label, Rel mi applies a pairwise method to compute the MI relevance amongst every feature together with its class label, and finally, Rel mi remove the redundancy that remains in each pair of the chosen features. As such, in Eq.(14) Fit mi is a maximisation function that makes the best use of the relevancy Rel mi and synchronously decreases the redundancy Red mi of the selected features.

2) BCOA-E
In contrasts to the Fit mi , the Fit E is employ to compute the relevance along with the redundancy among a group of features not necessarily between two pair of features alone. Eq.(15) displays the fitness function as: Rel E estimates the gain ratio of the features in X , using (15). Fit E is also consider as a maximisation function that makes the most used of relevancy Rel E and synchronously reduces the redundancy Red E of the selected subset of features.  From Algorithm 2, one can observe that Eq.(1) is used to initialise each dataset. Unwanted features that are recognised based on the computation of the fitness function in Eq. (14) and Eq.(15) are detached. It happens mostly if the population in the worst area is killed because it's less than the maximum value or else it gets some profit values. The nest with the best survival rate (feature subsets) can then move to the best environment using Eq.(5) and Eq.(6). The ELR is calculated using Eq.(2). The steps mentioned above will repeat until the best solution with the highest-ranked features is returned. Then a classifier is employed to compute the error rate.
The time complexity of the relevance and redundancy seen in Eq. 14  This section presents two different but related multi-objective optimisations algorithms using the idea of NSGAII and NSGAIII frameworks in BCOA. Which leads to BCNSG2 as well as BCNSG3 methods. In each of the proposed methods, MI and gain ratio based-entropy are added as the evaluation measures to have a total of four multi-objective filterbased algorithms (BCNSG2MI, BCNSG3MI, BCNSG2E and BCNSG3E. The detail of the algorithms is presented in the subsequent sections below.

1) BCNSG2MI AND BCNSG2E
The experiments on BCOA-MI and BCOA-E, clearly showed that both MI along with gain ratio based-entropy is an effective evaluation measure for filter-based FS. However, the weights employed in their fitness functions want to be pre-defined. Therefore, according to BCOA, we developed filter-based multi-objective FS using NSGAII along with MI (BCNSG2MI)and entropy (BCNSG2E) with the target of minimising the feature size and improving the greatest significant features with their class label to discover the Pareto front of the FS issue. The pseudocode for the BCNSG2MI and BCNSG2E is depicted in Algorithm 3.
COA, as well as its binary version, are initially meant to deal with the single-objective optimisation problem. The most significant task in spreading COA to multi-objective optimisation is to determine an outstanding environment of cuckoo for all habitat from the group of possible nondominated solutions. Reference [13] introduced a popular multi-objective optimisation technique known as the NSGAII. Since then, researchers are driven to use it and solve problems related to multi-objective optimisation approaches. For instance, [51] used the concept of NSGAII with PSO to develop a multi-objective PSO based on the NSGAII. Then, [10] and [12] used that idea to solve filter-based multi-objective FS problems using BPSO. Similarly, GA is employed for a single-objective and multi-objective using PSO in the work of [38]. However, other EC-based techniques such as COA was reported to have faster convergence and performed better than many other ECs, yet its potential for multi-objective optimisation as well as feature selection is not fully investigated.
Therefore, in this study, a BCOA multi-objective framework for FS, according to NSGAII, was presented. Thus, two pairs of filter-based multi-objective FS algorithms are advanced, and that is BCNSG2MI and BCNSG2E. While BCNSG2MI use Rel mi , BCNSG2E use Rel E to assess the significance or relevance between a pair of features with their target class.
The detailed of how the multi-objective filter-based algorithm (BCNSG2MI and BCNSG2E) works is depicted  Fig. 1. The core target is to used nondominated sorting of Phase VII to choose the best cuckoo for all habitat and amend the nonDomCOAList in the evolutionary process. As a VOLUME 8, 2020 display in Fig. 1, during every repetition, the algorithms start by identifying the nondominated features in the non-DomCOAList and compute the crowding distance, and all the nondominated feature subsets are arranged based on the crowding distance in Phase II. While in Phase III, a random cuckoo is chosen from the smallest crowded solutions, which is the uppermost graded part of the sorted nondominated solutions. All the habitats in the nonDomCOAList are copied to a union in Phase IV. After determining the best habitat where cuckoo lives, a new position for the next cuckoos' habitat is calculated according to Phases in Eq.(5) and Eq.(6) moreover, is added into the union in Phase V. In Phase VI, the two objective functions of the habitat are assessed where the relevance is assessed by Rel mi in BCNSG2MI and Rel E in BCNSG2E.
The nondominated sorting procedure is shown in Phase VII. Precisely, the nondominated solutions in the union are named the initial nondominated front and are afterwards removed out of the union. Next, the nondominated features in the remaining union are termed the second nondominated front, and it continues like that. The subsequent stages of the nondominated fronts are recognised by reiterating this process. Finally, Phase VIII displays the procedure of altering nonDomCOAList for the resulting repetition. Precisely, habitats are chosen from the top points of the nondominated fronts, beginning with the initial front and so on. If the solutions required is more than the features or solutions that remain in the present nondominated front, the complete solutions are joined into the next repetition. Phase II, until Phase VIII is repetitive until the end condition, is satisfied. Then, the proposed algorithm recovers the initial nondominated Pareto front in the union.

2) BCNSG3MI AND BCNSG3E
In the previous subsection, NSGAII was used along with filter-based BCOA for multi-objective FS. Although NSGAII performed well with both PSO, GA, and even the BCOA, However, it lacks some reference point; instead, it used the crowding distance and mutation operators for its computation. Moreover, a crowded comparison can restrict the convergence of NSGAII. Considering these limitations [50] proposed a more robust NSGAIII.
In contrast to the NSGAII, the maintenance of diversity among population members in NSGAIII is supported by providing an adaptively amending several well-spread reference points. As such, another multi-objective BCOA for filter-based FS using the concepts of nondominated sorting in NSGAIII is also presented. Based on these, other pairs of filter-based multi-objective FS algorithms are developed BCNSG3MI and BCNSG3E. Both BCNSG3MI and BCNSG3E used (14) and (15) respectively.
From Fig. 2, the proposed multi-objective BCNSG3MI and BCNSG3E is made up of eleven related steps. The focal impression is to use the nondominated sorting of NSGAIII in BCOA to select the best cuckoo environment for feature selection. At the end of each iteration, the proposed algorithm performs Step I to IV.
Step I initialise each habitat with some features from a dataset, and the total reference point of the features are computed. In Step II, the fitness evaluation function of the proposed approach is calculated for both MI and gain ratio based-entropy using (14) and (15), respectively. The relevancy is evaluated by using the Rel mi and Rel E .
Step III generates the initial population using the idea of the COA and BCOA. After that, in Step IV, the nondominated population sorting mechanism is employed. This identifies the various levels of the Pareto fronts in the union. If the maximum iteration is not reached, it continues to the next stage. The initial iteration is always set to zero. Thus, it must proceed to the next stage at the beginning.
Step V used the tournament selection and crossover with two parents as a probability. Then another Step IV is repeated in Step VI, while Step VII find the reference points and solutions with the associated member.
Step VIII apply the niche preservation, and Step IX stores the niche obtained solutions for the next generation using the BCOA concept in (5) and (6).
Step X returns the optimum solution of the selected features. Finally, in Step XI a classifier is employed to measure the error rate of chosen features.
Step V-VIII is repetitive until the highest number of repetitions is gotten. The detailed of the pseudocode is shown in Algorithm 4.

Algorithm 4 Proposed BCNSG3MI and BCNSG3E
1: Begin 2: Divide the dataset into training set and test set; 3: Initialise the habitat; 4: Evaluate the two fitness values of each cuckoo feature size together with their relevance in Red mi BCNSG3MI of Eq. 14 and Red E in BCNSG3E of Eq. 15 on the training set* 5: Allow the cuckoos to lay their eggs in their matching ELR; 6: Recognise the cuckoos in the nondominated solutions; 7: Compute reference point of the cuckoos and generate initial population; 8: Use nondominated population sorting mechanism; 9: WHILE maximum iteration is not reached DO 10: Apply tournament selection and crossover with two parents as probability; 11: Again, apply nondominated population sorting mechanism on the cuckoos; 12: Apply normalization on the population; 13: Find out reference points and solution with associated member based on associate procedure; 14: Apply the niche preservation (niche procedure); 15: Keep the niche obtained solutions for the next generations; 16 O(log 2 n). Also, the use of NSGAII and NSGAIII make the computation complex due to the nondominated sorting and external archive. However, NSGAIII is computationally fair than the NSGAII since it has a more concise way to renew or select individuals as well as the use of the reference points.

A. DATASETS
The standard datasets employed in this experiment is display in Table 1. The datasets are obtained from the popular repository in [52]. It contains a different feature size, instances and classes of varying degree of difficulties. For example, Lymphography dataset takes the smallest size of both features and instances, whereas, Madelon takes the maximum feature size and Coil2000 with the maximum number of instances.
While conducting the experiments, the instances of all the datasets are separated randomly into training and testing test. While the training test takes 70% of the instances whereas tests take 30%. The planned algorithms run on the training test first to choose the subsets of features and later, the error rate of the chosen features is computed on the test set using the classification algorithm. There are quite varieties of classification algorithms such as SVM, KNN, GNB and DT, among others. In this paper, SVM is chosen because of its popularity and proven records in computing classification accuracies in different researches.
The SVM computes the error rate of the nominated features in the multi-objective approach using the Eq. 16 below.
The termed TP, TN , FP, FN represents true positives, true negatives, false positives and false negatives correspondingly.

1) EXPERIMENTAL PARAMETER SETTINGS
The parameter settings used for the proposed BCOA-MI, BCOA-E, BCNSG2MI, BCNSG2E, BCNSG3MI and BCNSG3E algorithms are chosen based on the work of [45]; [16] where both the initial and upper population are set to five and twenty respectively. Besides all the proposed algorithms are run 40 separated times on all the dataset.
In the single objective filter-based approaches, both BCOA-MI and BCOA-E used five different values of β 1 and β 2 (0.9, 0.8, 0.75, 0.6 and 0.5) in the experiments for each dataset. Where β 1 is for the BCOA-MI and β 2 for the gain ratio based entropy. In addition to that, the Wilcoxon Rank Sum test was conducted on the BCOA-MI and BCOA-E, whereby 0.05 was employed as the level of significance, to confirm the significant change between the methods on different values of β compared to the full-length features. If the p-value >= 0.05, then our proposed method significantly outperformed the full-length features at 95% of the level of guarantee.
Based on the work [10], [49] and [12], both BCNSG2 (BCNSG2MI and BCNSG2E) and BCNSG3 (BCNSG3MI and BCNSG3E) used 1/n mutation rate. Where n is the maximum feature size in each of the datasets. Also, cross over probability is set to 0.5. The reference point for the BCNSG3 was set to 15 based on the work of [50].
The multi-objective algorithms (BCNSG2 and BCNSG3) obtain a set of nondominated solutions in all runs. The 40 sets of solutions attained by all the multi-objective algorithms are united into a single union set. The union set contains the subsets of features such as the feature size and their respective error rate. Thus, the set of average solution (named Pareto front) is gotten through the mean of the classification error and the matching number of features. Apart from the average Pareto front, the nondominated solutions inside the union set too are offered in the subsequent segment.

V. RESULTS AND DISCUSSION
This segment presents the outcomes of the experiments conducted. Tables 2 and 3, displays the results of the filter-based single-objective (BCOA-MI and BCOA-E) with changing weights in their respective fitness functions. Similarly, Figures 3 and 4 displayed the results of the multi-objective filter-based methods as well as the comparison between NSGAII (BCNSG2MI and BCNSG2E) and NSGAIII (BCNSG3MI and BCNSG3E) based algorithms.

A. RESULTS OF THE SINGLE-OBJECTIVE FILTER-BASED APPROACH BCOA-MI AND BCOA-E
The experimental outcomes are made known in Tables 2  and 3. From the tables, ''Ave Size'' speak for the mean of nominated features by all the algorithms in the 40 separate runs. Also, ''Ave-Acc'' along with ''Best-Acc'' serves as the mean accuracy and best accuracy respectively. ''Std Dev'' is the standard deviation for the 40 error rates tests. The outcome of the Wilcoxon Rank Test is denoted as ''Sig Test'' whereby a ''+'' or ''−'' symbolise that the classification performance of BCOA-MI or BCOA-E is good or poor than the full-length features, while ''='' serve as the same classification performance.
Generally, it can be seen clearly from the outcomes that BCOA-MI achieved considerable well on the average size of selected features in the whole datasets, whereby nearly 75% of the whole feature size is minimised. Unlike BCOA-E, which done well on accuracy. It disclosed that both BCOA-MI and BCOA-E possibly would meaningfully minimise the feature size and accomplish the same or improve classification performance compared to full-length features.
Looking at Tables 2 and 3, it can be detected that the higher the values of β 1 and β 2 the better the accuracy for each datasets. If the values of β are bigger, then the relevance is greater than the redundancy that leads to high accuracy. Nevertheless, sometimes if the difference negligible or similar. For example, looking at Lymphography dataset in Table 2, whenβ 1 = 0.9 and β 1 = 0.8 the best classification error rate are 14.00% and 16.00% respectively. Unlike in Dermatology dataset in Table 3, where β 2 = 0.9 and β 2 = 0.8 and the best classification error rate remains as 2.20% respectively. In either case, the number of features is minimised to the lowest level, almost around 60-70%.
On the other hand, the results in Tables 2 and 3 also shows that the higher the values of β 1 and β 2 the higher the size of chosen features. The decrease in the feature size is around 40% of the whole feature size. Moreover, the accuracy improves if both the values of β 1 and β 2 increase in all the datasets. The outcomes specified that several weight values might automatically inspire the goodness of the classifier specifically, those with smaller subsets of features compared to the full-length features.
Relating the performance of Tables 2 and 3, it is clear that β 2 performed well in terms of the error rate on each dataset as to β 1 and poor on chosen features and possibly the longest time in computation. Though, β 1 did well on chosen features along with the longer computational period owing to the only duo of features its works with which makes it faster in terms of computations.
In whichever way, it can be seen that engaging both β 1 and β 2 with suitable standards as fitness functions can derive a fewer number of features with improved classification performance compared to the full-length features. Therefore, BCOA-MI and BCOA-E with β 1 and β 2 values of 0.5 and 0.9 were employed for contrasting and evaluation in the   following segment to investigate the goodness of the proposed filter-based multi-objective FS algorithms.

B. RESULTS OF THE MULTI-OBJECTIVE FILTER-BASED APPROACH
The experimental results of the filter-based BCOA-MI and BCOA-E with different weights show that its useful criteria for the filter-based FS. Nonetheless, the values of weights assigned in the fitness functions of the BCOA-MI and BCOA-E needs to be defined. In this segment, a filter-based multi-objective FS using the same concepts of MI and entropy is proposed. The main objectives are to minimise feature size and consequently improve the relevance amongst the features and their target class label. By so doing, it is expected to discover the Pareto front during the FS processes.
The results obtained from the experiments of BCNSG2, BCNSG3 and BCOA are depicted in Fig 3 and 4. At the top middle of each of the graph is the title of the dataset and inside the bracket is the entire feature size followed by the error rate of the SVM classifier used on all the features. Like every other graph, the x-axis displays the feature size, whereas the y-axis displays the error rate of the SVM classifier. The legend in each of the charts contains three elements whereas the first two that end with ''−1'' and ''−2'' represents average nondominated solutions and Pareto front for the 40 independent runs respectively. The last element in the legend is BCOA-MI with either βmi = 0.5 or βmi = 0.9 and BCOA-E with either βE = 0.5 or βE = 0.9 which represents the 40 solutions achieved by the single objective filter-based feature selection algorithms with both MI and entropy.
The results of BCOA-MI and BCOA-E shows that some of the datasets evolve the same feature size in different runs as depicted in the graph. Despite the 40 independent runs applied, there are less than 40 distinct points shown in each of the charts. Similarly, the set of nondominated solutions (''−2'') may have the same subsets of features that are revealed at the matching point in the graph.

1) RESULTS OF BCNSG3MI AND BCNSG2MI
The results obtained by the Pareto front solutions of BCNSG2MI and BCNSG3MI in the filter-based FS real objective space is shown clearly in Fig 3, where MI is employing as the evaluation measures. It's well known that in any multi-objective filter-based FS methods, the goodness of the Pareto front features is assessed by its error rate on the hidden test data. The same is applied to these experiments. Thus, the solution used in Fig. 3 is the Pareto solutions obtained in the MI space. Nevertheless, the error rate display in the charts was evaluated using SVM on the test data.
Besides, Fig 3 compared the results obtained by the BCNSG2MI, BCNSG3MI and BCOA-MI with βmi = 0.5 and βmi = 0.9, that use MI to assess the relevancy as well as the redundancy amongst a couple of features. Both BCOA-MI and BCOA-E may evolve similar subsets of features in different runs on some of the datasets, besides they are revealed at a similar point in the graph. Even though 40 results have been offered, most likely there will be not up to 40 separate points display in the graph. For example, both the BCNSG2MI and BCNSG3MI nondominated solutions possibly will have an identical subset of features and are displayed in the same point in the chart. It is like the charts in Fig 4.

a: RESULTS OF BCNSG3MI
Results displayed in Fig. 3 indicated that BCOA-MI was able to reduce about 70% of the total feature size in almost all the datasets. Similarly, the classification error is mostly moderate and low on the Splice, Leddisplay, Chess (KrvskpEW), Optic, Audiology, Dermatology and Madelon datasets while is quite high on Connect4, Promoter and Spect datasets. The termed BCNSG3MI-2 means the average Pareto front whereas as BCNSG3MI-1 represents the nondominated features served from the 40 separate runs mentioned earlier.
In BCNSG3MI-1 the graphs showed that the nondominated features contain greater than or equal to one subset of features that choose almost halved of the total features and yet accomplish a minimum error rate in comparison to the full-length features. A typical example can be seen in DNA dataset, where a single nondominated solution carefully chosen 58 features out of the 180 full features. Besides, the error rate was diminished drastically from 17.22% to 10.75%. This can be seen on the graphs of the other datasets as well.
The graph in BCNSG3MI-2, shows that there are two or more solutions that chose fewer features and yet attained a minimum error rate as to the full-length features. In most cases, for equal feature size, there exist a various combination of features with different error rate. As such, the subset of features obtained in different runs may have different error rate for the feature size. Thus, some of the solutions in the average Pareto front will likely dominate others, even though the solutions obtained in all run are nondominated.
The results indicate that BCNSG3MI is a multi-objective algorithm that would spontaneously derive a subset of features that can decrease the feature size and consequently enhance the goodness of the classifier.
BCNSG3MI performed better than BCOA-MI in the majority of the datasets on the classification error rate. However, despite the fewer features size recorded by BCOA-MI with βmi = 0.5 and βmi = 0.9 than BCNSG3MI-1 on some few datasets, the majority of the solutions in BCNSG3MI-2 choose the fewer number of features and yet achieved an improved performance. Therefore, comparisons proved that using MI as the evaluation measures, the planned filter-based multi-objective FS (BCNSGMI) outperformed the filter-based single objective feature selection BCOA-MI with both βmi = 0.5 and βmi = 0.9.

b: RESULTS OF BCNSG2MI
Observing at the results in Fig 3, the average Pareto fronts of BCNSG2MI especially BCNSG2MI-1 comprise more than or equal to two solutions which choose the least features and consequently attained the comparable or improved performance compared to the full-length features in all the datasets. For example, similar performance was recorded in BCNSG2MI-1 and BCNSG2MI-2 on some few data points on the Chess dataset. In the majority of the datasets, BCNSG2MI-2 select the minimum number of feature subsets containing almost half of the total feature size then obtained boosted error rate compared to the full-length features. For instance, in Splice dataset BCNSG2MI-2 selects 17 features out of 60 and the error rate reduced from 30.25% to 20.00%. Almost similar results are achieved on the other datasets.
This is a testimony that BCNSG2MI as a filter-based multiobjective optimisation algorithm can automatically discover the Pareto front of an FS problem and minimise the error rate as well as the feature size required for the classification.

c: COMPARISONS AMONG BCNSG3MI, BCNSG2MI AND BCOA-MI
Relating the results obtained by BCNSG2MI with BCOA-MI, it can be noticed that in most cases, BCNSG2MI (BCNSG2MI -2) obtained an improved classification performance than BCOA-MI, For example, the charts in Fig. 3 shows that BCNSG2MI-2 outperformed BCOA-MI with βmi = 0.5 and βmi = 0.9 on all the datasets except on leddisplay, Madelon and Optic datasets. Moreover, BCOA-MI with βmi = 0.5 performed better than the BCNSG2MI-2 on the Soyabeans large dataset. In most cases, BCNSG2MI-2 outperformed BCOA-MI with βmi = 0.5 and βmi = 0.9 on the number of chosen features except in Connect4 and Promoter datasets where BCOA-MI recorded few numbers of selected features but with non-promising error rate.
The contrasting suggest that with MI in the fitness function, obtaining the best classification performance usually requires more features. However, occasionally there are some subsets of features that have fewer number features and yet attained better classification performance. Also, both BCNSG2MI and BCNSG3MI might acquire a set of nondominated solutions that used fewer feature size and achieved the best results. Thus, BCNSG2MI and BCNSG3MI as a filter-based multiobjective optimisation algorithm could better search for the solution region compared with the single-objective algorithm, BCOA-MI. Fig. 4 displays the Pareto front solutions achieved by the BCNSG2E together with BCNSG3E in the entropy zone. Nevertheless, their error rate shown in the charts were assessed by SVM on the test data. The outcomes in Fig. 4 compares the results obtained by BCNSG2E, BCNSG3E and BCOA-E with βE = 0.5 and βE = 0.9, that used entropy to estimate the relevancy as well as redundancy of a set of features in contrast to the MI, that evaluates for a pair of features.

a: RESULTS OF BCNSG3E
From Fig. 4 BCOA-E with βE = 0.5 and βE = 0.9 decreased almost 70% of the total features on most of the datasets and yet attained an equivalent or higher classification performance than using the full-length features.
From the results, one can observe that BCNSG3E-2 perform well on all the dataset. It includes greater than or equal to a single solution which chose fewer features and yet attained a higher level of performance compared to the complete features. On the other hand, BCNSG3E-1 attained a better classification error rate in the majority of the datasets. Correspondingly, the feature size reduces drastically to almost 50% on all the datasets. For instance, in Soyabeans Large dataset, the feature size reduced from 35 to 17.5 (exactly 50% feature size reduction) besides the error rate from 9.05% to 5.00%. Likewise, the other datasets in Fig. 4 confirmed the assertions.
This result advocates that the advanced BCNSG3E algorithm could derive a set of feature subsets that might enhance the goodness of the classifier simultaneously and yet decrease the feature size.
Comparing BCNSG3E with BCOA-E with both βE = 0.5 and βE = 0.9 one can observe that BCNSG3E performed better than BCOA-E with both βE = 0.5 and βE = 0.9 in terms classification error rate in all the datasets except Connect4 where BCOA-E with βE = 0.5 performed better than BCNSG3MI. Moreover, a similar result was perceived on Dermatology, Leddisplay and Chess datasets. Similarly, BCNSG3E selected fewer features than BCOA-E in all the dataset.
Therefore, a comparison using gain ratio based-entropy as per the evaluation measure, the planned filter-based multiobjective FS (BCNSG3E) can attain better solutions and do well than the filter-based single-objective (BCOA-E with both βE = 0.5 and βE = 0.9.

b: RESULTS OF BCNSG2E
According to the results in Fig. 4, all together, the average Pareto fronts of BCNSG2E (BCNSG2E -1) have many solutions that nominated smaller features and realised the best error rate compared to the entire full-length features. In most of the datasets, BCNSG2E-2 was able to minimise the error rate by choosing nearly half of the total features. Looking at the Promoter dataset, BCNSG2E decreased the error rate as of 8.65% to 0.00% by picking just 22 features out of the whole 57 features.
Moreover, the results indicated that the planned BCNSG2E together with entropy as the assessment condition could successfully choose a subset of features that can concurrently VOLUME 8, 2020 decrease the feature size and enhance the classification performance than the full-length features.

c: COMPARISONS AMONG BCNSG3E, BCNSG2E AND BCOA-E
Comparing the results of BCNSG2E with BCOA-E, in most cases, BCNSG2E (BCNSG2E-2) attained the best results compared to BCOA-E with both βE = 0.5 and βE = 0.9. Despite, the feature size is a bit bigger in some few cases. Still, BCNSG2E outpaced BCOA-E since improving the error rate is considered more important than reducing feature size.
Furthermore, relating the results of BCNSG3E with BCOA-E, it can be observed that, in the majority of the datasets, BCNSG3E select the fewer features and obtained an improved result than BCOA-E with both βE = 0.5 and βE = 0.9. A near similar result was achieved on Chess datasets, where BCNSG3E accomplished the same results to BCOA-E with both βE = 0.5 and βE = 0.59.
The comparisons of the methods show that using entropy as the assessment measure, the planned filter-based multi-objective FS algorithms (BCNSG2E and BCNSG3E) could well discover the exploration space and accomplish good solutions compared to single-objective FS algorithm, (BCOA-E).

3) COMPARISONS BETWEEN PROPOSED MULTI-OBJECTIVE APPROACHES
To be fair in the comparison between the proposed methods. This study adopts the pattern of the existing works in [10], [11]. In their papers, a comparison is first made with the single-objective then between the proposed methods and lastly with the state-of-the-art approaches (if exist). Based on that, this study also compared the proposed multi-objective filter-based approaches BCNSG2 (BCNSG2MI and BCNSG2E) and BCNSG3 (BCNSG3MI and BCNSG3E) with a single objective (BCOA-MI and BCOA-E). Then a comparison between the proposed multi-objective approaches is made based on the evaluation measures-for example, BCNSG2MI Vs BCNSG3MI since they all used MI as the filter evaluation measure. Also, BCNSG2E Vs BCNSG3E because they all used gain-ratio based entropy as the filter evaluation measures. However, it will not be fair to compare MI-based with entropy-based approaches Relating between the MI as well as entropy-based algorithms in Figs 3 and 4 respectively. It shows that BCOA-E along with BCNSG2E and BCNSG3E, mainly attained an excellent classification performance with minimum error rate compared to the BCOA-MI, BCNSG2MI and BCNSG3MI.
On the other hand, BCOA-MI chose the least number of features than BCOA-E. Simply because MI deals with two pairs of features in contrast to the gain ratio based-entropy that deals with a group of features in finding both relevance and redundancy. Hence the reason why the number of features in BCOA-E are many compared with the BCOA-MI. Alternatively, the features selected by the planned multi-objective optimisation algorithms is quite lesser compared to the single-objective algorithms. Thus, BCNSG2E and BCNSG3E with entropy as the evaluation criterion can attain an excellent result because it can use multiple ways relevancy and redundancy to improve both the classification performance and some selected features compared to BCNSG2MI and BCNSG3MI with MI as the evaluation condition.
Comparing among the algorithms in Figs 3 and 4, one can observe that BCNSG3MI and BCNSG3E based on NSGAIII framework outperformed the BCNSG2MI and BCNSG2E based on NSGAII framework on all the datasets both on error rate and the selected features. The results of both BCNSG3MI and BCNSG3E is 10-20% better than the BCNSG2MI and BCNSG2E in the majority of the datasets.
The results are not surprising because NSGAII is reported to lacks some reference point; instead, it used the crowding distance and mutation operators for its computation. Also, a full crowded comparison can restrict the convergence of NSGAII [50]. Therefore, the maintenance of diversity among population members in NSGAIII is supported by supplying and adaptively updating several well-spread reference points. Hence, the reason why BCNSG3MI and BCNSG3E outperformed both BCNSG2MI and BCNSG2E and can search for the better zone of the solutions and attained best classification performance using fewer features than all the other methods.

4) COMPARISON AMONG PROPOSED APPROACHES BASED ON TIME
The results in Table 4 analyses the average time spent in seconds by all the proposed algorithms. The four filter-based multi-objective algorithms are compared with the two filter-based methods BCOA-MI (with β MI = 0.5 and β MI = 0.9) along with BCOA-E (with β E = 0.5 and β E = 0.9).
The table (Table 4) displays that usually, majority of the pair-wise multi-objective algorithms, BCNSG2MI and BCNSG3MI complete their metamorphic training process in less than four seconds except on the Connect4, DNA and Madelon datasets. The Madelon dataset generally recorded much lengthier time compared to other datasets since it has the highest number of features than the remaining datasets. Likewise, in Connect4 dataset because of its large number of instances.
On the other hand, while applying the gain-ratio based entropy (group-based measures), all the multi-objective algorithms, BCNSG2E and BCNSG3E, completed the metamorphic training procedure around one minute in all the datasets excluding the Madelon dataset. Thus, there is some little variation on the time spent by the multi-objective algorithms. So, the BCNSG3E outperformed all others. The single objective algorithm BCOA-E with β E = 0.5 and β E = 0.9 spent lengthier time compared to the multi-objective algorithms, that is almost ten times lengthier on all the dataset. The wisdom behind it is that the feature size in the multi-objective algorithms is calculated as a single objective, which requires minor time compared to the redundancy measure RedB E in the fitness function of the BCOA. Generally, the combined entropy-based algorithms spent lengthier time compared to its MI-based counterpart.

5) DISCUSSIONS
In Figs. 3 and 4, the solutions employed in the graph are the Pareto front solutions gotten through the filter-based evaluation measure. Nevertheless, the classification performances display in the graphs were assessed using SVM on the test sets. While Fig. 3 displays the Pareto fronts attained by the BCNSG2MI and BCNSG3MI via MI as the assessment condition. Fig. 4 displays the Pareto fronts attained by BCNSG2E and BCNSG3E via entropy as the assessment measure.
It can be observed from Figs. 3 and 4, that some of the solutions in the average Pareto front (denoted by '−1') influence others though they are nondominated solutions in the filter-based assessment condition. Hence, this confirms that the Pareto front in the filter-based assessment condition zone on the training set have not included similar subsets as per the Pareto front in the SVM-based assessment on the test set. Just because the superiority of a feature subset assessed by MI or entropy on the training set does not automatically display its meticulous goodness on the test set.
Furthermore, the right Pareto front accomplished by the comprehensive exploration in the twofold filter-based assessment measures, the objective space cannot be the right Pareto front of the SVM-based assessment on the test set. The subsets of features that have similar filter-based results cannot essentially accomplish similar (good or poor) error rate on the hidden test set assessed by SVM. Let takes dual subsets of the feature as an example, both may have equal feature size, but diverse mixtures of the features. Those two subsets of feature possibly will have similar goodness assessed by the filter-based assessment condition on the training set. Therefore, they are nondominated with all others. Though, if SVM is applied or any available classifier to assess their error rate on the hidden test set, their error rate possibly will be somewhat dissimilar. The subsets of features that have the best classification performance will influence others. Also, like other filter-based conditions and other classifiers. As such, the Pareto front in the filter assessment condition region is mostly not similar to the Pareto front in the SVM-based assessment.
In an ideal world, the algorithms would recognise the right Pareto front in all the filter-based assessment condition zone. Since it is not possible to carry out a complete search for the datasets with huge feature size to detect the right Pareto fronts. The proposed multi-objective algorithms BCNSG3MI and BCNSG3E will recognise the right Pareto fronts gotten by the complete search; nonetheless, BCNSG2MI and BCNSG2E to some extends cannot. The main reason is that the BCNSG2MI and BCNSG2E lack some reference point; instead, it used the crowding distance and mutation operators for its computation. Moreover, a full crowded comparison restricts its convergence due to the used of NSGAII. BCNSG3MI and BCNSG3E attained the right Pareto front for the datasets with huge feature size.

6) COMPARISONS WITH OTHER EXISTING APPROACH
To be fair in the comparison, only filter-based multi-objective FS approaches that use the concepts of nondominated sorting and information theory and yet have similar datasets. For example, in the work of [10] we have eight related datasets, they used the concepts of nondominated sorting and crowding distance as well as MI and entropy all embedded in PSO. Similarly, the work of [11] has eleven datasets in common with this study. In addition to that, MI and relief f are used as filter evaluation measures in multiobjective DE.
The detailed comparison with the existing works is shown in the subsequent sections.

a: COMPARISONS WITH BPSO
To further investigate the performances of the proposed BCNSG3MI, BCNSG2MI, BCNSG3E and BCNSG2E algorithms we, first of all, compared them with four VOLUME 8, 2020 multi-objective PSO filter-based feature selections NSfsMi and NSfsE in [10] based on nondominated sorting based multi-objective PSO in [51]. Moreover, the results are compared with CMDfsMI and CMDfsE based on multi-objective PSO in [10]. All the eight datasets used in [10] except Mushroom dataset are compared with the results obtained in this study. The proposed approaches performed better than both CMDfsMI and CMDfsE as well as the NSfsMI and NSfsE on all the datasets with around 5-15% and 15-20% better in terms of both selected features and classification performance for BCNSG2MI and BCN-SG2E respectively. Moreover, the proposed BCNSG3MI and BCNSG3E performed even better with almost 20-35% reduction on both error rate as well as selected features.
The comparisons above clearly indicate that multi-objective BCOA with both NSGAII and NSGAIII has more advanced search mechanisms and have the potential of achieving even better performance.

b: COMPARISONS WITH DE
Besides, the results obtained are also compared with MODE mi as well as MODE mirf in [11] on the eleven datasets that are common to this study. Although, MODE mirf performed much better than MODE mi on all the datasets. The proposed BCNSG2MI and BCNSG3E outpaced both MODE mi and MODE mirf on most of the datasets except on Leddisplay datasets that they attained the same performance both on the selected features and error rate. Conversely, the proposed BCNSG3MI and BCNSG3E outpaced both MODE mi and MODE mirf on all the datasets. Therefore, the proposed multi-objective approaches have the potential to evolve the Pareto front features subsets automatically. Also, simultaneously, select the minimum and most relevant features and consequently attain the best results than the existing methods.

c: COMPARISON WITH EXISTING METHODS BASED ON TIME
The existing methods are also compared based on the CPU execution time (in seconds) as shown in Table 5

7) LIMITATION OF THE PROPOSED METHODS
This paper presents the first study of filter-based multiobjective FS using the concepts of NSGAII, NSGAIII with BCOA along with MI and gain ratio based entropy as the filter-based evaluation measures. Even though the results obtained are competitive to other existing works, however, the proposed methods have some limitations as follows: 1) The crowding-distance strategy of the BCNSG2MI and BCNSG2E restricted in the same front, can't exhibit the real superiority in the same front. Hence, there is a need to improve the dummy fitness strategy while considering the crowding within a different front. 2) Although the standard NSGA-II algorithm uses the crowding distance-based method for maintaining solutions diversity, the limitation of the crowding distance-based approach according to [63] is that it, selects two nearer solutions from the Pareto front for the mating. As such, the proposed methods BCNSG2MI and BCNSG2E sometimes may not preserve extreme solutions in Pareto front.
3) Using reference points in NSGA-III has difficulty in maintaining the diversity of the solutions in the discrete multi-objective optimisation problems [64]. Therefore, the proposed BCNSG3MI and BCNSG3E were likely unable to link all reference points, especially the best reference points in each objective. 4) The use of gain-ratio based entropy as the filter-based evaluation measure provides the best solutions compared to its MI counterpart. However, the gain ratio based entropy is computationally expensive. Hence the use of other faster filter-based approaches that can handle a group of features at a time as an evaluation measure may likely solve the problem. 5) Rajabioun in [16] stated that ''it should be noted that the higher performance of COA in reaching better results for these five benchmark functions and areal case study does not necessarily mean that COA is the ever best evolutionary method developed. It just can be considered as a successful mimicking of nature; suitable for some sort of optimisation problems.'' As such other EAs may be fine-tuned and use for filter-based multiobjective FS.

C. RESULTS ANALYSIS
The results show that information theory concept can be successfully used as a filter-based evaluation measure with BCOA to select fewer number of features and better classification performance. A relevance was employed to measure the classification performance of the selected features to the class labels. On the other hand, the number of the selected features is measured by the redundancy amongst features chosen. Based on that, two different relevance and redundancy measures are established, which are a pair-wise based on MI and a group-based using the concept of gain ratio based entropy.
In the pairwise based measure, it shows that BCOA-MI is faster than its BCOA-E counterpart and the optimal fitness values comprise of a few numbers of features, whereas the classification performance is in favour of the BCOA-E. The reason behind this is that BCOA-MI used pairwise evaluation to measure the relationship between two features, which does not involve complex computation of relevance and redundancy. Thus, no complex interactions amongst a group of features, which is considered a challenge in FS problems.
Alternatively, BCOA-E using the group-based measure is slower but yet recorded better classification performance and choose more features than the BCOA-MI. The reason is that it deals with subsets or group features while computing the relevance and redundancy. Also, it considers the selected features as a whole which leads to better feature interaction that consequently leads to improve classification performance.
A weight β values were employed to balance between the relevance (accuracy) and redundancy (selected features) in the fitness function for both BCOA-MI and BCOA-E, respectively. It is challenging to choose pre-determine best value of the β. The reason is that a considerable weight value on the redundancy in BCOA-MI or BCOA-E may reduce the feature size and affect the classification performance or vice versa. Similarly, larger weight value on the relevance may improve the classification performance and consequently affect the number of selected features and vice versa. To avoid this problem, both the relevance and redundancy are treated as two separate objectives in a multi-objective FS. It is hypothesised that it will solve the task better and obtain a set of nondominated solutions instead of a single solution, where the gathered Pareto front can assist users in choosing their preferred solutions to meet their requirements.
Based on that the concepts of NSGAII and NSGAIII are embedded in BCOA-MI and BCOA-E respectively to form BCNSG2 (BCNSG2MI and BCNSG2E) and BCNSG3 (BCNSG3MI and BCNSG3E). Both BCNSG3MI and BCNSG3E achieved the best performance than BCNSG2MI and BCNSG2E with regards to both selected features and the classification error rate on most of the datasets. It is because FS tasks are complicated problems with various local optimal. And BCNSG3MI, along with BCNSG3E, uses multiple mechanisms for maintenance of variety among population members and is supported by providing and adaptively updating several well-spread reference points. Precisely, it picks and screens out jam-packed leaders and applies various mutation operators to preserve the variety of the crowd to evade stagnancy in local optimal.
Also, both BCNSG2MI and BCNSG2E are not as good as BCNSG3MI and BCNSG3E regarding stagnancy avoidance in a local optimum. They handle various stages of Pareto fronts to keep the previously found nondominated solutions. Hence, the entire nondominated solutions are saved in the habitat from one iteration to another. The nondominated solutions would be replicated, and the habitat may miss variety faster, that might cause the problem of early convergence. Hence the reason why both BCNSG3MI and BCNSG3E are faster than their BCNSG2MI and BCNSG2E counterparts.

VI. CONCLUSION
This study aimed was to examine the use of FS specifically, filter-based utilising BCOA and information theory concepts for both single and multi-objective FS. The aims have been accomplished by developing two new filter-based single objective FS using MI and entropy, which are BCOA-MI along with BCOA-E. The BCOA-MI apply MI for all the couples of features to assesses the relevance as well as redundancy of the chosen couple of features. Whereas BCOA-E applies entropy to all the set of features to assess the relevance and redundancies of the chosen feature subsets. Besides, diverse weights values are assigned to evaluate the relevance and redundancy.
The outcome of the filter-based single objective disclosed that using a suitable value for the weight the BCOA-MI and BCOA-E could decrease the feature size and subsequently attain or accomplish comparable classification performance. BCOA-MI selected the smaller subsets of features while BCOA-E gets the best classification performance. However, neither BCOA-MI nor BCOA-E balance between the error rate as well as features size. As a result, a multi-objective filter-based BCOA is also proposed to find the set of nondominated solutions.
The aim of developing a filter-based multi-objective FS has also been achieved, in which the novel idea of NSGAIII, as well as NSGAII, are employed to hunt for fewer features with best error rate. Four filter-based multi-objective FS BCNSG2MI, BCNSG2E, BCNSG2MI and BCNSG3E were developed and evaluated also based on MI and entropy. The algorithms are first compared with BCOA-MI and BCOA-E, on fourteen benchmark datasets of varying degree of complexities. The multi-objective algorithms outperformed the single-objective algorithms in most of the datasets and can easily evolve a set of nondominated solutions with fewer feature size and improve performance In addition to that, the presented multi-objective algorithms are also related to filter-based multi-objective BPSO (NSfsMI and NSfsE) with MI and entropy as evaluation criteria. Also, with filter-based multi-objective DE approach (MODE mi and MODE mirf ). Whereby, the proposed multi-objective approach outperformed all the existing approaches and can easily evolve the Pareto subset of features with the least feature size and yet attained an improve classification performance.
Even though the proposed multi-objective approach would derive the best subsets of features, it is not clear whether the Pareto front, together with the set of nondominated solutions, can be improved or otherwise. Thus, in the future, filter-based multi-objective will address such problems and compared with other popular evolutionary algorithms for better solutions. Wrappers have better classification performance than the filter-based, but most of them are single objective that works by combining the aims of the FS into one single fitness function. Thus, future work on balancing those conflicting aims using the wrapper-based along with the novel concept of NSGAIII is not fully studied.
Moreover, recently filter-wrapper approaches are combined to benefits from the advantages of both. For example, filters are faster and scalable to large datasets but lack good classification performance. Whereas wrappers got good classification performance but not fast. Although, there are filter-wrapper approaches proposed in the literature to augment the problems of each approach and consequently benefits from their advantage. However, the work on multi-objective filter-wrapper FS is still an open issue, since the problem of each approach still exists in the singleobjective.
UMI KALSOM YUSOF received the B.Sc. degree from Western Illinois, Macomb, IL, USA, the M.Sc. degree from Universiti Sains Malaysia (USM), Penang, and the Ph.D. degree in computer science from Universiti Teknologi Malaysia (UTM) Skudai, Johor.
She is currently an Associate Professor and a Lecturer with the School of Computer Sciences, USM. Her research interests are related to data mining, Web engineering, computational intelligence, artificial intelligence, multiobjective optimization, evolutionary computing, computer security, and grid computing. She has published research articles at national and international journals, conference proceedings, as well as chapters of books.
SYIBRAH NAIM (Member, IEEE) received the B.Sc. degree in financial mathematics and the M.Sc. degree in applied mathematics from Universiti Malaysia Terengganu, in 2007 and 2010, respectively, and the Ph.D. degree in computer science from the School of Computer Science and Electronic Engineering, University of Essex, Colchester, U.K., in 2014. She is currently a Lecturer with the Technology Department, Endicott College of International Studies (ECIS), Woosong University, South Korea. She has published research articles at national and international journals, conference proceedings, as well as chapters of books. Her research interests are related to optimization, computational intelligence, artificial intelligence, multiobjective optimization, evolutionary computation, soft computing fuzzy clustering, fuzzy logic, fuzzy set theory, and fuzzy theory.