An Optimized Feature Selection Technique in Diversified Natural Scene Text for Classification Using Genetic Algorithm

Natural scene text classification is considered to be a challenging task because of diversified set of image contents, presence of degradations including noise, low contrast/resolution and the random appearance of foreground (font, style, sizes and orientations) and background properties. Above all, the high dimension of the input image’s feature space is another major problem in such tasks. This work is aimed to tackle these problems and remove redundant and irrelevant features to improve the generalization properties of the classifier. In other words, the selection of a qualitative and discriminative set of features, aiming to reduce dimensionality that helps to achieve a successful pattern classification. In this work, we use a biologically inspired genetic algorithm because crossover employed in such algorithm significantly improve the quality of multimodal discriminative set of features and hence improve the classification accuracy for diversified natural scene text images. The Support Vector Machine (SVM) algorithm is used for classification and the average F-Score is used as fitness function and target condition. First after preprocessing input images, the whole feature space (population) is built using a multimodal feature representation technique. Second, a feature level fusion approach is used to combine the features. Third, to improve the average F-score of the classifier, we apply a meta-heuristic optimization technique using a GA for feature selection. The proposed algorithm is tested on five publically available datasets and the results are compared with various state-of-the-art methods. The obtained results proved that the proposed algorithm performs well while classifying textual and non-textual region with better accuracy than benchmark state-of-the-art algorithms.


I. INTRODUCTION
Features are discriminative elements that help to differentiate different types of objects in an image. It has been observed that pattern recognition classifiers have difficulties achieving a good performance when the feature space has high dimension [1]. Therefore, to design a better classifier and achieve a good accuracy, a possible strategy consists of reducing the complexity of the model by reducing the number of features, discarding non-informative and redundant features [2], [3] obtained from diverse set of images.
The associate editor coordinating the review of this manuscript and approving it for publication was Qingli Li . In statistical machine learning feature selection, also known as attribute selection, variable selection or variable subset selection is a method used to select a subset of optimal features that are considered more pertinent to the application [4]. There are multiple approaches to gather the best subset of features, including, principal component analysis (PCA) [5], [6], ant colony optimization [7], particle swarm optimization (PSO) [8]- [10], firefly [11] and genetic algorithm (GA) [4], [12]. It is worth pointing out that GAs are powerful stochastic biologically-inspired techniques that can be used in several image processing applications including image enhancement, image segmentation, image classification, and (naturally) feature selection. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In this work, our goal is to classify text and non-text regions in natural scene images using genetic algorithms. Most of the classification works available in literature consider text documents, rather than cropped scene texts. Moreover in different scenarios, images of business places and logos can also be classified as text areas [13]. More specifically, we used a GA technique to select the best appropriate features for the text classification problem. Before extracting the image features, we perform a preprocessing operation, which consists of performing histogram equalization for normal contrast images and a Fourier transform for contrast enhancement in low resolution images. Then, a feature space (population) is obtained by extracting Histogram of Oriented Gradients (HOG) [14], Local Binary Pattern (LBP) [15], color features [16] and contour features [17].
Finally, to select the best features, a GA framework is used to classify text and non-text regions in natural scene images using an integrated SVM [18]. The GA is robust and unbiased to a variety of texts, fonts, sizes, variations, angles, distortions, and skews [19]. Furthermore, using biological evolution concepts (e.g. survival of the fittest) in optimization problems has been proven to be promising [20].
In summary, to obtain the best features for the problem of text classification in natural scenes, an unbiased GA technique searches the whole feature space (population). The GA selects the best features among all individuals (features), taking into consideration a fitness function. The GA shrinks the length of the feature vector to reduce the running and training time of the classification system and attain an optimal accuracy, even in the presence of noise and other degrading factors [21]. Since GA acts as a powerful tool for feature optimization and classification, we selected it to find optimal and near-optimal solutions to our problem, considering the very large search space of the application and the limited amount of time.
This work has the following key contributions: The design of a multimodal feature system for optimization, which includes preprocessing, text localization and feature extraction stages of a diversified set of images with noise, low contrast/resolution and random appearance of foreground (font, style, sizes, orientations) and background properties. The design of a GA framework with an integrated/implicit SVM, which is able to reduce the dimension of the feature vector and classify natural text images as ''text'' or ''non-text'' areas. The use of the average F-score as target and fitness function to optimize the performance of the proposed classifier. The rest of the paper is composed of the following sections. Section II highlights the related work, while Sections III describes in detail the proposed methodology. Section IV discusses the experimental results. In the last section, we provide our conclusions and discuss future works.

II. RELATED WORK
Nowadays, natural scene text classification is an important task that is gaining importance as a way to enhance model learning. The image feature extraction and optimization stages are very important parts of a good classifier. One of the best performing techniques is genetic algorithm (GA). GA is a feature optimization and classification method that is based on evolutionary theory. In this section, we describe the state-of-the art of GA techniques.
Vafaie and Jong [21] used a GA to pick the best features for a rule induction system. The feature selection policy described in their work has two main classes. The first class independently selects features for classification despite of their effect on performance. The second class selects a subset of the best optimized features (from the whole feature space), in such a way that it does not degrade the performance of the classifier system.
Raymer et al. [22] used a GA technique to extract and select features and train a classifier. Their method reduces the dimension of a weighted feature vector to scale the features, either in the linear or nonlinear way. The authors also employed a masking factor to act as a fitness function that helps performing the selection of the features.
Sun et al. [23] stated that GAs can select the best subset of features by encoding gender information (e.g. face length and width, mouth and eye size, angles, distances and areas). For this, they first used the principal component analysis (PCA) [24] to represent every image Eigen feature in a small dimensional space. A GA is then use to select the best features and reject irrelevant eigenvectors from gender information. After this, the selected features are fed to the Neural Network, SVM, Bayes Classifier, and Linear Discriminative Analysis stages for classification. The classifier fitness is evaluated using the accuracy (computed from the validation samples) and by varying the number of eigenvectors samples from 10 to 150.
Mohamad [25] used a GA technique to identify which feature combinations can be considered for classification using the classification accuracy as the fitness (or objective metric).
Uguz [5] analyzed the problem of text categorization and concluded that a large number of features increases the amount of noise in the process, misleading the classifier and reducing the accuracy and performance. The first stage of their algorithm uses an information gain (IG) measure to rate each feature within the document [26]. To reduce the dimensions, in the next stage of their algorithm, a combinatorial approach (a PCA [27]) and a GA are applied to the features, with the goal of ranking their order of importance. This way, to decrease the computational time and complexity of the classification process, features of less importance are eliminated. Other authors have used the K-nearest neighborhood [28], or the C.45 decision-tree [29] classifiers. Jaberi and Madiafi [30] introduced a method to reduce the selection pressure, i.e. the tendency to only select the best members of the in-progress generation, that are later propagated to the next generation, and increase the genetic diversity of the population of a GA using different selection procedures. As a result, using an elitist method that transferred the best individuals of in-progress generation to the next generation, they were able to reduce the complexity of the algorithm for different selection methods. This method also decreases the convergence time, but enhances the effect of selection pressure that often causes a convergence to a local minima.
Tsai et al. [31], proposed classification models for different domain datasets (including small-scale and large-scale) based on the selection of the best features followed by an instance selection (discarding faulty data) using a GA. The Bayesian network learning algorithm is used as fitness function and the chosen classifiers are the SVM and the K-nearest neighborhood (K-NN) techniques.
Catak [12], employed text datasets classification models with best feature selection using a GA technique. They introduced an objective function to maximize the sum of the feature ratio and of the F-score of the best chromosome. The authors used this objective function with three different classifiers (SVM, maximum entropy, and stochastic gradient descent), to check the efficiency of their proposed objective function.
Li et al. [32], proposed a classification method for electrocardiogram (ECG) signals using a GA and Back Propagation Neural (GA-BPNN) techniques. The features are extracted using a wavelet packet decomposition (WPD) transform. The best features are selected using a sum of square error (SSE) fitness function, with the help of the roulette wheel method.
Sharif et al. [33], used a GA to find out the appropriate features for the offline signature verification system. These selected features are then given to a SVM algorithm for verification, using an Artificial Neural Network (ANN) as a cost function. A few representatives classification methods [34]- [47] are also popular for scene character classification and further all these are useful for text/character recognition.
Similarly, the other evolutionary algorithms including PSO and firefly are also considered as eminent tool among researchers for selecting best classification features. In [9], multi-objective particle swarm optimization is used as a remedy for cost based feature selection. Recently, an extended work is presented by Song et al. [10] in which divide and conquer idea is employed to variable-sized cooperative co-evolutionary particle swarm optimization (VS-CCPSO) for feature selection. Finally, the work done in [11], [48], [49] are promising to read that reflect the challenge to deal with curse of dimensionality.
Seeing that, this literature reveals that the GA can be successfully used to select the relevant features for classification purposes. These techniques show promising results, since feature optimization is the key to successful model learning.
In this paper, we propose a GA methodology for classifying text and non-text regions in natural images. Above all, GA can be used to reduce number of features into an optimal set of features.  Our proposed GA framework is specially designed to extract discriminative features from diversified set of images having different degrading factors and foreground/background properties. The input features are converted into binary chromosome to leverage and achieve more informative supervised information. This enables the framework to achieve an efficient robustness against a variety of foreground and background properties. Up to our knowledge, there is currently no such similar work in the literature. Our results show that the proposed algorithm performs better, when compared to state-of-the-art feature selection methods.

III. PROPOSED METHODOLOGY
In this methodology, the goal is to classify text and nontext regions in natural scene images using GA considering diversified set of images having noise, low contrast/resolution and random appearance of foreground (font, style, sizes, orientations) and background properties. The block diagram of the proposed methodology is shown in Fig. 1.
The major focus of this study is to study classification with reduced false positives along with a good score of precision, recall, F-Score and accuracy keeping the aforementioned diversity in mind. The diagram of the proposed methodology is depicted in two figures: Fig. 2 shows image preprocessing for low/high contrast images, feature extraction using text localization and feature fusion, while Fig. 3 shows the optimal feature vector extraction stage, which uses a SVM classification with an iterative GA framework

A. PREPROCESSING AND MULTIMODAL FEATURE PREPARATION FOR CLASSIFICATION OF NATURAL SCENE TEXT USING GA FRAMEWORK
The proposed diagram is shown in Fig. 2, which is employed to build multimodal feature vector after necessary preprocessing depending upon the image condition.

1) PREPROCESSING
The first step consists of resizing the input images into images to 100 × 100 pixels with the goal of maintaining the symmetry among all positive and negative text images. These samples are manually cropped using the benchmark datasets and illustrated in [40]. Next, we convert the color images into grayscale images. Then, we perform a histogram equalization to enhance the contrast of good resolution images, which helps to differentiating text and non-text images. Sample outputs of these steps are shown in Fig. 3.
Images having low resolution/contrast, the Fast Fourier Transform (FFT) and the inverse FFT (IFFT) as the inverse technique are applied. In this work, the logarithmic transformation act as a filtering function H (x, y) that maps a tapered series of low input gray levels into a wider series of output values. The filtration function H (x, y) = C * log(1 + g(i, j)) is used to process image pixel g(i, j), where C is the constant value and normally taken as 1 for enhancement. It is equally important that FFT must be completely reversible means to restore the image from frequency domain vector into spatial domain vector, so we used standard IFFT equation for this purpose. Fig. 4 shows original low contrast and resolution image and their corresponding grayscale and contrast enhanced versions. The comparison of Fig. 4(b) and Fig. 4(c) shows significant improvement in contrast. This technique helps to achieve good results in the later sections.

2) FEATURE EXTRACTION AND FUSION
For a classification problem, we need to extract multiple features from an image to build a multimodal feature representation. The multimodal approach is generally able to represent most image properties so that we can obtain suitable supervised information for a classification problem. The obtained supervised information are susceptible to noise, non-universality, inter and intra-class variations, and spoofing attacks.
Popular image feature descriptors for multimodal representations include appearance features (HOG), color, contour and texture features (LBP). Common characteristics of text in images are: (1) the text is generally visible on different textures background; (2) the contour image reflects the salient edges from which geometric features can be detected; (3) Color is an important feature descriptor because generally text appears in different colors and on different color backgrounds to avoid readability issues.
Thus, a multimodal feature representation based on such characteristics can reveal significant content variations and effectively enhance the classification accuracy [50]. Given these features, it is clear that the text classification problem has a large dimension feature vector. Therefore, using a GA to obtain a discriminative optimal set of features enhances the learning capability of the classifier. The steps for feature extraction and fusion are shown in Algorithm I. Next, we described the set of features, which are extracted in a novel way to handle the diversity of different types of images used in this work.

a: APPEARANCE FEATURES
The Histogram of Oriented Gradient (HOG) operator was proposed by Dalal and Triggs [14]. In this work, we use the HOG operator to extract image appearance features resulting in a vector A = [A 1 , A 2 , A 3 , . . . , A n ]. In order to calculate the HOG of an image, we set fixed number of bins, and divide the image into blocks (arrays) of A x × A y . Then, we compute the HOG for each image block i.e. each sub-image k. The HOG of an image block k is computed by calculating the image gradient ∇k using the difference schema: where p and q are coordinates, Fig. 5 Assume that an image G(x, y) : → R has several objects in the background and foreground. Then, the curve and edges can be used to localize the boundaries of the Object of Interest (OoI) in images using the following steps [51], [52]: Choose an initial OoI Use some criteria to move forward and find stable curves and edges of OoI Stop when the stop criteria is met Based on the abovementioned criteria the region based model is the best choice to obtain the open contours. This could be achieved using the popular partial differential equation for curve evaluation in open contours. Hence, this model partitions the image G(x, y) into background and OoI on the basis of pixels intensity similarity adopting the following function given in eq. (2), which needs to be minimized.
Assume the object for predicted curve O is parameterized as the OoI can be determined if and only if G(x, y) ≈ consto 1 for region, whereas background is determined G(x, y) ≈ consto 2 . In (2), the open contours are defined by first two terms followed by a regularization term which acts as penalty to define the length of the curve. Further, the (2) is extended for implicit representation and expressed: where T is the threshold value, which is in the range of [1]- [3], depending upon image quality for generating open/active contours, while ∇W is regularization term for handling the outliers so as to speed up the process. Further, the values of o 1 and o 2 in (3) can be computed by minimizing the function in the following equations, which are based on externality conditions.
whereas the λ − is computed using gradient descent incorporating the loss function given as in (6) below, where t is the slope and δ is the intercept term and computed as t = t + t and δ = δ + δ respectively.
The above model localized text (OoI) on basis of intensity similarity and produce region boundary through edges and curves, which does not enclose the object. It can also handle random orientations and perform better with low contrast, unclear and complex background to determine stable regions. The final feature vector B has the length equal to 1 × 6.

c: TEXTURE FEATURES
Local Binary Patterns (LBP) is a popular feature descriptor proposed by Ojala et al. [15] to extract texture features in the image. It is computed for each pixel p = (x, y) of the image. Consider λ i is the level of intensity of a pixel in the 8-connected region of the pixel p and λ c is the intensity of the central pixel p. The LBP [53], [54] is computed using the following equations: The threshold for LBP that is e(x) is calculated as: In this work, we consider a random arrangement of adjacent pixels. Hence, the LBP descriptor converts every pixel to an 8-connected region. As a result, the feature set C = [C 1 , C 2 , C 3 , . . . , C n ] contains histogram values corresponding to each image region. Fig. 5(e) depicts encoded local texture features, which are very discriminative for the classification task. The feature vector C has a length of 1 × 9. which was introduced by Sun et al. [16], not only gives the information of the image colors, but also provides the spatial distribution of the different pixels in the image. The CDE descriptor uses a normalized spatial distribution histogram (NSDH) algorithm, which is built on the annular color histogram function [55]. Considering that A i is the pixel set of an image, then |A ij | is the pixels count for color bin i within circle of radius r i . Using the NSDH algorithm, the normalized color histogram of a particular color i is given by: and the color distribution entropy can be determined using the following equation: The length of the feature vector D at this stage is 1 × 256. Fig. 5 shows the examples of all extracted features, which include that the gradient orientation, the region boundary and the LBP histogram.

e: FUSING EXTRACTED FEATURES
Extracted features from multiple sources can be pooled or fused at distinct levels, which can in the form of decision, score, and feature levels strategies. Among all, using a feature level fusion strategy is considered the most effective and powerful fusion strategy, because it imitates different information from the same data to provide better recognition results [56]. The proposed feature extraction and fusion steps are presented in Algorithm I. In our case we have four feature spaces A, B, C and D which correspond to HOG, contour, LBP, and color features respectively. The feature vector A has n-dimensions, while the feature vectors B, C, and D have m-dimensions. To equalize the dimension of all vectors, the lower vector length is padded with zeroes. Let = (A, B, C, D) T be the image feature space. Then, for a random sample τ ∈ , the feature vector X includes random samples i.e. a ∈ A, b ∈ B, c ∈ C and d ∈ D. Therefore, all the selected feature vectors can be serially concatenated and defined as X = (abcd) T . The output of Algorithm I, is the final feature vector X , which has a length 1 × 4051, which is a high dimension.

B. GA FRAMEWORK FOR FEATURE OPTIMIZATION AND CLASSIFICATION
Next, we conclude our methodology by employing GA framework to reduce the feature vector and to receive optimal feature set as an output.

1) GA-BASED FEATURE SELECTION AND CLASSIFICATION
The proposed framework is shown in Fig. 6, which shows the optimal feature vector extraction stage, that uses a SVM classification with an iterative GA framework. The parents in the population P are in the form of a binary chromosome. Each binary chromosome is a composition of the feature

Algorithm 1 Proposed Feature Extraction and Fusion
Begin Input: Query image Output: Fused feature vector Step 1: Resize the input image to 100 × 100 dimension Step 2: Convert input image to grayscale image Step 3: Perform histogram equalization Step 4: A = Extract HOG features using eq. (1) Step 5: B = Extract contour features after Step 1 using (2-6) Step 6: C = Extract LBP features after Step 1 using (7) Step 7: D = Extract color features after Step 1 using (8-9) Step 8: Extracted features are concatenated serially to build the following fused feature space = (A, B, C, D) T Step 9: Feature selection X is made based upon selected binary chromosome X = (a, b, c, d) T , where a ∈ A, b ∈ B, c ∈ C, and d ∈ D End genes, where each gene g has a bit value 0 (not selected) or 1 (selected feature). In the beginning, the chromosomes from the population are selected randomly to initiate the process. Moreover, elitism is used to retain quite a few highest individuals for the next generation directly. Table 1 depicts the parameters used for the proposed methodology obtained after a number of experiments.
Refer to Fig. 6, the important following steps are performed: An Initial Chromosome (1) is generated using random binary values from the Combined Features Vector to act as Selected Features of Image. The Selected Feature Vector (2) becomes the part of the Training Set (3), being updated on each generation of the Training of SVM Classifier (5). The Testing Set (5) is also obtained from the Training Set (for validation purposes). Both Training Set (3) and Testing Set (5) are fed into the Training of SVM Classifier(5) stage, to obtain a Trained SVM Classifier (6), which iteratively enhances the model learning as training and testing samples are increased. During the training process, the Fitness Function Evaluation (7) operation is performed for calculating the Classification Accuracy. If the Check Condition >= 92.0 (8) is met, then the optimization process is finalized and the Optimal Feature Vector is obtained. Otherwise, if termination condition is not satisfied then a crossover (9) is performed between the 1st parent (current chromosome) and the 2nd parent (random binary values), which is selected using the roulette wheel method. To finalize the process, a single iteration/generation Mutation (10) is performed to get a new chromosome (after mutation). Yet again a new feature vector is selected and the process iterates between steps (3)-to- (11) until the target condition is achieved and the final Optimal Feature Vector is obtained.
The threshold values depicted in Table 1 are obtained after number of (training and testing) experiments. Hence, to achieve this target goal, we have distributed the data samples into different (training-testing) ratios. These distribution sample ratios are , , (75-25), (80-20), (85-15) and (90-10). We found that the (80-20) distribution ratio does a better job in achieving the target condition. At the same time, average F-Score remain constant for (85-15) and (90-10) distribution samples, which shows that target condition 92.0% is optimal at (80-20).
The rest of the other parameters in the Table 1 like mutation rate, crossover rate, population size and generation are also adjusted. They become optimal as the number of simulations reaches a target condition that is greater or equal to 92.0%. The Fig. 7, depicts a graph showing the average crossover, mutation and the F-Score values for the different distribution samples. From this graph, we are able to determine the optimal parameters that achieve the target condition and high accuracy levels for classification To get an optimized feature subset, we set our chromosome to be comprised of bits corresponding of all features. The length of the binary chromosome is 1 × 16. Fig. 8 shows the design of our chromosome.
Here, f i is represents the bit value for the ith feature and n is the total number of features. If the bit value is 1 the feature is selected, while if the bit value is 0 the feature is not selected respectively. For example, if the F binary digit in a chromosome is given by chromosome = 01000110, the second, sixth, and seventh bits are selected as features and rest of the bits are not selected as a features. Hence, the chromosome is built using the following formulation: (10) VOLUME 9, 2021 FIGURE 6. Illustration of GA framework to determine optimal feature vector. The process iteratively works until the target condition is met.  The initial value of the chromosome is set by randomly choosing the bits. Then, the feature vectors for all images are trained and classified with text or non-text labels. Therefore, the process to generate an initial population is very simple and straight-forward, as shown below in Algorithm II. A random_initialize() function generates random bits i.e. [0, 1] that gradually initializes the chromosome [57]. Hence, the numbers of selected features are considered an arbitrary initial solution is f .

end for 8: end while End
After a population initialization, the input is trained using the SVM for binary classification. In this work, the F-score is not calculated separately. It is calculated for each and every generation. Thus, the average F-score is calculated through the following equation: where F i = (2 * P * i R i ) (P i + R i ) and n is the generation count, f i is the selected feature count, and F i is the F-score from the previously calculated generations.
If the average score does not meet our predefined threshold, the next generation of the chromosomes is selected. The selection process ensures that the fitter chromosome has a higher probability of survival. Here in this study, our roulette wheel selection steps are given in Algorithm III. In Algorithm III, p i calculates sum of all chromosome fitness in Compute joint probabilities by prob i = n j=1,i P(j) and p 0 = 0 3: end for 4: Find a random number rvalue in [0, prob n ] where n is population and select the ith chromosome such that prob i−1 < rvalue < prob i End FIGURE 9. Snapshot demonstration of a probabilistic share of chromosomes in proposed methodology. GA in this paper uses a roulette wheel selection process.
the population, generates a random number r from interval [0, p n ], and goes through the population n. It stops at the position when the sum of p i is greater than r, which shows that it might be the chromosome with greater probability. A snapshot of this process is shown in Fig. 9, where the clockwise roulette wheel shows the probability of five different chromosomes from the population. Chromosome 3 has the highest probability share (40%), while chromosome 5 has the second highest probability share (30%). Therefore, chromosome 3 covers most part of the wheel and is considered the fittest. As a consequence, it can be selected multiple times in the next operations.
Hence, Algorithm III helps to choose a second parent having the highest probability share for next generation crossover and mutation. The values for both operations are listed in Table 1. If the muted chromosome becomes powerful and superior to the parents, it substitutes the parents. If it is in between two parents then it substitutes the inferior parent, else the most inferior chromosome will be replaced from the population. The GA will be terminated when the generations count reaches to the maximum number or meet our target average F-Score.
When we achieve the target condition, then the evolution process stops. Otherwise, if the termination condition is not met then the next generation of the chromosome needs to be produced. The crossover operation is performed with a rate between the currently selected chromosome and another randomly initialized chromosome of 0.7%. Thereafter, the mutation process is applied with the rate of 0.05% on the child, which alters and interchanges the gene values in the child from 0-1 or from 1-0 with a goal of producing new chromosomes. Hereafter, all image features are chosen according to the genes of this new generation chromosome. The two-point crossover and mutation process is represented in Fig. 10. The process continues until we achieve the target condition.
In Algorithm IV, we check whether the average F-score (target condition) is achieved after the SVM training on selected features. If yes, then further evolution is stopped and the selected features are said to be optimal features set. But, if target condition is not met then the iterations are made among step 5-to-1 in Algorithm IV.

IV. EXPERIMENTAL EVALUATION
In this work the precision, recall and F-score are not separately evaluated. Instead, for every generation, we calculate the average values of precision and recall to determine average F-score, choosing the setting with the best target average f-score value [5]. The F-score is defined in eq. (11), while precision, recall, and accuracy are computed using following equations: Thus the average precision (P) is defined as: The average recall (R) as: And the average accuracy as (A): (14) where n is the number of generations. P i , R i and A i calculated from the previous generation is used to compute P, R, and A for the current generation. Before going into datasets description, it is important to discuss and resolve the imbalance class property within the datasets. It is equally necessary to avoid classifier biasness towards majority classes. Keeping this issue in mind the resampling technique is adopted, that help to lessen the discrepancy between the sizes of the classes [58]. Taking this idea further, we use SMOTE (Synthetic Minority Oversampling Technique) [59] with SVM and called it as SMOTE-SVM. The SVM is first train using linear kernel to generate support vectors (samples), and then these samples are oversampled with SMOTE. This idea enforces distribution between two classes (text and non-text) at border level instead of equalizing the number of samples for all the following datasets.
•  [60]. The dataset is a collection of 251 testing and 258 training character patches and word patches annotated by the bounded box and their text contents.
• SVT: The Street View Text dataset (SVT) [61] was specifically used for word spotting problems. The dataset is a collection of 647 words from which 250 testing images (video frames) with the availability of bounding box locations and ground truth labels along with 100 training images (video frames). Each image is taken from Google Street View.
• IIIT5K: The IIIT5K [62] is the largest and most challenging dataset reported to date due to variation is a font, color, layout, size and inclusion of noise, distortion, blur and varying illuminations. IIIT5K Word dataset is a collection of 5K words cropped from images found on the Internet, from which 3k and 2k words used for testing and training subset respectively.
• MSRA-TD500: This dataset [63] is a collection of 500 multi-oriented natural scene text (slanted and skewed). The dataset is split into 300/200 training /testing samples. Both English and Chinese texts of various orientations are part of this dataset.
• KAIST: The KAIST dataset [64] consist of 3000 indoor and outdoor natural scene images with text under unusual lighting conditions. This is also a benchmark multi-lingual dataset with English and Korean text. Thus very much challenging for classification purpose.

3) DISCUSSION
To test the proposed methodology, we compare it with benchmark wrapper-based feature selection approaches such as SFS, SBF, SFFS, SBFS and Plus-L-Minus-R. These methods have proven their success while searching best optimal feature set so as to improve the performance of the learning algorithm [65]. In wrapper methods, the selection of features is based on the performance of predictor. The predictor is wrapped on search algorithm to find a subset of features, which gives the highest predictor performance [66]. Therefore, wrapper methods structure work interactively and are much closer to the proposed methodology. In the proposed methodology the GA is also looped around the predictor until target condition is achieved in order to enhance the predictor performance. Wrapper methods are discussed in detail in [67]. Initially, the Algorithm IV parameters are set with a value as given in Table 1, which were obtained carefully after series of experimentation and also reflects best results. Primarily, the average F-score needs to be maximized up to 92.0% (target condition). The selected criteria for the tests are: (1) with or without preprocessing; (2) using benchmark wrapper approaches; (3) with or without GA; (5) scene character classification.
The performed tests are Test 1 (Without preprocessing, feature fusion, and proposed methodology), Test 2 (With preprocessing and without proposed methodology), Test 3 (Without preprocessing and with benchmark wrapper feature selection methods), Test 4 (With preprocessing and with benchmark wrapper feature selection methods), Test 5 (Without preprocessing and with proposed methodology), Test 6 (With preprocessing and with proposed methodology), and Test 7 (Scene character classification). Test 7 is basically applied to monitor the performance of the proposed method when considering a binary classifier with the existing benchmark techniques.
For performing the aforementioned tests (except Test 1) we used the parameters depicted in Table 1. In Table 1, the major target condition is the average F-Score, which is set to be to greater or equal to 92.0%. Also the fusion method in Algorithm I is employed to monitor the classifier performance, while SVM is trained implicitly rather than explicitly. The graphs in Fig. 11 show the relationship between the average F-scores and the generations on the datasets including ICDAR 2003, SVT, IIIT5K, MSRA-TD500 and KAIST. The algorithm was implemented in MATLAB and tested using a 7-fold cross-validation methodology for training and testing the classification algorithm. All tests were performed on a 3.4 GHz Processor, 8 GB of RAM PC machine, with a GTX 1070 GPU support.
Tables II depicts the results obtained for two (Test 1 and Test 2). These results are then compared with rest of the aforementioned tests to monitor the improvement in F-score and accuracy. Notice that, Test 1 is performed on the raw data without the feature fusion. In this test, the results were not satisfactory. As a result, the weak learning model is obtained with a poor performance for unseen data. Test 2 used a preprocessing step and incorporated the fused feature vector. When compared to the results of Test 1, the average F-score for ICDAR 2003 increased from 68.1% to 75.9%, for SVT from 64.4% to 72.6%, for IIIT5K from 64.8% to 70.5%, for MSRA-TD500 from 62.5% to 69.2% and for KAIST from 59.8% to 69.8%. It is worth pointing out that results in Table 2 are used as benchmark for the rest of the aforementioned tests. Although Test 2 presented better results, the classification accuracy not yet acceptable for real applications. Table 3 depicts the results for Test 3 and Test 4, incorporating the benchmark wrapper-based feature selection approaches. Table 3, shows the average F-Score, showing that SFFS performs better in Test 3 and Test 4 than the other wrapper methods. This result is probably due to the additional steps included in the SFFS backtracking technique. Notice that for Test 3, which is the performed on the raw data, we found an improvement when compared to Test 2. However, in Test 4, we observed a considerable improvement. When compared to Test 3 results, for SFFS, the average F-score for ICDAR 2003 increased from 77.1% to 77.3%, for SVT from 73.7% to 77.6%, for IIIT5K from 73.2% to 79.7%, for MSRA-TD500 from 72.9% to 77.8% and for KAIST from 73.8% to 78.3%. It is worth pointing out that values in Table 3 are gradually enhanced for evaluation  parameters for Test 3 and Test 4. Hence, Table 3 shows better results when compared to Table 2. This confirms that the fused feature vector is playing a key role in enhancing the average F-Score and maximizing the model learning. Furthermore, the accuracy of the classifier is also getting gradually better. But, the performance figures are still not acceptable. Table 4 finally shows the results of Test 5 and Test 6, which incorporate the proposed methodology. In Test 5, the proposed algorithm outperforms all benchmark feature selection techniques, but it is unable to reach the target condition defined in Table 1. We attribute this lower performance to the fact that the images were not preprocessed. Nevertheless, the performance of the classifier is acceptable, which can probably be attributed to the feature fusion approach. In Test 5, ICDAR 2003 achieves 88.2%, SVT 86.5%, IIIT5K 84.8%, MSRA-TD500 84.2%   and KAIST 85.5% average F-Score. The average F-score and accuracy results are significantly better than the results in Table 3.
The average results for Test 6 also show that the proposed algorithm (with preprocessing and feature fusion) has a superior performance. We are able to achieve the target condition for all datasets along with better classifier accuracy. More especially, the accuracy increased from 87.   Table 4 are better than results in Table 3.
Here, it is also pertinent to mention that although all datasets reflects significant challenging characteristics, while MSRA-TD500 and KAIST are more challenging among the competitors. Since, these both datasets reflects different types of natural scenes (indoor and outdoor) specifically with random orientations and complex background, small distant text, low contrast/resolution images, sign boards, holdings, fences, different fonts, style and sizes. Acquiring certain level of accuracy surely reflects the worth of the proposed methodology.
Further, we believe these results shows considerable improvements due to the following reasons: (1) the feature fusion strategy described in Algorithm I; (2) the evolutionary nature of the GA algorithm that selects and reduces feature space dimensions gradually at each generation; (3) the implicit use of the SVM, which enhances the accuracy rate.
Finally, Fig. 11 shows the performance across datasets, where generations indices are shown in the x-axis and the average F-score is shown in the y-axis. From the graph, it is clear that the proposed method performs well for all selected datasets, reaching the target condition. Among all datasets, the results for ICDAR 2003 have the best performance for all selected parameters.
In Test 7, the proposed methodology is considered as a binary classification and then character recognition problem with 62 classes having 10 digit numbers and 52 English alphabets (both upper and lower case). Table 5 shows that most character classification methods were tested on ICDAR 2003 dataset. In fact, only a few methods were tested on the SVT and IIIT5K datasets. We believe the reason for this is that the SVT and the IIIT5K have a diverse content, composed of outdoor scenes text images, which is very challenging because of the variations in font size, layout, color and the presence of distortions, such as noise, varying illumination, and blur. Notice from the results in Table 5, that the proposed method works better than various state-of-the-art methods that are based CNN like for example [40]- [44].
Similar case is with MSRA-TD500 and KAIST, because both datasets reflects diverse scene characteristics. Hence, to the best of our knowledge no such method reported using these datasets for scene character classification. Acquiring certain level of classification accuracy surely reflects the worth of the proposed methodology.

V. CONCLUSION AND FUTURE WORK
The main purpose of our proposed method is to design a text classification method using a feature selection procedure and an optimization algorithm. The proposed methodology achieves high classification accuracy when tested on natural scene text images. A novel average F-Score is defined as a threshold (up to >=92%) for the robust model, which increases the performance on unseen data. The proposed method is tested on five selected datasets. Experimental results have shown that the proposed methodology works well when compared to benchmark feature selection/optimization and existing methods in terms of binary classification. Up to our knowledge, this work is the first attempt to classify text and non-text regions in natural scene images considering diversity in all aspects.
In the future, we plan to incorporate instance selection as the basis for the feature selection or both in parallel (instance and feature selection) for natural scene text classification. We also plan is to implement a dynamic GA rather than a static version for high-quality model learning. This type of scheme tunes the GA parameters dynamically, including varying population size, gene arrangement in the chromosome, and genetic operators. Finally, we plan to test the use of GA with neural networks, which is known to considerably improve the classification of the learning model.

ACKNOWLEDGMENT
The findings achieved herein are solely the responsibility of the authors.