Toward a Fair Evaluation and Analysis of Feature Selection for Music Tag Classification

Accurate music tag classification of music clips has been attracting great attention recently, because it allows one to provide various music excerpts, including unpopular ones, to users based on the clips’ acoustic similarities. Given a user’s preferred music, acoustic features are extracted and then fed into the classifier, which outputs the related tag to recommend new music. Furthermore, the accuracy of the tag classifiers can be improved by selecting the best feature subset based on the domain to which the tag belongs. However, recent studies have struggled to evaluate the superiority of various classifiers because they utilize different feature extractors. In this study, to conduct a direct comparison of existing methods of classification, we create 20 music datasets with the same acoustic feature structure. In addition, we propose an effective evolutionary feature selection algorithm to evaluate the effectiveness of feature selection for tag classification. Our experiments demonstrate that the proposed method improves the accuracy of tag classification, and the analysis with multiple datasets provides valuable insights, such as the important features for general music tag classification in target domains.


I. INTRODUCTION
A UTOMATIC music tag classification (MTC) is used to find a relevant music tag, such as emotion or genre, for a music excerpt based on its music signal or extracted acoustic features [1]- [3]. This task can be achieved by training a classifier using music excerpts with relevant tags annotated by a human being. After training the classifier, relevant tags for undiscovered or newly released music excerpts can be identified without human intervention by feeding the excerpts as input data to the trained classifier. Since automatic MTC has high learning potential, improving the accuracy of the classifier is an essential task.
In the past decades, various researchers considered acoustic feature selection (FS) as a key pre-processing step for improving the learning accuracy of MTC. The detailed procedure is as follows: 1) A collection of digital music clips or excerpts and the relevant tags are gathered. Regarding application constraints such as response time, the music excerpts can be divided into clips with a specific duration. 2) An acoustic feature extractor is selected to transform the signal information of each music excerpt into a series of statistical values that form the formal dataset for training the classifier.

3) Since relevant acoustic features may vary according
to the domain of the tags, an FS algorithm can be employed to improve classification accuracy. 4) The classifier learns the music excerpts based on the selected acoustic features with improved accuracy because noisy features can be removed through the FS process.
MTC performance varies according to the domain of the music collection, acoustic feature extractor, feature selector, and classifier. However, recent studies report the performance of MTC based on different settings, making it impossible to identify and compare different aspects of performance [4]- [6]. Therefore, we cannot judge the impact of FS on improving MTC performance.
In this work, we report the performance of automatic music classification in a fair setting. The strategy and the contributions of this study can be summarized as follows: • We gathered as many music collections as possible to prevent a biased conclusion-20 datasets were included in our experiments. • To guarantee a fair basis for the comparisons, we selected MIRtoolbox [7], one of the most popular toolkits in the field, as the acoustic feature extractor. • To examine the true potential of FS, we devised a novel evolutionary FS algorithm based on an evolutionary search process enhanced by a feature filter. • Our experiments, based on the same acoustic feature structure, allow a feature analysis of multiple datasets to identify important acoustic features across various MTC settings. Our experimental results indicate that an effective FS method can improve the performance of MTC consistently regardless of the underlying setting.

II. RELATED WORK
Using different combinations of the music collection, acoustic feature extractor, FS algorithm, and classifier, various MTC systems can be built with different performance levels. In this section, we review some notable studies related to MTC based on these concepts.
First, regarding the music collection, several datasets have been used in MTC studies. One well-known dataset is the Latin Music Database [6], [14], [24], [25]. It contains 3,227 music pieces belonging to 10 different Latin musical genres: Axe, Bachata, Bolero, Forro, Gaucha, Merengue, Pagode, Salsa, Sertaneja, and Tango. The ISMIR'2004 dataset consists of 1,458 music pieces assigned to six Western genres: classical, electronic, jazz and blues, metal and punk, rock and pop, and world music [6]. Another popular dataset is GTZAN [3], [18], [22]. It is widely used in studies on music classification, and it consists of 1,000 songs from 10 popular genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. Each song is 30s long with a sample rate of 22,050kHz. The Seyerlehner:1517-Artists dataset contains 3,180 original tracks from 23 music genres. From the original tracks, only tracks from different artists in each category were selected in [1]. MagnaTagAtune comprises 4,476 songs belonging to 24 musical genres, making it the most diverse collection [29]. The MIR-1K dataset, a collection of Tollywood and Bollywood songs from the Indian film industry, has been used for segmentation of vocal and non-vocal clips [11]. Other well-known datasets are EMO-DB, eNTEFACE05, EMVO, SAVEE [9], Traditional Malay Music [26], AllMusicGuide [21], and Thai music collection [2]. Many studies have formed a novel dataset by combining several datasets or even creating a new one. For example, Carnatic, Hindustani, and Bollywood musical datasets were joined together to create a single dataset [12]; in another study, 574 vehicle sounds were selected from three instrumental datasets-the McGill University collection, RWC database, and University of Iowa instrument samples [15], [20]. The datasets in the works of [10], [13], [16], [17], [23], [28], [31] were newly collected for the respective studies. Some were collected from music CDs and the Internet [31], while some were from individuals who were asked to classify respectively. These datasets were used to validate the performance of music classification algorithms. For example, in the work of [4], 12 datasets were used-Audionautix, Bugs2664, BugsEmo, CAL500, China3004, Emotions, Genre3, Highlight, KOCCA40, MusicEmo-A, MusicEmo-B, and Style812.
Third, regarding FS algorithms, previous studies frequently use a genetic algorithm (GA) to select the relevant acoustic features for given tags. This method can be combined with other methods, and the converted GAs or other FS algorithms were used contextually in each study. For instance, the One-Against-All (OAA) and Round Robin (RR) space decomposition approaches have been combined with GA [6], [24], [25]. S-Metric Selection Evolutionary Multiobjective Algorithms (SMS-EMOA) [1], [5], [20], [21] are often selected as an optimization heuristic in MTC studies. They sort the population into several solution fronts through fast non-dominated sorting [20]. The interactive genetic algorithm (IGA) is used both in [23] and [19]. In addition, GA with the m-features operator was introduced in [28]. Sequential forward selection (SFS) has been used both individually [29], [30] and in combination with ReliefF [17]. Particle swarm optimization [2], the self-adaptive harmony search (SAHS) algorithm [18], the interactive feature selection (IFS) method [30], and improved binary global harmony search (IBGHS) [8] are the other FS algorithms used in related research. [30] and [26] conducted studies using more than two algorithms: three algorithms in [30] and eight algorithms in [26].
Fourth, regarding the classifiers, we will mainly discuss five popular ones used in MTC studies. The decision tree (DT) is often employed as a classifier to decide the tag for a given music excerpt. Each node in a DT has two child nodes, except for the leaf nodes, which are used to decide the relevant tag for a given music excerpt [2], [5], [6], [12], [20], [21], [24], [25]. Closely related to the DT, the random forest (RF) is an ensemble learning method that randomly selects acoustic features from several different sets [1], [2],  [11]. The k-nearest neighbor (kNN) classifier calculates the distance between the test music pieces and training music pieces in terms of the acoustic features [1], [8], [9]. The naïve Bayes (NB) classifier uses Bayesian theory for deciding the relevant tag of music pieces [2], [5], [20]. It estimates the output tag by its highest probability based on feature distribution. Support vector machine (SVM) is a supervised learning model used for MTC [2], [9], [11]. In addition to the classifiers introduced thus far, many classifiers-such as the multi-layer perceptron (MLP) [6], [9], [24], neural network (NN) [11]- [13], stacking ensemble [2], neuro-fuzzy classifier (NFC) [11], and back propagation neural network (BPNN) [16]-have been used in MTC studies. Specifically, one study developed new kinds of classifiers [3]. [9] and [26] chose 18 classifiers. Table 1 summarizes the number of music datasets consid-ered in each corresponding study, the chosen acoustic feature extractor, the FS algorithm employed, and the classifier used for identifying the tags. Table 8 lists the abbreviations used in Table 1. Our brief review shows that the choices in each study varied, indicating that it is difficult to directly compare the impact of FS on the MTC.

A. PREPROCESSING
We used the MIRtoolbox to extract the acoustic features from music excerpts. The basic operators of MIRtoolbox are related to the management of audio waveforms, framebased analysis, periodicity estimation, auditory modeling, peak selection, and sonification of results. During preprocessing, we first input the music excerpts in the collection to the mirfeatures function in MIRtoolbox. Then, a total VOLUME 4, 2016  [41] of 888 numeric features were computed and returned, which were further categorized into dynamic (6), rhythm (37), timbre (739), pitch (0), and tonal (106). Then the outputs were extracted to standard CSV files via the mirexport function of MIRtoolbox, which exports the statistical information of each music excerpt.
Next, due to unexpected processes such as the dividedby-zero error, the statistical information may include Not-a-Number (NaN) values. In this study, we used kNN imputation [52], where k = 3, to deal with the NaN values. Thus, all the NaN values were replaced with a value estimated by referencing the nearest three music excerpts. However, if an acoustic feature contains too many NaN values, the kNN imputation cannot be applied because the algorithm is unable to identify the nearest neighbors; in such cases, we removed those features. We encoded the tag information using the onehot encoding scheme [53] because this allows for multiple tags to be assigned to a single music excerpt.
Finally, for the case in which classifiers are used that are known for being effective with categorical features, we used the LAIM discretization scheme [54] for discretizing the original numeric features into the categorical features.

B. PROPOSED FEATURE SELECTION ALGORITHM 1) Initialization and Evaluation
To search for the optimal feature subset, the proposed method first initializes a population composed of chromosomes. Each chromosome is represented as a binary vector consisting of  [51] ones and zeros depending on whether a feature is selected. The chromosomes are initialized randomly such that they are evenly distributed across the entire search space. To verify that removing noisy features can improve accuracy, the initial population includes one chromosome that selects all features. After that, each chromosome is evaluated as a fitness value by the classifier. Specifically, the classifier is trained with a feature subset represented by each chromosome and then predicts the tag for each test music excerpt. Given the correct tags and predicted tags, a fitness value is obtained by calculating the accuracy of the tagging. Therefore, a better feature subset has a higher fitness value.

2) Parent Selection
Parent selection is the process of selecting parents to generate offspring for the next generation. To pass on the relevant features to the offspring, parents should be chosen from the population based on their fitness values. To this end, we adopted a tournament selection algorithm [55]. Given chromosome groups sampled at random from the population, the tournaments are conducted by comparing fitness values. The winner of each tournament is selected as a parent. In an MTS problem, a feature subset with a high fitness value does not ensure high accuracy for each tag because the dependencies between the tags are different. For example, two parents with high fitness values may consist of similar features related to the same tags. To generate superior offspring, a pair of two parents must have a complementary relationship, with each containing features related to different tags. If one parent has features that depend on one part of the tags, another parent that has features relevant to the remaining tags can complement it.
To deal with the issue, we introduce a new parent-matching process. Let c be a chromosome chosen by a tournament VOLUME 4, 2016 and T be a set of tags. A tag-specific accuracy vector a c = [a 1 , ..., a |T | ] is computed by the trained classifiers, each used to predict each tag; here, a i is the accuracy of the ith tag predicted by c. To divide T into a strong tag subset T s and weak tag subset T w , the elements of a c first are sorted in ascending order. After that, T s and T w are identified based on the point where the difference between the two consecutive values in sorted a c is the highest because the optimal point is unknown. Given T w on c, its spouse c is chosen as follows.

3) Offspring Generation via Feature Filter
Traditionally, the offspring are generated by randomly recombining pairs of parents because it is difficult to assess the importance of each feature only based on the fitness value of a specific chromosome. If such information can be estimated, further improved offspring can be generated by utilizing the benefits of our parent-matching process. Given a pair of parents c and c , the offspring must inherit features closely related to T s and T w from c and c , respectively. To this end, we employ a feature filter that enables us to compute the relationship between features and tags based on information theory. Let S c and S c be feature subsets represented by c and c . Given an offspring S n = ∅, S c can pass a feature f + ∈ {S c \ S n } sequentially to S n . To maximize the dependency between features and T s , f + is selected as follows.
arg max is Shannon's mutual information between the variable set X and Y . To calculate H(T s), we transform a categorical variable T into a binary one-hot encoding for calculating Eq. (2). Therefore, we adopt a generalized information-theoretic criterion [56], which is a recent feature filter. It calculates f + as follows.
arg max where S c also can pass features to S n by replacing S c and T s with S c and T w , respectively. To explore feature subsets of various sizes, we randomly set the number of features that S c and S c pass into S n at each iteration. The offspring are generated as a combination of only the features that the population has. Therefore, some features in the original set are neglected, resulting in local optima. A simple approach to solve this issue is to extend the aforementioned offspring generation process to the population level. Let F and S p be the original feature set and union of feature subsets within the population, respectively. Given T s and T w on S p , new features to add into the population are selected as follows.
arg max Similarly, features to delete in the population are selected by replacing F , S p , and T w with S p , ∅, and T s , respectively. After that, the modified S p is added to the population as a new chromosome. Finally, the proposed method repeats these processes until the termination condition is met.

4) Termination
The termination condition is based on the number of fitness function calls (FFCs), that is the number of evaluations of individuals by the classifier. The algorithm terminates its search process if the number of remaining FFCs is zero. The number of allowed FFCs is a user-defined parameter.

IV. EXPERIMENTAL RESULTS
We created 20 music datasets from different domains to validate the effect of FS on MTC. Table 2 provides short descriptions of the music collections employed, where each music collection consists of audio files and music tags annotated by users or through a specific system. As shown in Table 2, the domains of the collections cover 12 genres, three emotions, three instruments, one mood, one groove, and one key. Since the main objective of this study was to conduct a fair experiment on FS, the acoustic features were extracted using the same feature extractor-MIRtoolbox [7]-which provides a set of integrated features for each music excerpt. The extracted features can be organized along five main musical dimensions: dynamics, rhythm, timbre, pitch, and tonality. Table 3 provides the statistics of the datasets created by applying the acoustic feature extractor to the music excerpts in the corresponding collection. Although a single domain that its relevant music tags can be included is suggested by the creator of collection in most cases, a few collections such as emoMusic, Emotify, GiantStepsKey, and GMD may cover multiple domains. For these collections, because the tags originally suggested by the creator are real-valued and consequently unsuitable for the classification task, we created a corresponding music dataset by using the genre tags shown in Table 4. Datasets are created using the genre tags emoMusic-G, Emotify-G, GiantStepsKey-G, and GMD-G. In Table 3, the terms |W |, Avg. Length, |F |, Suggested Domain, Used Domain, |C|, and Avg. Size of Class, indicate the number of patterns, the average length of music excerpts in the collection, the number of extracted acoustic features, the domain suggested originally by the creator of the collection, the domain used for creating the corresponding dataset, the number of classes or tags, and the average number of patterns per class and its standard deviation, respectively.
In this study, we used the NB classifier for comparing the performance of the proposed and conventional FS methods because of its popularity and effectiveness. Specifically, we considered the NB classifier to obtain the accuracy values   CBFS to validate the performance of 300 candidate feature subsets created by adding high-ranked features oneby-one to the feature subset. Among the 300 candidate feature subsets, the feature subset that yields the best accuracy value in the training phase is chosen as the final feature subset.
For the parameter settings of the evolutionary algorithms, 20% of the training patterns are used for evaluating the fitness of each chromosome. In short, 64%, 16%, and 20% of the patterns are used for training, validation, and testing, respectively. We set the size of the population to 50, and the maximum number of FFCs was limited to 300. To measure the performance of the four FS methods, we employed an accuracy measure in the range of where the total number of classifications is |W | × 0.2 because we employed the 8:2 hold-out cross-validation. All experiments were repeated 10 times, and the average of the measured values was used to compare the performance of VOLUME 4, 2016  In the Bonferroni-Dunn test, if the difference between the proposed method and the conventional method in terms of the average rank is larger than the value of the critical difference (CD), their performance is confirmed to be different and the superior method can be determined. In our experiments, we set the significance level α to 0.05, and thus, CD = 0.9773. Table 5 shows the comparison results of the four methods in terms of average accuracy with standard deviation from ten repeated experiments. The best accuracy value among the four methods for each dataset is in bold. In the table, / indicates that the corresponding method is significantly worse compared to the proposed method at the 95% significance level. The experimental result indicates that the proposed method outperforms the comparison methods for 13 datasets. For example, the proposed method yields an accuracy of 83.67%, 63.49%, 75.70%, and 45.42% for the Ballroom, Ballroom-Extended, GTZAN, and Soundtracks datasets, respectively. By contrast, the accuracy values of the secondbest method in those datasets are 73.31%, 51.82%, 66.55%, and 34.44% respectively. Thus, the difference between two methods is 10.36%, 11.67%, 9.15%, and 10.98%, respectively, indicating that the proposed method significantly outperforms the second-best method for these four datasets. A similar tendency can be observed from the experiments with other datasets, resulting in an average rank of 1.10. Based on the experimental results of Table 5, we conducted the Bonferroni-Dunn test as shown in Figure 1. Since the proposed method achieved the best average rank, it is placed in the right most side in the figure. Above the middle line, a line CD runs from the location of the proposed method on the line. Since there are no other comparison methods within CD, the superiority of the proposed method is confirmed.
To validate the effect of FS on MTC, we compared the classification performance based on the feature subset selected by the proposed method and the original set composed of about 900 features as shown in Table 6. In the last column, the improvement in terms of accuracy of the feature subset selected by the proposed method over the original feature set  is provided. Although the improvement of accuracy varies with the dataset, Table 6 indicates that the accuracy generally improved as a result of FS under the proposed method. Specifically, the proposed method improves the accuracy for the Ballroom, Ballroom-Extended, and GoodSounds datasets by 10% on average. As shown in the last row of Table 6, the FS process yields a 4.36% improvement on average. It is interesting to note that the feature subset selected by the other methods may yield a classification performance worse than that based on the original feature set. For example, the classification accuracy for the Ballroom dataset of the original feature subset is 75.25% whereas that of the feature subset selected by CBFS is 64.82%. Thus, our experiments indicate that the FS process can improve the performance of MTC, but the choice of FS method is important.

V. ANALYSIS AND REMARKS
In this study, we created music datasets using the same acoustic feature extractor and identified important features for the given music tags. Since all the datasets in our study share a common feature structure, a subsequent analysis was conducted over multiple datasets for clues such as the important features for general MTC tasks, specific domains, and nationalities. In this section, we analyze the selected features over multiple music datasets.

A. OVERALL DESCRIPTION OF TOP 50 FEATURES SELECTED
Of the features in the fields of dynamics, rhythm, timbre, pitch, and tonality given by MIRtoolbox, features selected through the proposed method differed based on the dataset. Figure 2 shows the frequently selected 50 features across the entire experiment. The full names are detailed in Table 9 in the Appendix. In Figure 2, each bar represents the different feature fields of MIRtoolbox. Since we ran ten iterations for each dataset and we have 20 datasets, each feature can be selected for a maximum of 200 times. The dynamics and pitch features are excluded from the figure because they were not selected across all the datasets, indicating that the top 50 selected features are from the rhythm, timbre, and tonality fields. Timbre, in particular, with 27 features among the top 50, is clearly important for MTC. Among the 50 features, 16 are related to mf cc, which are widely used in the study of traditional audio signal processing such as voice and music, accounting for the largest proportion. These 16 features consist of five mf cc features, six delta − mf cc(dmf cc) features representing the audio rate, and five delta − delta − mf cc(ddmf cc) features representing the acceleration of audio. The next most popular feature set within the timbre field is that of spectrum-related features. Seven spectrum-related features are extracted based on the Fast Fourier Transform, representing spectral distance, the coefficient of skewness, entropy, and so on. The timbre field also includes rollof f and brightness, which estimate the number of high frequencies, and spectral_irregularity, which shows fluctuations in the successive peaks of the spectrum. For the tonality field, 15 features were selected. The features related to keystrength that distinguish the major and minor keys account for the largest proportion, with six features. The chromagram, which comprises four features, shows the distribution of energy along with the pitches or pitch classes. The rest are: three modal-related features, one HarmonicChangeDetectionF unction feature, and one keyclarity feature. Finally, in the rhythm field, a total of eight features were selected, including six temporelated features and one each for peak, and attack. The six tempo-related features estimate the tempo by detecting periodicities from the event detection curve.

B. ANALYSIS BASED ON CORRELATION COEFFICIENT FOR EACH DOMAIN
For a more detailed analysis, we calculated the correlation coefficient (CC) between the selection frequencies of features from each dataset, where the selection frequency of each feature from a given dataset is the number of times that each feature is selected during ten iterative experiments. An array representing the selection frequency of each feature can be obtained for each dataset, and CCs between the arrays, which represent the similarity of FS between two datasets, can be calculated. Since there can be differences in tendency according to the domain of the datasets, we showed the CCs at an overall level (considering all 20 datasets), and by genre, instrument, and emotion (mood). Figure 3 shows the CC matrices between dataset pairs with the highest CCs, while Table 7  , the highest ratio compared to other dataset pairs. The 23 features consist of two in the rhythm field, 15 in the timbre field, and six in the tonality field. Of the 15 feature in the timbre field, 12 were mf cc-related features (two mf cc, three dmf cc, and seven ddmf cc). The six features of the tonality field consist of one M ode_P eriodentropy and five keystrength-related features. The five overlapping keystrength-related features indicate that the two datasets consist of songs similar in major and minor. Next, we observe that the value of CCs for the (Ballroom, Ballroom-Extended) dataset pair is 0.39, which is the second highest CC in the experiment. However, only 11 features (three rhythm, seven timbre, and one tonality-related features) are common to both datasets. This pair contains one each of mf cc, dmf cc, and ddmf cc. Only Keystrength_Slope, the most popular feature in the tonality field, was included in the commonly selected features. The (FMA-SMALL, HOMBURG) dataset pair are unrelated datasets from the viewpoint of the considered music excerpts, but ranked high with a CC value of 0.39. In this pair, 20 features are commonly selected (two rhythm, 14 timbre, and four tonality-related features), and 11 out of the 14 features in the timbre field are mf cc-related (three mf cc, three dmf cc, and five ddmf cc). The tonality field includes Keystrength_Slope and M ode_periodEntropy features. The (Ballroom-Extended, FMA-SMALL) dataset pair had a CC value of 0.38, with 12 commonly selected features (one rhythm, six timbre, and five tonalityrelated features). In addition to Rollof f _P eriodF req and Entropy_P eriodF req, which were selected in all dataset pairs, this pair contains two spectral_irregularity and two mf cc features. Regarding the tonality field, this pair contains the most diverse features among the pairs in the genre domain, such as chromagram, mode, and keystrength. Finally, the (Ballroom, Seyerlehner:Unique) dataset pair had the fifth-highest CC of 0.37, and the second most 20 features overlapped. It consists of four rhythm, 14 timbre, and two tonality fields. Two mf cc, three dmf cc, and four ddmf cc features that are categorized into the timbre field are commonly selected.
In the genre domain, three dataset pairs-(Ballroom-Extended, Seyerlehner:Unique), (Ballroom, Ballroom-Extended), and (Ballroom, Seyerlehner:Unique)-yield the largest CC values. Since these three dataset pairs are a subset of the overall dataset pairs, we omitted a detailed analysis of the dataset pairs from the viewpoint of the genre domain. Next, in the instrument domain, we compared all three datasets simultaneously. The GoodSounds dataset selected only 27 features from the 888 features, so we analyzed the overlapping features from the Medley-solos-DB and PCMIR datasets. The (Medley-solos-DB, PCMIR) dataset pair had a CC value of 0.27 with only eight commonly selected features (one rhythm and seven timbre features). Except for the one spectral_irregularity and four mf cc-related features (two mf cc and two dmf cc), the other commonly selected features were different compared to those for the genre domain, indicating differences in the tendency of selected features from the genre and instrument domains. Finally, we conducted our experiment on datasets in the emotion (mood) domain. The values of CC for (MIREX-like_mood, Soundtracks), (Soundtracks, MER500), and (MER500, MIREX-like_mood) are 0.24, 0.11, and 0.07, respectively. Since the (MIREX-like_mood, Soundtracks) dataset pair yields the best CC value-with the remaining two dataset pairs having very small values-we conducted our analysis on the (MIREX-like_mood, Soundtracks) dataset pair. In these two datasets, seven features of the timbre field are selected commonly, all of which were mf cc-related (one mf cc, one dmf cc, and five ddmf cc). Two of the three features in the tonality field were keystrength-related features.

C. REMARKS
In our experiment with multiple datasets, we observed which features were frequently selected by domain and what musical characteristics those features represent. First, spectral_irregularity, mf cc, and keystrength-related features are important not only in the genre domain but also in the instrument and emotion (mood) domains. Features such as Rollof f _P eriodF req, Entropyofspectrum_PeriodFreq, spectral_irregularity, mf cc, and keystrength are commonly selected throughout the entire dataset. These common features mean that they play an important role in feature selection, and it is essential to include them when performing feature selection in a limited number. Second, the top five correlation coefficients are all from the genre domain. This implies that genre domain datasets have more overlapping features than the other two domains do. It is easier to know which features are important when classifying the genre domain.
Third, the instrument and emotion (mood) domains lack a rhythm or tonality field, and the overall number of overlaps is also small. Whereas in the genre domain, many similar features were selected in addition to frequently overlapping features, completely irrelevant features except for essential features were extracted in the instrument and emotion (mood) domains. The results of the correlation coefficients are relatively lower than in the genre domain. Although they share essential features, it is important to share other similar features as well. Lastly, six features from the dynamics field are all related to root-mean-square. These features appear useless when classifying, even though the features were extracted through MIRtoolbox.

VI. CONCLUSIONS
In this study, to resolve existing difficulties in comparing the performance of various MTC methods, we created and evaluated 20 music datasets using the same feature extractor. As a result, all the 20 datasets have the same acoustic feature structure. To verify the true effect of FS on MTC tasks, we devised a novel evolutionary FS algorithm. Owing to the same acoustic feature structure of the 20 datasets, we were able to conduct additional experiments with multiple datasets, leading to valuable insights.
In the future, we plan on conducting an experiment based on convolutional neural network (CNN), which is which has been in the spotlight in music classification. We intend to utilize the findings of this study in the CNN research. In previous studies using CNN, the spatial information obtained through an analysis of music signals such as a melspectrogram, was considered an input to the CNN. Considering the important features found in this study, performance improvement is expected by providing higher quality input information to the CNN. Table 8 lists all of the abbreviations used in Table 1. Table 9 provides a detailed description of the y-axis in Figure 2.   Figure 2.

APPENDIX
Abbreviation Description

PeakPos
The abscissae position of the detected peaks, in sample index PeakMag Magnitude associated with each bin Mean The average along frames Std The standard deviation along frames

Slope
The linear slope of the trend along frames (the derivative of the line that would best fit the curve)

PeriodFreq
The frequency of the maximal periodicity detected in the frame-by-frame evolution of the values, estimated through the computation of the autocorrelation sequence. The first descending slope starting from zero lag is removed from this analysis, as it is not related to the periodicity of the curve. PeriodAmp The normalized amplitude of that main periodicity