Multi-Level Local Feature Coding Fusion for Music Genre Recognition

Music genre recognition (MGR) plays a fundamental role in the context of music indexing and retrieval. Unlike images, music genres consist of immediate characteristics that are highly diversified with abstractions in different levels. However, most representation learning methods for MGR focus on global features and make decisions from features in the same level. To remedy such defects, we intergrate a convolutional neural network (CNN) with NetVLAD and self-attention to capture the local information across levels and learn their long-term dependencies. A meta classifier is used to make the final MGR classification by learning from aggregated high-level features from different local feature coding networks. Experimental results show that the proposed approach yields higher accuracies than other state-of-the-art models on GTZAN, ISMIR2004, and Extended Ballroom dataset.


I. INTRODUCTION
With the increase of online music databases and userinteractive applications, developing effective automatic tools for music classification and retrieval has become an essential issue. Music information retrieval (MIR) aims to retrieve useful information from music and classify it into different categories. For MIR problems, music genre is significantly important because it facilitates both the search for music and the organization of music collections. Furthermore, music genre also reveals the interplay of cultures. Extracting influential features contributes to the automatic classification of music genres. There are a lot of approaches to extract descriptive features for music genre recognition (MGR) [1], [2]. The scatter transform and the transfer learning have been widely used for image and audio classification [3]- [6] but rarely used for MGR in combination. Various methods for building music genre classifiers have been studied, including support vector machines (SVM) [7]- [9], Gaussian processes [10], convolutional neural network (CNN) [11]- [13], recurrent neural network (RNN) [14], and long short-term memory (LSTM) [15]. It is proven that CNNs learning representation features yield better performance in comparison to LSTMs and traditional methods which extract handcrafted The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan .
features [16]. In contrast, many handcrafted features have strong complementarity, but MGR classification tasks usually use only one of them. In the field of MGR, most current representation learning methods for MGR focus on global features and make decisions from features in the same level. But music genre consists of immediate characteristics which are highly diversified and have different levels of abstractions [1], [11]. Moreover, the fusion in the final ensemble is in decision-level, which may ignore the internal relationship among features at early stages [8], [9].
As a local representation method, the NetVLAD has been widely studied in recent years [17], [18]. Music streams with highly diversified characteristics have abstractions in different levels, which have different influences on understanding of genres and are often distributed locally in repeated audio clips. The self-attention mechanism determines the importance of different features by focusing on dependencies of all positions in the signal [19]. In this work, we design a feature encoding network for music stream by combining the NetVLAD and the self-attention mechanism to fully explore local features of different levels of abstractions and learn their long-term dependencies. The code for the proposed approach is available on GitHub. 1 Major contributions of this research are summarized as follows: (i) A multi-level feature coding network using a CNN network with NetVLAD and the self-attention is proposed to capture the local information across different levels and learn their dependencies. Genres of music are locally positioned in different levels and time scales in a music stream. The NetVLAD captures the local information from each layer and aggregates them together to provide an integrated description. Then, the self-attention learns long-term dependencies across these aggregating features from different levels. (ii) The complementary nature of the scatter transform feature and the transfer feature for the MGR are explored, which enriches the diversity of features to make representations learned by the proposed model more useful. (iii) A meta classifier is used to learn the implicit relationship among high-level features to obtain more useful information. Then, it is retrained using aggregated heterogeneous high-level features learned by feature coding networks to improve the classification performance of the proposed ensemble model for the MGR. Experimental results also show that the proposed model yields better accuracies in comparison to other methods. The rest of this paper is organized as follows: Section II presents related works on feature extraction and classification methods for MGR tasks. Section III describes the proposed model. Section IV shows experimental results and discussions on datasets for MGR tasks. Lastly, Section V concludes this work.

II. RELATED WORK
Feature extraction and classification methods are major focuses of MGR tasks. This Section reviews feature extraction and classification methods for MGR tasks in Sections II-A and II-B, respectively.

A. FEATURE EXTRACTION METHODS
Most MGR models extract audio features to achieve satisfactory performance. In general, audio features can be divided into handcrafted and non-handcrafted features. In some cases, handcrafted features capture statistical characteristics with bag-of-frames analysis to observe amplitude characteristics over time frames (e.g. timbre [20], pitch, and rhythm [21]). In other cases, handcrafted features extract spectrogram textures and their temporal variations with time-frequency analysis to describe the temporal change of energy distribution over frequency bins (e.g. Mel-spectrogram [22], [23], harmonic and percussive spectrogram [24], constant-Q transform spectrogram [25], [26], and scatter transform spectrogram [27]). The local binary patterns (LBP), local phase quantization (LPQ), Gabor filter feature extraction (GF), binarized statistical image features (BSIF), locally encoded transform feature histogram (LEN), and the codebookless model (CLM) are also used to analyze the similarity of spectrograms [8], [9], [28].
Representation learning techniques such as CNN, LSTM, and transfer learning are widely applied to obtain non-handcrafted features for MGR tasks [16], [29]. The transfer feature learning utilizes the existing knowledge to process the available data from diversified feature spaces. In [6],a transfer feature is expressed as a concatenated feature vector using activations of multi-layer feature maps in a convolutional network that is pretrained on a very large dataset (i.e. Million Song Dataset) [30].

B. CLASSIFICATION METHODS
Classification methods of MGR tasks can be broadly divided into supervised classification and unsupervised clustering methods [1], [2]. As an instance of supervised methods for MGR, the local feature selection strategy uses a self-adaptive harmony search algorithm [7]. In recent years, many MGR solutions have shifted towards the use of deep learning, which outperforms traditional machine learning approaches on MGR tasks. The CNN representation learning outperforms hand-crafted features, being complementary to the latter [29]. Based on the fact that most energy of a spectrogram distributes over only a few temporal steps, an attention mechanism is incorporated into a bidirectional recurrent neural network (BRNN) so that all temporal steps are taken into account by assigning different weights to pay more attention to important temporal steps [15]. The BRNN, the GRU, and the BRNN-based model with parallelized and serial attentions are compared to show the effectiveness of attention mechanisms. With different levels of abstractions in music genres, a CNN-based architecture with multi-level and multi-scale features is exploited to better leverage the distributed properties of genres [11]. Furthermore, input signals are downsampled and transfer learning is utilized with the previous multi-level and multi-scale techniques [12].
The non-negative matrix factorization (NMF) [31], the non-negative tensor factorization (NTF) [32], and the sparse coding [33] are instances of unsupervised learning methods for MGR. In addition to the sparse coding, there are many feature coding methods used in image classification, including the hard coding [34], the soft coding [35], the low-rank sparse coding [36], [37], the vector of locally aggregated descriptor (VLAD) coding [18], and the Fisher vector (FV) coding [38]. However, traditional feature coding methods are unsupervised clustering-based approaches, which may not be suitable for classification tasks. The traditional VLAD coding model is extended to an end-to-end model NetVLAD by using a learned softmax for local feature assignments [17]. Parameters of the network are trained by the back-propagation algorithm. Some studies [39], [40] also explore neural networks involving clustering techniques for MGR. A bootstrapped k-means is proposed to responses [39]. A multi-layer perceptron combined with the spherical k-means algorithm is introduced to enhance the performance of transfer learning [40].

III. PROPOSED METHOD
The overall scheme for MGR is presented in Section III-A. Descriptions of feature extraction are given in Section III-B. The feature coding network focusing on local features across different levels is illustrated in Section III-C. The design of loss function and the fusion strategy from decision-level to feature-level for learning internal relations among high-level features at an early stage are given in Sections III-D and III-E, respectively.

A. OVERALL SCHEME
The procedure of the whole approach is presented in Algorithm 1. In Algorithm 1, F agg , R F , and F c denote the aggregation of local features produced by 5 Netvlad layers, high-level features extracted by feature coding network, and classification result of feature coding network, respectively. F agg , R F , and F c are also shown in Figure 2. F 6 are the best 6 feature combinations generated by the sum rule based on R F of eight types of low-level features. F high and R M are an aggregation of the learned high-level features input to the meta classifier and the classification result of meta classifier, respectively. R best is the best classification result of the meta classifier selected from F 6 . Firstly, the original audio signal is divided into multiple segments. The duration of each segment for ISMIR2004 is 30 seconds while the duration of each segment for both GTZAN and Extended Ballroom is 10 seconds. Then, eight types of low-level features are extracted from these segments. These features include timbre [20], rhythm histogram features (RH) [21], statistical spectrum descriptors (SSD) and transfer features [6], Mel-spectrogram [22], harmonic spectrogram [24], percussive spectrogram, constant-Q spectrogram [25] and scatter transform spectrogram [27]. Timbre, RH, and SSD capture statistical characteristics from time-frequency representation while Mel-spectrogram, harmonic spectrogram, percussive spectrogram, constant-Q spectrogram, and scatter transform spectrogram are time-frequency representation using different filter technologies. Transfer feature is a learned feature extracted by a CNN with transfer learning. Figure 1 shows five types of time-frequency representations. Same with other works, we sample at 22050 samples per second for feature extraction. After extraction of eight types of low-level features, high-level features are obtained by representation learning using feature coding network. In the representation learning phase, each feature coding network uses a feature as inputs. Specifically, the feature coding network uses a VGG as the backbone network which consists of 5 cascading blocks. VOLUME 8, 2020 Each block consists of a convolutional layer, a batch normalization layer, an ELU activation layer, and a pooling layer. The feature map produced by each block represents features at a particular level of abstraction. Then, these 5 feature maps at different levels of abstractions are fed into 5 NetVLAD layers to extract their local features. Self-attention is applied to learn long-term dependencies among these local features across different levels. In our method, an ensemble of feature coding networks is trained where each network uses a selected low-level feature as an input. Then, a meta-CNN is trained using the learned high-level features extracted by feature coding networks. There are 247 combinations (combinations of 2 to 8 features) over 8 low-level features. A simple sum rule is applied to select the best 6 feature combinations (i.e. F 6 ) yielding the highest testing accuracies. For each feature combination in F 6 , the learned high-level features are aggregated and then fed to a meta-CNN. Finally, the best classification result R best is obtained by a meta-CNN using the learned high-level features aggregation of the best feature combination as inputs.

B. FEATURE EXTRACTION
In our work, the eight features consist of timbre, rhythm, Mel-spectrogram, harmonic spectrogram, percussive spectrogram, constant-Q transform spectrogram, scattering transform spectrogram, and transfer feature, respectively.
Components of timbre feature include mean and variance of the spectral centroid, spectral flux, time-domain zero crossings, low-energy component, and 13 Mel-Frequency cepstral coefficients [20]. In this work, we used the Marsyas toolbox to generate the timbre feature. Rhythm consists of rhythm histogram feature (RH) and statistical Spectrum Descriptors (SSD) [21]. RH extracts a set of sixty acoustic characteristics, each of which corresponds to the aggregation of amplitude modulations of twenty-four critical zones' individually computed by a rhythm pattern. SSD is a set of statistical measures that describe fluctuations in zones criticisms and captures some timbre information of the twenty-four critical bands defined according to the Bark scale. Codes for RH and SSD are provided on Github. 2 Inspired by measured responses from the human auditory system, Mel-spectrogram is computed by mapping the spectral magnitudes of short-time Fourier transform onto the perceptually motivated Mel-scale using the filterbank technique [22], [23]. The concept of anisotropic spectrogram diffusion is used to split the music signal into two portions. One is a harmonic spectrogram and the other is a percussion spectrogram [24]. Constant-Q transform provides a logarithmically spaced frequency basis and a frequency resolution that depends on geometrically spaced center frequencies of the analysis windows [25], [26]. In this work, we use the Librosa python package to generate the Mel-spectrogram, harmonic spectrogram, percussive spectrogram, and constant-Q transform spectrogram.
A scattering transform defines a locally translation invariant representation which is stable with respect to time-warping deformation [27]. It is defined as a convolutional network whose filters are fixed to be wavelet and lowpass averaging filters coupled with modulus nonlinearities. We extract the scatter transform spectrogram by the Kymatio python package.
Transfer learning is designed to leverage an already trained model in a relevant field. The transfer feature used in this paper comes from the transfer learning method proposed in [6]. A convolutional neural network is designed and trained for the source task and the trained network is used as a feature extractor for the target MGR task on GTZAN, ISMIR2004, and Extended Ballroom datasets. The code for transfer feature is available on GitHub. 3

C. MULTI-LEVEL FEATURE CODING NETWORK
The structure of the multi-level feature coding network is shown in Figure 2. The feature coding network is mainly divided into four parts: backbone network, NetVLAD layer, self-attention layer, and fully connected layer. The backbone network consists of 5 cascaded blocks with each block containing four basic units: a convolutional layer, a batch normalization (BN) layer, an ELU activation layer, and a pooling layer, respectively. Different spectral bands have different distributions, so features should be learned from different bands. Therefore, a 2D convolution is used to learn both temporal and spectral structures [41]. In our method, 2D convolution networks with a kernel size of 3 × 3, stride size = 1, and padding size = 1 are used to build the multi-level feature coding network. The ELU is used as an activation function in every convolutional layer while the BN is applied after convolution and before activation. The number of feature maps is set as 32 in the first three blocks and 64 in the last two blocks. The number of feature maps is increased in the final two blocks so as to provide sufficient resolution of the learned features.
Extracted features from convolutional layers in 5 blocks can be expressed as F ∈ H ×W ×D , where D, H , and W denote the number of the convolutional filters, height and width of the feature map of convolutional layers, respectively. F can be considered as a descriptor set containing H ×W deep descriptors with the dimension of D. The NetVLAD takes F as inputs. K cluster centers (visual words) in NetVLAD are used to construct a dictionary of VLAD. The descriptor By softly assigning f j to the nearest cluster c k , the expression of the soft assignment coding is shown in Eq. (4), where w k and b k are trainable parameters for each cluster.
With Eqs. (1)-(4) combined, the final form of NetVLAD layer is expressed as Eq. (5) and c k (d) are the d th element of f j and c k , respectively. Thus, NetVLAD coding with a size of K × D is generated by summation over the residuals between features and their corresponding center. The dictionary size K has an important influence on the discriminative power, and it determines the size of coding. Because the subsequent self-attention layer consumes huge GPU memories [19], the dictionary size K is set as 20 rather than a larger number. The learning procedure of NetVLAD is shown in Algorithm 2.
The music stream has more complex intrinsic patterns that are highly diversified and have different levels of abstractions. Combining multi-layer audio features helps to find the genre positioned in different time scales. Therefore, all 5 NetVLAD codings of the 5 blocks are aggregated and fed to the self-attention layer to capture long-term dependencies across different levels. The self-attention maps are derived from the aggregation X ∈ M ×K of the 5 NetVLAD codings, where K and M denote the dictionary size and the row amount of the matrix X , respectively. Let X = {X 1 , X 2 , · · · , X M } be the aggregation of NetVLAD codings, where X 1 , X 2 , · · · , and X M are column vectors. The attention value at position i is obtained by Eq. (6).
g(X j ) = W g X j (8) where W g , W k and W q denote the weight matrices implemented by 2D convolution with a kernel size of 1 × 1. g is the linear function as shown in Eq. (8), and ϕ denote the embedded Gaussian function to calculate the dependency between X i and X j as shown in Eq. (9). In Eq. (6), the function ϕ 8. End for 9. NetVLAD coding is obtained by applying intra-normalization and L 2 normalization to V (F). VOLUME 8, 2020 Algorithm 3 The Procedure of Self-Attention Learning Input: The aggregation X of NetVLAD codings V (F) Output: Self-attention features Z 1. Let Z = φ, and initialize the weights of W k , W g and W q in Eq. (8) and Eq. (9). 2. For n = 1, 2, 3, · · · , N in do: 3. For X i in X do: 4.
Calculate the dependency between X at any position j and X at current position i by Eq. (9). 5.
Calculate the intensities of X at any position j by Eq. (8). 6.
Calculate the attention value of X at current position i by Eq. (6) and Eq. (7). 7.
Employ residual learning to the attention layer by calculating the contributions of local and non-local sources by Eq. (10). 8.
Let Z = Z ∪ Z i . 9. End for 10. End for calculates a scalar to reveal the relevance between signal intensities of X i and X j while the function g computes the feature embedding of X at position j. The contribution is determined by both the relevance and the signal intensity. The response is subsequently normalized by a factor C(X i ), which is defined in Eq. (7). Then the output of the self-attention layer with residual learning can be expressed by Eq. (10), where the scale parameter α balances the contributions between local and non-local sources in the training process. Furthermore, we add the multi-head to the self-attention by concatenating outputs of self-attention for N times. Its advantage is to enable the model to learn relevant information in different representation subspaces. The learning procedure of self-attention is shown in Algorithm 3.
Finally, output features of the self-attention layer are flattened and fed into two fully connected layers for inference. A dropout layer and a batch normalization layer are applied between two fully connected layers. Overall, the feature coding network has 23.03 million parameters and 0.35 billion floating point operations (FLOPs). Among them, the backbone network, the 5 NetVLAD layers, the self-attention layer, and the last two dense layers have 0.3169 billion FLOPs, 0.0093 billion FLOPs, 0.0007 billion FLOPs, and 0.0231 billion FLOPs, respectively. When using the NVIDIA TitanX GPU, average computation time for each batch during the training phase and the test phase is 34.20 ms and 11.41 ms, respectively.
Softmax loss function makes features spread in narrow strips. Although classes are separated from each other, feature distribution within a class is not compact. Owing to the large variance within a class, robustness of the model may be deteriorated if merely softmax loss is used. Center loss learns a center for features of each class and punishes on distances between features [42], thus making the distribution of features within a class more compact. As shown in Figure 3, features extracted by feature coding network with both softmax loss and center loss is more compact and yields better distinguishability than features extracted by that with softmax loss alone.

E. FUSION STRATEGY
Before building the ensemble, the best combination of features is selected using a simple sum rule on those feature coding networks. Then, high-level features of the selected combination are aggregated and fed to a meta-CNN which consists of three blocks, a self-attention layer, and two fully connected layers. Each block in the meta-CNN consists of a convolution layer with 1D convolutions, a batch normalization layer, an ELU activation layer, and a pooling layer. The number of convolutional kernels of the convolution layer in the three blocks are 32, 64, and 32, respectively. Sequentially, a self-attention is used to learn the dependency among heterogeneous high-level features. Finally, output features of the self-attention layer are flattened and fed into two fully connected layers for final inference. A dropout layer and a batch normalization layer are applied between the two fully connected layers. Meta-CNN has 20.19 million parameters and 0.02 billion FLOPs. Average computation time of each batch during the training phase and test phase is 44.65ms and 14.33ms, respectively, using an NVIDIA TitanX GPU.

IV. EXPERIMENTS
Details of three benchmark datasets and experimental settings are given in Sections IV-A and IV-B, respectively. Then, Section IV-C provides experimental results and discussions on these three datasets.

A. DESCRIPTION OF DATASETS
Datasets used for evaluation of models are GTZAN, ISMIR2004, and Extended Ballroom. The GTZAN dataset consists of 10 genre classes: Blues, Classical, Country, Disco, Hip Hop, Jazz, Metal, Popular, Reggae, and Rock [20]. Each genre class consists of 100 audio recordings around 30 seconds with 1000 music excerpts in total. These excerpts are taken from radio, compact disks, and MP3 files. Each item is stored as a mono audio file of 22.050 kHz and 16-bit. ISMIR2004 is a genre classification dataset proposed for the music information retrieval contest organized by the Music Technology Group and Pompeu Fabra University. It consists of 1458 samples of six different genres: classical (640), electronic (229), jazz/blues (52), metal/punk (90), rock/pop (203), and world (245). The 1458 music pieces are divided into training set (50%) and test set (50%) in the contest. The Extended Ballroom dataset is an extension of the Ballroom dataset. The number of music excerpts in the Extended Ballroom dataset is 4180, which is six times larger than the Ballroom dataset [43].

B. EXPERIMENTAL SETTINGS
For the GTZAN dataset, both 10-fold cross validation and a manual split by [45] are applied to evaluate the proposed method. Training set and test set are given for the ISMIR2004 dataset, so they are used directly. For the Extended Ballroom dataset, a 10-fold cross validation is also applied for evaluation. The RMSProp optimizer with a smoothing constant of 0.9 and the L 2 penalty of 4e-5 is used for feature coding network training. The learning rate is set to be 0.001 which decays every two epochs with an exponential rate of 0.94. Before training the meta-CNN for final classification, extracted high-level features are normalized to have zero mean and unit standard deviation. The RMSProp optimizer with a fixed learning rate of 1e-5 is used to optimize VOLUME 8, 2020 the meta-CNN. Experiments are carried out on the deep learning platform PyTorch with an NVIDIA TitanX GPU.

C. RESULTS AND DISCUSSIONS
Firstly, we present results of hyperparameters selection and ablation experiments of NetVLAD and self-attention. Then, tests on fusion strategies and visualization for both the training process and extracted features are reported. The artist filter is used to evaluate the GTZAN dataset to mitigate its flaws. Finally, the results of the proposed model are compared with other state-of-the-art models.

1) HYPERPARAMETERS SELECTION
In the proposed model, there are four key hyperparameters: the dictionary size K , the number of heads N in the self-attention mechanism, the λ in the loss function ζ as shown in Eq. (13), and the choice of activation function. We test the cases where K = 5, 10, 15, 20, 25, 30, 35, 40 and N = 5, 10, 15, 20, 25, 30, and 35. We take the Mel-spectrogram of ISMIR2004 dataset as an example to select the appropriate K and N in the feature coding network. The accuracies obtained for different K and N are shown in Figure 4 and Figure 5, respectively. There is an increment of the accuracy with K increasing in Figure 4. However, accuracies do not change much when K > 20. A smaller K implies faster training time, so K = 20 is selected. In Figure 5, N = 10 provides the highest accuracy. Hence, it is selected as the optimal number of N for further analysis. Accuracies obtained by the feature coding network based on different loss functions are shown in Table 1. Table 1 shows that the proposed network with a combination of loss functions (Eq. (13)) yields an improvement of 0.69% test accuracy compared with that merely employing softmax loss (i.e. ζ s ). In addition, ζ with λ = 0.001 yields the best test accuracy. Center loss may be beneficial for enlarging distances among classes and enhancing the robustness of the model. Therefore, λ = 0.001 is used in our experiments. Figure 6 shows the convergence of feature encoding networks using different activation functions. When using the ELU function, the feature coding network converges faster   with fewer fluctuations compared to those using Relu and LeakyRelu activation function. Figure 7 shows the results of ablation experiments on self-attention and NetVLAD using the Mel-spectrogram of ISMIR 2004 dataset as an example. It is shown that the model with both self-attention and NetVLAD yields the best performance. The model with neither self-attention nor NetVLAD yields the worst result. This result suggests that global features are not sufficient enough to improve accuracy compared with local features obtained by NetVLAD. Besides, the self-attention mechanism improves global features by learning their dependencies. However, Figure 7 shows that the attention mechanism does not improve the model with NetVLAD a lot. This can be explained by the fact that  local features across different levels provide discriminative information for MGR, which may fairly overlap with further information obtained from the self-attention mechanism. Nonetheless, self-attention learns dependencies of these local features across different levels, which helps to improve the performance of the model.

3) FEATURE COMBINATIONS
To study the performance of feature coding network for an individual feature, accuracies obtained by feature coding networks with different low-level features are shown in Table 2. No single low-level feature yields the best performance on these datasets. However, the rhythm feature yields the worst performance for all three datasets. Table 2 shows that the feature coding network using scatter transform spectrogram yields the best test accuracies of 89.50% for GTZAN and 93.80% for Extended Ballroom, respectively. For the ISMIR2004 dataset, the feature coding network using transfer feature yields the best test accuracy of 87.38%. These results can be expected because the scatter transform spectrogram is a local translation invariant representation being stable to time-warping deformation while the transfer feature was trained on a very large Million Song Dataset with rich label sets for various aspects of music including mood, genre, and instrumentation.
Then we evaluate performances yielded with different combinations of low-level features. In our method, an ensemble of feature coding networks is trained and each base network uses a type of low-level features as inputs.
Given 247 combinations (combinations of 2 to 8 features) over 8 low-level features, the simple sum rule is applied to select the top 6 feature combinations yielding the highest test accuracies for each dataset. Tables 3, 4, and 5 show test accuracies on chosen datasets yielded by the networks with both a simple sum rule and a meta-CNN using the top 6 feature combinations. In these tables, c, h, m, p, s, t, and T denote constant-Q transform feature, harmonic spectrogram, Melspectrogram, percussive spectrogram, scatter spectrogram, timbre feature, and the transfer feature, respectively.   Obviously, the 6 best feature combinations for the three datasets are different. This shows that neither a single combination nor single feature provides the best results for all datasets. Experimental results also show that fusion by a   meta-CNN is more competitive than the fusion by a sum rule in some scenarios. The best results on all three datasets are yielded by fusion using a meta-CNN, which verifies that learning the internal relationship among different features at early stages improves the results. m, s, and T are useful low-level features for the best performing meta-CNNs. The low-level feature combination yielding the best test accuracy is used for each dataset in our model. Therefore, the proposed method uses {c, h, m, s, t, T } for ISMIR2004, {m, p, s, T } for GTZAN and {s, T } for Extended Ballroom for experiments in the next sub-section, respectively.

4) VISUALIZATION
Visualizations on the training process and extracted features are provided to prove that the proposed model is not overfitting on small datasets. T-SNE (t-distributed stochastic neighbor embedding) is used to visualize the best feature combination before feeding it into the meta-CNN. Figures 8, 9, and 10 show the 2-D visualizations of the best feature combination extracted by feature coding network for ISMIR2004, GTZAN, and Extended Ballroom datasets, respectively. In Figure 8, the distribution of ''jazz'' is the most compact on the ISMIR2004 dataset, followed by ''metal/punk''..The genre of ''world'' is the least compact. As shown in Figure 9, distributions of ''blues'' and ''classical'' are the most compact for the GTZAN dataset, followed by ''jazz'' and ''metal''. ''Rock'' is the most loosely distributed. As shown in Figure 10, distributions of ''Wcswing'', ''Pasodoble'', and ''Quickstep'' are the most compact for the Extended Ballroom dataset while distributions of ''Salsa'' and ''Slowwaltz'' are the loosest. Overall, the overlapping between classes is insignificant for all three datasets. As shown in Figures 8, 9, and 10, the distribution of each genre is concentrated at a center instead of multiple centers. Boundaries of most genres are clear and can be divided using a smooth hypersurface. These show that our deep model does not overfit on these small datasets. Furthermore, we feed the best feature combination to the meta classifier and visualize the training process of the model on these three datasets in Figure 11. Figure 11 shows that the proposed model does not overfit and yields good test accuracies (i.e. good generalization capability).

5) GTZAN WITH ARTIST FILTER
Following the evaluation of the GTZAN dataset in [2], we conduct experiments with 3-fold cross-validation and the artist-filtered split as in [45]. In these experiments, all duplicate songs and unrecognizably distorted songs are removed. An artist filter (AF) is applied to make sure that no song from the same artist appears in both training and the test set of a fold. As shown in Table 4, Mel-spectrogram, percussive spectrogram, scatter spectrogram, and transfer feature form the best feature combination for GTZAN without AF, so they are used to evaluate the performance of GTZAN with AF. Accuracies of these features and their combination through the meta-CNN fusion are shown in Table 9. Table 10 shows the confusion matrix for the GTZAN with AF using the feature combination of {m, p, s, T }. The proposed method yields a worse test accuracy for GTZAN with AF (shown in Table 9) in comparison to that of without AF (shown in Table 2). Although the test accuracies drop for all the features when AF is used, the test accuracy of the transfer feature yields the smallest drop because it transfers knowledge learned from a large dataset. Our results with AF in Table 10 are highly similar to that in [44], where AF is proposed to evaluate GTZAN. For example, ''rock'' and ''disco'' are the most confused while ''classical'' and ''metal'' have the most correctly classified samples. The fusion accuracy with AF in Table 10 drops to 86.92% but it is still 31.10% higher than the accuracy with AF in [44].  Table 11 shows that test accuracies yielded by the proposed model outperform other state-of-the-art models [28], [45] on ISMIR2004, GTZAN, and Extended Ballroom datasets. For a fair comparison, all methods in Table 11 do not use AF. Our proposed model is called the FusionNet. For the ISMIR2004 dataset, FusionNet yields the test accuracy of 92.46%, which is 0.36% and 1.56% higher than the second-best and the third-best model in comparison, respectively. For the GTZAN dataset, FusionNet yields the test accuracy of 96.50%, which is 0.8% and 5.9% higher than the second-best and the third-best model in comparison, respectively. For the Extended Ballroom dataset, the test accuracy of FusionNet is 95.50%, which is 0.60% and 2.80%  higher than the second-best and the third-best model, respectively. Although the improvement in test accuracy is minor in comparison to the second-best method for the ISMIR2004 dataset, the proposed method yields satisfactory or significant improvements in other cases.

6) COMPARISONS WITH STATE-OF-THE-ART METHODS
For methods based on improved handcrafted features for MGR tasks, a novel feature set derived from long-term modulation spectral analysis of spectral and cepstral features is proposed to characterize the temporal evolution of an audio signal [46]. Then, an information fusion approach that integrates both the feature-level fusion and decision-level fusion is employed to improve the classification accuracy. A classifier combined with the joint sparse low-rank representation is proposed to identify subspaces where the samples are projected [48]. A music genre classification model is proposed to capture the temporal evolution of the spectral characteristics of the music signal and reduce the computational complexity [49]. It contains spectro-temporal features FIGURE 11. Training process using the best feature combination. based on timbre features, a SVM ranker for feature selection, and a RBF kernel estimation for SVM classification. Two novel scale and shift-invariant time-frequency representations of the audio content are proposed in [47] to model the inter-relationship between the various frequency bands.
For methods based on representation learning, a bidirectional recurrent neural network with serial attention and parallelized attention is proposed to focus on details of the target area [15]. A CNN-RNN-cascaded deep learning model that uses almost no handcrafted features is proposed in [14]. In [12], a CNN-based architecture with multi-level and multi-scale features [11] is extended by transfer learning. A transfer feature is proposed in [6], and the deep network trained on the Million Song Dataset is used as a feature extractor for small datasets.
For methods based on ensemble learning, a combination of weighted classifiers are used to enhance the performance obtained by the fusion of sum rule [8]. In its follow-up work, the authors employ and evaluate more novel representations and texture descriptors [9]. The authors conduct tests on different texture descriptors and a model based on CNN in the extended version of this work [28]. The complementarity between handcrafted features and CNN features is investigated for the first time on music classification tasks [29].
Before our work, the best performances achieved on the ISMIR2004 and the GTZAN datasets are reported in [28], which fuses the results from the ensemble of handcrafted texture descriptors and a CNN-based model. The approach proposed in [28] achieves the test accuracies of 92.10% and 95.70% on the ISMIR2004 and GTZAN datasets, respectively. The highest testing accuracy of 94.90% on the Extended Ballroom dataset is reported in [47]. This method is based on the scale and shift-invariant timefrequency representations. Our proposed approach Fusion-Net yields test accuracies of 92.46%, 96.50%, and 95.50% on the ISMIR2004, GTZAN, and Extended Ballroom datasets, respectively. There are several reasons for worse performances yielded by other state-of-the-art methods. Methods based on improved handcrafted features merely use a single feature and thus fail to provide enough discriminative information. Although methods based on representation learning extract salient features directly from the audio signals, they are designed based on global features, thus failing to capture more valuable local features across different levels.

V. CONCLUSION
In this work, we propose an ensemble approach for music genre recognition based on the fusion of high-level feature sets learned from different types of low-level features. A multi-level feature coding network uses a CNN with self-attention and NetVLAD to learn high-level features for each low-level feature. The NetVLAD extracts more dominant features by capturing local information from different feature levels while the self-attention learns long-term dependencies across levels. The proposed model is effective in capturing discriminative features, thus yielding the best test accuracy on GTZAN, ISMIR2004, and Extended Ballroom dataset.
In future, we intend to train the network in a multi-task learning manner by optimizing the local CNNs and global aggregated networks simultaneously to provide better performance. Furthermore, we will investigate different filter visualization techniques to interpret the filters and apply the proposed method on other tasks such as audio event classification, emotion prediction and music tagging.