A Novel Three Stage Framework for Person Identification From Audio Aesthetic

Social behavioral biometrics investigates social interactions to determine a person’s identity. Within the discipline of social behavioral biometrics, recognition of individuals based on their aesthetic preferences is an emerging direction of research. Human aesthetic is a soft, behavioral biometric trait that refers to a person’s attitudes towards a particular subject material. Recent developments in aesthetic-based biometric systems have proven that an individual’s visual and audio aesthetic preferences hold considerable distinctive features. This paper introduces a novel three-stage audio-aesthetic system that can uniquely identify a user from the set of their favorite songs. The system utilizes Residual Network (ResNet) for high-level feature extraction. A hybrid meta-heuristic feature selection algorithm based on Cuckoo Search and Whale Optimization is proposed for feature extraction optimization, which results in the low-dimensional feature set. The selected subset of features is fed into the XGBoost classifier to establish a person’s identity. The proposed method outperformed the handcrafted feature-based method by achieving 99.54% accuracy on a proprietary dataset (Free Music Archive) and 99.79% accuracy on a publicly available dataset (Million Playlists Dataset).


I. INTRODUCTION
Biometric systems establish identity of an individual based on unique physical or behavioral attributes. Physiological biometric systems encode an individual's physical characteristics to create a template unique to that person. Most commonly used physiological biometrics are fingerprints, iris, palm, face, and hand geometry [1]. On the other hand, behavioral biometric modalities focus on a person's actions, such as voice, signature, or gait rather than their physical characteristics. Within this domain, social-behavioral biometric investigates a person's identity using their social interactions and communication patterns [2]. Over the last few decades, with the flourishment of technology, the number of social media users has exploded. As a result, social network The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif . platforms have become a widespread source of data that can be utilized to verify a user remotely and covertly.
Exploiting personal aesthetic properties for biometric identification is an emerging research direction in social behavioral biometrics. Aesthetic features refer to an individual's preference toward a particular subject material. Several studies have shown that an individual's aesthetic attributes can be utilized to differentiate them from others [3], [4]. With the exponential growth of online social media users, aesthetic data is becoming more ubiquitous in the form of text, photographs, videos, and music. Additionally, people can publicly express their preferences and opinions on different social media platforms, increasing the rapid accessibility of aesthetic data. Furthermore, studies of cognitive neuroscience have emphasized that there is a significant link between human perceptions of aesthetics and personality, making personal aesthetics one of the prominent and desirable traits for biometric authentication [4]. Aesthetic biometric systems VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ have potential to transform many existing applications, such as multi-factor authentication, human behavioral analysis, and recommender systems [5].
In the domain of social behavioral biometric research, aesthetic systems have shown great potential. Several visual aesthetic systems have been recently developed for person identification [6] and gender prediction [7]. These systems have demonstrated that a user's preferred set of images can hold discriminatory features. Aesthetic data is easily accessible and retrievable from online social media platforms. Moreover, combination of aesthetic features with other adaptive biometric traits has a wide range of applications. Such systems can be extended to understand consumer behavior and experience, optimize the positioning of an e-commerce platform, increase the acquisition of new products, and enhance collaborative learning experience [8].
Recent studies have demonstrated the potential applications of audio systems in security and surveillance. Some of the related works include detection of the sound-based drones to protect the security-sensitive institutions [9], prediction of crime by incorporating audio modality with text sentiment [10], and the surveillance of streets based on atypical sounds [11].
The very first audio aesthetic system was developed in 2021 and achieved a significant accuracy by extracting handcrafted audio features from user liked-songs [5]. This research demonstrated that, similar to visual aesthetics, audio aesthetic features have unique and distinguishing characteristics for biometric identification.
Previous audio aesthetic-based person identification research only explored traditional machine learning approaches, trained with handcrafted features. However, extracting and optimizing handcrafted features requires extensive feature engineering, leading to increased computational complexity. In addition, when the size of the dataset increases, the system might face issues such as scalability, reliability, and robustness. On the other hand, Deep Learning (DL) approaches have proven to be highly useful in various domains, including audio analysis, text analysis, and image processing [12]. Deep learning methods can extract features automatically from a vast amount of data [13]. This property can be leveraged to extract high-dimensional features without extensive feature engineering. As a result, this will resolve the scalability and reliability issues of handcrafted feature based systems.
In this work, the following research questions will be addressed: 1) Which deep learning based pre-trained architecture is most suitable for user recognition using audio aesthetic features? 2) Will the high-level features extracted from a pre-trained Convolutional Neural Network (CNN) be more discriminative than handcrafted features? 3) Which hybrid meta-heuristic feature selection algorithm can select the most discriminating audio aesthetic feature subset? 4) Can a hybrid meta-heuristic feature selection algorithm choose a reduced subset of high-level features to achieve a high classification accuracy? 5) Which machine learning classifier can identify users from their preferred set of music with the highest accuracy? This paper proposes a novel three-stage deep learning based audio aesthetic system dedicated to person identification. To the best of our knowledge, this is the first audio aesthetic system that utilizes deep learning to extract aesthetic features. Initially, the proposed method will extract high-level features using pre-trained deep learning architecture. Later, a hybrid meta-heuristic feature selection algorithm will be employed for selecting the most optimal feature set, which will be further trained using machine learning algorithms for classification. It will be demonstrated that the proposed architecture surpasses the existing audio aesthetic method on two benchmark datasets.
This paper makes the following contributions to answer the research questions: • A novel three-stage framework based on deep learning and classical machine learning is proposed for identifying individuals from their audio aesthetic.
• To extract audio features, mel-spectrograms are used instead of generic spectrograms. As a result, the extracted feature set becomes more discriminating for person identification.
• Residual network is employed for extracting the most optimal high-level features from users' preferred set of music.
• A novel hybrid meta-heuristic algorithm Cuckoo Search based Whale Optimization Algorithm (CSWOA) is proposed that retrieves the most optimal feature subset from the high-level features.
• Reduced feature set is trained using eXtreme gradient Boosting (XGBoost) classifier for identification in order to reduce computational time.
• The high-level feature extractor ResNet is compared with InceptionNet and VGG16 to prove its superiority. Extensive comparison of the proposed hybrid metaheuristic feature selection algorithm with other standalone and hybrid wrapper-based feature-selection methods demonstrates its superiority for selecting audio-based aesthetic feature. The findings prove that the proposed architecture can extract a more discriminating feature set than the previously used handcrafted features. In addition, the experiments show that the proposed approach is more efficient, rigorous, and requires less computational cost than the most recent method for person identification using audio aesthetics. The proposed system attains overall recognition accuracy of 99.54% on the Free Music Archive (FMA) dataset and 99.79% accuracy on the Million Playlist Dataset (MPD) dataset, outperforming state-of-the-art method for audio aesthetic-based person identification.
The remainder of the paper is organized as follows: a detailed overview of existing research on aesthetic-based 90230 VOLUME 10, 2022 systems will be discussed in Section II. The proposed method will be thoroughly described in Section III. Experimental results and comparison of the proposed method with the previous works will be discussed in section IV. Finally, future research directions will be outlined in section V.

II. LITERATURE REVIEW
Recent progress in social behavioral biometric research has introduced various avenues to investigate human behavior based on social media activities, communications, and interactions. Social media became a platform for sharing personal interests, lifestyles, and ideas with acquaintances. As a result, online social media platforms have emerged as a prominent source of information about an individual's behavioral traits. Initially, analysis of this information was limited to linguistic authorship recognition [14], authentication via social media interactions, and spatio-temporal information mining. However, authentication based on human aesthetic preferences remained largely unexplored.
The intuition behind aesthetic-based identification is that every person has distinctive tastes and preferences when it comes to photographs, arts, music, and so on. Such idiosyncratic attributes can be exploited to distinguish individuals from others. The first concept of person identification from their visual preferences was introduced by Lovato et al. [15]. In this work, the dataset was sampled from the extensive database of the Flickr website containing 200 users, where each user was asked to select 200 favorite images. A machine learning method named Least Absolute Shrinkage and Selection Operator (LASSO) regression was exploited to learn the most distinguishing aesthetic features, yielding rank 1 identification accuracy of 14%. Despite the low recognition accuracy, this work unveiled a promising research scope for utilizing human preferences as a unique identifier. Later, the same group of researchers carried out another experiment on the Flickr dataset [3], this time integrating new features. This work achieved a rank 1 accuracy of 76%. At the same time, Segalin et al. [4] adopted the counting grid model and support vector machine as a learning method instead of LASSO regression utilizing the same database as [15]. In this work, a generative embedding strategy was followed by considering Bags of Features with 111 features. As a generative step, a multi-resolution counting grid was utilized for generating an ensemble of embedding maps. The SVM classifier was trained in a one-versus-all modality achieving rank 1 accuracy of 73%. The major limitation of this approach was that the same image could not be chosen by more than one user. In 2016, Azam and Gavrilova [7] introduced the proof of concept of gender identification of an individual using the perceptual aesthetic features of their preferred images. In this work, the authors utilized a bag of 56 perceptual image aesthetic features. For final classification, the decisions of three traditional binary classifiers were combined using feature selection and ensemble weight adjustment methods. The model was trained and evaluated on a database of 24,000 images from 120 Flickr users, achieving 77% rank 1 accuracy in aesthetic based gender prediction.
The most recent work exploited the original deep learning approach [16] for the first time on visual aesthetic-based identification [6]. Above mentioned techniques were highly dependent on manual feature engineering. However, deep learning models are capable of executing feature engineering on their own [13]. In addition, recent breakthroughs of deep learning models in the domain of computer vision [17] and biometric system [18] led to deployment of a deep learning approach for visual aesthetic systems. An original framework, AestheticNet was developed, yielding 97.7% rank 1 identification accuracy. A pre-trained VGG16 network was used for extracting high dimensional feature maps, which were further reduced to low-dimensional feature vectors with high variance using Principle Component Analysis (PCA). Subsequently, a residual learning-based CNN was used to train and validate the obtained low dimensional feature vector.
Audio preferences have become a ubiquitous phenomenon, which has been explored in multiple areas ranging from psychological research to medical therapy [19]. Apart from visual aesthetics, personal audio preferences have also shown great potential in analyzing an individual's behavioral and psychological traits [20]. Several music recommendation systems have been developed to exhibit the relationship between a user's music preferences and personality type [21], [22]. In addition, the development of human cognitive-based authentication system while listening to music is another prominent research domain. In 2020, Patel and Husain [23] proposed the development of an Electroencephalogram (EEG)-based person authentication system that measured the user's neurophysiological responses while listening to their preferred music. This study aimed at creating a user-authentication system to identify a user uniquely based on their corresponding EEG response to music. Another authentication system, MusicID was developed by utilizing users' preferred set of music [24]. In this work, authors exploited human brainwave patterns while user's listened to their favorite songs.
Prior works have established a correlation between human personality and audio preferences. The first proof of concept research in this domain in 2021 [5]. Authors developed the first audio aesthetic-based person identification system that achieved 95% user recognition accuracy on FMA and MPD datasets. This work utilized intra-song and inter-song features of users' favorite songs. An ensemble classifier was used for final authentication, where each classifier's decision weight was optimized using a genetic algorithm.
According to the prior research, it is evident that, similar to visual aesthetics, a person's audio preferences also hold discriminatory features to identify a user. However, prior works in this domain rely heavily on feature engineering. This can create challenges, including dataset bias, scalability, and reliability issues. Furthermore, the recent success of convolutional neural networks in biometric identification has VOLUME 10, 2022 demonstrated the efficacy of deep learning architecture in this domain. The above-mentioned points motivate us to develop the first audio aesthetic-based person authentication system based on deep learning architecture. In this paper, the advantages of deep learning and machine learning architectures are leveraged for automatic music aesthetic feature extraction and making predictions with reduced computational cost.

A. OVERVIEW
This research proposes an audio aesthetic system that uniquely identifies users based on their preferred music set. Initially, the raw mp3 data was converted into audio waveforms. The generated audio waveforms were further converted into mel-spectrograms, a logarithmic transformation of an audio signal's frequency by utilizing mel-scales [25]. Mel-spectrograms were chosen as sounds of equal distance represented using mel-spectogram are perceived to be of equal distance to humans [26]. Later, a pre-trained convolutional neural network, Residual Network (ResNet), was used to extract high-dimensional feature vector from the mel-spectrogram. The skip connection property of the residual network amplifies the feature maps extracted from the mel-spectograms, thus extracting more discriminative features. Following that, a novel hybrid meta-heuristic approach called Cuckoo Search based Whale Optimization Algorithm (CSWOA) was proposed to select the most discriminatory and effective feature set from the high-dimensional feature vector. A combination of 3-set songs was generated from each user's chosen music set to generate unique templates for each user. Then, the obtained data samples for each user were fed into XGBoost classifiers for final prediction. The overall architecture of the proposed system is depicted in Fig. 1.

B. GENERATION OF AUDIO WAVEFORM AND MEL-SPECTROGRAM
Initially, the audio file was converted into a digital representation of an audio signal sampled at regular time intervals and by considering the amplitude at each sample. The default sampling rate for audio datasets of 22050 Hertz was used. After that, each audio waveform was trimmed to eliminate the silent portions of each audio clip. To ensure a static input dimension, all the audio clips were zero-padded to an equal length of 30 seconds. As deep learning architectures do not take raw audio waveform directly as input, each processed audio waveform was transformed into its mel-spectrogram representation.
Mel-spectrogram is a modified version of the spectrogram [26]. A spectrogram is a visual representation of the signal strength or the loudness over time at different frequencies represented in a specific waveform. The third dimension of the spectrogram represents amplitude with varying brightness. In a spectrogram, loud events appear as bright colours, and quiet events appear as dark colours as illustrated in Fig. 4. The vertical and horizontal axis of a spectrogram plot represent frequency and time, respectively. Spectrograms are generated using Fourier transformation of audio signals [26]. Most of the time, spectrograms use a linear scale to measure frequency. On the other hand, humans perceive frequency as a logarithmic scale [26]. Consequently, there is insufficient information retained from spectrograms for training deep learning models [26]. To resolve this issue, mel-spectrograms are considered in this research for extracting more discriminating features. Relative to the regular spectrograms, melspectrograms utilize mel-scale and decibel scale to measure frequency and amplitude, respectively. Mel scale is a logarithmic conversion of an audio signal's frequency [26]. Following is the formula of conversion between hertz and mel-scale [27]: Here, m denotes frequency in mel scale, f denotes frequency in linear scale, f c is the corner frequency where the scale changes from linear to logarithmic, C is a constant that is chosen such that 1000 Hz to 1000 mel. Usually, the value of corner frequency, f c is fixed at 700. If natural logarithm is considered, the value of the constant C is 1127, and in case of considering logarithm with base 10, the value of the constant C is 2595.
From Fig. 2, it can be observed that lower frequencies in hertz correspond to the higher distance between mels. On the other hand, higher frequencies in hertz have a smaller distance between mels. This property reinforces mels human-like perception.
Mel-spectrograms provide the deep learning architecture with similar information to what a human would perceive. The raw audio waveforms are passed through mel filter banks to obtain the mel-spectrogram. Mel filter banks are used to map frequency bins from Short Term Fourier Transform (STFT) to Mel bins. After this, each sample received a shape of 500 × 500, indicating 500 mel filter banks and 500 time steps per clip. Fig. 3 and Fig. 4 depict the visualization of a sample audio waveform and its corresponding mel-spectrogram. Fig. 5 depicts an overall diagram for the sub-optimal feature set selection by hybridizing Cuckoo search and Whale optimization algorithm. The extracted high-level features from the pre-trained residual network are separately passed to Cuckoo search and Whale optimization algorithm. After that, both algorithms produce the best population according to their method. The hybridization method contains three steps. The first step is combining the most optimal population by computing the significance of the features contributing most to the population sets. The Average Weighted Combination Method (AWCM) is employed for computing the importance of the feature subset. The second step of the hybridization is to compute a threshold value, AWCM cutoff, to obtain the new optimized feature vector. The third step of the hybridization   is to apply Sequential One-Point Flipping (SOPF) algorithm to eliminate the redundant and unnecessary features from the newly produced feature vector. After removing the redundant features from the feature set, the sub-optimal feature set is obtained.

1) DESIGN OF HIGH-LEVEL FEATURE EXTRACTION
In this phase of the proposed system, high-level features were extracted from mel-spectrograms using a CNN architecture that is pre-trained. Using traditional feature engineering methods, generating a significant feature vector might be challenging when the associated dataset is large. Residual Network, a pre-trained CNN architecture, was proposed in this study to resolve this issue. He et al. [28] was the first to examine the vanishing gradient problem in an extremely deep convolutional neural network. Vanishing gradient occurs during the backpropagation phase of neural network training. During each iteration of neural network training, the weights are updated proportionally to the loss function's partial derivative with respect to the current weights. However, during the training of a very deep neural network, the gradient becomes vanishingly small, preventing the weights from changing. As a result, it may stop the neural network from further training [29]. The authors proved that if a CNN architecture comprises many layers, it will fail to generalize during the optimization process. They suggested including residual blocks or skip connections into the CNN architecture to aid this issue.
The concept of residual connection is to propagate a layer's output feature maps to its immediate layers and layers that follow. This method assists the design in aggregating data across the network, hence mitigating the problem of disappearing gradients. Residual networks have been used in VOLUME 10, 2022 several applications, including image segmentation, medical image analysis, robotics, etc. This CNN architecture employs the pre-trained weights to extract high-level image features from dimension 224 × 224 × 3. Since the mel-spectrograms created in the previous phase had dimensions of 500 × 500, it is essential to change the mel-spectrograms into a dimension compatible with the pre-trained CNN architecture.
For this purpose, the mel-spectrograms were downsized to 224 × 224 pixels. As the residual network requires RGB images as input, the single-channel mel-spectrogram images must be transformed to three-channel images. The mel-spectograms were stacked three times to rectify this, resulting in an RGB representation of the mel-spectrogram. After passing the mel spectrograms through the residual network, they generate updated feature maps, which are then flattened into a one-dimensional, fully connected neural network. This residual network comprises three fully connected layers with a size of 2048, 2048, and 1000, respectively [28]. The second last fully connected layer of the residual network was employed as the intermediate embedding feature vector of the mel-spectrograms, which is the high-level feature representation of the mel spectrograms. Therefore, the dimension of the features extracted from the residual network was 2048. Using the residual network pre-trained model serves to represent each of the chosen colour images of a mel-spectrogram through a lower representation by utilizing its high-level feature map.

2) LOW-DIMENSIONAL REPRESENTATION USING HYBRID META-HEURISTIC FEATURE SELECTION ALGORITHM
The generated high-dimensional feature vector from ResNet may induce overfitting during classification, resulting in redundant features. In order to overcome this issue, a hybrid meta-heuristic feature selection algorithm was proposed. Meta-heuristic feature selection algorithms are generic search-based optimization algorithms capable of finding the optimal feature subset from a large feature space. The purpose of these algorithms is to eliminate the inconsistent and redundant feature sets to increase classification accuracy and reduce inference time and memory consumption. The most commonly used feature selection algorithms are Genetic Algorithm (GA) [30], Particle Swarm Optimization (PSO) [31], Cuckoo Search Algorithm (CS) [32], and Whale Optimization Algorithm (WOA) [33]. In this work, a hybridization of Cuckoo Search (CS) and Whale Optimization (WO) algorithms was proposed for selecting an optimal feature subset.

a: CUCKOO SEARCH ALGORITHM
Cuckoo Search (CS) algorithm is one of the most popular meta-heuristic algorithms, inspired by the characteristic of some cuckoo species [32]. The cuckoo birds breed in the nest of other bird species called host species. This algorithm aims to increase the survival and productivity of the cuckoo birds by helping them not get discovered by the host birds. Following are the three major criteria for CS implementation [32]: • Each cuckoo lays only one egg at once and puts it in an arbitrarily selected nest.
• The best nest with high quality eggs will carry over to the next generation • The host nest count is constant, and host bird's probability of identifying a cuckoo's egg is p hb ∈ (0, 1). Host bird can throw the egg or abandon the nest, as well as build a new nest.
Here, the imitation of the above natural phenomena is that the eggs in a nest represent a set of solutions, while new cuckoo egg suggests a new solution. High-quality eggs represent the best optimal solution. This means that eggs that resemble the host birds can hatch and mature without being discovered by the host birds. Thus, the less fit solution will be replaced by the new and better one. The number of host nests represents the population. The fraction of cuckoo eggs discovered by the host bird is discarded, while the remaining are retained as optimal solutions for the following generation.
90234 VOLUME 10, 2022 Another significant parameter of this algorithm is Levy Flight referred to as the random walk of cuckoo for generating a new solution (egg) in the host nest. After laying a new egg at the position of an arbitrarily picked egg, a Levy Flight is initiated. If the newly selected location is fitter than another chosen egg's location randomly, this same egg is relocated to the newly elected nest. Using the Levy Flights search algorithm, the CS algorithm may concurrently obtain all optima in a solution space. This behavior has been applied to optimization and optimum search problems, and preliminary findings indicate its elevated potential [34]. To prevent becoming trapped in the local optimum, far-field randomization introduces a significant proportion of new solutions whose locations are sufficiently distant from the existing best solution. The following is the equation for Levy Flight [32]: In equation 2, s n+1 i , s n i , α, ⊕ and Levy(λ) represent new solutions, current location, step size,entry wise multiplication during walk and Levy exponent respectively. The Levy flight is basically a random walk, with random step length determined by a Levy distribution with infinite variance and mean. The formula of Levy(λ) is the following [32]: In equation 3, u is a normal stochastic variable and n represents current iteration. In this work, the number of features selected by the Cuckoo Search algorithm was 618.

b: WHALE OPTIMIZATION ALGORITHM
This algorithm is inspired by the bubble net feeding strategy of humpback whales searching for food [33]. WOA is a widely used meta-heuristic algorithm that has been demonstrated to be effective, straightforward to implement, and capable of generating robust and relevant feature subsets [35]. The whale behavior in this optimization algorithm is to search for prey, encircling, and attacking it. The search for prey is referred to as the exploration phase. Once the target is discovered, they begin their attack by encircling it.
Since the optimal solution in the search space is unknown at the beginning of the exploration phase, the agent (the whale) chooses a random prey as the current best option. Once an agent identifies the optimal solution, the other agents will update their locations by directing toward the optimal option. The equations 4 and 5 represent the prey localization and encircling method of whale [33]: Here, Z and c are the co-efficients. y is the position vector of the best solution obtained so far. y is the current position vector at n th iteration.
Z and c are updated using the following equations [33]: Here, a is a convergence factor that reduces from 2 to 0 over the iterations, r is a random value ranging between [0,1], n max denotes maximum number of iterations.
Shrinking encircling mechanism and spiral updating positions are the two types of prey hunting mechanisms to update the whale's position for finding an optimal solution. These methods depend on a probability factor p . In the Shrinking encircling mechanism, the optimal solution is obtained by reducing the value of a, while in the spiral updating positions, a whale travels spirally to reach the destination. In the spiral updating mechanism, to update the location of the spiral, the distance between the whale (X , Y ) and the prey (X , Y ) is calculated. The movement of the spiral shape is updated using the following equations [33]: Here, p is a random number ranging in [0,1], b is a constant that clarify the shape of the spiral and u is a random number between [−1,1] and d = |y − y|. During prey exploration, if |Z | > 1, then a random search agent (y rand ) is chosen from the entire population and the location of the current search agent is updated by equations 10 and 11. On the other hand, if |Z | < 1, the whales move towards the global best solution and its position is updated by equations 4 and 5. The equation of prey search is the following [33]: d = |c * y rand (n) − y(n))| (10) x(n + 1) = |y rand (n) − Z * d)| The exploitation phase depends on the distance between a search agent and the best search agent. If some random search agents are far away from the global solution, then the convergence time of the algorithm increases slightly [36]. In this work, the number of features selected by the Whale Optimization algorithm was 495.

c: CSWOA: HYBRIDIZATION OF CS AND WOA
Cuckoo Search and Whale Optimization Algorithms are two unique meta-heuristic feature selection algorithms. Due to few tuning parameters, the CS algorithm is simple to implement and can converge promptly [37]. On the other hand, the WOA algorithm does not get trapped in local optima and thus rapidly converges to the optimal global solution. These factors enable the method to tackle a wide range of real-world problems without significant structural modifications [36]. The merits of both methods have been leveraged to generate a feature vector that is more optimal than either approach could obtain individually. VOLUME 10, 2022 In the proposed hybridization approach, the best feature vector from both algorithms was initially obtained by implementing them separately. Then, the most optimal population was combined by evaluating the significance of all features pertaining to each of the two population sets. The features were represented in binary form (0 or 1). Evaluation of the importance of the feature subset was obtained by the Average Weighted Combination Method (AWCM) [38]. A threshold value, AWCM cutoff, was computed to get the new optimized feature vector. On the other hand, a non-greedy local search algorithm, called Sequential One-Point Flipping (SOPF) [38] eliminated the redundancy in the newly generated feature vector.
In the AWCM algorithm, a summation of the accuracy of all the feature vectors generated from each algorithm was determined. The resulting sum of each feature's accuracy can be referred to as a feature's importance. If the feature importance of a particular feature was greater than the AWCM cutoff, then that feature was taken. An example of a final optimized feature vector by utilizing feature importance using the AWCM algorithm is demonstrated in Table 1.
In addition, there was a possibility of the presence of redundant features in the optimal feature set obtained from AWCM. This may further reduce the classification accuracy. A local search algorithm, Sequential One Point Flipping (SOPF) algorithm was used to eliminate the redundant features to get rid of this problem [38]. This algorithm sequentially traverses each optimal feature set and inspects the effect of the neighboring feature sets on the features under consideration. It successively flips the state of each feature and calculates its fitness. That feature is accepted if any intermediate feature exhibits higher accuracy than a current solution. SOPF ensures an efficient and scalable feature vector by removing redundancy. The obtained size of the final feature vector generated from the hybrid CSWOA algorithm was 532. These feature vectors were considered as the most discriminating feature vectors that will be passed into machine learning classifiers for the final prediction.

D. CLASSIFICATION BLOCK
The classification block of the proposed method aims to perform identification of the users based on their audio-aesthetic preferences. After generating a feature vector for each user, testing and training sets were generated for classification. In this work, a unique 3-songs set was generated using the combination method 10 3 from the total number of user-liked songs while retaining the user's aesthetic preferences. The combination method assured no duplicate sets of data points within the population and that similar data points with different song orders are discarded. After performing combination the total datapoints for FMA dataset became 4080 and for MPD dataset the value was 24000. In addition, after combination the size of each datapoints feature vector became 532 × 3. Transformation of the dimensionality of the original audio file for every user is depicted in Fig. 6. First, the raw audio song is transformed into an audio signal which is of dimension 22050 × 30. Later, the audio signal was converted into an RGB mel-spectrogram of dimension 224 × 224x3. This transformation is a part of pre-processing for passing the mel-spectrogram into the residual network. The pretrained residual network extracts 2048-dimensional feature vector which is further reduced to 532 by using feature selection algorithm.
For classification, XGBoost was used to identify the users with the highest precision. XGBoost stands for eXtreme Gradient Boosting [39], which utilizes a parallel tree gradient boosting mechanism. This algorithm supports three types of gradient boosting methods [39]: • Gradient Boosting: Referred as gradient boosting machine that only deals with the learning rate.
• Stochastic Gradient Boosting: Sub-samples at the row, column, and column per split levels.
This XGBoost classifier yields superior results by efficiently using memory resources with less computational time. It uses Sparse Aware, which can automatically handle missing values. In addition, it contains a Block Structure that facilitates parallelization during tree construction. Furthermore, it performs continuous training, which can improve a prefitted model's performance on unknown data points [39].
The XGBoost classifier boosts gradients using the Gradient Boosting Decision Tree technique. It is a type of ensemble learning that enables sequential model embedding until performance hits convergence. These characteristics make this algorithm more robust. The advantages of this algorithm over others are its fast execution time, scalable kernel, and extensive selection of adjustable hyperparameters. Moreover, in this work, it outperformed other conventional classifiers, as demonstrated by the experiments described in the following section.

A. DATASET DESCRIPTION
The performance of the proposed three-stage audio aesthetic framework was evaluated on two datasets; one is proprietory [5], and another is publicly available [40]. The proprietary dataset consists of 34 users and their corresponding ten favorite songs. The users chose the songs from a set of 224 songs. The 224 songs were collected from Free Music Archive (FMA). The original FMA dataset consists of 917 gigabytes of Creative Commons-licensed tracks, and 161 genres [41]. The collected songs had a balanced mix of different music genres such as Pop, Rock, Folk, Hip-Hop, Jazz, Country, Classical, and Disco. Another dataset was constructed using publicly available the Million Playlist Dataset (MPD) from Spotify. There are 200 anonymous Spotify users' playlists sampled from the first 1000 MPD playlists. Each playlist consists of 10 songs, each with a 30-second song clip. The dataset was divided into 70:30 for training and testing.
To evaluate the model's performance, different evaluation parameters such as accuracy, precision, recall, and F1-score were utilized [42]. These evaluation metrics depend on the parameters of the confusion matrix. A confusion matrix measures the performance of machine learning classifiers visualizing the actual and predicted results by a classifier [42]. The parameters associated with it are, • True Positive (t p ): The percentage of positive predictions, that were actually positive.
• True Negative (t n ): The percentage of negative predictions, that were actually negative.
• False Positive (f p ): The percentage of positive predictions, that were actually negative.
• False Negative (f n ): The percentage of negative predictions, that were actually positive. The above parameters are used generally for binary classifications, but they can be derived for multi-class classifications as well. The equations for the evaluation metrics are given below [42]:

B. EXPERIMENTAL SETUP
The proposed method was implemented using Python programming language. For the machine learning algorithms sci-kit learn library was employed. The machine learning models were trained on a Corei5 CPU and NVIDIA 1080 GTX GPU backend. For the residual network the default architecture was used. For the cuckoo search algorithm, the number of nests was set to 20, step length was set to VOLUME 10, 2022 0.01, and the levy distribution parameter was set to 1.5 [43]. For the whale optimization algorithm, the population size was set to 100 with random search ability of 0.1 [44]. For the XGBoost classifier alpha was set to 0.2, max depth of the tree was set to 5, and number of estimators for boosting was set to 1000 [45]. These parameters ensure algorithms' optimal performance.

C. EFFICACY OF DIFFERENT COMPONENTS OF THE PROPOSED METHOD
The first set of experiments demonstrated the efficiency of the different components of the proposed method. The first experiment was conducted to show the strength of the residual network that was employed as the feature extractor of the proposed method. The second experiment established the importance of the hybrid meta-heuristic feature selection algorithm. The final experiment was performed to show the efficiency of the XGBoost classifier for final classification.

1) EFFICACY OF THE RESIDUAL NETWORK
The first step of the proposed architecture was to extract high-level features using a pre-trained CNN architecture. Several different pre-trained architectures were considered. Among them, ResNet [28], VGG-16 [46], and Inception-Net [47] were selected as the most used pre-trained architectures. Table 2 illustrates the performance of the architecture with ResNet. The proposed method with the ResNet as the feature extractor attained 2.25% higher accuracy than InceptionNet and 3.51% higher accuracy than VGG-16 on the FMA dataset. On the other hand, for the MPD dataset, 2.34% higher accuracy than InceptionNet and 2.41% higher accuracy than VGG-16 were achieved. There are several reasons for the superiority of ResNet performance over others.
The key advantage is addressing the problem of diminishing gradients via the skip connections method. This skip connection bypasses a few layers during training and connects directly to the output. Regularization allows any layer to be skipped if it degrades the architecture's performance. Therefore, in ResNet, a very deep neural network is trained without the vanishing gradient issue.

2) EFFICACY OF THE PROPOSED HYBRID META-HEURISTIC FEATURE SELECTION ALGORITHM
In this work, a hybrid meta-heuristic feature selection algorithm (CSWOA) is proposed to select the optimal feature subset from the entire feature set generated from pre-trained ResNet. To show the efficiency of the proposed feature selection method, an extensive comparison was performed with some standalone feature selection algorithms. The proposed hybrid meta-heuristic feature selection algorithm was compared with Particle Swarm Optimization (PSO) [31], Equilibrium Optimization (EO) [48], Cuckoo Search (CS) [32], and Whale Optimization (WO) [33]. In addition, hybridization of Particle Swarm Optimization and Equilibrium Optimization (PSEO) algorithms was also implemented for comparison. All the above-mentioned feature selection algorithms are wrapper-based methods. From Table 3, it is observed that the proposed method with CSWOA feature selection algorithm attained 2.47% higher accuracy than PSO, 3.33% higher accuracy than EO, 0.92% higher accuracy than CS, 1.09% higher accuracy than WOA, and 1.65% higher accuracy than PSEO algorithm on FMA dataset. On the other hand, for the MPD dataset, the proposed CSWOA algorithm achieved 1.51% higher accuracy than PSO, 1.54% higher accuracy than EO, 1.31% higher accuracy than CS, 1.09% higher accuracy than WO, and 1.17% higher accuracy than PSEO algorithm. The CS algorithm uses very few parameters, leading to its easy implementation and fast convergence. On the other hand, the WOA algorithm can rapidly find the global optimal solution, as it does not get stuck in local optima. The proposed hybrid meta-heuristic feature selection algorithm combines the most optimal feature subsets of two different wrapper-based algorithms and produces the optimal subset by using AWCM and SOPF methods. Therefore, the proposed hybrid meta-heuristic feature selection algorithm has higher accuracy than standalone feature selection algorithms. Fig. 7 and Fig. 8 demonstrate the results of different feature selection algorithms on FMA and MPD datasets, respectively.

3) EFFICACY OF THE XGBoost CLASSIFIER
For the final classification, XGBoost classifier was employed. In this experiment, XGBoost classifier was compared with Naive Bayes [49], Random Forest [50], Support  Vector Machine [51], and K-Nearest Neighbour [52]. Table 4 demonstrates the superiority of XGBoost over other classifier. XGBoost attained 2.25% higher accuracy than Naive Bayes, 3.64% higher accuracy than Random Forest, 8.16% higher accuracy than Support Vector Machine, and 8.54% higher accuracy than K-Nearest Neighbour on FMA dataset. For the MPD dataset XGBoost classifier attained 2.16% higher accuracy than Naive Bayes, 2.01% higher accuracy than Random Forest, 6.51% higher accuracy than Support Vector Machine, and 5.58% higher accuracy than K-Nearest Neighbour classifier. The XGBoost classifier utilizes the Gradient Boosting Decision Tree approach to enhance gradients. It is a form of ensemble learning that allows successive model embedding until the result converges. Because of this advantage, the XGBoost classifier attained higher accuracy than the other algorithms. Fig. 9 and Fig. 10 illustrate the results obtained from different classifiers on FMA and MPD datasets, respectively.

D. ABLATION STUDY
For the ablation study, three experiments were conducted. The first experiment was performed with a simple neural   network architecture without any hybrid meta-heuristic feature selection algorithm and machine learning classifier. The second experiment was performed without any hybrid VOLUME 10, 2022  meta-heuristic feature selection algorithm. In this experiment, a pre-trained residual network was used to extract features. Later, those features were passed into the XGBoost classifier. The final experiment was conducted by removing the XGBoost classifier from the proposed method but retaining the residual network and hybrid meta-heuristic feature selection algorithm. The final classification was performed using a simple Multi-Layer Perceptron.

1) PROPOSED METHOD WITHOUT HYBRID META-HEURISTIC FEATURE SELECTION AND XGBoost CLASSIFIER
The first ablation experiment on both datasets was the implementation of ResNet for identification. In the first row of Table 5 and Table 6, the illustration of this configuration is observed. This approach obtained 96.23% and 96.88% accuracy on FMA and MPD datasets, respectively. Compared to the proposed architecture, the accuracy of this method drops by 3.31% on the FMA dataset and 2.91% on MPD datasets. The reason for the performance drop with only residual connection is that in this experiment no feature selection method was used.

2) PROPOSED METHOD WITHOUT HYBRID META-HEURISTIC FEATURE SELECTION
The second ablation study eliminated the hybrid feature selection algorithm while retaining ResNet for feature extraction and XGBoost classifier for classification. From the second row of Table 5 and Table 6, it is observed that the accuracy of this approach was reduced by 9.06% and 7.9% for FMA and MPD datasets, respectively. In this experiment, the high-level features were extracted by a pretrained convolutional neural network and classification was done by XGBoost classifier. The high-level features extracted from the pretrained CNN contain complex feature representation, which a classical machine learning algorithm is unable to handle accurately. Therefore, for this experiment the performance was dropped by 9.09% for the FMA dataset and 7.9% for the MPD dataset.

3) PROPOSED METHOD WITHOUT XGBoost CLASSIFIER
In the third ablation study, experiment was conducted without XGBoost classifier. This experiment extracted features using ResNet, selected the optimal feature set using CSWOA, and used MLP instead of XGBoost for classification. This experiment also illustrates the dominance of the proposed architecture. The accuracy of this approach is reduced by 2.16% for FMA dataset and 1.5% for MPD dataset, when compared to the original results. This configuration is shown in the third row of Table 5 and Table 6. In the third experiment, the pretrained CNN was used to extract high level features and later a Multi-Layer Perceptron (MLP) was used for classification. The MLP achieves higher accuracy than using a classical machine learning algorithm, but the performance is still lower by 2.18% for FMA dataset and lower by 1.5% for the MPD dataset vs the proposed architecture. Table 7 illustrates a comparative study of the previously developed audio aesthetic model with the proposed method. This comparison reveals that the proposed method has outperformed the previous method in terms of accuracy and inference time. The proposed system is able to achieve stateof-the-art results on both FMA and MPD datasets by attaining 99.54% and 99.79% rank 1 accuracy, respectively. Regarding inference time, for FMA dataset, it is 1.67s, while for MPD dataset the inference value is 7.59s. The songs in the FMA dataset are collected from a pool of 224 songs, but songs in the MPD dataset are not restricted. Thus, the improvement in accuracy for the MPD dataset compared to the FMA dataset is due to a greater song diversity, resulting in more distinctive extracted features. The user count in FMA dataset is 34, while for MPD dataset it is 200. The audio aesthetic system introduced by Sieu and Gavrilova [5] achieved a rank 1 accuracy of 95.74% on the FMA dataset and 99.70% accuracy on the MPD dataset. Furthermore, in their work, the inference time for FMA and MPD datasets was 1.85s and 8.12s, respectively. From the above results, it is observed that the proposed approach obtained higher accuracy than the previous audio aesthetic system.

E. PERFORMANCE COMPARISON WITH STATE-OF-THE ART RESULTS
In terms of inference time, the proposed method also attained superior results by obtaining lower inference time than the previous work. The higher accuracy and lower inference time indicate that the proposed method is able to identify 90240 VOLUME 10,2022 users with the highest accuracy and infer unseen data faster than the previous method. Sieu and Gavrilova [5] adopted extensive feature engineering, which increased accuracy and inference time. On the other hand, in the proposed method, a pre-trained transfer learning algorithm, ResNet, was utilized for feature extraction and a hybrid meta-heuristic feature selection algorithm for optimized feature selection. Finally, classification was performed with the XGBoost classifier. The advantage of the proposed method is the implementation of ResNet and CSWOA for feature extraction and feature selection, respectively. ResNet automatically generated the most significant features, while the feature selection algorithm eliminated the redundant features and generated the most contributing feature subset. Finally, the XGBoost classifier identifies users with the highest accuracy and lowest inference time. The feature extraction and selection approach increased the accuracy by generating an optimal feature subset. In addition, XGBoost classifier helped in reducing the inference time, since it was fed to a lower number of optimal feature subset for training. From these aforementioned discussions, it can be stated that the proposed method is computationally inexpensive and provides a higher identification accuracy than the state-of-the-art method. Thus, for real-world application, the proposed approach is feasible to deploy.

F. RESULTS AND DISCUSSION
To establish superiority of the proposed model, several experiments were performed on both datasets. The optimized feature vector obtained from the hybrid feature selection model was fed into different machine learning classifiers for identification. XGBoost attained the highest identification accuracy for both datasets among all the classifiers. The detailed results for FMA and MPD datasets are tabulated in Table 8. From the table, it is observed that for FMA dataset, the accuracy, precision, recall, and F1-score values are 99.54%, 97%, 98%, and 99%, respectively. On the other hand, for the MPD dataset. these values are 99.79%, 98%, 99%, and 99%, respectively. The above results exhibit that the XGBoost classifier identifies users with significant accuracy and a great value of precision, recall, and F1-score. Fig. 11 depicts the Cumulative Matching Characteristic (CMC) curve, which demonstrates the system's rank 1 to rank 5 recognition rates. The CMC curve in a person identification system reflects the system's accuracy in identifying users within a specified number of predictions. A rank 1 identification rate is the value when a system correctly identifies a user in a single prediction. On the other hand, in the rank 5 identification rate, the prediction is generated within the top ten results. A CMC curve's normalized Area-Under-Curve (nAUC) measures its overall accuracy, with an ideal nAUC of 1 corresponds to a thoroughly reliable system. The proposed system obtains an nAUC of 0.9994 across all 34 user classes, with 99.54% rank 1 recognition and 100% rank 5 recognition for the FMA dataset. In contrast, the attained nAUC value for the MPD dataset is 0.9998 across all 200 users, with a rank 1 recognition rate of 99.79% and a rank 5 accuracy rate of 100%. The Receiver Operating Characteristics (ROC) curve of the proposed model is depicted in Fig. 12 and Fig. 13 for FMA and MPD datasets, respectively. ROC curve is a measure of the performance of classifiers at the different thresholds. It is a plot between True Positive Rate (TPR) and False Positive Rate (FPR). A system with a high TPR followed by a low FPR has lower verification errors, indicating its reliability. In ROC curve, the accuracy of a system is measured by Area Under the Curve (AUC) score. A model's AUC value near 1 means the model is highly capable of distinguishing between different classes. Fig. 12 exhibits the AUC value of the system on FMA dataset, which is 0.995, and Fig. 13 depicts the AUC value of the system for the MPD dataset, which is 0.998. The AUC values exhibit that the proposed system is highly capable of identifying users with a significant accuracy.

V. CONCLUSION AND FUTURE WORKS
In this work, a novel three-stage audio aesthetic-based biometric system is proposed. To the best of our knowledge, this is the first audio aesthetic system that utilizes automatically extracted features using a deep learning approach. In the first stage of this system, deep features were extracted using pre-trained ResNet architecture. In the second stage, an optimized feature subset was chosen from the generated high-level feature set by utilizing a hybrid feature selection algorithm. In the final step, XGBoost classifier was used for user identification based on personal audio preferences. The proposed approach has achieved a rank 1 accuracy of 99.54% and 99.79% accuracy on FMA and MPD datasets, surpassing other methods. Furthermore, the nAUC scores ranging from rank 1 to rank 5 on the CMC curve further validate the system's reliability and efficacy. Experimental results demonstrate that the proposed framework can extract more discriminating features from a set of user-preferred songs. This three-stage framework proves the efficacy of intermediary features automatically extracted from deep learning architecture without extensive feature engineering.
In the future, additional system performance optimization maybe achieved through substantial parallelism. Assessing the impact of various fusion techniques on the overall performance of an audio-visual system is another potential research direction. In addition, analyzing the relationship between an individual's preferred music genre and an image category might provide a new direction of research into the aesthetic-based biometric domain.