Classification of Indian Classical Music With Time-Series Matching Deep Learning Approach

Music is a heavenly way of expressing feelings about the world. The language of music has vast diversity. For centuries, people have indulged in debates to stratisfy between Western and Indian Classical Music. But through this paper, an understanding can be fabricated while differentiating the types of Indian Classical Music. Classical music is one of the essential characteristics of Indian Cultural Heritage. Indian Classical Music is divided into two major parts, i.e. Hindustani and Carnatic. Models have been sculptured and trained to classify between Hindustani and Carnatic Music. In this paper, two approaches are used to implement classification models. MFCCs are used as features and implemented models like DNN (1 Layer, 2 Layers, 3 Layers), CNN (1 Layer, 2 Layers, 3 Layers), RNN-LSTM, SVM (Sigmoid, Polynomial & Gaussian Kernel) as one approach. A 3 channels input is created by merging features like MFCC, Spectrogram and Scalogram and implemented models like VGG-16, CNN (1 Layer, 2 Layers, 3 Layers), ResNet-50 as another approach. 3 Layered CNN and RNN-LSTM model performed best among all the approaches.


I. INTRODUCTION
India has the most extensive Intangible Cultural Heritage globally, and Music is one of the most crucial aspects of this Cultural Heritage. Indian Classical Music is divided into two major parts, i.e. Hindustani and Carnatic [1]. The Classical Music tradition followed in India's Northern region is known as Hindustani, while tradition followed in the Southern region is Carnatic [1]. This distinction of music was observed around 16 th century. Both aspects were evolved from a common ancestor. Bhakti Movement Indian Classical Music based Raga Music classification is the upcoming area under music information retrieval. These studies are conceivable due to the availability of a considerable amount of musical data on the Internet. Significant work has been done in multimedia such as text and video. But the audio processing is still in the developing phase. It involves the processing of music and speech. This paper discusses speech processing, which could be used as the basis of Classical music classification using feaures such as MFCCs, Spectrogram, Scalogram. The sound and music features are studied and the features are extracted to perform the classification on these categories of the music. The initial phase are discussed, which used the music signals. The pitch class profiles based features and their acoustic characteristics based statistical measures are also considered along with algorithms applied. The promising results are depicted in this study along with performance comparison.
Studying music is an upcoming area that involves various computational techniques for investigating different forms of music. The computations of music used to understand society's heritage and culture from where the music evolved. It also pulls out the science behind the music and helps in developing the scientific model. Usually, researches focus on Western Music, while some studies have explored Indian Classical Music Sound [4]. Indian Classical Music is mainly classified into Carnatic music and Hindustani music. These two music form frameworks are similar with some stylistic differentiation. Hindustani Music is hinged on Raga based composition while Carnatic Music over Kriti. However, they have grown differently under diverse cultural inspirations [3].
Indian Classical Music is broadly categorized into Carnatic Music and Hindustani Music. Both are heaving a wide following in their way, but Carnatic Music's complexity is much higher in the means the notes are rendered and arranged [4]. Indian Classical Music is generally based on Raga and Talam. Talam can be considered equivalent to the melody in Western Music. The complexity of ragas is more as compared to Western Music in the context of melody and scale. Ragas notes are sequentially arranged so they can invoke the emotion of the song. A note is defined as Swara in Carnatic Music. Every note has a set frequency associated with it [5]. Every Carnatic Music has a Talam associated. The time duration of a song in Carnatic Music is an integral multiple of Talam. Talam is just like a beat in Western Music. It signifies the placement of the syllables and the tempo of the music in the composition. In Carnatic Music, Talam is indicated by singer hand gestures. Hence in this paper, Ragas patterns are considered to categorized Hindustani Classical Music from Carnatic Classical Music.
Music Note is considered to be an atomic unit of Indian Classical Music. In a real sense, the musical note considered as an identifiable fundamental frequency component (also known as pitch) of a singer with an appropriate duration [5]. The ratio of the fundamental frequencies of two notes is referred to as an interval [5].
Melodic audio generated when combining and playing together notes is known as Raga, similar to Western Music [9]. Arohana-avarohana patterns play a crucial role in different melody while having the same set of notes. The progression of notes, i.e. descending and ascending, is known as Arohana-avarohana patterns. It gives the knowledge about the transformation of notes which may go through in a raga [7].

II. LITERATURE REVIEW
The literature availability based on Carnatic and Hindustani Music is minimal as compared to Western Music. Very few studies have been conducted on Singer identification and Swara pattern recognition on Carnatic Music [5], [6]. Simultaneously, some works are being done to identify the Ragas in Hindustani Music [7]. In [7] the authors constructed an HMM-based model which recognized two Ragas of Hindustani Classical Music. In [8] authors has suggested a primary difference between the Raga patterns of Hindustani and Carnatic Music. It says we have R1 and R2 raga patterns in Hindustani music compared to R1, R2 and R3 in Carnatic. Likewise, G, D and N all have three different frequencies in Carnatic Classical Music against two Hindustani Classical frequencies, enhancing frequency identifications. The input signal used is a monophonic, voice-only music signal. The signal's fundamental frequency was also examined, and based on these features, the raga identification process was conducted for two Hindustani ragas. In the aspect of Western Music, researchers investigated the function of melody retrieval. In this paper, we have taken a song dataset consisted of Hindustani and Carnatic Classical Music. Apply speech signal processing algorithms to extract Mel frequency cepstral coefficients (MFCC) features for each song. Different classification algorithms have applied to classify Hindustani and Carnatic Classical Music.

III. METHODOLOGY
The proposed methodology contains dataset which comprises of audio files of Carnatic and Hindustani Music. Dataset consists of 28 Carnatic files while 36 Hindustani with 160 seconds as track duration. '0' has been classified as Carnatic while '1' as Hindustani in this study. This study consists of several layers of comparisons, as shown in Fig-1. The first layer consists of extracting 3 features, i.e. MFCC, Spectrogram, Scalogram while the second layer consists of integrating features. The third layer consists of comparing model's performance, while the fourth layer consists of comparing performance within models through modifying parameters.
Here, 2 type of approaches. MFCCs, Spectrogram and Scalogram features are merged into 3 channel configuration followed by training through VGG-16, ResNet-50 and CNN models as one approach. As another approach towards this, MFCCs are integrated into 1 channel configuration followed by training through DNN, SVM, RNN-LSTM and CNN models.
This study has been carried out on Google Colaboratory environment, with 13 GB of RAM and Intel(R) Xeon(R) CPU @ 2.20GHz. Training 3 channel input data has carried over TPU provided by Google Colaboratory. In this study, the Sample Rate assumed to be 22050. The dataset is divided into training and test data in the ratio of 7:3. Cross-validation is used to check the reliability of the dataset.

IV. FEATURE EXTRACTION A. MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCCs)
The primary feature extraction technique used in this study is MFCC. MFCC extraction from audio files has been carried out in significant 5 steps. These 5 steps are Pre Emphasis, Frame Blocking and Windowing, Discrete Fourier Transform, Mel Spectrum, Discrete Cosine Transform [21]. The balancing of sound is performed in Pre Emphasis, which filter higher frequencies. Eq-1 shows the executed pre-emphasis filter in this study.
where b is the slope of the filter. Segmentation of audio is executed in the second step, i.e. Frame Blocking and Windowing, to achieve a windowed section. This technique is effective against the edge effect, which is usually observed in Fourier Transform [21]. In this study, the audio is segmented into 10 subparts with a windowed length of 512. Any sampled signal can be constituted as a finite series of sinusoids. This alteration is known as Fourier Transform X (k) that is shown in Eq-2. The spectrum is obtained through Discrete Fourier Transform from each windowed frame in the third step [21].
where N is the number of points, the value of FFT is assumed to be 2048 in this study. Usually, after processing DFT, the spectrum is observed to be very extensive. Hence to make frequencies range linear, they are passed through Mel-filter banks. Mel-filter bank is a set of bandpass filters. The Mel scale is shown in Eq-3.
where f Mel denotes the perceived frequency, and f represents the physical frequency in Hz [25]. The triangular Mel Weighting filter is multiplied by spectrum to evaluate Mel Spectrum [s (m)] as shown in Eq-4.
where H m is weight to k th energy spectrum granting to m th output band, M is total triangular Mel weighting filters.
The Illustration of Mel spectrum over logarithmic scale has been performed in the last step, followed by execution of Discrete Cosine Transform (DCT) execution through which production of cepstral coefficients occurs [21]. where C is the number of MFCCs, n = 0, 1, 2, . . . , C − 1, and c(n) are the cepstral coefficients. The value of C is assumed to be 13 in this study.

B. SCALOGRAM
The portrayal of a signal's time-frequency domain through wavelet transformation is known as Scalogram. It is used to identify coefficient estimates at respective time-frequency positions [22]. Scalogram provides an in-depth visualization of the signal. Scalogram is the modulus of the multiscale wavelet transform, which elucidate time-frequency visualization. The spectro-temporal nature of scalograms makes them desirable for neural networks due to the signal's mapping properties with minimalistic information loss [23]. Scalogram gives the insight into the frequency, energy in time which shown as a function Ix (t, λ) in Eq-6.

C. SPECTROGRAM
The visualization of signal's robustness at several frequencies over time in a waveform is known as Spectrogram. In other words, it is also known as the representation of signal's loudness. It's the intensity plot of the Short-Time Fourier Transform (STFT) magnitude. The succession of data segment's Fast Fourier Transform (FFT) is known as Short-Time Fourier Transform (STFT). The Spectrogram extraction is carried out in 8 broad steps, i.e. Pre-emphasis, Frame Blocking, Windowing, Discrete Fourier Transform (DCT), Power Spectrum Density (PSD), Mapping and Normalization, Short-time Spectrogram, and Linear Superposition [26]. Power Spectral Density S X (f) of signal X(t) is computed as the Fourier Transform of an autocorrelation function R X (τ ) as shown in Eq-7.
where j = √ −1.         to this model while Sparse Categorical Cross Entropy as loss function.

V. MODELS
In 3 layers DNN model, total parameters have observed to be 3,210,698, out of which all are trainable parameters. In 2 layers DNN model, total parameters have observed to be 3,196,170, out of which all are trainable parameters. In a   In this study, a 2 layer LSTM is used with one hidden layer. RNN-LSTM model is sculptured and trained for one channel input, i.e. MFCCs features. In this model, total parameters have observed as 57,802, in which all are trainable. This model is trained through 100 epochs with numerous configurations of parameters like Batch Size, Adam Learning Rate, etc., as shown in result's section in this paper. Dropout layers of 0.3 rate is adopted to prevent overfitting of the model. ReLU activating function is used in a single hidden layer while Softmax activating function at the output layer followed by 2 LSTM layers before. Adam Optimizer is used as an optimizer to this model while Sparse Categorical Cross Entropy as loss function.     A Layer of CNN Model consists of a 2D Convolution Layer with ReLU activation function followed by a 2D Max Pooling layer. Padding is adopted to be the same and strides of (2, 2) followed by a Batch Normalization Layer. These all aspects joint together become a single convolution layer. After passing through various layers of the convolutional network, a flatten layer has changed the output to a single vector. It can further train in the dense layer. A Dropout layer is used and the dense layer to prevent overfitting, followed by an output layer consisting of a softmax acting function. Adam Optimizer is used as an optimizer to this model while Sparse Categorical Cross entropy as loss function.
CNN Model-A is sculptured as sequential model and trained for 1 channel input, i.e. MFCCs features. In 3 layers CNN model-A, total parameters has observed to be 131,530, out of which 131,338 are trainable parameters whereas 192 non-trainable. In 2 layers CNN model-A,         has been sculptured and trained for 3 channels input, i.e. MFCCs, Spectrogram & Scalogram. VGG-16 is the combination of A to E ConvNets. ConvNet A consists of 8 convolution layers and 3 fully connected layers resulting in 11 weight layers, whereas E consists of 16 convolution layers and 3 fully connected layers resulting in 19 weight layers. This model is implemented through TensorFlow library. VGG-16 functional layer has been used, which is already preconfigured, followed by the dropout layer to avoid overfitting and batch normalization. Dense layers are used with the ReLU activation function and Softmax activation function in the output layer. This model is trained through 30 epochs with numerous configurations of parameters like Batch Size, Adam Learning Rate, etc, as shown in result's section in this paper. Adam Optimizer is used as an optimizer to this model while Binary Cross-Entropy as loss function. In this model, total parameters has observed to be 58,407,745 out of which 43,606,017 are trainable parameters, whereas 14,801,728 non-trainable.

E. ResNet-50
A special kind of Convolution Neural Network is proposed by K He et al. which is used in this paper [28]. ResNet-50 is sculptured and trained for 3 channels input i.e. MFCCs, Spectrogram & Scalogram. This model consists of 50 deep convolutional neural network followed by an output layer. This model is implemented through the Ten-sorFlow library. This model is trained through 20 epochs with numerous configurations of parameters like Batch Size, Adam Learning Rate, etc as shown in result's section in this paper. Adam Optimizer is used as an optimizer to this model while Binary Cross-Entropy as loss function. In this model, total parameters has observed to be 23,589,761 out of which 23,536,641 are trainable parameters, whereas 53,120 non-trainable.

F. SVM
SVM can be used to solve classification problems irrespective of linearity and non-linearity. A non-linear transformation is used to uplift the training data into higher dimensions through non-linear mapping. An SVM with a non-linearity function has been shown in Eq-8.
where K(X,Y) is kernel [29]. Decision Boundaries, also known as Hyperplane, have been used in SVM to support classifying data points. The Equation of Hyperplane has been shown in Eq-9.
where w T is the weight vector and b is scalar. In this study, 3 separate kernels are used to classify between Carnatic and Hindustani Music, i.e. Polynomial, Sigmoid & Gaussian.

G. ADAM OPTIMIZER
In this study, Adam Optimizer is used in every model due to it's combining properties of AdaGrad and RMSProp. Other advantages of using Adam optimizer above other traditional optimizers are: memory efficient, easy to implement, handle highly noisy or sparse gradient easily, etc. Adam Learning rate have varied from 0.0001 to 0.001 in this study.

H. LOSS FUNCTION
Sparse categorical cross-entropy and Binary categorical cross-entropy are used in this study. Both of the loss functions are the special cases of cross entropy loss function. Therefore, using the same computational relationship, as shown in Eq-10.
where W is the weight of the neural network, y i is true label,ŷ i is the predicted label.

I. ACTIVATION FUNCTION
ReLU and Softmax activating functions are used in this study. Rectified Linear Activation Function (ReLU) is a linear function that ranges from zero to infinity, as shown in Eq-11.
where z is above or equal to 0. Sigmoid Activation Function is used at the output layer for finding the maximum probable answer to classification problems, as shown in Eq-12.
σ z i = e z i K j=1 e z j (12) where σ is softmax, z is vector, K is the classes number.

J. PRECISION
Precision is used as one of the evaluation metrics in this study. Precision gives the insights of positive identification's proportion which are actually veracious, as shown in Eq-13. (13) where TP is True Positive and FP is False Positive.

K. RECALL
Recall is used as one of the evaluation metrics in this study.
Recall gives the insights of veraciously identified actual positive's proportion, as shown in Eq-14. (14) where TP is True Positive and FP is False Positive.

L. F1 SCORE
F1 Score is used as one of the evaluation metrics in this study. F1 score fetches an equilibrium among precision and recall, as shown in Eq-15.
where P is precision and R is recall.

M. RECEIVER OPERATOR CHARACTERISTIC (ROC)
Receiver Operator Characteristic (ROC) is used as one of the evaluation metrics in this study. ROC is an evaluatic metric which yield's the probability over the curve between True Positve Rate (TPR) and False Positive Rate (FPR), as shown in Eq-16,17. (16) FPR = FP TN + FP (17) where TP is True Positive, TN is True Negative, FP is False Positive and FN is False Negative.

VII. CONCLUSION
This study primarily focused on classification between Carnatic and Hindustani Music through audio files using two broad feature extraction approaches. 1 channel input, i.e. MFCCs features, have observed to be more effective than 3 channel input, thus outperforming it with 96.08%. RNN-LSTM and 1 layer CNN has performed the best by yielding the same validation accuracy, i.e. 96.08%, but validation loss as 0.1356 & 0.1111 respectively for 1 input channel. This type of classification can open many insights of Indian music industry. Raga motifs are the foundation of melody in Indian classical music. Same raga can be redeveloped using different compositions and improvisation. This classification technique can also be used to check the similarity of the raga in different formations. The representation of the melody and their similarity characteristics are very essential for Indian classical music. Future works can be based on the humdrum-based sequences. The reinforcement learning models can be generated to predict the music segments by associating and considering Raga and using an action reward approach to classify it better using an intensive agent-based learning and reward approach. Further works will also focus on the melodic shape on tempo range which covers a wide performing tempo range in Indian classical concerts. She has visited the TLI-AP, National University of Singapore; Nanyang Technological University, Singapore; Lincoln University College, Malaysia; and the Asian Institute of Technology, Bangkok, on several academic assignments. She has several publications, books, 27 granted international patents, and two granted copyrights to her credit. She is also a national merit scholarship holder in both 10th and 12th grade. She is a fellow of the International Society for Development and Sustainability, Japan, and an Honorary Fellow of the Iranian Neuroscience Society, Iran; the Royal Society of Arts, London; and Nikhil Bharat Shiksha Parisad.

JEMAL H. ABAWAJY (Senior Member, IEEE)
is currently a Full Professor with the Faculty of Science, Engineering, and Built Environment, Deakin University, Australia. His leadership is extensive spanning industrial, academic, and professional areas. He is also the Director of the Distribution System Security (DSS). He has been actively involved in the organization of more than 200 national and international conferences, including the chair, the general co-chair, the vice-chair, the best paper award chair, the publication chair, the session chair, and a program committee member. He is also actively involved in funded research supervising, a large number of Ph.D. students, postdoctoral researchers, research assistants, and visiting scholars, in the area of cloud computing, big data, network and system security, and e-health. He is the author/coauthor of five books and ten conference volumes, and more than 250 referenced articles in conferences, book chapters, and journals. He  He has coauthored six books, co-edited 76 books, and authored or coauthored more than 300 research publications in international journals and conference proceedings. He holds four patents. His research interests include soft computing, pattern recognition, multimedia data processing, hybrid intelligence, social networks, and quantum computing. He is also a full Foreign Member of the Russian Academy of Natural Sciences. ANIRBAN DAS is currently associated with the University of Engineering and Management, Kolkata, as a Full Professor in computer science and engineering. He is also a Visiting Scientist with the University of Malaya, Malaysia. He is also the honorary Vice President of the Scientific and Technical Research Association, Eurasia Research, USA. He is also a Visiting Faculty with the Central University of Jharkhand, Wall of Fame of myGov as ''The Confederation of Elite'' Ranchi. He has authored 11 books, 47 patents filed, and more than 50 research publications mostly in journals and conferences of international repute. His research interests include machine learning optimization techniques and blockchain. He is a fellow of the Royal Society, U.K.; Nikhil Bharat Siksha Parishad; IETE, India; and RSA, U.K. He is also the invitee as delegate from academics of several national bodies like, CII, NASSCOM, FICCI, and EDCN. He was nominated in Wall Academicians of IICDC powered by DST, AICTE, myGov, and Texas Instruments, in 2019. He is also the Innovation Ambessabor certified by MHRD Innovation Cell, Government of India.
HAIRULNIZAM MAHDIN (Member, IEEE) is currently an Associate Professor with the Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia. He has been actively involved in many conferences internationally serving in various capacity, including the chair, the general co-chair, the vice-chair, the best paper award chair, the publication chair, the session chair, and a program committee member. His current research interests include data management, the IoT, and blockchain. He is a member of the Malaysia Board of Technologist (MBOT). He has also guest edited many special issue journals.