Motion recognition in Bharatanatyam Dance

This paper provides a method to understand the underlying semantics of Bharatnatyam dance motion and classifies it. Each dance performance is audio-driven and spans over space and time. The dance is captured and analyzed, which is helpful in cultural heritage preservation, and tutoring systems to assist the naive learner. This paper attempts to solve the fundamental problem; recognizing the motions during a dance performance based on motion-pattern. The used dataset is the video recordings of an Indian Classical Dance form known as Bharatanatyam. The different Adavus (The basic unit of Bharatanatyam) of Bharatanatyam dance are captured using Kinect. We choose RGB from various forms of captured data (RGB, Depth, and Skeleton). Motion History Image (MHI) and Histogram of Gradient of MHI (HoGMHI) are computed for each motion and used as an input for the Machine Learning (ML) algorithms to recognize motion. The paper explores two ML techniques; Support Vector Machine (SVM) and Convolutional Neural Network (CNN). The overall accuracy of both the classifiers is more than 90%. The novelties of the work are (a) analysing all possible involved motions based on the motion-patterns rather than the joint velocities or pose, (b)exploring the impact of training data and the different features on the classifiers’ accuracy, (c) not restricting the number of frames in a motion during recognition and formulate a method to deal with the variable number of frames in the motions.


II. MOTIVATION & RELATED WORK
Motion classification has been a well-known problem among researchers in recent years. Researchers have explored different machine learning techniques and deep learning techniques to solve the motion classification problem. However, several works reported on the ICDs (Indian Classical Dances) and other Non-ICDs (Salsa, Samba, Ballet, and many more). We keep our discussion limited to the more relevant and recent reported works in this work.
An article [2] describes the laws of movement in ICDs based on Natyashastra. In [3], a model for complex motion classification of Malaysian dance was proposed. To extract motion information, they use Histogram of optical flow (HOOF) and Space-time interest point (STIP) as the feature. KNN, Neural network, and Tree bagger classifier are explored as classification models in which Treebagger classifier gives maximum accuracy of 92.6% with N=150 trees as a parameter. [4] works on salsa dance steps. The threedimensional sub-trajectories of dancers extract the motion features using PCA. Two classifiers Gaussian mixture model (GMM) and Hidden Markov model (HMM), are used for motion classification. HMM with three hidden states achieves an accuracy of 74% in F-measure.
In [5], an approach is proposed to classify dancers' motion in an Indian Classical Dance performance. The velocity of skeleton joints is used as a feature of Bharatanatyam dancer motion. As classification methods, dynamic time warping (DTW) and the kNN algorithm are explored. The proposed approach achieves an accuracy of 85% over eight variations of Natta Adavu. The authors in [6] use the eight joint velocities of each frame involved in a motion to evaluate the Bharatanatyam dance performance. The evaluation is done based on the expert's performance. The video's length of both the expert and the learner remains the same to avoid missmatching the feature dimension. In [7], the authors try to differentiate the motions in Bharatanatyam and Kathak based on the position and tension of the body limbs and the hand postures. It is for visual analysis only.
Other works [8]- [16] address motion recognition using skeletal data for human motion recognition. However, these are not related to the dance. The skeletons used in these works are derived ones. So, the ill formed skeleton may affect the classifier in all these approaches, as reported in [17]. However, the RGB is free from these limitations because there is a little chance of getting corrupt RGB data.
Two works [18], [19] try to recognize the dance video actions based on postures. These are hardware-based embedded systems that use Field Programmable Gate Array (FPGA).
The advancement of deep learning makes computer vision problems solve proficiently. In [20], authors construct a Gaussian Mixture based Hidden Markov Model to recognize the artistic motion videos of Peking Opera captured by OptiTrack 3D vision device. In [21], authors train an ensemble Adaboost multi-class classifier on motion video data of various ICD forms by extracting the features such as shape signature, Hu & Zernike moments, HAAR, and LBP features. Finally, it identifies the dance postures and identifies the dance type based on these postures. Nriyantar [22] is a deep learning-based approach where SSTN (Symmetric Spatial Transformer Networks) is trained to recognize the ICD dance forms' sequence using the postures. It uses 3D CNN as a classifier and the pose signatures as a feature based on the skeletal joints. In [23], authors applied CNN to classify 200 hand gestures (Mudras)/poses from different ICD forms using the data collected from YouTube. These methods primarily do not consider the relation between adjacent frames at the initial layers in the architecture.
Deep network is not restricted to dance motion classification. It is also used in human motion classification, new motion generation, and many more other applications. In [24], a generative model was proposed using a deep learning framework capable of generating new unseen motions by learning from a large set of human motion capture data. Research on gesture recognition [25] is another active area 2 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of motion recognition. In this paper, the researcher proposed a Recurrent Neural Network (RNN) based on gestures' kinematics, which achieved an accuracy of over 98% in 16 experiments. A deep reinforcement learning-based algorithm was proposed in [26] to recognize human arm movements using a wearable device. The algorithm learned the pattern that are the acceleration data of human arm motion. The proposed architecture does not need any feature extraction technique to train the proposed model. The proposed method achieved similar accuracy to a deep neural network framework but used fewer data.
As reported in [27], MHI has been a simple and robust approach to recognizing human motions or actions. This article summarises all the related works based on MHI. This method is firstly proposed in [28]. The same research group also presents an updated version of MHI called hierarchical MHI [29] to recognize the motion. Instead of using the MHI (showing the motion pattern) in its raw form, a few articles [30]- [32] implement the various histogram techniques on MHI for motion recognition using ML. In these, [30], [31], and [32] extracted the features from MHI using Histogram of Optical Flow (HoF), Histogram of Gradient (HoG), and Motion history histogram (MHH) respectively for the SVM. They achieved a good recognition accuracy with these methods.
As discussed, there are several recent works [5]- [7], [18]- [23] done on motion/action recognition in ICDs, but except [5] and [6], the rests are based on pose/posture action recognition. Though [5] and [6] try to solve the motion recognition. They do not consider the motion as a pattern that follows across a set of frames. They consider skeletal joint velocities to distinguish the motions rather than their pattern. This work addresses this and solves the motion recognition based on the motions' pattern recorded inform of MHI.
The findings of the literature are summarised as gaps as given below. It also leads to the substantial contribution of the paper.
Identified Gaps: (a) Recognising motions based on skeletal joint velocities as reported in [5], [6] rather than the pattern that varies across a set of frames. Moreover, the ill formed skeleton may affect the classifiers [17] (b) A motion, when played in different instances, does not always contain the same number of frames. However, the earlier approaches [23], [33] assume that all the motion videos contain the same number of frames for making the feature vectors of the same dimension across the motions. (c) Most deep learning approaches use transfer learning techniques to solve such problems, which may not always help. (d) Raw frames as input are primarily used for the deep network, but the preextracted feature as input is yet to be explored. (e) A little work is reported only on specific Adavus and does not include all the Adavuss for the analysis Contributions: (a) Recognising the motions based on their pattern that varies across a set of frames rather than the skeletal joint velocities as a feature; [5], [6] (b) A solution approach where we can generate the same dimensional feature even though the number of frames is uneven in each motion. (c) To create a uniform feature-length, we prepare a scheme to select the frames during 3D CNN implementation. (d) Instead of a pre-trained network, we use our own designed CNN, which is simple and proved to be effective. (e) Exploring both 3D and 2D CNN models where the first version of CNN uses raw frames and the latter uses MHI as an input. (f) This work includes most of the Adavus in this motion classification.

III. DATASETS
The basic unit of Bharatanatyam is known as Adavu. The combination of these Adavus generates a sequence of dance in Bharatanatyam. The entire Nritta (based on Natyashastra [2]) rests on these Adavus. It includes different gestures of the hands, feet, arms, and body. The dancer rubs, stamps, slides, and touches the ground in various ways to synchronize with the different syllables or bols or Sollukattu used. The Adavus are differentiated based on various rhythmic syllables. These syllables are based on the types dancer's foot work. There are fifteen Adavus, according to Kalakshetra School of Training. Most of the Adavus have two or more Variants. Table 1 shows this information.
We record the Adavu videos in a controlled environment [34] using the sensor Microsoft Kinect 1.0 [35]. The software Nuicapture [36] also helps capture and extract the various data streams: skeleton, depth, RGB, and audio. Table 1 shows the Adavu variants, the number of dancers who perform these dances, and the recordings. We drop the Pakka and Sarikkal due to the unavailability of motion annotations. Approximately there exist 700-1000 frames in each video. A sample of this data is available on [1].
The experts annotate the motions in each video. The annotation shows the duration (Start frame # to End frame #) of a particular motion and its type. The type denotes the motion class. Table 2 shows the sample annotation. The ID in the annotation describes the Adavu type, Dancer#, and motion class. For example, The ID, J3D1M2 imply; J3: Joining Adavu of the third variation, D1: the performer is Dancer-1, and M2: the 2nd Motion class.
The motion recognition is performed mainly on those motions with sufficient examples for training any classifier. Table 3 shows the different Adavus, total number of motions, and unique motions involved in each. There exist 334 unique motions in total (Motion IDs, M1-M334). This motion information in Table 3 says that only a few motions can be classified using the ML models due to insufficient data. The motion statistics for training and testing in the SVM and CNN are shown in Section V.

A. PRE-PROCESSING
The input video frames are in the form of RGB images. These RGB frames are first converted into grayscale representation to avoid color differences. We then eliminate the background as reported in [37] using the depth information captured by Kinect to retrieve only the dancer information. Fig.3 (a) shows the background eliminated gray frame.

FIGURE 3: Background subtracted gray frame and Contour
The data now consists of video frames (gray frames) with their backgrounds subtracted. The gray frames may have sharp transitions due to random noises. We use an average filter with kernel size 3 × 3 to reduce these noises. After this, we create binary frames using a threshold approach as in (1).
where, (x, y) represents pixel location I i represents intensity of i th frame T h represents the threshold intensity Finding the required threshold (T h), we analyze the presence of a dancer (# of pixels contributing to dancer) in a FIGURE 4: Identifying proper threshold (T h) for image thresholding given background subtracted frame and its action across all the Adavus. We find that the dancer and background acquire 4%-7% and 96%-93% respectively in a given frame of size 480 × 640. This variation is due to the dance type. For example, in Natta Adavus, the dancer stretches its leg and hand. Hence, during this dance, dancer acquires more area in the frame than the Tatta Adavus, where the dancer stands in a particular place and performs foot tapping. The dancer's presence in the gray frame is non-zero, and the background is zero intensity (black). We draw a graph representing the variation of the non-zero pixel values with the intensity values, as shown in Fig. 4. We identify that the curve becomes flat when intensity exceeds the value 30. That means all these pixels belong to the dancer. So, we set T h as 32. After converting gray frames to binary, a list of boundary pixels is extracted from the binary image as contour points. Contour is then detected by joining these pixel points. Contouring helps in identifying the structural outlines of a dancer (as shown in Fig. 3 (b)). This contour helps in generating the MHI pattern.

B. FEATURE EXTRACTION
The initial step of the feature extraction is generating the MHI for a given motion. This static image representation of the motion stores the path across a set of frames as it progresses. We follow the following steps to get MHI as reported in [28]. Here MHI is a 2D matrix of size 480 × 640 initialized to zero.
• Find the contour of a dancer in each frame associated with a given motion • Compute frame difference between two contours. • Apply threshold to create a binary image as in (1). • The differential binary frame is assigned to MHI to update its values • The process continues till the end of the frames and gives an MHI for a given motion In MHI, the motion frames are merged into a single frame (image) where motion recency is represented as the intensity of the frames. Fig. 5 shows the generation of MHI for a given motion. In contrast, the article [30] suggests differencing of frames instead of contours. However, this process leaves some background information and may not be effective.
In MHI representation, pixel intensity shows the motion history at a particular location. It can be seen as a function of temporal volume. It depends on two parameters; the pixel value and the time. Brighter or larger intensity values represent a more recent motion and vice versa. MHI of a motion generates the motion template as an image where each pixel stores the information of its motion history. A motion consists of a set of frames. The MHI is updated according to the pixel values across the frames using (2) and creates a static template.
where, (x, y) represents pixel location F t represents frame at time t T hr is the threshold M HI t represents updated MHI upto time t τ represents maximum value Fig. 5 shows the MHI of a given motion as it progresses with the frames. From this method, a static image (2d vector of pixels) is generated for a single motion which is further used as a feature for recognition of motion using image classification. Fig. 6 shows the MHI representation of a few Natta motions.
Suppose there is a movement in a particular pixel, i.e., the absolute difference of pixel value at the time t and t − 1 is greater than the T hr. In that case, the MHI value at that pixel is set to a maximum value τ . Otherwise, the MHI value at that pixel is reduced by 1 unit. So, if motion at a particular pixel occurred, say 15-time intervals ago or 15 frames ago, the MHI value at that location is the maximum value, τ -15 unit, which results in low intensity at that pixel. This method keeps track of the more recent motion with high intensity and low recent motion with low intensity.

1) Histogram of Oriented Gradients (HOG)
A 2D vector represents a motion feature. These 2D features are converted to one-dimensional and fed into a machine learning classifier.
The HOG [40] is an effective descriptor of features used for human detection in real-life situations of image processing. It is calculated on an image distributed among equally partitioned cells that applies the imbrication of local contrast normalization technique. Various works [41]- [45] use HOG as a descriptor to successfully recognize human actions. So, we compute the HoG of the MHI (HoGMHI) associated with the motions. It is used as a feature for the classifier. The paper [30] also uses HoG of MHI, but to compute the MHI, it uses the frame differencing, unlike our approach where we use contour to record the MHI.   The dimension of HoGMHI is much higher than the number of data samples. So, SVM [46] is suitable for this scenario. Hence, we choose SVM as our classifier for this feature. HoGMHI of the motions is trained over SVM. Fig.2 shows the flow diagram. Here, we apply one Vs one SVM [47] where hyperparameter, c = 1 and kernel = linear. It predicts the unlabelled motion to be in a particular class. It takes one class as positive and the rest as negative. Hence, if there are n number of classes, then this SVM generates n * (n − 1)/2 number of classifiers to distinguish these n classes.
Some motions occur in several Adavus, whereas some motions confine within a particular Adavu. Again, there are only three performers for each Adavu. So, it leads to insufficient data samples for some motions. To build a good SVM classifier, we ignore the motions of those suffering from data shortage. We consider the motions with at least 12 samples to deal with this. This restriction leaves only 54 motions out of 334, for which we build our SVM classifier. We split the available dataset into a 3:1 ratio where 75% is used for training and 25% for testing. The train-test split for each motion is shown in Table 4 1) Result Analysis in SVM Table 4 shows the individual accuracy of 54 motions, and the column named Misspredicted As gives miss classification details in the pattern < MissPredicted Motion ID>(<# of Misspredicted samples>). For example, in M21, out of three test samples, one sample is predicted as M19. So, the column Misspredicted As represents it as M19 (1). Similarly, in M72, out of 10 test samples, three samples are predicted as M73. Hence the miss classification is represented as M73 (3).
As we can observe in Table 4, most of the motions miss classified with another by one or two samples. The overall accuracy is 84.68%. However, out of 54 motion classes, 27 perform pretty well with more than 80% accuracy, and most of them achieve nearly 100% or 100%.
Some of the miss classifications are well expected. Such as, three samples of M72 get miss classified as M73 due to low inter-class variance. In M72, the dancer raises the body on the toes and raises the left hand, then comes down on the feet with slight movement in hand. In M73, the dancer raises the body on the right foot by spreading the left hand and takes the body down by tapping the foot. The MHI of both these motions classes is very similar as the motion of two feet may not be distinguished with single foot movement in some instances. So, it leads to misclassification. Fig. 7 shows the MHI images as evidence. In this type of motion, the dancers' stability matters. A similar miss classification is observed between M74 & M75, which is the mirror replication of M72 and M73. Using Fig. 7, it can be visualized.
The motions achieving 100% accuracy or more than 80% 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   We also test the SVM by considering the motion classes having samples ≥ 25. This approach can give an idea regarding the behavior of the classifier with the increase of the data samples. The classifier performs better with higher training samples, giving 95.58% accuracy. The result analysis is illustrated in detail in Section V. Table 6 shows the result.

D. CNN AS CLASSIFIER
Convolutional networks are highly used and preferred in computer vision tasks. Here, we train two CNNs on two different types of input data. We train one CNN (3D-CNN), which can take the raw sequence of motion frames as input, and the other CNN takes an extracted MHI (2D-CNN). The original RGB motion videos are extracted using motion frame annotation files. The frames belonging to a single row in the annotation file give us one unique motion data sample. As part of pre-processing, we reduce the size of each frame in a given motion by half and convert the reduced frame to grayscale. After that, we collect motion classes that have at least 25 samples. That way, we get 11 classes of motions to train and test our CNN models. Fig. 2 shows the workflow of motion classification in CNN.
Here, we follow two approaches for motion classification VOLUME 4, 2016 task.
• Approach-1: A 3D CNN model. It uses the raw motion frames as an input. • Approach-2: A 2D CNN. It uses the extracted MHI template as input. A particular motion does not necessarily contain the same number of frames even when played by the same dancer in different instances. However, in Approach-1, the 3D CNN expects each input to be the same dimensions. For the 11 classes of motions we consider the number of frames for each motion class is 16. So, we try to standardize each sample to fit in 16 frames. To drop the excess frames for the motion samples with more than 16 frames, we formulate an equation as reported in (3). For example: If there exist 20 frames in a given motion then we consider the frame indices (F rame i ) [1,2,3,5,6,7,8,10,11,12,13,15,16,17,18,20] where, 1 ≤ i ≤ 16 F rame i = A vector of the frame indices n = number of frames in a given motion We propose an alternative method to overcome this dropping/discarding of the frames. We compute an MHI (motion history image) for each motion sample. It gives a single output image irrespective of the number of frames in the original video. So, it is a template for the sequence of frames in a video. MHI can be computed in different ways, but in our case, we consider a contour-based MHI approach where we consider the boundaries of the dancer in each frame of the motion video to compute the MHI. Since we compute the MHI using all the frames in the original motion video, the MHI model is agnostic to the number of frames in a motion video.

1) Approach 1: non-MHI
A 3D motion video sample concatenate sequential frames where one frame transitions to the next. So, our model learns these features between adjacent frames and the dependencies between pixels in each frame. The proposed ConvNet is 3D in nature which takes the entire 3D sample in one pass instead of one frame at a time for classifying. Fig. 8 shows 3D CNN architecture.
The 3D CNN contains eight convolutional layers and two fully connected layers. The eight convolutional layers extract features of the 3D motion sample. At each convolution layer, four operations, i.e., convolution, non-linear ReLU, batch normalization, and max-pooling, are applied in order. The applied convolution layers are either a spatial layer or temporal layer. All operations except pooling are similar at both spatial and temporal layers. We apply a 3D convolution on the input with 16 kernels of 3 × 3 × 3 with stride one FIGURE 8: CNN architecture (3D) for non-MHI Approch and extract 16 channels at each layer. The convolution is followed by a non-linear ReLU and Batch normalization. A spatial layer processes the input and produces an output with the number of features reduced by half along width and height dimensions. The temporal layer emits an output with the number of frames reduced by half. Both of these tasks are accomplished by max-pooling at each layer. The 3D pooling kernel is of size 1 × 2 × 2, which extracts the maximum value in each non-overlapping window of size 2 × 2 in each frame for the spatial layer. For the temporal layer, the pooling kernel is of size 2 × 1 × 1, which extracts the maximum of 2 features of adjacent frames. We get output features of size 16 × 1 × 15 × 20 when the input is passed along the eight convolutional layers. We convert these features into a single array of 4800 features and pass it to the remaining FC layers. The final 2 FC layers map the extracted features to individual motion classes.

2) Approach 2 : MHI
This section discusses the methodology of extracting MHI and its relevancy to our use case-performing a motion, a dancer's posture transitions from one frame to another. MHI can efficiently capture these transitions in a motion video in a single template image. This motion template applies to our problem since the motions are short and non-overlapping, i.e., the dancer does not enter the same state twice in a single motion video, thus preventing overriding features in MHI. We initially find the boundaries of the dancer's contours in every single frame. We apply image processing techniques offered by OpenCV to find the outline of the dancer. Initially, the MHI is set to zero. The contour image frames are stacked to get the final MHI. Fig. 5 depicts the computation of a sample MHI. Fig. 6 shows MHI of some motions involve in Natta Adavus.
Before we look into the architecture of CNN, we investigate the MHI feature. The MHI's are sparsely activated; that is, the number of non-zero pixels is very few compared to the entire MHI image space. So, we design a classifier that can adequately learn the MHI features. So, we included only two convolutional layers and three fully connected layers in the 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3184735 2D CNN. Fig. 9 shows the proposed CNN. At each of the two convolutional layers in the CNN, we apply the following operations in order: i) convolution, ii) non-linear leaky ReLU, iii) layer normalization iv) max pooling. The input is convoluted with a 3 × 3 kernel with stride two along with both the input directions during convolution. We extract 16 and 32 channels at Conv1 and Conv2, respectively. Necessary padding is appended at the edges of the input to keep output size consistent with the input. Instead of normal ReLU, which completely nullifies the negative features, we apply leaky ReLU, which penalizes the negative feature by 100. To reduce the effect of initialized weights on the convergence of the model, we apply layer normalization, which normalizes the features of a single input. Finally, we apply a max-pooling that extracts the maximum features in 2 × 2 window moving with stride two along both directions.
The outputs of the Conv layers are passed through 3 FC layers. At the first 2 FC layers, we apply a series of operations, i.e., linear, leaky ReLU, dropout, batch normalization operation, are applied in order. Linear operation maps an input feature array to another size. Dropout operation, drops 10% of the output features randomly and restricts the number of epochs to 14 by analyzing the train & test accuracy (Fig. 10) to prevent the model from overfitting. Then we apply batch normalization, which normalizes features over a batch of samples. Finally, the third FC layer maps the output features from FC2 into an array of 13 features. These features are converted into class scores using the softmax layer. We use the standard cross-entropy loss function [48] and Adam algorithm [49], [50] to optimize our network.

3) Result Analysis in CNN
The CNN may overfit due to the fewer number of data samples. So, we lower the network capacity and ignore the motions having less than 25 samples. We get only 11 motions for the two different CNN classifiers with this constraint. Fig. 8 and 9 show the architecture.
Between MHI and non-MHI (raw data) features, we see that the MHI feature performs slightly better. Fig. 11 and Table 6 show this comparison. Moreover, We can not ignore the fact that it is faster since it operates on a 2D image, unlike the Non-MHI model. We computed the execution time for both the approaches and reported Table 5. In the MHI model, the execution time also includes the time require to compute MHI in each step (on Train and Test data). The time taken by Non-MHI CNN model to predict a dance sequence is found to be 12.89 times more than that of MHI-CNN model. Again, the MHI model can be modified easily to classify motion samples containing any number of frames, whereas, in Non-MHI, we need to scale down or scale up the number of frames to make the feature vector of the same length. However, MHI may not capture motion correctly if overlapping objects are in the video. However, it does not happen in our dataset because only one dancer appears on the scene during the performance. In CNN approaches (MHI/Non-MHI), most of the motion classes achieve more than 90% accuracy, even though missclassifications are found, it is for one/two samples. In CNN MHI, except M72, M74, and M75, the accuracy of the rest of the classes is either equal to or better than the CNN non-MHI. Fig. 11 shows this. However, on analysis, we find that VOLUME 4, 2016 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3184735 0.0685 (s) 0.005 (s) Here unit of time is second (s) T rT and T sT for m and n samples respectively m = 515 and n = 181, Refer Table 6 AvgT sT = T sT /n the performance of both the CNN models is very close (MHI: 93.92% and Non-MHI: 93.37%).
We analyze the trade-off between the training samples and accuracy. We find that both the CNN models' behavior are alike, which is evident in Fig. 12. However, in M82, CNN MHI performs exceptionally well through lesser samples. It may be due to the input of pre-extracted MHI in which the motion pattern is already visible to CNN, unlike the Non-MHI model where CNN extracts the feature from the raw frames.
In CNN MHI (Refer Fig. 12), except M82, when the number of training samples lies between 20-32, the accuracy percentage varies between 60-90%. However, when the number of samples exceeds 37, all the classes' accuracy reaches 100% except M257 (97.37%). Similarly, in CNN non-MHI (Refer Fig. 12), except M74, when the number of training samples lies between 20-32, the percentage of accuracy varies between 57-80%. However, when the number of samples exceeds 37, the accuracy becomes 100% or very close to 100%.

V. RESULT ANALYSIS-CNN VS SVM
This section analyzes the results between two classifiers; CNN and SVM. As discussed in Section IV-D3, CNN MHI performs slightly better than CNN Non-MHI. Comparing CNN MHI and SVM, we find that both approaches follow a common trend in performance across the motion classes. Fig. 13 shows these similarities in the variation of the accuracies. Fig. 13 shows these similarities in the variation of the accuracies. For example, in both of these models, M72 is in decreasing trend whereas M74 is in increasing trend. Now, In between SVM and CNN Non-MHI, the accuracy variation pattern is missing. It is understandable in Fig. 13.
While comparing the overall accuracy, SVM performs slightly better (≈1.5%) than the CNN Models. However, in CNN MHI, besides M72, M74 and M75, accuracy of the rest of the motions is either 100% or close to 100%. Now, looking at the data and accuracy trade-off in SVM (Fig. 14), we find that except the motions M165 and M167 the rest of the motions follow the same trend as CNN models (Fig. 12).

VI. CONCLUSIONS
We propose a method to recognize the Bharatnatyam motions based on their pattern that varies across frames rather than the joint velocities as a feature [5], [6]. Again, the works presented in [5], [6] do not consider most of the Adavu variations in their analysis. However, The current work addresses those issues. So, the paper hugely contributes to the state-ofart in recognizing the motions in Indian Classical Dance.
We use two classifiers CNN and SVM. The paper uses HoGMHI, MHI, and raw frames to build SVM, CNN MHI, and CNN Non-MHI models. The overall accuracy of all these classifiers is above 90%. However, SVM and CNN MHI perform better than CNN non-MHI. Between SVM and CNN MHI, SVM's accuracy is slightly better (≈ 1.5%). However, in CNN MHI, most motions achieve 100% or close to 100% accuracy. The paper explores the impact of the feature on the classifiers and highlights the trade-off between data and accuracy. However, The models' accuracy and the impact of features are yet to be tested with the more extensive dataset.
This work becomes capable of handling the uneven number of frames associated with each motion class to generate the feature of the same length, which can be applied to the machine learning models. The proposed work designs a simple yet very effective 2D CNN model capable of dealing with a small dataset. We take care to overcome the overfitting issue. Again, Instead of providing the raw data to train the CNN model, we use the manually extracted feature (MHI) in CNN, which performs well. Moreover, the MHI model is also faster than the non-MHI, which is proved by comparing the execution time.
Future work can use skeletal coordinates to generate the trajectory as a feature for motion classification. VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and    13 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3184735