Video Activity Recognition With Varying Rhythms

Recognizing normal and anomalous events in long and complex videos with multiple sub-activities has received considerable attention in recent years. This task is more challenging than traditional action recognition in short and relatively homogeneous video clips. Other than the difficulty in recognizing activities in long videos, one other challenge is the varying activity rhythms. The rhythm of sub-actions in an activity can differ in nature and can pose additional challenges that affect the performance of activity recognition methods. In this article, five video activity recognition methods were evaluated using two publicly available video datasets, Breakfast and VIRAT, which consist of long and complex videos. Extensive experiments and analyses showed that among these methods, VideoGraph, was found to perform distinctly better than the other investigated methods while maintaining high accuracy even if the test videos were exposed to severe rhythm changes. The results indicated that VideoGraph is less sensitive to varying rhythms in contrast to other investigated methods. By changing some of the architecture parameters, we also observed performance improvements in VideoGraph.


I. INTRODUCTION
There is an emerging interest in automating human activity recognition using intelligent systems. This growing field has a wide range of applications such as human-computer interaction and identity detection [1], [2], surveillance and home monitoring [3]- [5], healthcare [6], [7], elderly care [8], [9], and traffic monitoring [10] and video summarization [11], [12]. One of the easiest acquired input data that can be used for activity recognition are color (RGB) videos captured by cameras. Recognizing activities in videos thus has received significant attention in recent years. The works in this emerging field mostly consist of recognizing human actions using datasets like UCF101 [13], KTH [14], HMDB51 [15], Kinetics [16]. These datasets consist of relatively short and homogeneous video clips, which are generally well-segmented and contain only one action event in which human actions take few seconds to unfold [17].
As an example in [18], the authors used UCF101 and HMDB51 datasets for demonstrating their two-stream 3-D-convNet fusion pipeline, which can recognize human The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . actions in videos of arbitrary size and length using multiple features. In [19], UCF101 and HMDB51 datasets were used and saliency-aware three-dimensional (3-D) CNN with LSTM is introduced for video action recognition. However, it is highly likely that some of these methods using datasets which consist of only short homogenous video clips could face challenges when it comes to recognizing normal and anomalous events in datasets that consist of long and complex videos with multiple sub-actions in it such as Breakfast [20] and VIRAT [21].
Graph-based methods have also found their use for video activity recognition. In [22], the authors proposed a semisupervised annotation approach by learning an optimized graph from multi-cues (i.e., partial tags and multiple features). There are some other graph-based methods which utilize the sub-action level annotations for human activity recognition in long and complex video datasets [23]- [30]. However, finding datasets with sub-action annotations is not easy and not very practical. Other than the difficulty in recognizing activities in long videos, one other essential challenge is the varying activity rhythms. The rhythm of sub-actions in an activity can differ in nature. As an example, considering ''getting into a car'' activity in the VIRAT dataset, one can open the door and get in the car immediately, or open the door then take some time before getting in the car. Even though these two sets of actions are both categorized with the same label, their temporal rhythms differ considerably. Varying rhythm of actions in real videos may arise from at least two sources. First, the rhythm of sub-activities in an event can differ in nature such as the different rhythms of getting in a car. Second, the rhythm issue may occur due to non-uniform or different sampling rates between the training and testing stages of the applied recognition method. Ignoring varying rhythms may seriously affect the activity recognition performance. It is quite likely that an event recognition algorithm may fail to accurately classify the activity when trained with one rhythm but tested with another rhythm.
The objective of this article is to investigate the performance of video recognition methods which do not use any sub-action level annotations for long duration and complex videos that are captured with stationary cameras and also to examine the recognition sensitivity of these methods to varying rhythms. Five video activity recognition methods were evaluated using the RGB color videos of two challenging public domain video datasets. These are Breakfast [20] and VIRAT 2.0 dataset [21], which are prepared by Brown University and DARPA, respectively. To simulate varying rhythms in these videos, we manipulated the original test videos in these datasets in three different ways and examined the sensitivity of the trained models with these methods (which were trained using the original rhythm videos in the training set) on the manipulated varying rhythm test videos.
Two of the investigated video activity recognition methods are Convolutional Neural Network -Long Short-Term Memory (CNN-LSTM) of which its source codes were found from [31] and Long Term Recurrent Convolutional Networks (LRCN) [32]. These two methods are considered as benchmark methods. The third method is CNN-IndRNN method [33], or IndRNN in short, which consists of a two-stage, end-to-end framework and is inspired in part by how humans identify events with varying rhythms. In the first stage, the most significant frames are selected while the second stage recognizes the event using the selected frames. The fourth method is called CNN-SkipRNN+ [33], or SkipRNN+ in short, which uses the same framework of IndRNN. However, SkipRNN+ has advantages over IndRNN by alleviating the gradient vanishing problem that occurs because of the many RNN (Recurrent Neural Network) layers used in the frame selection phase of the framework. Video-Graph [34] is the fifth and the last method. VideoGraph is a graph-based method in which the graph nodes are fully inferred from data and it is also extensible to datasets without node-level annotations. Similar to SkipRNN+ and other investigated methods, it also does not need annotations in sub-action level to train a model. VideoGraph learns an undirected graph from the video dataset. The nodes in the formed graph represent the key latent concepts (or the so-called sub-actions) that the human activity is composed of. The edges in the graph are considered to represent the temporal relationship between the latent concepts. VideoGraph is noted to model human activities for up-to thirty-minute videos [34]. It not only learns the graph nodes without any need for nodelevel annotation but also learns the relationships between graph nodes. The temporal structure of long-range human activities are represented via the constructed graph which is another interesting attribute of VideoGraph that can be utilized for visualization and video understanding. IndRNN, SkipRNN and VideoGraph are included in this work since these three methods were used with long and complex videos in some past works [33], [34].
In our results, the recognition results of VideoGraph were found to be superior to the other investigated methods reaching to close to 60% in the Breakfast dataset (Split-4), 92% for Breakfast 3-grouped class dataset, 92.5% accuracy in the VIRAT 4-event dataset, and over 62% in the VIRAT 6-event dataset. Among the five investigated methods, the varying rhythm sensitivity analysis investigations were conducted for IndRNN, SkipRNN+ and VideoGraph methods. The two conventional methods, CNN-LSTM and LRCN were applied to the original rhythm (R0) videos only. Since these two conventional methods had relatively lower recognition performance in the original rhythm (R0) case in the investigated datasets, no further investigation was conducted for the three varying rhythms. The sensitivity to varying rhythm results indicated that VideoGraph maintained its high recognition accuracy with varying rhythms. Some additional investigations with VideoGraph on the Breakfast dataset by varying some of the design parameters in its architecture also showed some slight performance improvements. Other than superior recognition results, VideoGraph's representation of activities via constructed graphs is demonstrated to bring significant value to the overall video understanding and activity recognition analyses.
The most significant novelty of this article is providing a comprehensive evaluation of five video recognition algorithms with respect to their sensitivity to varying rhythms when long and complex videos are used. It is our thinking that in the evaluation of activity recognition methods, assessing their robustness to varying rhythms is an important measure which needs to be taken into account. The contributions of this article are as follows: • We provided a comprehensive evaluation of five video activity recognition methods using two highly challenging activity recognition datasets with long and complex videos.
• We assessed the sensitivity of three of these methods to varying rhythms.
• We demonstrated that if similar activities are grouped in Breakfast dataset, the recognition performances can be improved for the grouped activity classes.
• We showed that by varying some of VideoGraph's design parameters, some performance improvements can be observed. Our paper is organized as follows. Section 2 provides technical information about the investigated video activity  recognition methods and the datasets used in our experiments. Section 3 contains the performance evaluations for the original videos and the sensitivity to varying rhythm results. Section 4 contains some discussions about the results. Finally, Section 5 concludes the paper with some remarks.

A. DATASETS
In the conducted analyses with the five video activity recognition methods, we used the RGB color images of the Breakfast and VIRAT dataset. Information about these datasets and data subsets formed from them are provided in the following.

1) BREAKFAST DATASET
The Breakfast dataset [20] was assembled by Serre Lab of Brown University. The videos in this dataset capture participants preparing breakfast food in many different kitchens at varying camera angles. There are 52 participants where each participant is denoted by P. Each participant was filmed in one of 18 different kitchens and with up to five different cameras from different angles and lighting conditions. The videos from these cameras film up to 10 different activities including making coffee, pouring orange juice, making chocolate milk, making tea, preparing a bowl of cereal, frying eggs, cooking pancakes, preparing a fruit salad, making a sandwich, and cooking scrambled eggs. Each video in the dataset is down sampled to 320 × 240 with a frame rate of 15 fps. This dataset was designed to be challenging in that it captured real world conditions with diverse range of lighting and environment. Table 1 shows the Breakfast dataset main events and the number of videos for each event. Fig. 1 shows sample image frames from the Breakfast dataset.  There are four different splits in the Breakfast dataset for forming the training and testing datasets [20]. Table 2 shows the distributions of the videos which belong to the 52 participants (P) (P03-P54) in these four splits.
In the Breakfast dataset investigations, we considered Split-4. One interesting observation from the resultant confusion matrices was that the breakfast activities that are similar to each other like {coffee} and {milk}, or {friedegg} and {scrambled egg} were considerably confused among each other by the classifiers. We considered grouping these 10 breakfast activities into three major classes and formed a three-class version of the breakfast dataset. In addition to using the original 10-event Breakfast dataset, we also used this three-class Breakfast dataset version, and trained models with the five activity recognition methods to examine the recognition accuracy after grouping of similar activities. Among the three groups, the set of five activities, {coffee, milk, tea, juice, cereals} forms the first group. The second group consists of {friedegg, pancake, scrambledegg}. Finally, the third group consists of {salad, sandwich}. Table 3 shows the number of events in the three-group Breakfast dataset. In both Breakfast datasets (10-class and 3-grouped class) we used 65% of videos for training and 35% of videos for testing.

2) VIRAT DATASET
The VIRAT 2.0 dataset [21] is a publicly available video dataset supported by DARPA. The videos in this dataset consist of surveillance footage capturing public areas such as parking lots and college campuses. The VIRAT 2.0 dataset consists of high-definition videos and the original size of the image frames in these videos are 1920 × 1080 in size. Each video contains multiple activities with accompanied labels and bounding boxes. The classified activities in this dataset include: Loading an object, Unloading an object, Opening trunk, Closing trunk, Getting into vehicle, Getting out of a vehicle, Person gesturing, Person carrying an object, Person running, Person entering facility, and Person exiting a facility. Table 4 shows these events and the number of videos for each event. Some of these events, such as person loading an object to a vehicle have very few videos indicating a data  imbalance problem which poses challenges to applied activity recognition methods. A few image frames for the first six events in VIRAT dataset can be seen in Fig. 2.
We did not use all 13-events of the VIRAT dataset in this work and instead used subsets of it. The reason for this is that the number of videos for each event significantly varies in the VIRAT dataset with some of the events not having enough videos for effective model training as can be seen from Table 4. Because including all 13 events would  have resulted in additional challenges such as a significant data imbalance problem with not enough videos for some events, we formed two smaller subsets of the original VIRAT dataset for our investigations. The first subset contains four events with close number of videos for each included event.
The four classes in this four-event subset can be seen in Table 5. In the VIRAT dataset annotation files, for all the videos, event ids, event types, start and end frames of the events are provided together with the bounding box locations of the event within these annotation files. Videos in the VIRAT 2.0 database are cropped with respect to the event annotation files. Using the start and end frames in the event annotation files for the four events of interest, these image frames are considered as videos and used for activity recognition. For each of the four VIRAT events, 10 videos are randomly selected for validation purposes while the remaining videos for that event are used for training a model. That is, in the formed subset, there are 40 videos in the validation dataset (10 videos for each of the four events) and there are 457 videos for the four events in the training set (total 497 videos for four events). The high-resolution video image frames are cropped with respect to the bounding box regions.
The second VIRAT subset used in the investigations consists of six events which relate to all six human-vehicle interactions as can be seen in Table 6. This 6-event VIRAT data subset is more challenging than the VIRAT 4-event data subset since the VIRAT 6-event data subset is imbalanced and some activities do not have enough number of videos (such as Event-1, Event-3 and Event-4). This poses additional challenges for the video activity recognition methods. 90% (train)-10% (test) random split is used with this data subset in our investigations.

1) CNN-LSTM
CNN-LSTM approach [33] first extracts features from the image frames of the video with a Convolutional Neural Network (CNN) and forms features sequences. These feature  sequences are passed to a separate LSTM, which is a type of a Recursive Neural Network (RNN) with some additional units [35].

2) LRCN
LRCN makes use of a pretrained CNN in conjunction with a LSTM unit [32]. During the training of a LRCN model, each training frame in a video is individually passed through the CNN where a vector of features is created. These features are then passed on to the LSTM unit. A prediction is generated from the LSTM unit and its state is also passed to the LSTM unit in the next frame until all frames are processed in that video. The predictions across all frames are averaged to get a final prediction for that particular video.

3) CNN-INDEPENDENT RNN (INDRNN)
IndRNN [33] is inspired in part by how humans identify events with varying rhythms by quickly catching frames contributing most to a specific event. The CNN part consists of a VGG16 network and is used to extract visual feature per frame. The RNN part consists of two layers. The most significant frames are selected in the first RNN layer via the use of a regularization term which is included when computing the final loss of the model. The second RNN layer recognizes the event using the selected frames and a crossentropy based loss is utilized in the recognition part. The sum of regularization term controlled by a parameter and cross entropy loss becomes the final loss of the model. For the classification RNN, Gated Recurrent Units (GRU) [36] is used. In this framework, only activity-level labels are needed in the training stage with no need of sub-action labels.

4) CNN-SKIP RNN (SKIPRNN+)
The details of SkipRNN+ can be found in [33]. In IndRNN method, because the input dimension to the IndRNN layer is high (4096), the output value in the stacked IndRNN layers increase by orders of magnitude resulting in the gradient vanishing problem [33]. In SkipRNN+ method, to mitigate this problem, an improved IndRNN structure is used by skipping state updates to shorten the computation. This idea is originally inspired by [37] which implements skip operation on conventional RNN. Unlike [16], SkipRNN+ structure uses Hadamard's product [38] when computing the gate value. This way the gradient of the SkipRNN+ depends on the weight value instead of the weight matrix product alleviating the gradient vanishing problem. An illustration of SkipRNN+'s architecture is shown in Fig. 3.

5) VIDEOGRAPH
Graph methods, which learn structured representations from videos, are being investigated for human activity recognition in the past [23]- [25]. Even though these graph based methods learn structured representations from videos, they require the graph nodes and/or edges to be known in advance which limits their practical use since they cannot be used when node or frame-level annotations are not available. In contrast, VideoGraph [34] is a graph-based method in which the graph nodes are fully inferred from data and it is extensible to datasets without node-level annotations. The block diagram of VideoGraph can be seen in Fig. 4. The video is first sampled into T segments and each segment, s i , contains 8 consecutive frames. Using Two-Stream Inflated 3D Con-vNet (I3D), which is a 3D CNN model [39], features are extracted from s i , where they are denoted by x i . An undirected graph with N nodes corresponds to key unit actions in the video whereas the edges of the graph provides the temporal relationship between these N nodes. The node attention block in VideoGraph learns the latent concept representation. For the initialization of these latent features, the features maps of the last convolutional layer of the I3D backbone are clustered and the resultant centroids are used for initialization. The graph embedding layer learns the graph edges and finalizes the graph structure. VideoGraph extracts two types of relationships and represents them via graph edges. There are the timewise edges indicating how the nodes transition over time and the node wise edges providing information about the relationships between nodes. The activation output of the first graph embedding layer is used to construct the final graph. Among the two graph embedding layers in VideoGraph, the second one is used for activity prediction. Following a set of pooling operations to the output of the second graph embedding layer both in time and node, the resultant output feature is feed-forwarded to a classifier to arrive at the activity prediction of the video.

III. RESULTS
In addition to applying the investigated methods to the videos with the original rhythm (R0), we also demonstrated the impact of varying rhythm via three other rhythms (R1, R2 and R3) [33]. The testing video sequences have the same sampling rate as the training inputs in the original rhythm (R0). The other three varying rhythm scenarios are designed with different kinds of sampling rates. To prepare the three varying rhythms, the number of frames of each testing video is first divided into three equal intervals and different sampling rates are applied to each interval to form a new testing sequence. To generate the first rhythm (R1), the first and the third intervals are subsampled with every two and five frames respectively to make those two interval periods sparser, while keeping the rhythm intact for the middle interval. The testing inputs of the second rhythm (R2) are similar to R1 except the first and third intervals are subsampled every five and two frames, respectively. As can be noticed this is the reverse of R1. For the last rhythm (R3), half length of the testing video is randomly sampled. All five methods were applied to the original rhythm (R0) videos whereas the varying rhythm sensitivity investigation was conducted only for three methods which are IndRNN, SkipRNN+ and VideoGraph. This is because, overall, the other two methods, LSTM and LRCN, had relatively lower recognition performance in the original rhythm (R0) case and no further investigation was considered for the three varying rhythms.
For performance comparison of the video activity recognition methods, we used the overall accuracy (OA) and Kappa metric [40] measures. Other than these, confusion matrices are also generated to examine which of the activities are generally confused with each other.
In the following, for each dataset and their subsets, we first provide a table that shows the overall accuracy (OA) and Kappa values for the five methods with four rhythms. A bar plot showing the overall accuracies of these methods with four rhythms is provided next. The resultant confusion matrices that belong to the highest overall accuracy for each dataset are also included. The constructed graphs with VideoGraph for the activities in the Breakfast-10 event, VIRAT 4-event and VIRAT 6-event datasets are presented with some brief discussion as well.
A. BREAKFAST 10-EVENT RESULTS Table 7 shows the 10-event Breakfast dataset results (Split-4) for the original rhythm (R0) and three different rhythms (R1, R2, R3) with five activity recognition methods. For VideoGraph, we used the default '64 segments/8 frames' parameter setting. Figure 5 shows the overall accuracy values for the five methods in a bar plot. From these results, it can be seen that VideoGraph significantly outperforms all other methods and the performance gap between VideoGraph and the next best method is quite wide. VideoGraph is observed to perform well with varying rhythms as well. VideoGraph manages to maintain its original rhythm recognition performance for the varying rhythms and its overall accuracy variation is found to be relatively less in comparison to other three methods. The confusion matrix of the best performing case of VideoGraph is shown in Table 8. From the confusion matrix, it can be observed that breakfast events similar to each other like {cereals} and {milk}, or {fried egg} and {scrambled egg} were confused with each other. Figure 6 shows constructed graphs with VideoGraph for three of the 10 events in the Breakfast dataset. In each constructed graph for an event, the nodes correspond to the latent concepts learned by VideoGraph's graph-attention block. If a node's size is big, it indicates that latent concept is dominant. The edges in the graph emphasize the relationship between these latent concepts represented in the form of nodes. It can be noticed that the node sizes and edge formations are similar in fried egg and scrambled egg events whereas the corresponding graphs of these two events are quite different than the graphs of cereals and milk. Yet, the graphs of cereals and milk also show similarities to each other. We can see more confusions among events when their graphs are similar to each other. The graph representations in VideoGraph can thus add significant value to the recognition and video interpretation analyses.    Table 9 correspond to the three-class Breakfast dataset results (Split-4) for the original rhythm (R0) and three different rhythms (R1, R2, R3). The default setting of '64 segments/8 frames' is used in VideoGraph. A similar performance trend is observed and VideoGraph performs significantly better, reaching to an overall accuracy of ∼92 % in the original rhythm (R0). We also included the confusion matrix for VideoGraph with the original rhythm (R0) in Table 10. The recognitions are also found to be extremely good with VideoGraph for the three varying rhythms. Although this can be considered as an imbalanced dataset, VideoGraph's performance reaching to ∼ 92 % overall accuracy is quite significant. VOLUME 8, 2020     Table 11 correspond to VIRAT 4-event dataset results. The default parameter setting (64 segments/8 frames) is used in VideoGraph. Similarly, VideoGraph performs superior to other methods, reaching to an overall accuracy of 92.5% in the original rhythm. The corresponding confusion matrix for the best VideoGraph case is shown in Table 12. The recognitions are also found to be considerably well with VideoGraph for the three varying rhythms. The constructed graphs with VideoGraph for VIRAT 4-event dataset can be seen in Figure 9. From Figure 9, it is interesting to observe that the nodes in the graphs for 'Getting in vehicle' and 'Getting out vehicle' events significantly differ from the graphs of the two other events which are 'Getting in facility' and 'Getting out facility'. That is, the differences between the graphs of human-car and human-facility interaction events can be clearly observed.  D. VIRAT 6-EVENT RESULTS Figure 10 and Table 13 correspond to the six-event VIRAT dataset results. The default parameter setting of '64 segments/8 frames' is used for VideoGraph. This is not only a highly imbalanced dataset but also contains very small number of videos for some of the events. From the results, we can see that VideoGraph performs better than others especially in the original rhythm and reaches to an overall accuracy of ∼63%. The confusion matrix of the best performing case (VideoGraph) is shown in Table 14. However, there is not a wide performance gap between VideoGraph and the other methods as was previously observed in the former three datasets. It is thought that being an imbalanced dataset and containing not enough number of videos for some of the activities could be contributing to this result. In any event, overall, VideoGraph still performs considerably better than the others especially in the original rhythm. Figure 11 shows   the constructed graphs for the six events in VIRAT-6-event dataset.

E. CHANGING SEGMENT AND FRAME NUMBER PARAMETERS IN VIDEOGRAPH
For VideoGraph, in addition to the default '64 segments/8 frames' parameter setting, two other segment/frame combinations are considered as well. This investigation was conducted using the Breakfast 10-event dataset. Table 15 shows the resultant performance metrics for three parameter combinations of VideoGraph including the default setting of '64 segments and 8 frames'. It can be noticed that when using '16 segments/32 frames', relatively a higher recognition accuracy is achieved in the original rhythm and also in two of the three simulated varying rhythms. Table 16 shows the confusion matrix for the '16 segments/32 frames' case which provided the highest overall accuracy in the original rhythm.

F. COMPUTATION TIME COMPARISON
The computation time comparison of the five investigated methods using the Split-4 of the Breakfast 10-event dataset can be seen in Table 17. The comparisons are with respect to feature extraction time, training time and test times. The computer platforms used for retrieving these times are also provided.

IV. DISCUSSIONS
The investigations with Breakfast and VIRAT datasets which contain long and complex videos clearly showed that among the five investigated activity recognition methods, VideoGraph performs significantly better than the others. Especially in the 10-event Breakfast dataset, Video-Graph's classification performance is distinctively better than SkipRNN+ (VideoGraph: 59.21% vs SkipRNN+: 24.8%). Similarly, in VIRAT-4 event dataset, the performance gap between VideoGraph and SkipRNN+ is quite wide   (VideoGraph: 92.5% vs SkipRNN+: 55.0%). The same performance trend can be also observed in the other two datasets. VideoGraph is also found to be less sensitive to varying rhythms because it provided accuracy values close to the accuracy value with the actual rhythm for all three varying rhythms. One other analysis with VideoGraph on the 10-event Breakfast dataset was to examine the recognition performance when the segment and frame number parameters are varied. We observed that some parameter combinations provided better results than VideoGraph's default parameter setting and this showed that there could be more room to further improve the accuracy values by varying these parameters and some other parameters such as kernel sizes used in graph embedding layer in VideoGraph's architecture. The constructed graphs with VideoGraph demonstrated that these graphs have the potential to add significant value to the overall video understanding and activity recognition analyses which could be further tapped into and exploited. The results for the Breakfast-3-grouped class dataset also provided some potential future investigation ideas with VideoGraph and other classifiers in the sense that if a set of additional classifiers trained specifically for the activities within each of the three groups are applied, this second layer of classifiers could perhaps further boost VideoGraph's performance for the 10-event case via a potential two-step activity recognition framework (three-grouped class classification followed by individual classifications for each group). Another future investigation idea is to examine VideoGraph's recognition performance on video datasets that consist of videos with varying image resolutions and various actors in the scene that are captured with moving cameras such as the UCF-Crime dataset [41].

V. CONCLUSION
Robustness to varying rhythms can be a discerning measure when comparing the performance of activity recognition methods since the rhythm of sub-actions in an activity can differ in nature and pose challenges for the activity recognition methods due to the fact not all rhythm variations can be included in training dataset for model learning. This article contained comprehensive investigations of five video activity recognition methods with two datasets that consist of long and complex videos in consideration of varying rhythms. The results showed that among them, VideoGraph performs significantly better than others and is found to be less sensitive to varying rhythms since it provided accuracy values for varying rhythms close to the accuracy values observed with the original rhythm. Having noted some performance improvements after varying some of VideoGraph's parameters also indicated that there could be more room for improvement in VideoGraph by searching optimal hyperparameters. BULENT AYHAN (Member, IEEE) received the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, in 2006. He is currently working as a Principal Research Engineer with Applied Research LLC. He has more than 100 journal and conference papers in prestigious journals and conferences. His research interests include machine learning, deep learning, artificial intelligence, signal processing, image processing, computer vision, pattern recognition, remote sensing, and condition monitoring.