Video Processing using Deep learning Techniques: A Systematic Literature Review

Studies show lots of advanced research on various data types such as image, speech, and text using deep learning techniques, but nowadays, research on video processing is also an emerging field of computer vision. Several surveys are present on video processing using computer vision deep learning techniques, targeting specific functionality such as anomaly detection, crowd analysis, activity monitoring, etc. However, a combined study is still unexplored. This paper aims to present a Systematic Literature Review (SLR) on video processing using deep learning to investigate the applications, functionalities, techniques, datasets, issues, and challenges by formulating the relevant research questions (RQs). This systematic mapping includes 93 research articles from reputed databases published between 2011 and 2020. We categorize the deep learning technique for video processing as CNN, DNN, and RNN based. We observe the significant advancements in video processing between 2017 and 2020, primarily due to the advent of AlexNet, ResNet, and LSTM based deep learning techniques. The prominent fields of video processing research are observed as human action recognition, crowd anomaly detection, and behavior analysis. This SLR is a helpful guide for the researchers to explore the recent literature, available datasets, and existing deep learning techniques for video processing.


I. INTRODUCTION
Many deep learning (DL) research works have shown successful results, primarily focusing on three data types: images, speech, and text. In addition, DL has also been successfully applied to communication signals/packets, e.g.
[1], [2]. Widely used applications of these data domains are image classification, speech recognition, regression problem, pattern recognition, and text sentiment classification. Apart from these, one more fascinating data modality is video data. However, video data is also interesting for research from the perspective of its big size and dimension. Millions of video data are uploaded every day on YouTube; thus, it became a rich repository and empowered artificial intelligence (AI) research. However, video data is challenging to analyze and process because of its large file sizes and complexity despite having rich data. Research on video processing using AI gained popularity after many AI algorithms were developed for Image processing for various applications, particularly in the past ten years.
Video data is one of the popular choices of users of different platforms like Twitter, YouTube, Facebook, etc. also the fastest-growing data type nowadays.

FIGURE 1. A concept of video data
We first clarify the exact meaning of video data in the computer-vision research field, which is considered in our study. Video data (or digital video data) is any sequence of time-varying images. In the video data, the picture information is digitized both spatially and temporally. The resultant pixel intensities are quantized. We can say a set of frames per second. Figure 1 depicts the concept of video data. Most of the fundamental research of computer vision today focuses on images, focusing less on sequences of images, i.e., video frames. However, video data provides deeper situational understanding because a series of images gives various information about the subject. For example, we can track an object through an optical flow of the sequence of images and predict its next action [3].
After explaining the abovementioned meaning of video data, we show the interpretation of video processing we considered in our study. In the context of computer vision, video processing or digital video processing is the ability to automatically analyze video, frame by frame, to detect and determine temporal and spatial features.

A. LITERATURE SURVEY
While many types of research have been carried out on video processing [4], very few studies have been systematically analyzed that focus on video processing using deep learning techniques. Instead, most of them perform surveys by targeting only specific functionality. However, a study conducted by Nayak et al. [5] shows the advancement in video anomaly detection using deep learning techniques. The authors present the various deep learning techniques for video processing to detect the anomalies such as abnormal activities-fighting, riots, traffic rule violations, stampede, and strange entities -weapons, abandoned luggage, etc. In another survey [6], researchers reported video processing for abnormal human activity recognition by leveraging the deep learning method for video processing. Borja-Borja et al. [7] surveyed state-of-the-art deep learning methods for video processing to list the group and crowd activities. The main techniques of deep learning are grouped into Convolutional Neural Network(CNN), Autoencoders (AEs), and Recurrent Neural Network (RNN). Another survey-based on anomaly detection from video data by [8] focuses deep learning approach where the author listed generative adversarial networks (GANs) along with other deep learning approaches mentioned in [5]. A significant application of video processing in computer-vision research is pedestrian detection. Brunetti et al. [9] present a review on deep learning video processing methods for pedestrian detection focusing on methods CNN, Deep Neural Network(DNN), Restricted Boltzmann Machine (RBM), and Gaussian Mixture Model. In a survey by Ciaparrone et al. [10], the author reports deep learning methods for Multiple Object Tracking (MOT) from video data. They explored the Faster R-CNN, Mask R-CNN, SSD methods of deep learning for muti-object tracking. Apart from that, the authors also listed out the YOLO series of detectors -YOLOV2. Yan et al. [11] reported a review on deep multi-view learning from videos focusing on representational deep learning methods such as conventional neural networks, deep brief networks, and multi-view auto-encoders.
Taskiran et al. [12] present a taxonomy for face recognition as an image-based and video-based method. For video-based face recognition, various recent deep learning methods were discussed by grouping as set-based method and sequence-based method. Authors of [13] present a review of the video scene parsing application of video processing using deep learning techniques. They highlight the 2D CNN, 3D CNN, Clockwork FCN, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Spatiotemporal transformer GRU (STGRU), and GAN methods of deep learning. In [14], Wang et al. surveyed salient object detection from video data using deep-learning-based methods. They mentioned the specially designed methods for salient object detection such as fully convolutional network, Spatio-temporal cascade neural network, attentive feedback network, etc.
Tong et al. [15] investigate deep learning algorithms for video processing, specifically small object detection. The authors show the gradual improvement in CNN-based methods such as R-CNN, Fast R-CNN, Deconv-R-CNN, Improved Faster R-CNN, etc., along with other deep learning approaches. Sánchez et al. [16] demonstrate a study on crowd behavior analysis or crowd anomaly detection by video processing using deep learning techniques. Authors listed convolutional RBM, Fast R-CNN, 3D CNN, PCANet, deep Gaussian Mixture Model, Convolutional AutoEncoder (CAE) with LSTM (CAE -LSTM), Spatio-temporal CNN, and GAN based approach. Table 1 shows the summary of related surveys included in this study.

B. MOTIVATION
We observed that a missing part is a combined review of various up-to-date video processing functionalities using deep learning techniques in related surveys. The surveys presented above focus only on specific functionality like video anomaly detection [5], abnormal human activity recognition [6], multi-object tracking [10], behavior analysis [16]. None of the surveys collate the research done on various functionality in one survey. Therefore, we motivate to present the recent advancement of deep learning-based video processing methods for multiple functionalities such as motion detection, object detection, human action recognition, object tracking, video classification, etc., and deep learning techniques to perform these functionalities. Another motivation is that in the past ten years, so many review papers are present for deep learning for image processing [17] [18]; similarly, deep learning techniques for video processing are needed.

C. CONTRIBUTION
However, there is no single survey that provides an insideout study covering all the aspects; our contribution in this paper is as follows: − A systematic literature review to investigate the up-todate research in video processing using deep learning techniques − We include 93 research papers from journals and conferences listed in top databases that show the development pattern of advanced deep learning algorithms for video processing − This paper can be helpful for researchers, where it gives knowledge for a better understanding of the advancement of deep learning techniques for video processing after the massive success of image processing. − Shows open challenges and future research in this field − This is the first SLR to present the various functionalities of video processing using deep learning in one paper to the best of our knowledge. − Our observation shows that video processing advancements using deep learning techniques are majorly between 2017 and 2020 due to the advent of very deep networks based on AlexNet, ResNet, and LSTM. − Significant research works found towards human action recognition, crowd anomaly detection, and behavior analysis from a computer-vision perspective. The organization of our paper is as follows. Section 2 shows the research methodology; Section 3 shows the result of our study. The result section answers the RQs formed in section 2 based on evidence reported in the literature included in this SLR. Section 4 discusses the results of our analysis. Finally, section 5 summarizes the conclusion and section 6 shows the future research direction in this field.
In the end, Appendix 'A' contains the list of a glossary, and Appendix 'B' has the list of abbreviations for the most frequent terms used in this study.

II. RESEARCH METHODOLOGY
We conduct an SLR to assess the deep learning techniques in video processing. In this paper, we follow the standard procedures of SLR, explained by the authors Chitu Okoli, Kira Schabram [19]. This method identifies, specifies, and analyzes all the publications in deep learning for video Processing to present the answer to each research question (RQ) and reveal the gaps. This methodology of literature study shows the new insights of deep learning research on video data.

A. RESEARCH QUESTION
The vital part of the systematic review is determining research questions (RQs). We prepare research Questions (RQ) to follow the review process to stay focused at the beginning of the study. It is a novel approach to investigate the answer of listed RQs deeply. During this process of forming RQs, the following points are considered: − The search stage must identify the significant study that addresses the RQs.
− The data-extraction stage must extract the data items needed to answer the RQs.

−
The data-analysis stage must synthesize the data; thus, the RQs can be answered. Table 2 shows the list of RQs arises in this paper. To explore the numerous successful researches done on video processing using deep learning techniques RQ4 What datasets have been used by the researchers for video processing?
To explore the publicly available video datasets RQ5 What are the issues and challenges in video processing? It presents the challenges of implementing AI algorithms on video data while dealing spatially and temporally.
Here, RQ1 deeply investigates the application areas where video processing research is highly required-followed by RQ2 categorizes the video processing research with specific functionalities that can be performed on video data. Then, RQ3 tries to find up-to-date deep learning algorithms for various functionalities. RQ4 investigates the publicly available datasets for video processing so that researchers can discover suitable datasets for their experiments. Finally, RQ5 focuses on issues during research and implementation of video data using deep learning and future direction to pursue research in this area.

B. SEARCH METHOD
We use the following search methods in a step-by-step manner: Table 3 shows the list of databases with respective focus subject areas we chose to perform this SLR. Table 4 lists the keywords we used to search the papers from online databases.

3) CHOOSE TIME RANGE
We extract a total of 593 articles from 2011 to 2020 initially. Afterward, 276 articles were excluded since the paper's work did not match our objective. In the next step, we also exclude 204 masters, doctoral or unpublished articles. Finally, we include 93 papers for the final study, which are purely conference and journal papers. Figure 2. shows the year-wise distribution of publications in terms of percentage.

4) INCLUSION AND EXCLUSION CRITERIA
We include only conference and journal papers in this study published in the English language. Therefore, we have not considered under-reviewed papers and book chapters. Table  5 shows the various criteria on which basis we include and exclude the searched papers.

5) QUALITY ASSESSMENT
We also assessed the quality of the selected 93 papers. The list of quality assessment questions is below. ⎯ Is the purpose of the research meet? − Since we exclude the unpublished papers, therefore we assess whether they were published in peer-reviewed journals? − Is the video processing using deep learning techniques clearly explained? − Are the accuracy and result of these researches being acceptable? − Are these meets the standard of novel techniques of deep learning for video processing research − Is there any systematic literature review conducted on video processing using deep learning before? − Does the author clearly explain the purpose of using deep learning techniques on the specific dataset?
− Is there any comparative study conducted on deep learning methods for video processing? − Does the study have an adequate average citation count per year? − Has the author been given open access to data and code to apply these techniques by researchers further? Finally, we reach the level to present the result of our study in answers to RQs defined in section 2.1. We point out that list of publications we considered purely to answer the RQs is between the time range 2011-2020, and few papers which are beyond time range are used only for background Study

−
Only conference and journal publication − methodology of work shows deep learning for video processing for any functionality − choose single publication if same deep learning method for same functionality found in multiple publications − only digital video data processing is considered in the context of computer-vision research

Exclusion criteria
− articles in other than the English language − unpublished papers − studies without proper validation and robust experimental research − study having content other than video data processing-related research which did not use deep learning techniques.

III. RESULTS
A total of 93 peer-reviewed research papers on video processing using deep learning techniques were studied. This section observes the characteristics, methods, threats, solutions, and deep learning algorithms mentioned in the selected papers. After an intensive study, we present the results based on the research questions listed in table 1:

A. RQ 1: WHAT ARE THE APPLICATIONS OF VIDEO PROCESSING?
The current era is full of technology and rolling on the digital revolution. A wide range of video processing applications is in entertainment [20], healthcare [21], retail [22][23],traffic management [24], transport [25], home automation [26][27], flame and smoke detection [28][29] [30],safety [31][32],and security [33]. These applications automatically generate the caption of the actions happening in the video sequence. These video captions are further used to identify persons, vehicles, and other objects in the video sequence [34] and their appearance and actions [35]. The most desired application for video processing is producing actionable intelligence to help policymakers understand and respond to current situations [36] [37]. Video processing applications also include education [38] and smart city video surveillance [39]. One of the popular applications of video analysis is crowd management [40]. This application helps to count the people at exit and entry points within a premise in real-time or at a periodic interval.

B. RQ 2: WHAT ARE THE VARIOUS FUNCTIONALITIES OF VIDEO PROCESSING IN THE COMPUTER VISION CONTEXT?
Video data has lots of functionalities on which researchers perform processing. For example, identifying some properties like attributes estimation, human pose estimation, person identification, motion detection can achieve when an object is detected in a video. Similarly, the task of human action recognition can perform in the video. Furthermore, some video analysis applications can process video offline, and some online, but many applications require situational awareness. Therefore, various applications of video analysis suggest the following functionalities, which performs on video data:

1) HUMAN ACTION RECOGNITION (HAR)
Human Action Recognition (HAR) is the task of identifying some actions from a video sequence [6], [41], [42].HAR is applicable in monitoring daily activities such as walking, bending, falling, climbing, sitting, etc., which is essential for activity analysis. HAR aims to identify the actions of one or more persons in the scene and gives helpful information about types of activities. HAR systems are also a part of human behavior monitoring in applications like injury detection during sports, elderly and child care, students' classroom behavior analysis, student-teacher classroom action recognition, and surveillance.

2) MOTION DETECTION
Motion detection is used to determine the presence of relevant motion in the observed scene. The objectives of motion analysis are to detect movements within frames of the video sequence, track an object's motion over time, group objects that move together, and identify the direction of motion. Specific techniques for implementing motion or movement analysis include background segmentation and differential equation models [43].

3) OBJECT DETECTION
Object detection is a technique to identify an object or entity, for example, a truck or a human in the video. In object detection tasks, visually observable objects in images of videos can be detected, localized, and recognized by computers [44], [45]. Detecting moving objects in video data has various applications in real life. In addition, the object detection in the video data helps a lot in real life, for example, to determine if there was a goal or not (in football), if a tennis ball is in/out of court (in tennis), or which athlete has finished first (in speed races), etc.

4) OBJECT RECOGNITION
Object recognition is a way of identifying the type of objects in the video sequence. When people observe something in a video, they can easily recognize the objects, scenes, and visual details. A driverless car is the best application of object recognition, modern technology now. Object recognition truly helps driverless cars distinguish a pedestrian from a street light [46] and recognize road signs, etc. It is also helpful in various applications such as robotics, industrial inspection, safety, smart city surveillance, and medical imaging.

5) OBJECT TRACKING
Object tracking or visual object tracking or video object tracking in video data is the process of tracking an object as it moves through space in a video. Object tracking divide into three different sections: initial object detection, assigning unique IDs, and tracking the objects across frames. Video object tracking is used for various applications like tracking faces and eyes for human-computer interaction, traffic control, video editing, surveillance, and security [10].

6) VIDEO CLASSIFICATION
Video classification focuses on automatically labeling videos based on video contents and frames. It is similar to image classification, in which images are classified based on the features belonging to a particular class. In the video classification task, video divides into frames (image) per second, and then a similar job of image classification performs [47].

7) BEHAVIOR ANALYSIS
Intelligent video processing, along with automatically detecting, recognizing, and tracking particular objects from image sequences, also aims to understand and describe object behaviors, detect abnormal behavior [48], hostile intent, etc. Behavior detection increases the speed and accuracy of suspicious detection and improves surveillance while reducing staff and equipment costs. The behavior detection system automatically detects suspicious behavior such as intrusion, loitering, and object abandonment based on user-defined time and location parameters. It can distinguish between humans, shadows, and moving objects.

8) GAIT ANALYSIS
Gait is the motion of human walking, whose movements can be faithfully reflected by the acceleration of the body sections. For every individual, human gait gestures a unique motion pattern. Therefore, gait analysis is a study of locomotion in both humans and animals. Coordination of several parts of the human body is watched and observed for gait analysis, such as the brain, spinal cord, nerves, muscles, bones, and joints. The study of gait analysis is widely applicable in healthcare, biometrics, sports, and many others. Gait dynamics [49] are captures using accelerometers and gyroscopes.

9) BACKGROUND SUBTRACTION
Sometimes, the interesting portion is not the background but the objects present in the foreground in a video scene. These interesting objects can be any object such as animals, humans, cars, etc. Detecting and processing a foreground object from a video is also known as background subtraction. Also, a standard method for search-space reduction and focus of attention modeling in video analysis is background subtraction techniques [50]. Eventually, it's easy to detect foreground objects if the background of a scene remains unchanged.

10) EVENT RECOGNITION
Event recognition is the technique of automatic analyses and recognizes the matching events from the video clips. Some Event recognition techniques from procedural videos are "baking a cake," "starting a vehicle," while other types of social activities like "birthday celebration," "Prayer," "Street dance." Many practical applications require identifying events, such as web video search, consumer video management, and intelligent advertising [51].

11) ACTION SEGMENTATION
Video segmentation is a technique of dividing a video sequence into different sets of continuous frames similar to specific criteria. We observe that performing action segmentation before doing action recognition gives better recognition performance [52]. A challenging problem in human action understanding is to recognize a sequence of continuous actions, which is generally a segment. It recognizes primary actions such as jogging, jumping and sitting, etc., from a video sequence where a person's actions can be segmented into various categories. Action segmenting can be applied to different movements from the input video and recognizing the action types simultaneously.

12) SCENE UNDERSTANDING
Scene understanding is a study of scene structure (e.g., pedestrian road cross, market area, traffic on the road, waiting for the queue at the entrance, scene status (traffic light color change), scene motion patterns (cars taking Uturns). Unusual activity recognition improves with the understanding of scene patterns, tracking, and motion patterns. The increasing surveillance of massive crowds at sporting events, concerts, amusement parks, airports, and other venues motivates a growing desire to process and analyze crowd scenes, i.e., scene understanding [53].

C. RQ 3: WHAT ARE THE VARIOUS DEEP LEARNING TECHNIQUES USED BY COMPUTER-VISION RESEARCHERS FOR VIDEO PROCESSING?
AI algorithms have excellent success in video processing research. However, diversity in spatial and temporal makes video data a challenging task to recognize in the video sequence. To answer this question, we group the deep learning techniques for video processing into Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), and Hybrid approach.

1) CNN BASED APPROACH
In an early work, large-scale YouTube videos containing 487 sports classes were used to train a CNN model [54]. This CNN model includes a multi-resolution architecture that utilizes the local motion information in videos. In addition, it consists of a context stream for low-resolution image modeling, and further to classify videos, it contains fovea stream (for high-resolution image processing) modules. The author has also explained the three broad connectivity patterns: early fusion, late fusion, and slow fusion to extend the network's connectivity for time dimension to learn spatial-temporal features of the video data. Another work presents event detection from videos using CNN [55]. The author proposed an encoding method for spatial and temporal information using CNN and a frame descriptor to enhance the visual information. Similarly, for the event detection task, in [56], the author benefits from a pre-trained model on ImageNet to classify unusual events from the surveillance camera. This practice reduces the computational cost to train a large CNN model for video processing. A general deep learning approach is two-stream CNN, proposed by Simonyan et al. [57]; it has two streams of CNN. In this architecture, two-stream has two separate layers, in which spatial information is stored through a single frame and another layer, using optical flow, temporal information is stored. Two-stream CNN combines regular images and optical flow images as input. To achieve high throughput, these two separate networks were combined with a late fusion technique. This video processing method has been experimented with human activity recognition tasks. To overcome the limitation of Spatio-temporal stream fusion at the softmax layer [57], another HAR approach [58] was introduced by fusing spatial and temporal networks at a convolution layer without losing performance. Other CNNbased methods developed for video processing for HAR tasks are [54], [59]. In a novel approach to video processing [60], the author proposes a MultiD-CNN framework for multimodal gesture recognition. This model combines two models; one is 3D Color-Depth Convolutional Network (3D-CDCN) and 2D Motion Representation Convolutional Network (2D-MRCN). These models mimic the architecture of deep residual networks (ResNets). The study shows the advantages of these two networks because the convolutional layers in the ResNets reduce the number of trainable parameters using the concept of weights sharing. From another's point of view, the ResNets also has few small connections that perform identity mapping and directly add the output of a particular layer to the output of later layers.
The author precisely [61] develop a deep learning algorithm to deal with large displacements in videos. They first create a matching algorithm -DeepMatch and prepare this model to match the 2D warping problem. They linked it to having a deformable SIFT descriptor grid, where all four quadrants can move independently to each other till a certain distance. A non-negative cosine similarity function uses for overlapping pixels to achieve a good scoring on possible warping. Finally, max-pool and subsample of the responses perform, generating a pyramid of features like SIFT. DeepFlow is DeepMatch combined with an energy minimization approach to generate the final optical flow for efficient video processing.
Another CNN-based method proposed by Nam et al. [62] for visual object tracking in online video data-Multi-Domain Network (MDNet).It is designed to learn the shared representation of targets from multiple and annotated video sequences for tracking, where each video is regarded as a separate domain. A different branch of domain-specific layers for binary classification has been used in this proposed method at the network's end. It shares the generic information captured from all video sequences, especially for generic representation learning. It is also observed that each domain in the MDNet is trained individually, where the shared layers are updated in each iteration. By following this method, the author has segregated the domain-independent information from domain-specific information. Using this technique, a generic feature was learned by the model for the representations of visual tracking.
DeepSORT [63] is one of the most widely used elegant object tracking CNN-based frameworks. The author used this Simple Online and Realtime Tracking (SORT) method for multiple object tracking, ahead of [62], focusing on simple, practical algorithms. The author has adopted a single conventional hypothesis tracking methodology with recursive Kalman filtering 1 and frame-by-frame data association. Therefore, this tracking scenario is defined on the eight-dimensional state space containing the bounding box center position, aspect ratio, height, and respective velocities in image coordinates. Also, a standard Kalman filter with constant velocity motion and linear observation model has been applied. The bounding points have been considered for the accurate position of the object.
To demonstrate efficient video processing for human action recognition, a multitask learning model is ActionFlowNet [64]. It trains a single stream network directly from raw pixels to jointly estimate optical flow 1 https://www.kalmanfilter.net/default.aspx 2 https://rse-lab.cs.washington.edu/projects/posecnn/ simultaneously with the action recognition through CNN. Authors trained this model on motion information on unlabeled video clips. Also, it has more accuracy in action recognition with a large margin of 23.6% compared with the state-of-the-art CNN-based unsupervised representation learning [65]. A new CNN model for 6D object poses estimation, proposed by Xiang et al. [66], is introduced as PoseCNN. This PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. Here method used for the estimation of 3D rotation of the object is by performing the regression on quaternion representation of each frame. This method experiments for 21 objects on a large-scale 6D video dataset 2 exclusively designed for the 6D object pose estimation task.

FIGURE 3. A typical two-stream CNN for video processing
Inspired by the great successes of CNNs in image recognition tasks, authors have used gait energy image (GEI), a popular image-based gait representation. The GEI was used as an input to the GEINet [67].GEINet was developed with two sequential triplets of convolution, pooling, and normalization layers, followed by two subsequent fully connected layers. From the cross-view gait recognition perspective, the authors have demonstrated that GEINet performs very well on the OU-ISIR large population dataset. In the popular video processing application for vehicle recognition, authors proposed [68] nine layers based CNN. For other video processing tasks such as background subtraction [69], object detection [70], [71], obstacle detection for self-driving car [46], anomaly detection in the crowded scene [72], [73], lane marking [74], monitoring wild animals [75], CNN based approach has been proven a good choice by the researchers. Figure 3 shows the concept of a two-stream convolutional neural network for video processing in a Spatio-temporal manner.

2) DNN BASED APPROACH
A deep neural network is considered an advanced form of neural network with a higher level of complexity, i.e., a neural network with more than two layers.DNN based algorithms developed for video processing are capable of handling extensive higher dimension data. In [76], an object detection task is performed in the video using DNN with SIFT [77] and tensor, which shows higher accuracy than previously developed methods. A robust deep neural network-based Multivariate Gaussian Fully Convolution Adversarial Autoencoder (MGFC-AAE) model was proposed by [78] to serve the demand of video anomaly detection and localization. In this model, the latent space representations of standard samples are trained to accord with a specific prior distribution.
Pashchenko et al. [79] use the DNN-based model for a transport system to recognize the critical situation. Amosov et al. [80] developed DNN based method for video processing for the classification probabilities for each video fragment; for normal and abnormal situations detection and recognition. Similarly, for road sign detection and lane detection tasks [81] for road analysis in automatic driving. DNN based video processing is performed by Luo et al. [82] for anomaly detection, in addition to sparse coding. This method aims to learn a dictionary to analyze various regular events with minor reconstruction errors.

3) RNN BASED APPROACH
A recurrent neural network (RNN) is also an artificial neural network that uses sequential or time-series data. This deep learning algorithm is generally used for temporal information. Since the nature of video data is Spatiotemporal, many video processing RNN based methods are developed in recent years. Long-short term memory (LSTM), Gated recurrent unit (GRU), Neural Turing Machines (NTM) are few popular variants of RNN. Among these, LSTM is one of the most widely used algorithms for video processing research.
The author of [83] proposed LSTM based 2-stage deep temporal model for group activity recognition. First, they design an LSTM model to analyze the action dynamics of individual people in a video sequence. The purpose of the second LSTM model is to aggregate person-level information for activity understanding in the entire video. In [84], Guo et al. propose GRU based approach for video processing for facial expression recognition. Authors implement facial landmark points and facial action units as input features in the training phase to effectively identify the facial regions and their components. One more GRU-based method was implemented for the video segmentation task by [85] by fully utilizing the temporal data in online videos.GRU preserves the temporal information part and maintains the spatial connectivities in the sequential frames of video data. A novel approach by [86] introduces a Multilayer perception recurrent neural network (MLP-RNN), which is suitable for differentiating multiple objects of varying sizes. It works on the reference outline template to foreground analysis for the absence and presence of anomalies. For efficient target tracking, [87]implement LSTM based method in consideration of target motion uncertainty. An efficient target tracking problem estimates the target states from the measurements. Therefore, this method estimates the true states of the object strictly in a sequential manner for target tracking.
A big huddle in video processing is the low quality of video data. To perform the face recognition task from low-quality video frames, authors [88] propose a multi-mode Aggregation recurrent network, which obtains the discriminative features by aggregating the video frames' information. A different approach for video reconstruction is GRU-based, by [82], which is a tricky task in reconstructing each video frame and loss of temporal redundancy as a resultant. Unlike this traditional approach, authors invent a fast deep-learning GRU reconstructor that utilizes the Spatio-temporal features in a video. A vital feature pointed by the authors is GRU requires low memory.
A step ahead, In [89], authors utilize the video reconstruction techniques for estimates the scene background in videos. Authors exploit semantic segmentation to extract foreground objects, for example, person or moving vehicle, and summing background regions to reconstruct the background. Authors utilize Conditional Random Field as Recurrent Neural Networks (CRF as RNN) for the semantic segmentation to detect the areas of essential objects in each frame and map with foreground and background. In [90], the authors introduce RNN based video manipulation detection method, which shows a uniques advancement in video processing. This method shows the alteration in the new video while comparing it with preexisting video frame by frame. This RNN based network is the fusion of ResNet and LSTM. A convolutional RNN and optical flow-based object segmentation from the video data approach are proposed by [91]. This method separates the object from the background in a video and mask the previous frames.

4) HYBRID APPROACH
The hybrid approach shows the combination of multiple deep learning methods uses for video processing. We found lots of research done using this approach. LSTM networks work based on frame-level CNN activation and combine information over time, as explained by the author [92], similar to temporal feature pooling. Compared to the vanilla recurrent neural network, LSTM has uninterrupted gradient flow, which is more accessible to back-propagate. LSTM is also more stable without gradient exploding or vanishing. The authors of [93] introduce temporal segment networks for human action recognition. The idea is to process the video data by dividing it into equal parts, extract small snippets from each part of the video, classify each snippet using a multistream CNN network and then form a consensus over the classification to output a softmax score for the whole video finally. They justify this as a way of better modeling global temporal dynamics. They find that RGB, Optical Flow, and Warped optical flow together give the best results. In [94], the authors of [93] extend their work on temporal segment networks to work with untrimmed videos and combined the classification scores from different snippets instead of simple averaging. They experimented with different weights, including those based on attention, measured using activations from the last layer of the RGB CNN.
Video processing on real-time yoga pose recognition using deep learning has been done by [95], in which a hybrid deep learning model is proposed using CNN and LSTM.In this method, The CNN algorithm extracts various pose features, and then it exploits the features of LSTM for actual temporal predictions. In addition, a new technique called Recurrent Convolution Networks (RCN) [96] was introduced for video processing in recent years. It applies CNNs on video frames for visual understanding and then feeds the frames to RNNs for analyzing temporal information in videos. To leverage the advantage of RNN for modeling high-level temporal feature sequences, authors of [97] propose an Inflated 3D [98] and LSTM based novel model for human action recognition. First, the author pretrains a 3D CNN model on a vast video action recognition dataset Kinetics, which improves the model's generality. Thereafter, long-short term memory(LSTM) is used to learn the high-level temporal features produced by the Kineticspre-trained 3D CNN model. Another contribution for HAR [96], employ the long-term recurrent convolutional network to overcome the issues of variable length input sequences. It combines convolutional layers(for visual recognition) and LSTM(for time-varying sequence learning) network. This network is also helpful for image caption generation and video description tasks. Another hybrid approach for video processing for HAR is [99], which uses 3D CNN and LSTM. In contrast, older methods of deep learning-based HAR [100] experimented with a hybrid classifier by fusion weights generated by homogeneous models arranged in a parallel architecture. A new transformer network adopts the attention mechanism in deep learning and outperforms when combined with spatio-temporal based models like CNN for human activity recognition. Girdhar et al. [101] prepose deep learning Transformer-based model combined with I3D network for HAR to collectively identify the spatiotemporal features of the person and the surrounding of the person whose action is trying to recognize. The connected network works on attention mechanisms unsupervised, primarily focusing on hands and faces, which mainly contribute to accurate human action recognition tasks.
Authors A. Hu et al. [102] presented a deep learning probabilistic model for the autonomous vehicle's video scene understanding of real-world urban scenes. This model learns features from the spatio-temporal convolutional network to predict future scene representation jointly by encoding the future state into a low-dimensional future distribution.
In [103], the authors use optical flow, CNN, LSTM, and support vector machine (SVM) for gesture recognition from video data. This approach is highly applicable for decoding the news for the deaf-mute community. The optical flow method is used to detect and process moving target objects on video. In [104], the authors propose a novel approach for adult content detection in videos, namely ACORDE (Adult Content Recognition with Deep Neural Networks). This method combines CNN as a feature extractor and LSTM for classification. [105] explain the anomaly detection in realtime videos by using optical-flow convolutional autoencoder and convolutional LSTM. It shows a better performance than the vanilla CNN or DNN based approach for anomaly detection. Automatic event detection from video data is presented in [106] by using CNN and RNN. This network efficiently recognizes soccer events from live video streaming by leveraging feature learning and deriving temporal relations through CNN and RNN.
The need to detect the most desirable objects from the dynamic video is fulfilled in [86], where salient object detection is done using a hybrid Convolutional Recurrent Neural Network (CRNN). Salient objects in the moving scene are detecting by capturing the temporal, spatial, and local constraint features with the CNN and RNN based CRNN model.
A very challenging application of the video process is healthcare, where a novel approach proposed by [107] shows the intelligent monitoring of tools used during surgery in the operation theatre. Authors bring in the notice that the tool used to record either through a microscope or an endoscope. This state-of-the-art video processing technique process each frame of the video by CNNs, and its learned outputs are sent to RNNs to fully utilize the temporal relationships between frames. Another video processing approach in medical is proposed in [21]. The author developed a multitask recurrent convolutional network with correlation loss ( MTRCNet-CL ) and fulfills the need for surgical tool presence detection and surgical phase recognition.
Furthermore, [108] performed video captioning to generate text descriptions of video frames using CNN and transformers by introducing a video encoder, proposal decoder, and captioning decoder. Table 7 shows the summary of deep learning approaches for video processing.

D. RQ 4: WHAT DATASET HAVE BEEN USED BY THE RESEARCHERS FOR VIDEO PROCESSING
We found various video datasets in the literature on which researchers have shown the experiment of deep learning for video processing. Table 8 shows the details of the video dataset with applications used by the researchers.

E. RQ 5: WHAT ARE THE ISSUES AND CHALLENGES OF VIDEO PROCESSING USING DEEP LEARNING?
Although there has been significant progress over the past few years, there are still many challenges in applying deep learning techniques to video processing and develop models for real-life application. Various challenges exist as a huddle in this research area, such as:

1) POOR QUALITY OF VIDEOS
Poor-quality videos captured through live cameras installed at long-distance create severe occlusions, and it exists in many scenarios of the video surveillance system. Public gatherings and overcrowded places such as religious events, airport arrival, and departure terminals are significant points where occlusions happen frequently. Apart from surveillance cameras installed in high areas cannot capture high-quality videos like present video datasets in which the target person is apparent and obvious. Due to the long distance of cameras, the subject is relatively small, making it challenging to process. The relatively low quality of those long-distance videos further increases the difficulty.

2) TRACKING AND LOCATING OF MULTISUBJECT
In real-world tracking, any single object from multiple moving objects in the video is complex. The main challenge of tracking is the target motion uncertainty due to the tracker's unavailability of an accurate dynamic model [87].
In particular, no surety of the transition function and the complex calculation of the densities between time series.

3) DYNAMIC BACKGROUNDS
Most of the real-world applications capture complex and evolving backgrounds through the surveillance camera. As a result, these types of videos are recorded in various dynamic backgrounds. Also, real-time video scenes certainly have illumination variance, occlusions, and changing viewpoints, which makes it very difficult for video processing in such complex and various dynamic situations.

4) LACK OF DATASET
A large number of video processing datasets is also desirable to experiment using various DL techniques. Such as the action recognition task on JHMDB was proven too challenging because of its data annotation method. It achieves inaccurate performance in the past research. So apart from data collection, proper annotation is also vital in video datasets. The lack of a properly formed video dataset is still a challenge in video processing research. Video datasets from various domains are also highly desirable. Studies have focused that the availability of video data is also a major issue. While few are publicly available but many data sets are still not available for open research. (Training, Testing, and inferencing ).

5) LACK OF COMPUTATION POWER
Besides methodology breakthroughs and available big training data, the recent success for video processing is also due to advances in hardware. Researchers faced few challenges of unavailability of enough computing resources for large-scale video data processing. Since deep learning algorithms need specialized processing hardware called GPU [134], highly data-intensive and compute-intensive computing machines are required.

A. SUMMARY OF REVIEWED STUDIES
We observe a rapid advancement in video processing using deep learning techniques between 2017 and 2020 compared to earlier research between 2011 and 2017. As per the study, CNN has outperformed in most video processing functionalities-video classification, scene labeling, and scene understanding, whereas RNN based approaches are best proven for visual object tracking or long-term temporal relationships. The performance is much improved in Temporal Segment Networks combined with LSTMs than vanilla CNN or RNN. When comparing LSTM with CNN, most researchers have concluded that both the algorithms perform well and fit appropriately for the Video Classification task. In addition, CNN has also performed well for scene labeling using a parametric model to learn discriminative features and classifiers. In the case of DNN, it significantly improves action recognition accuracy by a large margin than CNN-based unsupervised representation learning methods trained without large-scale external data and additional optical flow input. Without pretraining on large external labeled datasets, the models trained with large labeled datasets such as ImageNet and Sports-1M achieve more considerable accuracy. In contrast, vanilla RNN faces a short-term memory problem due to the vanishing gradient problem; therefore, LSTM has been proven to better perform in the Spatiotemporal nature of video processing. Furthermore, research shows that a deep-learning GRU reconstructor is fast and requires low memory, unlike traditional approaches. The various hybrid approach, where a combination of CNN and LSTM is employed, has shown tremendous improvement in the network architecture and handling of Spatio-temporal feature and long-term learning of patterns. It has been observed that most of the research aimed to design the algorithms with significant speed-up without loss of accuracy.
The video processing deep learning techniques are also advancing due to the advent of various video datasets in multiple domains -UCF 101, UMN, UCSD, Avenue, etc. listed in table 9. However, the diversity and nature of datasets make the algorithm learn close to real-time features in a controlled environment.
Various challenges have also been faced by researchers for developing deep learning algorithms for video processing. A big huddle lacks openly available data and costly hardware, which requires training, testing, and inferencing. Apart from occlusion, the poor camera video quality in a real-time environment makes the deep-learning models challenging to perform well.

B. THREATS TO VALIDITY
This section discusses the possible threats that might have affected our systematic literature review and how we alleviated them. Validity is the degree to which the results estimate what they are supposed to do.

1) THREATS TO INTERNAL VALIDITY
The fundamental threat to internal validity is the literature we collected for our study. We found limited research papers-93 out of 593, in which researchers use actual deep learning techniques for video processing in the context of computer vision. Furthermore, the few studies that show a novel approach with an experimental dataset, either those methods or datasets, are in arXiv. In contrast, our study was bound to include only peer-reviewed articles published in journals/conferences indexed in reputed databases.

2) THREATS TO EXTERNAL VALIDITY
External validity limits the ability to generalize the results beyond our study. We mentioned the accuracy of deep learning methods on datasets as reported by the original papers by authors. We did not perform any experimental research to re-calculate the results. Hence the generalization of deep learning techniques for video processing is shown as reported in its original research papers.

3) THREATS TO CONCLUSION VALIDITY
Conclusion validity is the degree to show the reasonability of the relationship between data and conclusions. Since the restricted access of other reputed databases like Scopus, Springer, and Wiley are beyond our research work. Therefore, we could not retrieve the literature published in these databases. Undoubtedly the inaccessibility of literature listed in these databases mitigates us to conclude the final result on the advancement of video processing using deep learning techniques. Deep learning methods for video processing may be a lot more than we include in this study. Hence the conclusion based on only 93 literatures extracted from WoS, ACM, and IEEE databases may not be adequately present the advancement of deep learning techniques for video processing between 2011 and 2020. This limitation also impacts our study.

V. CONCLUSION
In this paper, we have presented a systematic literature review of the deep learning techniques for video processing in the context of computer vision. We included 93 research papers published in the peer-reviewed journal/ conference indexed in WoS, ACM, and IEEE Xplore between 2011 and 2020. We present the SLR by forming the RQs and systematically answering them in terms of various applications and functionalities of video processing and deep learning techniques, datasets, and challenges. Finally, we conclude the few main points of our study: ⎯ Deep learning techniques can now boost video understanding, video classification, video analysis, action recognition, and pose recognition. ⎯ The advent of pivotal research in AlexNet and ImageNet for image processing gave a clear direction to perform analysis in video processing. Therefore, more literature was found on this topic between 2017 to 2020. ⎯ We found significant work on video processing using deep learning for human action recognition, behavior analysis, and crowd anomaly detection.

VI. FUTURE RESEARCH
Since deep learning techniques are suitable for handling large-scale video data, they can process and analyze millions of data captured from the distributed sensors. There are many active research topics in future directions regarding such data, such as threat identification, multi-person identification, multi-object tracking, scene labeling, etc. It has been observed that some topics like action recognition, video classification, and object tracking got enough research. However, surprisingly we did not find many research articles on scene labeling, scene understanding, video analysis from moving cameras, and cluttered backgrounds. However, despite remarkable progress, the advances achieved so far do not meet high accuracy standards and the correct realization of video processing in some areas, such as video surveillance in low light, partially captured areas, Gait Recognition, etc. A large number of video datasets should also be freely available in the future. However, lots of deep research is required, along with colossal computation power such as tensor core-based GPU for training huge neural networks.

APPENDIX A
• Artificial Intelligence: Artificial Intelligence is a field of computer science where algorithms are designed to make the machine capable of performing tasks intelligently without being explicitly instructed • Computer Vision: Computer vision is a field of artificial intelligence (AI) that enables machines to derive meaningful information from digital images, videos, and other visual inputs and take actions based on that information. • Deep Learning: Deep learning is an AI algorithm designed by using neural network architectures that contain many layers, sometimes called deep layers. • Video Processing: In the context of computer vision, video processing or digital video processing is the ability to automatically analyze video, frame by frame, to detect and determine temporal and spatial features. with some level of complexity, usually at least two layers, qualifies as a deep neural network or, say, the deeper form of neural network.

• Recurrent Neural Network (RNN): A recurrent Neural
Network is a type of neural network where the output from the previous step are fed as input to the current step • Long Short Term Memory (LSTM): LSTM is a type of RNN where the information flows through a mechanism known as cell states. This way, LSTMs can selectively remember or forget things.

• Transformers:
A transformer is a deep learning algorithm model that adopts the attention mechanism, thus differentially weighing the significance of each part of the input data.
• Probabilistic Model: A probabilistic model predicts the probability distribution over a set of classes, rather than only outputting the most likely class as output. • Benchmark Video Datasets: Benchmark video datasets are adequately prepared, annotated, validated, and proven to be accurate compared with other datasets. • Peer-Reviewed Articles: The articles, reviewed and critiqued by the author's peers who are experts in the same subject area. • High-Performance Computers or Supercomputers: One of the best-known types of HPC solutions is the supercomputer. A supercomputer contains thousands of compute nodes that work together to complete one or more tasks. This is called parallel processing. It's similar to having thousands of PCs networked together, combining compute power to complete tasks faster.

APPENDIX B
• DL: Deep Learning