Deep Learning for Object Detection and Segmentation in Videos: Toward an Integration With Domain Knowledge

Deep learning has enabled the rapid expansion of computer vision tasks from image frames to video segments. This paper focuses on the review of the latest research in the field of computer vision tasks in general and on object localization and identification of their associated pixels in video frames in particular. After performing a systematic analysis of the existing methods, the challenges related to computer vision tasks are presented. In order to address the existing challenges, a hybrid framework is proposed, where deep learning methods are coupled with domain knowledge. An additional feature of this survey is that a review of the currently existing approaches integrating domain knowledge with deep learning techniques is presented. Finally, some conclusions on the implementation of hybrid architectures to perform computer vision tasks are discussed.


I. INTRODUCTION
Just as motion perception is essential to our visual system, allowing us to interpret the world, to detect the presence of creatures [25], and to avoid danger [34], video computer vision helps artificial intelligence agents to decipher their surrounding environment and to synthesize actionable information. Inspired by the human visual system and enabled by the latest advancements in deep learning (DL), novel video processing methods are emerging that achieve remarkable results and that seek to revolutionize how computer vision tasks are implemented. Yet, similarly to human perception, computer vision is quite prone to illusions.
The fast pace of DL breakthroughs in combination with the improvement in hardware capabilities in terms of computation power, memory capacity, and sensor resolution have accelerated the spread of data-driven methods over the conventional computer vision techniques. Contrary to classical techniques, DL reaches human-level accuracy, requires less expert analysis, and provides superior flexibility including allowing re-training whenever new data are available [115].
The associate editor coordinating the review of this manuscript and approving it for publication was Zhongyi Guo . The objective of this work is to investigate the advancements of deep learning techniques for computer vision tasks in videos as well as their research perspectives to address their current weaknesses. More specifically, the contributions of our study are trifold: • We present an analysis of the existing DL techniques for detection and segmentation of objects in videos. • We present an overview of the challenges with the existing data-driven approaches.
• We outline new directions for research in video processing. The paper is organized in seven sections. Section II presents an overview of necessary preliminary knowledge. Section III gives a comprehensive overview of DL-based video computer vision methods. In Section IV the current challenges are presented and analyzed. To address these challenges, Section V presents an overview of approaches that couple DL methods with domain knowledge. Section VI highlights the most prominent topics that are expected to draw major interest from the research community in the following years, and Section VII gives concluding remarks.
A list of abbreviations mentioned in this paper and their definitions are presented in Table 1.

II. PRELIMINARIES
In this section, we introduce the most typical tasks of computer vision and we present a brief, comparative analysis between deep learning and conventional techniques in the domain of computer vision, as well as an overview of basic deep learning methods such as convolutional neural networks, restricted Boltzmann machines, and auto-encoders, which constitute the core for DL architectures in computer vision.

A. COMPUTER VISION TASKS
Computer vision tasks can be categorised into 4 major fields: (1) semantic segmentation, (2) classification & localization, (3) object detection, and (4) instance segmentation. The task of semantic segmentation refers to the process of assigning a class label to every pixel in an image [72]. One of the shortcomings of this task is the fact that semantic segmentation does not differentiate between instances of the same class. On the other hand, the classification & localization task aims to predict the class of a specific object in an image and to draw a bounding box around the region of the classified object in an image [126]. This task refers to a single object. However, most images in real-world settings contain multiple objects of different shapes and sizes. Therefore, object detection [37] refers to a more general approach where a varying number of predicted objects for every input image can be extracted, since it is unknown how many objects are expected to be detected in each image.
Object detection systems strive to find every instance of an object and estimate the spatial extent of each one. Nevertheless, the detected objects are located just with bounding boxes.
The task of instance segmentation refers to the problem of detecting all the instances of a category in an image and marking the pixels that belong to each one of them [39]. Extending this task to the video domain results in simultaneous detection, segmentation, and tracking of the instances [121]. The instance segmentation task combines object detection, where individual objects are classified and localized with a bounding box, and semantic segmentation, where each pixel is classified into the given classes.
The task of object classification & localization is included in object detection. At the same time, in semantic segmentation, each pixel of an image is associated with a class label like a road, tree, pedestrian, etc. In other words, all objects of an image that belong to the same class are treated as a single entity. On the other hand, each object of the same class is treated as a distinct individual instance with instance segmentation. Hence, instance segmentation can be considered as a more elaborate implementation of semantic segmentation. Since all the computer vision tasks are similar, in this work mainly object detection and instance segmentation techniques will be examined, as they are the most dominant techniques required in extensive applications such as autonomous driving [69], video surveillance [100], face recognition [108], and robot navigation [120].

B. DEEP LEARNING VS. TRADITIONAL COMPUTER VISION TECHNIQUES
Traditional computer vision methods are based on hardcoded, rigid-rule algorithms to apply feature extraction on images [80]. Several algorithms have been developed to extract properties such as corners, edges, and regions of interest from images [2], [12], [40], [74], [88]. These algorithms showcase advantages such as transparency, in terms of allowing to trace back to all steps of how a decision was made, and performance that is independent of the training dataset. At the same time, however, they have been criticised to be inflexible, difficult to improve or adapt, and highly time-consuming to develop manually for each additional object to be detected [83]. Moreover, the performance of these methods significantly deteriorates when the number of classes to be detected increases. By contrast, DL utilizes massive data sets and numerous training cycles to learn how an object looks, following a process during which relevant features of an object of interest are extracted automatically. The DL architecture can then be implemented on previously unseen images and make accurate predictions. DL-based methods perform remarkably better than traditional methods, albeit with trade-offs regarding computational requirements and training time [83]. As a result, they have vastly replaced traditional computer vision techniques, thanks to their strong ability to be easily adjusted, to extract complex features in much more detail, and to be much more efficient in terms of accuracy and versatility [83]. Tremendous advancements in research have taken place in this domain, resulting in the development of numerous methods. The fundamental DL methods implemented on image computer vision applications are discussed in section II-C.

1) CONVOLUTIONAL NEURAL NETWORKS
Convolutional neural networks (CNNs) have been widely used in image processing applications over the past decades [62], [66], [133]. Their structure consists of a number of convolutional and pooling layers, stacked one after another [5]. The convolutional layer can be visualized as a square matrix W of weights, called kernel [87]. The kernel slides over the image looking for patterns and when it distinguishes a part of an image that is similar with its pattern, it returns a large positive value, otherwise, it returns a small value. The input image is represented as a pixel matrix with size length × width × number of color channels (i.e. an RGB image has 3 color channels).
The convolutional layer is utilized for feature extraction and the pooling layer to downsample the resolution of the convolutional layer output. In this way, a dimension reduction is accomplished, which reduces the number of necessary parameters in the next layer, resulting in a less complex architecture. During the training process, the training samples are fed through the CNN and the error with respect to the desired output is calculated. The error and its gradient are then backpropagated through the network layers and the weights are updated.
CNN-based image object detectors can be separated into two main categories [105], [127]: • Two-stage approach: In the two-stage method, the first stage extracts region proposals and the second stage classifies those region proposals and determines the bounding boxes of the classified objects. In the region proposal part, sliding window techniques such as Deformable Part Models [20] are adopted. An additional region proposal technique, employed in region-based convolutional neural networks (R-CNNs) [27], is selective search [111]. R-CNNs extract around 2000 region proposals on each input image, which is a significantly reduced number of regions needed compared to other sliding window methods. At the second stage of this architecture, a CNN is used for object detection over the region proposals. The size of the proposed regions is arbitrary, while the CNN requires a fixed size input. Hence, a major drawback of R-CNNs is due to the fact that images need to be cropped or resized to accomplish the requirement for a fixed size input. Spatial pyramid pooling [31], [42], [64] is a method used in order to achieve a fixed-size output irrespective of the input image size. Hence, spatial pyramid pooling networks can be trained and tested on varying size images, which reduces overfitting of the model. Both R-CNNs and spatial pyramid pooling networks are particularly slow during training. Fast R-CNN [27] tries to solve this drawback by passing the original image through the CNN instead of using the region proposals. As a result, fast R-CNN is faster than R-CNN because the convolutional operation is implemented only once on the original image instead of 2000 times on the region proposals. Fast R-CNNs can train detection networks whose architecture involves multiple layers like VGG-16 [99], as they are 9 times faster compared to R-CNNs and 3 times faster than spatial pyramid pooling networks [105]. The drawback of the high time cost has been further addressed by faster R-CNNs [92]. In faster R-CNNs the time-consuming selective-search algorithm is replaced with a fully convolutional network that learns the region proposals of an image with arbitrary size. A major additional development of the previous R-CNNs is achieved by Mask-RCNNs [41]. Mask R-CNNs extend the previous architectures by labeling the pixels corresponding to each object instance. The Mask R-CNN inherits the region proposal network from faster R-CNNs and employs an additional branch that outputs a binary mask classifying whether or not a given pixel is part of an object. Two-stage approaches yield a high accuracy since each stage performs one specific task. However, in terms of real-time applications, twostage approaches show weaknesses in computational time.
• One-stage approach: One-stage approaches skip the first stage of region proposal and simply run detection directly on the input image. This simpler architecture allows them to have faster inference. Some networks can achieve a processing speed of up to 150 frames per second (fps). There is a trade-off, however, in terms of accuracy. Notable one-stage methods are the ''you only look once'' (YOLO) network [91], which extracts class and bounding boxes predictions directly from an input image using a CNN and the single-shot detector (SSD) [71], which takes an input image and passes it through multiple convolutional layers with different sizes of filters.

2) RESTRICTED BOLTZMANN MACHINES
The Restricted Boltzmann Machine (RBM) is a two-layer undirected graphical model [6] that was introduced in 1986 [46]. It consists of a set of visible nodes and a set of hidden nodes. RBMs are in essence a variant of Boltzmann machines, but in RBMs there are no intralayer connections between the nodes in the visible layer and the hidden layer (i.e. no visible node is connected to any other visible node and no hidden node is connected to any other hidden node respectively). In this way, RBMs are easier to implement and more efficient in training compared to Boltzmann Machines. Their visible nodes receive the input, combine it with weights and a bias, and pass it to the hidden nodes. The value generated at the hidden nodes is combined accordingly with weights and a bias and the result is passed to the visible nodes to reconstruct the input. If we consider the visible vector V , the hidden vector H , and the weight parameters α i , b i , w ij , an RBM configuration can be assigned with an energy E given by [24]: Given this energy function, a probability P is assigned to every pair (V , H ): where Z is equal to the sum of the energy of all the pairs of visible and hidden vectors.
For a given visible vector V , the probability that is assigned to the hidden node h j is where σ (·) is the logistic sigmoid function [38]. For a hidden vector H the assigned probability of a visible node v i is respectively: The weight parameters are optimized with the aim to maximize the likelihood of the visible and hidden vectors (V , H ). The intuition behind RBMs is based on the association of a scalar energy to each combination of the variables of interest. Learning is achieved, therefore, by calculating the combination that has the lowest energy.
RBMs are useful for dimensionality reduction, classification, regression, and feature learning. However, due to the fact that RBMs consist of only two layers, the complexity of the data representation that they can achieve is limited [24]. For this reason, a number of extended, architectures has been developed. An example of such architecture is the Deep Belief Network [44], which consists of multiple stacked RBMs. Deep Belief Networks are used for feature extraction in many computer vision applications. Except for Deep Belief Networks, another RBM-based architecture is the Deep Boltzmann Machine [95], [96]. Deep Boltzmann Machines are similar to Deep Belief Networks, although the former have only undirected connections between their layers, which makes them more robust to noisy observations, while the latter have bidirectional connections in the last layer [104].

3) AUTO-ENCODERS
Auto-encoders [8], [45] refer to a specific type of neural networks that aim to compress the input image data into a lower-dimension (latent) representation and then reconstruct the original image from this representation. Their architecture consists of two main parts, namely, the encoder and the decoder. The encoder maps an input vector of images X into a compressed, lower dimensional vector Z , while the decoder part maps the latent variable Z to a reconstruction of the input image. The encoder and decoder mappings φ : X → Z and ψ : Z → X are given by: where the operator • refers to the composition function: The autoencoder is trained with the objective to select the optimal encoder and decoder functions so that the minimum amount of information is required to encode the image in order to be regenerated on the decoder side.

III. DEEP LEARNING METHODS FOR DETECTION AND SEGMENTATION OF OBJECTS IN VIDEOS
Due to the similarity between video detection and image detection, some methods of image detection are often used for video detection. The methods described above can be extended to the video domain by running detection for each image in a sequence of frames [7]. In this way, however, the temporal correlation between frames is not taken into account. In addition, running a detection algorithm for each frame results in computational inefficiency since there might be feature extraction redundancies between sequential frames. Furthermore, in a video sequence, there might be poor-quality frames which could lead to low inference accuracy. One obvious reason that this extension is not trivial is due to the fact that a video sequence introduces an additional dimension; the temporal one. In other words, instead of being considered as a sequence of frames, a video should be rather regarded as a sequence of related frames. Due to the complexity of video data and the computation cost for training, research has been limited in this field. However, more and more video-related research works have surfaced lately, due to the release of ImageNet VID [93] and other massive video datasets. Depending on the architecture, DL-based techniques for video object detection can be broadly diversified into six categories, namely (1) optical flow, (2) tracking, (3) long short-term memory, (4) gated recurrent unit, (5) self-attention mechanism, and (6) generative learning. In the following subsections a critical appraisal of these architecture paradigms is presented.

A. OPTICAL FLOW
One of the most fundamental concepts in video processing is optical flow. Optical flow was originally introduced in [25] referring to human perception and the changing pattern of light that reaches our eyes. In computer vision applications, optical flow refers to the problem of estimating the displacement vector for each pixel in subsequent image frames [48].
A key assumption in optical flow is brightness constancy. This practically means that a pixel at the position (x, y) of an image at time t moves to the position (x + x, y+ y) at time t + t and the brightness I (x, y, t) remains constant: VOLUME 10, 2022 The Taylor series expansion of the left-hand side of (7) is where I x , I y , I t are the partial derivatives of the intensity function I with respect to x, y, and t respectively. Hence, if we substitute (7) into (8) we can derive: where ∇I = (I x , I y ) and v = ( x t , y t ) are the components of the optical flow, and I t is the temporal gradient of the intensity function.
Optical flow can be applied to estimate the motion of detected objects in video segments by assigning an optical flow vector to the pixels corresponding to the detected object.
Optical flow can be either ''sparse'' or ''dense''. Sparse optical flow estimates the flow vectors of some specific features, such as corners or edges of an object within an image frame. Dense optical flow, on the other hand, includes the flow vectors of all the pixels in an image frame. The latter method achieves higher accuracy than the former, although at the cost of increased computational requirements.
Recently, modern CNN architectures have been successfully used for optical flow estimation applications [18]. CNNs can be trained to run on pairs of images and to predict the optical flow field. These flow networks are employed in computer vision tasks for videos according to two different approaches. In the first approach, one neural network is responsible for the task of object detection and it is applied on sparse key frames. The extracted feature maps from these key frames are then propagated to the next frames with a flow network. This technique is called Deep Feature Flow (DFF) [132] and it achieves great computational efficiency due to the fact that it implements the object detection task only on key frames.
The second approach involving flow networks is known as flow-guided feature aggregation (FGFA) [131]. In FGFA, a feature extraction network is run on all individual frames to create the respective feature maps per frame. The inference at a reference frame is enhanced with an optical flow network that predicts the motion between the neighbor frames and the adjacent frames. The propagated feature maps from neighbor frames are aggregated with the feature map from the reference frame in an adaptive weighting method. FGFA achieves higher inference accuracy but at a higher computation time compared to DFF. For this reason, an impression network [43] is another proposed architecture that combines the two abovementioned techniques, with the objective to take advantage of both methods. Sparse key frame feature maps are then aggregated with other key frames feature maps and at the same time they are propagated to other non-key frames. The impression network overcomes DFF both in terms of accuracy and inference speed. It is also faster than FGFA although it achieves a slightly lower accuracy level. An alternative architecture, which outperforms FGFA, is proposed in [17], where a two-stream feature aggregation approach is integrated into a one-stage detector to achieve video object detection. In particular, the first stream applies optical flow to estimate the motion and to aggregate the features along the motion path, while the second stream predicts the features of the frame of interest by spatio-temporal sampling and aggregation of features from the adjacent frames. The final predictions result from blending the outcomes from the two streams.

B. TRACKING
Visual tracking can be described as the problem of estimating an unknown target trajectory over a sequence of image frames [78]. Traditional methods employ a variety of tracking algorithms, such as mean shift algorithm [14], particle filtering [30], and Kalman filtering [54]. With the advancements in data science in recent years, novel DL-based visual trackers have been developed.
Object tracking outperforms optical flow in accuracy [129]. This can be explained by the fact that tracking uses shared networks to achieve feature extraction for detection and tracking. Hence, the requirements in terms of computational power are limited and at the same time, the fusion between the two tasks is performed in a more straightforward way, which achieves higher accuracy compared to optical flow based models.
CNN is the first architecture that was adopted for DL-based visual tracking. In [19], a region-based fully convolutional neural network [15] is used for jointly performing detection and tracking in an integrated framework. The model is fed with a set of two consecutive image frames, from which the convolutional feature maps are computed. Object detection is run on each frame and a regressor is employed to compute the box transformation from one frame to the other. CNNbased object tracking models showcase some weaknesses in performance though, due to the scarcity of labeled data in terms of including sets of two consecutive frames, which are necessary for their training, as well as their speed limitations with respect to real-time applications [79].
A baseline approach presented in [121] extends the Mask R-CNN to include an additional tracking branch with an external memory for tracking object instances across frames. The proposed architecture extracts the classification, the bounding boxes, and the segmentation predictions of Mask R-CNN, and it takes into account the past frame information only for tracking. In this way, the task of instance segmentation is extended to videos. CrossVIS [122] presents a novel, cross-frame learning approach that uses the features of an instance in the current frame to segment the same instance in other frames. Crossover learning is integrated with the instance segmentation loss as an objective to obtain cross-frame instance segmentation consistency, achieving a low computational cost. CrossVIS outperforms MaskTrack R-CNN [121] in terms of both accuracy and speed [122].
An additional DL-based method for tracking arbitrary objects involves Siamese Neural Networks (SNNs) [109]. SNNs have been extensively implemented on visual tracking applications in the past years [4]. An SNN is basically a two-stream network that takes as input pairs of the target and search image and outputs a similarity map. In other words, SNNs learn a function f : (z, x) → f (z, x) which compares an image z with a candidate image x returning a high score when the two images are similar with each other. The position tracking of an object can thus be determined by checking all possible locations and selecting the one that corresponds to an image with the highest similarity to the previous frame. SNNs can learn the function f from a training video dataset with labeled object trajectories and they are one of the most promising methods for object tracking due to their performance and efficiency.
Recurrent neural networks (RNNs) [28] are an alternative architecture employed in visual object tracking applications. RNNs can be considered to operate on a sequence that contains vectors x(t) and each vector can describe e.g. an image frame from a video at time step t. In other words, an RNN is a neural network that is specialized for processing a sequence of values x(1), . . . , x(n), where n is the length of the sequence, in a similar way as a convolutional network is specialized for processing a tensor representing an image. The same update rule is applied to each part of the output, resulting in the sharing of parameters through a deep computational graph. RNN-based methods can be considered as a suitable method for visual object tracking since they take into account both spatial and temporal features of video frames [124]. The RNN-based methods aim to improve the tracking performance by utilizing temporal information such as past states of the target's position. However, their implementation is limited because their complex architecture involves a significant number of parameters that need to be determined [68].

C. LONG SHORT-TERM MEMORY
Although RNNs are naturally suited to time series data, like videos, their implementation suffers from various weaknesses. First of all, while they take into consideration information from the previous time stamp, their performance is deteriorated, when storing information for a longer time period [60]. Sometimes, certain information stored at long past time step might be required to accurately predict the current output. RNNs in that cases are incapable of utilizing such ''long-term'' dependencies. In addition, RNNs do not have the possibility to keep part of the past time stamp information and to discard the rest. An additional challenge in RNNs is that gradients propagated through the network tend to either vanish or explode because of the repetition of the weight matrix over all recurrent units. At the same time, optical flow techniques make use of temporal information only on two adjacent frames without using temporal information from other previous frames. Long short-term memory (LSTM) [47] is an improved type of RNN that is capable of utilizing long-term dependencies.
The architecture of an LSTM cell is depicted in Figure 1. LSTMs are cells consisting of three parts which are known as gates. The first gate determines what part of the information coming from past time steps needs to be ''remembered'' or can be ''forgotten''. The second gate inputs information of the current time step to the cell. Finally, the third gate passes the updated information from the current time step to the next one. The first gate is called forget gate while the second and the third ones are called input and output gates respectively.
In the following equations f (t), i(t), o(t) represent the forget, input and output gate vectors respectively, σ is the sigmoid function, W (j) and b (j) refer to the weights and biases corresponding to the j-th gate's neurons, h(t − 1) refers to the output of the previous cell at time stamp t − 1, and x(t) represents the input at time t [49].
• Forget gate • Input gate • Output gate Moreover, an additional vectorC is used that modifies the cell's state C: where the operator corresponds to the elementwise multiplication. The hidden state is equal to: LSTMs can maintain important information over a long sequence of data. [33] presents an extensive analysis of variants of LSTM as well as a review of the impact of the involved hyperparameters. In [75] an LSTM framework is developed as an extension to an SSD architecture in order to associate detected object instances across consecutive frames. VOLUME 10, 2022 The proposed method outperforms other RNN architectures [110] and it can be applied online. However, the weakness of this approach is that the SSD architecture involved is pre-trained in advance and thus, the SSD features do not get updated in response to the output of the LSTMs. In [70], an approach is suggested where LSTM is used in combination with interleaving conventional feature extractors with extremely lightweight ones. The main advantage of this approach is that minimal computation is required to produce accurate detection. In other words, an interleaved model framework is proposed, where multiple feature extractors are run sequentially or concurrently. A memory mechanism is then proposed to aggregate these frame-level features. A modified LSTM cell is used in [130] to achieve faster results with low computational requirements. The proposed architecture connects fast single-image object detection frameworks in series with convolutional LSTM layers in order to propagate frame-level information over time. This architecture inputs one single frame of the video at a time and it is quite simple. Hence, it achieves reduced computational cost as well as enhanced inference speed.

D. GATED RECURRENT UNIT
Similarly to LSTMs, gated recurrent units (GRUs) [13] are another type of RNNs. However, GRUs have fewer parameters than LSTMs, since they only have two gates: the update gate and the reset gate. As seen in Figure 2, in contrast to LSTMs, a GRU cell does not have an output gate, as in the case of LSTMs, and they combine the input and the forget gate of LSTMs into the update gate. Due to their simplicity, GRUs are significantly faster rather than LSTMs.
The update and reset gates in a GRU cell are defined as in equations (16) and (17) respectively. In the following equations z(t), r(t) represent the update and reset gate vectors respectively, and W (j) , b (j) refer to the weights and biases corresponding to the j-th gate's neurons [49].
• Update gate • Reset gate The update gate determines the amount of previous time-step information that passes along the next state, while the reset gate is responsible for deciding what part of the past information is neglected. After multiplying the input vector and the hidden state with the weights of the reset gate as presented in (17), the element-wise product between the reset gate and the previous time-step hidden state is calculated. Then, a non-linear activation function is applied to the result leading to the candidate hidden state: The hidden state then reads as: In [9] an SSD-based architecture is extended to multi-frame data. Convolutional GRUs are employed in order to fuse features across multiple frames and to enhance the accuracy of object detection. From a mathematical perspective, this architecture replaces the dot product operator in the standard gated recurrent unit definition in (16)-(18) with the convolution operator. As reported in [23], this approach improves the existing SSD architecture by 2.7 % in terms of the mean average precision on the KITTI dataset [22]. An additional example is provided in [110], where first a pseudo-labeler is trained on individual labeled frames. The pseudo-labeler assigns the labels to all video frames and then a recurrent architecture with GRUs is trained, which takes sequences of pseudo-labeled frames as input. The standard cost function used for the training of the RNN is augmented with an additional term to ensure the consistency across consecutive frames. In [112] a human activity recognition technique is proposed, where skip connections are introduced among GRU layers to ensure that even in a deep architecture with multiple layers, there is no vanishing gradient impact on the performance.
Both LSTM and GRU can ensure that important information is maintained along long time-series data. GRU is faster than LSTM in terms of training speed [123]. Their performance is comparable, although in small datasets, GRU slightly outperforms LSTM.

E. SELF-ATTENTION MECHANISM
RNNs, LSTMs, and GRUs have been widely adopted in sequence modeling applications. However, due to the fact that they process the data in a sequential manner, they do not allow for parallel computation, which could critically affect long sequences of frames, due to memory constraints limiting the batch size of samples during training.
Self-attention mechanism [58] relates different elements of a sequence to generate a representation of this sequence. Contrary to the architectures mentioned above, it supports parallel processing of sequential data. Originally it was proposed for machine translation [113] and then its application was extended to video data [26].
Three vectors are involved in the self-attention mechanism. These vectors are used for the representation of features (key vector), values (value vector), and the values to be determined (query vector). Let us assume that we have a sequence of n elements (x 1 , x 2 , . . . , x n ) of X ∈ R n×d , with d being the embedding dimension for the representation of each element [57]. We can then define three learnable weight matrices in order to transform the queries (W q ∈ R n×d q ), keys (W k ∈ R n×d k ) and values vectors (W v ∈ R n×d v ). In this way, the input X is first transformed with the weight matrices and projected onto Q = XW q , K = XW k , and V = XW v . A similarity function is used to calculate the similarity between the query and the key vector. The self-attention layer outputs Z ∈ R n×d v which is equal to where softmax function is defined by for i = 1 . . . k and X = (x 1 , . . . , x k ) ∈ R k . The self-attention determines the similarity between the key and the query vector by computing their dot product. The dot product is then normalized using softmax so that the sum of all the scores becomes equal to 1. Each element is then given by the weighted sum of all elements in the sequence. The weights in this case correspond to the attention scores. The most wellknown, self-attention architecture is the transformer [113].
In [26] a transformer framework is developed to recognize and localize human actions in a video. A person feature is represented as the query (Q) and the features from adjacent video frames correspond to the key (K) and the values (V). A video instance segmentation architecture built upon transformers is proposed in [116]. Four modules are included in the developed architecture: a backbone CNN to extract features over the video frames, an encoder-decoder transformer that determines the similarity of features on pixel and instance level, an instance-sequence matching, and a segmentation module. The overall performance of this framework is competitive compared to the single-model approaches tested on the YouTube-VIS dataset [121], although it is somewhat lower in comparison to other complex CNN-based models [3].
In [35] a constrained self-attention architecture is proposed for video object detection that captures motion cues under the assumption that moving objects follow a continuous trajectory. An additional, self-attention based architecture is proposed in [36], which is applied in the temporal-spatial domain towards aligning two feature maps of consecutive frames. The proposed method features a low amount of parameters, while it achieves higher accuracy in comparison to optical flow-based methods such as DFF and FGFA. A related, efficient, and simplified architecture for video object detection via aggregating semantic features across frames is presented in [118]. Cosine similarity is implemented to compute the semantic similarities of the extracted proposals across frames, which are then aggregated accordingly. In [16] an object relation module is employed as part of a multi-stage architecture, in order to extract object relations in both spatial and temporal context. The relations are then further distilled with refined supportive object proposals and propagated across frames. Finally, in [98] an attention-based module is developed to learn long-range temporal relations between objects, in order to propagate the extracted features. The proposed architectures in [16], [118], and [98] outperform optical flow-based approaches in accuracy.

F. GENERATIVE LEARNING
The objective of generative learning is to approximate a complex, high-dimensional probabilistic distribution that generates a class of data, in order to generate similar data. Developing generative architectures to understand complicated data distributions has been a long-standing research problem [84]. Recent works in this area [29], [59] have provided a new set of generative algorithms that can efficiently generate video segments or extract features from them. The most outstanding generative algorithms are the variational autoencoders (VAEs) and generative adversarial networks (GANs).
• Variational auto-encoders: Their architecture resembles an auto-encoder, with the difference that their latent variable distribution is regularised during the training. VAEs stemmed from the limitation of auto-encoders to generate new, unseen data, due to the fact that the distribution of the latent variable is unknown. To alleviate this issue, VAEs are trained to learn the distribution of the latent variable, assuming that it follows a Gaussian distribution with a mean µ and variance σ 2 [50]. One example of a VAE-based architecture for video object detection is presented in [67], where a modified VAE architecture, built on top of a Mask R-CNN is proposed, in order to detect and to segment multiple instances in diverse videos. The proposed architecture outperforms MaskTrack R-CNN [121], because the MaskTrack R-CNN architecture depends entirely on the Mask R-CNN to perform predictions, resulting in difficulties to handle false negative proposals of the Mask R-CNN in highly diverse videos with occlusions, deformations, and pose variations of objects. By contrast, the architecture proposed in [67] merges a VAE with a Mask R-CNN network in a topology consisting of one encoder and three decoders. This results in three parallel branches that provide strong complements for predictions about bounding boxes and mask features, and they significantly reduce the number of false negatives in the Mask R-CNN module.
• Generative adversarial networks: Generative adversarial networks are built on the basis of a twoplayer, min-max game. The generator network G and the discriminator network D correspond to the first and the second player respectively. The generator's objective is to mislead the discriminator by generating natural-looking data (e.g. images, videos, etc.) from a random, latent vector z. The discriminator on the other VOLUME 10, 2022 hand, tries to distinguish whether the data are real or fake (generated). The game is modeled as the following optimization problem: (22) A generative adversarial approach is developed in [102], to randomly generate masks that correspond to object appearance variations in time. The masks are then applied to reduce overfitting via adaptively dropping out input features. The developed architecture identifies the mask that maintains the most robust features of the target objects over a long period of time. In [106] a GAN is trained on color and depth information in order to generate similar backgrounds to the test samples. The generated background samples are then subtracted from the given test samples to detect foreground moving objects. Finally, in [11] the encoder-decoder architecture of [82], which is limited to process information between only two adjacent frames, is extended with a GAN, to enforce temporal and spatial coherence of the generated object masks and to exploit information within a longer temporal window. The developed architecture exhibits similar accuracy as other state-ofthe-art computer vision methods, while it is almost four times faster.

IV. CHALLENGES IN DEEP-LEARNING-BASED COMPUTER VISION
Despite the tremendous advances in deep learning and the fast pace of its breakthroughs over the last years, there are still challenges that prevent it from reaching its full potential. This section illustrates a set of major challenges related to computer vision tasks on video analysis with DL techniques. DL-based methods have succeeded in achieving even human-level performance in complex, computer vision tasks. However, this is possible only when massive datasets are available for training. Data are the core of any DL-based process and hence their shortage is often responsible for poor performance. Large-scale amounts of data are not available for all video applications though.
The impact of data scarcity is further escalated by the stand-alone approach of DL. A typical workflow for developing a DL module consists of creating a training set of inputs associated with outputs and learning the relations between them. In this way, however, the architecture becomes free-standing and isolated from prior, useful knowledge. Hence, the DL performance is highly determined by the existence of big-volume datasets while at the same time, applications that are more related to common sense reasoning and less to categorization, cannot be sufficiently targeted with purely DL methods [76].
Generalizability is an additional major challenge concerning the performance of a data-driven model trained on one dataset when applied to other datasets. When training deep neural networks with high complexity and numerous parameters, the cost function might have multiple minima, which minimize the training error but may not generalize well to unseen data. The presence of noise and outliers in the training dataset is an additional reason for poor generalizability. Generalizability is also deteriorated due to the weakness of DL methods to deal with hierarchical structures, since DL modules tend to fail when generalization depends on compositional processes [63].
At the same time, although correlation does not imply causation, they do not seem to be distinguishable for DL. Numerous neural network architectures have surfaced over the last decades that are highly capable of discovering complex correlations in data, yet they lack in reasoning about cause-effect relations or environment changes.
Finally, deep learning has delivered new, highly performing approaches in computer vision tasks, whose dominance, however, remains inversely proportional to their explanatory power. Rationalizing the output of data-driven techniques is a critical issue since more and more data-driven systems are adopted in safety-critical and high impact applications.

V. INTEGRATING DEEP LEARNING WITH DOMAIN KNOWLEDGE A. MOTIVATION
A prudent approach to address the abovementioned challenges is to expand the current methods and to merge them with principles that govern the dynamic behavior of systems over the time, enabling an adaptation to new, unseen scenarios. Combining DL-based techniques with equation-based dynamic models (DMs) in a complementary way, or in other words, integrating common sense understanding into artificial intelligence constitutes a particularly interesting challenge for computer vision systems.
Enabling data-driven vision systems to understand the principles that govern the behavior of objects is essential for the development of autonomous systems that understand observed scenarios and have the ability to adopt these principles to a never seen situation. Leveraging domain knowledge to identify equation-based models that describe how the properties of objects and entities change over time and embedding them into DL techniques can lead to novel, highly robust, and performing architectures. Such models could be developed for instance from well-known first principles in order to describe how an object moves and they could be coupled with DL methods forming a hybrid computer vision architecture. It is straightforward to conclude that hybrid architectures are more efficient compared to purely data-driven or model-based techniques as they harness the benefits of both disciplines. Hybrid methods that combine scientific domain knowledge with data-driven models allow for accurate inference even with imperfect models and limited amounts of data.
The integration of the two disciplines in a hybrid architecture can be realized either by infusing mathematical rules to a DL architecture or by combining the operation of the two separate modules in a complementary manner. An advantage of this second version of a hybrid architecture is the fact that an easy and straightforward recalibration of the DM module is feasible if a bidirectional interaction between the two modules is enabled. More specifically, the DL module, which can be re-trained incrementally when new data become available, can also enable the recalibration of the DM module. This results in a hybrid architecture which is highly flexible and easily adaptable to different scenarios.
Hybrid architectures merging data-driven techniques with domain knowledge, such as from physics have been recently developed, introducing a novel research field which is still in its infancy [55], [90]. As a result, their applications are limited mainly to topics related to climate science and geology. Their expansion to other disciplines like computer vision tasks remains a challenging research topic but would undoubtedly contribute towards addressing the abovementioned impediments in purely data-driven methods.

B. HYBRID ARCHITECTURES
A taxonomy of four general classes for integrated data-driven and model-based techniques can be derived. This classification is based on the level at which the integration takes place [55], [90]. More specifically, the four classes are: (1) preprocessing level, (2) initialization, (3) design of architecture, and (4) regularization. This section presents an analysis of these different methodologies.

1) DATA PREPOSSESSING LEVEL
Data preprocessing is essential in all data-driven techniques before passing the data through the DL module. The reason is straightforward: the quality of data determines the information that can be extracted and hence, it directly influences the learning process of the DL algorithm. As a result, it is vital that we apply a preprocessing technique before passing the data through the DL model.
The concept of data preprocessing is a major area in the field of deep learning. There are three main steps involved in data preprocessing, that is: (1) data cleaning, (2) data transformation, and (3) data reduction. Data cleaning refers to the handling of missing data as well as, to noise removal.
Data transformation may include normalization of the data, band-pass filtering, downsampling, and feature selection. When the input involves time-series signals, the data can be converted to the frequency domain via the fast Fourier transform (FFT). This implementation can be implemented in anomaly detection such as e.g. in the bearings of a rotating machine [94]. Finally, reducing the dimension of the feature set is another technique widely applied when preprocessing the data. A thorough analysis of the data preprocessing techniques is presented in [21].

2) INITIALIZATION
One important design choice when building a neural network architecture, is related to the parameter initialization [117]. Iterative optimization algorithms such as gradient descent are used during the process of training a neural network in order to estimate the network's parameters. In this process, an initial value for the parameters is required as a first step to start the optimization process. Quite often the initialization of the parameters is done based on a random distribution. Random initialization though can make the optimization algorithm that is employed for the calculation of the network weights to converge to local minima or saddle points.
An approach towards this issue would be to use a technique called transfer learning [85]. The basic idea of transfer learning is based on pretraining a neural network on a simpler, related problem. This pretraining task takes place under the assumption that a big quantity of data is available. This pretrained neural network can then be implemented as the initial state for the training of the original problem as it is closer to the optimal parameters value than random initialization. Transfer learning is a widely used technique in complex DL applications such as natural language processing and computer vision. However, the performance of this technique is highly dependent on the availability of big-scale data. An alternative approach is to employ domain-specific knowledge to assist the selection of the initial values of the parameters involved [55]. In this way, first-principle models can be used to generate approximate simulations for the initialization of the parameters of the neural network. Domain knowledge can ensure a reliable initialization of the parameters, which can assist in achieving generalizable, interpretable, and physically consistent architectures.

3) DESIGN OF ARCHITECTURE
Data-driven techniques have made a major impact at realizing highly performing systems for solving hard problems related to pattern recognition, prediction, etc. However, a major impediment in their wide adoption in critical applications is their ''black box'' nature since our understanding of their complexity is limited. Hence, domain knowledge can be infused in a DL architecture to ensure its interpretability.
One possible approach for this integration is to infuse the output of the equation-based model f DM as input to the DL module f DL , i.e. f hybrid : (X , P DM , P DL ) → Y where X is the input, Y is the output, P DM , P DL the parameters of the VOLUME 10, 2022 dynamic model and the DL model respectively, and f hybrid the composition of the two functions, f hybrid = f DL • f DM [90].
Two main categories of architectures can result from merging DL with dynamic models, founded on prior domain knowledge. In the first category, the output of the model is fed through the DL module at the first or at an additional layer. In the second category, the model is embedded into the DL module. Many architectures with respect to the first class have surfaced lately in the field of climate and geology applications. In [52], [56], the output of a physics-based model is provided as an additional input feature to the DL module in an application related to predicting the temperature of a lake based on the depth. In [86], a physics-based neural network architecture is used in order to simulate broadband earthquake ground motions. The DL module is used to predict the ground motion in the short term, including transient effects, which are particularly complex to model mathematically. The DM module is then used to simulate the response in a long-term period.
In the second class, the DM module is embedded into the DL module architecture. An example of this class is a physics-based model with an RNN including LSTMs [101] where the sensor data as well as the DM generated output are ingested as input to the RNN architecture.

4) REGULARIZATION
Deep neural networks can involve numerous parameters. However, when no large amounts of data are available, deep neural networks tend to overfit or, in other words, they fail to discover the underlying relationship described by the training data and hence they cannot extrapolate to observed data outside the training set. One way to handle this issue is to apply physical constraints on the loss function of the neural network. Several regularization techniques have been developed in this way, to prevent neural networks from overfitting. This is achieved by applying penalties to layer parameters, and by integrating these penalties in the loss function that is minimized during training. The loss function in that case will be of the following form [117]: where f Trn corresponds to a function that represents the error between the predicted valueŶ and the true value Y . This function can be for example the mean squared error or cross entropy. In addition, λ represents a hyperparameter determining the weight of the regularization term R(W ). The first two terms of (23) describe the standard loss function used when training a neural network. The additional term f Phy corresponds to the physics-based constraint and it aims to ensure the consistency of the trained system with firstprinciple laws or dynamic models. The weight of this function is represented by the hyperparameter γ . Given the true value Y , the following is considered as the general optimization problem to solve for (23): By introducing model-based constraints in the loss function for the train of DL modules, scientific consistency is achieved, which is essential for training generalizable models. In addition, the physics-based loss function f Phy requires no labeled data which allows the training of the DL module to be expanded to non-labeled data. A plethora of implementations that impose physics-based constraints on the training of DL modes has surfaced recently [81], [103], [107]. In [56] a physics-based loss function is used for the training of a temperature lake predictor. The loss function encompasses a constraint resulting from the relationship between the temperature, the density, and the depth of the lake water. In this way, the trained predictor achieves enhanced generalizability while at the same time consistency with first-principle laws is ensured for the results. In [51], the application of lake temperature prediction is extended to include temporal physical processes. More specifically, a physics-based RNN is developed that involves energy conservation constraints. Standard LSTM models store specific information at each time step, which feeds to the next time step. However, when the models are trained on data from specific seasons or from multiple years, it is difficult to generalize to data from different time periods since the time profiles vary significantly between each other. By including the energy flux changes, however, which determine the temperature changes, the architecture can successfully predict the lake temperature, even on unseen data. Another example is given in [53], where the data-driven model is penalized with the equation describing the time evolution of waves in order to identify the location of underwater obstacles from acoustic measurements. In this way, the accuracy of the model outside the training dataset is enhanced. Finally, [10] presents a case where multiple physics-based terms are present in a loss function. These might be competing loss terms with multiple local minima and correspond to different physics equations that need to be minimized together. Hence, an approach is presented where the contribution of each term is adaptively tuned during the training phase in order to improve the generalizability of the developed architecture.

C. HYBRID ARCHITECTURE IMPLEMENTATION IN COMPUTER VISION
Integrating useful domain knowledge into DL-based computer vision tasks is essential to build robust, generalizable systems and to compensate for the lack of large-volume training data. An example of such a hybrid architecture is proposed in [103], where the height of a free-falling object is estimated on each frame of a video by training a CNN to detect and track objects obeying to free-falling laws of physics. The training of this CNN is based on a loss function in which first-principle laws are encoded. In [1], physics are blended with DL in the framework of a two-stage encoder with the aim to recover the shape of an object based on polarized photos. In [61] an LSTM architecture is combined with a dynamics model in order to acquire a proposal distribution over an object's state. Finally, in [119], a generative vision system is proposed for estimating physical features of objects by integrating the output of a multi-physics simulation engine in the loop. Integrating DL techniques with domain knowledge is a recently introduced research topic [55], [90]. As a result, using domain knowledge to derive first-principle models or on a broader perspective, any dynamic mathematical or computer model [73] that describes how the properties of objects and entities change over time (Figure 3), and merging them with existing DL architectures constitute an especially promising research task to address the challenges of DL in computer vision.

VI. OUTLOOK: FUTURE DIRECTIONS IN DEEP LEARNING FOR OBJECT DETECTION AND SEGMENTATION IN VIDEOS
Deep learning has brought a catalytic effect in the field of computer vision for video analysis. Although nobody knows with certainty how DL will evolve over the coming decades, it is expected that much of the future research will revolve around the following critical areas [32], [77], [114]: • Out-of-distribution generalization: Future computer vision systems should be able to make accurate predictions not only in a known context but also for data with different distributions than the ones learned from the training samples. The main reason behind the difficulty of DL systems to accurately generalize and predict on unseen data is caused by the fundamental assumption that training and test data are independent and identically distributed (IID) [97], [128]. In many real-life cases however, the IID assumption is hardly satisfied. The ability to generalize under distribution shifts is of critical significance, and hence, the investigation of out-of-distribution generalization is expected to attract enormous research interest in the academic field.
• Deep learning systems with causal structures: Causality is expected to be a central strand of DL research in the coming years [89]. Developing DL systems that can represent causal relationships can increase their safety and reliability, and introducing a causal understanding of basic concepts in DL methods could certainly be the key to achieve robustness in complex real-world environments.
• Effective representation learning with few or no labeled data: While techniques for representation learning when massive labeled datasets are available have become remarkably powerful, various challenges remain in the case of limited labeled data. Developing approaches for addressing the issue of labeled data scarcity is an emerging popular direction of research.
• Adaptation in time-varying environments: Adapting to time-varying environments and other dynamicbehavior-related problems has been under examination for many years and it is expected to gain massive attention by the DL research community over the coming years. Allowing integration of new knowledge online and at the same time being capable of preserving the knowledge learned during previous interactions are only a few of the desirable features of future vision mechanisms.
• Multi-modal learning: Ultimately, major emphasis in research is expected to be placed upon developing methods that can process and link information combining modalities from various architectures [65], [76], since unimodal DL methods seem to fail to fulfill all the desirable future DL capabilities. In particular, combined architectures that integrate DL modules with domain knowledge could provide a suitable answer to most research questions arising from the DL directions listed above.

VII. CONCLUSION
In this paper a study is presented about detection and segmentation of objects applied to video segments. A review of the currently existing techniques has been presented as well as the major challenges that data-driven techniques face.
Then an extension of the data-driven techniques to a hybrid architecture that fuses data-driven techniques with equationbased models describing the dynamic behavior of objects and entities over time has been proposed in order to address issues like data scarcity, generalizability, and interpretability of the purely data-driven architectures. Finally, a survey of the current developments in hybrid architectures has been presented. We hope that this work will assist in better understanding the current status of DL in computer vision for video analysis as well as in presenting interesting directions as guidelines for future work.