A reinforcement learning based adaptive ROI generation for video object segmentation

Video object segmentation’s primary goal is to automatically extract the principal object(s) in the foreground from the background in videos. The primary focus of the current deep learning-based models is to learn the discriminative representations in the foreground over motion and appearance in small-term temporal segments. In the video segmentation process, it is difficult to handle various challenges such as deformation, scale variation, motion blur, and occlusion. Furthermore, relocating the segmentation target in the next frame is difficult if it is lost in the current frame during the segmentation process. This work aims at solving the zero-shot video object segmentation issue in a holistic fashion. We take advantage of the inherent correlations between the video frames by incorporating a global co-attention mechanism to overcome the limitations. We propose a novel reinforcement learning framework that provides competent and fast stages for gathering scene context and global correlations. The agent concurrently calculates and adds the responses of co-attention in the joint feature space. To capture the different aspects of the common feature space, the agent can generate multiple co-attention versions. Our framework is trained using pairs (or groups) of video frames, which adds to the training content, thus increasing the learning capacity. Our approach encodes the important information during the segmentation phase by a simultaneous process of various reference frames that are subsequently utilized to predict the persistent and conspicuous objects in the foreground. The proposed method has been validated using four commonly used video entity segmentation datasets: SegTrack V2, DAVIS 2016, CdNet 2014, and the Youtube-Object dataset. On the DAVIS 2016, the results reveal that the proposed results boost the state-of-the-art techniques on the F1 Measure by 4%, SegTrack V2 by a Jaccard Index of 12.03%, and Youtube Object by a Jaccard Index of 13.11%. Meanwhile, our algorithm improves the accuracy by 8%, F1 Measure by 12.25 %, and precision by 14% on the CdNet 2014, thus ranking higher than the current state-of-the-art methods.


I. INTRODUCTION
Video object segmentation (VOS) is a technique for automatically distinguishing the objects(s) in the foreground from the background in videos. Zero Shot Video Object Segmentation (ZVOS) is very helpful for both application and research since it does not need to interact manually during the assumption phase. Inspite of the common challenges in video processing (e.g., occlusion, object deformation, and backdrop clutter, etc.), ZVOS is confronted with a new challenge: how the primary objects can be correctly distinguished from the complex background when there is no prior object present. Two qualities are required for primary video object recognition. The objects in ZVOS should be recognized in a single frame (locally prominent) and must appear throughout the video sequence (globally consistent). Although the primary objects at the macro-level are highly correlated (the whole video), the camera movements, articulated body motions, out-of-view movements, occlusions, and ambient changes often create discontinuities at the micro-level (the individual frames) (shorter video snippets). Consequently, when dealing with problems caused by micro-level changes, it's better to depend on data from other frames (such as the global consistency feature).
When we look from a global perspective at ZVOS, we reduce the ambiguity locally and identify the primary objects. Even though it was the inspiration for most conventional heuristic models of video segmentation [1], [2] it is not preferred by the current Deep Learning (DL) based techniques. The best performing deep models of ZVOS currently focus mainly on the distinguishing feature of intra-frame. The important objects in or motion or appearance ignore the global occurrence consistency across multiple frames. The optical fluxes are typically calculated across a few frames consecutively [3]- [7] and are limited to the narrow temporal receptive window. Even though recurrent neural networks (RNNs) [8], [9] were created to retain data from the previous frames, the processing of this sequential method is not successful in exploring the intricate relationships between distant frames effectively.
Most works in the Machine Learning (ML) domain solve the VOS problem at the pixel stage, using Fully Convolution Networks (FCN) to conduct dense pixel wise classification for each image. Some researches focused on the coherent classification of objects based on object proposals. Despite extensive research in both the domains of image and videos, several new techniques have a lot of drawbacks. The supervised learning guidance serves as a kind of teacher in models to handle individual structures or group decisions sequentially. Supervised learning is a kind of ML algorithm that trains computers to predict output using labeled training data. As demonstrated by the labeled data, a portion of the input data is already labeled with the desired result.
The training data given as an input to the machines serve as a supervisor in supervised learning, training the machines to predict the output correctly. Abdulhussain et al. [10] proposed a temporal video segmentation (TVS) approach for reliably recognizing various types of video transitions with low computation cost and high recall values. The orthogonal moments are used as features in detection of transitions in the proposed technique. To improve the accuracy and speed of the TVS technique, embedded orthogonal polynomial algorithms and fast block processing are used to extract features. Li et al. [11] proposed an Attention-Guided Network (AGNet) for adaptively strengthening the interframe and intra-frame features for more accurate segmentation predictions. They added a spatial attention module (SAM) to an adjacent attention module (AAM) to a dilated Fully Convolutional Network (FCN) to imitate the feature correlations in spatial and temporal dimensions.
Nakamura et al. [12] proposed a semi-supervised strategy where they assumed a set of poorly labeled videos with sparsely marked frames. The frames are supplied as input, with the annotated frames being utilized to train a feature extractor. The proposed method works by dividing the input videos into small chunks known as primitive segments of set length, which are then grouped using the visual characteristics collected by the aforementioned feature extractor. Wang et al. [13] introduced Noisy-LSTM, a novel model for capturing the temporal coherence in video frames that can be trained using ConvLSTMs from start to finish, as well as a simple but successful training method that substitutes a frame in a video sequence with noises. --Chakrobarty et al. [14] proposed a novel shot boundary recognition method based on color and gradient information. Luminance distortion and gradient similarity are used to calculate the structural and contrast changes of each frame. By using an adaptive technique to detect the probable transitions between videos using an adaptive threshold, the proposed approach accounts for the impacts of changes in brightness and contrast-structure. Also, Chakrobarty et al. [15] proposed a video segmentation method based on the mean luminance patterns and CIEDE2000 color-difference and mean luminance patterns. CIEDE2000 color-difference uses the lab color space, which is efficient and trustworthy. The main benefit of the lab color space model is that it can accurately recreate all of the available colors as viewed by the human eye. The highlights and limitations of other TVS algorithms are mentioned in Table 1. VOS is formalized as a conditional decision-making process to tackle this problem. Two RL agent are employed to calculate the attention summaries between the two feature embeddings. If there are groups of frames, the agent calculates the enhanced feature based on pair-wise co-attention between the original frame and the correlation information from the other N frames. The VOS model uses this information for the update process and is further fed into the segmentation network to find accurate segmentation masks. Figure 1 shows the segmentation of three frames in the DAVIS 2016 video dataset by our Reinforcement Learning (RL) algorithm. As seen in the results in the 2 nd row and 4 th row, the boundary of our segmentation mask results is clearly visible.
To select the optimal segmentation mask for the frames, the features are fed into an RL model. Then the RL model determines the best action and chooses the most suitable mask for the current frame. Thus, an accurate segmentation result is obtained by the segmentation model. The RL agent learns to capture the complex correlations between a group or pair of frames from the same video. This is achieved via the use of a gated co-attention differentiable approach that enables the network to pay more attention to informative correlated regions while generating more discriminative foreground features. Our RL model can give more accurate results for a testing frame when several reference frames and correlations between the testing frame are utilized. When the data is used only from a single testing frame, the results are poor. Another advantage of our RL model is that it can be used to supplement the training data. It enables a high number of random frame pairings to be utilized inside a single video. The proposed model also removes the necessity for time-consuming and computationally expensive optical flow calculations, because of the specified connections between video frames. Finally, our RL model offers a single framework for collecting rich contextual data from video sequences from start to finish. To summarize, this paper provides four significant contributions: • We propose a single, end-to-end RL framework where two RL agents are employed to calculate the rich features between the video frames using the differentiable co-attention mechanism. This helps in recognizing the primary video foreground objects.
• The correlations are learned by the pair-wise coattention mechanism between the frame pairs, which is further fed into the segmentation network to obtain the optimal segmentation mask. We adopt the Deep Deterministic Policy Gradient (DDPG) algorithm to primarily train the agent in producing the correct object segmentation masks.
• The RL agent calculates the correlations among the video frames using the group co-attention mechanism, resulting in a significant motion object pattern modeling framework. The remainder of the paper is organized as follows. Section II describes the current state-of-the-art methods in video VOLUME 4, 2016 segmentation. Section III gives our proposed RL technique for segmenting the video objects. It involves modeling RL actions, states, and rewards to enhance the performance of the VOS by using a pair-wise co-attention mechanism. Section IV describes in detail the Results and Discussion. We finally conclude in Section V and give our future directions.

II. RELATED WORK
The VOS problem is addressed in a zero-shot unsupervised or one-shot (semi-supervised) setting, depending on the degree of supervision given during the test time. In this research, the focus is given to the problem of ZVOS, which performs the extraction of the primary object(s) and does not require any intervention of humans in test time. ZVOS rose from the long-studied issue of automated VOS in the field of computer vision. Automated video segmentation algorithms usually emphasize spatiotemporally connected groupings of video pixels and compact (consistent motion and appearance). Motion analysis is one of the early solutions [21], [22], and is based on assessing backgroundinduced motion patterns and geometry constraints. A wide class of models based on trajectory are used for exploiting the long-term motion information. The popular techniques include super-voxels [24], temporal superpixel [25], and hierarchical segmentation [26]. The researchers shifted their focus towards video object pattern modeling after low-level video over-segmentation. The signals related to objects like object proposals [23], [27], [28] are used for the saliency information [1], [29]- [31], and inference of primary video objects are utilized. The example works above described made substantial improvements in VOS. Still, hand-crafted features' limited representation capacity failed when the heuristic assumptions were not applied. -Many methods [32]- [34] using DL features have recently started to tackle the ZVOS problem, and are inspired by the success of DL. The improvement in performance of these models are large weight fully connected topology networks [32], [34], but are limited by the lack of learning capabilities end-to-end. The research later focused on ZVOS models that were constructed entirely on convolutional neural networks. To differentiate between independent object and camera motion, Tokmakov et al. [9] proposed using a learnable motion pattern network. For recognizing the background [7], Li et al. [6] used static images to train an instance embedding network and then identified the background [7] by incorporating motion-based bilateral networks.
FCN are another popular method for combining appearance data and motion for inference of objects [3]- [5], [20], [35], [36]. Other research looked at ZVOS via exploring robust network topologies [9], [20], and [37] teacher-student learning paradigm [38]. Wang et al. [39] created an attention-guided ZVOS model after viewing a dynamic task, and demonstrating the substantial relationship between moving object patterns and human attention. These deep ZVOS models often provide excellent results, showing the usage of neural networks advantages for this task. On the other hand, they focus exclusively on the short-term temporal information and ZVOS sequential nature, using the beneficial, cross-frame correlation within videos and failing to take a global view. During the ZVOS testing phase, target object(s) mask(s) are often provided in a few frames or the first frame and are sent to the future frames [40]- [43] automatically. Several prior approaches have been proposed, including super-trajectories [44], object recommendations [45], graphical models [33], and so on. The results of DL algorithms are promising and have dominated the field.
Test-time supervision is done for the models online, including methods based on learning, which performs the models fine-tuning online. Others include frame-by-frame mask propagation [42], [46], [47] and propagation-based, which rely on previous frame segments and function on a frame-by-frame basis. Matching-based is another kind of common stream, in which each frame is split based on its matching connection/correspondence to the preceding frame [48]. Although many matching-based OVOS models use a Siamese network architecture, our changes are substantial. Aside from task parameters, our RL method captures the global and rich correspondence by training the Siamese network between groups or video frame pairs. The main aim is to utilize the cross-frame correlations to assist automatic segmentation and primary object recognition. At the same time, the connections are captured between the matching-based OVOS models between the first and subsequent frames.
-Deep neural networks [49], [50] have been widely investigated for differentiable attention [51], [52], inspired by human vision. The networks can use neuronal attention with end-to-end training and focus on a subset of informative inputs. Beginning with the neural machine translation [53] and moving to a broad variety of NLP-related activities [49], a continual development of attention mechanisms has been witnessed in the field of natural language processing (NLP). Later, a wide range of computer vision applications utilized neural attentions, including object identification [54], image captioning [55], video processing [56], visual recognition [57], and visual dialogue [58], to name a few. It has been shown that differentiable attention can capture correlations/dependencies between the input components.
In particular, [59] proposed the use of channel and spatialwise attention for choosing an image area dynamically while reducing the redundancy of the feature channel. Selfattention beats LSTM and traditional RNN in the sequenceto-sequence challenge, according to Vaswani et al. [49] computed the solution at a place by accessing all locations. A non-local operation was proposed by Wang et al. [60] that may be thought of as a broader kind of self-attention in a self-supervised environment. Sun et al. [61] trained a mixed visual, linguistic model using self-attention-based BERT. Co-attention mechanisms, a kind of differentiable at-

S.No.
Algorithm Name Highlights Limitations 1.
Discrete Orthogonal Moments [10] Fast block processing and embedded orthogonal polynomial algorithms are utilized to extract features.
Detection of soft transitions needs to be done to guarantee the use of algorithm for detection of various transitions types.

2.
Attention-Guided Network [11] Append a spatial attention module (SAM) and adjacent attention module (AAM) on the top of dilated FCN, which model the feature correlations in spatial and temporal dimensions, respectively.
Selection of more powerful backbones to replace the upsample operation with a complex decoder for optimizing the final boundary results.

3.
Noisy-LSTM [13] The temporal coherence in video frames are leveraged by using convolutional LSTMs which replaces a given video frame sequence with noises.
Technique should further explore the the way to inject noises in model training and noises types.

4.
Hierarchical tree [12] Proposed method assumed a set of weakly-labeled videos whose frames only sparsely have a category label. Technique used a hierarchical tree of the category labels and performed recursively at each tree branch.
Unknown category labels cannot be handled.

5.
Motion Guided Attention [20] Technique transfers information inherent in image-based instance embedding networks.
The incorrect foreground seeds are often discovered on static objects when errors occur in "objectness only" mode. 6.
SBD-Duo [14] A novel shot boundary detection technique using colour and gradient information.
Should extend the proposed system to address non-uniform illumination effect and eliminate the effect of camera obstruction. 7.
Visual Colour Information [15] Technique utilizes mean luminace pattern and CIEDE2000 colour-difference.
Should extend the proposed method for detecting wipes transition using the mean luminance pattern. 8.
The structure suffers from a basic limitation.when we examine the overall frequency response of each channel. 9.
Color histogram [22] Local descriptors and the image color feature combined a kind of motion area extraction algorithm.
SURF matching is not performed for all adjacent frames of each candidate segment frame by frame.

10.
Deep CNN [23] Three stages are respectively used for abrupt detection, candidate boundary detection and gradual transition detection.
Variety of constraints for filtering nonboundaries for improving the model processing speed.
tention, have recently been successful in language and vision tasks [62], [63]. In this research, co-attention techniques are used for efficiently mining the underlying relationships and projecting different modalities into a single feature space. In this research, the RL agent captures coherence across different frames by utilizing a co-attention module, resulting in an elegant and unified VOS network framework that focuses on the significance of identifying video object information globally. The RL algorithm learns from experience. Actor-Critic applies to a well-known RL model that inherits several previous RL constructs focused on values and policies, such as policy gradient and Deep Q-learning. In computer vision applications, RL techniques have been used to detect objects at the bounding box level in various computer vision applications. Yun et al. [64] used RL techniques to move the bounding box from the object's original position in the previous frame to the precise location. To put it another way, the predicted action allows the sensor to move away from its current site, and the next event is measured using the new location. Zhang et al. [65] proposed a new RL-driven model that can choose deep convolutional layers based on the complexity of the current image, reducing run time while maintaining precision.
-The basic principle is that the less convolutional layers' features are used to process the simpler frames. In contrast, the fully convolutional layers' costly and invariant deep features are used to process the more complicated frames. To find the best relation between the filter hyperparameters, Dong et al. [66] used the RL technique. Since conventional continuous deep Q learning algorithms are challenging to implement, they may help speed up the convergence phase. Just one attempt to integrate RL into the role of VOS has been made to our knowledge.
-Chen et al. [67] created a new RL architecture that chooses the bounding object box and the background box. Context and object boxes are distinguished in the exploration, resulting in distinct segmentation masks for an analogous segmentation model. As a result, using the RL technique to choose the best object context box pair for the best segmentation result is usually appropriate. Unlike Chen et al. [61], who used RL to determine the size of the quest area fed into the segmentation network, our agent is capable of generating multiple co-attention versions for capturing the different aspects of the learned joint feature space

III. METHODOLOGY
Our RL method can recognize the primary video objects which appear throughout the video sequence and are distinguishable in each frame. Our RL method formulates the ZVOS problem as a co-attention approach, and a novel coattention Siamese Network is constructed to represent it from a global perspective. Our method learns to capture the VOLUME 4, 2016 5 complex correlations between a pair (or group) of frames from the same video during training. This is achieved by using a gated, differentiable co-attention mechanism, enabling the network to focus on informative, correlated regions while also producing more discriminative foreground features. From a global perspective, our RL method provides more accurate results for a testing frame, i.e., it considers the correlations between the testing frame and many reference frames. To be more explicit, we first explore the co-attention between the paired frames. Consequently, during the testing phase, the pair-wise co-attention features of many inference frames are concatenated to form a global representation. We describe a group co-attention module directly constructed across several frames. Based on the pair-wise co-attention module, we can capture global information more naturally and elegantly. Additionally, our RL model uses a large number of arbitrary frames which helps it augment the training data. -The proposed model avoids the need for computationally expensive and time-consuming optical flow calculations since the video frame interactions are fully specified. Finally, our model offers an end-to-end trainable framework for collecting rich contextual information from video sequences from start to finish. We demonstrate that our coattention strategy helps improve performance and focuses on the importance of global information and its usability for ZVOS. Our model can catch the rich relations between the video frames due to the differentiable co-attention mechanism, which is essential for distinguishing the key video foreground objects. Our method also employs group co-attention for extracting high and rich order correlations between video frames, resulting in a more powerful moving object pattern modeling framework. ----------The experimental results in the section IV infer that our method can suppress similar target distractions and capture the common objects even when no annotation is supplied during the segmentation task. Our approach can handle the sequential learning of data and can readily be applied to various video analysis applications, such as optical flow predictions and video saliency detection. We use a coattention strategy in our RL framework. The RL agent encodes the correlations between video frames directly. This enables our model to focus its attention on regions that are often coherent, helping in the identification of foreground objects and providing acceptable segmentation results. Our method is able to detect the irregular objects in the standard datasets. The coherent region are extracted finely and helps in extracting good quality segmentation masks. We give an overview of our RL method in Section A. In Section B, the agent action is provided to calculate the pair-wise co-attention summaries for the features obtained from the feature embedding module. In Section C, the State and Reward is given for the RL agent, while the training in Actor-Critic Framework is given in Section D. In Section IV, we give the analysis of Results and Discussion while Section V concludes our work.

A. OVERVIEW
The main objective of the proposed work is to utilize RL for zero-shot VOS. Unlike the current DL-based methods, which primarily focus on learning the foreground representations, our approach uses RL to find the correlation between the video frames. This helps our model attend the frequently coherent regions and discover the foreground object(s), thus producing good segmentation results. The RL agent evaluates the correlation learning between any frame pairs from the same video. It also helps assess the groupwise co-attention mechanism, which addresses the highorder relationship amongst a group of video frames. The framework of our RL model is shown in Figure 2.
-E a and E b 's features are fed into the RL model to obtain the optimal X a and X b for each frame to address the problem. Two RL models are built to choose the most suitable X a and X b for each frame and choose the appropriate group coattention. A state s ∈ S, a co-attention computation action a x ∈ A x that helps in determining the value of X a , and a computation action a y ∈ A y that determines the value of X b , a state transition function that is denoted as s' = T(s, A x , A y ), and a reward function g denoted by g(s, A x , A y ).
The provided frames F a and F b are fed into a feature embedding module to get E a and E b 's features. The two RL agents then calculate the pair-wise co-attention module and attention summaries that encapsulate the correlations between E a and E b . E and Z are then concatenated and passed to a segmentation module, which produces the final segmentation predictions, i.e the group co-attention enhanced embedding is fed into the segmentation network, the optimal segmentation masks Y a and Y b are achieved. To mine the correlations between F a and F b in their respective feature embedding space, the co-attention mechanism [68], [69] is employed. To begin with, the affinity matrix A f between E a and E b , is calculated in Equation 1 as follows: ...,WH} in E a represents the C dimensions feature vector. Therefore, every entry of A f provides each row E T b and each column E a similarity. Because, the W weight matrix is a square matrix, it can be represented as a diagonalization of W is calculated in Equation 2 as follows: Where D is a diagonal matrix and P is an invertible matrix. Then, it can be rewritten as follows in Equation 3: Before computing the distance between any of their locations, the characteristic of each frame is linearly modified in Equation 3. The projection matrix P becomes an orthogonal matrix when the weight matrix also symmetric: P>P = I, where I represents the identity matrix C× C. To 6 VOLUME 4, 2016 FIGURE 2. Overview of our RL framework based pair-wise co-attention in the phase of training. The frame pairs Faand F b is fed as an input to the feature embedding module for obtaining the two features Eaand E b . Then, our RL agent calculates the co-attention summaries and finds the correlations between the two frame embedded features Ea and E b . We adopt two actor-critic type model pairs to calculate the attention summaries Za and Z b . Two roles are performed by the "actor-critic" framework which includes an "actor" role for generating an action and a "critic" role for measuring how good the action is. Then a gating function is used for allocation of co-attention confidence to each attention summary. Finally, the concatenation of the attention summary Z and the features E obtained from the embedded feature module is done and then handed over to the segmentation module for producing accurate segmentation masks.
compute symmetric co-attention we use the Equation below as follows in Equation 4: We project the features E a and E b into a common orthogonal space while preserving their norm, according to Equation 4. This characteristic has been shown to assist in the removing correlation across many channels (i.e., the Cdimension) [70] as well as the enhancement of the network's generalization ability [71]. There is a degree of co-attention on each channel. Furthermore, the projection matrix P can be reduced to an I identity matrix and the weight matrix W can be reduced to the diagonal matrix. W (i.e. D) can be diagonalized into two diagonal matrices, D a and D b . As a result, Equation 4 can be re-written as channel-wise coattention in Equation 5 as follows: This method can be compared to applying a channel-wise weight to E a and E b before computing the similarity. This lowers channel redundancy in a similar way to the Squeezeand-Excite (SE) [72] method.

B. AGENT ACTION
The architecture of the proposed method comprises two RL models that choose the best group co-attention enhanced embedding. To select the pair-wise co-attention X a , the action set A x is utilized and for selecting X b the action set A y is utilized. A f i th column contains the WH-length vector. This vector depicts the relationship between the i th feature in E b and each feature (1,....,WH) in E a . Using a softmax function, the column-and row-wise normalized symmetric co-attention is as follows in Equation 6: Next, the attention summaries for the feature embedding E a w.r.t. E b can be computed as follows in Equation 7: where Z Given the subtle variations in appearance across input pairs, occlusions, and background noise, it's better to calculate the information from different input frames rather than treating all co-attention data identically. To accomplish this, the network uses a self-gate technique for assigning every attention summary with a co-attention confidence. The gate inscription reads as follows in Equation 8: where σ is the logistic sigmoid activation function, b f and w f is the bias and convolution kernel, respectively. The f g gate regulates the amount of data from the frame of reference and can be learned automatically. The updation of attention summaries is done once the gate confidences have been computed and are described in the Equation 9 as follows: Where the channel-wise Hadamard product is represented by '⋆'. As a consequence of these actions, a co-attention structure with gates is formed. The resulting co-attention representation Z is then concatenated with the original feature E by the two RL agents to calculate the values of X a and X b respectively in Equation 10 as follows: where '[ ]' denotes the concatenation operation. The pairwise co-attention enhanced feature X calculated by the RL agent is fed into a segmentation network to produce the final output Y ∈  (11) Each column in the Equation 11 above represents a linear combination of reference frame embeddings, resulting in Z n . Z n now contains the whole information of the reference group. By using Equation 8 and Equation 9, the gated coattention are calculated, and the concatenation of the coattention summary with the original features obtained from the feature embedding is done by the RL agent as follows in Equation 12: where E n is the enhanced group co-attention embedding that is calculated by considering the feature of the original frame and the correlation information of the preceding N frames' correlation information. By focusing on the group as a whole, we can improve the features {X n } N +1 n=1 for all the frames {F n } N +1 n=1 .

C. STATE AND REWARD
Since this method consists of two RL-based models, each model employs a separate collection of states. The RL model's input is actually in state s. The feature map is produced by state s obtained after extracting the features from the feature embedding module. The attention summaries are computed using the group-wise co-attention modules that encode the correlations between E a and E b . Finally, the concatenation of E and Z is done and given over to the segmentation module to produce the final segmentation process predictions.
The features E a and E b are obtained from the feature embedding module in the provided input frames F a and F b . Following that, the attention summaries for the feature embedding E a with respect to E b are computed. Co-attention confidence is assigned to each attention summary by using a self-gate technique. Concatenating the initial features E and Z yields the final co-attention representation. For Frame F a , the state f a is defined as follows: In the case for the second frame F b , E a is used for calculating the affinity matrix and the state is given by the following Equation below as: Finally, states state fa and state f b will be fed into the corresponding RL model and result in the actions to choose the optimum values of group co-attention enhanced embedding. The reward function is defined as r t = g(s t , a x , a y ), which helps in reflecting the final segmentation result performance of each frame in the video sequence and is given by: where △ = IoU (m t,k+1 , y t ) − IoU (m t,k , y t ) This Equation 15 reflects each of the final segmentation results for every frame in the video sequence: where IOU represents the Intersection-over-Union (IOU) between the ground truth and the predicted ROI, thus indicating the accuracy of the predicted segmentation's mask.

D. TRAINING IN ACTOR-CRITIC FRAMEWORK
DDPG is an off-policy, model-free technique for continuous action learning. [91] DQN (Deterministic Quality Network) and DPG (Deterministic Policy Gradient) are combined in this model (Deep Q-Network). It is based on DPG, which is capable of operating over continuous action spaces and employs DQN's Experience Replay and slow-learning target networks. The architecture of our actor-critic framework is shown in Figure 3. For RL training, this analysis uses the "actor-critic" framework. The "actor-critic" framework consists of two parts: an "actor" who performs an action and a "critic" who evaluates the actor's performance. The proposed RL algorithm selects the optimum group coattention enhanced embedding for each frame. We selected two "actor-critic" type pairs to choose the segmentation masks for each of the frames. We take four individual RL models. When the current frame F a is provided, the first step is to feed the state into the "actor" network, selecting the optimal improved feature to produce the segmentation mask. The corresponding reward r t will be rewarded once this phase is completed. The r t is calculated by the IOU that depends on the final results of the segmentation process. During the phase of training, the "critic" network is generally updated first, in the form of a value-based fashion which is shown as follows: The critic updates the network, which is quite similar to the average cost temporal-difference method of [73]: In Equation 16 and Equation 17, λ k+1 and λ k represent the weight of the critic model before and after the update, and γ is the learning rate of the critic model. g(X k ,U k ) means the accumulated award of the state s t the critic predicts that before the update of the model. r k , X k , and U k are the reviewer's criteria, and θ k is the actor's vector r at time k. (X k , U k ) is the current state-action pair. X k+l is the new state obtained after performing action U k .
where ψ(s, a) denotes the advantage function, and θ k and θ k+1 in Equation 18 denotes the weight of the "actor" model after and before the transition, respectively. The policymaking function Q(X k+1 , U k+1 ) is a network with the state s and unique action as inputs and the likelihood of the chosen action happening in the state s as output. As a result, the "actor-critic" framework avoids the shortage of policy-based and value-based approaches while training the RL models. The RL models are modified during each pass instead of waiting for the end of the process, thus significantly minimizing the training time while preserving the stability of RL training.

E. IMPLEMENTATION DETAILS
The backbone network for our RL framework is DeepLabv3 [74]. In this network, the ASPP atrous spatial pyramid pooling module and starting blocks of the five convolutional layers from [75] ResNet form the backbone network for our RL framework. The weight matrix W of the co-attention modules is built by utilizing 256×256 parameters of a fully connected layer. The SE-like module is used to describe the channel-wise co-attention. The 256 nodes fully connected layer is generated by the channel weights by having 256 nodes in one branch, which utilize a fully connected layer. It is then given as an input to the other embedded branch feature. Equation 8 is achieved using an 1×1 layer convolutional layer having a sigmoid activation function. The following parameters apply to the different co-attention variations as well as group co-attention. The 3×3 convolutional layers form the segmentation module (with the batch norm and 256 filters) and 1×1 layers of convolutional for final segmentation prediction (with one filter and sigmoid). Our RL model training process is split into two parts. We fine-tune the feature embedding module based on DeepLabV3 [74] is using the YouTube-Object VOS [18] and SegTrack V2 [17] image saliency datasets by using the static data.
The model is then fine-tuned using the DAVIS16 [10] training videos. At this point, two frames are chosen randomly from the same sequence and given to our RL model to use as pairs in the training process. We choose six frames at random for the training process of group co-attention from the same sequence. To conserve GPU RAM, we compress the frames in the input to 384×384×3, yielding a (56; 56; 256)-d tensor as the frame's initial feature embedding V. The whole network is constructed using TensorForce, and the training is done with the learning rate schedule and SGD optimizer shown below: init lr=1.5×10 −4 . The total epochs and batch size are both set to 4 and 30. Data augmentation is beneficial to static images and video data (e.g., flipping, resizing, and cropping). All testing and analyses are performed on an Intel (R) Xeon 2vCPU@2.2GHz and an NVIDIA GeForce 1080 Ti GPUs. It takes around 3 days to complete the training. The network design has provided outstanding results for the classification of ImageNet dataset and the segmentation of the PASCAL tasks [30]. A batch of four frames, consisting of three reference frames and one inference frame is used by our RL model. A three frames batch, i.e., two reference frames with just one inference frame, is used by our RL model and is sufficient for generating promising results.
Algorithm 1 RL Based video object segmentation. Input:Ground Truth of the First Frame extracted img (1) The Length of the sequence M The threshold T Pretrained ResNet network RL model to choose the pair-wise co-attention X a RL model to choose the pair-wise co-attention X b Output: Segmentation result Y t 1: Fine-tune Segmentation Network on the frame extracted F a and F b .
2: Extract the features E a and E b from the feature embedding space 3:for t=2 to L do 4:aaaObtain the two RL machine learning states using (3) aaaaaaaand (4), respectively. 5:aaaFeed the states into the RL Model and calculate the aaaaaaaoptimal X a and X b values.
6:aaaObtain the Segmentation mask Y t for the two aaaaaaaframes F a and F b . 7:aaaUpdate the segmentation networks on frames F t aaaaaaausing E a and Z a . 8:aaaK t =Update(Seg_Algo , F t ) 9:aaar(t) = m(t) 10:end for

IV. ANALYSIS OF RESULTS AND DISCUSSION
To calculate the performance of the proposed model, the basic statistical parameters used in other literature works have been studied. Sensitivity (Sen) is calculated in Equation  19, as follows: VOLUME 4, 2016 FIGURE 3. Actor-Critic Network.When the current frame Fa is provided, the first step is to feed the state into the "actor" network, selecting the optimal features for producing the segmentation masks. The corresponding reward rtwill be rewarded once this phase is completed. The rt is calculated by the IOU that depends on the final results of the segmentation process.

Sen = T P/T P + F N (19)
It means that the number of object pixels in the image is distributed uniformly. Similarly, the parameter Specificity (Spe) determines if the pixels proportions have been correctly assigned to the image and is given in Equation 20 as follows: Spe = T N/T N + F P (20) The rate of pixel classification referred to as Acc is determined in Equation 21, as follows: Acc = T P/T N + T P + F N + F P (21) The spatial overlap that is present between the assigned binary mask and the segmented image is defined as the Dice coefficient (Dice), and is measured in Equation 22 as follows: Dice = 2 T P/2 T P + F P + F N (22) The Jaccard Index (J m ) is the relationship between the binary labels and the pixel values analyzed for the input image. The J m is determined in Equation 23 as follows: Jaccard Index = T P/T P + F N + F P (23) -It is generally used to measure the change in the center of transformation present in the image axis. While true positive (TP) correctly depicts the object pixels, false positive (FP) incorrectly depicts non-object pixels as objects, true negative (TN) depicts all incorrectly labeled non-object pixels, and false negative (FN) represents the incorrectly identified object pixels.
-The F m is a measure of a test's accuracy in binary classification statistical analysis. It's determined by dividing the number of true positive results by the total number of positive results (including those that were erroneously recognized), and the Rec is the number of genuine positive findings divided by the total number of samples that could have been detected as positive. In diagnostic binary classification, Pre is also known as positive predictive value, while Rec is also known as Sen. The F m score is calculated using the harmonic mean of Acc and Rec. The additional weights generally score favor accuracy or memory over the other. The conventional F m , often known as the balanced F m , is the harmonic mean of Pre and Rec in Equation 24 as follows: The Segtrack V2 dataset [17], the DAVIS 2016 dataset [16], and the Youtube-Object dataset [18] are all used to test the proposed RL segmentation method. The DAVIS 2016 dataset contains 50 high-quality video sequences and 3455 frames that address a wide range of VOS issues, including transitions, occlusions, and motion blur. 30 DAVIS 2016 video sequences are used for training, and 20 video sequences are used for analysis. Each video series in DAVIS 2016 has only one object instance that is annotated. In contrast, the DAVIS 2017 dataset expands on the DAVIS 2016 dataset by annotating several elements in a single frame. The proposed approach is limited to the DAVIS 2016 dataset and only considers single instance segmentation. This dataset contains 155 video sequences, with a total of 570,000 frames in the Youtube-Object.
-The YouTube-Objects dataset contains videos retrieved from YouTube after looking for ten distinct object types' names.There are around 9 to 24 videos for each class. Each video is approximately 30 minutes and 3 seconds long. While the videos are annotated indirectly, we guarantee each includes at least one object from the relevant class. The video sequences are then separated into ten categories. The Youtube-Object is commonly used solely for analysis since the training and test sets are not isolated. By comparing the Youtube-Object dataset to the SegTrack V2 dataset, the occlusions outnumber the appearance changes in around 14 video frames. The area similarity J m and contour accuracy F m are the calculation parameters. Other state-of-the-art semi-supervised and unsupervised VOS strategies are now being used to test the proposed study, including Lucid-Tracker [76], STV [77], MSK [78], ObjectFlow [17], TRC [31], CVOS [86], KEY [25], MSG [87] and NLC [1]. Our results are compared with these methods and our quantitative results outperform these methods in terms of J m and F m . On the CdNET 2014 night video dataset, the comparison is made between the ground truth segmentation and the segmentation generated by the proposed approach. It is then calculated for each frame in the night videos. This helps us to evaluate the output of our method with the current state-of-the-art approaches. The results reflect the error flow over time and highlighting the points of failure [79]. In each method, deep neural networks evaluate the necessary parameter values and add the best features from the training data. These algorithms are trained and tested on two partitions of the same sequence. The deep learning network has poor generalization over unseen scenes due to a lack of annotated datasets. Our system was trained using CDNet videos, and it demonstrated strong generalization performance. In the fact that the current methods are mostly supervised, our method outperforms many of these methods. Our RL learning system is inherently more robust to relevant shifts in the frames than various methods such as COLBMOG [80], EFIC [81], and C-EFIC [82] based on Rec and Pre values. The Rec and Acc values for our method are 0.8799 and 0.9401, respectively. This means that our method has correctly obtained the segmentation masks and performs well compared to the other methods. Our RL agent is capable of making autonomous decisions with strong accuracies and Pre. Figure 4, Figure 5, Figure 6, Figure 7 depicts the comparison of the foreground masks which are generated by our proposed RL method, EFIC [81], C-EFIC [82], and COLBMOG [80]. In Figure 4, from top to bottom, these images display the initial frame (Input), ground truth (GT), and foreground segmentation mask effects of our proposed RL method, EFIC [81], C-EFIC [82], and COLBMOG [80] for the frames numbered 1056 and 1220 of the winterStreet video. The mis-classified pixels are highlighted in red. The quality of the segmentation masks provided by our approach and all the other three proposed methods are somewhat similar for frame 1056, which is a "simple" frame. In contrast, our RL method performs significantly better than the other approaches for frame 1220, which is a "hard" frame to detect. These qualitative results demonstrate our method's superior performance in strong reflections and the camouflaged objects originating from the street's headlights.
From left to right in Figure 4, it contains the bridgeEntry 1662nd frame, busyBoulvard frame number 820, fluid-Highway frame 443, frame 2665 of streetCornerAtNight, frame 1636 of frame tramStation video, and frame 1278 of winterStreet video. The methods evaluated on these frames determine the shape of the cars in these videos. Our method can accurately determine the car's shape and flashlight (which is the front lightbox that emits light). This demonstrates the better performance of our RL method to extract the vehicle's shape. The red color highlights pixels that have been incorrectly labeled. Because the CDnet Night Videos dataset contains a wide range of obstacles, it's also essential to test our RL method's success in other difficult scenarios, such as shadows and complex backgrounds. Figure 5 shows the comparison of our method's results with COLBMOG [80] for the input image. As it can be seen, our method segmentation mask is more accurate than the COLBMOG [80] method. In the second row the Figure  5 shows our RL algorithm performing significantly well than the COLBMOG [80] method. In the segmentation mask extracted, even the hands are visible clearly with our method. Figure 6 depicts the qualitative segmentation masks proposed by various VOS approaches. Our RL agent can accurately retrieve the primary objects by calculating the co-attention summaries that takes into account the global temporal information (s). The agent can deal with fast motion scenarios (e.g., packour) and cluttered backgrounds (e.g., dancingtwirl).
Our RL approach emphasizes the primary subject while minimizing the comparable object distractions using both videos and static saliency images. This technique fits well in video clips with a lot of variation in the presence of the target entity, such as the camel and breakdance videos. Our method successfully separates the target object from many other similar items, mainly when they are near together, as in the camel series. Since the SegTrack V2 and Youtube-Object datasets do not distinguish between the training and test sets, both of these video sequences can be included for the evaluation of VOS appoaches. Figure 7 shows the segmentation masks for the various methods such as COLBMOG [80], EFIC [81] and C-EFIC [82]. As it can be seen from the figure our method performs much better than the other method, giving clear Region of Interest (ROI). Table 2 gives the average metrics for each of the videos and across the overall set of videos for our RL learn-VOLUME 4, 2016 FIGURE 4. Example of segmentation masks for COLBMOG [80], C-EFIC [82] and EFIC [81], and the corresponding ground truth (GT), for the night videos from the CDnet 2014 dataset. Our method performs well than the current existing methods.As seen from the results it can be seen that the segmentation masks from our results are quite clear and our method is able to segment the objects more clearly.Even the headlights of the car are visible with our RL method, and the shape detected by our algorithm is quite similar to the Ground Truth.
ing method. As seen in Table 2, the average F m in the streetCornerAtNight dataset for the first half of the daytime videos is 0.9251, which is higher than COLBMOG [80] (0.7853), C-EFIC (0.7223) [82], and EFIC (0.6704) [81]. The values of other statistical measures are significantly higher than the other methods. The Night Videos genre is devoid of dynamic surroundings, which are significantly challenging to work with. With values of 0.9251 (bridgeEntry), 0.9401 (busyBoulvard), and 0.8378 (fluidHighway), we were able to achieve higher F m averages for all video categories of Dynamic Background (DB) type, putting us well ahead of the methods evaluated on these video categories. This is an error measure instead of the F m , so a lower value means better Acc. As it can be shown in the most challenging conditions, all algorithms fail in the same frames. In these frames, our RL method performs significantly well than the other state of the art methods evaluated on these dataset. -As seen in Table 3, our approach outperforms the others on the DAVIS 2016 dataset, SegTrack V2 and Youtube Object datasets. Our method is compared with eight deep learning models [3]- [5], [8], [9], [90], [9], [37], and eight conventional methods [1], [87]- [89], [25], [28], [86], all of which are based on the DAVIS16 benchmark [17]. This shows that our method is more effective in terms of integrating the information common for the inference of mask. We infer that our RL method considers more information related to cross-frame relation. Compared to previous state-of-the-art methods, this technique raises the mean area similarity J m on the Seg-Track V2 dataset by 12.03%, demonstrating the usefulness of RL models when selecting online adaptation ROIs. The proposed method also improves the mean-field correlation J m by 13.11% on the Youtube-Object dataset. The grid objects (such as trains and aircraft) and non-grid objects are the two types of categories in Youtube-Things (e.g., Cat, Bird). Despite the fact that the objects in the latter class often undergo fast appearance change and shape deformation our RL method maintains and captures long-term dependency better than any other method evaluated on this dataset.
The proposed approach outperforms the other methods on the DAVIS 2016 dataset with an F m of 91.85%. The 12 VOLUME 4, 2016 FIGURE 5. Our methods results compared with the COLBMOG [80] method for the Input Image. As it can be seen, our method segmentation mask is more accurate than the COLBMOG method in both the first and the second row.
values of J m on the three datasets are 90.68%, 89.63% and 92.61% for DAVIS 2016, SegTrack V2 and Youtube Object dataset respectively. Our approach can handle significant appearance variations caused by interacting objects, size differences, appearance change and background clutter. Because all of the techniques investigated, including optical flow fusion in FSEG [4], multi-tasks estimation in SFL [5] and ConvLSTM in PDB [8], we use the temporal information to estimate the segmentation mask. Our RL agent has the benefit of leveraging temporal correlations through the co-attention mechanism which becomes apparent from a global perspective. Our method can capture temporal coherence and differentiate between foreground and background objects. The better performance of our RL model is due to the group co-attention calculated by our agent which learns the fusing and capturing of the correlation information from multiple reference frames.
In Table 4, the validity of the suggested approach is shown by the fact that our algorithm runs better than the other approaches. The quantitative results for the average value metrics to evaluate the night video dataset segmentation results are compared with the other state-of-the-art methods. In this, we assess the foreground segmentation results by using around seven metrics that are well-known and are based on the number of correct and incorrect classified pixels. The seven measures used are Spe, Rec, FalsePos-itiveRate (FPR), FalseNegativeRate (FNR), Sen, and F m . We calculate and evaluate the overall success of our method using these various parameters. The metrics for each video group are then measured to ensure that our RL approach is performing well. We also calculate the average metrics in addition to the calculation of statistical metrics for each video. The average F m for each video and the entire set of other videos are shown in Table 4. The standard deviation of a group of videos is often measured. When we do a significance test such as the FriedMan test on each of the experimental outcome, we infer that the total ranks values of the respective column sum up to 34,25, 50 and 52. After calculating the p value, we find that the value is 0.00042, which is less than 0.05. This suggests that our results are significant.
Our system consistently outperforms the COLBMOG [80], C-EFIC [82], and EFIC [81] approaches in terms of F m ranking, with a 16.8 % relative overall F-Measure improvement over the previously suggested strategies, COLBMOG [80], a 25.74 % relative overall F-Measure improvement over C-EFIC [82], and a 27.03 % relative overall F-Measure improvement over the EFIC [81]. Our RL method outperforms the other approaches in all video types, including bridgeEntry, busyBoulvard, and fluidHighway. Just a few red-marked regions in Figure 4 infer that our algorithm can extract the shapes with great precision. COLBMOG VOLUME 4, 2016 FIGURE 6. The segmentation masks for the various methods. As it can be seen from the figure our method performs much better than the other method, giving very clear ROIs. Our RL method helps in easy detection of all the segmentation masks for the ROIs taken into consideration.
is focused on the low complexity color-based classification algorithm BMOG. Due to the low-quality video with visible compression noise, it produces inaccurate textures and has a significant effect on texture representation. The statistical measures infer that our system outperforms other approaches in these uneven datasets. The proposed system's Acc and Pre scores also raise the F m value, which is considerably better than the other methods. Compared to different algorithms, our RL algorithm has a lower standard deviation of the F m across the entire range of images, suggesting better efficiency. The F m is found to be 0.0552. In the COLBMOG [80] method, the BMOG model is used. The contribution of the value-added applied by COLBMOG to the device's overall performance is apparent when contrasting the F m obtained by BMOG for the Night Videos segment is 0.4982, with the one received by COLBMOG [74] (0.7564).
For video datasets, our method outperforms other methods in both qualitative and quantitative terms, with some promising results in pedestrian segmentation. It detects the dark region at the bottom right of the images accurately. Our results for the fluidHighway video, a low-quality video with visible compression noise that produces false textures and directly affects texture representation, are excellent. Our system outperforms the other approaches by a broad margin, in all the challenging cases. Our method has a significantly lower standard deviation of the F m over the entire continuum of videos, implying more accurate Pre across various obstacles. These statistical measures serve as a key metric for inferring the better performance of our method. We evaluate our system's Acc in some more complex cases, such as shadows and complicated backgrounds. According to ablation studies, the proposed method outperforms current methods since it uses the co-attention mechanism. We pro-14 VOLUME 4, 2016 FIGURE 7. Comparison of our segmentation masks with COLBMOG, C-EFIC and EFIC. The segmentation masks calculated by our method is better as compared to the other state of the art methods. pose a co-attention mechanism-based RL framework using Siamese Networks. The Siamese neural network (sometimes referred to as a twin neural network) is a kind of artificial neural network that uses the same weights to produce identical output vectors while processing two different input vectors. Table 5 gives the average performance metrics for the different categories of DAVIS 2016 video dataset. Our method performs relatively well for all categories of videos. -Our VOS approach can achieve long-term, difficult to achieve outcomes. The model fix mistakes made during the training phase. Once the model has resolved an error, the likelihood of the same error happening again is very low. Our RL method strikes a balance between exploration and exploitation. Exploration is the practice of searching for new samples and exploiting the promising areas explored during exploitation. Most machine learning algorithms do not maintain this balance. Furthermore, the mentioned issue is a general problem in various video-related tasks, and our proposed RL approach can be applied to other video-related tasks. The stronger the match between our predicted objects and the ground truth, the higher the value of Pre. The main advantage of an encoder-based Siamese network over a regular encoder network is the ability to quickly detect similar target objects and foreground information propagation. With ZVOS, the Siamese network function performs well. As a consequence, it can entirely replace online fine-tuning while also substantially speeding up the segmentation process. The segmentation accuracy is higher than the online fine-tuning, and the Siamese network can achieve the speed-accuracy trade-off. As a result, the amount of error produced during the segmentation process is reduced. The error minimization function has been extended to various other video-related functions, enabling the current frame's output to monitor the output of subsequent frames. In several other VOS methods, the current frame's segmentation results are paired with the information from the next frame. Our technique benefits from the use of the error minimization process.
RL algorithms have their state space, function space, VOLUME 4, 2016 transfer process, and reward defined. From this vantage point, our state-space comprises the assignment network's fixed inputs, including normal (image, flow, etc.) and unique to the current proposal inputs (current mask, appearance, etc.). It can be seen from the results, our method needs much less training data (especially video data) than other methods like LVO [3], FSEG [4], OBN [7], LSMO [9], MOT [37] while still obtaining better results. DDPG is composed of two models: actor and critic [91]. Rather than the probability distribution over the actions, the state is given as an input to the actor (policy network) and outputs the exact action (continuous). The state and action is given as an input to the critic and it produces a Q-value as an output. The term "deterministic" in DDPG refers to the fact that the actions are computed directly by the actor rather than utilizing a probability distribution across actions. We also compare our computational time and hardware resources required for various deep architectures with our RL method. The methods taken for comparison are SegNet [92], VNet [93], UNet [94] and Autoencoders [95]. Our RL algorithm performs significantly better than the other methods in terms of GPU training memory, GPU inference memory, forward pass, and the backward pass time with a value of 4069MB, 2700MB, 102.22ms and 144.49ms. From the results we infer that our RL algorithm can solve the complex problem of VOS in the night video dataset. Our method does not need a large number of scene flow images (such as those used in LVO [3] and LSMO [9] to train an optical flow module since it takes the video frames as input. Our method benefits from the natural data augmentation property of our Siamese network-based learning method. Our method also outperforms FSEG [4], LSMO [9], MOT [37] which all need extra video datasets. Our method also outperforms the PDB [8] with the same training data, showing the importance of global knowledge for ZVOS tasks.

V. CONCLUSIONS
Our RL model automatically recognizes and isolates the major object regions in each frame of a video. Unlike the conventional methods, which focused on sequential and local data, this research emphasizes the significance of the global co-attention mechanism. We propose our RL model to capture temporal coherence by gathering correlation information between frames group (or pairs) via a differentiable co-attention mechanism. The proposed method identifies the significant background objects for each frame, capturing the temporal correlation across frames. Our model can capture similar objects and minimize comparable target distraction even when no annotation is given during segmentation. We can extend our RL model to other video analysis applications such as video saliency detection and optical flow estimates. Our RL method ranks first in the CDnet "Night Videos" with an F m score of 0.9251. This makes it the best performing method for the segmentation of irregular objects in night video datasets. The results reveal that the proposed results boost the state-of-the-art techniques in the F1 measure on the DAVIS 2016 dataset by 2%, SegTrack V2 by a J m of 12.03%, and on the Youtube Object dataset by a J m of 13.11 %. Meanwhile, our algorithm achieves an accuracy of 87.99%, precision of 94.01%, and F m of 92.51% on the DAVIS 2016 dataset, thus ranking higher than the current state-of-the-art methods on the video segmentation datasets.

VI. DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper. Comput-ing and Software Engineering. He has secured a number of research projects from the industry and government agencies. Based on his publication track records he had been appointed as the Chief Editor and reviewer for several journals, and also the Chair, Technical Chair and committee for several International Conferences. He is also active in IEEE Computer society, Malaysia Chapter and has been appointed as the Executive Committee for 2016 and 2017.

VII. ACKNOWLEDGEMENTS
IZZATDIN ABDUL AZIZ is a researcher at the High Performance Cloud Com-puting Centre (HPC3) in The Universiti Teknologi PETRONAS (UTP), where he focuses in solving complex upstream Oil and Gas (O&G) industry problems from the view point of computer sciences. Dr. Izzatdin currently serves as the deputy head of the Computer and Information Sciences Department in UTP. He obtained his Ph.D in Information Technology from Deakin University, Australia working in the domain of hydrocarbon exploration and cloud computing. He is working closely with O&G companies in delivering solutions for complex problems such as Offshore O&G pipeline corrosion rate prediction, O&G pipeline corrosion detection, securing data on clouds and designing and implementing Metocean prediction system and bridging upstream and downstream oil and gas businesses through data analytics. Additionally, he is also working on Big Data transmission, security and optimization problems on High Performance Clouds.
ARUNAVA ROY obtained his PhD from the Department of Applied Mathematics, Indian School of Mines, Dhanbad, and presently works as a Re-searcher in the Department of Industrial and Systems Engineering, National University of Singapore, Singapore -117576. Previously, he was a Post Doctoral Fellow in the Department of Computer Science, The University of Memphis, TN, USA -38111. His areas of interest are web software reliability, software reliability, cyber security, algorithm design and analysis, data structure, and statistical and mathematical modeling.