BSUV-Net 2.0: Spatio-Temporal Data Augmentations for Video-AgnosticSupervised Background Subtraction

Background subtraction (BGS) is a fundamental video processing task which is a key component of many applications. Deep learning-based supervised algorithms achieve very promising results in BGS, however, most of these algorithms are optimized for either a specific video or a group of videos, and their performance decreases significantly when applied to unseen videos. Recently, several papers addressed this problem and proposed video-agnostic supervised BGS algorithms. However, nearly all of the data augmentations used in these works are limited to spatial domain and do not account for temporal variations naturally occurring in video data. In this work, we introduce spatio-temporal data augmentations and apply it to one of the leading video-agnostic BGS algorithms, BSUV-Net. Our new model trained using the proposed data augmentations, named BSUV-Net 2.0, significantly outperforms the state-of-the-art algorithms evaluated on unseen videos. We also develop a real-time variant of our model named Fast BSUV-Net 2.0 with performance close to the state-of-the-art. Furthermore, we introduce a new cross-validation training and evaluation strategy for the CDNet-2014 dataset that makes it possible to fairly and easily compare the performance of various video-agnostic supervised BGS algorithms. The source code of BSUV-Net 2.0 will be published.


Introduction
Background subtraction (BGS) is one of the fundamental video processing blocks frequently used in applications such as advanced video surveillance, human activity recognition, autonomous navigation, etc. [4,9]. BGS can be defined as binary video segmentation aiming to extract foreground and background regions in each video frame.
End-to-end BGS algorithms can be loosely grouped into three categories: (i) Unsupervised algorithms, (ii) videoor video-group-optimized supervised algorithms and (iii) video-agnostic supervised algorithms. Unsupervised algo-rithms attempt to mathematically model the background and extract the foreground pixels accordingly. Several model-based approaches in this category, such as PAWCS [26] and WiseNetMD [17], achieve very competitive performance, however they are currently outperformed by deep learning-based supervised algorithms.
Several recent papers introduced video-optimized [32,2,19], video-group-optimized [1,24] and video-agnostic [30,21] supervised BGS algorithms. The first two categories report results for methods that have been trained and tested on the same set of videos and their performance on unseen videos is not reported. On the other hand, video-agnostic algorithms report results on unseen videos by training and testing on disjoint sets of videos. One of the most successful video-agnostic BGS algorithms, BSUV-Net, uses spatial and semantic information from different time scales to improve performance on unseen videos. However, due to the limited amount of labeled BGS data, BSUV-Net's performance on challenging scenarios is still insufficient for real-world applications.
In computer vision literature, one of the most successful approaches for increasing the generalization capacity of algorithms trained with limited data is the use of data augmentation. Spatial data augmentation, such as random crops, rotations, color changes, etc. have proved very successful in image-related tasks [29,25]. In BSUV-Net [30], a limited type of spatio-temporal data augmentation was introduced to handle illumination differences between videos and this provided some performance improvement. However, to the best of our knowledge, besides this work there is no other data augmentation attempt tailored to BGS that uses both spatial and temporal information in a comprehensive manner. In this paper, we introduce a comprehensive suite of spatio-temporal data augmentation methods and adapt them to BSUV-Net. The proposed augmentations address some key BGS challenges, such as PTZ operation, camera jitter and presence of intermittently-static objects. We also conduct video-agnostic performance analysis and show that these data augmentations significantly increase the performance of targeted categories, while not resulting in a significant performance drop in other categories. Furthermore, we show that a network trained on the combination of multiple spatio-temporal data augmentations performs ∼5% better than the state-of-the-art (SOTA) for unseen videos. Our main contributions are as follows: 1. Spatio-Temporal Data Augmentation: We introduce spatio-temporal data augmentation methods for BSUV-Net to mimic challenging BGS scenarios, such as PTZ operation, camera jitter and presence of intermittently-static objects (e.g., cars stopped at a streetlight). Our experimental results show that these augmentations significantly improve the performance on the unseen videos of the related categories. 2. Fair evaluation strategy for CDNet-2014: Although CDNet-2014 is an extensive BGS dataset, it lacks a training/testing split for use by supervised learning approaches. We introduce a split of CDNet-2014 videos into 4 groups to be used for cross-validation. In this way, we can easily evaluate any supervised BGS algorithm on all CDNet-2014 videos in a video-agnostic manner. This will simplify algorithm performance comparisons in the future. 3. State-of-the-art and real-time results: Our proposed algorithm outperforms SOTA on CDNet-2014 for unseen videos by ∼5%. We also introduce a realtime variant of BSUV-Net 2.0 which runs at ∼ 29 FPS and performs on-par with SOTA. Upon publication, we will publicly share our training and validation scripts as well as the trained models.

Related Work
Unsupervised BGS algorithms: The early attempts at BGS have relied on probabilistic background models such as Gaussian Mixture Model (GMM) [28] and Kernel Density Estimation (KDE) [8]. Following the idea of BGS based on background modeling, more complicated algorithms were introduced as well (e.g., SubSENSE [27], PAWCS [26], SWCD [15] and WisenetMD). Recently, VTD-FastICA [14] has applied independent component analysis to multiple frames, whereas Giraldo and Bouwmans have introduced a graph-based algorithm that considers the instances in a video as nodes of a graph and computes BGS predictions by minimizing the total variation [10,11]. Finally, RT-SBS [6] and RTSS [31] have combined unsupervised BGS algorithms with deep learning-based semantic segmentation algorithms, such as PSPNet [33], to improve BGS predictions.
Video-or video-group-optimized supervised BGS algorithms: The early attempts of deep learning at BGS have focused on videoor video-group-optimized algorithms, which are tested on the same videos that they are trained on. Usually, their performance on unseen videos is not reported. Video-optimized algorithms train a new set of weights for each video using some of the labeled frames from this very test video, while video-group-optimized ones train a single network for the whole dataset by using some labeled frames from the whole dataset. They all achieve near-perfect results [2,19,1]. Although these algorithms might be very useful for speeding up the labeling process of new videos, their performance drops significantly when they are applied to unseen videos [30]. Clearly, they are not very useful for real-world applications.
Video-agnostic supervised BGS algorithms: Recently, several supervised-BGS algorithms for unseen videos have been introduced. ChangeDet [22] and 3DFR [21] proposed end-to-end convolutional neural networks for BGS that use both spatial and temporal information based on previous frames and a simple median-based background model. Similarly, Kim et al. [16] introduced a U-Net-based [23] neural network that uses a concatenation of the current frame and several background models generated at different time scales as the input. During evaluation, these methods divided videos of a popular BGS dataset, CDNet-2014 [13], into a training set and a testing set, and reported results for the test videos, unseen by the algorithm during training. Although all three algorithms outperform unsupervised algorithms on their own test sets, their true performance is unknown since no results were reported for the full dataset. Furthermore, these algorithms cannot be compared with each other since each used a different train/test split. Recently, BSUV-Net [30] also used U-Net architecture but added a novel augmentation step to handle illumination variations, thus improving performance on unseen videos. The authors proposed an evaluation approach on the whole CDNet-2014 dataset by using 18 training/test sets and showed promising results on the dataset's evaluation server. However, their approach with 18 training/test sets is complicated and makes any comparison of future algorithms against BSUV-Net difficult.
In this paper, we improve the performance of BSUV-Net by introducing multiple spatio-temporal data augmentations designed to attack the most common challenges in BGS. We name our improved algorithm BSUV-Net 2.0 and show that it significantly outperforms state-of-the-art BGS algorithms on unseen videos. We also introduce a real-time version of BSUV-Net 2.0 and call it Fast BSUV-Net 2.0. Finally, we propose a 4-fold cross-validation strategy to facilitate fair and streamlined comparison of unsupervised and videoagnostic supervised algorithms, which should prove useful for future BGS algorithm comparisons on CDNet-2014.

Summary of BSUV-Net
BSUV-Net is among the top-performing BGS algorithms designed for unseen videos. We briefly summarize it below.
BSUV-Net is a U-Net-based [23] CNN which takes a concatenation of 3 images as input and produces a prob-abilistic foreground estimation. The input consists of two background models captured at different time scales and the current frame. One background model, called "empty", is a manually-selected static frame void of moving objects, whereas the other model, called "recent", is the median of previous 100 frames. All three input images to BSUV-Net consist of 4 channels: R, G, B color channels and a foreground probability map (FPM). FPM is an initial foreground estimate for each input image computed by PSPNet [33], a semantic segmentation algorithm that does not use any temporal information. For more details on network architecture and the FPM channel of BSUV-Net, please refer to the original paper [30]. BSUV-Net uses relaxed Jaccard index as the loss function: where Y ∈ [0, 1] w×h is the predicted foreground probability map, Y ∈ {0, 1} w×h is the ground-truth foreground label, T is a smoothing parameter and m, n are spatial locations.
The authors of BSUV-Net also proposed a novel dataaugmentation method for video that addresses illumination differences (ID) often present between video frames. In an ablation study, they demonstrated a significant impact of this augmentation on the overall performance. In this paper, we expand on this idea and introduce a new category of data augmentations designed specifically for spatiotemporal video data.

Spatio-Temporal Data Augmentations
In this section, we first introduce mathematical notation and then describe currently-used augmentations and propose new spatio-temporal augmentations. Fig. 1 shows one example of each of the proposed augmentation.

Notation
Let us consider an input-label pair of BSUV-Net. The input consists of I E , I R , I C ∈ R w×h×4 an empty background, a recent background and the current frame, respectively, where w, h are the width and height of each image. Each image has 4 channels: three colors (R, G, B) plus FPM discussed above. Similarly, let I FG ∈ {0, 1} w×h be the corresponding foreground label where 0 represents the background and 1 -the foreground.
Although the resolution of input images varies from video to video, it is beneficial to use a single resolution during training in order to leverage parallel processing of GPUs. Therefore, the first augmentation step we propose is spatio-temporal cropping that maps each video to the same spatial resolution. In the second step, we propose two additional augmentations that modify video content but not size.
In our two-step process, in the first step we use different cropping functions to compute I E , I R , I C ∈ R w× h×4 and I FG ∈{0, 1} w× h from I E , I R , I C and I FG where w, h are the desired width and height after cropping. In the second step, we apply post-crop augmentations to compute I E , I R , I C ∈ R w× h×4 and I FG ∈{0, 1} w× h from I E , I R , I C and I FG . Below, we explain these two steps in detail.

Spatio-Temporal Crop
In this section, we define 3 augmentation techniques to compute I E , I R , I C , I FG from I E , I R , I C , I FG , each addressing a different BGS challenge. Let us also define a cropping function, to be used in this section, as follows: where i, j are the center coordinates, h, w are height and width of the crop, · denotes the ceiling function and a : b denotes the range of integer indices a, a + 1, . . . , b − 1.
Spatially-Aligned Crop: This is an extension of the widely-used spatial cropping for individual images. Although this is straightforward, we provide a precise definition in order to better understand the subsequent sections.
The output of a spatially-aligned crop is defined follows: where i, j are randomly-selected spatial indices of the center of the crop. This formulation allows us to obtain a fixedsize, spatially-aligned crop from the input-label pair.
Randomly-Shifted Crop: One of the most challenging scenarios for BGS algorithms is camera jitter which causes random spatial shifts between consecutive video frames. However, since the variety of such videos is limited in public datasets, it is not trivial to learn the behavior of camera jitter using a data-driven algorithm. In order to address this, we introduce a new data augmentation method by simulating camera jitter. In consequence, spatially-aligned inputs look randomly shifted. This is formulated as follows: where i k , j k are randomly-selected, but such that i C = i F G and j C = j F G to make sure that the current frame and foreground labels are aligned. By using different center spatial indices for background images and current frame, we create a spatial shift in the input.
PTZ Camera Crop: Another challenging BGS scenario is PTZ camera movement. While such videos are very common in surveillance, they form only a small fraction of public datasets. Therefore, we introduce another data augmentation technique specific to this challenge.
Since PTZ videos do not have a static empty background frame, BSUV-Net [30] handles them differently than other categories. Instead of empty and recent backgrounds, the authors suggest to use recent and more recent background, where the recent background is computed as the median of 100 preceding frames and the more recent background is computed as the median of 30 such frames. To simulate this kind of behavior, we introduce two types of PTZ camera crops: (i) zooming camera crop, (ii) moving camera crop.
The zooming camera crop is defined as follows: where z E , z R represent zoom factors for empty and recent backgrounds and N z represents the number of zoomed in/out frames to use in averaging. In our experiments, we use −0.1 < z E , z R < 0.1 and 5 < N z < 15 to simulate real-world camera zooming. R(I, h, w) is an image resizing function that changes the resolution of I to ( w, h) using bilinear interpolation. Note, that using positive values for z k simulates zooming in whereas using negative values simulates zooming out. Fig. 1(d) shows an example of zoom-in. Similarly, the moving camera crop is defined as follows: where p, q are the vertical and horizontal shift amounts per frame and N m E , N m R represent the number of empty and recent moving background crops to use for averaging. This simulates camera pan and tilt. In our experiments, we use −5 < p, q < 5 and 5 < N m E , N m R < 15 to simulate realworld camera movement.

Post-Crop Augmentations
In this section, we define several augmentation techniques to compute I E , I R , I C , I FG from I E , I R , I C , I FG . These augmentations can be applied after any one of the spatio-temporal crop augmentations.
Illumination Difference: Illumination variations are quite common, especially in long videos, for example due to changes in natural light or turning lights on or off. The authors of BSUV-Net [30] introduced a temporal data augmentation technique for handling illumination changes with the goal of increasing the network's generalization capacity for unseen videos. In our notation, this augmentation can be formulated as follows: where d E , d R , d C ∈ R 3 represent illumination offsets applied to RGB channels of the input images.
Intermittent-Object Addition: Another challenge for BGS is when objects enter a scene but then stop and remain static for a long time. Even very successful BGS algorithms, after some time, predict these objects as part of the background for they rely on recent frames to estimate the background model. BSUV-Net overcomes this challenge by using inputs from multiple time scales, however it still underperforms on videos with intermittently-static objects. To address this, we introduce another spatio-temporal data augmentation specific to this challenge.
We use a masking-based approach for intermittentlystatic objects as follows. In addition to the cropped inputs I E , I R , I C , I FG , we also use cropped inputs from videos with intermittently-static objects defined as I IO E , I IO R , I IO C ∈ R w× h×4 and I IO FG ∈ {0, 1} w× h . We copy foreground pixels from the intermittently-static input and paste them into the original input to synthetically create an intermittent object. This can be formulated as follows: where denotes Hadamard (element-wise) product. Fig. 1(f) shows an example of intermittent object addition. Note, that this augmentation requires prior knowledge of examples with intermittently-static objects which can be found in some public datasets.

Combining Spatio-Temporal Augmentations
While the augmentations defined above can all be used by themselves to improve the BGS performance on related categories, combining multiple or even all of them might result in a better algorithm for a general unseen video of which the category is unknown. However, combining the crop algorithms is not trivial since it is not practical to apply more than one crop function to a single input. Thus, we use online augmentation, where we randomly augment every input while forming the mini-batches. The augmentation steps are as follows. (i) We randomly select one of the spatial crop augmentations and apply it to the input. (ii) We apply the illumination change augmentation using randomized illumination values. (iii) We apply intermittent object addition to p% of the inputs. Note, that a different combination of augmentations will be applied to the same input in different epochs, thus we hope to significantly increase the generalization capacity of the network.

Video-Agnostic Evaluation Strategy for Supervised Algorithms
The most commonly used BGS datasets with a variety of scenarios and pixel-wise ground-truth annotations are CDNet-2014 [13], LASIESTA [7] and SBMI2015 [20]. Among these 3 datasets, only CDNet-2014 has a wellmaintained evaluation server 1 , that keeps a cumulative performance record of the uploaded algorithms. Moreover, it has been the most widely-used dataset for BGS in recent years with publicly-available evaluation results for nearly all of the published BGS algorithms. Since one of our aims is to compare the performance of BSUV-Net 2.0 on unseen videos with SOTA video-agnostic BGS algorithms, the availability of public results for these algorithms is important for this work. Thus, we use CDNet-2014 as our evaluation dataset. CDNet-2014 is a comprehensive dataset that provides some ground-truth frames from all 53 videos to the public, but keeps others internally for algorithm comparison. Since it does not include any videos with no labeled frames, it is not directly suitable for testing of videoagnostic supervised algorithms. In consequence, most of the leading algorithms are either videoor video-groupoptimized and achieve near-perfect results by over-fitting the training data. However, these results are not generalizable to unseen videos [30,22]. Several researchers addressed this problem by designing generalizable networks and evaluating their algorithms on unseen videos by using different videos in training and testing [30,21,22,16]. However, there is no common strategy for testing the per-formance of supervised BGS algorithms on CDNet-2014 for unseen videos. Some of the recent papers divide the dataset into two folds, train their algorithm on one of the folds and test on the other one. Since they only report the results on the test videos that they selected, their results might be biased towards their test set and not directly comparable with unsupervised algorithms. On the other hand, BSUV-Net [30] provides a video-agnostic evaluation strategy for the full CDNet-2014 dataset by using 18 training/testing video sets. This strategy might also be biased due to taskspecialized sets. Also, it is computationally expensive to replicate by other researchers.
In this paper, we introduce a simple and intuitive 4-fold cross-validation strategy for CDNet-2014. Table 1 shows these 4 folds. We grouped all of the videos in the dataset and each category into 4 folds as evenly as possible. The proposed video-agnostic evaluation strategy is to train any supervised BGS algorithm on three of the folds and test on the remaining one and replicate the same process for all 4 combinations. This approach will provide results on the full CDNet-2014 dataset which can be uploaded to the evaluation server to compare against SOTA. We believe this crossvalidation strategy will be very beneficial for the evaluation of future BGS algorithms.

Dataset and Evaluation Details
We evaluate the performance of our algorithm on CDNet-2014 [13] using the evaluation strategy described in Section 5. In CDNet-2014, the spatial resolution of videos varies from 320 × 240 to 720 × 526 pixels. The videos are labeled pixel-wise as follows: 1) foreground, 2) background, 3) hard shadow or 4) unknown motion. As suggested by [13], we ignored pixels with unknown motion label and considered hard-shadow pixels as background during evaluation.
For comparison of our algorithm with SOTA, we use the metrics reported on CDNet-2014, namely recall (Re), specificity (Sp), false positive rate (F P R), false negative rate (F N R), percentage of wrong classifications (P W C), precision (P r) and F-measure (F 1 ).We also report two ranking-based metrics, "average ranking" (R) and "average ranking across categories" (R cat ), which combine all 7 metrics into ranking scores. The details of these rankings can be found in [12].

Training Details
In order to train BSUV-Net 2.0, we use similar parameters to the ones used for BSUV-Net. The same parameters are used for each of the four cross-validation folds. We used ADAM optimizer with a learning rate of 10 −4 , β 1 = 0.9, and β 2 = 0.99. The mini-batch size was 8 and number of epochs was 200. As the empty background frame, we used manually-selected frames introduced in [30]. We used the median of preceding 100 frames as the recent background.
In terms of spatio-temporal data augmentations, we use an online approach to randomly change the parameters under the following constraints. The random pixel shift between inputs is sampled from U(0, 5) where U(a, b) denotes uniform random variable between a to b. The zooming-in ratios are sampled from U(0, 0.02) and U(0, 0.04) for recent and empty backgrounds, respectively, while the zooming-out ratios are sampled from U(−0.02, 0) and U(−0.04, 0). We use N z = 10. The horizontal pixel shift for moving camera augmentation is sampled from U(0, 5) with N m E = 20 and N m R = 10. We perform no vertical-shift augmentation since CDNet-2014 does not include any videos with vertical camera movement. For illumination change, assuming [0, 1] as the range of pixel values, we use d Lastly, for intermittent object addition, we always use the "intermittent object motion" inputs from the current training set and apply this augmentation to 10% of the inputs only. The details of training and evaluation implementation with all of the defined augmentations will be made publicly available upon publication. During inference, binary maps are obtained by thresholding network output at θ = 0.5.

Ablation Study
Let's analyze the effect of each spatio-temporal data augmentation technique, defined in Section 4. As the baseline network, we used BSUV-Net with only spatially-aligned crop augmentation and random Gaussian noise sampled from N (0, 0.01 2 ). We tested all of the other augmentations by using them together with spatially-aligned crop. In PTZ camera crop, for each input, we randomly select one of the following; zooming in, zooming out, moving right or moving left. A combination of data augmentations is performed as described in Section 4.4 with the addition of random Gaussian noise applied after the last step. Table 2 shows the category-wise F-score results for CDNet-2014. All results have been calculated locally for those CDNet-2014 frames with available ground truth and we report the median of results for every 5 th epoch between 150 th and 200 th epochs to disregard small fluctuations in the learning process. Fig. 2 shows some visual results for these algorithms for 5 videos. It can be observed that each augmentation type significantly improves the performance on related categories (randomly shifted crop -on "Camera jitter", PTZ camera crop -on "PTZ", illumination difference -on "Shadow", intermittent object addition -on "Intermittent object motion"), but combining all augmentations decreases the performance significantly on some categories (e.g., night and intermittent object motion). We believe this is due to trade-offs between the effects of different augmentations. For example, when a static background object starts

Baseline
Turbu moving, it should be labeled as foreground, but a network trained with a randomly-shifted crop augmentation can confuse this input with an input from the "Camera jitter" category and continue labeling the object as background. Still, the overall performance of the algorithm (last column in Table 2) that uses all augmentations, BSUV-Net 2.0, handily outperforms each individual augmentation.
Since BGS is often applied as a pre-processing step in real-time video processing applications, computation speed is critical. One of the main bottlenecks of BSUV-Net is the computation of FPM for each channel -it decreases the overall computation speed significantly. Therefore, we also implemented a fast version of BSUV-Net 2.0, that we call Fast BSUV-Net 2.0, by removing the FPM channels and using a 9-channel input instead of 12-channel. Table 3 shows speed and performance comparison of the two versions. Clearly, while Fast BSUV-Net 2.0 has lower performance, it can be used in real-time applications at 320 × 240 spatial resolution, which is very similar to the resolution used in training. For higher-resolution videos, one can easily feed decimated frames into Fast BSUV-Net 2.0 and interpolate the resulting BGS predictions to the original resolution. Table 4 shows the performance of BSUV-Net 2.0 and Fast BSUV-Net 2.0 compared to state-of-the-art BGS algorithms that are designed for and tested on unseen videos. We did not include the results of videoor video-groupoptimized algorithms since it is not fair to compare them against video-agnostic algorithms. This table shows official results computed by CDNet-2014 evaluation server 2 , so the   results of our models differ from those in Tables 2 and 3 (different ground-truth frames). We compare BSUV-Net 2.0 with some of the top-performing video-agnostic algorithms reported by this server. RTSS [31] and 3DFR [21] are not included in this table since their results are not reported. Video-agnostic results of FgSegNet v2 are taken from [30]. Both BSUV-Net 2.0 and Fast BSUV-Net 2.0 clearly outperform all state-of-the-art algorithms. Table 5 shows the comparison of F 1 results for each category. This table includes RTSS using results reported in the paper [31]. In 7 out of 11 categories, either BSUV-Net 2.0 or Fast BSUV-Net 2.0 achieves the best performance, including most of the categories that we designed the augmentations for (exception for the "Night" category). However, note that the bestperforming algorithm in the "Night" category is BSUV-Net which uses only the illumination-difference augmentation. Thus, it focuses on videos with illumination differences such as night videos. As discussed in Section 2, 3DFR [21], ChangeDet [22] and Kim et al. [16] are also among the best video-agnostic supervised algorithms, however each reports performance on a different subset of CDNet-2014, with the algorithm trained on the remaining videos. Table 6 shows the comparison of BSUV-Net 2.0 with these algorithms using the training/testing splits provided in respective papers in each column. BSUV-Net 2.0 clearly outperforms all three competitors, while Fast BSUV-Net 2.0 beats 2 out of 3, and does so with real-time performance.

Conclusions
While background subtraction algorithms achieve remarkable performance today, they still often fail in challenging scenarios such as shaking or panning/tilting/zooming cameras, or when moving objects stop for extended time. In the case of supervised algorithms, this is largely due to the limited availability of labeled videos in such scenarios -it is difficult to train end-to-end deep-learning algorithms for unseen videos. To address this, we introduced several spatio-temporal data augmentation methods to synthetically increase the number of inputs in such scenarios. Specifically, we introduced new augmentations for PTZ, camera jitter and intermittent object motion scenarios, and achieved significant performance increase in these categories and, consequently, in a better overall performance. We also introduced a real-time version of BSUV-Net 2.0 which still performs better than state-of-the-art methods. Furthermore, we designed a 4-fold cross-validation setting for CDNet-2014 for easier comparison of future algorithms with state of the art.