Foreground Object Detection in Visual Surveillance With Spatio-Temporal Fusion Network

Object detection generally shows promising results only using spatial information, but foreground object detection in visual surveillance requires proper use of temporal information in addition to spatial information. Recently, deep learning-based visual surveillance algorithms have shown improved results, in an environment similar to training one, compared to traditional background subtraction (BGS) algorithms. However, in unseen environments, they show poor performance compared to BGS algorithms. This paper proposes an algorithm that improves performance in unseen environments by integrating spatial and temporal information. We propose a spatio-temporal fusion network (STFN) that extracts temporal and spatial information from 3D and 2D networks. Also, we propose a method for stable training of the proposed STFN using a semi-foreground map. STFN can generate a compliant background model image and operate in real-time on a desktop with GPU. The proposed algorithm performs well in an environment different from training and is demonstrated by experiments using various public datasets.


I. INTRODUCTION
Visual surveillance aims to detect foreground objects stably in various environments. Unlike general object detection, which usually considers a single image, foreground object detection in visual surveillance requires a proper assessment of both spatial and temporal information. Traditional BGS algorithms use background model images generated by statistical analysis and feedback. They usually detect foreground objects by comparing a current image with the background model image. Temporal information is reflected in the background model image by the update process. But, the reflection of spatial data is limited to the local area due to the fixed size of the window.
Recently, deep learning-based algorithms for visual surveillance show superior performance compared to BGS The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . algorithms. FgSegNet_v2 [1] currently records the highest performance in the ranking of the CDnet2014 dataset [2]. It shows an average false detection rate of 0.2% in the entire image sequence by only using 1% of images for training. However, since FgSegNet_v2 [1] is a spatial network that processes only a single image without considering temporal information, it shows a performance drop in unseen environments.
This paper proposes a spatio-temporal fusion network (STFN) to perform better than BGS algorithms in unseen environments. We use the latest deep learning models, such as 3DCD [3] and FgSegNet_v2 [1], as a temporal and spatial network for the configuration of the proposed STFN. Also, we propose a method that uses a semi-foreground map for training, providing improved performance. In experiments, we follow the procedure proposed by Mandal and Vipparthi [4] to show performance improvement in unseen environments. They propose scene-independent data division (SIE) VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and scene-dependent data division (SDE). The SIE setup uses different scenes for training and evaluation. The SDE setup divides the same scene into training and evaluation. We evaluate the generalization ability through the SIE setup in experiments.
The proposed algorithm has the following contributions.
1) It has excellent foreground object detection ability in a new environment not observed in training. It is possible by integrating a temporal network that extracts spatio-temporal information from multiple images and a spatial network that focuses on extracting foreground objects. The temporal network produces a background model by processing temporal information among successive images. The spatial network detects foreground objects by processing a difference image produced by subtracting a current image and a background model image. 2) We propose a method that uses loss derived from the semi-foreground map (SFM) for stable training and performance improvement of the STFN model.

II. RELATED WORKS A. TRADITIONAL APPROACHES
Stauffer and Grimson [5] proposed a mixture of Gaussian (MOG) algorithm, which models the brightness change of pixels as the combination of multiple Gaussian distributions. Elgammal et al. [6] proposed an algorithm that uses kernel density estimation under the probabilistic non-parametric method. Kim et al. [7] proposed an algorithm that uses a codebook. They used intensity, color, and temporal features to make codewords, and later they were used to build a codebook. Barnich and Droogenbroeck [8] used a samplebased method for background modeling. They used previous frames samples to determine the foreground and background by comparing them to the current pixel. St-Charles and Bilodeau [9] also proposed an improved algorithm of Barnich et al. [8] that used local binary similarity patterns [10] as additional features of pixel intensities and the modification of the update mechanism of thresholds and background model. Hanies and Xiang [11] proposed a new BGS algorithm based on Dirichlet process Gaussian mixture models that estimate background distributions per pixel. It is possible to detect a stable foreground per pixel using the non-parametric Bayesian method. St-Charles et al. [12] proposed a SuBSENSE algorithm that provides universal pixel-level segmentation using spatiotemporal binary features and color information. They introduced pixel-level feedback loops for the dynamic adjustment of internal parameters in an automated fashion. St-Charles et al. [13] proposed a visual surveillance algorithm based on non-parametric, pixel-level background modeling using word dictionaries drawn from traditional codebooks and sample consensus approaches.
Laugraud et al. [14] proposed LaBGen, a background modeling algorithm that includes a temporal median filter based on pixel units and a patch selection mechanism based on motion detection. Berjón et al. [15] proposed a nonparametric real-time and high-quality moving object detection strategy. They used both foreground and background models to cope with an environment in which the shapes of the background object and the foreground object are similar. Ortego et al. [16] proposed a post-processing framework that improves the foreground object detection performance of the existing BGS algorithms. Yang et al. [17] offered spatiotemporally scalable matrix recovery (SSMR) for separating foreground objects in environments with complex foreground objects and camera movements. Garg et al. [18] suggested a new block-based feature suitable for background modeling road environments and detecting moving vehicles. Hossain et al. [19] proposed Fast-D, a BGS-based foreground object detection algorithm. They used a non-smoothing color feature to make foreground object detection robust. Roy and Bouwmans [20] proposed a new pixel-based object detection algorithm using dual-type pixel-level information to generate a background model. Li et al. [21] proposed a generalized shrinkage thresholding operator (GSTO) that integrates three familiar shrinkage operators.
Traditional BGS algorithms have the advantage of not requiring training using ground-truth labels, but their performance is inferior to deep learning-based algorithms. This is because situations covered by statistical modeling are certainly smaller than that of deep learning-based algorithms.

B. DEEP LEARNING APPROACHES
Braham and Droogenbroeck [22] proposed a ConvNet that produces a foreground probability map with a LeNet-5 type network, which combines a background model image and a current image. This method performed better than other BGS methods in the CDnet2014 [2]. Zhao et al. [23] proposed a deep neural network consisting of two modules: background reconstruction and foreground segmentation. The background reconstruction part was trained in a supervised method. This method has weaknesses in new environments like Braham and Droogenbroeck [22].
Wang et al. [24] proposed multi-scale CNN (MSCNN) and MSCNN+Cascade models using multiple CNN models. For the cascade model, they used two CNN models. However, they did not prove generalization performance in a new environment. Zeng and Zhu [25] proposed a new multi-scale fully convolutional network (MFCN) based on VGG-16 [26]. They used RGB images as input without creating separate background model images. Babaee et al. [27] proposed a background pixel library (BL) that performs background modeling using the segmentation map of SuBSENSE [12]. They stacked the generated background model image and a current image depth-wise and used them as the CNN input. Lim and Keles [28] proposed the multi-scale segmentation architectures FgSegNet_M and FgSegNet_S. The FgSegNet_M down sampled an input image to two different scales. After that, an original image and the down sampled two images are passed through the CNN. FgSegNet_S extracted features of multiple scales by feature pooling module (FPM) with dilated convolutions. Lin et al. [29] proposed an algorithm that generates a foreground map using a deep neural network consisting of 20 convolutional layers and three deconvolutional layers. Zeng et al. [30] proposed an algorithm that uses the output of the semantic segmenter and BGS segmenter. They obtain an improved foreground map with post-processing and feed it back to the BGS algorithm. ICNET [31] and PSPNet [32] were used as semantic segmenters, and ViBE [8] and SuBSENSE [12] were used as BGS segmenters, showing superior performance compared to BGS algorithms under various conditions. However, it shows unsatisfactory performance compared to the latest deep-learning methods.
Qui and Li proposed [33] a fully convolutional encoderdecoder spatial-temporal network (FCESNet) that uses multiple images as inputs. FCESNet consists of an encoder, spatial-temporal information transmission (STIT) module, and decoder with multiple input/output structures. Gomaa et al. [34] proposed an algorithm that detects and tracks moving vehicles through background subtraction using CNN. Patil and Murala [35] proposed a motion saliency foreground network (MSFgNet). The model receives 50 images and generates a background model image through the background estimation network (BENet). Lim et al. [1] proposed FgSegNet_v2 using a feature pooling module (FPM). They used the VGG-16 [26] as an encoder and compressed information as input of modified FPM (M-FPM).
Tezcan et al. [36] proposed a background subtraction algorithm for unseen videos called BSUV-Net. They show excellent performance in the SIE environment. But, an empty reference frame uses the intermediate value of initial frames. Therefore, it is impossible to immediately respond to the modified background environment where background objects are added or disappear. Akilan et al. [37] proposed a model composed of a 3D CNN-LSTM encoder and a 3D CNN-LSTM decoder. They proposed a double-encoding technique using an autoencoder-type micro-module. In addition, they used 3D convolutions to capture short temporal motions. Patil et al. [38] proposed an edge extraction mechanism (EEM) and dense residual block (DRB) for foreground object detection. Zheng et al. [39] proposed a new BGS algorithm based on parallel vision and Bayesian generative adversarial networks.
Rezaei et al. [40] proposed an algorithm for a generative low-dimensional background model (G-LBM) consisting of convolutional layers, fully connected layers, and deconvolutional layers. Mandal et al. [4] proposed ChangeDet, a lightweight and fast network suitable for visual surveillance. ChangeDet consists of depth reductionist background estimation (DRBE), contrasting feature assimilation (CFA), and contrasting feature-based encoder-decoder (CfE-CfD) blocks. This method shows superior generalization ability to other deep-learning methods in SIE and SDE environments.
Mandal et al. [3] proposed a fast and lightweight endto-end 3D-CNN-based change detection network called 3DCD with excellent performance in a SIE environment. 3DCD consists of gradual reduction background estimation (GRBE), foreground saliency reinforcement (FSR), multischematic encoder-decoder (MScE-D), and compact foreground detection (CFD) blocks. Tezcan et al. [41] proposed a spatio-temporal data augmentation method for their previous work [36]. It included spatio-temporal crop, data amplification using illumination difference, and intermittent-object addition.
Li et al. [47] proposed an algorithm for object detection in video surveillance. They adjust the generic CNN-based classifier to detect objects in visual surveillance by transfer learning and module for effective learning of the local and global contexts in surveillance scenes. Nguyen et al. [48] deal with change detection in visual surveillance by integrating motion feature network and traditional BGS algorithms. The motion feature network uses features trained by a triplet network. Zhu et al. [49] proposed an algorithm to improve object detection in a nighttime scene. They proposed N2DGAN that generates a virtual daytime background model image corresponding to nighttime, improving object detection in nighttime images.
Deep learning-based visual surveillance algorithms [1], [35] have shown dramatic improvement compared to traditional BGS algorithms [11], [12], [15], [19]. But, they only perform well in scenes similar to training ones. In unseen environments, their performance is worse than conventional BGS algorithms. Some research [3], [4] has recently shown good performance in unseen environments, but still, there are significant margins for improvement.

III. PROPOSED METHOD A. SPATIO-TEMPORAL FUSION NETWORK (STFN)
The proposed algorithm uses multiple current and past images as a network input to extract spatio-temporal information in image sequences. A temporal network is responsible for extracting information in the temporal domain, and it generates a background model image. A 2D network generates a foreground map by processing a difference image from the background model image and a current image. In this paper, the gradual reduction background estimation (GRBE) module in the 3DCD [3] and FgSegNet_v2 [1] are adopted as a temporal and spatial network in configuring the proposed STFN. The proposed algorithm effectively integrates these two networks and offers a method to train the proposed network. Figure 1 shows the structure of the proposed STFN. We use the GRBE module to generate a background model image using 50 gray images. The generated background model image (1 × 224 × 224 × 1) is reshaped to a 3D tensor (224 × 224 × 1) and then subtracted from a current image. Then, this subtracted image is used as input of FgSegNet_v2 to find foreground objects. The GRBE module plays the role of summarizing spatio-temporal data in successive images into a background model image. The original FgSegNet_v2 [1] uses a 3-channel RGB image as input and generates a 1-channel  foreground probability map. In the proposed model, we use 1-channel gray image as input.
For this reason, we do not use the VGG-16 [26] pretrained weights in FgSegNet_v2 [1]. Also, we suggest a loss term using a semi-foreground map (SFM) to improve the training process and performance. Figure 2(a) shows the GRBE module in 3DCD [3]. Figure 2(b) shows the structure of FgSegNet_v2. More details about the GRBE [3] and FgSegNet_v2 [1] can be found in the original papers. Although the quality of the background model image by the GRBE module is inferior to the SuBSENSE [12], we show that performance improvement is possible by integrating it with the FgSegNet_v2 within the end-to-end configuration. The proposed STFN model that uses the GRBE and FgSegNet_v2 as sub-modules overcomes their limitation.

B. LOSS WITH SEMI-FOREGROUND MAP
The proposed STFN contains two networks, which require equally trained well. If we train the proposed STFN model only using output, like most deep learning-based visual surveillance algorithms, a training imbalance between the GRBE and FgSegNet_v2 may occur. Even if we have a background model image with all pixel values of 0 as the output of the GRBE module, the FgSegNet_v2 module can be trained because the difference image is the same as the current image. We must train the GRBE module and FgSegNet_v2 module equally well for performance improvement. We solve this problem by adding a cost term that leverages output generated by the GRBE module.
The output of the GRBE module is a background model image. Therefore, a ground-truth background model image is required for computing loss. However, most visual surveillance datasets do not provide the ground-truth background model image since it is difficult to obtain. Fortunately, visual surveillance datasets provide ground-truth foreground maps. We propose a new method for loss computation leveraging the background model image estimated by the GRBE module and the ground-truth foreground map. Since the background model image and the foreground map belong to different domains, it is challenging if we compute loss directly by comparing them. For this reason, in this paper, we produce a semiforeground map (SFM) by difference operation between the estimated background model image and the original image. We use SFMs to compute additional loss to train the GRBE module.
The additional loss computation for the GRBE module is based on the following assumptions. (1) A relatively significant brightness difference between a current image and a background model image exists at foreground objects. (2) A relatively slight brightness difference exists in background regions between a current image and a background model image. (3) We can use one background model image as a representative background model image in a short time interval. Under the above assumptions, we obtain SFMs by computing difference images between the estimated background model image and multiple input images. We compute loss using SFMs and ground-truth foreground maps. Since one background model image is generated using successive original images by the GRBE module, it is used for all input images when obtaining SFMs. Although the presented cost computation has limitations in dynamic backgrounds, static foreground objects, and illumination changes, we show that using SFMs help train the STFN model effectively through experimental results.
We configure the SFM generation process with no trainable parameters by layer difference and nonlinear functions without adding convolutional layers. If we have trainable parameters in the SFM generation, high-quality SFM generation is possible even if the GRBE module generates a poor background model image.
The appropriate nonlinear function at the output stage of the SFM is selected based on the following considerations. In general, a foreground map has a value between 0 and 1. Therefore, we can choose either tanh or sigmoid as candidates, excluding ReLU, which produces an infinite output range. Since we take the absolute value on difference images, the nonlinear function receives a value larger than 0 as input. Under these conditions, the sigmoid has an output value between 0.5 and 1. Therefore, we choose tanh as a suitable nonlinear function for SFM generation.
In tanh, the derivative value decreases sharply as the value increases from 0. In the case of an 8-bit image, the brightness has a value between 0 and 255. A difference image after absolute operation also has a value between 0 and 255. If we directly use such an input range value, the gradient of tanh has a value close to 0 for most input ranges which may cause problems during training. Therefore, it is necessary to adjust the input range to a region where a gradient value of tanh is larger than zero. We use a pre-processing method that divides the pixel value of the input image by a value K larger than one, which guarantees the gradient value of tanh has a nonzero value. Throughout this pre-processing step, we can cope with the problem that the gradient approaches zero during back-propagation.
We apply the following pre-processing for all pixels of the current and multiple past images used as inputs of the proposed STFN model.
I(x, y, t) represents the brightness value at location (x, y) at time t on an image, and K is a real number.Ĩ (x, y, t) represents the brightness value at location (x, y) at time t after pre-processing. Next, we take the difference per pixel between a background model image generated by the GRBE module and input images.
and I GRBE d represents a background model image by the GRBE module and a difference image. The GRBE module estimates a background model image using previous N images. We obtain N difference images by Eq. (2). We convert difference images in Eq. (2) to foreground maps through the following process. Finally, we get N semi-foreground maps.
represents a semi-foreground map. A loss for training the proposed STFN consists of the following terms.
and I FFM f (x, y, t) represents a ground-truth foreground and a foreground map by FgSegNet_v2. FFM represents the final foreground map. BCE(a, b) is the binary cross-entropy of a and b. The final loss is as follows.
α represents the weight of the two-loss items and has a value between 0 and 1. K and α are hyperparameters, and we determine their value through experiments. VOLUME 10, 2022 After training, the output of the FgSegNet_v2 module is used as the final output. The output of the GRBE module is only used in training the proposed STFN. Since incomplete assumptions are used for SFM generation, we cannot lower loss by SFM below a certain level. For this reason, it is challenging to check training progress by investigating the loss of SFM. Therefore, we investigate FFM loss to determine whether the training is going well.

IV. EXPERIMENTAL RESULTS
In visual surveillance, two methods are widely used for performance evaluation. The first method uses different scenes for training and evaluation. The second method divides images from the same sequence into training and evaluation. Mandal et al. [4] noted these two evaluation methods as a scene-independent data division (SIE) environment and a scene-dependent data division (SDE) environment. Visual surveillance systems deployed in various environments should be able to detect foreground objects stable without additional training. Therefore, this paper uses quantitative and qualitative evaluation in the SIE environment to investigate the generalization power.
In the proposed STFN model, we use 50 recent images selected among 196 frames with four frame intervals as inputs to the GRBE module. We configure SFM to have 50 images, like the number of input images. The FFM consists of one foreground probability map that matches the current image. We use Kingma and Adam [42] as an optimizer and set the learning rate to 0.001. If the validation loss does not decrease more than five times, we reduce the learning rate to half-time. We use Keras [43] for implementation.
Three datasets, such as the CDnet2014 [2], LASIESTA [44], and SBI [45], are used in experiments. Evaluation is done in two ways. First, evaluation is done on the entire SBI and LASIESTA datasets after training the model using the CDnet 2014 dataset. Second, evaluation is done after we make the SIE environment by separating the LASIESTA and SBI datasets.

A. DETERMINATION OF HYPERPARAMETER
Hyperparameter used in the proposed method is K in Eq. (1) and α in Eq. (6). We determine their values by investigating experimental results. In the LASIESTA dataset [43], we used scene 1 for training and scene 2 for evaluation out of 10 categories. Table 1 shows the FM scores according to the variation of the K value used for pre-processing. It indicates that scenes used for training provide high FM result values regardless of the K value, but the scenes not used for training provide significantly low FM results depending on the K value. It indicates that the proposed pre-processing method is necessary for robust object detection in an environment different from the training environment. Table 2 shows FM scores according to the SFM loss weight α value variation. When the α value is 0, it corresponds to training the proposed STFN model without using SFM. In this case, the FM score of scenes used for training is 0.9924, which shows excellent performance. But, for scenes not used for training, the FM score is 0.2921, showing a dramatic performance drop. On the other hand, when we train the proposed STFN model using the cost term by SFM, we obtain excellent performance even for scenes not used for training. This shows the necessity of cost terms by SFM for stable training of the proposed STFN model. Based on Tables 1 and 2, K and α were set to 127.5 and 0.1.

B. EXPERIMENTAL RESULTS USING THE CDNET2014 DATASET
After training the proposed STFN model using the CDnet2014 dataset, we evaluate the performance of the proposed method using the LASIESTA and SBI datasets. Among the 11 categories in the CDnet2014 dataset, we use scenes from five categories for training. Table 3 shows scenes used for training in the CDnet2014 dataset. We used 42,345 images from 23 scenes for training. We divide them into 80% for train and 20% for validation per scene. Finally, we used 33,868 images for training and 8,477 images for validation.

1) EVALUATION OF THE LASIESTA DATASET
We use 20 scenes in 10 categories for evaluation in the LASIESTA dataset. Table 4 shows the quantitative evaluation results of the proposed algorithm. Table 5 shows the comparison results of FM scores to other algorithms. We evaluate 3DCD [3] and FgSegNet_v2 [1] after training them using the same dataset as the proposed algorithm. SuBSENSE [12] and PAWCS [13] are evaluated using BGSLibrary [46]. We use evaluation statistics specified in the papers for the rest of the algorithm. If we do not use SFMs, we can notice that the performance of the proposed STFN model significantly degrades. The proposed algorithm performs best when the STFN model is trained with SFM loss.
Compared to the 3DCD [3] and FgSegNet_v2 [1], the proposed algorithm improves by 13.1% and 104.5%, respectively. FgSegNet_v2 provides excellent results in the SDE setup, but its performance significantly drops in the SIE setup. The proposed algorithm shows a 5.5% improvement compared to MSFgNet [35], which shows the best performance among the comparison algorithms. Figure 3 compares background model images and foreground maps by the proposed algorithm and other algorithms. The proposed algorithm shows improved performance from the quantitative and qualitative evaluation results compared to the 3DCD [3] and FgSegNet_v2 [1]. However, the proposed algorithm shows poor performance in 'I_CA', an environment where foreground objects have stopped for a long time. Table 6 shows the results of various quantitative assessments of the proposed method. Table 7 shows the FM comparison with other algorithms. The proposed algorithm performs best when training the proposed STFN model using loss by SFM. The result of the proposed algorithm shows an improvement of 9.1% and 34.1% compared to the 3DCD and FgSegNet_v2. In the SBI dataset, FgSegNet_v2 provides excellent results in the SDE environment. Still, we notice large performance drops in the SIE environment, like the LASIESTA dataset. The result of the SBI dataset shows a relatively small performance improvement compared to the LASIESTA dataset. The proposed algorithm shows a 1.7% improvement compared to Yang et al. [17], which offers the best performance among comparison algorithms. Figure 4 shows the comparison results of background model images and foreground maps by the proposed and comparing algorithms. In the 'CaVignal' scene, a bootstrap environment, ghost objects denoted with a red rectangle occur due to an incorrect background model image.

2) EVALUATION OF THE SBI DATASET
In Table 7, the proposed algorithm gives a low performance in some categories in the SBI dataset. This is caused by wrong background model images, as shown in Figure 4. We think that improving the quality of the background model image would partially solve this problem.

C. EXPERIMENTAL RESULTS USING LASIESTA AND SBI DATASET
The previous experiment was done as follows. First, the model was trained using images from the CDnet2014 dataset, and then the trained model was evaluated on the LASIESTA and SBI datasets. In this case, the LASIESTA and SBI dataset are divided into training and evaluation data, considering an environment where training data is small.

1) EVALUATION BY DIVIDING LASIESTA DATASET
We divide the LASIESTA dataset into training and evaluation. The division of the LASIESTA dataset was done following 3DCD [3]. We use scene one for training and scene two in the assessment among ten categories of the LASIESTA dataset. We use 4,300 images in experiments, where we use 3,440 images for training and 860 for validation. Table 8 shows the comparison result of FM scores to other algorithms. Results noted in the 3DCD [3] were used as results of FgSegNet_v2 [1]. The proposed algorithm provides an 11.3% improvement compared to 3DCD [3] and ChangeDet [4], which show the best results among comparison algorithms.

2) EVALUATION BY DIVIDING SBI DATASET
An evaluation was done by dividing the SBI dataset into training and evaluation datasets. We experimented the same way in the 3DCD [3] and ChangeDet [4]. We use the 'Candela', 'CAVIAR2', 'CaVignal', and 'Highway2' scenes of the SBI dataset for evaluation and the remaining scenes for training. VOLUME 10, 2022    We used 3,455 images in experiments where we used 2,761 images for training and 694 for validation. Table 9 shows comparison results to other algorithms. We use the results of 3DCD [3] and ChangeDet [4], as noted in the paper. Although the performance degrades compared to the LASIESTA dataset because there is less training data, the proposed algorithm shows the best result compared to other algorithms. In addition, we can notice that the generalization ability becomes insufficient when we do not use the SFM loss in the proposed algorithm. The proposed algorithm provides an 18.3% improvement over 3DCD [3], showing the best performance among comparison algorithms.

D. COMPARISON OF BACKGROUND MODEL IMAGE
In the proposed STFN model, we obtain a background model image by processing multiple images along the temporal dimension from the present to the past. Classical BGS algorithms and modern deep learning-based 3DCD [3] use background model images. The SBI dataset provides a ground-truth background model image, while the LASIESTA dataset does not. We manually create a background model image using the ground-truth foreground map of the LASIESTA dataset. Unlike other methods that provide color background model images, the proposed method and 3DCD [3] generate gray background model  images. Therefore, we present only qualitative evaluation results. Figure 5 shows background model images by the proposed and other algorithms. Red rectangles represent incorrect areas in the background model image. BGS algorithms show unstable background model images at the initialization stage. In particular, it takes a long time to generate correct background model images when it contains foreground objects at the initial step.
The proposed method can cope with these problems and creates a background model image that complies with the general situation. However, the ghost objects phenomenon in BGS algorithms at the bootstrap environment also occurs in the proposed algorithm, as shown in the 'CaVignal' scene in Figure 5. The proposed algorithm gives a less accurate background model image when foreground objects are stopped for a long time, as shown in the 'I_CA_2' scene in Figure 5. Also, the proposed algorithm is limited in that it generates a grayscale background model image, while traditional BGS algorithms provide a three-channel RGB background model image. Although the background model image generated by the proposed algorithm has margins for improvements, we obtain improved performance by integrating it with the 2D spatial network. VOLUME 10, 2022

E. COMPARISON OF COMPUTATION TIME
In this study, an experiment was done using a PC with AMD R9 3900X CPU and NVIDIA RTX 2080Ti 11GB GPU. Table 10 compares the number of parameters and fps by the proposed algorithm, FgSegNet_v2 [1] and 3DCD [3].   The proposed algorithm shows higher generalization performance and requires a lower computational cost than 3DCD [3]. The proposed algorithm can process 20% faster than 3DCD [3]. In FgSegNet_v2 [1], computation is three times VOLUME 10, 2022 faster than the proposed algorithm. However, in the SIE environment, the generalization performance is lower than the proposed algorithm. The proposed algorithm can operate at 30fps, enough for real-time processing in visual surveillance.

V. CONCLUSION
In this paper, we have proposed a spatio-temporal fusion network (STFN) that integrates a 3D temporal network and a 2D spatial network to enhance object detection in visual surveillance. The temporal network provides a background model image by summarizing spatio-temporal information among consecutive images. Finally, the spatial network detects foreground objects by processing a difference image from a current image and a background model image. Also, we have proposed additional cost terms derived from a semi-foreground map for training the proposed STFN. We have shown that the proposed method can generate a stable background model image through other loss computations. The proposed STFN model performs better than the 3DCD [3] and FgSegNet_v2 [1], which we adopted for the configuration of the STFN. The proposed algorithm heavily depends on the quality of the background model image. Currently, the generated background model image is gray. If we could have a color background model image, it would have more potential to detect foreground objects. Also, the current algorithm has weaknesses when foreground objects stop long. In future studies, we will try to use RNN as a temporal network to cope with these problems.