A Fast Lightweight 3D Separable Convolutional Neural Network with Multi-Input Multi-Output for Moving Object Detection

Advances in moving object detection have been driven by the active application of deep learning methods. However, many existing models render superior detection accuracy at the cost of high computational complexity and slow inference speed. This fact has hindered the development of such models in mobile and embedded vision tasks, which need to be carried out in a timely fashion on a computationally limited platform. In this paper, we propose a super-fast (inference speed-154 fps) and lightweight (model size-1.45 MB) end-to-end 3D separable convolutional neural network with a multi-input multi-output (MIMO) strategy named “3DS_MM” for moving object detection. To improve detection accuracy, the proposed model adopts 3D convolution which is more suitable to extract both spatial and temporal information in video data than 2D convolution. To reduce model size and computational complexity, the standard 3D convolution is decomposed into depthwise and pointwise convolutions. Besides, we proposed a MIMO strategy to increase inference speed, which can take multiple frames as the network input and output multiple frames of detection results. Further, we conducted the scene dependent evaluation (SDE) and scene independent evaluation (SIE) on the benchmark CDnet2014 and DAVIS2016 datasets. Compared to stateof- the-art approaches, our proposed method significantly increases the inference speed, reduces the model size, meanwhile achieving the highest detection accuracy in the SDE setup and maintaining a competitive detection accuracy in the SIE setup.


I. INTRODUCTION
W ITH the increasing amount of network cameras, produced visual data and Internet users, it becomes quite challenging and crucial to process a large amount of video data at a fast speed. Moving object detection (MOD) is the process of extracting dynamic foreground content from the video frames, such as moving vehicles or pedestrians, while discarding the non-moving background. It plays an essential role in many real-world applications [1], such as intelligent video surveillance [2], medical diagnostics [3], anomaly de-tection [4], human tracking and action recognition [5], [6].
Traditional methods [7]- [29] are unsupervised which do not require labeled ground truth for algorithm development. They usually include two steps: background modeling and pixel classification. However, these traditional methods meet difficulties when applied in complex scenarios, such as videos with illumination changes, shadows, night scenes, and dynamic backgrounds.
With the availability of a huge amount of data and the development of powerful computational infrastructure, deep neural networks (DNNs) [30]- [32] have shown remarkable improvements in MOD problems and are developed to replace either background modeling or pixel classification in traditional methods or to combine these two steps into an end-to-end network. Existing DNN models are mostly supervised approaches based on 2D convolutional neural networks (CNNs) [33]- [50], 3D CNNs [51]- [56], 2D separable CNNs [57], or generative adversarial networks (GANs) [58]- [63]. Besides, unsupervised GANs [64], [65] and semisupervised networks are also proposed [66]- [73]. It demonstrates that the DNNs can automatically extract spatial low-, mid-, and high-level features as well as temporal features, which turn out to be very helpful in MOD problems.
While existing DNN models offer superior moving object detection accuracy, they suffer from computationally expensive and memory-intensive issues. In particular, the architecture change in 3D CNNs leads to a huge increase in model size and computational complexity compared to 2D CNNs, making it challenging to apply those models to real-world scenarios, such as robotics, self-driving cars, and augmented reality. These tasks are usually deployed on mobile and embedded devices, which have limited memory and computing resources. Besides, these tasks are delay-sensitive and need to be carried out in a timely manner, which cannot be achieved by high-complexity deep learning models. Thus, we aim to design a deep moving object detection model suitable for mobile and embedded environment, that can achieve faster inference speed and smaller model size while maintaining high detection accuracy.
In this paper, we propose an efficient 3D separable convolutional neural network with a multi-input multi-output strategy called "3DS_MM". This model is tailored for computation-resource-limited and delay-sensitive applications. Compared to state-of-the-art models, it significantly increases inference speed and reduces model size, meanwhile increasing detection accuracy or maintaining a competitive detection accuracy. Our key contributions are as follows: • We propose a new 3D separable CNN for moving object detection. The proposed network adopts 3D convolution to explore spatio-temporal information in the video data and to improve detection accuracy. To reduce computational complexity and model size, the 3D convolution is decomposed into a depthwise convolution and a pointwise convolution. While existing 3D separable CNN schemes all addressed other problems such as gesture recognition, force prediction, 3D object classification or reconstruction, our work applied it to the moving object detection task for the first time in the literature. • We propose a multi-input multi-output (MIMO) strategy. While existing networks are single-input singleoutput, multi-input single output, or two-input twooutput, our MIMO network can take multiple input frames and output multiple binary masks using temporal-dimension in each sample. This MIMO embedded in 3D separable CNN can further increase model inference speed significantly and maintain high detec-tion accuracy. To the best of our knowledge, this is the first time in the literature that such kind of MIMO scheme is used in the MOD task. • We demonstrate that the proposed 3DS_MM offers overwhelmingly high inference speed in frames per second (154 fps) and extremely small model size (1.45 MB), while achieving the best detection accuracy in terms of F-measure, S-measure, E-measure, and MAE among all models in scene dependent evaluation (SDE) setup and achieving the best detection accuracy among the models with inference speeds exceeding 65 fps in scene independent evaluation (SIE) setup. The SDE setup is widely used to tune and test the model on a specific video as the training and test sets are from the same video. The SIE setup originally raised in [50] is specifically designed to assess the generalization capability of the model on completely unseen videos.
The rest of the paper is organized as follows. In Section II, we introduce existing algorithms for moving object detection. In Section III, we explain the principles of the 3D separable convolution which lays the foundation for our proposed 3DS_MM. In Section IV, we elaborate on our proposed network in detail. Section V explains the training and evaluation setup of the experiments. Section VI describes our experimental results compared to the state-of-the-art models. Section VII concludes the paper.

II. RELATED WORKS
The methods for MOD problems have been extensively studied and improved over the years. These methods can be broadly categorized into: (1) traditional methods (unsupervised learning), and (2) deep learning methods (supervised and semi-supervised learning).
Traditional methods [7]- [29] are unsupervised which do not require labeled ground truth. They basically consist of two components: (1) background modeling which initializes the background scene and updates it over time, and (2) classification which classifies each pixel to be foreground or background. There are many background modeling schemes, such as the temporal or adaptive filters being applied to build the background like running average background [10], temporal median filtering [11], and Kalman filtering [12]. Another way for background modeling is to statistically represent the background using parametric probability density functions such as a single Gaussian or a mixture of Gaussians [13]. On the other hand, non-parametric methods directly rely on observed data to model the background such as IUTIS-5 [14], WeSamBE [15], SemanticBGS [16], and kernel density estimation [17]. Sample consensus is another non-parametric strategy used in PAWCS [18], ViBe [19] and SuBSENSE [20]. In particular, SuBSENSE uses a feedback system to automatically adjust the background model based on the local binary similarity pattern (LBSP) features and pixel intensities [21]. Eigen-background based on principalcomponent analysis (PCA) [22]- [24] is also used in back-ground modeling. Further, background subtraction based on robust principal-component analysis (RPCA) [25]- [29] solves camera motion and reduces the curse of dimensionality and scale. However, it is quite difficult for traditional methods to perform object detection in complex scenarios, such as videos with illumination changes, shadows, night scenes, and dynamic backgrounds.
Deep learning-based methods are mostly supervised and have been recently proposed for MOD problems [30]- [32], [42], [44]. The first work based on CNNs is ConvNet-GT [33], which replaces the pixel classification component with a well-defined network structure. The background is estimated by a temporal median filter, then the estimated backgrounds are stacked with the original video frames to form the input of the CNN that outputs the binary masks of detected objects. DeepBS [40] utilizes SuBSENSE [20] algorithm to generate background image and multiple layers CNN for segmentation. Also, a spatial-median filter is used for post-processing to perform smoothing. Wang et al. [34] proposed a multi-scale patch-wise method with a cascade CNN architecture called MSCNN+Cascade [34]. Although it achieves good detection performance, the patchwise processing is very time consuming. Other multi-scale feature learning-based models such as Guided Multi-scale CNN [35], MCSCNN [36], MsEDNet [37] and VGG-16 [74] based networks FgSegNet_M [38] and FgSegNet_v2 [39] were also proposed. FgSegNet_S [38] is a 2D CNN that takes each video frame at its original resolution scale as the input, while its extended version FgSegNet_M [38] takes each video frame at three different resolution scales in parallel as the input of the encoding network. FgSegNet_v2 is the bestperforming FgSegNet model in CDnet2014 [75] challenge. Another example, MSFgNet [41], has a motion-saliency network (MSNet) that estimates the background and subtracts it from the original frames, followed by a foreground extraction network (FgNet) that detects the moving objects.
3D convolution is applied to MOD problems to utilize spatial-temporal information in visual data. In [52], 3D CNN and a fully connected layer are adopted in a patchwise method. 3D-CNN-BGS [53] uses 3D convolution to track temporal changes in video sequences. This approach performs 3D convolution on 10 consecutive frames of the video, and upsamples the low-, mid-, and high-level feature layers of the network in a multi-scale approach to enhance segmentation accuracy. 3DAtrous [54] captures long-term temporal information in the video data. It is trained based on a long short-term memory (LSTM) network with focal loss to tackle the class imbalance problem commonly seen in background subtraction. Another LSTM-based example is the autoencoder-based 3D CNN-LSTM [55] combining 3D CNNs and LSTM networks. In this work, time-varying video sequences are handled by 3D convolution to capture short temporal motions, while the long short-term temporal motions are captured by 2D LSTMs. Although these 3D convolution-based methods offer accurate detection results, they have high computational complexity.
Recently, the concept of generative adversarial networks (GAN) is adopted in MOD problems, such as BSc-GAN [58], BSGAN [59], BSPVGAN [60], FgGAN [61], BSlsGAN [62], and RMS-GAN [63]. BScGAN is based on conditional generative adversarial network (cGAN) that consists of two networks: generator and discriminator. BS-GAN [59] and BSPVGAN [60] are based on Bayesian GANs. They use median filter for background modeling and Bayesian GANs for pixel classification. The use of Bayesian GANs can address the issues of sudden and slow illumination changes, non-stationary background, and ghost. In addition, BSPVGAN [60] exploits parallel vision to improve results in complex scenes. In [64], [65], adversarial learning is proposed to generate dynamic background information in an unsupervised manner.
However, the performance of all the aforementioned deep learning-based moving object detection methods comes at a high computational cost and a slow inference speed due to complex network structures and intense convolution operations. To reduce the amount of calculation, our previous work [57] proposed to use 2D separable CNN which splits the standard 2D convolution into a depthwise convolution and a pointwise convolution. It dramatically increases the inference speed and maintains high detection accuracy. However, this 2D separable CNN-based network does not exploit the temporal information in the video input.
In this work, we extend the 2D separable CNN to a 3D separable CNN, which reduces the computational complexity compared to standard 3D CNN. Although some existing works [76]- [79] adopt 3D separable CNN to extract highdimensional features, none of them applied it to the problem of moving object detection. For example, the 3D separable CNN in [76] is for hand-gesture recognition, in which the last two layers of the network are fully connected layers that output class labels. The 3D separable CNN in [77] is used for two tasks: 3D object classification and reconstruction. Neither task utilizes temporal data, hence no temporal convolution is involved. The 3D separable CNN in [78] is to predict interactive force between two objects, hence its network output is a scalar representing the predicted force value. This problem essentially is a regression problem. Besides, the way that the 3D convolution is separated in [78], [79] is different from our proposed method. It first conducts channel-wise 2D convolution for each independent frame and channel, then conducts joint temporal-channel-wise convolution. In contrast, our proposed 3D separable CNN performs spatialtemporal convolution first, then performs pointwise convolution along the channel direction.
Another factor that limits the inference speed is the inputoutput relationship. The input-output relationship of existing moving object detection networks has two types: (1) singleinput single-output (SISO), which is widely exploited in 2D CNNs such as FgSegNet_S [38] and 2D separable CNN [57]; and (2) multi-input single-output (MISO) which can be found in 3D CNNs such as 3D-CNN-BGS [53], 3DAtrous [54], and DMFC3D [51]. The disadvantage of SISO and MISO is that VOLUME , 2021 they result in a slow inference speed because only one frame output is predicted in every forward pass. Recently, the X-Net [80] adopts a two-input two-output network structure, which takes two adjacent video frames as the network input and generates the corresponding two binary masks. Although it can track temporal changes, the network structure is inflexible and the temporal correlation it utilizes is limited. In this work, we propose a multi-input multi-output (MIMO) strategy, which can take multiple input frames and output multiple frames of binary masks in each sample. It explores temporal correlations on a larger time span and significantly increases the inference speed when embedded in 3D separable CNN.
Another issue for supervised methods is the generalization capability of the trained models on completely unseen videos. Several moving object detection models were designed and evaluated over completely unseen videos, such as BMN-BSN [47], BSUV-Net [48], BSUV-Net 2.0 [49], BSUV-Net+SemBGS [48], ChangeDet [50], and 3DCD [56]. Besides, semi-supervised networks were also designed to be extended to unseen videos. For example, GraphBGS [66] and GraphBGS-TV [67] are based on the reconstruction of graph signals and semi-supervised learning algorithm, MSK [68] is based on a combination of offline and online learning strategies, and HEGNet [71] combines propagation-based and matching-based methods for semi-supervised video moving object detection.
In this paper, we devise a new lightweight 3D separable CNN specifically for moving object detection in computation-resource-limited and delay-sensitive scenarios. It has an efficient end-to-end encoder-decoder structure with a multi-input multi-output (MIMO) strategy, named as the "3DS_MM". The proposed 3DS_MM does not require explicit background modeling. We evaluate the model over CD-net2014 [75] dataset in an SDE framework with other stateof-the-art models, and we also assess the generalization capability of the model over CDnet2014 and DAVIS2016 [81] datasets in SIE setups over completely unseen videos.
The proposed 3DS_MM significantly increases the inference speed, reduces the trainable parameters, computational complexity and model size, meanwhile achieving the highest detection accuracy in SDE setup and maintaining a competitive detection accuracy in SIE setup.

III. 3D SEPARABLE CONVOLUTION
In this section, we elaborate on the rationale of the 3D separable convolution operation, which is the building block of our proposed 3DS_MM. In the following sections, we use the default data format "NLHWC" in Tensorflow to represent data, which denotes the batch size N , the temporal length L, the height of the image H, the width of the image W , and the number of channels C.

A. 2D CONVOLUTION VS. 3D CONVOLUTION
As shown in Fig. 1(a) [82], an ordinary 2D convolution takes a 3D tensor of size H × W × C i as the input, where H and W are the height and width of feature maps, and C i is the number of input channels. In this case, the filter is a 3D filter in a shape of K × K × C i moving in two directions (y, x) to calculate a 2D convolution. The output is a 2D matrix of size The mathematical expression of such 2D convolution is given by where In represents the 3D input to be convolved with the 3D filter f to result in a 2D output feature map Out. Here, h, w and c are the height, width, and channel coordinates of the 3D input, while j, i and c are those of the 3D filter.
However, for video signal the 2D convolution in Fig. 1(a) does not leverage the temporal information among adjacent frames. 3D convolution addresses this issue using 4D convolutional filters with 3D convolution operation, as illustrated in Fig. 1(b). In a 3D convolution, the "input" becomes C i channels of 3D tensors of size L × H × W , where L is the temporal length (i.e. the number of successive video frames). Hence, the input is 4D and is of size to calculate convolutions, where z, y, and x align with the temporal length, height, and width axes of the 4D input. The output shape is The mathematical expression of the 3D convolution with a 4D input is given by (2) where In represents the 4D input to be convolved with the 4D filter f to result in a 3D output Out. Here, l, h, w, and c are the temporal length, height, width, and channel coordinates of the 4D input, while k, j, i and c are those of the 4D filter. If the size of the filter is K × K × K × C i , then the indices k, j, i range from 0 to K − 1, and c ranges from 0 to C i − 1. Step 1. Depthwise convolution Step 2. Pointwise convolution The ability to leverage the temporal context improves moving object detection accuracy. However, 3D CNN is rarely used in practice because it suffers from a high computational cost due to the increased amount of computation used by 3D convolutions, especially when the dataset scale goes larger and the neural network model goes deeper. Thus, in order to make use of the temporal features, a low-complexity 3D CNN must be developed.
In order to utilize temporal features in video data, the idea of separable convolution can be applied to the standard 3D convolution. As shown in Fig. 2 The filters calculate the 3D convolution by moving in the directions of length, height, and width as shown by the red arrows. The computational complexity of such standard 3D convolution To simplify the 3D convolution, we decompose it into a 3D depthwise convolution and a 1D pointwise convolution. As shown in Fig. 2 (b) Step 1, the 3D depthwise convolution adopts C i independent filters of size K × K × K × 1 to perform a 3D convolution on each input channel. This procedure is described in (3). The required multiplications of such 3D depthwise convolution is Afterwards, the output of Fig. 2 (b) Step 1 is used as the input of Fig. 2 (b) Step 2, where the pointwise convolution adopts a filter of size 1 × 1 × 1 × C i , performs a linear projection along the channel axis as shown by the red arrow, and outputs a 3D tensor of size L o ×H o ×W o . This procedure is described in (4). Using C o such filters outputs C o 3D tensors. The required multiplications of such 1D pointwise convolution is The combination of the 3D depthwise convolution and the 1D pointwise convolution, called 3D separable convolution, achieves a reduction in computational complexity of With K = 3 and a large C o , the computational complexity can be reduced by roughly 27 times compared to the standard 3D convolution.
This work adopts such 3D separable convolution in a moving object detection network for the first time. It substantially reduces the amount of computation, meanwhile extracting temporal features in the video sequence.

IV. PROPOSED 3DS_MM NETWORK
The proposed deep moving object detection network shown in Fig. 3 is based on two major designs: (1) the encoderdecoder-based 3D separable CNN and (2) the multi-input multi-output (MIMO) strategy. This section describes the proposed approach in detail.

A. ENCODER-DECODER-BASED 3D SEPARABLE CNN
As shown in Fig. 3, the proposed network is an encoderdecoder-based CNN utilizing the 3D separable convolution as described in Section III. The network involves six blocks in the encoder network and three blocks in the decoder network. These block numbers are selected to provide a good trade-off between the inference speed and the detection accuracy empirically. Table 1 shows the details of the network and the shape of the input and output in each layer.

1) The Encoder Network
For each training sample, the input to the encoder network is a set of video frames in a 4D shape of 9 × H × W × 3 without background frame needed, where 9 is the number of video frames, H and W are the height and width of the video frames, and 3 is the RGB color channels. In Fig. 3, t 0 , t 1 , t 2 , t 3 , t 4 ... represent different time slots. In the first step, the standard 3D convolution described in Fig. 2(a) The output shape is in data format "LHWC", where L is the temporal length, H is the height, W is the width, C is the number of channels, dw represents "depthwise convolution", pw represents "pointwise convolution", and s represents the strides in temporal length, height, and width.
adopted with 32 filters of size 3 × 3 × 3 × 3 to calculate the convolution on nine input frames. The input video frames are transformed to 32 feature maps in a shape of 9 × H × W × 32 at the output. In the following blocks, each of the output feature maps of each layer is convolved with an independent filter of size 3×3×3×1 with strides [1, 2, 2] (in the direction of temporal length, height, width) for depthwise convolution, and then convolved with C o filters of size 1 × 1 × 1 × C i with strides [1, 1, 1] for pointwise convolution.
2) The Decoder Network The output of the encoder network is fed to the decoder network for decoding to produce the binary masks of the moving objects. Each layer of the decoder network adopts a transposed convolution, which spatially upsamples the encoded features and finally generates the binary masks at the same resolution as the input video frames.
The standard transposed convolution is split into a 1D pointwise transposed convolution and a 3D depthwise transposed convolution. These operations are defined similarly to the 1D pointwise convolution and the 3D depthwise convolution in the encoder network. In block 6 shown in Table 1, the encoder output of size 2 × H 4 × W 4 × 512 is converted to a tensor of size 6 × H 2 × W 2 × 256 using the 1D pointwise transposed convolution with 256 filters of size 1 × 1 × 1 × 512. By setting strides to be [3,2,2] for the temporal length, height and width in the pointwise transposed convolution, the feature maps are up-scaled by 3 times from 2 to 6 in the temporal length and enlarged by 2 times in height and width. Then followed by a 3D depthwise transposed convolution with 256 filters of size 3 × 3 × 3 × 1 and strides [1,1,1], the feature maps are projected to a tensor of size 6× H 2 × W 2 ×256 at the output of block 6. Block 7 is similarly defined. In the final block, the feature maps are projected to a 4D output of size 6 × H × W × 1, and a sigmoid activation function is appended to generate the probability masks for 6  successive frames. A threshold of 0.5 is applied to convert the probability masks to binary masks that indicate the detected moving objects. Fig. 4 illustrates our proposed MIMO strategy and how it is different from SISO and MISO. The temporal-dimension L of a 4D input or output of size L × H × W × C is redefined as the number of input frames L i and the number of output masks L o . By applying different padding and stride values in the convolutions in the neural network, different number of output masks L o can be predicted. In our study, we set L i as 9 and L o as 6. As shown in Fig. 4 (right), in the inference process, two groups of 9 input frames with 3 frames overlapped can output two successive groups of 6 binary masks.

B. MIMO STRATEGY
We also analyze how computational complexity can be reduced from MISO to this MIMO scheme. Let us consider our proposed network in Table 1. With the proposed MIMO scheme, the output layer in block 8 is of size L o × H o × W o × (C o = 1). Since block 8 mainly requires a pointwise convolution, the multiplications required to generate such output layer is Denote the total multiplications from block 0 to block 7 as M 0−7 , then the overall complexity of generating L o binary masks is With the same network structure, if we adopt a MISO scheme, then the output layer is of size To generate L o output binary masks, the overall complexity is Therefore, to output the same number of binary masks, MISO requires (7) − (6) = (L o − 1) × M 0−7 more multiplications than MIMO.

V. TRAINING AND EVALUATION OF THE PROPOSED MODEL
To analyze how the proposed model performs, we conducted three experiments illustrated in Table 2: (1) video-optimized SDE setup on CDnet2014 dataset, (2) category-wise SIE setup on CDnet2014 dataset, and (3) complete-wise SIE setup on DAVIS2016 dataset. In SDE [50], frames in training and test sets were from the same video, whereas, in SIE [50], completely unseen videos were used for testing. Further, in category-wise SIE, the training and testing were done per category over CDnet2014, whereas, in complete-wise SIE, training and testing were done over the complete DAVIS2016 dataset.
All the experiments were carried out on an Intel Xeon with an 8-core 3GHz CPU and an Nvidia Titan RTX 24G GPU. The following sections present the details of the training and evaluation processes and performance evaluation metrics.

A. VIDEO-OPTIMIZED SDE SETUP ON CDNET2014 DATASET
The CDnet2014 dataset [75] was used in the experiment. It contains 11 video categories: baseline, badWeather, shadow, and so on. Each category has four to six videos, resulting in a total of 53 videos (e.g., the baseline category has sequences highway, office, pedestrians, and PETS2006). A video contains 900 to 7, 000 frames. The spatial resolution of the video frames varies from 240 × 320 to 576 × 720 pixels. In our experiments, we excluded the PTZ (pan-tilt-zoom) category since the camera has excessive motion.
From each video, we selected the first 50% of frames as the training set and the last 50% of frames as the test set. The SISO-based networks and the proposed MIMO-based 3DS_MM were using exactly the same frames for training. Suppose that one video contained 100 frames, then for the SISO-based networks, the first 50 frames t 0 ∼t 49 were used for training, and the last 50 frames t 50 ∼t 99 were used for testing. For our proposed 3DS_MM, a 9-frame window slid over the same first 50% of frames, such as t 0 ∼t 8 , t 1 ∼t 9 , t 2 ∼t 10 ,. . . ,t 41 ∼t 49 to form the training set if the stride was 1, and t 50 ∼t 99 frames were for testing. In this way, all the deep-learning-based models were using the same frames for training. The only difference was that for the proposed network, the first 50% of frames were repeatedly utilized through the sliding operation. The traditional unsupervised methods WeSamBE [15], SemanticBGS [16], PAWCS [18], and SuBSENSE [20] were also tested on the same last 50% frames for performance comparison.
We used the RMSprop optimizer with binary cross-entropy loss function and trained each model for 30 epochs with batch VOLUME , 2021 TABLE 2. Different data division schemes of scene dependent evaluation (SDE) and scene independent evaluation (SIE). size 1. The learning rate was initialized at 1 × 10 −3 and was reduced by a factor of 10 if the validation loss did not decrease for 5 successive epochs.

B. CATEGORY-WISE SIE SETUP ON CDNET2014 DATASET
In order to evaluate the generalization capability of the proposed 3DS_MM, we also run experiments for the SIE setup. Compared to SDE, in SIE the training and test sets contain a completely different set of videos. In the category-wise SIE setup, the training and evaluation were conducted per category. A leave-one-video-out (LOVO) strategy originally raised in [50] was applied to divide videos in each category into training and test sets for CDnet2014 dataset. For example, the baseline category contains four videos, then three videos (highway, office, PETS2006) were used for training, and the 4th video (pedestrians) was for testing. This SIE setup was carried out on seven categories, so for each method in comparison, seven models were trained totally from scratch.
We used the RMSprop optimizer with binary cross-entropy loss function and trained the model for 30 epochs with batch size 5. The learning rate was initialized at 1 × 10 −3 and was reduced by a factor of 10 if the validation loss did not decrease for five successive epochs.

C. COMPLETE-WISE SIE SETUP ON DAVIS2016 DATASET
We also conducted an experiment in complete-wise SIE setup on DAVIS2016 dataset. Different from the categorywise setup on CDnet2014, the complete-wise setup on DAVIS2016 refers to the training and evaluation on the whole dataset. In our experiment, 30 videos in DAVIS2016 dataset were used in training, and 10 completely unseen videos were used for testing. For each method in comparison, only one unified model was trained from scratch without using any pre-trained model data.

D. EVALUATION METRICS 1) Efficiency
To evaluate the efficiency of our proposed model, the inference speed is measured in frames per second (fps), the model size is measured in megabytes (MB), the number of trainable parameters is measured in millions (M), and the computational complexity is measured in floating point operations (FLOPs).
The F-measure is defined as: where precision = T P T P +F P , recall = The S-measure [86] combines the region-aware structural similarity S r and object-aware structural similarity S o , which is more sensitive to structures in scenes: where α = 0.5 is the balance parameter. The E-measure is recently proposed [87] based on cognitive vision studies and combines local pixel values with the image-level mean value in one term, jointly capturing imagelevel statistics and local pixel matching information.
We also evaluate the MAE [88] between the predicted output and the binary ground-truth mask as: where P red i is the predicted value of the i-th pixel, GT i is the ground-truth binary label of the i-th pixel, and N is the total number of pixels.

A. ABLATION STUDY
We first investigated the influence of different components of our proposed 3DS_MM through ablation experiments. In order to quantify the effect of two components "3D separable CNN" and "MIMO" in 3DS_MM, we conducted four experiments over 10 categories of CDnet2014 dataset in SDE setup.
The results are shown in Table 3. We began with the standard 3D CNN and a MISO strategy, namely "3D CNN + MISO". It has an F-measure of 0.9532, a very low inference speed of 26 fps, approximately 9.13 M trainable parameters, and a computational complexity of 693.31 GFLOPs, which generates 1 output binary mask. To generate 6 output masks, the GFLOPs need to be multiplied by 6 (×6). We then replaced the standard 3D CNN by the 3D separable CNN, while the MISO strategy was retained. For a fair comparison, the 3D CNN and the 3D separable CNN structures adopted the same number of network layers, and their intermediate layers have the same output sizes. The resultant "3D separable CNN + MISO" method has a slightly reduced F-measure, but the inference speed increased from 26 fps to 31 fps. More importantly, the parameters and FLOPs were drastically reduced, due to the separable convolution operations. On the other hand, we retained the standard 3D CNN but replaced MISO by MIMO.
In particular, we kept the front part of the network the same and only modify the last layer to output 6 binary masks instead of a single mask. The resultant method "3D CNN + MIMO" significantly increased the inference speed (144 fps) compared to "3D CNN + MISO". Finally, the proposed "3D separable CNN + MIMO" method has a superior inference speed (154 fps) due to the MIMO strategy, as well as the fewest trainable parameters (∼0.36 M) and FLOPs (∼28.43 G) due to 3D separable convolutions. The above results have justified the effectiveness of our proposed model design.

B. OBJECTIVE PERFORMANCE EVALUATION 1) Objective Results in Video-Optimized SDE Setup on CDnet2014
The accuracy comparison of various methods in SDE setup in each video category is shown in Table 4. Each row lists  Table 4, we highlight the best value in each column in bold. We observe that our proposed 3DS_MM model achieves the highest inference speed at 154 fps, and performs best in BDW-badWeather, DBG-dynamicBackground, IOM-intermittentObjectMotion, LFR-lowFramerate, and Turbulance categories in F-measure. It improved the average Fmeasure by 1.1% and 1.4% compared to methods with the second and third highest average F-measure values in Table 4. It also offers the highest average S-measure, E-measure, and the lowest average MAE values among all methods.   unSV: unsupervised learning, SV: supervised learning, SISO: single-input single-output, MISO: multi-input single-output, MIMO: multi-input multioutput. The best value in each column is highlighted in bold. ↑ Larger value of the metric denotes better performance. ↓ Smaller value of the metric denotes better performance. unSV: unsupervised learning, SV: supervised learning, SISO: single-input single-output, MISO: multi-input single-output, MIMO: multi-input multioutput. The best value in each column is highlighted in bold. The second best average accuracy values are also highlighted. ↑ Larger value of the metric denotes better performance. ↓ Smaller value of the metric denotes better performance.)  better detection accuracy than our model, the inference speed of our model is 2.6 times that of ChangeDet.

3) Objective Results in Complete-Wise SIE Setup on DAVIS2016
All the models listed in Table 6 were trained and evaluated in the same complete-wise SIE setup as described in Section V-C. It is more challenging for a model to perform well in such SIE setup on DAVIS2016 dataset, because (1) the complete-wise SIE setup mixes 30 different kinds of videos from the real-world together for training, and (2) the content complexity of DAVIS2016 dataset is high. We compared our proposed model 3DS_MM (with an inference speed at 154 fps and an average F-measure of 0.7317, S-measure of 0.7492, E-measure of 0.8024 and MAE of 0.2089 over 10 test videos) to the state-of-theart semi-supervised deep learning-based models MSK [68], CTN [69], SIAMMASK [70], HEGNet [71], and PLM [73]. It turns out that our proposed model is superior over these  models in the inference speed. Besides, our model improved the F-measure by 2.5%, 9.6% and 6.5% compared to CTN, PLM and SIAMMASK, respectively, and its F-measure is on par with HEGNet. Although MSK offers 1.5% higher F-VOLUME , 2021 measure than ours, its inference speed is extremely low. Our proposed model also outperforms the supervised learningbased models FgSegNet_S [38], FgSegNet_M [38], FgSeg-Net_v2 [39], and 2D_Separable CNN [57] in F-measure by 10.3%, 11.7%, 10.6%, and 16.5%, respectively. Our proposed method demonstrates a similar superiority in Smeasure, E-measure and MAE values. Although there are other models in DAVIS Challenge website with higher detection accuracy than our proposed model, those models are far less efficient and their inference speed is too slow to be applied in delay-sensitive scenarios. Fig. 5 displays the detection accuracy metrics in F-measure, S-measure, E-measure and MAE versus the inference speed of all the compared models in the SDE setup, categorywise SIE setup, and complete-wise SIE setup. Since we aim at delay-sensitive applications, we expect our proposed 3DS_MM to offer overwhelmingly high inference speed, and a superior detection accuracy among models with high inference speeds. In Fig. 5, we observe that our proposed 3DS_MM surpasses all the other schemes in inference speed in all three experiment setups. In terms of the F-measure, Smeasure, E-measure and MAE, in the SDE setup our method is the best among all models, while in both the category-wise and complete-wise SIE setups our method is the best among all models with an inference speed above 65 fps. In Table 7, we summarize the overall performance including inference speed, trainable parameters, computational complexity, model size, and detection accuracy of our proposed 3DS_MM and other methods. The table is sorted in an ascending order of the inference speed. It is evident that the proposed 3DS_MM outperforms all the other listed methods with the highest inference speed at 154 fps, which is increased by 1.7 times and 1.8 times respectively, compared to the second and third fastest methods in Table 7. The computational complexity and the model size of our proposed method are 28.43 GFLOPs and 1.45 MB, smaller than all the other models in Table 7, due to our proposed 3D separable convolution.

C. ACCURACY, SPEED, MEMORY, AND COMPUTATIONAL COMPLEXITY ANALYSIS
In terms of detection accuracy (F-measure, S-measure, Emeasure, and MAE), our proposed model outperforms all other models in SDE setup. In category-wise SIE setup, our proposed method offers the second best accuracy scores. Although it is slightly worse than changeDet [50], its inference speed (154 fps) is 2.6 times that of ChangeDet (58.8 fps). In complete-wise SIE setup, although our model offers slightly worse accuracy scores than MSK [68], it offers overwhelming superiority in terms of inference speed. The extremely low inference speed of MSK (0.5 fps) hinders the practical use of this model for delay-sensitive applications.
The number of trainable parameters of our proposed model (∼0.36 million) is much less than most of the models in comparison. The reason that ChangeDet [50] (∼0.13 million) and MSFgNet [41] (∼0.29 million) have fewer trainable parameters than ours is because they use 2D filters and they are shallower networks with fewer convolutional layers, while our proposed 3DS_MM uses 3D filter and a deeper network. Nevertheless, the inference speeds of ChangeDet and MSFgNet are much slower than ours since they are both MISO networks. In contrast, our 3DS_MM is able to significantly increase the inference speed due to the proposed MIMO strategy and 3D separable convolution.

D. SUBJECTIVE PERFORMANCE EVALUATION
In addition to objective performance, we also provide visual quality comparison as shown in Fig. 6 1 , Fig. 7, and Fig. 8.

1) Subjective Results in Video-Optimized SDE setup on CDnet2014
In Fig. 6, we randomly picked a sample test frame from categories BSL-baseline, BDW-badWeather, NVD-nightVideos, and IOM-intermittentObjectMotion. We observe that (1) the proposed 3DS_MM provides more details and clearer edges in the detected foreground objects, such as the car mirrors in "BSL" and "BDW", and (2) the proposed method detects more contiguous objects such as the bus in "NVD" and the walking man in "IOM". In contrast, the detected binary masks of other methods in comparison have either blurry edges or missing parts.

2) Subjective Results in Category-Wise SIE setup on CDnet2014
In Fig. 7, we randomly select a sample frame from each of the four categories (BSL-baseline, BDW-badWeather, LFR-lowFramerate, SHD-shadow) of CDnet2014 test results to show the visual quality of the models in Category-Wise SIE setup. Our proposed model has a better generalization capability compared to other models. It shows that our proposed model detects clearer shapes of the persons in BSL and SHD, and detects more details of person legs in SHD. The results of other methods, however, are either noisy, blurry, or have missing parts. In addition, the proposed model performs better in BDW and LFR categories with clear and correct shapes, while other models detect excessive or noncontiguous content.

3) Subjective Results in Complete-Wise SIE setup on DAVIS2016
In Fig. 8, we randomly select four videos (camel, horsejumphigh, paragliding-launch, and kite-surf) from the results of DAVIS2016. Our proposed model detects the shapes of objects consistently well for all four videos, while the detection results of 2D_Separable [57], FgSegNet_S [38], FgSegNet_v2 [39], and SIAMMASK [70] are either noisy or incomplete. Besides, the detection results of CTN [69], MSK [68], and PLM [73] for the kite-surf video are less accurate than the proposed model.

VII. CONCLUSION
In this paper, we propose the 3DS_MM model for moving object detection. Our model is designed specifically for memory-and computation-resource-limited environments and for delay-sensitive tasks. Our model utilizes spatialtemporal information in the video data via 3D convolution. The proposed 3D depthwise and pointwise convolutions with the MIMO strategy effectively reduce computational complexity and significantly enhance the inference speed. In addition, the 3D separable convolution leads to very few trainable parameters and a small model size. Finally, the defined SDE and SIE experiments demonstrate that our proposed model achieves superior detection accuracy among all compared models with high inference speeds suitable for low-latency vision applications.
In terms of future study, we plan to use data-augmentation technique to improve the robustness of the proposed model and to further improve the model generalization capability on unseen videos. We will also investigate the potential of feature fusion to improve moving object detection accuracy without reducing the efficiency. Further, we plan to extend the work to semantic segmentation tasks.