A Multi-Scale Recurrent Framework for Motion Segmentation With Event Camera

Motion segmentation is a formidable computer vision task, aiming to segment moving targets from a dynamic scene. In this paper, we choose to introduce an additional modality to bolster the robustness. The event camera is a bio-inspired sensor that accurately detects and captures intensity changes with exceptional temporal resolution and dynamic range, which is an optimal choice for motion segmentation. Therefore, we present a novel framework for event-based motion segmentation and propose Multi-Scale Recurrent Neural Network (MSRNN) to fuse temporal information efficiently. To our best knowledge, it is the first time that a multi-scale recurrent architecture is implemented in event-based motion segmentation. The proposed framework is evaluated through experiments conducted on the EV-IMO dataset. Our method achieves a mean Intersection-over-Union (mIoU) of 82.0%, which sets a new state-of-the-art in motion segmentation. To further validate our approach in arduous real-world scenarios, we introduce the Event Challenging Motion dataset, consisting of 350 images and corresponding events, in which our method outperforms the other methods by 1.5% in Intersection-over-Union (IoU).


I. INTRODUCTION
Motion segmentation aims to predict motion masks to understand scene dynamics. Motion segmentation allows applications such as robotics to focus on moving objects or ignore them based on task requirements. While image-based motion segmentation has advanced rapidly in recent years [1], [2], motion segmentation is still hindered by the drawbacks of frame-based cameras, which inevitably introduce motion blur and image degradation, as well as image information defects resulting from high dynamic range. Event cameras [3], [4], [5], offer high temporal resolution and dynamic range, are widely applied in optical flow estimation [6], [7], [8], [9], [10], [11], [12], image deblurring [13], [14], [15], [16], [17], [18], frame reconstruction [19], video frame interpolation [20], human pose estimation [21], fast auto-focus [22], and other computer vision tasks. Distinct from traditional image sensors such as CMOS and The associate editor coordinating the review of this manuscript and approving it for publication was Angel F. García-Fernández .
CCD, event cameras function as bionic technology by capturing asynchronous temporal intensity changes of a scene as a continuous events stream. Thus, event cameras can detect both moving objects in a dynamic scene along with the background's motion caused by the inevitable movement of cameras. Moreover, they exhibit exceptional temporal resolution and dynamic range, making them suitable for handling complex and challenging scenarios in motion segmentation.
Motion segmentation is to distinguish between moving objects and the background. Given a dynamic scene, the target of motion segmentation is to tell the moving objects from the whole scene. It can be implemented in a drone or a walking robot. These machines need to accurately perceive and respond to rapidly moving objects in the scene, especially in extreme conditions. Thus, we desire to design a more robust framework for motion segmentation that can be adapted to a fast or low-illumination scene. Temporal information plays an important role in this task. Most learning-based methods take multiple image frames or additional data for motion segmentation [1], [23], [24], [25]. Meunier et al. utilize only optical flow information for motion segmentation [26]. However, the drawbacks of images and low-quality additional information will reduce the accuracy. Instead, we utilize events for motion segmentation due to the extraordinary property of event cameras.
Recent learning-based motion segmentation networks like [25], [26], and [27] are simply based on UNet [28]. It is first implemented in biomedical segmentation and is found to be useful in current universal segmentation. In the UNet, the encoder extracts the overall and local information from the event frames to estimate the pose of the moving object, and the decoder aims to fuse the features from the encoder and reconstruct the contour and position of the moving object. The multi-scaled features from each stage of the encoder contain information of the image from different levels. Largescale features provide more edge and contour information, while small-scale features contain richer semantic information about the moving object and background. However, This framework does not utilize long-range temporal information for motion segmentation, which causes less temporal consistency.
Based on the previous work, our method, Multi-Scale Recurrent Neural Network (MSRNN), focuses on the temporal consistency of a dynamic scene. MSRNN fuses multi-scaled features from the previous time step to get more accurate motion estimation. Besides, the iterative recurrent architecture provides no extra learnable parameters and is easy to train.
In this work, we explore the potential of events for motion segmentation and propose Multi-Scale Recurrent Neural Network (MSRNN) that effectively fuses long-range temporal information from the previous time steps. To our best knowledge, it is the first time that a recurrent architecture is implemented in event-based motion segmentation. To improve the robustness of motion prediction on both large and small scales, we propose a multi-scale recurrent architecture that incorporates a recurrent block at every encoder stage. Specifically, the spatial size of the feature maps is halved after each block, which helps improve the prediction of large motion and small motion, respectively. We conduct experiments and compare our method with state-of-the-art motion segmentation methods on the EV-IMO dataset [27], and prove its effectiveness through a detailed ablation study. Next, we collect a new event motion segmentation dataset named Event Challenging Motion (ECMotion) in a laboratory setting with a SEEM1 event camera. Our dataset contains 350 frames in total, under varying light conditions, and 150 frames are annotated with ground-truth. Furthermore, we perform extensive comparisons against an image-based framework and other competitive methods on the ECMotion dataset, demonstrating the superiority of our event-based motion segmentation framework.
In summary, our contributions are as followings: 1) We utilize events to improve motion segmentation and propose Multi-Scale Recurrent Neural Network (MSRNN) for event-based motion segmentation.
2) Our motion segmentation model achieves the new state-of-the-art for motion segmentation on the EV-IMO dataset. 3) A novel dataset for evaluation on real-world high-speed motion segmentation is proposed. Several methods are evaluated on the proposed dataset.

II. RELATED WORK A. IMAGE-BASED MOTION SEGMENTATION
Motion segmentation is a fundamental computer vision task. Hand-crafted algorithms such as [29], [30], and [31] separate the optical flow into 'layers' modeled by an affine motion, in which the robustness depends on the performance of the optical flow algorithms. Following researches utilize Bayesian treatment [32] to enhance multi-body factorization [33]. Brox et.al propose to integrate the motion segmentation into the variation formulation of the optical flow estimation with level sets [34]. Before the advent of deep learning, several more notable methods appear. And most of them build upon previous ideas, including advanced trajectory-based methods [35], and a Conditional Random Field (CRF) based approach [36]. Nevertheless, the speed, robustness, and performance of all the above algorithms cannot compete with modern learning approaches.
In recent years, motion segmentation has made significant progress through the utilization of the Convolution Neural Network (CNN). Tokmakov et al. extract the feature map of each frame and then incorporate the features of adjacent frames to establish motion masks [23]. Shen et al. predict motion masks using a lightweight UNet [28] in their pipline [25]. The community has witnessed several innovative components and methods, including multi-fusion architecture [1], [2], [24], [37], partially supervised networks [38], fully unsupervised network [26], and Recurrent Neural Network [39]. Despite their extraordinary effect on motion segmentation, image-based methods still struggle when facing real-world scenarios, particularly in extreme conditions such as high-speed motion, and low-illumination conditions.

B. EVENT-BASED MOTION SEGMENTATION
Recently, a number of event-based motion segmentation algorithms appears. Lagoree et al. [40] propose a kernel function-based method for segmenting moving objects. In another work, Mitrokhin et al. [41] propose a motion detection and tracking algorithm using time images, compensating for camera motion while tracking moving objects. Moreover, they propose a challenging event-based dataset called EED, which contains five different scenes and bounding boxes of moving objects. Zhou et al. create a space-time event graph and pass it to an iterative clustering algorithm to predict scene motion [42]. Chen et al. propose a mutually reinforced framework both for motion estimation and event denoising [43]. However, these conventional approaches FIGURE 1. The proposed framework for motion segmentation using the representation of voxel grid. The raw events are first converted into a voxel grid. A pack of adjacent frames is then input into the MSRNN to generate motion masks for each time step. We compare the masks with their ground truths respectively. Notably, the current step t k is used for inference. perform in event space and can be extremely impacted by the event noise.
Mitrokhin et al. [27] introduce the first event-based motion segmentation dataset called EV-IMO which contains depth maps, motion masks, camera, and object motion information. They also present a deep convolution neural network based on UNet [28] to predict motion masks for applications with limited scenes, such as robotics. This method utilizes early fusion by concatenating the input feature maps of adjacent frames, however, it doesn't take full advantage of long-range temporal information because the events utilized in their work are near the timestamp of the target time, which discards previous and future events for long-range temporal information. In another work, they propose a method using event surface and a Graph Neural Network (GNN) [44]. Though this approach treats each event as a node in the GNN, resulting in improved training and inference times. However, GNN-based methods still struggle with training instability.
Most recent image-based methods like multi-fusion [1], [23], [24], [25] take additional information like optical flow or depth as input, thus, it takes more effort to accomplish the motion segmentation pipeline and low-quality optical flow can deteriorate the segmentation result. The RNN-based method [39] does not utilize multi-scaled information and still takes additional optical flow as input. Our method is most similar to UNet [28] or SfM-Net [38], which are image-based networks. However, we utilize voxel grid [6] as input which is perfectly compatible with these image-based methods. Moreover, we design a novel multi-scaled recurrent architecture to fuse long-range temporal information from adjacent time steps for analyzing the relative pose change to learn the motion mask. The multi-scale architecture extracts local and global features, helping alleviate the noise influence of the event camera.

III. METHOD
A. FRAMEWORK Event cameras [5] respond only to changes in brightness in the log domain of the photocurrent intensity, i.e. L = log(I ). If the brightness change L(x i , t i ) comparing the previous event at pixel where t i denotes the time gap since the previous event at the same pixel x i , and p i indicates the polarity of brightness change. The inherent characteristics of event cameras make them suitable for capturing dynamic and fast scenes, particularly under challenging light conditions. In our work, we explore the potential of events for motion segmentation. Due to their high temporal resolution without motion blur, event cameras are ideal for motion segmentation in dynamic scenes. Fig. 1 shows the proposed event-based motion segmentation framework. We first convert an event stream with a time interval of T into a voxel grid [6] with channel dimension. Each channel consists of the accumulated events within a T /C time frame, thus partially maintaining the raw data's temporal information. Subsequently, we process this voxel grid using a CNN-based network to predict motion masks. Each prediction can be used to predict the consecutive frame.
Our pipeline can be expressed as (3). Here T t=t k e t indicates the events for time t k , V represents the transformation of voxel grid, and C k−1 and H k−1 represents the multi-scale features from the previous frames. Furthermore, F e represents the encoder's transformation, with 1 representing its learnable parameters, and F d is the decoder's transformation, with 2 representing its learnable parameters.
In our proposed framework, we utilize two paradigms to encode the time information: voxel grid representation and recurrent architecture. The raw events of a specific time VOLUME 11, 2023 period (x, y, t, p) is converted into voxel grid represented as R H ×W ×C , where C represents the different time periods. Raw events are not compatible with CNN-based methods because of their asynchronous nature. Voxel grid is a dense representation for events that can be applied to the CNN-based framework and is widely used in event-based optical flow estimation task [6], [10], [11], [12] to learn the scene motion. Compared to other dense event representations like event frame [45], motion-compensated event image [46] and time surface [47], voxel grid has variant channels with finer time information and it sustains the polarity information within the period of time. Compared to raw events, the time information is discretized, leading to the loss of accurate temporal information compared to the raw event stream. The other paradigm is recurrent architecture. We meticulously design a novel recurrent network based on UNet [28] that efficiently fuses previous adjacent frame features to predict the motion masks. Based on these two paradigms, our proposed recurrent network for event cameras achieves an 82.0% mIoU score on motion segmentation, demonstrating our pipeline's feasibility.

B. MODEL ARCHITECTURE
To efficiently feed the features of previous frames to our network, we employed the UNet as the foundation and designed a Muti-Scale Recurrent Neural Network (MSRNN) to improve the accuracy of motion segmentation. Each UNet unit of our architecture mainly composes an encoder and a decoder, depicted in Fig. 2. The decoder contains four successive decoder blocks and culminates with a Sigmoid output layer. Table 1 presents comprehensive details of the network layer's input and output size and the number of channels. The encoder block is composed of two consecutive convolutional layers, a Channel-Wise Attention (CA) block [48], and a Long Short-Term Memory (LSTM) [49] block, as shown in Fig. 3.
As illustrated in Fig. 4 (a), the CA block includes a branch to learn the channel weight and a shortcut to connect the input feature. This module multiplies the weight with the output of the network layer. We unfold high-level features represents the feature of the i-th channel and C is the total channel number. CA weight ω h ∈ R c is extracted from f h through the CA network layer. First, each f h i is transformed to a channel-wise feature vector ν h ∈ R C through average pooling. Then ω h ∈ R c is obtained through a Fully Connected layer (FC) followed by a ReLU activation layer, another FC layer, and then a Sigmoid activation layer. ω h ∈ R c is the weight of each channel which is mapped to [0, 1]. Finally, the module's output f h is obtained by weighting the original input feature with ω h ∈ R c . CA block composes a sequence of layers that assign a weight to each channel of the original feature. Through the CA block, the temporal information of the multi-scale feature of each encoder block can be enhanced, resulting in the improvement of the accuracy of predicting more fine-grained mask edges. In our MSRNN model, every encoder block consists of an LSTM block. Figure 4 (b) illustrates the fundamental components of LSTM, including a memory cell, an input gate, an output gate, and a forget gate. The memory cell stores the previous values of the cell and its states, while the three gates regulate how much of the previous cell state to ''forget'' or ''remember'' when processing a new input. Through LSTM, each stage of the encoder can transmit information from adjacent frames effectively. The LSTM block takes the previous frames' state (c m k−1 , h m k−1 ) as input, generating an output state (c m k , h m k ) to assist with predicting the following frame. This spatial design of the model accomplishes a multi-scale recurrent architecture.

C. MULTI-SCALE RECURRENT ARCHITECTURE
To utilize long-range temporal information from events, and make full use of multi-scale features, we design MSRNN with a multi-scale recurrent architecture, as illustrated in Fig. 2. Compared with a traditional RNN, MSRNN transmits multi-scaled features which offer more sufficient information from the consecutive time step and the RNN architecture achieved by LSTM blocks makes it more trainable. The encoder includes four stages that receive features (c k−1 , h k−1 ) from the previous frame, and forward new features (c k , h k ) of respective scales of [1/2, 1/4, 1/8, 1/16] to the subsequent frame. The scales are chosen based on the shape of features output from each of the encoder blocks based on UNet. The input size is 256 × 336 and 1/16 scale of it is enough for extracting low-level features. Thanks to the previous feature input of multiple scales, our network enables more accurate prediction of both large and small objects. We utilize LSTM units to incorporate long-range information to enhance the prediction of the present frame and improve the temporal consistency. As shown in Fig. 3, each stage of the encoder contains a LSTM block, as a role to interconnect adjacent time steps. Adjacent frames are highly correlated on motion segmentation, so the LSTM block can improve the accuracy of prediction and maintain temporal consistency. Our proposed multi-scale architecture can preserve multi-scale and multi-level features from preceding steps, serving as a source of prior knowledge for the ongoing   stage with concentrated information from previous frames augmenting overall robustness.

IV. EXPERIMENTS A. DATASET
The EV-IMO dataset [27] serves as the main dataset of our research, representing the first event-based dataset to encompass both camera motion and multiple moving objects. The data is collected from specific scenes captured by the DAVIS-346C event camera with a resolution of 260 × 346 and a 70 • field of view, for around 30 minutes in total.
Each recorded sequence presents no more than three objects, with a true mask provided for each object at a rate exceeding 200 frames per second.
EED dataset [41] contains limited samples with annotated bounding boxes but no motion masks, and it has no train set, causing it to be not appropriate for our learning framework. Compare to the synthetic dataset MOD++, EV-IMO contains only real-world data, which our work focuses on. Therefore, we only utilize EV-IMO dataset for our research.
The EV-IMO dataset includes 34 high-quality sequences for training, featuring main scenes of boxes, floor, table, tabletop, and wall. It also includes 21 sequences for validation, encompassing the main scenes of the boxes, fast, floor, table, tabletop, and wall. The original sequences of the EV-IMO dataset are recorded in seconds or minutes, rendering them inadequate for training. Consequently, we divide the data into multiple time slices with ground-truth corresponding to each slice, respectively serving as training samples. With image-based ground-truth being generated at 40 frames per second, the interval between any two adjacent groundtruth' timestamps is roughly 25 µs. The true mask (groundtruth) of each moving object is saved in image form, marked with its corresponding timestamp. We take each timestamp corresponding to the ground-truth image as the slice center, with a length of 0.03 s. Each slice is represented as an event matrix of N × 4, where N indicates the number of events in 0.03 s. Then, the matrix is converted into a voxel grid format, creating an image-like matrix of 3 × 260 × 346. We further adjust the shape to 3×256×336 by center cutting. The 0-th dimension holds the total event integral within a given time interval. Given a time span of 0.03 s, we have a channel of 3, and thus, t = T /3 = 0.01 s represents each channel's timespan. By this representation of events, we can extract features from events and treat each slice as an image to train a convolutional neural network and learn the motion mask.
Eventually, we yield around 15,000 samples for training and 5,300 samples for validation.

B. IMPLEMENTATION DETAIL
To address the data imbalance in the dataset, we employ a hybrid loss function that combines Focal Loss [50] VOLUME 11, 2023 80109 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   [27]. The results of EV-IMO method and GConv are from [44]. ''mIoU'': mean IoU value of moving objects and the background. ''Swin-UNet + '': an updated version of Swin-UNet with a multi-scale recurrent architecture. and Dice Loss [51], which is defined as: wherel = ℓ b ( x i , y i ) is the BCEloss, and x i and y i denote the prediction and ground-truth respectively. Based on the spatial architecture design (Fig. 2), we propose a multi-frame loss that summarizes four adjacent frames' losses as a batch loss to supervise the training. We train our network for 15 epochs using the NVIDIA Titan XP with a batch size of 4, for approximately 10 hours. We implement the Root Mean Square Propagation (RMSprop) optimizer with a Warm-Up scheduler. We set the initial learning rate to 1 × 10 −5 , and implement a Warm-Up scheduler for 10 epochs.

C. RESULTS
For each validation dataset scenario, we evaluate the accuracy of the samples by measuring the Intersection over Union (IoU) and the mean IoU (mIoU) for segmentation.
In this subsection, we compare our MSRNN with stateof-the-art models or network architectures, namely the pyramid feature attention network (SODModel) [48] and SwiftNet [52], Swin-UNet [53] and P2T [54] on the EV-IMO dataset. We compare our method with the results of EV-IMO method [27] and GConv [44] which are given by [44]. We do not provide the results of [41] and [42]. They perform in event space and utilize the metrics in the form of success rate, which is different from the metrics we use in the paper. Thus, we can not obtain the same form of quantitative results as our pipeline. We present motion segmentation results in Table 2. It is evident from the results that our MSRNN outperforms the other methods in most of the scenarios set on the EV-IMO dataset, resulting in a mean improvement of 0.9% in mIoU compared to SODModel. In addition, we improve Swin-UNet and create a multi-scale recurrent version, named Swin-UNet + . The result shows that the multi-scale recurrent architecture can be applied to the transformer and make considerable improvement.   [27]. We halve the channel numbers of each convolution layer to conduct the ablation study. ''IoU'': IoU value of moving objects. Additionally, We evaluate the models based on the proportion of samples with IoU values greater than 50% and 60%. This metric denotes the accuracy of successful prediction, where the predicted mask overlaps the ground-truth by at least 50% and 60%, respectively. The quantitative results  including runtime are illustrated in Table 3, indicating that our method performs better than the others. However, we are more concerned with accuracy than speed in this vision task.
We further provide a visual comparison of our MSRNN and the other models, illustrated in Fig. 5, demonstrating that MSRNN exhibits the best segmentation results. For instance, for the segmentation of a drone wing contour, our model yields astounding results, accurately segmenting the intricate details of the object, while the other models struggle.
Moreover, our proposed model shows better robustness than the other models when dealing with multiple objects.

D. ABLATION STUDY
We have conducted additional experiments to investigate the impact of various factors on our network's performance, including the model input, the CA module, the presence of recurrent architecture, multi-scale architecture, and the type of recurrent unit. As shown in table 4, by simply introducing a multi-scale recurrent architecture for fusing adjacent frames, an improvement of 2.2% in IoU over the base model was achieved, confirming the critical role of the structure. Without multi-scale features, which means we use the same scale for every layer of MSRNN, the results show the performance downgrades, as illustrated in Table 4. Furthermore, encoders with integrated CA modules improve the robustness of our model. Moreover, we find that the recurrent block of ''LSTM'' outperforms that of ''GRU'' [55].
The model trained on images in the dataset outperforms that trained on event data. The primary reason behind the improved performance is that frames contain more comprehensive texture information of objects. To address this issue, we propose a novel dataset for event-based motion segmentation of various moving objects under challenging conditions.

E. EXPERIMENTS ON ECMotion DATASET
We introduce a novel dataset with ground-truth annotations for motion segmentation, named Event Challenging Motion (ECMotion), to validate the effectiveness of our event-based framework. Specifically, we collect data in real-world settings with challenging indoor scenarios using a SEEM1 event camera with a resolution of 262 × 320.
These scenarios consisted of low light, high dynamic range (HDR), and proper light conditions, including moving objects different from EV-IMO, to broadly validate our method's robustness. Our dataset includes 350 frames with 150 annotated with ground-truth, and the distribution of different scene categories is detailed in Fig. 7.
Distinct from the EV-IMO dataset, the ECMotion dataset contains a limited amount of samples mainly captured under challenging scenarios. As our model has considerable parameters, the dataset is too small to train or fine-tune the model. If we pre-train a model on the EV-IMO dataset and finetune on our divided ECMotion dataset, the model including the image-based model can work or perform better on the ECMotion dataset because of overfitting. Thus, we prefer to use this dataset only for testing. All the models are trained on the EV-IMO dataset and evaluated on our ECMotion dataset.
The evaluation results of different methods on ECMotion are illustrated in Table 5. Notably, Our event-based MSRNN model outperforms the others with an IoU improvement of 1.5% over SwiftNet. Notably, the image-based MSRNN achieves an IoU of 0 and almost fails in every prediction, indicating that this image-based method is difficult to adapt to challenging scenes. The performance of MSRNN − drops by 7.1% in IoU compared to the original one, indicating the effectiveness of the RNN architecture in utilizing long-range information.
The visualized results obtained from various methods are shown in Fig. 6. Compared with the image-based approach, we note that event-based methods show greater robustness in challenging conditions, which shows that event cameras are more effective in motion segmentation, especially under challenging conditions. Moreover, our MSRNN shows better segmentation performance than MSRNN − , indicating that the RNN component is crucial for utilizing long-range temporal information for the prediction of current frame, resulting in higher accuracy. Our model also achieves better segmentation results than SODmodel, SwiftNet, Swin-UNet+, and Swin-UNet, as evidenced by IoU scores. P2T model performs well on the ECMotion dataset by the IoU metric. However, it learns to segment wrong objects in most of the scenes, as illustrated in Figure 6, while MSRNN shows more robustness due to its multi-scale recurrent architecture which can sustain temporal consistency.
Furthermore, we split our ECMotion to 3:2 for training and validation respectively and we fine-tune our MSRNN model on the ECMotion dataset. Table 6 illustrates the comparison of image-based and event-based frameworks. The results indicate that event-based MSRNN still performs better than image-based MSRNN after fine-tuning, demonstrating that events perform better than images in extreme conditions. The image-based method is limited to the poor quality of images because in extreme scenes the object becomes blurry.

V. CONCLUSION
In summary, we investigate the potential of events for motion segmentation through comprehensive experiments. Specifically, we have introduced MSRNN, a novel motion segmentation network with a multi-scale recurrent architecture that effectively fuses features of adjacent frames, achieving considerable improvement. In addition, we introduce a realscene dataset, ECMotion, which contains several instances of challenging conditions. Our method notably outperforms other existing methods for motion segmentation, both on the EV-IMO and our ECMotion datasets. We believe that our work will inspire further research into the intrinsic properties of events, and we intend to investigate the applicability of events for other visual tasks under dynamic scenes.