Detection of Abandoned and Stolen Objects Based on Dual Background Model and Mask R-CNN

Dual background models have been widely used for detecting stationary objects in video surveillance systems. However, there is a problem that both abandoned and stolen objects are equally detected as stationary objects, making it difficult to distinguish them. Another problem is the ghost region created by shadow shift or light changes, which makes the discrimination issue more complicated. In this paper, we present an efficient method to distinguish abandoned objects, stolen objects, and ghost regions in the surveillance video. This method contains two main strategies: the first one is the dual background model for extracting candidate stationary objects, the second one is object segmentation based on mask regions with CNN features (Mask R-CNN) for providing the object mask information. The basic idea is: given a candidate stationary object from the background model, it is checked whether a corresponding segmented object exists in the current video frame or the previous background frame to take into account the current and past situations. And the final state of the candidate stationary object is determined by considering various situations through the comparative analysis technique presented in this paper. The proposed algorithm has qualitatively experimented with our own dataset focusing on the discrimination issue, which generated satisfactory results. Therefore, it is expected to be widely applied to automatic detection of stolen objects as well as abandoned objects in open environments such as exhibition halls and public parks where existing intrusion detection-based security services are difficult to be deployed.


I. INTRODUCTION
In recent years, researches on intelligent video surveillance system analyzing video automatically without continuous observation by humans have been actively conducted to provide methods for detecting and notifying the occurrence of specific events such as intrusion, loitering, abandonment, crime, and fire detection. Intelligent surveillance systems can reduce human errors by lowering the dependency on humans. They can also improve the response time by generating alarms as soon as events occur [1], [2]. Although systems based only on traditional computer vision technology have suffered limitations in accuracy or performance, recent advances in artificial intelligence have opened practical ways for improvement in various applications. But neither The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang.
technique is perfect on its own, how to use them properly for specific applications.
Therefore, we propose a novel approach for detecting abandoned and stolen objects based on the conventional background subtraction and an artificial intelligence technology, i.e., Mask R-CNN. The background subtraction technique creates dynamic backgrounds and foregrounds in real-time [3]. The primary way to detect abandoned and stolen objects is to analyze the foreground and to select stationary objects. In this paper, an abandoned object is the object newly placed on the background and a stolen object represents an uncovered background from which an existing object is removed. The problem is that both are technically similar in that they equally appear as stationary objects in the foreground. Besides, the ghost region issue caused by such as shadow shift or light changes is another problem to concern. Ghost regions or ghosts are meaningless areas that also appear as stationary objects in the foreground [4] even though they are not abandoned or stolen. Therefore, it is difficult to distinguish abandoned objects, stolen objects, and ghost regions by simply analyzing the foreground. For this reason, we intend to solve this problem using artificial intelligence technology.
The methodology we propose is as follows. First, we use a dual background model to detect stationary objects [5]- [11]. The dual background model consists of two sub-models with different learning rates. The short-term model, which has a high learning rate, quickly absorbs moving pixels into the background when they stop. On the other hand, the longterm model updates the background slowly. So stationary pixels can exist for a long time in the foreground. Therefore, the pixels existing as foregrounds in the long-term model but not in the short-term model are extracted as stationary pixels. Adjacent pixels among the stationary pixels are grouped into a blob and considered as an object. The blobs below a certain size are regarded as noises and removed. Simply determining whether an object is stationary based on a single frame could be premature when we use the dual background model. The object may stop temporarily and move immediately. Therefore, temporal transition information is necessary to identify the stationary foreground based on the sequence pattern of each blob [8]. We track the object to check if it is consistently stationary more than a certain amount of time at the same spot. If so, it is considered 'stable' and designated as a 'candidate stationary object'.
The candidate stationary object can be finally determined as an abandoned object, a stolen object, or a ghost region when certain conditions are satisfied. In this paper, Mask R-CNN [12] which provides object mask data is used to verify whether there is a segmented object in the area of the candidate stationary object. If a segmented object is detected in the current frame, its pixels and the pixels corresponding to the candidate stationary object are compared respectively to determine the equality. If the match score is greater than a certain threshold, it is determined as an abandoned object. Then, how do you explain when a segmented object exists in the current frame but doesn't match the candidate? And what if there is no segmented object in the area of the candidate? To answer these questions, we need to take into account the past situation as well as the present. We trace back to a previous frame (precisely, the long-term background frame) to see if any object existed at that time. There could be a segmented object in the frame. However, it is also possible that the segmented object doesn't match the candidate or it doesn't exist at all. Therefore, we propose a method to determine the final state of the candidate stationary object by considering these various clues.
In summary, we use the dual background model based on the traditional background subtraction technique to extract the pixel region estimated to be a candidate stationary object. And we also use Mask R-CNN, one of the widely used artificial intelligence technologies in the field of object detection, to check whether the area corresponds to a trained object.
Furthermore, we present a method for determining whether the candidate stationary object is abandoned, stolen, or a ghost region, taking into account various situations that may occur in the present and the past. This paper is organized as follows. In Chapter 2, we analyze several related researches and their problems with the detection of abandoned and stolen objects. In Chapter 3, we explain the proposed framework and algorithm for detecting abandoned, stolen objects and ghost regions in detail. In Chapter 4, we prove our algorithm through various experiments using our own dataset. Finally, we conclude in Chapter 5.

II. RELATED WORKS
As mentioned above, abandoned, stolen objects, and ghost regions are equally detected as stationary objects based on the foreground analysis. Most of the relevant researches have mainly focused on detecting abandoned objects. Most of them did not consider stolen objects and ghost regions or left them as challenges.
Lin et al. [8], [9] suggested ways to increase the accuracy of abandoned object detection using the dual background model. They also use a pixel-based finite state machine (PFSM) to track the state of the object and the back-tracing verification to find its owner. However, their research did not solve the problem of tracking long-term abandonment and illumination change issue due to the limitation of the background subtraction. Park et al. [10], [11] identifies the position and area of the candidate stationary object through the dual background model. They register the template of the candidate, which is used for comparison in the presence authentication process determining the final state. By reducing the dependency on foreground information, they solved long-term abandoned object tracking, occlusion, and illumination change. Wahyono et al. [13] and Filonenko et al. [14] detect stationary objects using the difference of a reference background and the current background. The triple background model presented by Cuevas et al. [15] solves the problem of longterm abandoned object detection and occlusion by adding a long-term model with no background absorption. However, all the studies based on multiple background models focus on the detection of abandoned objects and do not address the problem of distinguishing whether a candidate stationary object is an abandoned object or a stolen object.
Smeureanu and Ionescu [16] perform stationary object detection based on background subtraction and motion estimation. And they tried to improve the detection accuracy of abandoned objects by applying a cascade of convolutional neural networks (CNN). However, they have the overfitting problem because they trained the sub-areas of the experimental video background as negative classes. Shyam et al. [17] construct a dual background model using the sViBe [18] modeling method. They also use the PFSM to keep tracking states of foreground pixels. Besides, they classify stationary objects into suspected objects using Single Shot Multibox Detector (SSD) [19]. The problem to consider is that they unconditionally regard an object to be abandoned if it exists in the candidate area. However, when an object disappears, it can create a foreground area. If there is another object behind it, the system can wrongly determine that it is an abandoned object, not a stolen object.
There have also been attempts to distinguish between abandoned and stolen objects. Connell et al. [20] and Venetianer et al. [21] store the background just before a candidate stationary object is generated. If the edge energy of the corresponding area in the current frame is higher than the edge energy of the stored background, it is classified into an abandoned object. Unless, it is determined as a stolen object. This approach is applicable in general situations, but does not take into account the case where the edge energy of the background is higher than the debris. Ghost areas, caused by shadow shifts or light changes, are difficult to distinguish because it is difficult to predict how much they affect the edge energy of the background. Tian et al. [22], [23] proposed a region growing method that examines color similarity by extending the area from the inside of the candidate stationary to the outside. The problem with this technique is that detection can fail if the color of the object is similar to the background.

III. PROPOSED ALGORITHM FOR DETECTING ABANDONED OBJECT AND STOLEN OBJECT A. DUAL BACKGROUND MODEL FOR EXTRACTING STATIONARY OBJECTS
In this paper, the proposed algorithm extracts a stationary object that can be finally determined as an abandoned or a stolen object using a dual background model as shown in Figure 1. The dual background model is composed of two sub-models (short-term model and long-term model) with different learning rates. In the short-term model, pixels of an abandoned bag are quickly absorbed into the background (SB) and disappear in the foreground (SF) at the same time. On the other hand, they remain for a relatively long time in the long-term foreground (LF). Therefore, there is no bag in the long-term background (LB). In summary, the SF contains only moving objects and the LF includes moving objects as well as stationary objects. (To be precise, that moment exists.) We calculate the approximate position, size, and shape of a stationary object using these characteristics. The difference foreground (DF) is the result of subtracting the SF from LF. Active pixels adjacent to each other are grouped into a blob and considered as a stationary object in the DF frame. If its size is smaller than a certain threshold, it is considered noise and filtered out. As a result, only stationary objects remain in the DF frame. By the way, simply determining whether an object is stationary based on a single frame could be premature. Because the object may stop temporarily and move immediately. We must use the temporal transition information to identify a stationary object based on the sequence pattern of each blob in the consecutive frames [8]. So we track the object to check if it is consistently stationary more than a certain amount of time at the same spot. If so, it is considered ''stable'' and designated as a ''candidate stationary object''. The candidate stationary object has the possibility to be finally determined as an abandoned object, a stolen object, or a ghost region when certain conditions are satisfied.

B. ISSUE OF DISCRIMINATION BETWEEN ABANDONED AND STOLEN OBJECT
Although the dual background model makes it easy to extract stationary objects, it is difficult to distinguish between an abandoned object and a stolen object by simply analyzing the foreground. Figure 2 shows the results of the dual background model in the situation after someone stole a suitcase that has existed as a part of the background. While it has already disappeared from the SB, it is still present in the LB because it has not been absorbed in the background yet. Using this feature, the back-tracing verification method presented in this FIGURE 2. When a suitcase that has existed as part of the background is stolen, a stationary object is generated in the DF frame. But it does not exist actually. Therefore, there should be a way to distinguish between abandoned and stolen objects.
paper searches the previous LB frame to look into the past. Details are described in Section D. Analyzing only the DF, as suggested, can draw a conclusion that a stationary object exists now. But it is wrong. Since this issue increases the false detection rate of the surveillance system, there should be a way to distinguish them.
To solve the problem, Venetianer et al. [21] presumed that the edge energy of the current frame is higher for abandoned objects and lower for stolen objects. As shown in Figure 3-(1), (2), and (3), the edge energy of an abandoned object is higher than the background like their expectation. What if the edge energy of the background is higher than an abandoned object as shown in Figure 3-(4), (5), and (6)? It will be detected as a stolen object, which is not. Therefore, this method is error-prone.
Meanwhile, Tian et al. [22] noted that there is a difference between the color of an abandoned object and its background. They applied the region growing method from the inside of the foreground area to the outside. If the colors of an edge pixel and its adjacent pixel is comparable to each other, it is included in the same region. If the final region is similar to the stationary object of the DF, they regard that the object is abandoned. Unless it is considered stolen. However, the color of an object and its background can be similar as we can see  6): Background image and its edges. We used the Canny edge technique [24] to calculate the edges.
in Figure 3. In this case, the division can be difficult. Even the authors mentioned in their paper that the region growing method failed if the colors of objects and backgrounds were similar. Figure 4 shows the situation where a ghost region is created. As the shadow of a building moves to the right, a parking line is revealed and falsely determined as a stationary object in the DF frame. Ghost regions can be generated by such as shadow movement and light changes. Since these also increase the false detection rate of the surveillance system, they must be clearly distinguished from both abandoned and stolen objects. As shown in the (4) of Figure 4, Our algorithm identifies the object as a ghost region (the red bounding box). The next section describes the method to solve the discrimination issue including ghost regions.

D. THE PROPOSED ALGORITHM
The purpose of this paper is to accurately detect abandoned and stolen objects. Given that there is a stationary object in the DF, we must be able to distinguish whether it is an abandoned object, a stolen object, or a ghost region. Figure 5 shows the framework of the proposed algorithm in this paper. The dual background model extracts the binary mask of a candidate stationary object (CSO DF ) after a specific filtering process. At the same time, object segmentation using Mask R-CNN is performed on the current frame to obtain object masks. The comparison module receives the masks generated by the upper modules. As a result, a segmented object (SO VF ) corresponding to the CSO DF is determined. Given no  SO VF matching the CSO DF , another object segmentation process is performed once again on one of the previous LB frames where the CSO DF first appeared in the DF frame. In this paper, this is called back-tracing verification, and the segmented object obtained from the LB frame is named SO LB . After that, we compare the states of the SO VF and SO LB . Depending on the combination of comparison results, the final state of the CSO DF is determined as one of three categories: abandoned, stolen, or ghost.
To see how equal the two given masks are, we compare the pixels of each region one by one. The method for calculating the match score is as follows. The top, left, width, and height constituting the bounding box of a CSO DF are defined as 't', 'l', 'w', and 'h', respectively. The pixel configuration in the bounding box of the CSO DF is expressed by w × h matrix A: We define the total number of a ij with the value of 1 as 'n', which is used to calculate the final match score. A w × h matrix B represents the segmented object within the same bounding box: We compare all the corresponding elements of the two matrices to compute matched pixels. To express this result, we define a w × h matrix R and the element r ij has the following values: If r ij has a value of 1, it represents a matched pixel. The total number of matched pixels is defined as 'm'. Finally, a match score is defined as a ratio of the number of matched pixels (m) to the number of pixels of the CSO DF (n): The match result depends on whether the match score is over a certain threshold. This method allows us to compare the CSO DF and SO VF . The comparison of the SO VF and SO LB is performed in the same way.
If two masks match each other, it means that they are identical. Otherwise, there are two cases. First, their sizes are very different. This is particularly relevant to the CSO DF . Because the background subtraction can't represent only the object itself as the foreground. An additional region such as shadows of an object may appear in the foreground, which may create a CSO DF larger than the object. Second, there is no object. Since there can be various states in the past and the present, the match results must be combined to determine the final state of the CSO DF . Figure 6 is the flowchart of the proposed algorithm. First, we check if there are stationary blobs in the DF frame. Among them, blobs smaller than a certain size are considered noises and discarded. The remaining ones can be temporarily stationary objects. Therefore, we need to make sure that the blobs appear repeatedly at the same location in consecutive frames. In this paper, this process is called 'checking the stability of a blob'. If the number of appearance of a stationary blob satisfies a condition, it is regarded as stable.
The basic condition to be satisfied for detecting an abandoned object is as follows. First, there is a stationary object. Second, it did not exist before. When a stable blob is selected, it is designated as a CSO DF . To check whether an object exists in the CSO DF area, object segmentation is performed on the current frame using Mask R-CNN. If there is a SO VF in the CSO DF area, a match score is calculated to verify their equality. If they match each other, we consider that it is an abandoned object. Figure 7-(1) corresponds to this case.
Given that a SO VF exists but doesn't match the CSO DF , we need to see the previous situation through the backtracing verification which has two steps: First, looking for the previous LB frame which corresponds to where the CSO DF is first created. The reason for selecting a specific LB frame is that an object existing as a foreground in the DF frame is not yet absorbed into the background. Second, performing an object segmentation on the LB frame to get the mask of an object. If there is no SO LB in the LB frame, it is determined to be an abandoned object. The reason is that there is now a new stationary object in the place where there was nothing. Then, why doesn't the SO VF match the CSO DF in the first place? This is because a shadow has been created together as shown in Figure 7-(2), i.e., there is a SO VF smaller than the CSO DF in the absence of the SO LB .
The problem to consider occurs when a SO LB is present. In this case, it should be compared to the SO VF . If the SO LB is equal to the SO VF as shown in Figure 7-(4), it is assumed that it has existed from the past and only a large ghost region, e.g., shadow, covering its area has been created as a CSO DF . Therefore, it is determined as a ghost region. What if the SO LB is smaller than the SO VF like Figure 7-(3)? This indicates that the current object SO VF obscures another small object SO LB and creates an additional shadow equal to the size of the CSO DF at the same time. Therefore, the CSO DF is determined as an abandoned object. If the SO LB is greater than the SO VF as shown in Figure 7-(5) and (6) (regardless of the match score of the CSO DF and SO LB ), it means that an uncovered background appears as a foreground after the object has disappeared. And at the same time, another hidden object has revealed in the uncovered background. Therefore, it is determined as a stolen object.
In the absence of both the SO VF and SO LB as shown in Figure 7-(7), the object has never existed from the beginning. Then why is the CSO DF created? The reason is that a ghost region is created. If a SO LB is present while there is no VOLUME 8, 2020 SO VF , the object is considered stolen, regardless of its size, as shown in Figure 7-(8) and (9).
In summary, in this section, we first covered how to select a candidate stationary object from several foreground blobs generated from a dual background model. To distinguish the status of a given candidate stationary object, we were assisted by Mask R-CNN, one of the famous deep learningbased object segmentation techniques. Object segmentation process is performed on the current frame and the previous background frame to consider not only the present but also the past. The candidate stationary object is compared and analyzed with corresponding segmented objects in each frame. Depending on the result, its final state is determined.
The processes after the object state has been determined are as follows: If it is abandoned, additional processes such as searching for the owner, template registration, etc. can be performed for further tracking. The final state is determined through the presence authentication process at the end of the timer. If the object is stolen, a warning alarm can be triggered immediately. However, if it is a ghost region, nothing happens.

A. EXPERIMENTAL ENVIRONMENT
We experimented with the PETS2006 [25], ABODA [26], and our own video dataset to demonstrate that the algorithm proposed in this paper can accurately identify and detect abandoned objects, stolen objects, and ghost regions. Since the PETS2006 and ABODA datasets are specialized for detecting abandoned objects, this chapter discusses experiments with our database. The system's core equipment used in the experiments are the Intel Xeon W-2123 CPU 3.60GHz, 16.0GB RAM, and NVIDIA TITAN RTX GPU. The ratio of the learning rate of the short-term and long-term model is set to 50:500. For each model, KNN background subtraction technique [3] is used. The threshold values for the match score are set to 90%. For object segmentation, we used the PyTorch based Mask R-CNN [27] distributed by Facebook Research on github.  The foreground of the stationary object has not yet been created in the DF frame. And then, the person abandoned the suitcase and disappears. As a result, the foreground blob of the object begins to be created in the DF frame. The complete blob is created in the DF frame as shown in Figure 8-(3). Then, the stability of the blob is verified and it is designated as a CSO DF . There is a SO VF matching the the CSO DF . Therefore, the CSO DF is finally considered an abandoned object (colored in blue). This is the basic case for detecting an abandoned object, because the match score of the SO VF and CSO DF exceeds the threshold. Figure 9 handles a basic scenario for detecting a stolen object, which corresponds to the case of Figure 7- (8). It shows a situation where someone takes an object that has been a part of the background. A suitcase exists as a part of the background in Figure 9-(1). Someone approaches the suitcase. And then, he takes the suitcase and disappears. In Figure 9-(4), the foreground blob of the stationary object is created in the DF frame. Figure 9- (5) is showing that the stability of the blob is verified and it is designated as a CSO DF . The surveillance system back-traces to the LB frame corresponding to where the complete stationary blob is created because there is no SO VF . And a SO LB is detected in the LB of Figure 9-(4) (colored in red). Therefore, the CSO DF is considered as a stolen object and it is marked with a green bounding box.  Figure 10 shows a similar situation where a CSO DF is larger than a SO LB because of the shadow of a missing object. A suitcase exists as a part of the background in Figure 10-(1). Figure 10-(2) shows a person taking the suitcase. At the time of Figure 10-(3), the foreground blob of the stationary object is generated in the DF frame. Note that the shadow of the suitcase is created as the foreground. Even if the CSO DF does not match the SO LB , it is eventually determined as a stolen object in Figure 10-(4) because there is no SO VF as shown in Figure 7- (8) and (9).   Figure 10, but shows that the surveillance system can detect stolen objects and ghost regions simultaneously using the proposed algorithm. In Figure 11-(1), a suitcase exists as a part of the background and someone is approaching the suitcase. The person takes the suitcase and leaves the area. As we can see, some ghost regions are created because of the light coming through the branches in Figure 11-(2). Figure 11- (3) is showing that the foreground blob of the stationary object is generated in the DF frame. Note that the shadow of the suitcase is also created as the foreground. The object shadow is divided by the exterior pattern of the building so that a part of the shadow is detected as a ghost region. Although the CSO DF does not match the SO LB , it is simply determined to be a stolen object in Figure 11-(4) because there is no SO VF . The proposed algorithm determines that a CSO DF is an stolen object even if it does not match the SO VF and also the SO VF is smaller than the SO LB , as shown in Figure 7-(5), (6). We prove our algorithm can handle this situation in Figure 12. Because the CSO DF is larger than the SO VF in Figure 12-(5), VOLUME 8, 2020 FIGURE 13. Experiment to demonstrate the ability to detect abandoned objects, stolen objects, and ghost regions simultaneously. they don't match each other. According to the algorithm, the SO LB is detected in Figure 12-(4) through the back-tracing verification, which shows that the SO LB is larger than the SO VF . In other words, the SO VF was placed behind the SO LB . Therefore, the CSO DF is determined to be a stolen object. Figure 13 shows an experiment to verify that the proposed algorithm distinguishes multiple CSO DF s correctly. As the shadow of a building moves to the right, a CSO DF is generated. It is determined as a ghost region in Figure 13-(1) because there is no SO VF and SO LB . After someone left a suitcase, another CSO DF is created in Figure 13-(4). It is determined to be an abandoned object in Figure 13-(5) because there is a matching SO VF . After the parked car disappears, a CSO DF is created in Figure 13-(7). Since there is no SO VF , the algorithm perform the back-tracing verification and find a SO LB in Figure 13-(7). So the third CSO DF is determined to be a stolen object in Figure 13- (8). The newly parked vehicle are determined as an abandoned object in Figure 13- (12). Of course, parking a car is different from abandoning an object, but we wanted to show that the proposed algorithm also can be applied to the detection of illegally parked vehicles.

V. CONCLUSION
Researches on existing intelligent video surveillance systems that rely on the foreground analysis generated by the background subtraction have a problem that abandoned objects look like stolen objects and ghost regions. Therefore, it is important to accurately classify these three cases because it can increase the false detection rate of the system. In this paper, we presented a novel algorithm based on traditional image processing techniques and artificial intelligence technology to precisely distinguish abandoned objects, stolen objects, and ghost regions. The proposed algorithm first uses the dual background model to specify the position and area of a candidate stationary object (CSO DF ). Then, segmented objects are extracted by performing object segmentation (Mask R-CNN) on the current video frame (SO VF ) and the previous LB frame (SO LB ) determined via backtracing verification, which returns the pixel area constituting several types of objects. Depending on the result of comparative analysis on CSO DF , SO VF , and SO LB , the candidate stationary object is accurately classified into an abandoned object, a stolen object, and a ghost region. We showed that the proposed algorithm accurately detects the state of an object in various situations through experiments on our own dataset. Therefore, it is expected to be widely used for realtime detection of not only abandoned, stolen objects in open environments such as exhibition halls and parks but also for illegally parked vehicles, etc. Due to native characteristics of the deep learning model, there is a limitation that only trained objects can be detected. Also, if the video has a low resolution, it can be difficult to identify objects. However, since the release of Mask R-CNN in 2017, more training datasets have been released and also new technologies such as detectron2 [28] and YOLACT [29] have been continuously developed. Therefore, we expect that the methodology presented in this paper will become more robust over time.