Spatiotemporal Activity Semantics Understanding Based on Foreground Object Segmentation: iCounter Scenario

Foreground object segmentation that captures the spatial and temporal information of moving objects in video is the most fundamental task for activity understanding in many intelligent applications, such as smart stores. Recently, several methods are proposed for the detection and recognition of activity based on object segmentation. However, these methods are often inaccurate because they do not maintain the temporal associations of object segment consistency across time. In this work, we proposed a hierarchical approach for foreground object segmentation and activity semantics understanding from sequential video to preserve spatial and temporal connectivity in the frames. The proposed system consists of two main modules: (a) the concatenated deep learning network containing PSPNet and convolutional-GRU to segment the foreground of an object of interest; (b) the activity mining framework which incorporates three sub-modules (i) a RetinaNet-based frame classifier to detect and count objects of interest; (ii) a time-domain activity and event detection algorithm; (iii) an image-based item query engine to recognize the shopping items. To evaluate the proposed approach, we designed the smart checkout-box called iCounter to collect the shopping activities dataset named "NOL-41" which is used in extensive experiments. The results show that the accuracy of the foreground object segmentation is 90.6%, the accuracy of the frame classification is 93.4%, the accuracy of activity event detection is 98.4%, and the accuracy of item query is 94.3%. Finally, the overall accuracy of the shopping list is 95.2%.

Machine Learning Models) based on traditional handcrafted features depend on prior knowledge and human ingenuity to extract discriminating features [14,15]. These methods can be broken down into three majors steps: (1) Foreground object detection that corresponds to activity segmentation, (2) Feature extraction and selection by domain knowledge or an expert, and (3) Detection and recognition of activity understanding [16,17,18]. However, these methods (e.g. [19,20]) are not sufficient for smart store spatiotemporal activity understanding based on the foreground object segmentation, particularly, when different objects share pixel region in the time domain in a sequential video. For example, when a shopper placed the item with their hand in the checkout box, the item's shared some region with their hand. When foreground object segmentation is applied to these objects, it doesn't properly work on it which leads to the misdetection of items in the real-time smart store.
These challenges (shared pixel) have been explored by many classical and modern approaches [21,22,23]. The optical flow was used in previous methods to maintain the temporal associations of object segments for the pixel consistency across the time for smoothness, but its temporal associations are inaccurate [24]. Long et. al. in [25] proposed a fully convolutional network that yields a coarse heat-map for improving the only single image segmentation, but it does not apply to video segmentation. Pavel et. al. [26] and Siam et. al. [27] proposed a network for video segmentation that uses a combination of FCN and recurrent neural networks. Apart from these methods, there is a need for a method that covers foreground segmentation frame-by-frame with maintaining the spatiotemporal information consistent for large inputs in real-time [28,29]. More recently, deep learning methods have attracted a lot of interest from the computer vision community. Because they can automatically learn representations from raw data and preserver the spatial and temporal information of objects in a sequential manner.
In this work, we propose a novel hierarchical approach for real-time foreground object segmentation and activity semantics understanding from sequential video to preserve spatial and temporal shared pixel connectivity "such as item in hand" in the frames [28]. The proposed approach consists of two main modules: (a) a concatenated deep learning network based on Pyramid Scene Parsing Network (PSP-Net) [29] to segment the foreground objects of interest, and convolutional-GRU [27] to preserve spatial and temporal connectivity of foreground objects in the frames; (b) the activity mining framework which includes three sub-modules (i) a RetinaNet-base [30] frame classifier to detect and count objects of interest; (ii) a time-domain activity and event detection algorithm; (iii) an image-based item query engine to recognize the shopping items. We also propose a design of a self-checkout box called iCounter with a camera that is mounted at a fixed angle to capture the spatiotemporal shopping activities information of the shopper as shown in figure  1. The iCounter is used to analyze the proposed hierarchical approach. We also contribute an NoL-41 video dataset of The proposed approach including the moving object segmentation and activity mining framework is evaluated by the intersection-over-union (IoU) and accuracy metric of each module in a real-time retail application scenario. The performance evaluation of the foreground object segmentation network was evaluated by three-fold cross-validation. There are 4216 frames including different glove colors, light intensity levels, and background patterns used for segmentation network training and evaluation. The foreground object segmentation module achieves an accuracy of 90.6%. The RetinaNet-based frame classification module has an accuracy of 93.4%. The accuracy of the shopping event module is 98.4% without considering the commodity category, the average error of the checkout event time is 0.018 seconds. The accuracy of the image query module is 95.2% considering the commodity category.
Our primary contributions and novelty of the proposed framework are: (i) Proposing an activity semantics understanding framework based on the foreground object segmentation and Conv-GRU to preserve spatial and temporal connectivity in the frames to analyze the shopping data. (ii) Proposing an activity mining framework that includes three sub-modules to detect, count, and recognize the activity's objects of interest in the time domain to smooth the shopping checkout process that reduces the timing of the retailer checkout process. (iii) Creating an NoL-41 video dataset of checkout shopping activity that will help the computer vision community to do further research on the checkout process. The claimed contributions address the shared pixel challenges in the smart store especially when the shopper adds the item into the iCounter for self-checkout. The foreground network keeps the information of each hand and item until the item is added to the shopping list. This is also beneficial ii VOLUME 4, 2016 for retail stores to avoid any misdetection and protect the retailers' investment from any big loss in the future. The rest of the paper is organized as follows: Section II describes the related work about segmentation, object detection, and recognition. Section III presents the methodology and prototype design. Section IV describes the performance evaluation and implementation environment, and section V concludes this research.

II. RELATED WORK
Retail organizations have a major concern about the item's billing in retailer stores. If customers go for the checkout process and find misdetection, missing item, or extra item in their total bill, they. will be distracted from the store which worsens the store's reputation. This causes hurt the progress and make a reason of big loss for retailer stores [31]. To avoid such incidents, different technologies were proposed including machine learning, deep learning, and computer vision technologies. Moreever, these papers [32,33,34] explore the salient object detection in the field of the internet of things which help to deploy the smart store solution.

A. CLASSICAL MACHINE LEARNING APPROACHES
Traditional foreground object segmentation approaches often use functions such as Gaussian Mixture Model (GMM) [35] and Gaussian distribution [36,37]. The Gaussian distribution is used to calculate the background pixels, which is specifically used for video analysis, that is, learning the environment of each frame and comparing different frames, and storing the previous frames. This time-lapse method improves the results of motion analysis. However, noise is often generated due to changes in light, and the performance parameters also affect the result of the object segmentation.

B. DEEP LEARNING APPROACHES
In recent years, the development of deep learning has achieved great success in semantic segmentation. The initial deep learning method applied to image segmentation is patch classification. The patch classification method slices images into pieces for feeding to the depth model because the fully connected layer requires a fixed-size image. A fully convolutional network (FCN) replaced the full connection layer with convolution [38,39], so the input size can be variable, and the speed is faster. However, there is still a problem with semantic segmentation, which is the down-sampling operations [40].
The down-sampling pooling operation was solved by two different models. Firstly, the FCN-based encoder-decoder architecture reduces the spatial dimension due to pooling, and the decoder gradually restores the spatial dimensions. There are usually cross-layer connections from an encoder to a decoder. The networks belonging to the encoder-decoder architecture are U-Net [41] and SegNet [42]. The second is dilated convolution architecture that replaces the pooling and maintains spatial resolution [43]. It also integrates semantic information well because it can expand the receptive field.
The DeepLab series [44,45,46,47] and Pyramid Scene Parsing Network (PSPNet) networks [29] belong to the dilated convolution. However, the aforementioned methods and technologies suffer from drawbacks such as noise, downsampling, and maintaining the time sequence pattern. These drawbacks reduce the accuracy, and therefore, there is a need for methods that analyze the streaming data and understand the activity semantics.

C. PROPOSED METHOD BACKGROUND
PSPNet is a foreground segmentation-based network that combines the concept of global average pooling and feature fusion to achieve semantic segmentation. The feature fusion is a pyramid structure, also known as Pyramid Pooling Module [48]. The Pyramid Pooling Module fuses features on four different pyramid scales [49]. Like PSPNet other foreground object segmentation can be used in different applications for analyzing videos and understanding activities in the video. Although smart stores require shopping activities understanding, the foreground object segmentation has not been fully utilized in this domain. For example, Amazon Go eliminates the distress of customer queues using selfcheckout technology including deep learning and computer vision modules to analyze the shopping activities for the shopping process [50]. iStore smart store [6,7] is based on computer vision and deep learning for smart shopping in which YOLOv2 is used for item recognition. Because the accuracy and performance of smart store systems can still be improved, so foreground segmentation technique (PSPNet) still needs an approach that maintains the frame-by-frame activity stream to enable the understanding of shopping activity semantics.
To preserve the spatiotemporal connective the GRU analysis the foreground object frame-by-frame. Gated Recurrent Unit (GRU) is an LSTM architecture that is commonly used to process time-series data and is suitable for the analysis of sequential frames. The GRU works on the same principle but with simpler architecture as compared to LSTM. The GRU uses two gates such as reset r t and update z t to capture the temporal relation of the signal within the cell. Equation 1 describes the mathematically model of the GRU where h is a hidden state, t is the current time step, x is the input, σ is the activation function and W is the weight.
In our previous papers, we examined the self-checkout process with the help of iCarts [6,7], and iShelf [8]. iCart is a lightweight smart self-checkout solution, that utilizes a smartphone mounted on a shopping cart and connected to a back-end deep-leaning server. The deep learning network is used to analyze the context of each frame to classify VOLUME 4, 2016 iii and action recognition. iShelf is an on-shelf item tracking solution that combines the load cells with sensor fusion and deep learning techniques. The tracking process incorporates determining the position and the weight of an item on the shelf. We generated the datasets for both (iCart, iShelf) and tested them on the proposed approaches which achieved the state-of-the-art performance. Similarly, iCounter is also a part of the self-checkout series that uses the below-mentioned methodology.

III. THE PROPOSED APPROACH
This section presents the proposed hierarchical approach for real-time foreground object segmentation and activity semantics understanding. The section firstly shows the context, smart store, in which the proposed approach can be used and evaluated. Then, the section shows the composition of the proposed approach including the foreground object segmentation module and the activity mining framework.

A. ICOUNTER SMART STORE PROTOTYPE
The prototype design of the iCounter system is shown in figure 1. As illustrated in the figure, a camera is mounted at top of the checkout box to capture the checkout activities of the shopper at the checkout counter. The intelligent engine is mainly divided into two modules which are explained stepby-step in Figure 2. indicates one hand and one item in the frame. Figure 2(b) shows the foreground object segmentation module. The main purpose of the network is to cut out the foreground object from the image and distinguish the hands and items in the foreground block. Figure 2(c) to figure 2(e) illustrate the activity mining framework which include three sub-modules. Figure 2(c) is a frame classification that is mainly used to analyze the action state at each time instance and apply smoothing to correct some misjudged frame labeling. Figure 2(d) shows the action segmentation and event detection module. The detection is based on the change of the frame and the relative position of the hand detection. For example, when t=4, an event of adding an item is detected because of the state transition of the frame. Figure 2(e), detects an event of adding an item at t=4, and further queries the categories of items after the added items.

B. FOREGROUND OBJECT SEGMENTATION MODULE
Foreground object segmentation module has three parts: (i) semantic segmentation network that applies the segmentation on the foreground object in the sequential video, (ii) Convolutional gated recurrent unit that preserves the spatial connectivities of the frames in the sequential video, and (iii) the classifier that classifies the segmented objects including hand and item.

1) Semantic Segmentation Network
The segmentation network is the first part of the foreground object segmentation module as shown in figure 3. The network uses the backbone of the pyramid partition network (PSPNet) to obtain the features of the semantic data from the input. The PSPNet combines the concept of global average pooling and feature fusion to achieve semantic segmentation. The feature fusion is a pyramid structure, also known as Pyramid Pooling Module. The segmentation network accepts the input I w,h,d by the original RGB image having 320×256 resolution as mentioned in equation 2. In equation 2, I t−1(w,h,d) defines the previous frame, I t(w,h,d) current frame, I t+1(w,h,d) next frame, where w, h, d and n describe the width, height, dimension and the total number of the frames, respectively.
The network generates the feature map as a output x t(w,h,d) in equation 3. The feature map length and width are 1/8 of the original image, and the depth is 128, which is the data block marked as a yellow square in figure 3. It is necessary to detect the foreground object through the information of the previous frame because the front and back scenes must be identified in time to find the changing area.

2) Convolutional Gated Recurrent Unit
The Convolutional Gated Recurrent Unit (Conv-GRU) is the second part of the foreground module which is used to process time-series data, and it is suitable for the analysis of continuous images. The Conv-GRU is embedded in the semantic segmentation network that has a combination with a convolutional network. The full architecture of the Conv-GRU network is shown in figure 4. The Conv-GRU is a iv VOLUME 4, 2016 pixel-level network. The Conv-GRU gets the inputs into two parts: The first part is the feature map output x t(w,h,d) of the PSPNet. The length and width of feature map x t(w,h,d) are 1/8 of the original image and depth 128 (yellow square in figure 4). The second is the historical feature map output of the previous image after passing through the Conv-GRU network. The length and width of the historical feature map are 1/8 of the original image and the historical feature map of depth is 64 (green square in figure 4). These two inputs are merged by concatenation and made two different 5x5 convolutions in sequence. Conv-GRU has a gating mechanism to regulate the flow of information like remembering the context of the previous and current frame. Similarly, the update gate (z) uses the same inputs and applies the sigmoid activation function defined in equation 4b. The reset gate stores the relevant content h , t from the past which is calculated as in equation 4c. The reset gate and feature map h t−1 do the point-to-point multiplication and then the concatenation with the semantic feature map. After a 5x5 convolution, the model applies the tanh associated activation and outputs the current feature map h t (blue square in figure 4) with a length and width are 1/8 of the original image and the depth is 64. The current feature map h t is the product of z and memory contentĥ t plus the product of 1-z and the previous feature map h t−1 , which is also a previous feature map for the next frame as shown in equation 4d.
The classifier is the third part of the foreground module and uses the softmax activation. The classifier takes the current GRU feature map h t as input, and generates a twodimensional matrix of 320x256 as output, figure 3. The size of the matrix is equal to the original image size and the value of each position in the matrix is the predicted category for location. The classifier classifies the input feature map into three categories of foreground semantic objects as defined in equation 5.
The proposed novel design for spatiotemporal activity semantics understanding based on foreground object segmentation allows the network to preserve the key information For example, if the foreground network miss-detects either hand or item object in the frame sequence, the Conv-GRU will use the preserved hand and item information stored in the history unit (GRU) to compare the current and previous frames object information and does the appropriate tasks. The mechanism of the design network allows the activity mining framework to evaluate the modules including action detection and item recognition to maintain the state-of-the-art performance for iCounter shopping activity.

C. ACTIVITY MINING FRAMEWORK
The activity mining framework consists of frame classification, action, and event detection, and item recognition modules. Frame classification module categories the frame w.r.t hands and items, the action and event detection module detects the events in the time domain and the item recognition module recognizes the item identity.

1) Frame Classification
Frame classification module works based on section III-B. This module divides the frames into (m, n) categories based on the foreground object as defined in equation 6. Each

2) Action and Event Detection
The action and event detection module is based on section III-C1. The frame classification contains the spatial information of the hand and the hand-held items which is helpful to detect the checkout event in the time domain. The event consists of two categories including added event and no event. Added event means that there is a hand to put an item into the checkout counter, and no event means there is an empty hand put into the checkout box. The event between two consecutive no-hand time intervals is judged by the action as shown in figure 6. Figure 7 left to right shows the events such as (0, 0) means there is no-event, (1, 1) means added 1 item in the checkout-box, and, ∆hand t and ∆item t show the change in number of hands and items. Similarly, (2, 2)

Algorithm 1 Event Detection Algorithm
Require: Frame Classification Output Ensure: Event Categories Initialization if ∆item t > 0 and ∆hand t >= ∆item t then if closestentrancehandwithitem then query images according to distance between hand and entrance end if end if

3) Item Recognition
Item recognition [51] module recognizes the shopping item for a shopping list based on the event recognition. The item recognition network uses the foreground object image as input as shown in figure 8. Firstly, we create an image database containing various types of item images with different perspectives, occlusion, and reflection. The database has 400 item images. VGG16 deep learning model is used as feature extraction and gets the feature vector in the form of a histogram. The red histogram represents the item query image feature vector, and the blue represents the database feature vectors. Euclidean distance checks similarities between the database and query-image feature vector and obtained the final result.

IV. PERFORMANCE EVALUATION AND IMPLEMENTATION ENVIRONMENT
In this section, the proposed approach including the moving object segmentation and activity mining framework is evaluated. The moving object segmentation is evaluated by the intersection-over-union (IoU) metric and the activity mining framework is evaluated by the accuracy metric of each module. addition, if there is no corresponding foreground object, it means just an empty background in the image. The rest of the V19∼V25 videos consisting of 3384 frames in total were labeled with 198 checkout actions and 108 checkout events to evaluate the performance of the checkout event detection algorithm. Each video is labeled by the checkout action including action type, action start time, action end time, and action category. The action category depends on the hand action in the checkout box including empty-hand and hand-held items. Similarly, videos are labeled with checkout events including event start time, event category, and item name. The event has 2 categories including add and no event. Along with the labeling, the dataset also has a pixel distribution ratio for background, hand, and hand-held items for all frames. The background occupied a 92.3% area, while the area of hand and hand-held items are 4.9% and 2.8% respectively.

B. EVALUATION OF FOREGROUND OBJECT SEGMENTATION NETWORK
The performance evaluation of the foreground object segmentation network was evaluated by three-fold crossvalidation. There are 4216 frames including different glove colors, light intensity levels, and background patterns used for segmentation network training and evaluation. The IoU metric is used to evaluate the network performance. To validate the overfitting, we take glove color, light intensity, and background pattern as performance parameters for network evaluation.
The evaluation results of each parameter are shown in table 2. Table 2(a) shows that the model is sensitive to the color change of the glove especially when the model is evaluated using all glove colors images. So, more glove color data is needed for training and learning to improve the stability of the model.    sity has a more accurate result for model evaluation. The full light accuracy is mainly affected due to item shining packing and colors that cause the blurred image. After parameters evaluation, the overall performance of the segmentation network was evaluated by three-fold crossvalidation. Table 3 shows the overall accuracy of segmentation network is 90.6%. The average evaluation result shows that the segmentation network is the best fit for background and hand segmentation. However, more hand-held items labeled data are needed to increase average accuracy and boost model stability. Finally, we compared the proposed methodology with state-of-the-art (SOTA) foreground object segmentation methods as shown in table 4. Deeplab+GRU has closer accuracy to the proposed methodology but has slower fps. The proposed methodology has the highest accuracy for foreground object segmentation is 90.6% with a better fps rate.

C. EVALUATION OF FRAME CLASSIFICATION
Frame classification has a certain in-corrected frame label prediction that can be smoothed in the time domain before the shopping event detection. The smoothing effect is limited to one frame before and after the reference frame. The major frame algorithm is used to smooth the current frame category. The major vote algorithm decides the current frame voting based on the previous and next frame category but only the  current frame in the smoothing range. Frame classification smoothing evaluation depends on the RetinaNet hand detection network and item search algorithm. RetinaNet was evaluated on 1168 foreground hand-labeled images that cheesed from V01∼V18 videos by three-fold cross-validation with the mAP metric. Table 5 shows the average accuracy of RetinaNet network is 97.1%. After evaluation, the major vote algorithm smooths the in-corrected frame label. Figure 9 shows the smoothing example of a frame label like how an algorithm smooths the in-corrected frame label. There are two rows of the frames from left to right named "original" and "major vote". The original frames row indicates the predicted frames labels result, and the major vote frames row shows the smoothed in-corrected frame labels. The accuracy of frame classification before smoothing is 0.931% and after smoothing is 0.932%.

D. EVALUATION OF EVENT DETECTION AND ITEM RECOGNITION
The event detection module was evaluated on 3384 frames approximately that have 108 checkout events including 60 added and 48 no events by accuracy and time error metrics. The checkout events are detected by an algorithm 2. The event accuracy (EventAcc) is measured based on the pairing sequence between the ground truth and the predicted event in the time domain. The calculation formula of EventAcc is defined in equation 7. Table 6 shows that the event detection accuracy is 96.9% before smoothing and without recognizing the item category. The accuracy slightly increases after smoothing the event detection and accuracy is 98.4%.
The item query evaluation is based on the recognition of   The prototype system consists of a deep learning server and a checkout box. The main hardware architecture is shown in figure 1. The checkout box has a length and width of 60 cm and a height of 70 cm. A webcam (Logitech C930e) is mounted on the top to capture the checkout activities video and uploaded them to the server. Two lamps are placed on the left and right sides of the checkout box to provide illumination. The deep learning server operating system is Ubuntu 18.04 LTS, and NVIDIA GEFORCE GTX 1080Ti GPU to accelerate the execution of deep learning algorithms, and finally, the results are displayed on the screen with web pages.

V. CONCLUSION
This paper has proposed a hierarchical approach for foreground object segmentation and activity semantics understanding from sequential video to preserve spatial and temporal connectivity in the frames. The proposed system handles large inputs for real-time recognition of products and activities and real-time response to avoid any misdetection of segmentation or inconvenience in the spatial and temporal domain. We conclude that spatiotemporal activity semantics understanding based on foreground object segmentation has a state-of-the-art accuracy to detect and recognize the retail product in real-time applications. Also, different modules are implemented in a proposed system to maintain the state-ofthe-art accuracy for foreground objects in the spatiotemporal domain. Our future work will include testing the system in different store environments.
FIRST TZU-WEI YU is a Master student at National Yang Mining Chiao Tung University, Tai