Efficient Two-Stream Network for Online Video Action Segmentation

Temporal action segmentation is a task of predicting frame-level classes of untrimmed long-termed videos. It can be widely used in various applications including customer analysis, data collection, video indexing, and surveillance. Learning good representations from videos such as motion and spatial representation is a critical factor for model performance. Furthermore, the used model should make reliable decisions with low latency for practical use in a resource-constrained environment. However, many works in action segmentation consider just an accuracy using the pre-calculated representations from heavy 3-dimensional(3D) networks. In this paper, we propose a two-stream action segmentation pipeline that can learn motion and spatial information efficiently and operate online. While the temporal stream combines frame-grouping and TSM for capturing short-term dynamics and long-term temporal information at the same time, the spatial stream captures information on color and appearance complementary to representations from the temporal stream. In addition, the results of both the streams are combined by a cross-attention module to provide a desired classification result for the task. Since it can be operated without heavy 3D convolutional neural networks (CNNs), it takes much less memory and computation than conventional 3D-CNN-based methods. Our proposed network using a non-overlapping sliding window achieved segmentation performance on two action segmentation datasets comparable to many recent works that require full temporal resolution and pre-calculated features of 3D CNNs.


I. INTRODUCTION
Temporal action segmentation is a challenging task of identifying action segments from long untrimmed videos by classifying every frame. It is a demanding technique in a variety of fields such as crime detection, video indexing, customer analysis, etc. Although many works have achieved impressive improvement over the last few years [1], [2], [3], [4], [5], most of the works are not suitable for real-world applications by developing heavy networks to improve performance. For The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo . real-world applications, the used model should be sufficiently lightweight and be able to make reliable decisions with low latency.
2-dimensional (2D) convolutional neural networks (CNNs) have shown astonishing results on image recognition by extracting components on a single multi-channel image correlated with kernels [6], [7], [8]. However, the 2D CNNs are unfit for video-based deep learning since they lack temporal encoding ability. To enable temporal modeling with the 2D CNNs, many works have applied recurrent neural networks (RNNs) to the outputs of CNN layers [9], [10], [11], [12]. Nevertheless, those approaches have shown shortcomings in VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ performance since they did not perform any convolutions across multiple frames in low-level features.
To mitigate this issue, 3-dimensional (3D) CNNs were introduced to model spatio-temporal information from RGB frames. Convolutional 3D (C3D) [13] and inflated 3D (I3D) [14] were introduced by expanding 2D filters to the time axis to learn spatio-temporal representations. By modeling both spatial and temporal representations in multiple frames with 3D filters, they have shown substantial performance improvements in video action recognition. However, since the 3D CNNs require a large amount of computation and memory budget, they cannot support real-time processing or mobile applications.
In addition to RGB frames, optical flow features were used in many works to improve the performance. The optical flow represents a field of dense motion vectors to provide motion representation complementary to static spatial information of RGB frames. Two-stream models introduced in [15] and [16] tried to combine RGB frames with optical flow features. The networks encoded spatial representation from RGB frames and motion representation from optical flow features with 2D CNNs in each stream. They achieved impressive improvement on many video-based deep learning studies. Two-stream I3D achieved substantial performance improvement by replacing 2D CNNs of the previous work with 3D CNNs [14]. However, both the networks used optical flow features that should be pre-calculated since they required too expensive computational costs to perform in real-time or online.
In addition to learning the spatio-temporal representations mentioned above, temporal segmentation has been achieved by sliding window approaches [17], [18] or Markov models [19], [20], [21] while recent works have employed models with high-level modeling ability over the time axis for classifying frame-wise features. Many action segmentation methods were built upon temporal convolutional network (TCN) [22] because of its long temporal modeling ability. Multi-stage TCN (MS-TCN) [3] stacks TCNs to perform hierarchical dilated temporal convolutions and refinement. It is capable of learning large temporal relationships by stacking multiple dilated convolutions over the time axis. It used I3D features pre-extracted before training so that it could use the full temporal resolution without memory overflows. Many works employed the TCN-based model as a baseline and improved the performance by introducing a new loss function or added a boundary regression branch [4], [5]. Although some works introduced novel networks, they still required the full temporal resolution and the pre-calculated input features [1], [2].
There have been many studies on video-based deep learning models considering the requirements of practical purposes such as memory footprint, size of the model, and latency. Temporal shift module (TSM) [23] was introduced to make conventional 2D CNNs encode spatio-temporal information by exchanging information with neighboring frames via shift operation. The shift operation in the residual block shifts a small portion of the channels in feature maps along the time axis to capture spatio-temporal information. With the same number of parameters and FLOPs with 2D CNNs, it achieved comparable performance with 3D CNNs. Jiang et al. tried to model spatio-temporal and motion features without using any optical flow features and 3D convolutions to lighten the existing heavyweight models [24]. They replaced residual blocks in ResNet [6] architectures with spatio-temporal and motion (STM) blocks that can encode spatio-temporal and motion features together. Many recent architectures have been leveraged the network architecture search (NAS) [25] through RNNs with reinforcement learning to find a model that achieves a good accuracy-latency trade-off [26], [27], [28]. For instance, mobile video networks (MoViNets) have achieved higher performance than the previous works in video action recognition with less memory consumption and operations [29] by using the NAS and also introducing the stream buffer. The idea of the stream buffer is to cache a portion of feature maps at multiple blocks in models and concatenate the cached activations with the features for the next subclip. As it expands the temporal receptive field across subclips, it can support online inference on mobile devices.
Although color information is useful for many video-based tasks such as action recognition, motion information can be more important for some events than the richness of color information. For example, to distinguish between pushing and pulling something, the direction of hands or objects can be critical factors. Accordingly, we used framegrouping [30] that is a method using three channel-averaged images as an input of 2D CNN for extracting motion information. Although the frame-grouping is designed to encode short-term dynamics like optical flow inputs [15], [31], [32], it does not require any computational overhead for feature extraction. However, the frame-grouping is not suitable for capturing long-term dependencies since the temporal receptive field of the processing is too short. To mitigate this issue, we apply the TSM for every residual branch in CNNs to widen the temporal receptive field. By combining the frame-grouping with the TSM, we can achieve significantly efficient detection for actions from videos. Our work shares a similar motivation with some works [14], [33], [34], [35], [36] that widen the temporal receptive field of optical flow features with 3D CNNs.
Our work arose from the question of whether whole video data are necessary for action segmentation. Most of the previous works with regard to video action segmentation required full temporal resolution that usually corresponds to more than a minute. On the other hand, in this work, temporal action segmentation is performed online with a sliding window approach.
As a step towards online action segmentation, our contributions are three-fold: 1) We introduce a temporal module that can encode motion information efficiently by extending the previous work [30] from violence recognition to video action segmentation. The previous work used frame-grouping module to model short-term dynamics by averaging channels and grouping three frames as an input of network. In this work, we combine TSM with the frame-grouping to concentrate on learning motion and temporal representations efficiently with an enlarged temporal receptive field. We apply our method to action segmentation tasks for demonstrating the effectiveness of the proposed network. 2) We introduce a two-stream network followed by cross-attention to address some action classes that can be confused without color information such as 'adding salt' and 'adding pepper'. In addition to the temporal stream focusing on action as mentioned above, we add a spatial stream that captures information on color and appearance complementary to representations from the temporal stream with small additional computations to improve the performance. In addition, the results of both the streams are combined by a cross-attention module to provide a desired classification result for the task. Furthermore, we introduce a strategy to penalize the spatial and temporal streams separately. We divide original classes into action (verb) and material (object) parts to form action classes and material classes. Then, we apply separated losses for the two streams, referred to as action and material losses. For example, if a predicted action (or material) is different from the ground truth action (or material), the action (or material) loss penalizes only the temporal (or spatial) stream. 3) We conduct experiments to compare our model with recent works including the most popular baseline on action segmentation called MS-TCN [3]. Although our online model performs temporal action segmentation with low latency based on limited frames, we demonstrate that our network successfully achieves comparable results with many recent works.

II. RELATED WORKS A. EXISTING METHODS OF SPATIO-TEMPORAL MODELING
The methods based on two-stream neural networks consist of a spatial stream and a flow stream. The spatial stream only models the appearance features without considering the temporal relationships. Since the conventional flow stream learns the motion features between the neighboring frames, it lacks the ability to learn the long-range temporal information. Two-stream I3D [6] integrated 3D convolutions with optical flow and showed significant performance improvement by expanding the temporal receptive field in each stream. However, the optical flow calculation and 3D convolutions require large computational costs. It combined optical flow features that represent short-term dynamics with 3D convolutions that can widen the temporal receptive field with the depth of the network. Our proposed network is derived from the idea of twostream networks, especially from the idea of the two-stream I3D. We apply frame-grouping and TSM instead of the optical flow and 3D convolutions to lighten the model. Shortterm motion features and long-term spatio-temporal features are both important to learn video representation. However, applying only TSM with frame-grouping to general action classification tasks may not be suitable for classifying some classes where color information is important since the efficiency of the frame-grouping is obtained by the loss of color information of inputs. Therefore, we apply an additional stream that is intended to capture the colorful appearance of inputs. By combining two streams, we can efficiently model spatio-temporal representations without using any heavy 3D CNNs.

B. RECENT ACTION SEGMENTATION METHODS
In temporal action segmentation to identify every action segment from long untrimmed videos, a long-term context is the key to recognize action class of every frame in a video clip. Therefore, many recent works employed MS-TCN that consists of multi-stage TCNs to perform long-range temporal modeling. Li et al. [4] improved the MS-TCN by introducing dual dilated layer composed of two dilated convolutional layers for multi-scale temporal resolution. Ishikawa et al. [5] introduced a boundary regression branch that regresses boundary probabilities of every frame for refining the predictions of segmentation branch. Also, they proposed a new loss function to mitigate the fluctuating action predictions. Singhania et al. [1] introduced an encoder-decoder architecture to aggregate the multi-resolution temporal features and a loss function to penalize the predictions of some action classes that are completely irrelevant with a set of action classes presented in the video. Although TCN-based models have achieved remarkable performance improvements recently, they still depend on the full temporal resolution of video clips with pre-extracted I3D features provided in [3]. In this work, we focus on developing an efficient pipeline for action segmentation that can be processed online. To this end, we make use of efficient backbone for feature extraction and follow the traditional approach that slides a window and performs classification for the window at every frame.

III. PROPOSED APPROACH
In this work, we propose a network that consists of two streams where they learn motion and spatial representations, respectively. Fig. 1 illustrates the overall procedure of our proposed action segmentation pipeline. In the temporal stream, we combine frame-grouping to capture short-term dynamics and TSM to consider long-term temporal relationships. This work is inspired by a two-stream network that processed RGB frames and optical flow features with 3D convolutions [14]. Our architectures, however, do not contain any 3D convolutions nor require computationally expensive features such as the optical flow features. Since the model loses the color information of inputs with the frame-grouping module, the model has limitations in distinguishing objects for which color information is important (e.g. 'pepper' and FIGURE 1. The proposed pipeline consists of three modules. The temporal stream consists of frame-grouping and TSM. It processes three single-channel frames as an input and enlarges the temporal receptive field with multiple shift operations. The spatial stream is a conventional 2D CNN, which processes RGB frames to provide fine spatial representations to an entire model. The cross-attention module processes the outputs of the two encoders to fuse motion and spatial information. 'salt'). To mitigate this issue, the conventional 2D CNN is added in the spatial stream for modeling relevant spatial representations from a still image. In this section, we explain the details of each stream and the whole pipeline for action segmentation. Then, we will describe the temporal receptive field of our model and the proposed loss functions.

A. TEMPORAL STREAM 1) FRAME-GROUPING FOR SHORT-TERM DYNAMICS
In order to build an efficient video action segmentation system, we leverage 2D CNNs instead of heavy 3D CNNs. 2D convolution, however, has no ability to model temporal information since it just performs cross-correlation on a single frame by applying a 2D kernel to each channel and summing the results across the channel axis. 3D convolution, on the other hand, performs cross-correlation on multiple multichannel images with 3D-expanded kernels striding along the spatial and temporal axes to encode spatio-temporal information. It requires more parameters and FLOPS compared to the 2D convolution since the expanded kernels stride in time as well as space. Frame-grouping is proposed in our previous work [30] to give 2D CNNs the ability to learn spatio-temporal representations in videos. We make each three-channel frame into a single-channel frame z t ∈ R H ×W by averaging across the channel axis and group three consecutive frames to learn spatio-temporal representations in a video with the conventional 2D-CNN backbones as follows: where t is the time index and the superscript c of an input image X t ∈ R 3×H ×W represents the channel index. After averaging the channel, we group three consecutive frames as an input to 2D CNN to map single-channel images ], which is expressed as: where is a kernel of the first layer of 2D-CNN models, n is the temporal index in the feature maps U with 1 ≤ n ≤ T 3 , and T is the total number of input frames that should be divisible by three since we replace a single three-channel image with three channel-averaged images for an input of 2D CNN. c is the output channel index of the first CNN layer. * denotes the convolution operator. The conventional 2D convolution takes a single three-channel frame as an input while the 2D convolution with frame-grouping takes three consecutive single-channel frames. Since our proposed module combines three temporally consecutive frames, it inherently encodes short-term dynamics. Fig. 2 briefly shows an example to compare our method (2D convolution with frame-grouping) with the conventional 2D convolution. Using the frame-grouping, the number of time steps in the output are reduced by a third compared to the input. As the size of tensors is scaled down, the memory demand of the network is decreased. We leverage this module to achieve an online video action segmentation network.

2) TSM FOR LONG-TERM TEMPORAL REPRESENTATIONS
The drawback of the frame-grouping is that we cannot enlarge the temporal receptive field more than three frames since it processes only three frames without striding kernels along the time axis. Therefore, the temporal receptive field is too short  to classify some classes well, such as picking something up and pushing something left to right. In this work, we additionally apply TSM for encoding long duration without any additional operations. The TSM shifts some portion of the channel dimension along the temporal dimension to exchange information with neighboring frames. It costs no additional computation but achieves comparative performance with 3D CNN by modeling spatio-temporal representations. It is inserted in residual branches of residual blocks to enlarge the temporal receptive field of the model by two frames for each bi-directional shift. The example of the TSM is illustrated in Fig. 3.
After the frame-grouping, TSM is inserted in every residual block to efficiently widen the temporal receptive field with shift operation. In the TSM, a portion of the channels shifts by −1 while another portion shifts by +1. Lin et al. reported that a naive shift may bring significant performance degradation since it can harm the spatial modeling ability of the 2D-CNN backbone. The authors chose 1/4 for a portion of shifting (1/8 for each direction), and we follow the same hyperparameters with the authors' reports [23].

B. SPATIAL STREAM
The spatial stream, which is a conventional 2D CNN, can be augmented when color information is essential. The spatial stream takes an input from every third frame not to burden our network too much. We select one frame out of three, denoted asẌ = {X 3i |1 ≤ i ≤ T 3 }, and map the selected framesẌ to feature maps U = [u 1 , u 2 , . . . , u T 3 ], expressed as: where is a kernel of the first layer of 2D-CNN models, n is the temporal index in the feature maps U with 1 ≤ n ≤ T 3 , and c is the output channel index of the first CNN layer. As it requires one frame out of three, it does not cost too much memory and latency overhead. Only with the temporal stream, it shows poor performance for discriminating some confusing classes such as ''add salt'' and ''add pepper'', ''add vinegar'' and ''add oil'', since it contains frame-grouping where color information is lost to efficiently obtain motion information of inputs. In this case, we can add the spatial stream to enable our model to encode effective spatial representations. The temporal stream efficiently recognizes activities while the spatial stream can provide fine spatial representations to support the whole pipeline. Incorporating both the streams, our proposed two-stream network successfully achieves comparable performance with a heavy 3D-CNN-based method.

C. FRAME-WISE CLASSIFICATION
Denoting the whole steps in the temporal stream as F T , we have where f m ∈ R C×T /3 with the number of channels denoted by C is the output of the temporal stream. Also, we have the processing in the spatial stream F S to learn color information from still images. Since every third frame from an original video is selected for an input in the spatial stream, where f s ∈ R C×T /3 represents the output of the spatial stream. The outputs of the two streams f m and f s are then processed by a simple cross-attention module to fuse motion and spatial information: Note that W q , W k , and W v are the query, key, and value projection matrices, respectively. The resulting vector f d is then fed to a final classifier F, which is a single fully-connected layer to map one of classes for every T/3 frames: (7) VOLUME 10, 2022 As illustrated in Fig. 4, an action classifier F A and a material classifier F M can be introduced as a single fullyconnected layer: Those two classifiers can be used to classify action and material classes since classes of the 50Salads dataset can be divided into verbs and objects.

D. TEMPORAL RECEPTIVE FIELD OF THE PROPOSED MODEL
We apply frame-grouping in the first convolutional layer of the networks to get a three-frame temporal receptive field. As discussed in [30], frame-grouping is effective to model short-term dynamics. It can efficiently deal with some actions that can be captured in a short duration. To expand the previous work for capturing general actions from videos, we apply TSM to enlarge the temporal receptive field. The temporal receptive field is enlarged by six frames for every bi-directional shift since each time step already contains the information of three frames by the frame-grouping. As the TSM is inserted in every residual branch of residual blocks, the temporal receptive field of our proposed model is sufficiently large to represent spatio-temporal information. For example, the temporal receptive field of the proposed model can cover 60 frames for MobileNetV2 [37] containing 10 inverted residual blocks. The process of the temporal stream is illustrated in Fig. 5.

E. TRAINING LOSSES
The temporal stream can efficiently capture motion information with a long-range temporal receptive field while the spatial stream can capture static information with an ability to encode color information. In this work, we benchmark our method on two datasets called Georgia Tech Egocentric Activities (GTEA) [38] and 50Salads [39]. The classes of the FIGURE 5. Illustration of frame-grouping and TSM operations. Each single-colored block and double arrow represent a feature map containing information of three frames and the shift operation of the TSM in a residual block, respectively. As shift operations proceed, the temporal receptive field gets wider by exchanging information with neighboring frames. GTEA and 50Salads datasets are illustrated in Fig. 6, and the mapping from original classes of the 50Salads dataset to action and material classes is shown in Fig. 7. As shown in Fig. 6, the GTEA dataset only contains classes of actions while the 50Salads dataset contains confusing classes without any color information. For example, 'add salt' and 'add pepper' have the same action class with different material classes. If a model predicts 'add salt' for the ground truth of 'add pepper', only the spatial stream should be penalized. To achieve intended goals of action segmentation using our two-stream model, we apply two additional losses on action and material classes only for the 50Salads dataset while the conventional loss corresponding to the loss on action classes is applied for the GTEA dataset that does not contain any overlapping classes with the same action. An original crossentropy loss and the two additional cross-entropy losses we used are illustrated in Fig. 8.
The first step to penalize the temporal and spatial streams separately is to parse the transcripts of classes into verbs and objects and map them into action and material classes, respectively. We define original classes C := {1, . . . , C}, action classes A := {1, . . . , C 1 }, and material classes M := {1, . . . , C 2 }, where C, C 1 , and C 2 are the numbers of the original, action, and material classes, respectively. The classes named ''action start'' and ''action end'' are mapped to ''Nothing'' for both action and material classes.
As the number of time steps is reduced by a third for both the outputs of the temporal and spatial streams through (2) and (3), ground truths, given for every frame, are also sampled for every third frame and used for training, like the input of the spatial stream where every third frame was taken. The sampled ground truths of the original, action, and material classes at frame n can be denoted as y n ∈ C, y ACT n ∈ A, and y MAT n ∈ M, respectively, and the final predicted labels can be expressed as follows:ŷ where k, k 1 , and k 2 are indices of the original, action, and material classes, respectively. d n denotes a C-dimensional vector corresponding to d at time step n whilem n andŝ n are obtained from m n and s n corresponding to m and s at time step n, respectively. An element ofm n (orŝ n ) for an action (or material) class is given by adding the elements of m n (or s n ) corresponding to the original classes merged to the action (or material) class by the class mapping. Then, we applied two separate cross-entropy losses for each stream, named as action loss L ACT and material loss L MAT . They can be expressed as where I is the indicator function whether the ground truth class is k 1 (or k 2 ) at time step n. P is a probability assigned to class k 1 (or k 2 ) at time step n. By penalizing the two streams separately, we can force each stream to learn spatio-temporal and fine spatial representations, respectively. Lastly, we use a cross-entropy loss to penalize the merged output d as Our final loss function is the sum over three types of cross-entropy losses as follows:

F. POST-PROCESSING FOR REFINEMENT
For temporal action segmentation, previous works conventionally required full temporal resolution of a long untrimmed video while ours makes a decision for a small portion of frames (30 frames) with low latency for practical use in a resource-constrained environment. Our work stemmed from the question of whether the full temporal resolution of the video is necessary for action segmentation. We found that our model may make a decision with a short section of a video to provide a plausible result for action segmentation. However, our work tends to give poor results on some metrics such as the F1 score and segmental edit distance because of oversegmentation errors. To overcome the problem, we try two types of post-processing methods, operating online without and with latency (Method 1 and 2, respectively). Before the post-processing, the number of time steps in the final output d is tripled by repeating d n three times as d n =d 3n−2 = d 3n−1 =d 3n to match the number of frames with the ground truth.

1) METHOD 1 (RECURSIVE AVERAGING ON THE OUTPUTS)
We smooth results in a window by recursive averaging as follows:d  where α is the forgetting factor for the recursive averaging. Also, t c and T w are the current frame index and the window size, respectively. We set α = 0.6 and T w = 6. Then, the final prediction is given byỹ

2) METHOD 2 (SMOOTHING WITH THE MODE VALUE)
An alternative to the Method 1 is a look-ahead postprocessing technique to use the class that appears most frequently among the predictions from t c − T w to t c + T w . If the most frequent (mode) class in the window with the length of 2T w + 1 is in the list of top-K classes of the current frame (corresponding to the t c -th frame), we choose the mode class instead of the class corresponding to the maximum prediction value at the current frame. We can express our second post-processing method as where j is the mode class among the predictions from t c − T w to t c + T w , and T t c is a set of top-K predicted classes at frame t c . In this case, we set T w = 10 and K = 3. Compared to the Method 1, it has a stronger smoothing effect with T w -frame latency. As illustrated in Fig. 9, we can improve the quality of the predictions with the two simple smoothing methods on the 50Salads dataset by reducing over-segmentation errors. Since the GTEA dataset contains smaller classes with less confusing actions, results show much less over-segmentation errors, and we may obtain sufficiently accurate results without any post-processing method for the GTEA dataset.

IV. EXPERIMENTS A. DATASETS
We evaluate the performance of our proposed pipelines on public action segmentation datasets: 50Salads [39] and GTEA [38]. The 50Salads dataset contains confusing classes such as ''add salt'' and ''add pepper''. Therefore, the spatial stream is essential to discriminate ''salt'' from ''pepper''. The GTEA dataset contains 28 videos of four subjects preparing coffee, tea, or sandwich. There are 11 classes of daily activities such as ''take'', ''open'', ''pour'', ''close'', ''shake'', ''scoop'', ''stir'', etc. Each video is captured from a camera mounted on a subject's head. We performed crossvalidation by leaving one subject out and reported an average in all the experiments using the same data splits as in the work of MS-TCN [3] for fair comparison. The 50Salads dataset contains 50 videos of 17 classes that 25 people prepare two different salads. Contrary to the GTEA dataset, classes with the same action on different objects exist such as ''cut cucumber '' and ''cut cheese''. Similar to the evaluation for the GTEA dataset, we conducted five-fold cross-validation and reported an average in all the experiments using the same data splits as in the work of MS-TCN [3].

B. IMPLEMENTATION DETAILS
Implementation of our networks was based on PyTorch. Input images were resized to have a fixed size of 224 × 224. We trained our models using the Adam optimizer with a learning rate of 0.001 and a batch size of 16 on an Nvidia-RTX Titan throughout the experiments. MobileNetV2 with a multiplier of 1.0 was used for the backbone network. Although many video action segmentation models run at 15 fps, enough time gaps between adjacent frames are required for the frame-grouping to secure a sufficient temporal receptive field with three consecutive frames as reported in our previous work [30]. Therefore, the 50Salads and GTEA datasets were downsampled to 7.5 fps, and the window size was set to 30 frames to have the number of input frames as a multiple of three. Then, 30 frames (window size) and 3 frames (framegrouping unit) at 7.5 fps correspond to 4 seconds and 0.4 seconds, respectively. When evaluating by comparing with the ground truth, we repeated every final output two times to match the frame rate with MS-TCN which runs at 15 fps for fair comparison. The GTEA dataset includes 11 classes (C = 11) while the 50Salads dataset contains 19 classes (C = 19) mapped to 7 action and 12 material classes to get C 1 = 7 and C 2 = 12.

1) BENCHMARK MODEL
In this work, we compared our model with MS-TCN. The MS-TCN consists of multiple stages and each stage is composed of dilated convolutional layers that do not change the temporal dimension. The multiple dilated convolutions allow the temporal receptive field to cover a whole untrimmed video. A softmax is applied at the end of each stage to have its own prediction. The purpose of multiple stages is an incremental refinement of the predictions from the previous stages to overcome over-segmentation errors. Since features were pre-calculated from the I3D network rather than extracted onthe-fly, it can process full temporal resolution without memory overflow. Although the MS-TCN is a powerful model for action segmentation, it cannot be used for a practical application with low latency in a resource-constrained environment since it requires future frames and extracts features at the current frame with heavy computational complexity. Additionally, we added the results including recent works to demonstrate that our model shows decent performance compared to TCN-based models requiring pre-calculated I3D features.

2) EVALUATION METRICS
In this paper, the segmental F1 score [22], segmental edit distance [20], [40], and frame-wise accuracy were selected as evaluation metrics of experimental results. The segmental edit distance (also called Levenshtein distance) is commonly used to calculate similarity between two strings. It calculates the edit distance from one string to other using three operations (insertion, deletion, and replacement). Here, it is used to calculate the similarity between sequences of predicted action classes and ground truths. It uses dynamic programming for calculating the minimum edit distance efficiently. It can successfully penalize over-segmentation errors. The segmental F1 score at overlapping threshold k%, denoted by F1@k, is the harmonic mean of the precision and recall for taking both the detection metrics into account. The overlapping threshold is determined based on the intersection-over-union (IoU) ratio. It can successfully penalize over-segmentation errors and ignore minor temporal shifts. The frame-wise accuracy is the most intuitive method for the performance metric in action segmentation tasks. There are two limitations for this metric. First, the performance can be affected by annotator variability. Different annotators can have different interpretations for the start and end times of actions. Second, long actions have higher impact than short actions on the frame-wise accuracy. Therefore, the over-segmentation errors have relatively low impact on this metric.

3) PERFORMANCE COMPARISONS
In Table 1, we show the results of our models and works including MS-TCN on the 50Salads and GTEA datasets. While our results were obtained by averaging over cross-validation folds as mentioned in Subsection IV-A, results for the others were obtained from authors' reports. Although our approaches showed comparable results with other works, it is worth noting that ours can be performed online with a non-overlapping sliding window. Additionally, ours do not require any 3D CNN nor full temporal resolution of input videos. Considering the performance-cost tradeoff, ours have sufficient advantages over the other offline networks.

D. ABLATION STUDY 1) EFFECT OF THE SPATIAL STREAM
Although the temporal stream is essential for discriminating action classes (''shake'' and ''pour''), the spatial stream may be required for tasks that need to identify objects (''salt'' and ''pepper'') in action classification. To evaluate the contribution of the spatial stream, we conducted an ablation study of the spatial stream on the 50Salads and GTEA datasets. As shown in Table 2, although the temporal stream alone shows good enough performance for the GTEA dataset, it shows degraded performance for the 50Salads dataset containing confusing classes without the spatial stream to learn richness of color information. It demonstrates that the color information is more important for segmentation of the 50Salads dataset than that of the GTEA dataset. Furthermore, the experimental results clearly show that the spatial stream successfully complements the temporal stream by encoding fine spatial representations.

2) EFFECT OF THE MODULES IN THE TEMPORAL STREAM
An ablation study of the two modules in the temporal stream was performed on the GTEA dataset. For this study, VOLUME 10, 2022 TABLE 1. Results evaluated on 50Salads and GTEA datasets. M1 and M2 denote Method 1 and 2 for refinement, respectively. The column of ''I3D Feat. '' represents whether a model uses pre-calculated I3D features provided in [3].  we excluded the spatial stream to focus on effectiveness of the two modules in the temporal stream. Although decent performance were obtained when excluding one of the two modules, the combined version showed the best performance, as shown in Table 3. It demonstrates that the TSM and frame-grouping can benefit each other like existing works using optical flow to encode motion and 3D CNN to widen temporal relations.

3) EFFECT OF ADDITIONAL LOSS FUNCTIONS
We conducted an ablation study for the two additional loss functions L ACT + L MAT to demonstrate their effectiveness. For this study, we just experimented on the 50Salads dataset since the GTEA dataset only contains action classes. As demonstrated in Table 4, the two additional loss terms showed a performance improvement. The results show that the two streams can be effectively trained with the proposed losses by penalizing both the streams with action and material parts of the original classes separately.

4) EFFECT OF THE NUMBER OF INPUT FRAMES
We evaluated results for two different window sizes (or the number of input frames) of the proposed model on the 50Salads and GTEA datasets. In the sliding window approach, a segment of the designated input frames is processed by the whole pipeline of the proposed model to get frame-wise inferences. As demonstrated in Table 5, 30 frames provided better performance with a larger temporal receptive field although 15 frames still showed decent results.

V. CONCLUSION
In this paper, we presented an efficient action segmentation pipeline based on temporal and spatial streams followed by cross-attention to combine the results of both the streams. We introduced the temporal stream by combining frame-grouping and TSM for capturing short-term dynamics and long-term temporal information at the same time. In addition, we applied the spatial stream to capture information on color and appearance complementary to representations from the temporal stream. Taken together, it could successfully encode spatio-temporal representations with lightweight 2D CNN and achieve comparable results with the method using heavy 3D CNN and dilated convolutions in action segmentation tasks. Since our method can be operated online with a non-overlapping sliding window, it can be used for practical applications with low latency in a resource-constrained environment. In the future, we will extend our work to recognize various gestures and improve the performance using a variety of data augmentation skills.