Temporal Action Localization With Coarse-to-Fine Network

Precisely localizing temporal intervals for each action segment in long raw videos is essential challenge in practical video content analysis (e.g., activity detection or video caption generation). Most of previous works often neglect the hierarchical action granularity and eventually fail to identify precise action boundaries. (e.g., embracing approaching or turning a screw in mechanical maintenance). In this paper, we introduce a simple yet efficient coarse-to-fine network (CFNet) to solve the challenging issue of temporal action localization by progressively refining action boundary at multiple action granularities. The proposed CFNet is mainly composed of three components: a coarse proposal module (CPM) to generate coarse action candidates, a fusion block (FB) to enhance feature representation by fusing the coarse candidate features and corresponding features of raw input frames, and a boundary transformer module (BTM) to further refine action boundaries. Specifically, CPM exploits framewise, matching and gated actionness curves to complement each other for coarse candidate generation at different levels, while FB is devised to enrich feature representation by fusing the last feature map of CPM and corresponding raw frame input. Finally, BTM learns long-term temporal dependency with a transformer structure to further refine action boundaries at a finer granularity. Thus, the fine-grained action intervals can be incrementally obtained. Compared with previous state-of-the-art techniques, the proposed coarse-to-fine network can asymptotically approach fine-grained action boundary. Comprehensive experiments are conducted on both publicly available THUMOS14 and ActivityNet-v1.3 datasets, and show the outstanding improvements of our method when compared with the prior methods on various video action parsing tasks.

producing, storing and consuming each hour. How to effi-23 ciently and effectively parse video data and decode seman-24 tic content becomes especially important [1], [2], [3]. 25 In general, parsing human actions in videos includes action  Currently, these active topics have been widely studied due to 28 many practical applications, such as human computer inter-29 section, smart surveillance and online video retrieval, etc. 30 The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao . In recent years, impressive progress for action recognition has 31 been reported by using deep learning networks [8]. However, 32 compared to action detection, temporal action localization 33 remains large development space in technique and perfor-34 mance due to fundamental challenge, which aims not only to 35 recognize the actions with a set of category labels but also 36 to locate the exact time stamps of the starting and ending 37 boundaries of different action instances in long raw videos. 38 Inspired by object detection, which is required to generate 39 spatial bounding box to precisely locate object instance from 40 images, temporal action localization attempts to determine 41 temporal intervals of action instances in videos. Considering 42 the similarity between the two tasks, many works [9], [10] 43 on action detection inherit the merit of pipeline for image 44 object detection [11], [12]. Previous detectors can be 45 sketchily classified into two types: one-stage methods [13], 46 [14] and two-stage methods [15], [16]. The latter is generally considered to have better accuracy due to the default anchor In this paper, we devise a novel coarse-to-fine network 83 (CFNet) to enhance the precision and speed of temporal  Finally, BTM is introduced to refine action boundary by 101 modeling long-term temporal dependency at a finer granu-102 larity. Thus, fine-grained action intervals can be identified. 103 Comprehensive experiments are carried out on THUMOS14 104 and ActivityNet-v1.3, and display outstanding improvements 105 of the proposed detector over prior approaches on action 106 detection.

107
Overall, the major contributions are summed up as follows: 108 • We devise an effective framework to enhance the per-109 formance of temporal action localization in a coarse-to-110 fine fashion, which is composed of three components: 111 coarse proposal module, fusion block and boundary 112 transformer module.

113
• To improve the coarse proposal generation, multi-114 ple actionness curves are combined together to pro-115 vide complementary information to each other. Then, 116 an efficient watershed actionness grouping algorithm is 117 applied to produce coarse action candidates.

118
• A boundary transformer module is employed to model 119 long-term temporal dependency at frame granularity, 120 thus, the accuracy of action boundary can be signifi-121 cantly improved.

122
• Comprehensive experiments are carried out on THU-123 MOS14 and ActivityNet-v1.3, and exhibit outstanding 124 performance over prior methods on the challenging issue 125 of temporal action localization.

127
In this section, we briefly review the most relevant works: 128 two-stage and one-stage pipelines for temporal action local-129 ization, and transformers used in computer vision community. 130 Two-stage pipeline usually adopts proposal and classifica-131 tion step, which is widely studies in works [22]  tends to generate a large number of redundant or incomplete 210 candidates, resulting in performance degradation. Another 211 practical method is to identify the temporal intervals with 212 high evaluation scores, and we take this paradigm in this 213 work. Furthermore, in view of the complexity and variety 214 of training samples used in actionness learning process, 215 we exploit three kinds of actionness learning (i.e., frame-216 wise, matching and gated) to discriminate actions at different 217 levels. The three actionness measurements can provide com-218 plementary information to each other and work together to 219 achieve final actionness.
The framewise actionness prediction (Figure 2(a)) is essen-222 tially learning a binary classifier, which deals with every 223 frame separately. Considering n consecutive frames, from 224 which we can extract a feature set F = {f i | f i ∈ d }, 225 then the feature set is divided into positive subset F + and 226 negative subset F − according to the ground truth. Finally, 227 the binary classifier φ b can be obtained through optimizing 228 the following loss function: In the coarse proposal process, we perform a relative evalu-232 ation by exploiting actionness degree within a video, which 233 can be considered as a supervised sequence problem. Assum-234 ing a couple of frames, which are fetched from the positive 235 subset and negative subset respectively, we seek to train a 236 matching classifier (Figure 2(b)) generating higher score of 237 the positive feature than negative feature. Technically, given 238 feature pairs where 239 each pair (f + , f − ) contains a positive and negative feature 240 f + , f − extracted from the same video respectively. To opti-241 mize matching actionness classifier φ p , the loss function is 242 formulated as follows: It is noted that there is high correlation between action labels 246 of consecutive frames, recurrent information is exploited to 247 predict actionness. To capture the recurrent information from 248 a sequence of frames, we devise a single-layer GRU model 249 and formulate it as a recurrent classification task. The gated 250 classifier φ r (Figure 2(c)) is trained to output action cate-251 gories.

253
After obtaining the three kinds of actionness classifiers: φ b , 254 φ p and φ r , the corresponding anctionness score S b , S p and 255 S r can be generated by each classifier. We employ a sigmoid 256 function to normalize the scores to [0, 1], and the final con-257 fidence score can be computed by linearly combing the three 258 FIGURE 1. Overview of the proposed coarse-to-fine network (CFNet), which mainly consists of three stages: coarse proposal module (CPM), fusion block (FB) and boundary transformer module (BTM). First, the CPM generates coarse action candidates by exploiting three complementary action curves (i.e., framewise, matching and gated actionness). Then, the FB is devised to generate frame-level features by fusing the coarse proposal features and corresponding features of raw frame input. Afterward, the action boundaries are predicted with a transformer structure by modeling long-term temporal dependency. Finally, we use matching strategy to update confidence score and further refine the boundary of each action candidate. Thus, each action interval can be achieved in a finer granularity. The three stages are sequentially cascaded together to progressively refine action boundaries by incrementally improving feature representation. actionness scores as follows: where λ 1 , λ 2 and λ 3 are weight parameters to be learned by To facilitate subsequent boundary refinement, we design 278 the fusion block (FB) to generate frame-level features. The 279 detailed structure of FB is illustrated in Figure 3. As we can 280 see, the FB is composed of two branches.   [32]. Specifically, we stack M -layer 304 transformers to predict boundary probabilities at frame-level 305 and the encoder-decoder structure in each layer is configured 306 as follows:

307
The transformer encoder is used to estimate the relevance 308 between frames, which consists of a multi-head self-attention 309 layer and a feed forward network (FFN). First, the frame-level 310 feature map F ∈ R T ×C generated by FB is fed into encoder 311 and output enhanced feature sequence F ∈ R T ×C incorpo-312 rating global temporal context.

313
Different from the encoder, in addition to self-attention 314 mechanism and feed forward network, the decoder also 315 adopts encoder-decoder attention strategy. The input to multi-316 head self-attention is identical with the encoder, namely fea-317 ture sequence F ∈ T ×C . The output of decoder is feature 318 sequenceF. F is transformed into K , V , whileF is trans-319 formed into Q. Thus, the relevant features can be significantly 320 enhanced, meanwhile the over-smoothing effect is substan-321 tially suppressed. (4) 332 where p 1 and p 2 are confidence scores for boundary classi-333 fication and regression respectively. 1  where L coarse and L fine denote the loss function for CPM and 341 BTM respectively. α, β are weighted parameters, which are 342 empirically set to 1.

343
Concretely, L coarse , L fine are formulated as follows: where N represents the total number of training samples, presents the corresponding ground truth.

B. PARAMETER SETTINGS OF THE COMPARED METHODS 415
The following state-of-the-art methods are compared for the 416 task of temporal action localization. The parameter settings 417 of these algorithms are briefly described as follows:

418
(1) The LEAR submission at THUMOS14 [43] adopts 419 Fisher vector to encode dense trajectory features. They used 420 a sliding window to slide over the video with a stride of 421 10 frames. The window size is set to 10, 20, 30, 40, 50, 60, 422 70, 80, 90, 100 and 150 frames respectively. After scoring the 423 windows, they applied non-maximum suppression algorithm 424 to filter out overlapped proposals.  The performance comparison of the proposed CFNet to prior 498 solutions on THUMOS'14 is presented in Table 1, where 499 the experimental results are obtained after fusing two-stream 500 scores. As we can see that CFNet is superior to the prior 501 solutions with a significant mAP improvement on all IoU 502 thresholds. In particular, CFNet obtains remarkable improve-503 ment of 2.1% at tIoU = 0.5 compared to the suboptimal 504 TCANet (from 44.6% to 46.7%), which is an advanced 505 pipeline in THUMOS'14. It is observed that the previous 506 works (such as [43], [44] and [45]) show relatively weak 507 performance, because they neglect the hierarchical action 508 granularity. In contrast, the proposed CFNet takes full advan-509 tage of three cascaded steps to encode discriminative features. 510 Specifically, the CPM first explores three different measures 511 to learn actionness. Then, the features of these ordinary pro-512 posals are strengthened by FB module, which can highlight 513 salient action characteristics. Next, based on these enhanced 514 features, the BTM is exploited to learn long-term temporal 515 dependency at a fine granularity. Finally, matching strategy 516 is utilized to refine temporal action boundary as well as 517 update the confidence scores of corresponding proposals. 518 As expected, the experimental results reflect that our model 519 can predict proposals with high scores, and the boundaries of 520 action instances are more precise.

521
The detection results on ActivityNet-v1.3 are presented in 522 Table 2. Similarly, although the frame rate on ActivityNet is 523 set as only 3 fps, CFNet outperforms the prior solutions on all 524 preset IoU thresholds. Figure 4 shows some qualitative exam-525 ples, the ground truth temporal action interval and predicted 526 96384 VOLUME 10, 2022    Table 3. 537 It is noted that CFNet is evaluated on Ubuntu 16.04 server 538 with a GeForce RTX 2080ti GPU. It is obviously showed that 539 the proposed coarse-to-fine network can run a high speed of 540 1260 fps, and outperform other approaches by a big margin.  THUMOS'14 at threshold 0.5. The results are reported with 551 the input of RGB images and optical flow separately. It is 552 observed that different thresholds possess a consistent per-553 formance trend. 554

555
Considering that the features are progressively enhanced 556 through three modules, we explore how the network perfor-557 mance improves with the change of module configuration. 558 At the same time, the confidence score of each proposal is 559 progressively updated by multiplying scores from different 560 modules. The experimental results are exhibited in Table 4. 561 It is clearly showed that the sequentially embedded modules 562 can substantially enhance the accuracy of the generated action 563 proposals, which demonstrate the effectiveness of different 564 components. 565 2) EFFECTIVENESS OF TRANSFORMER LAYERS 566 Generally, utilizing multiple layer transformers can achieve 567 better performance, which have been demonstrated in several 568 studies, e.g., object detection [49] and NLP [32]. We carry out 569 ablation experiments to investigate the number of different 570 layers in BTM. The experimental results are presented in 571 VOLUME 10, 2022