A Novel Few-Shot Action Recognition Method: Temporal Relational CrossTransformers Based on Image Difference Pyramid

Most current few-shot action recognition methods model temporal relationships on the basis of image classification and achieve satisfactory results. However, they focus on the extra temporal information of video data compared to images and use the frame tuple embedding representation of the query video for matching, but ignore the important information of “action changing feature” in action recognition. To use this information, we propose the Temporal Relational CrossTransformers Based on Image Difference Pyramid (TRX-IDP) method for few-shot action recognition. Based on TRX, we perform high-order image difference, sigmoid enhancement, resizing on the frame tuples which are directly used for query, and use the frame tuples to calculate the Motion History Image (MHI). Combined with the two, we construct the Image Difference Pyramid containing motion feature information. We also develop CrossTransformers query representation for IDP and restructure the linear mapping function of the model. We evaluate our model using four commonly used few-shot action recognition benchmark datasets. TRX-IDP achieves state-of-the-art performance on partial SSv2, HMDB51, and UCF101, while slightly lagging behind the current best models on Kinetics and SSv2. In addition, we perform detailed ablation experiments on TRX-IDP to prove the importance of each part of the model and to give the best hyperparameters of TRX-IDP.

the video samples data set needed to collect deep learning 23 is too large and the cost of labeling is very expensive [5]. 24 To solve the problem of insufficient data with labeled sam-25 ples, few-shot learning has been applied to the field of action  [10] has achieved satisfactory 28 results. 29 The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. Before few-shot action recognition, few-shot image classi-30 fication methods had achieved significant success, and these 31 methods inspired Zhang et al. [8] to implement action recog-32 nition using a matching approach that searches a single sup-33 port set of samples. Similarly, there are methods [7], [9] 34 to search the average representation of support classes to 35 realize action recognition. However, these methods ignore the 36 temporal information between frames when using multiple 37 frames to represent a video for matching. i.e., they do not 38 use the temporal information of the video when modeling. 39 In addition, a complete action requires two, three or more 40 frames to represent, so using individual frames in the video to 41 match one by one during the matching process is not the best 42 method. Further, an action may occur anywhere in the video 43 sample, i.e., the effect of temporal offset needs to be offset 44 in the matching process. Moreover, the same type of action 45 may consume different lengths of time in different videos, offsetting the degree of stretching of the action during the matching process is also necessary. of fine-grained classification. 66 In this paper, we give full consideration to the problems  to prove the importance of each part of the model and to 100 give the best hyperparameters of model.

102
In this section, we introduce the following three areas of 103 research that are relevant to this paper, including few-shot 104 classification, few-shot image classification and few-shot 105 action recognition.

107
In order to quickly build cognitive ability for new con-108 cepts with just one or a few examples, few-shot learning 109 was created. So far, few-shot learning has become increas-110 ingly mature, and according to different realizing method, 111 we broadly classify few-shot learning into three categories. 112 Munkhdalai et al. [ Snell et al. [20] proposed metric-based methods. Among the 116 three methods mentioned above, the metric-based learning 117 method outperforms the other two methods in the classifi-118 cation of few-shot videos. The metric-based method aims 119 to find a feature representation of the sample and calculates 120 the distance between the query sample and the support set, 121 and classifies the query sample to its nearest support set at 122 the time of classification. The metric-based method is most 123 relevant to this paper.

125
In recent years, more and more people have researched 126 numerous methods for few-shot image classification on the 127 basis of few-shot learning. Similar to the classification of 128 few shot learning, few-shot images classification can be 129 classified into three categories: data-enhanced, optimization-130 based, and metric-based. Data augmentation is a method of 131 expanding the sample data using spatial deformation [ Unlike few-shot image classification, the difficulty of 151 few-shot action recognition is that it needs to deal with 3D 152 video data. In the above discussion, it has been shown that the 153 VOLUME 10, 2022 metric-based method is currently the best method, so most of  In the above few-shot action recognition methods, we note  We start with a special case of a triplet and proceed to build We consider a video V and perform a keyframe extraction 212 operation on V . Then we use the obtained keyframe sequence 213 to represent V , i.e., V : {v 1 , . . . , v F }, where v i is the keyframe 214 extracted from V and for i < j, v i appears earlier in the video 215 V than v j , and F is the number of keyframes extracted from V . 216 We define the triplet consisting of three frames selected from 217 V as P = {v 01 , v 02 , v 03 }. 218 We perform the difference operation on v 0i and v 0(i+1) to 219 get the first-order difference image diff 1i = |v 0i − v 0(i+1) |. 220 Then we use the TemperatureSigmoid function to enhance 221 the contrast of the differential image and finally rescale the 222 differential image TS(diff 1i ), the purpose of rescaling is not 223 only to reduce the complexity to linearity, but also to reduce 224 the number of invalid features. where the rescaling operation 225 is like the average pooling of 2 × 2, the stride is 2, and the TS 226 function is: where the hyperparameters are selected based on common 229 image contrast enhancement functions.

230
Thus we get the i-th image of the first-order differential of 231 the pyramid: (2) 233 For a difference of order k, 1 < k < F, k < K , K is the 234 highest layer number of the pyamid, there are:  represents the target motion as image brightness 239 by calculating the pixel changes at the same location during 240 the time period. We creatively treat the sequence of a few 241 frames as video and calculate its MHI. Let H be the inten-242 sity value of the motion history pixel and H (x, y, t) can be 243 calculated from the update function as: where (x, y) and t are the positions and times of the pixel 246 points, t ≥ 1, H τ (x, y, 0) = 0; τ is the duration, which 247 determines the time range of the motion from the perspective 248 of the number of frames, and here τ = 250; δ is the recession 249 parameter, and here δ = 100. (x, y, t) is the update function, 250 defined using the inter-frame difference method: where: sets''. In this paper, we draw few-shot action recognition tasks 266 from the training set, and for each task, we focus on its C-way, 267 N-shot classification problem. 268 We consider three frames sampled from the query video where 0 : R H ×W ×3 → R D is a convolutional neu-282 ral network layer that transforms the input frame into a 283 D-dimensional embedding, PE(·) is position encoding based 284 on the index of the frame, and q 0i is the i-th image of the 285 0-th order difference layer (the first layer is the 0-th order 286 difference layer) of the Image Difference Pyramid formed by 287 the extracted frame tuple (q p1 , q p2 , q p3 ).

288
The TRX method pioneeringly uses ordered frame tuples 289 to represent actions, but ignored the changing features of the 290 actions themselves. Our proposed Image Difference Pyramid 291 highlights the action features that are missed during frame 292 tuple matching, and we define the query representation of the 293 k layers of the pyramid as:

301
In summary, we define the query representation of Q p as 302 follows: We compare the query representation Q p with all the triplet 306 representations in the support set, allowing to match actions 307 with different speeds or appearing in different locations in the 308 video.We define the set of all triples as: Using the same method as for (7) 318 We apply the query representation generated using the The correspondence between Q p and S c nm can be expressed 326 as: Then we can calculate the distance between the query repre-337 sentation Q p and the support set S c : Considering that a frame triplet may not be the best rep-350 resentation of an action, the using of higher-order tuples is 351 necessary. We use ω to represent the length of the tuple as 352 TRX does. Rewrite in (11): Generalize the query representation Q p with index p = 355 (p 1 , . . . , p ω ) ∈ ω : (24) 359 We define the set of tuple lengths ω as . For instance,  the gradient of the sum distance. TRX-IDP is trained end-to-382 end using all ω ∈ , different differential orders k and shared 383 backbone parameters for all tuples. Fig. 2  . 409 We randomly initialize the parameters of the model (for 410 k , k = 0 we initialize it to 0) and set D = 2048, d v = 411 1152 = d k . We extract 8 keyframes from the video, i.e., 412 F = 8, and then resize the resulting keyframes to 224 × 224. 413 In addition, TRX-IPD selects SGD as the optimizer and sets 414 the learning rate to 10 −3 (when the data set is partial SSv2, 415 the learning rate is set to 10 −4 ). , which 420 were introduced in Section II. The TRX-IDP method inherits 421 the characteristics of the TRX method and performs better 422 on the few-shot task than on the one-shot task, and to facil-423 itate comparison with the other methods mentioned above, 424 we evaluate our method using the standard 5-way 5-shot 425 benchmark. 426 We present the results of TRX-IDP and other model perfor-427 mance in Table 1. The models in Table 1 all use ResNet-50 428 as the backbone to extract features. On Kinetics, the accuracy 429 of OTAM has been as high as 85.8%, and TRX has improved 430 0.1% compared to the next, while the best model HyRSM 431 now reaches 86.1%, our TRX-IDP has only 86.0% accuracy 432 VOLUME 10, 2022 Kinetics is used as a few-shot benchmark, it is similar to some  when the frame tuple length is 1. The MHI feature figure 471 plays a positive impact when the tuple length reaches 3 and 472 a negative impact when the tuple length Less than or equal 473 to 2, and the performance gain is higher as the tuple length 474 gets longer. When the tuple length equals 4, the performance 475 improvement reaches 0.4%. Therefore, when introducing the 476 MHI, it should be ensured that the tuple length is greater 477 than 1, and the larger the number of tuples, the greater the 478 amount of information contained in the MHI. In addition, 479 when the tuple length is larger than 4, the performance of the 480 model decreases instead. Therefore, in section IV-C3, we will 481 not discuss the case where the tuple length is greater than 4. 482  has been improved, and the combination of pair frames and 517 triplet frames has achieved the best result: 59.8%. Combining 518 pair frames and quadruples is not a good choice, because 519 its performance (58.9%) is even inferior to that of a sin-520 gle CrossTransformer (triplets and quadruples). When using 521 three CrossTransformers = {2, 3, 4}, the performance is 522 reduced (-0.1%). When K is used for evaluation, the overall 523 difference is small, but the same conclusion can be obtained 524 as when SSv2 is used: the combination of pair frame and 525 triplet frame i.e. = {2, 3, 4} is the best choice. Compared 526 with TRX, IDP has improved the overall performance of 527 TRX.

529
In this paper, we propose the Temporal Relational 530 CrossTransformers Based on Image Difference Pyramid 531 (TRX-IDP) method for few-shot action recognition. Our 532 method is based on TRX. On this basis, the frame tuples 533 used for query are subjected to high-order image difference, 534 sigmoid enhancement and resizing. Combined with Motion 535 History Image (MHI), the Image Difference Pyramid (IDP) 536 containing motion feature information is constructed. We also 537 develop the CrossTransformers query representation for IDP 538 and rewrite and optimize the linear mapping function of the 539 model. the TRX-IDP method outperforms TRX on few-shot 540 benchmarks for all four datasets and achieves state-of-the-art 541 performance on partial SSv2, HMDB51, and UCF101, while 542 slightly lagging behind HyRSM on Kinetics-400 and full 543 SSv2. In the future, we will try to combine the IDP module 544 with other metric-based few-shot action recognition methods 545 and explore them.