Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition

Gesture recognition has attracted considerable attention owing to its great potential in applications. Although the great progress has been made recently in multi-modal learning methods, existing methods still lack effective integration to fully explore synergies among spatio-temporal modalities effectively for gesture recognition. The problems are partially due to the fact that the existing manually designed network architectures have low efficiency in the joint learning of multi-modalities. In this paper, we propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the proposed 3D Central Difference Convolution (3D-CDC) family, which is able to capture rich temporal context via aggregating temporal difference information; and 2) optimized backbones for multi-sampling-rate branches and lateral connections among varied modalities. The resultant multi-modal multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics. Comprehensive experiments are performed on three benchmark datasets (IsoGD, NvGesture, and EgoGesture), demonstrating the state-of-the-art performance in both single- and multi-modality settings. The code is available at https://github.com/ZitongYu/3DCDC-NAS.

As gestures have various temporal ranges, modeling such visual tempos would benefit for gesture recognition.Previous methods [15], [16], [17] attempt to construct the frame pyramid for such purpose, with each branch of the frame pyramid sampling the input frames at a different rate.However, the architecture (i.e., network structure) of each branch and relations (i.e., lateral connections) among the multi-rate branches are usually shared and hand-designed, which is sub-optimal for message propagation.Hence, how to discover bettersuited architectures and lateral connections among multirate branches is crucial.
For RGB-D based gesture recognition, complementary feature learning from different data modalities is beneficial.For example, the depth data is easy to distinguish foregrounds (i.e., face, hands, and arms) from backgrounds while RGB data provides detailed texture/color appearances.However, most existing methods [7], [8], [18], [19], [20] conduct the multi-modal fusion via coarse strategies (e.g, score fusion or last layer concatenation), which may leverage the multi-modal information insufficiently.Thus, to design more reasonable multi-modal fusion strategy is not a trivial work.
Motivated by the above observations, we propose a novel spatio-temporal convolution family called 3D Central Difference Convolution (3D-CDC), to exploit the rich local motion and enhance the spatio-temporal representation.Moreover, over the 3D-CDC-based enhanced search space, Neural Architecture Search (NAS) is adopted to automatically discover the optimized multi-rate and multi-modal networks for RGB-D gesture recognition.Our contributions include: • A novel spatio-temporal convolution family, 3D-CDC, is proposed, intending to capture rich temporal context via aggregating temporal difference information.Without introducing extra parameters, 3D-CDC can replace the vanilla 3D convolution, and plug and play in existing 3DCNNs for various modalities with enhanced temporal modeling capacity.• We propose a two-stage NAS method to automatically discover well-suited backbones and lateral connections for the multi-rate and multi-modal networks, which effectively explores RGB-D-temporal synergies and represents global dynamics.• To our best knowledge, this is the first approach that searches multi-rate and multi-modal architectures for RGB-D gesture recognition.Our searched architecture provides a new perspective to understand the relationship among multi-rate branches as well as modalities.
• Our proposed method achieves state-of-the-art (SOTA) performance on three benchmark datasets on both singleand multi-modality testing.In the rest of the paper, Sec.II provides the related work.Sec.III formulates the 3D-CDC family, and introduce the twostage multi-rate and multi-modal NAS algorithm.Sec.IV provides rigorous ablation studies and evaluates the performance of the proposed models on three benchmark datasets.Sec.V shows the visualization results and discusses transferability to the action recognition task.Finally, a conclusion with future directions is given in Sec.VI.

II. RELATED WORK
In this section, we first introduce some recent progress in multi-modal gesture recognition.Then, previous video-based NAS methods will be reviewed.Multi-Modal Gesture Recognition.For video-based gesture recognition, it is challenging to track the motion of hands and arms owing to the large degree of freedom.Many handcrafted feature based [21], [22], [23], [24] and deep learningbased methods [25], [26], [11], [27], [28] are proposed to tackle this issue.As for the learning-based gesture recognition, on one hand, 3DCNNs including C3D [29], Res3D [30], I3D [31] and SlowFast [15] are utilized for gesture feature extractor.On the other hand, LSTM variants such as Atten-ConvLSTM [32], [32] and PreRNN [33] are introduced for temporal memory propagation.Based on the CNNs, several extended modules [12], [13], [14], [34] and convolution operators [35], [36] are developed for enhancing the spatio-temperal representation.However, most of them need extra structures and learnable parameters to modulate the original spatiotemporal features.In this paper, we propose 3D-CDC for modeling a rich temporal context, which is vital for describing fine-grained hands/arms motion.Among these methods, Lee et al.s [13] motion feature network (MFNet) and Yu et al.s [36], [37] central difference convolution (CDC) are the most similar to our work.Instead of the fixed motion filter used in MFNet and only the spatial context in CDC, our 3D-CDC learns the temporal gradient (motion) filters automatically in a unified 3D convolution operator.
In terms of the multi-modal fusion strategy, decision-level fusion [44], [9], [45] and feature-level fusion [7], [8], [38], [20] methods have developed for integrating mutual knowledge from varied modalities.Despite achieving SOTA performance, the existing multi-modal fusion strategies for gesture recognition are designed manually and coarsely, which might be suboptimal for message propagation between modalities.In this paper, we prefer to discover well-suitable multi-modal fusion strategies automatically via NAS.
In terms of the search space design, cell-based NASNet [50] space is favorite due to its flexibility and rich capacity.Besides, some novel operators, such as extended convolution [36], [61] and attention [58], [62], are introduced into search space, which is proved to be beneficial for searching more powerful architecture.However, there are still no operators specially designed for temporal enhancement.
To our best knowledge, no NAS based method has ever been proposed for RGB-D gesture recognition.To fill in the blank, we search multi-rate and multi-modal networks over the temporal enhanced search space for RGB-D gesture recognition.

III. METHODOLOGY
In this section, we first introduce the 3D-CDC family in Sec.III-A.Over the 3D-CDC based space, we then propose the two-stage multi-rate and multi-modal NAS in Sec.III-B.

A. Temporal Enhancement via 3D-CDC
In classical 3DCNNs, 3D convolution is the most fundamental operator for spatio-temporal feature representation.In this subsection, for simplicity, the 3D convolutions are described in 3D (without channel) while an extension to 4D is straightforward.There are two main steps in the vanilla 3D convolution: 1) sampling local receptive field cube C over the input feature map x; 2) aggregation of sampled values via weighted summation with learnable weights w.Hence, the output feature map V anilla can be formulated as where p 0 denotes current location on both input and output feature maps while p n enumerates the locations in C. In this subsection, 3D convolution with kernel size 3×3×3 and dilation 1 is used for demonstration, and the other configurations are analogous.The local receptive field cube for the 3D convolution is C={(−1, −1, −1), (−1, −1, 0), • • • , (0, 1, 1), (1, 1, 1)}.Spatio-Temporal Central Difference Convolution (3D-CDC-ST).Inspired by the CDC [36] which introduces spatial gradient cues into representation learning, we integrate spatio-temporal gradient information into a unified 3D convolution operator.It is worth noting that such spatial gradient and temporal difference designs are widely used in the dense optical flow [63] calculation.Instead of the optical flow only performed in RGB sequence level, networks with stacked spatio-temporal CDC would be regularized to learn more local motion context in both RGB sequence and deep feature level, which is able to model fine-grained temporal dynamics for gesture recognition.
Similarly, spatio-temporal CDC also consists of two steps, i.e., sampling and aggregation.The sampling step is similar to vanilla 3D convolution but the aggregation step is different: as illustrated in Fig. 1(a), spatio-temporal CDC prefers to aggregate the center-oriented spatio-temporal gradient of sampled values.Eq. (1) becomes When p n =(0,0,0), the gradient value always equals to zero with respect to the central location p 0 itself.For the gesture recognition task, both spatio-temporal intensity-level semantic information and gradient-level difference message are crucial and complementary.The former one is good at global modeling and robust to sensor-based noise while the latter one focuses more on local appearance and motion details and might be influenced by noise.As a result, combining vanilla 3D convolution with 3D-CDC might be a feasible manner to provide more robust and discriminative modeling capacity.Therefore we generalize spatio-temporal CDC as where hyperparameter θ ∈ [0, 1] tradeoffs the contribution between intensity-level and gradient-level information.Please note that w(p n ) is shared between vanilla 3D convolution and spatio-temporal central difference (CD) term, thus no extra parameters are added.
Temporal Central Difference Convolution (3D-CDC-T).Unlike the aforementioned spatio-temporal CDC considering both spatial and temporal gradient cues, we propose a version with only temporal central differences.As shown in Fig. 1(b), the sampled local receptive field cube C is separated into two kinds of regions: 1) the region in the current time step R , and 2) the regions in the adjacent time steps R .In the setting of a temporal CDC, the central difference term is only calculated from R .Thus the generalized temporal CDC can be formulated via modifying Eq. ( 3) as Temporal Robust Central Difference Convolution (3D-CDC-TR).In consideration of the sensor noise especially in the depth modality, we also propose a version with the temporal robust central difference.Similarly, the temporal robust CDC only calculates the difference term from the regions in the adjacent time steps R .As shown in Fig. 1(c), the robust temporal center is represented by averaging the spatial centers of all time steps (i.e., p t−1 0 , p 0 and p t+1 0 ) within C. The robust temporal center-oriented gradient might be less sensitive to the pixel jitters from the adjacent time steps.The generalized temporal robust CDC can also be formulated via Fig. 2: Architecture search space in the first stage.Single-modal (RGB or depth) multi-rate frames are adopted as inputs.Here we utilize 3 branches with different frame rates (e.g., uniformly sampling the original videos into 8, 16, and 32 frames, respectively), and it also can be extended to more branches according to the actual situation.In this search stage, inspired by [15], the architectures of all lateral connections are fixed with temporal convolutions.And the outputs of the lateral connections are concatenated with the features from the target branch.The channel numbers are doubled after each MaxPool layer.The architecture of the cells from multi-rate branches to be searched can be shared or unshared (see Sec. IV-C for ablation study).modifying Eq. ( 3) as We will henceforth refer to these three generalized versions (i.e., Eq. ( 3), ( 4) and ( 5)) as 3D-CDC-ST, 3D-CDC-T and 3D-CDC-TR, respectively.The ablation studies about the 3D-CDC family and hyperparameter θ are conducted in Sec.IV-C.
Relation between 3D-CDC and 3D Vanilla Convolution.Compared to 3D vanilla convolution, 1) 3D-CDC-ST regularizes the spatio-temporal representation with more detailed spatial cues and temporal dynamics, which might be suitable for scene-aware video understanding tasks; 2) 3D-CDC-T enhances the spatio-temporal representation with only rich temporal context, which might be assembled for video temporal reasoning tasks; and 3) 3D-CDC-TR introduces robust but slighter temporal evolution cues for spatiotemporal representation, which might perform robustly even in noisy scenarios.In particular, 3D-CDC-ST, 3D-CDC-T, and 3D-CDC-TR degrade to vanilla 3D convolution when θ=0.As illustrated in Fig. 1(d), the 3D-CDC family provides more details about the trajectory of the left arm, and such a local motion context is crucial to gesture recognition.More visualizations are shown in Sec.V-C.

B. Multi-Rate and Multi-Modal NAS
In order to seek the best-suited backbones and lateral connections for multi-rate and multi-modal networks, we propose a two-stage NAS method to 1) search backbones for multirate single-modal networks first, and then 2) search lateral connections for multi-modal networks based on the searched backbones.The iterative procedure is outlined in Algorithm 1.More technical details can be referred to two gradient-based NAS methods [47], [51].Update weights Φc by descending ∇Φ c Ltrain(Φc, αc) 13 : end 14 : Derive the final lateral connections based on the learned αc 1) Stage 1: Searching Backbones for Multi-Rate Single-Modal Networks: In SlowFast Networks [15], low-and highrate branches are utilized for complementarily capturing dynamic visual tempos.However, the coarse hand-defined architecture with vanilla convolutions limits its representation capacity.Here we search optimal rate-aware backbones over the temporal enhanced search space.
As illustrated in Fig. 2, in the first stage, our goal is to search for cells to form multi-rate backbones for singlemodal gesture recognition.As for the cell-level structure, similar to [47], each cell is represented as a directed acyclic graph (DAG) of K nodes {n} , where each node represents a network layer.We denote the operation space as O b , which consists of seven designed candidate operations: 'Zero', 'Identity', 'Conv 1x3x3', 'CDC-T-06 3x1x1', 'CDC-T-06 3x3x3', 'CDC-TR-03 3x1x1' and 'CDC-TR-03 3x3x3'.To be specific, 'CDC-T-06' and 'CDC-TR-03' denote the 3D-CDC-T with θ=0.6 and 3D-CDC-TR with θ=0.3, respectively.Fig. 3: Architecture search space in the second stage.Multi-modal and multi-rate frames are adopted as inputs.Here we utilize 3 branches with different frame rates for two modalities (RGB and depth), respectively.In this search stage, all cells are initialized with the searched structures in the first stage and then fixed.The architecture of the low-, mid-and high-level lateral connections to be searched can be shared or unshared.
These settings of θ are based on the ablation study results in Section IV-C.We also consider a vanilla operation space with vanilla 3D convolutions instead of 3D-CDC for comparison.
The multi-rate backbones for each of modalities M to be searched have the architecture parameters α (i,j) b . Each edge (i, j) of DAG represents the information flow from node n i to node n j , which consists of the candidate operations weighted by the architecture parameter α (i,j) b . Specially, each edge (i, j) can be formulated by a function õb (i,j) where õb ) is utilized to relax architecture parameter α The intermediate node can be denoted as n j = i<j õb (i,j) (n i ).The output node n K−1 is depth-wise concatenation of all the intermediate nodes excluding the input nodes.
In the searching stage, cross-entropy loss is utilized for the training loss L train and validation loss L val .Network parameters Φ b and architecture parameters α b are learned via solving the bi-level optimization problem: When the searching is converged, we derive the final architectures via: 1) setting o 2) Stage 2: Searching Lateral Connections for Multi-Rate Multi-Modal Networks: The lateral connections from most existing multi-rate [15] or multi-modal [9], [7], [20] spatiotemporal networks are designed manually, which might be sub-optimal for information exchange.Here we propose to search best-suited lateral connections among rate-aware and modality-aware branches, intending to effectively explore RGB-D-temporal synergies.In the second stage, our goal is to search for lateral connections among the multi-rate and multimodal branches.As most of the definitions and the search procedure are similar to those in the first stage, here we only show the two main differences from the first stage.
On one hand, the composition of architecture search space is different.As shown in Fig. 3 (see low-level connections for example), the lateral connections search space can be represented as a bidirectional graph of K =6 nodes (branches) within the modalities M. Specially, the lateral connections from the lower frame rate branches to the higher ones are not considered because we assume that the lower frame rate branches always have less information than the higher ones.Thus, there are total 18 edges (lateral connections) inside the bidirectional graph and each edge consists of the candidate operations weighted by its corresponding architecture parameter α c .The final output of each node is the depth-wise concatenation of all outputs of the incoming edges.
On the other hand, the design of the operation space is different.The operation space for lateral connections is denoted as O c , which consists of two parts: 1) 'Zero', 'Conv 5x1x1', 'CDC-T-06 5x1x1', 'CDC-TR-03 5x1x1' with stride=(4,1,1) for the edges from the high frame rate branches to the low frame rate branches; and 2) 'Zero', 'Conv 3x1x1', 'CDC-T-06 3x1x1', 'CDC-TR-03 3x1x1' with stride=(2,1,1) for the others.Specially, stride=1 is utilized for edges between the branches of different modalities with same frame rate.When the search is converged, for each edge (i, j), only the operation in O c with the largest α (i,j) c is adopted.With the two-stage multi-rate and multi-modal NAS in Algorithm 1, both the final backbones and lateral connections are derived.

IV. EXPERIMENTS
In this part, we first give details for benchmark datasets and experimental setup.Then, we thoroughly evaluate the impacts of 3D-CDC family, multi-rate configuration, and two-stage NAS.Finally, we evaluate and compare our results with stateof-the-art methods on three benchmark datasets.

B. Implementation Details
Our proposed method is implemented with Pytorch.Cell nodes K=4 and K =6 are used as the default setting.The optical flow is extracted by pyflow [65] -a python wrapper for dense optical flow [63].In the search phase, partial channel connection and edge normalization [51] are adopted.The initial channel numbers for low, mid, and high frame rate branches are 24, 16, and 8, respectively, which double after searching.There are 8 cells for each branch in the search stage, which increases to 12 cells after searching.SGD optimizer with learning rate lr=1e-2 and weight decay wd=5e-5 is utilized when training the network weights.The architecture parameters are trained with Adam optimizer with lr=6e-4 and wd=1e-3.The lr decays with factor 0.5 in the 20 th epoch.We search 30 epochs on the training set of IsoGD dataset [2] with batch size 20 while architecture parameters are not updated in the first 10 epochs.Especially, L train is calculated on the first half of the training set while L val on the latter part.The whole two-stage NAS costs 12 days on four P100s.In the training phase, models are trained with SGD optimizer with initial lr=1e-2 and wd=5e-5.The lr decays with factor 0.1 when the validation accuracy has not improved within 3 epochs.Random horizontal flip and spatial crops are utilized for data augmentation.We train models with batch size 12 for maximum 80 epochs on four RTX-2080Ti GPUs.

C. Ablation Study
All ablation studies are trained from scratch and evaluated on the validation set of the IsoGD dataset.
Impact of 3D-CDC for Modalities.In these experiments, we use C3D [29] the backbone and sequence size 3×16×112×112 as the inputs.According to Eq. ( 3), ( 4) and ( 5), θ controls the contribution of the temporal difference cues.As illustrated in Fig. 4, 3D-CDC-T improves the accuracy  of RGB modality dramatically.Compared with vanilla 3D convolution (i.e., θ=0), 3D-CDC-T gains 7.3% when θ=0.6, which indicates the advantages of temporal difference context.One highlight is that, assembled with 3D-CDC-T, the RGB modality is able to obtain comparable accuracy (44.06% vs. 47.02%) with optical flow modality, indicating an excellent dynamic modeling capacity of 3D-CDC-T.In terms of the depth and optical flow modalities, the best performance (42.86% and 52%) could be achieved when using 3D-CDC-TR with θ=0.3 and θ=0.4,respectively.Compared with 3D-CDC-T, 3D-CDC-TR is more robust for depth and optical flow modalities because it alleviates sensor noises and pre-processing artifacts between frames in these two modalities.By the observation, the 3D-CDC-ST performs relatively poorly.The reason might be that the gesture recognition task prefers more temporal reasoning context than spatial gradient cues and appearance details.According to their enhanced temporal representation abilities, 3D-CDC-T with θ=0.6 and 3D-CDC-TR with θ=0.3 are considered in our NAS operation space.
Impact of Multi-Rate Branches.As gestures have various temporal scales, modeling such visual tempos of different gestures facilitates their recognition.Here we conduct the abla- tion study with C3D [29] network to explore how the branches with different frame rates influence the gesture recognition task.As illustrated in Fig. 5, for the single rate network, the higher the frame rate it has, the better performance it will be.This is because a higher frame rate usually has less sampling temporal information loss, which has richer fine-grained temporal cues for gesture recognition.Furthermore, we could find that the performance could be further improved when cooperated with the multi-rate branches (e.g., '16 + 32 frames' and '8 + 16 + 32 frames').In terms of the impact of multirate branches for different modalities, it is obvious that multirate branches contribute more to RGB than depth modality.When assembling with 3D-CDC-T for RGB or 3D-CDC-TR for depth, the trends of multi-rate networks are analogous as the vanilla cases but achieving holistic performance gains (due to the excellent representation capacity of 3D-CDC).
We also reimplement SlowFast [15] Networks (trained from scratch) with '8+32 frames' multi-rate setting on the IsoGD dataset.However, it only achieves respective 22.28% and 40.69% accuracy on RGB and depth inputs, which indicates the importance of suitable architecture design for multi-rate networks in the gesture recognition task.
Effectiveness of the Two-stage NAS.Based on the best multi-rate setting ('8+16+32 frames'), we study the two-stage NAS for both single and multiple modalities.The first stage NAS (NAS1) intends to find well-suited multi-rate singlemodal networks.As illustrated in Tab.I, when searching over the vanilla search space w/o CDC, the architectures found by NAS1 improve 0.79% and 0.47% accuracy (compared with multi-rate C3D) for RGB and depth inputs, respectively.Moreover, the gains consistently occur when searching over 3D-CDC based search space for both RGB (+1.12%) and depth (+0.8%) modalities.
Based on the searched multi-rate networks for RGB and depth modalities, 'NAS2 Fixed' utilizes late fusion directly without searching the lateral connections between two modalities.Out of expectation, it performs even worse than the single-modal NAS1 searched models.It means that simply late fusion will encounter the problem of insufficient information exchange in feature levels.With the second stage NAS (NAS2), 'NAS2 Unshared' achieves more than 50% accuracy, which indicates the advantages of NAS that mines the efficient integration of multi-rate and multi-modal branches.Further-  [38] RGB-D-Flow-DFlow 80.96 more, compared with 'NAS2 Shared' searching the shared lateral connections for low-mid-high levels, 'NAS2 Unshared' performs better (+3.74%), which implies the importance of the specific design for interactions of each level.

D. Comparison with State-of-the-art Methods
After studying the components in Sec.IV-C, we evaluate our models on three benchmark datasets.Note that in this subsection, our models are firstly pre-trained on the Jester [68] gesture dataset, which is similar to [45], [32], [69].
Results on IsoGD.As shown in Table II, although the existing methods [8], [45], [11] adopt 3DCNNs to learn from single RGB or depth modality, it is still challenging to represent the discriminative and robust spatio-temporal features with vanilla 3D convolutions and coarsely designed architecture.With the enhanced temporal representation capacity via 3D-CDC and multi-rate collaboration, our proposed multi-rate single-modal NAS method 'NAS1' obtains the best accuracy on every single modality.This exactly demonstrates the superiority of the searched architecture.In terms of the RGB-D gesture recognition, our searched architecture with two-stage NAS (NAS2) obtains more than 1% and 4% accuracy gains compared with the 'NAS1' with RGB and Depth modality, respectively.It demonstrates the effectiveness of RGB-Dtemporal synergies at earlier stages.Similar to [32] ensembling the results from varied modalities, our 'NAS1+NAS2' boosts the accuracy to 65.54%.
To evaluate the modality generalization of the architecture searched from RGB-D, we retrain the same model 'NAS2' with RGB-Flow and Flow-Depth modalities separately and also achieve comparable performance (61.22% and 62.47%, respectively).To our best knowledge, it is the first to explore the modality generalization issues for multi-modal NAS.Finally, the best accuracy (66.23%) could be achieved when ensembling the scores from all three 'NAS2' models among RGB-D-Flow modalities.Although the FOANet [38] reports the best performance (80.96%) on IsoGD, the high accuracy is achieved by fusing 12 channels (i.e., global/left/right channels for four modalities) with manual hand detection.Note that without hand detection preprocessing, our 'NAS2 all' outperforms FOANet (66.23% vs. 61.4%)by a large margin using only RGB-D-Flow modalities.This exactly demonstrates the superiority of our searched architectures.
Results on NVGesture.Table III compares the performance of our method with SOTA methods on the NVGesture dataset.It can be seen that our approach performs the best for both single-modal and multi-modal testing, which indicates 1) our searched architecture is able to represent discriminative spatio-temporal features for single/multi-modal gesture recognition; and 2) the architecture searched on the source dataset (IsoGD) via the proposed two-stage NAS transfers well on the target dataset (NVGesture), demonstrating the excellent generalization ability of the proposed NAS method.
Fig. 6 evaluates the coherence between the predicted labels from the searched 'NAS1' and 'NAS2' architectures, and the ground truths on the NVGesture dataset.The coherence is calculated by their confusion matrices.We observe that with RGB-D-temporal synergies, 'NAS2' has less confusion between the input classes and provides generally a more diagonalized confusion matrix.This improvement is better observed in the first three and last six classes.
Results on EgoGesture.We also evaluate the robust-   [69] needs an extra detector to capture the key segments for preprocessing.As our proposed multi-rate and multi-modal network recognizes the gesture on original video clips directly, the performance might be further boosted with the gesture detector.

V. DISCUSSION AND VISUALIZATION
In this section, we first discuss the transferability of the proposed two-stage NAS and 3D-CDC on action recog-TABLE V: Results on the RGB-D action recognition dataset THU-READ [73] with the cross-subject protocol.The architectures of 'NAS1' and 'NAS2' are searched on IsoGD and then retrained/evaluated on THU-READ.nition task, which is interesting and necessary because it exists task-oriented biases (gesture recognition is less relied on the scene but more related to the fine-grained temporal cues when compared with the action recognition).Then, we analyze the visualization results of the searched architecture and feature response, which are shown in https://github.com/ZitongYu/3DCDC-NAS.
A. Task Transferability Generalization to RGB-D Action Recognition.In order to validate the generalization ability of our 3D-CDC based two-stage NAS, we transfer the searched architecture ('NAS1' and 'NAS2') to another multi-modal video understanding task, i.e., RGB-D action recognition.Here one of the largest RGB-D egocentric dataset THU-READ [73] is used for experiments.It consists of 40 different actions and 1920 videos.We adopt the released leave-one-split-out protocol.For fair comparison, C3D [29], SlowFast [15], 'NAS1', and 'NAS2' are pretrained on Jester gesture dataset.Table V compares the performance of our method with SOTA methods on THU-READ.It can be seen that our approach outperforms the mainstream 3DCNNs (e.g., C3D [29] and SlowFast [15]) with a convincing margin, indicating that the architecture searched on the source task (gesture recognition) could be generalized well on the target video understanding task (e.g., action recognition).
Impact of 3D-CDC for RGB Action Recognition.Here we explore the effectiveness of 3D-CDC for scene-based action recognition tasks.Fig. 7 illustrates the results of two classical scene-related action datasets (UCF101 [77] and HMDB51 [78]).It is obvious that compared with the 3D vanilla convolution (θ = 0), 3D-CDC-ST improves the performance dramatically in both datasets (+3% for UCF101 when θ = 0.6 and +2.1% for HMDB51 when θ = 0.4, 0.6, 0.8).The reason might be twofold.On one hand, an enhanced spatio-temporal context is helpful to represent scene-aware appearance and motions.On the other hand, the spatiotemporal difference term can be regarded as a regularization term to alleviate overfitting.Another highlight is that, without extra parameters, 3D-ResNet18 assembled with 3D-CDC-ST outperforms that with Spatio-Temporal Channel Correlation (STC) Block [34] by +2.1% on UCF101 split 1.In contrast, 3D-CDC-T performs the worst because of its weak spatial context representation capacity and vulnerability to the scene noises, which are vital in these two scene-aware datasets.

B. Visualization of the Searched Architecture
Here we give a an the searched cells and lateral connections with the proposed two-stage NAS.It can be seen that there are more 'CDC-T-06' based operators in all three RGB branches while more 'CDC-TR-03' based operators in the depth branches.This consists with the observations in our ablation study of 'Impact of 3D-CDC for Modalities' in Section 4.1.It is interesting to find that the lower-level lateral connections are sparser (i.e., with more 'Zero' operators) and the high-level lateral connections have more learnable operators (i.e., convolution operators).This might inspire the video understanding community to design more reasonable multi-modal networks in the future.

C. Feature Visualization
The neural activation (before Pool3 in C3D) are visualized.It can be seen that the proposed 3D-CDC-ST, 3D-CDC-T, and 3D-CDC-TR enhance the spatio-temporal representation and enforce the model to focus more on the trajectories of arms and hands.As for the depth modality, all the convolutions are able to make the accurate attention on the movements from arms and hands due to the benefits from the foregrounds provided by the depth inputs.Despite more robust representation achieved by the 3D-CDC family, the interferences from the sensorbased noise and undesirable movements (e.g., head and hair) still occur.Thus, it is necessary to explore RGB-D-temporal synergies to overcome such limitation.

Fig. 1 :
Fig. 1: Our proposed 3D-CDC family with kernel size 3, which could be used as novel operators for NAS.(a) 3D-CDC-ST considers the central difference information in the whole local spatio-temporal regions.(b) 3D-CDC-T only calculates the central difference clues from the local spatio-temporal regions of the adjacent frames.(c) 3D-CDC-TR is similar to 3D-CDC-T but adopts the temporal central mean pooling before calculating the central difference term, which is more robust to temporal noise.The symbols and ⊕ denote element-wise subtraction and mean operations, respectively.(d) Feature response of various convolutions in RGB modality.Compared with vanilla 3D convolution, the 3D-CDC family enhances the temporal context obviously.

Algorithm 1
Two-Stage Multi-Rate and Multi-Modal NAS Stage1: Fix lateral connections, and search backbones For multi-rate backbones, create a mixed operation õ(i,j) b parametrized by α (i,j) bfor each edge (i, j) 1 : for each of modalities M do 2 :Fix the lateral connections among multi-rate branches 3 :while not converged do 4 : Update architecture α b by descending ∇α b L val (Φ b , α b ) 5 : Update weights Φ b by descending ∇Φ b Ltrain(Φ b , α b ) 6 : end 7 : Derive the final backbone of the current modality based on the learned α b 8 : end Stage2: Fix backbones, and search lateral connections For lateral connections, create a mixed operation õ(i,j) c parametrized by α (i,j) c for each edge (i, j) 9 : Initialize and fix the multi-rate and multi-modal backbones searched in Stage 1 10 : while not converged do 11 : Update architecture αc by descending ∇α c L val (Φc, αc) 12 :

=
arg max o b ∈O b ,o b =zero η (i,j) o b , and 2) for each intermediate node, choosing two incoming edges with the two largest values of max o b ∈O b ,o =zero η (i,j) o b .
focuses on touchless driver controlling.It contains 1532 dynamic gestures fallen into 25 classes.It includes 1050 samples for training and 482 for testing.The videos are recorded with three modalities (RGB, depth, and infrared).For fair evaluations with SOTA methods, infrared modality is not used in our experiments.The EgoGesture dataset [4] is a large multi-modal egocentric hand gesture dataset.It contains 24,161 hand gesture clips of 83 classes of gestures, performed by 50 subjects.Videos in this dataset are captured with an Intel RealSense SR300 device in RGB-D modalities across multiple indoor and outdoor scenes.

Fig. 6 :
Fig. 6: The confusion matrices obtained by comparing the groundtruth labels and the predicted labels from the NAS1 and NAS2 networks trained on the NVGesture dataset.Best seen on the computer, in color and zoomed in.

TABLE I :
Comparison among various configurations of the two-stage NAS for varied modalities.The upper part is about the first stage NAS1 while bottom part is about the second stage NAS2.The evaluation metric is accuracy (%).