AutoGCN-Toward Generic Human Activity Recognition With Neural Architecture Search

This paper introduces AutoGCN, a generic Neural Architecture Search (NAS) algorithm for Human Activity Recognition (HAR) using Graph Convolution Networks (GCNs). HAR has enjoyed increased attention due to advances in deep learning, increased data availability, and enhanced computational capabilities. Concurrently, GCNs have shown promising abilities in modeling relationships between body key points in a skeletal graph. Typically, domain experts develop dataset-specific GCN-based methods, which limits their applicability beyond the specific context. AutoGCN seeks to address this limitation by simultaneously searching for the ideal hyperparameters and architecture combination within a versatile search space using a reinforcement controller while balancing optimal exploration and exploitation behavior with a knowledge reservoir during the search process. We conduct extensive experiments on two large datasets focused on skeleton-based action recognition to assess the proposed algorithm’s performance. Our experimental results demonstrate the effectiveness of AutoGCN in constructing optimal GCN architectures for HAR, outperforming conventional NAS and GCN methods, as well as random search. These findings highlight the significance of a diverse search space and an expressive input representation to achieve good model performance and generalizability.


Introduction
Human Activity Recognition (HAR), the process of identifying and categorizing human activities, has witnessed a surge in interest in recent years.This momentum is attributed to advancements in deep learning techniques, the increased availability of data, and enhanced computational capabilities [1].HAR has applications in a wide range of domains, including video surveillance, human-computer interaction, and healthcare, highlighting its significance and broad impact [1,2,3,4].Within the realm of HAR, the models aim to capture representations of humans using a skeleton graph built of body key points [1].In the context of HAR, the traditional feature extraction method relies on manually crafted descriptors to capture fundamental attributes [5,6].Another approach involves formulating the problem as a deep learning task in Euclidean space.This approach uses the capabilities of Convolutional Neural Networks and Recurrent Neural Networks associated with recognition tasks.However, approaches involving these architectures have a significant shortcoming in modeling the inherent structures in skeletons, as the input data is serialized [1].Recently, Graph Neural Networks (GNNs), specifically Graph Convolution Networks (GCNs), have emerged as a cutting-edge solution in HAR [7].GCNs have achieved remarkable performance by leveraging the relationships between different segments and joints, treating these as nodes and edges in an adjacency matrix, making convolutional operations possible [1,7].
While GCN-based methods have shown promising results, they are primarily developed by domain experts and often tailored to specific datasets through a trial-and-error process.This limits their potential impact, as applying them to a broader range of problems can be challenging or when domain knowledge is unavailable.One approach to overcoming those challenges is automatically building task-specific deep learning models using Neural Architecture Search (NAS) [8].By employing NAS, researchers can discover optimal architectures from a defined search space through a search procedure, enhancing the overall performance and generalizability of the resulting models.
In this work, we propose a generic NAS algorithm, AutoGCN, that dynamically learns both hyper-and architecture parameters to construct a framework that maximally fits the HAR task.We present the following contributions: • A generic approach to construct a GCN architecture for HAR with NAS, which can concurrently optimize hyperparameters and the model architecture.Additionally, the approach can be applied to different datasets within HAR and performs better than random search and other NAS procedures for GCNs.• An update mechanism incorporating a knowledge reservoir to balance exploration and exploitation during the search process, leading to an accelerated and efficient search procedure while ensuring optimal exploration behavior.• A diverse search space construction that enables the usage of different input representations to enhance the network performance and generalizability further.
2 Related Work

Skeleton-based HAR
HAR within the domain of skeletal data attracts increasing interest due to its rectified robustness against background noise, occlusion, or scaling implications when compared to conventional RGB methods [3].Since the nature of the skeleton data modality is non-euclidean, current approaches commonly use GCNs, as these can operate on unstructured data.The first work in this field stems from [7], introducing ST-GCN to learn spatial and temporal patterns from the input data with a tailored GCN layer.This foundation makes the handcrafted feature engineering obsolete while achieving superior performances compared to previous methods [7].Several subsequent studies have adopted this technique and introduced multi-stream architectures to incorporate different kinematic variables from which the model can learn.A two-stream adaptive graph convolutional network named 2s-AGCN was introduced in [9] to consider the skeleton's second-order information and learn the graph's topology.In a later study [10], Efficient-GCN was introduced, built upon three input branches and incorporating techniques from separable convolution to make the network more efficient.While these approaches yield good performances, they are all designed by human experts, limiting their generalizability to scenarios beyond the tested databases or usage by non-domain experts.This leaves the problem of the automatic construction of generic models unresolved.

Neural Architecture Search
For most tasks, deep neural networks must have a customized architecture to accomplish their objectives effectively.Initially, deep neural networks were used to automate the manual feature engineering aspect of machine learning.The field has since progressed further with the introduction of NAS and automated machine learning (AutoML), which automates the construction of neural architectures and accompanying tasks like hyperparameter optimization, e.g., batch size, learning rate, or algorithm selection [11].These approaches have already given rise to architectures in the area of image classification that outperform models designed by human experts, as highlighted in studies such as [8,12].
Zoph et al. made significant contributions to the field of NAS with their pioneering work [8].Their approach introduced a reinforcement learning algorithm searching for optimal architectures, yielding competitive performance on the CIFAR-10 and Penn Treebank benchmarks [13,14].However, it is worth noting that this method employed a substantial amount of computational resources, utilizing 800 GPUs over two weeks [12].Significant research has focused on improving the efficiency of NAS approaches to address the first approach's resource-intensive aspect.For example, in the work that developed the Differentiable Architecture Search (DARTS) technique [15], a gradient-based optimization methodology was presented, enabling efficient exploration of designs by utilizing a continuous relaxation of the search space.Another influential study, [16], introduced the Efficient Neural Architecture Search (ENAS) approach, which addressed the computational inefficiency of NAS through parameter sharing across child models, yielding greater efficiency and scalability.The Progressive Neural Architecture Search (PNAS) [17] concept established a step-by-step expansion strategy to shorten the search process while maintaining competitive performance.Furthermore, the NASNet technique [18] focused on identifying transferable architectural building blocks that perform well across various tasks and datasets.
A new approach named AutoHAS incorporating hyperparameters and architecture components in the search was introduced in [19].This comprehensive method aims to optimize the architectural design and the associated hyperparameters like the chosen optimizer and its learning rate for increased performance.Therefore, the shared network weights and a reinforcement learning controller are updated alternately to find the best probability distribution in the combined search space, resulting in the best-performing architecture and hyperparameter configuration.By extending the framework to incorporate hyperparameter optimization, AutoHAS went beyond the traditional scope of NAS and encompassed a broader spectrum of optimization possibilities.The inclusion of hyperparameter search is a significant enhancement as it allows for the tuning of various factors that impact model performance, such as learning rates, regularization parameters, and batch sizes, which have a significant impact on the model performance [12].In contrast, AutoHAS relies on the straightforward Reinforce algorithm [20], which overlooks previously sampled and trained candidate architectures during the update procedure, resulting in an imbalance between exploration and exploitation behaviors.Additionally, the architectural search is confined to basic building blocks, and the exploration of hyperparameters is restricted to selecting the most suitable learning rate and optimizer.
To date, NAS research has primarily focused on and applied to image classification tasks.In the domain of HAR GCNs, the NAS research focus has been limited to [21,22], to the best of our knowledge.In the work of [21], NAS was employed to discover the most effective architecture by approximating spatiotemporal cues and leveraging Chebyshev polynomials of varying orders.This approach involved constructing a search space consisting of three dynamic graph substructures: spatial, temporal, and spatiotemporal.Additionally, the authors searched for eight function modules that were subsequently applied to each network layer.Expanding upon these concepts, SNAS-GCN represents an extension of this research, optimizing the search space and implementing a single-path one-shot approach for improved search efficiency [22].The experimental findings of SNAS-GCN demonstrate a reduction in search time compared to the previous work [21], albeit with a diminished level of accuracy.Despite the significant contributions of [21,22] to the field of skeleton GCNs through utilizing NAS, their work has certain limitations.One limitation lies in their reliance on approximating spatiotemporal cues and utilizing Chebyshev polynomials, which can lead to information loss and impact the derived architectures' accuracy and robustness.Furthermore, the search for a modest set of eight function modules applied to the network layers may restrict the expressiveness and flexibility of the discovered architectures.The lack of flexibility in the network architecture is a further notable limitation, as it is fixed to ten layers with uniform channel sizes and composition.This rigidity restricts the expressiveness of the architecture, potentially hindering its ability to capture complex patterns in the data effectively.In addition, the authors only search for the spatial and temporal filters solely on the joint data while sharing them with the other four dataset modalities within NTU RBG+D [23], and, as a result of this, overlook the potential variance present in these modalities.Moreover, the final performance of the architecture relies on an ensemble score derived from two distinct data modalities, making it necessary to use more resources to search and train those models.Considering these limitations, a more extensive and diverse search space combined with an ample data representation, the model could leverage a richer set of operations and adapt better to the specific task requirements, potentially leading to improved performance.

AutoGCN
AutoGCN is a novel NAS framework that predicts action classes for a given sequence of skeletons.In this section, we provide AutoGCN's technical details.First, we introduce some preliminaries and the data representation as the foundational input features.Second, we discuss the defined search space and its implications.Third and finally, we explain the search algorithm in detail.Fig. (1) illustrates the workflow in the AutoGCN framework.

Preliminaries
The human skeleton is modeled as a spatiotemporal graph by connecting the spatial graph of each time frame along a defined temporal dimension d, enabling the network to process spatial and temporal information effectively.Therefore, the skeleton sequence is transformed into an undirected graph described as G = (V, E, A), where n = |V | denotes the nodes connected by |E| edges.These attributes are represented in an adjacency matrix A ∈ R n×n .

Data representation
Prior research on skeleton-based action recognition demonstrated the paramount importance of data preprocessing [24,25,26,27,9,28,29].In the context of this study, the input features are categorized into four distinct groups, namely: 1) position P , 2) velocity V , 3) bone B ∈ {L, β}, and 4) acceleration features A, corresponding to the fundamental  kinematic properties of the human body [30].The input sequence X , denoted as X ∈ R Cin×Tin×Vin×Min represents a sequence of skeleton data inputs, where C in denotes the number of channels indicating the {x, y, z} orientation, T in represents the number of frames in the sequence, V in specifies the number of vertices representing spatial points, and M in depicts the number of skeletons present in the sequence.This input sequence undergoes a subsequent partitioning into the mentioned categories.The relative position P is calculated via where c is the chosen central joint.The velocity V is determined via The parameter d determines the timeframes over which the velocity is measured, with different values yielding different temporal spans.In this study, the range of d is within [1,2].To calculate the acceleration A, the velocity from ( 2) is differentiated with respect to time.Since the equation involves discrete time steps, finite differences are used to approximate the derivative.The acceleration A is then calculated as follows An 8th-order Butterworth filter is also introduced to compensate by the derivative [31] from the acceleration feature.This meticulous filtering process is critical in enhancing the signal quality, enabling a more accurate and reliable representation of the underlying data.The bone features B are divided into bone length L and respective angles β which have been introduced in [10]: ).
Here i adj is the adjacent joint, while w ∈ {x, y, z} stands for the three orientation coordinates of the skeleton.

Spatial Graph Convolution
Yan et al. [7] pioneered the application of graph convolution in the domain of skeleton action recognition.The following relation encapsulates the essence of their approach: Here, v ti represents the i-th joint at the t-th frame, and f out (.) and f in (.) correspond to the output and input features of the respective joints.Furthermore, B(v ti ) denotes the set of neighboring joints of v ti , while B vti defines the convolutional sampling area, specifically encompassing vertexes that are 1-distance neighbors (i.e., immediate neighboring joints).The normalization factor Z ti (v tj ) represents the size of the corresponding subset, and its purpose is to equalize the influence of various subsets on the final output.The weighting function w(.) incorporates the mapping function l ti (.), which serves to establish diverse partitioning strategies, such as uni-labeling, distance, and spatial configuration partitioning [7].To further refine the formulation, Equation ( 7) is written as Here, f out denotes the output features, while W j , Λ j , A j , and M j contribute to the weighting, diagonal degree, adjacency, and normalization matrices, respectively.This transformation enables an enhanced representation of the skeleton sequences using graph convolution techniques [7].

Search Space
Diverging from conventional image-based NAS algorithms, which typically incorporate a limited set of elementary building blocks, such as convolutional layers with varying kernel sizes, skip connections, or max pooling layers [32,33], AutoGCN adopts a distinctive approach.Recognizing that these simplistic blocks cannot adequately capture the data representation of a skeletal sequence, AutoGCN embraces the integration of prior knowledge.This algorithm aims to exploit the full potential of skeletal sequence information by defining the search space using architectural components known to be suitable for skeletal data [24,34,21,16,9,7].This departure from traditional methodologies acknowledges the need for a more nuanced and informed exploration of possible architectures in this domain [12,15].In addition to incorporating prior knowledge in the search space, AutoGCN also encompasses hyperparameters like optimizer type, momentum, batch size, L2 regularisation, and learning rate.Building upon [19], the hyperparameter search space can be formulated as where the hyperparameter h is a sum of weighted components C h i of the predefined values B i , satisfying two conditions: the weights sum up to 1, to ensure these are normalized and that the combined influence of all components adds up to the final hyperparameter value, and C h i behaves either as a continuous value in [0, 1] or as a binary choice {0,1} depending on whether it represents a continuous or categorical hyperparameter.
The architectural search space can be encapsulated in the following equation [19] where n defines the chosen architecture type C α i,j and k the corresponding value out of the predefined set of architectural parameters ∆ i,j .Just as with the hyperparameters, the architecture can be categorical or continuous.By considering both the architectural components and the hyperparameters, AutoGCN ensures a comprehensive search process, allowing for the fine-tuning of multiple parts to achieve an optimal outcome.Table (1) shows the entire search space and its variables.

Architecture
As previously discussed, the architecture construction necessitates careful consideration of the unique data representation intrinsic to the skeleton sequence.To address this requirement effectively, the architecture is thoughtfully partitioned into separate building blocks, building upon prior research [10,29,37].The fundamental elements comprise 1) the initialization layer, 2) the input stream-and, 3) the main stream search block, followed by 4) the classifier block, as depicted in Fig. (2).Each block integrates various parameters derived from the comprehensive search space in Table (1).This modular approach ensures that the architecture can deal with the intricate nuances and complexities inherent in the skeleton sequence data representation.

Search Algorithm
AutoGCN seeks optimization on two fronts: the hyperparameters denoted as h, and the architecture represented as α.This approach aligns with prior research in this field [19,12].The overarching objective is to minimize the loss function L on the validation dataset D val while ensuring that the optimal parameters ω * α,h are derived from the minimized loss on both the architecture α, hyperparameters h and the training dataset D train .This is articulated as follows: To address the potential inefficiencies of the Reinforce algorithm's sampling behavior, AutoGCN incorporates a replay memory N that stores a collection of N previously generated student architectures.This memory is used to enhance the sampling procedure for updating the controller.The addition of a replay memory helps mitigate the limitations of the Reinforce algorithm by utilizing a broader range of student architectures that have been saved over time.This has the effect of expanding the effective sampling size and improving the controller's decision-making process.Refer to Algorithm (1) for further details.

Experiments
This section outlines our experimental setup of the proposed AutoGCN algorithm on two datasets: NTU RGB+D 60 and NTU RGB+D 120 [23,38].We compare current state-of-the-art (SOTA) models and the two introduced NAS procedures on these datasets.Furthermore, we investigate the impact of controller hyperparameters on the model's performance, including the number of rollouts and update cycles.Additionally, we perform experiments to assess the importance of the acceleration feature in our data representation and the influence of the searched network size.

Datasets
NTU RGB+D 60 is a substantial 3D human activity dataset for action recognition comprising two versions: NTU RGB+D 60 [23] and NTU RGB+D 120 [38].NTU RGB+D 60 contains 56,880 videos, each representing one of the 60 action classes.NTU RGB+D 120 is an extension of NTU RGB+D 60, including 60 additional action classes and 114,480 videos.To evaluate the classification performance, we follow the cross-subject and cross-view settings for NTU RGB+D 60 [23], and cross-subject and cross-setup settings for NTU RGB+D 120, as suggested respectively in [38,23].
The NTU RGB+D 60 dataset is partitioned and used for evaluation as follows: (1) Cross-Subject (X-Sub) evaluation divides the dataset based on subjects.Specifically, 20 subjects are assigned to the training set, while the remaining 20 form the test set.(2) Cross-View (X-View) evaluation involves partitioning the dataset based on camera views.For this evaluation, camera views two and three are utilized to create the training data, while camera view one is reserved for testing.
Correspondingly, the following split protocols are suggested for NTU RGB+D 120 by the authors: (1) Cross-Subject (X-Sub120): The training set comprises samples from 56 subjects, while the test set includes samples from 50 subjects.
(2) Cross-Setup (X-Setup120): In this protocol, samples with even setup IDs are designated for training, while samples with odd setup IDs are reserved for testing.

Implementation
In the experimental setup, the student architecture is trained for a maximum of 25 epochs, while the argmax architecture is trained for 80 epochs to ensure complete convergence.The learning rate used in training is sampled from the search space and undergoes a warm-up for the first ten epochs with a decay factor of 0.5.Subsequently, the learning rate is reduced by 0.25 at epochs 30, 50, 60, 65, and 70.An early stopping mechanism is employed to identify underperforming Table 2: Comparison with SOTA models and NAS approaches on the NTU RGB+D 60 dataset with the Top-1 accuracy in (%).The square brackets indicate the 95% confidence intervals, determined via bootstrapping.
The training is performed on a single NVIDIA-V100 with 32 GB GPU RAM on the PyTorch framework (version 2.0.1)[39] with the global seed set to 1234.The code and the experiment results are publicly available at https://github.com/DeepInMotion/AutoGCN.

Comparison with other SOTA
To evaluate the performance of AutoGCN, it is compared with the baseline approach from Peng et al. [19] and other state-of-the-art (SOTA) HAR approaches.The best-performing models' results on the NTU RGB+D 60 database are shown in Table (2) and (3), respectively.The values of P max (α, h) for the best-performing model are listed in Table (4).Moreover, the values of the Policies P for all search space parameters are listed in Appendix (A) Fig. (4) and Appendix (B) Fig. (5), accompanied by an analysis of these values.
In contrast to other SOTA methodologies, we present the achieved point estimate for the NTU RGB+D 60 dataset and the associated confidence interval, calculated using bootstrap resampling [40].Specifically, we perform 1000 resamples of the complete test dataset with replacement while maintaining a constant training set and model configuration.Our reporting is based on the boundaries encompassing the [2.5, 97.5] percentiles, establishing a 95% confidence interval for the test set.
As shown in Table (2), our approach surpasses the NAS-GCN baseline with the Joint configuration by 0.8% for the X-View and the X-Sub dataset.In contrast to SNAS-GCN, our model demonstrates a notable increase in accuracy, with a substantial gain of 1.2% observed for both the X-Sub and X-View datasets.Moreover, it is essential to emphasize that AutoGCN achieves similar results to the baseline method NAS-GCN without the need for ensemble techniques, achieving a competitive accuracy of 95.5% on the X-View dataset.These results show the effectiveness and efficiency of AutoGCN in the context of skeleton-based human action recognition by delivering high performance without the added complexity of ensemble methods.
Table (3) demonstrates that our approach yields results comparable to the SOTA models for the NTU RGB+D 120 dataset previously reported.
As delineated in Table ( 4), a discernible disparity in the identified architectural components between the X-Sub and X-View datasets emerges.This disparity supports our contention from the introduction that these two datasets have inherent characteristics that cannot be effectively captured by a single searched architectural framework alone.
Table 3: Comparison with SOTA models on the NTU RGB+D 120 dataset with the Top-1 accuracy in (%).
Model X-Sub120 X-Setup120 ST-GCN [7] 70.7 73.2 AS-GCN [41] 77.9 78.5 2S-AGCN [9] 82.5 1 84.2 1 SGN [29] 79.2 81.5 EfficientGCN-B0 [10] 6: Top-1 accuracy percentages w.r.t the rollout hyperparameter of the controller and the update states, along with the average accuracies of the student models and Argmax accuracies for the update cycles in (%).The table showcases how the choice of rollouts impacts the accuracy of two distinct approaches, X-Sub and X-View, and the effects of the first, second, and third updates on their respective performance.

Comparison to Random Search
Random search [45] as an optimization technique for tuning hyperparameters is widely used in NAS procedures, which makes it an essential component of our experiments [46].To ensure the fairest comparison against our proposed method, we perform the random search following the same steps as our algorithm AutoGCN: An initial cohort of 20 student architectures are trained for 25 epochs, and subsequently, the highest-performing architecture is chosen and trained for an extended training period of 80 epochs.Finally, we compare the iterations taken by each approach to obtain the best-performing architectures, displayed in Table (5).It can be observed that the AutoGCN algorithm achieves a higher-performing architecture after just 20 iterations both for the X-View and X-Sub datasets compared to random search.To achieve a comparable point-estimates as AutoGCN, random search has to undergo another 20 iterations.
The results indicate that AutoGCN can obtain an optimal architectural configuration without being subject to the stochasticity of random search.

Influence of the Controller hyperparmeters
Since the controller has fixed hyperparameters, we investigate the ideal number of rollouts for training our model, aiming to balance computational efficiency and the final model performance.The found values are shown in Table (6), in which the Top-1 accuracy percentages in relation to the rollout hyperparameter of the controller and the update stages are depicted.It also displays the average accuracies of the student models and Top-1 accuracies for the first, second, and third updates in percentage values.The table provides an overview of how the choice of rollouts impacts the accuracy of the two datasets, X-Sub and X-View.The arrow symbols describe the trend in the accuracy and average accuracy compared to the previous update: '↑' for an increase, '↓' for a decrease, '→' for no significant change.Furthermore, those results are visualized in Fig. (3).
The X-View dataset attains its highest point estimate following 30 rollouts after the second update cycle.Notably, for 30 rollouts, the controller identifies the best-performing architecture after the second update, requiring significantly more time than the experiment with only ten or 20 samples between updates while not achieving a notable higher point estimate.
Conversely, the X-Sub dataset achieves its highest point estimate after 20 rollouts, which also transpires following the first update cycle.The X-Sub dataset necessitates only one update cycle to achieve the highest point estimate for 20 and 30 rollouts.With ten rollouts, the controller achieves the highest accuracy after the second iteration and remains   Given that the student architectures are trained for a shorter duration of 25 epochs, the optimizer's hyperparameters significantly impact the final accuracy used as a reward for the controller.With a "fast learning" optimizer, suboptimal student architectures, which converge more rapidly, may enjoy an advantage over "slower" trained architectures in the early stages of training since those are only trained for 25 epochs.
Considering that the selection of the student architecture is stochastic, the controller may be updated with less performant models between each update cycle.This becomes evident when comparing the average accuracy for the X-Sub dataset with ten rollouts, where a notable rise in the average accuracy correlates with an increase in the highest point estimate.
On the other hand, the second update with 30 rollouts decreases the point estimate for the X-Sub datasets, suggesting that the controller becomes trapped in a suboptimal configuration after this amount of updates.Furthermore, it has to be recognized that the search space contains multiple optima that can be explored, resulting in the alteration of optimal architecture builds.Consequently, different architecture configurations can achieve similar performances after optimization, which becomes evident when investigating the different sampled architecture configurations across the sampled rollouts.

Influence of the model size
We conduct experiments to investigate the optimal network size by altering the search space, as outlined in Table (1).These focus on exploring larger architectures, where we increased layers, blocks, and depth sizes in the experimental design.The results in Table (7) indicate that the X-View dataset's accuracy remains consistent for the larger and smaller sampled architecture after 20 rollouts.The small model requires much fewer parameters but slightly more FLOPs than the larger model.The greater amount of FLOPs is due to the more complex Bottleneck convolutional layer chosen by the controller.
This outcome shows the effectiveness of the proposed search space, as defined in Table (1), and emphasizes the usage of smaller models that require significantly less training.The results advocate utilizing these smaller models for their Table 8: Influence of the acceleration feature A on the Top-1 accuracy in (%) with the 95% confidence interval.
Input X-Sub X-View P, V, B 87.5 ± 1.4 93.7 ± 1.1 P, V, B, A 88.3 ± 0.9 95.6 ± 0.8 computational efficiency while achieving the same performance as the large model, requiring fewer computational resources.

Data representation
The newly introduced acceleration feature is analyzed to gauge the strength of the data representation.Consequently, this feature is excluded and incorporated into the highest-performing model identified through our search.The impact of omitting this feature from the model becomes evident when observing the results presented in Table (8), where the Top-1 accuracy demonstrates a notable decrease.When conducting a two-sided z-test with a 5% significance level, the null hypothesis of no significant difference between those data modalities can be rejected.

Limitations & Future Work
The proposed AutoGCN algorithm uses a reinforcement learning approach, requiring every student network to be trained from scratch at each iteration.While this strategy enables dynamic adaptation and refinement of the individual architecture and hyperparameter components, it leads to an increased computational burden in computing the reward for the controller guiding the search and updating the policies P.
In order to avoid having to train every student architecture from scratch, the search procedure could adopt so-called one-shot techniques like DARTs, in which an over-parameterized supernetwork is trained to contain every potential architecture build and enable weight sharing among students [12,15].To implement this approach, reusable building blocks, and operation types that can be recycled among different student compositions effectively would have to be defined.While such an adaptation could then lead to a speed-up of the search procedure, it could also lower the versatility of the search space.
In our work, speed-up techniques for the search process are limited to leveraging the knowledge reservoir N and implementing an early-stopping mechanism.Possible enhancements to accelerate the search could involve the inclusion of performance predictors [12].Such predictors could assess the potential of a given student architecture early, allowing for a more efficient allocation of computational resources by prioritizing the exploration of promising candidates and neglecting unpromising ones.With such techniques, the convergence towards optimal solutions could be accelerated, resulting in a significantly reduced overall search time.

Conclusion
In this study, we have developed and presented a GCN NAS algorithm named AutoGCN, tailored for the task of skeletonbased HAR.The intricate dependencies between hyperparameters and architectural configurations are formulated in an expressive search space encompassing a broad range of building blocks from which the controller can sample, which allows the algorithm to find a versatile and high-performing network architecture.AutoGCN is applied to identify and optimize these fundamental building blocks of the network concurrently with a reinforcement algorithm, giving every search space parameter a policy that can be updated based on the anticipated performance from the sampled student architectures.The search process's exploration and exploitation behavior is refined by incorporating a replay memory during the search process, enabling the method to strike a promising balance between these and enhancing the overall search performance.
Through extensive experiments, we provide a rigorous performance analysis, comparing our method against the baseline NAS procedure and SOTA approaches in the domain of skeleton-based HAR.Finally, AutoGCN demonstrates effectiveness also compared to random search.Future work will investigate the influence of performance predictors on AutoGCN to decrease the search time of the algorithm, as good-performing architectures could be recognized earlier.Additionally, weight-sharing techniques could further make the search more efficient.search space parameters for the highest values.In the Main stream group, most search areas demonstrate clear trends, barring the values associated with Depth main.These probability values for the depths of one or three layers remain closely clustered following the first and second controller updates.Within the Optimizer group, variability among the policy values is notable, particularly for the Learning rate and Batch size parameters.Given the intertwined nature of these hyperparameters and their impact on the incompletely trained student architecture, fluctuations are expected.
It is essential to highlight that multiple local optima are possible within this search space configuration.These optima depend upon the statistical variance from the random sampling process of the student architectures.Furthermore, the average student's Top-1 accuracy is significantly lower with the second update cycle in this experiment compared to the other rollout experiments from Table (6).B Policy values on the X-View dataset Fig. (5) shows the policy values for the model's hyperparameter and architecture search space with 30 rollouts between controller updates on the X-View dataset.The values are grouped as defined in Table (1).The best Top-1 accuracy in this experiment is achieved with the second controller update, where the P max (α, h) values are presented in Table (4).

Figure 1 :
Figure 1: Overview of the proposed AutoGCN algorithm.

Figure 3 :
Figure 3: Influence of the rollout parameter on the model performance.The star indicates the highest accuracy achieved in the rollout experiments.

Figure 4 :
Figure 4: Policy values for the best-performing model on the X-Sub dataset.One update cycle contains 20 student architectures.

Table 1 :
Search Space compromising the parameter and the corresponding value ranges, grouped into the effective areas.
Split the data into: D train and D val Output: P max (α, h) Initialize controller's policies P, iterations i, and rollouts r while not converged do for i do Sample (α, h) from controller's search space Build student architecture for 25 epochs do Train student architecture Append validation accuracy to r Sample N student architectures from reservoir N Update controller by REINFORCE with r Sample P max (α, h) and train architecture Get final architecture from P max (α, h)

Table 4 :
Comparison of the found architecture and hyperparameter components from the best performing models.

Table 5 :
Comparison of random search and AutoGCN to the period of time and the Top-1 accuracy in (%).

Table 7 :
Changed values of the search space and the resulting point estimate in (%) on the X-View dataset.