Learning a Unified Latent Space for NAS: Toward Leveraging Structural and Symbolic Information

Automatically designing neural architectures, i.e., NAS (Neural Architecture Search), is a promising path in machine learning. However, the main challenge for NAS algorithms is to reduce the considerable elapsed time to evaluate a proposed network. A recent strategy which attracted much attention is to use surrogate predictive models. The predictive models attempt to forecast the performance of a neural model ahead of training, exploiting only their architectural features. However, preparing the training data for predictive models is laborious and resource demanding. Thus, improving the model’s sample efficiency is of high value. For the best performance, the predictive model should be given a representative encoding of the network architecture. Still, the potential of a proper architecture encoding in pruning and filtering out the unwanted architectures is often overlooked in previous studies. Here, we discuss how to build a proper representation of network architecture that preserves explicit or implicit information inside the architecture. To perform the experiments, two standard NAS benchmarks, NASbench 101 and NASbench 201 are used. Extensive experiments on the mentioned spaces, demonstrate the effectiveness of the proposed method as compared with the state-of-the-art predictors.

model for a specific problem is time and labor consuming. 23 Besides, relying on expert experience often results in subjec-24 tive sub-optimal solutions. This has resulted in an emerging 25 branch in machine learning which attempts to automatically 26 find the optimal neural model for a specific task at hand, 27 namely Neural Architecture Search (NAS). 28 The optimal answer, there, consists of three parts: the opti-29 mal structure of the model, the optimal parameters and the 30 optimal hyperparameters. Traditional optimization methods 31 The associate editor coordinating the review of this manuscript and approving it for publication was Michael Lyu. address the parameter optimization problem. The challenge, 32 however, lies in simultaneous optimization of the struc-33 ture and hyperparameters of a neural model due to inter-34 depencency among the hyperparameters and the structure of 35 a network. Thus, separate optimization of the architecture or 36 the hyperparameters, may yield a sub-optimal solution [1]. 37 However, per deep structure, vast selection of hyperparameter 38 exist which form the pool of candidate models. A combina-39 tion of possible structures and their corresponding hyperpa-40 rameters, regarding a certain task, builds an enormous space 41 of options, called NAS search space. Such a space would 42 contain at least a few hundred thousand different models. 43 The challenge would be much severe in real-world tasks. 44 Conventional search methods [2], [3], [4], often fully train 45 and evaluate a large part of the space to converge; which is 46 very costly. Thus efficient methods for searching these huge 47 search spaces is highly desirable. 48 Retrospective studies in literature have proposed useful 49 strategies for addressing such a big search space including: 50 There are certain sequences of (network) processes that 107 lead to superior performances. We try to discover decisive 108 patterns inside a network and use it as a measure to esti-109 mate the performance. In this study, we propose a method to 110 encode and integrate the structural and non structural features 111 of the network as a whole via extracting substructures as 112 local attributes. To extract structural features, we use tree ker-113 nels and for non structural features we propose an operation 114 coding method. The proposed method can model nested and 115 multi-connection networks in polynomial time. 116 The basic of the method is to convert a deep convolutional 117 model into tree kernels using a small set of samples and then 118 running kernel-level matching. So it relies on two concepts: 119 finding the kernels and effective sampling of the search space 120 To perform the experiments, we use two standard bench-121 marks, NASbench 101 and NASbench 201, trained on the 122 CIFAR-10 and ImageNet datasets. For each architecture, the 123 datasets also include the respective training and evaluation 124 statistics. These information have been collected in standard 125 and reliable conditions and are now a basis for fair com-126 parison of neural network architecture search algorithms. 127 Experiments with the proposed method show a significant 128 improvement in sample efficiency of the predictor and its 129 performance estimation. The experiments also show that the 130 knowledge of the predictor can be transferred to other similar 131 problems to find strong architectures. The main contributions 132 of this paper are as follows: 133 • The importance of structure in NAS is investigated and 134 we introduce a method to embed the structure in network 135 encoding.

136
• For the first time in NAS we propose to use tree kernels 137 in finding structural similarities between two networks. 138 • We investigate the effect of lack of sufficient data on 139 the performance of the predictive model. At present, our 140 method has the best sample efficiency compared to a 141 wide range of methods that had the best performance in 142 2020 and 2021.

144
In different steps of the architecture search process (sampling, 145 modifying and training), a proper encoding of the neural 146 architectures is needed. Although there is not much study 147 regarding the proper encoding, a general classification for 148 network encoding methods is as follows.

149
A. DIRECT ENCODING

150
The direct encoding is the common sequential or layer-by-151 layer network encoding for flat feed-forward networks with 152 no extra branch. The primary network can be reproduced 153 directly from the network code. Topology-related parameters 154 such as the number of hidden layers, the number of neurons, 155 the layers' types, kernel sizes and a set of general charac-156 teristics such as learning rates and biases can be encoded 157 this way [13], [14], [16].

215
The task at hand is to predict the performance of a neural 216 architecture before training. This can be modelled as a regres-217 sion problem in machine learning, where we aim at learning 218 a regressor P to take the representation of an architecture 219 N i ∈ N as e(N i ) and return its estimated performance as 220 y i = P(W p , e(N i )), where W p denotes the trainable func-221 tion parameters. The function e(.) takes an architecture and 222 embeds it into a real valued vector. The function P is trained 223 to solve the following minimization problem, where V G denotes its set of nodes and 235 E G ⊂ V G × V G shows the set of edges. To learn and find the 236 similarities in graph-structured data, kernels can be applied. 237 Although the kernels differ in the way they are calculated and 238 the types of features they extract, the idea of using kernels to 239 compare two graphs is to break them down into substructures 240 (nodes or sub-graphs) and calculate the kernel for each pair 241 of substructures. Let G be a graph space; the kernels compare 242 the G, H ∈ G by calculating the difference between their 243 components.

244
Assume F = F 1 × . . . × F k is the space of compo-245 nents that make up G ∈ G. Also suppose F : F → 246 G is a mapping of the components to the graph, so that 247 F(g) = G if and only if g ∈ F components, build 248 the graph G.
Here the function k i , is the kernel on F i and F −1 (G), is the 251 inverted mapping and the set of all components of the graph 252 G [25]. Each pattern represents a set of homogeneous, iden-253 tical graphs. The function φ(G) σ i (1 ≤ i ≤ N ) counts the 254 occurrence of sub-graphs of type σ i in graph G and the kernel 255 counts the simultaneous occurrence of sub-graph patterns in 256 respective graphs. Thus, we have: Each φ i (G) σ r j counts the occurrence of nodes labeled σ r j ∈ 259 where is the set of allowed labels. The time required to 260 calculate kernels increases exponentially as the size of the 261 sub-graphs increases. Actually for graphs of size n the time 262 required to find patterns of size k is of order O(n k ). We pro-263 pose parsing the networks into tree sub-graphs. The order 264 VOLUME 10, 2022 depth of the longest sub-tree and m is the maximum number 266 of layers. The sub-trees can be compared using a special kind 267 of structured kernels, called tree kernels [26]. Next we're 268 going to talk about tree kernels and how to exploit them in 269 NAS problem.
a tree kernel of G if and only if the following conditions 278 hold: The tree kernel t is a member of the set of all possible tree 281 kernels T , of the graph G. The function φ t (G), counts the 282 number of times the tree kernel t rooted from node v occurs 283 in G: The structural learning now models the problem as identifica-287 tion of the common tree kernels between two distinct graphs 288 as follows: Here, T is a set of trees, φ t counts and w : T → R 291 weights the tree kernels. When w(t) tends to 0, only small 292 linear (single nodes) subspecies are preserved and vise versa.

294
In this section, we first introduce our proposed framework, 295 namely S 2 i, which embed both Structural and Symbolic 296 information (S 2 i) into a latent space, as shown in Figure 1.

297
The proposed framework is comprised of three compo-  Instead of entering the feature space directly, tree kernels cal-345 culate the degree of similarity between two trees in terms of 346 the amount of common fragments or sub-trees between them. 347 The similarity score is affected by the length of adjustment or 348 probable gaps along the matched sub-trees.

349
Let The kernel function is defined as follows: Here N T 1 and N T 2 are the set of nodes of two trees 356 T 1 and T 2 , respectively, and the delta function (n 1 , n 2 ) = 357 |F| i=1 I i (n 1 )I i (n 2 ) computes the number of pieces that are 358 common to T 1 and T 2 and have their roots in n 1 and n 2 . 359 The delta function can be rewritten as the sum of delta of 360 where s k [1 : j] is the subordinate of the children from 1 to j 372 in the k-th string. We assume each node of a tree as an entity 373 and assign the operations to the nodes. Our goal is to discover 374 the existence of a one-way connection between the two nodes 375 E 1 and E 2 and therefore to find the dual < E 1 , E 2 > in the 376 sequences. Matching the two sequences s 1 and s 2 is of order 377 O(p|s 1 ||s 2 |). In the worst case, the computational complexity 378 of calculating the kernels is O(pρ 2 |s 1 ||s 2 |), where ρ is the 379 VOLUME 10, 2022 So far, we solely modelled the structure of a neural model.

385
To encode the non-structural information, we represent each 386 operation by a four-part numeric vector, as shown in Figure 3.  which is a n × n convolution followed by a BatchNorm and a 399 ReLU operation.

401
A key step in modeling network structures is to form a rep-402 resentative set of tree kernels. However, tree kernels are built 403 based on a proxy set, i.e., a set of initial trees that hold most 404 structures existing in the problem. The proxy set is a set of k 405 architectures sampled from the search space. As discussed in 406 Section V, tree kernels are computed from proxy set and form 407 the dimensions of the latent space. Retrospective studies uti-408 lized various strategies for building the proxy set [30]. In the 409 experiments, we used three major strategies for building the 410 proxy set including random subset, top-k subset and diverse 411 subset.

413
Our objective is to build a predictor P to well estimate the per-414 formance of an architecture before training. Our model takes 415 a network architecture a and an epoch index t, and produces 416 a scalar value P(a, t) as the prediction of the performance 417 after exactly t epochs. The hypothesis behind this is that the 418 validation accuracy generally changes as training proceeds so 419 we have to be specific about time point of the prediction. This 420 helps us to better model the possible correlations between 421 training samples. We used a three step predictor to select 422 promising models as follows:      testing results are given in Table 2.

562
As shown in the Table 2, the proposed method has the 563 highest accuracy in validation data. Of course, the WeakPre-564 dictor method has the same validation accuracy. Other basic 565 methods have the best accuracy with significant differences 566 in validation data. The Table 3 shows the number of train-567 ing samples used by the methods with highest test accuracy.

TABLE 2.
Validation and test results on NASbench 101. The entries with '-' corresponds to those that the validation accuracy was not reported in the original paper.
As it can clearly be seen in the results, the lowest number of 569 training samples belong to our method. 570

571
A major challenge of using the predictive method is pes-572 simistic estimation in some areas of the search space. Thus 573 ranking architectures relative to each other, is more impor-574 tant than estimating their exact performance (numerical 575 accuracy). To do so, we compare the performance of our pre-576 dictive model with GCN, LSTM, MLP and GATES predictive 577 models through calculating Kendall-Tau correlation coeffi-578 cients. The values for these methods are derived from [23] 579 and reported in Table 4. To evaluate the resistance of the 580 proposed method to the amount of available data, all predic-581 tors are trained with different ratios of the training data and 582 the Kendall-Tau coefficient on the accuracy obtained from 583 testing the architectures is represented. We train our predictor 584 primarily with 100 architectures out of 381262 mentioned 585 set. What remains from the training budget of each round, 586 is divided into 10 parts to update the predictor.

587
As Table 4 shows, the generalization is especially evident 588 when training samples are scarce. For example, when only 589 190 architectures (0.05% of architectures) are observed by the 590 predictive model, the ranking in our method clearly results in 591 a higher correlation than the other encoders. To observe how the encoding can improve the sample effi-594 ciency, we conduct a series of experiments and sample the 595 proxy architectures with different limitations. In each round, 596 we evaluate the resulting model on 10,000 architectures, but 597 our experiments show that these results are also valid on 598 the whole data-set. First we sample only those architectures 599 which their validation accuracy is above a certain threshold. 600 The results of this experiment are reported in Table 5. For 601 each specific threshold, the test is run 3 times and the results 602 are averaged. 603 The results approve that the quality of proxy trees is effec-604 tive in solving the final problem. Generally as the accuracy 605   threshold of the proxy architectures increase, the average test 606 accuracy of the predictor and its sample efficiency increase.

607
It is clearly seen that this sampling is much efficient than tures from the high-accuracy proxy set. Therefore, we might 614 select architectures with slightly lower performances instead. 615 We run this test for 10'000 samples, totally 5 times and aver-616 age the results. 617 We can infer from the Table 6 that second test has a more 618 monotone behavior. So, the better approach to sample proxy 619 architectures is to keep diversity among them even if we cast 620 out some highly accurate networks. To train the predictor,  the architectures' rank according to their true performance. 632 Darker points resemble more accurate architectures. The 633 visualization is not exact as the real encoding space is 634 multi-dimensional while it is mapped to two dimensional 635 space. Anyhow it can be seen that architectures with simi-636 lar performance are mapped to close areas and form small 637 groups. We also move gradually from strong regions to weak 638 regions and vise-versa.  The Kendall-Tau coefficient was also measured on the 645 NASbench 201, the results of which are shown in Table 7.      G. CROSS META-LEARNING DATASET 661 We further demonstrate the usefulness of our predictor trained 662 on CIFAR-10 by applying it to ImageNet classification. It is 663 revealed that CIFAR accuracy and ImageNet accuracy are 664 strongly correlated [7]. We transfer the best architectures 665 found on CIFAR-10 as proxy trees to the ImageNet problem. 666 We exploit the Kendall-Tau coefficient to assess the quality 667 of the predictor's ranking. To compare the findings to the 668 results in other papers, we conduct experiments under the 669 following settings: On each round, the predictor is primarily 670 trained with 100 samples and updated with the rest of the 671 training budget of the specific round (batch size is 100). The 672 results are summarized in Figure 7. Our predictor achieves 673 slightly better performance BRP-NAS and significantly sur-674 passes NPENAS and Bananas.

676
Surrogate predictive models have shown its effectiveness in 677 addressing the Neural Architecture Search problem where a 678 model attempts to forecast the performance of a neural model 679 ahead of training.

680
In this paper, we proposed to leverage both structural fea-681 tures of a deep network and per-layer information to learn 682 an encoding for surrogate predictor. We proposed to repre-683 sent each neural network via a tree structure and then extract 684 features from the given tree. In particular, we adopt tree 685 ture learning for non-structural features. Extensive results on two state-of-the-art datasets of NAS task demonstrate of the type of assigned operation and its hyperparameters. 734 We define a Delta function as below to consider both:

735
Delta(E 1 , E 2 ) = E 1 1 = E 2 1 0.75 + 0.25 × E 1 2 = E 2 2 736 (13) 737 The E i 1 is the operation type which is decisive in matching 738 and E i 2 is its most important hyperparameter (like the kernel 739 size). So similar operations with different hyperparameters 740 get a proportional matching score.