Proxyless Neural Architecture Adaptation at Once

Recently, Neural Architecture Search (NAS) methods are introduced and show impressive performance on many benchmarks. Among those NAS studies, Neural Architecture Transformer (NAT) aims to adapt the given neural architecture to improve performance while maintaining computational costs. In the architecture adaptation task, we can utilize the known high-performance architectures, and the architecture adaptation results of NAT showed performance improvements on various architectures in their experiments. However, we verified that NAT lacks reproducibility through multiple trials of experiments. Moreover, it requires an additional architecture adaptation process before network weight training. In this paper, we propose proxyless neural architecture adaptation that is reproducible and efficient. The proposed method doesn’t need a proxy task for architecture adaptation. It directly improves the architecture during the conventional training process, and we can directly use the trained neural network. Moreover, the proposed method can be applied to both supervised learning and self-supervised learning. The proposed method shows stable performance improvements on various architectures and various datasets. Extensive experiments on two benchmark datasets, i.e., CIFAR-10 and Tiny Imagenet, present that the proposed method definitely outperforms NAT and be applicable to various models and datasets.


I. INTRODUCTION
Traditionally, neural architectures are manually designed by 18 human experts: e.g.VGG [1], ResNet [2], and DenseNet [3]. 19 Choosing proper neural architecture for a given dataset and  To overcome this limitation, many recent works are pro-28 posed to reduce the computational costs of NAS while main- 29 taining the advantage of NAS approaches. Most of the these 30 The associate editor coordinating the review of this manuscript and approving it for publication was Jan Chorowski .
works focused on search space [4], search strategy [7], [8], 31 and network performance evaluation [5], [8]. Conventional 32 NAS methods search neural architectures from scratch on 33 the given search space. However, if we utilize the known 34 superior neural architectures and reduce the search space, 35 we can search high-performing neural architectures with a 36 fraction of the cost compared to searching from the scratch. 37 Neural Architecture Transformer (NAT) [9] is the work 38 that aims to utilize the known superior neural architectures 39 and adapt them to the given dataset and tasks. The authors 40 attempted to improve the given architecture so that the per-41 formance of the network is improved while maintaining or 42 reducing the computational costs. However, NAT requires 43 a proxy task to adapt the given architecture in addition to 44 the conventional network weight training stage. As claimed 45 in [10], using a proxy task in the architecture searching 46 phase can degrade the performance of the final trained model. 47 tecture and trains the weights of the network.   Meanwhile, NAT [9] proposed the architecture adaptation 132 that optimizes given neural architecture while typical NAS 133 methods search the architecture without any pre-defined base 134 architecture. NAT aims to adapt the given architecture for 135 the dataset and task so that the resulting architecture has 136 better performance with fewer parameters than the original 137 architecture. To achieve the goal, NAT only changes the 138 original operations into none or identity operations. Although 139 NAT showed impressive results in the paper, there are several 140 limitations of the NAT as we claimed in Section 1.

142
The proposed method can be applied to not only supervised 143 learning but also self-supervised representation learning. 144 Traditionally, self-supervised learning focuses on learning 145 representations from unlabeled data and pre-defined hand-146 crafted tasks like image colorization [30], solving jigsaw 147 puzzles [31], and rotation prediction [32]. Recently suggested 148 contrastive learning enables learning representations without 149 defining specific tasks. In contrastive learning, the neural net-150 work learns representations based on different views of given 151 inputs that are produced by a set of augmentations. Especially, 152 SimCLR [33] achieved remarkable performance on various 153 benchmarks with the utilization of effective augmentations 154 on contrastive loss.

157
We summarize the main differences between existing meth-158 ods and the proposed method in this subsection.

159
The proposed method enhances the given architecture   To adapt the given neural architecture, we use differentiable 193 architecture parameters and gradient-based learning. The 194 architecture parameters θ is defined in the network archi-195 tecture graph. Each edge in the network architecture graph 196 contains original operation, identity operation, and none oper-197 ation. Computation of each edge is carried out based on the 198 architecture parameters: We set initial θ none and θ id 206 as zero, and θ same as one. Therefore, the initialized network 207 works the same as the original architecture.

208
The proposed method can be used for two different objec-209 tives; one is improving the cell architecture and the other 210 VOLUME 10, 2022 FIGURE 2. Two types of architectural parameters θ . The proposed method can be used to adapt the cell architecture or to adapt the entire network architecture. θ is shared when the proposed method is used to adapt the cell architecture. In the case of the adaptation of the entire network architecture, each layer has its own θ .
is improving the entire network architecture. If we aim to   where θ denotes the architecture parameter, w is the weight 263 of the network, L(·) is the loss function. Unlike w, θ does not 264 apply a softmax while calculating the loss. The reason is that 265 transitioning into the softmax eliminates the effect of being 266 normalized. The magnitude between the updated value and 267 the original value may be insufficient if the normalized value 268 is reflected in θ using a small learning rate and SGD optimizer 269 with cosine scheduler.

270
In case of the supervised learning, we used the cross-271 entropy (CE) loss which uses data and labels to calculate 272 the loss. In self-supervised learning experiments, we utilized 273 SimCLR [33] objective which uses two different augmented 274 views of images to calculate the loss. The SimCLR objective 275 can be formulated as: where z i , z j means positive pair which means originally same 278 image but augmented differently, sim(·) is the cosine similar-279 ity and τ is the temperature hyperparameter.

280
Although we used the CE loss and SimCLR objective 281 for supervised learning and self-supervised learning, the 282 [36] to evaluate various NAS methods at fair environments.

293
In this paper, we aim to enhance the given architecture  We experimented with various models on CIFAR-10 and 317 Tiny Imagenet dataset. These models include ResNet20, 318 MobileNet V2, DARTS, and ProxylessNAS. The former two 319 models are manually designed, and the latter two models are 320 NAS-based models. To compare the performance of NAT 321 and our algorithm, we trained the NAT controller on each 322 dataset and then trained the resulting architecture which is 323 inferred from the controller. In the case of our algorithm, 324 we test both cell-based architecture adaptation and entire 325 network architecture adaptation. We used 0.025 learning rate, 326 600 epoch, and Stochastic Gradient Descent(SGD) optimizer 327 with cosine scheduler as the same hyper-parameters to all 328 models and methods. Exceptionally, we applied 300 epochs 329 to Mobilenet V2 and DARTS on Tiny Imagenet dataset, and 330 utilized cut-out [37] for NAS models such as DARTS and 331 ProxylessNAS. 332 VOLUME 10, 2022    The results of TABLE 1 have average accuracy, standard 362 deviation, and total cost by various methods with differ-363 ent models on CIFAR-10 dataset. We trained and inferred 364 five times to get average accuracy and standard deviation. 365 As shown in TABLE 1, the results of NAT is unstable in 366 the case of manually designed models. The results of our 367 algorithms have better average accuracy and standard devi-368 ation than original and NAT in all cases. Moreover, the 369 total computational cost is lower than NAT. Note that NAT 370 cannot transform the architecture of ProxylessNAS, since it 371 has various cell architectures in the network. However, the 372 proposed method successfully improves the performance of 373 ProxylessNAS architecture.

374
TABLE 2 shows the reproducibility of various meth-375 ods with different random seeds on CIFAR-10 dataset. 376 In NAT [9], there is only one result for each experiment. 377 Therefore we experimented five times to get reliable evalu-378 ation results for all comparison methods.

379
In the case of Resnet20 experiments, the results of 380 seed 1 and 4 of NAT is the performance of the random 381 guesses. These results are caused by transforming identity 382 edges into none operation. Transformed Resnet20 architec-383 tures of seed 1 are presented in FIGURE 3. Changed edges 384 are notated as red colors. As shown in FIGURE 3(b), NAT 385

457
The objective of the proposed method is to enhance the per-458 formance of the given neural architecture while maintaining 459 or reducing the computational costs by architecture adapta-460 tion. To analyze the computational costs of resulting archi-461 tectures, the number of parameters and FLOPs are presented 462 in TABLE 5. The proposed method maintains or reduces 463 the value of FLOPs in the case of Resnet20 and DARTS. 464 The value of FLOPs is slightly increased in Mobilenet V2, 465 however, the increment is only 0.05% of the original FLOPs. 466 The accuracy of the resulting architectures of the proposed 467 method is enhanced in all three architectures, while NAT 468 failed to enhance the accuracy due to the changing essential 469 operations in the cell architecture.

470
The architecture parameter θ in the proposed method is 471 not large in the case of the cell-based architecture adaptation. 472 As presented in the results, optimization of θ with just first-473 order information show consistent performance estimation in 474 various seed architectures, various datasets, and even in both 475 supervised learning task and self-supervised learning task. 476 Nevertheless, optimization of θ using second-order infor-477 mation [7] or neuroevolution [40] can be further improving 478 the result of the proposed method. We plan to adapt these 479 optimization techniques to the proposed method for future 480 work. 482 We proposed a novel gradient-based neural architecture 483 adaptation algorithm that is reproducible and effective for 484 architecture improvement. Thanks to the differentiable archi-485 tecture parameters, our algorithm can train both the archi-486 tecture parameters and the network weight parameters at 487 once. Thus, the proposed method can easily be combined 488 with the conventional neural network training process. Rather 489 than using an RL-based controller with stochastic policies, 490 the proposed method uses architecture parameters for each 491 operation and avoids performance degradation caused by 492 changing important operations into none operations.

493
The results of the experiments demonstrate that the pro-494 posed algorithm has high reproducibility and stably improves 495 the performance of various models on various datasets. The proposed method can improve the performance of both manu-497 ally designed architectures and NAS-based complex architec-498 tures. Moreover, the proposed method can be applied to both 499 supervised learning and self-supervised learning and achieve 500 performance improvement on both learning schemes.