BNNAS++: Towards Unbiased Neural Architecture Search With Batch Normalization

Neural Architecture Search (NAS) achieves significant progress in many computer vision tasks, yet training and searching high-performance architectures over large search space are time-consuming. The NAS with BatchNorm (BNNAS) is an efficient NAS algorithm to speed up the training of supernet ten times over conventional schemes by fixing the convolutional layers and training the BatchNorm only. In this paper, we systematically examine the limitation of the BNNAS and present improved techniques. We observe that the BNNAS is prone to select convolution and fails to find plausible architectures with small FLOPs, especially on the search space with operations other than convolutions. This is because the number of learnable parameters on BatchNorm is imbalanced for different operations, i.e., BatchNorm is only attached to the convolutional layer. To fix the issue, We present the unbiased BNNAS for training. However, our empirical evidence shows that the performance indicator in BNNAS is ineffective on the unbiased BNNAS. To this end, we propose a novel performance indicator, which decomposes the model performance into model expressivity, trainability, and uncertainty, to preserve the rank consistency on the unbiased BNNAS. Our proposed NAS method, named as BNNAS ++, is robust on various search spaces and can efficiently search high-performance architectures with small FLOPs. We validate the effectiveness of BNNAS++ on NAS-Bench-101, NAS-Bench-201, DARTS search space, and MobileNet search space, showing superior performance over existing methods.


I. INTRODUCTION
Automatically finding the optimal neural architectures for a given task and dataset have been an active research field in recent years. The goal is to relieve human from tedious model design jobs and prevents numerous trial-and-error experiments. Pioneer works search architectures via reinforcement learning [1] or evolutionary algorithms [2], where they involve unbearable searching cost and resource consumption. Many NAS algorithms manage to use reduced computational settings, such as weight sharing NAS [3], [4], [5], [6], and cheap proxy [7]. But they still suffer the significant memory and computational cost.
To address the issue, recently neural architecture search with batch normalization (BNNAS) [8]

was introduced to
The associate editor coordinating the review of this manuscript and approving it for publication was Frederico Guimares . train only the BatchNorm layers while fix the convolutional layers for the weight-sharing supernet. In the search stage, the BNNAS compares the value of γ , the scale parameters in the BatchNorm, to indicate the model performance during the search stage. It speed up the supernet training up to ten times than conventional supernet training methods such as SPOS [6] and FairNAS [9], and achieve comparable results. In this paper, we systematically study the limitation of the BNNAS and present improved techniques for both training and searching. We observe that the original BNNAS is lean to select the convolutions and unable to find plausible architectures in the search space containing operations other than convolutional layers, such as pooling or identity layer. This is because only the convolutions are stacked with Batch-Norm layers, which causes those architectures with more convolutions to have more learnable BatchNorm parameters than those with pooling or identity layers. Such imbalance in the number of learnable parameters causes the BNNAS to become bias towards convolutional layers. To this end, we present the unbiased BNNAS which maintains the same number of learnable BatchNorm parameters for all architecture; thus, these architectures can make a fair comparison during the search stage. However, we empirically find that the γ -indicator, the performance indicator that is previously proposed in BNNAS for architecture search, is ineffective on the unbiased BNNAS. Therefore, we introduce a novel performance indicator by combining model expressivity, trainability, and uncertainty measurements. Through extensive experiments, we verify that the composite indicator strongly correlates with the model test accuracy in the early training stage, and thus the supernet training time can be further reduced. We validate our BNNAS++ on NAS-Bench-101, NAS-Bench-201, DARTS search space, and MobileNet search space, showing promising performance and superior efficiency over existing methods. The comparison between the conventional weight-sharing NAS, the BNNAS, and the BNNAS++ can be found in Figure 1.
In summary, our contribution are the three folds: • We empirically show that the original BNNAS, an efficient NAS algorithm, is unfair in training the supernet. We present the unbiased BNNAS to fix the unfairness issue.
• We empirically show that γ -indicator in the original BNNAS does not apply to the unbiased BNNAS. Therefore, we introduce a novel performance indicator by combining the measurement on model expressivity, trainability, and uncertainty for the search procedure.
• We demonstrate our approach's effectiveness and efficiency on NAS-Bench-101, NAS-Bench-201, DARTS search space, and MobileNet search space.

II. ANALYSIS OF NEURAL ARCHITECTURE SEARCH WITH BATCH NORMALIZATION A. REVISIT THE BNNAS
First of all, we revisit the BNNAS approach. Given an input x, for any operations OP(·), it is common to map the parameters along the channel dimension to zero mean Gaussian distribution using batch-wise statistics µ and σ , and then rescale the parameters with γ and β. Formally, we can describe the batch normalization as where y is the output after BatchNorm, is an extremely small number to prevent zero on the denominator. The BNNAS limit the choice of OP(·) ∈ Conv(·), nevertheless, a typical NAS search space have many types of operations OP(·) ∈ {3 × 3_conv, avg_pool, · · · , identity}. The BNNAS uses the γ value as the indicator to search for the best architecture instead of validation accuracy. The conventional NAS methods train both OP(·) and BatchNorm together and then use the validation accuracy to rank the architectures. Instead, the BNNAS [8] propose to update the BatchNorm only and fix the parameters of OP(·). In the search stage, it uses the average γ value over all cells to indicate model rank.
B. WHY BNNAS IS BIASED? Figure 2 shows the scatter plot of trained model test accuracy versus the value of gamma in BatchNorm. We randomly sample 1,000 models from NAS-Bench-201 and train BatchNorm only with each model by one epoch on CIFAR10. We note that the original BNNAS trains the supernet for ten epochs in a weight-sharing setting, which produces a worse performance for each subnetworks than training a stand-alone model for VOLUME 10, 2022  We can observe that there is indeed a positive correlation between test accuracy and gamma value. However, the architectures with a high gamma value (over 85) mostly have a more significant number of parameters (we show the corresponding plot measured by FLOPs in Appendix). Meanwhile, the test accuracy of these architectures is mostly comparable or even worse than many smaller architectures with low gamma values. In Table 1 we count the number of operations in the architectures with the top ten highest gamma values in the BNNAS. We find out that the BNNAS favors convolutional layers: None of the architecture with high gamma values have operations other than 1 × 1 and 3 × 3 convolutions.
We argue that the BNNAS is biased towards convolution layers because the BatchNorm only attached to the convolutions. In other words, since only the BatchNorm is trained, the capacity of networks is largely determined by the learning process on the learnable parameters in BatchNorm. Even though we calculate the average value of gamma over all operation nodes, placing BatchNorm after convolutional layers only have a negative impact on searching good architectures with small sizes. In conclusion, the BNNAS is biased, cause unfairness on the search in supernet.

C. UNBIASED BNNAS
It is necessary to fix the unfairness in the BNNAS, thus we propose the unbiased BNNAS. The method is simple: we attach a BatchNorm layer to every operation, including pooling layer, identity layer, etc. As such, a fair evaluation for each architecture can be done when the training is finished. This is extremely important for weight sharing NAS [9], [10], [11], [12]. The top part in Figure 1 gives a comparison of the BNNAS and the unbiased BNNAS in the training stage. We also show the number of operations on the top ten architectures search by the unbiased BNNAS in Table 1. Evidently, our approach can find networks consist of diverse operations. While obtaining slightly better accuracy than the BNNAS, our approach consistently finds smaller architecture (measure by both FLOPs and number of parameters.).

D. THE INEFFECTIVENESS OF γ -INDICATOR
The most straightforward way to search architectures is to reuse the γ -indicator for unbiased BNNAS. However, our empirical study shows that the correlation between test accuracy and gamma value no longer exists after the training becomes fair for various network architectures. The Left Figure 3 shows the results of stand-alone training networks on CIFAR10 with train BatchNorm only for one epoch and two hundred epochs. There are 1,000 architectures randomly sampled from NAS-Bench-101. We can observe when the networks are trained for only one epoch, the correlation between test accuracy and gamma value is completely random. The architectures with high gamma value can have an accuracy range from 30% to 93%. When we train the networks for 200 epochs, the situation become worse. In the right Figure 3, the architecture that obtain the highest gamma value over 200 achieves around 60% test accuracy. As we attach BatchNorm after pooling layers or identity layers, the gamma value of network architectures with those operations can also reach a very high value, which makes the γ -indicator no longer be an effective performance indicator for unbiased BNNAS.

III. BNNAS++: COMPOSITE INDICATOR FOR UNBIASED BNNAS
In the previous section, we have shown that the necessity of using unbiased BNNAS and how γ -indicator fails on it. In this section, we introduce a surrogate performance indicator for unbiased BNNAS. Our goal is to design an indicator that can be 1) effectively predicted the rank of the architectures at early training stage of the unbiased BNNAS, and 2) efficiently calculated during search stage. We disentangle the model generalization into model expressivity, trainability, and uncertainty. We provide detailed description of each individual indicator and empirically demonstrate their correlation with the model performance. We also conduct ablation on essential components that might impact the results of our method, for instance, the batch size and initialization methods in Appendix.
Expressivity: The ability of deep networks that can compactly express highly complex functions over input space is viewed as one of the most critical factors to their success. The expressivity of deep networks has been widely studied theoretically [13], [14] and has been used on NAS to score networks with random features [15], [16]. However, both [16] and [15] count the number of linear regions of randomly initialized networks, limiting their search space on networks with ReLU activation [17]. Whereas current SOTA backbones highly rely on non-linear activation other than ReLU, such as Swish [18], [19].
This paper discusses the expressivity of random features for various layer choices, such as convolutional layers, pooling layers, identity, etc., with trainable batch normalization. The batch normalization can be considered as a reparameterization trick on the parameters space. Learning to scale and shift the initialized parameters does not increase the expressivity of the random feature itself. Therefore, we refer to the validation accuracy as the expressivity of the deep networks. The original BNNAS fails to measure the expressivity of random features due to unfair supernet training. In our unbiased BNNAS, since the number of learnable BatchNorm parameters are the same for each operation, we can adopt expressivity as a performance indicator.
Trainability: Besides the expressivity, whether a deep network can achieve good performance is determined by how effectively the optimizer can optimize it. Recent work [20] shows the effect of BatchNorm on the Lipschitzness of the loss, which plays a crucial role in optimization by controlling the amount by which the loss can change when taking a training step [21], [22]. Their empirical and theoretical results indicate that the BatchNorm smooth the loss landscape, and the re-parametrization in BatchNorm makes the gradient of the loss more Lipschitz.
Specifically, let us denote the loss of a network with Batch-Norm as L BN and the loss of the same network without BatchNorm as L \ BN . Given the activationŷ j , and gradient ∇ˆy jL , we define the Lipschitzness of the loss as ||∇ y j L||, [20] prove that we can measure the effect of BatchNorm on the Lipschitz by As a result, the scale term γ σ in the inequality can be used to measure the flatness of the loss landscape, which we define as the trainability score in our context.
Uncertainty: Uncertainty is a measurement of knowing what the model does not know. Deep neural networks are known to be overconfident in their predictions. Instead of point estimation, the model can offer confidence bounds for each decision from the probabilistic view to avoid overconfidence. This is useful to solve core issues of deep networks such as poor calibration and data inefficiency [23], [24]. There are rarely studies that discuss the relationship between uncertainty and model generalization. In this work, we obtain practical uncertainty for each network architecture via trainable batch normalization [25]. We discuss the correlation of uncertainty and performance, and we use it as one of the performance indicators to find plausible architectures.
Specifically, given a network architecture X , we are interested in approximating inference in Bayesian modeling. A common approach is to learn a parameterized approximating distribution q θ (ω) that minimizes KL(q θ (ω)||p(ω|D)), where D is the training set, ω is the model parameters, p(ω|D) is the probabilistic model, the Kullback-Leibler divergence is of the true posterior with respect to its approximation.
For models that are built upon batch normalization, we are able to obtain approximate posterior by modeling µ B and σ B over a mini-batch B as stochastic variables. Following [25], we use approximate posterior to express an approximate predictive distribution p(y|x, D) = f ω (x, y)q θ (ω)dω, and we can obtain the covariance of the predictive distribution empirically by The calculation of variance is generic in practice. We simply record the value of variance for each iteration during evaluation and then average the total variance to obtain the final uncertainty score.

A. EMPIRICAL STUDY ON CORRELATION ANALYSIS
We verify each indicator on NAS-Bench-201 with CIFAR10, CIFAR100, and ImageNet16-120. We randomly sample 2,300 architectures and plot the scatter plot to show the correlation between each indicator and test accuracy on trained networks. The test accuracy is directly taken from [26]. All models are trained by BatchNorm with the same default settings in [26] for one epoch on the target dataset. We train the models that only update the BatchNorm layers for one epoch because we want to check that our proposed indicator is effective in the early training stage of the networks. The results are shown in Figure 4. We provide Kendall Tau correlation [27] metrics for each indicator on every dataset, and all p-value for the tau correlation is less than 0.005 to ensure the results are statistically significant.
We observe that 1) a negative correlation between initial expressivity and test accuracy, 2) a positive correlation between trainability and test accuracy, and 3) a negative correlation between uncertainty and test accuracy. These results indicate that, for networks that are fixed with convolutional layers and trained with BatchNorm only, at the initial stage, a better architecture in the search space is 1) perform bad, 2) easy to train, 3) certain about its predictions. It is plausible for the correlation on the trainability terms. A network architecture with good trainability is easier to converge to a wide optima [28] with better generalization. It is also not surprising that deep networks with more confidence (less uncertain) on the prediction tend to generalize better, which is similar to the studies on human investors [29], [30], and athletes [31].
Nevertheless, it is counter-intuitive that good architectures tend to perform worse at the early stage of training BatchNorm only. A possible explanation is that architectures trained with BatchNorm only tend to over-fit on the training data first, which is also observed by [32] and then generalize on the test data. The high expressivity networks are more likely to over-fit initially due to the tendency to make the feature representation become orthogonal [33], and will be more generalizable than low expressive networks if more training time is conducted. We provide more evidence on this topic in Appendix.

B. COMPOSITE PERFORMANCE INDICATOR
We present and verify the strong correlation between the network's test accuracy and its expressivity of random features, trainability, and uncertainty, respectively. The key to build an effective NAS algorithm is how to combine these indicators. As shown in the Figure 4, the numerical value for each indicator varies. The naive approach would be to normalize each indicator's value, then add them into a single score. However, such a way requires us to know the estimated range of the value prior to we perform the search, which is impossible for most cases. Therefore, we rank the architecture according to the value for each indicator instead. How to combine the rank results on different indicators can be treated as an ensemble learning problem. In practice, we give equivalent importance to each indicator and use the average rank across three indicators as the final rank for simplicity. We summarize the correlation study on NAS-Bench-201 in Table 3, which demonstrates a strong correlation between test accuracy and our composite indicator.
Furthermore, our proposed score can be straightforwardly incorporated into existing NAS algorithms. To demonstrate this we implemented a variant of REA, which we call Assisted-REA (AREA). REA starts with a randomly selected population (10 in our experiments). AREA instead randomly-samples a larger population (in our experiments we double the randomly-selected population size to 20) and uses our score (Equation 2) to select the initial population (of size 10) for the REA algorithm. We give a brief description of combining BNNAS++ with evolution search.

IV. EXPERIMENTS
In this section, we introduce the experiments on NAS-Bench-101, NAS-Bench-201, DARTS search space, and MobileNet search space. All search cost are evaluated on Nvidia GTX 1080-Ti. More detailed experimental settings and search space description can be found in Appendix.

Algorithm 1 BNNAS++ With Evolution Search
Define the random architecture sampler RandArchSampler(), and batch_size = 256 ep_rank, t_rank, uc_rank = list(), list(), list() for i = 1, . . . , T do Update TopK architectures as parents according to indicators. Mutate TopK network arch = RandArchSampler() Get data x and target y of randomly sampled mini-batch. ep_rank. append(accuracy(arch(x), y)  BNNAS++ outperforms the BNNAS counterpart by a large margin, on an average accuracy of 94.16% versus 91.04%. Our approach also achieves slightly higher accuracy than other methods with much less search cost, for instance, 6.8% of RLNAS and 5.8% of non-weight sharing REA.

2) NAS-BENCH-201
The NAS-Bench-201 is proposed by [26], which benchmarks a number of NAS algorithms. It contains 15,625 candidate architectures and provides the accuracy for each architecture. We compare our proposed methods with non-weight sharing NAS (Random Search, REA [2], REINFORCE [36], BOHB [37]), weight-sharing NAS (Random Search [38], DARTS [3], GDAS [39], SETN [40], ENAS [4], Fair-NAS [9], BNNAS [8]). We report the experimental results of BNNAS++ in Table 2. Our results indicate the BNNAS++ is extremely fast: over 20 to 30 times faster than the weightsharing approach, while we obtain comparable or slightly better performance than these methods. Compared to the  BNNAS, our approach further save 70% of search time and achieve better performance on all three datasets.

B. ImageNet EXPERIMENTS
We conduct experiments ImageNet [41] with two commonly used search space: MobileNet search space and DARTS search space, to compare BNNAS++ with state-of-the-art NAS.

1) MobileNet SEARCH SPACE
MobileNet-like Search Space consists of 21 to-be-searched layers, each layer leverages the MobileNetV2 inverted bottleneck [42] and optional with SE module [43]. At each layer, the search method can choose operation between convolutional layers with kernel size {3, 5, 7}, expansion ratio {3, 6} and identity layers. Note that the original BNNAS does not have the identity layer option. We compare with several weight-sharing NAS that use the same search space setting [5], [6], [8], [9], [44]. Specifically, we adopt the training strategy of SPOS and FairNAS. The supernet is trained via the unbiased BNNAS for five epochs and ten epochs. Then, we use our proposed indicator to search the architecture. As shown in Table 5, with the same search space, our searched results achieve 74.28% with 325 MFLOPs and 75.81% with 465 MFLOPs, slightly outperform SPOS and FairNAS counterpart, whereas the search time is only 24 times and 20 times less the of the original method, respectively. Furthermore, there is a clear advantage of BNNAS++ over the BNNAS, which outperforms the BNNAS by 0.16% and 0.14% on accuracy with 1.6 times and 1.5 times less search cost. It indicates that even on the search space that is mainly composed of convolutions, our method is still superior to the BNNAS. By searching the SE module, we obtain architecture with 1.05% higher accuracy than the BNNAS with only 69% of the FLOPs. Our unbiased training strategy and performance indicator help us find small architectures with plausible accuracy.

2) DARTS SEARCH SPACE
We compare our algorithms with various NAS algorithms [3], [9], [12], [35], [39] that performed on the same DARTS search space. The bottom of Table 5 provides the experimental results on ImageNet. We can observe that the BNNAS does not achieve satisfactory results, as the test accuracy is relatively lower than state-of-the-art methods, yet its FLOPs are similar or higher. On the other hand, the BNNAS++ obtain 1.0% higher accuracy and 64 fewer FLOPs than the BNNAS. It demonstrates that our approach is robust on search space with various operations. This further backs up our initial assumption that the biased BNNAS hurts search performance.

V. RELATED WORKS A. BATCH NORMALIZATION
Though many normalization methods [45], [46] have been proposed in the last few years, batch normalization [47] still dominates the CNNs. It was initially considered as a technique to reduce internal covariance shift [47]; however, recent works prove that it could smooth the loss landscape [20], stabilize the training process [48], [49], and preserve rank stability under certain assumption [50]. BatchNorm can also be used to measure the model uncertainty [25]. Reference [32] have found out that training BatchNorm only while fixing the convolutional layers at the initial state can also achieve performance that is much better from random guess.

B. BATCH NORMALIZATION IN NEURAL ARCHITECTURE SEARCH
This section mainly discusses the usage of batch normalization in the NAS field. We refer readers to the Appendix for a more detailed description of general methods in NAS. From the perspective of batch normalization, some studies show that BatchNorm layers are problematic in training a supernet since the batch statistics and scaling parameters share various sub-networks. As a result, the BatchNorm need to finetune after supernet training [6], [9], fix the learnable parameters [3], replace with ghost batch normalization [51], [52] or even be removed entirely from training [53]. Moreover, the supernet training can be hurt if the batch size is too small due to limited GPU memory. Thus, some works replace standard BatchNorm with synchronized batch normalization [54] or group normalization [45], [55] to alleviate smallbatch problem [56]. The BNNAS [8] proposes to speed up the training of supernet by freezing the convolutions parameters and updating the BatchNorm only. The scale parameters in BatchNorm are used to determine the rank of architectures. It improves NAS search cost the weight-sharing NAS [6], [9] on MobileNet search space up to 10 times and achieves comparable performance.

VI. CONCLUSION
In this work, we examine the limitation of the BNNAS, an efficient NAS algorithm. We observe that the BNNAS brings unfairness into the supernet training, and the search criterion is ineffective if the unfairness issue is fixed. To this end, we propose the unbiased BNNAS and a novel composite performance indicator to search for good architecture efficiently, effectively and robustly. Our proposed method BNNAS++ is evaluated on NAS-Bench-101, NAS-Bench-201, DARTS search space, and MobileNet search space, showing superior search efficiency and effectiveness. These results indicate that our approach makes successful usage of BatchNorm in NAS. Overall, we believe that our approach provides an interesting and practical new perspective on BatchNorm in NAS.