μDARTS: Model Uncertainty-Aware Differentiable Architecture Search

We present a Model Uncertainty-aware Differentiable ARchiTecture Search (<inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>DARTS) that optimizes neural networks to simultaneously achieve high accuracy and low uncertainty. We introduce concrete dropout within DARTS cells and include a Monte-Carlo regularizer within the training loss to optimize the concrete dropout probabilities. A predictive variance term is introduced in the validation loss to enable searching for architecture with minimal model uncertainty. The experiments on CIFAR10, CIFAR100, SVHN, and ImageNet verify the effectiveness of <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>DARTS in improving accuracy and reducing uncertainty compared to existing DARTS methods. Moreover, the final architecture obtained from <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>DARTS shows higher robustness to noise at the input image and model parameters compared to the architecture obtained from existing DARTS methods.


I. INTRODUCTION
Uncertainty estimation of neural networks is a critical challenge for many practical application [1], [8], [9].We can approximate the uncertainty of a neural network by incorporating Monte-Carlo dropout [2].Gal et al. used concrete dropout as a continuous relaxation of dropout's discrete masks to improve accuracy and provide better uncertainty calibration [3].For a given training dataset, the model uncertainty of a network depends on its architecture.However, no prior work aims to optimize network architectures to reduce model uncertainty.
We approach this problem by exploring the neural architecture search (NAS) [4], [5], [11]- [13].Liu et al. have proposed a differentiable architecture search (DARTS) [6] that uses continuous relaxation of the architecture representation allowing an efficient architecture search using gradient descent, thereby improving the efficiency of NAS.The DARTS perform a bi-level optimization problem where an outer loop searches over architectures using a validation loss (L valid ) and the inner loop optimizes parameters (weights) for each architecture using a training loss (L train ).
We advance the baseline DARTS framework by introducing: (a) concrete dropout [3] layers within DARTS cell to enable a well-calibrated uncertainty estimation; (b) a Monte-Carlo dropout based regularizer [2] within L train to optimize dropout probabilities; and (c) a predictive variance term in L valid to search for architecture with minimal model uncertainty.The proposed architecture search is defined as Model Uncertainty-aware Differentiable ARchiTecture Search (µDARTS).This paper makes the following key contributions: • We develop the µDARTS framework and training process to improve accuracy and simultaneously reduce model uncertainty of neural network.• We show that the architecture search process in µDARTS converges to a flatter minima compared to the standard DARTS method.• We show that the final architecture obtained via µDARTS converges to a flatter minima during model training compared to the model obtained using the standard DARTS method.• We test the final DNN models obtained from architecture search methods on the CIFAR10, CIFAR100, SVHN, and ImageNet datasets.We show that the µDARTS method improves the accuracy and uncertainty of the final DNN model found using the archi-tecture search.• We also showed that the µDARTS method has better performance when subjected to input noise and generalizes well when tested with parameter noise This paper aims to find the architecture which not only maximizes the accuracy but also minimizes the predictive uncertainty of the model.We do so using the predictive variance as the regularizer of the bi-level objective function of the DARTS architecture search method.This new architecture search method finds architecture with lower model uncertainty and better generalizability as it induces an implicit regularization on the Hessian of the loss function, leading to more generalizable solutions for the architecture search process.
The rest of the paper is organized as follows: Section II discusses the background and related works to this paper, Section III discusses the theoretical description of the novel Model Uncertainty Aware DARTS methodology proposed in this paper, while Section IV revolves around the implementation details about the architecture search baselines for comparison with the proposed µDARTS method.Section V deals with the experiments undertaken and the results obtained thereby while finally, Section VI summarizes and discusses the conclusions we arrived at from the experiments.

II. BACKGROUND AND RELATED WORKS A. UNCERTAINTY ESTIMATION USING CONCRETE DROPOUT.
Considering the likelihood function of the models to be a multivariate normal distribution given as p (y * |f ω (x * )) = N (y * ; f ω (x * ) , Σ), we can get an approximate estimate of the expected value of the variational predictive distribution by sampling T sets of weights ωt (t = 1, . . ., T ) from the variational dropout distribution [2], [3]: where f ωt (x * ) denotes a forward pass through the model with the weights which are sampled ωt , thus effectively, performing T forward passes through the network f with dropout and this process is known as Monte Carlo (MC) dropout.

B. DIFFERENTIABLE ARCHITECTURE SEARCH.
Liu et al. presented DARTS to perform a one-shot neural architectural search.The DARTS method [6] is based on the principle of continuous relaxation of the architecture representation, allowing efficient architectural space search using a gradient descent approach.DARTS formulates the architecture search as a differentiable problem, thus overcoming the scalability challenges faced by other RLbased NAS methods.The DARTS optimization procedure is defined as a bi-level optimization problem where L val is the outer objective and L train is the inner objective as: where the validation loss function L val determines the architecture parameters α(outer variables) and the training loss L train optimizes the network weights w (inner variables).
The computational graph is learned in one go in the DARTS-based architecture search process.At the end of the search phase, the connections and their associated operations are pruned, keeping the ones that have the highest magnitude of their related architecture weights multipliers.In their paper, Noy et al. [34] showed that the harsh pruning in DARTS, which occurs only once at the end of the search phase, is sub-optimal and that a gradual pruning of connections can improve both the search efficiency and accuracy.On the other hand, Bi et al. [35] solve the problem of searching in a complex search space by starting with a complete super-network and gradually pruning out weak operators.Though these methods help in finding efficient architectures, these methods neither estimate nor try to minimize the uncertainty of the model.

1) Improving DARTS Search Space
Many different works have worked on improving the shortcomings of the DARTS methodologies.For example, in the paper, [31], the authors pointed out that despite the large gap between the architecture depths in search and evaluation scenarios, the DARTS method report lower accuracy in evaluating the searched architecture or when transferring to another architecture.The authors gradually increased the depth of the searched architectures to address this issue.
Alternatively, in the paper [32], the authors addressed the issue of large memory and computing overheads in jointly training a super-network and searching for an optimal architecture.The authors proposed an approach to sample a small part of the super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance.Again, Dong et al. [33] represented the search space as a DAG to reduce the search time of the architecture search process.
This paper uses the DARTS search space and methodology as the baseline.However, our method could very well be extended to perform for any such methodologies mentioned above and search spaces when the aim is to find an architecture in the search space which not only improves the accuracy but also minimizes the uncertainty while making it more robust to noise.

2) Improving Robustness and Generalizability of DARTS:
Arber et.al. [7] and Chen et.al [30] more stabilized neural architecture search methods.SmoothDARTS (SDARTS) use a perturbation-based regularisation to smooth the loss landscape and improve the generalizability.Empirical results have shown that a generalization performance of the architecture found by DARTS improves with a lower eigenvalue of the Hessian matrix of the validation loss for the architectural parameters (∇ 2 α L DARTS val ).RobustDARTS [7] has been proposed to improve robustness by (i) computing the Hessian and stopping DARTS early to limit the eigenvalues (converge to a flatter minima) and (ii) using an L2 regularization term in the training loss (L train ).

C. NEURAL ARCHITECTURE DISTRIBUTION SEARCH:
Ardywibowo et al. showed that searching for a distribution of architectures that performs well on a given task allows us to identify standard building blocks among all uncertaintyaware architectures [29].With this formulation, the authors optimized a stochastic out-of-distribution detection (OoD) objective and constructed an ensemble of models to perform OoD detection.However, the work concentrates on the detection of out-of-distribution uncertainty by optimizing the Widely Applicable Information Criterion (WAIC), a penalized likelihood score used as the OoD detection criterion.However, they do not discuss the model uncertainty that arises from the generalization error and mainly focus on just the out-of-distribution uncertainty.

D. CONTRIBUTION OF THIS WORK.
The prior works on NAS and DARTS do not focus on minimizing uncertainty.Therefore, the key contribution of this paper is an architecture search method that can simultaneously maximize accuracy and reduce model uncertainty.

III. MODEL UNCERTAINTY AWARE DARTS
We propose µDARTS, a neural architecture search method to find the optimal architecture that simultaneously improves the accuracy and reduces the model uncertainty while also giving a tighter estimate of the model uncertainty using the concrete dropout layers.We formulate the µDARTS bi-level optimization problem as: ) where, L DARTS MC (θ) is the Monte Carlo dropout loss and Var model p(y|x) (α, w * (α)) is the predictive variance.As pointed out by Zela et al. [7], it can be seen that increasing the inner objective regularization helps to control the largest eigenvalue and more strongly allows it to find solutions with the smaller Hessian spectrum and better generalization properties.Again, we observe the implicit regularization effect on the outer objective, which reduces the overfitting of the architectural parameters.This is further discussed in Appendix A. Compared to the optimization problem for DARTS (see 3), we make the following key updates: We know that the validation loss, L val , is used for the architecture search process, and we want our neural architecture search method to search for the architecture which also has the least predictive uncertainty along with high accuracy.Hence, we add the predictive variance term to the validation loss term and the new validation loss function is given as L DARTS val (α, w * (α)) + Var model p(y|x) (α, w * (α)), where w * is the optimal set of weights found by optimizing the training loss function in the other bi-level optimization problem.The predictive variance is estimated as the variance of the T Monte Carlo samples on the network as: where {y l } T t=1 is a set of T sampled outputs for weights instances ω l ∼ q(ω; Φ) and y = 1/T t y t .
Again, we added a Monte-Carlo loss function in the training loss function since the training loss determines the weights of a particular architecture.To get well-calibrated uncertainty estimates, adapting the dropout probability as a variational parameter to the data at hand is necessary.So, we use the concrete dropout layers and add the Monte Carlo loss to calibrate the dropout probabilities.As shown in [3], the optimization objective that follows from the variational interpretation can be written as where θ is the parameters to optimize, N is the number of data points, S is a random set of M data points, f ω (x i ) is the neural network's output on input x i when evaluated with weight matrices realisation ω, and p (y i | f ω (x i )) is the model's likelihood, e.g. a Gaussian with mean f ω (x i ) .
The KL term KL (q θ (ω) p(ω)) is a "regularization" term which ensures that the approximate posterior q θ (ω) does not deviate too far from the prior distribution p(ω).
The total derivative of L val w.r.t.α evaluated on (α, w * (α)) would be: where In general, computing the inverse of the Hessian is not possible considering the high dimensionality of the model parameters w.Thus, we use gradient-based iterative algorithms to find the optimal w * .However, to avoid repeated training of each architecture which is computationally very expensive, we approximate w * (α) by updating the current model parameters w using a single gradient descent similar to the approximation step done in [6]: where ξ is the learning rate for the virtual gradient step, DARTS takes with respect to the model weights w.Therefore, we obtain: where the inverse Hessian ∇ 2 w L −1 train is replaced by the learning rate ξ.
The final output using the µDARTS method gives us the optimal architecture, which has the maximum accuracy and the minimum uncertainty in the architecture search space.However, there are some benefits to including the predictive variance term in the validation loss and the Monte Carlo dropout loss.We hypothesize a two-fold benefit from this modified loss function structure.
Firstly, the predictive variance term acts as a regularizer to the validation loss function.It makes the neural architecture search method more robust and resilient to input or parameter noise.In this paper, we shall prove this hypothesis empirically by computing the largest eigenvalue of the Hessian of the validation loss to indicate the flatness of the loss minima of the architecture search process.Also, analytical proof of this is given in Appendix B, considering a simplified linear model.
Similarly, the Monte Carlo dropout loss added to the training loss function acts as another regularizer.Hence, the final architecture obtained from the neural architecture search method also gives better performance under noise perturbation.Similarly, we also prove this empirically by computing the largest eigenvalue of the Hessian of the training loss to indicate that the architecture converges to a flat minima for the final architecture we got from the neural architecture search process.

DARTS:
We implemented the standard DARTS method [6] with the search space constituting of the following operations: O : 3 × 3, 5 × 5 separable convolutions (sep conv), 3 × 3, 5 × 5 dilated separable convolutions (dil conv), 3 × 3 max pooling, 3 × 3 average pooling, identity, and zero (Fig. 1).All the operations in the method are of unit stride length, wherever applicable, and paddings are added to the convolved feature maps to preserve their spatial resolution.For the convolution operations, we use the order ReLU/Conc Drop(ReLu)-Conv-BN as done in [4], [5].The output node of the convolutional cells is the depthwise concatenation of all the intermediary nodes except the input nodes.The network is then formed by stacking multiple cells following the same principle used in [6].We used the implementation of DARTS used in [6] for our results and ran it for 50 epochs for each searching and training algorithm.
DARTS with Concrete Dropout (DARTS-CD).The final architecture obtained from DARTS cannot be directly used for uncertainty estimation.We modify the optimal architecture from the standard DARTS to enable uncertainty estimation.We include a Concrete dropout [3] layer in all the layers of the final architecture, and the DARTS-CD is obtained by adding a Monte Carlo dropout loss to the training loss function L train to calculate the optimal dropout probabilities [3].We use the re-trained final architecture to estimate accuracy and uncertainty using multiple MC samples.
RobustDARTS with Concrete Dropout (RDARTS-CD).We implemented the RobustDARTS (RDARTS) model using the L2 regularization in the training loss, and the early stopping mechanism [7].The RDARTS generates a final architecture.Like DARTS-CD, concrete dropout layers are included in each of the layers in RDARTS for uncertainty estimation.The modified architecture is re-trained by adding Monte-Carlo dropout loss in the loss function.The modified final architecture from RDARTS calculates the model uncertainty and accuracy.
Progressive Differentiable Architecture Search with Concrete Dropout (P-DARTS-CD) We also implemented the Progressive Differentiable Architecture Search (P-DARTS) [31].P-DARTS is an efficient algorithm that gradually increases the depth of the searched architectures while training.P-DARTS solves the problems of heavier computational overheads and weaker search stability using PC-DARTS-CD Partially-Connected DARTS (PC-DARTS) [32] performs a more efficient search without comprising the performance by sampling a small part of the super-network to reduce the redundancy in exploring the network space.Similar to the previously mentioned architectures, we add Concrete Dropout layers to the final architecture from the PC-DARTS method and, using the Monte Carlo Dropout loss in the loss function, get an estimate of the uncertainty of the model.
GOLD-NAS-CD Gradual One-Level Differentiable Neural Architecture Search (GOLD-NAS) introduces a variable resource constraint to one-level optimization so that the weak operators are gradually pruned out from the supernetwork.Similar to the other methods, we add additional Concrete Dropout layers after each layer in the final architecture to get the uncertainty estimate of the model.
ASAP-CD Architecture Search, Anneal and Prune (ASAP) [34] uses a differentiable search space that allows the annealing of architecture weights while gradually pruning inferior operations.In this way, the search converges to a single output network continuously.Similar to the other methods, we add Concrete Dropout layers after each layer in the final architecture obtained in the paper to get the uncertainty estimate of the model.
µDARTS.Fig. 1 shows the overall architecture of the proposed µDARTS method, including the internal details of a cell.We include the following operations in O: 3×3, 5×5 separable convolutions (sep conv), 3×3, 5×5 dilated separable convolutions (dil conv), 3×3 max pooling, 3×3 average pooling, identity, and zero.The key difference between an µDARTS cell compared to the standard DARTS cell is that The comprehensive details for all the methods described above are shown in Fig. 2.

A. TRAINING CONDITIONS
The experiments performed in this paper were performed on a single NVIDIA GeForce GTX 1080 Ti Graphics Card.To get a fair comparison between the models, the experiments performed in this paper keep the training parameters constant for all the models.We trained all the methods for 50 epochs for the architecture search process with a batch size of 32 and a learning rate of 0.05.After getting the final architecture, we train the final model for 50 epochs with a batch size of 64 and a learning rate of 0.025 to get the best results of the model obtained.Table 1 summarizes the hyperparameters used in the architecture search.
The experimental results of the paper are divided into the following subsections: • Comparative analysis of the architecture search methods, including ablation studies, robustness, convergence, and run-time.• Comparative analysis of the final architecture, including flatness of the loss surface and testing errors.
• Comparative analysis of the final DNN models, including accuracy and uncertainty comparisons and tolerance to input and parameter noise.
In the rest of this paper, we primarily train the models under the same training conditions mentioned above for uniformity and evaluate them to get the accuracy and uncertainty estimates.For reference, we used the vanilla models (without adding concrete dropout layers), as implemented in the original papers, and compared them with the accuracy and training condition of µDARTS as implemented in this paper.The results are shown in Appendix C.

B. ANALYSIS OF THE ARCHITECTURE SEARCH METHODS
The searched architecture using the µDARTS architecture search method is shown in Fig. 3  Including L MC (θ) results in calibrated uncertainty values in the inner loop of the DARTS optimization loop.However, if the predictive variance term is removed, the neural architecture search method is not optimized for the uncertainty values.Removing the predictive variance loss term but not the MC loss is equivalent to the DARTS-CD architecture search method.We simulate this ablation study on the CIFAR 10 dataset and compare the performance with the case where we include the predictive variance term in the training loss function.We plot the variation of the predictive uncertainty with increasing epochs for µDARTS and DARTS-CD in Fig. 4.
In this paper, we search for an architecture that not only maximizes accuracy but simultaneously minimizes the model uncertainty.We accomplish this by using the predictive variance as a regularizer of the loss function as given in Eq. 4. To verify that adding the predictive variance term helps in searching for architectures with lower predictive uncertainty, we plotted the evolution of the predictive uncertainty of different architecture search results with respect to µDARTS.The results are shown in Fig. 5.We see that the model uncertainty (measured using the predictive variance) remains almost constant for models like DARTS-CD and RDARTS-CD, which steadily decreases for the µDARTS architecture search method, which uses the predictive variance as a regularizer of the Cross-Entropy loss function.
Role of L MC (θ): We see that on removing the Monte Carlo dropout loss term from the training loss L DARTS val , the dropout probabilities in each of the layers are not updated, and hence the uncertainty is not calibrated.Therefore, we get a wrongful estimate of the uncertainty of the model.In Fig. 6 we show this by repeating the same experiment with different values of dropout probabilities and show that each of the dropout probabilities gives rise to a different uncertainty estimate as discussed in [3].We also note that if we remove the Monte Carlo loss function from the architecture search process while keeping the predictive variance term, we will get an architecture optimized for an uncalibrated noise.Hence, the optimal architecture found using this method may not be the best architecture to minimize the noise.
We empirically evaluate the robustness of the architecture search methods by estimating the largest eigenvalue of the Hessian of the validation loss function (L valid ) for each method.Table 2 shows that the µDARTS method has a smaller value of the largest eigenvalue of the Hessian of the validation loss compared to the other methods making it more robust.We plot the evolution of the largest eigenvalue of the Hessian of the validation loss in Fig. 7.We see that while for the standard DARTS method, the largest eigenvalue keeps increasing for increasing epochs.However, the largest eigenvalue for the Hessian of the validation loss function of the µDARTS method increases very slowly and is much lower than the other methods discussed.The RDARTS-CD can generate similar robustness but requires the implementation of early stopping.The empirical analysis verifies that including predictive variance Var model p(y|x) (α, w * (α)) in the validation loss (L valid ) improves robustness of the architecture search method (an analytical proof of this property is given in the Appendix assuming a linear model).

3) Convergence and Runtime Analysis
In this section, we discuss the run times and the convergence of the architecture search method.Fig. 8 shows the validation error of the search models for each epoch for the CIFAR 10, CIFAR 100, SVHN, and the ImageNet datasets.µDARTS shows a faster convergence (or lower validation loss) than other architectures.Table 3 shows that the runtime of µDARTS is similar to DARTS, DARTS-CD, and RDARTS-CD.

C. COMPARATIVE ANALYSIS OF THE FINAL ARCHITECTURE
Performance of Final Architectures.We compare the performance of the final architectures obtained from the architecture search methods by computing the largest eigenvalue of the Hessian of the training loss function L train .Table   4 shows that µDARTS has a lower mean and standard deviation of the testing error compared to DARTS, DARTS-CD, and RDARTS-CD.Hence, we empirically verify that using the Monte-Carlo dropout loss as a regularizer in the training loss in µDARTS, instead of using an L2 regularizer as done in RobustDARTS [7], improves the robustness of the final architecture.These results prove that the µDARTS architecture search method converges to a flatter minima for each iteration of the bilevel optimization.
Importance of MC Dropout Loss within Bi-level Optimization.Also, it is to be noted that in the DARTS-CD method, we are manually adding the Concrete Dropout layers after the architecture search process.So, the comparative analysis of the DARTS-CD and the µDARTS shows us that just adding the concrete dropout without solving the bi-level optimization problem does not provide us with the optimal solution.

1) Accuracy and Uncertainty
We compare the performance of µDARTS considering CI-FAR10 , CIFAR100 [19], SVHN [20] and ImageNet [21] datasets.All experiments are performed using PyTorch [18] based models.For each dataset, we estimate the accuracy and uncertainty of the final architectures obtained from µDARTS with the following models: (1) architectures obtained from standard DARTS, (2) architectures obtained from the other baseline architecture search methods, and (3) existing deep networks for image classification with Concrete dropout.We calculate the model uncertainty by the number of Monte Carlo passes of the network and getting the predictive variance (see equation ( 5

)).
As examples of image classification network, we consider MobileNetv2 [14] , VGG16 [15], ResNet20 [16], Efficient-Net [17].All networks were implemented from the source code.We used the standard models for each of them and implemented them in PyTorch.We also added a Concrete dropout-based ReLu layer instead of the ReLu layer in the original implementation of each of the models.An MC dropout loss is also added to the loss function to calculate the optimal dropout probabilities to get a tighter bound on the uncertainty estimate of the model.The modified models were re-trained, and uncertainty estimations were performed considering multiple MC samples.
We obtained the confidence interval of the performances of the architectures searched using the architecture search methods by calculating the mean and standard deviation of the model accuracy.We did this by re-training and reevaluating the final searched model 10 times with different initializations.The observed mean and variance of the accuracy of the models and the mean of their uncertainty are shown in Table 5.The table shows that the µDARTS outperforms the standard differentiable architecture search methods like DARTS-CD, RDARTS-CD, P-DARTS-CD, PC-DARTS-CD, GOLD-NAS-CD, and ASAP-CD.The architecture obtained with µDARTS also outperforms standard non-NAS-based architectures like MobileNetv2, VGG16, ResNet20, and EfficientNet-B0.It is to be noted here that all of the baseline models shown in  model.We see that under these training conditions, the CIFAR-10 results of µDARTS are much higher (96.22%) than the standard DARTS (94.32%).A similar trend can be seen for ImageNet (74.64% for µDARTS compared to 70.63% for DARTS).This shows that the proposed µDARTS method can achieve good accuracy and uncertainty scores with minimal training (50 epochs) compared to the other baselines.A reason for the better performance of the model on the ImageNet dataset in comparison to CIFAR10 might be attributed to the better regularization of the loss functions obtained due to the addition of the predictive variance and the Monte Carlo loss functions help in a more efficient search for architecture.Also, adding the regularization terms helps stabilize the µDARTS search by implicitly regularizing the Hessian of the loss functions.We see that the implicit regularization leads to a smaller dominant eigenvalue of the Hessian, which serves as a proxy for the sharpness and thus leads to a more generalizable architecture.

2) Input Noise Tolerance of Final DNN Models
We study the accuracy and uncertainty of the final DNN models obtained from various architecture search methods under Gaussian noise to the input images.We used the CIFAR10, CIFAR100, and SVHN datasets and compared varying Signal-to-Noise Ratio (SNR).We do not re-train the models with noisy images.Instead, noisy images are only applied during inference [Table 6].The added predictive variance term and the Monte-Carlo loss terms act as a regularizer, improving the performance compared to the standard methods.We observe that the µDARTS process has higher accuracy and lower uncertainty under noise than DNN models from other architecture search methods.

3) Tolerance of Final DNN Models to Noisy Parameters
We test the architecture's performance when the model parameters are perturbed by a small gaussian noise ∼ N (0, 1).
We hypothesize that a more stable neural architecture will lower the testing error and uncertainty variance.Table 7 shows that the DNN model generated by the µDARTS method gives the least variance.Hence, we conclude that the µDARTS final architecture is stable and can handle noise perturbations very well.Comparing the results of Tables 5 and 7, we see that with the inclusion of the parameter noise, though the accuracy for all the models fall and the uncertainty increases, the change in the accuracy and uncertainty for the µDARTS model [e.g.CIFAR10: ∆Mean Accuracy = 2.61, ∆Variance = 0.045] is lower than the change in other networks.Hence, we can conclude that the µDARTS model is more resilient to parameter noise.

VI. CONCLUSIONS
In this paper, we proposed a novel method, referred to as µDARTS, to search for a neural architecture that simultaneously improves the accuracy and the uncertainty of the final neural network architecture.µDARTS uses concrete dropout layers within the cells of a neural architecture search framework, and adds predictive variance and Monte-Carlo dropout losses as regularizers in the validation and training losses respectively.We experimentally demonstrate that µDARTS improves the performance of the neural architecture search and the final architecture found using the architecture search by showing that the optimization problem converges to a flat minima.We also empirically show that the final architecture is stable when perturbed with input and parameter noise.The update of w in µDARTS can thus implicitly control the trace norm of the Hessian of the loss function.If the matrix is close to positive semi-definite, this is approximately regularizing the (positive) eigenvalues of ∇ 2 α L µDARTS val (w, α).Therefore, µDARTS reduces the Hessian norm through its training procedure.

B. COMPARISON OF STABILITY USING LARGEST EIGENVALUES
In [7], the authors empirically showed the instability of the DARTS method is related to the norm of Hessian ∇ 2 α L valid .Chen et al. [30] verified this by plotting the validation accuracy landscape of the DARTS and showed it is extremely sharp and thus, even a small perturbation in α can drastically change the validation accuracy.
Here we prove that the eigenvalue of the Hessian of validation loss in µDARTS (∇ 2 α L µDARTS valid ) is lower than the eigenvalue of the Hessian of validation loss in DARTS (∇ 2 α L DARTS valid ).Lemma 6.1:The largest eigenvalue of the Hessian of the validation loss of DARTS is given by, where, Proof.We consider x i is the input vector, α is the weight matrix of architectural parameter, and y i is the vector denoting the labels of the classes.For simplicity, we consider a linear model x T i α for this proof.Considering the cross entropy loss as the validation loss (L valid ) we obtain: where, σ(.) is the sigmoid function, σ d = σ x T i α 1 − σ x T i α for each i.We know the Rayleigh quotient of ∇ 2 α L DARTS valid is given by: where, z is any unit-length vector.Assuming maximum of σ d is represented as σ max d , we obtain: where, R M (z) is the Raleigh quotient of the matrix M x = N i=1 x i x T i as given by: Since, the maximum eigenvalue of a symmetric matrix (A) is equal to the maximum value of its Raleigh quotient (λ max (A) = R max A )), we obtain: Corollary.The validation loss function of the standard DARTS (L DARTS valid ) method is convex in nature.Proof.We know that, the smallest eigenvalue of any square symmetric matrix is given as the minimum of the Rayleigh quotient of that matrix.Since σ d ≥ 0, from equations 13 and 14, we obtain R DARTS (z) ≥ 0. Hence, the smallest possible eigenvalue of ∇ 2 α L DARTS valid must be zero or positive which implies that the validation loss function is convex.Lemma 6.2:The largest eigenvalue of the Hessian of the validation loss of µDARTS is given by, Proof.To estimate ∇ 2 α L µDARTS valid , we add the predictive variance term to the DARTS validation loss.
Without loss of generality, we consider the case when T = N .Therefore, the Raleigh quotients of ∇ 2 α L µDARTS valid can be computed as: Since N is large, we note that Thus, we can say that where, σ j is the polynomial in σ x , σ x 2 , and α.Also, since σ x is convex, using Jensen's inequality (σ 2 x ≤ σ x 2 ) we get: Assuming, maximum value of σ ud and σ j are given by σ max ud and σ max j , respectively, we obtain: Using the relation between maximum eigenvalue and Ralyeigh quotient, similar to the Lemma 6.1, the largest eigenvalue of the µDARTS method is given as: Lemma 6.3:If λ max (∇ 2 α L DARTS valid ) and λ max (∇ 2 α L µDARTS valid ) be the maximum eigenvalues of the Hessian of the validation loss for DARTS and the µDARTS, respectively, then Proof.We represent σ d as: σ d = p(1 − p), where p = σ x T i α ∈ [0, 1] due to the properties of the sigmoid function.Moreover, since α is the weights in the neural network we have α < 1 =⇒ α T α < 1.Hence, by maximizing σ i with respect to p we obtain:σ max d = 0.25 We now represent σ ud = √ q(1 − √ q) + 4α T α[q(1 − q) − 2q 2 (1 − q)] + 2q(1 − q) − 2( √ q(1 − √ q) + √ q}, where q = σ x 2 = σ(αx i x T i α T ) ∈ [0, 1].We now maximize σ j with respect to q.Finally, noting α T α < 1, we observe that σ ud ≤ 0. Thus the function becomes: σ max ud < 0 < 0.25 = σ max d .From equation ( 14) we note that R max M ≥ 0. Therefore, using equations ( 15), (22), and the bound on σ max ud , we get

C. COMPARISON OF ACCURACY OF UNMODIFIED MODELS
We demonstrate the comparative performance of the µDARTS method with respect to models searched using other architecture search methods with the training condition for each model as reported in the original paper.Note that the original NAS and DARTS papers only reported accuracy of the final models (not uncertainty).The results are shown in Table C.1.We see that the µDARTS, trained with only 50 epochs gives comparable performance to other networks trained for a much higher number of epochs and with a much higher batch size.This result further showcases the robustness and easy trainability of µDARTS.

Fig. 2 .
Fig. 2. Summary of Baselines and µDARTS.XDARTS-CD refers to the baseline architecture search models viz., P-DARTS-CD, PC-DARTS-CD, RobustDARTS-CD, GOLD-NAS-CD, ASAP-CD described in Section IV 1) Ablation StudiesIn this section, we perform an ablation study for the two different cases: 1. removing the Monte Carlo loss function L MC (θ) from the training loss L DARTS val 2. removing the prediction variance term Var model p(y|x) (α, w * (α)) from the validation loss term L DARTS train .Role of Var model p(y|x) (α, w * (α)):

Fig. 3 .Fig. 4 .
Fig. 3. Figure showing the normal and reduction cell representations searched using the µDARTS algorithm

Fig. 5 .
Fig. 5. Plot showing the variation of predictive variance for different architecture search methods with increasing epochs for CIFAR 10 dataset (Note that the y-axis is represented in the logarithmic scale)

TABLE 1 .
Table showing the hyperparameters of the architecture search process each cell has an option to have a Concrete dropout layer.The Concrete dropout layer, if included, enables computation of the uncertainty values of the model.Wherever applicable, the operations are of unit stride, and also we use padding in the convolved feature maps.µDARTS also includes a Concrete dropout in the final softmax layer of the model.

TABLE 2 .
Robustness of Architecture Search Methods (Largest Eigenvalues of the ∇ 2 w L valid )

TABLE 3 .
Run-time in GPU hours for different architecture search methods

TABLE 4 .
Performance of Final Architectures Table V are equipped with concrete dropout to calculate the uncertainty of the

TABLE 5 .
Comparison of Accuracy and Uncertainty of Final Models from Different Architecture Search Processes (All models are trained for 50 epochs and batch size of 64)

TABLE 6 .
Accuracy and Uncertainty of Different DNN Models under Input Noise

TABLE 7 .
Performance of Final DNN under Parameter noise

TABLE 8 .
*TABLE C.1: Comparison of Reported Test Errors of Models for Different Architecture Search Processes (The test errors are the values reported in the original papers)