RARTS: An Efficient First-Order Relaxed Architecture Search Method

Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.


I. INTRODUCTION
Neural Architecture Search (NAS) is an automated machine learning technique to design an optimal neural network architecture by searching its building blocks of deep neural networks from a collection of candidate structures and operations. Although NAS has achieved many successes in several computer vision tasks [1]- [6], the search process demands huge computational resources. The current search times have come down considerably from as many as 2000 GPU days in early NAS [2], thanks to subsequent studies [7]- [13] among others. Differentiable Architecture Search (DARTS) [14] is an appealing method that avoids searching over all possible combinations by relaxing the categorical architecture indicators to continuous parameters. The higher level architecture can be learned along with lower level weights via stochastic gradient descent by approximately solving a bilevel optimization problem. DARTS can be further sorted into first-order DARTS and second-order DARTS, in line with whether a mixed second derivative estimation of loss function is used or not.
Despite its search efficiency obtained from continuous relaxation, DARTS can still have some problems experimentally and theoretically. They are the efficiency problem with second-order DARTS, the convergence problem with firstorder DARTS, and the architecture collapse problem (i.e., the selected architecture contains too many skip-connections) with both DARTS. Second-order DARTS takes much longer search time than first-order DARTS as it involves the mixed second derivatives. It has also been pointed out that secondorder DARTS can have superposition effect [15], which means the approximation of the gradient of α is based on the approximation of the weight w one step ahead. This is believed to cause gradient errors and failures in finding optimal architectures. Therefore, it is used less often in practice than first-order DARTS [15], [16]. However, first-order DARTS learns the architecture using half of the data only. Evidences are provided to show that it can result in incorrect limits and worse performance [14]. The experimental results also show that first-order DARTS (3.00% error) is less accurate than second-order DARTS (2.76% error) on the CIFAR-10 dataset [14], [17]. For the architecture collapse problem, typically such a bias in operation selection degrades the model performance. This problem has been observed by a few researchers [18], [19], who have tried to solve it by replacing some operations of the architecture.
In addition to the search for topological architectures, i.e., edges and operations of cells (building blocks) in some early NAS works and DARTS [2], [14], many NAS style methods have been developed to search for the width of a model, i.e., the number of channels in convolutional layers [16], [20]. Searching for width is supposed to be a way of channel pruning, which is a common tool for network compression, i.e., constructing slim networks from redundant ones [21]. Specifically, channel pruning can be formulated as an architecture search problem, via the setup of learnable channel scoring parameters [21]- [23] as architecture parameters. This is an elegant approach for compression without relying on channel magnitude (group 1 norm), which is used in previous regularization methods [24]. The previous way of setting up channel scoring parameters [21] utilizes the scale parameters of the batch normalization layers, yet they are not contained in many modern networks [25], [26]. Another challenge remains to be solved is to replace its plain gradient descent by the more accurate DARTS style algorithms.
Apart from the bilevel formulation of DARTS, a single level approach (SNAS) based on a differentiable loss and sampling has been proposed [27]. On CIFAR-10, SNAS is more accurate than the first-order DARTS yet with 50% more search time than the second-order DARTS. This inspires us to formulate a new single level method which is more efficient and accurate. Our main contribution is to introduce a novel Relaxed Architecture Search (RARTS) method based on single level optimization, and the computation of only the first-order partial derivatives of loss functions, for both topology and width search of architectures. Through both data and network splitting, the training objective (a relaxed Lagrangian function) of RARTS allows two networks with different but related weights to cooperate in the search of a shared architecture.
We have carried out both analytical and experimental studies below to show that RARTS achieves better performance than first and second-order DARTS, with higher search efficiency than second-order DARTS consistently: • Compare RARTS with DARTS directly on the analytical model with quadratic loss functions, where the RARTS iterations approach the true global minimal point missed by the first-order DARTS, in a robust fashion. A convergence theorem is proved for RARTS based on descent of its Lagrangian function, and equilibrium equations are discovered for the limits.
• On the CIFAR-10 based search of topological architecture, the model found by RARTS obtains smaller size and higher test accuracy than that by the second-order DARTS with 65% search time saving. A hardwareaware search option via a latency penalty in the Lagrangian function helps control the model size. Upon transfer to ImageNet [28], [29], the model found by RARTS achieves better performance as well, compared with DARTS and its variants. Apart from the standard search space used in the DARTS paper, RARTS also beats DARTS on the public NAS benchmark of search spaces like NATS-Bench [30]. • For channel pruning of ResNet-164 [31] on CIFAR-10 and CIFAR-100 [17] with fixed pruning ratio (percentage of pruned channels), RARTS outperforms the differentiable pruning benchmarks: Network Slimming [21] and TAS [20]. Comparisons between DARTS and RARTS have also been made in a 1 regularized (unfixed ratio) pruning task, where RARTS achieves a high sparsity of 70% and exceeds DARTS in accuracy.

A. DIFFERENTIABLE ARCHITECTURE SEARCH
DARTS training relies on an iterative algorithm to solve a bilevel optimization problem [14], [32] which involves two loss functions computed via data splitting (splitting the dataset into two halves, i.e. training data and validation data): where w * (α) = arg min w L train (w, α).
Here w denotes the network weights, α is the architecture parameter, L train and L val are the loss functions computed on the training data D train and the validation data D val . Since many common datasets like CIFAR do not include the validation data, D train and D val are usually two nonoverlapping halves of the original training data. We denote L train and L val by L t and L v to avoid any confusions with the meaning of the subscripts. D t and D v are defined similarly. DARTS has adopted data splitting because it is believed that joint training of both α and w via gradient descent on the whole dataset by minimizing the overall loss function: can lead to overfitting [14], [15]. Therefore, DARTS searches for the architectures through a two-step differentiable algorithm which updates the network weights and the architecture parameters in an alternating way: • update weight w by descending along ∇ w L t (w, α) • update architecture parameter α by descending along: where ξ = 0 (ξ > 0 ) gives the first or second-order approximation. The bilevel optimization problem also arises in hyperparameter optimization and meta-learning, where a second-order algorithm and a convergence theorem on minimizers have been proposed in previous work [33] (Theorem 3.2), under the assumption that the α-minimization is solved exactly, and w t (α) converges uniformly to w(α). However, the α-minimization of DARTS is approximated by gradient methods only, and hence the convergence of DARTS algorithm remains unknown theoretically. We are aware of the fact that the first-order DARTS updates the architecture parameters on D v by descending along ∇ α L v (w, α), which means it merely uses half of the data to train α and might cause some convergence issues (see Fig.  1). MiLeNAS has developed a mixed-level solution, where the architecture parameters can be learned on D = D t ∪ D v via a first-order descending algorithm [15]: We shall see that MiLeNAS is actually a constrained case of RARTS when our two network splits become identical. However, we point out that computing L v using an identical network makes MiLeNAS still suffer from the same convergence issue in a later example (Section III-D). The secondorder DARTS is observed to approximate the optimum better than first-order DARTS in a solvable model and through experiments, yet it requires computing the mixed derivative ∇ 2 α,w L t , at a considerable overhead. Searching by DARTS can also lead to the architecture collapse issue, meaning the selected architecture contains too many skip-connections. Typically such a bias in operation selection degrades the model performance. SNAS [27], FBNet [12], and GDAS [18] use differentiable Gumbel-Softmax to mimic one-hot encoding which implies exclusive competition and risks of unfair advantages [19]. This unfair dominance of skip-connections in DARTS has also been noted by FairDARTS [19], which has proposed a collaborative competition approach by making the architecture parameters independent, through replacing the softmax with sigmoid. They have further penalized the operations in the search space with probability close to 1 2 , i.e. a neutral and ambiguous selection. As these methods focus on replacing some operations or the loss function, it would be worthwhile to explore other solutions such as replacing the gradient-based DARTS search algorithm.
In addition to DARTS, many other differentiable methods for architecture search have been proposed, considering various aspects such as the search space, selection criterion, and training tricks. SNAS [27] has discussed it from a statistical perspective with however a minor to moderate performance improvement. The search efficiency has also been improved by sampling a portion of the search space during each update in training. A perturbation-based selection scheme has been proposed in [34], as the magnitude of architecture parameters is believed to be inadequate as a selection criterion. P-DARTS [35] has adopted operation dropout and regularization on skip-connections. From the procedure side to delay a quick short cut aggregation, it has also divided the search stage into multiple stages and progressively adds more depth than DARTS. PC-DARTS [36] samples a proportion of channels to reduce the bias of operation selection and enlarge the batch size as well. GDAS [18] searches the architecture with one operation sampled at a time. Other approaches apply differentiable methods on much larger search spaces with sampling techniques to save memory and avoid model transfer [7], [12]. We will see that these variants of the differentiable architecture search method are actually complementary to our approach that advances DARTS on the purely algorithmic side by mobilizing weights. Moreover, many works [7], [12], [16], [37]- [39] manage to balance latency with the performance of the model to enhance the efficiency of the model. Despite the broad use of differentiable methods in the works we have mentioned, one may wonder how DARTS and its variants beat random search. A detailed comparison in [40] has elaborated the advantage of DARTS in accuracy and efficiency compared with random search.

B. SEARCH FOR WIDTH AND CHANNEL PRUNING
Differentiable search method has contributed to a wide range of tasks other than topological architecture search. TAS [20] searches for the width of each layer, i. e. number of channels, by learning the optimal one from the aggregation of several candidate feature maps via a differentiable method and sampling. FasterSeg [16] searches for the cell operations and layer width, as well as the multi-resolution network path over the semantic segmentation task. These works of searching width are closely related to channel pruning, which means pruning redundant channels from the convolutional layers. Among numerous methods to prune redundant channels [24], [41]- [45], a classical approach is to apply group LASSO [46] on the weights to identify unimportant channels. The weights in each channel form one group, and the magnitude of each group is measured by 2 norm of its weights. The network is trained by minimizing a loss function penalized by the 1 norm of these magnitudes from all groups. The channels are pruned based on thresholding their norms. Selecting good thresholds as hyperparameters for different channels can be laborious for deep networks. On the other hand, channel selection is intrinsically a network architecture issue. It is debatable if thresholding by weight magnitudes is always meaningful [47].
Another approach of channel pruning [21], [22] involves assigning a channel scaling (scoring) factor to each channel, which is a learnable parameter independent of the weights. In the training process, the factors and the weights are learned jointly, and the channels with low scaling factors are pruned. After that, the optimal weights of the pruned network are adjusted by one more stage of fine-tuning. In terms of channel scaling factors, the channel pruning problem becomes a special case of neural architecture search. Besides this formulation, there are several pruning methods based on NAS. AMC [37] has defined a reward function and pruned the channels via reinforcement learning. MetaPruning [48] VOLUME 4,2016 generates the best pruned model and weights from a meta network.

III. METHODOLOGY
In this section, we introduce the RARTS formulation, its iterative algorithm and convergence properties. RARTS is different from all the differentiable algorithms we have mentioned, in that it puts forward a relaxed formulation of a single level problem which benefits from both data splitting and network splitting.

A. DATA SPLITTING AND NETWORK SPLITTING
As pointed out in DARTS [14] and MiLeNAS [15], when learning the architecture parameter α, splitting training and validation data should be taken into account to avoid overfitting. However, we have discussed that the bilevel formulation (1) and training algorithm of DARTS may lead to several issues: unknown convergence, low efficiency and the unfair selection of operations. Therefore, we follow the routine of train-validation data splitting, but want to formulate a single level problem, in contrast to DARTS and MiLeNAS. First, if we use (w, α), the pair of weight and architecture parameters in Eq. (2) to represent a network, what we propose to do is to further relax the network weights w via splitting a network copy denoted by (y, α). We call (y, α) and (w, α) the primary and auxiliary networks, which share the same architecture α and the same dimensions as weight tensors, but can have different weight initialization.
Next, a primary loss L v (y, α) is computed with parameters (y, α) fed on data D v , while an auxiliary loss L t (w, α) is computed with parameters (w, α) fed on data D t . Note that the computation of the auxiliary loss L t (w, α) is the same as that of DARTS. The difference is that the primary loss is computed on the primary network (y, α), instead of (w, α). Now we present the single level objective of our relaxed architecture search (RARTS) framework. With a 2 penalty on the distance between w and y, the two loss functions are combined through the following relaxed Lagrangian L = L(y, w, α) of Eq. (2): where λ and β are hyperparameters controlling the penalty scale and the learning process. We will see in the search algorithm that the penalty term enables the two networks to exchange information and cooperate to search the architecture which they share together. This technique of splitting w and y is called network splitting, which is also inspired by some previous work [49]. In their work, splitting of variables is able to approximate a non-smooth minimization problem via an algorithm of combined closed-form solutions and gradient descent. Since various NAS approaches discover architectures of inconsistent sizes or FLOPS, it has made the comparison through different methods unfair, because larger models are likely to have better performance but low efficiency. Many NAS methods have adopted latency as a model constraint [7], [16]. To control model size, we follow the technique of approximating the model latency with the sum of latency from all the operations [16], and add the approximated latency to the loss function as a penalty. Since each component of the latency tensor (denoted by Lat) is the latency amount associated with a candidate operation, the dimension of Lat is the same as that of α. Therefore, we provide an alternative objective which is penalized by the latency of the model: where the bracket is inner product.

B. RARTS ALGORITHM
We minimize the relaxed Lagrangian L(y, w, α) in (4) by iteration on the three variables in an alternating way to allow individual and flexible learning schedules for the three variables. Similar to Gauss-Seidel method in numerical linear algebra [50], we use updated variables immediately in each step and obtain the following three-step iteration: With explicit gradient ∇ w,y y − w 2 2 , we have: To minimize the Lagrangian (5), the first two steps are the same as Eq. (7) since the latency only depends on α. The third step becomes: Note that the update of α in Eq. (7) involves both L t and L v , which is similar to the second-order DARTS but without the mixed second derivatives. The first-order DARTS uses ∇ α L v only in this step. In the previous section, we have discussed the architecture collapse issue of DARTS, i.e., selecting to many skip-connections. A possible reason why DARTS may lead to architecture collapse is that its architecture parameters converge more quickly than the weights in the convolutional layers. That means, when DARTS selects architecture parameters, it tends to select skip-connection operations, since the convolutional layers are not trained well. The fact that first-order DARTS has only used one of the two data splits to train the weights, makes the training of convolutional layers worse. For RARTS, we make use of both L t and L v to update the weight parameters w and y in the first two steps of Eq. (7). In the third step of Eq. (7), both L t and L v are also used to update the shared architecture α. In this way, the architecture is learned better, as more data are involved during training. If y = w is enforced in Eq. (7) e.g. through a multiplier, RARTS essentially reduces to first-order MiLeNAS [15]. However, relaxing to y = w has its advantages of having more generality and robustness as it is optimized on two networks with different but related weights. In contrast, MiLeNAS trains the network weigts on the training data D t only, and suffers from the same convergence issue as first-order DARTS (Section III-D). We summarize the RARTS algorithm in Algorithm 1.

Algorithm 1 Relaxed Architecture Search (RARTS)
Input: the number of iterations N , the hyperparameters λ and β, a learning rate schedule (η t w , η t u , η t α ), initialization of the weight parameters w 0 , u 0 and the architecture parameters α 0 .
Output: α * , the architecture we want. Split the dataset D into two subsets D p and D a . for t = 0, 1, ..., N do Compute L p and L a on D p and D a , respectively, and then compute L using Eq. (4) Update the parameters via gradient descent:
Theorem 1. Suppose that the loss functions L t and L v satisfy Lipschitz gradient property. If the learning rates η t w , η t y and η t α are small enough depending only on the Lipschitz constants as well as (λ, β), and approach nonzero limit at large t, the Lagrangian function L(y, w, α) is descending on the iterations of (7). If additionally the Lagrangian L is lower bounded and coercive (its boundedness implies that of its variables), the sequence (y t , w t , α t ) converges sub-sequentially to a critical point (ȳ,w,ᾱ) of L(y, w, α) obeying the equilibrium equations: If the loss is penalized by latency as in (5), the last equilibrium equation becomes: Proof. We only need to prove for the loss (5) and the iterations (8), as the loss (4) is its special case when Lat = 0. We notice the latency penalty function Softmax(α t ), Lat also satisfies the Lipschitz gradient property. This is because and hence all the first and second derivatives of Softmax(α t ), Lat are bounded uniformly regardless of α t . Applying Lipschitz gradient inequalities on L v and L t , we have: Substituting for the (w, y)-gradients from the iterations (8), we continue: We note the following identity Upon substitution of the above in the right hand side of (11), we find that: The β-terms cancel out. Substituting for the α-gradient from the iterations (8), we get: where the last two inner product terms are upper bounded by: for positive constant L 4 := max(L 1 , L 2 ). It follows that: If η t y < 1 2 L is descending along the sequence (y t , w t , α t ). For c 4 = 1 2 min{c −1 1 , c −1 2 , c −1 3 }, it follows from (12) that: Since L is lower bounded and coercive, (y t , w t , α t ) are uniformly bounded in t. Let (η t w , η t y , η t α ) tend to non-zero limit at large t. Then (y t , w t , α t ) sub-sequentially converges to a limit point (ȳ,w,ᾱ) satisfying the equilibrium system (9) or (10).

IV. EXPERIMENTS
We show by a series of experiments how RARTS works efficiently for different tasks: the search for topology and the search for width, on various datasets and search spaces.

A. SEARCH FOR TOPOLOGY
For the hyperparameters and settings like learning rate schedules, number of epochs for CIFAR-10 and the transfer learning technique for ImageNet, we follow those of DARTS [14]. We also consider the results on CIFAR-10 and CIFAR-100 for NATS-Bench [30], which is another benchmark search space.
Comparisons on CIFAR-10. The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images [17]. These 3-channel images of 32 × 32 resolutions are allocated to 10 object classes evenly. For the architecture search task on CIFAR-10, the D t and D v data we have used are random non-overlapping halves of the original training data, the same as DARTS. The settings for searching topology with RARTS follows those of DARTS. That is, batch size = 64, initial weight learning rate = 0.025, momentum = 0.9, weight decay = 0.0003, initial alpha learning rate = 0.0003, alpha weight decay = 0.001, epochs = 50. For the stage of training, batch size = 96, learning rate = 0.025, momentum = 0.9, weight decay = 0.0003 [14]. For each cell (either normal or reduction), 8 edges are selected, with 1 out of 8 candidate operations selected for each edge (see Fig. 2). Besides the standard 2 regularization of the weights, we also adopt the latency penalty. The latency regularization loss is weighted so that it is balanced with other loss terms. Typically, if we increase the latency weight, the model we find will be smaller in size. The latency term Lat for each operation is measured via PyTorch/TensorRT [16], and thus it depends on the devices we use. For the current search, the latency weight is 0.002 so that the model size is comparable to those in prior works. The final latency loss is the weighted sum of the latency from each operation, where the weights are the architecture parameters.
As shown in Table 1, the search cost of RARTS is 1.1 GPU days, far less than that of the second-order DARTS. The test error of RARTS is 2.65%, outperforming the 3.00% of the first-order DARTS and the 2.76% of the secondorder DARTS. It should also be pointed out that the model found by RARTS has 3.2M parameters, which is smaller than the 3.3M model found by DARTS. Moreover, RARTS outperforms other recent differentiable methods in accuracy and search cost at comparable model size. We also notice that the variance of RARTS performance is lower than that of VOLUME 4, 2016 TABLE 1. Comparison of DARTS, RARTS and other methods on CIFAR-10 based network search. DARTS-1/2 stands for DARTS 1st/2nd-order, SNAS-Mi/Mo stands for SNAS plus mild/moderate constraints. Note that faster search times also depend on speed and memory capacity of local machines used. The V100 column indicates whether the model is trained on high-end Tesla V100 GPUs or not. Each run of our experiment is conducted on a single GTX 1080 Ti GPU. The numbers in the parentheses indicate the search GPU days of DARTS on our machine. Average of 5 runs. These runs are conducted on our machine.

4
AmoebaNet-B [10] 2.55 ± 0.05 2.8 3150 SNAS-Mi [27] 2.98 2.9 1.5 SNAS-Mo [27] 2.85 ± 0.02 2.8 1.5 DARTS-1 [14] 3.00 ± 0.14 3.3 1.5 (0.7) DARTS-2 [14] 2  DARTS. RARTS has also arrested architecture collapse and only selected one skip-connection, as shown in Fig. 2. We are aware that different values of hyperparameters in the RARTS search stage may impact the latency of the models found by RARTS. Table 2 has listed the latency of several models with different hyperparameters. Here we use the baseline setting of latency weight = 2 × 10 −3 , batch size = 64, learning rate = 3 × 10 −4 , weight decay = 1 × 10 −3 . We change the value of one hyperparameter and keep the others the same during each experiment, so that we can see how sensitive the resulting latency is to a specific hyperparameter. First, the result shows that a small batch size of 16 can impact the model's latency, whereas a batch size of 32 or 64 can lead to similar latency. This is a positive phenomenon, since we prefer larger batch size as it requires less training time.
Among the other hyperparameters, it is clear that the only factor that could cause a significant difference is the latency weight. A latency weight of 2 × 10 −2 is so large that its model has only 60% latency compared with the baseline. The model's latency is not sensitive to the other hyperparameters, as the latency is around 22.0, and varies within 10% only. This finding is beneficial, since we can fix the latency level via fixing the latency weight and find the model with the best accuracy among the models of similar latency level via tuning the other hyperparameters.
Comparisons on ImageNet. ImageNet [28], [29] is composed of over 1.2 million training images and 5,000 test images from 1,000 object classes. The architecture which is built of the cells learned on CIFAR-10 is transferred to be learned on ImageNet-1000, producing the results in Table 3. Even if our experiments are performed on a GTX 1080 Ti whose maximum memory allows only a batch size of 128, our 25.9% error rate outperforms those of DARTS and SNAS (batch size 128), and is also comparable to those of GDAS (batch size 128) and MiLeNAS. MiLeNAS among some other algorithms in Table 2 have been implemented on Tesla V100 with batch size 1024, a much higher end hardware than that in our experiments. This partly explains its lower TABLE 3. Transfer to ImageNet: test error comparison of DARTS, RARTS and other methods on local machines resp. The V100 column indicates whether the model is trained on high-end Tesla V100 GPUs or not. The larger GPU memory can support larger batch size, which leads to better accuracy and training efficiency on ImageNet. The Direct column indicates if the model is searched directly on ImageNet without transfer-learning. The direct search tends to be more accurate but costs more computational resources.

Method
Top-  accuracy occurrence (2.80) on CIFAR-10 but higher accuracy after transfer to ImageNet. Typically ImageNet is trained better on larger GPU's because of the larger batch size. ProxylessNAS has obtained high accuracy on both CIFAR-10 and ImageNet, but their models are much larger than the other methods. It has avoided transfer learning as the training cost is reduced via path sampling. Inheriting the building blocks from DARTS and ProxylessNAS, FairDARTS has penalized the neutral (close to 0.5) architecture parameters, but its high accuracy also benefits from the relaxation on the search space. Their normal cells contain less than 8 operations since the operations with architecture parameters lower than a preset threshold are eliminated. This explains their smaller model size and comparable accuracy. P-DARTS has devised a progressive method to increase the depth of search. Their work shows that deeper cells have better capability of representation, which is also an improvement on the search space. PC-DARTS as a sampling method has achieved the least searching cost and can be trained directly on ImageNet. These methods are complementary to our work which is purely on the differentiable search algorithm without modifying the search space of DARTS.
Comparisons on NATS-Bench. For NATS-Bench, one has to search a block of 6 nodes from the search space of 5 different operations, including zero, skip-connection, 3 × 3 average pooling, 1 × 1 convolution or 3 × 3 convolution [30]. Therefore, it includes 15,625 different candidate architectures and any DARTS style methods can be adapted easily to its search space. NATS-Bench has measured each architecture's performance under the same training settings, and hence fair comparisons can be made between the discovered architectures since no further evaluation is needed on the local machines. In our experiments, we set batch size = 64, initial weight learning rate = 0.025, momentum = 0.9, weight decay = 0.0005, initial alpha learning rate = 0.0003, alpha weight decay = 0.001, number of epochs = 100. Table 4 presents the search results of DARTS vs. RARTS on NATS-Bench. RARTS has surpassed both DARTS-1 and DARTS-2 in accuracy by more than 20% on CIFAR-10 and 6% on CIFAR-100. Besides its success in accuracy, RARTS has totally escaped from the architecture collapse issue, i. e., the architectures found by RARTS from NATS-Bench contain no skip-connections. On the contrary, both architectures found by DARTS-1 and DARTS-2 contain 100% and 38.9% (average of 3 runs) skip-connections on CIFAR-10 and CIFAR-100, respectively. It is clear that too many skip-connections resulting in architecture collapse will impact the performance of the models greatly. VOLUME 4, 2016

B. SEARCH FOR WIDTH
To search the width of the architecture (number of channels in convolutional layers), we follow the settings of Network Slimming [21], by introducing scoring parameters α to measure channel importance. Denote the original feature map by F i,j and define the new feature mapF i,j = α i,j F i,j , where (i, j) are the layer and channel indices. Multiplying a channel of output feature map by α is equivalent to multiplying the convolutional kernels connecting to this output feature map by the same α. We prune a channel if the corresponding α is 0 or very small. The α ij 's are learnable architecture parameters independent of channel weights, and hence is considered to have similar roles to the architecture parameters in the case of searching topological architecture. Although such treatment of scoring parameters is much like that in Network Slimming [21], we point out that the single level formulation of RARTS and the training algorithm to learn those scoring parameters are novel. The first difference is that Network Slimming trains both weight and architecture parameters on the whole (training and validation) data, unlike DARTS or RARTS, without using either dataset splitting or network splitting. Another key difference between RARTS pruning and Network Slimming is in the search algorithm, i.e., Network Slimming trains the weights and the architecture jointly in one step, while RARTS trains them in a three-step iteration. Moreover, Network Slimming has used batch normalization weights as the scoring parameters. We point out that we could still define such a set of learnable architecture parameters α, even if the batch normalization operation is not contained in the architecture.
We also compare RARTS with TAS [20], which is another width search method based on differentiable NAS, relying on both continuous relaxation via feature maps of various sizes and model distillation. The first difference is on how the channel scoring parameters are applied to the feature maps. For TAS, the channel parameters are treated as probabilities of candidate feature maps, smoothed by Gumbel-Softmax. Then a subset of feature maps is sampled to alleviate the high memory costs. RARTS is much simpler in its formulation, as it is a dot product of the channel parameters with the filter to be pruned. The second key difference is the use of a training technique called Knowledge Distillation (KD) [51] by TAS to improve accuracy. There are some other NAS based methods for width search, or channel pruning [37], [48] mentioned in Section II-B. Noting that our formulation of the problem and the criterion to evaluate results are different, we emphasize that out progress is in fusion of a new search algorithm and the width search task.
When using RARTS to search for width, we follow the hyperparameters and settings of Network Slimming as well. That is, learning rate = 0.1, weight decay = 0.0001, epochs = 160 [21]. In Table 5, RARTS outperforms the un-pruned baseline, Network Slimming (NS) and TAS [20] by over 10% error reduction on CIFAR-10. While TAS does not offer an option to specify the pruning ratio of channels (PRC), the pruning ratio of FLOPs is around 30% for NS (40% PRC), RARTS (40% PRC) and TAS. So the comparison is fair. On CIFAR-100, RARTS still leads NS at the same PRC. The gap is smaller as the baseline network is less redundant.
Our experimental results reveal that the accuracy of TAS with KD is lower than (on CIFAR-10) or similar to (on CIFAR-100) that of RARTS, while TAS without the training technique like KD is 2% worse [20]. This supports the fact that RARTS works better as a differentiable method for width search, without regard to any other training tricks. Apart from the comparisons with the above methods, we also consider a pruning task for comparing DARTS and RARTS, which can be viewed as an ablation study of RARTS on the width search task. For this task, we prune MobileNetV2 [52] on a randomly sampled 20-class subset of ImageNet-1000, with 1 regularization but unfixed pruning ratio. The pruning ratio can be learned automatically by the strong regularization term, as many of the architecture parameters are simply zero. Table 6 shows that RARTS also beats both random pruning and DARTS in accuracy. Even though the 2nd DARTS obtains a higher sparsity, it sacrifices the accuracy.

V. CONCLUSION
We have developed RARTS, a novel relaxed differentiable method for neural architecture search. We have proved its convergence theorem and compared it with DARTS on an analytically solvable model. Thanks to the design of data and network splitting, RARTS has achieved high accuracy and search efficiency over the state-of-the-art differentiable methods, especially DARTS, with a wide range of experiments, including both topology search and width search. These results support RARTS to be a more reliable and robust differentiable neural architecture search tool for various datasets and search spaces. In future work, we plan to incorporate search space sampling and regularization techniques to accelerate RARTS (as seen in several recent variants of DARTS) for broader applications in deep learning. JACK XIN received the Ph.D. degree in mathematics from New York University's Courant Institute of Mathematical Sciences, in 1990. He was a faculty at the University of Arizona, from 1991 to 1999, and the University of Texas at Austin, from 1999 to 2005. He is currently a Chancellor's Professor of mathematics at UC Irvine. His research interests include applied analysis and computational methods, and their applications in multiscale problems and data science. He is a fellow of Guggenheim Foundation, American Mathematical Society, American Association for the Advancement of Science, and the Society for Industrial and Applied Mathematics. He was a recipient of Qualcomm Faculty Award (2019-2022).