DistilNAS: Neural Architecture Search With Distilled Data

Can we perform Neural Architecture Search (NAS) with a smaller subset of target dataset and still fair better in terms of performance with significant reduction in search cost? In this work, we propose a method, called DistilNAS, which utilizes a curriculum learning based approach to distill the target dataset into a very efficient smaller dataset to perform NAS. We hypothesize that only the data samples containing features highly relevant to a given class should be used in the search phase of the NAS. We perform NAS with a distilled version of dataset and the searched model achieves a better performance with a much reduced search cost in comparison with various baselines. For instance, on Imagenet dataset, the DistilNAS uses only 10% of the training data and produces a model in ≈1 GPU-day (includes the time needed for clustering) that achieves near SOTA accuracy of 75.75% (PC-DARTS had achieved SOTA with an accuracy of 75.8% but needed 3.8 GPU-days for architecture search). We also demonstrate and discuss the efficacy of DistilNAS on several other publicly available datasets.


I. INTRODUCTION
Image classification is one of the fundamental tasks in computer vision field. Since the success of Alexnet [1], the deep learning models that achieved state-of-the-art performance in image classification were all manually designed.
Recently, with the advent of Neural Architecture Search (NAS) [2], [3], the focus has shifted to automating the design of neural networks. NAS has received considerable attention due to its remarkable success in image classification [4] and other related tasks.
Due to the success of deep neural network architectures in various application domains [5], [6], [7], [8] specifically in field of vision such as Resnet [9], NAS in the recent past has focussed on finding optimal building blocks called cells. The idea is to first find an efficient cell and then stack these cells together in an appropriate manner tailored for a specific task. Some of the earliest works in this regard employed reinforcement learning [4], [10] and evolutionary algorithms [11], [12] to design the cells. As these approaches had to evaluate architectures over a vast search space, both these approaches The associate editor coordinating the review of this manuscript and approving it for publication was Davide Patti . suffer from the inordinate amount of required computational resources (measured in terms of GPU-days). This implies that the architecture search can be generally performed on a simpler proxy dataset and then evaluation is done with the target dataset.
In order to alleviate some of the aforementioned problems with RL and EA based NAS methods, Liu et al., came up with a more efficient strategy of doing NAS called Differentiable Architecture Search (DARTS) [13]. DARTS followed the one-shot approach where a super network contains every candidate architecture in the search space as a sub-network and in each step, a sub-network or combination of sub-networks is trained and evaluated. In DARTS, instead of searching over a discrete set of candidate architectures, the search space was relaxed to be continuous. This enabled them to perform gradient descent with respect to architectural parameters based on validation loss. Through this approach, DARTS achieved a comparable performance with the RL and EA based methods at a small fraction of the search cost needed by them.
In spite of this, DARTS is limited by its severe memory requirement. Due to this, DARTS is forced to search the architecture in a shallow network where as the architecture being evaluated later is deep. This is referred to as 'depth gap'. To alleviat this problem, an improved version of DARTS called Progressive DARTS (P-DARTS) is proposed in literature [14]. In P-DARTS, deeper architectures are evaluated in the search phase via a careful pruning of the search space. Despite the advances that DARTS and P-DARTS achieve, it must be noted that both of these methods still perform a proxy search on a simpler task (CIFAR) before being evaluated on a more complex one. This characteristic of these models gives rise to the following question: Given an arbitrary dataset, can we perform architecture search on it within a reasonable amount of time?
This question is important in order to address the performance gap that may arise due to transfer learning through a model searched on simple datasets. In this work, we propose a data preprocessing method based on a curriculum learning method that enables us to sample a smaller effective subset of the dataset which we call ''Distilled Data''. Using this distilled data, we demonstrate that NAS can be performed on large datasets in a reasonable amount of time without compromising on the accuracy of the model. The intuition behind our approach is that, since the training budget in the search phase is limited, we must focus only on finding a reasonably adeqaute model that can detect features of interest for a given class. Hence, we propose that only ''informative'' images, i.e., those images that contain only the relevant features corresponding to the class of interest, must be used in the search phase of NAS. The ''uninformative'' images are typically used for increasing the generalization ability of the model and hence do not add value in the model search phase. The generalization ability of the model can be tackled on in the evaluation phase where we train the searched model on the full dataset. An overview of our method is as follows: Given a dataset, we perform density clustering on each class in the feature space and split them into three clusters-'C REL ' (RELVANT, very informative about the relevant features), 'C LREL ' (LESS-RELEVANT, less informative about the relevant features) and 'C IRREL ' (IRRELEVANT, uninformative about the relevant features). The criterion which we use for this clustering is proposed in [15]. After performing the clustering process, we sample a small fraction of the data from the 'C REL ' clusters and denote these samples as our 'distilled data'. For instance, on Imagenet [16] which contains a 1000 classes, we randomly sample only 10% of the training data from the 'C REL ' clusters corresponding to the 1000 classes. Using this distilled data for Imagenet, we perform NAS using the P-DARTS method (P-DARTS will be our standard baseline for all the experiments except for those in the ablation study section).
The search process takes up only about 0.6 GPU-days 1 on a Nvidia Tesla V100 GPU (≈ 1 GPU-day if we include the time for clustering) and achieves a top-1 error of 24.25% under the mobile setting in the evaluation phase. To the 1 Used single Nvidia Tesla V100 GPU with 16GB graphics RAM. best of our knowledge, SOTA on Imagenet was achieved by PC-DARTS [17] with a top-1 error of 24.2% but needed 3.6 GPU-days and 12.5% of randomly sampled images from the training data for the search process. So our distilled data with P-DARTS method achieves a near SOTA performance at a much lower computational cost.
To summarize, our contributions are as follows: 1) We propose a novel method called DistilNAS, which uses a data distillation method based on curriculum learning to accelerate the neural architecture search process. 2) DistilNAS achives near SOTA performance on Imagenet at a much lower search cost in experiments. 3) We perform extensive exiperiments to establish the efficacy of the proposed model DistilNAS with various community standard computer vision datasets such as CIFAR10 [18], CIFAR100 [18], Food-101 [19], Oxford-IIIT Pets [20], Stanford Cars [21], Caltech-101 [22]. 4) We conduct an ablation study to show the significance of selecting samples from informative vs uninformative clusters and effect of selection of different architectures for feature extraction.
The remainder of this paper is organized as follows. Next, in Section II, we list some of the relevant related works along with their limitations and discuss how DistilNAS alleviates them. Section III is dedicated to the necessary background and details of the proposed techniques. In Section IV, we present the evaluation of the proposed model by comparing the results with adequate baseline and stateof-the-art methods. Finally, we conclude with major takeaways, potential limitations, and future research directions in Section V.

II. RELATED WORKS
The purpose of NAS is to automatically learn the best possible neural architectures and reduce the human drudgery involved in this process [23], [24]. Neural Architecture Search (NAS) has significantly impacted the various important research areas such as image classification [4], [10], object detection [25], [26], semantic segmentation [27], and speech recognition [28] etc,. In particular, NAS has gained a great deal of traction in the research community because of it's breakthrough in image classification [4]. Some of the popular initial works in NAS, represent the neural architectures in form of a variable length string and use a combination of recurrent neural net and reinforcement learning to optimize and control it [29]. In general in a typical neural network, similar modules are stacked together to make it more intricate and effective architecture, in case of NAS, we use cells to achieve the same [30]. Efficacy of stacking similar smaller cells and a combination normal cells and reduction cells is also reported in some other important works [31]. A line a recent works [32] propose and advocate for stacking of diversified cell blocks instead of identical blocks and show VOLUME 10, 2022 the effectiveness of such an arrangement in compared to stacking homogeneous cell blocks.
In literature, it is well analysed fact that NAS methods suffer from very high computational cost [33]. Several soultions are proposed in the lierature to alleviate this limitation such as the authors in [34] tranform the neural architecture search space into a differentiable form and use gradient optimization. The limitation of these models is that they rely on fine tuning a specific structure of the network. Some recent models such as [13] utilizes cell-based search space based search strategy which uses each cell as a direct acyclic graph and relaxes the descrete search strategy. The limitation of [13] is very high memory requiremnts. A recent work [17] improves over [13] reducing memory usage during search and by improving the search efficiency.
In this work, we propose a data distillation method based on a curriculum learning approach that deals with the limitations mentioned earlier i.e. large memory requirement and humongous search cost of architecture search phase. We extract a very small, efficient, and coherent dataset from the original larger dataset, which contains only samples which are very informative about the corresponding class, we call this dataset as distilled dataset. Using this distilled data, we demonstrate that NAS can be performed on large datasets in a reasonable amount of time without compromising on the accuracy of the model. The intuition behind using distiiled dataset during the search phase is that it is wise to only focus on samples which are very informative (containg relevant features) with respect to the underlying class.

III. BACKGORUND AND PROPOSED MODEL
In this section, Firstly, we describe the necessary background details pertaining to intuition and cues related to the proposed approach. Subsequently, we explain the overall methodology and underlying steps involved in DistilNAS..

A. BACKGROUND
In this work, we leverage concepts from curriculum learning [35] in order to accelerate the search stage of NAS. In curriculum learning, a curriculum is devised so that the network starts by learning on a simple task and proceeds progressively to learn from more complex ones. In [15], one such learning curriculum was devised to deal with the problem of noisy labels as follows: Given a dataset, a deep neural network is used to project the images in a given category on to a deep feature space. This is done via the features of the fully connected layer. Naturally, images that are visually similar will be close to each other in the feature space. Then, by formulating a density metric based on the Euclidean distance between the images, density clustering is performed in this feature space.
This clustering step partitions the image dataset into clusters named 'C REL ', 'C LREL ' and 'C IRREL '. The cluster labelled 'C REL ' has the highest density whereas the one labelled 'C IRREL ' has the least density in feature space. Intuitively, an 'C REL ' cluster corresponds to those images that are visually similar whereas the 'C IRREL ' cluster corresponds to those images that are visually dissimilar. After creating these clusters for every category, the network is trained sequentially by starting with the 'C REL ' images and then proceeding to 'C LREL ' and 'C IRREL ' ones.
For doing NAS, we use P-DARTS as the baseline method. So, we will briefly describe the DARTS framework on which P-DARTS is built. As in most NAS approaches, in DARTS also, we search for a robust and efficient building block called cell. Then, we stack several cells together to complete the network architecture for the given task. A cell is a directed acyclic graph (DAG) consisting of 'n' nodes (x 0 , x 1 , . . . , x n−1 ) where x 0 and x 1 are the input nodes. Every node represents a network layer and the edges between the nodes within a cell are labelled by a fixed operation (e.g., 3 × 3 convolution, identity, maxpool etc) that are chosen from a predefined space of operations, O. The main idea of DARTS is to formulate the information flow from node 'i' to node 'j' as a weighted sum of |O| operations. This can be mathematically expressed as where x i is the output of the i−th node and α o i,j is the weight parameter for operation o(x i ). The output of a node j is given by x j = i<j f i,j (x i ) and the output of a cell is obtained by concatenating the output of all the nodes from x 2 , x 3 . . . x n−1 . Due to the differentiable nature of the function above, we can do gradient descent with respect to both the layer weights and architectural parameters, α i,j 's. Once the search phase ends, the operation with the highest value of α o i,j on the edge (i, j) is kept and the rest of the operations on (i, j) are discarded. Due to some limitations of DARTS such as severe memory requirement and 'depth-gap' in search and evaluation phases, an improved method called P-DARTS is proposed in the literature [14]. In P-DARTS, techniques such as 'search space approximation' and 'search space regularization' are introduced. The aim of these techniques is to progressively prune the operation search space thereby addressing the issue of depth gap in search and evaluation phases. For more details, please refer [14].

B. PROPOSED METHOD
Given an arbitrary dataset, we hypothesize that: • Only 'informative' images, i.e., images that contain only the features corresponding to the relevant class, must be used in the search phase of NAS.
• Furthermore, 'uninformative' images, i.e., those images that also contain features corresponding to the objects that are irrelevant for classification, might have a detrimental effect on the searched architecture. An intuitive explanation for the first part of the hypothesis is as follows: Firstly, the 'informative' images within a class by and large correspond to those images that belong to the C REL cluster created via CurriculumNet [15]. The images that belong to the C REL cluster are visually very similar. Also, those images in the C REL cluster are very important and useful for the corresponding class as they contain only the relevant features that are necessary for the classification task. Moreover, since the clustering criterion ensures that C REL clusters are sufficiently dense, this means that there are enough samples of a similar type from which the network can learn. For all these reasons, we expect that the samples from the C REL cluster should be given more weightage when doing NAS.
As for the second part of the hypothesis, 'uninformative' images are the samples that belong to the not so relevant clusters (i.e., C LREL and C IRREL ). The images belonging to these clusters are visually dissimilar. In most of the cases, these samples contain lots of 'irrelevant' information also, for an instance they contain features of the concerned class and also features pertaining to objects that are not of interest for classification. So, we conclude that these samples are not of much use in the search phase and in some cases might affect the search process adversely. This is because, we must update the parameters of the architecture primarily based on the ability of the model to detect features of the class of interest. Moreover, the uninformative images are typically used to increase the ability of the model to generalize well. But, in the search phase our training budget is limited and hence we must focus only on detecting those features that are of interest for the classification task. The generalization ability of the model can be dealt with during the evaluation phase where we train it with the full dataset.
We use the CurriculumNet algorithm to create intracategory clusters for a given dataset, i.e., we create the three clusters C REL , C LREL and C IRREL for every class respectively. Here, it is important to note that only samples in the training set are clustered and the test data is left untouched. In using CurriculumNet, we employ a pre-trained Resnet-152 for projecting the samples on to a deep feature space (FC layer features) in the first stage. The reason for choosing Resnets is that they have a similar architecture compared to the NAS models: a basic block stacked together appropriately to form the full network. In the ablation study section, we will explain the effect of choosing different architectures for feature extraction. Once the clusters are created (C REL , C LREL , and C IRREL ), we sample a fraction of the images from the C REL cluster and call this our ''distilled data'' for the given dataset. The size of the distilled set is constrained by the size of the original dataset. For instance, on Imagenet, we only need 10% of the training set (all belonging to the C REL cluster) to construct our distilled dataset. In the next section, we will provide experimental justification for our hypothesis.

IV. EXPERIMENTAL EVALUATION A. RESEARCH QUESTIONS
We design and conduct the experiments to find the answers to the following research questions.
• RQ1 Are the class informative samples (containg relevant features for corrosponding class) are specifically useful during search phase of NAS?
• RQ2 Can we perform NAS with a smaller, highly efficient distilled dataset, extracted from original dataset and still perform better in terms of search cost and performance?
• RQ3 Does the performance degrades with the introduction of uniformative samples?
• RQ4 What is the effect of different architectures of feature extraction before clustiering on overall perofrmance of the model?

B. DATASETS
In this section, we show the efficacy of the distilled dataset on any arbitrary dataset. In order to do so in the best manner possible, we use the following criterion while choosing the datasets: • Size -We evaluate our model with datasets of varying sizes, ranging from a few thousand images to a million or more.
• Resolution -We experiment with both low as well as high resolution images.
• Class labels -To evaluate the robustness of our model, we use datasets with number of labels ranging from only ten labels to hundreds or even thousands of labels. Based on the criterion above, we report experimental results on the following datasets: Imagenet, Food-101, Stanford Cars, Oxford-IIIT Pets, Caltech-101, CIFAR-10 and CIFAR-100. All the experiments were performed on a single Nvidia Tesla V100 gpu. Before we present the results, we mention some notation for our baseline models which will be used from now on: • DIST -denotes the model that is obtained by doing model search on the distilled data of the given dataset.
• RAND -represents the model that is obtained by doing model search on a random sample of the given dataset.
The size of the sample is the same as the size of the distilled data.
• MIX -refers the model that is obtained by doing model search on the data which consists of samples from all the clusters, i.e., C REL , C LREL and C IRREL . For instance, if the distilled data consists of 10% of the total training data (all of which are sampled from the C REL cluster), then we need to sample 10% from each of the three clusters for this baseline.
• COMP -denotes the model that is obtained by doing model search on the complete dataset. The models obtained from the search phase are then evaluated on the full dataset. The evaluation is repeated five times and then the average top-1 accuracies are reported. As previously VOLUME 10, 2022    noted, we will be using 'Progressive Darts (P-DARTS)' for doing NAS.

1) IMAGENET RESULTS
The Imagenet dataset is amongst the most popular datasets for image classification in computer vision. It consists of nearly 1.3 million images for training and 50k images for testing, distributed over a 1000 classes. Most NAS approaches with the exception of PC-DARTS [17], Proxyless-NAS [36] and TE-NAS [37], perform architecture search on the CIFAR datasets and then do transfer learning on the Imagenet. In this work, we perform model search directly via the distilled dataset extracted from original Imagenet. As is the norm in the literature on Imagenet, we work in the mobile setting where the resolution of the input images are 224 × 224 and flops limited to 600 million.
The distilled data for Imagenet is constructed by randomly sampling 10% of the training data (roughly 130, 000 images distributed equally over the 1000 classes) from the C REL clusters. We run the P-DARTS search on this distilled data. The architecture for search is modified slightly by adding three convolutional layers of stride 2 to reduce the resolution of the image from 224 × 224 to 28 × 28. The search process of P-DARTS is done in three stages. In the first stage, the architecture consists of five cells which is increased to eleven and seventeen in the second and the third stage respectively. The Dropout [38] for skip-connect are set to 0.1, 0.2 and 0.3 for the three stages respectively. The rest of the search settings remain the same as in P-DARTS where in every stage. the network is trained with a batch size of 128 for 25 epochs. An Adam optimizer with a learning rate of 6×10 −4 , a weight decay of 0.001 and momentum set to 0.5 and 0.999 is used for architecture search. The search phase takes about 1 GPU-day approximately including the time needed for clustering the data (0.6 GPU-day for just the search). To achieve a reasonable trade-off between computational complexity and accuracy, we choose the model with one skip-connection and this model is labelled DIST.
In the evaluation phase, we follow the same approach as in the DARTS framework where a network of 14 cells and 48 initial channels is evaluated. We train the network for 350 epochs with a batch-size of 128 on a single Nvidia Tesla V100 GPU. The training process takes about 13 days  with the P-DARTS implementation provided by the authors. The network is optimized via SGD with a learning rate of 0.08 which is decayed linearly after every epoch, a momentum of 0.9 and a weight decay of 3 × 10 −5 . As in the P-DARTS evaluation phase, we use label smoothing and an auxiliary tower in the training process. The model achieves a top-1 accuracy of 75.75%, averaged over five runs. This is very close to the SOTA achieved by PC-DARTS with an accuracy of 75.8% but needs 3.8 GPU-days for architecture search and 12.5% of the training data. The performance of the various NAS algorithms on the Imagenet is compared with our proposed method in table 2.

2) FOOD-101 RESULTS
Food-101 dataset for image classification is comprised of 101 classes and 75,750 images for training and 25,250 images for testing purposes. To the best of our knowledge, NAS has not been performed directly on Food-101 in any of the previous works. As in the case of Imagenet, we resize the images from a resolution of 224 × 224 to 28 × 28 by adding three convolutional layers of stride 2. For Food-101, the distilled data is constructed by randomly sampling 20% of the training data from the C REL clusters. This increase in the size of the distilled data as compared to the distilled data of Imagenet can be attributed to the fact that Food-101 is much smaller than Imagenet. Using only 10% of training data as before lead to unstable architectures with lower accuracy in our experiments. Therefore, we had to increase the size of distilled data to 20%. The search process is the same as the one described for Imagenet in the previous section and is performed with a single NVIDIA Tesla V100 GPU. The search process takes about 0.25 GPU-days (0.1 GPU-days for just model search).
In the evaluation phase, we again consider a network comprising of 14 cells and 48 initial channels. The network is trained for 400 epochs with a batch size of 128 on a single Nvidia Tesla V100 GPU. Again, we use the SGD optimizer with a learning rate of 0.1 decayed linearly after every epoch, a momentum of 0.9 and a weight decay of 6 × 10 −5 . We also apply the auxiliary tower and label smoothing in the training process. The model achieves a top-1 accuracy of 84.05% averaged over five runs. The results of our method along with the baselines are given in table 3.

3) STANFORD CARS RESULTS
Stanford Cars is also a very useful smaller dataset in computer vision. It consists of 196 categories with 8,144 images for training and 8,041 images for testing. The input images of resolution 224 × 224 are reduced to 28 × 28. Since the dataset is small, we randomly sample 50% of the training data from the C REL clusters and use it for architecture search. The search settings are the same as Imagenet except for the batch size which is reduced to 64. The search process takes 0.15 GPU-days including the time for clustering.
The network configuration for the evaluation phase is the same as that for Imagenet. We train it for 600 epochs with a batch size of 128. The optimizer used is the SGD with a learning rate of 0.15 decayed linearly every epoch, a momentum of 0.9 and a weight decay of 6×10 −5 . As before, we apply auxiliary tower and label smoothing while training. The results from our approach is documented in table 4.

4) CALTECH-101 RESULTS
Caltech-101 is a publicly available dataset that is well studied in computer vision. It comprises of 102 categories with 3,060 images for training and 6,084 images for testing. Here also, we randomly sample 50% of the training images from the C REL cluster and run the P-DARTS search algorithm on it. The search settings are the same as that of Imagenet with the batch size set to 96. The search phase takes 0.2 GPU-days including the time for clustering.
The searched architecture is then evaluated with the same network configuration as that of Imagenet. We train it for 400 epochs with the SGD optimizer whose parameters are the same as the ones used in Imagenet evaluation. Since the Caltech-101 dataset contains class imbalances, we report the mean per-class accuracy. We refer to table 5 for results.

C. OXFORD-IIIT PETS RESULTS
Oxford-IIIT pets dataset consists of 37 classes with 3,680 images for training and 3,369 images for testing. The distilled data for Oxford-Pets is obtained by sampling 50% of the training images from the C REL clusters. Then, we run the P-DARTS algorithm with batch size set to 64. The search phase takes about 0.2 GPU-days.
In the evaluation phase, we train the network for 500 epochs with a batch size of 128. The network is optimized via the SGD with a learning rate of 0.1 decayed linearly after every epoch, a momentum of 0.9 and a weight decay of 8 × 10 −5 . The performance of our approach is given in table 6.

D. CIFAR-10 AND CIFAR-100 RESULTS
CIFAR-10 is one of the most well known datasets for image classification. It consists of 50,000 images for training and 10,000 images for testing purposes, distributed equally over 10 classes. The resolution of the images is 32 × 32. For CIFAR-10, we randomly sample 20% of the training images from the C REL clusters as the distilled data. Then, we perform NAS on the distilled data using the P-DARTS method with the search parameters set to the same values as mentioned in the P-DARTS paper. The search process takes about 0.2 GPU-days including the time needed for clustering. For evaluation, we follow the same process as mentioned in the P-DARTS paper.
CIFAR-100 is a similar dataset with the same size as that of CIFAR-10, but comprising of 100 categories. Here also, we randomly sample 20% of the training images from the C REL clusters and use it as the distilled data for doing NAS. The search settings and the parameters are the same as mentioned in the P-DARTS paper. Here we note that the reported top-1 accuracies on CIFAR-10 and CIFAR-100 in the P-DARTS paper were 97.5% and 84.08% respectively. However, we were not able to achieve the same performance when we ran P-DARTS on the full dataset with the same settings on a Nvidia Tesla V100 GPU. Our results on CIFAR-10 and CIFAR-100 are shown in table 7.

E. ABLATION STUDY 1) EFFECT OF NOISY SAMPLES
In this section, we will provide experimental evidence and justification for our hypothesis and expectation that uninformative samples can be detrimental in the architecture search phase. We create two subsets of data, one for Imagenet and one for Food-101 and perform ablation study on them. The reason for choosing these datasets is that they contain substatial number of uninformative samples for each class. The size of the subsets used here is the same as the size of the distilled data used for Imagenet (10%) and Food-101 (20%). The proportion of samples from the C LREL and the C IRREL clusters are kept in the ratio of 1:1 for each class. The model search is done via P-DARTS with the same settings as that of Imagenet and Food-101 respectively.
The model evaluation is also done in the same settings as that of Imagenet and Food-101. We start by using the same hyperparameter values for learning rate, weight decay etc. Then, we also do a small grid search around the hyperparameter values previously used and repeat the evaluation process. This is done to ensure that our evaluation of these models is fair. The accuracy reported is the best amongst all the repeated evaluations (more precisely, we repeat the evaluation phase six times on each dataset). We refer to table 8 for results.

2) DIFFERENT ARCHITECTURES AS FEATURE EXTRACTORS
We now discuss the effect of using different architectures for feature extraction before clustering. The architectures we consider here are VGG [44] and Densenet [45]. Specifically, we have used VGG-19 with batch normalization and Densenet-169. Using these two models we create representative subsets for Food-101 and Caltech-101 (Four subsets in total, two each corresponding to the two models). The experiments are conducted with P-DARTS and the settings used are the same as the ones used in Section IV-B2 and IV-B4. As seen from the table 9, our hypothesis that informative samples lead to better architectures still holds. We also note that the individual accuracies are lesser as compared to the model searched on the distilled data created via Resnet-152.

3) EFFECT OF RANDOM SAMPLING
Since we randomly sample from the C REL clusters to create distilled data, it is important to verify that the results are not biased/dependent on sampling itself. To do so, we repeat the sampling process (with replacement) five times and create five different sets of distilled data on Food-101 and Stanford Cars. Then, we perform the search and the evaluation process of P-DARTS on all of them. From the table, it can be seen that the model complexity and accuracy has negligible variance across the five different subsets. The results are documented in table 10.

V. CONCLUSION
To Speed up the search phase and avoid the need to search on a simple proxy dataset (which is not the true representation of the target dataset) are the two major challenges in neural architecture search. In this work, we introduce a very intuitive and effective solution to address these challenges in the context of NAS for image classification. The crux of our approach is to curtail down time/cost incurred in search phase of NAS via focusing on detecting the relevant features only for classification and we do so by using a very intricate smaller distilled dataset to perform the NAS, which is extracted from the original target dataset.
A possible limitation of the proposed approach might be that the model may suffer from overfitting in some cases due to over optimization, this can be addressed in the evaluation phase where we train with the full dataset. We demonstrate the effectiveness of this approach with a series of well-known benchmark datasets, ranging in size from a few thousand to a million or more. We also showcase by experiments that it is futile to utilize uninformative samples at the time of architecture search. Since we focus extensively on image classification, in future it is worth to extend this approach to problems such as object detection, segmentation etc. SWAROOP N. PRABHAKAR received the Ph.D. degree in theoretical computer science from The Institute of Mathematical Sciences, Chennai. Currently, he is working on industrial applications at the intersection of NLP and computer vision. His research interest includes algorithmic problems with a mathematical flavor to them.
ANKUR DESHWAL received the master's degree in computer science and engineering from IIT Kanpur, in 2010. Currently, he is a Staff Engineer with the AI-Computing Group, Samsung Semiconductor India Research, Bangalore. He has authored publications in conferences, such as ISCAS, ICML, and SIGGraph, and granted with several patents in area of compiler, deep learning, and AutoML. His research interests include intersection of deep learning and compilers.
RAHUL MISHRA received the master's degree in computer science and engineering with specialization in data engineering from IIIT-Delhi, in 2013, and the Ph.D. degree in computer science and engineering with specialization in NLP from the University of Stavanger, Norway, in 2021. Currently, he is working as a Senior Staff Engineer with the AI-Computing Group, Samsung Semiconductor India Research, Bangalore. He is a Frequent Invited Reviewer for various top-notch conferences, such as CIKM, CoNLL, IJCAI, and KDD, and journals such as, Expert Systems with Applications, ACM Computing Surveys, and IEEE ACCESS. His research interests include intersection of deep learning, and natural language processing with specific interest in representation learning, self-supervision, and multi-modal learning.
HYEONSU KIM received the bachelor's degree in computer science and engineering from the Hankuk University of Foreign Studies, Seoul, South Korea. He is a Staff Researcher with Samsung Electronics, Hwaseong, South Korea. His research interests include AI compiler and in-memory processor programming model.