Building High-throughput Neural Architecture Search Workflows via a Decoupled Fitness Prediction Engine

Neural networks (NN) are used in high-performance computing and high-throughput analysis to extract knowledge from datasets. Neural architecture search (NAS) automates NN design by generating, training, and analyzing thousands of NNs. However, NAS requires massive computational power for NN training. To address challenges of efficiency and scalability, we propose PENGUIN, a decoupled fitness prediction engine that informs the search without interfering in it. PENGUIN uses parametric modeling to predict fitness of NNs. Existing NAS methods and parametric modeling functions can be plugged into PENGUIN to build flexible NAS workflows. Through this decoupling and flexible parametric modeling, PENGUIN reduces training costs: it predicts the fitness of NNs, enabling NAS to terminate training NNs early. Early termination increases the number of NNs that fixed compute resources can evaluate, thus giving NAS additional opportunity to find better NNs. We assess the effectiveness of our engine on 6,000 NNs across three diverse benchmark datasets and three state of the art NAS implementations using the Summit supercomputer. Augmenting these NAS implementations with PENGUIN can increase throughput by a factor of 1.6 to 7.1.Furthermore, walltime tests indicate that PENGUIN can reduce training time by a factor of 2.5 to 5.3.


INTRODUCTION
N EURAL networks (NN) are powerful models that are increasingly used in traditional high-performance computing (HPC) scientific simulations and new research areas, such as high-performance artificial intelligence and high-throughput data analytics, to solve problems in physics [1], materials science [2], neuroscience [3], and medical imaging [4] among other domains. Finding suitable NNs is a time-consuming process involving several rounds of hyperparameter selection, training, validation, and manual inspection. Neural architecture search (NAS) automates the process of finding near-optimal models for a given dataset, but it comes at a high training cost involving thousands of NNs on a large number of HPC resources. For instance, conventional NAS algorithms exhibit prohibitive computational demand where training of each NN to convergence is the main bottleneck [5], [6]. A single NAS run on a small dataset of one or two petabytes can require HPC systems with thousands of accelerators [1]. A training run of a large language model such as BERT takes more than 80 hours on 16 TPUv3 AI accelerator hardware [7], and training a visual transformer requires decades of compute time on a TPUv3 [8]. For even larger datasets, scientists must allocate significant time on the largest compute resources available (in the range of tens of thousands of GPU hours [1]) to conduct a single search of NN models. Furthermore, in the early stages of NAS, up to 88% of NNs fail to learn [9], wasting expensive compute resources.
In current literature there are multiple strategies to conduct NAS, including random searches, grid searches, hyperparameter sweeps, reinforcement learning, evolutionary optimization, gradient-based optimization, and Bayesian optimization [10], [11]. Many NAS implementations rely on built-in "truncated training"-a fixed termination criterion where each NN is trained for a set number of epochs; fixed termination criteria result in wasting expensive HPC resources [12], [13], [14], [15]. Advanced NAS implementations use a fitness prediction strategy and dynamically terminate training each NN once the fitness prediction is calculated [16], [17], [18]. NAS implementations that employ fitness prediction embed their prediction strategy in their search process, resulting in a tightly coupled search and prediction solution. Tight coupling between search and prediction strategies mean that NAS cannot be optimized without heavily interfering in a given NAS implementation. By decoupling search and prediction and creating a flexible fitness prediction method, we make NAS optimizations portable across problems and datasets, and simultaneously increase NAS implementations' throughput and efficiency. We implement this decoupling approach and its flexible fitness prediction via our engine PENGUIN, which plugs into existing NAS implementations, informing each search without interfering in it.
Our solution decouples search and prediction strategies by augmenting NAS implementations with our PENGUIN engine to build flexible and composable NAS workflows. PENGUIN uses a parametric modeling approach to predict NN fitness early in the training phase; it functions independently of the NN architecture, the particular NAS implementation, and the target dataset. NAS can leverage PENGUIN's fitness predictions to terminate training NNs early, thus increasing throughput and exploring a larger space of candidate models in a scalable way. PENGUIN can plug into existing NAS implementations to augment their searches. Any NAS whose search strategy includes training NNs and making decisions that depend on NN fitness can use PENGUIN (see Table 1 for examples).  [19] evolutionary × NAS for image reconstruction [20] evolutionary × psoCNN [13] particle swarm Hierarchical representation [14] evolutionary ×

Auto-Keras [24] bayesian
Not only can PENGUIN be plugged into a variety of different NASes, but PENGUIN itself can use any parametric function for modeling NN fitness. By tailoring the function parameter values to the NN's fitness data, PENGUIN constructs a fitness model; it then extrapolates from the model to predict future fitness of the NN. Table 2, adapted from Viering and Loog [25], lists a variety of parametric functions, including several exponential and power functions, that have been successfully used for modeling learning curves in a variety of machine learning problems. Any of these functions can be plugged into PENGUIN's fitness modeling method, as can custom user-defined parametric functions.
We perform a case study in which we evaluate the accuracy of PENGUIN's predictions, and its throughput gain, as compared to three different unaugmented NAS implementations (i.e., MENNDL [19], EvoCNN [12], and NSGA-Net [15]). By augmenting each of these NAS implementations with PENGUIN and a parametric function from Table 2, we can save 39% to 86% of training epochs, resulting in an increase in throughput by a factor of 1.6 to 7.1. Furthermore, we measure the actual training walltime speedup that can be attained by augmenting MENNDL with PENGUIN and observe a decrease in training time by a factor of 2.5 to 5.3 compared to the unaugmented NAS.

BUILDING A DECOUPLED NAS WORKFLOW
Designing an NN manually is time-consuming and prone to human bias, often resulting in sub-optimal models. NAS exploits supercomputing resources to automate the NN architecture design, alleviating the architecture tuning barrier. NAS selects NN models from a search space, trains them on the target dataset, evaluating their fitness (e.g., validation accuracy or loss), and uses the fitness information to generate new NNs. NAS implementations can be supported by strategies that provide predictions for NN fitness at a given epoch in the future, enabling a NAS to truncate the training process, compare the fitness predictions for different NNs, and steer the search towards NNs that yield better results faster. NAS implementations as found in the literature either do not use a prediction strategy, opting instead for fixed truncated training where NNs are trained for a statically-defined, fixed number of epochs and then compared; or the NAS design involves a tight coupling between the NAS search strategy and a particular prediction strategy, as depicted in Figure 1. The tight coupling imposes restrictions to the generalization of predictions for different NAS, datasets, and problems, requiring non-trivial, ad-hoc tuning of the NAS implementation. In contrast to current tightly-coupled NAS solutions, we envision a workflow-oriented approach to the NAS problem, as depicted in Fig. 2. To this end, PENGUIN is designed to be completely decoupled from any NAS, and its prediction strategy is not tailored to a specific search strategy or dataset, allowing portability across NAS and scientific domains. PENGUIN can be utilized even by NAS implementations that already employ fixed truncated training, allowing them to truncate training as soon as NN fitness predictions are calculated by our engine.
Besides being NAS-agnostic, PENGUIN integrates a framework for parametric modeling that adapts to different NN fitness metrics. Parametric modeling of an NN's fitness involves fitting a parametric function (i.e., a function with Overview of our vision of a decoupled NAS workflow supported by PENGUIN. PENGUIN constitutes the core prediction engine that, given an input NN provided by a NAS and a parametric function, outputs a fitness prediction that can be leveraged by a NAS. parameters to determine, for example, F(x) = ax b + c) to the NN's historical fitness data. This fitting process produces a function (e.g., f (x) = 10x 2 + 5) that models the fitness curve. The fitness model can then be used to extrapolate a prediction for the fitness value at a given epoch in the future, e pred , by calculating the value of the model at that epoch, f (e pred ). As shown in Fig. 2, PENGUIN allows users to plug in any desired parametric function (some examples in Table 2).
PENGUIN's design and interfaces provide users with the building blocks to flexibly build NAS workflows from tools and methods already available. The decoupled nature of PENGUIN constitutes a crucial first step towards scalable high-performance NAS workflows, since this decoupling enables optimization of the resource allocation for each component of the workflow, the component orchestration and placement, and the mechanisms to exchange data efficiently amongst them.

THE PENGUIN ENGINE
PENGUIN augments NAS to enable a modular NAS workflow in which the search strategy is fully decoupled from the prediction strategy. PENGUIN is designed to plug into any NAS method and run on dedicated resources concurrently with the NAS, informing the search without interfering in it. PENGUIN dynamically predicts the fitness each NN could attain during the NN training phase and informs the NAS of each NN's predicted fitness.
The NAS selects NNs from the search space to explore and trains those NNs. When a NAS plugs in PENGUIN, periodically during training of each NN, the NAS passes partially trained NNs to PENGUIN. PENGUIN then iteratively executes a three-step process depicted in Fig. 2. In each iteration of the process, PENGUIN first validates the NN and calculates its fitness (NN validation); then executes a parametric modeling method in order to construct a model for the NN's fitness curve, and uses the parametric model to extrapolate a prediction for the fitness the NN is expected to attain at a given epoch in the future (Parametric modeling); and finally determines whether the prediction has converged, deciding whether to output the NN fitness prediction to the NAS or continue the iterative process (Prediction analyzer).

NN Validation
When classifying a scientific dataset, the dataset is divided into a training set and a validation set. Both PENGUIN and the NAS have knowledge of the location of the training and validation sets, and they have capability to access these data. The NAS trains the NNs on the training set and passes partially trained NNs to PENGUIN every E epochs throughout training, where an epoch means one cycle through the training set. The value of the parameter E is user-defined. In our tests, we let E = 0.5, meaning that each NN is passed to PENGUIN every 0.5 epochs throughout its training. At each iteration, PENGUIN validates the NN M and calculates its fitness f it V for the validation set at the current epoch (e). We build an ordered list of these (epoch, fitness) datapoints for the NN across the iterations of PENGUIN. In the first iteration, we initialize the list with the starting epoch and fitness values. In subsequent iterations, we append the current (epoch, fitness) datapoint to the ordered list.

Parametric Modeling
Given a parametric function, we attain a model for NN fitness by determining values for the function parameters using least squares regression. Specifically, we utilize SciPy's optimize.curve fit (Curve Fit) method to find the parameter values that best fit the ordered list of the NN's epoch and fitness data. The result is a function that models fitness in terms of epochs. This function is then used to extrapolate a fitness prediction for the NN at a given epoch in the future.

Fitting the parametric function to predict fitness
Algorithm 1 shows the internal parametric modeling procedure. Given an NN M , the procedure takes as input the current training epoch (e); the ordered list of (epoch, fitness) values and predictions from previous iterations (H), also called fitness history; the fitness value at the current epoch e (f it V ); and a parametric function (F ), along with optional bounds and initial values for the function's parameters.
The first step in our parametric modeling method is the creation of a new tuple of fitness metrics for the NN M at the specified epoch (h e ). This tuple is inserted in the fitness history H for M (Line 3) so that it is taken into consideration as a new datapoint for the Curve Fit method.
The rest of the steps required to construct the fitness model f only take place if there are enough datapoints to conduct the Curve Fit method. The number of datapoints needed in H to proceed with curve fitting, denoted C min , is equal to the degrees of freedom in the parametric function (Line 4). This is the same as the number of parameters to determine. For example, the parametric function F(x) = a − b c−x has three parameters and degrees of freedom; thus it requires C min = 3 datapoints.  if |H| ≥ Cmin then 5: Then we feed H to Sci-Py's optimize.curve fit method to find the function f that best fits the historic fitness data for the NN (Line 6). The function f is used to extrapolate a prediction for the NN's fitness at a given epoch in the future (e pred ). The fitness the NN is predicted to attain at epoch e pred is given by f (e pred · r) = f it P , where r is the rescaling factor that was used for the epoch data. We append the value of f it P to the fitness tuple for the current epoch (h e ) so that it will be reflected in the fitness history (Line 9). As a result of our modeling method, we output the updated history of fitness calculations and predictions (H) for analysis by the prediction analyzer.

Prediction Analyzer
Each time the parametric modeling method receives a new fitness measurement for an NN M at epoch e, it generates a new fitness model f from the parametric function F, and it uses the fitness model to compute a new prediction (f it P ) for the fitness the NN will attain at epoch e pred . This fitness prediction f it P is appended to the history of measured and predicted fitness values H. The fitness model and its associated fitness prediction f it P vary from one iteration to the next because new fitness measurements are calculated and appended to H at each iteration, hence the parametric modeling method has one more datapoint to use to construct the fitness model with each iteration. The goal of the prediction analyzer is to determine whether the iterative fitness predictions for M have converged to a stable value, which we denote by F IT P . If so, the analyzer outputs the final fitness prediction F IT P , and PENGUIN's iterative process terminates. If not, the next iteration begins, and the NN resumes training.

Parameterizing the analysis
To guide the behavior of the prediction analyzer and its convergence criteria, we rely on four parameters: • N , the number of most recent fitness predictions to consider • t, the threshold describing how much variability is tolerated to establish convergence and report a final fitness prediction • E, the number of epochs per iteration • e pred , the epoch in the future for which PENGUIN predicts NN fitness; also, the maximum number of epochs an NN will train. These engine parameters, summarized in Tab. 4, constitute part of the input arguments for the prediction analyzer procedure and can be adjusted in PENGUIN to tune the overall iterative process. Figure 3 shows the trade-off for different choices of the parameters N and t on sample subsets of each of the benchmark datasets introduced in Section 4. This figure motivates our choice of N = 3 and t = 0.5 because these points are Pareto-optimal or near-Pareto-optimal on all the trade-off curves.
The parameter E affects the ability of the parametric modeling to successfully approximate the NN accuracy curve. E impacts the granularity of the set of datapoints used for fitting the parametric function; the smaller E is, the more datapoints we will have. Therefore, E must meet a trade-off between two constraints: it must be as small as possible to maximize the number of datapoints to use for constructing the parametric model, and it must be large enough to avoid extrapolating local behavior to the whole curve. This motivates our selection of E = 0.5.
The parameter e pred corresponds to the future epoch for which PENGUIN predicts NN fitness. One can set e pred to any arbitrary value to match the epoch at which the user would like to assess NN fitness. We set the parameter to match the epoch of truncated training of the NAS that PEN-GUIN is augmenting. By default, each of the NAS methods evaluated in this paper (i.e., MENNDL [19], EvoCNN [12], and NSGA-Net [15]) incorporate truncated training, terminating training at 20, 10, and 25 epochs respectively. If epoch e pred is reached during PENGUIN's iterative process, the process terminates by outputting the actual fitness corresponding to that epoch.

Analyzing the convergence of the predictions
Algorithm 2 shows the internal procedure in our prediction analyzer. Given an NN M , the procedure takes as input the current training epoch (e), the history of fitness values and predictions (H), and the four engine configuration parameters described above. We first check if we have reached the minimum number of epochs to assess convergence (Line 3). If the answer is no, the number of tuples in H is not sufficient to analyze whether the predictions are stable; we exit the validation and continue with the next iteration. If the answer is yes, we check for convergence of the iterative fitness predictions. If convergence is achieved prior to reaching epoch e pred in training, then the final output fitness prediction (F IT P ) corresponds to the fitness predicted by the tailored function in the current epoch (f it P ) (Lines 5, 8). We define three conditions for reaching convergence: • Condition 1: NN training must not have already reached the epoch, e pred , for which to predict fitness. • Condition 2: The most recent fitness prediction (F IT P ) must be within the range of valid fitness values. For example, if fitness is measured by validation accuracy, the accuracy prediction must be less than or equal to 100%. Otherwise, the prediction is not valid to establish convergence (Line 6). This condition ensures that we have a realistic value for F IT P . • Condition 3: The N most recent fitness predictions are all within the threshold t of their mean (Line 7). In other words, for each of the N most recent predictions p i , we checks if mean − t ≤ p i ≤ mean + t.
There are three possible outcomes based on these conditions. First, if all three conditions are true, it means the prediction has converged. In this case, the prediction analyzer returns the final prediction F IT P with convergence status if e ≤ N · E then exit 4: if e = e pred then return f itV 5: if F ITP is a valid fitness value then exit 7: if last N f itV ∈ H are within t then converged ← true 8: if converged is true then return F ITP 9: else exit true (Line 8). Second, if the first condition is not satisfied, then the NN has already trained for e pred epochs. In this case, there is no need for a fitness prediction from PEN-GUIN. Instead, we return as our final fitness prediction the actual fitness value at epoch e pred , f it V , with convergence status false, indicating that PENGUIN's prediction did not converge prior to training for e pred epochs (Line 4). Last, if the first condition is satisfied, but either of the second or third conditions is not met, then the prediction analyzer procedure ends without returning a fitness prediction and with convergence status false (Line 9).
Based on the output of the prediction analyzer, we can establish the termination criteria for PENGUIN. If a fitness prediction is returned, PENGUIN terminates and outputs this value to the NAS. This can occur in two scenarios: either we reached convergence and the output is the fitness prediction F IT P , or we reached e pred and the output is f it V -the actual fitness attained at epoch e pred . We can distinguish these cases by inspecting the value of the convergence status. If no value is returned, then PENGUIN begins the next iteration and resumes training the NN. As an example, Fig. 4 depicts the end result of this process for one of the NNs we trained on the CIFAR-100 benchmark dataset [26]. The x-axis indicates the number of training epochs, and the y-axis indicates the fitness of the NN (accuracy in this example). The dotted red line indicates the point at which the prediction converges; this is when our engine outputs the fitness prediction and terminates the iterative training process. In this example, our engine terminates the iterative process and outputs the fitness prediction after 4 epochs. The fitness model from the final iteration is graphed, along with the fitness datapoints (validation accuracy in this example) for the NN across e pred epochs. Note that only the fitness values from the first 4 epochs were used to construct the fitness model.

CASE STUDY
We present a case study in which we use PENGUIN to answer two questions from the perspective of a NAS plugging in PENGUIN: (i) "How accurate are the predictions?" and (ii) "What savings are gained?"

Experimental Setup
We apply PENGUIN to a set of approximately 6,000 NNs-2,000 each trained on one of three widely used benchmark datasets: CIFAR-100, Fashion MNIST, and SVHN-using a parametric function of the form f (x) = a−b (c−x) . Using the Summit supercomputer at the Oak Ridge National Laboratory, we train our set of diverse NNs across these benchmark datasets.

Benchmark Dataset Characterization
To the best of our knowledge, there is no systematic characterization of datasets used in evaluating ML methods in HPC. This makes the assessment of the generality and applicability of a method in the area of NN research a case-driven discussion [27], [28], [29], [30], [31], [32] that is also found in works tackling the problem of evolutionary NN design [33], [34]. We contribute to the discussion by characterizing three widely used datasets (i.e., CIFAR-100, Fashion MNIST, and SVHN) in terms of the diversity of their attributes. Based on this diversity we use these benchmark datasets to evaluate our engine's capability to predict the accuracy of NNs. We identify five key attributes that demonstrate the diverse nature of these benchmark datasets: the number of samples, the number of classes, the number of image color channels, the sample size [35], and the internal balancing of the sample distribution per class, which is known to affect the accuracy of NNs [36]. Table 5 summarises these attributes and shows a sample from each benchmark dataset. Sample size (pixels) 32×32 28×28 32×32 Example image † SVHN also contains 531,131 samples that can be used as additional, easier training data, which we did not use in our evaluation.

CIFAR-100 (C-100): This dataset was introduced in 2009
[26] as a subset of the 80 Million Tiny Images dataset [37], aimed towards improving tasks of unsupervised training of deep generative models. It is still one of the most popular benchmark datasets in the field of computer vision due to the manageable size of the dataset, the resolution of its images, and its challenges for NN models [38], [39], [40].
Fashion MNIST (F-MNIST): This benchmark dataset [41] serves as a replacement for the original MNIST dataset comprising ten classes of handwritten digits [42]. It shares the same image and dataset size, data format, and structure of training and testing splits with MNIST, making it a popular benchmark dataset for NN models targeting computer vision problems [43], [44], [45].
SVHN: Like MNIST, SVHN [46] contains digits, but in this case they are obtained from real-world house numbers, and thus contain color information, various natural backgrounds, overlapping digits, and other distracting features. These characteristics make SVHN a more difficult benchmark dataset than MNIST, and as a result, SVHN is a very popular benchmark dataset for NNs [47].

NN Generation
In the first stage of the search, many NAS implementations generate the initial set of NNs randomly. We generate each NN with uniformly randomized parameter values from the intervals defined in Tab. 6 Our set of NNs represents the NNs such a NAS would select from the search space to explore. Each NN begins with a number of convolutional layers. We randomize kernel, stride, and padding values for each convolutional layer, as well as the number of filters. This is followed by randomized non-linear layers of the pooling, ReLU (rectified linear unit activation function), or  Training batch size [25,250] *Values are uniformly randomized in the specified intervals. † These parameters are only taken into account if they are randomized to be true. dropout types; we add fully connected layers at the end of the network. For each NN, we generate a random boolean that determines whether or not to include any dropout layers between the fully connected layers. If the boolean is true, then we generate a value for the dropout rate for these layers. After each fully connected layer except the final one, we randomly decide whether or not to add a dropout layer. Finally, we randomize the learning rate, momentum, dampening, and weight decay. In addition, we randomize the batch size to use for training. We choose an integer uniformly between 25 and 250 and truncate the training benchmark dataset to be divisible by batch size. Because we validate every half epoch, we need number of samples in the truncated dataset to be divisible by twice the batch size. If this divisibility condition is not met, we re-randomize batch size until it is. This process assures the diversity of our NN sets. As a result, we generate and train a set of approximately 2,000 NNs for each one of the three benchmark datasets (i.e., C-100, F-MNIST, and SVHN). We produce a publicly available NN dataset containing architecture descriptions of these random NNs and metadata describing each NN's accuracy and loss at each half epoch of training [48].

NN Classification
When searching the space of possible NNs for a given dataset, we deal with a wide range of parameter values that define those NNs and that affect their capability to learn the benchmark dataset. We classify our randomly generated NNs based on their learning capability. A large number of randomly generated NNs may be unable to learn. We call such NNs never-learns. Generally, never-learns are able to classify a single class from the benchmark dataset on which they are trained, and they never learn to classify more than one class. As a result, the accuracy value of neverlearns depends on the number of classes in the benchmark dataset and whether or not those classes are balanced. We observe that never-learns on C-100 have final accuracy values of about 1% because C-100 contains 100 balanced classes; never-learns on F-MNIST have final accuracy values of about 10% because F-MNIST contains 10 balanced classes; and never-learns on SVHN have varying final accuracy values up to about 20% because SVHN contains unbalanced classes, and the largest class contains about 1/5 of the data (in both the training and testing sets). Such behavior is consistent with early stages of many NAS methods, when a NAS generates a large set of random NNs for exploration, and many of these NNs have very low accuracy. We also observe that many NNs do not learn for several epochs but eventually do begin to learn, in some cases attaining quite high final accuracy. Because the accuracy of these NNs does not increase for many epochs, they can be incorrectly predicted to never learn. We call these NNs that are incorrectly predicted to never learn because of a long initial learning delay anomalies. Table 7 breaks down the generated random NNs into these categories. As noted earlier, the full set of trained NNs with many never-learns reflects the type of networks expected in early stages of NAS. As a NAS progresses, we expect to see fewer and fewer never-learns and more better performing NNs. To simulate later generations of NNs in a NAS, we create a subset of NNs from this full NN set, keeping all learning NNs and removing all the never-learns. Using Summit, we evaluate our engine on both of these sets (i.e., with and without never-learns) in order to see results that reflect how our engine would perform in different stages of a NAS (e.g., at the very beginning of the search and in subsequent iterations).

Parameterized Function
In our case study, NN fitness is measured by validation accuracy. We selected the parameterized function f (x) = a − b (c−x) to use in PENGUIN's parametric modeling method. Our empirical observation indicates that during the initial phase of training, NN accuracy curves tend to be concave down and increasing, with the accuracy values approaching a horizontal asymptote; the parametric function f (x) = a − b (c−x) , b ≥ 1 shares these properties. We choose this function because a preliminary comparison with other functions showed that this specific function provides good results across the spectrum of datasets we are considering, when fitness is measured by accuracy. Future work will study function selection for more datasets and types of fitness measurements (e.g., loss).

Accuracy
When addressing the first question of our case study, "How accurate are the predictions?", we must recall that a NAS process involves generating NN models, evaluating their accuracy, selecting the best models at each step, and using these best models to inform the next generation of models (Sec. 1).
PENGUIN reports fitness predictions for all the generated NNs (Sec. 2), and a NAS can use these predictions to select the best models. In other words, it is important for PENGUIN to accurately identify the best NNs from among the set generated by a NAS, as these are used to create the next generation of NNs. Thus, in evaluating the accuracy of our predictions, we compare the ground truth best x NNs from all our generated NNs with the predicted best x NNs as identified by PENGUIN for x = 50, 100, and 150.
We use PENGUIN's fitness predictions for all the NNs to create a set of our engine's predicted best x NNs. Recall that PENGUIN predicts the fitness each NN is expected to attain in the future, at epoch e pred (Section 3.3.1). Thus, the ground truth to compare with PENGUIN's predictions is the actual fitness of each NN at epoch e pred . In our tests, in order to identify the ground truth best x NNs, we allow all the NNs to continue training and validating for e pred epochs, even after PENGUIN has reported fitness predictions for them. The NNs that achieve the best validation fitness at epoch e pred of training are the ground truth best NNs. These are also the NNs that the NAS PENGUIN is augmenting would select as the best.
In this case study, we measure the accuracy of PEN-GUIN's predictions for the selected parametric function using the sensitivity and specificity metrics, which measure respectively the true positive rate and true negative rate of the predicted best x NNs.

Sensitivity and Specificity
We measure the sensitivity (i.e., true positive rate) and the specificity (i.e., true negative rate) of PENGUIN's predicted best x NNs for different values of x, for both the full NN set and the subset of learning NNs only. Table 8 shows the sensitivity values of PENGUIN's predicted best x NNs; the sensitivity values are similar for both NN sets. Across all our experiments, for x values ranging from 50 to 150, the sensitivity of PENGUIN's predictions ranges between 0.56 and 0.87. In fact, in almost all of our experiments, the sensitivity values are greater than or equal to 0.70. The exception to this is F-MNIST when x = 50, which has sensitivity values of 0.56 and 0.58 for the full NN set and the set of learning NNs only, respectively. Table 9 shows the specificity of PENGUIN's predicted best x NNs for the full NN set and the subset of learning NNs only. The specificity values show more variance for these two NN sets because their sizes are different. Across all our experiments on the full NN set, for x values ranging from 50 to 150, the specificity of PENGUIN's predictions ranges between 0.98 and 0.99. For the subset of learning NNs only, the specificity ranges between 0.66 and 0.97.  Table 8 and the  second row of Table 9 for the full NN set depicting the accuracy predictions and actual accuracy values of all NN models for each benchmark dataset. Specifically, the figure indicates the true positives, true negatives, false positives, and false negatives resulting from our prediction of the best 100 models. PENGUIN's output accuracy predictions are on the y-axis; actual accuracy is on the x-axis. PENGUIN's accuracy predictions are predicted using our parametric modeling. Note that PENGUIN's output prediction is the accuracy each NN is expected to attain for epoch e pred . The actual accuracy is given by the observed accuracy values of each NN at the epoch e pred . In this case, we set e pred = 20 because the value is representative of the truncated training used by cutting-edge NAS implementations like MENNDL. Throughout our case study, actual and predicted accuracy values are always compared for the same epoch; should e pred = 10, then PENGUIN would predict the accuracy each NN is expected to attain for epoch 10, and we would compare these predictions with the actual values of the NNs at epoch 10.
In Fig. 5, the purple triangles depict true positives (i.e., NNs that are correctly predicted to be in the top 100). The grey plus signs depict true negatives (i.e., NNs that are correctly predicted not to be in the top 100). The blue circles depict false positives (i.e., NNs that are incorrectly predicted to be in the top 100). The red squares depict false negatives (i.e., NNs that are incorrectly predicted not to be in the top 100). Because the x-axis indicates actual accuracy, the most accurate NNs are the furthest to the right of the figure, and the least accurate NNs are the furthest to the left. We observe that the most accurate NNs from the set of ground truth best NNs are true positives, and the ground truth best NNs that are false negatives tend to be the least accurate ones, with the exception of a few of the anomalies. In the lower left corner of the figure, we see NNs with very low actual accuracy and predicted accuracy; these are the neverlearns (i.e., NNs that never learn to classify more than a single class). Parallel to the x-axis we see the anomalies; as discussed in Section 4.1.2, these are NNs that have varying levels of actual accuracy but are predicted to have very low accuracy. They are NNs whose accuracy does not begin to increase for many epochs, and therefore they are incorrectly predicted to never learn. Part of our ongoing work involves studying ways to distinguish between NNs that never learn and these anomalies.

Gain
We answer the second question of our case study, "What savings are gained?", by using three metrics-number of training epochs saved, throughput gain, and walltime  speedup-for each of the three benchmark datasets. We calculate our gain using these metrics as compared to the training required to assess NN accuracy by three different stateof-the-art NAS methods (i.e., MENNDL [19], EvoCNN [12], and NSGA-Net [15]). MENNDL, EvoCNN, and NSGA-Net are NAS implementations that all use built-in truncated training for early termination of NN training. MENNDL terminates training at 20 epochs, EvoCNN terminates training always at 10 epochs, and NSGA-Net terminates training always at 25 epochs. Additionally, MENNDL incorporates its own dynamic built-in early termination criterion on top of its truncated training, terminating training earlier if the NN's minimum training loss has not decreased for at least 10 epochs. We use the built-in training termination methods of these NAS implementations as a baseline to compare with the same NAS implementations when augmented by PENGUIN. Any of the three NAS implementations, when augmented by our engine, terminates training each NN when either PENGUIN's fitness prediction converges (as in Figure 4), avoiding the full training, or in the worst scenario at e pred epochs, where we set e pred equal to the number of epochs each NAS's existing built-in truncated training ends (i.e., 20, 25, and 10 for MENNDL, EvoCNN, and NSGA-Net respectively).  Figure 6 depicts the distribution of NNs according to the percentage of training epochs saved by augmenting MENNDL with PENGUIN, both for the full NN set and for the subset of learning NNs only. Percentage of epochs saved is denoted on the x-axis and percent of total NN samples on the y-axis. The height of a rectangle denotes the percent of total samples that save the indicated portion of training epochs: taller rectangles to the right indicate that more samples save a larger percentage of epochs (i.e., there is a larger region of avoidable training for these samples). The dashed line indicates the mean percentage of training epochs saved. We observe that for the full NN set (including never-learns), across all three benchmark datasets, 60% to 75% of models save more than 80% of the training epochs that would be required by MENDDL's early termination criterion. In fact, for all three benchmark datasets, we cut the mean number of training epochs needed to evaluate the networks by more than 75%. We demonstrate respective mean savings of 82%, 78%, and 78% of the training epochs on C-100, F-MNIST, and SVHN (indicated by the dashed lines in Figs. 6a-c), as compared to MENNDL unaugmented by PENGUIN. For the set of learning NNs only, across all three benchmark datasets, more than 40% of models save over 80% of the training epochs that would be required by MENDDL's early termination criterion. Overall, for all three benchmark datasets, C-100, F-MNIST, and SVHN, we cut the mean number of training epochs needed to evaluate the networks by 63%-67%. Similarly, we calculate the percentage of training epochs saved by augmenting EvoCNN and NSGA-Net with PEN-GUIN, for the same NN models and benchmark datasets as above, and for both our full NN set and the set of learning NNs only. For the full NN set (including neverlearns), across all three benchmark datasets, augmenting EvoCNN with PENGUIN would save 64% to 70% of the mean training epochs needed, and augmenting NSGA-Net with PENGUIN would save 82% to 86% of the mean training epochs needed. For the set of learning NNs only, across all three benchmark datasets, augmenting EvoCNN with PENGUIN would save 39% to 40% of the mean training epochs needed, and augmenting NSGA-Net with PENGUIN would save 70% to 74% of the mean training epochs needed. These outcomes are not presented in a figure because of space constraints.
Across both NN sets (i.e., the full NN set and the subset of learning NNs only), all three benchmark datasets, and all three NAS implementations, we measure average savings of 39% or more as compared to the training epochs needed without the use of PENGUIN. We expect to observe larger savings in the earliest stages of NAS because we encounter many never-learns and PENGUIN's predictions stabilize quickly. This is reflected in the average savings ranging between 64% and 86% across all benchmark datasets and NAS implementations on the full NN set. In the set of learning NNs only, which better represents later stages of NAS, we see average savings ranging between 39% to 74%.

Throughput Gain
The ability of PENGUIN to predict NN model fitness early in the training process has immediate implications in NAS. As summarized in Section 4.3.1, we can anticipate reducing required training epochs by about 64%-86% in early stages of NAS and by about 39%-74% in later stages of NAS, depending on the benchmark dataset and the NAS method. The reduction in training epochs for individual NNs increases the number of networks that can be explored and evaluated using the same amount of wall time and compute resources. Exploring more network structures gives a NAS additional opportunity to find better NN models. Alternatively, if a NAS has a fixed problem size (i.e., a set number of models to explore), then one could evaluate those models using fewer computational resources.
For the full NN set on C-100 in Fig. 6a, the mean savings is about 80% or 4 5 ; this means PENGUIN evaluates the models in 1 5 the computation needed by MENNDL. This corresponds to a throughput gain of about 5×, allowing 5× the number of networks to be explored with the same computational resources. In Fig. 6d, for the learning NNs on C-100, the mean savings is about 63%; this means PENGUIN evaluates the models in 37 100 times the computation needed by MENNDL. This corresponds to a throughput gain of about 100 37 , or 2.7×, allowing 2.7× the number of networks to be explored with the same computational resources. Thus, for C-100, augmenting MENNDL with PENGUIN yields a throughput gain of between 2.7× and 5×. Similarly, augmenting MENNDL with PENGUIN yields a throughput gain of between 2.7× and 4× on F-MNIST and SVHN.
The same manner of computation demonstrates a throughput gain of between 1.6× and 3.3× for EvoCNN, across all benchmark datasets and both NN datasets, and a throughput gain of between 3.3× and 7.1× for NSGA-Net, across all benchmark datasets and both NN datasets.

Walltime Speedup
MENNDL's built-in training termination leverages truncated training in conjunction with a dynamic early termination criterion. It ends training at 20 epochs, or earlier if loss is stable over the past 10 epochs, making this NAS implementation one of the most effective on HPC systems [1], [4]. We put PENGUIN to the test by comparing the actual walltime of training our 6,000 NNs using MENNDL with two termination scenarios: (i) with MENNDL's built-in training termination; and (ii) with PENGUIN augmenting the termination decision. Figure 7 shows the training walltime in hours for the full NN set (Fig. 7a) and the set of learning NNs only (Fig. 7b) for the two scenarios listed above and the three benchmark datasets. When augmented by our engine, MENNDL gains speedups of up to 5.3 times  for the full NN set and up to 3.1 times for the learning NN set compared to its dynamic built-in termination. As noted in Section 4.3.1, we expect to observe larger savings in the earliest stages of NAS, represented by the full NN set, and smaller savings in later stages of NAS, represented by the set of learning NNs only. This is consistent with our observations in Fig. 7-we see speedups of 4.3× to 5.3× on the full NN set and speedups of 2.5× to 3.1× on the set of learning NNs only. Table 11 summarizes the gain from augmenting MEN-NDL with PENGUIN. The first column gives the percentage of mean epochs saved. The second column shows the theoretical speedup (i.e., throughput gain), we calculated based on the percentage of epochs saved-note how the estimates are conservative because of rounding down. The third column shows the actual wallclock time speedup measured-in all cases the measured speedup is very close to the theoretical speedup. To sum up, for the full NN set, we observe theoretical speedup of 5× on C-100, 4× on F-MNIST, 4× on SVHN, and measured speedup of 5.3× on C-100, 4.5× on F-MNIST, and 4.3× on SVHN. For the set of learning NNs only, we observe theoretical speedup of 2.7× on C-100, F-MNIST, and SVHN, and measured speedup of 2.5× on C-100, 2.7× on F-MNIST, and 3.1× on SVHN. Our measurements empirically confirm the effectiveness of PENGUIN in drastically cutting the NN training time and consequently increasing the training throughput on HPC systems.

Applications of PENGUIN
The decoupling of search and prediction, and the flexible fitness prediction method in PENGUIN, allow us to apply this work in two directions.
First, we can plug different parametric functions into PENGUIN to handle other types of fitness measurements beyond accuracy. For example, a function based on exponential decay rather than exponential growth could be used when fitness is measured by loss. Second, we can tune PENGUIN's parametric modeling for different problems with different datasets and fitness measurements. There is no perfect alignment of NAS and parametric function that works for every dataset and fitness measurement. PENGUIN enables a user to test different combinations of NAS methods and parametric functions at small scale in order to find the best pairing of NAS and parametric function for a given dataset and fitness measurement. The search can then be scaled up using the best NAS and parametric function for a given problem.
Furthermore, by extending PENGUIN, we can capture edge cases (i.e., models that never learn, models with anomalies, models with false negatives, and models that never converge). Figure 8 presents NN accuracy curves for four edge cases, together with the predictions given by PENGUIN. Specifically, Figure 8a depicts a never-learn on C-100. In this example, there is noise in the data, causing the accuracy to alternate between values slightly larger than 1 and values slightly smaller than 1, but at no point does the accuracy begin to increase steadily. This NN is correctly classified by PENGUIN as a never-learn. Figure 8b depicts an anomaly on the F-MNIST benchmark dataset. This NN is incorrectly classified by PENGUIN as a never-learn because its accuracy does not begin to increase until after 9.5 epochs, and PENGUIN stabilizes earlier (at 2 epochs). Figure 8c is an example of a false negative on C-100 (indicated by the red squares in Fig. 5a). Figure 8d shows an NN on F-MNIST where the prediction of PENGUIN never converges. In this case, the NN is not terminated early by PENGUIN and instead trains for e pred epochs-here 20 epochs as used for MENNDL-and the actual observed maximum accuracy is taken as the accuracy estimate for the NN. The cases where PENGUIN's predictions do not converge are the samples in Fig. 6 with 0 percent of training epochs. The frequency of each of these cases may be observed in Tab. 7, Fig. 6, and Fig. 5. These are examples of edge cases that we will target in future work.

RELATED WORKS
Recent works have proposed different prediction strategies. The two most common approaches for truncating training are to train for a fixed but significantly reduced number of epochs (e.g., 20 instead of 100s) [21], [49] or to train until the maximum accuracy or minimum loss has not improved for a user-specified amount of time (e.g., if the maximum accuracy does not improve for 5 epochs). Such methods are commonly included in NN software like Keras [50], and are built-in to many NAS implementations, including the three NAS implementations in our case study.
There are a variety of works in the area of performance prediction and extrapolating NN learning curves [51], [52], [53], [54]. Some methods target hyperparameter search, which involves finding the best hyperparameter configuration for a given human-designed network but does not explore different architectures [51]. Domhan et al. [51] and Klein et al. [52] use probabilistic methods involving computationally expensive Markov Chain Monte Carlo sampling. The approach of Swersky et al. [54] automatically pauses and restarts training of models based on the predicted trajectory of the loss curves. Baker et al. [53] extract features from NNs to use in training a series of regression models to estimate fitness. There are also a variety of works that use parametric modeling of learning curves for machine learning problems.
The solution presented in this paper amplifies these approaches with the design of a prediction engine, PENGUIN, that allows the use of any parametric function to model learning curves. Furthermore, PENGUIN is fully decoupled from any NAS and can be plugged into existing NAS to provide fitness models and predictions, regardless of the NAS method or target datset.

CONCLUSIONS
This paper introduces PENGUIN, an engine that enables the decoupling of NAS from fitness prediction strategies. PENGUIN increases the computational efficiency of highperformance and high-throughput NAS workflows and enables portability of fitness prediction across NAS implementations and scientific domains. We present a case study using three diverse datasets (i.e., CIFAR-100, FashionMNIST, and SVHN) and three NAS implementations (i.e., MEN-NDL, Evo-CNN, and NSGA-Net). We compared the number of epochs, throughput, and walltime when these NAS methods would terminate training of the NNs using their builtin truncated training vs. when augmented by PENGUIN. PENGUIN enabled a reduction in needed training epochs by 39% to 86%, depending on the NN sets and the benchmark dataset used. PENGUIN increases throughput of explored NNs by a factor of 1.6× to 7.1×. Compared to MENNDL, which serves as a cutting edge NAS implementation on HPC systems such as the Summit supercomputer, PEN-GUIN achieves a walltime speedup between 2.5x and 5.3x.
In future work we will further quantify the impact of PENGUIN for a broader range of NAS implementations, a more diverse set of parametric functions, and a larger suite of datasets.