AIPerf: Automated machine learning as an AI-HPC benchmark

The plethora of complex artificial intelligence (AI) algorithms and available high performance computing (HPC) power stimulates the convergence of AI and HPC. The expeditious development of AI components, in both hardware and software domain, increases the system heterogeneity, which prompts the challenge on fair and comprehensive benchmarking. Existing HPC and AI benchmarks fail to cover the variety of heterogeneous systems while providing a simple quantitative measurement to reflect the overall performance of large clusters for AI tasks. To address the challenges, we specify the requirements of an AI-HPC considering the future scenarios and propose an end-to-end benchmark suite utilizing automated machine learning (AutoML) as a representative AI application. The extremely high computational cost and high scalability make AutoML a desired workload candidate for AI-HPC benchmark. We implement the algorithms in a highly efficient and parallel way to ensure automatic adaption on various systems regarding AI accelerator's memory and quantity. The benchmark is particularly customizable on back-end training framework and hyperparameters so as to achieve optimal performance on diverse systems. The major metric to quantify the machine performance is floating-point operations per second (FLOPS), which is measured in a systematic and analytical approach. We also provide a regulated score as a complementary result to reflect hardware and software co-performance. We verify the benchmark's linear scalability on different scales of nodes up to 16 equipped with 128 GPUs and evaluate the stability as well as reproducibility at discrete timestamps. The source code, specifications, and detailed procedures are publicly accessible on GitHub: https://github.com/AI-HPC-Research-Team/AIPerf.


Introduction 1.AI & HPC Convergence
AI, machine learning, and deep learning have drawn tremendous attentions in recent years.Deep learning, as a sub-field of AI, is originally used to describe artificial neural networks, which act like universal approximators Hartman et al. (1990), with many hidden layers.To have the intended output, a training process LeCun et al. (2015) is required to automatically adjust the weights (parameters) of the network to guide the internal numerical values.Training is essentially a multi-dimensional fitting process of a complex space, therefore computationally expensive and can be treated as a supercomputing problem.While HPC is conventionally bounded to scientific simulations and modeling problems with stale middle-ware like MPI, now it involves other forms of distributed computing especially deep learning methods to harness the massive amount of data that grows at an unprecedented rate Yi and Loia (2019).The increase in algorithmic advances of AI algorithms, available computing power of HPC and data collections, as well as the demand for scalable and data-driven solutions stimulate the convergence of AI and HPC machines Fox (2019).HPC systems Womble et al. (2019) need to incorporate the support of diverse AI workloads on top of inconsistent accelerators and software frameworks to become AI-HPC systems with the following characteristics: • Effective and parallel implementation of algorithms on large scale clusters.
• High bandwidth and low latency communications between distributed workers.
• Sufficient memory on AI accelerators for local data access.
• Efficient and high-speed interconnections to the network file system.
The convergence still faces new challenges, e.g.deep learning applications often repeatedly and iteratively access data across multiple nodes.Also, AI practitioners need to leverage the available parallel computing resources, with provided software, to deliver satisfactory results on large scale clusters Thompson et al. (2020).Various hardware-accelerator innovations have bridged the gap between AI and HPC from both the tech giants and startups.The need for real-time and large scale computations requires a unified fashion with less heterogeneity to deal with all aspects of the system simultaneously.Unfortunately, the evolving and diverse architecture designs, mainly regarding the computing unit, storage design, and data precision, are the opposite of mono-culture to keep up with the relentless pace of the market Huerta et al. (2020).Consequently, the need for an open and reliable benchmark suite to evaluate the performance of different AI hardware & software stacks with diverse workloads emerges rapidly, as shown in Figure 1.
Figure 1: The converge of AI and HPC with the growth of model, machine, data, as well as potential tasks.The benchmark design should consider the whole system's diversity and comprehensively measure all critical aspects.

AI-HPC Benchmark
One approach to correct possible inconsistent performance and identify bottlenecks of AI-HPC systems is to design a robust and comprehensive benchmark that emphasizes the future workloads' needs.The hybrid HPC models with AI surrogates reveal a collection of unique and novel opportunities for scientific breakthrough and unforeseeable discoveries, as well as business innovations and other societal benefits.HPC-enabled AI has already taken the form of neural networks to computational functions to provide a rich computing environment for potentially multiple orders of magnitude improvement to further the paradigms of scientific discovery Kurth et al. (2017); Stevens et al. (2020).Among various AI tasks, computer vision (CV) takes a major role not only in science like astronomy but also in the industry like face recognition and self-driving.As the AI model complexity increases exponentially2 in recent years, it's critical to develop an automated and unbiased method to tackle the model searching and optimization problem.Unfortunately, current HPC and AI benchmarks do not address these trends regarding the workload type, comprehensiveness, and scalability.For example, popular HPC benchmarks like HPL focus on floating-point operations (FLOPs) while I/O measurement, which is critical since deep learning workloads are typically data-intensive, is often less considered.Therefore, we choose AutoML (for CV) as the AI workload with uniform implementations and evaluate the performance on different scales of machines to verify the scalability.In addition, we provide two approaches to measure the AI computing ability: FLOPS that is conventionally used in the HPC community as our major score, and a complementary regulated score to intelligently reflect the hardware and software co-performance.In summary, our benchmark suite fits the following requirements of an AI-HPC benchmark: • AutoML is a representative AI task with complex AI models and big data.
• The workload scales with systems, from small clusters to HPC.
• The benchmark is customizable and configurable for various architectures and system scales.
• The benchmark results in a simple, comprehensive, and reproducible measurement, based on open rules, to reflect both AI performance and HPC ability.
Our main contributions are the following: • We clarify the requirements for AI-HPC benchmarks and briefly explain existing HPC and AI benchmarks' unsuitability.
• As far as we know, we are the first to use AutoML as a benchmark for AI-HPC.We modify the algorithms in various aspects to fit the requirements.• We establish an end-to-end benchmark suite with highly efficient and parallel implementation.
• We propose an analytical method to calculate the FLOPS, associated with various neural networks.
• We propose a compose factor with a novel design to quantify the overall deep learning performance of heterogeneous systems.
• We evaluate our benchmark on different scales of machines with big data and validate the benchmark's linear scalability and reproducibility.
The rest of this paper is organized as follows.In Section 2, we review the existing HPC and AI benchmarks and point out their downsides.In Section 3, we explain AutoML and popular frameworks for deep learning.In Section 4, we explain why AutoML is a decent candidate as an AI-HPC benchmark workload, with the details of algorithms, implementations, and measurements.In Section 5, we evaluate our benchmark on different scales of clusters with multiple AI accelerators based on a large size dataset.We summarize our work in Section 6.
2 Related Work

HPC Benchmarks
The popular benchmarks include high performance LINPACK (HPL, used for "Top500" ranking3 ), HPC Challenge Benchmark, and NAS Parallel Benchmarks etc. LINPACK stands for Linear algebra package, a library that solves a dense system of linear equations, as the heart of many computational science problems, using Gaussian elimination.
HPL is a version of LINPACK that is scalable and optimizable for given machines.Though it is the de facto HPC benchmark, it provides little information about I/O ability, which is a key factor for modern cluster computers, hence it is complemented by other benchmarks as the HPC Challenge benchmark.Another popular HPC benchmark suite is NAS Parallel Benchmarks (NPB), originally motivated by the insufficient problem size and impeding restrictions of traditional benchmarks like LINPACK.HPL-AI Mixed-Precision benchmark Carson and Higham (2018) 4 is developed to highlight the convergence of AI workloads on HPC.The motivation is the demand for mixed-precision benchmark as the traditional HPC benchmarks require 64-bits accuracy while AI workloads typically need 32-bits or lower.The main drawback of HPL-AI is the lack of representation of AI-related calculations.The computing performance is critical to AI-HPC, however, we can not simply use existing HPC benchmarks like HPL for several reasons: • Conventional HPC benchmarks tend to exercise on one or a few components (e.g.HPL on CPU but not I/O), while extensive testing over-all relevant components is needed for AI applications to provide the benchmark as-a-whole.
• AI applications, especially deep learning, are often based on complex models.Simple matrices calculation is not representative enough to cover the complete ecosystem.
• Traditional HPC benchmarks do not provide AI performance measurements like accuracy.
• HPC benchmarks focus on 64-bits precision, while most AI applications achieve desired results in 32-bits or 16-bits precision.

AI Benchmarks
Fair and inclusive comparison of machine computing power for AI is not trivial.The challenges include system heterogeneity, the variety of deep learning workloads, the stochastic nature of approaches, and the difficulty in design simple yet comprehensive measurements.Previous AI benchmarks attempt to highlight the challenges by incorporating different hardware systems Dong and Kaeli (2017);  1 shows the summary of recent popular AI benchmarks.MLPerf Mattson et al. (2019), the arguably most accepted AI benchmark so far, uses time-to-accuracy (TOA) to measure the co-performance of software and hardware together.The TOA is an indirect quantification the machine computing ability comparing to FLOPS, which is what we choose in our benchmark.Another related work Jiang et al. (2018) only focuses on 3 scientific AI applications and has insufficient scalability.Overall, there are several specific reasons why the existing AI benchmarks are not suitable to be AI-HPC benchmark: • Most existing AI benchmarks are composed of a variety of smaller AI tasks and frameworks, therefore a lack of uniformity.
• Most existing AI benchmarks have been evaluated on different scales of clusters with a large dataset, which is required for AI-HPC benchmarks.
• Most existing AI benchmarks do not provide simple yet informative measurements that reflect the machine computing ability on AI tasks.Developing machine learning solutions are mostly relied on complex model designs, involving heavily with human expertise, which can be extremely time-consuming.Inevitably, the demand for building AutoML emerges in recent years as a sub-field of machine learning Hutter et al. (2019) to generate models automatically and speed-up the development.
As shown in Figure 2, AutoML pipeline contains various steps He et al. (2019).The first step is data preparation, which involves data collection and data cleaning.The second step is feature engineering, including feature selection, feature construction, and feature extraction.Although data and features lay the foundations and set the limit of AI performance, we focus on the rest part of AutoML in this paper since they do not depend on the application scenarios.
Figure 2: An overview of the AutoML pipeline.We limit our attention to the model generation in this paper.
The third step is to generate machine learning models and the corresponding configurations ( normally referred to as hyperparameters), which can have a significant impact on the performance.The two main approaches for algorithm selection are the conventional design with human experts and the automated neural architecture search (NAS).Without human intervention, NAS Elsken et al. (2018) has the potential to generate novel architectures beyond imagination and has boosted the performance significantly.Hyperparameter optimization (HPO) is a problem of optimizing a loss function over complex configuration space.Besides the architecture, modern deep learning model performance is also sensitive to hyperparameters Bergstra and Bengio (2012a) like the batch size.While the two problems are preferably solved together in a uniform manner, we treat them separately to fully utilize the computing resources in a parallel manner.Finally, model evaluation measures the model performance once the candidate model is generated.The simplest method is to finish inference on the test dataset.Although this is accurate, it is too expensive since each typically model has numerous configurations.So the main motivation to use an evaluator is to provide measurement quickly enough, to a certain degree of accuracy.In this paper, We use an early stopping strategy Pedregosa et al. (2011) that stops the training when the validation loss flats with epoch.
Overall, as a booming field of AI research, AutoML is an inherently computing-intensive endeavor, which makes it an excellent candidate workload for AI-HPC benchmark.In addition, AutoML is already mature enough to rival human experts to make a real impact and be useful for AI researchers Hutter et al. (2019).Comparing to LINPACK, AutoML is a representative and end-to-end AI task containing nearly all AI-related components, regarding the primary computing operations (e.g.convolution and activation function), precision (typically in 32-bits), and workflow.Comparing to MLPerf, AutoML is much more expensive computationally than individual AI tasks and is more scalable on clusters and HPC.As a result, we choose AutoML as our benchmark workload considering all the unique advantages.

Frameworks
Deep Learning Frameworks Deep learning frameworks provide user-friendly API and transform programs in highlevel languages into an internal representation of certain functionalities.The low-level efficient libraries, e.g.cuDNN, are invoked to execute primary operations like matrix multiplication.As stochastic as deep learning training is, multiple solutions with desired performance exist Goodfellow et al. (2016), therefore implementation and customized setups vary while maintaining similar results.The difference is critical as the training process is an approximation intrinsically.
An open-source framework with enough community support would be a decent candidate for building the benchmark.
TensorFlow is an open-source library for low-level numerical computation with graphs, where the nodes represent mathematical operations and the edges represent tensors (data arrays).The actual operations are written as high-performance C++ binaries with high level Python abstractions.Keras is a high-level library wrapper that is built on top of frameworks like TensorFlow.Though Keras provides off-the-shelf models with simple architectures, it is relatively slow, inflexible, and difficult to debug.PyTorch has complex architectures and uses dynamic computation graphs, meaning the network behavior can be modified at run-time and facilitates the training convergence.This feature is not quite beneficial for benchmarking purposes.The "Pythonic" nature of PyTorch makes it convenient to code but less efficient for running.After carefully comparing different frameworks Wang et al. (2019b); Nguyen et al. (2019), we choose TensorFlow in our benchmark evaluation with the following reasons: • TensorFlow is the most popular open-source deep learning framework so far, with a large and active community supported by Google for quick updates and frequent releases.
• TensorFlow is efficient, user-friendly, and east-to-debug (with TensorBoard) regarding the numerical computations for both research and deployment.
• TensorFlow supports various systems with high performance and scalability.One key feature is the rich collection of algorithms to generate neural architectures and optimizing hyperparameters, as well as a simple interface for more user-defined algorithms.Another key feature is the parallel design to efficiently leverage large scale clusters with training platforms like Kubernetes.NNI is also user-friendly as it develops a command-line tool as well as Web-UI to manage training processes.Most other frameworks like TPOP focus on the AutoML pipeline optimization, especially data pre-processing and feature engineering.In this paper, we focus on the model generation among all parts in the AutoML pipeline, therefore we choose to build our own benchmark suite on top of NNI.2019) proposes an open-source framework, called Auto-Keras, to perform network morphing guided by Bayesian optimization.Though every method has its own advantages, we choose the implementation in Wei et al. (2016) as our baseline for developing the benchmark workload.We choose convolutional neural network based architecture (residual network He et al. (2016) in our case) as the initial models to address the ubiquitous CV tasks in the AI as well as the most promising workload genre on AI-HPC.We modify the morphism so that each morphing step would add a block (convolutional layer, batch normalization Ioffe and Szegedy (2015) and activation function all together) instead of just one layer.In addition, We adapt this implementation of network morphism for NAS to suit for benchmarking in a parallel and distributed way.The details are explained later in this paper.

Hyperparameter Optimization
HPO problems can be viewed as identification of optimal model configurations from all possible candidates.Similar to NAS, HPO has three elements: search space, search approach, and evaluation method.In our case, the search space is defined by the adjustable hyperparameters, which are categorized into two groups.The first group contains the ones that are influential on final accuracy and GPU FLOPs include batch size and kernel size.The second group contains the ones that generally chosen based on the understanding of performance like optimizer.In our benchmark design, we only consider the first group and choose the heuristically better choices for the second group to reduce the search cost.We use the stochastic gradient descent method with momentum Qian (1999) as the optimizer since it requires less memory and be more efficient.We optimize the batch size (see Appendix A) based on the AI accelerator utilization, memory, and validation accuracy.Various search approaches can be applied to select the best hyperparameter combinations.including grid search Larochelle et al. (2007), random search Bergstra and Bengio (2012a), Bayesian optimization Bergstra et al. (2011), and heuristic search like evolutionary Real et al. (2017a) etc.We perform tests on the different optimization methods then compare the validation accuracy on the test dataset.The results of multiple experiments on CIFAR10 (see Appendix A) show that Bayesian optimization (TPE) outperforms other methods, therefore we use it to optimize the combination of batch size and kernel size.Due to the parallelism implementation design limit, HPO is performed separately with NAS in our benchmark.

Framework
Not all operations in the AutoML pipeline run on AI accelerators, e.g.data preparation and disk I/O.With the improvement of GPU performance, the CPU, and disk become a bottleneck gradually.Consequently, the GPU idles and results in lower FLOPs.To balance the workloads in the pipeline and fully appreciate all computing resources, we need to effectively parallelize the most computationally expensive part Peteiro-Barral and Guijarro-Berdiñas (2013).
As mentioned, we choose NNI (V1.5) as a baseline to adapt to our benchmark suite.The original NNI framework is implemented on a "master-slave" architecture and performs the NAS and HPO on the master server, which is the bottleneck for large clusters.We modify the framework in various aspects, shown in Figure 3, including utilize slave node CPUs efficiently and parallelly to generate new architectures (then stored them in the buffer), perform training of candidate architectures from the buffer on slave nodes separately and asynchronously, and implement parallel training on all slave nodes to speed up the process.We utilize data parallelism Abadi et al. (2016) with NVIDIA Collective Communications Library (NCCL)7 so that each worker uses one partition of the batch for training.The synchronous strategy allows all workers to train on different partitions of data and results in individual gradients, which are then aggregated all-together at each step.We summarize the workload procedure as follows: • User accesses the master node through Secure Shell (SSH), collects information about slave nodes and creates a SLURM configuration script.• Master dispatches benchmark workloads with SLURM, corresponding to the requested and available resources, in a parallel way to slave nodes asynchronously.• The slave nodes receive the workloads and perform searching and training parallelly.
• The CPUs on slave nodes search for new neural architectures based on the rank of models in the historical model list, which contains detailed model information and accuracy on the test dataset, then store the architecture in the buffer (in a network file system, NFS) for later training; • The GPUs on slave nodes load the candidate architecture and data from NFS, utilize data parallelism to train with hyperparameter tuning, and then store the results in the historical model list.
• The running terminates once the condition is satisfied (e.g.reaching user-defined time or accuracy).The final results (score, achieved error, and regulated score) are automatically calculated based on the recorded metrics and then reported.

Measurement
FLOPS is the most cited performance metric to reflect the overall computing ability for benchmarking HPC8 as yet.
Our benchmark utilizes FLOPS as the major score to directly describe the computing power of AI accelerators.Toolkit like NVIDIA profiling tools (nvprof) 9 can record the executed operations by kernel replay, which is exceptionally slow.This method is limited to hardware and can not compute the operation needed in a uniform way across various platforms.Inspired by LINPACK, we treat the operation counting as a mathematical problem to calculate the counts needed, without optimization, to finish the complex computation in the training and validation processes.For a given data and model with specific hyperparameters, the theoretical operation needed is predetermined.Since we require all computation to be conducted with floating points of at least 16-bits precision, we use operation counts and FLOPs interchangeably in this paper.If the hardware or software has any special optimization, the operation count is reduced or the execution is faster, therefore higher FLOPS eventually.
To calculate the operations analytically and precisely, we need to understand the training and validation process.Deep learning libraries like TensorFlow and PyTorch use computational graphs to represent the computations of the training or validation process of neural networks.A computational graph is a directed acyclic graph where nodes represent variables or operations, and edges represent function arguments (data dependency).Each computation is essentially a node so that variables feed values into operations and operations feed the outputs into other operations.Computational graphs can compose complex models with simple functions, and enable automatic differentiation to train the neural networks.Backpropagation Hecht-Nielsen (1992) is a reverse mode automatic differentiation Baydin et al. ( 2017) which applies the chain rule efficiently and recursively to compute gradients of inputs and parameters and other intermediates along computational graphs.As shown in Algorithm 1, backpropagation has two parts: forward pass (FP) that compute results of operations and save intermediate values needed for gradients computation in memory, and backward pass (BP) that apply the chain rule to compute the gradients of the loss function with respect to the inputs (multiply Jacobian matrices by gradients).
Algorithm 1 Backpropagation Hecht-Nielsen (1992) Forward Pass: 1. Define the computational graph where each node represent a variable (parameters and intermediates).
2. Visit each node in topological order to compute the variables with corresponding operations and store the values at the nodes.Backward Pass: 3. Initialize the loss gradients dL dy and all local partial derivatives dy dxi .4. Visit each node in reverse topological order to compute the loss gradients w.r.t.local variables with chain rule: dL dxi = dL dy × dy dxi .Return: dL dxi for all variables.
The total computing operation count is the sum of that in FP and BP, which includes operations to calculate the gradients and the operations to update the parameters (with gradient descent).Most computations in neural networks are matrices multiplication, which is dot products y that has n multiply-accumulate (MACC) and corresponding to roughly 2n operations.The gradient descent procedure can be described as repeat θ i := θ i + α dL dθi until convergence, so the operation needed is equivalent to one MACC for one parameter in one BP.We break down the original and morphed models to several components (layers) and analytically compute the operation count needed of each layer in the FP, as listed in Table 2.The detailed descriptions of each layer are in He et al. (2016); Ioffe and Szegedy (2015).
Table 2: The analytical operation counts of each layer (per image) in the FP.For convolutional layer, the input image dimension is and the kernel (filter) size is K × K.For dense layer, the input is C i and output is C o .Following the convention in Huss and Pennline (1987), the operation weight of MACC is 2, the weight of add/subtract/multiply/comparison is 1, the weight of divide/sqrt is 4, and the weight of special operation like exponential is 8.The operation is only an approximation.
Layer Operation in the FP The analytical analysis of operation computing is more complicated in the BP process.The convolution in FP can be described as where O ij is the output, X(i − m, j − n) is the input, and F (m, n) is the filter (kernel).The partial derivatives of local parameters ( ∂O ∂F ) and local intermediates ( ∂O ∂X ) can be easily derived and are used in gradient calculation.Applying the chain rule, we have the the parameters' gradients and the intermediates' gradients by multiplying the loss gradients with the local gradients as Equation 1.By substituting the derivatives ( ∂O ∂F and ∂O ∂X ), we can express the backpropagation as Equation 2. Therefore, the total operation needed to calculate all gradients is roughly twice as the that in FP.The total parameter is convolution layer (without bias) is K ×K ×C i ×C o , so the operation needed to update all parameters with gradient descent method is 2×K ×K ×C i ×C o .Consider all steps we can have the total operation in BP shown in Table 3.Since the K, C i , and C o are typically small values in convolutional layers, the total operations in BP in roughly twice of that in FP.
For dense layer Y = W T X + B, the intermediates' gradients can be obtained by multiplying the loss gradients ( ∂L ∂Y ) with the Jacobian matrices of intermediates ( ∂Y ∂X ).Similarly, the weights' gradients is ∂L ∂W = ∂L ∂Y × ∂Y ∂W .In both cases, the operation needed is the same as that in FP.The bias gradient is ∂L ∂B = ∂L ∂Y × ∂Y ∂B = ∂L ∂Y since ∂Y ∂B = 1, therefore resulting no extra operation.The total parameter in a dense layer (with bias) is (C i + 1) × C O and the total operation needed in the BP of the dense layer is shown in Table 3. Unlike the convolutional layer, the operation of the dense layer in BP is more than tripled of that in FP.The operation in BP of the rest layers, including batch normalization, activation function (ReLU), element-wise add layer, max-pooling, global-pooling, and softmax layer are all ignorable for practical purposes.We confirmed our analytical method by comparing the results of ResNet-50 on ImageNet with TensorFlow profiler10 (only computing operation in the FP) and NVIDIA profiling tools (compute operation in both FP and BP).In our analytical method, we do not consider any hardware or software optimization that would result in such an effect.The operation count from this analytical approach is only related to the neural architecture, hyperparameters configuration, and data (like image resolution).The optimizations that result in less operation will speed up the training or validation processes therefore higher final FLOPS.The details of other FLOPS measurement approaches and comparisons are elaborated in Appendix B.
Table 3: The analytical operation counts of each layer (per image) in the BP.The meanings of symbols are the same as in Table 2.The total operation needed for calculating the gradients and for updating parameters are summed.
Layer Operation in the BP Measuring AI-HPC quantitatively is not trivial due to the diversity, sometimes even conflicting, of workloads and metrics.One single metric like FLOPS alone is not sufficient to reflect the AI-HPC computation capabilities considering hardware as well as software.For example, the data parallelism algorithm that is frequently applied in large scale AI tasks and distributed machine learning will speed up the whole process at a cost of lower AI accelerator utilization and FLOPS.Though appearing differently, some metrics do reflect common characteristics of the system, e.g. an efficient AI-HPC would perform more computation and result in higher accuracy in less time.While one can present all relevant metrics informatively, we provide a regulated score considering the general model performance (error on validation dataset) as a complementary result, besides the major score, to quantitatively measure both AI performance and HPC computing power simultaneously.The empirical results Goodfellow et al. (2016) show that accuracy increases monotonically and then plateaus over time.In other words, the error (1 -accuracy) decreases slower and slower over time.We would like to compensate for this effect with an increasing changing rate of the regulated score.Therefore the absolute value of the partial derivative of the regulated score with respect to the error should increase with decreasing error.On the other hand, the partial derivative of the regulated score with respect to FLOPS should be independent of FLOPS to make the computation contribute to the regulated score uniformly.According to the above conditions, we design our regulated score as Equation 3: where Error ∈ (0, 1) and the negative sign keeps the ln(Error) be positive.Consequently, the regulated score increases faster with lower error and increases linearly with FLOPS.

Fixed and Customizable Configuration
There are several rules and customizable configurations in our benchmark for a fair comparison.First, the benchmark should run on a "master-slave" architecture.The master is deployed on a strong server without any AI accelerator and collects all results from slave nodes.The slave node is composed of one or multiple servers with AI accelerators and can be deployed with or without a container environment.Both scale-up (multiple AI accelerators on each slave node) and scale-out (one AI accelerator on each slave node) configurations are supported.Second, the algorithms and search space used for AutoML are fixed, i.e. network morphism for NAS and Bayesian optimization for HPO, with aforementioned operations and hyperparameters.The HPO only starts at the fifth round of training on each slave node since the earlier rounds are trained insufficiently, which is also referred to as the warm-up process in this paper.We start with 10 epochs for the first round of model training, then an additional 20 epochs for each one more round until 90 epochs in the fifth round.A predicted accuracy (see Appendix C), instead of the actual one, is used for these earlier models.In the warm-up process, there is a maximum allowed training epoch and patience, which is the number of epochs to wait before early stop if no progress on the validation dataset.Third, the dataset is fixed to be ImageNet Deng et al. (2009) 11 as it is a widely accepted large dataset for benchmarking CV models.The training and validation dataset have 1281167 and 50000 224*224 RGB images, respectively.The data can be formatted in an optimal way corresponding to the framework.For example, the data loading is more efficient for TensorFlow if the data is in TFRecord.The back-end deep learning framework is not limited while we suggest TensorFlow as mentioned.Other hyperparameters configurations, including batch size, optimizer, learning rate, and loss function, are optimized by separate experiments but also customizable.Forth, the termination rule is reaching the user-defined running time.Though customizable, we suggest the running time longer than 6 hours if using NVIDIA V100 so that the results are converged.The precision used in the calculation must be float 16-bits or higher with a maximum model error of 35% so that the results are valid.
Once the benchmarking process finishes, the data analysis toolkit runs automatically to calculate the score and regulated score utilizing all the recorded information and log files and then create a report.The summarized configurations are shown in Table 5, where the suggested setups or values are determined either empirically or experimentally.
Table 5: Fixed and customizable configurations.Similar to the "pencil-and-paper" manner Bailey et al. (1991), we keep the deep learning framework and most hyperparameters open to further optimization.This would partially relieve the performance dependency on manual designs and be more independent of the software part of the system.
the robust performance on various scales of machines (GPUs), as shown in Figure 4, Figure 5, and Figure 6.The benchmark score scales linearly with machines to reflect the general AI computing ability.The achievable error of automatically generated models is limited by the GPU time consumed in our evaluation.Previously AutoML work like Zoph and Le (2016) searches 12800 neural architectures before finding the optimal one, comparing to 96 architectures in our largest scale of evaluation with 16 nodes in 12 hours.The potential of AutoML can only be appreciated by longer searching time with more machines as it is tremendously computing greedy.The regulated score is essentially FLOPS multiply with the model performance as a coefficient.For AI systems of the same machine scale but with different software optimizations, the regulated score can reflect the hardware and software co-performance, therefore we also provide it as a complementary result.More details of the evaluation performance are explained in Appendix D.

Conclusion
The rise of the convergence of AI and HPC reveals new challenges in benchmarking the state-of-the-art and future large scale clusters.We review current popular HPC and AI benchmarks and qualitatively analyze why they do not address all the challenges.Considering a variety of AI-HPC workloads, we specify the benchmark requirements and choose AutoML as the representative AI application.We build an end-to-end benchmark suite with high efficiency and parallelism implementation.We also propose an analytical method that is independent of software optimizations to estimate the FLOPS needed to train and validate an AI model.We provide PFLOPS as the benchmark major score, complemented by a regulated score considering the achievable error of generated models, to quantitatively measure the machine computing power on AI workloads.We evaluate the benchmark on different scales of machines with a large dataset and verify its linear scalability and stability.Our future work includes developing more efficient algorithms and implementations regarding parallel training and achieving higher machine utilization and performance.

Acknowledgment
We thank Peng Cheng Lab AI research center for providing the evaluation resources.We also appreciate the valuable advice from Yong Dou and Jun Long, and technical support from Hongji Chen, Haiyu Fu, Ya Ouyang.

A Appendix A. HPO Specifications
The batch size, which is the number of samples used in each internal updating procedure, impacts the convergence speed and stability.The choice of batch size is often a compromise of GPU memory, computational speed-ups, and model performance.We evaluate different batch sizes regarding GPU utilization, GPU memory consumption, and validation accuracy on ImageNet.The results in Figure 7a show "448" as a slightly better choice considering all three factors.
For HPO method selection, we design a toy experiment based on the CIFAR10 dataset.We use one GPU and run our benchmark workload for 48 hours.The candidate methods include: evolutionary method Real et al. (2017a), grid search Larochelle et al. (2007), random search Bergstra and Bengio (2012a) and tree-structured parzen estimator (TPE Bergstra et al. (2011)).The search range is consistent for all methods while the grid search has discrete search values and the rest methods have continuous search values.We define the search space of adjustable hyperparameters empirically based on the convergence speed and finally accuracy.To be specific, the ranges of batch size and kernel size are [384,512](for v100) and [2,5], respectively.The HPO is only applied after the warm-up process, where the epoch limit is from 10 to 60 with a step of 10 for each round.The performance comparison in Figure 7b shows the TPE method results in slightly better accuracy.

B Appendix B. FLOPS Comparison
We compare our analytical approach of operation computing with the TensorFlow profiler (tf.profiler) and NVIDIA profiling tool (nvprof).The tf.profiler can only count operations in the FP.The nvprof can trace the GPU activity, collect metrics, and number of events during the benchmark running (mainly training and validation processes).It uses kernel replay to ensure all requested profile data, which is operation counts of adds, multiplies, multiply-accumulates, and special operations in 32-bits precision in our case.The overhead of nvprof is extremely high therefore we need an approach to speed-up the process.Fortunately, the iterative nature of deep learning computation means that we can use a small data sample for profiling, see Equation 4. This is only an approximation since the operation counts vary with the hyperparameters like kernel size, which changes dynamically in HPO procedure.We keep all the hyperparameter configuration, as well as other specifications like data parallelism technique, consistent to generate accurate measurements.We use a partition of data (16 images) since the profiling process is prohibitively expensive to use the entire dataset.A critical detail is the FLOPs from the initialization, which does scale with data.Here is an example of how we calculate the total FLOPs for training and validation:  9.For the operation in the FP, the results are consistent with three approaches in training as well we validation.For the operation in BP, the results from nvprof and the analytical method are also consistent.Table ??he counting results from nvprof vary with batch size.The operation count from nvprof does not increase linearly with the batch size and the acceleration ratio plateaus once the batch size is larger than 16, in both FP and BP.As mentioned in Section 4.4, this optimization on NVIDIA GPUs would result in higher final FLOPS since the actual operation performed in the process is less than the operation from the analytical approach.

C Appendix C. Accuracy Prediction
In the warm-up process (initial 4 rounds of training), we limit the epoch from 20 for the first round to 50 for the fourth round.Since the ImageNet dataset is typically converged after 60 epochs, we need to predict the achievable accuracy of the insufficiently trained model.We use a logarithmic function for an ordinary least squares regression and Root Mean Square Error (RMSE) to estimate the standard deviation of the data and the goodness of fit.The estimated accuracy, shown in Figure 8 is the analytical value at 60 epoch minus twice of the RMSE for a conservative prediction.

D Appendix D. Evaluations Performance
We use NVIDIA System Management Interface (nvidia-smi)12 to track the GPU utilization.A toolkit is developed to extract relevant up-to-date information and perform the calculation based on a user-defined sampling interval.To monitor the running processes and ensure the stability and robustness, we compare the GPU utilization, which reflects the percentage of time during which one or more kernels are occupied13 , and the corresponding standard deviation of GPUs, shown in Figure 9. Similar results are shown in Figure 10 for the GPU memory.We also show CPU utilization and the main memory consumption, each with the corresponding standard deviation among nodes, in Figure 11 and Figure 12.The benchmark workload (AutoML) is GPU intensive, therefore the CPU and memory usage are moderate.
Various work has been done to develop AutoML frameworks and tools Truong et al. (2019); Zöller and Huber (2019) that are user-friendly and easy to deploy, like Neural Network Intelligence (NNI), Tree-based Pipeline Optimization Tool (TPOT), and auto-sklearn.NNI is a popular open-source 6 toolkit in AutoML community, which automates the deep learning model design process including neural architecture generation and hyperparameter optimization.

Figure 3 :
Figure 3: Schematic diagram of the benchmark workload.The workflow details are explained in the main text.
Figure4: The benchmark score (in PFLOPS) over time with an one-hour sampling interval.We report the average values which are calculated from results in 6 hours to 12 hours after the initial warm-up phase.The score is stable and scales linearly with the number of machines (GPUs).

Figure 5 :
Figure5: The achievable error of generated models over time with an one-hour sampling interval.The sluggishness of the achievable validation error originates from the limit of the generated neural architectures and the insufficient HPO process (we only perform one round of HPO for each architecture).

Figure 6 :
Figure6: The regulated score (in PFLOPS) over time with an one-hour sampling interval.It is essentially stable after the initial warm-up phase and we report the average value in the stable phase.The regulated score scales linearly with the number GPUs and reflects the co-performance of hardware and software in the system.
HPO method performance on CIFAR10.

Figure 7 :
Figure 7: The comparison of batch size and HPO method.

F
LOP S = T otal F LOP s T otal time = F LOP s f rom sample data × T otal data size Sample data size × T otal time (4) • Profiling FLOPs with one partition of training and validation: x(F LOP s) = init.(FLOP s) + (training(F LOP s) + validation(F LOP s)) • Profiling FLOPs with two partitions of training and validation: y(F LOP s) = init.(FLOP s) + 2 × (training(F LOP s) + validation(F LOP s)) • Profiling FLOPs with one partition of training only: z(F LOP s) = init.(FLOP s) + training(F LOP s) • Calculate the init.(FLOP s), training(F LOP s) and validation(F LOP s).• Scale the results for entire dataset of training and validation: T otal F LOP s = init.(FLOP s) + [(training(F LOP s) × training batch num.) + (validation(F LOP s) × validation batch num.)] × epoch We compare the operation counts of ResNet-50 layers on ImageNet with different approaches, as shown in Table

Figure 8 :
Figure 8: An example of the model accuracy prediction.The fitted logarithmic function and the predicted accuracy are shown in the label.

Figure 9 :
Figure 9: GPU utilization and the corresponding standard deviation over time with 18 minutes sampling interval.The average values are calculated from results in 6 hours to 12 hours (after the initial warm-up phase).The dent of GPU utilziation indicates the inter-phase between the training stages.The corresponding standard deviations indicate the performance variation of different GPUs and the low values show the utilization uniformity of machines on different scales.

Figure 10 :
Figure 10: GPU memory utilization and the corresponding standard deviation over time with 18-minutes sampling interval.The average values are calculated from results in 6 hours to 12 hours (after the initial warm-up phase).The dent of GPU memory utilization indicates the inter-phase between the training stages.The corresponding standard deviations indicate the performance variation of different GPUs and the low values show the memory usage uniformity of machines on different scales.

Table 1 :
Summary of existing AI benchmarks.The second column contains three aspects of the benchmark workloads: Chen et al. (2015)16)2012b)et al. (2017) designsSimonyan and Zisserman (2014);Szegedy et al. (2015);He et al. (2016);Huang et al. (2017);Howard et al. (2017);Hu et al. (2018)in the past few years have drawn enormous attention in deep learning research community.These designs require tremendous human effort, sometimes even domain knowledge, commonly in an ad-hoc fashion.In contrast, neural architecture is automatically generated by selecting and combining primary operations (e.g.convolution) with NAS approaches.According toElsken et al. (2018), the NAS method can be categorized into three abstraction levels: search space, search strategy, and performance estimation strategy.The major search strategies (algorithms)He et al. (2019)include random searchBergstra and Bengio (2012b), reinforcement learning Zoph and Le (2016), evolutionaryReal et al. (2017b), Bayesian optimizationMendoza et al. (2016), and gradient-based method Liu et al.(2018).Research around NAS is typically exploring three dimensions of abstraction simultaneously using various algorithms to search different combinations of building blocks.Most of the existing NAS algorithms, especially reinforcement learning and evolutionary, require a large number of iterations to converge to the optimal architecture with the proper hyperparameter configuration.One approach to reduce the computational cost is to adapt pre-trained neural architectures instead of searching from scratch.In the spirit of transfer learning and knowledge inheritance, various tricks have been suggested to transform the neural architectures.Chen et al. (2015)proposes network transformation, called "Net2Net", which transforms a pre-trained parent network to a child network and then continue training.A function preserving transformation is used to keep the child network's consistency of input and output while making the network wider or deeper compared to the original parent network.
Tan and Le (2019)17)ented by the neural architecture is transformed, at least partially, from the parent network to the child network.Wei et al. (2016)first dubs "network morphism" as an upgraded version of "Net2Net", allowing multiple morphing, including width, depth, kernel size, and subnet.Successive works likeElsken et al. (2017);Jin et al. (2019);Tan and Le (2019)expand the network morphism related research.Jin et al. (

Table 4 :
The analytical operation counts of each layer (per image) in FP and BP together.Most operations in ResNet-50 happen in convolutional layer.

Table 8 :
The comparison of operation counts of each layer of ResNet-50 on ImageNet (per epoch, batch size=1) with different approaches.The difference of FPs in training and validation stage comes from the data size.

Table 9 :
The operation counts of FP and BP from nvprof with different batch size and the corresponding acceleration ratio comparing to the results when batch size is 1.The operation ratio is the