DeepQGHO: Quantized Greedy Hyperparameter Optimization in Deep Neural Networks for on-the-fly Learning

Hyperparameter optimization or tuning plays a significant role in the performance and reliability of deep learning (DL). Many hyperparameter optimization algorithms have been developed for obtaining better validation accuracy in DL training. Most state-of-the-art hyperparameters are computationally expensive due to a focus on validation accuracy. Therefore, they are unsuitable for online or on-the-fly training applications which require computational efficiency. In this paper, we develop a novel greedy approach-based hyperparameter optimization (GHO) algorithm for faster training applications, e.g., on-the-fly training. We perform an empirical study to compute the performance such as computation time and energy consumption of the GHO and compare it with two state-of-the-art hyperparameter optimization algorithms. We also deploy the GHO algorithm in an edge device to validate the performance of our algorithm. We perform post-training quantization to the GHO algorithm to reduce inference time and latency.


I. INTRODUCTION
Deep Neural Networks (DNNs) have stretched the boundaries of Artificial Intelligence (AI) across a variety of tasks, including object recognition from images [3] [4], machine vision [5], natural language processing [6], and speech recognition [7]. They have been used in real-world applications such as estimation of driving energy for planetary rovers [8], SAR target recognition, terrain classification [9], underwater visual odometry estimation [10], interactive medical image segmentation [11], and self-driving vehicles [12]. In this study, we focus on greedy approach based hyper-parameter optimization for on-the-fly training applications in deep neural networks.
There are two types of training for deep learning applications: 1) offline training, and, 2) online training. Offline training requires the entire ground-truth data before the training process starts and the pre-trained model is deployed for a specific task (e.g., path planning for asteroid hopping rovers [13] or classification and detection of Martian rocks [14]). In contrast, online or on-the-fly training generates the ground-truth data and trains the DNNs during deployment. In on-the-fly training, data arrive sequentially in a stream to train the DNNs [15]. On-the-fly training is inevitable for the applications that require training neural network immediately or in which transmitting data to the central system for training is costly. Some of the scenarios in which on-the-fly training can be very effective are smart agriculture space exploration, and deep-sea applications.
Smart Agriculture: Consider an example of a UAV used on agricultural farms to estimate the total nitrogen of the soil using multispectral images [16]. It is hard to apply supervised learning for this scenario since the UAV operates on a variety of crop fields, which both require and generate different datasets. Applying on-the-fly training can be beneficial in this scenario since on-the-fly training can deal with unlabeled data. Space exploration: A lack of label data is a challenge for space exploration. Consider the Mars Curiosity Rover [17] which is designed to determine whether Mars is suitable for microbial life. The Mars Perseverance Rover [18] was developed to search for the signs of ancient life and collect rock samples. In addition, the goal of the Europa mission [19] is to find signs of life on Jupiter's moon, Europa. Sending the data to the ground station entails a high data transmission cost. Also, manually labeling data requires human labor and time. Deploying DNN with on-the-fly training can resolve these challenges.
Deep sea: The ability to recognize and estimate the behavior of deep-sea organisms is critical for underwater robotics. However, the automatic tracking of deep-sea organisms is a challenge due to a lack of labeled data [20]. In addition, sending the data over a network to the ground station is costly due to high data transmission costs; these costs can be reduced by applying on-the-fly training.
When it comes to training DNNs, there are several challenges, such as choosing hyperparameters, an efficient regularizer [15], a vanishing gradient, diminishing feature reuse [21], and internal covariate shift during training [22]. Although many offline training methods have been developed to address them, the challenges of on-the-fly training remain unexplored [21]- [23]. One of the noteworthy challenges is optimizing hyperparameters for training DNNs in edge devices with limited computational capability and operational time.
In this study, we focus on the computation time and energy efficiency of hyperparameter optimization (HPO) in DNNs for on-the-fly training. Several HPO techniques have been used for different applications such as grid search (GS) [1], random search (RS) [24], Bayesian optimization (BO) [25], genetic algorithm (GA) [26], particle swarm optimization (PSO) [27], and Hyperband [28]. These algorithms are not suitable for on-the-fly training due to their high computation time (CT) (see Table 1). The authors of [29] developed hyperparameter optimization for on-the-fly training, but limited their hyperparameter search space only to the learning rate. Many crucial hyperparameters should also be considered in DNNs including, neuron size, dense neuron size, regularizer, and dropout. The literature lacks detailed research on hyperparameter optimization in DNNs for on-the-fly training. To address this knowledge gap, we propose a novel greedy hyperparameter optimization (GHO) algorithm for on-the-fly training based on the greedy approach.
Greedy algorithms have been extensively studied for decades. It is well known as one of the fastest optimization techniques. The authors of [30] proved greedy algorithm is theoretically optimal in DNA sequence alignment production. They extended the greedy algorithm, showing that their extended version (greedy alignment algorithm) computes DNA sequence alignment more than 10 times faster than certain dynamic programming algorithms [31], [32]. The authors of [33] and, [34] introduced greedy algorithm to deep neural network training for fast learning. To reduce the complexity of a deep belief network (DBN), the authors of [34] studied layer-wise deep neural networks empirically based on the idea presented in [33]. Although computation is slightly more expensive on the upper layers, the greedy approach saves hyperparameter optimization time. This property of the greedy approach is very effective for saving computation time, especially on large datasets. We chose to incorporate the greedy method in hyperparameter optimization because total computation time is the most critical feature for onthe-fly training. Additionally, the literature lacks a suitable hyperparameter optimization for on-the-fly training based on the greedy approach.
The authors of [1] compare GS, RS, and GA for neural architecture search (Table 1) and tune the hyperparameters using HPO algorithms on NVIDIA Tesla K80 (24GB of GPU memory and 4992-Cuda cores) [35]. We implement our GHO algorithm, replicate their experimental setup and tune the hyperparameters using GHO on Nvidia GTX 1060 (6GB of GPU memory and 1280-Cuda cores). We found that GHO is 4x faster than RS and 5x faster than GA with a 6% decrease in prediction accuracy in both cases (see Table 1). This evidence supports that our proposed algorithm is also suitable for limited computational resources. The authors of [2] implemented BO to find the optimal hyperparameters. The BO algorithm is 99% accurate after 50 iterations. We also replicate the experimental setup presented in [2]. In this case, the GHO is more than 5x faster than the BO tuned on the same GPU. We also implement BO, RS, and GHO in an edge device to validate the performance of our algorithm. Specifically, we deploy the algorithms in NVIDIA Jetson Nano and compare the energy efficiency of each algorithm under different scenarios.
Moreover, due to the limitations of computational resources in edge devices, deploying a large DNN (e.g., ResNet50) model is impractical and computationally burdensome. Researchers developed quantization techniques [36] to reduce the model size significantly, keeping similar performance. Specifically, quantization allows full-precision model to be replaced by fixed-precision and mixed-precision model, which reduces latency and energy consumption during inference. We perform post-training quantization to reduce inference latency significantly.
The state-of-the-art HPO algorithms presented in Table 1 are not suitable for on-the-fly training due to their high computation time and low energy efficiency. The authors of [37] proposed a greedy approach for hyperparameter optimization. However, they did not focus on on-the-fly training. Their study lacks performance evaluation of the greedy approach in different datasets and DNN architectures. We propose a fast greedy hyperparameter optimization algorithm for onthe-fly training in deep neural networks to address these shortcomings.
The contributions of this paper are as follows: • We develop a novel greedy hyperparameter optimization (GHO) algorithm for on-the-fly training in DNNs. • We perform a numerical study to evaluate the performance of GHO algorithm and compare it with state-ofthe-art HPO algorithms. • We deploy the GHO and some existing HPO algorithms in an edge device to validate the performance of our algorithm. • We perform post-training quantization on the HPO techniques to further improve the performance.

II. RESEARCH QUESTIONS
RQ1: Why on-the-fly learning? What is the impact of the GHO for on-the-fly learning? RQ1 covers information where on-the-fly training is required in circumstances when offline training is infeasible, such as extraterrestrial rovers, unmanned underwater vehicles (UUV), or unmanned aerial vehicles (UAV) etc. Due to the limited computation capabilities and operational time of these systems, they require a fast optimization algorithm. A brief result of the GHO algorithm is presented and compared to the existing HPO algorithms in Section IV-A. RQ2: How does the performance differ from existing HPO techniques? Does the performance behavior stay consistent on different DNNs and datasets when applying quantization?
The performance of the GHO algorithm is studied empirically and compared to the other state-of-the-art HPO algorithms on different DNNs and datasets. RQ2 also accounts for the performance consistency of the GHO algorithm applying different quantization options described in Section IV-A.

RQ3
: What is the impact of the GHO algorithm and other existing HPO algorithms during training and inference on the energy consumption?
RQ3 investigates the impact of the GHO algorithm as well as other HPO algorithms on energy consumption while training the DNNs and performing inference. RQ3 also explores the energy consumption when applying different quantization options on the HPO techniques. To measure the energy consumption of the GHO algorithm, we conducted an empirical study on an NVIDIA Jetson Nano described in (Section IV-C).

III. METHODOLOGY
Using the emergence of modern computational technology and parallel processing, many machine learning (ML) problems cannot be solved to optimality in reasonable computational times, due to the inner architecture of ML models. Furthermore, in many realistic cases, reaching optimum solutions is pointless; we are always working with crude simplifications of reality and imperfect results. The aim of approximate algorithms (also known as heuristics) is to generate good approximate solutions quickly without guaranteeing solution quality. A greedy algorithm makes a decision that appears correct at the time. This implies that a locally optimal decision is made in the expectation that this choice will lead to a globally-optimal solution [38], [39]. Huffman encoding [40]-used to compress data-and Dijkstra's algorithm [41]-used to find the shortest path through a graph-are examples of greedy algorithms that are very efficient for certain problems. In hyperparameter optimization, greedy algorithms make greedy choices to select the hyperparameters at each step in such a way that ensures the objective function is optimized (either maximized or minimized). The greedy algorithm only has one chance to find the best solution; it never reverses the decision [42].

A. PROPOSED ALGORITHM
We use the greedy approach, implementing a greedy search algorithm for hyperparameter optimization. In greedy search, we obtain the validation accuracy locally for each hyperparameter. The proposed GHO algorithm optimizes each hyperparameter, keeping others constant. In this way, it obtains a locally optimal solution to the hyperparameter. The process of optimizing the local solution for each hyperparameter continues iteratively until all hyperparameters are optimized.
Let X = (X N , X H ) be a DNN configuration where X N represents a network topology and X H is the hyperparame-  ters of the DNN that belongs to a decision set D. We assume w is the set of all weights of the DNN. Let T and V be the training and validation datasets respectively. The hyperparameter optimization problem is divided into two parts. For a given network topology, X N , and set of hyperparameters, X H , the DNN training problem is given by where f T (·) is the function to obtain training loss using the DNN configuration, X , and dataset T . Our goal is to find the optimal X H that maximizes the validation accuracy, V a . The hyperparameter optimization problem is as follows: where f V (·) represents the function to evaluate validation accuracy using the optimal weights w * obtained from Equation 1. Equation 2 optimizes the hyperparameters, X H , together using the validation accuracy function, f V . The computation cost of the optimization in Equation 2 increases exponentially with the number of hyperparameters. The greedy algorithm reduces this exponential computational cost. Specifically, the greedy algorithm optimizes each hyperparameter, keeping the others constant. Let n be the number of hyperparameters in X H = (X 1 H , X 2 H , . . . , X n H ), where X i H ∈ D i is the ith hyperparameter in X H . Let the search space for the ith hyperparameter X i H have m elements and is given by where D i j is the jth element in the search space of X i H . D i ∈ R m is the decision set (search space) for the hyperparameter X i H . The greedy approach allows us to rewrite the Equation 2 as follows: Algorithm 1 is the pseudo-code of the GHO algorithm, where, validation accuracy V a = (V 1 a , V 2 a , . . . , V n a ) and V i a represents the validation accuracy obtained from the following Equation. V Algorithm 1 Greedy hyperparameter optimization (GHO)

B. EXPERIMENTAL SETUP
We performed eight experiments on two different platforms (PC and Edge), each based on a different DNN architecture [43]) and dataset (CIFAR10 [44] and the Intel Image Classification [45]). We specified the same hyperparameter configuration space to fairly compare RS, BO and GHO (see Table 2). The optimal hyperparameter configuration was determined by each of the HPO algorithms based on the highest validation accuracy. To prevent the time complexity, we fixed the activation function: relu, optimizer: adam, epochs: 100, kernel size: (3,3). pool size: (2, 2), and stride size: (2,2). We also use the early stopping callback function to prevent the overfitting of the model and define, monitor: 'val_loss', and patience : 3. The hyperparameter configuration space is summarized in Table 2. We tuned the DNNs on a machine with a 32-core AMD Ryzen Threadripper TR4 processor, NVIDIA RTX A6000 GPU card, 48   Step by step process of HP tuning.
RS, BO and GHO are executed within their respective hyperparameters to train the DNNs. After obtaining the validation score, we select the hyperparameters that yield the highest validation score. Finally, we train the DNNs for each HPO technique using the best set of hyperparameters. Figure 2 shows the step-by-step-process of HPO, training the dataset, and prediction on test data. Based the on the defined search space in Table 2, GHO generates 87-trails for VGG16 and 94-trails for ResNet50. To fairly compare the HPO algorithms we set the same number for max_trail for both RS and BO.
We use NVIDIA Jetson Nano [46] as our edge platform. The Jetson Nano is equipped with a quad-core A57 processor, 2 GB of LPDDR4 CPU memory, and a 128-core NVIDIA Maxwell GPU. Jetson Nano is not compatible with VGG16 and ResNet50 due to limited computational resources [47].
We followed the instructions provided by the NVIDIA developer forum [47], deploying VGG11 [48] and ResNet18 [49] on the edge platform. While designing the experiments, we simulate real-world scenarios. For example, the Mars Curiosity and Perseverance rovers are equipped with a RAD750 architecture (2 GB of flash memory and 256 MB of RAM) [50], [51]. Both the Curiosity and Preseverance rovers' camera modules can capture photos at a resolution of 1024 × 1024 pixels. Due to the limited on-board computational and storage capacities on both the rovers, many images cannot be stored while performing on-the-fly training. Therefore, we fixed 10000 images for CIFAR-10 and 6000 images for Intel Image Classification dataset on the edge platform. In addition, we reduced the hyperparameter configuration space (see Table no. 2), epochs: 10, max_trails: 27 (for VGG16), and max_trails: 44 (for ResNet18).
We performed a numerical study to compare the GHO algorithm with BO and RS. To evaluate the HPO methods, we used validation accuracy as the performance metric and computational time (CT) as the model efficiency metric. CT is the total time required to complete an HPO process. Additionally, we measured the energy consumption of the edge platform. We recorded the data points every hour within 15-minutes intervals and computed the average of the data points for each of the HPO algorithms. The operating voltage of Jetson Nano was fixed to 5V.
We selected the VGG and ResNet architectures due to their widespread use. Both DNN architectures use many hyperparameters such as epochs, number of hidden layers, number of units per layer, activation, dropout, batch size, and learning rate. Optimizing all of them is computationally expensive; therefore, it is not suitable for applications that require onthe-fly training. Table 2 summarizes the hyperparameters considered in this study. The selection of appropriate hyperparameters is application-specific; our algorithm allows flexibility on hyperparameter selection. Finally, we combine TensorFlow post training quantization options such as, FP16, Dynamic range, and INT8 with the HPO techniques as well as evaluate the inference latency, model size, accuracy, and energy consumption for different configuration options. We deploy the quantized models on the Jetson Nano using Ten-sorFlow Lite converter.

A. THE IMPACT OF THE GHO ALGORITHM (RQ1)
Autonomous systems (e.g., UAV, UUV) operated in remote areas often do not have sufficient computational resources. They often cannot communicate with the central system to save battery life or communication is simply not possible. Therefore, on-the-fly training is necessary for the systems in which there are no other ways to send numerous data to the central station for training. Since training time for these scenarios is a crucial factor, the necessity of fast training arises. Several HPO algorithms have been developed in the past which are not sufficiently fast to meet the requirement. The GHO algorithm is significantly faster than the state-ofthe-art HPO algorithms presented in Figure 3.   Insight: The GHO algorithm saves significant computational time and energy consumption compared to BO and RS.

B. PERFORMANCE CONSISTENCY OF ALGORITHMS (RQ2)
We compare the results of the GHO algorithm with the standard HPO algorithms under different scenarios. We eval-uate the performance of the algorithms on two different datasets (CIFAR10 and Intel). For each dataset, the algo-rithms are tested with two well-known deep neural network architectures (VGG16 and ResNet50). The associated com-putation time (CT) and validation accuracy (V a ) are presented in Table 3. It shows that GHO outperforms BO and RS in both DNNs and datasets in exchange for only a few percentage points of accuracy trade-off. As the computation time of the GHO algorithm outperforms all scenarios presented in Table 3, performance of the GHO algorithm is consistent across different datasets and architectures. We combine BO, RS, and GHO with the FP16, Dynamic range, and INT8 quantization and evaluate inference latency, model size, accuracy, and energy consumption for different configurations as shown in Figure 4, and Figure 6. Overall, the HPO techniques achieve reduced latency, model size, and energy consumption when combining with quantization with lower accuracy, especially when quantizing with INT8 precision. In all DNNs and dataset, GHO outperformed BO and RS. Among all the quantization options, INT8 has the lowest latency. Similarly, for the model size, INT8 has the lowest model size. The accuracy drops significantly for INT8. Among the HPO techniques, BO has the highest model size compare to RS and GHO. We observe the similar pattern for both VGG11 and ResNet18.
Insight: The GHO algorithm outperforms the BO and RS in terms of computation time during training and latency during inference. The performance behavior is consistent in both DNNs (VGG16 and ResNet50) and datasets (CIFIAR10 and Intel) as well as for the quantization options.

C. IMPACT ON ENERGY CONSUMPTION (RQ3)
We perform BO, RS, and GHO on the VGG11 and ResNet18 and deploy it on the Nvidia Jetson Nano to evaluate energy consumption. Notably, MAXN (10 Watt) power mode was selected on the Jetson Nano board and no external IO or peripherals are connected during the energy measurements. In addition, 1.1GB of GPU memory was pre-allocated using TensorFlow to avoid the CUDA error:out of memory error [52]. SSince the Jetson Nano (2 GB version) is not equipped with an INA3221, a power monitoring interface, the Jetson-stats command was unable to show the power usage of the board [53]. We measured the energy consumption of the Jetson Nano using a digital millimeter. Table 4 shows that, for both VGG11 and ResNet8, GHO  (Table 4). Higher energy consumption is one of the major drawbacks of on-the-fly training because of the limited operational time. For instance, consider an ex-ample of autonomous UAV path planning and estimation [54]. These UAVs have a maximum of 10 hours of flight time using fuel cells [55]. Our results validate that RS and BO can significantly reduce the operational time compared to GHO in these scenarios. When applying quantization on the HPO techniques, BO has the highest energy consumption during inference compared to RS and GHO as shown in Figure 6.
Insight: The GHO algorithm outperformed the BO and RS in terms of energy consumption for both training and inference, as well as achieving comparable prediction accuracy. In this study, we demonstrated a greedy hyperparameter optimization algorithm for on-the-fly training applications in deep neural networks. We deployed our greedy hyperparameter optimization algorithm and some existing hyperparameter optimization algorithms such as Baysian optimization and random search in the NVIDIA Jetson Nano. Then, we compared their computation times and energy consumptions.
Overall, our numerical studies confirm that our greedy hyperparameter optimization algorithm outperforms Baysian optimization and random search in terms of energy consumption. Additionally, it is faster for all datasets and deep neural networks studied in this paper while achieving comparable prediction accuracy. We also perform post-training quantization on the HPO techniques to further improve the performance.
MD HAFIZUR RAHMAN has completed his M.S. in Electrical Engineering at South Dakota Mines, USA and his B.Sc. in Electrical and Electronic Engineering at Pabna University of Science and Technology, Bangladesh. During his M.S., he worked on data-driven models for biofilm phenotype prediction on metal surfaces modified with 2D coat-ings. His current research areas include deep learning, DL model quantization , and computer vision. He is currently working as Embedded Engineer at General Motors as a contractor. His primary responsibilities are debugging and implementing test software and procedure. He is a keen researcher who is continuing his study with collaborators from academia and industry outside of his professional employment. VOLUME 4, 2016