ETM: Effective Tuning Method Based on Multi-Objective and Knowledge Transfer in Image Recognition

With the widespread application of machine learning and deep learning, image recognition has been continuously developed. However, there are still huge challenges in the use of machine learning and deep learning. The tuning processes of algorithms are critical and challenging for their performance. Although there have been many previous works to improve the final accuracy of the recognition algorithms through tuning, these works cannot consider some indicators that are also very important in the actual environment (such as latency, central processing unit (cpu) utilization) in the tuning. In this paper, we propose an effective tuning method based on multi-objective and knowledge transfer, which is solved the above limitations in the image recognition. Specifically, we first use an agent to automatically tune the recognition algorithms, and combine the prediction accuracy and the running latency of each episode as a multi-objective reward signal to guide the update of the internal parameters of the agent. In this way, the agent can continuously select the better algorithm configuration to improve prediction performance. In addition, we improve the efficiency of the above tuning process by transferring knowledge. To do that, we can learn the meta parameters from other small-scale tasks to initialize the agent. In the experiments, we apply the proposed method to tune the eXtreme Gradient Boosting and random forest on 57 image recognition tasks and convolutional neural network on 2 tasks. The experimental results verify that the proposed method achieves average accuracy rankings of 1.92, 1.42 and 1.71 on three algorithms to be optimized, respectively. Especially in terms of latency performance, the proposed method performs best on all the tasks (57 data sets) on the three algorithms to be optimized. In addition, we verify the various components of the proposed method through ablation experiments.


I. INTRODUCTION
So far, machine learning and deep learning has made great progress in many works on the image recognition field [1]- [3]. However, machine learning and deep learning still need many tedious processes in practical applications. The tedious processes include data processing, feature engineering, algorithm selection, hyperparameter optimization and data analysis. Among them, hyperparameter tuning is a particularly important part for the performance of the predictive algorithms, where the hyperparameters refer The associate editor coordinating the review of this manuscript and approving it for publication was Ioannis Schizas .
to the parameters set manually before the training model. In this paper, we mainly solve the hyperparameter optimization problem (HPO problem) to tune the prediction algorithm, so as to improve the prediction performance of the algorithm.
For complex algorithms, the tuning is often a timeconsuming and tedious process, which prevents researchers from focusing on the problem that needs to be solved. To solve the above limitation, automatic HPO methods are proposed and used in various fields. This automatic HPO methods automatically select hyperparameter configuration with as little human intervention as possible, and gradually select the optimal hyperparameter configuration by trial and error in the preset ranges ( [4]). Subsequently, the idea of automation is extended to the problem of algorithm selection combined with hyperparameter tuning ( [5]). In the field of image recognition, an efficient hyperparameter tuning method can achieve the following goals: • it greatly reduces the threshold for the use of machine learning and deep learning models, which makes the application of these technologies more popular; • for researchers, it can pay more attention to the modeling process of problems in specific scenarios, rather than model tuning process; • compared with traditional manual tuning methods, it can greatly improve optimization efficiency and the prediction performance of model.
The hyperparameter tuning problem of the algorithm is essentially an optimization problem, and its optimization objective is to make the algorithm achieve the best prediction performance by selecting the hyperparameter configuration. However, this optimization problem cannot be solved directly and efficiently due to the following reasons: • First of all, it is not clear at present the clear functional relationship between the selection of hyperparameters and the performance of the prediction algorithm in different scenarios, so it is not possible to directly perform gradient descent based on the optimization objective to obtain the optimal solution.
• Second, the tuning of each algorithm is a process of constant trial and error, which means that the tuning process needs to be explored in the preset range of each hyperparameter. Obviously, the search space is high-dimensional and as the number of hyperparameters increases exponentially, which makes the entire tuning process very complicated and inefficient.
• Finally, in order to make the prediction performance of the model better, the structure of the model will become very complicated. Importantly, the above situation is very unfavorable for deploying the model on an actual application platform.
To solve the above limitations, many advanced works have been proposed so far. In the algorithm tuning community in the field of image recognition, advanced works mainly includes two categories: tuning algorithms and tuning tools. Tuning algorithms can be roughly divided into basic search methods and sampling-based methods. The typical representatives of basic search methods are grid search and random search, while sampling-based methods mainly include bayesian optimization methods, evolutionary optimization methods, and optimization methods based on reinforcement learning. Tuning tools usually focus on the actual user experience (convenience and flexibility). However, although previous works have proved that the above tuning algorithms and tuning tools can perform well in image recognition tasks, they often only consider the predictive performance of the model and does not pay attention to the indicators (such as latency) of the model in the actual environment. Moreover, most of the previous works cannot carry out the transfer of experience, which actually waste a wealth of tuning knowledge.
In this paper, we propose an effective tuning method (ETM) based on multi-objective and knowledge transfer. This method employ an agent to automatically tune the hyperparameters of the recognition algorithms in preset ranges, and combine the prediction accuracy and the running latency of each episode as a multi-objective reward signal to guide the update of the internal parameters of the agent (as shown in Figure 1). In this way, the recognition algorithms can achieve high prediction performance and low actual running latency. In addition, with the development of machine learning, the proposed method can be used for hyperparameter optimization of traditional models, such as prediction tasks and classification tasks.
For the algorithms in the image recognition field, we consider both accuracy and latency to achieve multi-objective optimization. This idea is inspired by the observation: the model has higher predictive performance but may has lower latency. Therefore, we should optimize the prediction performance and latency of the algorithms by hyperparameters tuning. In addition, since image recognition algorithms or models often are deployed in actual environments with resource constraints, they need to meet some specific indicators (such as response time (RT)). In this paper, multi-objective optimization considering predictive performance and latency is feasible and practical.
To further improve the efficiency of the above tuning, this paper uses previous optimization experience to transfer knowledge. Specifically, we perform meta-learning algorithms (model-agnostic meta-learning: MAML [6]) on a number of small-scale tasks to obtain the agent's optimization experience, which represents the agent's meta-parameters and is often used to initialize the agent's internal parameters. In this way, an agent can quickly adapt to new tasks.
In the experiments, the proposed method is employed to optimize the hyperparameters of eXtreme Gradient Boosting (XGBoost) [7] and random forest on 57 datasets and convolutional neural network on 2 datasets. In this paper, we focus on the HPO problem in algorithm tuning process. Our main contributions are as follows: VOLUME 9, 2021 • To solve the problem that the optimized objective function is not clear, we use an agent to automatically select each hyperparameter, and obtain the reward value signal through training and update the agent with reinforcement learning algorithm. In this way, we can get an agent with good decision making.
• Compared with the traditional tuning method, the proposed method can take the prediction accuracy and the running latency as the tuning objective. Importantly, we design an aggregation function that skillfully combines multi-objective optimization with agent updating so that agent decisions can be trade-off accuracy and latency.
• To improve the tuning efficiency, we extend the idea of knowledge transfer to the process of hyperparameters tuning. Specifically, we gain an agent optimization experience (i.e. meta parameters) by performing metalearning on multiple small tuning tasks and the agent is initialized with the meta-parameters.
• The proposed method is compared with other tuning methods on multiple tuning tasks of image recognition field. The experimental results show that the proposed method is feasible and efficient. Moreover, the effectiveness of each component is verified by ablation experiments. The remaining of this paper will describe in detail the related work, the specific design and process of the multiobjective tuning method, information on how to use metalearning to transfer knowledge, experimental results and a conclusion.

II. RELATED WORK A. MULTI-OBJECTIVE OPTIMIZATION
Multi-objective optimization is an improvement on the basis of single-objective optimization. Most single-objective optimization methods are based on reinforcement learning and optimize the objective continuously by taking the feedback value of the objective as the reward value signal. The singleobjective optimization based on reinforcement learning are modeled by an single-objective markov decision process (MDP). The MDP is formed by an agent interacting with the environment and usually expressed as a 5-tuple, which includes a state set S, an action set A, a transition probability function P, a reward function P, and a discount coefficient γ . The state set S mainly includes all the states s that the environment can be in; the action set A includes all the actions a that the environment can execute; the transition probability function represents the probability function of the transition from one state to the next state; the reward function represents the feedback value of the agent's decision; the discount coefficient represents the confidence of the previous actions. During the MDP, the goal of an agent is to obtain a trajectory τ that maximizes the expected reward value, which is formally expressed as follows: where the trajectory τ is formed by the interaction between the agent and the environment and includes the action, state and reward value of multiple time steps, that is τ = (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . ); the expected reward value represents the weighted sum of the reward values of each time step in the trajectory, i.e. R(τ ) = ∞ t=0 γ t r t . The methods to solve the single-objective optimization based on reinforcement learning can be roughly divided into value-based optimization and policy-based optimization. The value-based optimization methods first need to calculate the expected reward value of the trajectory and takes it as the value of the state-action pair, that is To solve the single-objective optimization problem, the agent needs the optimal decision policy to maximize the value of the state-action pair, that is π * (s) ∈ argmax a Q * (s, a), where Q * (s, a) denotes the optimal state-action value. Q-learning [8] is a classical value-based optimization method, which indirectly obtains the optimal decision policy by continuously maximizing the expected reward value of the trajectory. This method satisfies the basic identity of Bellman equation, that is An important defect of value-based optimization methods similar to Q-learning is the curse of dimension, which makes these methods become very difficult or even ineffective in solving the optimization problem of continuous values. However, another policy-based optimization method can easily address the above limitation. One well-known method is the policy gradient method, which does not require the agent to learn how to maximize the expected reward value but directly optimizes the policy to improve the probability of the optimal action, that is Even so, policy-based optimization methods such as policy gradient have the disadvantage of training instability. Usually in practice, some effective tricks are used to reduce the training variance of the optimization method, such as adding a baseline function and assigning suitable credit. More recently, some advanced research works [9], [10] have been proposed to combine value-based and policy-based optimization methods to achieve complementary advantages.
Based on the single-objective optimization method, the multi-objective optimization method is usually modeled as a multi-objective MDP, which is also represented by a 5-tuple. Different from the single-objective optimization method, the reward value signal is a vector composed of the feedback values of multiple objectives rather than a scalar reward, i.e. r ∈ R n . In the actual optimization process, an important challenge of multi-objective optimization method is to find a Pareto optimal solution to trade-off each optimization objective. Since we cannot find an optimal solution to satisfy multiple objectives, we often need to customize an aggregation function to aggregate multiple reward value signals into a scalar value. The aggregation functions can be roughly divided into linear and nonlinear types. The method of weighted sum is a typical linear setting method [11], while the exponential weighting method is a nonlinear one [12]. In this paper, we implement multi-objective optimization by customizing a nonlinear aggregation function.
Up to now, multi-objective optimization methods have made great progress in algorithm tuning, which are mainly used to solve the neural architecture search (NAS) problem in image recognition field. In this paper, we mainly focus on the hyperparameter optimization problem in the algorithm tuning, and strive to realize the performance improvement of the model in many aspects through the hyperparameter tuning.

B. HYPERPARAMETER OPTIMIZATION
Hyperparameter optimization is a part of the algorithm tuning pipeline. The purpose of hyperparameter optimization is to improve the predictive performance of the algorithm by tuning its hyperparameters. To make the hyperparameter optimization method clearer, we first define the commonly used symbols in the hyperparameter optimization: • A denotes the algorithm to be tuned; • is the hyperparameter search space of the algorithm to be tuned, which is a high-dimensional and needs to be preset; • n is the number of hyperparameters to be optimized; • λ denotes the hyperparameter configuration selected by the optimization method, which is represented by a vector composed of n hyperparameter values; • λ * denotes the optimal hyperparameter configuration; • A λ represents the algorithm to be optimized that sets the selected hyperparameter configuration; • D train and D valid represents the training set and the verification set of the target task respectively. When given a target task, the formal expression of hyperparameter optimization is: where L(A λ , D train , D valid ) denotes the validation performance of A λ on the target task. Generally, grid search [13] or random search [4] are widely used for the optimization tasks with small search spaces. Grid search is the simplest hyperparameter tuning methods, and its main idea search the optimal solution by traversing all the combinations of hyperparameters. Obviously, the grid search suffers from the curse of dimensionality, so its optimization process will consume a lot of time when faced with complex tasks. Random search uses random policy instead of traversing all combinations, which mainly idea is to perform hyperparameter tuning by sampling randomly on all possible combinations. Some experiments demonstrate that random search is better than grid search when some hyperparameters are much more crucial than others [4]. Moreover, random search has the advantages of parallelization and flexibility. However, random search cannot achieve the optimal optimization results due to the lack of policy guidance.
Bayesian optimization (BO) is a method cluster, which includes a series of powerful hyperparameter tuning methods. The main idea of bayesian optimization is to use a specific model to fit the functional relationship between hyperparameter configuration and its performance, and to use the acquisition function to obtain the next potential hyperparameter configuration based on the functional relationship. The bayesian optimization methods consist of surrogate model and acquisition function. The specific process of the bayesian optimization methods is as follows: firstly, the most potential hyperparameter configuration is obtained by sampling of the acquisition function; then the performance evaluation is carried out on the target task; and finally the functional relationship is fitted by training the surrogate model on all samples. After many iterations, the acquisition function can finally choose a better hyperparameter configuration. Since the efficiency and accuracy of the surrogate model are important, most of the previous works focused on how to select the surrogate model. At present, two popular surrogate models are Gaussian process and tree model. Spearmint [14] is a bayesian optimization method using Gaussian process as the surrogate model, which is an advanced method for low-dimensional optimization search space. The two disadvantages of using a Gaussian process are time-consuming (cubic time complexity) and poor scalability. The sequential model-based algorithm configuration (SMAC) [15] and the tree Parzen estimators (TPE) [13] are bayesian optimization methods using random forests and a tree of Parzen estimators as the surrogate models respectively. Many studies have shown that bayesian-based optimization methods can achieve higher optimization results [13]. However, the tuning process is inefficient when solving large-scale optimization tasks.
Population-based tuning approach are another competitive HPO methods, which can be roughly divided into genetic algorithms and evolutionary algorithms. The main idea of population-based optimization method is to preserve a series of populations and make them evolve through hybridization and mutation. The covariance matrix adaption evolutionary strategy (CMA-ES [16]) is an improved algorithm based on evolutionary algorithm, which samples configurations from a multivariate Gaussian distribution. More recently, CMA-ES has proved to be a powerful black-box optimization method and is superior to advanced Bayesian methods [17].
The bandit-based methods have been proposed to solve HPO problem recently, such as hyperband [18] and BOHB [19]. Hyperband method uses the idea of the successive halving to allocate resources to each hyperparameter configuration. The researches show that hyperband has a strong performance during the tuning process of the deep learning model. However, because random policy is used for sampling, the optimization efficiency of hyperband is general. To solve the above limitations, a method combining Bayesian optimization and bandit-based is proposed, which is called BOHB [19]. This method has high efficiency at the beginning and has good performance in the long run.
The above mainly describes the single-objective optimization methods. Thus far, multi-objective optimization methods are mainly focus on NAS problems [20]- [22]. The main reason is that the neural network architecture has a great impact on the actual running indicators. For example, due to the limitations of hardware devices and application scenarios, indicators such as computational complexity and resource consumption also need to be optimized. The progress of these multi-objective optimization methods in NAS enables the network architecture to effectively adapt to the actual environment. Therefore, we believe that machine learning models that are widely used in many fields should also be studied in the multi-objectives optimization (response time or resource consumption), so that machine learning model can better adapt to the actual environment in addition to good predictive performance.

C. KNOWLEDGE TRANSFER
Knowledge transfer is an important area of research in the image recognition community. One method of knowledge transfer is transfer learning; for example, [23] uses pretrained weights and data to improve natural language processing (NLP) models. In algorithm tuning, [24] uses transfer learning to learn a generalizable framework that can speed up the search for new tasks. Another important and recent method is meta-learning or learning-to-learn, which has recently received interest [6], [25], [26]. The training of meta-learning is mainly divided into two steps: collecting meta-data of historical learning tasks or previously learned models; extracting useful knowledge from meta-data to guide the completion of new tasks. Meta-data includes hyperparameter configuration, neural network architecture, model evaluation results, model internal parameters, and task attributes (meta-features). Meta-learning can be divided into three categories: meta-representation, metaobjective and meta-optimizer. The meta-objective defines the goal of the meta-learning by selecting meta-objectives and the associated data flow between inner loop events and external optimization. The meta-Optimizer represents the choice for the outer optimizer during meta training. The outer optimizer can take various forms such as gradient descent, reinforcement learning, and evolutionary search. The meta-representation explains what the representation of learning should be. Generally, representations include hyperparameters, network structure, and initial weights.
Meta-learning achieves the goal of fast adaptation to new tasks by learning from other tasks. Auto-sklearn is an advanced tuning tool that applies meta-learning to select a configuration that is likely to perform well on a new task.

III. HYPERPARAMETER TUNING BASED ON MULTI-OBJECTIVE OPTIMIZATION
In this section, we will describe in detail hyperparameter tuning based on multi-objective optimization. First, we illustrate the property of sequential decision making in the HPO problem. Then, the HPO is extended to the multi-objective Markov decision process. Finally, we will introduce the design of the agent and multi-objective optimization in detail.

A. SEQUENTIAL DECISION MAKING IN HPO
For traditional hyperparameter optimization methods, they directly choose a hyperparameter configuration in the preset high-dimensional search space. If the model to be optimized is very complex, the search space of the task will become very large and grow exponentially with the number of hyperparameters. In order to solve the above problems, we consider that there is a natural sequential decision process for the HPO problem. The intuition behind the way to solve the HPO problem is: Any complex high-dimensional action can be selected incrementally, component by component, where each component's probability also depends on components already selected earlier [27]. Specifically, the main idea of the sequential decision process in HPO is: hyperparameters are selected sequentially, and the selection of hyperparameter depends on the selection of previous hyperparameters.
To further illustrate the advantages of the sequential decision process in HPO, we will be compared with the traditional method of directly selecting hyperparameter configuration in high-dimensional spaces. We assume that the model to be optimized has n hyperparameters to be optimized. In each iteration, the traditional optimization method selects a hyperparameter configuration in the search space, where the size of search space is = 1 × 2 × . . . n (× denotes the Cartesian product; i denote the search space of the i-th hyperparameter). In the case of sequential decision making, hyperparameters are selected sequentially to form the configuration. In this method, each iteration contains n selections, and each selection needs to be conducted in the search space of the corresponding hyperparameter, so the size of the search space is = 1 ∪ 2 ∪ . . . n . Obviously, sequential decision making not only reduces the difficulty of tuning but also improve the efficiency of optimization.
In the process of sequential decision making, in addition to sequential selection of hyperparameters, we should also consider the interrelation of hyperparameters selection. In this paper, we use a memorized network structure for implicit association and set the current hyperparameter selection to be dependent on the previous hyperparameter selection for display association.

B. MULTI-OBJECTIVE MARKOV DECISION PROCESS
Based on the formulation of the above sequential decision making, we further defined the HPO problem as a multiobjective Markov decision process. With such the definition, the workflow of using agent to solve HPO problems 47220 VOLUME 9, 2021 can be clearly described. First of all, we define the 5-tuple S, A, P, R, γ of the multi-objective Markov decision process in the HPO problem, which are as follows: • A is a set of all the actions that an environment can perform, that is, the set of all the hyperparameters that the algorithm needs to tune. At each time-step t, the action a t = λ t , and the search space of the action is t . After n time-steps, the agent can selects n hyperparameters, i.e. λ = a 1:n .
• S is a finite set of states, which includes all the states the environment can be in. For the HPO problem, the environment that interacts with an agent is composed of a dynamic part and a static part, where the algorithm to be optimized and the target task are the static part, and the hyperparameters are the dynamic part. In this paper, we only consider the dynamic part. Specifically, we take the hyperparameter distribution at time t − 1 as the state of the environment at time t, i.e. s t = D(λ t−1 ), where the hyperparameter distribution is output by the agent.
• R is the reward function. In multi-objective MDP, the reward value signal is composed of the feedback values of multiple objective. In this paper, we take the accuracy and latency as optimization objectives. Therefore, the vector consisting of the accuracy and latency will be used as a reward value signal. Specifically, r t = [0, 0] for t ∈ [1, n) and r n = [accuracy, latency], where accuracy denotes the validation performance of A λ=a 1:n , latency is the latency of A λ=a 1:n .
• P : S × A → P(S) is a state transition probability function. We usually do not know the state transition of the environment, otherwise the model-base method will easily solve the problem.
• γ is a discount factor. As shown in Figure 1, the overall framework consists of three components: an agent to select a hyperparameter configuration, a trainer to obtain the model accuracy and latency with the selected configuration, and multi-objective rewards including accuracy and latency. The multi-objective MDP as follows: for a given task, the agent selects n hyperparameters one by one based on its previous decisions. Then, the machine learning model with the selected hyperparameters is trained on a training set D train . The accuracy and latency of a validation set D valid are used as reward signals to update the parameters of the agent by an reinforcement learning algorithm. As a result, the agent learns how to tune hyperparameters over time.

C. DESIGN OF THE AGENT
The agent consists of an input embedding layer, an output embedding layer and a long short-term memory (LSTM) [28], which is the core part of the agent. Specifically, the input state s t is converted to a high-dimensional representation by an input embedding layer, which allows the agent to better observe the state representation. The output of the input embedding layer is then fed to the core network consisting of three layers of a LSTM. Although it is difficult to train the LSTM network, the LSTM cell has been indicated to be a powerful structure in solving the sequential problem. Finally, the output of the LSTM is converted to a low-dimensional representation by an output embedding layer. The output of the agent is not a hyperparameter value but rather a distribution of the possible values. Following [10], [29], we use the normal distribution to represent the distribution of a hyperparameter (λ t ).
Thus, the output of the agent at t is N (µ t , σ t ), and s t = N (µ t−1 , σ t−1 ), s 1 = N (0, 1). As described above, the design and workflow of the agent match the sequential decision process very well.

D. SAMPLING FOR A HYPERPARAMETER
From the above description, it can be seen that the output of the agent is a distribution of possible values of a hyperparameter. Therefore, we need to get the actual hyperparameter value by sampling. A simple sampling method is random sampling. However, due to the significant difference in the preset range of each hyperparameter, random sampling within the preset range of the hyperparameter will make the training of agent very unstable and even ineffective. To solve the above problems, we customize a transformation method to scale the distribution of the hyperparameters. The transformation process is as follows: • Scale the mean of the distribution µ to µ by the tanh function in the range (−1, 1); • Sample a value z from the new distribution N (µ , σ ); • Scale z into the range of hyperparameter [z L , z U ] by the following method: where z U and z L represent the upper and lower bounds, respectively. The clip_and_convert function can limit the sampling value z within the preset range by clipping and make the hyperparameter meet the type requirement by type conversion.

E. MULTI-OBJECTIVE OPTIMIZATION
The internal parameters θ of the agent represent a policy π that can decide which action to choose based on the current VOLUME 9, 2021 Algorithm 1 Meta-Learning on HPO Tasks Input: θ : meta parameters; α, β: step size.

Procedure:
1: randomly initialize θ 2: while not done do 3: Sample a batch of HPO tasks T i from source datasets 4: for all T i do 5: Sample a trajectory using π θ in T i : τ i = (s 1 , a 1 , A 1 , LAT 1 . . . , s n , a n , A n , LAT n ) 6: (5) 7: Sample a trajectory using π θ i in T i : τ i = (s 1 , a 1 , A 1 , LAT 1 . . . , s n , a n , A n , LAT n ) 8: end for 9: Update θ ← θ − β∇ θ T i L T i (π θ i ) using each τ i and L T i defined in Equation (5)  10: end while state of the environment. Follow the previous works [30], [31], we use the PPO-clip method [10] to update θ . Compared with the policy gradient method [32], the PPO-clip method implements off-policy based on the important sampling and uses KL divergence to constrain the gradient step, so as to achieve a good training efficiency and stability. The objective function of the PPO-clip method is defined as: where L is given by: L(s, a, θ k , θ) = min( π θ (a|s) π θ k (a|s) A π θ k (s, a), clip( π θ (a|s) π θ k (a|s) where is a hyperparameter that controls the change to the new policy θ from the old policy θ k , = 0.2. For singleobjective RL, the advantage function is defined as A π θ k = R(τ k )−b, where the return R(τ k ) = n t=1 r t is the cumulative reward over the k th sample, and b is an exponential moving average of the returns of the previous samples.
We design an aggregation function that combines the accuracy and latency as a reward signal to achieve the multi-objective optimization. Let L(s, a, θ k , θ) incorporate the latency and be redefined as: L(s, a, θ k , θ) = min( π θ (a|s) π θ k (a|s) × f scalar , clip( π θ (a|s) π θ k (a|s) , 1 − , 1 + ) × f scalar )) (10) where LAT k denotes the inference latency of the k th sample on the target task, and T is the minimum latency of all configurations searched so far. We use a customized weighted product method to define the aggregation function. Here, w is the weight factor defined as: where α and β are application-specific constants, where α ≥ 0 and β < 0. In fact, we can achieve the accuracylatency trade-off by tuning the two constants. An empirical rule for determining α and β values is to softly adjust the advantage value A π θ k by considering the sign of the value. If A π θ k is positive and LAT k ≤ T , which means the selected configuration can achieve high accuracy and uses less inference latency, w is set to a negative value to increase the value of f scalar ; otherwise, if A π θ k is negative and LAT k ≤ T , w is set a positive value to increase the value of f scalar , since even A π θ k is negative, the constraint of latency is satisfied, and this configuration is not too bad. Specifically, we further illustrate the motivation of setting the weight w (α and β) by analyzing the following four cases (A π θ k is referred to as A for simplicity in the following): • Case 1: A ≥ 0, LAT k ≤ T is the best case, that is, the high accuracy and the low latency. Therefore, we should set w (α) to a negative value to increase the positive advantage value of the action a.
• Case 2: A ≥ 0, LAT k > T is suboptimal case, that is, the accuracy objective is met and the latency objective is ignored. We should set w (β) to a negative value to reduce the original advantage value, thereby reducing the positive effect of the action a.
• Case 3: A < 0, LAT k ≤ T is suboptimal case, that is, the latency objective is met and the accuracy objective is ignored. We should set w (α) to a positive value to increase the original advantage value, thereby reducing the negative effect of the action a.
• Case 4: A < 0, LAT k > T is the worst case, that is, neither the accuracy objective nor the latency objective is met. We should set w (β) to a positive value to reduce the original advantage value, thereby increasing the negative effect of the action a. We consider two ways to set values of α and β, hard constraint and soft constraint. If α = 0 and β = −1, we obtain a hard constraint. When A π θ k is positive and LAT k ≤ T , we simply use A π θ k as the advantage value; otherwise, we sharply penalize the advantage value to discourage models from violating latency constraints. In our experiments, we use a soft constraint that smoothly adjusts the advantage value by setting α = −0.07 and β = −0.07 if A π θ k is positive; otherwise α = 0.07 and β = 0.07.

IV. KNOWLEDGE TRANSFER IN HPO
For traditional tuning methods, they ignore previous experience of optimizing tasks, which means that each new task is solved from scratch. Obviously, such this methods are unnatural and inefficient. In fact, previous experience should be accumulated and used for further exploration, similar to the accumulation of knowledge in human experts [33]. To accelerate learning, we make knowledge transfer from other tuning tasks, i.e., we train the agent on a variety of learning tasks on a small scale to acquire a prior experience and learn the common feature representation. In this way, the agent with prior experience will learn faster. Importantly, many previous works have demonstrated the strong performance of meta-learning in knowledge transfer community. In this paper, we use meta-learning to transfer knowledge in the HPO problem.
Specifically, we use the recently proposed model-agnostic meta-Learning algorithm to transfer knowledge (MAML). The algorithm 1 and figure 3 give an overview and workflow of the training process of meta-learning on different HPO tasks respectively. A hyperparameter tuning task T i is defined as the optimization of hyperparameters for a given model A on a dataset i. Following [6], there are two optimizing steps, namely, the meta-training step (Step 6), in which a task-specific learner θ learns based on the current parameter θ , and the meta-test step (Step 9), in which the parameter θ updates based on the evaluation of θ , where α and β are the learning rates. In this work, τ i is sampled by θ and is used for the meta-training process; τ i is sampled by θ and is used for the meta-test. After multiple episodes, the meta parameters θ can be obtained from this meta-learning procedure but are not necessarily a good one for the new task. However, these parameters serve as a good starting point for training a good model using only a few steps of learning.

V. SUMMARY OF THE OVERALL FRAMEWORK
To make the proposed approach clearer, we will integrate all the above details to give a complete description of the tuning approach (see Alg. 2). First, the agent's meta-parameters θ are obtained by using meta-learning on multiple small-scale tasks and used to initialize the agent when solving a new task. Then, the distributions of hyperparameters are output sequentially Algorithm 2 Tuning Method Based on Multi-Objective and Knowledge Transfer Input: s 1 : The initial state, s 1 = N (0, 1); n: The number of the algorithm hyperparameters. Procedure: 1: The agent is initialized with the meta-parameter obtained from the algorithm 1 2: while not done do 3: for t=1 to n do 4: The agent outputs N (µ t , σ t ) based on s t

5:
Sample a t (λ t ) from N (µ t , σ t ) 6: Obtain accuracy t and LAT t on the validation set after training A λ 7: end for 8: Use the trajectory τ = (s 1 , a 1 , A 1 , LAT 1 . . . , A n , LAT n ) to update the agent's parameters by PPO-clip algorithm 9: end while by the agent, and the actual hyperparameters are obtained by sampling and clipping. After n time steps, a hyperparameter configuration with n hyperparameters λ is obtained. Then, the selected hyperparameter configuration is set to the algorithm to be optimized A and the multi-objective reward vector (accuracy and latency) is obtained by training A λ on the target task. Finally, a reinforcement learning algorithm is used to update the agent's internal parameters. In this way, an agent not only quickly adapt to new task but also make hyperparameter tuning take into account multiple objectives.

VI. EXPERIMENTS
In this section, we compare other advanced optimization methods on 57 image recognition datasets to illustrate the performance of the proposed method. The objects of algorithm tuning include two tree-based models (random forest and extreme gradient boosting (XGBoost)) and a deep learning model (convolutional neural network). The experiments consist of two parts: comparison experiments and ablation experiments. Comparison experiments are performed to demonstrate the performance advantages of the proposed method, while ablation experiments are performed to show the feasibility and effectiveness of each component of the proposed method. In the following description, we first describe the relevant details of the experiments, and then conduct comparison experiments and ablation experiments respectively.

A. EXPERIMENTAL SETTINGS 1) DATASETS
In this paper, we focus on datasets in the image recognition field and use them as target tasks. Specifically, we collected a total of 77 datasets from the two public repositories 1 (UCI 2  TABLE 1. The table shows the statistical results of the size of the dataset used in the experiment. We can clearly see that the size of the dataset used in the experiment is wide, so we can verify the robustness of the proposed method to datasets of different sizes.

FIGURE 4.
This figure shows the process that the agent directly outputs all hyperparameters at one step. Since the agent makes decisions directly, the horizon of each episode unrolls one step. At each episode, the agent directly outputs µ o i and σ o i (i ∈ [1, n]) of n hyperparameter distributions, and then the output will be used as input for the next time.
and OpenML 3 ). In order to transfer knowledge from the previous task, we selected 20 datasets as the source data sets for meta-learning, and the remaining 57 datasets as the target tasks. Importantly, in order to verify the robustness of the optimization methods, the datasets selected in the experiment include handwritten numbers and letters, cars, animals, and other entities of specific scenes. Moreover, these datasets range in size from thousands to tens of thousands, which can verify the ability of the optimization method to adapt to problems of different sizes (as shown in table 1).

2) COMPARISON METHODS
In this paper, the proposed method is referred to as ETM (effective tuning method), which first initializes the agent through knowledge transfer, then uses the agent select each hyperparameter sequentially, and optimizes accuracy and latency based on multi-objective optimization framework. In the comparison experiment part, we compare the proposed method with the following advanced optimization methods: an evolutionary algorithm-based optimization method CMA-ES [16], three Bayes-based optimization methods TPE [13], Speriment [14] and SMAC [15], and a recent advanced optimization method BOHB [19]. Furthermore, the default hyperparameter configuration of the algorithm to be optimized is used as the baseline.
In order to verify the effectiveness of each component of the proposed method, we propose two variants based on the ETM method: single-ETM and TM. The single-ETM method 3 https://www.openml.org/ uses a single-objective optimization framework, which only considers the accuracy performance of the model to be optimized. The other settings are consistent with the ETM method. By comparing the ETM method and the single-ETM method, the effectiveness of multi-objective optimization can be verified, and the advantages of multi-objective optimization for algorithm tuning can also be illustrated. The TM method does not initialize the agent with the knowledge transfer method, while the other settings are the same as the ETM method. By comparing the ETM method and the TM method, the influence of knowledge transfer on optimization efficiency can be explained.

3) EVALUATION CRITERION AND EXPERIMENTAL DETAILS
Following the evaluation criterion of the state-of-the-art papers [15], [30], the three performance indicators of accuracy, time and latency are calculated in each tuning experiment. The above three indicators refer to the performance of the hyperparameter configuration obtained in the training on the test set. In fact, the accuracy performance can be used to illustrate the impact of the tuning of the optimization method on the predictive performance of the optimization model. The time performance can show the optimization efficiency of the optimization method. The latency performance can explain the effect of the tuning of the optimization method on the actual running time of the optimization model. In addition, we calculate the ranking of each optimization algorithm on the above three indicators as well as the average ranking and standard deviation of each optimization algorithm on multiple tasks.
Importantly, we run each tuning method 3 times independently and report the average performance to avoid contingency. Each independent experiment is run 300 times. In the experiment, we use PPO-clip [10] method to update the agent and use Adam algorithm [34] to perform optimization, where we set the learning rate to 0.008. Moreover, we evaluate the hyperparameter configuration by using 5-fold cross validation method. During meta-learning, we sample 5 batches of tasks, and each batch contains 3 different tasks. Afterwards, the Adam algorithm [34] is used to perform 30 meta gradients on each batch of tasks, where α = 0.0007 and β = 0.001. Due to the obvious difference in the size of the data set, the partition ratio is 8 (training set)/2 (test set) for small datasets (the size is less than 10,000), and the partition ratio is 9 (training set)/1 (test set) for big datasets (the size is larger than 10,000). For the trade-off weight w of multi-objective optimization, we use soft constraint to set the weight w. The specific settings and analysis will be described in ablation experiments section. In particular, the proposed method does not introduce parameters of strong sensitivity.  The average rank ''Rank" and standard deviation ''Stdev" of accuracy, time and latency over 101 datasets. ''*" and ''+" denotes that the statistically significant difference from other values in the same line is p < 0.01 and p < 0.05, respectively. The best result is in bold font.

B. COMPARE WITH OTHER METHODS
In this section, we will verify the performance advantages of the proposed method through running comparison experiments. The comparison experiments mainly take machine learning algorithms and deep learning models in image recognition field as tuning objects. Specifically, we use the tuning algorithms to perform hyperparameters tuning for the random forest, XGBoost and convolutional neural network on 57 tasks. The experimental results and analysis are described below.

1) HYPERPARAMETER TUNING FOR MACHINE LEARNING ALGORITHMS AND DEEP LEARNING MODEL a: SEARCH SPACE
In this experiment, we chose to optimize the hyperparameters of two advanced machine learning algorithms, the random forest and XGBoost algorithms, based on the following reasons: the random forest algorithm is evaluated by [35] as the best of 179 classifiers arising from 17 families; the XGBoost algorithm contains many more hyperparameters and has recently been dominating the Kaggle competition; the performances of the two algorithms are sensitive to the hyperparameter configuration. The code of two algorithms is based on scikit-learn [36]. Six hyperparameters (continuous) of the random forest algorithm and ten hyperparameters (continuous) of the XGBoost algorithm need to be optimized ( Table 2). In recent years, the convolutional neural network has been widely used in the field of image recognition. Therefore, we choose the convolutional neural network in the deep learning models as a more complex optimization object. The architecture of the convolutional neural network is similar to the one proposed by [37], which includes two convolution layers, two pooling layers, and two fully connected layers. We choose 15 hyperparameters to be optimized, including the stride size, kernel size, and channel size in each convolutional layer; the pooling type, kernel size, and stride size in each pooling layer; the number of hidden nodes in each fully connected layer; and the learning rate. The specific information of the hyperparameters is shown in Table 2.

b: EXPERIMENTAL RESULTS AND ANALYSIS
Following the previous settings, we run the experiments and report the experimental data. Table 3 shows the optimization performance of each optimization method on three algorithms to be tuned, which includes the average ranking of accuracy, the average ranking of time, the average ranking of latency, their standard deviation and significance level. It is clear from the table 3 that our proposed approach performs strongly in the three optimization scenarios. Specifically, although BOHB and SMAC methods can be competitive in terms of accuracy, the proposed method achieves optimal (i.e., the lowest accuracy ranking) under three optimization scenarios. In terms of time performance, the proposed method can achieve the best performance under three optimization scenarios. It is worth noting that the advantage is more pronounced in the XGBoost and convolutional neural network optimizations scenarios. This indicates that the proposed method can be well adapted to complex optimization tasks. In terms of latency performance, the average ranking of the proposed method is first, that is, it achieved optimal performance on all 57 tasks, which also indicates that considering the multi-objective optimization of accuracy and latency at the same time can improve latency performance while maintaining accuracy performance. Importantly, we use the Friedman statistical test and Wilcoxon post-hoc test to ensure that experimental comparisons are statistically significant [38].
In addition, we compare optimization methods from another perspective, that is, the number of tasks that each optimization method achieves optimal performance on 57 tasks. Through the above statistics and analysis, the performance level of each optimization method on each task can be illustrated. The experimental results are shown in Table 4. Obviously, our method is significantly better than other methods (especially in latency).
To conclude, the proposed approach is superior to other methods in most optimization scenarios and optimization tasks, and the latency performance has obvious advantages.

C. ABLATION EXPERIMENTS
In this section, we demonstrate the effectiveness of each component in the proposed method by performing ablation experiments. In the ablation experiments, We only choose some tuning scenarios and target datasets in the comparison experiments.

1) MULTI-OBJECTIVE OPTIMIZATION VS SINGLE-OBJECTIVE OPTIMIZATION.
In this section, we verify the impact of multi-objective optimization on tuning by comparing the ETM and single-ETM methods. This ablation experiment takes XGBoost as the algorithm to be optimized and 12 datasets as the target task. The experiment is independently executed for 3 times, and each experiment iteration is 300 times. The experiment results are shown in Table 6, and it can be clearly seen that the hyperparameter configuration searched by the ETM can improve the latency performance of the model to be optimized on the premise of ensuring accuracy performance.
To further study of the multi-objective optimization, we chose two sets of values of w to study the effects of soft constraint and hard constraint. The settings of w are as follows:   Figure 5 shows the accuracy and latency of the hyperparameter configuration under soft constraint (w soft ) and hard constraint (w hard ) settings. When the weight coefficient w is set as the hard constraint, the agent is more inclined to choose a hyperparameter configuration that can improve the latency performance, so as to avoid the severe penalty of the advantage value. However, the hard constraint setting makes the searched hyperparameter configuration to fall into local optimal in terms of accuracy performance. In contrast, when w is set as w soft , the agent can better trade-off accuracylatency. As shown in Figure 5, although the latency of the hard constraint is lower than soft constraint on most samples, the test set accuracy of the soft constraint is significantly better than the hard constraint on most samples. Importantly, the soft constraint make the agent to explore some configurations that can achieve the best test set accuracy while with less latency.

2) KNOWLEDGE TRANSFER VS TUNING FROM SCRATCH
We compared the performances of the proposed method ETM (which transfer knowledge by the meta-learning) and TM (which does not transfer knowledge) methods on 12 target tasks (Table 7). In terms of test results and runtime, we found that ETM is superior to TM in all datasets, and the latency in the real-world is not affected. Although TM method also employs an agent to sequentially select hyperparameters, it ignores tuning experience of the previous tasks. However, the ETM method uses meta-learning to utilize the experience of previous optimization tasks and uses meta-parameters to initialize the agent, which accelerates the agent's ability to adapt to new tasks. The experimental results demonstrate that the agent can adapt to new tasks quickly by using meta TABLE 7. The test performance, latency and runtime of the ETM and a variant TM. ''acc'', ''latency'' and ''time'' represent accuracy, latency and time performance respectively. We report the mean of the 3 test performances. The best result is in bold font.

TABLE 8.
We set three orders, and the orders of selecting hyperparamters are random and different. ''acc'', ''latency'' and ''time'' represent accuracy, latency and time performance respectively. We report the mean of the 3 test performances. The best result is in bold font.
parameters. In addition, it can find better configurations by only a few samples and is not limited by suboptimal values.

3) SEQUENTIAL DECISION MAKING VS DIRECTLY OUTPUTS CONFIGURATION
In this part, we tune the XGBoost on 12 tasks and compare ETM-SDM (SDM: sequential decision making) and ETM-DOC (directly outputs configuration) to verify the feasibility of sequential decision making. The experimental results are presented in Figure 9, where each method ran for 200 episodes. We can see that the tuning of method ETM-DOC only works on a few tasks and even fails on some tasks. However, the ETM-SDM method, which treats HPO as a sequential decision problem, achieved better test set performance on all target datasets. We believe that the reasons for this result is as follows: if the search space is very large, it is difficult for ETM-DOC to explore a good policy in such large space. However, ETM-SDM method makes a new decision sequentially, the search space is reduced at each time-step and in this way, it is much easier to handle the problem.
To further explore sequential decision making, we randomly set three different optimization orders for the XGBoost (as shown in Table 5). By comparing the performance of three VOLUME 9, 2021  The test performance, latency and runtime of two variants (ETM-SDP and ETM-DOC). ''acc'', ''latency'' and ''time'' represent accuracy, latency and time performance respectively. We report the mean of the 3 test performances. The best result is in bold font. different optimization orders, it is shown that the proposed method is insensitive to the optimization order. We can see from Table 8 that the random optimization order does not affect the final tuning performance.

VII. CONCLUSION
In this paper, we focus on algorithm tuning in the field of image recognition and propose an efficient hyperparameter optimization method. This method uses an agent to select hyperparameters sequentially. Compared with the traditional tuning method, this method is based on a multiobjective optimization framework that simultaneously takes accuracy and latency as optimization objectives, and customizes an aggregation function to trade-off accuracy and latency. Finally, we use reinforcement learning algorithm to update the policy. In order to improve the efficiency of tuning, we use meta-learning to obtain the meta-parameters of the agent in the previous optimization tasks and use the meta-parameters to initialize the agent when solving a new task. In the experiments, we compared the proposed method with other advanced tuning methods on 57 datasets of image recognition fields. The experimental results show that the proposed method can perform strongly in time, accuracy, and latency. Specifically, the proposed method achieves average accuracy rankings of 1.92, 1.42 and 1.71 on three algorithms to be optimized, respectively. Especially in terms of latency performance, the proposed method performs best on all the tasks (57 data sets) on the three algorithms to be optimized. In addition, we verify the effectiveness of the proposed component by performing the ablation experiments.