Enhanced Deep Deterministic Policy Gradient Algorithm Using Grey Wolf Optimizer for Continuous Control Tasks

Deep Reinforcement Learning (DRL) allows agents to make decisions in a specific environment based on a reward function, without prior knowledge. Adapting hyperparameters significantly impacts the learning process and time. Precise estimation of hyperparameters during DRL training poses a major challenge. To tackle this problem, this study utilizes Grey Wolf Optimization (GWO), a metaheuristic algorithm, to optimize the hyperparameters of the Deep Deterministic Policy Gradient (DDPG) algorithm for achieving optimal control strategy in two simulated Gymnasium environments provided by OpenAI. The ability to adapt hyperparameters accurately contributes to faster convergence and enhanced learning, ultimately leading to more efficient control strategies. The proposed DDPG-GWO algorithm is evaluated in the 2DRobot and MountainCarContinuous simulation environments, chosen for their ease of implementation. Our experimental results reveal that optimizing the hyperparameters of the DDPG using the GWO algorithm in the Gymnasium environments maximizes the total rewards during testing episodes while ensuring the stability of the learning policy. This is evident in comparing our proposed DDPG-GWO agent with optimized hyperparameters and the original DDPG. In the 2DRobot environment, the original DDPG had rewards ranging from −150 to −50, whereas, in the proposed DDPG-GWO, they ranged from −100 to 100 with a running average between 1 and 800 across 892 episodes. In the MountainCarContinuous environment, the original DDPG struggled with negative rewards, while the proposed DDPG-GWO achieved rewards between 20 and 80 over 218 episodes with a total of 490 timesteps.


I. INTRODUCTION
The integration of RL with the Deep Q-Network (DQN) framework has been successfully achieved, utilizing Deep Neural Networks (DNN) as the foundation for Q-learning [1], The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro .[2].DQN has shown remarkable performance across a variety of Atari games, thereby contributing to the advancement of diverse DRL systems [3], [4].However, DQN's applicability was limited to tasks involving discrete and relatively small state and action spaces [5].In contrast, many RL problems encompass continuous states and action spaces.Although DQN can handle continuous tasks by discretizing the state and action spaces, this approach increases the entire unpredictability of the control mechanism [6].To tackle and overcome this challenge, the Deterministic Policy Gradient (DPG) algorithm [7] was introduced, which combined DNN techniques and proved to be suitable for addressing continuous tasks.Consequently, the Deep Deterministic Policy Gradient (DDPG) was created [8].Nevertheless, the DDPG algorithm is susceptible to insufficient exploration and intermittent instability during training [9].
Within the framework of a continuous control problem, the DDPG algorithm necessitates the pre-definition of a set of parameters, commonly referred to as hyperparameters, to enable autonomous exploration and learning in a complex environment.These hyperparameters encompass aspects such as the batch size, the size of the network, the exploration strategies, the learning rates, time steps, and others [10].During the training time, these parameters are not dynamically optimized; instead, researchers rely on their knowledge to manually select the most suitable values.Hyperparameters configuration significantly impact the efficacy of learning processes, interactions with the environment, and the time required for learning [11].Therefore, it is crucial to determine the most suitable hyperparameters carefully and accurately to enhance the performance of the model.Each environment possesses unique characteristics and complexities that require specific hyperparameters tailored to its requirements.Typically, a common approach to selecting these hyperparameters is through manual search, which demands expertise to identify robust sets of hyperparameters [12].However, finding the optimal hyperparameters is a challenging task [13].
Recently, optimization has emerged as a captivating area of study across various domains, and nature-inspired metaheuristic optimization algorithms have gained considerable attention as promising techniques for achieving optimal solutions [14].These algorithms have found applications in conjunction with AI methods due to several advantageous characteristics: (i) their ease of implementation and reliance on straightforward concepts, (ii) their ability to operate without requiring gradient information, (iii) their capability to overcome local optima, and (iv) their applicability to a wide range of problems spanning diverse research fields.Metaheuristic algorithms inspired by nature tackle optimization problems by mimicking biological or physical phenomena.These techniques can be broadly classified into numerous main categories, such as swarm-based methods, evolution-based methods, and physics-based methods [15].
Ant Colony Optimization (ACO) [16], and Particle Swarm Optimization (PSO) [17] are widely recognized as popular swarm-based optimization algorithms.PSO draws inspiration from the collective behavior of bird flocks, while ACO is motivated by the cooperative behavior observed in ant colonies.Another notable algorithm is the Grey Wolf Optimizer (GWO) [18], which involves initializing a population of grey wolves to represent potential solutions.GWO has demonstrated efficient problem-solving capabilities in literature [18].
A recent study by Faris et al. [19] has recently assessed the scientific applications of the GWO and reported promising results across various optimization problems.The high success rate of GWO in literature might appear from its remarkable characteristics relative to other swarm intelligence methods.This review underscores that GWO does not require prior knowledge of the search space and involves only a few parameters.Moreover, it is scalable, flexible, user-friendly, and straightforward, maintaining a fine balance between exploration and exploitation throughout the search process, resulting in excellent convergence.
In this research, we employed the GWO to enhance the DDPG algorithm performance in continuous control problems.To accomplish this, we validated our proposed DDPG-GWO method on Gymnasium environments, as they possess highly complex state spaces, continuous action spaces, and necessitate precise fine control.Moreover, OpenAI's simulation environments present a wide array of virtual spaces, enabling the training and evaluation of AI models within realistic and interactive scenarios spanning diverse domains [20].To effectively adapt the DDPG algorithm to Gymnasium environments, this study focuses on the optimization of seven key hyperparameters: actor learning rate, critic learning rate, discount factor, exploration, batch size, Polyak averaging, and learning rate of target networks.These hyperparameters play a crucial role in governing the performance of the overall system [8].To achieve this optimization task, the GWO is employed as the chosen metaheuristic optimization technique.GWO is selected based on its promising outcomes and potential for achieving desirable results in this context.
Therefore, we conducted a comprehensive evaluation of our agent in a training mode using OpenAI's simulation environments.The performance of our DDPG-GWO agent is compared against the original DDPG agent, which utilizes a list of the DDPG algorithm's hyperparameters.The evaluation results indicate that our agent exhibited superior performance in optimizing (maximizing) the cumulative rewards throughout the training episodes and during the test episodes, with a significant margin of improvement compared to its competitors.This article provides the following contributions: • This study surveys the latest and most remarkable advancements in DDPG, highlighting the research endeavors dedicated to enhancing the optimization of its hyperparameters.
• Narrowing down the scope to optimizing a specific set of seven hyperparameters in the DDPG algorithms, these parameters are widely acknowledged as pivotal for augmenting the learning process's efficiency.
• Utilizing the GWO to identify the optimal settings of the chosen hyperparameters, enabling our agent to 139772 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
effectively implement continuous control within the simulation environments provided by OpenAI.
• A performance comparison is conducted between the DDPG agent employing the optimized hyperparameters by the GWO and the original DDPG agent utilizing a set of hyperparameters.
The rest of this paper is structured as follows: in section II, we provide the fundamental background of this study.Section III presents the discussion on related work.The detailed explanation of the proposed approach is presented in section IV.In section V, the experimental results of the proposed method are discussed.Section VI provides the discussion and analysis.Finally, section VII concludes the paper.

II. FUNDAMENTAL BACKGROUND
In this section, a fundamental overview concerning DRL is elaborated in (Sub-section II-A), DDPG (Sub-section II-B), GWO (Sub-section II-C), and OpenAI simulation environments (Sub-section II-D).

A. DEEP REINFORCEMENT LEARNING
DRL is an advanced approach that combines deep learning (DL), which involves training deep neural networks, with RL, a technique used to teach agents to make decisions in an environment through trial and error [21].Two wellknown algorithms in this field are the DQN algorithm and the DDPG algorithm.The DDPG algorithm builds upon the DQN algorithm, enhancing its capabilities.One key aspect of the DDPG algorithm is that it does not require a predefined model and can learn from the experience gathered while acting in the environment off-policy.This flexibility allows the algorithm to adapt and improve its decisionmaking abilities over time.

B. DDPG ALGORITHM
The DDPG algorithm, as introduced by Lillicrap et al. [22], represents a distinctive approach in DRL that combines the principles of being both off-policy and model-free.This fusion effectively integrates the strengths of the DQN algorithm and the Actor-Critic (AC) algorithm, melding DL and RL principles.
While the DDPG algorithm shares a structural framework with the AC algorithm, it distinguishes itself through a more refined neural network configuration.Notably, while the DQN algorithm excels in discrete problem domains, the DDPG algorithm builds upon DQN's experiences to tackle the complexities of continuous control tasks, achieving end-to-end learning.Illustrated in Figure 1, the DDPG's structure embodies an actor-network that processes input states, selects actions, and generates action values.Concurrently, a critic network evaluates the chosen action's efficacy and computes the associated reward.The DDPG algorithm's procedural steps are elaborated as follows: 1) Initializing the neural network parameters.The actor, guided by the behavior policy, chooses an action.
To promote exploration, the action produced by the policy network is augmented with noise N t .This modified action, denoted as a t , is then sent to the environment for execution.
2) Once the environment performs the action a t , the resulting outcome is observed, including the reward r t received and the new state a t+1 of the system.
3) The actor archives the state transition s t , a t , r t , s t+1 in the replay memory, which functions as the training dataset for the online network.4) DDPG employs two separate neural network copies, namely the policy and the Q network.These copies consist of the online network and the target network.
The policy network's update method can be described as follows: The update procedure for the Q network is outlined as follows: N transition data are selected randomly from the replay memory to serve as training data for the online policy network and online network.Each individual transition data within a mini-batch is denoted by s t , a t , r t , s t+1 .5) During the critical step, compute the gradient of the Q values for the online Q network.The loss of the Q network is defined as The derivative of the loss function L with respect to the parameters θ Q can be computed as ∇ θ L. This calculation involves utilizing the target policy network µ ′ and the target Q network Q ′ .6) Update the online Q network, and update θ Q using the Adam optimizer.7) Within the actor component, the policy gradient is computed by calculating the gradient of the policy network.
8) Updating the online policy networks: updating with the Adam optimizing.9) The hyperparameters of the target networks adopting the method of soft update: Typically, the DDPG algorithm employs the Actor-Critic framework to iteratively train the policy and the Q network by facilitating interactions among the environment, actor, and critic components.

C. GREY WOLF OPTIMIZATION
The GWO algorithm proposed by Mirjalili et al. [18], draws inspiration from the social intelligence observed in grey wolves, which exhibit a preference for living in groups consisting of 5 to 12 individuals.To replicate the leadership hierarchy observed in GWO, the algorithm incorporates four levels: alpha α, beta β, delta δ, and omega ω.
In this hierarchy, the alpha wolf is known as the male and female leader of the pack.It holds the primary responsibility for making decisions such as hunting, selecting sleeping locations, and determining wake-up times.Beta, on the other hand, assists alpha in decision-making and focuses mainly on providing feedback and suggestions.Delta fulfills multiple roles within the group, serving as hunters, sentinels, caretakers, scouts, and elders, and controls the omega wolves by following the commands of the alpha and beta wolves.The omega wolves, in turn, are required to obey all other wolves within the group [23].
In the context of the GWO algorithm, the hunting process is orchestrated by the α, β, and δ wolves, with the ω wolves obediently adhering to their guidance.The encircling behavior characteristic of GWO can be computed using this formula: − → A and − → D denotes coefficient vectors, − → X p denotes the vector of the prey's positions, and X denotes the positions of the wolves in a d-dimensional space, where D denotes the number of variables.The variable (t) denotes the number of iterations, and − → D is defined as follows: where the following notation is used to represent − → A and − → C : 139774 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where r 1 , r 2 are selected randomly in the normal range of zero to unity.Over the course of iterations, the components of ⃗ a are linearly decreased from 2 to 0. Using Eq 8, a grey wolf can approach the prey by changing its position around the prey randomly.
The values of x 1 , x 2 , and x 3 are defined and computed as follows: At a given iteration t, the vectors − → x 1 , − → x 2 , and − → x 3 represent the three top-performing wolves (solutions) within the swarm.These values are determined by following the calculations outlined in Equation (3) for A 1 , A 2 , and A 3 .Additionally, the vectors − → D α , − → D β , and − → D δ are computed using the method described in (11).
One of the key elements in the GWO algorithm for balancing exploitation and exploration is the vector ⃗ a.In the baseline paper of this algorithm, it is recommended to gradually decrease the vector's value for each dimension in a linear manner, starting from 2 and ending at 0, as the number of iterations progresses.The equation used to update the vector is as follows: In the given equation, the variable t represents the current iteration number, while ''ter'' represents the total number of iterations for the optimization process.

D. OPENAI GYMNASIUM
The OpenAI Gymnasium has become a popular toolkit in the machine learning community for RL research [24].This work follows the established structure used by researchers and builds upon it by creating 2D Robot and Mountain Car Continuous simulation environments.OpenAI Gymnasium places a primary emphasis on the episodic setting of RL, aiming to maximize the expected total reward per episode and achieve satisfactory performance quickly.The toolkit also aims to integrate the GymnasiumAPI with physical robotic hardware, allowing for the validation of RL algorithms in real-world environments [24].
OpenAI Gymnasium [25] includes a collection of environments known as Partially Observable Markov Decision Processes (POMDPs), which will continue to expand over time.The initial beta release of Gymnasium featured various environments, including: • Classic control and toy text: They encompass smallsize tasks that are commonly encountered in the RL literature.
• Algorithmic: Tasks that revolve around computation, such as performing operations like adding multi-digit numbers or reversing sequences.Memory is frequently essential for these tasks, and their level of challenge can be finely tuned by modifying the length of the sequence.
• Atari: This category involves classic Atari games, where input is derived from RAM or screen images, and the Arcade Learning Environment is leveraged [26].
• Board games: Initially featuring the Go on 9 × 9 and 19 × 19 board games, with the Pachi engine acting as a competitor [27].
• 2D and 3D robots: These entail simulated robot control tasks that utilize the MuJoCo physics engine, acclaimed for its speed and precision in simulating robots [28].Some of these tasks were adapted from RLLab [29].
Since its initial release, the collection of environments has expanded to encompass additional options, including those grounded in the Box2D, an open-source physics engine, or the Doom game engine through VizDoom [30].

1) OPENAI GYMNASIUM ENVIRONMENTS
In our study, we leverage two OpenAI Gymnasium environments: the 2D Robot Arm Environment and the Mountain Car Continuous Environment.[25].
• 2D Robot Arm Environment: The presented OpenAI Gymnasium environment, named RobotArm-V0, simulates a two-link robot arm operating in a 2D space using PyGame.In RobotArm-V0, the robot comprises two 100-pixel length links, and the objective is to reach a randomly generated red target point in each episode.The observation space includes target positions in both x and y directions, as well as the current positions of the two arm joints in radians.The action space consists of discrete actions to hold the current joint angles or increment/decrement them, with a default rate of 0.01 radians.The reward function penalizes the robot if the current tip-to-target distance is greater than or equal to the previous distance, and rewards it if the distance is within a tolerance of 10 pixels.The episode terminates when the reward reaches −10 or +10 [25].

III. RELATED WORK
This section presents the discussion on related work.First of all, Chen et al. [31] introduced a method for adapting hyperparameters in DRL based on the Bayesian approach.
Their study stands out as the most comprehensive investigation of RL hyperparameters to date, specifically focusing on configuring the AlphaGo algorithm.They achieved the automatic refinement of game-playing hyperparameters for AlphaGo, an achievement that conventional methods are unable to attain.The application of Bayesian optimization not only improved the winning probabilities of AlphaGo but also generated valuable data that can be used to develop enhanced versions of self-play agents incorporating Monte Carlo Tree Search (MCTS).Nonetheless, this approach requires a substantial number of experiments and relies on advanced information.Additionally, it is primarily effective for adapting individual hyperparameters rather than a range of hyperparameters.Liessner et al. [11] introduced a model-based approach for optimizing hyperparameters in the DDPG algorithm, which demonstrated effectiveness in real-world industrial applications.The authors addressed the challenge of limited training time by imposing strict constraints on the DDPG algorithm within the specific domain.By successfully optimizing the hyperparameters under these time limitations, they achieved improved performance.However, one limitation of the study is that the strict constraints on the DDPG algorithm are not applicable to other domains or applications.
Oktay et al. [32] suggested employing an artificial bee colony (ABC) algorithm to fine-tune the weights of an artificial neural network (ANN) that functions as the objective function in an optimization procedure.The ANN is trained using specific input and output datasets, and the objective function, which relies on the ANN, is enhanced using the ABC algorithm to achieve improved outcomes.This study demonstrates the application of metaheuristic optimization approaches alongside artificial intelligence methods.
Another independent study by Sehgal et al. [33] utilized a Genetic Algorithm (GA) to optimize hyperparameters in the HER (Hindsight Experience Replay)+DDPG algorithm.The GA-based approach effectively identified hyperparameters that required fewer training epochs while still achieving enhanced task performance.The research employed a range of robotics manipulation tasks encompassing actions such as push, slide, reach, fetch, place, pick, and open operations, serving to showcase the effectiveness of the proposed methodology.
Elfwing et al. [34] introduced a method that shares similarities with population-based training (PBT).They presented an alternative technique called OMPAC, which focuses on the evolutionary mechanism and represents the initial strategy for adapting multiple hyperparameters in DRL using a population-based approach.In a related study by Jaderberg et al. [35], the authors also employed a PBT to optimize a group of models along with their hyperparameter configurations.This was accomplished within a predetermined computational budget, with the objective of achieving optimal performance.The proposed approach demonstrated promising results in various domains such as Machine Translation, DRL, and GANs.However, it should be noted that PBT relies on basic stochastic perturbations to adapt hyperparameters, which may not effectively track changes in potentially optimal hyperparameter configurations over time.
Another study by Zhou et al. [36] introduced an online method for hyperparameter optimization for DRL.This method enhanced the existing PBT procedure, resulting in efficient online adaptation of hyperparameters.The authors incorporated a recombination operation inspired by GA into the population optimization process.This recombination operation accelerated the convergence of the population towards the optimal hyperparameter configuration.The authors empirically validated the effectiveness of this approach and demonstrated improved results compared to the classical PBT method, which aligns with their research findings.
A recent study by Parker-Holder et al. [37] introduced a novel and proven PBT-style approach called Population-Based Bandits (PB2).PB2 is a procedure that can identify exceptional hyperparameter configurations using a smaller number of agents compared to traditional PBT.Through multiple RL trials, the authors demonstrated that PB2 can achieve remarkable performance levels while adhering to a moderate computational budget.In another study by Moghanian et al. [38], a swarm-based metaheuristic algorithm is employed to minimize errors in intrusion detection.In the novel approach, the Grasshopper Optimization Algorithm (GOA) is harnessed to enhance the precision of artificial neural networks (ANNs) in order to decrease the rate of intrusion detection errors.
Additionally, the authors in [39] and [40] discussed the advantages and disadvantages of different deep architectures, as well as the different optimization methods that have been used.Alqushaibi et al. [41] propose a new weight optimization method based on the sine cosine algorithm (SCA).Balogun et al. [42] evaluate a number of different methods on a real-world dataset of software defects and show that they can significantly improve the performance of defect prediction models.
To sum up, optimization strategies like PBT, which aim to learn optimal schedules for hyperparameters rather than relying on fixed settings, have shown promising results.However, these strategies can be susceptible to sample ineffectiveness, which can impact their performance.Through the review of existing literature, it is evident that many papers have focused on utilizing grid search [43], Bayesian methods, or GA [33] to optimize various hyperparameters in DRL.While these approaches have demonstrated some success, they also have noticeable limitations.To overcome these limitations and achieve remarkable results, this work proposes the utilization of a metaheuristic optimization algorithm known as GWO.The GWO algorithm is employed to optimize the hyperparameters of the DDPG algorithm in two simulated Gymnasium environments provided by OpenAI, namely the 2DRobot and MountainCarContinuous simulation environments.

IV. THE PROPOSED FRAMEWORK
This section introduces the paper's primary contribution, which involves the implementation of the GWO algorithm to explore the hyperparameters space of the DDPG algorithm.The objective is to identify the set of hyperparameters that maximizes the total rewards obtained in the 2DRobot and MountainCarContinuous simulation environments and then compare the optimized results with the original DDPG algorithm results.The subsequent Table 1 presents the hyperparameters of the original DDPG algorithm used for comparison with the optimized DDPG-GWO algorithm within the simulation environments of 2DRobot and Moun-tainCarContinuous as shown in Figure 2.These hyperparameters were selected by GWO to fine-tune the DDPG algorithm to maximize the total rewards achieved in the aforementioned simulation environments.
The rest of this section is divided into two sub-sections: the training of DDPG networks using GWO (IV-A), and the learning process (IV-B).

A. TRAINING OF DDPG NETWORKS USING GWO
As shown in Figure 3 and Algorithm 1, the training process of DDPG networks using GWO begins by initializing the DDPG networks (actor and critic networks) and the GWO algorithm.The DDPG networks learn the policy and value functions, while the GWO algorithm generates a group of grey wolves for hyperparameter search.Each grey wolf's chosen hyperparameter set is evaluated by executing the DDPG algorithm and measuring its performance.A reward function reflecting the task's objective is used to assess performance.
The GWO algorithm updates the best solutions found so far, representing superior hyperparameter configurations.Each grey wolf's hyperparameter settings are modified based on the position of the optimal solutions.This collective intelligence guides the search toward promising regions in the hyperparameter space.Once optimized, the DDPG networks are trained using the updated configuration.
The training phase involves iterative interactions with the environment, experience collection, and network parameter updates.Network performance is periodically evaluated to monitor progress.If satisfactory performance or convergence is achieved, the training process can be halted.
If performance is unsatisfactory, the GWO algorithm refines the hyperparameters, and the DDPG networks undergo further training.The iterative nature of training enables continuous exploration and exploitation of the hyperparameter space.This leads to the identification of hyperparameter configurations maximizing the DDPG algorithm's overall performance.

B. LEARNING PROCESS
In this section, we explicitly explain the learning process employed to optimize the hyperparameters of the DDPG algorithm using the GWO approach.The training process of each wolf in GWO to optimize DDPG hyperparameters can be represented using equations.By denoting the hyperparameters to be optimized as h, and the position of each wolf as x.In GWO, the three types of wolves: alpha, beta, and delta, are represented as x α , x β , and x δ , respectively.The position update equation for each wolf in the GWO can be defined as follows: where x new is the updated position, x is the current position, A is the updating amplitude, and D is the random vector.The updating amplitude A is calculated as: where a is a linearly decreasing parameter, and r is a random number between 0 and 1.The random vector D is calculated using the positions of the alpha α, beta β, delta δand omega ω wolves: where C is a constant that controls the influence of each wolf's position on the update.Finally, the updated hyperparameters h new can be obtained by applying the GWO update equation to each element of h: where h[i] and x new [i] denote the i-th element of h and x new , respectively.By iteratively applying the above equations for each wolf in the GWO, the hyperparameters h can be optimized to enhance the performance of the DDPG algorithm.
The GWO algorithm refines the positions of its wolf population, and these refined positions are subsequently used to update the hyperparameters of the DDPG algorithm.This iterative process continues to optimize the hyperparameters, ultimately enhancing the performance of the DDPG algorithm.

V. EXPERIMENTS
This section presents the experimental analysis conducted to compare the performance of the original DDPG algorithm with our proposed DDPG-GWO method, in two distinct simulation environments: the 2DRobot and the Mountain-CarContinuous.By evaluating the results obtained from both algorithms, we aim to assess the effectiveness and improvements brought by the DDPG-GWO method.A comparison is made between these optimized hyperparameters and the original DDPG hyperparameters used in the same environments as those presented by Lillicrap et al. [8].
The rest of this section is structured as follows: the experimental setting (V-A), the 2D robot environment results (V-B), and the MountainCarContinuous environment results (V-C).

A. EXPERIMENTAL SETTINGS
Table 2 shows the experimental settings to perform the experimental analysis.We employed a hardware setup 139778 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Initialize main critic network Q(s, a) and main actor network µ(s) with random weights 8: Initialize target critic network Q ′ and target actor network µ ′ with weights from main networks 9: Set hyperparameters (α actor , α critic , γ , batch size, τ ) based on wolf position 10: Initialize replay buffer R 12: Initialize action exploration process  For both the 2DRobot and MountainCarContinuous environments, we utilized the DDPG algorithm as the baseline.The DDPG algorithm was implemented with the following configurations: a learning rate of 0.001, a discount factor (gamma) of 0.99, a target network update rate of 0.001, a replay buffer size of 100,000, and a batch size of 64.
In contrast, our proposed approach, DDPG-GWO, introduced an optimization technique called GWO into the DDPG framework.This hybrid algorithm incorporated the following parameters: a GWO population size of 20, a maximum number of iterations set to 100, and exploration and exploitation factors set to 2 and 0.5, respectively.

B. RESULTS FOR THE 2DROBOT ENVIRONMENT
This sub-section presents the 2DRobot environment's experimental analysis conducted to compare the performance of the original DDPG algorithm (V-B1) with our proposed DDPG-GWO (V-B2).

1) ORIGINAL DDPG ALGORITHM'S RESULTS IN 2DROBOT ENVIRONMENT
The 2DRobot agent's performance was unstable during its training process as shown in Figure 4.In the beginning, it showed poor performance with negative rewards ranging from −150 to −50.However, there was some improvement with positive rewards between episodes 400 and 800.Unfortunately, the agent's behavior regressed to a sub-optimal state after that.The inconsistency in its performance could be attributed to various factors, such as hyperparameters like a high exploration rate of 0.1 and relatively small learning rates of 0.0005.Additionally, the agent's architecture, with 16,16 layers for both the actor and critic, not have been adequate for the task at hand.

2) DDPG-GWO ALGORITHM'S RESULTS IN 2DROBOT ENVIRONMENT
Figure 5 illustrates promising outcomes for the optimized DDPG-GWO algorithm compared with the original DDPG algorithm (Figrue 4).Throughout 892 episodes, the average episode reward ranged from −100 to 100, with a running average of the previous 100 episode rewards fluctuating between 1 and 800.The algorithm effectively explored the environment using a moderate exploration rate of 0.15.The critic and actor networks had learning rates of 0.002 and 0.001, respectively.A discount factor of 0.96 was applied to future rewards, enabling the agent to balance immediate and future rewards.Experience replay during training was facilitated by a memory size of 13519, contributing to improved performance.The target networks' stable update mechanism was achieved through Polyak averaging with a value of 0.006.Exploration utilized an Ornstein-Uhlenbeck (OU) process as a noise source, with the noise standard deviation set to 0.08.

C. RESULTS FOR THE MOUNTAINCARCONTINUOUS ENVIRONMENT
This sub-section presents the MountainCarContinuous environment's experimental analysis conducted to compare the performance of the original DDPG algorithm (V-C1) with our proposed DDPG-GWO (V-C2).

1) ORIGINAL DDPG ALGORITHM'S RESULTS IN MOUNTAINCARCONTINUOUS ENVIRONMENT
As depicted in Figure 6, the DDPG results for the Mountain-CarContinuous task are not satisfactory.One major problem  is the rather low average reward, which even goes negative at times, indicating frequent failures in reaching the goal.It is possible that the chosen actor and critic network structures are not fit enough for the task, and the high noise standard deviation may be causing erratic policy updates.Moreover, the algorithm seems to be struggling to effectively utilize past experiences, pointing to potential issues with the memory or the learning process itself.

2) DDPG-GWO ALGORITHM'S RESULTS IN MOUNTAINCARCONTINUOUS ENVIRONMENT
The integration of GWO and DDPG in the MountainCarContinuous environment demonstrated remarkable improvements in hyperparameter optimization, as shown in Figure 7, leading to a significant boost in performance.The average episode reward witnessed a substantial increase, going from 80 to 20, indicating a marked enhancement in learning efficiency.The algorithm achieved a running average of the previous 100 episode rewards throughout 218 episodes, with a total of 490 time steps.Stability was ensured through the utilization of OU noise with a standard deviation of 0.184.These outcomes strongly suggest that the DDPG-GWO approach holds great potential to elevate the performance of the algorithm across a wide range of real-world applications.

VI. DISCUSSION AND ANALYSIS
The performance comparison analysis between the original DDPG and the proposed DDPG-GWO is illustrated in Figure 8.It indicates that the DDPG-GWO algorithm outperforms the original DDPG algorithm in both environments 2DRobot and MountainCarContinuous.In the 2DRobot environment, the DDPG-GWO algorithm demonstrated promising outcomes with a moderate exploration rate of 0.15 and optimized hyperparameters.It effectively explored the environment, achieving a running average of 100-episode rewards between 1 and 800.On the other hand, DDPG alone exhibited unstable performance with negative rewards, indicating difficulties in learning an effective policy.This proves that the integration of GWO in the DDPG algorithm aids in better hyperparameter optimization and learning efficiency, resulting in more stable and improved performance.
Furthermore, Table 3 provides a comparative analysis of the original DDPG algorithm and the proposed DDPG-GWO algorithm in 2DRobot and MountainCarContinuous environments.It shows that our proposed DDPG-GWO algorithm outperforms the original DDPG algorithm, achieving an average episode reward between −25 to 85 compared to −150 to 50 of DDPG.Additionally, the running average of DDPG-GWO over X episodes (892) is significantly higher than the original DDPG (1000).Similarly, in the Moun-tainCarContinuous environment, the DDPG-GWO performs better with an average episode reward between 20 to 80, while the original DDPG ranges from −15 to 15.Moreover, the running average of DDPG-GWO (218) is superior to that of DDPG (1000) over the same number of episodes.
The proposed DDPG-GWO algorithm in this study holds significant theoretical and practical implications.The theoretical significance lies in its effective hyperparameter optimization, addressing a critical challenge in DRL.By employing GWO as a metaheuristic algorithm, DDPG-GWO demonstrates improved learning performance with faster convergence rates, enhancing control strategies in the simulated Gymnasium environments.This research enriches the theoretical foundations of DRL algorithms and explores the application of nature-inspired optimization techniques, contributing to the advancement of DRL methodologies.
On the practical front, the DDPG-GWO algorithm offers real-world applicability.Its potential to optimize decision-making processes in industries like robotics, autonomous vehicles, and finance holds significant promise.The algorithm's ability to handle complex continuous action spaces showcases its relevance and adaptability to diverse scenarios.Additionally, by reducing training time and resource requirements, the DDPG-GWO presents cost-effective and efficient AI-driven systems, paving the way for more practical and impactful implementations in dynamic and uncertain environments.Overall, the successful integration of GWO into DDPG signifies the potential of metaheuristic algorithms to enhance learning performance and control strategy, with implications spanning across various industries and research areas.
Table 4 shows the hyperparameters comparison between the original DDPG and the proposed DDPG-GWO algorithms.

VII. CONCLUSION AND FUTURE WORKS
To conclude, the DDPG-GWO resulted in significant improvements in hyperparameter optimization and learning efficiency in both the 2DRobot and MountainCarContinuous environments.In the 2DRobot environment, the optimized hyperparameters led to ranging from −100 to 100 across 892 episodes, achieving a balanced trade-off between immediate and future rewards with a 0.96 discount Stability during training was ensured through experience with a memory size of 13519 and Polyak averaging 0.006.Similarly, in the MountainCarContinuous environment, DDPG-GWO notably enhanced the average episode reward from 80 to 20 over 218 episodes, indicating a marked improvement in learning efficiency.The stability was maintained with the OU noise process having a standard deviation of 0.184.Nevertheless, DDPG's performance in both environments fell short, likely due to the actor-critic network's insufficient complexity, high noise standard deviation, and ineffective utilization of past experiences.These findings highlight the value of DDPG-GWO in elevating performance and learning efficiency across diverse realworld applications, while also underscoring potential areas for future improvement.
The paper's future endeavors involve bolstering DDPG algorithms to achieve better sample efficiency, managing high-dimensional inputs more effectively, and tackling non-stationarity.Researchers have the opportunity to investigate hybrid methods, fusing DDPG with other algorithms, and pushing the boundaries of theoretical comprehension.As DDPG gains enhanced robustness and dependability, its real-world applications in robotics, finance, and healthcare are expected to witness increased adoption.

ABBREVIATIONS
Table 5 contains definitions for all abbreviations used in this study.

FIGURE 2 .
FIGURE 2. The experimental environments setup: (A) ''2DRobot'' comprises a 2DRobotic arm setup, posing a challenging control task to the learning agents, and (B) ''MountainCarContinuous'' involves a continuous-action variant of the classic ''Mountain Car'' problem, where an agent must learn to navigate a car to surmount a steep hill using continuous acceleration.

FIGURE 3 .
FIGURE 3. The flowchart of the proposed DDPG-GWO hyperparameter optimization in this study.

FIGURE 8 .
FIGURE 8. Comparison between the original DDPG and the proposed DDPG-GWO in 2DRobot and MountainCarContinuous.
Store experience (s t , a t , r t , s t+1 ) in R

TABLE 4 .
Hyperparameters comparison between the original DDPG and the proposed DDPG-GWO algorithms.