Deep Reinforcement Learning Versus Evolution Strategies: A Comparative Survey

—Deep Reinforcement Learning (DRL) has the po- tential to surpass human-level control in sequential decision-making problems. Evolution Strategies (ESs) have different characteristics than DRL, yet they are promoted as a scalable alternative. To get insights into their strengths and weaknesses, in this paper, we put the two approaches side by side. After presenting the fundamental concepts and algorithms for each of the two approaches, they are compared from the perspectives of scalability, exploration, adaptation to dynamic environments, and multi-agent learning. Then, the paper discusses hybrid algorithms, combining aspects of both DRL and ESs, and how they attempt to capitalize on the beneﬁts of both techniques. Lastly, both approaches are compared based on the set of applications they support, showing their potential for tackling real-world problems. This paper aims to present an overview of how DRL and ESs can be used, either independently or in unison, to solve speciﬁc learning tasks. It is intended to guide researchers to select which method suits them best and provides a bird’s eye view of the overall literature in the ﬁeld. Further, we also provide application scenarios and open challenges.


I. INTRODUCTION
In the biological world, the intellectual capabilities of humans and animals have developed through a combination of evolution and learning. On the one hand, evolution has allowed living beings to improve genetically over successive generations such that higher forms of intelligence have appeared, on the other hand, adapting rapidly to new situations is possible due to the learning capability of animals and humans.
In the race for developing artificial general intelligence, these two phenomena have motivated the development of two distinct approaches that could both play an important role in the quest for intelligent machines. From the learning perspective, Reinforcement learning (RL) shows many parallels with how humans and animals can deal with new unknown sequential decision-making tasks. Meanwhile, Evolution Strategies (ESs) are engineering methods inspired by how the mechanism that let intelligence emerge in the biological world-repeatedly selecting the best performing individuals. In this paper, we put RL and ESs side by side analyzing their strength and weaknesses for sequential decision-making tasks and shed light on their potential development directions.
The RL approach is formalized as an agent acting on an environment that seeks to optimize an expected sum of rewards over its trajectory [1]. Imagine playing a table tennis game (environment) with a robot (agent). The robot has not explicitly been programmed to play the game, but instead, it can observe the score of the game (rewards). The robot's goal is to maximize its own perceived score. For that purpose, it tries different techniques of hitting the ball (actions), observes the outcome, and gradually enhances its playing strategy (policy). Despite the proven convergence of RL algorithms to optimal policies -best solutions to the problems at handthey face difficulties as the dimensionality of the data grows (such as images or time series). Deep RL (DRL) algorithms [2] attempt to resolve this by combining RL algorithms with deep neural networks (DNNs) allowing them to tackle sequential decision-making problems with high-dimensionality inputs.
As a contrasting approach to RL, ESs are a set of derivativefree optimization algorithms that iteratively select individuals of a population based on their performance in terms of an objective function [3]. In recent years, ESs have seen a increase in popularity and has been successfully applied to several applications, including optimizing objective functions for many RL tasks [4,5]. It is imperative to see that the parallel development of DRL and ESs indicates that each has its advantages (and disadvantages), depending on the problem setup. To enable scientists and researchers to choose the best algorithm for the problem at hand, we summarized the pros and cons of these approaches through the development of a comparative survey: we compared DRL and ESs from different learning aspects such as scalability, exploration, the ability to learn in dynamic environments and from an application standpoint ( Figure 1). We also discuss how combining DRL and ESs in hybrid systems can leverage the advantages of both approaches.
To date, there have been different papers summarizing different features of DRL and ESs. For example, derivative-free reinforcement learning (e.g., ESs) has been reviewed in [6], covering aspects such as scalability and exploration. A survey related to DRL for autonomous driving is provided in [7], and the challenges, solutions, and applications of multi-agent DRL systems are reviewed in [8]. However, contrasting with prior work, our paper surveys the literature with a bird's-eye view, focusing on the main developmental directions instead of individual algorithms.
The rest of the paper is organized as follows: Section II presents the fundamental architectural concepts behind RL and ESs; Section III summarizes fundamental algorithms of RL, DRL and ESs; Sections IV-A, IV-B, IV-C and IV-D compare the capabilities of DRL and ESs; In Section V, we present hybrid systems that combine DRL and ESs. Section VI compares them from an applications' point of view. Section VII outlines open challenges and potential research directions. Finally, we conclude the paper in Section VIII. The main takeaways of each section are summarized in a concise subsection titled "Comparison".

II. FUNDAMENTALS
This section covers the fundamental elements of DRL and ESs, including formal definitions and the main algorithmic families.

A. Reinforcement Learning
Reinforcement Learning (RL) is a computational approach to understanding and automating goal-directed learning and decision making [1]. The goal of an RL agent is to maximize the total reward it receives when interacting with an environment (Figure 2a), which is generally modeled as a Markov Decision Process (MDP). An MDP is defined by the tuple (S, A, T, R), where S denotes the state space; A is the action space; T (s, a, s ) is a transition function that defines the probability of transitioning from the current state s to the next state s after an agent takes action a; R(s, a, s ) is the reward function that defines the immediate reward r that the agent observes after taking the action a and the environment transition from s to s .
The total return starting from time t until the end of the interaction between an agent and its environment is expressed as where R t and G t are random variables modeling the immediate reward, r, and return obtained at time t respectively, and γ ∈ [0, 1) is a discount factor that weights the immediate and future rewards. Value functions are the expected return of being in a state or taking a particular action. The state-value function v π (s) gives the expected return from state s following policy π, V π (s) = a π(a|s) s ,r p(s , r|s, a)[r + γV π (s )]. (1) The action-value function (or Q-function) Q π (s, a) is the expected return of taking action a in state s and following policy π thereafter, Q π (s, a) = s ,r p(s , r|s, a) r + γ a π(a |s )Q π (s , a ) .
(2) The action selection process of an agent is governed by its policy, which in the general stochastic case yields an action according to a probability distribution over the action space conditioned on a given state π(s, a). There are four main RL algorithmic families: Policy-based Algorithms. A policy-based algorithm optimizes and memorizes a policy explicitly, that is, it directly searches the policy space for an (approximate) optimal policy, π * . Examples of such algorithms are policy iteration [9], policy gradient [10] and REINFORCE [11]. Policy-based algorithms can be applied to any type of action space: continuous, discrete or a mixture (multiactions). However, these algorithms generally have high variance and are sample-inefficient.
Value-based Algorithms. A value-based algorithm learns a value function, based on the state V π (s) or based on state and action Q π (s, a). Then, a policy is extracted according to the learned value function. Examples of such algorithms are value iteration [12], SARSA [13], Q-learning and DQN [14]. Value-based algorithms are more sample-efficient than policy-based ones. However, under ordinary circumstances the convergence of these algorithms is not guaranteed.
Actor-critic-based Algorithms. Whereas the two algorithm classes mentioned above each have their own strong and weak points, the actor-critic approach tries to combine the strengths of both into a single algorithmic architecture [15]. The actor is a policy-based algorithm that tries to learn the optimal policy, whereas the critic is a value-based algorithm that evaluates the actions taken by the actor.
Model-based Algorithms. All of the algorithmic families mentioned previously concern model-free algorithms. In contrast, model-based algorithms learn or make use of a model of the transition dynamics of an environment. Once an agent has access to such a model, it can use it to "imagine" the consequences of taking a particular set of actions without acting on the environment. Such capability enables an RL agent to evaluate the expected actions of an opponent in games [16,17] and to make better use of gathered data, which is very useful in tasks such as robot control [18]. However, for many problems, it is difficult to produce close to reality models.
Deep Reinforcement Learning (DRL) refers to the combination of Deep Learning (DL) and RL ( Figure 2b) [2]. DRL uses deep neural networks (DNNs) to approximate one of the learnable functions of RL. Correspondingly, there are three main families of DRL algorithms: value-based, policy-based, and model-based [14,16,19]. For example, the DNN of a policy-based DRL agent takes the state of the environment as input and produces an action as output (Figure 2b). The action selection process is governed by the parameters θ of the DNN. The parameters selection is optimized using a backpropagation algorithm during the training phase.

B. Evolution Strategies
Evolution Strategies (ESs) are set of a population-based black-box optimization algorithms often applied to continuous search spaces problems to find the optimal solutions [20,21]. ESs do not require modeling the problem as an MDP, neither the objective function f (x) has to be differentiable and continuous. The latter explains why ESs are gradientfree optimization techniques. They do however require the objective function f (x) to be able to assign a fitness value to (i.e., to evaluate) each input x ∈ R n such that f : The basic idea behind ESs is to bias the sampling process of candidate solutions towards the best individuals found so far until a satisfactory solution is found. Samples are drawn from a (multivariate) normal distribution whose shape (i.e., the mean m and the standard deviation σ) is described by what are called strategic parameters. These can be modified online to make the search process more efficient. The generic ESs process is shown in Figure 2c and its elements are explained below: 1) Initialization: the algorithm generates an initial population P consisting of µ individuals.
2) Parent selection: a sub-set of the population is selected to function as parents during the recombination step. 3) Reproduction consists of two steps: a) Recombination: two or more parents are combined to produce a mean for the new generation. b) Mutation: a small amount of noise is added to the recombination results. A common way of implementing mutation is to sample from a multivariate normal distribution centered around the mean obtained from the previous recombination step: where g is the generation index, k is the number of offsprings, and I is the identity matrix. 4) Evaluation: a fitness value is assigned to each candidate solution using the objective function f (x i ). 5) Survivor selection: the best µ individuals are selected to form the population for the next generation. Generally, the algorithm iterates from step 2 to step 5 until a satisfactory solution is found.
The idea of employing ESs as an alternative to RL is not new [22,23,24,25], but recently it has seen a renewed interest (e.g. [4,26]).

C. Comparison
Our main takeaways of the above fundamental concepts are as follows: • The objective of an RL algorithm is to maximize the sum of discounted rewards, whereas an ESs algorithm does not require such formulation. However, the objective for RL settings can be converted to ESs settings with a terminal state that provides a reward equivalent to the fitness function. • The problem setup differs between RL and ESs. An ESs algorithm is a black-box optimization method that keeps a pool of multiple candidate solutions, while an RL method generally has a single agent that improves its policy by interacting with its environment. • An ESs algorithm aims at finding candidate solutions that optimize a fitness function, whereas the goal of DRL is to keep advancing one or two function approximators which in turn need to optimize the equivalent of the fitness function, usually defined by the discounted return. • The ESs approach is most similar to the policy-based DRL approach: both aim at finding parameters in a search space such that the resulting parameterized function optimizes certain objectives (expected return for DRL or fitness score for ESs). The main distinction is that ESs, unlike DRL, do not calculate gradients nor use backpropagation. • Value-based RL methods usually operate in discrete action spaces while the actor-critic architecture extends this ability to continuous action spaces. ESs can operate on discrete or continuous action spaces by default. III. FUNDAMENTAL ALGORITHMS Fundamental algorithms of (D)RL and ES are introduced in this section.

A. Reinforcement Learning Algorithms
SARSA is a model-free algorithm that leverages temporaldifferences for prediction [1]. It updates the Q-value, Q(s t , a t ), while following a policy. The interaction between the agent and environment results in the following sequence . . . , s t , a t , r t+1 , s t+1 , a t+1 , . . . : the agent takes an action a t while being in a state s t , and consequently, the environment transitions to a state s t+1 and the agent observes a reward r t+1 . For action selection, SARSA uses ε-greedy algorithm, which selects the action with maximum Q(s t , a t ) with probability of 1 − ε, and otherwise, it draws an action uniformly from A. SARSA is an on-policy algorithm, that is, it evaluates and improves the same policy that selects the taken actions. SARSA's update equation is where α is the learning rate. Q-Learning [1] is similar to SARSA with a key difference: it is an off-policy algorithm, which means that it learns an optimal Q-value function from data obtained via any policy (without introducing a bias). The update rule of Q-learning is Off-policy algorithms are more data-efficient than on-policy ones, because they can use the collected data repeatedly. REINFORCE [11] is a fundamental stochastic gradient descent algorithm for policy gradient algorithms. It leverages a DNN to approximate the policy π and update its parameters θ. The network receives an input from the environment and outputs a probability distribution over the action space, A. The steps involved in the implementation of REINFORCE are: 1) Initialize a Random Policy (i.e., the parameters of a DNN) 2) Use the policy π θ to collect a trajectory τ = (s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , ..., a H , r H+1 , s H+1 ) 3) Estimate the return for this trajectory 4) Use the estimate of the return to calculate the policy gradient: 5) Adjust the weights θ of the Policy: θ ← θ + α∇ θ J(θ) 6) Repeat from step 2 until termination. Deep Q-network (DQN) [28] combines Q-learning with a deep convolutional neural network (CNN) [33] to act in environments with high-dimensional input spaces (e.g., images of Atari games). It gets a state (e.g., a mini-batch of images) as input and produces Q-values of all possible actions. The CNN is used to approximate the optimal action-value function (or Q-function). Such usage, however, causes the DRL agent to be unstable [34]. To counter that, DQN samples an experience replay [35] dataset D t = {(s 1 , a 1 , r 2 , s 2 ), . . . , (s t , a t , r t+1 , s t+1 )} and uses a target network that is updated only after a certain number of iterations. To update the network parameters at iteration i, DQN uses the following loss function where θ i and θ i are the parameters of the Q-network and target network, respectively; and the experiences, (s, a, r, s ), are drawn from D uniformly.

B. Evolutionary Strategies Algorithms
The (1+1)-ES (one parent, one offspring) is the simplest ES conceived by Rechenberg [36]. First, a parent candidate solution, x p , is drawn according to a uniform random distribution from an initial set of solutions, {x i , x j }. The selected parent, x p , together with its fitness values enter the evolution loop. In each generation (or iteration) an offspring candidate solution, x o , is created by adding a vector drawn from an uncorrelated multivariate normal distribution to x p as follows: x o = x p + yσ, y ∼ N (0, I).
If the offspring x o is found to be fitter than the parent x p then it becomes the new parent for the next generation, otherwise it is discarded. This process is repeated until a termination condition is met. The amount of mutation (or perturbation) added to x p is controlled by the stepsize parameter σ. The value of σ is updated every predefined number of iterations according to the well-known 1 5th success rule [37,38]: if x o is fitter than x p 1 5th of the times then σ should stay the same, if x o is fitter more than 1 5th of the times then σ should be increased, and otherwise it should be decreased. The (µ/ρ + , λ)-ES was originally proposed by Schwefel [39] as an extension to the (1+1)-ES. Instead of using one parent to generate one offspring, it uses µ parents to generate λ offsprings using both recombination and mutation. In the commavariation of this algorithm (i.e., (µ/ρ,λ)-ES) the selection of the parents for the next generation happens solely from the offsprings. Whereas in the plus-variation, the selection of the parents for the next generation happens from the union of the offsprings and old parents. The ρ in the name of the algorithm refers to the number of parents used to generate each offspring.
An element (or an individual) that the (µ/ρ + , λ)-ES evolves consists of (x, s, f ) where x is the candidate solution, s are the strategy parameters that control the significance of the mutation, and f holds the fitness value of x. Consequently, the evolution process itself tunes the strategy parameters which is known as self-adaptation. Thus, unlike (1+1)-ES, (µ/ρ + , λ) do not need external control settings to adjust the strategy parameters. Covariance Matrix Adaptation Evolution Strategies (CMA-ES) is one of the most popular gradient-free optimisation algorithms [40,41,42,43]. To search a solution space, it samples a population, λ, of new search points (offsprings) from a multivariate normal distribution: x x x g+1 i = m m m (g) + σ (g) N (0 0 0, C C C (g) ) for i = 1, . . . ,λ, where g is the generation number (i.e., g = 1, 2, 3, . . . ), x x x i ∈ R n is the i-th offspring, m m m and σ denote the mean and standard deviation of x x x, C C C represents the covariance matrix, and N (0 0 0, C C C) is a multivariate normal distribution. To compute the mean for the next generation, m m m g+1 , CMA-ES computes a weighted average of the best-according to their fitness values-µ candidate solutions, where µ < λ represents the parent population size. Through this selection and the assigned weights, CMA-ES biases the computed mean towards the best candidate solutions of the current population. It automatically adapts the stepsize σ (the mutation strength) using the Cumulative Stepsize Adaption (CSA) algorithm [40] and an evolution path, p p p σ : if p p p σ is longer than the expected length of the evolution path under random selection E||N (0 0 0, I I I)||, increase the stepsize, otherwise, decrease it. To direct the search towards promising directions, CMA-ES updates the covariance matrix in each iteration. The update consists of two main parts: (i) rank-1 update, which computes an evolution path for the mutation distribution means, similarly to the stepsize evolution path; and (ii) rank-µ update, which computes a covariance matrix as a weighted sum of covariances of the best µ individuals. The obtained results from these steps are used to update the covariance matrix C C C itself. The algorithms iterate until a satisfactory solution is found (we refer the interested reader to [43] for a more detailed explanation). Natural Evolution Strategies (NES) While NES is similar in many ways to the previously defined ES algorithms, the core idea behind it relates to the use of gradients to adequately update a search distribution [44]. The basic idea behind NES consists of: • Sampling: NES samples its individuals from a probability distribution (usually a Gaussian distribution) over the search space. The end goal of NES is to update the distribution parameters θ to maximize the average fitness F (x x x) of the sampled individuals x x x. • Search gradient estimation: NES estimates a search gradient on the parameters by evaluating the samples previously computed. It then decides on the best direction to take to achieve a higher expected fitness. • Gradient ascent: NES computes gradient ascent along the estimated gradient • Iterates over the previous steps until a stopping criterion is met [44]. Salimans et al. [4] proposed a variant of NES for optimizing the policy parameters θ. As gradients are unavailable, they are estimated via Gaussian Smoothing of the objective function F (X) which represents the expected return.

C. Comparison
Our main observations of the fundamental algorithms are: • Both ES and on-policy RL algorithms are data inefficient: on-policy algorithms make use of data that is generated from the current policy and discard older data; ES discard all but a sub-set of candidate solutions in each iteration. • The computation requirements per iteration of ES are less than that of DRL as it does not require backpropagating error values. • Value-based DRL algorithms such as DQN can be dataefficient because they work with off-policy data. However, they can become unstable for long horizons and high discount factors [45]. • Policy-based RL and ES are similar in that they both search for good policies directly. • Table I summarizes some of the important characteristics of the mentioned algorithms.
IV. DEEP REINFORCEMENT LEARNING VERSUS EVOLUTIONARY STRATEGIES The following section present a comparison between DRL and ES concerning their ability to parallelize computations, explore an environment, and learn in a multi-agent and dynamic settings.

A. Parallelism
Despite the success of DRL and ESs, they are still computationally intensive approaches to tackle sequential decisionmaking problems. Parallel execution is thus an important approach to speed up the computation [46]. Below, we look into the rich literature for the most suitable the DRL and ES algorithms for parallel RL tasks.

1) Parallelism in Deep Reinforcement Learning
In parallel-RL multiple RL agents (or actors) run in parallel to accelerate the learning process. Unlike Multi-agent RL (Section IV-D), agents do not interact directly and run in separate instances of the chosen environment. Each agent gathers its own learning experience which can be shared with other agents during the learning process to optimize a global network ( Figure 3) [52,53]. In the remainder of this section several Parallel-DRL algorithms are discussed. Gorila. Nair et al. [46] presented Gorila, the first massively distributed architecture for DRL. It consists of four major components: actors, learners, a parameter server, and a replay buffer ( Figure 3a). Each actor has its Q-network. It interacts with an instance of the same environment and stores the generated experiences (i.e., a set of {s, a, r, s }) in the replay buffer. Learners sample from the experience replay buffer and use DQN to compute gradients. Sampling from a buffer reduces the correlation between data updates and the effect of non-stationarity in the data. These gradients are then sent asynchronously to the parameter server to update its Qnetwork. After that, the parameter server updates the actors' and learners' Q-networks to synchronize the learning process. A3C & GA3C. While using a replay buffer helps in stabilizing the learning process, it requires more memory and computational power and can only be used with off-policy algorithms. Motivated by these limitations, Mnih et al. [47] introduced the Asynchronous Advantage Actor-Critic (A3C) as an alternative to Gorila. A3C consists of a global network and multiple agents with their network parameters (Figure 3b). The agents are implemented as CPU threads within a single machine, which reduces the communication cost imposed by Gorila. The agents interact in parallel with their independent copy of the environment. Each agent calculates the value and the policy gradients which are used to update the global network parameters. This method of learning diversifies and decorrelates data updates which stabilize the learning process. GA3C [54] makes use of GPUs and shows better scalability and performance than A3C. Batched A2C & DPPO. A downside of A3c is that asynchronous updates may lead to sub-optimal collective updates to the global network. To overcome this, Batched Advantage Actor-Critic (Batched A2C) employs a master entity (or a coordinator) to synchronize the update process of the global network [48]. Batched A2C tries to capitalize on the advantages of both Gorila and A3C. Similar to Gorila, Batched A2C runs on GPUs and the number of actors is highly scalable while still running on a single machine akin to A3C and GA3C [54]. Figure 3c presents the Batched A2C architecture. At each time step, Batched A2C samples from the policy and generates a batch of actions for n w workers on n e environment instances. The resulting experiences are then stored and used by the master to update the policy. The batched approach allows for easy parallelization by synchronously updating a unique copy of the parameters, with the drawback of higher communication costs. Distributed Proximal Policy Optimization (DPPO) [55] features architecture similar to that of A2C, and uses the PPO [56] algorithm for learning. Ape-X & R2D2. Ape-X [49] extends the prioritized experience buffer to the parallel-RL setting and shows that this approach is highly scalable. The Ape-X architecture consists of many actors, a single learner, and a prioritized replay buffer ( Figure 3d). Each actor interacts with its instance of the environment, gathers data, and computes its initial priorities. The generated experiences are stored in a shared prioritized buffer. The learner samples the buffer to update its network and the priorities of the experiences in the buffer. In addition, the learner also periodically updates the network parameters of the actors. Ape-X's distributed architecture can be coupled with different learning algorithms such as DQN [28] and DDPG [19]. By incorporating value-function re-scaling and LSTMs [57]. R2D2 [58] has a similar architecture but outperforms Ape-X using recurrent neural network (RNN)based RL agents. IMPALA. GA3C [54] suffers from poor convergence (due to the use of an on-policy method in an off-policy setting). IMPALA [50] corrected this with the use of V-trance. Its architecture consists of multiple actors interacting with their environment instances ( Figure 3e). The experiences are sent to a learner, in contrast to A3C, where gradients are communicated. The learner optimizes the policy and value functions based on these experiences. IMPALA can support single or multiple synchronized learners over which policy parameters are distributed. IMPALA separates acting and learning leading to a higher throughput. Unlike with Batched A2C, the actors no longer need to wait for the learners to finish. Moreover, additional machines can be added to generate more trajectories per unit of time. The model parameters of the actors are periodically updated by the learner. The Decoupling of the acting and learning may however cause the policy used for generating a trajectory to fall behind that of the learner. This is solved by using the V-trace off-policy in the actor-critic algorithm. SEED. SEED [51] improves on the IMPALA system by moving inference to the learner ( Figure 3f). Consequently, the trajectories collection becomes part of the learner and only observations and actions are exchanged between the actors and the learner. SEED makes use of TUPs and GPUs and shows significant improvement over other approaches.
2) Parallelism in Evolution Strategies ES algorithms require minimal bandwidth to scale to a large number of workers as only scalar values are communicated after each episode as opposed to entire gradient vectors as seen in policy gradient DRL. In addition, ES also does not require value function approximation while value-based DRL does. In an attempt to capitalize on these advantages Salimans et al. [4] proposed OpenAI-ES, an algorithm derived from NES (see Section III), that directly optimizes the parameters θ of a policy. The main feature of OpenAI-ES is the idea of shared random seeds which means agents only have to share scalars, leading to a drastic reduction of the bandwidth requirements. The main steps of OpenAI-ES are displayed in Figure 5 and operate as follows: 1) sample a Gaussian noise vector, ε i ∼ N (0, I).
Several other researchers have leveraged OpenAI-ES [4] for their work. Conti et al. [32] proposed Novelty Search Evolution Strategy (NS-ES) algorithm. NS-ES hybridizes OpenAI-ES [4] and novelty search (NS) -a directed exploration algorithm. They also introduced two other algorithms that replace NS with quality diversity (QD). The results show that the NS and QD algorithms improve ES performance on RL tasks with sparse rewards, as they help avoid local optima. Chrabaszcz et al. [5] proposed a canonical ES algorithm parallelized with OpenAI-ES using a random noise table. Liu et al. [59] also followed the efficient communication strategy introduced by Salimans et al. [4]: random seeds are shared between workers and each worker knows what search parameter the other workers have used. Consequently, their proposed algorithm "Trust Region Evolution Strategies (TRES)" requires extremely low bandwidth. Finally, Fuks et al. [29] proposed Evolution Strategy with Progressive Episode Lengths which leverages the same parallelization idea as OpenAI-ES [4]. Table II snapshot the main characteristics of the presented algorithms in and Figure 4 shows them on a timeline.

3) Comparison
Our observations about parallelizing DRL and ES are: • Despite the additional complexity, parallelism accelerates the execution of DRL and ES algorithms. • Parallel DRL usually communicates network parameters or entire gradient vectors between nodes while parallel ES algorithms share only scalar values between workers. • Parallel ES requires minimal bandwidth when compared to parallel DRL.

B. Exploration
One of the fundamental challenges that a learning agent faces when interacting with a partially known environment is the exploration-exploitation dilemma. That is, when should the agent try out suboptimal actions to improve its estimation of the optimal policy and when should it use its current optimal policy estimation to make useful progress? This dilemma has attracted ample attention. Below, we summarize the main exploration methods in DRL and ESs.

1) Exploration in (Deep) Reinforcement Learning
Simple exploration techniques balance exploration and exploitation by selecting the optimal action-according to the current estimation -most of the time and a random action on occasion. This is the case for the well-known -greedy exploration algorithm [1] that acts greedily with probability 1 − and selects a random action with probability .
More complex exploration strategies estimate the value of an exploratory action by making use of the environment-agent interaction history. Upper confidence bound (UCB) [60] does that by making the reward signal, which the agent seeks to maximize, equals the estimated value of a Q-function plus a value that reflects the unconfidence of the algorithm about this estimate, r + (s, a) = r(s, a) + B(N (s)), where N (s) represents the frequency of visiting state s, and B(N (s)) is a reward bonus that decreases with N (s). In other words, UCB promotes the selection of actions with high r(s, a) (good actions) or the ones with high uncertainty (the ones that are visited less frequently). The Thompson sampling method (TS) [61] maintains a distribution over the parameters of a model. To balance exploration and exploitation, TS samples the distribution and acts greedily concerning these samples. UCB and TS become more confident about the optimal policy over time. Consequently, they naturally reduce the probability of selecting exploratory actions, and therefore, they are inherently more efficient than -greedy. From RL to DRL. DRL agents act on environments with continuous or high-dimensional state-action spaces (e.g., Montezuma's Revenge, StarCraft II). Such spaces render countbased algorithms (e.g., UCB) and the ones that require maintaining a distribution over state-action spaces (e.g., TS) useless in their original formulation. To explore such challenging environments with sparse reward signals, many algorithms have been proposed. Generally, these algorithms couple approximation techniques with exploration algorithms proposed for simple RL settings [62,63,64,65]. Below we outline some of the main DRL exploration algorithms. Posterior sampling. Osband et al. [66] used randomized linearly-parameterized value functions to extend the TS technique to DRL settings without maintaining an intractable exact posterior update. Bootstrapped DQN [67], then, improved on this idea by using DNNs and ensemble Q-functions. It draws a sample at random from the ensemble Q-functions and acts greedily with respect to this sample. Chen et al. [68] integrates UCB with Bootstrapped DQN by calculating the mean and variance of a subset of the ensemble Q-functions. O'Donoghue et al. [69] combined TS with uncertainty Bellman equations to propagate the uncertainty in the Q-values over multiple timesteps.
Information gain. In exploration based on information gain, the algorithm provides a reward bonus proportional to the information obtained after taking an action. This reward bonus is then added to the reward provided by the environment to push the agent to explore novel (or less known) states [77]. ICM uses a forward dynamic model to predict states and measures information gain as the difference between the predicted and observed state outperforms TRPO-VIME in VizDoom (a sparse 3D environment) [73] Episodic curiosity uses episodic memory to form the novelty bonus outperforms ICM in visually rich 3D environments from VizDoom and DMLab [74] Never Give Up combines both episodic and life-long novelties; encourages the agent to visit rarely visited states obtains a median human normalized score of 1344%; the first algorithm that achieves non-zero rewards in the game of Pitfall Agent57 uses a meta-controller for adaptively selecting the right policy: ranging from purely exploratory to purely exploitative first DRL agent that surpasses the standard human benchmark on all 57 Atari games [76] Bellemare et al. [70] fitted a density model to all the states observed so far. After observing a new state, the model is updated. The obtained density models are utilized to drive a pseudo-count of state visitation. The pseudo-count is then used to drive a reward bonus. Houthooft et al. [72] proposed to learn a transition dynamic model with a Bayesian neural network.
The information gain is measured as the KL divergence between the current and updated parameter distribution after a new observation. Based on this information the reward signal is augmented with a bonus. Pathak et al. [73] used a forward dynamic model to predict the next state. The reward bonus is then set to be proportional to the error between the predicted and observed next state. To make this method effective, the authors utilized an inverse model, removing irrelevant -for the comparison-state features. Burda et al. [78] defines the exploration bonus based on the error of a neural network in predicting features of the observations given by a fixed randomly initialized neural network.
Savinov et al. [74] proposed a new curiosity method that uses episodic memory to form the novelty bonus. The bonus is computed by comparing the current observation with the observations in memory and a reward is given for observations that require some effort to be reached (effort is materialized by the number of environment steps taken to reach an observation). Tao et al. [79] estimated the intrinsic rewards based on the distance to the nearest neighbors in a meaningful low-dimensional representational space to gauge novelty while combining the value-based approach with a model-based approach. Badia et al. [75] proposed "Never give up" (NGU), an RL agent that aims to solve hard exploration games such as "Pitfall!". This method combines both episodic and life-long novelties. Episodic novelty inspires an agent to regularly return to familiar states (potentially not fully explored) in the span of various episodes (not in the same episode). Life-long novelty gradually down-modulates states that become progressively more familiar across many episodes. A Universal Value Function Approximator (UVFA) is then used to learn separate exploration and exploitation policies at the same time. Agent57 [76] aims to manage the tradeoff between exploration and exploitation using a "meta-controller" that adaptively selects a correct policy (ranging from very exploratory to purely exploitative) for the training phase.
2) Exploration in Evolution Strategy ES algorithms optimize the fitness score while exploring around the best solutions found so far. The exploration is realized through the recombination and mutation steps. Despite their effectiveness in exploration, ESs may still get trapped in local optima [59,80]. To overcome this limitation, many ESs algorithms with enhanced exploration techniques have been proposed.
One way to extract approximate gradients from a nonsmooth objective function, F (θ θ θ), is by adding noise to its parameter vector, θ θ θ. This yields a new differentiable function, F ES (θ θ θ). OpenAI-ES [4] exploits this idea by sampling noise from a Gaussian distribution and adding it to the parameter vector θ θ θ. The algorithm then optimizes using stochastic gradient ascent. Additionally, OpenAI-ES relays on a few auxiliary techniques to enhance its performance: virtual batch normalization [31] for enhanced exploration, antithetic sampling [81] for reduced variance, and fitness shaping [44] for improving local optima avoidance.
Choromanski et al. [82] proposed two strategies to enhance the exploration of Derivative Free Optimization (DFO) methods such as OpenAI-ES [4]: (i) structured exploration, where the authors showed that random orthogonal and Quasi Monte Carlo finite difference directions are much more effective than random Gaussian directions for parameter exploration; and (ii) compact policies, whereby imposing a parameter sharing structure on the policy architecture, they were able to significantly reduce the dimensionality of the problem without losing accuracy and thus speeding up the learning process.
Maheswaranathan et al. [83] proposed Guided ES: a random search that is augmented using surrogate gradients which are correlated with the true gradient. The key idea is to track a low-dimensional subspace that is defined by the recent history of surrogate gradients. Sampling this subspace leads to a drastic reduction in the variance of the search direction. However, this approach has two shortcomings: (i) the bias of the surrogate gradients needs to be known; and (ii) when the bias is too small, Guided ES cannot find a better descent direction than the surrogate gradient. Meier et al. [84] draw inspiration from how momentum is used for optimizing DNNs to improve upon Guided ES [83]. The authors showed how to optimally combine the surrogate gradient directions with random search directions and how to iteratively approach the true gradient for linear functions. They assessed their algorithm against a standard ESs algorithm on different tasks showing its superiority.
Choromanski et al. [85] noted that fixing the dimensionality of subspaces (as in Guided ES [83]) leads to suboptimal performance. Therefore, they proposed ASEBO: an algorithm that adaptively controls the dimensionality of subspaces based on gradient estimators from previous iterations. ASEBO was compared to several ESs and DRL algorithms and showed promising averaged performance.
Liu et al. [86] proposed Self-Guided Evolution Strategies (SGES). This work is inspired by both ASEBO [85] and Guided ES [83]. Further, it is based on two main ideas: leveraging historical estimated gradients and building a guiding subspace from which search directions are sampled probabilistically. The results show that SGES outperforms Open-AI [4], Guided ES [83], CMA-ES and vanilla ES.
The aforementioned methods suffer from the curse of dimensionality due to the high variance of Monte Carlo gradient estimators. Motivated by this, Zhang et al. [87] proposed Directional Gaussian Smoothing Evolution Strategy (DGS-ES). It encourages non-local exploration and improves high-dimensional exploration. In contrast to regular Gaussian smoothing, directional Gaussian smoothing conducts 1D nonlocal explorations along d orthogonal directions. The Gauss-Hermite quadrature is then used for improving the convergence speed of the algorithm. Its superior performance is showcased by comparing it to many algorithms including OpenAI-ES [4] and ASEBO [85].
To encourage exploration in environments with sparse or deceptive reward signals, Conti et al. [32] proposed hybridizing ESs with directed exploration methods (i.e., Novelty Search (NS) [88] and Quality Diversity (QD) [89]). The combination resulted in three algorithms: NS-ES, NSR-ES, and NSRA-ES. NS-ES builds on the OpenAI-ES exploration strategy. OpenAI-ES approximates a gradient and takes a step in that direction.
In NS-ES, the gradient estimate is that of the expected novelty. It gives directions on how to change the current policy's parameters θ θ θ to increase the average novelty of the parameter distribution. NSR-ES is a variant of NS-ES. It combines both the reward and novelty signals to produce policies that are both novel and high-performing. NSRA-ES is an extension of NSR-ES that dynamically adapts the weights of the novelty and the reward gradients for more optimal performance.

3) Comparison
Our observations of this section are summarized below.
• The exploration-exploitation dilemma is still an active field of research and there is no single algorithm that outperforms all others. • Thanks to the recombination and mutation, ESs algorithms might suffer less from local optima than DRL ones. However, novel environments with sparse and deceptive reward signals demand more sophisticated and capable exploration algorithms. • ESs still face some problems when exploring, as high dimensional optimization tasks can originate in high variance gradients estimates. • Table III and Table IV summarize some important characteristics of DRL and ESs exploration algorithms.

C. Non-Markov settings
The Markov property denotes the situation where the future states of a process depend only on the current state and not on events or states from the past. The degree to which agents can observe (changes in) the environment has an impact on their decision behavior. In certain favorable scenarios the state of the agent in its environment might be fully observable (e.g., using sensors) to an extent such that the Markov assumption holds. In other cases, the state of the environment is only partially observable and/or the agent faces a distribution of environments (Meta-RL).

1) Partially Observable
In many real-world applications, agents can only partially observe the state of their environments and might only have access to their local observations. This means agents need to take into accent the history of observations-actions and rewards-to produce a better estimation of the underlying hidden state [90,91,92]. These problems are usually modeled as a partially observable Markov decision process (POMDP). Researchers have addressed the POMDP problem setup through the proposal of many RL models. One possibility is to em- ASEBO adapts the dimensionality of the subspaces on-the-fly for efficient exploration optimizes high-dimensional balck-box functions and performs consistently well across several tasks compared to state-of-the-art algorithms [85] DGS-ES uses directional Gaussian smoothing to explore along non-local orthogonal directions. It leverages Guss-Hermite quadrature for fast convergence.
improves on state-of-the-art algorithms (e.g., OpenAI-ES and ASEBO) on some problems [87] Iterative gradient estimation refinement iteratively uses the last update direction as a surrogate gradient for the gradient estimator. Over time this will result in improved gradient estimates. avoid local optima encountered by ESs while achieving higher performance on Atari and simulated robot tasks [32] ploy a recurrent structure to enable agents to consider past observations [93,94].

2) Meta Reinforcement Learning
Meta-RL is concerned with learning a policy that can be quickly generalized across a distribution of tasks or environments (modeled as MDPs). Generally, a meta-learner achieves that through two stages optimization process: first, a meta-policy is trained on a distribution of similar tasks with the hope of learning the common dynamics across these tasks; then, the second stage fine-tunes the meta-policy while acting on a particular task sampled from a similar but unseen task distribution [95]. Examples of meta-RL tasks include: navigating towards distinct goals [96], going through different mazes [97], dealing with component failures [98], or driving different cars [99].
Meta-RL can be subdivided into two categories [96]: RNNbased [100,101] and gradient-based learners [102,103]. Recurrent Models (RNN-based learners). Leveraging the agent-environment interaction history causes a stronger inductive bias which leads to faster learning [99,104]. This idea can be implemented using Recurrent Neural networks (RNNs) (or other recurrent models) [97,100,101,105]. The RNNs can be trained on a set of tasks to learn a hidden state (metapolicy), then this hidden state can be further adapted given new observations from an unseen task.
General architecture of a meta-RL algorithm is illustrated in Figure 7 [106], where an agent is modeled as two loops, both implementing RL algorithms. The outer loop samples a new environment in every iteration and tunes the parameters of the inner loop. Consequently, the inner loop can adjust more rapidly to new tasks by interacting with the associated environments and optimizing for maximal rewards.
Duan et al. [101] and Wang et al. [100] proposed analogous recurrent Meta-RL agents: R 2 and DRL-meta, respectively. They implemented an LSTM and a GRU architecture in which the hidden states serve as a memory for tracking characteristics of interaction trajectories. The main difference between both approaches relates to the set of environments. Environments in [100] are issued from a parameterized distribution [107]. In contrast, those in [101] are relatively unrelated [107].
Such RNN-based methods have proven to be efficient on many RL tasks. However, their performance decreases as the complexity of the task increases, especially with long temporal dependencies. Additionally, short-term memory is challenging for RNN due to the vanishing gradient problem. Furthermore, RNN-based meta-learners cannot pinpoint specific prior experiences [97,108].
To overcome these limitations, Mishra et al. [97] proposed Simple Neural Attentive Learner (SNAIL). It combines temporal convolutions and attention mechanisms. The former aggregates information from past experiences and the latter pinpoints specific pieces of information. SNAIL's architecture consists of three main parts: the DenseBlock, TCBlock, and AttentionBlock. This general-purpose model has shown its efficacy on tasks ranging from supervised to reinforcement learning. Despite that, challenges such as the long time needed for getting the right architectures of TCBlocks and Dense-Blocks. [108] persist. Gradient-Based Models. Model Agnostic Meta-Learning (MAML) [102] realizes meta-learning principles by learning an initial set of parameters, θ 0 θ 0 θ 0 , of a model such that taking a few gradient steps is sufficient to tailor this model to a specific task. More precisely, MAML learns θ 0 θ 0 θ 0 such that for any randomly sampled task, T , with a loss function, L, the agent will have a modest loss after n updates: where U n T (θ θ θ) refers to an update rule such as gradient descent. Nichol et al. [109] proposed Reptile a first-order metalearning framework, that is considered to be an approximation of MAML. Similar to first-order MAML (FOMAML), Reptile does not calculate second derivatives, which makes it less computationally demanding. It starts by repeatedly sampling a task,  Fig. 7: Schematic of Meta-reinforcement Learning; illustrating the inner and outer loops of training [106] then performing N iterations of stochastic gradient descent (SGD) on each task to compute a new set of parameters. Then, it moves the model weights towards the new parameters. 3

) Meta Evolution Strategies
Gajewski et al. [110] introduced "Evolvability ES", an ES-based meta-learning algorithm for RL tasks. It combines concepts from evolvability search [111], ESs [4], and MAML [102] to encourage searching for individuals whose immediate offsprings show signs of behavioral diversity (that is, it searches for parameter vectors whose perturbations lead to differing behaviors) [111]. Consequently, Evolvability ES facilitates adaptation and generalization while leveraging the scalability of ESs [110,112]. Evolvability ES shows a competitive performance to gradient-based meta-learning algorithms. Quality Evolvability ES [112] noted that the original Evolvability ES [113] can only be used to solve problems where the task performance and evolability align. To eliminate this restriction, Quality Evolvability ES optimizes for both -task performance and evolability-simultaneously.
Song et al. [114] argue that policy gradient-based Model Agnostic Meta Learning (MAML) algorithms [102] face significant difficulties when estimating second derivative using backpropagation on stochastic policies. Therefore, they introduced ES-MAML, a meta-learner that leverages ES [4] for solving MAML problems without estimating second derivatives. The authors empirically showed that ES-MAML is competitive with other Meta-RL algorithms. Song et al. [115] combined Hill-Climbing adaptation with ES-MAML to develop noise-tolerant meta-RL learner. The authors showcased the performance of their algorithm using a physical legged robot.
Wang et al. [116] incorporated an instance weighting mechanism with ESs to generate an adaptable and salable metalearner, Instance Weighted Incremental Evolution Strategies (IW-IES). During parameter updates, higher weights are assigned to offsprings that contain more new knowledge. The weights are assigned based on one of the two proposed metrics: instance novelty an instance quality. Comapred to ES-MAML, IW-IES proved competitive for robot navigation tasks.
The characteristics of meta-RL make it particularly suited for tackling the sim-to-real problem: simulation provides previous experiences that are used to learn a general policy, and the data obtained from operating in the real world fine-tunes that policy [117]. Examples of using Meta-RL to train physical robots include: Nagabandi et al. [98] built on top of MAML a model-based meta-RL agent to train a real legged millirobot; Arndt et al. [118] proposed a similar framework to MAML to train a robot on a task of hitting a hockey puck; and Song et al. [115] introduced a variant of ES-MAML to train and quickly adapt the policy commanding a legged robot.

4) Comparison
We list our observation of this section below: • Meta-RL is generally a two-stage optimization process: the first, it optimizes on a task distribution level, and the second, fine-tunes for a specific task. • There are two main approaches for Meta-RL: gradientbased and recurrent models. • There are many challenges in gradient-based Meta-RL methods, such as estimating first and second-order derivatives, high variance, and high computation needs. • ES-based meta-RL attempts to address the limitations of gradient-based meta-RL; however, ES-based meta-RL itself faces a different set of challenges such as the sample efficiency. • Meta-RL is particularly useful and advantageous in the sim-to-real paradigm. The simulation would provide the experience needed, and generalize a behavior allowing the model to properly act in the real world.

D. Learning in a multiagent setting
A multi-agent system (MAS) is a distributed system of multiple cooperating or competing agents (physical or virtual), working towards maximizing their own objectives within a shared environment [120]. Each of the agents is equipped with a decision-making model which can be distributed either homogeneously or heterogeneously over the entire group. Currently, multi-agent systems form one of the leading research areas of Artificial Intelligence due to their wide applicability. Virtually any application that can be partitioned and parallelized can benefit from using multiple agents.

1) Multi-agent Reinforcement Learning
An MAS can be combined with DRL to form a Multiagent Deep Reinforcement Learning (MADRL) system which addresses sequential decision-making problems for multiple agents sharing a common environment (Figure 8). MADRL agents are trained to learn a certain behavior through interaction with the environment and optionally with other agents. In MAML given a task distribution, it searches for optimal initial parameters such that a few gradient steps are sufficient to solve tasks drawn from that distribution.
outperforms classical methods such as random and pretrained methods [102] Reptile similar to first-order MAML on-par with the performance of MAML [109]  IW-IES uses NES for updating the RL policy network parameters in a dynamic environment outperforms ES-MAML on set of robot navigation tasks [116] fact, Tampuu et al. [121] show that DRL agents can perform better in a gaming application, when trained against other dynamic DRL agents instead of a static algorithm. These systems can become quite complex, as agents will observe not only the consequences of their actions, but also the behavior of other agents.
Since the environment and the reward states are affected by the joint actions of all agents, the single-agent MDP model cannot be directly applied to MARL systems, as they do not adhere to the Markov property. The Markov (or Stochastic) games (MG) [122] framework comes as a generalization of the MDP that captures the entanglement of the multiple agents. There are several important properties to be considered when considering MADRL systems. In the following section we will discuss each of these properties and their resulting impact on the overall system. Setup: Cooperative vs Competitive In a cooperative game, also known as a team problem, the agents seek to maximize a common reward signal by taking actions that favor their outcome, while taking into account their effect on other agents. Most contemporary applications are based upon a cooperative setup. Examples of this scenario include foraging, exploration and warehouse robots. The main challenge for the cooperative setup is termed as the structural credit assignment problem [123]: which members of the team should receive credit for a favorable reward signal, and which members should be penalized. Due to the complex dynamics between agents and the action history, it can be non-trivial to determine whose actions were beneficial to the group reward.
Contrary, in a competitive game, agents receive different reward signals based on the overall outcome of the joint actions. In this setup, certain actions might be beneficial to one set of agents while being indifferent or disadvantageous for the other agents. Control: Centralized vs Decentralized Another important distinction to make for MARL systems is the centralized versus decentralized control approach. In the case of centralized control, there exists a single control entity that governs the decisions of all agents based on all available joint actions, joint rewards and joint observations. While this approach enables optimal decisions, it quickly becomes computationally infeasible as the number of agents within a system grows. Additionally, this creates the risk of a single point of failure since the whole system could fail if the central controller breaks.
The decentralized approach does not make use of a central controller and relies on agents to make decisions independently, based on the information available to them locally. Decentralized systems can be subdivided into two categories: "A decentralized setting with networked agents", and "A fully decentralized setting" [124]. The former setup involves agents which can communicate with other nearby agents and use the shared information to optimize their actions. In the latter scenario, agents make independent decisions without information exchange. While this means that no explicit messages can be sent, it is still possible to influence the behavior of other agents by affecting their reward as seen in Sartoretti et al. [125]. While the decentralized approach can provide more scalability and robustness, it also significantly increases the complexity of the system as there is no central entity that has knowledge of and can control the state of each robot. An interesting future research direction might be semi-centralized MADRL systems in which one or more central entities possess partial information of a set of agents. Alternatively, it is possible to alternate techniques between different phases of the design. Chen [126] proposed a system with centralized training and exploration and decentralized execution which can increase inter-agent collaboration and sample efficiency. Challenges in Multi-agent Reinforcement Learning Moving from a single-agent to a multi-agent environment brings about new complex challenges with respect to learning and evaluating outcomes. This can be attributed to several factors, mainly including the exponential growth of the search space and the non-stationary of the environment [127]. In the following section the additional challenges for MADRL applications will be discussed.
Non-stationarity:-In a MADRL system, agents are learning concurrently and their actions repeatedly reshape their shared surroundings, resulting in a highly dynamic environment. This highly dynamic aspect is often referred to as non-stationarity. Therefore, the entire setup becomes non-stationary from a single agent perspective. In other words, P (s |s, a 1 , . . . , a N , π 1 , . . . , π N ) = P (s |s, a 1 , . . . , a N , π 1 , . . . , π N ), where any π i = π i . Consequently, the convergence of well-known algorithms such as Qlearning can no longer be guaranteed as the Markov property assumption of the environment is violated [28,128,129]. Many papers in the literature that attempt to address the nonstationarity problem. citetcastaneda2016deep proposed two algorithms: Deep loosely coupled Q-network (DLCQN) and deep repeated update Q-network (DRUQN). DLCQN modifies an independence degree for each agent based on the agent's negative rewards and observations. The agent then utilizes this independence degree to decide when to act independently or cooperatively. DRUQN tries to avoid policy bias by updating the value of the action inversely proportional to the probability of selecting that action. Diallo et al. [130] proposed a multiagent concurrent DQN algorithm able to converge in a nonstationary environment. Lenient-DQN conceived by Palmer et al. [131] utilizes leniency with decaying temperature values for adjusting the policy updates sampled from the experience replay memory to deal with the non-stationarity caused by concurrent learning.
Scalability:-One way to deal with the non-stationarity problem is to train the agents in a centralized fashion and let them act according to a joint policy. However, this paradigm gives rise to the scalability challenge of multi-agent systems. As the number of agents increases, the state and action spaces grow exponentially, a phenomenon known as "combinatorial complexity" [132,133,134]. Centralized training and decentralized execution is a paradigm to develop multi-agent systems that try to find a balance between the challenges imposed by scalability and non-stationarity. Several valuebased or actor-critic algorithms have been proposed such as Value Decomposition Networks (VDN) [135], QMIX [136], MADDPG [137] and COMA [138], CTEDD [139]. Modeling Multi-agent Reinforcement Problems The following section will summarize the common approaches of modeling and solving multi-agent reinforcement learning problems.
independent-learning:-Under this approach each agent considers other agents as part of the environment and consequently each agent is trained independently [129,140,141]. This approach does not suffer from the scalability problem [141,142], but it makes the environment non-stationary from each agent's perspective [143]. Furthermore, it conflicts with the usage of experience replay that improves the DQN algorithm [28].
Foerster et al. [142] used importance sampling and the age of the samples of the replay memory using fingerprinting, to stabilize experience replay in DQN in MARL.
fully observable critic:-A way to deal with the nonstationarity of a MARL environment is by leveraging an actorcritic approach. Lowe et al. [137] proposed a multi-agent deep deterministic policy gradient (MADDPG) algorithm, where the actor policy accesses only the local observations whereas the critic has access to the actions, observations, and target policies of all agents during training. As the critic has global observability, the environment becomes stationary even though the policies of other agents change. A number of extensions to MADDPG has been proposed [144,145,146,147].
value function decomposition:-Learning the optimal actionvalue function in fully cooperative MARL settings is challenging. To coordinate the agents' actions, learning a centralized action-value function, Q tot , is desirable. However, when the number of agents is large, learning such a function is challenging. Independent-learning (where each agent learns its actionvalue function, Q i ) does not face such a challenge, but it also neglects interactions between agents, which results in suboptimal collective performance. Value function decomposition methods try to capitalize on the advantages of these two approaches. It represents Q tot as a mixing of Q i that is conditioned only on local information. Value-Decomposition Network (VDN) algorithm assumes that Q tot can be additively decomposed into N Q i for N agents. QMIX [136] algorithm improves on VDN by relaxing some of the additivity constrains and enforcing positive weights on the mixer network.
consensus:refers to the problem setup where agents are allowed to communicate with their neighbors to reach an agreement. The information is shared locally-between neighboring agents-preserving scalability even as the number of agents increases [148,149].
learn to communicate:-Cooperative environments may allow agents to communicate. In such settings, the agents can learn a communication protocol to achieve their shared objective more optimally [150,151]. Foerster et al. [152] proposed two algorithms, Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL), that use deep networks to learn to communicate. RIAL is based on Deep Recurrent Q-Network with independent Q-learning. It shares the parameters of a single neural network between the agents. In contrast, DIAL passes gradients directly via the communication channel during learning. While a discrete communication channel is used in realizing RIAL and DIAL, CommNet [153] utilizes a continuous vector channel. Over this channel, agents obtain the summed transmissions of other agents. Pesce and Montana [154] introduced memorydriven multi-agent deep Deterministic policy gradient (MD-MADDPG). In it, the agents use shared memory as a communication channel: upon taking an action, an agent reads the shared memory and writes a response to it.
Partial observability:-Foerster et al. [152] introduced a deep distributed recurrent Q-network (DDRQN) algorithm based on a long short-term memory network to deal with POMDP problems in the multi-agent setting. Gupta et al. [155] extended three types of single-agent RL algorithms based on policy gradient, temporal-difference error, and actorcritic methods to the multi-agent systems domain. Their work shows the importance of using DRL with curriculum learning to address the problem of learning cooperative policies in partially observable complex environments. The deep recurrent policy inference Q-network (DRPIQN) was conceived by [156] to address the problem of partial observability in multiagent systems.
We refer the interested reader to the following survey papers for a more in-depth discussion on the topic of multi-agent reinforcement learning: Hernandez-Leal et al. [128] provide a comprehensive survey on the non-stationarity problem in MARL; OroojlooyJadid and Hajinezhad [143] scope their survey to include the papers that study decentralized MARL models with a cooperative goal; Da Silva and Costa [157] focus on transfer learning for MARL systems; Althamary et al. [158] provide a survey on using multi-agent reinforcement learning methods for vehicular networks; a survey on MARL from the perspective of challenges and applications is introduced by Du and Ding [159]; a selective overview of theories and algorithms is presented in [124]; a survey and critique of MADRL is given in [133]; and OroojlooyJadid and Hajinezhad [143] provide a review of cooperative MADRL.
2) Multi-agent Evolution Strategies As previously mentioned, many challenges are still faced when solving multi-agent DRL tasks. These make it difficult for such systems to always perform as expected [160]. Throughout this section, we will present the literature using ESs for multi-agent learning tasks. Many of the proposed methods, consist of hybridizing DRL and ESs in such situations.
Hiraga et al. [161] developed robotics controllers based on ESs for managing congestion in robotic swarms path formation using LEDs. The performed experiment covered a swarm of robots, each having seven distance sensors, a ground sensor, an omnidirectional camera, and RGB LEDs. An artificial neural network (three-layered neural network) represents the controller of the robot, having as inputs: the distance sensors, ground sensors and the cameras, and as outputs: the motors and LEDs controls. (µ, λ)-ES is utilized to optimize the weights of the controller. A copy of the controller is implemented on N different robots, before being evaluated and assessed depending on the swarm's performance. Another similar approach was proposed in [162] for building a swarm capable of cooperatively transporting food to a nest and collectively distinguishing between foods and poisons. Hiraga et al. [162] developed a controller for a robotic swarm using CMA-ES, aiming to automatically generate the behavior of the robots. The performed experiment covered a swarm of robots with each having eight distance sensors, an omnidirectional camera, an artificial neural network controller (three-layered neural network), and two motors (left and right). The sensors constitute the inputs to the first layer and the ANN output controls the motors. CMA-ES is utilized to optimize the weights of the controller. The controller is then copied to the N robots and evaluated accordingly. The fitness function combines a positive reward (when transporting foods) with a negative one (when transporting poison). Several experiments proved that the performance grows exponentially with the increase in the number of robots, and that the developed controllers are scalable.
Shopov and Markova [163] combined ESs and multi-agent DRL (Deep Q-Networks) for Sequential Games and showcased the model's efficiency as compared to Classical multiagent reinforcement training with -greedy. The experiment performed by Shopov and Markova [163] aims to optimize the behaviour of a group of autonomous agents (the pursuers) in a map. Tests were performed on two cases: one map with almost no obstacles and another with many obstacles (increased probability of falling into the local minimum). Using ESs on the latter yielded better performance.
Tang et al. [164] proposed an adversarial training multiagent learning system, in which a quadruped robot (protagonist) is trained to become more agile by hunting an ensemble of robots that are escaping (adversaries) following different strategies. An ensemble of adversaries is used, as each will propose a different escape strategy, thus improving agility (agility refers to coordinated control of legs, balance control, etc.). Training is done using ESs and more specifically by augmenting CMA-ES to the multi-agent framework. There are two steps for training: An outer loop which iteratively trains the protagonist and adversaries, and an inner loop for optimizing the policy of each. Policies are represented by feedforward neural networks and are optimized with CMA-ES. This method was compared to MADDPG and MATD3 (stateof-the-art actor critic-based multi-agent RL), and proved to be more successful in learning agile locomotion behaviors. Additionally, leveraging an adversarial approach outperformed methods with no adversarial training.
Chen and Gao [165] proposed a predator-prey system which leverages ESs (OpenAI-ES, CMA-ES). It consists of having multiple predators trained to catch prey in a certain time frame. The predator controllers are homogeneous, and are represented by neural networks which parameters are optimized with ESs (openAI-ES, CMA-ES) and Bayesian Optimization. The NN has three inputs (the inverse of the distance from the predator to the other nearest predator, the angle between the orientation of the predator and the direction of the prey relative to the predator, the distance between the predator itself and the prey), one hidden layer and two outputs for controlling the angular velocities of the two wheels. As for the prey's controller, it follows a simple fixed evasion strategy: having computed a danger zone map, the prey navigates towards the least dangerous locations. After performing various experiments, the predators showcased a successful collective behavior: moving following a formation and avoiding collisions. In the final experiments, CMA-ES outperformed Bayesian Optimization and OpenAI-ES, as it can better handle noise. The system seemed to be successful in real life as well (not just in simulation).  In a multi-agent setting, agents often receive a shared reward for all the agents, making it harder to learn proper cooperative behaviors. [160] thus proposed to use Parallelized ESs along with a Value Decomposition Network (useful for identifying each agent's contribution to the training process) for solving cooperative multi-agent tasks. Figure 9 below is an overview of the overall PES-VD algorithm. PES-VD consists of two phases. First, the policies of each agent are represented by a neural network with parameters θ, optimized using Parallelized ES. Each agent thus identifies its actions independently following its policy and by interacting with its environment. In a second place, seeing how the reward is common to the whole team, a Value Decomposition Network is used to compute the fitness for each of the different policies. Finally, PES-VD is implemented in parallel on multiple cores: M workers evaluate the policies and compute the gradients of the Value Decomposition Network and a master node collects the data and updates the policies and the Value Decomposition Network accordingly. PES-VD was compared with gradientbased (REINFORCE, Actor-Critic, DQN, VDN) and gradientfree methods(Random search [166] and OpenAI-ES) in two different multi-agent environments: Multi-Agent Particle Environment (MAPE) [167], and StarCraft Multi-Agent Challenge (SMAC) [168]. The proposed method seemed promising for both benchmarks and outperformed gradient-based and gradient-free methods in the Cooperative Navigation task.
Rais Martínez and Aznar Gregori [169] assess the performance of ESs (CMA-ES, PEPG, SES, GA and OpenAI-ES) for multi-agent learning in the swarm aggregation task. Swarm aggregation was first proposed in [170], and was solved using DRL. In this problem, the robots controllers are represented by a Neural network with 2 hidden layers. It has 8 infrared sensors and 4 microphones for inputs and 2 wheels and a speaker as output. Each robot in the swarm runs the same network, thus maintaining collective behaviour. The results showcased that ESs were successful in training the aggregation behavior. CMA-ES achieved the best solution for small swarms (5,10 and 20 robots) and SES for larger ones (40 robots

3) Comparison
Here we summarize our observations of this section • Training under a multi-agent setting is more challenging than training a single RL agent for a plethora of reasons.
There are usually two types of agents in MARL: cooperative and competitive agents. Algorithms can make use of a centralized or decentralized framework and will act in a partially or fully observable environment. • New algorithms such as PES-VD [160] propose a direct solution to some of the main challenges of MADRL. PES-VD uses a Value Decomposition Network for solving structural credit assignment problems. • Using ESs for multi-agent learning is still a growing field with a large potential as to the many advantages ESs can bring to concepts such as "collective robotic learning" and "cloud robotics" [173] with its improved approach to parallelism [4].

V. HYBRID DEEP REINFORCEMENT LEARNING AND EVOLUTION STRATEGIES ALGORITHMS
Although DRL and ES have the same objective-optimizing an objective function in a potentially unknown environmentthey have different strengths and weaknesses [176,177]. For example, DRL can be sample efficient thanks to the combination of RL and deep learning; while ES have robust convergence properties and exploration strategies. The hybrid approach combines DRL and ES to get the best of both worlds. Although the idea is not new [178], hybridizing DRL and ES has gained momentum, driven by the recent success of DRL and ES [4,179]. We describe in the following a few population-guided parallel learning schemes that enhance the performance of RL algorithms.
Pourchot and Sigaud [174] addressed the problem of policy searching by proposing CEM-RL: a hybrid algorithm that combines a cross-entropy method (CEM) with either the Twin Delayed Deep Deterministic policy gradient (TD3) [175] or the Deep Deterministic Policy Gradient DDPG [19] algorithms ( Figure 10). The CEM-RL architecture consists of a population of actors that are generated using CEM, and of a single DDPG or TD3 agent. The actors generate diversified training data for the DDPG/TD3 agent, and the gradients obtained from DDPG/TD3 are periodically inserted into the population of the CEM to optimize the searching process. The authors showed that CEM-RL is superior to CEM, TD3 [175], and Evolution Reinforcement Learning (ERL) [180]: a hybrid algorithm that combines a DDPG agent with an evolutionary algorithm. Shopov and Markova [163] merged Deep Q-Networks and ES to develop a hybrid agent that is able to discover high-performing reinforcement-learning policies in sequential games.
Houthooft et al. [181] devised a hybrid RL agent, Evolved Policy Gradients (EPG), that, in addition to the policy, optimizes a loss function. EPG consists of two optimization loops: the inner loop uses stochastic gradient descent to optimize the agent's policy, while the outer one utilizes ES to tune the parameters of a loss function that the inner loop minimizes. Thanks to this ability to fine tune the loss function according to the environment and agent history, EPG can learn faster than a standard RL agent.
Diqi Chen and Gao [182] proposed a hybrid agent to approximate the Pareto frontier uniformly in a multi-objective decision-making problem. The authors argued that despite the fast convergence of DRL, it cannot guarantee a uniformly approximated Pareto frontier. On the other hand, ES achieve a well-distributed Pareto frontier, but they face difficulties optimizing a DNN. Therefore, Diqi Chen and Gao [182] proposed a two-stage multi-objective reinforcement learning (MORL) framework. In the first stage, a multi-policy soft actor-critic algorithm learns multiple policies collaboratively. And, in the second stage, a multi-objective covariance matrix adaptation evolution strategy (MO-CMA-ES) fine-tunes policy-independent parameters to approach a uniform Pareto frontier.
The last layer of a DNN is harder to train than the ones preceding it [183]. De Bruin et al. [184] used a hybrid approach to train and fine-tune a DNN control policy. Their approach consists of two main steps: (i) learning a state representation and initial policy from high-dimensional input data using gradient-based methods (i.e., DQN or DDPG); and (ii) fine-tuning the final action selection parameters of the DNN using CMA-ES. This architecture enables the policy to surpass in performance its gradient-based counterpart while using fewer trials compared to a pure gradient-free policy.
Several other researchers have also proposed solutions hybridizing ES and DRL for various applications. For example, Song et al. [185] proposed ES-ENAS, a neural architecture search (NAS) algorithm for identifying RL policies using ES and Efficient NAS (ENAS); Ferreira et al. [186] used ES to learn agent-agnostic synthetic environments (SEs) for Reinforcement Learning.

A. Comparison
Here we summarize our observations of this section • DRL suffers from temporal credit assignment, sensitivity in the hyperparameters' selection and might suffer from more brittle exploration due to its unique agent setting, while ES has low data efficiency and struggle with large optimization tasks • Combining both approaches can help address some of these identified challenges • Some hybrid methods proposed throughout the literature seem to outperform the use of each method on its own • Hybridizing DRL and ES is still a relatively new field of research.

VI. APPLICATIONS
The following section provides a comparison of DRL and ESs from an application perspective. While both have been applied to a wide array of applications, DRL still seems to receive more attention than ES within the scientific community (searching Google Scholar for "Deep Reinforcement Learning" results in 6380 hits, while only 1480 for "Evolution/Evolutionary Strategies"). Figure 11 showcases the most prevalent application domains for both DRL [190] and ES. The data for Figure 11 has been collected by querying Google Scholar for papers titled with "Deep Reinforcement Learning" or "Evolutionary Strategies" in conjunction with any of the key words listed in Table VIII.

A. Deep Reinforcement Learning applications
Gaming. Video games such as the Atari games [191] are excellent testbeds for DRL algorithms, given their well-defined problem setting and virtual environment. This makes evaluation safe and often faster than real-world experiments. [192].
There have been two important triumphs for DRL with respect to perfect information games. First, in 2015, Mnih et al. [193] developed an algorithm that could learn to play different Atari 2600 games at a superhuman level using only the image pixels as input. This work paved the way for DRL applications trained on high-dimensional data based only on the reward signal. Soon after, in 2016, Silver et al. [194] developed AlphaGo, the first program ever to beat a world champion in Go. Instead of the handcrafted rules often seen in uses gradient descent and CMA-ES for policy and loss function optimization, respectively achieves faster learning than policy gradient methods and provides qualitatively different behavior from other popular metalearning algorithms [181] MO-CMA-ES integrates a multi-policy soft actor-critic algorithm with a multi-objective covariance matrix adaptation evolution strategy to approach uniform Pareto frontier exceeds other algorithms such as the hypervolume-based [187], radial [188], Pareto following [188], and Deep Neuroevolution [189] algorithm on computing the Pareto frontier [182] Fine-tuned DRL combines CMA-ES and DQN or DDPG to train and finetune a DNN control policy surpasses gradient-based methods while requiring less iterations than gradient-free ones [184]  robotics, "motion control", robots, "robot navigation", assembly Finance finance, financial, trading, portfolio, stock, price, liquidation, hedging, banking, trader, cryptocurrency, underpricing Computer Vision "image detection", "face recognition", "object detection", "object localization", "visual tracking", "object tracking", "target tracking", "face tracking","trajectory tracking" Communications network, routing, communications, wireless, 5g, LTE, MAC, "access control", "network slicing", excluding: "neural network" Healthcare healthcare, treatment, cancer, medical, blood, patient, diagnostic, diagnosis, clinical, infection, disease Energy microgrid, "energy trading", "energy management", "power grid", "power control", "building energy", "electric power", "energy storage", "heating energy" Transportation transportation, transport, "vehicle routing", ride-hailing, "traffic signal", "car-following", fleet, "autonomous vehicle", "traffic light", "vehicle driving", "autonomous braking" Edge Computing "edge computing", "edge caching", "edge-assisted", MEC, offloading Civil Engineering structural, pavement, stormwater chess programs, AlphaGo consisted of neural networks trained using a combination of Supervised Learning (SL) and RL.
Only a year later, this achievement was triumphed by Silver et al. [195], whose AlphaGo Zero program beat its predecessor AlphaGo. AlphaGo Zero was based solely on RL, omitting the need for human data. More recent works have also been successful in imperfect information games which, unlike Go and Atari games, only let agents observe part of the system. In OpenAI Five [196], agents were able to defeat the world's esports champions in the game of Dota2, while AlphaStar [197] attained one of the highest rankings in the complex real-time strategy game of Starcraft II. [46,50,70,75] further examined DRL algorithms' ability to scale, parallelize, and explore using Atari games. Lastly, an extensive survey on DRL in video games has been composed by Shao et al. [192]. Robotics. Robotics is another domain which forms a prominent testbed for DRL algorithms [198,199]. DRL can provide robots with navigation, obstacle avoidance and decision mak-ing capabilities, by mapping sensory data directly to actual motor commands [200,201]. In some cases this has enabled robots to learn complex movements such as jumping or walking [202,203]. Tai et al. [204] proposed a mapless motion planner which relies on training in simulation, after which, physical agents were able to navigate unknown environments without fine-tuning. While most works involved simulation, Gu et al. [173] showed that DRL can be used to learn complex robotics 3D manipulation skills from scratch on real-world robots and further reduced training time by making agents pool their policy updates. Haarnoja et al. [203] demonstrated that using DRL, one can also achieve stable quadrupedal locomotion on a physical robot within a reasonable time without prior training. For an in-depth review of the use of DRL for robot manipulation, we refer the interested reader to [198,199]. Finance. DRL also finds applications in trading [205,206] and investment management [207], including cryptocurrency [208]. Moody and Saffell [209] built a DRL agent for stock trading using raw financial data as the DNN input. Carapuço et al. [210] described a system for short-term speculation in the foreign exchange market, based on DRL. Wu et al. [211] proposed adaptive stock trading strategies leveraging DRL. A more recent DRL work by Lei et al. [212], adaptively selects between historical data and the changing trend of a stock, depending on the current state. Computer Vision Computer vision problems often involve high-dimensional data, lending itself to DRL solutions. For instance, DRL can greatly improve the efficiency of image classification and object localization and recognition, by focusing on the most promising regions, using a so called 'glimpse window' as seen in Mnih et al. [213]. Caicedo and Lazebnik [214] proposed using a DQN for object localization by transforming a bounding box to identify the most specific location using a top-down analysis. Kong et al. [215] took things a step further by introducing a collaborative multiagent DRL system with inter-agent communication to search for joint objects (e.g. person riding a bike). III and Ramanan [216] used DRL in order to learn policies for motion planning, deciding where to look in the frame, when to reinitialize and when to update the appearance model. Communications. Upcoming networks such as the 5G network, emphasize the need for efficient dynamic and largescale solutions [217]. DRL has been emerging as an effective tool to tackle various problems and challenges within the field of networking [218]. For example, Wang et al. [219] applied a DQN to automatically optimize data transmission and reception in a multi-wireless-channel access problem. Ye and Li [220] developed a similar system for vehicle-tovehicle communication. The optimal transmission bitrate can change over time. DRL can dynamically optimize the bitrate based on the quality of the last segment, the current buffer state [221,222], rebuffering times, and other channel statistics [223,224].
Proactive caching can greatly reduce the number of transmissions over the network. However, deciding which content to caches is not trivial. Researchers have used DQNs to determine which information to keep in a cache based on observations of the channel state [225], cache state [226], request history [227,228] and available base stations [229,230,231].
Offloading can improve performance and reduce battery consumption of edge devices. However, the response time of a busy server might be unacceptably long. Thus, the timevarying channel conditions of a server need to be taken into account when offloading. DQNs can observe the channel quality [232,233], task queues, remaining file size [234], and the servers' capacities [235] to offload optimally. Healthcare. Recently DRL has gained traction for applications such as personalized healthcare treatments [236]. Liu et al. [237] proposed the first DRL framework for estimating the optimal dynamic treatment regimes from observational medical data using DRL to estimate the long term value function. A large number of the applications is dedicated to medical image processing to extract features or detect anatomical objects from 2D/3D MRI or CT images [238,239,240,241]. Liu et al. [242] present different DRL models which can detect lung cancer using data collected from medical IoT devices. Furthermore, DRL has also been applied in order to inquire patients for symptoms and diagnose diseases based on clinical data [243,244,245]. Energy. Within the energy sector, smart grids make intelligent decisions with respect to electricity generation, transmission, distribution, consumption and control. DRL has been used in a variety of settings to tackle electric power system decision and control problems [246], such as in the context of microgrids [247] or building energy optimization [248,249]. Transportation Congestion, safety and efficiency are important aspects of transportation. DRL is often used for adaptive traffic signal control to reduce waiting times [250], [251], [252]. Chen et al. [253] expanded upon this and conceived the first DRL control system which scales to thousands of traffic lights. Wang and Sun [254] developed a MADRL framework to prevent 'bus bunching' and streamline the flow of public transport. Manchella et al. [255] proposed a model-free DRL algorithm which packs ride-sharing passengers together with goods delivery to optimize fleet utilization and fuel efficieny.

B. Evolution Strategy applications
ES are also used for a myriad of applications as shown in Figure 11b. The main categories are discussed in the section below.
Gaming similarly to DRL, gaming represents one of the main testbeds for ES. Most of the literature on ES reviewed in this survey test their algorithms on Atari games [4,5,29,32,256,257,258,259]. These are considered to be challenging as they present the agents with high dimensional visual inputs and a diverse and interesting set of tasks that were designed to be difficult for humans players [14]. Robotics The robotics field is also leveraged for testing ES. Simulations are mostly performed in the Mujoco simulator [260] or in PyBullet [261] [4,59,80,86,256,262,263,264]. The different tasks in these simulators are used to benchmark algorithms in the continuous control domain. Several other researchers tested their algorithms on real robots: Song et al. [115] adapted their proposed method for metalearning using ES on real robots. Additionally Chen and Gao [165] adapted their Predator-prey method to hardware as well: one phase considered that the environment is fully observable, while another covered a partially observable environment. Communications Different methods are proposed throughout the literature that used ES for computer networks. Pérez-Pérez et al. [265] used Evolution Strategy with NSGAII (ESN) to approximate the Pareto frontier of the mobile adhoc network (MANETs). Krulikovska et al. [266] used ES for the routing of multipoint connections. Additionally, they proposed methods for improving ES. Nissen and Gold [267] used ES for designing a survivable network, keeping in mind economics and reliability. Shahhoseini and Torkzadeh [268] proposed a multi-constraints QoS routing algorithm. He et al. [269] analyzed the data characteristics of wireless sensor network (WSN), and proposed a method for fault diagnosis of WSN based on a belief rule base (BRB) model.Srivastava and Singh [270] used ES for solving the the total rotation minimization problem (TRMP) in directional sensor networks. Srivastava et al. [271] presented an ES method for solving the Cover scheduling problem in wireless sensor networks (WSN-CSP). Gu and Potkonjak [272] proposed an ES method to search for a network configuration able to produce and stabilize responses of a Physical Unclonable Functions (PUFs). Edge Computing Emadi et al. [273] proposed the use of CMA-ES for optimizing task scheduling in cloud computing. Mai et al. [274] proposed a hybrid RL and ES method for real-time task assignment among fog servers, aiming to lower to a minimum the computation latency. Shukla et al. [275] proposed a hybrid RL and ES approach for solving the problem of high latency between healthcare IoTs, end-users, and cloud servers. Finance Korczak and Lipinski [276] presented a portfolio optimization algorithm using ES. Rimcharoen et al. [277], Sutheebanjard and Premchaiswadi [278] proposed the Adaptive and (1+1) ES methods for predicting the Stock Exchange of Thailand index movement. Bonde and Khaled [279] predicted the changes (increase or decrease) of stock prices for different companies using ES and Genetic Algorithms. Pai and Michel [280] proposed ES with hall of fame (ES-HOF) for optimizing long-short portfolios with the 130-30-strategy-based constraint. Pai and Michel [281] used multi-objective ES for futures portfolio optimization. Yu [282] proposed an ES method for the multi-asset multi-period portfolio optimization. Sable et al. [283] proposed an ES approach for predicting the short time prices of stocks. Sorensen et al. [284] applied metalearning algorithms to ES for stock trading. Civil Engineering ES have also often been used in civil engineering. They are used for meeting the demands of structural design optimization tasks. Hasançebi [285] studied the computational performance of adaptive ES in largescale structural optimization. Additionally, Mitropoulou et al. [286] showcased that ES can be considered as efficient tools for both single and multiobjective design optimization of structural problems. [287] proposed an ES integrated parallel optimization algorithm meant to minimize the total member weight in each test steel frame. [288] used ES to adapt the battery recharge strategy to changing environments. Ogidan and Giacomoni [289] applied an enhanced nondominated sorting evolution strategy (eNSES) for sanitary sewer overflow (SSO) optimization problems. Hajebi et al. [290] proposed an iterative optimization technique with CMA-ES for the subsurface inverse profiling of a 2-D inhomogeneous buried dielectric target.

C. Comparison
Here we summarize our observations of this section.
• Both DRL and ES algorithms have found adoption in a substantial number of different applications. • DRL-based solutions seem to excel in situations that require scalable and adaptive behavior. • ES's applications are less widespread and mostly centered around specific use cases such as structural optimization. • Although DRL and ES are used quite extensively in robotics' tasks, moving from simulation to reality is still a major gap. Most of the studies discussed the simulations performed but few actually implemented it on actual robots.

VII. CHALLENGES AND FUTURE RESEARCH DIRECTIONS
Although DRL and ESs have proven their worth in many AI fields, there are still many challenges to be addressed. We briefly list some of them in the sequel.

A. Deep Reinforcement Learning
Sample Efficiency. DRL agents require a large number of samples (i.e., interactions with environments) to learn goodperforming policies. Collecting so many samples is not always feasible: image training a delivery robot to navigate its surrounding from scratch [7]. Although this problem has been tackled in different ways (e.g., transfer learning, metalearning) more innovation and research are still needed [291,292]. One promising research direction to tackle this problem is model-based RL [291]. However, getting an accurate model of the environment is usually hard. Exploration versus exploitation. The exploration versus exploitation dilemma is one of the most prominent problems in RL. Beyond classical balancing approaches such as εgreedy [1], Upper Confidence Bound (UCB) [1], and Thompson Sampling [1], recent breakthroughs enable the exploration of novel environments. For example, Osband et al. [67] observed the importance of temporal correlation and proposed the bootstrapped DQN; and Bellemare et al. [70] used density models to scale UCB to problems with high-dimensional input data. However, as DRL is proposed to tackle ever more complex environments, the exploration versus exploitation dilemma still poses a challenge that requires innovation. Sparse reward. A reward signal guides the learning process of an RL agent. When this signal is sparse learning becomes much harder. Although, different solution approaches have been proposed (e.g., reward shaping [1], curiosity-driven methods [73], curriculum learning [293], hierarchical learning [294] and inverse RL [295]), learning with sparse rewards still represents an open challenge. Simulation-to-reality gap. Despite the benefits of simulations, they give rise to the sim-to-real gap: policies that are learned in simulations often do not work as expected in the real world. Different techniques are being adapted to mitigate the effect of this gap. For example, [296,297] randomized the simulated environment to produce more generalized models. Rao et al. [298] noted that such randomization requires manually specifying which aspects of the simulator to randomize. Therefore, they used domain adaptation (i.e., many simulated examples and a few real ones) to train a robot on grasping tasks without manually instrumenting the simulator. Despite such efforts, the sim-to-real gap is still an open challenge to be addressed. Optimizing complex systems. Optimizing the performance of complex systems such as 5G networks requires a versatile and advanced optimizer. DRL has the potential to optimize such systems (e.g., Zhao et al. [299] proposed optimizing user association and resource allocation using DDQN; and Li et al. [300] suggested enhancing energy consumption using DQN). However, we are still in the early stages of harnessing the power of DRL in optimizing systems such as 5G networks. For more detailed challenges about DRL and networking, we refer the reader to [301].

B. Evolution Strategies
Sample Efficiency. ESs can provide more robust policies as compared to DRL; however, they are even less sample efficient, as they work with full-length episodes [302,303], and they do not use any type of memory [302]. Approaches to improve sampling in ESs include Importance Mixing [302,304] and Sample Reuse. However, this line of research is still fresh and is attracting a lot of attention from the scientific community. Noise handling. While ESs tolerate some noise due to their randomized nature, noise renders their computations more difficult and causes their performance to approach random walk [21,305,306]. Several solutions have been proposed to improve noise handling in ESs, such as re-evaluation of points [305] and adapting the population size during fitness evaluation to improve the signal-to-noise ratio [306]. A detailed summary of the challenges related to ESs, such as differential evolution and swarm optimization, is presented in [21].

VIII. CONCLUSION
Deep Reinforcement Learning (DRL) and Evolution Strategies (ES) have the same objective but are fundamentally different mechanisms. Understanding their relative strengths and weaknesses may lead to developing an algorithmic family that is superior to each one of them. Therefore, in this paper, we provided the necessary background of DRL and ES and compared them. Instead of focusing on individual algorithms, we considered major learning aspects such as parallelism, exploration, meta-learning, and multi-agent learning. Further, we discussed the recent advances made in hybridizing DRL and ES. Before discussing potential future research directions, a comparison between DRL and ES from an application perspective was made to show how these two are used and the context. Finally, we believe hybridizing DRL and ES has a high potential to drive the development of agents that operate reliably and efficiently in the real world.