Modern Value Based Reinforcement Learning: A Chronological Review

Investigation of value based Reinforcement Learning algorithms exhibited a resurgence into mainstream research in 2015 following demonstration of super-human performance when applied to Atari 2600 games. Since then, significant media attention and hype have accompanied this area, and the field of Artificial Intelligence generally, spread across distinct categories. This review paper is focused exclusively on the progression of value based Reinforcement Learning in the last five years. We aim to distill the incremental improvements to stability and performance in this period, highlighting the minimal changes to the base algorithm over this time. This holds true with all but the one exception of the Recurrent Experience Replay in Distributed Reinforcement Learning algorithm, representing a fundamental shift and increase in agent performance through an advanced memory representation. We suggest a new focus area for value based Reinforcement Learning research.


I. INTRODUCTION
Reinforcement learning is a distinct sub-topic of the broader field of Machine Learning [1], relating specifically to the notion of learning by experience, gained through actions of an agent/system in its environment, to achieve maximum long term reward [2]. The promise of Reinforcement Learning is grand: a model free agent which can interpret any environment, and learn a task to superhuman ability, all with minimal user interaction. This has spurred significant interest in the field, shown most notably by the volume of papers published since its most notable successes. However what is generally experienced is that Reinforcement Learning algorithms can be very difficult to construct and train to an optimal solution, as is widely discussed, for example as found in [3] and [4].
The current enthusiasm for Reinforcement Learning research is broadly considered to have been seeded by the work of DeepMind with the creation of the Deep Q Network [5], [6], and its success at completing Atari 2600 video games through the Arcade Learning Environment (ALE) [7]. The promise of Reinforcement Learning was then further enhanced by its role in enabling machine to triumph over The associate editor coordinating the review of this manuscript and approving it for publication was Zhouyang Ren . humans at the game of GO [8], [9] or the complex real-time strategy game of StarCraft [10].
The scope of this review is highlighted in Section I-B, and dictates the the structure of the review as follows; Section II gives a brief history of the concepts in research that bore the field in its infancy, whilst Section III details the fundamental aspects of Value based Reinforcement Learning for consideration in the subsequent review, and details concepts which are a inherently included in value based Reinforcement Learning approaches. Section IV covers the early work and threads of research which formed the basis of what is considered here as modern Reinforcement Learning. Logically Section V follows on to detail the critical approaches of Value based Reinforcement Learning in greater detail, highlighting key differences as the approaches evolved, leading into a discussion of observed results in Section VI and then conclusions of the review in Section VII.

A. THE NATURE OF REINFORCEMENT LEARNING
The Reinforcement Learning approach iteratively samples its available actions for high reward and after some period of searching, where the correct answer is not known, refines a course of actions to achieve its objective. Hence in Reinforcement Learning applications, an exploratory stage is required and instruction given based on an expected outcome, honed over subsequent experiences.
The techniques applied in reinforcement problems, both for evaluation and instructing an algorithm, are often thought of as a conflict between exploration and exploitation [2], as discussed further in Section III-A.
The most important feature distinguishing Reinforcement Learning from other machine learning approaches is that the algorithm evaluates the actions taken to achieve some objective, and does not implicitly know the best solution. This is analogous to human learning mechanisms, where unforeseen events are progressively handled within the system with experience. As such, any successes in Reinforcement Learning for solving engineering problems are arguably a fundamental step towards true artificial intelligence, and hence the interest found here in this review.

B. SCOPE OF THIS REVIEW
In this review paper, we aim to highlight the current state of Reinforcement Learning research from the perspective of fundamental value-based approaches that have led to advancements in the field. With the breadth of research undertaken across the world, it is infeasible to address all permutations of the underlying Reinforcement Learning methodology, and as such focus will be given to the most adopted or successful approaches towards the current state of the art.
This review will briefly touch on the genesis of the value based Reinforcement Learning methodology and continue the evolution of Reinforcement Learning techniques from this important milestone, making comparison back to the original, or unaltered algorithm of Q learning.
The assertion made by [11] that a critical inhibitor to the success of Reinforcement Learning progress is the lack of reproduce-ability and non-standardized approaches to assessment and reporting are considered here to be paramount. This is particularly in comparison to supervised learning progress through the various benchmarks as shown in [12], listing the current state of the art on standard supervised learning tasks such as ImageNet [13] or CIFAR-10 [14]. Particularly when compared against supervised learning, there is a lack of standardization in reporting results in Reinforcement Learning, that makes it much more difficult to assess the relative contribution to the field of Reinforcement Learning, despite the presence of many suitable approaches [15], [16], [17]. This is important in relation to scope as it is intractable to consider approaches for criticality if they are not directly comparable in terms of performance.
This review is not an exhaustive list of all Reinforcement Learning research or the applicability of each approach to the full gambit of possible use cases, rather a chronology of the major advancements to the fundamental approach of value based Reinforcement Learning, where the approach is assumed to be equally applicable across discreet or continuous action space for example. There are countless examples in the literature where there are stated gains in overall performance at a given task through minor adjustments to the application of the fundamental approach; for example the division of the state space input into multiple policies [18]; or the smoothing of the Bellman equation to reduce statistical variation in the learning phase [19]. Whilst these are valuable contributions to the overall field of research, they are not the focus of this review.
Standing out amongst recent work, we especially highlight one recent advancement of significant note, as it includes a previously unseen aspect, being the inclusion of a specific memory term, which resulted in a dramatic increase in comparative performance. What is proposed here is that this contradicts a fundamental assumption of Reinforcement Learning, which is that the Markov Model sufficiently encapsulates current and all previous state space information by virtue of its action-value mapping. Corollary to this, there is likely additional performance gains to be made by incorporating higher level cognitive concepts into the agent's decision making. It is purported here that research in the area of Reinforcement Learning will advance from this concept as a critical point in the field.

II. THE PRE-HISTORY OF MODERN REINFORCEMENT LEARNING
Modern Reinforcement Learning is characterised by Sutton and Barto [2] as a combination of two distinct historical research areastrial-and-error search and optimal control or dynamic programming -with a third thread spanning these two fields referred to as temporal-difference methods. All threads came together in the 1980's to become what is now known as Reinforcement Learning. Optimal Control attempts to minimize system behaviour variation over time. The Bellman equation, taken from optimal control research, is fundamental in the field of Reinforcement Learning. It specifies a value function to discern optimal actions or controls for a given objective [20], and is now the common term to describe the action-reward pairing in Reinforcement Learning applications. Inspiration for the technique came from Hamilton and Jacobi, which was subsequently extended to a discrete stochastic version referred to as Markovian decision processes [21]. An important attribute to note, referred to as the Markov property, holds where knowledge of the current state is all that is required to decide the best course of action, as illustrated graphically in Figure 1. All information contained in every previous state is said to be inconsequential, through inherent understanding incorporated in the current state. Perhaps the most notable example of optimal control is the work of [22] and [23].
The more common thread of trial-and-error search is more intuitive to humans due to the congruence with our own learning mechanisms, and has arguably received more attention in research as a result, where it is even claimed as an essential aspect of true artificial intelligence [24]. Early influential work in trial-and-error search techniques involved the playing of simple board games repeatedly to reward or punish moves post game [25]. This extended to more complex tasks such as the pole balancing task [26].
The temporal-difference method is the third thread of research contributing to the unified field of modern Reinforcement Learning, where temporal difference ideas were first applied to checkers in order to modify the function used for decision making online, which included what is known as secondary reinforcers [27]. Following this, research into Reinforcement Learning slowed until linkages were made to animal psychology by [28] and carried on by [29]. The unification of these threads came when optimal control and temporal difference were utilised in the Q-learning approach by [30] and [31].

III. VALUE-BASED REINFORCEMENT LEARNING CONCEPTS
Value-based Reinforcement Learning is the methodology of an artificial intelligence agent acting with priority towards a high reward state, within an environment through trialand-error, assigning current and expected future value to all intermediate state-action pairing through gradient based optimisation. The agent then seeks out high value states given an exploitation policy to maximise overall reward. The Reinforcement Learning model is described in [32] as any learning agent connected to the environment through perception of the state s of that environment and action a taken within it.
In essence, Reinforcement Learning is a process of improving an agent's ability to perform some task through experience, or trial and error. A fundamental process of evaluation feedback is the action-reward pairing, giving estimated reward at some future time, based on the learning agent and its interaction with the environment. An agent is any mathematical representation of the decision making process, acting on an environment [2]. Observations are made by the actor of the current state s t , and a best action a t to take is estimated according to some policy π t (s, a) at any given time t, where s t ∃ S, the set of all states, and a t ∃ A, the set of all actions. Mathematically, where γ is a discount factor to promote short or long term actions [2], [30]. This is graphically illustrated in Figure 2 where the relationship to the Markov Property is observable. This gives the concept of a Value Function in which the best course of action is calculated, based on the intermediate state values V and selection policy π, to maximise the overall reward R t . The agent predicts randomly at first, given that no understanding of the pattern of inputs and how they relate to outputs is yet understood. The system is then updated to minimise the error between the calculated outputs and the class states. All value based Reinforcement Learners are predicated on this fundamental principle, maximising the value function through maximum overall reward is optimal for the success of the agent at the given task.

A. EXPLORATION AND EXPLOITATION
A critical differentiator between Reinforcement Learning and other machine learning approaches is the requirement to explore the action space [32]. The reward which is obtained by a selected action for a given state is not known a-priori, and an agent must build the understanding within the model through trial and error. The issue with this is that exploration implies sub-optimal action selection to avoid local minima. There are many techniques applied to ensuring adequate exploration of the state space, the most common of which is -greedy, which involves the exploitation action being the action with highest reward, being selected unless some small probability criteria is reached, in which case a random action is taken. Numerous incremental enhancements to the task of exploring the state space have been made; by using the accuracy of the reward estimate to signal an under sampled state action pair [33]; controlling the amount of available actions for a given state, thereby controlling the probability of exploration actions [34]; by introducing noise into the parameter space to introduce stochastic variation into the state-action pairing [35], [36]; or by simply tracking the relative frequency of action selection for a state and altering the probability for under sampled actions [37], [38], [39], [40]. It is considered here that the different approaches to ensuring the state space is adequately explored do not fundamentally change the Reinforcement Learning method, being an operation equally applied across any Reinforcement Learning technique. As such, in-depth review of each approach is out of scope of this review, noting that several research articles state that it made a marked improvement to overall success of the technique, and an important area of research in it's own right.

B. MODEL FREE AND MODEL BASED REINFORCEMENT LEARNING
Model based learning involves the determination of a complete understanding of the environment and the relative reward of each action at each state. This was the standard approach historically, and commonly referred to as dynamic programming. The critical failing of the approach is it is intractable for all but the simplest of environments. The growth from model-based to model-free Reinforcement Learning allowed the application to complex environments, and is widely applied in modern techniques in the form of neural networks. The origins of neural networks are based in pattern recognition and perceptual learning inspired by the understanding of neural systems in mammals [41], [42], [43]. At the simplest level, a single neuron provides a binary decision, which can be stacked to provide high level interpretation of information [44]. For a given input signal x an output signal y is calculated having passed through some number of hidden nodes as multiples of neuron link weights w and activation function σ . Following on from the initial application of backpropagation in neural networks [45], the technique has been applied in a variety of forms, and have led to augmented algorithms which represent the current state of art in numerous machine learning fields, Reinforcement Learning included amongst them. The fundamental approach to achieve backpropagation is stochastic gradient descent [46], in which the easily measured error at output (easy in supervised learning approaches) is propagated through the layers by calculating derivatives, with the current implementation simplified using the chain rule [47].

C. DEEP AND SHALLOW REINFORCEMENT LEARNING
It is noted in [48] that the main focus at present in the field of machine learning is in the area of supervised learning, which is essentially applying a model to achieve a known category of labelled data. This is extended to Deep supervised learning through application of many layers of convolutional filtering operations applied to more complex input states. Early work on deep networks used linear regression methods applied to polynomial activation functions [49], but the objectives and benefits to augmenting the input information to emphasise underlying data structure remains the same [48]. The first architecture to incorporate the visual cortex inspirations, which are now known as convolutional neural networks, was titled the Neocognitron [50], [51] and incorporated a set of convolutional filters, as is the common practice today, with the exception of the absence of backpropagation of errors into the filter weights. The modern field of Deep Belief Networks came when the error backpropagation was combined with unsupervised pre-training [52], [53], where Restricted Boltzmann Machines [54], [55] were stacked to encode then reconstruct original data and hence provide low level discriminants to high level image data. Deep Learning has subsequently become a major field of research, with the ability to provide human level representation of data independent of in depth intervention from the human writing the algorithm [56], a fundamental weakness in previous machine learning approaches [57]. Following the initial application of neural networks to Q-Learning algorithms in Reinforcement Learning [58], the technique of deep learning was incorporated to Reinforcement Learning to play Atari 2600 games as the Deep Q Learner [5], leading to the state of the art performance which could be credited with rejuvenating the field of research under review here: value based Reinforcement Learning.
Deep Reinforcement Learning is not fundamentally different to shallow Reinforcement Learning except in that it applies the concept of convolutional neural networks to the input state, which is generally imagery within the literature. Deep Learning convolutional neural networks are neural networks that include convolutional layers, which are known to improve performance of the system through enabling translation invariant predictions and enhanced representation of input features [59]. This has broadened the potential application space significantly, but again does not significantly alter the fundamental technique applied; it is subsequently assumed in this review to be ubiquitous to the underlying Reinforcement Learning algorithm, Deep or Shallow.

IV. GENESIS OF REINFORCEMENT LEARNING
The early research into Reinforcement Learning focused on the now considered simple tasks of common board games, such as backgammon, chess or checkers. However at the time it was revolutionary when a human was beaten at these games by a machine. Whilst these techniques have been dramatically superseded by modern algorithms, many of the fundamental operations remain. The following section of this review follows the chronology of these major milestones in the application of Reinforcement Learning, and hence the growth of the field of research.

A. TEMPORAL DIFFERENCE LEARNING
Arguably the first high profile application of what is now the modern field of Reinforcement Learning is the TD-Gammon algorithm, utilising temporal difference learning to defeat human players at the game backgammon [60], [61]. The temporal difference learning was then used to defeat human players at chess [62]. Note that IBM's Deep Blue was far more famous, defeating the chess champion, but this approach was not Reinforcement Learning based, arguably not even machine learning as it was a simple alpha-beta search using brute force [63], [64].
Temporal difference learning is described in detail by [2], where the term comes from the target update, or error term for backpropagation being essentially the difference in the measured future value and the estimated future value, VOLUME 10, 2022 moderated by a discount factor γ . The policy target in a temporal difference learning algorithm is then the sum of the experienced reward and all future rewards, under the assumption it is known, mathematically given as Hence, the states' value estimate is updated according to where α is the learning rate parameter. TD(n) and TD(λ) are an extension to the fundamental approach of temporal difference learning in which a number of steps taken are used in each bootstrap update [2]. In the latter case the relative weights of each step are altered by the value of the λ term.
A limitation of the approach is that each state must be sampled sufficiently in order to converge to the true value at that state. This is intractable for modern applications as for most cases the environment is continuous, and hence there are an infinite number of states. As was stated previously, the approach was made famous by the work of [60] titled TD-Gammon by beating a human at backgammon, however it should be noted that backgammon does not suffer this limitation as there are only a relatively small number of states on the board. Since then, a number of minor alterations have been made to the technique for specific applications, without fundamentally altering the approach; [65], [66] for example incorporated tree search into the temporal difference technique.
A similar application of the temporal difference to the famous TD-Gammon learning approach was by [67] for the learning of the game draughts, or [62] for learning chess. As such, due to its relative utility in applications in early machine learning research and influence on successive implementations, it is considered here one of the fundamental steps toward the modern approach of Reinforcement Learning.

B. REINFORCE
A class of reinforcement algorithms called REINFORCE were proposed by [68] to maximise immediate reward, built on earlier work by the same research [69], [70], in which the major contribution to modern Reinforcement Learning comes through the application of function approximators for the statistical connection layers, and the learning through backpropagation that it facilitates. Despite this, as is noted by [71], the method fails to learn a value function and as a result learns much slower than other approaches, and for this reason has experienced much less attention in subsequent research. However it should be noted that the methods have influenced a branch of Reinforcement Learning which is outside of scope as it is not value based, referred to as Policy Gradient methods.

C. SARSA
Up until this point, Reinforcement Learning approaches have only addressed discreet state space problems, reliant on the ability to sample each state space sufficiently to determine all transitions [72]. Progressing the field of Reinforcement Learning to be capable of generalising continuous state spaces was a major leap forward by [72] in the development of the algorithm which would later be referred to as SARSA in the work of [2], which is an acronym for State, Action, Reward, State Prime, Action Prime in reference to the inputs to the update phase. The approach presented by [72] extends directly on the work of [30] for the underlying update approach and [73] in the use of neural networks. The fundamental principle of the leap forward in Reinforcement Learning research was the update of the state-action policy at each time step in an on-line methodology, with improved convergence towards an optimal policy compared to previous approaches.
In essence the only difference between temporal difference learning and SARSA is the extension of the algorithm detailed in Equation (4), from the mapping of state to reward value V (s t ) to state-action pairs Q(s t , a t ) to reward values, 5) This is an on-policy method in the fact that only visited states and chosen action pairs are updated. In the event that each state-action pair is visited an infinite number of times, the approach has been shown to converge to an optimal solution.
There have been numerous successful applications of the SARSA algorithm, for example variations of the approach detailed in [72] and [74] were used by [75] to prove bootstrapping for online Reinforcement Learning is preferable to Monte Carlo methods. Even modern research applications still use this fundamental approach, i.e. [76] and [77], and in a swarming application [78].

D. Q-LEARNING
Two papers [30], [31] are credited with the development of the Q-learning approach through the application of the Bellman equation [20], [79] to Markov decision processes. The breakthrough was a form of learning through action and response, which has seen subsequent research apply the technique to some high profile applications, including defeating human world champions at numerous tasks, some of which are highly complex. This now forms the basis of modern Reinforcement Learning.
Built on the concept of Markov decision processes, Q Learning relies on the concept of each state s, having a probability to transition to another state through some action a gaining some reward r. The Markov model is graphically represented in Figure 1. If all rewards obtained for every action across all states are known, the optimal policy to achieve maximum reward is obvious.
Watkins [30] determined that the total discounted reward was the success criteria to optimise toward, based on the premise that reward now is worth more than some future reward. This was mathematically represented by the discount factor γ in the following mathematical expression r t + γ r t+1 + γ 2 r t+2 + · · · + γ n r t+n + · · · .
This leads to the concept of Value where any given state has an expected value given by the highest reward achieving set of actions, articulated mathematically as the function V π (s) = E π r t + γ r t+1 + γ 2 r t+2 + · · · + γ n r t+n + · · · | s t = s .
This gave rise to the term Value based Reinforcement Learning, noting the π term denotes the policy of selecting actions at each state. The Q function is the mapping of reward to each state-action pair, where the accumulated reward Q is a function of the observation state (s) and proposed action (a). The optimal policy is then found by maximising the reward, which is as simple as always choosing the action which yields the maximum total discounted reward, i.e.
The one step update of the Q network improves the policy over consecutive updates, and is mathematically shown as The optimal return Q * is the maximum reward possible given the rewards (r) at each future time step (t) discounted by γ .
Of course this is predicated on the need to know accurately the reward expected at each state and action, which is the only shortcoming of the original research, in that it required a further leap forward in order to provide the critical mechanism to learn the reward as opposed to expecting it to be known, ultimately leading to the sub category of artificial intelligence referred to as Reinforcement Learning. This is the endeavour of the subsequent research within modern Reinforcement Learning. Despite this, a number of successful applications of the concept in this fundamental form have been found [80], [81].

V. MODERN REINFORCEMENT LEARNING
The renaissance of Reinforcement Learning in research today can be linked to a highly cited body of work by Google's DeepMind project [5], [6] identified as the Deep Q Network (DQN). In the context of this review paper, modern value based Reinforcement Learning is considered as approaches building on, or post publication of this work announcing super-human performance against Atari games and of course the DQN itself.

A. DEEP Q LEARNING
What is referred to here as the vanilla DQN method is introduced in the papers [5], [6] from the Google DeepMind group following on from the underlying research into Reinforcement Learning of arcade games [82], [83]. This approach generated significant attention to the field of Reinforcement Learning and Machine Learning in general due to the reported Super-human performance of the algorithm against a high proportion of tested Atari games, as summarised in Figure 3, showing the stated performance results relative to human performance normalised by the random score of each game. Two major additions by [6] within the DQN methodology allowed the standard Q Learning approach to converge given the extreme increase in state space complexity. The first was experience replay, where essentially the network is trained periodically on a randomized historical sample batch to remove correlations and smooth over the data changes in the data distribution. This process was biologically inspired by the theory that mammalian brains re-enact past experiences during periods of low activity [85], [86]. The second technique is referred to as delayed update, where the target state is held constant for a period of time by remembering the network architecture between updates. The new or updated network's reward is calculated for a given state using the Q update function given previously as Equation (10), and this is compared to the reward which would have been experienced by the stored network state [6]. The update to the network is then based on the loss function L, where θ i represents the network architecture (weights) and θ − i represents the stored network of the delayed update step. A selection of uniformly distributed samples (U (D)) of the state variables (s, a, r, s ) is drawn from the stored sample pool and used for the assessment of the loss function. The network architecture is then updated periodically; this effectively keeps a constant target state for a period of time.
The platform used to test the Q learner was an emulation of the Atari 2600 game console which supports 49 games as listed in Figure 3. The emulator utilises 210 × 160 color pixels at a 60 Hz frame rate and a color palette of 128, with the Q Learner only having access to the visual image and the number of available actions in which to make decisions [6]. The actions available form the final fully connected layer of the neural network. Comparisons were made against other approaches: (i) the Best Linear Learner and (ii) the Contingency awareness (SARSA) [82], [83]. The DQN approach was largely very successful in comparison to these two methods, which in many cases could not surpass a human player [6]. This comparison shows that at the time in which it was published the approach represented the state-of-the-art for Reinforcement Learning. The deep Q-learning method of Reinforcement Learning has been subsequently used in countless applications and studies since its inception, e.g. [87], [88], [89], [90], [91], and [92]. Furthermore, the successes made by the approach have spawned a number of advancements to the field, as will be discussed in subsequent sections.

B. DOUBLE Q-LEARNING
The first major alteration to the vanilla DQN approach came in the research of [93] carrying on from earlier work [94], referred to as Double Q-Learning, or Double DQN (DDQN). The research identified that there is commonly an overestimating in the value function, causing biased selection of actions already sampled. It has been suggested that the instability is a result of correlations in the observation sequence and policy changes due to small updates to the Q function, i.e. a non-stationary action-value function [95]. The DDQN approach extended on the original research by decoupling the state-action function for determining value from the target network which is updated online at each step sample of the game. In effect, the approach holds in memory a sample of the agent's weights θ for assessment of the state-value pairing and selection of actions, whilst bootstrapping the target network weights θ , periodically assigning the target weights to the agent. This does not alter the Q function described in Equation (8), but changes the update function from Equation (10) as follows, At the time, this incremental improvement to value based Reinforcement Learning was shown to be the state of the art, outperforming DQN at the fundamental test of Atari 2600 games [96], with comparative results as shown in Figure 4. It is important to note here that the paper included updated performance values against the expanded set of games, and is hence used for comparison going forward. This variant of the value based Reinforcement Learning approach is a seemingly small and easily implemented change to the base approach; resultantly it has been widely applied in the literature, e.g. [97], [98], [99], and [100], with a number of suggested incremental improvements [101], [102]. Importantly, the Double DQN methodology has been widely accepted and used in subsequent critically important research, e.g. [103].

C. PRIORITIZED EXPERIENCE REPLAY
Up until this point in the timeline of Value based Reinforcement Learning, the concept of Experience Replay was well accepted and implemented as it provided fundamental benefits to the technique, in that it reduced the high variance associated with individual updates of the Q score, as well as breaking the correlation between samples by mixing the training data set in time, more appropriately approximating the independent and identically distributed sampling assumption [104]. The failing of the approach is simply that each sample in the training data set is given equal weight, and this does not give proper appreciation to state-action pairings which result in significant improvements to the total reward in comparison to events which have little or no affect. Building on top of the Double DQN methodology discussed previously, the work of [103] introduced an incremental advancement to the field of value based Reinforcement Learning through a technique termed Prioritized Experience Replay, and as the name suggests, gives priority in agent updates to samples with greater training effect when drawing samples from the experience replay buffer.  [84]. Any games for which the score increased or decreased by factors larger than 10 are truncated in the figures. VOLUME 10, 2022 The underlying concept of Prioritized Experience Replay stems from idea that transitions, i.e. the selection of action based on state, yielding a predicted reward, can be surprising in that the prediction is wildly wrong, redundant in that it has been observed regularly, or highly relevant such as if it yields high reward [103]. The research suggests to select training samples with a high expectation of learning progress, proportional to the temporal-difference (TD) error δ, a process first suggested by [105]. Recalling earlier that TD-error is the difference between expected and experienced reward in an update, representing the update component in Equation (4). The paper further states that this can cause a loss of diversity in training samples, requiring a normalisation process. Mathematically the process to determine sampling probability P of sample i is The priority of a sample p i is calculated as the ranking of TD-errors, i.e. the greater the error the higher the priority, or the absolute value of the error itself; ranking is more robust to outliers [103]. The α term is a hyper-parameter which is strictly greater than zero, and controls the strength of the prioritization selection. Bias in the estimation is introduced in the process by changing the sampling distribution, and removed by applying the importance-sampling weights to the TD-error term in all updates of the agent, which are calculated according to the function Once again, the state of the art in value based Reinforcement Learning has been progressed through this incremental improvement, combining Double DQN with prioritizing experience replay [103], demonstrated on the field's seemingly accepted metric of success, the Atari 2600 games, as shown in Figure 4b. There are numerous fields of application for the approach of value based Reinforcement Learning presented as Prioritized Experience Replay, such as online object recognition [106], robot navigation [107], or simple network and traffic routing as stated in [108]. A similar concept has also been proposed using strata of experiences and selecting amongst the stratus to promote important samples [109]. A similar concept was proposed by [90], referred to as Hindsight Experience Replay, in that it makes better use of the experience replay buffer, but different in that it focuses on the case where multiple goals are inherent in the agents task. This approach was later combined with subsequent research to again claim state-of-the-art status as the Prioritized Dueling Experience Replay approach [110].

D. DUELING DEEP Q NETWORKS
Early work by [111] suggested that separating the advantage and value functions improved performance of the Q Learning Reinforcement Learning algorithm. An extension of the work for deep convolutional neural networks [84] was applied to the common Atari 2600 baseline, achieving state-of-theart results at the time of publication (noting that this was achieved when combined with previous concepts such as DDQN and Prioritized Experience Replay) to instantiate the method commonly referred to as Duelling Deep Q Networks; see Figure 4c for relative performance assessment against the previous state-of-the-art at the time, Prioritized Experience Replay.
The major contribution to the field comes in the separation of the Advantage term from the Value term in agents network. This concept is considered here to be one of the first proposed methods involving a derivative of the standard feed forward convolutional network topology commonly applied in research. A clear comparison is made by [84] in Figure 5, where the scalar output of Value V estimate for a given state is determined separately from the vector of Advantage A terms. These terms are necessarily combined in selecting an optimal action to take for the given state. Most important in this approach, is an understanding of what the Advantage term actually represents, where mathematically it is simply the difference between the expected value for all actions at a given state and the actual value for the optimal policy π action set, The Advantage nomenclature reflects the relative update contribution, in that the optimal action to take for a given state would estimate the value to be the closest to the true value (even if there were discrepancies due to insufficient training samples to accurately estimate the true value at that state), providing the least Advantage in learning. Conversely, a state-action pairing which was significantly different from the true value, resulting in a larger Advantage output from the network. This is stated to have the benefit to training the agent through better utilisation of these non optimal state-action pairings [84], accelerating learning and reducing uncertainty in training. A critical aspect of the Duelling Deep Q Network approach is in the combination of the Advantage and Value terms, which again is simple mathematically, i.e. s, a ; θ, α) .
The important aspects to note here are in that the state-action updates are dependent not only on the Value predicted by the agent, but also its perceived Advantage, relative to the advantages across the entire set of actions. This is considered to add stability to updates, whilst not directly altering the selection of actions, which is still based on expected value of current and future rewards. The technique of Dueling Deep Q Networks has been widely applied in the literature, in diverse fields of application such as traffic signalling [112], estimating algorithmic parameters [113], UAV control [114], and administering pain relief [115]. This is perhaps an indication of the stability of the approach resulting in widespread adaptation. In addition, there are instances in literature where the process has been extended [116], but also incorporation into other approaches to address the Reinforcement Learning field of research [35], [117], [118].

E. NOISY NETWORKS
Another highly successful approach to achieving the task of value based Reinforcement Learning was proposed by [35] and self labelled as NoisyNet. This work leveraged a well researched concept from supervised learning in which noise is added to the neural network to improve overall classification success following class optimization [119], [120], but is the first to apply the concept to Reinforcement Learning, which is an arguably more complex subgroup of artificial intelligence. It was identified by [121] that the act of randomly perturbing the selected action is unlikely to fully access all states of the environment in any form of efficient manner. Whilst this seems relatively intuitive, and despite the large body of work to find more effective methods [122], [123], [124] a vast majority of applications still use the simple -greedy approach. The major contribution of the work by [35] is considered here to be the instantiation of a problem-specific exploration strategy without the need for random perturbation of action selection or the need for researcher input, replacing the fixed and often decreasing exploration strategy. This major contribution was in addition to the reported state-of-the-art performance against the Atari 2600 games benchmark (as shown in Figure 4d) comparing the NoisyNet concept applied to both vanilla DQN and Dueling DQN respectively. It is interesting to note that the combination of NoisyNet with Dueling showed a bigger net improvement over the Dueling network than the alternate comparison against the vanilla DQN approach.
In essence the NoisyNet methodology adds parametric noise w to the fully connected layer weights (µ in this nomenclature as opposed to the standard w) and biases of a deep or shallow neural network, to cause variation in the resultant output from the layer, and hence in the value based Reinforcement Learning context. In turn this produces variation in the estimated value of the current and future states given some policy and hence in the optimal action selection. The crucial element however is in the progressing update to that parametric noise contribution σ w . Noting that the standard representation of a fully connected layer neuron with input X , bias b, layer weights w and output y is given mathematically as where the function represents any number of continuous differentiable functions, such as rectified linear units. The NoisyNet approach extends this as Comparing the two representations of the neuron for the standard and noisy fully connected layers, it is observed that the NoisyNet variation is in the addition of the noise component, where the σ terms are progressively updated at each learning update. Like the methods discussed up until now, the NoisyNet Reinforcement Learning approach has wide applicability across multiple tasks, for example UAV navigation [125], autonomous ground vehicle control [126] or even conversational dialogue [127].

F. DISTRIBUTIONAL REINFORCEMENT LEARNING
A subsequent methodology proposed by [128], and identified here as the Distributional Reinforcement Learning method (often referred to as C51), raised the issue that the value estimate is not a singular value for a given state-action pairing. Instead it is a distribution of probable values in which the expected value of that distribution is what has been optimised against up until this point in the Reinforcement Learning field of research. This work by [128] acknowledges the work of [129], [130], and [131] in identifying the underlying relationship, but extends the research importantly by applying it to the value estimate, noting that a Gaussian distribution was applied previously by [132] and [133]. The result was stateof-the-art performance against the Atari 2600 benchmark, with relative performance against the highest results obtained by DQN, Double DQN, Dueling DQN or Prioritized Dueling Experience Replay cited in [110], as shown in Figure 6.
The fundamental variation proposed by the method of Distributional Reinforcement Learning is the change from a Q function representing a state-action estimate of expected current and future value, to a value distribution, denoted Z , FIGURE 6. Second four relative performances of Reinforcement Learning research, data supplied by [118], [128], [134]. Any games for which the score increased or decreased by factors larger than 10 are truncated in the figures.
in which the expectation is the Q score or value in previous methods [128]. This is expressed mathematically as It was considered here somewhat self-evident that the expected value for any state-action pairing would in fact be a distribution, as opposed to a single numeric value. However, what was interesting to note was that the relative improvement over the previously set benchmarks was minor (<5%) in most cases, suggesting that the value distribution is approximated reasonably well by the expectation value, at least for the deterministic Atari 2600 games. The Distributional Reinforcement Learning approach was later extended to include other assistive techniques, namely Prioritized Experience Replay to form the Distributed Prioritized Experience Replay approach [135], and to include quantile regression [136]. Additionally, there exists a number of examples of the algorithms direct utilisation for real world learing tasks, such as resource management [137], robot control [138] and financial investing [139].

G. ASYNCHRONOUS & DISTRIBUTED REINFORCEMENT LEARNING
Building on work by [140] and [141], a class of Reinforcement Learning algorithms were first proposed by [142] (and similarly by [143]) involving the parallelism of learning agents across CPU threads, acting independently from one another, and labelled as the General Reinforcement Learning Architecture or Gorila algorithm. Whilst the approach uses the fundamental Double DQN algorithm of value based Reinforcement Learning in its application with no fundamental changes to speak of, it is notable due to the stated improvements in speed of learning primarily, but also improved performance against the base Double DQN agent against most of the Atari 2600 benchmark, as shown in Figure 6b.
The GORILA approach maintains an agent as well as a second target agent as in the Double DQN approach, along with an experience replay buffer which is either local or global in scope. The difference this approach has in comparison with previous methods is in the bank of N network parameters being maintained, each acting independently of each other on an instance of the environment, and processed asynchronously across the machines and a common parameter server. Periodic update of the target network is made from the bank of parameters being maintained. It was stated that the approach reached the results up to 20 times faster than the standard GPU based approach [117].
The concept of applying Reinforcement Learning across distributed processes was further developed by [117] in what was described as asynchronous Reinforcement Learning. The approach was applied across multiple variants of the Reinforcement Learning task, including value based one-step DQN agent, although the most noted or successful application was in the policy based variant labelled as Asynchronous Advantage Actor-Critic (A3C) which is out of scope for this review, being focused on value based Reinforcement Learning. The contribution of the asynchronous method to Reinforcement Learning is in the use of multiple CPU threads instead of multiple machines, drastically simplifying implementation. It was also noted that the approach varied the exploration policy across threads, implemented by varying the epsilon greedy parameter, and most importantly, the absence of an experience replay buffer. This is of critical importance to note as it is the presence of this buffer which is considered here as a primary contributor to the computational load of the Reinforcement Learning problem, not the processing of the agent. What this showed was that sampling widely enough in a bootstrapping process could mitigate the need for maintaining a large sample of older experiences, but did not address if there was a benefit to be had by doing so.

H. DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY
Ape-X is an approach to Reinforcement Learning introduced by [135] which has combined the concept of massive parallel processing of the interaction of the agent with the environment (a processing bottleneck in Reinforcement Learning) with the concepts of Prioritized Experience Replay, which determined that not all experiences of an agent are of equal importance for the value update in back-propagation. A critical aspect of the approach to note is that the parallel 'actor' agents are building and maintaining a local experience replay buffer, which are drawn from globally by the common 'learner' agent. The individual 'actor' agents maintain the local buffer in terms of updating the priorities in accordance with the target network updates. The paper by [135] showed state-of-the-art performance on a number of the Atari benchmark games, and substantial improvements to learning rates, proclaimed to be a result of the increased exploration of the environment state space, but it is also noted here that the prioritization of experiences is a major contributor to performance over the asynchronous or GORILA methods. Importance Weighted Actor-Learner Architecture (IMPALA) [144] extended on this to also make the 'learner' agent asynchronous using a V-trace method to map experiences to gradients in asynchronous updates of the agent. This was only applied at that time to a policy based Reinforcement Learning approach, but could be applied to value based approaches, however this is again out of scope for this review.

I. RAINBOW REINFORCEMENT LEARNING
At the time of its publication, the Rainbow algo-rithmcitehessel2018rainbow was reportedly the state of the art in value based Reinforcement Learning approaches. The major advancement on previous incremental improvements to the value based Reinforcement Learning methodology was to combine several previous approaches into a single algorithm; namely the double DQN, prioritized experience replay, duelling DQN, multi-step learning, distributional Q networks and noisy networks. Each of these approaches have been discussed at length here, where each contribute toward the overall success of the agent in correcting for different issues within the original DQN method. The only contradiction to this comment is that it was stated by [118] that the contribution of the double DQN to overall performance was negligible, based on ablation studies, i.e. simply removing that section of the code from the processing. Most likely the effect of the double network was superseded by one or more other included approaches.
The most interesting observation of the Rainbow methodology is that it dramatically improves on a number of published benchmarks of performance across the individual components of which it is constructed, and notably the vanilla DQN method itself, whilst remaining in essence a variation on the original algorithm. Figure 6c shows the relative performance of the RAINBOW Reinforcement Learning algorithm against the previous state-of-the-art results of the NoisyNet algorithm.
The only major variations in each individual subcomponent of the RAINBOW algorithm from the original research, is the temporal difference error was not applied to determine the priority of the experiences for replay, instead the Kullback-Leibler loss is used, and that both streams of the fully connected layers in the dueling network architecture has the distributional probabilities applied [118].
Despite the improved success of the RAINBOW Reinforcement Learning approach over the widely applied DQN variants, there have been limited published applications in the real world, but some examples of the extension to the technique are found in [134], [145], and [146]. This is undoubtedly a factor of how recently it has been first published.

J. IMPLICIT QUANTILE NETWORKS
The work by [136] expanded the concepts of the value function being a distribution, which should be modelled or learned explicitly [128] as well as the purported benefits of parallel processing of learner and actor agents [117], [142] to create the Quantile Regression DQN (QR-DQN) approach. The most significant change proposed by the approach over Distributional DQN is to separate the return distribution into N quantiles, taking advantage of the increase to speed and sampling distribution increase identified previously. This approach has resulted in state-of-the-art level performance at the time of publication against the Atari 2600 benchmark [136], with results shown in Figure 6d.
Implicit Quantile Networks is stated by [134] to be an extension of the distributed Reinforcement Learning approach [128] and the authors own work [136], where parameter sets are distributed amongst the learner agents, returning an implicitly defined (hence the name) return distribution, resulting in a risk-sensitive policy. Here risk is defined as the possibility of under-sampling, or not sampling the entire breadth of possible state-action pairing. The benefit of the approach was stated as being the increased sampling space with improved learning efficiency, giving an overall improvement within the same learning time as previous techniques. The reported state-of-the-art performance is shown in Figure 8a.
The collection of techniques presented in this section do represent a step forward in the application of Reinforcement Learning techniques in general, both value based and policy based equally, making them worthy of noting here. However the approaches do not alter the underlying algorithm in the fact that the parallel processing of multiple agents is the fundamental step forward in research detailed here, also noting the relatively modest improvements made over successive variations of these distributed multi-agent approaches.

K. RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING
The approach of Recurrent Experience Replay in Distributed Reinforcement Learning by [147] has been coined as the R2D2 algorithm. Fundamentally the algorithm utilises a prioritized experience replay with a double DQN agent, and leverages the benefits discussed previously in sampling efficiency and depth provided by distributed concepts [134], [135], [136], [144], i.e. multiple actors operating on environment instances. The advancement made by the research of [147] is in the application of Long Short Term Memory [148] to build a deeper memory representation, in excess of what was commonly applied as a frame buffer of recent states to some level (typically four), although the use of LSTM had been applied to Reinforcement Learning previously as the Deep Recurrent Q-Network (DRQN) [149] and in other work [117], [144], [150], the key point of difference here is in the use in conjunction with the other advances in Reinforcement Learning, i.e. experience replay. The R2D2 also applied the principles of partially observable Markov decision processes (POMDP) [151], although in the context of the Atari benchmark, it is noted that this would be an insignificant contributor to success as the task is fully observable [147]. What was profoundly noticeable from this alteration to the previous Value based approaches described  [134] and [147]. Any games for which the score increased or decreased by factors larger than 10 are truncated in the figures.
here, is the dramatic increase in overall performance against the Atari 2600 benchmark, as shown in Figure 8b.
The results of the R2D2 algorithm against the common Atari 2600 benchmark is a substantial improvement over previous efforts, with a minor combination of some previous concepts and a concerted effort to construct an effective combination, and the discovery of the need to 'Burn in' or avoid zero state initiation of the Recurrent Neural Network.
The research also suggests that the impact of adding in recurrency into the neural network is more significant than simply improving the memory of the agent, improving somehow the learning representation, although this is not elaborated further. What is obvious is that the common understanding that the Markov model contains sufficient state information or representation to encapsulate all previous information should be challenged. VOLUME 10, 2022 L. MODEL BASED REINFORCEMENT LEARNING Despite the well noted success of model based Reinforcement Learning algorithms such as the famous AlphaGo [8] or Checkers [152], it was considered that model-based Reinforcement Learning approaches were out of scope for this review, as it is not a fundamental deviation from the standard value based approach. This position was taken as it was considered that having access to a model of the environment or rule set was a significant advantage, and somewhat contrived due to the absence of such knowledge in real world applications. The MuZero algorithm proposed by [153] however, demonstrated substantial improvements over the model-free approaches discussed throughout this review, and exceeds the previous state-of-the-art performance of the R2D2 algorithm by combining forward looking into a number of steps of a environment dynamic model to choose the best course of action in a given state.
The MuZero approach conducts planning by estimating the policy, the value function and the reward for the current transition [153], extending on work by [154], [155], and [156] VOLUME 10, 2022 FIGURE 11. Performance of value based reinforcement learning algorithms relative to the DQN benchmarked performance, and normalised by human performance, represented on a log scale.
which represented a change in approach to model-based Reinforcement Learning by predicting the value function. The focus on learning a model of the value function makes it relevant to the context of this review, as it represents a possible step change in value based Reinforcement Learning. While noting that the MuZero itself is actually a policy based approach, it is worth comparing the performance of MuZero approach to that of the R2D2 algorithm, as shown in Figure 8c.

VI. COMPARISON OF MAJOR ADVANCEMENTS IN VALUE BASED REINFORCEMENT LEARNING PERFORMANCE
What has been discussed throughout the sections of this review, is a history of some important advancements in value based Reinforcement Learning methodology. But as has been highlighted in research into benchmarking performance, there is a lack of commonality and repeatability to the results shown against the Atari 2600 test set [11], [16]. This makes direct and fair comparison of the performance or efficacy of the various approaches in literature difficult. Commonly the benchmark of success is the notion of superhuman level of performance, i.e. by how much does the algorithm surpass a humans ability to complete the task. This ratio of the DQN performance against the level of a human player is shown in Figure 3, but it is important to note that here, what is or isn't considered superhuman performance is not in question, where the interest lies in how much the field has advanced in comparison to its inception. The analysis of human relative performance across the value based Reinforcement Learning approaches suggests that there are broad sweeping advancements across the techniques, across the suite of games. Comparison of the various Reinforcement Learning techniques, relative to the previous state of art shows that in the vast majority of cases, significant gains were localised to a few select instances or games. Comparing all methodologies presented in this review, relative to the scores presented in the original DQN research, as shown in Figure 9 highlights this concept. What can be observed here is the clustering of results around each game, where most algorithms perform similarly well against any given game, of course with some limited number of exceptions such as Frostbite or Seaquest, where one particular algorithm performed exceedingly well in comparison to other approaches, but where this was not reflected across all games. Figure 7 shows the mean score across the Atari 2600 games, normalised by the difference between the human score and the random agent, and displayed as a value relative to the original DQN results [6], ordered in time. What this shows, is that over many years since the groundbreaking research of the DQN agent, and the resultant insatiable interest in Reinforcement Learning and artificial intelligence in general from both academia, and the media, the level of overall performance has not had the dramatic improvements which are observed in other areas of artificial intelligence, image classification for example, or commensurate with the hyped claims of the field within mainstream media. That is of course, until the recent inclusion of the R2D2 algorithm, which as discussed previously, is considered to be one of the few major step changes in value based Reinforcement Learning, with the inclusion of higher level representation of memory. That is not to say the other incremental improvements aren't important, just that they focused on improving the algorithm through higher efficiency in sampling, stability of value function representation, or depth of sampling space. To further emphasise this point, consider these relative scores separated out into the individual games, firstly ignoring the R2D2 algorithm in Figure 9, and secondly including it as in Figure 10. Here the axes limits have been clipped at 100 times the performance of DQN; above that, exactly how far it exceeds the performance is somewhat irrelevant for this discussion. The results shown in Figure 9 show that minimal improvement (<10%) has been made across the techniques, with the glaring exception of a handful of specific games, Asterix and Private Eye most notably. Compare that to the results of R2D2, as shown in Figure 10, where there are a number of clear examples of performance well above that of the original DQN, if not the majority. What is not observed, is a general improvement across all games, due in part to some games being already solved by the original DQN algorithm. To make this issue clear, consider Figure 11, which shows the performance improvement of all algorithms relative to the DQN algorithm, normalised by human level performance. This shows which games were not easily solved through Reinforcement Learning originally, and also human players. Figure 11 shows a measure by which an algorithm is outperforming both the DQN algorithm, and the human level performance. This highlights the issue where the Reinforcement Learning agent is good at outperforming humans by incredible amounts, when the task is simple, and hence a model of required behaviour is shallow and easily constructed. The majority of improvements are incremental in nature, with some small improvement over the base algorithm (in overall performance not considering learning efficiency), with a handful of examples of considerable improvement in one or more particular game constituting the majority of the mean improvement observed across time, as shown in Figure 7. Also interesting to note is that the games showing the most improvement in R2D2 compared to clusters of previous algorithms are predominantly the games which failed to exceed human performance initially, and failed to gain notable improvement across subsequent iterations of the algorithm up until that point.

VII. CONCLUSION
A subtle observation made when analysing the performance of value based deep Reinforcement Learning approaches, since the renaissance in this niche area of Artificial Intelligence, is the small pool of researchers contributing to the field. This is not true in terms of broad application of the concept, in which there is an impressive amount of effort across a broad and diverse number of researchers and research establishments. However, in terms of fundamental improvements to the techniques, which are benchmarked sufficiently against a common standard, and show some notable performance gain, there are relatively few researchers, whom are then common across multiple examples of the fundamental research. This is exacerbated when considering the affiliations of these core researchers, where there is a strong centre of gravity in a particular centre of excellence.
Of particular interest to note within this review of the field of value based Reinforcement Learning was that the improvements observed across the majority of value based Reinforcement Learning techniques were quite modest in relative terms when compared to the baseline DQN algorithm. This is considered a function of the relatively minor changes applied, where the most significant change is considered to VOLUME 10, 2022 be the representation of the value as a distribution instead of a single value. This of course only holds true until considering the addition of an advanced memory representation into the R2D2 algorithm, which is considered here to be the major paradigm shift applied across deep value based Reinforcement Learning since its first and groundbreaking implementation. What this suggests is that the long held assumption that the Markov model being sufficient to contain a full representation of all history states, leading up to that state, completely defining historical states, should be questioned. It is further asserted, and a key recommendation from this review, that the addition of other high level representation of the state-action pairing to result in reward, will further improve the broad applicability of Reinforcement Learning to other, more complex tasks as well as to improve on these benchmark levels of the Atari 2600 games.