An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm

In this paper we propose an efficient hardware architecture that implements the Q-Learning algorithm, suitable for real-time applications. Its main features are low-power, high throughput and limited hardware resources. We also propose a technique based on approximated multipliers to reduce the hardware complexity of the algorithm. We implemented the design on a Xilinx Zynq Ultrascale+ MPSoC ZCU106 Evaluation Kit. The implementation results are evaluated in terms of hardware resources, throughput and power consumption. The architecture is compared to the state of the art of Q-Learning hardware accelerators presented in the literature obtaining better results in speed, power and hardware resources. Experiments using different sizes for the Q-Matrix and different wordlengths for the fixed point arithmetic are presented. With a Q-Matrix of size $8\times4$ (8 bit data) we achieved a throughput of 222 MSPS (Mega Samples Per Second) and a dynamic power consumption of 37 mW, while with a Q-Matrix of size $256\times16$ (32 bit data) we achieved a throughput of 93 MSPS and a power consumption 611 mW. Due to the small amount of hardware resources required by the accelerator, our system is suitable for multi-agent IoT applications. Moreover, the architecture can be used to implement the SARSA (State-Action-Reward-State-Action) Reinforcement Learning algorithm with minor modifications.


I. INTRODUCTION
Reinforcement Learning (RL) is a Machine Learning (ML) approach used to train an entity, called agent, to accomplish a certain task [1].Unlike the classic supervised and unsupervised ML techniques [2], RL does not require two separated training and inference phases being based on a trial & error approach.This concept is very close to the human learning.
As depicted in Fig. 1, the agent ''lives'' in an environment where it performs some actions.These actions may affect the environment which is time-variant and can be modelled as a Markovian Decision Process (MDP) [1].An interpreter observes the scenario returning to the agent the state of the The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed M. Elmisery .
environment and a reward.The reward (or reinforcement) is a quality figure for the last action performed by the agent and it is represented as a positive or negative number.Through this iterative process, the agent learns an optimal actionselection policy to accomplish its task.This policy indicates which is the best action the agent should perform when the environment is in a certain state.Eventually, the interpreter may be integrated into the agent that becomes self-critic.
Thanks to this approach, RL represents a very powerful tool to solve problems where the operating scenario is unknown or changes over time.
Recently, the applications of RL have become increasingly popular in various fields such as robotics [3]- [5], Internet of Things (IoT) [6], power management [7], financial trading [8] and telecommunications [9], [10].Another research area in RL is multi-agent and swarm systems [11]- [14].This kind of applications require powerful computing platforms able to process very large amount of data as fast as possible and with limited power consumption.For these reasons, software-based implementations performance is now the main limitation in further development of such systems and the use of hardware accelerators based on FPGAs or ASICs can represent an efficient solution for implementing RL algorithms.
The main contribution of this work is a flexible and efficient hardware accelerator for the Q-Learning algorithm.The system is not constrained to any specific application, RL policy or environment.Moreover, for IoT target devices, a lowpower version of the architecture based on approximated multipliers is presented.
The paper is organized as follows.
• Section I is a brief survey on Reinforcement Learning and its applications.Q-Learning algorithm, and the related work in the literature are presented.
• Section II describes the proposed hardware architecture, detailing its functional blocks.A technique to reduce the hardware complexity of the arithmetic operations is also proposed.
• Section III presents the implementation results and the comparisons with the state of the art.
• In sec.IV final considerations and future developments are given.
• Appendix shows how the architecture can be exploited to implement the SARSA (State-Action-Reward-State-Action) RL algorithm [15] with minor modifications.
A. Q-LEARNING ALGORITHM Q-Learning [16] is one of the most known and employed RL algorithms [17] and belongs to the class of off-policy methods since its convergence is guaranteed for any agent's policy.
It is based on the concept of Quality Matrix, also known as Q-Matrix.The size of this matrix is N × Z where N is the number of the possible agent's states to sense the environment and Z is the number of possible actions that the agent can perform.This means that Q-Learning operates in a discrete stateaction space S × A. Considering a row of the Q-Matrix that represents a particular state, the best action to be performed is selected by computing the maximum value in the row.
At the beginning of the training process, the Q-Matrix is initialized with random or zero values, and it is updated by using (1).
The variables in (1) refer to: • s t and s t+1 : current and next state of the environment.
• a t and a t+1 : current and next action chosen by the agent (according to its policy).
• γ : discount factor, γ ∈ [0, 1].It defines how much the agent has to take into account long-run rewards instead of immediate ones.
• α: learning rate, α ∈ [0, 1].It determines how much the newest piece of knowledge has to replace the older one.
• r t : current reward value.
In [16] it is proved that the knowledge of the Q-Matrix suffices to extract the optimal action-selection policy for a RL agent.

B. RELATED WORK
Despite the growing interest for RL and the need for systems capable to process large amount of data in very short time, just a few works can be found in the literature about the hardware implementation of RL algorithms.Moreover, the comparison is hard due to the lack of implementation details and homogeneous benchmarks.In this section we show the most prominent researches in this field.
In 2005, Hwang et al. [18] proposed a hardware accelerator for the ''Flexible Adaptable Size Topology'' (FAST) algorithm [19].The system was implemented on a Xilinx XCV800 FPGA and was validated using the cart-pole problem [20].The architecture is well described but few details about the implementation are given.
In 2007, Shao et al. [21] proposed a smart power management application for embedded systems based on the SARSA algorithm [15].The systems was implemented on a Xilinx Spartan-II FPGA.Although the authors proved its functionality, neither the architecture nor the implementation details are given.
One of the most relevant work in the field is [22] by Gankidi et al. that, in 2017, proposed a RL accerelerator for space rovers.The authors implemented the Deep Q-Learning technique [23] on a Xilinx Virtex-7 FPGA.They obtained a throughput of 2.34 MSPS (Mega Samples Per Second) for a 4 × 2 state-action space.
Also in 2017, Su et al. [24] proposed another Deep Q-Learning hardware implementation based on an Intel Arria-10 FPGA.The architecture was compared to a Intel i7-930 CPU and a Nvidia GTX-760 GPU implementation.They achieved a throughput of 25 KSPS with 32 bit fixed point representation for a 27 × 5 state-action space.
In 2018, Shao et al. [21] proposed a hardware accelerator for robotic applications based on ''Trust Region Policy Optimization'' (TRPO) [25].The architecture was implemented VOLUME 7, 2019 on different devices: FPGA (Intel Stratix-V), CPU (Intel i7-5930K) and GPU (Nvidia Tesla-C2070).With respect to the CPU, the authors obtained a speed-up factor of 4.14× and 19.29× for the GPU and the FPGA implementation, respectively.
The most recent works (published in 2019) include Cho et al. [26].They propose a hardware accelerator for the ''Asynchronous Advantage Actor-Critic'' (AC3) algorithm [27], describing an implementation based on a Xilinx VCU1525 FPGA.The system was validated using 6 Atari-2600 videogames.
In the work by Li et al. [28] another Deep Q-Learning network was implemented on a Digilent Pynq development board for the cart-pole problem.The system is meant only for inference mode and, consequently, cannot be used for realtime learning.
One of the most advanced hardware accelerator for Q-Learning was proposed by Da Silva et al. [29].The authors presented an implementation based on a Xilinx Virtex-6 FPGA.Moreover, they performed a fixed-point analysis to confirm the convergence of the algorithm.Different comparisons with state of the art implementations were made.Since this is one of of best performing Q-Learning accelerators at today, we provide an extensive comparison with our architecture (sec.III-B).

II. PROPOSED ARCHITECTURE
The Q-Learning agent shown in Fig. 2 is composed by two main blocks: the Policy Generator (PG) and the Q-Learning accelerator.The agent receives the state s t+1 and the reward r t+1 from the observer, while the next action is generated by the PG according to the values of the Q-Matrix stored into the Q-Learning accelerator.
Note that s t , a t and r t are obtained by delaying s t+1 , a t+1 and r t+1 by means of registers.s t and a t represent the indices of the rows and columns of the Q-Matrix, respectively.These delays do not affect the convergence of the Q-Learning algorithm, as proved in [30].
With the aim to design a general purpose hardware accelerator, we do not provide a particular implementation for the PG since it is application-defined.The PG has been included only in the experiments for the comparison with the state of the art (sec.III-B).
Figure 3 shows the Q-Learning accelerator.
The Q-Matrix is stored into Z Dual-Port RAMs, named Action RAMs.Consequently, we have one memory block per action.Each RAM contains an entire column of the Q-Matrix and the number of memory locations corresponds to the number of states N .The read address is the next state s t+1 , while the write address is the current state s t .The enable signals for the Action RAMs, generated by a decoder driven by the current action a t , select the value Q(s t , a t ) to be updated.The Action RAMs outputs correspond to a row of the Q-Matrix Q(s t+1 , A).
The signal Q(s t , a t ) is obtained by delaying the output of the memory blocks and then selecting the Action RAM through a multiplexer driven by a t .A MAX block fed by the output of the Action RAMs generates max a Q(s t+1 , A).The Q-Updater (Q-Upd) block implements the Q-Matrix update equation ( 1) generating Q new (s t , a t ) to be stored into the corresponding Action RAM.
The accelerator can be also used for Deep Q-Learning [23] applications if the Action RAMs are replaced with Neural Network-based approximators.

A. MAX BLOCK
An extensive study about this block has been proposed in [30].In the paper, the authors proved that the propagation delay of this block is the main limitation for the speed of Q-Learning accelerators when a large number of actions is required.Consequently, they propose an implementation based on a tree of binary comparators (M -stages) that is a good trade-off in area and speed [31].
This architecture is employed by the Q-Learning accelerators presented in [22], [29] and has also been used in our architecture (Fig. 4).
Moreover, in [30] it is proved that, when pipelining is used to speed up the MAX block, the latency does not affect the convergence of the Q-Learning algorithm.This means that, when an application requires a very high throughput, it is possible to use pipelining.

B. Q-UPDATER BLOCK
Equation (1) can be rearranged as to obtain an efficient implementation.Equation ( 2) is computed by using 2 multipliers, while (1) requires 3 multipliers.
The Q-Updater block in Fig. 5 is used to compute (2), generating Q new (s t , a t ).
The critical path consists in 2 multipliers and 2 adders.In the next section (II-B1) a method to reduce the hardware complexity for the multipliers is illustrated.

1) APPROXIMATED MULTIPLIERS
The main speed limitation in the updater block is the propagation delay of the multipliers.Using a similar approach to [32], it is possible to replace the full multipliers shown in Fig. 5 with approximated multipliers based on barrel shifters [33].In this way, we are approximating α and γ with a number equal to their nearest power of two (single shifter), or to the nearest sum of powers of two (two or more shifters).Due to the fact that α, γ ∈ [0, 1], only right shifts have been used.
Considering a number x ≤ 1, its binary representation using M bits for the fractional part is: where x 0 , . . ., x −M are the binary digits.Let i, j, k be the positions of the first, second and third '1' in the binary representation of x starting from the most significant bit.Moreover, we define < x > OP n the approximation of x with the n most significant powers of two in the M + 1 bits representation.That is for the approximation with one, two and three powers of two.
The concept can be extended to more power of two terms.For example, x = 0.101101 (2) = 0.703125 can be approximated as: Some examples of the approximated values for different powers of two are presented in Fig. 6 (x ≤ 1).
Consequently, the product z = x • y can be approximated as: The approximated multipliers are implemented by one or more barrel shifters in the Q-Updater block, depending on the approximation, as shown in Fig. 7 and 8.
The position of the leading ones i and j in the representation of α and γ can be given as input if constant for the whole computation, or determined by Leading-One-Detectors (LOD) [34] if the values are modified at run time.The error introduced by this approximation does not effect the convergence of the Q-Learning algorithm [16] and, as a side effect, we obtain a shorter critical path and lower power consumption (sec.III-A).Moreover, we tested the system in different applications which proved to be almost insensitive to the approximation error since the convergence conditions of Q-Learning are still satisfied ( α,γ ≤ 1).
By using approximated multipliers, we avoid to use FPGAs with DSP blocks and we can implement the accelerator in small ultra low power FPGAs suitable for IoT applications [35], [36].

III. IMPLEMENTATION EXPERIMENTS
In order to validate the proposed architecture, we implemented different versions of the Q-Learning accelerator.
In the experiments, we used a Xilinx Zynq UltraScale+ MPSoC ZCU106 Evaluation Kit featuring the XCZU7EV-2FFVC1156 FPGA.All the results in this section were obtained using the Vivado 2019.1 EDA tool with default implementation parameters and setting a timing constraint of 2 ns.The system was coded in VHDL.
The design exploration was implemented for the following range of parameters: We focused the implementation analysis on the following resources [37]: • Look Up Tables (LUT); • Look Up Tables used as RAM (LUTRAM); • Flip-Flops (FF); • Digital Signal Processing slices (DSP); For every resource of the device, we also provide the percent usage respect to the total available.
The performances were measured in terms of maximum clock frequency (CLK) and dynamic power consumption (PWR).The latter was evaluated using Vivado after the Place&Route considering the maximum clock frequency and a worst case scenario with a 0.5 activity factor on the circuit nodes [38].
All the implementation examples in this section do not make use of pipelining in the MAX block (sec.II-A).Unless otherwise stated, no approximated multipliers are used.
Tables 1 to 9 show the implementation results for different number of states, actions and data-width for the Q-Matrix values (tables header color: blue 8-bit, red 16-bit, green 32-bit data-widths).The first consideration is related to the number of DSPs.Since only one Q-Matrix element is updated per clock cycle, the only parameter that affects the number of required DSPs is the bit-width.For a Q-Matrix with 8-bit data, we obtain the fastest implementations that do not require any DSP slice.For 16-bit and 32-bit data, 3 DSPs and 5 DSPs are required respectively.
Another consideration comes with the maximum clock frequency (that corresponds exactly to the throughput of the system).Given a certain data-path bit-width and number of actions, the clock frequency remains almost unaltered.This can be ascribed to the different solutions found by routing tool.For this reason, in Fig. 9 we use the average clock frequencies.The frequency drop, when the number of actions increases, is greater for 8 bit data-paths with respect to the 16 and 32 bit cases.This behaviour can be justified by taking into account the major role of FPGA interconnections when a large number of bits is used.
For what concerns the hardware resources, the number of required LUT RAMs is related to the size of the Q-Matrix N ×Z .From N = 8 to N = 32 the resources remain the same, from N = 64 a higher number of LUT RAMs is required.As expected, the power consumption is proportional to the number of required LUTs (considering architectures with the same parameters).The trend can be observed in Figs. 10 and 11.
Even for the largest implementation considered (N = 256, Z = 16, 32-bit Q-Matrix values), the required FPGA resources are moderate.This suggests that the architecture can be easily employed in applications requiring a large number of states or actions and applications where multiple agents must be implemented on the same device.
The main result of the design exploration shows that we can implement fast Q-Learning accelerators with small amount of resources and low power consumption.The barrel shifter-based architectures do not require any DSP slice, they use less hardware resources, they are faster and more power-efficient than their full multiplier-based counterparts, especially for the 16 and 32 bit implementations.For these reasons, they are suitable for Q-Learning applications on very small and low-power IoT devices at the cost of a reduced set of possible α and γ values.

B. STATE OF THE ART ARCHITECTURE COMPARISON
The architecture proposed in this paper has been compared with one of best performing Q-Learning hardware accerelerators at today [29].
In their paper, Da Silva et al. proposed a parallel implementation based on the number of states N , while in our work the parallelization is based on the number of actions Z .Since in most of the RL applications Z N (see examples in sec.I), our approach results in a smaller architecture.
Another important difference consists in the earlier selection of the Q-matrix value to be updated.This allows to implement a single block for the computations of Q new = (s t , a t ), while in [29] N × Z blocks are required.Moreover, in case of FPGA implementations, our architecture allows to employ distributed RAM or embedded block-RAM.This gives an additional degree of freedom compared to [29] where only registers are considered for storing the Q-Matrix values.
To obtain a fair comparison: • We implemented the same RL environment of [29] and stored the reward values in a Look-Up Table.
• We implemented a random PG as described in [29].
• We implemented the architectures on the same Virtex-6 FPGA ML605 Evaluation Kit (using the ISE 14.7 Xilinx suite).
The experimental results are shown in Tables 13, 14, 15 and 16.We can only make comparisons with Z = 4 and Z = 8 since they are the only values implemented in [29].The implementation results are given in terms of [39]: • Energy required to update one Q-Matrix element (Energy) As expected, our architecture employs a constant number of DSP slices, while in [29] this number is proportional  to N × Z .The number of Slice Registers required by our implementations remains almost unaltered when the number of states increases, while in [29] it grows with N .
Figure 12 compares the maximum clock frequency for different number of states and actions.Our system is more than 3 times faster and the speed is almost independent to the Q-Matrix number of states.It is important to highlight that the most evident difference between the proposed architecture and [29] is its independence from the environment and agent's policy.This happens because the system in [29] cannot be used as a generalpurpose hardware accelerator since the RL environment is mapped on the FPGA.Our system does not have such limitation.

IV. CONCLUSION
In this paper we proposed an efficient hardware implementation for the Reinforcement Learning algorithm called Q-Learning.Our architecture exploits the learning formula by a-priori selecting the required element of the Q-Matrix to be updated.This approach made possible to minimize the hardware resources.
We also presented an alternative method for reducing the computational complexity of the algorithm by employing approximated multipliers instead of full multipliers.This technique is an effective solution to implement the accelerator on small ultra low-power FPGAs for IoT applications.
Our architecture has been compared to the state of the art in the literature, showing that our solution requires a smaller amount of hardware resources, is faster and dissipates less power.Moreover, our system can be used as a generalpurpose hardware accelerator for the Q-Learning algorithm, not being related to a particular RL environment or agent's policy.
With little effort, the proposed approach can be also exploited to implement the on-policy version of the Q-Learning algorithm: SARSA.This aspect is further explored in Appendix.
For the above reasons, our architecture is suitable for highthroughput and low-power applications.Due to the small amount of required resources, it also allows the implementation of multiple Q-Learning agents on the same device, both on FPGA or ASIC.

APPENDIX SARSA ACCELERATOR ARCHITECTURE
The proposed architecture for the acceleration of the Q-Learning algorithm can be easily exploited to implement the SARSA (State-Action-Reward-State-Action) [15] algorithm.Equation (7) shows the SARSA update formula for the Q-Matrix.Q new (s t , a t ) = Q(s t , a t )+α (r t +γ Q(s t+1 , a t+1 )−Q(s t , a t )) (7) Comparing ( 2) to (7), it is straightforward to note the similarities between the two equations.Since the update of the Q-Matrix depends on the agent's next action a t+1 , SARSA algorithm is the on-policy version of the Q-Learning algorithm (which is off-policy).The resulting architecture is presented in Fig. 14.The main difference between the Q-Learning implementation in Fig. 3   consists in the replacement of the MAX block with a multiplexer driven by the next action a t+1 .
The analysis about the Q-Learning architecture can also be extended to the SARSA accelerator.

FIGURE 2 .
FIGURE 2. High level harchitecture of the Q-Learning agent.

FIGURE 6 .
FIGURE 6. Approximated values for a 6-bit number using M = 5 bits for the fractional part.(a) 1 power of two, (b) 2 powers of two, (c) 3 powers of two.

FIGURE 7 .
FIGURE 7. Q-Matrix updater block with multipliers implemented by a single barrel shifter.

FIGURE 8 .
FIGURE 8. Q-Matrix updater block with multipliers implemented by two barrel shifters.

FIGURE 12 .
FIGURE 12. Clock frequency comparison between Da Silva et al. [29] and proposed architecture, 16 bit Q-Matrix values.

Figure 13
Figure 13  compares the energy required to update a single Q-Matrix element for different number of states and actions.Also in this case, our architecture, except for the N = 6

FIGURE 13 .
FIGURE 13.Energy required to update one Q-Matrix element comparison between Da Silva et al. [29] and proposed architecture, 16 bit values.

Z = 4
case, presents a better energy efficiency which remains almost unaltered increasing the number of states.

TABLE 1 .
Implementation results for Q-Matrices with 8 bit data and Z = 4.

TABLE 2 .
Implementation results for Q-Matrices with 8 bit data and Z = 8.

TABLE 3 .
Implementation results for Q-Matrices with 8 bit data and Z = 16.

TABLE 4 .
Implementation results for Q-Matrices with 16 bit data and Z = 4.

TABLE 5 .
Implementation results for Q-Matrices with 16 bit data and Z = 8.

TABLE 6 .
Implementation results for Q-Matrices with 16 bit data and Z = 16.

TABLE 7 .
Implementation results for Q-Matrices with 32 bit data and Z = 4.

TABLE 11 .
Implementation comparisons: approximated and full multipliers with 16 bit operands.

TABLE 12 .
Implementation comparisons: approximated and full multipliers with 32 bit operands.

TABLE 14 .
Proposed implementation results for 16 bit Q-Matrix values and Z = 4.

TABLE 16 .
Proposed implementation results for 16 bit Q-Matrix values and Z = 8.