A Reinforcement Learning-Based QAM/PSK Symbol Synchronizer

Machine Learning (ML) based on supervised and unsupervised learning models has been recently applied in the telecommunication field. However, such techniques rely on application-specific large datasets and the performance deteriorates if the statistics of the inference data changes over time. Reinforcement Learning (RL) is a solution to these issues because it is able to adapt its behavior to the changing statistics of the input data. In this work, we propose the design of an RL Agent able to learn the behavior of a Timing Recovery Loop (TRL) through the Q-Learning algorithm. The Agent is compatible with popular PSK and QAM formats. We validated the RL synchronizer by comparing it to the Mueller and Müller TRL in terms of Modulation Error Ratio (MER) in a noisy channel scenario. The results show a good trade-off in terms of MER performance. The RL based synchronizer loses less than 1 dB of MER with respect to the conventional one but it is able to adapt its behavior to different modulation formats without the need of any tuning for the system parameters.


I. INTRODUCTION
Machine Learning (ML) is a field of Artificial Intelligence based on statistical methods to enhance the performance of algorithms in data pattern identification [1]. ML is applied in several fields, such as medicine [2], financial trading [3], big data management [4], imaging and image processing [5], security [6], [7], mobile apps [8] and more.
In recent years, the advancement of electronics, information sciences and Artificial Intelligence supported the research and development in the telecommunication (TLC) field. In modern TLC systems, important aspects are flexibility and compatibility with multiple standards, often addressed with ML approaches. Some examples are Software Defined Radio (SDR) [9], Cognitive Radio (CR) [10] and Intelligent Radio systems (IR). IRs are capable to autonomously estimate the optimal communication parameters when the system operates in a time-variant environment. This intelligent The associate editor coordinating the review of this manuscript and approving it for publication was Malik Jahan Khan. behavior can be obtained by using conventional adaptive signal processing techniques or by using ML approaches.
In [11] the authors show a survey of different innovative ML techniques applied to telecommunications. In [12] the authors compare the modulation recognition performance using different ML techniques such as Logistic Regression (LR), Artificial Neural Networks (ANN) and Support Vector Machines (SVM) over different datasets (FSK and PSK-QAM signals). The authors in [13] propose the use of Deep Belief Networks in a demodulator architecture. Other noteworthy research papers in this context are [14]- [18]. Another important research field is related to the development of ML hardware accelerators that allow time and energy efficient ML algorithms execution [19]- [24].
ML techniques are usually classified in three main categories: Supervised, Unsupervised and Reinforcement Learning. The first two require a training phase to obtain an expert algorithm ready to be deployed in the field (inference phase). Supervised and unsupervised ML approaches rely on massive amounts of data, intensive offline training sessions and large parameter spaces [25]. Moreover, the inference performance VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ degrades when the statistics of the input data differs from that of the training examples. In these cases, multiple training sessions are required to update the model [26]. In Reinforcement Learning (RL) the training and inference phases are not separated. The learner interacts with the environment to collect information and receives an immediate reward for each decision and action it takes. The reward is a numerical value that quantifies the quality of the action performed by the learner. Its aim is to maximize the reward while interacting with the environment through an iterative sequence of actions [27]. Consequently, in a scenario where the input data statistics evolves, a possible solution is the use of an RL approach. The main feature of RL is its capability to solve the task interacting with the environment by trail-and-error. For this reason, this methodology brings the following advantages: • The off-line training phase is not required, avoiding the need for cumbersome training data-sets; • Parameter space dimensions are generally smaller than in supervised and unsupervised approaches; • The model is self-adapting using the data collected in the field. RL became increasingly popular in robotics [28]- [31], Internet of Things (IoT) [32], financial trading [33] and industrial applications [34]. RL for multi-Agent systems is a growing research field as well. It is often bio-inspired [35] and the learners are organized in ensembles, i.e. swarms, to improve learning capabilities [36], [37].
However, the application of RL models to TLC is a fairly new and unexplored line of research. As detailed later, a conventional TLC receiver is a cascade of processing units based on tunable feedback loops and application-specific modules. For example, a Timing Recovery Loop (TRL) is a system that includes an error detector and a feedback loop filter to control the synchronization of the receiver resampler. Conventional approaches require the tuning of the filter parameters and error detectors that are specific to the transmission parameters, such as types of modulation scheme.
In this paper, we propose a TRL based on RL capable to operate independently of the modulation format, due to its adaptation capability. Moreover, a large training dataset is not required, differently from the cited researches about the use of supervised and unsupervised ML approaches in TLC. Our solution is based on Q-Learning, an RL algorithm developed by Watkins and Dayan [38]. This model learns from the input data how to synthesize the behavior of a symbol synchronizer by using a trial-and-error iterative process. The system performance has been validated by comparing the proposed approach to a standard synchronization method, the Mueller and Müller gate [39]. With respect to [40], we extend the method to a wider set of modulations schemes by using a modified and enhanced ML model.
The main advantages of our approach are: • It is capable to operate on different modulation schemes (static conditions); • It is capable of self adaptation to changes in the channel noise and modulation formats (changes in the environment); • It avoids the loop filter parameter tuning phase; • It avoids the need to allocate specific sub-channels for synchronization because the timing recovery is performed by using the input data.
In this paper, we prove that the use of a Reinforcement Learning based QAM/PSK symbol synchronizer is a valid timing recovery method by testing it in generic telecommunication system. This paper is organized as follows. In Sect. II an overview about conventional symbol synchronization methods, RL and Q-Learning is provided. In Sect. III we illustrate the architecture and the sub-modules of the proposed RL based TRL. In Sect. IV we provide the experimental results regarding the choice of the Q-Learning hyperparameters and we compare the performance of the system to conventional synchronization methods. In Sect. V the conclusions are drawn.

II. BACKGROUND
In this section, we provide insights about conventional timing recovery methods and Reinforcement Learning.

A. SYMBOL SYNCHRONIZATION METHODS
A generic receiver front-end consists of a receive filter followed by a carrier and timing recovery system. The carrier recovery system is implemented with first or second-order Phase-Locked Loops (PLL). The symbol synchronization is performed using Timing Recovery Loops (TRL) as shown in [41], [42]. Examples of recovery loops are the Costas Loop, for carrier recovery [43], the Early-Late [42], [44], the Mueller and Müller [39], and the Gardner Loop [45] for symbol synchronization. The timing recovery is often implemented as a first-order TRL. A TRL consists of a timing phase detector (TPD), a loop filter, and a Numerically Controlled Oscillator (NCO) for resampling.
In this paper we compare the RL based synchronizer to the Mueller and Müller (M&M) method for two main reasons: the M&M recovery loop is a data-assisted timing synchronization algorithm and it uses only one sample per symbol. The M&M synchronizer (the BPSK version is shown in Fig. 1   used to train an object, called Agent, to perform a certain task by interacting with the environment. As shown in Fig. 2, the RL Agent adjusts its behavior by a trial-and-error process through a rewarding mechanism called reinforcement that is a measure of its task-solving overall capability [46]. The effects of the actions are gathered and estimated by an entity called Interpreter that computes the new state and rewards the Agent accordingly. The Interpreter can be external or internal to the Agent itself: in the latter case, the Agent and the Interpreter are merged into a more complex Agent able to sense the environment and capable of self-criticism. This is the type of Agent we use in this paper.
The proposed RL-based synchronizer interacts with the received signal generated by an evolving TLC environment and is capable to decide the best timing compensation value. This process takes place without the need for extensive training phases and large training datasets.
In standard RL algorithms, the environment is often modeled as a Markovian Decision Process (MDP) [46]. The Agent is characterized by the space S × A, where S is the state vector and A is the set of the actions. The Agent evaluates the states s i ∈ S and decides to perform the action a i ∈ A. The process by which the Agent senses the environment, collecting pieces of information and knowledge, is called exploration. At a certain time t, the Agent is in the state s t and takes the action a t . The decision to perform the action a t takes place because the Agent iteratively learns and builds an optimal action-selection policy π : S × A. Each π(s, a) is a real number in the range [0, 1]. π is a probability map showing the chances of the Agent to take one of the actions in A given the current state s t . Then, the Agent observes the effects generated by the action a t on the environment.
The information derived from the environment response is called reward, a real number (positive or negative). Its value depends on how the action a t brings the system closer to the task solution. The Agent aims to accumulate positive rewards and to maximize its expected return by iterating actions over time. This process is modeled by the State-Value Function V π (s) which is related to the current policy map and it is updated with the reward value at each iteration. Through V π (s) the Agent learns the quality of the new state s t+1 reached starting from s t and executing a t . The number of iterations taken by the Agent to complete the given task is called an episode. The agent reaches a terminal state s goal and the updated State-Value Function is stored for further use. The subsequent episode starts with the updated State-Value Function available to the Agent. As V π (s) is updated, the Agent manages its set of actions in order to solve the task by maximizing the cumulative reward in fewer iterations. This process is called exploitation. The Agent explores and exploits the environment using the knowledge collected in the form of the π map.
In general, the environment is dynamic and the effect of an action in a certain state can change over time. By alternating exploration and exploitation, the RL Agent adapts its State-Value Function V π (s) over time following the environment changes. There are several RL algorithms that approximate and estimate the State-Value function, both for single and multi-Agent, and the most prominent example is Q-learning by Watkins and Dayan [38].

1) WATKINS' Q-LEARNING
Q-learning is an algorithm able to approximate the State-Value function V π (s) of an Agent in order to find an optimal action-selection policy π * . As mentioned before, the Agent overall goal is to maximize its total accumulated future discounted reward derived from current and future actions. The State-Value function is defined as V π (s) = E [R] and R is the Discounted Cumulative Future Reward defined as (1): where γ ∈ [0, 1] is the discount factor and r t is the reward at time t. In [38] the authors demonstrate that to approximate the State-Value V π * (s) function related to the optimal policy π * it is sufficient to compute a map called Action-Value Matrix Q π * : S × A. The elements of Q represent the quality of the related policy π(s t , a t ). The Q values are updated at every iteration (time t) using the Q-Learning update algorithm in the form of the Bellman equation [47] and reported below (2). where α is the learning rate, γ is the discount factor and r t is the reward at time t. Some state-of-the-art action-selection policies widely used in Reinforcement Learning frameworks [46] are: • Random: the Agent selects an action a i ∈ A with a random approach, ignoring the Q values; • Greedy: the Agent selects the action with the highest corresponding Q value; • -Greedy: the chance of taking a random action is a value ∈ [0, 1] and the probability of a greedy choice is 1− ; • Boltzmann: the Q values are used to weight the action-selection probabilities according to the Boltzmann distribution based on the temperature parameter τ . In the case of an Agent required to perform multiple actions simultaneously, it is possible to apply the concept of Action Branching [48] that extends the dimensionality of the action space.

III. TIMING SYNCHRONIZER BASED ON Q-LEARNING
In this section, we present the design of a TRL based on Q-Learning able to recover the timing of a raised-cosineshaped signal when QAM and PSK modulation schemes are used. In our system, the input signal y[k] is downsampled by a factor M obtaining the signal x[n] (as for the Mueller and Müller gate). The Agent task is to compensate for the symbol timing error by delaying or anticipating the resampling time.
In our experiments, the downsampling factor M is equal to the number of samples per symbol. The definitions of the Agent state is related to a measure on the resampled symbol (detailed in Sect. III-A.2). The Agent action is defined as a double control (Sect III-C): the first one is used to manage the resampling timing and the second one to modify the input signal amplitude.
With respect to Fig. 3, the Agent is required to fulfill two main objectives: 1) To minimize the variation in amplitude of consecutive resampled symbols x[n] with respect to an expected constellation (dark blue arrows); 2) To maximize the eye-opening (orange arrow). With respect to [40], the algorithm is extended to support the QAM synchronization.
The proposed RL Agent architecture is illustrated in Fig. 4 and it consists of four main sub-modules: 1) State and Reward Evaluator Block: it is the Interpreter in the RL framework; 2) Q-Learning Engine: it updates the Q matrix and implements the action-selection policy π; 3) Action Decoder: it carries out the Agent decisions to manage the signal sampling delay and its amplitude; 4) Numerically Controlled Oscillator: it provides the resampling timing. Additional details and accurate descriptions of the TRL blocks are provided in the following subsections.

A. STATE AND REWARD EVALUATOR
The basic idea behind the RL based TRL is illustrated in Fig.5. In this figure the input y[k] is a BPSK signal. The variation range of the resampled signal absolute value |(x[n])| is represented by the pink band (Fig. 5a) and green (Fig. 5b). The Agent goal is the minimization of this variation and, in addition to this, it minimizes the difference between the amplitude of the resampled signal x[n] (red crosses) and the expected symbol value x * (blue line, the value is 1 in the depicted example). We design an RL Interpreter, called State and Reward Evaluator (SRE), that processes x[n] to fulfill these objectives.
We design the Agent states s i ∈ S and reward r depending on the input signal amplitude and its variation over time. As depicted in Fig. 6, the SRE module is built as two blocks in cascade called Constellation Conditioning and State & Reward Computing.

1) CONSTELLATION CONDITIONING
The main limitation of the approach proposed in [40] is the inability to synchronize QAM signals. To address this problem we use a coordinate transformation approach, implemented as shown in Fig.7.
The following equation represents the constellation conditioning algorithm: The constellation conditioning process is depicted in Fig. 8. In this example, we consider 124150 VOLUME 7, 2019   a 16QAM constellation where the coordinates of x[n] are projected on the real axis, aligned using the gain block, collapsed and merged on the expected symbol coordinatex * .

2) STATE & REWARD COMPUTING
In Fig. 9 the State and Reward Computing block is shown. The state function (4) and the reward function (5) are: where N is the number of states and q must be an odd integer to keep the sign. ∇ 2x [n] is the second-order finite difference ofx[n], as shown in (6): To compute the state s[n], the value of ∇ 2x [n] is scaled by N /2 to be compatible with the coordinates of the Agent Q-matrix. If the Agent retrieves the correct symbol phase, the differential ∇ 2x [n] tends to 0 andx[n] tox * . For this reason, according to (4), the target state s * i is placed in the neighborhood of s i = N /2 +x * and the reward, in (5), is maximized.

B. Q-LEARNING ENGINE
The purpose of this module is to update the Action-Value elements of Q using equation (2). As shown in Fig. 4, the inputs are the Agent state s[n] and the reward r[n]. The output is the action a[n], determined using one of the policies described in Sect. II-B.1. In particular, the action is branched in two components: • Branch a 1 = a 1 0 , a 1 1 , a 1 2 represents the action vector to delay, stop or anticipate the resampler; • Branch a 2 = a 2 0 , a 2 1 , a 2 2 is an action vector used to increase, hold or decrease G[n]. The Agent action space A = a 1 ×a 2 is bi-dimensional and it is processed and actuated by the Action Decoder block, detailed in the next Section. The mapping of the state-action space S × A into Q is a tensor with one dimension for the states and two for the actions, as illustrated in Fig. 10.
There are 3×3 = 9 available actions for each state (s i ). The hyperparameter N ∈ N is the cardinality of the state space S (hence s i ∈ N). For this reason, the state values s[n] calculated in (4) are rounded to address the Q-tensor.

C. ACTION DECODER
The Action Decoder implements the decisions taken by the Agent. The action a = a 1 , a 2 is the input of this block. The outputs are used by the Agent to control the NCO timing and G[n] respectively. This block computes the increments to be assigned to the NCO timing and the Gain G[n].
The action vector a 1 control the resampler using three different actions: • a 1 0 = 0: the Agent decides that the current timing is correct; • a 1 1 = +1: the Agent anticipates the resampling; • a 1 2 = −1: the Agent delays the resampling. The purpose of the action a 2 is to modify the amplitude of the input signal, as shown in Fig. 8 and discussed in Sect III-A. The elements of a 2 are the increments used by the Action Decoder to update the input gain G[n]: • a 2 0 = 0 : the currentx * value is appropriate and the Agent decides to maintain the G[n] constant;

IV. EXPERIMENTS AND RESULTS
In this section, we present three experiments. In the first one we find suitable Q-Learning hyperparameters, in the second VOLUME 7, 2019    one we test the adaptivity properties of the Q-Learning Agent, and, in the last one, we compare our algorithm to the Mueller and Müller Loop when signals with different modulation formats in a generic telecommunication system are used. The experimental setup is common among the experiments and it is shown in Fig. 11.
The TX-Module is a base-band transmitter in which the modulation scheme and Root-Raised Cosine (RRC) shaping filter are parametric. The AWGN Channel Emulator is configurable in delay and noise power (measured in terms of E b /N 0 ). The RX-Module includes an RRC receive filter and the Q-Learning Synchronizer. In these experiments, we assume a recovered carrier. The performance of the synchronizer is evaluated in terms of Modulation Error Ratio (MER), defined as in [49].
In the first experiment, we asses the optimal Q-Learning hyperparameters a posteriori. At the same time, we aim to reduce the Agent complexity, i.e. the dimensionality of the hyperparameters space. In the second experiment, we change the environment properties i.e. the modulation format to evaluate the Q-Learning Agent adaptation capability. In the last experiment, we compare the MER of the Q-Learning synchronizer with that obtained by the Mueller and Müller method for PSK and QAM modulation schemes. The experiments are characterized by the following Q-Learning hyperparameters (the values that were validated a posteriori) and TLC parameters:

A. HYPERPARAMETERS ANALYSIS
Considering equation (2) and the Agent features, the hyperparameters are: • Action-selection policy π: it represents the action choice rule according to the Q-values. In our work we employed -Greedy with = 10 −5 to facilitate the state-space exploration. Moreover, it is important to avoid the local maxima of the reward function.
• Learning rate α: it defines the convergence speed of the Q matrix towards the ideal State-Value function.
In the case of a dynamic environment, it also defines the adaptation speed of the Agent. For α = 0.1, we found 124152 VOLUME 7, 2019 a good trade-off between algorithm convergence speed and local maxima avoidance.
• Discount factor γ : it weights the outcome of the present actions with respect to the future ones. A suitable γ value is 0.01.
• Number of states N : it defines the state-space quantization, hence the Q-tensor dimension (Fig. 10). In the following analysis, different simulations have been implemented for PSK and QAM formats, and in the next Sections the QPSK and 16QAM are illustrated.

1) MER VARIATION AS A FUNCTION OF α AND γ
To study the range of variation of the Q-Learning TRL hyperparameters with respect to the modulation format and E b /N 0 we set an initial constant timing delay d = M /4, where M is the number of samples per symbol. The graphs in Fig. 12 show that the variations of α (Fig. 12a and Fig. 12b) and γ ( Fig. 12c and Fig. 12d) do not affect the MER performance significantly. Consequently, the default set α = 0.1 and γ = 0.01 is a proper design choice for different modulation formats.

2) NUMBER OF STATES N
The number of states determines the environment observation precision of the Agent. To analyze the system MER performance, we simulate our system with N in the range 8-1024 and E b /N 0 = 5 − 30 dB for PSK and QAM modulation schemes. The MER results are evaluated after 1000 exploration symbols. Figure 13a shows that the MER is affected by N only in the case of high SNR. In this case, the MER grows when N increases (10 dB of spread at E b /N 0 = 30 dB). In the 16QAM case in Fig. 13b, we observe the same dependence with respect to N but with a smaller spread (2.5 dB at This experiment allows a proper selection of the number of states N . In the QPSK case, the MER curves are monotonically growing for N ≥ 64, consequently there is no advantage to use an Agent with more than 64 states. In the 16QAM tests, the curves show a similar trend. The best trade-off in terms of MER and number of states is N = 64. The resulting Q-Tensor consists of 64 × 3 × 3 = 576 elements.

B. Q-LEARNING TRL ADAPTIVITY EXPERIMENT
As mentioned in Sect.I and Sect. II-B, an RL Agent is able to adapt itself to the evolution of the environment. This property ensures high system availability on the field and no needs for training sessions. With the following experiment we show the adaptivity properties that the RL approach provides and how the proposed TRL is able to operate independently of the modulation scheme.
In Fig. 14, the initial timing delay is set to d = 7 (in samples, horizontal dashed gray line) and the modulation format changes from QPSK to 16QAM (vertical red line). This change in the modulation scheme models an environment evolution and it is useful to observe the Agent behavior in terms of adaptivity.
The Agent manages its actions to carry out the best TRL controls, i.e. the timing compensation (blue curve) and the value of G (orange curve). The Agent experiences two exploration phases (green background) and two exploitation phases (white background). After the first exploration phase  (∼ 500 iterations), the Agent decides that the best timing compensation and gain values are 7 and ∼1, in a QPSK environment. The values of the Q-matrix converge to the optimal ones and the Agent begins to exploit its knowledge by maintaining the optimal TRL controls.
At a certain point, the environment changes from QPSK to 16QAM. Because of that, the Agent is forced to adapt itself by entering the second exploration phase to update its Q-matrix. After about 2000 Q-learning iterations, the Agent enters the last exploitation phase and it decides to set the timing compensation correctly to 7 and the gain to ∼ 0.7. The modification of G[n] from ∼1 to ∼0.7 is related to the change of the real part of the coordinate of the received symbol from the QPSK to the QAM format.

C. MODULATION SCHEME EXPERIMENTS AND COMPARISON WITH CONVENTIONAL METHODS
In this experiment we test the Q-Leaning synchronizer performance when different modulation schemes are used. Moreover, we compare its performance to conventional synchronization methods. The Q-Learning hyperparameters are N = 64, α = 0.1, γ = 0.01, values suitable for BPSK, QPSK, 16QAM, 64QAM and 256QAM. The channel delay is constant and equal to d = M /4, where M is the number of samples per symbol. The Q-Learning synchronizer is compared to the following methods: • A Reference ideal resampler: a resampler with the exact timing compensation for the channel delay d; • A Mueller and Müller timing recovery loop: the stateof-the-art TRL discussed in Sec. II-A. The graphs in Fig. 15 are obtained with the same set of the Q-Learning hyperparameters. The Mueller and Müller TRL filter parameters depend from the modulation method (one set for PSK modulations and one for QAM) and they are reported in Table 1. Figure 15 and Table 2 show the experimental results. The number of simulations is 200, the number of exploration  symbols is 10000 and the MER of a single simulation is computed over 200 symbols. Each value in Table 2 is the average MER plus or minus its standard deviation. Figures 15a and 15b show the MER measurements for BPSK and QPSK respectively. The MER values for BPSK and QPSK are similar. The only exception is in the QPSK configuration as the Q-Learning synchronizer loses up to 0.7dB at E b /N 0 = 25 dB with respect to the reference model.
The QAM related graphs (Fig. 15c, Fig. 15d and Fig. 15e) show an interesting performance. In the case of high noise (E b /N 0 = 5 − 10 dB), the three systems perform similarly. In the mid-range (10 − 20 dB), the Q-Learning performance is similar to the Mueller and Müller loop in the 16QAM and 256QAM. In the same range, for 64QAM, we measured MER values lower (1.5 dB on average) than the Mueller and Müller loop. In the 20 − 25 dB range, the Q-Learning synchronizer is outperformed by about 2 dB due to a poorer efficacy of the exploration phase of the Agent. This leads to the conclusion that the presence of white noise in the RL environment forces the Agent to explore more states, thus converging to its optimal Q-matrix much more efficiently. Taking into account all MER measurements, the average difference between our method and the conventional TRL is lower than 1 dB. 124154 VOLUME 7, 2019

V. CONCLUSION
In this paper, we presented an innovative method based on RL to implement a QAM/PSK symbol synchronizer. This approach employs an RL Agent able to synthesize a TRL behavior compatible with a number of modulation schemes.
The RL algorithm used to approximate the State-Value function is Q-Learning. The state and reward are computed from the sampled symbol amplitudex[n] and its differential ∇ 2x [n]. The Agent action is branched in two sub-action: the first one to manage the timing and the second one to adjust the input signal amplitude. The proposed Q-Learning TRL includes four modules: a State and Reward Evaluator (the RL Interpreter), a Q-Learning Engine, an Action decoder, and an NCO. The compatibility with respect to the different modulation schemes is obtained through the Constellation Conditioning block. This solution extends the work presented in [40] to QAM formats.
The proposed method was validated through the simulation of a base-band transmitter, receiver, and an AWGN channel emulator, with E b /N 0 in the range 5-30 dB through three experiments. In the first one, we estimated a set of hyperparameters for BPSK, QPSK, 16QAM, 64QAM and 256QAM. The performance of the RL synchronizer is evaluated in terms of Modulation Error Ratio (MER).
In the second experiment, we proved the autonomous adaptation capability of our approach, suitable for the new intelligent communication systems. In the last experiment, we compared the RL approach to the Mueller and Müller TRL: for BPSK and QPSK modulations, the two synchronizers achieve similar MER values, close to a Reference resampler. The proposed TRL is slightly outperformed for the 64QAM and 256QAM in low-noise scenarios (E b /N 0 ≥ 15 dB). The RL synchronizer performance is the same as the Mueller and Müller TRL for 16QAM in the full noise range.
These results are a good trade-off between flexibility and performance. Moreover, unlike conventional methods, the Q-Learning based TRL is able to adapt itself to evolving scenarios, avoiding parameter tuning. We plan to expand our research to support additional modulation formats and to implement suitable hardware architectures.  Professor with the Technical University of Denmark. He collaborates in many research projects with different companies in the field of DSP architectures and algorithms. He is the author of about 200 articles on international journals and international conferences. His current research interests include low power DSP algorithms architectures, hardware-software codesign, fuzzy logic and neural hardware architectures, low power digital implementations based on non-traditional number systems, and computer arithmetic and cad tools for DSP. He is a member of the Audio Engineering Society (AES). He is a Director of a master in audio engineering with the Department of Electronic Engineering, University of Rome Tor Vergata.
SERGIO SPANÒ received the B.S. and M.S. degrees in electronic engineering from the University of Tor Vergata, Rome, Italy, in 2015 and 2018, respectively, where he is currently pursuing the Ph.D. degree in electronic engineering. He is a member of the DSPVLSI Research Group, University of Tor Vergata. His current research interests include digital signal processing, machine learning, telecommunications, and ASIC/FPGA hardware design. He had industrial experiences in the space and telecommunications field. His current research topics relate to machine learning hardware implementations for embedded and low-power systems. VOLUME 7, 2019