Reinforcement Learning for Systematic FX Trading

We explore online inductive transfer learning, with a feature representation transfer from a radial basis function network formed of Gaussian mixture model hidden processing units to a direct, recurrent reinforcement learning agent. This agent is put to work in an experiment, trading the major spot market currency pairs, where we accurately account for transaction and funding costs. These sources of profit and loss, including the price trends that occur in the currency markets, are made available to the agent via a quadratic utility, who learns to target a position directly. We improve upon earlier work by targeting a risk position in an online transfer learning context. Our agent achieves an annualised portfolio information ratio of 0.52 with a compound return of 9.3\%, net of execution and funding cost, over a 7-year test set; this is despite forcing the model to trade at the close of the trading day at 5 pm EST when trading costs are statistically the most expensive.


I. INTRODUCTION
Forecasters of financial time series commonly make use of supervised learning. For example, [1] apply both parametric approaches such as nonlinear state-space models and nonparametric approaches such as local learning to nonlinear time series analysis. [2] applies learning algorithms to decision making with financial time series. He notes that the traditional approach in this domain is to train a model using a prediction criterion, such as minimising mean-square prediction error or maximising the likelihood of a conditional model of the dependent variable. He finds that with noisy time series, better results are obtained when the model is trained directly to maximise the financial criterion of interest, here gains and losses (including those due to transactions) incurred during trading.
In this spirit, we extend the earlier work of [3] and [4], where direct, recurrent reinforcement learning agents are put to work in financial trading strategies. Rather than optimising for an intermediate performance measure such as maximal forecast accuracy or minimal forecast error, which is still the traditional approach in this domain, we maximise a more direct performance measure such as quadratic economic utility. An advantage of the approach is that we can use the riskadjusted returns of the trading strategy, execution cost and funding cost to influence the learning of the model and update model parameters accordingly.
Whereas the focus of [3] was on the use of the differential Sharpe ratio as a performance measure, we adopt the quadratic utility of [5]. This utility ameliorates the undesirable property of the Sharpe ratio in that it penalises a model that produces returns larger than E [rt] , that is, the ratio of the expectation of squared returns to the expectation of returns [6]. For this reason, along with the use of relatively weak features and shared backtest hyper-parameters, [4] obtained mixed results when experimenting with cash currency pairs. In contrast, our experiment with the major cash currency pairs sees our recurrent reinforcement learning trader achieve an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a seven-year test set. This return is achieved despite forcing the model to trade at the close of the trading day at 5 pm EST when trading costs are statistically the most expensive.
Aside from the different utility functions, we put these improved experiment results down to a combination of several factors. Firstly, we use more powerful feature engineering in the shape of radial basis function networks. The hidden processing units of these networks have means, covariances and structures that are determined by an unsupervised learning procedure for finite Gaussian mixture models [7]. The VOLUME X, 2021 1 arXiv:2110.04745v6 [q-fin.TR] 21 May 2022 approach is a form of continual learning, explicitly inductive, feature representation transfer learning [8], where the knowledge of the mixture model is transferred to upstream models. Secondly, when optimising our utility function with respect to the recurrent reinforcement learner's parameters, we do so sequentially online during the test set, using an extended Kalman filter optimisation procedure [9]. The earlier work uses less powerful offline batch gradient ascent methods. These methods cope less well with non-stationary financial time series.
[10] modelled the dynamics of financial assets as a jumpdiffusion process, which is commonly used in financial econometrics. The jump-diffusion process implies that financial time series should observe small changes over time, socalled continuous changes, as well as occasional jumps. A sensible approach for coping with nonstationarity is to allow models to learn continuously.
We finish this section with a description of the layout of this paper. Section II provides preliminary introductions to transfer learning and reinforcement learning via policy gradients and ends with an overview of trading in the foreign exchange market. Section III introduces the experiment methods of this paper, including the targeting of financial risk positions with direct recurrent reinforcement and feature representation transfer via radial basis function networks. The section ends with a description of the baseline models used to compare the results of the marquis model. The marquis model is a feature representation transfer from a radial basis function network to a direct recurrent reinforcement learning agent and is shown visually in figure 3.
Section IV details the design of the experiment that we conduct on daily sampled foreign exchange pairs. The data is obtained from Refinitiv. We evaluate performance using the annualised information ratio, which is computed on daily returns that are net of transaction and funding costs. The section completes a brief description of the hyper-parameters set for the various models. The experiment results are described in section V and are discussed in section VI. Concluding remarks are given in section VII.

II. PRELIMINARIES
This section introduces the policy gradient form of reinforcement learning and how it has been put to work empirically in quantitative finance, particularly with automated trading strategies. Finally, we finish the section with a short review of more recent work.

A. TRANSFER LEARNING
Transfer learning refers to the machine learning paradigm in which an algorithm extracts knowledge from one or more application scenarios to help boost the learning performance in a target scenario [8]. Typically, traditional machine learning requires significant amounts of training data. Transfer learning copes better with data sparsity by looking at related learning domains where data is sufficient. Even in a big data scenario such as with streaming high-frequency data, transfer learning benefits by learning the adaptive statistical relationship of the predictors and the response. An increasing number of papers focus on online transfer learning [11,12,13]. Following Pan and Yang [14], we define transfer learning as: Definition 1 (transfer learning). Given a source domain D S and learning task T S , a target domain D T and learning task T T , transfer learning aims to help improve the learning of the target predictive function f T (.) in D T using the knowledge in D S and T S , where D S = D T , or T S = T T .
In the context of this paper, the source domain D S represents the feature space, which consists of the daily returns of the 36 currency pairs that are used in our experiment. The source learning task T S is the unsupervised compression of this feature space into a clustered form that learns its intrinsic nature. The clusters are formed via Gaussian mixture models, and we transfer their output via radial basis function networks to currency pairs that we wish to trade in the target domain D T . The target learning task T T is to take financial risk positions in these currency pairs for economic utility maximisation via direct recurrent reinforcement learning.

B. POLICY GRADIENT REINFORCEMENT LEARNING
Williams [15] was one of the first to introduce policy gradient methods in a reinforcement learning context. Whereas most reinforcement learning algorithms focus on action-value estimation, learning the value of actions and selecting them based on their estimated values, policy gradient methods learn a parameterised policy that can select actions without using a value function. Williams also introduced his reinforce algorithm where θ ij is the model weight going from the j th input to the i th output, and θ i is the weight vector for the i th hidden processing unit of a network of such units, whose goal it is to adapt in such a way as to maximise the scalar reward r. For the moment, we exclude the dependence on the time of the weight update to make the notation clearer. Furthermore, η ij is a learning rate, typically applied with gradient ascent, b ij is a reinforcement baseline, conditionally independent of the model outputs y i , given the network parameters θ and inputs x i . ln(∂π i /∂θ ij ) is known as the characteristic eligibility of θ ij , where π i (y i = c, θ i , x i ) , is a probability mass function determining the value of y i as a function of the parameters of the unit and its input. Baseline subtraction r − b ij plays a vital role in reducing the variance of gradient estimators. Sugiyama [16] shows that the optimal baseline is given as where the policy function π(a t |s t , θ) denotes the probability of taking action a t at time t given state s t , parameterised by θ. The expectation E p(r|θ) , is distributed over the probability of rewards given the model parameterisation. , the average update vector in weight space. Thus for any reinforce algorithm, the average update vector in weight space lies in a direction for which this performance measure is increasing, and the quantity (r − b ij ) ln(∂π i /∂θ ij ) represents an unbiased estimate of ∂E[r|θ]/∂θ ij .
Sutton and Barto [17] demonstrate an actor-critic version of a policy gradient model, where the actor references the learned policy and the critic refers to the learned value function, usually a state-value function. Denote the scalar performance measure as J(θ); the gradient ascent update takes the form With the one-step actor-critic policy gradient algorithm, one inserts a differentiable policy parameterisation π(a|s, θ), a differentiable state-value function parameterisationv(s, w) and then one draws an action a t ∼ π(.|s t , θ), taking action a t and observing a transition to state s t+1 with reward r t+1 . Define where 0 γ ≤ 1 is discount factor. The critic's weight vector is updated as follows and finally, the actor's weight vector is updated as The actor-critic architecture uses temporal-difference learning combined with trial-and-error learning to improve the learned policy sequentially.

1) Policy Gradient Methods in Financial Trading
Moody et al. [6] propose to train trading systems and portfolios by optimising objective functions that directly measure trading and investment performance. Rather than basing a trading system on forecasts or training via a supervised learning algorithm using labelled trading data, they train their systems using a direct, recurrent reinforcement learning algorithm, an example of the policy gradient method. The direct part refers to the fact that the model tries to target a position directly, and the model's weights are adapted such that the performance measure is maximised. The performance function that they primarily consider is the differential Sharpe ratio. Denote the annualised Sharpe ratio [18] as where r k is the return of the k th strategy, with standard deviation s k and r f is the risk-free rate. For ease of explanation, we now remove the strategy index k and replace it with a time index t. The differential Sharpe ratio is defined as where the quantities a t and b t are exponentially weighted estimates of the first and second moments of the reward r t . The exponential decay constant is τ ∈ (0, 1]. They consider a batch gradient ascent update for model parameters θ The reward r t = ∆p t f t−1 − δ t |∆f t | depends on the change in reference price p t from which a gross profit and loss are computed, transaction cost δ t and a differentiable position function of the model inputs and pa- Trading and portfolio management systems require prior decisions as input to properly consider the effect of transaction costs, market impact, and taxes. This temporal dependence on the system state requires reinforcement versions of standard recurrent learning algorithms. Moody et al. [6] present empirical results in controlled experiments that demonstrate the efficacy of some of their methods for optimising trading systems and portfolios. For a long/short trader, they find that maximising the differential Sharpe ratio yields more consistent results than maximising profits. Both methods outperform a trading system based on forecasts that minimise mean-square error. They find that portfolio trading agents trained to maximise the differential Sharpe ratio achieve better risk-adjusted returns than those trained to maximise profit. However, an undesirable property of the Sharpe ratio is that it penalises a model that produces returns larger than E[r 2 ] E[r] ≈ bt at , that is, the ratio of the expectation of squared returns to the expectation of returns, which is counter-intuitive to investors' notion of risk and reward.
Gold [4] extends Moody et al.'s [6] work and investigates high-frequency currency trading with neural networks VOLUME X, 2021 trained via recurrent reinforcement learning. He compares the performance of linear networks with neural networks containing a single hidden layer and examines the impact of shared system hyper-parameters on performance. In general, he concludes that the trading systems may be effective but that the performance varies widely for different currency markets, and simple statistics of the markets cannot explain this variability.
He also finds that the linear recurrent reinforcement learners outperform the neural recurrent reinforcement learners in this application. Here, we suspect that the choice of inputs (past returns of the target) results in features with weak predictive power. As a result, the neural reinforcement learner struggles to make meaningful forecasts. In comparison, the linear recurrent reinforcement learner does better coping with both noisy inputs and outputs, generating biased yet stable predictions. Gold also used shared hyper-parameters. Many of the currency pairs behave differently in terms of their price action. For example, US dollar crosses are usually momentum-driven. Cross-currencies, such as the Australian dollar versus the New Zealand dollar, tend to be meanreverting in nature. Therefore, sharing hyper-parameters probably negatively impacts the ex-post performance here.

2) More Recent Work
In terms of more recent work involving policy gradient methods in finance, Tamar et al. [19] discuss risk-sensitive policy gradient methods that augment the standard expected cost minimisation problem with a measure of variability in cost. They consider static and time-consistent dynamic risk measures that combine a standard sampling approach with convex programming. Their approach is actor-critic for dynamic risk measures and involves explicit approximation of value functions.
Luo et al. [20] build a novel reinforcement learning framework trader. They adopt an actor-critic algorithm called deep deterministic policy gradient to find the optimal policy. Their proposed algorithm uses convolutional neural networks and outperforms some baseline methods when experimenting with stock index futures. They also discuss the generalisation and implications of the proposed method for finance.
Zhang et al. [21] use deep reinforcement learning algorithms such as deep q-learning networks [22], neural policy gradients [23] and advantage actor-critic [24] to design trading strategies for continuous futures contracts. They use long short-term memory neural networks [25] to train both the actor and critic networks. Both discrete and continuous action spaces are considered, and volatility scaling is incorporated to create reward functions that scale trade positions based on market volatility. They show that their method outperforms various baseline models, delivering positive profits despite high transaction costs. Their experiments show that the proposed algorithms can follow prominent market trends without changing positions and scale down or hold through consolidation periods.
Azhikodan et al. [26] propose automated trading systems that use deep reinforcement learning, specifically a deep deterministic policy gradient-based neural network model that trades stocks to maximise the gain in asset value. They determine the need for an additional system for trend-following to work alongside the reinforcement learning algorithm. Thus they implement a sentiment analysis model using a recurrent convolutional neural network to predict the stock trend from financial news.
Ye et al. [27] address an optimal trade execution problem that involves limit order books. Here, the model must learn how best to execute a given block of shares at minimal cost or maximal return. To this end, they propose a deep reinforcement learning-based solution that uses a deterministic policy gradient framework. Experiments on three real market datasets show that the proposed approach significantly outperforms other methods such as a submit and leave policy, a q-learning algorithm [28] and a hybrid method that combines the Almgren-Chriss model [29] with reinforcement learning.
Aboussalah and Lee [30] explore policy gradient techniques for continuous action and multi-dimensional state spaces, applying a stacked deep dynamic recurrent reinforcement learning architecture to construct an optimal real-time portfolio. The algorithm adopts the Sharpe ratio as a utility function to learn the market conditions and rebalance the portfolio accordingly.
Betancourt and Chen [31] propose a novel portfolio management method using deep reinforcement learning on markets with a dynamic number of assets. Their model endeavours to learn the optimal inventory to hold whilst minimising transaction costs.
Lei et al. [32] acknowledge that algorithmic trading is an ongoing decision making problem, where the environment requires agents to learn feature representation from highly non-stationary and noisy financial time series, and decision making requires that agents explore the environment and simultaneously make correct decisions in an online manner without any supervised information. Instead, they propose to tackle both problems via a time-driven feature-aware deep reinforcement learning model to improve the financial signal representation learning and decision making.

C. FOREIGN EXCHANGE TRADING
This section describes the foreign exchange market and the mechanics of the foreign exchange derivatives, which are central to the experimentation that we conduct in section IV. The global foreign exchange market sees transactions above 6 trillion US dollars traded daily. Figure 1 shows this breakdown by instrument type and is extracted from the Bank of International Settlements Triennial Central Bank Survey, 2019.
FX transactions implicitly involve two currencies: the dominant or base currency is quoted conventionally on the left-hand side and the secondary or counter currency on the right-hand side. If foreign exchange positions are held overnight, the trader will earn the interest rate of the currency bought and pay the interest rate of the currency sold. The interest rates for specific maturities are determined in the inter-bank currency market and are heavily influenced by the base rates typically set by central banks. Foreign exchange trades settle two business days after the trade date by market convention unless otherwise specified.
Clients fund their positions by rolling them forward via tomorrow/next (tomnext) swaps. Tomnext is a short-term foreign exchange transaction where a currency pair is simultaneously bought and sold over two business days: tomorrow (in one business day) and the following day (two business days from today). The tomnext transaction allows traders to maintain their position without being forced to take physical delivery and is the convention applied by prime brokers to their clients on the inter-bank foreign exchange market. In order to determine this funding cost, one needs to compute the forward rates (prices). Forwards are agreements between two counterparties to exchange currencies at a predetermined rate on some future date.
Forward rates are calculated by adding forward points to a spot rate. These points reflect the interest rate differential between the two currencies being traded and the maturity of the trade. Forward points do not represent an expectation of the direction of a currency but rather the interest rate differential. Let bid spot t denote the spot/cash currency pair rate at which price takers can sell at time t. Similarly, let ask spot t denote the spot/cash currency pair rate at which price takers can buy at time t. The spot mid-rate is ( Forward points are computed as follows where e 2 is the secondary interest rate, e 1 is the dominant interest rate, T is the number of days till maturity, and φ is the tick size or pip value for the associated currency pair. Example forward points for GBPUSD are shown in figure  2. GBP= is the Refinitiv information code (ric) for cash GBPUSD and GBPTND= is the ric for tomnext GBPUSD forward points. Note that the forward points are quoted as a bid/ask pair, reflecting the appropriate interest differential applied to sellers and buyers and the additional cost (spread) quoted by the foreign exchange forwards market maker to compensate them for their quoting risk. The tomnext outrights are computed as As an example of rolling a long GBPUSD position forward, the tomnext swap would involve selling GBPUSD at bid spot t and repurchasing it at ask tn t . The cost of this roll is thus notional × (bid spot t − ask tn t ), where notional denotes the size of the position taken by the trader. If a trader is short GBPUSD, then to roll the position forward, she would buy ask spot t and sell forward bid tn t , with the funding cost being notional × (bid tn t − ask spot t ). This funding may be a loss but also a profit. In addition, many currency market participants hold foreign exchange deliberately to capture the favourable interest rate differential between two currency pairs. This approach is known as the carry trade and is extremely popular with the retail public in Japan, where the Yen interest rates have been historically low relative to other countries for quite some time.

III. EXPERIMENT METHODS
This section describes how our recurrent reinforcement learner targets a position directly. In addition, we also describe the baseline models that are used for comparison and contrast. Next, we explore online inductive transfer learning, with feature representation transfer from a radial basis function network to a direct, recurrent reinforcement learning agent. The radial basis function network consists of hidden processing units of the Gaussian mixture model. The VOLUME X, 2021 recurrent reinforcement learning agent learns the desired risk position via the policy gradient paradigm. Finally, the agent is put to work trading the major spot market currency pairs.

A. TARGETING A POSITION WITH DIRECT RECURRENT REINFORCEMENT
Sharpe [5] discusses asset allocation as a function of expected utility maximisation, where the utility function may be more complex than that associated with mean-variance analysis. Denote the expected utility at time t for a single portfolio constituent as where the expected return µ t = E[r t ] and variance of returns may be estimated in an online fashion with exponential decay, where as before τ is an exponential decay constant The risk appetite constant λ > 0 can be set as a function of an investor's desired risk-adjusted return, as demonstrated by Grinold and Kahn [33]. The information ratio is a riskadjusted differential reward measure, where the difference is taken between the model or strategy being evaluated and a baseline or benchmark strategy with expected The similarity to the Sharpe ratio is apparent. Setting b t = 0 and substituting the non-annualised information ratio into the quadratic utility and differentiating with respect to the risk, we obtain a suitable value for the risk appetite parameter: The net returns whose expectation and variance we seek to learn are decomposed as where ∆p t is the change in reference price, typically a mid-price ∆p t = 0.5 × (bid t + ask t − bid t−1 − ask t−1 ), δ t represents the execution cost for a price taker κ t is the profit or loss of rolling the overnight foreign exchange position, the so-called 'carry' (see section II-C) and f t is the desired position learnt by the recurrent reinforcement learner The model is maximally short when f t = −1 and maximally long when f t = 1. The recurrent nature of the model occurs in the input feature space where the previous position is fed to the model input and φ j (.) denotes a radial basis function hidden processing unit, in a network of m such units, which takes as input a feature vector u t , see section III-B. The goal of our recurrent reinforcement learner is to maximise the utility in equation 3 by targeting a position in equation 9. To do this, one may apply an online stochastic gradient ascent update Instead of a static learning rate η, one may consider the Adam optimiser of Kingma and Ba [34], where an adaptive learning rate is applied. This adaptive learning rate is a function of the gradient expectation and variance. The weight update then takes the form denoting biascorrected versions of the expected gradient and gradient variance, respectively. β 1 and β 2 are exponential decay constants. In earlier work, Bottou [35] had considered approximating the Hessian of the performance measure with respect to the model weights as a function of gradient only information. In practice, we find that Adam takes many iterations of model fitting to get the weights large enough to take a meaningful position via function 9; this is not necessarily an Adam problem, but a result of the tanh position function taking a while to saturate. If the weights are too small, then the average position taken by the recurrent reinforcement learner will be small as well. Therefore, we settle on an extended Kalman filter [36,9] gradient-based weight update, albeit modified for reinforcement learning in this context. In algorithm 1, P t is an approximation to [∇ 2 υ t ] −1 , the inverse Hessian of the utility function υ t with respect to the model weights θ t .
We decompose the gradient of the utility function with respect to the recurrent reinforcement learner's parameters as follows: The constituent derivatives for the left half of equation 11 are:

B. RADIAL BASIS FUNCTION NETWORKS
In Borrageiro et al. [37], the authors show that online transfer learning via radial basis function networks provides a residual benefit in forecasting non-stationary time series. The residual benefit stems from the feature representation transfer of clustering algorithms. These algorithms are adapted sequentially, as are the supervised learners, which map the clustered feature space to the targets. The feature engineering that we use in this paper uses clusters formed of Gaussian mixture models. The network size is determined by the unsupervised learning procedure of finite mixture models described by Figueiredo and Jain [7]. Finally, we briefly describe the key ingredients of this meta-algorithm here. The radial basis function network is a network of m > 0 Gaussian basis functions Here we learn the j th mean µ j and covariance Σ j through a Gaussian mixture model fitting procedure. Denote the probability density function of a k component mixture as and the mixing weights satisfy 0 ≤ π j ≤ 1, cannot be found analytically. The standard way of estimating θ M L or θ M AP is the expectation-maximisation algorithm [38]. This iterative procedure is based on the interpretation of u as incomplete data. The missing part for finite mixtures is the set of labels Z = z 0 , ..., z n , which accompany the training data u 0 , ..., u n , indicating which component produced each training vector. Following Murphy [39], let us define the complete data log-likelihood to be which cannot be computed since z i is unknown. Thus, let us define an auxiliary function where t is the current time step. The expectation is taken with respect to the old parameters θ t−1 and the observed data u. Denote as r ic = p(z i = c|u i , θ t−1 ), cluster c's responsibility for datum i. The expectation step has the following form . The maximisation step optimises the auxiliary function Q with respect to θ θ t = arg max θ Q(θ, θ t−1 ).
The c th mixing weight is estimated as The parameter set θ c = {µ c , Σ c } is then As discussed by Figueiredo and Jain [7], expectationmaximisation is highly dependent on initialisation. They highlight several strategies to ameliorate this problem, such as multiple random starts, final selection based on the maximum likelihood of the mixture, or k-means based initialisation. However, the distinction between model-class selection and model estimation in mixture models is unclear. For example, a 3 component mixture in which one of the mixing probabilities is zero is indistinguishable for a 2 component mixture. They propose an unsupervised algorithm for learning a finite mixture model from multivariate data. Their approach is based on the philosophy of minimum message length encoding [40], where one aims to build a short-code that facilitates a good data generation model. Their algorithm can select the number of components and, unlike the standard expectation-maximisation algorithm, does not require careful initialisation. The proposed method also avoids another drawback of expectation-maximisation for mixture fitting: the possibility of convergence toward a singular estimate at the boundary of the parameter space. Denote the optimal mixture parameter set This leads to a modified maximisation step The maximisation step is identical to expectationmaximisation, except that the c th parameter set θ c is only estimated when π c > 0 and θ c is discarded from θ * when π c = 0. A distinctive feature of the modified maximisation step is that it leads to component annihilation; this prevents the algorithm from approaching the boundary of the parameter space. In other words, if one of the mixtures is not supported by the data, it is annihilated. We finish the section by showing figure 3, which provides a visual representation of the feature representation transfer from the radial basis function network to the recurrent reinforcement learning agent. The external input to the transfer learner, represented by the left-most black circles, is a vector of daily returns of the 36 currency pairs used in the experiment, detailed in section IV-A. The grey circles represent the radial basis function network hidden processing unit layer. In addition, we have a blue circle that represents the previously estimated position of the recurrent reinforcement learning agent. The outputs of this hidden layer are stored

C. BASELINE MODELS
In order to assess the comparative strength of the model of section III-A, we employ two baseline models. The first model is a momentum trader, which uses the sign of the next step ahead return forecast as a target position. This model is also a radial basis function network; except here, the feature representation transfer of the Gaussian mixture model cluster is made available to an exponentially weighted recursive least-squares supervised learner. A visual representation of the model is similar to figure 3, without any recurrent position unit as represented by the blue circle.
The exponentially weighted recursive least-squares fitting procedure is shown compactly in algorithm 2. The precision matrix P 0 may be initialised to the identity matrix scaled by the inverse of the Ridge penalty, I d α −1 and the initial weights w 0 are typically initialised to the zero vector. The discount factor τ is typically close to but less than 1. The particular model form is experimented with by Borrageiro et al. [37] in a multi-step horizon forecasting context.
Our second baseline is the carry trader, hoping to earn the positive differential overnight foreign exchange rate. Denoting the long and short carry as where the superscript spot denotes the cash price, and the 8 VOLUME X, 2021 Algorithm 2: exponentially weighted recursive leastsquares Require: α, τ // α ≥ 0 is a Ridge penalty. // 0 τ ≤ 1 is an exponential decay factor.
Input: x t−1 , x t ∈ R d , y t // yt is the daily sampled return of the target.
Output:ŷ t superscript tn denotes the tomorrow/next price, the position of the carry trader is In other words, the carry trader goes long the base currency if the base currency has an overnight interest rate higher than the counter currency. Equally, the carry trader sells the base currency short if the base currency has an overnight interest rate that is lower than the counter currency. Long and short carry may be a cost rather than a profit due to the bid/ask spread that traders make markets in tomnext swaps. Therefore we allow the carry trader to abstain from trading completely in such circumstances.

IV. EXPERIMENT DESIGN
In this section, we establish the design of the experiment, beginning with a description of the data we use and finishing up with a description of the performance evaluation criteria.

A. THE DATA
We obtain our experiment data from Refinitiv. We extract daily sampled data for 36 major cash foreign exchange pairs with available tomnext forward points and outrights. These foreign exchange pairs are listed in table 1. Summary statistics of the distribution of the daily returns for these currency pairs are shown in table 2. The dataset begins 2010-12-07 and ends on 2021-10-21, a total of 2,840 observations per pair. Daily spot mid-price returns are constructed for each of these currency pairs. These are used as the features for the recurrent reinforcement learning agent and the exponentially weighted recursive least-squares momentum trader. The midprice is defined in equation 2, and the return for the k th pair is simply  IDR=  IDRTN=  USDILS  ILS=  ILSTN=  USDINR  INR=  INRTN=  USDJPY  JPY=  JPYTN=  USDKRW  KRW=  KRWTN=  USDMXN  MXN=  MXNTN=  USDNOK  NOK=  NOKTN=  USDPLN  PLN=  PLNTN=  USDRUB  RUB=  RUBTN=  USDSEK  SEK=  SEKTN=  USDSGD  SGD=  SGDTN=  USDTHB  THB=  THBTN=  USDTRY  TRY=  TRYTN=  USDTWD  TWD=  TWDTN=  USDZAR ZAR= ZARTN= One of the challenges that the models will face in the experiment is that these daily data show the last known top of book spot and outright prices at the end of the trading day, 5 pm EST. The bid/ask spread for these prices are at their widest statistically at this time. Therefore the execution and funding costs will be more expensive; this contrasts with a trader who can execute at a more liquid time, such as 2 pm GMT. If we try to use intraday data, say data sampled minutely, Refinitiv restricts us to 41 trading days, which is not a huge sample size. Figure 4 illustrates the challenge succinctly. It shows relative intraday bid/ask spreads for the 36 currency pairs that we experiment with. The data are sampled minutely over two months ending mid-October 2021. The global maximum bid/ask spread occurs precisely when Refinitiv samples the daily data.

B. PERFORMANCE EVALUATION METHODS
We have a little over 11 years of daily data to use in our experiment. From these data, we construct daily returns for each of the 36 currency pairs, reserving the first third as VOLUME X, 2021   a training set and the final two-thirds as a test set. The structure of the radial basis function networks of sub-section III-B is determined in the training set, with external input being the returns of the various currency pairs. The recurrent reinforcement learning agent is also fitted in the training set to each currency pair, explicitly learning the weights in the position function 9, using the extended Kalman filter learning procedure of algorithm 1. Additionally, the momentum trader of sub-section III-C is fitted in the training set to each currency pair using algorithm 2. Both models continue to learn online during the test set. However, the carry trader baseline does not require any model fitting.
The test set evaluates performance for each currency pair using the net profit and loss equation 8. This reward, net of transaction and funding cost, is in price difference space. We convert to returns space by dividing by the mid-price computed using equation 2. These returns are accumulated to produce the results shown in figure 6 and the middle sub-plots of figures 8 and 9. In addition, the daily returns are described statistically in tables 3 and 4. In table 3, the information ratio (ir) is computed using equation 6. We set the baseline return b t = 0. In summary, we evaluate performance by considering the risk-adjusted daily returns generated by each model, net of transaction and funding costs.

C. HYPER-PARAMETERS
The following hyper-parameters are set in the experiment: • τ = 0.99; this is the exponential decay constant of moving moment equations 4, 5, 8, extended Kalman filter weight algorithm 1 and exponentially weighted recursive least-squares algorithm 2. • α = 1; this is the Ridge penalty of extended Kalman filter weight algorithm 1 and exponentially weighted recursive least-squares algorithm 2. • γ, the risk appetite parameter of equation 3, is initially set to 1 and then updated by passing through the training data once and setting it via the procedure of equation 7. Figure 6 shows the accumulated returns for each strategy. The reinforcement learning agent is denoted as drl, the momentum trader is shown as mom and the carry trader is indicated as carry. The carry baseline performs poorly, reflecting the low-interest rate differential environment since the 2008 financial crisis. Essentially the available funding that can be earned relative to execution cost is small. Figure  5 shows the direction of travel in central bank interest rates over the past 20 years. Central bank rates halved on average during the 2008 global financial crisis and have declined further since. In contrast, the momentum trader achieves the highest return with an annual compound net return of 11.7% and an information ratio of 0.4. Additionally, the recurrent reinforcement learner achieves an annual compound net return of 9.3%, with an information ratio of 0.52. Its information ratio is driven higher because its standard deviation of daily portfolio returns is two-thirds of the momentum trader's. Table 3 summarises net profit and loss returns statistics by strategy, with a figure of the distribution of the daily returns in figure 7. Table 4 shows the funding or carry in returns  space for each strategy. We can see that the carry baseline does indeed capture positive carry, although this return is not enough to offset the execution cost and the profit and loss associated with holding risk, which moves in a trendfollowing way, mainly as opposed to the funding profit and loss. How funding moves opposite to price trends is expected. Central banks invariably increase overnight rates when currencies depreciate considerably to make their currency more attractive and stem the tide of depreciation. The Turkish Lira and Russian Ruble are two cases in point. We see evidence in table 4 that the recurrent reinforcement learner captures more carry relative to the momentum trader. This funding capture is expected as well, as the funding profit and loss make their way into equation 8 and are propagated through the derivative of the utility function with respect to the model weights, using equation 11.

VI. DISCUSSION
Both baselines make decisions using incomplete information.
The momentum trader focuses on learning the foreign exchange trends but ignores the execution and funding costs, whereas the carry trader tries to earn funding but ignores execution costs and the price movements of the underlying currency pair. In contrast, the recurrent reinforcement learner optimises the desired position as a function of market   moves and funding whilst minimising execution cost. To demonstrate that the recurrent reinforcement learner is indeed learning from these reward inputs, we compare the realised positions of a USDRUB trader where in the former case, transaction costs and carry are removed (figure 8) and in the latter case, transaction costs and carry are included (figure 9). We see that without cost, the recurrent reinforcement learner VOLUME X, 2021 realises a long position (buying USD and selling RUB) broadly, as the Ruble depreciates over time. In contrast, when funding cost is accurately applied, the overnight interest rate differential is roughly 6%, and the recurrent reinforcement learner learns a short position (selling USD and buying RUB), capturing this positive carry. The positive carry is not enough to offset the rapid depreciation of the Ruble. How significant are these results? Grinold and Kahn [33] show table 5 of empirical information ratios for US fund managers over the five years from January 2003 through December 2007. The data relates to 338 equity mutual funds, 1,679 equity long-only institutional funds, 56 equities longshort institutional funds and 537 fixed-income mutual funds. Although now a bit dated, the results indicate that our recurrent reinforcement learner that trades statistically at the worst time of day in the foreign exchange market achieves an information ratio at the 75 th percentile of information ratios achieved empirically by various passive and active fund managers within fixed income and equities. The momentum trader achieves an information ratio between the 50 th and 75 th percentile. The information ratio is a measure of consistency and has a probabilistic interpretation: it measures the probability that a strategy will achieve positive residual returns in every period [33]. Equation 6 shows that the information ratio is the ratio of residual return to residual risk. Let us denote this residual return as the strategy's alpha: The probability of realising a positive residual return is where Φ(.) denotes the cumulative normal distribution function. In this respect, we find that recurrent reinforcement learner has a probability of positive residual return of 70% and the momentum baseline has a probability of positive residual return of 66%.
In terms of future work, one might consider a multi-layer perceptron version of our recurrent reinforcement learner. One might also consider an echo state network [41] version of the model. In addition, one might be able to improve the results further by applying a portfolio overlay. The utility function of equation 3 is readily treated as a portfolio prob-  where the optimal, unconstrained portfolio weights are obtained by differentiating the portfolio utility with respect to the weight vector Another approach is to treat portfolio selection as a policy gradient problem, where the policy of picking actions, or this case portfolio constituents, is estimated via function approximation techniques.

VII. CONCLUSION
We conduct a detailed experiment on major cash foreign exchange pairs, accurately accounting for transaction and funding costs. These sources of profit and loss, including the price trends that occur in the currency markets, are made available to our recurrent reinforcement learner via a quadratic utility, which learns to target a position directly. We improve upon earlier work by casting the problem of learning a risk position in an online learning context. This online learning occurs sequentially in time but also via transfer learning. This transfer learning takes the form of radial basis function hidden processing units, whose means, covariances and overall size are determined by an unsupervised learning procedure for finite Gaussian mixture models. The intrinsic nature of the feature space is learnt and made available to the recurrent reinforcement learner and baseline supervised-learning momentum trader. The recurrent reinforcement learning trader achieves an annualised portfolio information ratio of 0.52 with a compound return of 9.3%, net of execution and funding cost, over a 7-year test set, despite forcing the model to trade at the close of the trading day 5 pm EST, when trading costs are statistically the most expensive. The momentum baseline trader achieves a similar total return but a lower risk-adjusted return. The recurrent reinforcement learner does maintain an essential advantage in that the model's weights can be adapted to reflect the different sources of profit and loss variation, including returns momentum, transaction costs and funding costs. We demonstrate this visually in figures 8 and 9, where a USDRUB trading agent learns to target different positions that reflect trading in the absence or presence of cost.