The Recurrent Reinforcement Learning Crypto Agent

We demonstrate a novel application of online transfer learning for a digital assets trading agent. This agent uses a powerful feature space representation in the form of an echo state network, the output of which is made available to a direct, recurrent reinforcement learning agent. The agent learns to trade the XBTUSD (Bitcoin versus US Dollars) perpetual swap derivatives contract on BitMEX on an intraday basis. By learning from the multiple sources of impact on the quadratic risk-adjusted utility that it seeks to maximise, the agent avoids excessive over-trading, captures a funding profit, and can predict the market's direction. Overall, our crypto agent realises a total return of 350\%, net of transaction costs, over roughly five years, 71\% of which is down to funding profit. The annualised information ratio that it achieves is 1.46.


I. INTRODUCTION
Financial time series provide many modelling challenges for both researchers and practitioners. In some circumstances, data availability is sparse, and the datasets are vast in other circumstances; this impacts the choice of model and the learning style. In addition, financial time series are typically both autocorrelated and nonstationary, requiring approaches such as integer or fractional differencing [1] to remove these effects and facilitate correct feature selection; this is previously identified by Granger and Newbold [2] as leading to spurious regressions if not mitigated. Another approach to coping with nonstationarity is to allow models to learn continuously.
Against this backdrop, we extend our earlier work [3] where we combine sequential learning with transfer learning [4] and reinforcement learning [5]. More concretely, we novelly transfer the learning of an echo state network [6] to a direct, recurrent reinforcement learning agent [7] who must learn to trade digital asset futures, specifically the XBTUSD (Bitcoin versus US Dollar) perpetual swap on the BitMEX exchange. Our transfer learner benefits from an ample, dynamic reservoir feature space and can identify and learn from the different sources of impact on profit and loss, including execution costs, exchange fees, funding costs and price moves in the market.
Perhaps the main benefit of this paper will be for financial industry practitioners. In the numerous papers we researched on machine learning applications to financial trading, the researchers' emphasis tends to be on the novelty of the machine learning model, which inevitably has a high learning capacity. In practically all cases, good risk-adjusted returns are claimed, yet when one digs into the results in more detail, one invariably finds that trading costs are not fully accounted for. Such papers usually assume the execution of trades on the closing prices of sub-sampled data with certainty. However, only a price taker can obtain certainty of fill by crossing the bid/ask spread. The price taker then observes an immediate loss equal to the execution time half bid/ask spread. We barely see this cost accounted for, and even when it is, such as with the seminal work of Moody et al. [7], a fixed execution cost is assumed. This fixed execution cost is never observed in reality; see, for example, figure 4 of Borrageiro et al. [3], which shows that the bid/ask spread varies by time of day. For many assets, especially in traditional finance, the bid/ask spread varies by day of the week as well; Dacorogna et al. [8] demonstrate and discuss various examples of such stylised facts. Another approach commonly taken by academic financial trading researchers is to use supervised learning and sub-sample the high-frequency data into monthly time series. The main reason for doing this is to ameliorate excessive trading and thus high execution cost. Inevitably, we see the caveat emptor of a lack of experimental VOLUME X, 2022 1 arXiv:2201.04699v4 [cs.
LG] 21 May 2022 data and its impact on the generalisation performance of their chosen model. Furthermore, the down-sampling of data into lower frequencies typically makes bearable the slow training times of many models, such as deep q-learning networks. We complete this section with a summary of the main contributions of this paper. Our meta-model can process data as a stream and learn sequentially; this helps it cope with the nonstationarity of the high-frequency order book and trading data. Furthermore, by using the vast high-frequency data, our model, which has a high learning capacity, avoids the kind of overfitting on a lack of data points that occurs with down-sampled data. We escape the problem of over-trading that is typically seen with the supervised learning model by learning the sensitivity of the change in risk-adjusted returns to the model's parameterisation. Stated another way, our model learns from the multiple sources of impact on profit and loss and targets the appropriate risk position. Finally, the scientific experiment that we conduct is representative of the conditions that would be observed in live trading; thus, we are confident that the resulting performance can realistically be transferred to industry use.

II. PRELIMINARIES
This section provides a brief overview of the related ideas that we use in our experimentation, namely transfer learning, reservoir computing with echo state networks, and a form of policy gradients reinforcement learning. Finally, we conclude the section with a literature review of recent publications that apply reinforcement learning in cryptocurrency trading.

A. TRANSFER LEARNING
Transfer learning refers to the machine learning paradigm in which an algorithm extracts knowledge from one or more application scenarios to help boost the learning performance in a target scenario [4]. Typically, traditional machine learning requires large amounts of training data. Transfer learning copes better with data sparsity by looking at related learning domains where data is sufficient. Even in a big data scenario such as streaming high-frequency data, transfer learning can benefit from learning immediately, where data is initially sparse, and a learner must begin providing forecasts when asked to do so. An increasing number papers focus on online transfer learning [9,10,11]. Following Pan and Yang [12], we define transfer learning as: Definition 1 (transfer learning). Given a source domain D S and learning task T S , a target domain D T and learning task T T , transfer learning aims to help improve the learning of the target predictive function f T (.) in D T using the knowledge in D S and T S , where D S = D T , or T S = T T .

B. ECHO STATE NETWORKS
Echo state networks are a form of recurrent neural network. They consist of a large, fixed, recurrent reservoir network, from which the desired output is obtained by training suitable output connection weights. Determination of the optimal out-put weights is solvable analytically, for example, sequentially with recursive least squares [6]. Echo state networks are an example of the reservoir computing paradigm of understanding and training recurrent neural networks, based on treating the recurrent part (the reservoir) differently than the readouts from it [13]. Following Jaeger [6], the echo state property states that: Definition 2 (echo state property). If a network is started from two arbitrary states x 0 ,x 0 and is run with the same input sequence in both cases, the resulting state sequences x T ,x T converge to each other. If this condition holds, the reservoir network state will asymptotically depend only on the input history, and the network is said to be an echo state network.
The echo state property is guaranteed if the dynamic reservoir weight matrix W hidden is scaled such that its spectral radius ρ(W hidden ), that is, its largest absolute eigenvalue, satisfies ρ(W hidden ) < 1. This ensures that W hidden is contractive. The mathematically correct connection between the spectral radius and the echo state property is that the latter is violated if ρ(W hidden ) > 1 in reservoirs using the tanh function as neuron nonlinearity and for zero input [14].
Murray [15] states that the criteria for plausible modelling on how the brain might perform challenging time-dependent computations are locality and causality. As the echo state network uses a fixed, dynamic reservoir of weights, whose update depends on local information of inputs and activations, with fixed and random feedback, the model offers a plausible model of biological function. These biological aspects are explored by Maass and Markram [16] concerning liquid state machines and more generally with spiking neural networks in Samadi et al. [17].
Numerous articles demonstrate echo state networks within a reinforcement learning context. For example, Szita et al. [18] propose a novel method that uses echo state networks as function approximators in reinforcement learning. They emphasise that echo state networks are promising candidates for partially observable problems where information about the past may improve performance, such as with k-order Markov processes. Since echo state networks are effectively linear function approximators acting on the internal state representation built from the previous observations, Gordon's [19] results about linear function approximators can be transferred to the echo state networks architecture. Building on this, they provide proof of convergence to a bounded region for echo state network training in the case of k-order Markov decision processes.
Shi et al. [20] seek to model the optimal energy management of an office. Time series inputs such as the real-time electricity rate, renewable energy and energy demand are made available to an echo state network q-learning model, which determines the optimal charging/discharging/idle strategies for the battery in the office so that the total cost of electricity from the grid can be reduced.
Chen et al. [21] develop a fault-tolerant adaptive tracking control method fused with an echo state network, driven by reinforcement learning for Euler-Lagrange systems subject to actuation faults. Specifically, the echo state network implements an associative search network, a control gain network and an adaptive critic network, resulting in enhanced learning capabilities and stronger robustness against external uncertainties or disturbances, thus better controlling performance.

C. POLICY GRADIENT REINFORCEMENT LEARNING
In this section, we summarise our previous preliminary overview of the policy gradients method [3]. Williams [24] introduced policy gradient methods in a reinforcement learning context. Whereas the majority of reinforcement learning algorithms tend to focus on action value estimation, learning the value of and selecting actions based on their estimated action values, policy gradient methods learn a parameterised policy that can select actions without the use of a value function. Williams also introduced his reinforce algorithm where w ij is the model weight going from the j th input to the i th output and w i is the weight vector for the i th hidden processing unit in a network of such units, whose goal it is to adapt in such a way as to maximise the scalar reward r. The learning rate is η ij and the weight update of equation 1 is typically applied with gradient ascent. The reinforcement baseline b ij , is conditionally independent of the model outputs y i , given the network parameters w and inputs is the probability mass function determining the value of y i as a function of the parameters of the unit and its input. Baseline subtraction r − b ij plays an important role in reducing the variance of gradient estimators and Sugiyama [25] shows that the optimal baseline is given as where the policy function π(a t |s t , w) denotes the probability of taking action a t at time t given state s t , parameterised by w. The main result of Williams's paper is This result relates ∇E[r|w], the gradient in weight space of the performance measure E[r|w], to E[∆w|w], the average update vector in weight space. Thus for any reinforce algorithm, the average update vector in weight space lies in a direction for which this performance measure is increasing and the quantity (r − b ij ) ln(∂π i /∂w ij ) represents an unbiased estimate of ∂E[r|w]/∂w ij .

1) Policy Gradient Methods in Financial Trading
Moody et al. [7] propose to train trading systems and portfolios by optimising objective functions that directly measure trading and investment performance. Their model tries to target a position directly, and the model weights are adapted to maximise the performance measure. The performance function that they primarily consider is the differential Sharpe ratio. The annualised Sharpe ratio [26] is where µ is the strategy's return, σ is the standard deviation of returns, and r f is the risk-free rate. The differential Sharpe ratio is defined as where the quantities a t and b t are exponentially weighted estimates of the first and second moments of r t They consider a batch gradient ascent update The reward depends on the change in reference price p t , previous position f t−1 and transaction costs δ t which are applied only if there is a change in position |f t − f t−1 | > 0. The position function is typically differentiable and bounded VOLUME X, 2022 −1 ≤ f t ≤ 1. This position function depends on the model inputs and parameters f t f (x t ; θ t ). The right half of equation 2 shows the dependency of the model parameters on the past sequence of trades. To correctly compute and optimise these total derivatives requires the use of recurrent algorithms such as backpropagation through time [27,28] or real-time recurrent learning [29]. An undesirable property of the Sharpe ratio is that it penalises a model that produces returns larger than which is counter-intuitive relative to most investors' notions of risk and reward [7]. Gold [30] extends Moody et al. [7] work and investigates high-frequency currency trading with neural networks trained via recurrent reinforcement learning.
He compares the performance of linear networks with neural networks containing a single hidden layer and examines the impact of shared system hyper-parameters on performance.
In general, he concludes that the trading systems may be effective but that the performance varies widely for different currency markets, and simple statistics of the markets cannot explain this variability.

D. REINFORCEMENT LEARNING WITHIN CRYPTOCURRENCY TRADING
This section provides a brief literature review of reinforcement learning applied to cryptocurrency trading. Before performing this review, it is helpful to describe how returns net of transaction costs are generated when trading on an exchange. For example, assume that a single instrument is traded, such as Bitcoin versus the US Dollar. This instrument may be a cash or futures instrument. A gross profit (loss) is generated when Bitcoin is sold higher (lower) than its initial purchase price. Similar logic is applied for short positions, albeit with directions swapped. A net profit is observed by deducting various costs from the gross profit. These costs vary depending on the execution-style and are differentiated between price makers and price takers. A price maker inserts quotes into an exchange limit order book and executes when a price taker removes the price maker's liquidity. The price maker captures half the bid/ask spread at the time of execution; the execution price is evaluated against the prevailing transaction mid-price (0.5×[bid+ ask]), where bid < mid < ask. Bid/ask spreads in crypto tend to be as competitively priced as traditional financial instruments, although the exchange fees, usually a percentage of the notional traded, tend to be much higher. The price maker's fills are probabilistic, not inevitable. To compensate the market maker for the uncertainty incurred by providing liquidity and the risk of adverse selection, they capture half the bid/ask spread. The price taker obtains the certainty of immediate fill, subject to competing with other price takers for the same quoted liquidity. This certainty of fill comes at a cost, as the price taker must pay half the bid/ask spread and usually much higher exchange fees than price makers when trading crypto.
A final cost that must be considered is funding. If one buys cash crypto without leverage, then it is plausible to consider no funding cost, and one can consider the purchase as self-funded. However, if one wants to sell cash crypto short, one needs to borrow the inventory; this attracts a funding cost. Furthermore, transacting in cryptocurrency futures on an exchange or contracts for difference in the overthe-counter market attracts a funding cost similar to what brokers charge for traditional financial instruments. Even if one trades futures directly on an exchange without a broker, some crypto derivatives contracts attract funding like traditional foreign exchange instruments do. For example, trading overnight foreign exchange exposes one to the interest rate differential between two currency pairs. Similarly, perpetual crypto swaps attract an intraday funding profit or loss, which ensures that the swap tracks the underlying cash instrument within tolerance. We now proceed with the literature review.
Jiang and Liang [31] supply a convolutional neural network with historical prices of a set of crypto assets as its input, outputting portfolio weights of the set of assets. The network is trained on less than a year of price data from the Poloniex cryptocurrency exchange. The training is done in a reinforcement learning manner, maximising the accumulated return as the reward function of the network. Using 30 minutely sampled closing prices, they conduct backtests which achieve ten-fold returns. In addition, they set exchange trading transaction fees of 25 basis points (25e-4) times the notional value traded of the base cryptocurrency. However, since they use closing prices to execute with a certainty of fill without applying half the observable bid/ask spread cost at the time of execution, the empirically observed backtest results do not represent the actual cost of trading and are thus more sanguine than reality.
Lee et al. [32] present a novel method to predict Bitcoin price movement using inverse reinforcement learning [33] and agent-based modelling. Their approach predicts the price by reproducing synthetic yet realistic behaviours of rational agents in a simulated market. Inverse reinforcement learning provides a systematic way to find the behavioural rules of each agent from Blockchain data by framing the trading behaviour estimation as a problem of recovering motivations from observed behaviour and generating rules consistent with these motivations. Once the rules are recovered, an agentbased model creates hypothetical interactions between the recovered behavioural rules, discovering equilibrium prices as emergent features through matching the supply and demand of Bitcoin. Their model is used to forecast the market's direction, and their results show that their proposed method can predict short-term market prices while outlining overall market trends. However, their experiments do not include the impact of holding risk, transaction or funding costs.
Lucarelli and Borrotti [34] apply deep reinforcement learning to trading Bitcoin. More precisely, double deep qlearning [35] and duelling double deep q-learning [36] networks are trained in batch mode using 80% of the four years of data that they have available. The remaining 20% of the data is used for out of sample testing. Two reward functions are also tested: Sharpe ratio and profit reward functions. They find that the double deep q-learning trading system based on the Sharpe ratio reward function is the most profitable approach for trading Bitcoin. We note that the authors collect their data from Kaggle rather than from an actual crypto exchange and that they use minutely sampled open-highlow-close prices rather than actual order book bids and asks. As such, accurate transaction costs cannot be used in their experiment. Furthermore, no indication is made in their paper that they use any form of funding or exchange trading fees in their returns calculations.
Zhang et al. [37] note that portfolio selection is difficult as the nonstationarity of financial time series and their complex correlations make the learning of feature representation challenging. They propose a cost-sensitive portfolio selection method with deep reinforcement learning. Specifically, a novel two-stream portfolio policy network is devised to extract price time series patterns and asset correlations, while a new cost-sensitive reward function is developed to maximise the accumulated return and constrain costs via reinforcement learning. They empirically evaluate their proposed method on real-world datasets from the Poloniex crypto exchange. Promising results demonstrate the effectiveness and superiority of the proposed method in terms of profitability, costsensitivity and representation abilities. Once more, however, transaction costs are not fully accounted for in their experiment. For example, they assume a fixed transaction cost of 25 basis points times the notional value traded of the base cryptocurrency; however, they also use open-high-low-close prices, sampled every 30 minutes. The closing prices they use assume that execution is immediate. In reality, however, immediate execution may only be achieved by crossing the spread; this additional cost must be modelled, which usually turns a theoretically profitable strategy that executes at the closing price or mid-price into a loss-making one.

III. THE RESEARCH EXPERIMENT
We begin with a discussion of the research data we use in our experiment, followed by an elucidation of the research methods and a description of the experiment results. As a high-level summary, our experiment aims to explore transfer learning using a source model, an echo state network and a target model, a direct, recurrent, reinforcement learning agent. The objective of this meta-model is to learn to trade digital asset futures, specifically perpetual contracts on the BitMEX crypto exchange. Finally, the dynamical reservoir of the echo state network acts as a powerful nonlinear feature space; this is fed into the upstream recurrent reinforcement learner, who is aware of the various sources of impact on profit or loss and learns to target the desired position.

A. THE BITMEX XBTUSD PERPETUAL SWAP
The data that we experiment with is from the BitMEX cryptocurrency derivatives exchange. In 2016 they launched the XBTUSD perpetual swap, where clients trade Bitcoin against the US Dollar. The perpetual swap is similar to a traditional futures contract, except there is no expiry or settlement. It mimics a margin-based spot market and trades close to the underlying reference index price. A funding mechanism is used to tether the contract to its underlying spot price. In contrast, a futures contract may trade at a significantly different price due to the basis The basis means different things in different markets. For example, in the oil market, the demand for spot oil can outpace the demand for futures oil, especially if OPEC withholds supply, leading to a higher spot price; this results in a futures curve in a state of backwardation. The crypto futures normally trade in a state of contango, where the futures prices trade at a higher rate than the spot prices. Backwardation or contango in crypto markets does not represent supply and demand shortages in an economic sense but rather reflects risk appetite for crypto. Similar effects happen in the equity markets. As with the equity markets, crypto market participants can take risks in the futures market more easily. The spot markets typically do not offer leverage, and the trader must have inventory in the exchange to trade. In contrast, futures allow traders to sell assets short with leverage and without borrowing the underlying asset. However, what is required is a margin or deposit to fund the position. Figure 2 shows the basis in relative terms for the XBTUSD perpetual swap during the bear market of 2018. The mid-price of the perpetual swap is compared against the underlying index it tracks, .BXBT. The relative basis is computed as Before this bear market, Bitcoin hit a then all-time high of $20,000, and the 100-day exponentially weighted moving average of relative basis was very positive in late 2017. For most of 2018 and 2019, the basis was largely negative, reflecting the cash market sell-off from all-time highs to circa USD 3,000.

1) Funding
The funding rate for the perpetual swap comprises two parts: an interest rate differential component and the premium or discount of the basis. We denote this funding rate as κ t . The interest differential reflects the borrowing cost of each currency involved in the pair where ζ is a basis cap, typically 5 basis points (0.05%). When the basis is positive, traders with long positions (buy XBT, sell USD) will pay those with short positions (sell XBT, buy USD). Reciprocally, shorts pay longs when the basis is negative.

B. THE RECURRENT REINFORCEMENT LEARNING CRYPTO AGENT
We begin with a description of the dynamic reservoir feature space, the resultant learning of which is transferred to the direct, recurrent reinforcement learner, which targets the desired risk position. It is worth highlighting where our approach deviates from the traditional use of echo state networks within a reinforcement learning context. Figure 1 demonstrates visually that the target labels, the so-called teacher signal, can be fed back into the dynamic reservoir. Equally, one could apply a regression layer to the echo state network and feed the resulting forecasts into the dynamic reservoir. Both approaches make the echo state network recurrent then. We value this recurrent nature in a trading context, as we wish to feed information about the current position back into the model. However, rather than treating this exercise as a value function estimation task as with Szita et al. [18], we feed the augmented, dynamic reservoir features of the echo state network to a direct recurrent reinforcement learner. By differentiating a quadratic utility function with respect to the recurrent reinforcement learner's parameters, with feedback connections from the agent's past positions fed back into the echo state network dynamic reservoir, the agent learns from the various sources of impact on profit and loss and targets the appropriate position that maximises the riskadjusted reward.

1) The Dynamic Reservoir Feature Space
Denote as u t , a vector of external inputs to the system, which is observed at time t. In the context of this experiment, such external input would include order book, transaction and funding information. These features may come from the instrument being traded, exogenous instruments, or both. Initialise the external input weight matrix W input ∈ R n hidden ×ninput , where the weights are drawn at random; a draw from a standard normal would suffice. Here, n hidden denotes the number of hidden processing units in the internal dynamical reservoir and n input is the number of external inputs, including a bias term. Next we initialise the hidden processing units weight matrix, W hidden ∈ R n hidden ×n hidden . The procedure detailed by Yildiz et al. [38] is • Initialise a random matrix W hidden , all with nonnegative entries. • Scale W hidden such that its spectral radius ρ(W hidden ) < 1. • Change the signs of a desired number of entries of W hidden to get negative connection weights as well.
• Sparsify W hidden with probability P (α), 0 α < 1, setting those elements to zero. This procedure is guaranteed to ensure the echo state property for any input. Intuitively, a recurrent neural network has the echo state property concerning an input signal u t , if any initial network state is forgotten or washed out when the network is driven by u t for long enough [39]. The model supports recurrent connections from either a teacher signal y t ∈ R n back or model outputŷ t ∈ R n back . These are connected to the model via the weight matrix W back ∈ R n hidden ×n back , whose weights are initialised at random from a standard normal. Note that W input , W hidden and W back , have weights that remain fixed.
Finally, initialise the output weight vector w out 0 ∈ R ninput+n hidden +n back . It is at this point that our procedure differs from the original echo state network formulation shown by Jaeger [6]. There, w out is a matrix W out ∈ R n back ×(ninput+n hidden +n back ) and the performance measure of the model is the quadratic loss The reasons will become apparent shortly when we detail the model's direct, recurrent reinforcement learning part. But first, we must describe how we create the augmented state of the system, z t ∈ R ninput+n hidden +n back . Firstly, we initialise a zero-valued internal state vector x 0 ∈ R n hidden . Then at time t, we compute the recurrent internal state where f hidden (.) is typically a squashing function such as the hyperbolic tangent. The augmented, recurrent system state is Equation 11 defines whatŷ t represents, namely the past desired positions of the direct, recurrent reinforcement learning agent.

2) Direct Recurrent Reinforcement Learning
The augmented, internal feature state, z t , is now fed into the upstream model, a direct, recurrent neural network, whose performance measure is a quadratic utility function of reward and risk. For the reader's benefit and the fact that we use the same target transfer learning model, we describe the dynamics of the direct, recurrent reinforcement learner in a manner similar to our earlier work [3]. Sharpe [40] discusses asset allocation as a function of expected utility maximisation, where the utility function may be more complex than that associated with mean-variance analysis. Denote the expected utility for a single asset portfolio as where the expected return µ t and variance of returns σ 2 t may be estimated in an online fashion with exponential decay The risk appetite constant λ > 0, can be set as a function of an investor's desired risk-adjusted return, as demonstrated by Kahn [41]. Define the annualised information ratio as the risk-adjusted differential reward measure, where the difference is taken against a benchmark or baseline strategy Substituting the non-annualised information ratio into the quadratic utility and differentiating it against the risk, we obtain a suitable value for the risk appetite parameter The net returns whose expectation and variance we seek to learn, are decomposed as where ∆p t is the change in reference price, typically a mid price δ t represents the execution cost for a price taker κ t is the funding cost (see subsection III-A) and f t is the desired position learnt by the recurrent reinforcement learner The model is maximally short when f t = −1 and maximally long when f t = 1. The past positions of the model are used as the feedback connections for equation 5 The goal of our recurrent reinforcement learner is to maximise the utility in equation 6, by targeting a position in equation 10. To do this, we apply an online optimisation update of the form where the weight update procedure is an extended Kalman filter for neural networks [42,43], albeit modified for reinforcement learning in this context; sequential updates are applied as per algorithm 1.
// 0 τ ≤ 1 is an exponential decay factor. Initialise: d = n input + n hidden + n back w out = 0 d , Input: ∇υ t Output: w out t 1 q = 1 + ∇υ T t P t−1 ∇υ t /τ 2 k = P t−1 ∇υ t /(qτ ) 3 w out t = w out t−1 + k 4 P t = P t−1 /τ − kk T q 5 P t = P t τ // variance stabilisation Above, P t is an approximation to ∇ 2 υ t , the inverse Hessian of the utility function υ t with respect to the model weights w out t . We decompose the gradient of the utility function with respect to the recurrent reinforcement learner's parameters as follows The constituent derivatives for the left half of equation 12 are: , where n = n input + n hidden + n back − 1, using 0 as the starting index.

C. EXPERIMENT DESIGN
We put a recurrent reinforcement learning crypto agent to work by trading the XBTUSD (Bitcoin vs US Dollar) perpetual swap on BitMEX. We transfer the output of the source model, the dynamic reservoir feature space of subsection III-B1, to the target model, the direct recurrent reinforcement learner of subsection III-B2, who learns to target a risk position directly. Finally, we use five minutely sampled intraday data. The choice of this sampling rate is driven by the throttle imposed by the vendor on retrieving historical data; if we could obtain the raw, asynchronously delivered tick data promptly, we would do so. Nevertheless, we are still using 365 × 5 × 1440/5 = 525600 observations in our experiment. Our performance evaluation procedure involves the following: • Construct input features from the order book and trade information made available by BitMEX for XBTUSD. • Feed these input features into an echo state network, with n hidden = 100, n back = 10 and the percentage of reservoir units W hidden that are sparsified, set to α = 0.75. • Feed the output of the echo state network (equation 5) into a direct, recurrent reinforcement learner (subsection III-B2). • Set the risk appetite constant λ = 0.00001 for quadratic utility equation 6. • Set the ridge penalty β = 1 and the exponential decay factor τ = 0.999 for the extended Kalman filter of algorithm 1. • Backtest the entire history as a test set. • Learn sequentially online to target the desired position. • Force the agent to trade as a price taker, who incurs an execution cost equal to equation 9 plus exchange fees, which are set to 5 basis points (0.05%). • For non-zero risk positions, apply the appropriate funding profit or loss as per equation 4. • Monitor equation 7, the expected net reward of the strategy. The crypto agent can trade freely if µ t ≥ 0. Otherwise, close the position and wait for an opportunity to enter the market again. Table 1 and figure 3 show the results of the experiment. The crypto agent achieves a total return of just under 350% over a test set that is less than five years. The associated annualised information ratio is 1.46. Denoting the maximally short position as -1, no position as 0 and the maximally long position as 1, we see that the agent averages a position of 0.41. Thus there is a bias toward the agent maintaining a long position, which is desirable, as Bitcoin has appreciated against the US Dollar over this period. We see visual evidence in figure 3 that on occasion, the crypto agent abstains from trading, or rather is forced to take no position; this will happen during periods when the predictive performance of the agent decreases relative to execution and funding costs. Our crypto agent also captures a 71% cumulative return due to earning funding, which is expected as the agent learns to target the appropriate position that maximises its quadratic utility, and funding is one of the drivers of this utility. The total execution cost and exchange fees that the agent pays out is -54%.

IV. DISCUSSION
The echo state network provides a robust and scalable feature space representation. We transfer this learning representation to a recurrent reinforcement learning agent that learns to target a position directly. It is possible to use the echo state network as a reinforcement learning agent itself, as shown by Szita et al. [18]. However, the approach may lead to undesired behaviour in a trading context. Specifically, they use an echo state network to estimate the state-action value function of a temporal difference learning sarsa model [44,5]. This value function takes the form q π (s, a) = E π {r t |s t , a t }, where the expected return r t depends on the transition to state s t having taken action a t under policy π. Sarsa estimates this value function sequentially as q(s t , a t ) = (1−η)q(s t , a t )+η[r t+1 +γq(s t+1 , a t+1 )], (13) where 0 < γ ≤ 1 is a discount factor for multi-step rewards, and η > 0 is a learning rate. Equation 13 shows that the state transition reward is passed back to the starting state. In activities such as maze traversal or board games, being aware of the reward associated with multiple steps or decisions and passing that reward back to the current position is of great value. However, in the context of trading, where the value function q(s t , a t ) represents the value of a position s t , with the possibility of switching or remaining in the same position denoted by action a t , we will on occasion find that a larger utility is assigned to the wrong state. For example, imagine the current state is s t = 0, that is, the model has no position. Now we observe a large positive price jump leading to large positive reward r t+1 0. Value function estimators such as equation 13 would pass the state transition reward r t+1 to the initial state q(s t = 0, a t = 0). At the next iteration, with probability P r(1 − ), s t+1 = 0 as q(s t+1 = 0, a t+1 = 0) is the highest value function. However, if the position is zero, the model cannot hope to earn a profit. Even if one excludes the zero state, then there is still the possibility of observing this problem for a reversal strategy with possible states s = {−1, 1}. Direct reinforcement, as we describe it, does not incur these problems and we have, through transfer learning, improved upon the earlier work in direct reinforcement [7,30] and extended the work of Borrageiro et al. [3].

V. LIMITATIONS OF THE WORK
As previously discussed in subsection III-B1, the various echo state network parameters are initialised at random. Figure 4 and table 2 measure the impact of this randomness on test set performance. We run a Monte Carlo simulation of 250 trials, where the network parameterisation is fixed to θ = {n hidden = 100, n back = 10}. The information ratios in this set of simulations vary from 0.219 to 1.763, with total returns between 65.4% and 502.1%. Whilst this does show evidence of a reasonable variability of returns, table 2 also shows that 95% of the mean information ratios vary between 1.1 and 1.2 and 95% of the mean total returns vary between 289% and 307%. Thus the overall picture remains unchanged, that being that the transfer learning crypto agent has successfully learnt how to trade this instrument during the test period. Other than this acceptable sensitivity to weight initialisation, we make no assumptions that would otherwise cause our results to be violated if these assumptions were not met.

VI. CONCLUSION
We demonstrate an application of online transfer learning as a digital assets trading agent. This agent uses a powerful feature space representation in the form of an echo state network, the output of which is made available to a direct, recurrent reinforcement learning agent. The agent learns to trade the XBTUSD (Bitcoin versus US Dollars) perpetual swap derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data, avoids excessive over-trading, captures a funding profit and is also able to predict the direction of the market. Overall, our crypto agent realises a total return of 350%, net of transaction costs, over roughly five years, 71% of which is down to funding profit. The annualised information ratio that it achieves is 1.46.