A Deep Neural Network Algorithm for Linear-Quadratic Portfolio Optimization with MGARCH and Small Transaction Costs

We analyze a fixed-point algorithm for reinforcement learning (RL) of optimal portfolio mean-variance preferences in the setting of multivariate generalized autoregressive conditional-heteroskedasticity (MGARCH) with a small penalty on trading. A numerical solution is obtained using a neural network (NN) architecture within a recursive RL loop. A fixed-point theorem proves that NN approximation error has a big-oh bound that we can reduce by increasing the number of NN parameters. The functional form of the trading penalty has a parameter $\epsilon>0$ that controls the magnitude of transaction costs. When $\epsilon$ is small, we can implement an NN algorithm based on the expansion of the solution in powers of $\epsilon$. This expansion has a base term equal to a myopic solution with an explicit form, and a first-order correction term that we compute in the RL loop. Our expansion-based algorithm is stable, allows for fast computation, and outputs a solution that shows positive testing performance.


Introduction
The problem we consider is one faced by a fund manager who has just taken in a large amount of new capital. This capital needs to be integrated into the portfolio but transaction costs caused by large bid/ask spreads make it extremely inefficient to directly invest the entire amount immediately (i.e., the typical buy-and-hold strategy is sub-optimal). A more efficient way is to invest the new funds according to a solution to a dynamic mean-variance optimization that includes a quadratic penalty on trade size. Optimal execution of large orders was formulated as a mean-variance optimization with penalization on trades in [1], and a multi-asset version of this problem was studied in [2]. However in practice, many assets have heteroskedasticity, and therefore it is interesting to consider mean-variance preferences in the setting of dynamic covariance matrices given by a multi-variate GARCH (MGARCH) model. In addition, when volatility spikes, there is an increase in price impact ( [3]). This negative correlation between price and volatility was also found by [4]. [5] modeled such volatility by fitting skewness in the ARCH model. This work assumes that the penalty on trading depends on the instantaneous value of the covariance matrix (e.g., the condition number of the covariance matrix) to describe such price volatility. The contrary movement between degrees of freedom in covariance matrices and overall market volatility is highlighted in [6] and also touched upon in [7]. This heteroskedastic problem is a linear-quadratic program, but with the added feature of non-constant coefficients that depend on the MGARCH process.
When solving this linear-quadratic program with non-constant coefficients in high dimensions, there are substantial computational challenges for non-ML approaches. There is no explicit solution because of non-constant coefficients; this is in contrast with a linear-quadratic program with constant coefficients for which the solution has a feedback form given by a matrix Riccati equation. For non-constant coefficients and high dimensions, the Riccati equation is not explicitly solvable, but a more tractable solution uses a neural network (NN) to obtain a policy approximation that is refined through reinforcement learning (RL). This estimated policy is the output of an iterated algorithm that is similar to the actor-critic approach used in the deep-deterministic policy gradient (DDPG [8]) and the deep Q Network (DQN [9]) for discrete action spaces.
The NN-based policy approximations have the capacity to manage high dimensions, but the implementation of the method requires some analysis of the problem. In particular, actor-critic RL algorithms are known to be unstable in many situations ( [10,11]). Stability and computation time for RL algorithms are discussed in [12] along with a boosting algorithm to mitigate these issues. In [13], a double-network method is used to stabilize the DQN method; a double network is used in [9] wherein it is referred to as the actor-critic approach; the normalized advantage function in [14] uses dueling networks.
In this paper, we resolve these stability and computation-time issues by constructing a solution that exploits the smallness of the transaction costs and uses the explicit form of the solution to a myopic problem. We call a solution myopic when its decision does not forecast the future and is purely based on historical observations, that is, the optimization has a discount parameter δ ∈ [0, 1) and for the case δ = 0 the solution is myopic. The optimization also has transaction costs with a small parameter > 0, and as tends toward zero the solution becomes more and more like the myopic solution.
In this paper, we will show that the explicit myopic formula is the base term in an expansion of the optimal solution in powers of . The principal effects of non-myopic RL are captured by the order-term in this expansion, which means that a good approximation is the base term plus the order-term.

Background and Review
Other machine learning applications in finance include [15] where stochastic gradient descent (SGD) with deep NN architecture is used for computing prices of American options on large baskets of stocks, and in [16] where an RL approach is used to numerically solve high-dimensional backward stochastic differential equations related to finance. In [17], the authors utilized an LSTM network for predicting the price movement with daily S&P500 data 1 . The performance of the LSTM is mixed during different periods. The short-term trend prediction in the price movement on NASDAQ by the deep network was studied by [18]. The authors utilized features from both fundamental and technical analysis as the network input. [19] used a graph network to predict the stock price movement on S&P500 data. In [20], they demonstrate how adversarial learning methods can be used to automate trading in stocks. General methods from control theory have been applied for optimal trading decisions in [21,22]. The effects of transaction costs and liquidity are well-studied ( [1,23,24]). In particular, the "aim portfolio" description given in [2] has been a key result for the management of large funds. The discussed works are based on supervised learning. Therefore, the non-supervised learning approaches should also be studied as they may address more complicated problems.
It was originally thought that combining NN with RL recursions would lead to instability ( [11]) due to the approximation errors caused by using the neural networks. However, hypothetically, the accuracy of NN approximations can be improved to within arbitrarily close bounds (see [25,26]). Additionally, the actor-critic approach has been shown in settings to have a stabilizing effect ( [27,28]). The actor-critic approach utilizes two independent deep networks to represent a policy function (the actor) that provides a group of possible actions for a given state and an evaluation function (the critic) that evaluates the action taken by the actor based on the current policy function. By alternatively updating the actor and critic with the given objective function, the two networks converge. The DDPG algorithm ( [9]) takes the actor-critic approach for solving RL problems with continuous control space and is based on the DPG algorithm in [29]. The DQN method in [8] is an RL algorithm that utilizes and trains a deep network to represent the Q-value function. Then at each instant, the DQN takes the action based on the Q-value network. The DQN does not use an actor-critic approach. and can only address discrete action problems, whereas the DDPG, which extends from DQN, can also address continuous action problems. But the DDPG is more difficult to train and sometimes unstable as it uses an actor-critic approach. Reinforcement learning was utilized in many topics, such as wireless ( [30]) and mobile edge computing networks ( [31,32]).
In portfolio management, [33] utilized the DDPG to optimize the cryptocurrency portfolio. [20] utilized both DDPG and proximal policy optimization (PPO) for portfolio management. [34] considers portfolio optimization for a transaction cost. However, the authors only considered a linear cost. In this work, we considered a quadratic transaction cost that is more practical. Similar to the problem we consider in this paper is the linear-quadratic regulator with uncertainty in the (constant) coefficients. There are provable bounds for identifying the minimum run-time required from an onpolicy approximation, after which the observed data is used to refine the policy to within a given tolerance of the optimal ( [35,36]). This approach to uncertainty in on-policy learning is analyzed as an actor-critic approach in [11].

Results in this Paper
The analyses in this paper show the effectiveness of RL and NN-based policy approximations for solving linear-quadratic programs with non-constant coefficients and small penalization on control. We consider an iterative scheme for the optimal controls, for which we can prove convergence to a fixed point under some reasonable conditions. We extend this fixed point argument to show that NN approximation errors will compound over time, but that their total will be of order big-oh in the magnitude of an error that is reduced as the number of NN parameters is increased. The problem is of practical interest because the MGARCH covariance process does not allow for an explicit solution, such as those seen in other linear-quadratic optimizations ( [37,38]).
For a faster implementation, we propose an algorithm that exploits the smallness of the transaction cost parameter. In particular, we write the solution as a series expansion in powers of the transaction costs parameter, which is small. Using only a few terms from this expansion we form a sub-optimal solution that tends toward the optimum as the parameter decreases toward zero. This approach is similar to a standard NN-based policy gradient but is more stable (i.e., the difference between our RL output and the ground-truth optimal output is bounded) because the series terms are fast and easy to compute, and because the NN approximation error is present only in the higher-order terms and therefore is an order of magnitude smaller than it would've been without the expansion. In our experiments, we take the first two terms in this series: the based term that is equal to an explicitly computable myopic solution and a first-order correction term that we compute using an RL loop and NN functional approximation. We implement this approximation using a single network that contains only fully-connected layers without requiring dueling networks or special machinery, and in our studies on market data, we can see that the correction term provides some improvement compared to using only the myopic control, which is defined in (22). The key difference between our RL strategy and the myopic strategy is that the myopic strategy makes decisions without forecasting the future, whereas the RL strategy forecasts the future to make decisions. Overall, the contribution of our paper includes the following: • We show the effectiveness of RL and NN-based policy approximations for solving linear-quadratic programs with non-constant coefficients and small penalization on control.
• We prove the convergence of an iterative scheme to a fixed point for the optimal controls under some reasonable conditions.
• We show that NN approximation errors will be of order big-oh bound.
• We propose an algorithm based on NN approximation that exploits the smallness of the transaction cost parameter for faster implementation.
• We evaluate our algorithm on both synthetic and historical market data, which shows positive testing results.
The remaining part of this paper is organized as follows: First, the problem is mathematically formulated. A solution with a two-step iteration scheme is proposed, and its convergence is analyzed. A practical method for implementing the solution using neural networks is proposed. A small-analysis regarding the neural network solution is shown. Lastly, the method is evaluated on synthetic market data and historical market data, and the results are discussed.

Model and Optimization Problem
Let Rt ∈ R n denote the vector of n-many assets' returns realized at time t, and let Σt−1 be the covariance of Rt given the information immediately prior at time t − 1. An MGARCH model ( [39,40]) is the following: (1) where µ is the conditional mean vector, Zt+1 ∼ iid(0, Σt), and where A, B and C are n × n matrices. The following condition ensures that Σt is always invertible: Condition 1. The matrix C given in (2) is full rank so that CC is invertible, that is, there is a positive lower bound Asset prices are calculated by compounding the returns (1). For 1 ≤ i ≤ n, the i th asset's price is where h is some known function. A typical choice of h is h(s, r) = s(1+r) as used in [39,40], but for technical reasons, we will need to impose the following condition: is finitely bounded away from zero. That is, h ∞ = sup s,r |h(s, r)| < ∞ and there exists constant s > 0 such that infs,r h(s, r) ≥ s.
Denote the R n vector of these prices as Next, define the covariance matrix of the dollar returns, Let Xt ∈ R n denote a manager's holdings in assets (in contract units). The returns (in dollar units) on this portfolio are i (S i t+1 − S i t )X i t , the expected value of these returns is µ StXt, and their variance is X t PtXt. The portfolio manager has a control {at, t = 1, 2, 3, . . . } that she selects at time t to change Xt. The manager's control should be optimal with respect to her mean-variance preferences, where 0 ≤ δ < 1 is a discount factor, > 0 is a (small) parameter, and the function q : R n × R n×n → R + is a transaction cost (or liquidity penalty) that is higher at times when it is harder to trade and lower at times when there is plenty of liquidity. This form of transaction cost penalty was introduced in [1] and the relationship with volatility was shown in [3].

Condition 3.
We assume q is bounded away from zero, where χ is a known constant.
As mentioned earlier, the penalty on trading should depend on the instantaneous value of the covariance matrix. Many works use principal component analysis (PCA) to study the relationship between degrees of freedom in the stock returns' covariance matrix and overall market volatility. For example, [6] observes that relatively few eigenvectors are needed to capture the majority of market variance during times of high market stress, thus resulting in wider bid/ask spreads and higher transaction costs; the relationship is reversed during times of low market stress. Based on this dynamic, we use the condition number to excite the transaction cost function when the market has reduced degrees of freedom.
In the examples, we take q(s, p) = cond(diag −1 (s)pdiag −1 (s)), which is based on the empirical observation that losses in the S&P500 market index occur when there is a large spike in the condition number of the covariance matrix, as shown in Fig. 1. Additionally, we take γ = 1 is the value of the new capital, i.e., we set risk aversion so that the goal is for W0 to be invested in the market. The problem formulated in (4) is similar to the mean-variance preferences problem in [2] but with the added nonconstant cost from q(σ), where σ 2 = Σ and Σ is the estimated covariance matrix. A similar type of financial control problem was considered in [41]. An effective way to analyze this system is to take a Hamiltonian approach and write it using a vector of Lagrange multipliers, where we have used the transversality condition ( [42]) First-order conditions in at and in Xt yield a forward-backward system where Et denotes expectation conditional on the information observed up to time t. However, in the real world, (6) and (7) cannot be directly solved. Therefore, in this paper, we propose an iteration scheme for (6) and (7): where Pt = γ q( S t ,P t ) Pt and X (k) 0 = x0 for all rounds of iteration k. Then, we implement RL using neural networks (NNs) to estimate the limiting fixed point from (8). The convergence of iterations in (8) depends on if the following condition holds: for all t.
If Condition 4 holds, then we can prove that (8) converges to a unique fixed point. Condition 4 would always hold if Pt and Ψt commute, but this is an unrealistic condition. For the data used in this paper, we check empirically that in fact Condition 4 holds for < 1. For theoretical purposes, because we are considering the small-parameterization, the following proposition is useful for confirming Condition 4: from which it follows that Condition 4 will hold for small enough.

Two-Step Iteration Scheme
The scheme in (8) is the basis for an iterative algorithm with two steps. The policy is given by λ (k) and is used to generate X (k+1) . Then, upon observation of X (k+1) , the updated Lagrange multiplier λ (k+1) is obtained via a fixedpoint iteration, for fixed k and for k → ∞. Algorithmically, this can be described as a 2-step iterative procedure for finding a fixed point: is given by (10).
If we consider a finite-time version of (6) and (7) with terminal condition λ for all t > T ), and given Condition 4, then we have a contraction mapping with λ (k ) t converging to a unique fixed point as

Convergence to a Fixed-Point
The two-step iteration described in Section 2.1 looks for a fixed point of λ (k ) for a given X (k+1) . In this section we present a theorem stating that the pair (X (k) , λ (k) ) given by (8) converges to a fixed point, thus confirming that the forward-backward system of (6) and (7) has a unique solution. We prove these results for the finite-time version of (6) and (7) with terminal condition λ  T +1 ≡ 0 for all k. If we assume Condition 1, Condition 2, Condition 3 and Condition 4, then the iterations of (8) will converge to a unique fixed point.

Proof. (see Appendix).
Remark 1. Accuracy of the approximation of the infinite-time problem by a finite-time problem can be proven with optimality bounds and a squeeze lemma as T → ∞.
The main idea in the proof of Theorem 1 is to show a contraction in E λ , thus confirming the existence of a unique fixed point. The initial step for setting up the proof is to write the following forward equation for the iteration difference, which is derived from (8) by differencing between iteration k + 1 and k. The following quantity is important for convergence, The quantity in (12) is important because from (11) we obtain the following inequality, An equivalent form of this inequality appears in Theorem 1 and is used to thus proving the existence of a unique fixed point of (8) for finite T .

Neural Network Policy Approximation
The machine learning approach to solve (8) is to implement RL using neural networks (NNs). It amounts to estimating the limiting fixed point λ * from (8) as λ * t ≈ λ(t, Xt−1, St, Pt; θ) where λ(·, ·, ·, ·; θ) is a policy approximation function with a feed-forward NN, and θ denotes the NN parameters to be estimated. The NN takes into account time t so that the solution can be adapted to the time remaining until terminal time T . An example of an NN that can accommodate time dependence is the Deep BSDE architecture in [16].
Given an initial estimate θ (0) , we proceed to iteratively look for θ that is close to a fixed point. The following iterative estimation scheme is the basis for the algorithm we'll implement: where the indicator 1t<T is used to enforce the finite-time problem's terminal condition of λT +1 = 0. The following theorem proves the NN scheme in (13) will in fact be a decent approximation for the fixed point λt.
Assume also that For each k let ε (k) denote the error from the NN approximation, Then, the error of the iteration scheme in (13) is for k large, where λ * is the fixed point of (8).
The bound in (15) is similar to the bounds derived in [43], wherein the approximation error was computed to be the sum of three terms: a sampling error term, an NN parameter error, and a value function estimation error. An additional similarity is an exponential growth in the bounding constants as T increases. In theory, the exponential growth in (15) is contained by continually increasing the hyperparameters so that sup ε ( ) + 2η tends to zero, but in practice increasing hyperparameters requires a growing number of training samples, leading to an infeasibly long computation time. However, the smallness of can be exploited for a faster algorithm, which does not eliminate exponential growth in T , but we are at least able to reduce bounding constants with lowered computational cost.

Small-Asymptotic Analysis & Implementation
Let's consider the parameterization with being small enough that effects of order 2 can be dropped or grouped in with round-off error. In this setting, we can construct a solution in a power series form and then truncate terms order 2 and higher. In other words, our expansion has a base term equal to the explicitly computable myopic solution (obtained for the case δ = 0), and an order-correction term. Computation of the order-correction is more involved, but computational costs and runtime are minimal.

Expansion of λ t
We write the formal expression for λt, λt = λt , which we insert into (6) and (7) to obtain the following equations, Rearranging (17) we have the following stabilized equation, for which it is straightforward to check that the convergence proof of Theorem 1 still applies. We write the following formal expansion, t + . . . , which we insert into (18) to obtain the following recursive expressions for the expansion's terms, for i = 1, 2, 3, . . . . This expansion is in powers of and can be truncated for a good approximation of the solution to (16).

Remark 2.
When we approximate λt with lower-order terms λ [0] t and λ [1] t , we can simplify the expressions in (19) so that they do not depend on (20) is different from (19) because it has reduced the base term to the naive policy that goes straight to the aim portfolio, thus leaving it to the correction terms to compensate for transaction costs.
Inserting the λt expansion into (16), we suspect the following lower-order expansion for the state process has a Big-Oh error as follows, These order O( 2 ) errors are in fact true, based on the following proposition: Proposition 2. Assume Condition 1, Condition 2 and Condition 3. The order-approximation λ t has an error, which is where λ * denotes the solution to (16) and (17).

Neural Network Algorithm
Let function ϕ : R d × R d × R d×d → R n be from a class of NN functionals with sigmoidal activation. The term λ [1] t is approximated as δ I + Note that now we are considering a NN architecture that remains constant through time, which is a considerable simplification from the NN architecture used in the proof of Theorem 2. The tradeoff in making this simplification is faster computation time. Algorithm 1 gives the implementation of the scheme in (13) using the lower-order expansion in (21) with this NN approximation of δEt λ t+1 . Fig. 2 shows the flow of Alg. 1 at each moment t. We empirically observed that after training, our NN policy can work in real-time (specifically, each NN inference takes around 1.5 ms on a commodity laptop).
In our analysis of Algorithm 1's portfolio, we will compare to the purely myopic strategy, The myopic strategy shares the same objective as the RL strategy but for δ = 0. Note that because our objective function contains a quadratic transaction cost term, the typical buy-and-hold portfolios will create a large transaction cost at the beginning, leading to a negative total wealth return. Therefore, we do not compare our method with the typical buy-andhold portfolios.

Sector ETFs: an 11-Dimensional Example
We verify the small-asymptotic analysis by allocating 11

Estimating Parameters
Given a sequence of historical price {S i t } for the i th ETF, the return rate R i t at each time t can be found by: andR the mean of Rt over t, the initial covariance matrix Σ0 is calculated as: where L is the length of the sequential historical data. A, B, and C in (2) were estimated using Broyden Fletcher Goldfarb Shanno (BFGS) algorithm ( [45]). Specifically, we first burned in the initial covariance matrix Σ0 into (2). We then used t≥0 ||Σt+1 − Σt|| 2 with matrix norm as the loss function and BFGS as the minimizer to find A, B, C, and Σ to minimize the loss function. The following equilibrium state is achieved when the loss function is the smallest: For the estimation of µ, we used the eigen-portfolio approach in [46]. The parameters (i.e., A, B, C, Σ0, and µ) were estimated for each fold. Fig. 3 shows an example of the historical ETF prices and the simulated ETF prices using the estimated parameters.

Architecture of Neural Network
The utilized neural network (NN) is composed of fully connected layers. The input contains the portfolio Xt−1 that has 11 elements, plus the covariance matrix of the dollar-returns Pt = ΨtΣtΨt whose dimension is 11 × 11 and also the expected value of returns µ Ψt which is also 11-dimensional. Therefore, the total dimension is 143. The hidden layer size was determined by considering both the training time and the NN performance. When using more complex NN architecture, we observed that there was no obvious improvement in performance while the training time increased considerably. On the other hand, when using even simpler NN architectures, we observed that the deep RL algorithm suffered from under-fitting problems. Therefore, we utilized four hidden layers, each of which contains 400 neurons. The output of the NN corresponds to ϕ(·, ·, ·; θ) in Algorithm 1, which is 11-dimensional. The activation function is Tanh. The architecture of the utilized neural network (NN) is shown in Table 5 in the appendix shows the detail of the architecture. The programming language is Python 3. Tensorflow and Keras were utilized as the deep-learning library. Keras is built on top of Tensorflow.

Simulation on Synthetic Data
At moment t + 1, a noise vector Zt+1 was generated from a Gaussian distribution N (0, Σt), where Σt is the covariance matrix at the previous moment, the return rate Rt+1 was determined by (1), and the covariance matrix Σt+1 was found by (2). For each fold, we generated one sequence of synthetic data to train the NN and the other 200 sequences to test the performance of the NN. We show how the total wealth grows with time for RL and myopic strategies. We also show how the fund manager should invest the money into the stock market with time using RL and myopic strategies, respectively. Specifically, wealth Wt at time t was calculated by with W0 = 100. We ran the trained NN and the myopic strategy (as given by (22)) on the test data for each fold and calculated the average. The training epoch was 100 for each fold. The optimizer was Adam. The loss function was mean-squared-error loss (MSE). In this case, we took = 0.003, for which we were able empirically to verify Condition 4. The results are shown in Fig. 4. Table 2 shows the value added of the two methods in each fold on the synthetic data. The RL strategy of Algorithm 1 consistently outperforms the myopic strategy given by (22) by showing an annual increase of 1.8% in total value. The results are scalable to any W0. Therefore, our RL approach will show significant outperformance in the absolute value of total assets added when the given capital (i.e., W0) is large. Note in the table that we are emphasizing the $-value of returns, as this is a better measure of fund performance as per the reasoning of [47].
The amount of money invested in the stock market is calculated by where I0 = 0 because all money is in cash at the beginning. Note that since the ETF prices vary with time, it is likely that at the end of the day, the total money invested in the stock market may exceed W0. Fig. 5 shows the evolution of It for both RL and myopic strategies. The two strategies show different investment speeds. The RL is investing faster than the myopic strategy. Fig. 6 shows the evolution of the RL and myopic portfolio allocations with time. In the beginning, the portfolio allocations are at 0 for both RL and myopic. As time passes, the portfolios increase, but the slope becomes smaller. Eventually, the portfolios will be attracted to a stationary state. It is worth mentioning that the investment portfolio charts are not for comparison purposes. Instead, they are used to illustrate how our RL approach and the myopic approach are applied and to ensure that the two approaches have reasonable behaviors.

Simulation on Historical Data
For historical out-of-sample data, Zt was found by   where Rt is the return rate of the real historical data at each time. For each fold, the NN was trained with the data of that fold and tested out-of-sample on the historical data of the following six months. The training epoch was 100. The optimizer was Adam. And the loss function was MSE. In this case, we took = 0.01, and again we were able empirically to verify Condition 4. The result is shown in Fig. 7. For most situations, the RL outperforms the myopic strategy, but not as significantly as it did on synthetic data. This is because the real historical market data contains considerable uncertainty that might cause the diminished performance of the RL strategy in testing. Table 3 shows annualized total asset value added by using the two approaches on historical market data. Among the cases, the RL strategy shows more added value than the myopic approach. The RL method is able to consistently outperform the myopic by about 11bps. This over-performance is not tremendous but certainly makes evident that RL can improve out-of-sample performance for this problem. The results are scalable: if the fund manager has 100 million dollars, using the RL strategy can help her gain an extra $120,000 compared to the gain realized from the myopic strategy. Fig. 8 shows how the money is invested with time. Fig. 9 shows how the portfolios change. It bears repeating that the practical purpose of this paper's optimization is to optimally move a large amount of new capital into the market, i.e., bring this new capital into the fund. By showing the cumulative amount of money invested we can see how much of the investment goal has been accomplished. Figures 5 and 8 illustrate how the two strategies behave as they are moving the new capital into stocks. Overall, the portfolios show similar behavior to what we observed in the synthetic data case, namely, that the RL strategy seeks to move the new capital into the ETFs faster than the myopic.
Note that , in this case, is larger than in Sec. 4.2, this is because we want to highlight the difference between RL and myopic strategies. If = 0.003, the difference between the two strategies is less easy to highlight. The reason Figure 5: Money invested in the stock market at a different time for the RL strategy (solid) and the myopic strategy (dot). The X-axis shows the day. Y-axis shows the money invested in the stock market.
for the weaker significance when testing out-of-sample is because of the following. First, Fig. 4 shows persistent overperformance by RL because it is an average of 200 trajectories. It is possible that if we had abundantly more out-of-sample data, we could see averages playing out, in which case we might see a stronger out-of-sample performance by RL. Second, there may be an out-of-sample model risk, i.e., that the model changes or is misspecified in the out-of-sample test. In this case, the training is aimed at learning an optimum that becomes sub-optimal when applied in the out-of-sample test. Figure 6: Portfolios of the RL strategy (solid) and the myopic strategy (dot) at different times. The X-axis shows the day. Y-axis shows the portfolio.
Ultimately, we cannot say much more than this, but the different folds of data that we have shown do give us some sense for out-of-sample variation for both RL and the myopic portfolios.

The Objective Value V on Historical Data
The main goal of this paper is to maximize the objective value function (4), we, therefore, ran the RL and myopic algorithms on the 10 folds historical market data and calculated the relevant terms in (4). Specifically, we show the average results of V , total transaction cost T t=0 δ t q( S t ,P t ) 2 a t Ψtat, and risk penalty T t=0 δ t γ 2 X t PtXt with different . We observed that when is relatively small (e.g., 0.00001), our RL algorithm considers an aggressive investment to maximize (4) (i.e., increasing the inventory Xt quicker and thus taking a relatively higher total transaction cost and risk penalty). When is relatively large (e.g., 0.01), our RL algorithm adopts a conservative investment to maximize (4) (i.e., increasing Xt slower and thus taking a relatively lower total transaction cost and risk penalty). Fig. 10 shows how the inventory of IYR ETF changes for different by using the RL and myopic strategies. The inventories of other ETFs show similar behavior. From the figure, when = 0.01, our RL tends to slowly increase Xt. When = 0.00001, our RL tends to increase Xt quickly. The explanation for such behavior is that when is small (meaning that the transaction cost can be negligible), f (at, Xt, St, Pt) in (4) can be approximated as: which has an equilibrium point at Xt = 1 γ P −1 t µ Ψt for every t. Therefore, to maximize (4) with the approximation (24), our RL algorithm needs to increase Xt to the equilibrium point within a short time and thus performs an aggressive investment. If is large (meaning that the transaction cost cannot be negligible), to avoid high transaction costs, our RL can only slowly increase Xt. We show the average results of final objective-function value V , total transaction costs, and total risk penalty over the 10 folds historical market data in Table 4 for different . The total transaction cost and risk penalty for = 0.00001 are higher than the total transaction cost and risk penalty for = 0.01, which confirms our explanation.
Our RL algorithm outperforms the myopic strategy for both aggressive and conservative investment cases by showing a higher V according to Table 4. For the aggressive investment case ( = 0.00001), our RL algorithm behaves less aggressively than the myopic strategy by having a lower total transaction cost and risk penalty. On the other hand, for the conservative investment case ( = 0.01), our RL algorithm behaves more aggressively than the myopic strategy by taking a higher total transaction cost and risk penalty. However, for both cases, our RL algorithm shows a higher value for the objective V than the myopic strategy, as shown in Table 4. Fig. 11 shows the average performance of our RL and the myopic strategies with time on the 10-fold historical market datasets. The average wealth returns using RL and myopic strategies are close. However, our RL approach attains a higher (i.e., better) value for the objective than the myopic strategy, meaning that our RL achieves a better trade-off between investment return, transaction cost, and risk penalty. Therefore, our RL outperforms the myopic strategy.

Discussion of Our RL Approach
From the experimental results, we have observed that our RL approach has a higher objective value V given in (4) (i.e., the main goal of this paper) than the myopic strategy on the historical data. In other words, our RL approach balances the wealth return, risk penalty, and transaction cost better than the myopic by showing a 3.85% better for = 0.01 and 11.64% better for = 0.00001. On synthetic data, the RL method returns an extra 1-2% additional annual percentage return over the myopic (out of a total of roughly 10% annual return). This demonstrates that our method outperforms the myopic, as this extra 1-2% annually is significant when measuring the growth of assets. For real data, we are able to show consistent over-performance of the RL method by about 11bps. This over-performance certainly makes it evident RL's improved out-of-sample performance. Our RL approach also shows a faster investment speed with working in real time. Therefore, our approach is a valid effective method.

Conclusion
We have implemented a reinforcement learning algorithm to solve the mean-variance preferences of (4) with an MGARCH model and transaction costs of order , where > 0 is a small parameter. Our method addresses issues of algorithm convergence and computation time by using an expansion to guide the network to the solution. While similar methods use double (dueling) networks to stabilize convergence, we only need to use one network with our expansion. Our method is stable and can work in real time. The resulting portfolios show good performance in simulated tests.

A Proofs
Proof of Proposition 1: We apply the Sherman-Woodbury-Morrison formula, The first term in (25) is bounded as follows, where we have bounded P −1 t in the following way, where c > 0 is the lower bound defined in Condition 1. The second term in (25) is bounded as follows, Now, taking the norm of (25), applying a triangle inequality to the right-hand side, and inserting these bounds, we have If χ h ∞ < γs 2 c, then we can re-arrange to obtain the bound in (9).
Proof of Theorem 1: The following lemma is useful to have when proving convergence in Theorem 1. Proof. By positive definiteness of each matrix, we have .

Thus, we have
which proves the second statement.
From inspection of (11), the usefulness of the following lemma should be evident: Lemma 2. If we assume Condition 1, Condition 2 and Condition 3, then in (11) there are the following bounds in coefficients, Proof. (26) is a clear consequence of Condition 2 and Condition 3.

2.
To prove the second bound in (26), we start as follows, where Pt is invertible because we have assumed Condition 1 putting a lower bound on the eigenvalues. However, we don't have a lower bound on P −1 t because the eigenvalues of Pt do not have a fixed upper bound, but from Condition 2 we have Ψt ≤ h ∞, and thus we use Lemma 1 to get γ q( St, Pt) γ q( St, Pt)P −1 which proves the second bound in (26).
for k < T where eT ∈ R T is the T th canonical basis vector, and this bound gives us the general bound We apply the bound in (30) to the right-hand side in (29), and we find that E (k+1) 1:T = 0 for all k ≥ T thus proving convergence to a unique fixed point, hence proving the statement of Theorem 1.
Proof of Theorem 2: Given the compact set K from (14), by the universal approximation theorem ( [25]), for each iteration in (13) there is a NN for which parameters can be chosen such that where ε (k) is the arbitrarily small error that can be decreased by increasing the NN's hyperparameters. In the same manner as the process given by (8), we define the iterated process to be λ (k) t = λ(t, X (k) t−1 , St, Pt, θ (k) ) .

Denote the error as
where λ * t is the limiting fixed-point of (8). Using the pair of iterative equations in (8), we can include the NN error to obtain the following recursive bound, which is the statement of Theorem 2. Proof of Proposition 2: Denote the state process from the order-approximation as for t = 1, 2, 3, . . . , T . Inserting the expansion into (17) results in the order-approximation error where the equality uses the expressions of for λ [0] t and λ [1] t given in (19). Using Lemma 2, we have Therefore the differential error given above is of order O ∆( )E sup t≤T λ [1] t . Thus, similar to Theorem 2, we have a big-oh bound on error, where λ * denotes the solution from (16) and (17).