Market Making Strategy Optimization via Deep Reinforcement Learning

Optimization of market making strategy is a vital issue for participants in security markets. Traditional strategies are mostly designed manually, and orders are mechanically issued according to rules based on predefined market conditions. On one hand, market conditions cannot be well represented by arbitrarily defined indicators, and on the other hand, rule-based strategies cannot fully capture relations between the market conditions and strategies’ actions. Therefore, it is worthwhile to investigate how to incorporate deep reinforcement learning model to address those issues. In this paper, we propose an end-to-end deep reinforcement learning market making model, i.e., Deep Reinforcement Learning Market Making. It exploits long short-term memory network to extract temporal patterns of the market directly from limit order books, and it learns state-action relations via a reinforcement learning approach. In order to control inventory risk and information asymmetry, a deep Q-network is introduced to adaptively select different action subsets and train the market making agent according to the inventory states. Experiments are conducted on a six-month Level-2 data set, including 10 stock, from Shanghai Stock Exchange in China. Our model is compared with a conventional market making baseline and a state-of-the-art market making model. Experimental results show that our approach outperforms the benchmarks over 10 stocks by at least 10.63%.


I. INTRODUCTION
Market making (MM) strategy is a kind of buy-side highfrequency trading strategy in stock market. It is usually used to enliven the market, stabilize market orders, improve liquidity of the stock and promote the development of the market. The profit of a MM agent is obtained by capturing the spread of the market, which is basically the volatility of the market and differences between best bid and ask prices. The agent frequently quotes its own bid and ask prices simultaneously, and makes profit by waiting both legs hit by other orders, and in some situations loses money due to price trending. Therefore, how to design the MM strategy and make it profitable becomes an important question.
Traditional MM strategies are designed by human experts. Trading rules are mechanically made based on experiences, and the MM strategy makes its trading actions according to The associate editor coordinating the review of this manuscript and approving it for publication was Yin Zhang . those rules. Issues with this approach are 1) the traditional MM strategy cannot well describe or represent the market and strategy states and 2) the manually-designed rules cannot well capture the relations between states and proper trading actions. In recent years, deep reinforcement learning has been widely exploited in many research areas as well as industries. For example, the Alphastar trained by deep reinforcement learning algorithm in [1] exceeds 99.8% human players. Deep Q-network (DQN), which combines reinforcement learning with deep neural networks [2], uses deep neural network to extract data features, and realizes end-to-end optimization of complex decision problems through reinforcement learning. However, most of existing deep reinforcement learning tasks that are relevant to stock market focused on long-or mid-term stock trading, e.g., Deng et al. [3] used direct deep reinforcement learning method to represent real-time financial signals, and trained an agent for financial asset trading, and few work has been done on the buy-side market making strategies. Challenges in this task have three folds: 1) State representation. There are two categories of states that trading models need to capture i.e., internal state including strategies' inventory etc. and market state including whether it is trending or stable. 2) Decision frequency. MM agents need to deal with high-frequency market data, including several levels of bid and ask prices and related waiting orders in each level, and they need to make real-time decisions on how to place their own orders in the limit order book (LOB) and cancel orders when market is against the strategy.
3) State-action mapping. MM strategies need to learn the relations between massive states and different combinations of trading actions, which makes the learning process more difficult.
In this paper, we propose an end-to-end reinforcement learning MM trading model based on recurrent deep neural network representation, which is termed deep reinforcement learning MM model (DRLMM). The model is designed to have three modules: 1) Feature capture. A deep recurrent Q-network (DRQN) architecture [4] is exploited and applied to MM agent learning, where the DRQN is based on DQN and modified by replacing the first post-convolutional fully connected layer with a recurrent long short-term memory (LSTM) layer. The DRQN with the LSTM units automatically learns temporal market states from the LOB without any hand-designed features. 2) Action selection. The action space is different from many previous works that it is designed to be adaptive to strategy's internal state instead of a fixed action set. The action set contains several subsets, and at each time, the MM agent selects the appropriate subset as action space according to its internal state, such as inventory. Action selection policy is then learned by DRLMM. 3) Cross engine. A near-to-real market cross engine is designed to simulate order execution in stock exchange. Market data are fed into both strategies and the engine, and the engine executes strategies's orders based on predefined rules. Comparative experiments are conducted on a Level-2 data set of 10 stocks in Shanghai Stock Exchange of China. The experimental results show that our model is effective and can make more profits.
The key contributions of this paper are as follows: • We design an end-to-end reinforcement learning MM trading model based on recurrent deep neural network representation. The model uses LSTM units to capture temporal market information and exploits DRQN to optimize the MM strategy.
• We design an adaptive action selection policy, which selects a subset of actions from the whole action space based on strategy's internal state. This mechanism makes the model training more efficient.
• We backtest the DRLMM model in Chinese stock market using Level-2 limit order book data, and experimental results show that DRLMM is better than the baseline strategies in many metrics. The rest of this paper is organized as follows. In Section II, we review the relevant works of reinforcement learning and deep reinforcement learning in the financial field.
In Section III, we introduces the MM model framework based on deep reinforcement learning. In Section IV, the experiments are introduced, and, performances of DRLMM model and baseline are compared through market simulation. In Section V, we give our conclusions and future work directions.

II. RELATED WORK
There have been many research works that are related to MM and reinforcement learning. In recent years, with the development of deep learning and successful application of deep reinforcement learning, more and more attentions are given to this area by both researchers and practitioners. In this section, many previous works of market making strategy design, reinforcement learning and deep reinforcement learning are reviewed.

A. MARKET MAKING STRATEGY
MM strategy is concerned by many research areas, including both finance and machine learning. In finance and economics, MM is generally studied as an optimal control problem. Scholars use stochastic dynamic programming to study optimal bidding. For example, Avellaneda et al. [5] studied the pricing strategy in LOB. Guilbaud and Pham [6] considered the influence of execution priority. A large number of literatures have investigated the establishment of MM model. In the early stage, Ho and Stoll [7] proposed the classic model of single dealer. Since then, many researchers have extended the MM model. Das [8] expanded the scope of application of the model on the basis of glosten and Milgrom's market maker model [9]. However, these methods based on market microstructure modeling rely heavily on conditional assumptions and are not suitable for application in complex real markets.

B. REINFORCEMENT LEARNING
In recent years, reinforcement learning has been used to solve many kinds of financial problems. Chan and Shelton [10] applied reinforcement learning to the MM model for the first time to endow the MM agent with learning ability. Then, researchers applied various reinforcement learning methods to market makers. Spooner et al. [11] designed a market making agent based on time differential reinforcement learning, and applied return function to control inventory risk. Lim and Gorse [12] first proposed an optimized market making model based on reinforcement learning in high frequency trading, and used constant absolute risk aversion (CARA) as the final optimization function of the model. In the aspect of multiagent, Patel [13] applied the multi-agent reinforcement learning framework to market making strategy, and made trades with the help of macro and micro agents. Ganesh et al. [14] established a multi-agent simulation system of a dealer market, and proposed a reinforcement learning reward method based on yield. Zhong et al. [20] collaborated with a market making firm and developed a market making strategy based on Q-learning, and they make the strategy become a lookup table that is easier to be implemented in real production. Spooner et al. [21] proposed an adversarial reinforcement learning based market making strategy and claimed that the agent can converge to Nash equilibrium in several special cases.

C. DEEP REINFORCEMENT LEARNING
Deep reinforcement learning can solve the problem of high-dimensional and dynamic strategy optimization by combining the high-dimensional input of deep learning with reinforcement learning. Kumar [23], in his extended abstract, proposed a deep reinforcement learning Q-network based market making strategy which is simulated and tested on his market simulator. Similarly, Gasperov et al. [22] proposed a deep learning market maker that utilizes trading signal generator to help predict market trends, and they test their strategy with one month bitcoin market tick data. Mnih et al. [2] and Alpha Go [15] have proved the practicability and effectiveness of DQN. Deep reinforcement learning's powerful representation ability and its end-toend strategy optimization ability also attracted attentions of researchers in the field of quantitative trading. They tried to use DQN to solve the problem of investment decision-making under complex market conditions. Deng et al. [3] used direct deep reinforcement learning for financial asset trading. Ning et al. [16] established a fully connected neural network trained by experience replay and double DQN, and proved that its performance is better than the standard benchmark approach. Jia et al. [17] and others applied reinforcement learning based on LSTM to quantitative trading. Gueant and Manziuk [18] proposed a discrete-time model-based role criticism algorithm and compared it with the classical finite difference method.

III. DEEP REINFORCEMENT LEARNING MARKET MAKING
The reinforcement learning can be modeled as markov decision process (MDP) and represented by a tuple of four elements, i.e. (S, A, π, r), where in the tuple, S is a state space, A is an action space, π is policy, r is reward. In reinforcement learning algorithm, Q-learning is a representative value based algorithm. Its main goal is to build a two-dimensional Q-table to store Q-values of each pair of states and actions, and constantly optimize the values. (1) where α is learning rate, s is the next state of s, a is the action at state s , and γ is a discount factor.
Taking one step further, DQN uses a deep neural network instead of the Q-table in order to solve the problem of continuous state space or action space. The neural network is used to approximate the value function. The loss function of the network is defined as follows, where t is the current time step, and y t is an estimate of the expected return. DRQN [19] replaces the first post-convolutional fully connected layer with a recurrent LSTM on the basis of DQN. DRQN has been proved that it can deal with some observable information, and it is better handling loss of information than DQN. For the stock trading environment, many unknown variables in the time dimension can not be found well by DQN framework. So we use LSTM module in DRQN framework to construct market information representation. In this way, we can define the state closer to a comprehensive observation of the trading environment. The structure of the network is designed as follows: 1) a LSTM layer. It processes the market data in order to generate strategy's states; 2) a hidden layer. It is fully connected to the first layer and outputs four possible actions.

A. FORMULATION
The high-frequency market making strategy basically provides liquidity to the market by continuously quoting (both bid and ask sides) in the LOB. It processes incoming intra-day tick data, determines external (market) and its internal states, and takes instant actions simultaneously. It makes profits largely by capturing spreads in LOB, and loses money to inside traders in situations such as price trending. In general, market making strategies can issue new orders and wait for executions in order to capture more spreads, and cancel old orders that are placed in a risky position in order to avoid losses. In this paper, the state-determination and actionmaking are formulated in the DRQN framework, and we design a MM trading strategy based on DRQN, as shown in Figure 1.  Firstly, in each time step, the MM agent will send the state information as input to the neural network. Secondly, through several hidden full connection layers, the Q-network is used to estimate the possible value of each state-action. Thirdly, through the spatial selection module, the corresponding action is selected, and the real reward is returned through the cross engine simulator. Finally, the gradient descent method is used to update the θ parameters.
Our definition of the state, action and reward are as follows:

1) STATE S
The state set in the model consists of two parts: internal state and external state, S = {S in , S out }. The internal state S in = {m, I , O}, including money m, inventory I , and remaining order information O; the external state S out mainly contains the market information, and in contrast to many previous literatures that generally extracted data from LOB, construct features manually, and build a multi-dimensional state space, we directly choose entire LOB and let LSTM to build states for the agents. The entire LOB at each time stamp is a set of multi-dimensional data, including 10 bid/ask price levels (at the sell side ask 10 . . . ask 1 and the buy side bid 10 . . . bid 10 , prices are highest at ask 10 and lowest at bid 10 , and ask 1 is the lowest sell price a.k.a best ask, and bid 1 is the highest buy price a.k.a best bid) and waiting order queues corresponding to each level. For each time stamp t, we select previous five snapshots of LOB data, and input it to LSTM iteratively to generate its external state at t. Therefore, the MM agent can retain the cognition of the previous market transaction data when the external state is constantly updated.

2) ACTION A
In this paper, we divide the fixed finite set of action space into many subsets, i.e. A = {A 1 , A 2 , A 3 . . . , A n }. At each time step t, the space selection module selects the appropriate action space A t after analyzing the existing state s t and transmits it to the network. DRQN traverses every action A in A t and updates the corresponding Q-value continuously. Other actions not in A t will be excluded from consideration, and the model will not select such actions for the state s t . Specifically, A t consists of three parts, i.e. A = (D, P, N ), where D is order side, P is order price, and N is the number of shares. The order side can be either 'buy' or 'sell'; the price ranges all the prices that can be placed in the market, e.g. any level from ask 10 to bid 10 ; the number of shares can be adjusted according to the demand of placing orders, with a minimum of one lot size. Rules for action subset selection are as follows, • If there is no open order, the action space includes issuing two new orders which are on the best bid and ask respectively.
• If there is only one open order, the action space includes: 1) wait for the order getting executed; 2) cancel the order and issue a new order with a new price.
• If there are two legs of open orders waiting in the LOB, the action space includes: 1) wait for the orders getting executed; 2) cancel either of them and issue a new order with a new price; 3) cancel both orders and issue two new orders.

3) REWARD R
The reward function in reinforcement learning is generally defined as cumulative reward, i.e., U T = T t=1 R t , where R t is the reward in each step, which is also the value function in the classic deep reinforcement learning framework. R t is usually defined by the profit earned in many normal trading models. However, since the goal of MM agent is different from normal trading models, R t in DRLMM needs to be defined from two major perspectives: 1) market liquidity. DRLMM needs to quote on both sides of the LOB simultaneously in order to provide liquidity to the market; 2) inventory risk. DRLMM needs to consider the risk of inside traders that have extra information and make market price trending. Therefore, the reward in DRLMM is defined as, if 'buy' and 'sell' limit orders are executed, −0.5, if cross-spread order (market order) is executed, 0, others.
If two legs ('buy' and 'sell' limit orders) are executed at the same time or within a very short time period, the MM agent captures the LOB spread (makes profit) and also provides liquidity to the market. In this case, R t is 1 (positive, two halfspreads). If one leg is executed first, the MM agent waits for a period of time and finds its inventory is at risk, the MM agent cancels the other leg and sends a new order that crosses the spread to get an immediate execution in order to get inventory at balance. In this case, the MM agent loses money, thus R t is −0.5 (negative, one half-spread). In the other cases, the MM agent can be considered as waiting, so the R t is 0.

B. CROSS ENGINE SIMULATOR
A near-to-real cross engine simulator is designed and implemented in order to assist the order transaction for our MM agents. The simulator is designed to accept historical Level-2 data and orders from the MM agents. Since there are only quotes and trades data in the historical data, and no order queue data is provided, we cannot accurately know the waiting time of the limit orders generated by the MM agents. Therefore, in this simulator, we use waiting time (a constant number suggested by practitioners) to simulate the waiting behaviors of limit orders, i.e., when other conditions do not change, the limit orders can be traded at the price level only when there has a constant number of trades happened. As shown in Figure 2, in detail, the matching mechanism in this simulator is summarized as follows: • When buy order's price crosses the order book spread and touches the ask side, or sell order's price crosses the order book and touches the bid side, 1) if the order's quantity is less than the shares in the waiting queue, it will be executed immediately; 2) if the order's quantity is more than the shares in the waiting queue, part of it will be executed immediately and the rest part will wait in the queue.
• When buy order's price is on the best bid or sell order's price is on the best ask, the order will wait for a constant number of n trades and get fully executed if its quantity is less than the shares of the next trade, otherwise, the rest part of the order will remain in the queue.

IV. EXPERIMENTS AND DISCUSSIONS
In this section, the data set and setup used in the experiments are firstly introduced, and secondly, experimental results are presented and analyzed with discussions. • Rule-based market making strategy (RMM). The first baseline is a traditional rule-based market making strategy. The strategy places buy and sell orders only on bid 1 and ask 1 respectively, and as the same with DRLMM, the order's quantity is set to one lot (100 shares) for all the stocks. Figure 3 demonstrates VOLUME 10, 2022 the rules used in RMM, and the detailed trading logic of RMM is summarized as follows, -Place two orders (two legs) at the same time, one buy order on bid 1 and one sell order on ask 1 ; -If both orders are executed (or closed), two new orders will be issued at the next time point. Accordingly, the prices will be updated to new bid 1 and ask 1 ; -If only one of the two legs is executed, e.g., only the buy order is executed, the other leg will wait for three trades until the order gets executed, otherwise, the sell order will be canceled and a new sell order with bid 1 price is issued (cross the spread) and executed immediately. -If the strategy is going to end market making (e.g., market is going to close), the strategy cancels any open order and closes open inventory by using market orders.
• Reinforcement learning based market making strategy (RLMM). A modified version of the reinforcement learning based market making introduced by Lim et al. [12] is employed as the second baseline in our experiments. The inputs of the RLMM are snapshots of LOBs, which are vectors of prices and corresponding shares on each price levels. The LOB vectors are taken as states, and the RLMM uses Q-learning as its core algorithm to learn mappings between states and actions, where the action set is defined in the same way as the DRLMM.

C. SETUP
We choose -greedy algorithm for model's action selection during training, and gradually reduce the probability of exploration along with the learning process of the agent. The action set of the MM agent contains only best bid and best ask, and the size of limit orders are set to be one lot (100 shares). At the beginning of each tranche, the inventory I (I > 0, enough to buy and short stocks) is set to be a constant. In the last minute of each tranche, if the inventory is not balanced, i.e., the agent did not close any live orders, the MM agent needs to take a series of done-for-day action: 1) stop market making, 2) cancel all the live orders, and 3) sell all holding positions, in order to keep the inventory balanced. In addition, without loss of generality, transaction cost is 0, since most market makers have market making licenses and do not need to pay any transaction fee (in some market they can even get rebate from market making). The detailed experiment setup is summarized as follows, • Data processing: each trading day is equally divided into 8 parts, and each part is 30 minutes (a tranche). Models are then trained and tested on the set of tranches. All the data are divided into a training set (first 80 days) and a test set (last 20 days). in -greedy is set as 0.7, δ is set as 0.95, discount factor γ is set as 0.9, batch size is set as 128, loss function is set as ''MSELoss'', optimization strategy is set as ''Adam'' and replay memory size is set as 10,000. There are two other parameters to be tuned, which is learning rate (lr) and number of output nodes in LSTM (n). lr is selected from {1e−5, 5e−5, 1e−4, 5e−4, 1e−3, 5e−3}, and n is selected from {10, 15, 20, 25, 30, 35, 40, 45, 50}. There are totally 54 combinations of those two parameters, and after tuning lr is set as 1e − 3 and n is set as 20.
• Model testing: after the training, the model is applied to the tranche in the test data set, and the DRLMM is then compared with baselines in terms of total PnL (Profit and Loss) and winning rate.

D. RESULTS AND DISCUSSIONS
In this section, we evaluate the DRLMM performance by comparing with the baselines. We use average profit (AP) over trading days and its standard deviation (std.) to evaluate the profitability and stability of the trading models. In Table 1, results of 10 stocks are presented, and numbers marked by bold font indicate that the model performs better than the other two models. It can be observed that the DRLMM wins in all the 10 stocks with smaller standard deviation. Since there are 8 tranche for each trading day, thus, there are totally 160 tranche in the test data set. In each trading period, e.g. 9:30-10:00, there are 20 tranche from 20 trading days in the test data set. In addition to the AP comparison between those models, we compare strategies's AP over different trading tranche, i.e., for each trading period, we calculate the AP of that period over 20 trading days, and compare their performances. In Table 2, numbers in bold font indicate the model performs better than the other two models. It can be observed that, the DRLMM wins 67 out of 80 trading tranche over 10 stocks, which indicates that the DRLMM performs better than the baselines in different trading periods. It can also be observed from the results that during the opening and closing hour of the market, strategies tend to make more profit than the time periods near the noon. It is because that the market is usually actively traded during the opening and closing hours, which gives MM strategies more chances to get both legs of orders executed and make profit.
To take one step further, we use winning number to compare the strategies in different trading periods. If a strategy gets more profits than the other strategy in one trading period, it gets 1 score. Therefore, the highest score that one strategy can get in one trading period is 20 in the test data set. Results of winning number in the test data set are shown in Table 3 and Table 4. Numbers in bold font indicate that the strategy performs better than the other strategy. It could be observed from the results that the DRLMM performed overwhelmingly better than the baselines in terms of winning number. In summary, the performance of DRLMM in our experiments is better than the baselines. It could reliably generate more profits in the volatile market. From the experimental results, it could be observed that the LSTM can bring better market state representations than manually feature engineering. In addition, the deep reinforcement learning can learn a better mapping between strategy states and actions, and make smarter actions to obtain more profits and lower risks.

V. CONCLUSION
The market making strategy optimization is an attractive topic for both researchers and practitioners. With the development and successful application of deep reinforcement learning models, how to use deep reinforcement learning model to market making strategy becomes an interesting research problem. In this paper, we propose an end-to-end reinforcement learning market making strategy based on recurrent deep network representation, DRLMM. It exploits LSTM network to extract temporal patterns of the market directly from the LOBs, and it learns state-action relations via a reinforcement learning approach. In order to control inventory risk and information asymmetry, a deep Q-network is introduced to adaptively select different action subsets and train the market making agent according to the inventory states.
Experiments are conducted on a six-month Level-2 data set, including 10 stock, from Shanghai Stock Exchange in China. Our model is compared with two baseline market making strategies. Experimental results show that: 1) the DRLMM performs better than the benchmark MM strategies; 2) using DRQN to directly extract market information and construct market features can make the state representation in DRLMM better than manually made features; 3) the adaptive action space can improve the training process of DRLMM as well as the profitability of the MM strategy.
In future work, DRLMM can be extended to a multi-agent setting, where many agents with different parameters are trained to learn market making and a meta-learning mechanism could be further introduced to select agent in order to build a more profitable MM strategy.