Reinforcement Learning for Stock Prediction and High-Frequency Trading With T+1 Rules

The high-frequency trading framework for the price trend prediction model and trading strategy has been a popular approach for T+0 trading in the stock market. The prediction model is used to predict price trends, and the trading strategy is used to determine the price and volume of the order. Most trading strategies consist of multiple trading logic associated with certain tuning parameters. These parameters significantly affect the profitability of high-frequency trading frameworks. There are two main disadvantages of this framework: 1) the price trend prediction model can not adapt to the current market data distribution, and 2) the trading strategy can not adapt to the current market conditions automatically. Thus, the framework cannot always maintain positive revenue. To address this problem, we propose a novel dynamic parameter optimization algorithm based on reinforcement learning for stock prediction and trading, and to generate an adaptive trading framework. First, we use a rolling model training method for stock price trend prediction. Second, we regard each set of strategy parameters as action and devise an inverse reinforcement learning algorithm for the reward function to accurately estimate the reward of each action. Because of the T+1 trading rules of the Chinese stock market, we consider the constraint of limited short position in the reward function. Finally, a reward-enhanced upper confidence bound (UCB) selection algorithm is proposed to automatically optimize the parameters of the trading logic in real-time trading. The experimental results show that our method achieves competitive performance in the Chinese stock market.


I. INTRODUCTION
High-frequency trading (HFT) has developed deeply in each part, including price indicators, machine learning models, and trading strategies. From and academic viewpoint, HFT is an online decision-making problem that should take action at each trading time. The action is to send a trading order to the exchange, or do nothing, and an order is composed of three elements: direction (i.e., buy or sell), price, and volume. There are many technical indicators to address the order direction, such as moving average convergence and divergence (MACD), K &D J line (KDJ), Williams %R (WR), relative strength index (RSI), and stop and reversal system (SAR).
The associate editor coordinating the review of this manuscript and approving it for publication was Ze Ji .
With the popularity of machine learning algorithms, most of the quantitative traders have integrated the machine learning model into the trading strategy, which profoundly improves the profitability of the trading strategy with the high accuracy of machine learning models to predict price movement direction. Owing to the successful integration of the machine learning model, the new generation HFT framework consists of two modules at the production level, 1) a machine learning model, aims to determine the order direction by predicting the future price trend. 2) the trading strategy, provides the order price and order volume, which are usually designed by human expert, such as active policy or market making policy.
In the Chinese stock market, the stock that has bought today can be sold only on the next trading day, which is called the T+1 trading rule. If we want to run the HFT strategy in VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the Chinese stock market, there is only one way, and that we have bought stock positions. Short positions are usually hedged with stock index futures, such as IC, IF, and IH. Thus, the entire portfolio has no relationship with the fluctuation of the market. In this stock portfolio, we can run the HFT (also called T+0) strategy on the stock positions. While there are some limitations to applying the T+0 strategy, 1) the stock position that can be short is limited, and 2) the stock position must be keep the same as the initial position at the end of the trading day. As in [1], we also transform the decision-making problem of the trading strategy in the Chinese stock market as reinforcement learning problem. In the HFT stock framework, a supervised learning model is trained to predict the price trend of each stock, and all stocks use the same model. For each stock, there is only one parameter to control the threshold of the model prediction value, if this parameter becomes smaller, the more signals will be generated by the trading strategy, meanwhile, the larger the parameter, the fewer the signals. The trading strategy determines the price and volume of the order based on the threshold of the model prediction value and limit order book.
In the Chinese stock trading system, we use the same trading policy for all stocks, whereas the threshold of the model prediction value for each stock is different. Different stocks have different market data distributions, thus, the optimal threshold of the model prediction value can yield more profit.
Using the supervised learning model to provide a prediction value, a complete trading policy can be determined by a set of thresholds. To this end, we propose a reinforcement learning (RL) framework for learning an automatic and intelligent high-frequency trading strategy.
In the RL framework, the action is that a set of all stock thresholds in the portfolio, and the state is the current position and market data of each stock. Unlike the work [1], the action space is large, because there are N stocks in the stock portfolio, and the value of N is usually approximately 200. We proposed a dynamic action set approach to reduce the action space by using historical trading data for each trading tick. For the design of the reward function, we can not only consider the long and short expectations of the trading order, since the Chinese stock applies T+1 trading rules with the above limits. We should also consider the limits of the short position in the design of the reward function, particularly the limited position that can be short.
In the experimental part, we designed abundant comparison experiments to evaluate our reinforcement learning framework on the tick level market data of the Chinese stock market, which included three parts, 1) the supervised learning model for predicting stock's price trend, 2) the reward function, and 3) the entire HFT framework for trading strategy. Model expectation and accuracy are used to evaluate the price trend model, while the evaluation metrics of the reward function and trading strategy are profit and the commission, respectively. The results of the above experiments demonstrate the competitive performance of our reinforcement learning framework for high-frequency trading in the Chinese stock market.
The main contributions of our work as follows: 1) We transform the high-frequency trading strategy optimization problem in the Chinese stock market using the T+1 rules as a reinforcement learning problem.
2) We propose a rolling training approach for the price trend prediction model, which can generate an adaptive model for new market conditions.
3) We develop a novel method to consider the limits of T+1 trading rules, which considers the constraint of limited short position into the design of the reward function. 4) We propose a novel approach to dynamically reduce the action space for each trading tick. 5) We deploy the high-frequency trading framework into a production level system and greatly increase the profit.
In the following section, we describe the organization of this paper. Section 2 describes the related works on high-frequency trading and reinforcement learning. Section 3 introduces the details of our reinforcement learning framework for high-frequency trading in the Chinese stock market, including the supervised learning model and its rolling training method, learning algorithm for the reward function, and dynamic action space reduction approach. The details of the experimental information and results are described in Section 4. Finally, Section 5 concludes the paper.

II. RELATED WORK
We review the related work in the following aspects.

A. HIGH-FREQUENCY TRADING
High-frequency trading (HFT) [2] is a method of using algorithms to send orders to exchange automatically with a very short holding time. It can be widely used by any secondary market with the T+0 trading rules, meaning that market players can buy and sell stocks or futures within one day. The limit order book (LOB) plays the most significant roles in HFT, which represents a collection of buyers and sellers, ordered by price and time. The price buyers are prepared to buy is called the bidprice, and the highest bidprice is called bestbidprice. Similarly, the lowest askprice is called bestaskprice. There are two types strategy are commonly used in HFT, aggressive strategy and market-making strategy. The aggressive strategy aims at causing rapid up and down of securities price movements, which is relied on technical indicators or supervised learning model to provide prediction value for price trend. Market-making strategies help to increase the liquidity in the market, and reduce the price volatility which leads to fair pricing of the asset.
Supervised learning methods are widely used in financial forecasting because of the high quality limit order book and market data, Researchers have formulated the price trend prediction problem as a regression task, and a set of classical machine learning algorithms are used for the regression tasks, such as linear regression [3], LASSO [4], elastic net [5], random forest [6], decision tree [7], support vector machine Stock exchange (SE). SM receives market data from SE, and DAS receives all stock rewards which are computed by RF based on all traded orders. BL receives prediction value from SM and action set from DAS, then BL outputs the optimal action by hyperparameter optimization algorithm to TP. TP outputs order based on the optimal action to the stock exchange.
(SVM) [8] and LightGBM [9]. These non-linear algorithms usually outperform than linear models because they can learn the non-linear relationships between different features. With the development of deep learning models for compute vision, natural language, and speech recognition, some researchers have attempted to learn hidden relationship from features using deep neural networks (DNN) [10], recurrent neural networks (RNN) [11], long short-term memory (LSTM) [12], convolutional neural network (CNN). While at the production level of high-frequency trading, the fatal shortcoming of deep learning models is the time cost of prediction, which is usually millisecond, whereas the cost time of the high-frequency trading strategy is only 10 microseconds.

B. REINFORCEMENT LEARNING FOR HIGH-FREQUENCY TRADING
With the development of reinforcement learning (RL) in Robot Navigation and Game Playing [13] and even recently in chip design [14] and poster design [15], more and more researchers try to apply the RL algorithm into different domain, especially high-frequency trading. The work [16] is applied to automated financial trading programs. Most previous work focus on the stock market [17], [18], such as forecasting price fluctuations of stock market [19] and devising optimal stock trading strategies [20]. A set of methods dedicated to the study of financial portfolio optimization [21]. This work directly considers to learn an automatic and intelligent trading strategy that results in large-scale action space (e.g., considering each price or volume as an action), which is a hard-achieve-goals since the financial market changes rapidly. While in this paper we propose to simplify the task by optimizing the threshold in trading strategy, which can reduce the action space.
Inverse Reinforcement Learning (IRL) learns to extract the reward function given the observed behavior of an expert. The initial work on IRL was done by [22], which solved the inverse reinforcement learning problem for moderate-sized discrete and continuous domains. Using a probabilistic model of a stochastic expert with a GP before reward values, the algorithm [23] presented can recover both a reward function and the hyper-parameters of a kernel function that describes the structure of the reward.
The optimization of model hyper-parameters has gradually evolved into an important research direction [24]. Specifically, in the field of financial transactions, financial trading programs contains a lot of control parameters, which are regarded as hyper-parameters of trading models. [25] discussed in detail how the parameterization choices are made according to the available historical data and the parameters are tuned to achieve optimal performance.

III. THE PROPOSED APPROACH
The proposed reinforcement learning framework for high-frequency trading in the Chinese stock market is shown in Figure 1, which consists of six modules, 1) SM (Supervised Model): a supervised model to predict stock price trend, 2) BL (Bandit Learning): bandit learning algorithm for selecting the optimal action, 3) RF (Reward Function): a reward function of action learned by the inverse reinforcement learning, 4) DAS (Dynamic Action Set): a dynamic algorithm for action set reduction, 5) TP (Trading Policy): trading policy to send the order to stock exchange, 6) SE (Stock Exchange): learning environment for high-frequency trading framework, stock exchange, while back-testing system can help us learn on history data. Based on current market data, the SM module provides the prediction value for all stocks, the BL module determines the threshold of the model prediction value for each stock, and the TP module sends all stocks' trading order to stock exchange by the threshold. If there are traded orders from stock exchange, RF module can compute the reward of each order, and can compute the reward of action based on each order's reward, then DAS can reduce the action space based on action's reward and market data. Finally, BL module can select the optimal action from above action space and determine the threshold of each stock. VOLUME 11, 2023 The detailed information of above six components are demonstrated as follows.

A. PROBLEM DEFINITION
The aim of the high-frequency trading strategy is to send the proper order to the stock exchange, while in this study, the problem is to determine the proper threshold of the prediction value for each stock in the portfolio. The prediction value is provided by the supervised model, which is trained on historical data. If the threshold of prediction value is determined, trading strategy can give the price and volume of order based on prediction value and current stock's state. As mentioned above, the problem can be treat as a reinforcement learning problem, where the action is the threshold of each stock.
Suppose p i is the threshold of the i − th stock, and at each trading tick, we should determine all the values of p i , where i ∈ {1, . . . , N }, and N is the number of stocks in the portfolio, which is usually approximately 200. Thus the action a j is a threshold set {p 1 , p 2 , . . . , p N }, which j ∈ {1, . . . , K }, K is the number of action in action space A. The problem is selecting the optimal action a * from action space at each trading tick. Table 1 shows notation in this paper.
To summarize, the input of our problem is 1) market data, including tick data, transaction data, and order data, 2) trading strategies, which are written by expert experiences, including active strategy and market making strategy. 3) hyperparameters, here are the threshold of prediction value of each stock. And the output is the optimal hyper-parameter at each tick. In addition, the constraint of the problem is that 1) limited position which can short, 2) keep the position as the fixed initial position. max a o∈O 1<i<M where E(a) is the cumulative expectation of action a, N stands for the number of stock in portfolio, and M is the total period number for the order expectation.

B. SUPERVISED MODEL FOR PRICE TREND PREDICTION
In this section, we propose a novel approach for price trend prediction. The change rate of midprice (the average of best ask price and best bid price) is the label 2 of supervised learning model.
In contrast to CNN or LSTM algorithms, we compressed the history information into the current tick at the stage of X design. Table 2 shows detailed information about feature set, which includes the abundant historical information of market data, such as spread of limit order book, the difference of mean average price and last price, the speed of price change, the speed of trade volume change, the accelerated speed of price change, the volatility of last price, the volatility of trade volume, etc. Then we use Gradient Boost Regression Tree (GBRT) [26], [27] algorithm to train the price trend model. And we use the Huber loss function 3, since huber loss function has better performance on overall data sample than MSE loss function, which performs better on extreme data samples.

C. ROLLING TRAINING FOR PRICE TREND PREDICTION MODEL
Even though we trained the price trend model on a huge training data set, usually three-month tick data, the effectiveness of the price trend model will decrease along with time, since the distribution of market data changes along with time. To address this problem, we deploy a rolling training approach for the price trend model. The model will be trained on the newest three-month data every month. If the expectation and accuracy of a new model are better than the old model, we can deploy a new model to a production system. If not, we can ensemble the prediction value of these two models and compare the expectation and accuracy, then select the better model.

D. EXPECTATION-BASED REWARD FUNCTION
The proper reward functional form can provide the ability to accurately evaluate the action in our reinforcement learning framework for each trading tick. We can use an experimental reward function via domain knowledge, such as the mean expectation of all traded orders in a fixed period. However the market data's distribution changes rapidly in the stock market, and the reward function based on domain knowledge may become invalid after a short time. We propose a novel method based on inverse reinforcement learning to learn the optimal form of the reward function R(a), which a represents for the action. Now, we introduce how we compute the expectation of one order. For example, our trading framework sends an order o(price, volume, direction) with a price of 100.12, volume of 1000, and direction buy to the stock exchange.
where T is the number of ticks from the current tick to the future tick. In general, we will consider the short period and  long period expectation together, and to evaluate the overall order signal. Here, we take the linear function to combine the different period expectation together, the reward function is as follows: where Because T + 1 trading rules are used in the Chinese stock market, thus we can apply T 0 trading only if we have the yesterday's position of this stock. This leads to stocks having a limited short position on the next day. A constraint item is added in Equation 5 to 1) control the speed of trading in one day, and 2) maintain the same position at the end of the trading day.
where CP is the current position of the stock, DP is the target position of the stock.
To use the inverse reinforcement learning for learning the parameters of reward function, the reward function can be written as E(o) = w * u, where u is the feature expectation, which E 10 is followed by equation 5, and T = 10. Algorithm 1 shows the detailed step for parameter learning of the reward function, which aims to find an optimal reward function to minimize the difference between reward function with the best reward function E(w) − E(w * ). In addition, we describe the convergence analysis of the algorithm 1 as follows, From the equation (8), we can obtain the specific formulation of error j which is mentioned in algorithm 1. We can observe that the loss function error j is a linear function of the parameter w j , and the Stochastic Gradient Descent (SGD) optimization method can be used to achieve the convergence point, which means error j < .
For the most high-frequency trading strategy, the short expectation plays more significant effect on the order's VOLUME 11, 2023 Algorithm 1 Learning Algorithm for Reward Function Input: Training set D, including LOB data, volume and turnover of each tick;; Output: Best parameter w * of reward function; 1: Analysis back-testing result on training set D, then find the best trading action of each tick, get expert policy p * ; 2: init a random parameter w 0 for reward function, compute the initial feature expectation u 0 ; 3: optimize parameter w j to minimize the expectation difference error j between expert policy p * and p j ; 4: if error j < , then stop and get best parameter w * = w j ; 5: based on reward function R(w j ), get current optimal policy p j+1 by applying parameter optimization algorithm in next section Action Selection Algorithms; 6: compute the feature expectation u j+1 ; 7: set j = j + 1, loop again from step 3; reward. Thus the overall formulation for the reward of one order are where T ∈ {10, 30, 90, . . .}, 0 < γ < 1.

E. DYNAMIC ALGORITHM FOR ACTION SET GENERATION
After we have one optimal reward function for evaluating action, the action set should be determined to select the optimal action at each tick. Usually, we have about three hundred stocks in the portfolio, and for each stock, we have a specific parameter for the model prediction value of this stock, such as 1.6 for SH600519, 1.2 for SZ000568, 0.8 for SZ300750, etc. And the possible values of each parameter are {1.0, 1.2, 1.4, 1.6, 1.8, 2.0}. thus the action space is 6 200 , which is significantly larger than the action space of Go. We can not iteratively compute the reward of each action, or the trading agent will not send any orders to the stock exchange.
Here we develop a dynamic algorithm to generate an action set at each tick based on current stock position and market data, 1) narrow the value range for the parameter of a specific stock, for example, the most valuable Liquor stock SH600519, if the parameter is high, then the prediction value can not be larger than this parameter, and can not send any orders to exchange. Thus the possible values of SH600519 are {1.0, 1.2}. In addition, we can narrow the value range based on the volume and volatility on that trading day. Higher volatility means we should use the large parameter, and a lower volume means we should use a small parameter. 2) cluster the stocks (e.g. four clusters from two aspects, price, and volatility), and use the same parameter for each cluster. Since the stocks in the same cluster have a similar distribution of market data. Finally, we can narrow the action set to 2 4 at each tick. And we will generate different action set at different tick.

F. ACTION SELECTION ALGORITHM BASED ON UCB
When the optimal reward function is given by the algorithm above, the expectation of each traded order can be computed accurately, then the reward of each action in the dynamic action space can be formulated as follows, where M is the number of traded order in a fixed trading period, o i stands for each traded order, and i ∈ {1, 2, . . . , M }.
Multi-armed bandit learning algorithm provides an appropriate way to select the optimal action from action space, that is to select the best threshold of prediction value for each stock in portfolio.
where S(a) stands for action's score, W (a) stands for the average reward, visited time is recorded as n(a), and a constant C.
In general, RL environment can generate the reward of only one action, then update W (a). While in this section, we developed a novel method to compute the reward of all action in action space together at each trading tick. We develop a precise back-testing system, and integrate it into online trading system. The back-testing system can estimate the expectation of each traded order, and act as a learning environment.
Algorithm 2 demonstrates the running process for action selection algorithm. At each tick in online trading, the algorithm will run once to get the optimal action a i t based on previous UCB score, then run the back-testing system to update the reward score of all other actions. After that, the algorithm stops to step into the next tick.

Algorithm 2 Action Selection Algorithm Based on Reward
Enhanced UCB Input: Action set A, Reward function R(a) Output: Optimal action a t at each trading tick t; 1: Init W (a i ) = 0.0, n(a i ) = 0; 2: Select action a i t with highest UCB score S(a); 3: Trading with action a i t , then record the reward of each order; 4: Back-testing the action a i t in different fixed period (e.g. 30 minutes, 60 minutes) before current tick, then record the reward w(a i ) * λ t , there t is the period; 5: Update W (a i ) and n(a i ); 6: Loop again from step 2;

IV. EXPERIMENTS
Level-2 market data for the Chinese stock market consists of the dataset. In the following sections, we will introduce the data set, evaluation metrics, experimental result of baseline and our algorithm.

A. DATASET
Level-2 depth market data, which is the finest grain of stock data, is used in our experiment. It is also called tick data, which was published from the Chinese Stock Exchange. The format was TradingDay, UpdateTime, Volume, Turnover, BidPrice1-5, AskPrice1-5, BidVolume1-5, AskVolume1-5. From the viewpoint of the dataset, we may be the first to present our novel research results on such a high-quality dataset that runs in the real trading system.
We prepare datasets from a subset of all stocks in the Chinese stock market, except for ST stock and low trading volume. This subset is selected carefully to cover most of the stock types, including high price, low price, a different industry, value stock, and growth stock. In the Chinese stock market, Level-2 depth market data are published as market data snapshot every three seconds. The information of the final dataset is 1) 200 stocks, 2) the scope of the training set is 2020.03.01 -2020.07.31, 3) the scope of the test set is 2020.08.01 -2020.10.30, and 4) minimal time unit is 3 s. Table 5 shows some example stocks, including the minimum price unit, recent price, market value, and so on.

B. EVALUATION METRICS OF PRICE TREND MODEL
In a practical trading system, we typically use a large prediction value, which is either, positive or negative. Thus we will focus on the evaluation results of a large prediction value.
Two metrics are proposed to evaluate the price trend model, 1) Accuracy, which checks if the sign of the prediction is the same as the sign of the label. 2) Expectation, which checks that if we send a signal to exchange using the large prediction value, we can obtain the profit with the corresponding y.

C. EVALUATION METRICS OF TRADING FRAMEWORK
Trading profit and commission are used in the design of evaluation metrics. In general, we evaluate the trading policy from two viewpoints, 1) Absolute profit AP = p − c, 2) Multiple MP = p c of profit and commission, where p is trading profit, and c is trading commission.
Usually, most researchers use absolute profit AP as the metric, while if the absolute profit of one trading strategy is the same as another, Multiple MP will give another viewpoint for evaluation. The higher MP, the better trading strategy, and it shows that this trading strategy also has more capacity.

D. RESULT FOR PRICE TREND MODEL
We sorted the prediction value by its absolute value, and evaluate the accuracy and expectation for the top k% prediction value, where k takes the value from {0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5, 10}. In addition, we evaluate the distribution on different period label, here we use T stands for period, means that there are T ticks in each period, where T takes the value from {1, 2, 4, 10, 20, 40, 80, 120, 240, 480}. An excellent model can take high expectations on short periods, also on long periods. Table 3 shows the expectation results for the top prediction value. From the expectation result, we observed that, 1) the larger the prediction value, the higher the expectation. This implies that the model has high monotonicity. If we want to obtain a sharper strategy, we can trade only with the highest signal. 2) The expectation distribution from the perspective of the period is peak-like, where the highest point is T = 20, which also be proved in figure 2. From T = 1 to T = 20, the expectation increases very quickly. While from T = 20 to T = 480, the expectation decreases slowly. In addition, we can observe that the first 1 tick can obtain a nearly 33% expectation of 20 ticks, which means that we should send the signal as soon as possible. Table 4 shows the accuracy results of the price trend model. From the accuracy result, we can observe that, 1) The higher the prediction value, the higher the accuracy of the model prediction.
2) The accuracy distribution from the viewpoint of the period is peak-like, where the highest point is T = 4 and be showed in figure 3, which differs from the highest point (T = 20) in the expectation distribution. This means that the accuracy of T = 20 is lower than that of T = 4, but some signals from T = 20 are much higher than those from T = 4, thus the expectation is higher. 3) Even for the period T = 480, which means 12 minutes, the model also has an accuracy greater than 50%.

E. RESULT FOR OUR REINFORCEMENT LEARNING FRAMEWORK OF HFT
Existing reinforcement learning work on trading strategy focuses on predicting action (determining whether to send a trading order to exchange) directly, honestly, this work usually has little effect on real trading. This author has no experience of high-frequency trading at the production level, thus we can not run any comparison experiment based on their work, although we will design some typical baselines.
Here we prepare some different types of baselines, 1) The traditional technical indicator MACD [28], and many quantitative agents use technical indicators to send the order signal.
2) A supervised learning model [29] with a simple taking strategy, which sends orders only depend on prediction value, and use market order type to trade. 3) The first public and complete trading strategy, Way of the Turtle, which brought huge profit for the strategy author in real trading.
We also propose some baselines by simplifying one or more modules of our reinforcement learning framework, 4) FT: Fixed threshold for each stock. 5) F(Time): Determine the threshold just by time, for example, set the threshold as 1.8 during morning trading session, and set the threshold as 1.4 during the afternoon session. 6) F(Volatility): Determine the threshold just by volatility, for example, the price of stock SZ300750 has high volatility, we can set the threshold as 2.0, and the price of stock SH600519 has low volatility, we can set the threshold as 1.4. 7) F(Time, Volatility): Determine the threshold by time and volatility.  Our proposed framework is called SL-IRL-UCBE, which refers to the supervised learning model, inverse reinforcement learning for learning the reward function, UCB enhanced selection algorithm. And we design one more baseline is called SL-TP-UCBE, which means the reward function use time priority (TP) method, and is same as the equation 9, we set the λ = 0.7. Thus the expectation of short period is more important.
When we tested the performance of these algorithms, the test set was set as the last one month dataset. Every day one result is generated, which includes profit and commission cost. In table 6, we show the average profit and average commission in the one month test set by day.
From the result of table 6, we can find that our reinforcement learning framework for high-frequency trading in the Chinese stock market, SL-IRL-UCBE achieves the most competitive performance among all these methods. SL-TP-UCBE performs better than other baselines, but the time priority method for reward function is not the optimal selection. Inverse reinforcement learning helps to obtain the optimal form of reward function. Among the baselines, the technical indicator MACD has the worst performance, that is because one single indicator can not adapt to the stock market now.

F. RESULT FOR DIFFERENT REWARD FUNCTION OF ORDER
We proposed several baselines to evaluate the performance of a reward function of one order. 1) E(T=10): Just use only one period expectation as the reward of order, the future 10 tick after trading. 2) E(New): Only one period expectation as the reward, but the reward changes at each tick, which is the expectation between the traded tick and the newest tick. 3) E(Linear): Fixed parameter of different period expectation, here we set as 0.5, 0.4, 0.3, . . . , which is the linear combination. Similar to the above comparison experiment, all these methods were tested on the same test set, that is, the last one month market data. Table 7 presents the results of the different reward functions.
From the comparison results, we can observe that, our proposed method for learning the form of the reward function outperforms all other methods, where SL-IRL-UCBE achieves the highest profit, which means that the reward function is more accurate for evaluating action. Baseline E(Linear) performs better than the other baselines, which demonstrates that the expectation of different periods is important for order evaluation.

G. STATISTIC RESULT OF REAL TRADING IN CHINESE STOCK MARKET
We run the T0-IRL-UCBS approach on the real stock account in the Chinese stock market from Feb 7th, 2022 to Mar 18th, 2022, where the stock position value is about 50 billion RMB, and there are 300 stocks in the portfolio. Figure 4 shows the absolute daily profit, the difference of profit and commission. Figure 5 shows the earning multiplier for each trading day, the division of profit and commission. The result of real trading shows the competitive performance in the T 0 strategy of the  Chinese stock market. In addition, we can find that the result in real trading from 2022.2.7 to 2022.3.18 is much better than the result at the test set, especially at Feb 7th, we achieve about five times profit of transaction commission. There are two reasons for this: 1) High volatility: The Chinese stock market is much more larger fluctuation than before, and the HFT strategy usually has perfect performance in the market with high volatility. In the figure 6 and 7, the value and volume of the transaction of HS300 stock index are showed the high relationship between volatility and transaction volume. In the trading period from from Feb 7th, 2022 to Mar 18th, 2022, the value of transaction is higher than the market with low volatility, and the percent of transaction value can achieve about 70% of the value of the whole portfolio. Thus the daily transaction value is about 35 billion RMB on the single side, and 70 billion RMB on the double side (buy and sell). Due to the transaction fee rate is 0.07% in Chinese stock market on the single side, the daily transaction commission is about 49 thousands. 2) Intelligent Algorithm: Our RL framework has much more optimization space in such volatile markets, and other market participants have no such better adaptability.

V. CONCLUSION
We have presented a novel and effective framework for price trend prediction model and parameter online optimization of a high-frequency trading strategy in the Chinese stock market. An abundant feature set is designed for price trend model training, and a rolling trading method is applied for the self-adaption of the price trend model. Inverse reinforcement learning based algorithm is proposed for the parameter learning of reward function, and the constraint item of Chinese stock T + 1 rules is considered in the equation of reward function. In addition, a precise back-testing system was developed to evaluate the reward for each action during real-time trading. All these experiments result on the subset of all stocks show that our proposed algorithm achieves competitive performance on Chinese Stock Market Data. Finally, we run our proposed framework at the production level to evaluate the effectiveness in real trading. Daily profit shows the promised profitability. In the future, we will upgrade the framework to make it more suitable for use in any secondary market.

VI. OUTLOOK FOR FUTURE WORK
There are many potential directions for future work, as the high-frequency trading with advanced machine learning is still relatively in its early stage, especially seeing the fast development of machine learning.
First, we aim to explore more advanced temporal models both for time series learning, in terms of anomaly detection [30], [31] and forecasting [32], as well as continuous time event sequence modeling, especially for the so-called temporal point process (TPP) [33]. Specifically, the TPP model can either be used for relation mining [34] also for prediction [35], [36], rule mining [37] and clustering [38], [39].
Another promising direction is how to incorporate more information into the decision making pipeline, which can be encoded by graph neural networks (GNNs) [40] or other more efficient embedding methods [41], [42]. Meanwhile, machine learning for combinatorial optimization is also worth further study. One immediate way is to incorporate the knowledge graph [43].
Finally, putting the decision making in a multi-agent system perspective, it would be also interesting to consider the relation and constraint among the agents for trading, whereby graph learning [44], and especially graph matching [45] can be a potential tool to advance this topic.
WEIPENG ZHANG received the bachelor's and master's degrees from the Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China, in 2012 and 2015, respectively, where he is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering. Before that, he worked as the Manager of hedge fund to integrate machine learning and reinforcement learning algorithm into high frequency trading. He is currently focusing on the research and application of reinforcement learning algorithm in quantitative trading. BING HAN is currently a Senior Staff Engineer and the Head of the Intelligent Engine Department, MYbank, Ant Group. Her research interests include machine learning and data intelligence, especially in recommender systems and financial technology.
HUANXI LIU was born in Hunan, China, in 1982. He received the master's and Doctor degrees in pattern recognition and intelligent system from Shanghai Jiao Tong University, China, in 2007 and 2010, respectively. He is currently a Senior Engineer with Shanghai Jiao Tong University. His research interests include pattern recognition, computer vision, and machine learning. VOLUME 11, 2023