Improving Pairs Trading Strategies Using Two-Stage Deep Learning Methods and Analyses of Time (In)variant Inputs for Trading Performance

A pairs trading strategy (PTS) constructs and monitors a stationary portfolio by shorting (longing) when the portfolio is adequately over- (under-)priced measured by a predetermined open threshold. We close this position to earn the price differences when the portfolio’s value reverts back to the mean level. When the portfolio is significantly over- (under-)priced measured by another predetermined stop-loss threshold, we close the position to stop loss. This paper develops a two-stage deep learning method to improve the investment performance of a PTS. Note that the literature executes a PTS by selecting the best trigger threshold (a combination of open and stop-loss thresholds) from a restricted, heuristically-determined set of trigger thresholds. Such a design significantly degrades investment performance. However, selecting the best threshold from all possible thresholds yields a non-converged training problem. To resolve this dilemma, we propose in the first stage of our method a representative label mechanism by which to construct a set of candidate trigger thresholds based on all possible thresholds and then train a deep learning (DL) model to select the best from the set. Experiments demonstrate that the proposed first-stage method avoids the non-converged training problem and outperforms most state-of-the-art methods. To further reduce the trading risk, the second stage trains another DL with the profitability of each trade labeled by executing the PTS with trigger thresholds recommended in the first-stage mechanism to remove unprofitable trades. Compared to models that indirectly judge profitability by price movement similarity without considering the quality of the recommended trigger thresholds, our model produces higher win rates and average profits. Furthermore, we find that training with the PTS portfolio value process exhibiting time invariance clearly outperforms training with only time-varying stock/return processes, even though the latter training set contains more information. This is because unpredictable changes in market trends cause the model to learn time-varying patterns from the training set that may not apply to the testing set.


23
A pairs trading strategy (PTS) is a popular, statistical arbi- 24 trage investment strategy that forms and trades market-neutral 25 The associate editor coordinating the review of this manuscript and approving it for publication was Mingbo Zhao .
portfolios [1]. Rather than guessing hard-to-predict trends in 26 financial markets, a PTS eliminates the risk of market ten-27 dency by longing (or shorting) several assets at the same time, 28 according to specified investment weight ratios determined 29 by various statistical methods [2]. The value of this portfolio, 30 or ''spread,'' oscillates around a mean price level and has a 31 account the quality of the recommended trigger thresh-86 olds. Sarmento and Horta [8] group stocks according to the 87 OPTICS algorithm, then remove pairs whose stocks come 88 from different groups. Furthermore, Lu et al. [9] use long 89 short-term memory (LSTM) and wavelet convolutional neu-90 ral network (CNN) to predict time-series anomaly properties. 91 But these mechanisms do not necessarily determine unprof-92 itability, so they can result in the removal of many profitable 93 trades, which significantly reduces overall profits. Instead, 94 this paper proposes a two-stage model to remove unprofitable 95 trades without significantly sacrificing overall profits; the 96 training data set comprises two parts. The first stage trains a 97 ResNet model on the first part of the training data, labeled by 98 the representative thresholds, to recommend open and stop-99 loss thresholds. Then to remove unprofitable pairs, the second 100 part of the training data is first inputted into the first-stage 101 model to obtain the recommended thresholds. We then trade 102 each stock pair with the recommended threshold to obtain 103 the profit/loss signal, as a label for the stock pair to train 104 the second-stage model. For each stock pair in the testing 105 set, we also use the first model to recommend an open and 106 stop-loss threshold and the second model to remove unprof-107 itable pairs from trading. This two-stage model yields better 108 win rates and higher Sharpe ratios across all our experiments. 109 Because frequent dramatic changes in financial markets 110 alter patterns of stock price and return processes, it becomes 111 difficult for machine learning algorithms to capture changing 112 patterns, even when using many features and various data 113 lengths [10]. Thus, researchers tend to train their models 114 using a limited amount of the most recent historical market 115 data; ancient data and corresponding embedded informa-116 tion get discarded. But the cointegration test proposed by 117 Johansen [3] guarantees that the statistical properties of the 118 spread process do not vary with time. This feature effectively 119 improves the performance of the PTS model if we prolong 120 the training period, such that we do not need to tune the 121 hyperparameter that controls the length of the training period. 122 Even if the spread process contains less information than 123 the price processes of stock pairs, 2 models trained on spread 124 process data still outperform those trained on stock pair data. 125 In Section II-A, we review prior PTS research that relies 126 on quantitative and machine learning models. Section II-B 127 outlines how we construct stock pairs that possess coin-128 tegration properties and provide the definitions of PTS 129 reward functions. With Section III, we detail the construc-130 tion of the optimal combination of the open and stop-loss 131 thresholds (referred to as the ''trigger threshold'') and the 132 representative labeling mechanism adopted to address the 133 non-convergence training problem. Then in Sections IV-A 134 and IV-B, we describe how we incorporate the multi-scale 135 ResNet into our PTS model, as well as the design of the two-136 stage model. The experimental results in Section V confirm 137 the superiority of our two-stage models; as we explain, the 138 thresholds for PTS. Fallahpour  192 Some variations of PTS include a double DQN proposed by 193 Brim [21], with three actions (hold, buy, sell), that seeks to 194 predict the trend of the spread, though a low win rate limits 195 their model's applicability. Instead of using open and stop-196 loss thresholds, Xu and Tan [22] predict open and stop-loss 197 timing for PTS, which they use to form a return-maximized 198 portfolio with a deterministic policy gradient method. In addi-199 tion, Hsu et al. [23] take advantage of opinions from social 200 media to predict spread price movements.

201
To reduce PTS risk, Sarmento and Horta [8] use the 202 OPTICS algorithm and divide the stocks into groups, accord-203 ing to their average return processes. They then remove PTS 204 pairs with stocks from different groups. In our experiments, 205 their approach slightly improves the win rate and reduces the 206 maximum drawdown; however, it discards many profitable 207 trades, such that it significantly reduces overall investment 208 performance. When Lu et al. [9] use the time-series anomaly 209 detection mechanism proposed by Huang et al. [24] to label 210 anomalies of the price processes, they can combine LSTM 211 and continuous wavelet CNN to predict structural breaks, 212 which they interpret as losing cointegration properties. But 213 errors in labeling anomalies are difficult to avoid, which 214 biases training efforts to detect structural breaks. Therefore, 215 we propose a two-stage model that determines the optimal 216 open and stop-loss thresholds in the first stage, then detects 217 and removes unprofitable pairs in the second stage. With 218 experiments, we show that this two-stage approach achieves 219 a better win rate, higher trading opportunities, and greater 220 overall profits than filtering pairs with the OPTICS algorithm. 221 Our approach also incurs fewer risks of negative returns 222 than the structural break detection approach. Rather than 223 using RL, we adopt DL with representative labeling mech-224 anisms to find recommended open and stop-loss thresholds. 225 To capture complex features or patterns in financial markets, 226 we adopt the residual network (ResNet) model proposed by 227 He et al. [25]; their extensive empirical data affirm that 228 ResNets are simpler to optimize and also achieve higher 229 learning precision because ResNets include more hidden 230 layers. ResNet is extended by Li et al. [7] from a single 231 scale to multiple scales by adding convolution kernels of 232 various sizes to adaptively detect data features from different 233 aspects. By combining representative labeling with multi-234 scale ResNet, our proposed method yields superior invest-235 ment performance.

236
Financial markets constantly change with time, mainly due 237 to black swan events such as the COVID-19 pandemic and 238 quantitative easing, which caused stock markets to plum-239 met and then soar during the first half of 2020. Such time-240 based heterogeneity causes trading patterns to vary over 241 time and creates difficulties for DL algorithms, even with 242 many features and long window sizes [10]. Prior    is mean-reverting; that is, it oscillates around the mean level 300 of the spread, E(P i (t)). We could also measure the P i (t) 301 variation by calculating its standard deviation σ i . A sample 302 cointegration calibration of two 0050.TW constituent stocks, 303 Cheng Shin Tyre (2105) and Shin Kong Financial Holdings 304 (2888), are illustrated in Figure 1. We use the stock trading 305 data during the formation period to calibrate the VECM for 306 determining the trigger threshold for each stock pair. Then 307 we use the threshold to trade the stock pair during the trading 308 period. The mean-reverting property of the spread defined in 309 Equation (2) is illustrated by the blue curve moving around 310 the mean level of −10.15. The magnitude of σ i , denoted by 311 the distance between the mean level and the green (or red) 312 dashed line, will be used to tune the open and stop-loss 313 thresholds described as follows.

314
The profit (or loss) to purchase the aforementioned stock 315 pair portfolio at time τ and sell it at τ can be expressed as 316 the product of the investment amount c and the difference of 317 the spread: where ln Due to the mean-reverting nature of Equation (2), we can 326 short (long) the portfolio when the spread P i (t) soars (falls) 327 FIGURE 2. Trading period scenarios. The red, purple, and black lines denote the mean of P i (t ), the thresholds for opening the portfolio, and the thresholds for stopping losses, respectively. The values are listed to the right of the lines. The orange and green curves denote the change of spread processes over the trading period. We would long (short) the portfolio if the spread process begun from A goes down (up) to reach E (B), as denoted by the green (orange) curves. With solid and dashed curves, we indicate actions that close the portfolio to gain profit or to stop loss, respectively. The dotted curve indicates that the portfolio is forced to close at the end of the trading period.  ing training difficulty too much, a representation labeling 405 mechanism (RLM), as depicted in Figure 3, is first proposed 406 in our previous conference work [6]. For clarity, our current 407 paper describes the details implementations of RLM with 408 our revisions for generating representative trigger thresholds 409 based on the statistics for the PTS profits obtained from 410 training set data. In Section III-A, we explain how we divide 411 daily data into formation and trading periods and perform 412 data preprocessing (step 1). To obtain information on eligible 413 stock pairs and investment ratios, we apply the Johansen 414  Also, as much of the PTS literature uses easily obtained daily close price data to determine trading decisions, their models cannot consider day trading due to data limitations. We are not limited by this constraint since we instead use intra-day tick data. breaking events that usually occur during the closing of the 443 market from eroding PTS profits. 5

444
Step 1 in Figure 3 describes the data preprocess-445 ing, in which we divide the 2013-2018 period into 446 non-overlapping training and testing periods. The stock tick 447 data for each business day D in the training period generates 448 the labels and spread features required to train the RLM 449 model, whose performance can then be verified on each 450 business day of the testing period. Daily trading takes place 451 from 9:00 a.m. to 1:30 p.m. each business day, divided into 452 the formation period (the first 166 minutes, ignoring the 453 beginning of the first 16 minutes) and the trading period (rest 454 of the business day), as illustrated in Figure 1. We cut the 455 first 16 minutes of trading data since the high volatility of the 456 stock price influences the effectiveness of the cointegration 457 test. As this test also requires sufficiently long time series data 458 to ensure its robustness, we follow [9] by setting the length of 459 the formation period to 150 minutes for the cointegration test, 460 and use the remaining time to execute the PTS. We use tick 461 data from the formation period to calculate each half-minute's 462 weighted average stock price. As described in Section II-B, 463 we examine whether the resultant time series possesses the 464 mean-reverting property by the Johansen cointegration test. 465 Then we derive corresponding investment ratios β defined in 466 Equation (2) and close the portfolio to earn profits 485 when the spread reverts to E(P i (t)). We impose a stop-loss 486 when the process continuously plummets to E(P i (t)) − ξ i S σ i 487 (or soars to E(P i (t)) + ξ i S σ i ), as illustrated in Figure 2. 488 The profit for executing PTS with P i T during the trading 489 period can be evaluated based on Equation (3). Note that 490 the search space composed of all possible trigger thresholds 491 5 The transaction cost also falls from 0.3% to 0.15% for day trading in the Taiwan Stock Exchange. 6 The statistical characteristics of training features used in many extant machine learning algorithms [e.g., 33] vary with market trends. Thus training periods of heuristically selected lengths may significantly influence investment performance. VOLUME Similarly, we can also calculate M c , the maximum standard- We label the i-th stock pair with the spread P i F by the stock pair during different training periods in our later exper-542 iments. Note that many trigger thresholds enumerated by the 543 procedure mentioned above are never chosen as the optimal 544 threshold by any stock pair, probably because the stock price 545 is quoted as integral multiples of basic units (i.e., ticks) rather 546 than continuously. Many trigger thresholds do not fit discrete 547 changes of the spread process defined in Equation (2), due 548 to discrete stock price quotes, and therefore will never be 549 selected as optimal trigger thresholds. This rationale explains 550 why heuristically selecting trigger thresholds (e.g., [5]) might 551 significantly deteriorate investment performance. Deriving 552 feasible trigger thresholds from discrete changes in the spread 553 process can be very hard, so we discretely enumerate many 554 thresholds and use Equation (3) to calculate profits to filter 555 unprofitable thresholds.  Instead of adopting the regression-based method, we could 579 select an optimal trigger threshold for each PTS-eligible stock 580 pair from all possible trigger thresholds using classification-581 based approaches. Here we use cross-entropy as the loss 582 function and train with different combinations of inputs and 583 techniques to resolve non-convergence problems as we exam-584 ine the regression-based method. Part of our experimental 585 results 8 are illustrated in Figure 4. It can be observed from 586 Figure 4(a) that the training losses oscillate significantly 587 regardless of the changes in DL models and optimizers. 588 Figure 4(b) also illustrates the highest training accuracy could 589 only achieve around 30%. In the next subsection, we address 590 this non-convergent training problem with RLM, which sig-591 nificantly reduces the number of labels without sacrificing 592 the quality of trigger thresholds. Our experiments show that 593 in Section III-C. 626 We address the lack of training convergence problem by 627 setting representation trigger thresholds, according to either 628 k-means or thresholds with the top-k highest probabilities. 629 With the former method, we partition all trigger thresholds 630 into a reasonable number of clusters by the k-means algo-631 rithm; the cluster number 25 is determined by the elbow 632 approach. The set of representation trigger thresholds R is 633 defined as the centers of the previously mentioned clus-634 ter, which we call Kmeans(0). The optimal trigger thresh-635 old for the i-th spread process is relabeled by picking one 636 of the representation thresholds that maximizes profit, as 637 follows: Note that each representation threshold selected by 640 Kmeans(0) (black nodes) basically does not coincide with 641 any trigger threshold because each cluster center is calcu-642 lated as the averaging of nearby trigger thresholds belonging 643 to the same cluster. However, a slight shift in the thresh-644 old, like moving from the upper purple solid line to the 645 dashed one in Figure 2, could significantly change the invest-646 ment profit as mentioned above. To prevent disturbances in 647 low-probability trigger thresholds from degrading the quality 648 of representation labels, k-means can be applied to trigger 649 thresholds with probabilities larger than 0.1% and 0.5%, 650 as illustrated in Figures 5(b) and 5(c), respectively. The 651 resulting representation label settings are named Kmeans(1) 652 and Kmeans(2), respectively. Besides, to ensure that each rep-653 resentative label coincides with a trigger threshold, we could 654 choose, as representation trigger thresholds, those trigger 655 thresholds with the top 25 highest probabilities, as shown 656 in Figure 5(d), which we refer to as the HighFreq label 657 setting. Section V compares these RLMs to find the best 658 one. 659 VOLUME 10, 2022 Pink, yellow, and green nodes denote the trigger thresholds selected with probabilities larger than 1%, 0.5%, and 0.1%, respectively. Blue and black nodes denote other low-probability trigger thresholds and representative thresholds, respectively. Here we illustrate the trigger thresholds distribution for the year 2016. The distributions of other years are similar, so we ignore them for simplicity. ResNet adds size-5 and size-7 convolution kernels, as well as 707 two corresponding chains of blocks. 9 The features extracted 708 by the three convolution kernels (i.e., outputs from the three 709 chains of residual blocks) are concatenated to form a feature 710 vector, which gets sent to a fully connected network.

711
To find the optimal settings to achieve the best training 712 results, we have attempted different settings of optimizers 713 and activation functions; training accuracy and loss for part 714 of our experiments that use different kinds of optimizers 715 are illustrated in Figure 6. Training accuracy refers to the 716 percentage of correct predictions of all pairs in the training 717 set; training loss is measured according to cross-entropy. 718 The training accuracy for the CNN model, denoted by the 719 orange curve, increases slowly; the training loss oscillates 720 significantly. Thus we use a residual network, which employs 721 more hidden layers to capture various features embedded in 722 complex financial markets. Although the three-scale ResNet 723 with AMSGrad, RMSprop, and SGD optimizers and the 724 single-scale one achieve almost 100% accuracy and 0% loss 725 after large enough training epochs, we select the three-scale 726 ResNet with AMSGrad since it converges the most smoothly 727 and quickly. By repeating the above comparison, our later 728 experiments choose the three-scale ResNet with AMSGrad 729 optimizer, Leaky-ReLU activation function, and the three-730 channel input. The inputs are formed by the spread and the 731 two stock return processes unless stated otherwise in our later 732 experiments. 733 We divide the data into the training and the validation set 734 to determine the number of training epochs. We first train 735 the model on the training set data and then run the resulting 736 model on the validation data set to calculate the accuracy 737 and loss. To retrieve useful information from the training data 738 FIGURE 6. Training accuracies and losses of CNN and ResNet under different settings. We illustrate part of our experiments (denoted by the legends) for finding the optimized settings to achieve the best training results. The activation function used here is Leaky-ReLU. The input has three channels: the spread and the two stock return processes.
set without incurring overfitting, we halt the training process 739 when the win rate of the validation set reaches a maximum. 740 We illustrate part of our experiments (denoted by the leg-  Figure 7. 752 Here, we divide the training set data into set 1 to train the first-753 stage mechanism and set 2 for the second-stage mechanism.

754
In the first stage, the trigger threshold for each pair is deter-755 mined by the optimal threshold selection from Equation (5) In contrast with the single-stage model, the two-stage model 779 trades only those pairs that are predicted to be profitable. Our 780 experiments show that this design improves the win rate and 781 reduces risk. 783 We conduct experiments on the constituent stocks of the 784 Taiwan Top 50 ETF (0050.TW) from 2013 to 2018 to 785 back-test improvements in PTS investment performance due 786 to the proposed RLM and the two-stage model. To evaluate 787 the trading performance, we first extract intra-day trading 788 information from each trading day D 1 from the testing period, 789 as illustrated in Figure 3. Then we retrieve stock pairs fea-790 sible for PTS by applying the Johansen cointegration test 791 to the formation period data of day D 1 . Next, we predict 792 each pair's optimal representative trigger threshold using the 793 trained three-scale ResNet described in Section IV-A. With 794 the retrieved stock pair and the corresponding trigger thresh-795 old, we execute tick-by-tick pairs trading in the D 1 s trading 796 period. We execute all trades one tick later than the spread 797 process hits the trigger threshold to simulate price slippage 798 effects. 799 We compare the investment performance of different 800 PTS by analyzing the following financial indicators: the 801 (overall) profit, the win rate, the normal close rate, the number 802 of trades, the Sharpe ratio (SR) calculated on a daily or 803 pair basis, the maximum drawdown (MDD), the maximum 804 required capital, and the average profit (per trade), as listed 805 in the first column of Table 1. To facilitate the performance 806 comparison in the following tables, we set in boldface the 807 best performance for each indicator (except for the number 808 of trades and the maximum required capital) to easily iden-809 tify the best methods or settings. The (overall) profit sums 810 VOLUME 10, 2022

833
Section V-A compares various representative labeling 834 methods discussed in Sections IV-A and III-D. Because 835 combining multi-scale ResNet with the settings described 836 in Figure 6 and HighFreq (or KMeans(0)) yields the best 837 performance, these settings are adopted in our subsequent 838 experiments. Section V-B demonstrates that the proposed 839 RLM outperforms existing trigger threshold selection mech-840 anisms. Section V-C shows that training a machine learn-841 ing model with the spread process defined in Equation (2), 842 whose patterns are time-invariant, prevents changes in finan-843 cial markets from harming the model's predictability. Finally, 844 Section V-D illustrates how the two-stage model developed 845 in Section IV-B can effectively reduce PTS risk than existing 846 methods do.

848
To improve PTS investment performance, we select the best 849 DL and corresponding settings as in Section IV-A and proper 850 RLM in this section to ensure the efficiency of the training 851 described in step 5 of Figure 3. Table 1 compares the differ-852 ent RLM proposed in Section III-D. In row 4, KMeans(0), 853 tings that apply k-means to the total trigger thresholds (see Figure 5(a)), trigger thresholds with probabilities greater than 856 0.1% (see Figure 5(b)), and trigger thresholds with prob-857 abilities greater than 0.5% (see Figure 5(c)), respectively.

858
HighFreq picks the trigger thresholds with the top-25 highest 859 probabilities, as illustrated in Figure 5(d). Market trends vary with time, and the non-stable nature of 923 a stock price/return process makes it difficult for machine 924 learning models to capture and predict stable patterns of trad-925 ing data [10]. Thus model performances vary significantly, 926 depending on whether there are turning points during the 927 training or testing periods. Therefore, a proper hyperparame-928 ter setting that controls the length of training and testing peri-929 ods could be challenging to identify [33]. However, spread 930 processes (Equation (2)) generated by the cointegration test 931 are stationary; their statistical properties do not change when 932 shifted in time. This valuable property allows us to extend 933 the training period to capture more patterns to improve PTS 934 profitability, without exposing the model to changes in finan-935 cial markets. In Table 3, lengthened training periods generally 936 coincide with increases in win rate, SR, trading opportunities 937 (i.e., number of trades), and overall profit. In contrast, using 938 only non-stationary series, such as stock prices and stock 939 returns, as training data yields unstable performance with 940 each increment of the training period. The spread process also 941 can be expressed as the non-invertible function of the pair 942 of stocks' prices, as in Equation (2). These two stock price 943 processes thus contain broader information than is available 944 in the spread process. Even if the spread process contains less 945 information though, its stationary property makes it easier 946 for machine learning algorithms to capture time-invariant 947 patterns, rather than the time-variant patterns of the stock 948 return and price processes.

949
Next, we proceed to analyze the impacts of combin-950 ing different types of processes as inputs. Combining the 951 stationary process with the non-stationary one as inputs 952 (i.e., ''S + R'' and ''S + P'' cases) provides stable invest-953 ment performance that grows with the increment of the 954 training period. In addition, their performances are better 955 than those generated by training with just the spread process 956 (i.e., ''S'' case). But investment performance generated by 957 training with non-stationary return and stock price processes 958 (i.e., ''R + P'') is unstable. This result again confirms the 959 value of the time-invariant property.

960
To strengthen our arguments, we extend the experiment 961 in Table 3 to different testing periods across 2016-2018, 962 as shown in Table 4. We observe the same phenomena. 963 TABLE 1. Comparing different representative labeling mechanisms. We list the training period, validation period, and testing period in the first, second, and third rows. The fourth row lists RLMs to generate representative trigger thresholds. The investment performance indicators used to measure performance are in the first column. For each indicator, we set in boldface the best of the four RLMs.   (i.e., ''R'' or ''P'' cases). Moreover, the distribution of the 969 stock return process is more stable than that of the stock price, 970 as the former process is evaluated by applying the difference 971 operators on the logarithm of the latter process. Specifically, 972 the return for stock S i j over the period [τ , τ ] can be evaluated 973 by ln , as in Equation (3). Thus, the machine learning 974  stocks of the pair belong to different groups. We combine 992 their OPTICS-based risk-reduction into our first-stage RLM 993 mechanism, illustrated in Figure 3, and thereby compare their 994 risk-reduction model and our second-stage model that detects 995 and removes unprofitable trades, as in Figure 7. Comparisons 996 of the one-stage model (i.e., adopting only the RLM mecha-997 nism for trading) and the combinations of RLM with different 998 risk-reduction methods are listed in Table 5.

999
Detecting and removing unprofitable pairs may erro-1000 neously remove profitable transactions, which would reduce 1001 overall profits and the daily Sharpe ratio. But our removal 1002 mechanism also improves the win/normal close rate and 1003 significantly enhances pair-based SR and the average profit 1004 (per trade) by up to 40%. In addition, MDD falls significantly, 1005 attesting to the effectiveness of our second-stage approach 1006 to protect investors from unexpected significant loss. The 1007 OPTICS-based approach [8] achieves similar effects, but our 1008 proposed two-stage model outperforms their model on almost 1009  TABLE 5. Comparison with the optics-based risk-reduction algorithm. The training data set 1 and validation data used to train the first-stage model in Figure 3 are listed in the first two rows. The training data set 2 used to train our or Sarmento and Horta [8] risk-reduction methods is listed in the third row. ''O-S'' and ''T-S'' denote our proposed one-stage and two-stage models, respectively. ''5 min-M'' indicates the OPTICS grouping based on the 5-minute moving average returns. The performance when we use Kmeans(0) or HighFreq to select representative thresholds appears in subsequent rows. For each indicator, we set in boldface the best of the three methods.  Table 4 in Lu et al. [9]. The experimental settings also match their paper. For each financial indicator, we set in boldface the best of the methods listed in the first column.
all financial indicators, according to the direct comparisons. Sharpe ratio (Equation (7)) is the standard deviation of all 1028 returns, regardless of positive or negative signs. In contrast, 1029 the Sortino ratio is the standard deviation of negative returns. 1030 Therefore, we can deduce that the risk of negative returns in 1031 HighFreq is much smaller than in SAPT.

1033
This paper proposes a novel two-stage model to improve PTS 1034 investment performance and reduce trading risk by optimally 1035 selecting trigger thresholds and removing unprofitable stock 1036 pairs. In the first stage, we train a multi-scale ResNet with the 1037 proposed RLM to select optimal thresholds without incurring 1038 the non-convergence training problem. Our approach outper-1039 forms other approaches that heuristically generate a set of 1040 thresholds for selections. To remove unprofitable stock pairs 1041 in the second stage, we train another multi-scale ResNet with 1042 the profitability of each stock pair obtained by executing the 1043 PTS with trigger thresholds recommended in the first stage. 1044 Therefore, our second-stage model outperforms models that 1045 indirectly predict stock pair profitability by the similarity of 1046 stock price processes [8] and the occurrence of structural 1047 breaks [9]. We also find that the time invariance of the 1048 spread process (i.e., portfolio value process) makes it eas-1049 ier for machine-learning algorithms to capture features and 1050 hence improve investment performance. Indeed, as in much 1051 of the financial-market-prediction literature, training with 1052 time-varying patterns such as stock prices or returns yields 1053 unstable investment performance as the patterns learned 1054 from the training set may change in the testing set. Thus