An Enriched Time-Series Forecasting Framework for Long-Short Portfolio Strategy

Stock return forecasting typically requires a large number of factors and these factors usually exhibit nonlinear relations with each other. Conventional methods of stock return forecasting mainly fall into two categories: Technical Analysis and Fundamental Analysis. Technical Analysis focuses on time-series data, while Fundamental Analysis explores low-frequency fundamental variables. Although there are substantial works on either time-series analysis or fundamental analysis, few studies have enriched the time-series forecasting with fundamental variables, as the features are characterized by different frequencies, scales and types. In this paper, we propose a Long Short-Term Memory and Deep Neural Network (LSTM-DNN) hybrid model to integrate the fundamental information into time-series forecasting tasks. We demonstrate how investors can benefit from the superior performance of LSTM-DNN by constructing a long-short portfolio that takes long positions in stocks with the highest forecasting returns and short positions in stocks that are expected to decline. Extensive experimental results on real data show that our novel framework could improve the profitability of long-short portfolio strategies compared to the state-of-the-art approaches. We also find evidence indicating that the outperformance of LSTM-DNN model comes from its enhanced ability to extract information from the nonlinear relations among various features, rather than bearing more market risks. Besides the novel framework, we propose a cross-section normalization method, which benefits the framework by providing enriched cross-section signals.


I. INTRODUCTION
Accurate stock return forecasting and clear elucidation of the underlying causal factors are of interest to various parties. However, it is still a notoriously challenging issue owing to the complex, dynamic, and chaotic nature of the stock market. Despite the challenges, many empirical studies have shown that financial markets are predictable to some extent [1]- [3].
Within the past decades, we have witnessed a dramatic increase in researches on stock return prediction. Technical Analysis (TA) and Fundamental Analysis (FA) are two popular methods in the literature of stock return forecasting. TA focuses on modeling time-series price data and predicts future stock prices primarily based on historical price trends. The development of the TA method is mainly driven by data mining and machine learning fields. Various statistical and The associate editor coordinating the review of this manuscript and approving it for publication was Shangce Gao . machine learning methods have been explored and applied, such as Auto-Regression, Neural Networks, Support Vector Machine, and Random Forest. However, the complex and non-stationary property of stock returns imposes a formidable challenge for most time-series modeling approaches. In contrast, FA is a classical topic in finance, especially in asset pricing, and is generally considered for developing long-term investment strategies. FA pays more attention to fundamental factors and variables that are extracted from the firms' annual reports and other announcements. Both TA and FA have their unique advantages on stock return forecasting.
Recently, several attempts were carried out to integrate TA and FA. They propose to first aggregate the time-series sequence into TA indicators (e.g., Moving Averages, Trend Lines, and Relative Strength Index) and then train the predictive models based on both TA indicators and FA variables. It is not difficult to prove that the temporal dependency extracted shrinks into the ad-hoc TA indicators. Integrating both types of analyses while preserving the superiority of the time-series model remains a challenging task due to three main reasons: Firstly, the types of data to deal with are quite different. TA usually deals with time-series data, while FA focuses on annual or quarterly fundamental data. Secondly, TA and FA use different analyzing tools. Classical TA models use moving average and auto-regression, while sorting stocks based on firm characteristics into portfolios is the conventional method in FA. Finally, they differ in their trading strategies. TA attempts to find better trading signals on one or two assets and simply applies a pair trading or buyand-hold strategy. In contrast, trading strategies developed by FA require a long-short portfolio, which longs the stocks with higher expected returns and shorts the stocks with lower expected returns. This paper proposes a novel framework by blending the Long Short-Term Memory (LSTM) and Deep Neural Network (DNN) to capture the temporal dependencies among variables and integrate stable fundamental information, respectively. We call our blended framework the LSTM-DNN model thereafter in this paper.
In recent years, advanced machine learning methods received much more attention due to their advantage in capturing the nonlinear relations among various factors compared with linear models. Traditional econometric methods, such as ordinary least square (OLS) and panel regressions, are popular in economic and finance studies, as their results can be easily interpreted and validated. To the opposite, advanced machine learning techniques, e.g., Deep Neural Networks (DNNs), are capable of solving complex tasks, e.g., image recognition, natural language processing, but at the expense of interpretability. To enjoy more precise predictions and address potential limitations, we study the improvement of LSTM-DNN by feeding it a different set of features and comparing it to the benchmarks.
To incorporate both time-series data and fundamental information, our blended framework models them separately and then combines the intermediate results to make final predictions. Specifically, LSTM takes care of the time-series data and aims to bring the time-series flavor to the last layer of neurons of the model; Fundamental data are handled by DNN which allows for more complex relations to exist among various features. From the perspective of the profitability of the strategy, DNN and LSTM are able to extract richer information from features relative to ordinary econometric models which are developed based on a bunch of assumptions. About the trading strategy, we limit ourselves in a long-short portfolio strategy in this paper, while other strategies are applicable.
Intuitively, the long-short portfolio strategy can achieve its best performance when we accurately locate a stock's expected return on the entire cross-section of the expected returns, not necessarily its exact expected return. Thus, rather than using raw data, we apply a cross-section-aware preprocessing on features and feed them to our LSTM-DNN framework. Specifically, we standardize all individual stock features within each month by subtracting the cross-section average value of a feature, except for categorical features, and dividing the resulting value by the cross-section standard deviation of the given feature. For example, if the lag 1-month return of a stock is 1% while the average stock return in that month is 2%, and the cross-section standard deviation is 2%, the lag 1-month return of the given stock after cross-section preprocessing is (1% − 2%)/2% = −0.5. Via such a preprocessing, the factors encompass richer information in the sense that it contains the relative position information of the firm characteristics across all firms within each month. It is also important to note that all factors we use are available at the time of portfolio construction.
The long-short portfolio strategy works as follows. We rank stocks based on their individual predicted returns. Then, given their ranks, we buy the top decile 1 and sell short the bottom decile of securities once every rebalancing period. One advantage of the long-short portfolio strategy is that the movement of the whole market has little impact on the performance of this strategy. For example, a hedge fund might sell short $1 million of Apple stock and buy $1 million of Microsoft stock. With this position, any event that causes all tech industry stocks to fall will make a profit from the Apple position and bear a matching loss on the Microsoft position. By exploiting a long-short portfolio, namely, invest the stocks with the highest predicted returns and sell short the stocks with the lowest predicted returns, an investor can profit from inferring the relative performance of a stock in the future period.
To conduct a robustness check on the long-short portfolio based on LSTM-DNN model, we still focus on the rate of return and/or Sharpe Ratio (rate of return divided by its standard deviation) delivered by the strategy which is consistent with most trading strategy developing papers [4], [5]. It is noted that the conventional metrics commonly used in data mining, such as mean square error (MSE), may not be a good indicator for performance evaluation.
The layout of the paper is as follows. Section 2 reviews the related literature of machine learning approaches in individual stock return prediction. Section 3 describes the problem formulation and predictive modeling. Section 4 introduces the data and experiment setting. Section 5 examines the performance of the proposed method and other machine learning methods. The robustness check and feature importance discussion are delivered in section 6. Section 7 concludes and discusses areas for future works.

II. RELATED WORK A. ADVANCEMENTS MADE IN FINANCE FIELDS
Trading strategies have been extensively explored in finance literature. Many of them are proved to be effective in forecasting stock returns [6]. Value investing has become popular in professional financial advisers and practitioners since the famous book ''The Intelligent Investor'' [7] was published in 1965. After that, many fundamental factors have been studied. These factors span a wide range of information, such as market capitalization [8], price-to-earnings ratios [9], trading liquidity [10] and return momentum [11]. Many of these factors exhibit unique predictive value for forecasting a stock's future performance. In particular, momentum has received a tremendous amount of attention from practitioners [12]- [14]. Many follow-up studies have been focused on optimizing the momentum trading strategy. For example, a two-pass ordinary least squares regression method has been proposed to extract more information from a stock's historical performance [15].

B. CLASSICAL MACHINE LEARNING METHODS
Forecasting is a classical machine learning problem. Quite a few conventional machine learning methods have been developed in data mining and machine learning fields for predicting stock returns. For example, Support Vector Machine (SVM) was used to forecast the weekly movement direction of NIKKEI 255 index [16] and predict the stock market trend of Taiwan stock market [17]. Random forest was employed to predict the direction of stock market prices and was found to outperform SVM and Artificial Neural Networks (ANNs) in empirical experiments [18]. More recently, a combined SVM and K-nearest neighbor with weighted features have been proposed to predict Chinese stock market indices [19].

C. DEEP LEARNING APPROACHES
Neural networks, which achieved the best performance in multiple areas [20], have been applied to analyze stock markets. Shallow artificial neural networks (ANNs) were used to predict stock price movement for firms traded in China [21] and Canada [22]. These ANNs have shown more predictive power than linear models. The past few years have witnessed a surging number of the applications of deep neural networks (DNNs) for forecasting financial markets. Abe and Nakayama [23] used DNNs to predict one-month-ahead cross-section stock return. Various DNN structures, with differing numbers of layers and differing numbers of neurons in each layer, were investigated. They found, the more layers, the higher the directional accuracy would be. DNNs, gradient-boosted trees, random forest, and different ensembles have been evaluated for predicting the one-day-ahead stock return on the S&P 500 [24]. The predicting target is a binary response indicating if the stock return is larger than the corresponding cross-section median return computed over all stocks. Their results showed that a simple equal-weighted ensemble of neural networks, gradient-boosted trees and random forests achieved the best performance. A recent study also compared different structures of DNNs and other advanced machine learning approaches. But it considered more extensive features, including 94 characteristics of each stock, the interactions of these characteristics, and eight aggregate time-series variables [5]. They found shallow neural networks outperformed other competing methods. There are also some studies concerning semantic or sentiment information in optimization of asset allocation and volatility forecasting [25], [26].
Long Short-Term Memory Networks (LSTM) have been used to predict future trends of stock prices [27] and directional movements [28]. Both studies conclude that the LSTM, a better alternative for time-series problems, outperforms plain deep neural networks and classical machine learning methods (e.g., random forest).

D. HYBRID MODELS
Besides single-mode machine learning algorithms, there are some hybrid models proposed in recent years. For example, the wavelet technique is integrated with neural networks to forecast stock markets [29], [32]. A hybrid model of ANNs and autoregressive moving average models is proposed [30]. An ensemble of technical analysis and machine learning method is proposed in [34]. More hybrid models for stock market forecasting can be found in [31], [33], [35]- [37]. Recently, the hybridization of different kinds of neural networks, which can accommodate heterogeneous input features, has been applied to other areas successfully, e.g., success prediction on crowdfunding [38] and user intended action prediction [39]. These promising results motivate us to blend different kinds of neural networks for characterizing the heterogeneous input features to predict stock returns.

III. METHOD A. PROBLEM FORMULATION
We aim to predict the monthly return for each stock based on the historical data. Let subscript t and t − τ denote the t-th month and lagged τ month, respectively. In addition, we use r andr to represent actual stock return and the predicted stock return throughout the paper. We denote the operable stock set at time t as t . Note that the stocks with missing values in the required features are not included in the set t . Let r i t be the monthly return of the i-th stock at time t, where i ∈ {1, . . . , N t } and N t = | t | represents the number of active stocks at month t. Note that the number of active stocks, N t , might vary at each time point due to the missing data. For the stock fundamental information, f i t represents the latest available annual fiscal report statistics of stock i at time t. Assume that there are M fundamental features in total.
We denote by f i,j t , where j ∈ {1, . . . , M }, the j-th fundamental feature of stock i at time t.

B. CROSS-SECTION STANDARDIZATION
We propose to construct the long-short portfolio based on the predicted relative performance of each stock. Instead of using the raw historical stock returns and fundamental information, we normalize the dependent and independent variables across all the active stocks at each month to encode the relative information rather than an absolute term. Let v i t be a numerical variable of the i-th stock at time t. We normalize the raw variable v i t and make it comparable along the time by removing the impacts of group mean and standard deviation. Specifically, the normalized valueṽ i t is calculated as follows: Note that the normalization is also consistent with the essence of cross-section stock return prediction, which is to accurately forecast the relative performance of a stock [40]. As shown in Table 1, for each stock, we calculate 27 features, including 18 monthly return variables and 9 fundamental variables. Excepting the average market returns and industry category (a classification code to group companies with similar products or services), which are categorical variables, all other variables are cross-section standardized before feeding to various machine learning models. In this paper, we focus on a compact number of features. A larger set of features may improve the overall performance, but at the expense of increased cost to collect the required data, especially for inexperienced individual investors. Further works will be exploring more types of features and the impacts of different features.

C. LSTM-DNN MODEL
We apply a stacked Long Short-Term Memory Network (LSTM) to model the monthly return time series A compact form of the equations for the forward pass of an LSTM unit with a forget gate is represented as follows: where • denotes the element-wise Hadamard product;r t ∈ R τ is the normalized input vector at timestep t; f t is the forget gate activation vector; i t and o t are input and output gates' activation vectors, respectively, while h t and c t represent the hidden state (output vector) and cell state. W * ,r ∈ R h×τ , W * ,h ∈ R h×h and b ∈ R h are weight matrices and bias vectors. In contrast, the fundamental variables are fed into a fully-connected neural network (FCNN). We concatenate the outputs from both LSTM and FCNN and add additional fully-connected layers to output the final predictions. Mathematically, the output of the hidden neuron in the fully-connected layer is defined as where σ , w j and b j represent the activation function, weight matrix and bias, respectively. Following Nair and Hinton [41], we use rectified linear unit (ReLU) activation function for all hidden neurons. The ReLU activation is defined as follows: The loss function we use is the mean squared error (MSE). For illustration, the mean squared error at time t is defined as follows: wherer i t is the real standardized return of stock i at time t andr i t is its estimation. Figure 1 shows the structure of the proposed LSTM-DNN model together with the number of hidden nodes at each layer. Techniques such as L2 regularization and Dropout [42] are applied to avoid overfitting. Linear activation function is used for the final output layer, while the ReLU activation functions are utilized for other layers. Adam algorithm [43] with learning rate of 0.001, and decay rate of 0.001 is adopted to optimize the model.

D. LONG-SHORT PORTFOLIO
The predictive return of stock i for month t isr i We denote f (·) the trained LSTM-DNN model. Based on the predicted returns of each stock for the month t, we rank the stocks in ascending order and re-index them as {1, . . . , N t }. The long-short portfolio is constructed as follows: where α ∈ (0, 0.5] is the cutting percentage and symbol x takes the minimum integer larger than x. To be consistent with existing methods in the literature, we use α = 0.1 in experiments and we also show that the LSTM-DNN model is superior to other methods with any α in Section VI-C.ˆ long t andˆ short t are the stock sets for longing (which are predicted to have higher returns) and shorting (which are predicted to have lower returns), respectively. There are at least two strategies for investors to profit from our model. First, a simple buy-and-hold strategy that only longs the predicted 'winner' stocks provides investors with expected monthly returns above the expected market returns. Second, by constructing a long-short portfolio, which long the predicted 'winner' stocks and short predicted 'loser' stocks, investors can exploit the return difference between the two sides. In this paper, we focus on the second strategy, which is more common in literature.

A. DATA DESCRIPTION
The experimental data is obtained from the Center for Research in Security Prices (CRSP). Specifically, the data we downloaded from CRSP includes individual stock returns and prices, S&P 500 index return, industry categories, number of shares outstanding, share code, exchange code, and trading volume. The prices of stocks are recorded at the end of each month and adjusted for stock splits and stock dividends. The stock return of month t is calculated by using the adjusted close price of month t divided by the adjusted close price of month t − 1 and then minus 1. S&P 500 return data is used to derive market index past{6, 24, 60}ave in Table 1. Shares outstanding is the number of publicly held shares recorded in thousand. We only study the returns of common shares (with CRSP share code 10 or 11). Daily trading volumes are aggregated to calculate the monthly trading volumes. Standard Industrial Classification Code (SICCD) is used to group companies with similar products or services. In this paper, we only study stocks listed on the three normal exchanges, NYSE, AMEX, and Nasdaq (with exchange code 1, 2, and 3, respectively).
The firm's book value data is obtained from COMPU-STAT 2 and the one-month Treasury bill rate is obtained 2 COMPUSTAT is a database held by Standard & Poor's, and we access the COMPUSTAT via Wharton Research Data Service (WRDS).
from Kenneth French's data library. 3

B. EXPERIMENTAL SETTING
The data, ranging from July 1, 1977 to December 31, 2018, are collected from different sources. The total number of stocks in our data set is 6,700, with approximately 1,398 operable stocks per month. It should be noted that the market structures and trading rules are evolving as investors apply mutable trading strategies, and regulators may change trading rules of the market. For example, the Dodd-Frank Act strengthened the regulation of interbank lending system after the 2008 financial crisis, and the popularization of high-frequency trading enhances the market liquidity during normal periods. These structural changes require us to treat the strategies in each period differently. We split the 30year period from 1989 to 2018 into six 5-year testing periods. Each testing period is trained with the data of its most recent 10-year period. Within each 10-year period, the first 8-year data is used to train the model, and the last 2-year data set is selected for validation (for neural network related methods only). Thus, every two consecutive training periods are overlapped with a 5-year period. Figure 2 illustrates the data splitting procedure. The first training sample, Jan 1979, as shown in the figure, is using the historical data from Jul 1977 (lagged 18) to Dec 1978 (lagged 1) as the input feature.
We consider different scenarios by varying input features and data preprocessing procedures. As shown in Table 2, 'S' indicates that the inputs go through cross-section standardization preprocessing and 'US' indicates that the inputs  are not standardized. 'Lag18' and 'FDMT' represent the lagged 18 returns and fundamental variables, respectively. '*' can be replaced by the name of each method. For example, DNN_S_Lag18_FDMT is the method in which the inputs include standardized lag 1 to lag 18 returns and latest fundamental variables and the model is Deep Neural Network.

C. BASELINE METHOD
We compare our method with following six competing baselines: • Naive Momentum Strategy (Naive_MM) [12] • Ordinary Least Square (OLS) • Support Vector Machine (SVM) • Random Forest Regression (RF). • Long Short-Term Memory (LSTM) • Fully-Connect Deep Neural Networks (DNN) The conventional momentum strategy (Naive_MM), proposed by Chan et al., ranks stocks by the average of past t-2 to t-12 monthly return, and then longs the stocks in the winner decile (top 10%) and shorts the stocks in the loser decile (bottom 10%) [12]. Fitting an OLS model is another popular technique in finance analysis. During the last decade, SVM and RF have exhibited the state-of-the-art performance in both classification and regression tasks [44]- [48]. In addition, we also compare our model with the traditional deep learning-based models: Long Short-Term Memory (LSTM) and Deep Neural Network (DNN). The LSTM takes the historical lagged return time series as inputs, while the DNN uses both lagged returns and fundamental variables. To this end, we conduct experiments on both traditional methods (Naive_MM, OLS, SVM and RF) as well as deep learning-based methods (LSTM and DNN) to comprehensively evaluate the performance among competing methods.
We tune the hyper-parameters for each model based on the validation performance. Specifically, for SVM, we grid search C in {0.1, 1.0, 10}, γ in {0.0001, 0.001, 0. Figure 1 shows the structure of the proposed LSTM-DNN model).

D. EVALUATION METRIC
Following previous studies, the average expected monthly return (MAR) is used to evaluate the performance of competing models [5], [19], [22], [27], [32]. Mathematically, In addition, the Sharpe Ratio (SR), which is popular in finance literature, is also considered as the second evaluation metric. Formally, where R p is the average portfolio return, R f is the 1-month treasury bill return, and σ p represents the portfolio return standard deviation.

V. RESULT
Following Gu et al. [5], we calculate one-month-ahead outof-sample stock return predictions for each method at the end of each month. For each model, we sort the stocks in ascending order based on the predictions. The ordered stocks are equally divided into 10 buckets, denoted as Decile 1 -10 (See the first column of Table 3). Specifically, Decile 1 and 10 represent the stock sets with the lowest and highest investment returns, respectively. The equity long-short strategy takes long positions in stocks that are expected to increase in value (Decile 10) and short positions in stocks that are expected to decrease in value (Decile 1) to earn the spread. The profit of the equity long-short strategy is presented in Table 3. To clearly illustrate the results, we calculate the average of actual monthly returns for the stocks in each bucket. Here we make a few remarks. Firstly, the Decile 1 and Decile 10 obtained based on the predictions from LSTM-DNN have the lowest and highest actual returns, respectively, which demonstrates that, compared with other methods, LSTM-DNN delivers more accurate monthly return predictions. Secondly, LSTM-DNN also provides better results for the middle Deciles (Decile 2-9). As shown in Table 3, Decile 2-9 obtained via LSTM-DNN have incremental returns in most cases. With standardized time series data and fundamental variables (Table 3: Standardized:Lag 18 + FDMT), LSTM-DNN could generally achieve incremental results from Decile 1 to Decile 10. In contrast, the results obtained via other methods are much worse that some predicted lower Deciles have much larger actual returns than the predicted higher Deciles (For example, Decile 8 and Decile 9 achieved by SVM have much lower actual returns than the Decile 1-2.). Thirdly, the profit of the  equity long-short strategy could be maximized by integrating the proposed LSTM-DNN method. With the standardized inputs (Section VI-B) and the proposed LSTM-DNN model, we could achieve the highest portfolio-level profit of 3.556%. Table 3 also empirically demonstrates the contribution of fundamental variables. When integrated with fundamental features, we have higher returns in Decile 10 and lower returns in Decile 1. The observation is consistent across different methods including OLS, DNN, and LSTM-DNN, strongly indicating that fundamental information could help the model make more accurate predictions and build better portfolios. Interestingly, the predicted 'winner' portfolio (Decile 10) constructed by LSTM-DNN framework achieves 3.329% monthly return, whereas the predicted 'loser' portfolio has a negative monthly return (i.e., −0.227%). These results suggest that the statistical inference of LSTM-DNN contributes to the final profit from both 'long' and 'short' sides, when we apply the equity long-short strategy.
Since we use the batch sliding window to do the evaluation, Table 4 presents the monthly average return (MAR) and Sharpe Ratio (SR) for different methods with standardized/unstandardized inputs in different testing periods. LSTM-DNN model with cross-section standardized inputs demonstrates the best performance, measured by MAR and SR, in most batches and yields strongly comparable performance in other batches. By comparing the upper panel and lower panel, we observe that cross-section normalization procedure generally benefits all the models. Figure 3 visualizes the cumulative profit for each method over the entire out-of-sample test period. For each method, we start trading with 1 dollar from Jan 1st, 1989 and calculate the cumulative value of the portfolio for each month. To avoid a crowded figure, all methods presented in the figure use the standardized inputs. The period of the global financial crisis in 2008 is highlighted in gray in Figure 3. As shown in Figure 3, LSTM-DNN model could achieve  the highest cumulative return from the whole testing period (1989 -2019). In addition, adding fundamental features makes better predictions, as all the methods with 'Lag18_FDMT' are more profitable than the methods with 'Lag_18'. It should be noted that the investment experiment is conducted under simplified scenarios and external factors (e.g., transaction fee) are not considered. In the future, we'd like to build more sophisticated simulation frameworks to provide more accurate analytical results.

VI. DISCUSSION
In this section, we conduct several discussions related to the performance of the LSTM-DNN model. We first check the robustness of the performance by looking at its profitability during the worst recessing periods. Then, we examine whether the cross-section normalization procedure really enhances the model performance. We also show that the arbitrary 10% cutting point does not give our LSTM-DNN model  any advantage. Finally, we use a Saliency map [49], [50] to illustrate the feature importance and compare it with the results from linear and nonlinear model.

A. RISK ANALYSIS
Although our LSTM-DNN model has larger investment returns compared to the competing models, whether it is a good trading strategy for most investors is not completely determined. To clearly answer this question, we have two more aspects to examine for our strategy. First, investors have different expectations about their portfolio performance in different periods. For example, it is not acceptable for most investors if a strategy delivers extremely bad performance in the financial recession, even though it outperforms other strategies in other periods. Second, we check whether LSTM-DNN strategy has higher systematic risks. A strategy can earn much higher returns by simply leveraging its exposure to systematic risks according to efficient market hypothesis. 4 For the first point, Table 5 shows the performance of different methods during the worst 15 months measured by the S&P 500 index in our entire testing period. All three methods perform much better than the market in these months while LSTM-DNN (LD) performs the best in most months. Our LSTM-DNN strategy delivers an average return of 3.56% per month, but the S&P500 index loses lots of money during the worst periods of the market. The worst return of the LSTM-DNN strategy during these months is −2.7%. In contrast, DNN and OLS are much worse and their returns are −13.63% and −9.57%, respectively, for the worst months. Therefore, the LSTM-DNN model-based strategy should be more attractive to investors.
For the second aspect, we find that the risk-adjusted LSTM-DNN strategy has similar performance compared with its original returns across the whole testing period. Specifically, we regress the return of the LSTM-DNN model on the Fama-French three factors, and the resulting intercept of the regression is the risk-adjusted return of the strategy. Thus, it is very likely that the LSTM-DNN model obtains better performance by extracting more complex interactions among factors rather than simply leveraging the risk exposures to common risk factors.

B. IMPACTS OF CROSS-SECTION STANDARDIZATION
To demonstrate the advantage that the cross-section standardization brings to long-short portfolio construction, we compare the average time-series returns of our model and baseline models with and without the standardization preprocessing. Figure 4 visualizes the results. The cross-section standardization leads to improvement for every strategy. More interesting is that, without standardization, both the LSTM-DNN model and the linear model achieve an indistinguishable return, while the DNN model performs much worse.
One possible reason is that DNN treats both the time-series and cross-section variables in the same manner and all input features are independent of each other, so DNN has many more parameters, which constructs a more complex optimization space and thus it is hard to find a good local optimal. In contrast, the LSTM part of the proposed model has much fewer parameters due to the parameter sharing mechanism between the time steps.

C. CHOICE OF CUTTING PERCENTAGE
To avoid any advantages that ad hoc setting can bring to the LSTM-DNN model, we calculate both the average monthly return and the Sharpe Ratio for the best performing strategies within a wide range of cutoffs. Sharpe Ratio is a more comprehensive metric, as it considers both return and volatility.
From Figure 5, we observe that LSTM-DNN constantly performs better than the other models. Note that all three models reveal a similar pattern regarding both return and Sharpe Ratio. All three methods achieve better average return as the cutting percentage gets smaller, and earn an astonishing 8% monthly average return when the cutting percentage is 1%. 1% cutting percentage means we short the bottom 1% and long the top 1% of stocks. It could only achieve much higher returns if the prediction is accurate. The drawback is that the variance of a smaller set of stocks is usually larger. In contrast, the Sharpe Ratio, which further considers the volatility of the portfolio return, achieves the highest value around 10% cutting percentage for all three methods and decreases dramatically as the cutting percentage goes below 10%. Thus, Figure 5 justifies the fairness of using 10% as the cutting percentage in our study.

D. FEATURE IMPORTANCE ANALYSIS
Although we do not have a direct way to demonstrate the non-linear interactions among various time-series features and firm fundamental variables, we are able to measure the agreement and disagreement on the relative feature importance between the LSTM-DNN model and benchmark OLS model. Figure 6 shows the relative feature scores from the two methods across different time periods using the Saliency Map technique [51]. Overall, we find that among many factors, lag 1-month return, book-to-market ratio, and a couple of industry categories outweigh other factors in most of the periods. It is clear that even for a single method, the relative feature importance is not the same across different periods. For example, a firm within the agriculture industry is expected to earn higher returns than average level except for the most recent half-decade when it is measured by OLS model. These differences may reflect the structural change of the market caused by policymakers or investors.
The disagreement between LSTM-DNN and OLS, however, is more of our interest. For example, the LSTM-DNN model emphasizes the lagged returns compared with the OLS model for most of the periods. Intuitively, this is because the LSTM model has advantages in capturing meaningful information from time-series data. The two models also reveal disagreement on the feature importance among categorical variables. We also find a diminishing trend of factors' importance in recent decades, which echoes to the hypothesis that the market is becoming more efficient. It is important to note that we employ the same set of information and the same data preparation procedure for all models, which guarantees the consistency of feature importance comparison.

VII. CONCLUSION
In this paper, we propose a hybrid neural network framework to integrate both time-series data and stable fundamental features for the stock return prediction. A cross-section data normalization method is introduced for enriching the signal of the relative performance of each stock. We conduct various experiments on the CRSP data and show that the proposed method could achieve better performance than existing methods for the long-short portfolio strategy. By comparing linear modeling and the proposed framework, we empirically demonstrate that the nonlinear interactions between the fundamentals and lagged returns have positive impacts on the prediction results. Finally, we conduct comprehensive examinations from multiple perspectives, like the robustness of strategy and the long-short cutting percentage, to demonstrate the gross advantages. In a word, we empirically show that the higher return obtained by our method is not achieved by taking higher risks or picking a favorable threshold.
The potential future work can focus on incorporating more fundamental features and longer historical time-series data. It is noted that the intra-day momentum may perform differently from the monthly momentum, and the long-short equity strategies can be different in many ways, in which the nonlinear relations among factors are not trivial to capture. So flexible deep neural networks may be a feasible tool to mine the mystery.