How to Handle Data Imbalance and Feature Selection Problems in CNN-Based Stock Price Forecasting

Stock market forecasting is a time series problem that aims to predict possible future prices or directions of an index/stock. The stock data contains high uncertainty and is influenced by too many factors; hence it isn’t easy to achieve the goal by traditional time series methods. In literature, the convolutional neural networks (CNN) models were used for stock market forecasting and gave successful results. But, data imbalance due to labeling and feature selection problems were seen when considering these models. Hence, this study proposed a new rule-based labeling algorithm and a new feature selection approach to solve the issues. In addition, a CNN-based model, which was presented to predict the next day’s trade action of stocks in the Dow30 index, was constructed to check the effectiveness of the data labeling and the feature selection approach. Different image-based input variable sets were created using technical indicators, gold, and oil price data to feed the CNN model. The prediction performance of CNN models was compared with other studies in the literature. The experimental results showed that the CNN prediction model, which uses the proposed feature selection and labeling approaches in this study, performs 3-22% higher accuracy than the CNN-based models taking part in other studies. Also, the labeling approach proposed is more successful than Chen and Huang’s data weighting approach to solve the stock data imbalance problem. This algorithm reduced the ratio between labeled data from 15 times to 1.8 times.


I. INTRODUCTION
Implementing a forecasting model by analyzing current data and predicting future behavior has become one of the most important issues in the academic and business world, especially in the money/financial markets that provide a return on profits. Money markets consist of financial assets such as stocks, indices, futures contracts, and all these assets are treated as financial time series. Many political, social, and economic factors affect financial assets, such as uncertainty and volatility in markets, political events, general economic situation, movements in other country markets, and investors' expectations. All these factors cause nonlinear behavior on the prices of financial assets. Therefore it is too hard to predict future behavior in financial time series. In this direction, it has been observed that deep learning methods have come to the forefront in recent years, with the frequent use of regression, The associate editor coordinating the review of this manuscript and approving it for publication was Li He . decision trees, support vector machines, and artificial neural networks from artificial intelligence techniques.

A. LITERATURE REVIEW
Stock market forecasting is a time series problem that estimates possible future direction or price value based on historical price data. But, the fact that the stock market data is affected by too many factors and is not linear makes it challenging to perform this estimation effectively with traditional time series methods. However, the autoregressive integrated moving averages (ARIMA) model is selected from the time series methods within the scope of stock market/share forecasting, and the success of this model is compared with other methods [1]- [4]. Selvin et al. estimated the prices of stocks belonging to three companies listed on the NSE stock market. Recurrent neural networks (RNN), long short-term memory (LSTM), and convolutional neural networks (CNN) were used from deep learning methods to perform the prediction. Besides, ARIMA models were used for time series VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ analysis methods. When they compared the performance of the four models, they found that the deep learning models were more successful than the ARIMA model [4]. In addition to time series methods, machine learning methods such as support vector machines, decision trees, artificial neural networks are also applied to stock price prediction problems [1], [2], [5]- [8]. Vijh et al. estimated the next-day closing price of five stocks on the New York stock market using artificial neural networks (ANN) and random forest methods from machine learning techniques. When they evaluated the performance of the prediction models, they saw that the ANN model was more successful [5]. In another study, Parmar et al. estimated the future value of a company's stock using LSTM and regression methods. They found that LSTM was more successful among the proposed models [9]. In contrast to statistical and machine learning methods within stock market forecasting, it was observed that deep learning methods have become prominent [10]- [13] in recent years, and LSTM is frequently preferred among these methods [14]- [18]. The stock market data is a time series, and LSTM successfully performs the learning process with the inputs in the time series. Hossain et al. proposed a hybrid model to estimate the closing price of the next day's S&P500 index. This proposed hybrid model consists of LSTM and gated recurrent unit (GRU) methods [15]. In another study, Du et al. proposed a forecasting model with the LSTM method to predict the next day's Apple stock closing price. They used two different sets of input variables for the proposed model. When comparing the performance of models, they saw that the model created using multiple variables was more successful [16]. In another study, Ji et al. proposed a hybrid model (IPSO-LSTM) for estimating the Australian stock market (ASM) index. This hybrid model consists of a new particle swarm optimization (PSO) and LSTM method. They observed that the proposed model is more successful than support-vector regression, LSTM, and PSO-LSTM models [17].
In recent years, CNN method, which is widely used in image recognition due to its particularly significant pattern recognition ability, has also been applied in the field of financial forecasting, and its scope expanded [19]- [24]. Gunduz et al. predicted the daily movement direction of the three stocks in BIST30 with the CNN method. They investigated the effect on the forecasting model of the technical indicators calculated from gold and dollar prices. As a result of experiments, it was shown that the use of dollar-gold attributes in addition to price attributes improves classification performance for all three stocks [19]. Alhazbi et al. investigated whether external factors such as oil price affected the stock market direction and predicted the daily movement of the Qatar Stock Exchange using the CNN method. They stated that adding these external factors to the stock market data increases the model performance [20]. In another study, Sezer and Ozbayoglu proposed a new prediction model called CNN-TA that uses 2-D CNN based on image processing in their research. They aimed to predict the buy/sell/hold position of Dow30 stocks' prices and Exchange-Traded Funds (ETF) prices for the next day with the proposed model. The financial data were converted into 15 × 15 images for the input variables of the CNN-TA model. In addition, a sliding window training-test approach is adopted for the proposed model. They evaluated the performance of the CNN-TA model for the last ten years' data [21]. In another CNNbased study using the labeling algorithm in [21], Chen and Huang [22] proposed two different labeling algorithms to predict the future S&P500 index. Technical indicators, gold prices, gold price volatility index, crude oil, and crude oil price volatility index were determined as input features of the prediction model. They compared the performance of their proposed models, called CNN8 and LSTM8, with the CNN-TA [21] model in terms of accuracy, precision, recall, and F1 score metrics, and these models outperformed the CNN-TA model. Also, it has been concluded that considering oil prices, oil volatility index, gold prices, and gold volatility indices instead of considering only technical indicators as input features of the model positively affect the stock market forecasting problem. Sim et al. [23] pointed out that using many technical indicators did not positively affect the prediction models. Another model called TI-CNN based on CNN and technical indicators have been proposed by Chandar [24] to estimate the stock trade action. He calculated ten technical indicators using daily stock prices for the proposed model and used Gramian Angular Field (GAF) method to obtain images from these technical indicators. The performance of the proposed model was compared with three earlier studies [21], [25], [26] in the literature. As a result of the comparisons, he saw that the TI-CNN model was better than the other models.
It is seen that the determination of the input variables of the model in stock price estimation studies, in other words, the feature selection process, directly affects the model performance. Within this context, most of the studies use the technical indicators and stocks historical data. However, recent studies argue that using these indicators does not affect the model since technical indicators show similar behavior to the closing prices of stocks [22], [23]. In this context, it has been proposed that using external factors such as gold and oil price may positively affect the forecasting model's performance.
Another factor affecting the prediction model's performance is the imbalance dataset. This problem is encountered in many fields such as image classification [27], speech recognition [28], and natural language processing [29]. This problem also occurs in stock price prediction studies [22], [30], [31]. There are various approaches to handle the imbalance problem in the literature. Some studies use complex methods, others use more straightforward approaches such as copying [32] and weighting data [22] to solve this problem.
In this study, we proposed a CNN-based model to predict the next day's trade actions of the stocks in the Dow30 index. While developing this model, a rule-based labeling method was proposed to solve the stock data imbalance problem.
We also analyzed and decided the input parameters that positively affect the performance of the stock price prediction model to provide the basis for the feature selection approach.

B. MOTIVATION AND CONTRIBUTIONS
The prediction models in 2D-CNN-based stock price/market forecasting studies achieved successful results in recent years. However, there are still some problems in issues such as data imbalance and feature selection that need to be corrected. Therefore, the motivation of this study is to identify these problems and find solutions. The contributions of this research are as follows: • It was seen that using only technical indicators for the prediction model did not have a positive effect on the model performance. Within this scope, external factors such as gold and oil prices were used for the proposed model and created sets of different input variables to examine the effect on the model.
• Each trading data was labeled as buy/sell/hold using stock price data. Then, we created images of each trading day for different sets of input variables. To solve the data imbalance problem with a simpler (straightforward) approach, we examined the labels of the stock's last five days. We proposed a new labeling algorithm based on some rules to predict the next day's trade action within this scope.
• We referenced the input variables used in [21]. In this study, we investigated the necessity of the determined period values (6-20 days) for the 15 technical indicators. Considering other studies in literature, experiments conducted, and practical applications, it was observed that the use of 7, 14, and 21-day period values positively affected the model's success. As a result of these contributions, we observed that the proposed stock's price prediction model, outperforms other studies [21], [22], [24] in the literature.

C. ORGANIZATION
The remaining parts of the study are organized as follows: In Section II, the proposed CNN model and how the input dataset for this model is generated are detailed. In this context, information about the dataset, labeling algorithm, feature selection, and image creation are discussed. The experimental results from the study are detailed in Section III. The last section consists of the conclusion and future works.

II. METHOD
In this study, we proposed two approaches to solve the data imbalance problem and determine more meaningful features for the model. Accordingly, we observed the performances of the approaches on a CNN-based model for forecasting the next day's trade action (buy, sell, hold) of Dow30 stocks.

A. DATASET
The daily price data of Dow30 stocks were obtained between 01/01/2008 to 15/12/2021 from finance.yahoo.com. This dataset includes the open, close, adjusted close, high, low, and volume values of these stocks. The adjusted closing price data in the dataset was used to label the images as buy, sell or hold, and the other price data (close, high, low, volume) were used to calculate the technical indicator values. In addition, the daily closing prices of gold and oil in the determined period were obtained from the same website. We used the sliding window, and the traditional (standard) approaches for training and test processes in the study. In the sliding window approach, we chose the first five-year period for training and the following year for testing, i.e., the training period: 2008-2012 and the testing period: 2013. Then we shifted the training and testing periods by one year, i.e., training period: 2009-2013 and the testing period: 2014. In the traditional approach, 70% of the obtained stock data was used for training and the remaining for the test process.indicator values.

B. LABELING
After creating the dataset for the model, each trading day was labeled as ''Buy,'' ''Sell,'' and ''Hold'' with the labeling algorithm proposed by Sezer and Ozbayoglu. In this proposed algorithm, the 11-day sliding window approach is used. Labeling was realized according to the midpoint of the 11-day window size, in other words, the closing price of the 6th day. If the 6th day's closing price was the highest in this 11-day window, it was labeled ''Sell''. If the 6th day's closing price was the lowest, it was marked ''Buy''. Otherwise, it was marked as ''Hold''.
We labeled the used datasets in this study for both approaches with the labeling algorithm proposed in [21]. Then, we examined the labels of the datasets to be used in the sliding window approach. We found that the number of data labeled as Hold in the dataset was approximately 15 times those labeled as Buy and Sell. We proposed a new rule-based labeling algorithm based on the labeling algorithm given in [21] to solve the data imbalance problem. The proposed rule-based labeling algorithm is detailed following and given in Algorithm 1: • Firstly, the labeling algorithm proposed in [21] was applied to the dataset. Thus, each trading day in the dataset is labeled as ''Buy'', ''Sell'' or ''Hold''.
• Then, these labeled data are combined according to the past five trading days. The obtained label values resulting from the merging were evaluated and relabeled the data for the past five days within the following rules: The performance of the proposed labeling algorithm is discussed in Section III.

C. FEATURE SELECTION
It is essential to determine the input variables that the model will use to predict the next day's trade action of stock prices. For this purpose, it is necessary to analyze the factors affecting stock prices. Most of the proposed forecasting models use technical indicators in addition to stock prices. However, in some of the studies in the literature [22], [23], it has been stated that these indicators do not positively affect the performance of the forecasting model because some of the technical indicators behave as stock's closing prices. In addition, it has been stated that external factors such as gold and oil price may be more effective on a prediction model performance. We performed various experiments and analyzed practical applications within this scope to create different input variable sets. The methods of their creation are given following:  [21], were handled. Then, gold and oil price data were added to these features.  • Fifteen different period values for 15 technical indicators in CNN-TA were investigated to decide which period is meaningful for the proposed forecasting model. As a result, the use of only 7, 14, and 21-day period values of these technical indicators were chosen. Then, gold and oil price data were added to these features.
• The correlation-based feature selection method was applied by considering 15 different periods of each technical indicator used in the CNN-TA model. As a result, 11 feature values were found to be more significant. Then, gold and oil price data were added to these features.

D. IMAGE CREATION
According to the new labeling algorithm, various experiments were done to determine the stock's next day's trading action.
As a result of these experiments, we observed that it is more successful in considering the last five days' features. These features include technical indicators and with or without gold and oil daily closing prices obtained from Yahoo Finance. We constructed a matrix in which columns represent features and rows represent daily price by selecting period.   Table 1. Then, created images labeled according to Algorithm 1 to determine the stock's next day's trade action value. Also, Figure 1 shows sample images with a dimension of 75 × 15 obtained according to the trade action label. The stocks' trade action of the next day was estimated using the images and the CNN model detailed in the following subsection.

E. CNN MODEL
In this study, to test the performance of labeling and feature selection approaches the CNN architecture proposed by Sezer and Ozbayoglu was used. This model architecture was given in Figure 2 and adapted according to our created different input image sets. The proposed CNN architecture consists of 8 layers: input layer, two convolutional layers, a max pooling, two dropouts, fully connected layer, and output layer. For the first two convolution layers where the convolution operation is performed, filter size 3 × 3 and the number of filters were selected as 32 and 64, respectively. The convolution operation on two-dimensional images is performed using Equation 1. K and I denote the kernel and input image in the equation, respectively. After these layers, the max-pooling layer with a 2 × 2 filter size was used to select the highest values in the convolutional matrix. In addition, two dropout layers (0.25, 0.5) were added to prevent overfitting. By using these layers, a deep neural network architecture is built in this study. Equation 2 provides the details about the neural network architecture. W , x, and b denote the weights, input and bias in the equation, respectively. The fully connected layer was used to flatten and connect the neurons to the next layer. Finally, the softmax activation function, given in Equation 3 (y denotes output), was selected as the activation function in the output layer to perform the classification process [21].

F. PERFORMANCE MEASURE
The confusion matrix given as an example in Table 2 is frequently used to evaluate the model's performance within the scope of classification problems. The information included in this matrix refers to follows [33]:   to compare models only in terms of accuracy metric. For this reason, the F1 score metric, which is calculated with the harmonic average of the recall and precision, should be used.

III. EXPERIMENTAL RESULTS AND DISCUSSIONS
Datasets with six different image dimensions were created for the stocks in the Dow30 index with the determined features and the proposed labeling algorithm. Then, the trade action of the stocks the next day's closing price was estimated with the CNN model, which uses both sliding window and traditional training-test approaches. In addition, test datasets were created containing the last ten-year data of the stocks and the previous one-year data of the determined ten stocks (AAPL, TRV, INTC, GS, WBA, CSCO, NKE, MMM, KO, AXP). The performance of the CNN model was evaluated using these datasets and compared with other studies in the literature [21], [22], [24]. The obtained results are discussed in the next two subsections. In the first subsection, the evaluation of the labeling algorithm is evaluated. In the second subsection, the performance of the CNN model using the feature selection method and labeling algorithm is discussed.

A. EVALUATION OF THE LABELING ALGORITHM
When the proposed labeling algorithm in [21] is used for the five-year training dataset of Dow30 stocks within the scope of the sliding window training testing approach, the number of data labeled as buy, sell and hold is approximately 2200, 2180, and 30600, respectively. It was seen that the number of data labeled as Hold is approximately 15 times those labeled as Buy and Sell. When the rule-based labeling algorithm in this study was used for the same dataset, the number of data labeled as buy, sell, hold is approximately 9300, 9200, and 16400, respectively.These two results show that the proposed labeling algorithm and feature selection approaches reduced the data imbalance from 15 times to 1.8 times. Hence the proposed approaches are very effective. In addition, the data weighting approach used in the study [22] to solve the data imbalance problem in [21] does not increase the number of data but improves the classification performance.

B. EVALUATION OF THE CNN MODEL PERFORMANCE
The obtained results for datasets with six different image dimensions of the CNN model using the sliding window training-test approach were given in Table 3. When evaluating the F1 score and accuracy metrics values for this approach, it was seen that higher values were obtained on 75 × 15 dimensions images. Therefore, the CNN model performed more successfully on these image dataset. In addition, this successful model was compared CNN-TA and CNN8, and the results were given in Table 4. As a result of the comparisons, the proposed prediction model outperforms the CNN-TA model in terms of performance metrics due to the proposed new labeling algorithm and feature selection approach. While the success of the CNN-TA model was 58%, the model's success was improved by approximately 22%, in the direction of the approaches suggested in this study, was reached an accuracy of 80.23%. In addition, recall and F1 score metrics, especially for Buy and Sell, were improved by approximately 50%. The CNN8 model was also compared with the prediction model in this study. As a result of the comparisons, it was seen that the proposed rule-based approach in this study improved the model performance more than CNN8 in terms of accuracy, precision, recall and F1 score metrics. The proposed model was also compared with the TI-CNN model [24]. The study also used the training-test approach and the labeling algorithm in [21]. This model estimated the next day's trade action of some stocks in NASDAQ and NYSE. The comparison results were given in Table 5. The results showed that the proposed model was more successful than TI-CNN model. In addition, the model outperformed TI-CNN in terms of accuracy and F1 score metrics for AAPL stock and F1 score for GS. Table 6 shown the traditional training-test approach results for datasets with six different image dimensions. It was seen that the CNN model performed more successful predictions on 75 × 15 and 75 × 17 dimensions images. In addition, it was observed that the gold and oil prices positively affected the prediction performance. Table 6 shown the traditional training-test approach results for datasets with six different image dimensions. It was seen that the CNN model performed more successful predictions on 75 × 15 and 75 × 17 dimensions images. In addition, it was observed that the gold and oil prices positively affected the prediction performance.
In our stock trading system, most of the ''Buy'', ''Sell'', and ''Hold'' points are captured correctly by the proposed model. The main reason for this is that the number of the ''Buy'', ''Sell'', ''Hold'' points are nearly equal; hence the deep neural network catches the entry and exit points correctly. In other words, to be able to catch most of the ''Buy'' and ''Sell'' points (recall), the model generates true alarms for existent entry and exit points (precision). As a result, the standard annualized returns of the stock trading algorithm will be higher. Besides, compared with the other stock trading models in the literature, it will be better than the majority.

IV. CONCLUSION AND FUTURE WORKS
In this study, a new rule-based labeling algorithm was proposed to solve the imbalance problem in the dataset used in the stock price prediction. In addition, a feature selection approach was proposed by investigating what should be considered when determining the input variables to be used in the CNN-based stock prediction model. To see the effect of the proposed labeling algorithm and feature selection approaches on the prediction model, a 2D-CNN-based deep neural network was used. While realizing this, both sliding window and traditional training-test approaches were used. When the obtained results are compared with the results of other prediction models in the literature, it has been observed that the proposed approaches had a positive effect on the model's prediction performance. These results prove that the proposed approaches are applicable within the scope of stock price prediction.
As future works, hybrid methods can increase the success of developed CNN-based approaches to predict stock prices. In particular, attention-based deep learning and deep q learning approaches may be applied to this field.
ZİNNET DUYGU AKŞEHİR received the bachelor's degree in computer engineering from Karadeniz Technical University, Trabzon, in 2017, and the master's degree in computer engineering from Ondokuz Mayıs University, Samsun, in 2020. She is currently pursuing the Ph.D. degree in computational sciences. Her research interests include machine learning and data mining.
ERDAL KILIÇ received the bachelor's and master's degrees in electrical and electronic engineering from Karadeniz Technical University, Trabzon, in 1991 and 1996, respectively, and the Ph.D. degree in electrical and electronic engineering from Middle East Technical University, Ankara, in 2005. He is currently a Full Professor with the Department of Computer Engineering, Ondokuz Mayıs University. His research interests include neural networks, machine learning, and data mining. VOLUME 10, 2022