Impact of Hyperparameter Tuning on Machine Learning Models in Stock Price Forecasting

Stock price forecasting has been reported as a challenging task in the scientific and financial communities due to stock prices’ nonlinear and dynamic nature. Machine learning models exhibit capabilities that allow them to handle nonlinear data and be candidate tools for stock price forecasting. In this study, an empirical evaluation of eight conventional machine learning models’ is conducted to forecast the stock price of eleven companies belonging to the Saudi Stock Exchange. Moreover, the optimal configuration of hyperparameters in each machine learning model is identified. Forecasting performance is evaluated by two well-known error metrics: Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). Wilcoxson effect size is utilized to determine the impact of hyperparameter tuning by comparing tuned and un-tuned machine learning models’ forecasting performance. Empirical results indicate there are varying impacts of hyperparameter tuning of machine learning models in forecasting stock price. After tuning the hyperparameters, Support Vector Regression outperforms other forecasting models with a significant statistical difference. In contrast, Kernel Ridge Regression shows noteworthy forecasting performance without hyperparameter tuning with respect to other un-tuned forecasting models. However, Decision Tree and K-Nearest Neighbour are the poor-performing models which demonstrate inadequate forecasting performance even after hyperparameter tuning.


I. INTRODUCTION
S TOCK markets are the most attractive place to invest in the financial markets due to its outstanding revenue opportunities, which also encompasses a tremendous risk factor of losing colossal capital. This phenomenon is prompted by the volatile nature of daily stocks prices, which depends on abundant grounds such as: a company's reputation and financial performance, global economic condition, political stability, foreign policy, etc. [1]. As for investors in stock markets, forecasting stock price is a desirable and crucial task, yet it is the most challenging task due to the frequently changing stocks prices nature. As a result, forecasting stocks prices with reasonable accuracy can enable investors to boost their rewards and minimize their losses.
In Saudi Arabia, the Capital Market Authority (CMA) regulates the Saudi Arabian Capital Market by enforcing Capital Market Laws from its establishment [9]. CMA was formed to safeguard investors and residence from fraudulent and illegal stock trading. Currently, the Saudi Stock Exchange is commonly known as Tadawul in the Kingdom, regulated by CMA [10]. Tadawul comprises of eleven different sectors such as: Energy, Materials, Information Technology, Financial organization, etc. [11].
Time series data is formed by a set of observations with a timestamp which is the best form to visualize and analyze stock markets. Time series analysis involves the utilization of historical data to forecast the future. System analysts often use these forecasting models to manage and plan future events to minimize risks, maximize resource utilization, and increase profit. These forecasting techniques are commonly used to approximate stock prices and future trend direction. Nowadays, machine learning models are becoming more popular in stock prices forecasting due to their time-series forecasting capabilities [12].
In general, machine learning models consist of two parameters: model parameters and hyperparameters.The model parameters are those that the model would learn during the training phase. In contrast, machine learning model hyperparameters must be configured at the beginning of training and during training it may change (early stopping and learning rate decay) or remain constant. In this experiment, after configuring, hyperparameters remain constant throughout the training phase. As a result, to develop a robust machine learning model, searching for the best hyperparameter setting may turn out to be crucial [13], [14]. Moreover, the default hyperparameter configuration of a machine learning model does not guarantee the best performance. Additionally, the hyperparameter's optimal values in a machine learning model sometimes depend on the dataset and problem domain [15]. Thus, a range of hyperparameter values is explored in developing an ideal machine learning model. This procedure of finding the best hyperparameter setting for a machine learning model is often known as hyperparameter tuning [16]. Manually searching for the best hyperparameter is still widespread in research; however, it requires an adequate understanding of the hyperparameter setting of corresponding machine learning models [17]. It is sometimes impossible to use manual tuning due to numerous hyperparameters, time inefficient model evaluation, and complexity of specific problem domains. Accordingly, researchers proposed several hyperparameter optimization techniques to automate or partially automate the hyperparameter tuning process [18].
In this study, we aim to illustrate the forecasting capabilities of eight machine learning models ( Decision Tree (DT), Support Vector Regression (SVR), K Nearest Neighbour (KNN), Gaussian Process Regression (GPR), Stochastic Gradient Descent (SGD), Partial Least Squares Regression (PLS), Kernel Ridge Regression (KRR), Least Absolute Shrinkage And Selection Operator (LASSO)). In addition, the grid search hyperparameter tuning method is implemented to identify the best hyperparameter configuration for each machine learning model. Finally, the impact of tuning hyperparameters of these models is investigated in the context of Saudi Stock Exchange.
The rest of this paper is organized as follows: Section II reviews the existing literature in stock price forecasting. Section III briefly discusses the principle ideas of machine learning algorithms which are utilized in our empirical investigation. Section IV reports the overall configuration of our empirical study. Section V illustrates the results of this experiment with suitable analysis. Section VI points out the possible threats to validity in our experiment. Finally, in section VII, we conclude our paper with a direction to the future research in this domain.

II. LITERATURE REVIEW
In this section we review the related work conducted in stock prices forecasting using machine learning models.
Forecasting models using time series analysis came into light with the widely used traditional statistical models named Autoregressive Integrated Moving Average (ARIMA) [19]. Moreover, ARIMA and its variants gained considerable interest due to its extensive usage in various forecasting activities, especially in stock price forecasting [20]. For forecasting stocks price, previous studies utilized traditional models such as: Linear Regression [21], Autoregressive Integrated Moving Average (ARIMA) [22], Moving Average Convergence / Divergence (MACD), and Relative Strength Index (RSI) [23]. These statistical models are usually developed with linear function structure, making them suffer from performance issues in real data with noise and nonlinearity characteristics [24]. On the other hand, machine learning models showed promising performance in detecting underlying nonlinear relationships and random assumptions. In addition, machine learning models illustrated an extraordinary capability in handling noisy real-world data [25].
Mehtab et al. [2] used various deep learning techniques to evaluate stock prices forecasting performance based on execution time and RMSE value with two different sliding window sizes. Their study suggests that CNN-based models performed better than LSTM based models in forecasting stock price in the context of India's National Stock Exchange (NSE). Azlan et al. [3] conducted a time series analysis using the clonal selection algorithm and found almost similar forecasting performance as ARIMA models on yahoo stock price. In time series forecasting, LSTM models illustrated superior forecasting performance with a long-term confident band [4]. Li and Bastos [26] conducted a systematic literature review that reported LSTM is the most widely used deep learning technique on stock price forecasting. In addition,  [7] conducted an empirical study using support vector regression (SVR) to forecast the daily stock prices of selected companies from China, Brazil, and the USA. The study reports that using the linear kernel in SVR helps to minimize the forecasting error. Moreover, according to their research, SVR showed superior forecasting accuracy compared with the random walk-based model.Chou and Nguyen [6] designed a time series forecasting system using Least squares support vector regression (LSSVR) with sliding window meta-heuristic optimization to forecast the stock price. Moreover, they developed an application with a simple graphical user interface from their proposed method to make stock price forecasting easy. Olatunji et al. [8] proposed an artificial neural network model and evaluated the models using three Saudi Stock Exchange companies such as STC, SABIC, and Al Rajhi Bank. They used the previous five days stock price to forecast the next day's stock price. They reported that their proposed model achieved significantly low root mean square error. Shrivastav and Kumar [27] conducted an empirical study on stock price forecasting using SVR and ARIMA model. The results illustrate the superiority of stock price forecasting performance of SVR model comparing with ARIMA model. Table 1 presents a summary of the surveyed related studies that had conducted a univariate time series analysis using different machine learning techniques in forecasting stock prices. In most studies, researchers used deep neural network models like LSTM, ANN etc., whereas only few conventional models were investigated [26]. In this paper, we aim to fill this gap by empirically investigating the forecasting capabilities of several conventional machine learning models and performing a comprehensive empirical study within the context of Saudi stock market (Tadawul).

III. BACKGROUND
A range of machine learning models is used in the literature to develop forecasting models nowadays. Moreover, we can conclude from the literature that the two most dominant core machine learning models are conventional models and neural networks. Conventional machine learning encompasses any core algorithmic framework used to find a solution by the use of data. These systems can learn automatically from data, with subject matter experts conducting preprocessing and selecting features to feed the algorithm. Usually these trained machine learning models need all inputs to be structured data, such as numbers.Neural network algorithms are capable of learning key aspects from underlying data and determining which features to focus on without explicit expert identification. However, we observe that only a few conventional machine models are used to forecast stock prices from the literature. Moreover, these models do not assure a higher forecasting accuracy. However, we still investigate these models due to their straightforward implementation and ease of explaining non-technical individuals. In our study, we focus on evaluating various conventional machine models rather than neural network models. Here we briefly discuss the machine learning models used in this study:

A. DECISION TREE (DT)
In machine learning, the concept of employing a decision tree was first introduced by Quinlan, which was referred to as the Induction of Decision tree, commonly known as ID3 [28]. In ID3, it searches for the decision tree that can correctly classify by generating all possible combinations of decision stumps. In classification problems, the decision tree aims to maximize the Information gain, whereas, in regression problems, it minimizes standard deviation or mean square error. In this study, we have used decision tree regression to develop one of the proposed forecasting models.

B. SUPPORT VECTOR REGRESSION (SVR)
To solve binary classification problems, Vladimir of the AT&T Bell Laboratories first introduced Support Vector Machine (SVM) that formulated it as a convex optimization problem [29]. For solving regression problems, Support Vector Regression (SVR) was remodeled by specifying the ε intensive region around the function, which can approximate the continue values with reasonable model complexity. Moreover, SVR formulates regression problems as an optimization problem that seeks the tightest path possible around the surface by reducing the prediction error [30]. This condition can be represented as an equation as follows [31]: Where the magnitude of the vector containing real continues values is denoted as w which is being predicted.

C. K NEAREST NEIGHBOUR (KNN)
The K-nearest neighbour algorithm [32] is one of the initial machine learning algorithms. However, It is frequently used for classification and regression due to its clarity and configurability. KNN [33] is usually referred to as a lazy learning model since it does not develop any model or function using the training set. Instead, for each test set element, it finds similar k nearest records from the training set. Then, the prediction is performed by majority voting among that k nearest records. In KNN regression analysis [33], to forecast the values in the test set, the k most past patterns were identified and combined from the training set.

D. GAUSSIAN PROCESS REGRESSION (GPR)
To solve nonlinear regression problems, Gaussian Process Regression (GPR) follows a non-parametric and probabilistic approach. According to GPR, The measurements of the output variable y are produced in the following way [34]: ε ∼ 0, σ 2 (4) VOLUME 4, 2016 Where x is the input variables, f is the function of Gaussian Process Prior, m is the mean function, k is the covariance function, and ε is Gaussian noise with variance σ 2 . Gaussian processes are entirely characterized by their mean and covariance functions, which encode our conclusions about the process. After selecting the mean and covariance functions, the output variable is predicted in the form of predictive Gaussian distribution.

E. LINEAR REGRESSION WITH STOCHASTIC GRADIENT DESCENT (SGD)
At Present, the SGD plays an essential role in different activities of machine learning [35]. It is an optimization technique that uses a noisy gradient combined with a reduced step size. Usually, The slope of a function is referred to as a gradient. It quantifies the magnitude of change in an input variable, resulting in the change of the output variable. Gradient descent is a convex function that produces the partial derivative of a series of input parameters. Finally, SGD is the technique that extends gradient descent by optimizing the process in a stochastic manner [36].In our experiment, we utilized SGD as a basic stochastic gradient descent learning method for fitting linear regression models. It supports a range of learning rates and penalty which are presented in table 4. SGD has previously been used to solve classification problems such as detecting code smells [37] and categorizing text documents [38], as well as regression tasks such as biomass prediction [39] and healthcare analysis [40].

F. PARTIAL LEAST SQUARES REGRESSION (PLS)
Partial Least Squares regression encompasses the renowned principle of partial correlation [41]; it aims to forecast y using a number of input variables x. The predictors of a PLS regression model are developed by the linear combination of input variables x. These predictors are generally referred to as PLS components or latent vectors, which maximize the correlation between x and y. Finally, PLS regression analysis can be utilized for both linear and nonlinear data forecasting [42].

G. KERNEL RIDGE REGRESSION (KRR)
In 2000, Cristianini and Shawe-Taylor [43] introduced the expression "Kernel Ridge Regression" to refer to a specialized form of Support Vector Regression which was a variant of the previous "ridge regression in dual variables" [44]. The main difference between SVR and KRR is in the selection of loss function. Additionally, KRR enables model fitting to be performed more quickly than SVR. On the other hand, SVR performs forecasting more rapidly than KRR.

H. LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR (LASSO)
For parameter estimation in regression problems, the lasso introduced by Tibshirani [45] has become a widespread substitution to the simple least-squares method. Its success is attributed to the main function of this method: Least Absolute Shrinkage, which shrunk the vector of regression coefficients, with the probability of setting certain coefficients to zero. Thus, it results in a continuous estimation and variable selection.

IV. EMPIRICAL STUDY
This section outlines our empirical research objective and formulating research questions to achieve our research objective. Then, we explain the dataset used with a descriptive analysis of the underlying dataset distribution. In addition, we briefly describe different forecasting error measures used in this empirical study to measure forecasting errors of the machine learning models. Finally, we explain the performed statistical analysis to examine to which extent the performance differences between machine learning models are significant or not. This empirical study is entirely implemented and conducted using Python.

A. GOAL
Using the GQM template [46], we lay the goal of our empirical study as follows: • Evaluate: The forecasting capabilities and the impact of hyperparameter tuning of conventional machine learning models • Purpose: Forecasting the stocks prices • Respect: Magnitude of relative error, mean absolute percentage error, and root mean square error • Perspective: Researcher, Stock Investor, Financial Data Analyst • Context: 11 stock prices in the Saudi Stock Market (Tadawul) To achieve our goal, we have formulated the following research questions to guide our empirical investigation: RQ 1. What are the capabilities of conventional machine learning models in forecasting stocks prices in the Saudi Stock Market? RQ 2. What are hyperparameter tuned conventional machine learning models capabilities in forecasting stock price in the Saudi Stock Market? RQ 3. To which extent hyperparameter tuning effects the performance of conventional machine learning models in forecasting stock price in the Saudi Stock Market?

B. SAUDI STOCK MARKET DATASETS
In this research, we utilize the Saudi Stock Market (Tadawul) data to conduct our empirical study. There are eleven sectors in Tadawul where each sector consists of a large number of companies.

1) Dataset Description
In this study, we have selected one stock company from each sector to interpret our findings well. The list of the selected companies with relevant details are given in Table 2.
We collected the data from www.investing.com in January 2021, which provides the Saudi Stock Exchange's daily stock  [6]. Additionally, the closing price is the stock's final price on a particular day, making it more valuable to predict. Table 3 summarizes the statistical characteristics of each stock dataset used in this analysis. SAUDI ARAMCO and ALINMA have the lowest standard deviation, indicating that their stock prices are very similar to mean value and that their stock prices fluctuate very little. In contrast, the maximum difference between the minimum and maximum stock price, as well as the maximum standard deviation, is seen in ARAB SEA, STC, and GASCO, implying a higher degree of fluctuation. Different stocks distributions within the chosen stock dataset will assist us in validating the outcome of our empirical study.

C. DATA PREPOSSESSING
Each dataset contains seven features (open, close, high, low, volume, change), including the time stamp for each sample. We have utilized only the closing stock price as the independent variable, removing all other features except the time stamp and closing stock price. There were no missing values in the closing stock price from the listing date to 31st December 2020 in any stock datasets we selected. Then, the data is converted into time-indexed closing stock prices. Later, this data is transformed into an appropriate format for supervised learning using the sliding window technique, which is often known as a lagged variable. In this study, the sliding window of size five is implemented. Figure 1 illustrates the overall structure of our experiment design based on Alinma dataset and then we replicate the same procedure for the other selected datasets. A sliding window of size five is commonly used in forecasting stock price, mentioned in the surveyed literature [3], [4], [8]. As a result, the feature list contains stock prices of t-4, t-3, t-2, t-1, t days where t corresponds to today. Furthermore, the output list includes the stock price of t+1 day that we will forecast using the machine learning models. After that, we split the feature and output list in 80%-20% into training and testing sets without any randomization. As a result, in this experiment, training data (the first 80% of VOLUME 4, 2016 dataset) is historical stock price and testing data is the future value of stock of that company. Suppose for STC dataset training set is from 26-01-2003 to 11-04-2017, and the test set is from 12-04-2017 to 31-12-2020. In the training set, at first, we develop machine learning models with the default hyperparameter configurations by scikit learn [47] and evaluate our models in the test set. We label these models as un-tuned machine learning models. Additionally, we create a hyperparameter space for each machine learning model. We use time split cross-validation to split the training set into nine folds which is illustrated in Figure 2.

FIGURE 2. Time Split Cross-Validation fold size in Alinma Dataset
Then, we find the best hyperparameter from the hyperparameter space over the nine folds. Moreover, we use the grid search technique, which rigorously develops machine learning models for all hyperparameters possible combinations in the hyperparameter space and finds the best combination with the average lowest error in all nine splits.
After finding the best hyperparameter configuration, we build each machine learning model according to that arrangement. Then, we evaluate our tuned machine learning models in the test set. Finally, we analyze and compare the tuned and un-tuned models.

E. HYPERPARAMETER TUNING TECHNIQUES
The classical method for hyperparameter tuning is grid search, which is basically an exhaustive search of a given subset of possible values in hyperparameter space. A grid search algorithm is driven by some type of performance metric, which is commonly determined by cross-validation on the training set. Additionally, other tuning techniques such as random search [48], gradient search [49], and Bayesian optimization [50] have been proposed. Random Search chooses the hyperparameter combinations at random, which substitutes exhaustive search of all possible combinations in a grid search. This is applicable for both discrete and continuous/mixed value hyperparameters. For some specific learning algorithms, gradient-based optimization may be used to calculate the gradient considering hyperparameters and subsequently optimize the hyperparameters via gradient descent.
In the Bayesian hyperparameter tuning method, a probabilistic model is developed from a function by translating the hyperparameter values. Bayesian optimization seeks to acquire data about this function and the position of the optimum by repeatedly assessing a potential hyperparameter configuration based on the existing model and then updating it. The optimum configuration is generated by evaluating the target on the validation set. Although there are various suggested hyperparameter tuning techniques, grid search remains state of the art for several reasons: Firstly, the implementation of grid search is straightforward and it supports parallelization. In general, it finds a better configuration of hyperparameters than manual inspection with an equal amount of time. Finally, if the dataset and hyperparameter search space is not excessively large, then grid search illustrates noteworthy performance in choosing the best hyperparameter configuration [51]. Table 4 outlines the hyperparameter space for each machine learning model. The bolded value indicates the hyperparameter default value. Our study consider three types of hyperparameter space: integer-valued space, continuous-valued space, and string-valued space. To define the continuousvalued space, we raise and reduce the default values by the multiplication of 10. We attempt to consider all possible string values available in the string-valued space. We select accepted values around the default value with a constant interval for integer-valued space, or sometimes we set some random values to justify the range. type of that hyperparameter space. For instance, for the KNN 'algorithm' hyperparameter; we consider all the possible algorithms KNN supports, which is a string-valued space.

F. HYPERPARAMETER SPACE
Then for the C hyperparameter value in SVR, a value from 0.01 to 100 is selected with ten multiplicative intervals.

G. FORECASTING EVALUATION MEASURES
In time series analysis, we measure the forecasting performance by measuring the forecasting error that represents the gap between the real value and the forecasted value. In our empirical study, the machine learning models are trained with hyperparameters tuning on the 80% training dataset. After we build the fine-tuned models, we report their forecasting error by testing the models forecasting performance on the unseen 20% test dataset, which is a common practice in the existing literature [52], [53]. This study employs two primary statistical loss functions: Root Mean Square Error (RMSE) and Mean Magnitude of Relative Error (MMRE) to evaluate our investigated models' forecasting performance. These loss functions are frequently used in recent time series forecasting studies [54], [55]. The error functions are discussed below:

1) Root Mean Square Error (RMSE)
The square of the difference between actual and forecasted values is calculated. Then, the mean value is taken from all the instances. Finally, RMSE is calculated by taking the square root of the resulted mean value. RMSE measure is one of the most preferred scale-dependent measures due to its suitability for evaluating various models built using the same dataset. Moreover, RMSE's theoretical resemblance with the statistical models made it ubiquitous in assessing forecasting model performance [56].

2) Mean Absolute Percentage Error (MAPE)
The difference between the actual and forecasted value is divided by the real value. Then, MAPE is calculated by taking the mean value of all instances and taking percentage. This measure is also commonly known as the Mean Magnitude of Relative Error (MMRE) in some literature [57]. Additionally, MAPE is the most widely used measure to evaluate forecasting models [58].

3) Magnitude of Relative Error (MRE)
Magnitude of Relative Error (MRE) is a simple form of MMRE mentioned above without taking the mean from all the instances. We use MRE in our statistical analysis to construct and evaluate our null and alternate hypothesis. The formula of MRE is presented below: Here, m is the number of samples in our test set, x t is the actual value from the test set andx t is the corresponding forecasted value by the proposed machine learning models.

H. STATISTICAL ANALYSIS
In this study, we perform two different statistical tests. We first examine the significant difference in forecasting performance between conventional machine learning models using a non-parametric Wilcoxon signed-rank statistical test [59] in terms of the magnitude of relative error (MRE) at a significance level of 0.05 with Bonferroni correction [60]. Wilcoxon main advantage is that it does not require any form of distribution in the data as it is non-parametric testing. Additionally, it is unbiased by outlier data since it does not consider the magnitude of the value. Instead, it applies signs and ranks of the value. Lastly, this test is widely used in the literature to compare forecasting models [52]. Then, the impact of hyperparameter tuning between un-tuned and tuned models is measured using the Wilcoxson effect size. It gives a clear illustration of how much forecasting performance is improved after tuning hyperparameters of each machine learning model.

1) Wilcoxon Signed-Rank Statistical Test with Bonferroni Correction
In the Wilcoxon test, at first, the absolute difference between each pair of observations is measured, and then it ranks the absolute difference in ascending order. After that, every absolute difference to the corresponding rank is enclosed with a sign [61]. T + is represented as the summation of the rank with a plus sign, and Tis described as the summation of the rank with a minus sign. The distribution of T can be approximated by a normal distribution if the size of the sample (n) is more than 25 where the mean, µ T = n (n + 1) 4 (8) and standard deviation, Finally, the test statistics Z is formulated as follows: To reject the null hypothesis and accept the alternative hypothesis the calculated value of Z must be greater than or equal to the critical value Z α , where α is the level of statistical significance. The appropriate value of α depends on the domain of study.
A particular alpha value might be appropriate for a single hypothesis test, but it is not suitable for all concurrently conducted hypothesis tests on the same data. The Bonferroni correction method is used to correct alpha value when performing several dependent or independent statistical experiments. To prevent a large number of misleading type 1 error, VOLUME 4, 2016 the alpha value must be decreased to reflect the number of comparisons conducted. In Bonferroni correction, the alpha value is divided by the number of comparisons n, which results in the corrected alpha value is equal to α/n. In forecasting price literature, the tropical value of α is 0.05. But we have conducted a pairwise comparison among the forecasting performance of the machine learning models. Each dataset is subjected to a total of 28 comparisons. After the Bonferroni correction, the adjusted value of α is 0.05/28, or 0.0017857. The null and alternate hypotheses of this empirical analysis are as follows: H 0 : MRE X = MRE Y (There is no difference in forecasting performance between the two machine learning models in terms of MRE).
H a : MRE X = MRE Y (There is a difference in forecasting performance between the two machine learning models in terms of MRE).

2) Wilcoxon Effect Size Test
The effect size test measures the magnitude of a treatment outcome. The Wilcoxon effect size [62] is measured as the simple difference between the proportion of favorable and unfavorable data of rank sums in the Wilcoxon signed-rank test.
Here, r is the effect size, f is the favorable portion, and u is the unfavorable portion.
In the statistical significance test, the p-value shows the significance of the difference between groups is sufficient or not. The calculated p-value is often dependent on the standard error (SE). Moreover, the sample size has an impact on standard error and, accordingly, on the p-value. If the sample size rises, the standard error drops, and the p-value drop [63]. This sample size dependency makes p-values as confounded. Often a statistically significant finding means that huge sample size is used [64], [65]. As a result, the p-value does not indicate the magnitude of the difference in the mean scores of the groups or the frequency of the relationship between the examined variables. Therefore, we conduct Wilcoxon Effect Size to demonstrate the magnitude of difference between forecasting performance of machine learning models.

V. EMPIRICAL RESULTS
In this section, we examine the forecasting abilities of conventional machine learning models with default hyperparameter settings. Then, we illustrate the best hyperparameter configuration for each machine learning model across all the datasets. After that, we review conventional machine learning models forecasting performance with the best hyperparameter settings. Finally, we analyze the effect of hyperparameter tuning in the selected machine learning models according to Saudi Stock Exchange. Table 5 demonstrates the investigated forecasting performance in terms of RMSE and MAPE values over all the stock companies datasets. Maximum forecasting performance refers to minimum error is bolded and underlined for each dataset. Un-tuned Stochastic Gradient Descent (SGD) model consistently performed poorly with a very high error in most datasets. One noticeable phenomenon is observed in the Arab Sea dataset, where most of the models performed poorly, although SGD performed comparatively well. On the other hand, the un-tuned kernel ridge regression (KRR) model performs considerably well by scoring the lowest RMSE and MAPE in nine datasets. Additionally, on the other datasets, it performs very close to the best-performing model.

A. FORECASTING PERFORMANCE OF UN-TUNED MACHINE LEARNING MODELS
To answer the research questions in our study, we reveal each model performance in a win/loss scenario according to the statistical testing and forecasting performance error [66]. In each paired model comparison, if the null hypnosis is rejected using Wilcoxson statistical test with Bonferroni p-value correction, we look into the corresponding machine learning models performance score. Then we assign a win to the machine learning model with the least MAPE and assign a loss to the machine model with the high MAPE value. If the null hypnosis is accepted, we do not give win/loss to any machine learning model as there is no statistical difference between the compared models.
According to the win-loss scenario, Table 6 represents the outcome of Wilcoxson statistical testing to illustrate the significant difference in forecasting performance between each pair of un-tuned machine learning models. For illustration, let us consider Kernel Ridge regression's (KRR) forecasting performance in the context of all datasets except GASCO and TAIBA. KRR achieved the highest wins without any loss against any other models. But in GASCO and TAIBA, LASSO earned the highest wins without any loss. Overall, un-tuned KRR and LASSO can be labeled as the best two models in this scenario, with 61 and 55 wins, respectively. On the contrary, un-tuned SGD and GPR are the least performing forecasting models with 72 and 67 losses. Furthermore, we extend the analysis of the statistical test results in Table 7 by illustrating the paired comparison between each model. Each row reflects the number of wins for the row model against the column model, and likewise, the column model losses against the row model. For example, SVR has won eight times against KNN, and KNN has lost eight times against SVR. The final column computes the percentage of each model overall wins out of all possible paired comparisons, with each model having 77 paired comparisons (7 paired comparisons of each model multiply by 11 stock datasets). Likewise, the loss percentage is computed in the final row. It is significant to mention that a paired comparison might not always result in a win or a loss, if there is no significant difference in performance between the two models.
In Table 7, the paired comparison exhibits the dominance of un-tuned KRR models in forecasting stock prices over  RQ1 Answer: Un-tuned machine learning models (KRR and LASSO) illustrate significantly superior or at least similar forecasting performance compared to other untuned models in forecasting the selected 11 Saudi companies stock prices. In the contrary, un-tuned SGD and GPR models are the least performing models in stock prices forecasting.

B. BEST HYPERPARAMETERS CONFIGURATION
In this study, we use grid search to find the best hyperparameters for each investigated machine learning model. Grid Search is applied in the training set using time split cross-validation with MAPE score as the optimizing scorer function. The grid search technique attempt to find the best value of hyperparameters using all possible combinations in the hyperparameter space. Though it is computationally heavy, it effectively finds the best hyperparameters of a machine learning model [55]. In our empirical study, the best hyperparameters combination is selected based on the minimum average MAPE score in nine folds of the training set. The total number of model fits is the multiplication of the nine folds with the combination of values in each hyperparameter space mentioned in Table 4.
Finally, Table 8 gives a comprehensive idea about the best value of hyperparameter of the proposed machine learning models in forecasting stock prices in the context of the selected stock datasets.  Table 9 presents the explored forecasting performance of the hyperparameter optimized machine learning models across all datasets. After hyperparameter tuning, we observe a significant improvement in the forecasting performance of SGD and GPR models, which were the least performing models in their un-tuned state. However, the tuned SGD model performed poorly by scoring a massive error in forecasting SAUDI ARAMCO stock price. After hyperparameter tuning, SVR becomes one of the top forecasting models by scoring the minimum RMSE and MAPE values in 8 datasets out of 11. On the contrary, little or no significant error reduction is seen in the KRR and LASSO models, which are the two best-performing models in their un-tuned state. However, KRR scored the minimum RMSE and MAPE in forecasting SAUDI ARAMCO and ALINMA stock price. Table 10 demonstrates the Wilcoxson statistical testing results in the win-loss scenario to show the substantial gap in forecasting efficiency between each pair of hyperparameter tuned machine learning models. For instance, let us consider SVR forecasting performance in the context of all the datasets. SVR has the highest number of wins without losing to another model, indicating the supremacy of the tuned SVR model forecasting performance.

C. FORECASTING PERFORMANCE OF TUNED MACHINE LEARNING MODELS
On the other hand, tuned DT is ranked as the lowestperforming forecasting model with 69 losses. It also signifies that the hyperparameter tuning might impact other models more extensively than the DT model because DT won 26 instances in the un-tuned comparison. In contrast, it wins only two instances after hyperparameter tuning. Moreover, GPR is one of the worst-performing models in the un-tuned version, but after hyperparameter tuning, we notice the dramatic improvement of forecasting performance by achieving 36 wins and outperformed tuned KRR and LASSO models, which are the two top-performing models before hyperparameter tuning. Furthermore, we extend the investigation of the statistical test results in table 10 by showing the paired comparison between each hyperparameter tuned model.
In table 11 , the paired comparison unveiled the supremacy of the tuned SVR model in forecasting stock prices over other models. Moreover, tuned SVR has achieved 0% loss, suggesting that not a single tuned machine learning outperformed SVR in the context of all exploited stock prices datasets. Furthermore, tuned GPR has achieved the secondhighest win ratio, which has only lost twice against the bestperforming tuned SVR model. In contrast, tuned DT and KNN have the most loss ratio (i.e., 89.6%, 75.3%), signifying all the other tuned models have significantly performed better than these two models in almost all the stock datasets. Finally, figures of the test set forecasting performance for all the tuned models (total 88) are available for download as supplementary documents.   RQ2 Answer: Tuned SVR and GPR models demonstrate significantly superior or at least similar forecasting performance compared to other tuned models in forecasting the selected 11 Saudi companies stock prices. In the contrary, tuned DT and KNN models are the least performing models in stock prices forecasting.

D. EFFECTS OF HYPERPARAMETER TUNING IN MACHINE LEARNING MODELS
At first, we visualize the effect of hyperparameter tuning by comparing tuned and un-tuned graphs of a model's forecasting performance in the test set. Then we illustrate the impact of hyperparameter tuning using the Wilcoxson effect size test.
1) Visualizing the Impact of Hyperparameter Tuning In table 14, the positive large effect in the SVR model illustrates remarkable forecasting improvement after hyperparameter tuning, whereas negative large impact in the DT model demonstrates poor forecasting performance in both un-tuned and tuned versions of the machine learning algorithm. When a machine learning model is performing extremely poor, slight or moderate degradation can cause a large effect on forecasting performance. Table 15 illustrates the impact of hyperparameter tuning of each machine learning model in all datasets. In each column, an un-tuned and tuned model forecasting performance is compared and reported according to Wilcoxson effect size. For example, in the second column, the forecasting performance of un-tuned SVR and tuned SVR is evaluated using Wilcoxson effect size across all stock datasets. The range of Wilcoxson effect size is [-1,1] . A positive value of effect size indicates that hyperparameter tuning enhance the forecasting performance. While a negative value denotes that hyperparameter tuning degraded the forecasting performance. The standard interpretation of Wilcoxson effect size value is as follows:

2) Wilcoxson Effect Size Test
• Negligible Effect (N) < 0.1    Table 16 represents the Wilcoxson effect size of hyperparameter tuning with the standard interpretation mentioned above.
Previously in section 5.3, we assumed that the hyperparameter tuning of GPR and SGD might have a significant effect which is now empirically proven according to the table. So, we can say that hyperparameter tuning of GPR and SGD model has a major impact on improving forecasting performance over all the stock datasets utilized in our experiment. Although, in the ARAB SEA stock dataset, we notice a negligible effect of hyperparameter tuning in the SGD model performance. The reason behind this phenomenon lies in section 5.1, where we mention that only in the ARAB SEA dataset the un-tuned SGD model performed remarkably well. Moreover, tuning SVR has consistently improve the forecasting performance by minor to major impact, supporting results in section 5.3, where we show that tuned SVR is the best performing model. Additionally, tuning SVR, GPR, and SGD models have never degraded the forecasting performance across the stock datasets used in this study. On the contrary, we can visualize and comprehend the reason behind the performance degradation of tuned DT, KRR, LASSO, and PLS in section 5.3, using the following table 12 . For example, after hyperparameter tuning in DT, forecasting performance deteriorates in four stock datasets. Additionally, in most of the datasets, the effect of hyperparameter tuning is negligible to small. Furthermore, in KRR, the impact of hyperparameter tuning is found to be negligible across eight stock datasets. Thus it reveals the null or negative effect of hyperparameter tuning on these models.
RQ3 Answer: Hyperparameter tuning of SVR, GPR, and SGD significantly improve forecasting performance or at least have a positive impact, whereas hyperparameter tuning in other models may reduce the forecasting performance in the context of forecasting the selected 11 Saudi companies' stock prices.

VI. THREATS TO VALIDITY
Our findings are restricted to the selected 11 Saudi Stock Exchange (Tadawul) companies used in this study and cannot be generalized to other stock companies. As a result, replication of this empirical study is required using additional stock datasets around the world. Random state variable in machine learning models can lead to a threat to validity by generating non-deterministic results. Throughout this experiment, we set the random state to a specific constant value for each machine learning model. Consequently, our experiment results are consistent and deterministic, minimizing the risk of this particular threat to validity. Moreover, an additional threat to the validity is associated with the different evaluation metrics used for measuring forecasting capabilities. Nevertheless, our empirical study performance measures and statistical testing are widely applied and mathematical accepted in the research community.

VII. CONCLUSION
This paper aimed to investigate conventional machine learning models forecasting capabilities in the Saudi Stock Exchange (Tadawul) context. Moreover, we empirically investigated the impact of hyperparameter tuning on a machine learning model forecasting performance. A thorough investigation of eight conventional machine learning models forecasting performance was conducted using 11 stock datasets from different sectors in Tadawul. This paper primary contributions can be summarized in four folds: First, we compared the applicability of un-tuned conventional machine learning models in forecasting stocks price. Second, we searched the hyperparameter space for each conventional machine learning models and reported the best hyperparameters combination in each stock dataset. Third, we compared the forecasting performance of machine learning model after tuning. Fourth, we analyzed to which extent hyperparameter tuning effects the performance of conventional machine learning models in forecasting stock prices.
Our study's findings demonstrate the applicability of conventional machine learning models to forecast stock price for data analysts, stock investors and machine learning practitioners. Conventional machine learning models like SVR and KRR can be employed to precisely forecast stock prices, which might be an excellent tool for a stock investor. Our empirical results align with existing literature suggesting superior forecasting performance of SVR and KRR in time series forecasting [67]- [69]. The ability to efficiently address nonlinear regression is the primary reason for their noteworthy performance, which is also supported by the existing studies [70], [71]. Empirical findings revealed that hyperparameter tuning of machine learning models could extensively impact the forecasting performance. Suppose, after hyperparameter tuning, SVR became one of the best models in forecasting stock prices. On the contrary, un-tuned best performing models like KRR, LASSO, and PLS showed a negligible or negative impact on hyperparameter tuning. As in these scenarios, negligible impact implies that hyperparameters tuning result in a similar model compared with the default un-tuned model. While the hyperparameter tuning overfits the model which results negative impact denoting forecasting performance degradation after hyperparameter tuning. Nevertheless, hyperparameter tuning could be a better choice for improving the machine learning model forecasting performance while avoiding overfitting.
Our work in this paper can be expanded in the several directions. This experiment can be replicated using additional stock datasets of varying sizes from other global stock markets and findings can be compared to those reported in this paper. In addition, a sensitivity analysis can be performed to identify which hyperparameter has the most crucial impact on enhancing the forecasting performance for each machine learning model. Finally, investigating deep neural networks and ensemble models can be an interesting future direction for researchers to evaluate whether these can achieve additional forecasting performance compared with the conventional models.