Exploring Predictive Variables Affecting the Sales of Companies Listed With Korean Stock Indices Through Machine Learning Analysis

This study uses machine learning algorithms to explore predictor variables that determine whether the national statistical indices managed and announced by the Korean government influence the sales of companies listed on the Korea Composite Stock Price Index (KOSPI) and Korean Securities Dealers Automated Quotation (KOSDAQ). Further, it proposes a machine learning algorithm suitable for forecasting the sales of these companies. The sales of 1,470 companies listed on KOSPI and KOSDAQ with more than 20 years of history and 58 national statistical indices were analyzed. The predictor variables and performance were explored using the analysis data from 2000 to 2021 and the following machine learning algorithms: random forest, gradient boost, extreme gradient boosting, adaptive boosting, and categorical boosting. The analysis result confirmed that the national statistical indices contain different variables that affect the sales of listed companies by industry. The primary variable that showed the greatest influence in each industry was the industrial accident rate for manufacturing, finance and insurance, gold for construction, number of automobiles produced for wholesale and retail, and foreign exchange reserves for information and communication. The regression performance evaluation indicators—mean absolute error, mean squared error, and root mean squared error—were used to determine the optimal machine learning algorithm. The results showed that gradient boost achieved the best performance. Consequently, this study proposes using national statistical indices for companies to establish management strategies based on machine learning results.


I. INTRODUCTION
Since the fourth industrial revolution was mentioned at the World Economic Forum in 2016, technological innovations have progressed in various fields through the fusion of artificial intelligence (AI), big-data analysis, and the Internet of Things with information and communication technology. Following this trend, the Korean government emphasizes data, networks, and AI. It has created an industrial ecosystem by introducing the Data Industry Act, amending three data laws, and AI ethics standards.
As a part of creating the industrial ecosystem, the Korean government has constructed and operated web-based The associate editor coordinating the review of this manuscript and approving it for publication was Cheng Chin . statistical information systems, the Statistics Korea e-Nara Index and Bank of Korea Economic Statistical System. These systems help people understand the social and economic situation at one place using 743 national statistical indices created based on indicators managed by 41 central administrative agencies and through private statistical data, such as price-, employment-, and production-related indices.
Price is the overall level that averages the prices of individual products traded in the market, considering their importance in economic life. The consumer price index is a representative index that refers to a price index that measures the average cost of living of urban households or changes in the purchasing power of currency by examining the price fluctuations of goods and services that consumers purchase in their daily lives [1], [2], [3].
Employment refers to a state in which one party provides labor to the other party, and the other party pays remuneration for it. The employment rate is a representative index that refers to the employed proportion of the population at a specific time among the working-age population (ages 15-64) [3]. Production refers to the process of converting raw materials into products or services through the input of people and equipment. The service index is a representative index that comprehensively represents the production activities of the entire service industry and individual industries. This index is determined by applying a weight representing the relative importance of each industry [3].
The Korean government has attempted to represent social and economic situations using national statistical indices. However, government press releases and news announced that the social and economic fields were directly or indirectly affected by COVID-19 after its outbreak in the second half of 2019. Consequently, problems with the national statistical indices provided through the Statistics Korea e-Nara Index and Bank of Korea Economic Statistical System have emerged.
Because the national statistical indicators representing social and economic conditions only provide statistical figures for indicators, only the changes in figures due to the influence of COVID-19 can be known via the indicators, but information on the related industries and companies affected by the national statistical indicators is not provided. This is the biggest problem in the fundamental purpose of using the national statistical index to reflect the social and economic situation. The contents of the national statistical indices provided by the Statistics Korea e-Nara Index and Bank of Korea Economic Statistical System are as follows: First, national statistical indices only provide monthly, quarterly, and yearly statistics; therefore, we can only know the changes from the previous month, quarter, and year. Second, a comparative analysis cannot be performed on the same statistical cycle for every national statistical index because not all national statistical indices provide monthly, quarterly, and annual statistical values. Third, most national statistical indices do not provide related indices except for the representative national statistical indices. Hence, when the statistical values of national statistical indices increase or decrease, the related indices will be affected and cannot be determined.
As such, national statistical indicators are at the level of providing only one-dimensional information about indicators. Now it is time to improve national statistical indicators. Rather than simply presenting national statistical indicators, information on the impact of national statistical indicators on corporate sales according to corporate characteristics should be presented. However, because research on this is insufficient, knowing what kind of relationship exists between national statistical indicators and corporate characteristics is not possible; therefore, there will be limitations in presenting them.
National statistical indicators are believed to affect different types of industries. However, the research related to this is insufficient, or no specific research data exist yet.
A previous study analyzed the relationship between national statistical indicators and one company; however, to date, no study has targeted multiple companies or classified companies by industry. Research methodology also uses machine learning algorithms to suggest optimal algorithms suitable for data characteristics. Because machine learning algorithms can have different predictive performance due to differences in analysis data, dependent variable type, and number of predictors, the research subject is not a company or research results targeting one company cannot be generalized. Thus, I hypothesize that the lack of related studies examining the relationship between national statistical indicators and companies is also affecting [12].
The Korea Composite Stock Price Index (KOSPI) and Korean Securities Dealers Automated Quotation (KOSDAQ) are representative stock indices that represent the economic situation of Korea and best reflect political, social, and economic factors and corporate management performance. KOSPI represents the flow of all stocks (excluding KOS-DAQ) listed on the Korea Stock Exchange. Companies with an equity capital of more than 30 billion won are listed as part of corporate size requirements. KOSDAQ is a stock market operated by the KOSDAQ committee for small-and medium-sized enterprises and venture companies with a function similar to that of the US NASDAQ. Among the company size requirements, companies with an equity capital of 1.5 billion won or more or market capitalization of 9 billion won or more are listed. As such, companies registered with the KOSPI and KOSDAQ markets reflect the Korean economic situation as a stock index.
This study selected KOSPI-and KOSDAQ-listed companies as companies that are affected by social and economic conditions to understand whether national statistical indicators affect companies.
In addition, KOSPI-and KOSDAQ-listed companies were classified according to the type of business, which is a characteristic of the company, to determine whether the index affecting the company's sales varies with the type of business, in which the national statistical index is a characteristic of the company. The research methodology also used machine learning algorithms to analyze national statistical indicators and real data produced by companies without using traditional statistical techniques such as surveys and focus group interviews (FGIs). The performance of each machine learning algorithm was evaluated and a machine learning algorithm suitable for predicting sales of KOSPI-and KOSDAQ-listed companies was presented. In addition, by studying companies that have not yet been specifically studied, national statistical indicators suitable for corporate characteristics were presented to enhance corporate management activities.
This study contributes to the field in three aspects because of deriving the factors that affect the sales of KOSPI-and VOLUME 11, 2023 KOSDAQ-listed companies from the national statistical indicators of the Korean government.
First, the subject of research using macroeconomic indicators was expanded to industries that are characteristic of companies. Most of the related studies analyzed the correlation between macroeconomic indicators, and studies analyzing the correlation between macroeconomic indicators and one company are insufficient or have not been studied in detail yet. Therefore, there were limitations in presenting the relationship between macroeconomic indicators and companies. To improve this, the research target was expanded to the industry, which is the characteristic of a company, and the cornerstone was laid for a study that analyzes the relationship between macroeconomic indicators and companies.
Second, the national statistical indicators were suggested to have different indicators affecting the sales of listed companies depending on the type of business, which is the characteristic of the company. Related studies have a limitation in presenting the macroeconomic indicators affecting each industry, because no data exist specifically on the relationship between macroeconomic indicators and industry, which is a characteristic of companies. To improve this, for the first time, an empirical analysis was conducted using machine learning algorithms for national statistical indicators and business characteristics, and standards for which national statistical indicators should be used as basic data for companies to establish management strategies were established.
Third, the optimal machine learning algorithm was suggested to vary depending on the sales of listed companies and sales of listed companies by industry, which are data characteristics. Related studies have suggested optimal machine learning algorithms suitable for analysis data, but there is a limit to suggesting that the optimal machine learning algorithm varies with the same analysis data characteristics when analyzed comprehensively. To improve this, the optimal machine learning algorithm was found to vary because of analyzing the sales of listed companies and sales of listed companies by industry according to the characteristics of the analysis data. These results found that differences in analysis data, dependent variable types, and number of predictors change not only the predictability of machine learning but also the machine learning algorithm.
The remainder of this paper is organized as follows: Section II discusses the relevant literature. Section III introduces the variable explanation and analysis used in the study, and Section IV presents the machine learning analysis and comparison results. Section V concludes the study and presents directions for future research.
According to Sova and Lukianenko [4], stock indices in developed countries are affected by interest rates and monetary policies. However, stock indices in developing countries are affected from a long-term rather than short-term perspective [4]. Fernandez and Li [5] published a report, which suggested that consumer activity and monetary policy affect the Philippine stock market; it is affected by all macroeconomic variables except interest rates.
Guven et al. [6] suggested that the number of daily deaths and daily increase in patients infected with COVID-19 had a negative effect on the stock index and that the response policy of the government had a positive impact on the stock index.
Loang and Ahmad [7] suggested that before the COVID-19 pandemic, company information (company size, return on assets (ROA), return on equity (ROE), earnings per share (EPS)) and macroeconomic variables (export rate, import rate, real GDP, nominal GDP, FDI, IPI, and unemployment rate) affected stock indices in the US and China. However, after the COVID-19 pandemic, company information had no impact, and only macroeconomic variables affected stock indices.
Liu et al. [8] examined the relationships between the consumer and producer price indices with the future prices of the China CSI 300 stock index. They suggested that the consumer and producer price indices affected the futures prices of the CSI 300 stock index. Table 1 summarizes the studies in which macroeconomic indicators affect the stock index.
I reviewed previous studies that applied optimal machine and deep learning algorithms to predict the future using macroeconomic indices. Singh [9] analyzed data spanning 25 years from April 22, 1996, to April 16, 2021, using adaptive boosting ((AdaBoost), k-nearest neighbors, linear regression, artificial neural network, random forest, stochastic gradient descent, support vector machine, and decision tree algorithms to derive optimal machine and deep learning algorithms to predict the Nifty 50 index of the Indian market. Consequently, he suggested stochastic gradient descent as the optimal algorithm. Bhardwaj et al. [10] used an artificial deep neural network, random forest, gradient boost, ridge regression, and k-nearest neighbors algorithms to predict per capita GDP using 262 growth indicators, development, health, energy, and finance in 33 OECD countries. After analyzing data spanning 22 years from 1996 to 2017, they determined the artificial deep neural network as the optimal algorithm for predicting per capita GDP.
Chatterjee et al. [11] analyzed stock price data from January 2004 to December 2019 using autoregressive integrated moving average (ARIMA), random forest, multivariate adaptive regression splines (MARS), recurrent neural network, and long short-term memory algorithms to determine the optimal algorithm for predicting the future prices of three major stocks on the National Stock Exchange of India. Consequently, they recommended the MARS algorithm as the optimal algorithm for predicting future prices. Lee [12]  examined the variable that most significantly affects the sales of printing-related small and medium enterprises (SMEs) among 22 economic statistical indices related to prices, growth, employment, and interest rates. To determine the optimal algorithm, they analyzed the sales of printing-related SMEs from August 2013 to November 2021 using the random forest, extreme gradient boosting (XGBoost), and light gradient boosting machine algorithms (LightGBM). They presented the consumer price index and the cost-of-living index for living necessities, which are price indices, as essential variables that affect the sales of printing-related SMEs. They also found random forest (RF) as the optimal algorithm. Moreover, Lee [13] explored the index that most influenced drugstore sales among the 28 government statistical indices related to price, economy, employment, and interest rate and analyzed the sales of drugstores from January 2016 to December 2021 using RF, extreme gradient boosting, light gradient boosting machine, and categorical boosting (Cat-Boost) algorithms to determine the optimal algorithm. They found the economic sentiment index, cyclical component of the coincidence index, and consumer sentiment index as the variables influencing drugstore sales. They found RF as the optimal algorithm.
Gaspareniene et al. [14] used the decision tree, RF, linear regression, extreme gradient boosting, feedforward neural network, recurrent neural network, and long short-term memory algorithms to predict the S&P 500 index using 27 indicators related to macroeconomic, labor market, real estate market, credit market, and money supply. Results showed that the treasury bill, crude oil price, and personal savings were the important variables affecting the S&P 500 index, and RF was presented as the optimal algorithm.
Bharat et al. [15] used the linear regression, classification and regression trees (CART), generalized linear model (GLM), ARIMA, autoregressive moving average with exogenous (ARMAX), and vector autoregressive (VAR) algorithms to predict crude oil prices using macroeconomic indicators such as GDP, interest rates, investment, unemployment rate, US dollar exchange rate, and consumer preference index. The results of the analysis showed that the US dollar exchange rate and consumer preference index were the important variables affecting the price of crude oil, and ARMAX was the optimal algorithm. Table 2 summarizes the studies that used machine learning algorithms and macroeconomic indicators.
Studies in which macroeconomic indicators affect the stock index commonly present economic indicators that affect the stock index the most among macroeconomic indicators. Studies predicting economic indicators through machine learning analysis commonly present economic indicators that affect dependent variables among macroeconomic indicators, and studies comparing and analyzing multi-machine learning algorithms present optimal machine learning algorithms suitable for analysis data characteristics.
As such, related studies use macroeconomic indicators to present those that most affect stock index, GDP per capita, sales of printing companies, pharmacy sales, and crude oil prices and explain a significant relationship with dependent variables. However, after analyzing related studies, limitations of the study were found. Macroeconomic variables that affect dependent variables are all different. This may be a natural result because the independent and dependent variables used in the study are different. However, according to the studies of Lee [12] and Lee [13], independent variables commonly used national statistical indicators; additionally, only dependent variables were analyzed by dividing them into printing companies and pharmacies among small businesses, and macroeconomic variables that affected them were different. Printing companies and pharmacies were classified into manufacturing, wholesale, and retail industries, and the research results were presented differently depending on the industry.
In other words, related studies only explain the relationship between macroeconomic indicators and dependent variables, VOLUME 11, 2023  but there is a limit to the detailed analysis on how macroeconomic variables affect dependent variables according to related industries or business characteristics. Owing to these limitations, it is not known whether macroeconomic indicators that represent social and economic conditions affect companies and related industries, whether they affect companies in common depending on corporate characteristics, or whether macroeconomic indicators that affect them vary depending on corporate characteristics.
Accordingly, this study analyzes whether macroeconomic indicators affect corporate characteristics, identifies which indicators affect corporate sales the most according to corporate characteristics, and proposes an algorithm suitable for predicting corporate sales.

III. RESEARCH METHOD A. DATASET
To explore whether national statistical indices representing social and economic conditions affect the sales of KOSPIand KOSDAQ-listed companies, 58 statistical indices that were compiled annually among 743 national statistical indices were selected as features. The 58 statistical indicators consisted of 6 traffic infrastructure-related, 14 growthrelated, 4 leisure-related, 5 price-related, 4 customs-related, 6 production-related, 4 employment-related, 3 interest-raterelated, 3 population-related, and 6 other indicators. Table 3 shows the variable and basic statistics for the feature dataset by category for the 58 statistical indices.
As the dependent variable, sales of KOSPI-and KOSDAQ-listed companies were selected. For companies listed on the KOSPI and KOSDAQ, the annual sales of 1,470 KOSPI-and KOSDAQ-listed companies with more than 20 years of business among those registered with the Korea Listed Companies Association were selected. The 1,470 KOSPI-and KOSDAQ-listed companies consisted of 998 manufacturing, 59 construction, 125 wholesale and retail, 159 information and communication, and 129 finance and insurance companies. Table 4 shows the number and basic statistics of the 1,470 companies by industry.

B. MACHINE LEARNING ANALYSIS
Python (version 3.9.7) was used for the machine learning analysis. In addition, machine learning algorithms such as RF, gradient boost, XGBoost, AdaBoost, and CatBoost were used for regression analysis.
Ensemble learning refers to a technique that produces multiple classifiers. It combines their predictions to derive accurate final predictions, and the types of ensemble learning are divided into voting, bagging, and boosting. In voting and bagging, multiple classifiers determine the final prediction result through voting. Voting refers to combining classifiers   with different algorithms; with bagging, all classifiers are based on the same type of algorithm, but learning is performed with different data sampling. Boosting is a learning method that reduces errors by weighting incorrectly predicted data while sequentially learning and predicting weak learners.
A representative bagging method, RF, achieves high accuracy by minimizing prediction errors and overfitting by 63540 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. maximizing randomness in sample and variable selection to improve predictive power and reduce overfitting, which are the problems with decision trees.
Boosting methods include AdaBoost, gradient boost, XGBoost, and CatBoost. AdaBoost is a method of boosting while weighting error data and is similar to AdaBoost but uses gradient descent to update weights.
XGBoost alleviates the problems of gradient boost such as slow execution time and lack of regularization. It reduces the execution time with parallel processing support, has a strong durability against overfitting owing to the addition of the regularization function, and has an optimized number of repetitions with its built-in cross-validation function, suggesting excellent prediction performance [16].
CatBoost is based on gradient boost. However, gradientboost-based algorithms have the problem of one-hot encoding, which rapidly increases the number of variables and slows the learning speed of the model when using categorical variables. CatBoost was created to address this issue, exhibiting excellent performance when analyzing datasets composed of categorical variables. Table 5 lists the packages according to the machine learning algorithms.
The analysis procedure involved data preprocessing, model learning, and model evaluation, as shown in Fig. 1. A detailed examination of each step is as follows.
Data preprocessing was performed before model learning in the sequence of data matching, cleaning, and transformation, as follows: First, 58 statistical indicators used as features in the data matching step and 1,470 listed companies' sales data layouts used as targets were analyzed. Looking at Fig. 2, features are composed of years and indicator values, and targets are composed of years and sales. Because there is no unique identification information in features and targets but only the year, which is a common variable that exists, data were combined based on the year using nonparametric matching, a statistical matching method among the data matching methods [17], [18], [19].
Data matching is a technique for combining different datasets. It is divided into exact and statistical matching depending on the existence or absence of data combinations using unique identifiers such as resident registration, business registration, passport, and license numbers. Exact matching combines data having the same identifier value for datasets having unique identifiers. Statistical matching combines data when a dataset does not have unique identifiers [13], [17], [19], [20].
Statistical matching is divided into non-parametric and parametric estimation matching, depending on the existence of data combinations using common and unique variables in datasets. Non-parametric estimation matching combines data using only common variables in each dataset. In contrast, parametric estimation matching combines data using only common and unique variables in each dataset [13], [18], [19].
Second, in the data cleaning step, features had different management and presentation timings according to indicators, and indicators that had been manually managed before the establishment of the information system did not have yearly indicators. The target also had different start-up dates, and different sales reporting years existed depending on the company's business history based on the start-up date. For accurate data analysis, data that did not have yearly indicators among features were deleted, and the target refined the data to the common reporting year among the different reporting years according to the company's performance. Based on the refined features and target data, the data analysis period was set from 2000 to 2021, the period in which the years existed in common.
Third, in the data transformation step, features and targets showed different units and ranges of indicator values according to indicators. In particular, in the case of features, the impact of the target was identified by the unit and range of the indicator value on the same basis to determine the importance and performance evaluation of variables between features. To solve this problem, all variables were converted into log values to make them into a normal distribution for feature scaling, which is an operation for adjusting the index values and ranges of features and targets to a certain level.
Model learning is the process that determines the best performance evaluation index based on the analysis data, and it is performed in the following sequence: First, the analysis data are divided into training and testing data. The reason is that testing data are needed to evaluate how well the model has learned from the training data. In addition, 80% is set as training data and 20% as testing data through random functions to prevent sampling bias of training and testing data. Generally, the performance for time-series data is evaluated by using past data as training data and the latest data as testing data. However, there is a problem of overfitting the testing data if verifying and correcting the model performance is repeated using fixed testing data. Cross-validation is performed to prevent this problem, which consists of training and evaluation after composing training and validation datasets of several different sets to remove data bias [12], [13].
In this study, 5 folds were used for cross-validation, the hyperparameter setting range for each machine learning algorithm was set as shown in Table 6, and the optimal hyperparameters were designated through GridSearchCV provided by Scikit-Learn within the specified range. VOLUME 11, 2023 G. Lee: Exploring Predictive Variables Affecting the Sales of Companies  Second, regression analysis was performed to calculate the optimal performance evaluation indicators-mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE)-by applying machine learning algorithms such as RF, XGBoost, AdaBoost, AdaBoost, and CatBoost.
MAE was obtained by averaging the difference between the target value y i and predicted value y i converted into absolute values as follows: The MSE was obtained by averaging the square of the differences between the target value y i and predicted value y i as follows: The RMSE is the square root of MSE as follows: 63542 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The closer MAE, MSE, and RMSE are to zero, the more accurate the model and the better is the applied machine learning algorithm [9], [12], [13], [16].
The model evaluation stage involves comparatively analyzing the performance evaluation indices according to machine learning algorithms and deriving the importance of features that influence the target. This was performed sequentially as follows: First, the optimal machine learning algorithm was determined by analyzing MAE, MSE, and RMSE, which are the metrics for evaluating the regression performance of the machine learning algorithm.
Second, the importance of the features influencing the target variable was derived using the optimal machine learning algorithm. The machine learning algorithm used in this study was based on a decision tree. Decision trees are machine learning models that predict target variables while continuously branching data into specific features, such as trees. When dividing nodes during training, the concept of Gini impurity is used for classification. The MSE is used for regression for branching; therefore, the information gain is the highest. The information gain is a general term for the performance gain obtained by branching a decision tree to a specific feature, and the feature importance is calculated based on this.
The feature importance is a number indicating how much the variable affects the target for each machine learning algorithm, and the greater the number, the higher the importance [12], [13], [21]. Specifically, through feature importance, the statistical index with the highest importance between the national statistical indices has the greatest predictive power for the sales of KOSPI-and KOSDAQ-listed companies.
Third, the performance evaluation index of machine learning algorithms was compared and analyzed by industry. The difference was obtained by deriving feature importance to understand whether the optimal machine learning algorithm and variables affecting sales by industry vary depending on the industry of the listed companies.

A. RESULTS OF MACHINE LEARNING ANALYSIS OF THE SALES OF LISTED COMPANIES
MAE, MSE, and RMSE of the regression performance evaluation indices were analyzed for each machine learning algorithm using the data of 1,470 companies to determine the national statistical indices with the largest influence on the sales of KOSPI-and KOSDAQ-listed companies. The results are summarized in Table 7.
From the regression analysis results in Table 7, gradient boost showed a better regression performance than that of RF, XGBoost, AdaBoost, and CatBoost according to the evaluation criteria of the regression performance evaluation indicators, MAE, MSE, and RMSE. In other words, in predicting whether national statistical indicators affect the sales  of KOSPI-and KOSDAQ-listed companies, gradient boost can be considered the optimal machine learning algorithm.
The results of extracting the top six important features, which are the top 10% of the 58 independent variables, using gradient boost determined as the optimal machine learning algorithm, are shown in Fig. 3 and Table 8.
As for the feature importance presented in Fig. 3, Industrial accident rate, Foreign exchange reserves, Economically active population, External bonds, Manufacturing production index, and Gold are shown in order. Among them, the numerical value of Industrial accident rate was the largest, indicating that the predictive power of the sales of listed companies was the highest. Furthermore, the features by category in Table 3 were examined concerning the feature importance results in Table 8. The sales of KOSPI-and KOSDAQ-listed companies were affected more by working conditions, growth, employment, stability, production, and raw materials than by prices, leisure, population, and transportation infrastructure.

B. MACHINE-LEARNING-ANALYSIS RESULTS FOR THE SALES OF LISTED COMPANIES BY INDUSTRY
Listed companies were classified into manufacturing, construction, wholesale and retail, information and communication, finance, and insurance according to industry to determine whether the optimal machine learning algorithm and variables affecting sales by industry varied with industry among the listed companies. Subsequently, the regression performance evaluation indices MAE, MSE, and RMSE were examined for each machine learning algorithm. The results are summarized in Table 9.   Looking at the results of the regression analysis in Table 9, Gradient Boost showed the best regression performance in the manufacturing, construction, and information and communication industries according to the evaluation criteria of  the regression performance evaluation indicators MAE, MSE, and RMSE. CatBoost showed the best regression performance in the wholesale and retail industries, whereas RF showed the best regression performance in the financial and insurance industries. In other words, it was confirmed that the optimal machine learning algorithm varied with the industry in predicting whether national statistical indicators affected sales by industry among KOSPI-and KOSDAQ-listed companies. Using the optimal machine learning algorithm by industry, the top 6 feature importance by industry, which is the top 10% of the 58 independent variables, was extracted, as shown in Figs. 4-8. Fig. 4 shows the importance of features in the manufacturing industry. The descending order of feature importance is as follows: industrial accident rate, economically active population, wholesale and retail index, foreign exchange reserves, occupational disaster death toll, and export volume index. Among these, the industrial accident rate shows the highest predictive power for manufacturing company sales. Fig. 5 shows the importance of features in the construction industry. The descending order of feature importance is as follows: gold, cost-of-living index, base rate, industrial accident rate, import price index, and occupational disaster death toll. Among these, gold shows the highest predictive power for construction company sales. Fig. 6 shows the feature importance of the wholesale and retail industries. The descending order of feature importance is as follows: number of automobiles produced, cost-of-living index, export amount index, domestic cargo volume, export price index, and international cargo volume. Among these, the number of automobiles produced shows the highest predictive power for wholesale and retail company sales. Fig. 7 shows the importance of features in the information and communication industries. The descending order of feature importance is as follows: Foreign exchange reserves, gold, industrial accident rate, economically active population, external bonds, and automobile sales index. Among these, the foreign exchange reserves shows the highest predictive power for information and communication company sales. Fig. 8 shows the feature importance of the finance and insurance industries. The descending order of feature importance is as follows: industrial accident rate, number of automobiles produced, number of international passengers, number of foreign tourists visiting Korea, number of domestic passengers, and automobile sales index. Among these, the industrial accident rate shows the highest predictive power for finance and insurance company sales.  Table 10 summarizes the feature importance of the top six by industry.
Differences were found on checking the features by category in Table 3 regarding the feature importance by industry in Table 10. The manufacturing, finance, and insurance industries were found to be highly affected by the working conditions. In contrast, the wholesale, retail, and information and communication industries were highly affected by growth, and the construction industries were highly affected by raw materials.

C. COMPARISON OF REGRESSION ANALYSIS RESULTS BY THE MACHINE LEARNING ALGORITHMS
The performance of the machine learning algorithms was compared to determine the most suitable algorithm for predicting the sales of KOSPI-and KOSDAQ-listed companies. Feature importance was compared to determine whether the variables that had the largest influence on the sales of listed companies and the variables that affected listed companies varied with the industry using the optimal machine learning algorithm. From the results, the following observations were made: First, the results of the machine learning analysis of listed companies are shown in Fig. 9. Gradient Boost was the optimal machine learning algorithm according to the evaluation criteria of the regression performance evaluation indices MAE, MSE, and RMSE.
Second, the machine learning analysis results of listed companies by industry are shown in Figs. 10-12. According to the evaluation criteria of the regression performance evaluation indexes MAE, MSE, and RMSE, the optimal machine learning algorithm for each industry suggested Gradient Boost for manufacturing, construction, and information and communication, CatBoost for wholesale and retail, and Random Forest for finance and insurance.
Third, the results of checking the top six features with the highest importance among the listed companies showed that the variable with the greatest influence was the industrial G. Lee: Exploring Predictive Variables Affecting the Sales of Companies accident rate in the working conditions category, as shown in Table 8.
Fourth, the results of checking the top six features with the highest importance among the listed companies by industry showed that the variables with the highest influence varied by industry (Table 10). The most influential variables in the manufacturing, financial, and insurance industries were the industrial accident rate related to working conditions, in the construction industry was gold related to raw materials, in the wholesale and retail industries were the number of automobiles produced related to growth, and in information and communication industries were foreign exchange reserves related to growth. Table 11 summarizes the national statistical indicators that significantly affect the sales of listed companies and listed companies by industry based on the results of the machine learning regression analysis.
Consequently, Gradient Boost was confirmed to be the optimal machine learning algorithm for all industries when national statistical indicators predicted sales of KOSPI-and KOSDAQ-listed companies and that the optimal machine learning algorithm differed for each industry. In addition, 63546 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    the most influential variables for listed companies were confirmed to vary with the industry.
These results are an empirical analysis of the limitations of related studies that did not identify the macroeconomic indicators affected by industry, and they were obtained using national statistical indicators, KOSPI-and KOSDAQ-listed companies' sales, thereby suggesting which national statistical indicators should be used as basic data to enhance management activities.

V. CONCLUSION
This study aimed to investigate whether national statistical indices, created to understand social and economic conditions, affect company sales. Furthermore, it aimed to determine the optimal machine learning algorithm by exploring the variables that most affected the sales of KOSPI-and KOSDAQ-listed companies and by comparing the performance of several machine learning algorithms.
For this purpose, 58 national statistical indices provided by the Statistics Korea e-Nara Index and Bank of Korea Economic Statistical System and the annual sales of 1,470 KOSPI-and KOSDAQ-listed companies registered in the Korea Listed Companies Association with more than 20 years of history were used as analysis data.
Model learning and evaluation were performed using the machine learning algorithms RF, Gradient Boost, XGBoost, AdaBoost, and CatBoost. The importance of the features that influence the sales of KOSPI-and KOSDAQ-listed companies and the listed companies by industry were derived, and they were compared using a machine learning algorithm. The regression performance evaluation metrics of MAE, MSE, and RMSE were employed. After the analysis, the whole industry presented Gradient Boost as an optimal machine learning algorithm. By industry, manufacturing, construction, and information and communication industries presented Gradient Boost, wholesale and retail industries presented Cat-Boost, and finance and insurance industries presented Random Forest as the optimal machine learning algorithms.
In addition, upon examining feature importance to explore the variables that most affected the sales of listed companies using the optimal machine learning algorithm, the industrial accident rate was found to have the greatest predictive power on the sales of listed companies among national statistical indicators. By industry, the variable with the highest predictive power was the industrial accident rate for the manufacturing, financial, and insurance industries, gold in the construction industry, the number of automobiles produced in the wholesale and retail industry, and foreign exchange reserves for the information and communication industry. This result confirmed that the national statistical indices that affected the sales of listed companies varied by industry.
In other words, it was confirmed that for a company to establish a management strategy, the national statistical indicators that have the most influence depending on the industry, which is a characteristic of the company, should be used as basic data.
This study determined the optimal machine learning algorithm for the variables and data characteristics of the national statistical indicators affecting the sales of KOSPIand KOSDAQ-listed companies. However, the study has limitations, which are discussed below.
First, in terms of research subjects, KOSPI-and KOSDAQ-listed companies were compared and analyzed by industry. This is an analysis of companies with equity capital of more than 1.5 billion won among corporate size requirements and meeting KOSPI-and KOSDAQ-listing conditions. Companies are divided into large companies, medium-sized companies, small and medium-sized companies, and small companies according to the size of the company. KOSPIand KOSDAQ-listed companies were only partially listed for large companies, medium-sized companies, and small and medium-sized companies, so small and medium-sized companies with an equity capital of less than 1.5 billion won were not included in the study. Owing to the limitations of the collected research subjects, national statistical indicators were analyzed according to the industry, which is a characteristic of a company, for KOSPI-and KOSDAQ-listed companies, but the effect of the national statistical indicators on the size of the company could not be analyzed. Therefore, the research subjects should be expanded according to the size of the company. The research results can be further generalized if national statistical indicators that affect corporate sales according to the size of the company are compared and analyzed using corporate sales data from corporate credit rating agencies.
Second, in terms of variables, 58 statistical indicators representing social and economic conditions were used as independent variables, and annual sales data of KOSPI-and KOSDAQ-listed companies were used as dependent variables. The Korea Listed Companies Association collects and manages sales data for KOSPI-and KOSDAQ-listed companies once a year according to the sales data collection cycle, so only annual sales data exist and monthly and quarterly sales data are not managed. Therefore, 58 statistical indicators providing annual indicators among 743 national statistical indicators were used to derive predictive variables that affected sales of listed companies. However, there is a limit to generalizing the research results as predictive variables were derived only from 58 statistical indicators. Therefore, if the Korea Listed Companies Association collects the sales data of KOSPI-and KOSDAQ-listed companies from once a year to once a quarter, 58 statistical indicators will include quarterly statistical indicators to expand the target of independent variables. In addition, predictive variables that affect the sales of KOSPI-and KOSDAQ-listed companies will be expanded.
By expanding and examining the variables, and research objects, which are the limitations of this study, companies will be able to provide a considerable amount of data, which can be used to develop management strategies.
This study is the first to empirically analyze the national statistical indicators a company should use to establish management strategies, and it is expected to provide a good foundation for follow-up studies using the sales growth of companies and the national statistical indices.