Equity Research Report-Driven Investment Strategy in Korea Using Binary Classification on Stock Price Direction

This research examines and proposes an investment strategy by combining the natural language processing on the equity research reports published in the Korean financial market and machine learning algorithms for binary classification. At first, we deduce the part-of-speech from the report using the KoNLPy and Mecab. Then, we define 33 features as the input variables and perform the binary classification on the price direction of the stocks recommended in the report using various machine learning algorithms. Note that we investigate the model performance in detail by dividing the entire period into three sub-periods, including pre-COVID-19 for the sideways market, COVID-19 for the crashing market, and post-COVID-19 for the extreme bullish market. We confirm that the random forest is the best classifier for all periods, so we utilize its results on positively predicted stocks in the test set as the investment universe for the monthly re-balancing and buy-and-hold investment. The proposed strategy shows a significantly higher return on investment than benchmarks during the pre-COVID-19 and COVID-19 periods, whereas the comparable return during the post-COVID-19.


I. INTRODUCTION
Financial companies periodically issue research reports for investors. The report contents include analyzing companies, financial institutions, diplomatic issues between countries, and politics. Among them, this study focuses on the equity research report that recommends a specific stock at a time. Usually, analysts write their perspective on a stock expected to show high returns in the future through various quantitative and qualitative analyses. However, the profit in the future varies in different reports. One reason for such a result is that the person who writes the report may not be equipped with enough analytical skills, extending to low-quality reports. In this study, we assume that the composition of the equity research reports quantified through natural language processing (NLP) can distinguish the stock recommendations' reliability.
In the 2010s, the digital online content volume has exploded, including market analysis reports, news articles, The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Hao Chen . journal texts, online blogs, and social media. Accordingly, research on analyzing public sentiment, especially opinion mining in social media, has become essential. As the market prediction using NLP algorithms has been studied in the financial field, a research field called natural language-based financial forecasting has been gradually established [1]- [4]. In particular, the stock market has received great attention in academia due to its sensitivity to market participants' sentiment. That is, investors' sentiment can change the overall trend of individual stocks and even the market. Many previous studies have analyzed investors' opinions and market sentiment from social media posts regarding financial markets. Some studies extract and analyze the mood of text from social media such as Twitter [5]- [9], news [10], bulletin board [11], [12] and utilize them for market prediction. This research's main objective focuses on a binary classification of positive or negative moods using the text to discover its correlation with the market or company's stock price movement. The methods that have been used for analysis include the Naive Bayes approach [12]- [16], support vector machine [17]- [20], and decision tree [21].
In addition to the same machine learning algorithm, many studies also have utilized various deep learning algorithms such as Artificial Neural Network [22]- [24] and Recurrent Neural Network [25]- [27]. Depending on how the features of the text are extracted, it may not reflect the movement of stock price well. Hence, the extraction of features representing the characteristics of natural language is an important topic in NLP. The methods of the feature engineering incorporate linguistic feature [28], [29], keyword extraction [30], data reduction with generative probabilistic model [31], and word embedding with n-grams [28], [32], TF-IDF [33], ensemble model [34] and deep learning [35].
Furthermore, there have been many studies regarding the derivation of investment strategies using NLP. Several studies have analyzed market sentiment information using a machine-learning algorithm to construct a portfolio. Such studies either design a neural network with an ensemble of evolving clustering and LSTM [36], or propose a new followthe-loser portfolio strategy from the post of stock micro-blogs using semi-supervised learning method [37], or establish a trading strategy from new sentiment data using learningto-rank algorithms [38]. Also, recently, a portfolio investment strategy that considers shareholders' confidence index by combining the existing random forest and sentimental analysis [39] and an investment strategy that encodes external information from financial news using reinforcement learning have been proposed [40].
However, there have been limited efforts on establishing an investment strategy based on the NLP of equity research reports published in the Korean financial market. Therefore, in this paper, we focus on analyzing the report through NLP and investigate if the induced information can be utilized for investment strategy. At first, the NLP element is derived by quantifying the structure of the report in the form of part-of-speech (POS). Then, using NLP elements as input features, a binary classification model that predicts whether the stocks recommended from the report produce the positive or negative return is constructed. The model with the best classification performance is selected for the experiment by applying several machine learning algorithms. Finally, we propose an investment strategy to buy stocks predicted to yield a positive return in future returns through the suggested classification algorithm. To show the superiority of the proposed investment strategy, we compare its investment returns with the strategy of investing all the stocks recommended by the report and the market index as benchmarks. Besides, to investigate whether the proposed investment strategy shows consistent performance in various market conditions, different periods' investment return is analyzed separately.

A. EQUITY RESEARCH REPORT
This study utilizes 34,780 equity research reports on stocks traded in Korean financial markets published from 2019-01-01 to 2020-06-12. Note that the securities firm analysts are in charge of writing the reports and provide them in Portable Document Format (PDF) on the firm's website. Each report recommends one stock at a time, providing information on the target price, current status and direction on the underlying company's business, and degree of recommendation on the stock. Note that the number of reports varies in different months as summarized in Table 1. A total of 1,118 stocks were recommended within 34,780 research reports where the reports are concentrated on a limited number of stocks as presented in Figure 1. Specifically, the top 400 stocks account for 90% of all reports. One reason for such concentration could be the investor's preference on stocks with high market capitalization or thematic investing, which can promote readability and click rates on reports in favor of analysts. Due to less diversity among the reports, many investors doubt the report's utility in predicting the future returns on the underlying stock. However, it is also true that most of the reports are carefully written through sufficient research, consisting of the analyst's sentimental but reasonable opinions on the company. In this context, we assume that the valuable reports with rich information are written based on clear facts, whose composition and even sentence structure could differ from the unhelpful report with limited value. At first, we define that a report is valuable if it successfully recommends a stock with a positive return in the near future. A recommended stock from an unhelpful report yields a negative return in the near future. Then, we conduct a binary classification based on NLP and various machine learning algorithms to distinguish the composition of valuable reports.

B. FEATURE ENGINEERING WITH NLP
We utilize the NLP to define the features for binary classification. At first, the contents of each report written in Korean are divided into morpheme units through NLP. English has its meaning decomposed based on spacing, but the Korean can be divided into morphemes containing two or more meanings without spacing. To analyze the Korean language, we employ the KoNLPy [41], a Python package for NLP of the Korean language, and Mecab [42], methods of tagging POS that tags each morpheme with 43 detailed POS. However, 43 POS divides the sentence in too much detail and has many features, which can cause overfitting. Therefore, in this study, the top 10 most used NLP elements are integrated and selected as summarized in Table 2.
Based on the ten selected POS, we utilize each POS frequency as a feature for the binary classification. Then, we create eight additional features that can represent the characteristic of equity research report: Number of a morpheme (subwords), Average number of morpheme per sentence (mean_subwords_per_sentence), Standard deviation of morpheme per sentence(std_subwords_per_sentence), Number of sentences ending with da (da), Number of sentences (sentence), Number of paragraphs (paragraph), Number of pages (page), Number of pages with words (page_with_word). Note that we use optical character recognition (OCR) to count the number of pages with words since there exist pages with only tables or pictures. In Korean, a perfect sentence ends with da; otherwise, a sentence has omitted elements. Note that a sentence that does not end with da only conveys some financial terms providing limited implication to the investment. Therefore, we assume that the da can be a feature representing an equity research report's characteristic. In Figure 2a, the distributions of the ten selected POS and additional features are investigated, showing that all features are skewed. Therefore, we apply log-transformation to all variables as illustrated in Figure 2b. In this context, we successfully obtain variables whose distributions are close to the normal distribution used in machine learning for binary classification.

C. BINARY CLASSIFICATION & INVESTMENT STRATEGY
We propose a binary classification based on the pre-processed NLP-driven features, which predicts whether or not the stock suggested in the equity report will show a positive or negative return in the future. Specifically, we utilize five well-known models. At first, we employ the k-Nearest Neighbors (k-NN) classifier. The k-NN algorithm hinges on the assumption that similar data points will be located at close distance [43]. Therefore, it calculates the distance between the test data and the input, which can be obtained as follows: where p and q refer to the data points that have coordinates of (p 1 , p 2 , . . . , p n ) and (q 1 , q 2 , . . . , q n ) in n dimensions, respectively. 46366 VOLUME 9, 2021 Secondly, we utilize the logistic regression using the sigmoid function as follows [44]: where H (x), W and b correspond to the sigmoid function, weight, and bias, respectively. As a result of approaches 1 or 0, the value of the cost function decreases or increases, respectively. Thirdly, we utilize the decision tree, which analyzes and represents patterns between data as a combination of possible rules and is built top-down from the root node [45]. To build a decision tree, we use the entropy for an area to which m data points belong can be calculated as follows: where p k refers to the percentage of the data points belonging to the category k. It is trained to increase the homogeneity of each area and reduce the impurity or uncertainty as much as possible, which is called information gain. Fourthly, we utilize the random forest. Since the decision tree has a limitation of overfitting, we employ an ensemble model that generates multiple decision trees and votes on each tree's classification results. It can be obtained through bagging that makes a decision tree with data sampled with replacement from the entire training data [46].
Lastly, we utilize gradient boosting, an ensemble model that produces a robust classifier by combining weak classifiers, typically decision trees [47]. It uses gradient descent to differentiate the loss function as a parameter to obtain the slope and calibrates the parameter so that the loss decreases. The loss function and the negative gradient are expressed as follows.
where L refers to the loss function.
For the experiment, we divide the data into the train(70%) and test(30%) sets. Note that we ensure the partitioned data can carry the equivalent distributional characteristics of the number of equity research reports per month as well as the number of those per stock. Although many prediction problems in financial time-series use the in-sample and outof-sample on time, our model can utilize random sampling since its explanatory variables are not dependent on time. For 50 different random seeds, we compare the classification performances of five models for different times after the report's release. Based on the model with the best performance, we simulate the backtesting with monthly re-balancing and simple buy-and-hold for different investment horizon investment strategies using the positively predicted stocks in the test set. Then, we compare the investment performance with other benchmarks. A step-by-step scenario of the proposed investment strategy is illustrated in Figure 3

III. EMPIRICAL RESULTS AND DISCUSSIONS A. BINARY CLASSIFICATION PERFORMANCE
As previously stated, we utilize five machine learning models to predict the price direction of the stock recommended from the equity research reports whose NLP elements are considered as the features. Table 3 summarizes the hyper-parameters for each model. We compare the binary classification performances of each model for a different time in the future in  terms of prediction accuracy and area under the receiver operating characteristic curve (AUC). Note that this research's main objective is to examine if the equity research report's NLP elements can be used to construct an investment strategy. Therefore, based on two simple measures, we select a model with the highest classification performance, analyze the classification results in detail using the precision, recall, and F1-score, and utilize it to establish an investment strategy.
The models predict the direction of stock at 30, 60, 90, 120, 150, and 180 trading days after the report's release. We will call this as prediction time. We consider the equity research reports published from 2019-01-01 to 2020-06-12, and the Korean financial market has experienced the sideways period with low volatility (2019-01-01 -2020-01-20), collapsing period due to the outbreak of COVID-19 (2020-01-21 2020-03-29), and soaring period with the extreme bullish market (2020-03-30 2020-06-30). Specifically, we divide the periods based on the highest and lowest points of KOSPI200, the representative financial market index of Korea, within the entire period. In this regard, the classification performance can be evaluated for different market conditions. At first, the average classification performances of each model for 50 different random seeds are summarized in Table 4. According to the results, the accuracy and AUC tend to increase as the prediction time increases for all models. It implies that a higher return can be expected when an investment strategy is established based on the long investment horizon's prediction results. Finally, we choose the random forest as the primary classification model since it shows the highest accuracy and AUC for all prediction times.
Detailed classification performance of the random forest is summarized in Table 5. Comparing to the accuracy and AUC, the F1-score is low and invariant for different prediction times, which reduces the utility of the prediction model. Specifically, the low F1-score is caused by the relatively low recall. Note that the precision shares the same pattern as the accuracy and AUC. However, such a result does not affect the random forest's utility since the proposed investment strategy only utilizes the positively predicted stocks, whose return in the future is expected to be positive. A classification model with high precision but low recall in a binary classification indicates relatively lower false positives than false negatives. In this context, the stocks predicted to be positive are likely to be in the actual positive direction, although the model cannot accurately detect all stocks with positive direction. Therefore, we can imply that an investment strategy based on the stocks predicted to be positive returns can produce a high profit.
We further investigate how random forest classification varies in different market conditions as summarized in Table 6. For the reports published during the sideways period, the accuracy increases as the investment period increases, but the AUC remains around 0.5. Hence, the corresponding investment strategy is expected to produce a little   advantage over investing in all reports. During the collapsing period, both AUC and F1-score increase as the prediction time increases. Hence, the corresponding investment strategy is expected to produce a high profit over investment in all reports. During the soaring period, we observe a high accuracy and F1-score but relatively low AUC values. The high accuracy and F1-score are realized due to the biased target variable on the positive direction during the recovery from the COVID-19 pandemic shock. Therefore, the corresponding investment strategy is expected to show no significantly different profit compared to the investment in all reports.

B. FEATURE IMPORTANCE
Prior to utilizing the binary classification into the investment strategy, we investigate the feature importance based on random forest results. The average importance of each NLP element in the random forest is summarized in Figure 4. For the total period in Figure 4a, the most significant feature is the English ratio. Note that the low feature importance indicates no significant influence on predicting the direction of the stock price. Specifically, based on the median of the English ratio, the average investment return for 180 prediction time for all reports with an English ratio lower than the median is -2.3%, whereas that for all reports with an English ratio VOLUME 9, 2021 higher than the median is 7.8%, which yields the difference of 10.1% of the return. It implies that a relatively high English ratio report can be expected to show a positive expected return compared to a report that does not have one. Likewise, the noun ratio, the second most crucial variable, shows a 7.2% difference in investment return based on the median. In this context, we discover the NLP-elements that positively affect the investment return, which are English ratio, subwords per page, page word, and subwords, among the top 15 features showing high importance. Otherwise, for most NLP-elements, the lower the value, the higher the investment return. Interestingly, most of the ratios of NLP-elements show high feature importance than selected POS and additional features in Figure 3.
Furthermore, we examine the feature importance of NLP elements for different market conditions in Figures 4b,4c and 4d. Analogous to the total period, the ratios of NLP-elements show high feature importance in all periods. Therefore, we can conclude that the ratios of NLP-elements play a more important role than basic NLP elements regardless of market conditions except for the determiner ratio and ending ratio. Also, subwords per page and number ratio show high feature importance regardless of market conditions. Note that the higher the subwords per page, the higher the investment return, while the lower the number ratio, the higher the investment return.

C. INVESTMENT PERFORMANCE
Finally, we perform the backtesting of two investment strategies based on the positively predicted stocks in the test set from the 50 random seeds as the investment universe. The first strategy is the monthly-rebalancing. At first, we take a long position on the positively predicted stocks on the test set with equal weight. Then, after a month, we sell all the stocks purchased and repeat the process of taking long position. Figure 5 shows the monthly average cumulative rate of return. The proposed strategy is a blue line, and the monthly cumulative returns of KOSPI200 and all stocks recommended from the equity research reports in the test set are provided as benchmarks with sky blue and gray lines, respectively. Note that the vertical lines indicate the three standard deviations of cumulative returns for each month. The result shows that the proposed strategy outperforms the returns of other benchmarks. Besides, the strategy of buying all the stocks recommended by the report slightly exceeds the KOSPI index, which ensures some degree of the reliability of the equity research report on recommending stocks.
In order to compensate for the limitation of the cumulative return, the average return on investment in different market conditions based on a buy-and-hold strategy is summarized in Table 7 for different investment horizons from 30 days to 180 days. The proposed investment strategy yields significantly higher returns for the total period than the benchmarks 46370 VOLUME 9, 2021  invested in all stocks recommended by the report for all investment horizons. Also, the difference in returns between the two investment strategies increases as the investment period increases. During the sideways period, the proposed investment strategy shows slightly better returns than the benchmark. However, the equity research report published in the sideways period includes a collapsing period on the long-term investment horizon. Despite the sharp decline in the market, the proposed strategy does not record negative returns except for the investment horizon of 120 trading days, which is very encouraging. During the collapsing period, it yields significantly higher returns than the benchmark for all investment horizons. In particular, since the long-term investment horizon includes a soaring period, the proposed investment strategy can be considered to possess an ability to detect stocks whose prices will rise rapidly during the recovery of a financial market after the market crash. Finally, during the soaring period, the presented model shows a similar investment return as the benchmark.

IV. CONCLUSION
Throughout this research, we explore the possibility of developing an investment framework using a binary classification based on NLP-elements of the equity research report. To the best of our knowledge, this is the first attempt to utilize the NLP-elements of the equity research report in Korea to establish investment strategies. Therefore, this research's novelty lies in providing the possible integration of NLP-elements of the equity research report in stock investment. Through the experiments, the random forest shows the best classification performance whose AUC of the random forest during the sideways period and the collapsing period is higher than 0.5. Therefore, we select the random forest as the binary classification algorithm. Then, we perform the backtesting based on classification results for monthly re-balancing and buy-and-hold for different investment horizons. As a result, we confirm that the proposed investment strategy generates higher returns than the benchmark during the sideways period and collapsing period. In an extreme bull market, selecting VOLUME 9, 2021 stocks with high expected return does not make much of a difference since any stock an investor chooses will yield a high return. However, an investment strategy that helps select stocks with a high return in the future during sideways or bearish markets has a significant implication in real-world investment practice. Therefore, for further research, we plan to utilize various portfolio theories in constructing efficient investment strategies rather than simple buy-and-hold by using the positively predicted stocks from the binary classification.