Word-Embedding-Based Traffic Document Classification Model for Detecting Emerging Risks Using Sentiment Similarity Weight

With the increase in traffic accident rates, traffic risk detection is becoming increasingly important. Moreover, it is necessary to provide appropriate traffic information considering user locations and routes and design an analysis method accordingly. This paper proposes a word-embedding-based traffic document classification model for detecting emerging risks using a quantity termed sentiment similarity weight (SSW). The proposed method detects emerging risks by considering and classifying the importance and polarity of keywords in traffic document. Conventional sentiment analysis methods fail to utilize semantically significant keywords unless they are included in a sentiment dictionary. In this study, through word imputation using an established similarity dictionary and by widening the limited utilization range, the proposed method overcomes the disadvantage of sentiment dictionaries. The proposed method is evaluated through three tests. In the first, the similarity between keywords is measured, and thus model accuracy is evaluated. In the second test, three classifiers for emerging risk classification are compared. In the last test, emerging risk detection is assessed according to whether the proposed SSW is applied, and its effectiveness is therefore verified. The evaluation results demonstrate that the proposed traffic-related document classification model using the SSW has an f-measure of 0.907, indicating satisfactory performance. Therefore, the proposed SSW can be effectively used as a parameter in traffic-related document classification and enables the detection of emerging risks.


I. INTRODUCTION
The development of transportation means positively influences everyday life in several respects, such as shortening travel time and overcoming the limitations of distance travel. However, as the number of people using transportation increases, traffic volume also increases, and thus traffic congestion and accidents occur more frequently. Accordingly, the fatality rate of traffic accidents rises, and therefore the social cost for handling such accidents is increasing [1]. Traffic accidents occur unexpectedly and are difficult to analyze accurately because they are affected by environmental factors. Therefore, it is necessary to conduct long-term The associate editor coordinating the review of this manuscript and approving it for publication was Mu Zhou . risk management through traffic data analysis. In addition, with the development of information and communication technology, massive amounts of unstructured data are being generated in real time through mass media and social networking services (SNS). In this circumstance, unstructured data analysis based on artificial intelligence for supporting intelligent transportation systems (ITSs) has proved quite valuable [2]. Unstructured data may be found in various forms, such as text, images, and multimedia. According to the characteristics of a dataset, it is necessary to apply different mining techniques. Among them, text and opinion mining for the analysis of unstructured text data have attracted considerable attention [3]. Opinion mining is a technique whereby useful information is extracted by analyzing people opinions. It uses sentiment information to convert the sentiment of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a text into quantifiable and objective information that can be analyzed. In addition, opinions are classified as positive, negative, or neutral, and this can be applied to decision making. Text mining is a technique for extracting new and meaningful information from preprocessed text data by using association rules, cluster analysis, and classification [4], [5].
Currently, text-mining-based analysis is under investigation for extracting traffic information from real-time stream text data [6]. Such data consist of a variety of text information, including words and sentences, for traffic risk assessment [7]. However, owing to the data explosion, it is difficult to extract traffic data only from massive text data. Ali et al. [8] conducted an ontology-based transportation sentiment analysis using unstructured text data from social network platforms to extract meaningful information. The disadvantage, however, is that sophisticated data preprocessing is essential because of the characteristics of social network content. This implies that it is difficult to assess emerging risks for ITS from simple and unprofessional traffic-related information. This paper proposes a word-embedding-based trafficrelated document classification model for detecting emerging risks using a quantity termed sentiment similarity weight (SSW). The proposed model classifies word-embeddingbased traffic-related documents from news data and detects emerging risks using the SSW from the classified documents. It collects unstructured traffic data through crawling. Trafficrelated main keywords are extracted from the collected data, and the importance of keywords in the document is determined by term frequency-inverse document frequency (TF-IDF) weight. By performing sentiment analysis on keywords, the polarity value of a word is determined. In this study, the SSW is proposed to resolve the issue of limited word range in sentiment dictionaries. Specifically, words are weighted considering the similarity and polarity of the main keywords. Thereby, traffic-related documents are classified, and emerging risks are detected. Accordingly, traffic-related document classification using the proposed SSW allows the detection of emerging risks considering user routes and thus provides significant information on traffic risks. This enables safe driving and walking for drivers and pedestrians, respectively. The main contributions of this study are as follows.
1. We propose a new framework that detects and classifies keywords related to emerging traffic risks by considering the polarity and importance of words using the proposed SSW. 2. We propose a method to detect emerging traffic risks by extracting only traffic-related documents from unstructured text data of various categories. 3. We overcome the limitation of sentiment dictionaries by measuring the similarity between keywords and using the proposed word imputation method. 4. We propose a graphical user interface that can detect emerging risks using the traffic-related document classification model. The remainder of the paper is organized as follows: Chapter 2 describes sentiment word classification based on sentiment analysis, and Word2vec-based word embedding. Chapter 3 describes the proposed word-embedding-based traffic-related document classification model for detecting emerging risks using the SSW. Chapter 4 describes the experiments the results of the performance evaluation of the proposed model. Chapter 5 concludes the paper.

II. RELATED STUDIES
A. SENTIMENT WORD CLASSIFICATION BASED ON SENTIMENT ANALYSIS Sentiment analysis is a technique for measuring the polarity value (the level of positiveness and negativeness) in a piece of text and determining the related sentimental state [9]. To this end, a sentiment dictionary with the polarity values of sentiment vocabularies is established. This is expressed by quantifying word sentiment after preprocessing text data. Sentiment analysis can be performed through sentimentdictionary-based methods and machine-learning methods. The former analyze word sentiment by numerically representing word polarity. However, if a word is not present in a sentiment dictionary, it is difficult to analyze its sentiment. Accordingly, such dictionaries should be carefully established. To this end, supervised learning based on machine and deep learning is often applied. Supervised learning is the process of learning through user labels [10]. By labeling the level of positiveness or negativeness of a document, a classifier learns to make inferences. Sentiment-dictionary-based modules include TextBlob [11], valence-aware dictionary and sentiment reasoner [12], and SentiWordNet [13]. Madhu [14] proposed sentiment-analysis-based document clustering, which determines the polarity and subjectivity values of Twitter documents using the TextBlob and AFFIN dictionaries. Documents are clustered by sentiment analysis, and thus their relationship is discovered. Denecke [15] proposed a method for automatically determining the polarity of multilingual documents, whereby such a document is translated into English through standard translation software, and then the Sentiwordnet dictionary of emotions is used to determine its polarity. That is, a single sentiment dictionary is sufficient in a multilingual framework.
General sentiment dictionary construction consists of a text data collection step and a morphological and sentiment analysis step. The former is the collection and pretreatment of text data. In the pre-processing process, data are collected and documented, and disused words are removed. Moreover, words are tokenized. In the latter, morphological analysis and sentiment analysis are conducted on the collected text data. The documented data are analyzed morphologically and tagged, and then their sentiment is analyzed. In addition, words are analyzed in terms of sentiment level, are classified into positive, neutral, or negative types, and are quantified. The process of establishing a sentiment dictionary is shown in Fig. 1. Conventional sentiment analysis methods use the polarity values of vocabularies in a sentiment dictionary.
The degree of positiveness and negativeness can be determined only for words that appear in a sentiment dictionary. Therefore, to expand a conventional sentiment dictionary, it is necessary to design a method of calculating the polarity value of missing words.

B. WORD2VEC-BASED WORD EMBEDDING
Word embedding is a technique for analyzing the context of words in a sentence and converting them into a vector value. Word embedding methods include GloVe [16], Fasttext [17], and Word2vec [18]. Word2vec fails to consider the co-occurrence frequency of entire sentences because learning is performed only in a user-specified window. By contrast, GloVe (global vectors) determines the structure of semantic words by using the co-occurrence probability of all words. However, as the number of words increases, the size (hence, the computational complexity) of the co-occurrence word matrix increases [16]. Fasttext is a word-embedding technique proposed by Facebook. It is assumed that there are multiple words in a term, to obtain the word embedding vector, Fasttext splits a word into n-grams, and the embedding vector is then the sum of these n-grams. For example, 'kill' → (ki, il, ll). This generates a vector of words that are not found in the dictionary; moreover, its training process is fast [17]. Word2vec was proposed by Google after the neural network language model was improved [18]. It infers the meaning of words based on the distribution hypothesis that words in similar locations have a similar meaning. In addition, a vector is assigned to a word by representing the word in high-dimensional coordinates [19]. As operations between words are possible, the similarity between words can be calculated. The continuous bag of words (CBOW) and skip-gram are representative Word2vec methods [20]. The former uses surrounding words as input to predict a target word and has the advantage of fast training [21]. Skip-gram takes a target word as input to predict the surrounding words. Its training process is slow because the loss must be calculated by the number of contexts [22]. Nevertheless, CBOW is inferior to skip-gram in terms of learning efficiency because it updates the vector value of a target word only once, whereas in skip-gram, the size of the context window is twice as large as the window size. Therefore, for the same window size, the learning amount can be several times as large in skip-gram. Accordingly, skipgram with high learning efficiency is generally applied [23]. Seyed Mahdi Rezaeinia [24] proposed an improved word embedding method based on a pre-trained sentiment dictionary by applying sentiment analysis. It extracts vectors from a text corpus according to Word2vec/GloVe, a word position algorithm, a vocabulary-based approach, and morphological analysis. In combination with the extracted vectors, improved word vectors are constructed. Thereby, accurate classification can be achieved; however, the proposed method was evaluated only on sentiment datasets. B. Naderalvojoud [17] developed a sentiment-recognition-based word-embedding deep learning model. It considers both the meaning and polarity of words and overcomes the weakness that words with different sentiments, but similar contexts may have similar vector values. However, this method does not consider the importance of words in a document. Conventional sentiment analysis techniques may be unable to use meaningful words because of the limited range of sentiment dictionaries. Therefore, it is necessary to develop a method whereby used in sentiment analysis, although they may not appear in a sentiment dictionary.

III. WORD-EMBEDDING-BASED TRAFFIC-RELATED DOCUMENT CLASSIFICATION MODEL FOR DETECTING EMERGING RISKS USING SENTIMENT SIMILARITY WEIGHT
Even though it is possible to obtain information from text data, it is difficult to recognize and predict risks based on these data. This study proposes a technique for collecting traffic-related documents from a variety of texts and detecting emerging risks. The proposed technique substantially uses words that may not appear in a sentiment dictionary by applying the SSW. Figure 2 shows the word-embedding-based traffic-related document classification model for detecting emerging risk using SSW.
The proposed method consists of four steps. In the first step, collect unstructured data through crawling, preprocess it, and then form a matrix through TF-IDF weights. This removes the stop words of the collected documents and proceeds with the morpheme analysis process. It also extracts important keywords with high TF-IDF weight. In the second step, document labeling is performed based on the main keywords extracted from the document. Through this, it is classified into traffic-related data and non-traffic data. In the third step, Word2vec is applied to vectorize words, calculate a similarity value, and generate a dictionary. In the last step, the SSW is extracted using the TF-IDF, similarity, and polarity values of the keywords. Thereby, words not present in a sentiment dictionary are replaced by highly similar words. In addition, a classification model using the SSW is applied to a user interface system. VOLUME 8, 2020

A. DATA COLLECTION AND KEYWORD EXTRACTION USING TF-IDF
To detect emerging risks in a traffic-related document, breaking news data are crawled. The data provided by the Traffic Broadcasting Network (TBN) [25] are collected and documented according to topic (e.g., politics, economy, society, information technology and science, and traffic). Subsequently, only necessary data are retained, such as date and time, category, title, main body, links. In the crawled news text data, not all categories are ''traffic,'' but there are trafficrelated documents. Therefore, it is necessary to determine the category of the collected documents again. To this end, keywords are extracted. Based on the extracted keywords, binary labeling is performed to determine whether the document is traffic-related. Morphological analysis is conducted to extract representative keywords from the crawled documents. Morphological analysis is used to grasp the structure in the minimum unit of a semantic corpus [26]. Words in each document are tokenized through morphological analysis, and meaningless words are removed, that is, conjunctions, postpositions, numbers, and special characters. To extract the main keywords from the preprocessed data according to word importance, a matrix of TF-IDF weights is constructed. It multiplies the frequency of the term with the reverse document's frequency, merely reducing the weight of the many words. This allows us to extract meaningful keywords. The morphemes used for keyword extraction are nouns. However, it is difficult to conduct sentiment analysis using one part of speech, as sentiment vocabularies contain various parts of speech. Therefore, for accurate and effective sentiment analysis, verbs and adjectives that are included in sentiment vocabularies are also extracted. The extracted keywords are listed, and TF-IDF weight matrices for nouns, verbs, and adjectives are established. Table 1 shows the TF-IDF weight matrix for nouns. The top keywords are presented according to the TF-IDF weights for noun words in each document.
In Table 1, the first row indicates the pronunciation of Korean words, and their translation into English (in the parentheses). For example, the TF-IDF weight of ''sa-go'' (accident) in Doc No. 54 is 0, and its TF-IDF weight in Doc No. 128 is 0.741. That is, ''sa-go'' is meaningless in Doc No. 54, but meaningful in Doc No. 128. This is because each document has different word importance. Therefore, by constructing a matrix of TF-IDF weights, it is possible to find the main keywords in a document and calculate their importance.
After the topic of a document is identified by extracting the main keywords, the label '1' is assigned to the document if it is related to traffic; otherwise, the label '0' is assigned. That is, binary labeling is applied. Document labeling is based on the traffic keywords in the land, infrastructure, and transport terminology dictionary issued by the Ministry of Land, Infrastructure, and Transport [27]. Label 1 represents a document containing traffic-related words, such as ''open highway,'' ''traffic safety law revision,'' ''traffic accident,'' and ''road congestion.'' Label 0 represents a document without traffic-related words. If Label 1 is assigned to a document (i.e., the document is related to traffic), its location is collected. A document with Label 0 is not collected but deleted. Keyword labeling is based on the actual content of a document, rather than on document category. Therefore, this labeling allows more accurate classification. Table 2 shows the extracted keywords from a document and the result of traffic labeling.
It can be seen the crawled document is preprocessed so that it has semi-structured contents: time, location, category, and label. The contents represent the words and morphemes used in the document. ''NNG'' indicates a general noun, ''NNP'' a proper noun, and ''VV'' verb. Time refers to the date and time of collection. Location represents the occurrence position and range of a traffic event found in a document. Categories refer to politics, society, economy, information technology and science, and traffic. Existing documents have the disadvantage that they are not classified as traffic documents if the category is not ''traffic'' even if their content is traffic-related. In contrast, the keyword-based method can correctly classify such documents. In addition, it is possible to collect traffic documents that are in different categories. Therefore, the proposed method can overcome the limitation of document collection in a restricted category and improve accuracy.

B. COSINE-SIMILARITY-BASED KEYWORD SIMILARITY USING WORD2VEC
A word may have multiple synonyms with different morphemes but the same meaning. If unused words are detected, it is possible to replace them by synonyms by discerning a semantic similarity. To calculate this similarity, Word2vec is applied to vectorize the extracted keywords, and a dictionary is established using the similarity results. To this end, crawled document data are used. To vectorize the keywords, Word2vec predicts a word through the context of a sentence. Therefore, stop words are not processed in establishing a similarity dictionary. In addition, to ensure diversity, not only traffic-related documents but also documents related to various fields, such as economy, politics, and society, are used. A total of 35,729 word vectors are generated. Figure 3 shows the results of word embedding based on Word2vec.
It visualizes the dimensionality reduction applied to a high-dimensional vector space to obtain a two-dimensional space. The x-and y-axis represent the vector coordinates of a keyword.
In Fig. 3, Set1, Set2, and Set3 are the clustering sets of semantically similar keywords, representing the sets damage, crash, and vehicle, respectively. The clustered keywords are located closer to each other as the similarity increases.
To discover the similarity between words, cosine similarity is used in a vector space. It is a similarity measure between vectors and is calculated using their cosine angle. It allows the calculation of the distance between vectors in a multidimensional space [28]. Cosine similarity is a real number between −1 to 1. If the cosine similarity between two words is close to −1, then the words tend to have opposite meaning; if it is close to 1, they tend to have nearly the same meaning [29]. Equation (1) shows the formula for cosine similarity, where W and X are word embedding vectors, S (W k , X k ) denotes the cosine similarity between W k and X k , k means for each word and n is the number of words embedded.
(1) Table 3 shows the cosine similarity between words by Word2vec. Tn is the similarity rank. The entries of the table are arranged in descending order of similarity.
The word that has the highest cosine similarity to ''un-jeon'' (driving) is ''un-haeng'' (race), with a similarity value of 0.876, and the word with the second highest is ''joohaeng'' (run), the similarity value of which is 0.749. Therefore, by establishing a similarity dictionary, it is possible to obtain words that are similar to main keywords not present in a document. Moreover, word imputation is performed. That is, a word is replaced on the basis of its similarity if it is not present in a document. Nevertheless, it is possible that the worst case may occur: T1 (representing the highest cosine similarity value obtained) corresponds to a remarkably low value. Then, an original keyword is not semantically similar to T1. For example, if T1 for the word ''bi'' (rain) is the word ''ja-jeon-geo'' (bike), the similarity between the words ''bi'' and ''ja-jeon-geo'' is low (0.217). This implies that these words are semantically irrelevant. Accordingly, word imputation is applied only to words that have a similarity value higher than a threshold. For the imputation threshold setting, each of the 35,729 words is compared with T1 in terms of similarity. Whether the correlation between each word and T1 is significant is determined by decreasing the similarity by 0.1. Table 4 shows the semantic similarity probability of a word and T1 according to the imputation threshold values. In Table 4, the first row represents a threshold similarity value between a word and T1. The second row indicates the semantically significant probability between a word and T1 for each threshold value. It can be seen that when the threshold value is 0.5, a cutoff occurs. Therefore, two words are not considered semantically consistent if their similarity is 0.5 or less. Accordingly, the imputation threshold value is set to 0.5. Thus, through word-similarity-based imputation, it is possible to obtain words with significant correlation.

C. SENTIMENT SIMILARITY WEIGHT FOR DETECTING EMERGING RISKS
To establish the proposed SSW for detecting emerging risk, the Korean Sentiment Analysis Corpus (KOSAC) [30] sentiment dictionary is applied. It consists of the polarity values (positiveness, neutrality, and negativeness) of 16,000 n-gram morphemes. The polarity value ranges from −1 to +1. Figure 4 shows the extraction process of the SSW. The weight values are generated using the TF-IDF, similarity, and polarity values of the main keywords in a document. The final SSW is obtained through the sum of SSW of the words found in the document.
Regarding polarity, if a words does not appear in a sentiment dictionary, it is impossible to use it. For this reason, word imputation is applied. It considers similarity and polarity according to the existence of a word in a sentiment dictionary. Nevertheless, if the polarity of a word with the highest similarity is simply selected in the imputation process, but similarity and semantics are ignored, then the weight value may be affected. Therefore, multiply the value by Similarity to imputation only a similar degree between words. For instance, assuming that the word ''chung-dol'' (crashing) is replaced by the word ''chung-dol-ha-da'' (crash), as shown in Table 5, if similarity is not considered, the polarity value of the word ''chung-dol'' becomes −1, which is the polarity of the word ''chung-dol-ha-da''. If similarity is considered, the polarity value of the word ''chung-dol'' becomes −0.869, which is obtained by multiplying the similarity value (0.869) by the polarity value of the word ''chung-dol-ha-da'' (i.e., −1). When the similarity between two words is high, they are considered semantically similar. For this reason, the polarity value of a replacement word is significantly considered. If similarity is low, the polarity value of a replacement word is a little considered. Thus, word imputation is performed by considering the similarity between words.
The SSW for a word w (WSSW w ), where w may be a noun, adjective, or verb in a document, is calculated in Equation (2), where there are two cases depending on whether a word appears in a sentiment dictionary: If a keyword appears in a sentiment dictionary, only the TF-IDF and polarity values are used; otherwise, word imputation is applied using a similarity dictionary. Here, Similarity refers to the similarity value of the word to be replaced, Polarity represents the polarity value of a word.
For example, if the extracted word ''cha-ryang'' (vehicle) from a document is not found in a sentiment dictionary, a conventional method cannot use this word. The proposed method finds a word with high similarity to ''cha-ryang'', and the polarity value of the new word replaces that of the initial word. For instance, the polarity value of ''cha-ryang'' is replaced by the polarity value of ''ja-dong-cha'' (car), which is the most similar word. As a result, although the TF-IDF value of ''cha-ryang'' is 0.364, its polarity value is set to 0.4, which is that of the word ''ja-dong-cha''. We note that the similarity between ''cha-ryang'' and ''ja-dong-cha'' is 0.913, and thus the WSSW of ''cha-ryang'' is calculated as 0.133 in Equation (2). In Equation (3), the SSW is calculated using the WSSW values as follows: That is, the SSW is calculated by adding all the WSSW values, which are the weights of the noun, verb, and adjective keywords extracted from a document, and then by dividing the sum by the number of extracted words. In this equation,  Table 5 shows the extraction process of the SSW. Regarding the parts of speech, NNG, NNP, and VV mean general noun, proper noun, and verb, respectively. The generated SSW ranges from −1 to +1. In Table 5, the keyword ''chung-dol'' (crashing) does not appear in a sentiment dictionary; thus, it is replaced by the word ''chung-dol-ha-da'' (crash), which has the highest similarity value in a similarity dictionary. In the SSW extraction process, the SSW of Doc No. 15 is calculated as −0.441.

D. TRAFFIC-RELATED DOCUMENT CLASSIFICATION MODEL USING SENTIMENT SIMILARITY WEIGHT
In this study, to detect traffic emerging risks, the proposed SSW is applied to a support vector machine (SVM) classifier [31]. Considering the polarity and importance of a word in a document, the keywords related to traffic emerging risk are detected and classified. In the classification, there are two classes: emerging risk and non-emerging risk. The class compares and labels keyword similarities between documents. To this end, traffic documents closely related to traffic safety, such as traffic accidents and traffic jams, are collected. Compare the similarity between the collected documents and the train, test data set documents. The cosine similarity determines this according to the TF-IDF weight matrix in each document. The cosine similarity comparison between documents specified to have similar keywords indicates an average of 0.71. Therefore, training and test datasets with similar cosine degrees of 0.71 or higher with collected traffic documents are labeled as an Emergency Risk, C 0 . The nonemerging risk, C 1 , is labeled through comparisons between documents, such as highway opening and traffic safety law revision, as shown in the above method. In the learning process of the classification model, the main keywords of a traffic-related document are extracted. Based on the TF-IDF, similarity, and polarity values of the extracted keywords, WSSWs are calculated, and are subsequently averaged to obtain the SSW of the document. Matrix is built through the keyword frequency and SSW of each document. Classification is performed by applying this to SVM. An SVM based binary classifier is used to judge if a particular document has emerging risk. Fig. 5 shows the document classification process in consideration of the SSW. Using the keyword frequency of the SSW of each document, an SSW matrix is constructed and applied to the SVM for classification. Figure 5 shows the document classification process considering the SSW.
All the WSSWs of keywords are calculated through TF-IDF, Similarity, Polarity, and with the use of the mean of the WSSWs, an SSW is calculated. In addition, the calculated SSW is combined with DTM (Document Term Matrix) to form training data. The training data considering SSW is trained in the SVM binary classifier. Considering SSW in the proposed classification model learning process, it acts as a measure to classify traffic safety-related documents and detect the risk of emergence.

IV. RESULTS AND PERFORMANCE EVALUATION A. TRAFFIC-RELATED DOCUMENT CLASSIFICATION BASED ON EMERGING RISK DETECTION SYSTEM
An emerging risk detection system applies the proposed SSW-based classification model to a real traffic situation. It is established based on a user route. The system detects the emerging risks in this route and provides related information. We use a computer with Intel (R) Core (TM) i5-3570 at 3.40 GHz, and 16 GB RAM, running Windows 10 and Python 3.6.0. Figure 6 shows the user route-based emerging risk detection system to which the SSW-based classification model is applied.
It can be seen that the user interface of the emerging risk detection system consists of the following parts: crawling, keyword extraction, location, and emerging risk detection. It sets the url, date, and page, and then starts crawling. In addition, documents are selected, and stop words are removed. Morphological analysis is performed on documents processed with stop words. After the TF-IDF values are obtained from pre-processed documents, classification is performed using an SSW-based SVM. In the location part, the system receives the user departure and destination points, and displays on a map the number of emerging risk documents related to the route. In addition, based on the emerging risk keywords extracted from the documents, it is possible to provide the user with simple information on each route section. Figure 6 shows the emerging risk detection mechanism applied on a route from the Gyeong-gi Provincial Government to Gwang-myeong City Hall. On the map, 265 emerging risk documents are detected in four sections. Among the documents, 132 documents are detected in the Suwon-Gwangmyeong Highway section, which is considered to carry the highest risk. The keywords detected from the extracted documents are displayed, and therefore it is possible to determine the risk factors in any road section. By providing information regarding overall and emerging risks on a route, users may prepare accordingly.

B. PERFORMANCE EVALUATION
To evaluate the performance of the proposed system, TBN news data were crawled. A total of 1,400 data documents were collected and divided into training (70%) data and test data (30%). The performance evaluation has three objectives. The first objective is to determine the best similarity evaluation method when the similarity between keywords is measured. It is evaluated by comparing the accuracy of models according to the method of measuring the similarity between words. The second objective is to select the most appropriate classification model for detecting an emerging risk by applying the proposed SSW. To this end, the proposed SSW is applied to the SVM, the KNN [32], and the naive-Bayes classification model [33], and the performance of these models in classifying emerging risk documents is compared. The third objective is to demonstrate the effectiveness of emerging risk document detection based on the proposed SSW. That is, based on an SVM binary classifier, the model with the proposed SSW is compared with a conventional model without the proposed SSW.
In this study, word similarity is measured by the cosine similarity in a vector space. Regarding the first objective,  the cosine-similarity-based method is demonstrated to be the best similarity measurement technique. To evaluate document classification accuracy, the cosine similarity measure [34] is compared with the Manhattan-distance-based similarity measure [35] and the Euclidean-distance-based similarity measure [36]. Figure 7 shows the comparison results. In this figure, the X-axis shows the similarity measurement method, and the Y-axis represents accuracy.
It is seen that cosine similarity is more accurate than Manhattan-distance-based and Euclidean-distance-based similarity by 0.059 and 0.033, respectively. Cosine similarity is applied in a multi-dimensional space and considers vector directions. Therefore, it is possible to measure similarity by considering the overall context of a keyword in a document.
By contrast, distance-based similarity measures, such the Manhattan and Euclidean methods, have considerable limitations and are not suitable for natural language processing. They infer that documents in the same context are different by simply considering the frequency of words. For this reason, these distance-based similarity measures have lower accuracy than the cosine similarity measure. Given that word embedding is performed through the contextual meanings of words, the cosine similarity technique, which considers the direction of each word vector, is more appropriate and can be used to infer the semantic similarity between keywords.
In the second performance evaluation, the proposed method proposed is applied to different classification models to select the best model. As classification models, we use SVM, KNN, and naive-Bayes. These three classifiers are compared in terms of accuracy, precision, recall, and F-measure. Precision is the probability of actual emerging risks among the predicted emerging risks [37], [38]. Recall is the probability of accurately classifying emerging risk documents. If the data are imbalanced, the scale of accuracy is unstable. For this reason, the F-measure based on precision and recall, along with accuracy, is adopted as the scale for performance evaluation [37]. Figure 8 shows the comparison results. The F-measure is the harmonic mean of precision and recall, and it is used to determine whether the classification is correct. It is given by Equation (4).
As shown in Fig. 8, when the proposed method is applied, the SVM classification model achieves the best performance VOLUME 8, 2020 in terms of accuracy, precision, recall, and F-measure. This model has consistently good performance even with a small dataset and is appropriate for high-dimensional data. It is suitable because the training data has high-dimensional attributes including multiple keywords, and consists of approximately 1,400 small data points. The NB classification model does not consider the importance of keywords but views them as independent factors; thus, its performance is poor. Although KNN is rather accurate, it should calculate the similarity values of all keywords. Therefore, the method has high computational cost. Consequently, the SVM model using the proposed SSW is the best classifier for mining text, including highdimensional keywords.
In the third performance evaluation, the SVM classifier using the proposed SSW is compared with the SVM classifier without SSW in terms of accuracy, precision, recall, and F-measure. This comparison demonstrates the effectiveness of the proposed SSW for detecting emerging risks. The results of the performance evaluation are visualized in the ROC curve graph. Table 6 shows the results of the comparison between an SVM model based on SSW and a conventional SVM model without SSW. According to the overall performance evaluation, the accuracy value of the classification model based on the proposed SSW is higher than that of a conventional classification model by 0.118, and the F-measure value of the model is also higher by 0.085. By using the SSW, the proposed method effectively replaces words not present in a sentiment dictionary. The proposed technique considers the importance of words and replaces a word using its similarity even if the word does not appear in a sentiment dictionary. For this reason, it exhibits superior performance. For classification, a conventional SVM classification model without SSW excludes all the words not found in a sentiment dictionary. Such a model does not consider polarity, and therefore it exhibits poor performance. Given the results of the performance evaluation, the proposed SSW enables the effective assignment of a weight to a word (regardless of whether the word appears in a sentiment dictionary). To visualize the evaluation results, a receiver operating characteristic (ROC) curve graph is used. In the ROC curve graph, the X-axis indicates the false positive rate (FPR), and the Y-axis represents the true positive rate (TPR). The area under the curve (AUC) is the area under the ROC curve and ranges from 0 to 1. Values of AUC close to 1 indicate better performance of the model. Equations (5) and (6) define TPR and FPR, respectively, using a confusion matrix. In Equation (5), TP indicates that the classifier has determined the emerging risk document as an emerging risk, and FN indicates that the emerging risk document has been determined as a non-emerging risk. In Equation (6), FP indicates that a non-emerging risk document has been determined as an emerging risk, and TN indicates that a non-emerging risk document has been determined as a non-emerging risk.  It can be seen that the SVM model with SSW and the conventional SVM model without SSW have an AUC of 0.94 and 0.83, respectively. Therefore, the model using SSW has better classification performance and can effectively detect emerging traffic risks.

V. CONCLUSION
In systems using traffic data, the utilization of unstructured data is low, and data analysis and processing are more important. Thus, in this paper, we proposed a wordembedding-based traffic document classification model to detect emerging risks using the SSW. The model uses unstructured text data for analysis and processing. Specifically, it collects unstructured traffic data from breaking news. The collected data are labeled, and only traffic-related documents are used. After these documents are analyzed morphologically, the main keywords are extracted using TF-IDF weight. Based on the extracted keywords, a similarity dictionary is established by Word2vec. In addition, the similarity between words is measured. To obtain the proposed SSW based on a sentiment dictionary, the model uses the polarity, TF-IDF, and similarity values of words. If a word is not found in the sentiment dictionary, it is replaced by a word with the highest similarity using the similarity dictionary. This method overcomes the limitation of the conventional technique in which, if a particular word is not found in a sentiment dictionary, its sentiment is not extracted. In addition, the proposed model is designed to correctly apply word weights by replacing only words the similarity value of which is higher than a certain threshold. The performance of the proposed method was evaluated in three ways: Evaluate the accuracy of the model according to the similarity measurement method. Then, the proposed SSW was applied to SVM, KNN, and naive-Bayes classifiers, and their performance was compared. Finally, an SVM classifier with the proposed SSW was compared with a conventional SVM classifier without SSW. The results demonstrated that the F-measure of the traffic document classification model using the proposed SSW was 0.907, indicating good performance. Therefore, the proposed model can effectively classify traffic documents, and the SSW operates with significant parameter. In addition, by using the proposed model, an emerging risk detection system based on user route was constructed. It allows visualizing emerging risks detected from news data in a user route, thereby enabling the user to select a safe road. Using the proposed technique, it is possible to extract highly relevant traffic-related documents from unstructured data in various situations, classify the extracted documents, detect emerging risks, and notify users of potential traffic risks.