Classification of Proactive Personality: Text Mining Based on Weibo Text and Short-Answer Questions Text

This study focused on the topic of predicting “proactive personality”. With 901 participants selected by cluster sampling method, targeted short-answer questions text and participants’ social media post text (Weibo) were obtained while participants’ labels of proactive personality were evaluated by experts. In order to make classification, five machine learning algorithms included Support Vector Machine (SVM), XGBoost, K-Nearest-Neighbors (KNN), Naive Bayes (NB) and Logistic Regression (LR) were deployed. Seven different indicators, which include Accuracy (ACC), F1-score (F1), Sensitivity (SEN), Specificity (SPE), Positive Predictive Value (PPV), Negative Predictive Value (NPV) and Area under Curve (AUC), combined with hierarchical cross-validation were also used to make the comprehensive evaluation of models. With participants’ Weibo text and short-answer questions text, we proposed a new approach to classify individuals’ proactive personality based on text mining technology. The results showed that short-answer questions + Weibo text datasets had the best performance, followed by short-answer questions text datasets, while the outcome of Weibo text datasets were the worst. However, it is noteworthy that Weibo text has the highest average score on the SPE, which indicated that Weibo text played an important role in identifying individuals with low proactive personality. With Weibo text, SEN was also improved compared with only applying short-answer questions text. In addition, among all three datasets, the indicator SPE is always higher than SEN, indicating this text classification approach was more competent for identifying college students with low proactive personality. As for algorithms, Support Vector Machine and Logistic Regression showed steadier performance compared with other algorithms.


I. INTRODUCTION
''How can we better achieve the success of our career?'' This question has raised extensive thinking and discussing, and the answer to this question is very tempting. In order to better answer this question, psychologists and management scientists have studied this topic from different perspectives. As a result, plenty of conclusions and suggestions were achieved. For instance, Judge and Bretz [1] believed it was necessary to analyze the relationship between individual success and personal intrinsic traits. Consistent with this suggestion, psychologists paid attention to the role of personality factors The associate editor coordinating the review of this manuscript and approving it for publication was Kemal Polat . in individual success, especially how proactive personality affects individual success. Bateman and Crant [2] introduced the concept of proactive personality in the study of organizational behavior development, and analyzed the effect of proactive personality in the perspective of organizational development [3]. As described by its definition, proactive personality refers to individual who tends to position their roles and make efforts on one's own initiative in order to make changes to surrounding environment for better adaptation [4], [5]. Thus, individuals with proactive personality are more adaptable to changes of external environment. To be specific, they usually actively identify opportunities and take advantage of them, and then work hard towards goals for meaningful changes in their career [4], [6].

II. RELATED WORK A. PROACTIVE PERSONALITY
In the research of proactive personality, Rita and Hans [7] reported significant positive correlation between proactive personality and self-reported employment behaviors among college students. Compared with individuals with low proactive personality, those who have high proactive personality usually have advantages in innovation capability. For instance, they are more willing to think, share and utilize their ability to improve their knowledge [8]. Besides that, proactive personality has positive effect on internal motivation [9], [10], which will urge individuals to actively make efforts to change their surrounding environment [11]. In this way, individuals can obtain better opportunities and outcomes in workplace, and more innovative behavior will be generated. Additionally, in the field of human resources, proactive personality has its predictive power in organizational behaviors. In Thompson's study [12], it had been proved that employees with high proactive personality had more self-examination in their daily work and can established good relationships with managers by actively changing surrounding environment, which improved their work performance [13]. What's more, employees with high proactive personality have their advantage in teamwork [14], that means they are more likely to get attention from their superiors, which will have important positive effect on their future career development [15].

B. TEXT MINING AND TEXT CLASSIFICATION
The technology of text Mining can discover, retrieve and extract information from a text corpus, which is usually too complicated for manual work [16]. To be more specific, text mining combines technologies such as natural language processing, artificial intelligence, information retrieval, and data mining to help understand complex written analytical processing systems [17], [18]. In the beginning, text mining was to provide government intelligence and security agencies to detect terrorist activities and other security threats. To improve its performance, text mining utilized text analysis components and technologies from external disciplines such as computer science [19], [20], management science, machine learning, and statistics [21]. These improvements are very important for carrying out practical research, and these techniques have since been widely used in related fields [22]. Nowadays, the accuracy of text mining and the ability to handle complex problems have been steadily improved.
Researchers in the field of psychology also deployed text mining as a tool for the purpose of analyzing psychological factors. In the field of emotional psychology, Neubaum et al. [23] confirmed the phenomenon of emotional contagion in online environment by analyzing the dynamic information of Facebook users. In the field of health psychology, Merchant et al. [24] used open word analysis technology to study language and personality, and achieved accurate prediction of mental health status for Internet users. In the field of personality psychology, Schwartz et al. [25] used a similar method to achieve accurate prediction of personal characteristics of network users; Kosinski et al. [26] used digital behavior records with dimension reduction and linear regression method to predict the user's sexual orientation; Chittaranjan et al. [27] studied the association between behavioral traits automatically extracted from smartphones usage and self-reported ''big five'' personality traits. In addition, some other studies have used big data text analysis to carry out the processing of data gathering [28], [29]. From these practical analysis results, it can be seen that text mining analysis improved the outcomes effectively. In short, Zhu et al. [30] summarized the basic research ideas of using big data for personality prediction. That was, analyzing the user's network behavior data, and then applying machine learning to build a personality feature prediction model. He believed that the psychological characteristics of individuals could be reflected by social network behaviors. Following this idea, Ren et al. [31] used Weibo texts to determine information about individual psychological characteristics, personality types, and social attitudes.
In addition, Text classification is an activity that labels natural language text with predefined categories [32]. It requires cross-disciplinary knowledge to build models and to improve accuracy of prediction [33]. Machine learning algorithms such as BP neural networks [34] and Bayesian theory [32], [35] were widely used in the process of text classification tasks. Text classification generally involves in text expression, selection of classifiers and evaluation of classification results. To be more specific, text expression can be divided into text preprocessing, statistics, feature extraction and other steps [16], [36]. Text preprocessing is an important step that can reduce the interference of noise and improve the accuracy of classification. And for the whole process, there are two most influential procedures: feature extraction and model training, which can directly affect the accuracy of classification [37].

C. MOTIVATION AND HYPOTHESIS
Firstly, the studies of individual personality in the field of psychology often analyze the personality traits of individuals from either measurement result or text content such as the diaries of individuals and the content of interviews. Although text content has its potential in the study of personality, data gathering and qualitative analysis of text content requires a lot of manpower. In order to improve the efficiency of this study, text mining analysis method was adopted to collect and process relevant data, which will effectively improve the efficiency and ensure the accuracy of the assessment.
In addition, in the measurement of proactive personality, most previous studies are performed with traditional psychological questionnaires [2], [6], [38], [39]. However, paperand-pencil questionnaires have their shortcomings like high social desirability effects. Thus, this study will explore the possibility of predicting individuals' proactive personality with text from social media and targeted short-answer questions text. With participants' Weibo text and short-answer VOLUME 8, 2020 questions text, we proposed a new approach to classify individuals' proactive personality based on text mining technology.
In conclusion, we believed that short-answer questions + Weibo text datasets will have better performance on predicting individuals' proactive personality.

III. METHOD
This study conformed with the code of ethics of the World Medical Association (Declaration of Helsinki) for experiments involving humans and was approved by the Ethics Committee of Shandong Normal University. Additionally, our research obtained written informed consent from the participants.

A. PARTICIPANTS
Cluster sampling method was adopted to recruit college students as participants. As a result, 1671 students participated in a survey that contained 4 short-answer questions and Weibo ID inquires. Among them, 901 participants completed all 4 questions and provided valid Weibo ID. There are 100 males and 801 females in our sample, which include 226 juniors, 347 sophomores, and 328 freshmen. 307 of them are from one-child family. With the provided Weibo ID, all 901 participants' Weibo post was obtained with web crawler on the date of January 27, 2020. A total of 13,511 Weibo posts were collected. After removing non-original post (e.g. repost), 4,955 of Weibo posts were kept, with an average of 5.5 posts for each participate.

B. RESEARCH TOOLS
Four short-answer questions were set to reflect proactive personality. Since proactive personality is usually being discussed with occupational behavior, we designed short-answer questions for college students from the original definition of proactive personality and relevant expressions in different fields in order to make this research more meaningful for participants. The translated short-answer questions are as follows: (1) ''In your daily life, what would you do if your talents were constrained by the environment, and please explain your reason''; (2) ''In your life, which one do you prefer, accepting existed methods or seeking new approaches to solve problems, and please explain your reason''; (3) ''If leaving the pace of regular life and changing surrounding environment will increase the probability of making mistakes, will you still choose to make changes? Please explain your reason.'' (4) ''How do you think about your study and life in the next few years, and will it be connected with your future career?'' After reading these questions, participants were asked to answer all questions according to their actual thoughts and situations with about 60 words for each question.

C. PROCEDURE 1) TEXT PREPROCESSING
In text mining study, data preprocessing is necessary for better classification performance. In this study, we applied following preprocessing procedure: Firstly, for short-answer questions text, 12 psychology undergraduates and graduate students, who had been informed to transcribe text as its original condition including punctuation and no changes should be made during the transcription, completed the transcription works. For the Weibo text, we excluded non-original post includes advertising post and repost. Jieba package in Python [40] was utilized to perform words segment task, since all text was written in Chinese. Segmented words were compared with Harbin Institute of Technology stop word list [41] in order to remove pronouns, useless auxiliary words and punctuation. As a result, features in text will be more prominent.

2) TEXT FEATURE EXTRACTION
Term Frequency-Inverse Document Frequency (TF-IDF) is a widely-used technique for text feature extraction task in text mining practice. TF-IDF is a statistical method that evaluates the importance of a word to a document in a document set or a corpus. If the frequency of occurrence of TF in a document is high, and it rarely appears in other documents, it is considered to have good class discrimination ability. The formula of TF-IDF is as follows: Among them, tf (t,d) is the word frequency of the feature term t in document d, N is the number of all training documents, and n t is the number of documents where the feature term t appears in the training set. We used TF-IDF to perform weight analysis on the features to facilitate subsequent classifier training. Some studies may delete the weights of words with less than certain frequency, but we retained TF-IDF weights of all words. The reason is because the F-test was used for better feature selection later on. In the case of two categories of samples, even if low-frequency words appear, the features corresponding to the word will be filtered out due to the differences between features are not significant enough.

3) FEATURE SELECTION
In traditional statistics, F-test is widely used to test the difference between two or more levels of a variable. In this study, we applied F-test for feature selection since the principle of feature extraction is to distinguish a set of features that can better represent the meaning of text from another set that can't. To be more specific, labels of high and low proactive personality categories are corresponded to certain features that have a large difference, so in this way features can be distinguished with F-test. Also, p value is used as the threshold of extraction. The larger the differences between high score group and low score group are, the more accurate the classification task can be. In this study, we adopted 3 different levels of p value for comparison: a) Retain all features with significance level less than or equal to 0.05, since p value less than or equal to 0.05 is a commonly used threshold for statistical inference in statistics. Generally, events with a probability of less than or equal to 5% are considered to be small-probability events. In this way, when the error rate is not higher than 5%, and the 97372 VOLUME 8, 2020 difference between high score group and low score group is considered to be significant, it can be used for classification tasks.
b) Retain all features with significance level less than or equal to 0.1, since p value less than or equal to 0.1 is the threshold for statistical inference in some education measurement study [42]. When the error is not higher than 10%, and the difference between high score group and low score group is considered to be significant, it can be used for classification tasks.
c) The p value threshold was determined by exhaustive search. An exhaustive search for threshold p value from 0.05 to 0.1 was conducted, and the result of it was used for feature extraction standard.

4) MODEL TRAINING
In the process of model training, grid search with Cross-Validation (CV) was used to determine parameters. Grid search is a method that takes all possible values of parameters to construct a grid in a certain range in order to obtain the value with best performance [43].
After the values of parameters were obtained, we trained models and evaluated their performance with Stratified K-Fold cross-validation [44], [45]. In stratified K-Fold crossvalidation, the data set is divided into K equal subsets, and the proportion of the sample category in each fold is the same as the proportion of the population category. In this study, based on previous experience and trial-and-error, K was determined to be 5. The whole process of this research can be illustrated as Figure 1, and described as follows: a) Training set and validation set: Since five-fold crossvalidation was applied, the original data was divided into five parts without repeated sampling. One was selected as the validation set each time, and the remaining text was used as the training set. Performance was calculated by the average performance of 5 trials. b) Determination of ''Classification Labels'': with the work of collecting relevant literature on proactive personality done, we invited 16 experts in related fields to assist in this research. They were asked to score relative factors from 0 to 9. In this way, criterion for evaluating proactive personality was obtained. Then the weighted average score of each participant was calculated and sorted in descending order. The first 50% of participants were labeled as high proactive personality group(category), while the left participants were labeled as low proactive personality group. c) Classifier selection: Based on previous experience, we selected 5 algorithms which were widely used in text mining study, including Support Vector Machine (SVM), XGBoost, K-Nearest-Neighbors (KNN), Naive Bayes (NB) and Logistic Regression (LR).  SVM was firstly proposed by Vapnik [46], which aimed to find the hyperplane with the largest spacing as a proportional VOLUME 8, 2020 classification boundary. It is based on structural risk minimization (SRM) principle in the statistical learning theory, and it has outstanding generalization performance [47]- [49]. Recently, kernel functions such as linear kernel, polynomial kernel, radial basis kernel (RBF), fourier kernel, and spline kernelis are introduced into SVM to solve the inner product operation in high dimensional space, so as to deal with the nonlinear classification task well.
Non-stationary kernel like polynomial kernel is well suited for problems where all the training data is normalized. But due to the consideration of time cost, we didn't choose nonstationary kernel. In this study, linear kernel, sigmoid kernel and radial basis kernel (RBF) of SVM were used. The Linear kernel is the simplest kernel function. It is given by the inner product <x, y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts. The sigmoid kernel comes from the neural networks field, where the bipolar sigmoid function is often used as an activation function for artificial neurons. It is interesting to note that an SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel was quite popular for support vector machines due to its origin from neural network theory. Also, despite being only conditionally positive definite, it has been found to perform well in practice. In addition, the radial basis function is a kind of scalar function symmetrical along the radial direction, which is usually defined as a monotonic function of the Euclidean distance between any point x in space and a certain center x c . It is similar to the gaussian distribution, so it is also called the gaussian kernel function, which can map the original features to infinite dimensions.

2) XGBOOST
The scalable end-to-end tree boosting system called XGBoost, which is characterized by fast computation and good performance, has been widely used for data scientists to achieve state-of-the-art results in many machine learning tournaments [50], [51]. The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.
The idea of the algorithm is to continuously add trees and continuously perform feature splitting to grow a tree. Each time as a tree is added, the model is actually learning a new function to fit the residuals of the last prediction. When k trees are obtained after training, the score of a sample is predicted. In fact, according to the characteristics of this sample, a corresponding leaf node will fall in each tree, and each leaf node corresponds to a score. Scores corresponding to each tree add up to the predicted value of the sample. As a non-parametric model for supervised learning, the selection of XGboost parameters depends on the training data used in the model [52]. The objective function is where i represents the i-th sample and, l(ŷ i − y i ) represents the prediction error of the i-th sample. k (f k ) represents the function of the complexity of the tree. The smaller the complexity, the lower the generalization ability. The expression is

3) K NEAREST NEIGHBORS (KNN)
The classifier KNN was proposed by Cover and Hart in 1968, its performance has been proved to be excellent with large sample size [53]. The error rate of KNN can reach Bayes optimization in very mild conditions [54]. KNN is an instancebased algorithm which means there were no such 'model' trained, classification is made based on comparison between instances in training data and cases. That makes KNN become very sensitive to the number of features, irrelative features can influence the accuracy of prediction greatly [55]. Thus, feature extraction process is very import for KNN algorithm.
The process of KNN algorithm can be described as: a) Calculate distance (etc. European distance, Manhattan distance, cosine Angle distance) between known cases. b) Sort cases by distance in increasing order. c) Select k points that have smallest distance from the current case. d) Determine the frequency of k points in categories. e) The most frequent categories among k points will be the output.

4) NAIVE BAYES
Naive Bayes is a simple but well-used classifier based on statistics. In text mining, decisions are made based on the presence or absence of certain features [56]. That means probability of being to a certain class was assigned to each feature based on training data. After all probability being calculated, decision can be made based on the presence of features in testing set. The term ''naive'' means all features will be treated independently. In another word, the frequency of features in testing set will not be taken into account, and it assumes all features will present independently. Giving training set as D = {d1, d2, · · · · · · , dn} with corresponding category as X = {x1, x2, · · · · · · , xd}, variable parameters as Y = {y1, y2, · · · · · · , ym}, the prior probabilities of Y will be P prior = P(Y) and posterior probability of Y will be P post = P(Y|X). In this way, P priort = P(Y|X) can be calculated as Since features are independent from each other, we can have 97374 VOLUME 8, 2020 Thus, posterior probability of Y can be calculated as

5) LOGISTIC REGRESSION
Logistic regression is a generalized linear regression, which can be used to achieve classification or prediction by constructing a regression function [57]. Logistic regression model is a classifier that focuses on the binary classification problem [58], and it can also handle multi-classification problems. Logistic regression maps any input value to the [0, 1] and gets a predicted value in linear regression. Then, map this value to the Sigmoid function, and use the predicted value as the x-axis variable and the y-axis as a probability. The logistic regression function is and the prediction function is The value of θ has a special meaning, it presents the probability that hθ(x) is 1. Therefore, the probability that the classification result of the input x is category 1 and category 0 is which can be written together as P(y |x ; θ) = (hθ(x)) y (1 − hθ(x)) 1−y (11) The likelihood function is (1 − hθ(x (i) )) 1−y (i) (12) and the log-likelihood function is At this time, gradient rise is used to find the θ when l(θ) is maximized.

E. INDICATORS OF CLASSIFICATION
To evaluate the performance of prediction, we deployed several indicators related to confusion matrix (Table 1) plus area under curve (AUC), in order to evaluate our models in a comprehensive way. Among these indicators, accuracy (ACC) (formula 15) indicates the percentage of individuals that had been classified correctly. Meanwhile sensitivity (SEN) (formula 16) suggests the percentage of    from 0.27 to 0.35. It showed that feature extraction process can improve the accuracy of prediction. Among 3 text datasets, the short-answer questions text had least number of features, since the information of answers was limited in the frame of questions; following by Weibo text, which has more abundant information compared with short-answer questions text; and it is not surprising that short-answer questions + Weibo text datasets had most features. After we performed exhaustive search in the range between 0.05 to 0.1, p value of 0.0762 performed best for short-answer questions text, while 0.0866 is the best one for Weibo text and 0.0868 for short-answer questions + Weibo text.

3) RESULT OF CLASSIFICATION UNDER P VALUE = BEST THRESHOLD
As illustrated by Table 8 to 10, after exhaustive search for best threshold, short-answer questions + Weibo text had best performance on ACC, F1, NPV, AUC and SEN, which are 0.840, 0.808, 0.922 with SVM algorithm, 0.745 with NB algorithm, plus 0.807 with LR algorithm; short-answer questions text had best performance on SPE and PPV with NB algorithm, which are 0.969 and 0.925. Weibo text did not show significant advantage compared with other two datasets.  From the perspective of algorithms, SVM, Naïve Bayes and Logistic Regression had better performance. When we compared 3 datasets vertically, the average ACC of them on 5 algorithms are 0.736, 0.698 and 0.774 respectively, which indicated that all 3 datasets had acceptable result of classification. Besides that, short-answer questions + Weibo text datasets showed clear advantage compared with other two datasets on classification, which indicated the combination of them can help to make more accurate prediction. When we compared the result between short-answer questions text datasets and Weibo text datasets, short-answer questions text datasets were superior to Weibo text datasets on all indicators except SEN and NPV under 0.05 p value, which means short-answer questions text had better predictive effect. For PPV and NPV with shortanswer questions + Weibo text datasets, they ranged from 0.850 to 0.904 and 0.757 to 0.810 respectively, which means both of them reflected high confidence of making correct classification.
At the same time, features with high weight were extracted as Table 11, these words were crucial in classifying proactive personality individuals. Figure 5 illustrates the average result of 3 datasets after extracting features under best threshold p value; figure 6 illustrates the average performance of 5 classifiers after extracting features under best threshold p value. After calculating the average performance, we can find that short-answer questions + Weibo text datasets had best performance among 3 datasets, following by short-answer questions text datasets; Weibo text datasets had worst predictive accuracy. However, it is worth mentioning that Weibo datasets have best performance on SPE indicator. From the perspective of indicators, all three datasets had steady performance on the indicators of VOLUME 8, 2020  PPV and SPE, while the combination of datasets improved SEN a lot.

4) COMPARISON BETWEEN CLASSIFIERS UNDER BEST THRESHOLD P VALUE
From the perspective of classifiers, SVM, LR and NB algorithms had their advantage respectively. Among them, SVM and LR showed steady performance while NB had best performance on AUC, SPE and PPV. Yet, the performance of NB on other indicators was not so desirable. The performance of KNN and XGboost algorithms was unsatisfactory, which means they might not be suitable for text classification in this study. From the perspective of indicators, SPE was the highest among all indicators, following by AUC; while SEN was the lowest.

V. DISCUSSION
In the field of psychology, previous in-depth studies of personality often analyzed individual personality traits from individuals' freely expressing diaries and interview content [59], [60]. However, as the sample size increases, qualitative analysis of all subjects requires a lot of manpower and material resources. In addition, in the measurement of proactive personality, most studies chose to use traditional scales [2], [6], [38], [39]. As a result, shortcomings of selfreported scales like social desirability effect are inevitable. The evaluation process of this study added a new approach to the field of personality measurement.
As we mentioned, individual's proactive behavior will have a very significant effect on both oneself and organizations [6], [8], [9], [10], [13], [14], [61]. As Frese [62] mentioned in his research, there are two ways to measure individual's proactive personality, which included behavioral interview and self-reported approach. Abstractly, they represented objective approach and subjective approach personality evaluation. Francesca [63] once mentioned in educational research that the concept of variable analysis could be used as a reference for demographic variables, which had a very important role in improving the accuracy of data. He [37] mentioned that the self-report text and the PTSD symptom scale can be combined to analyze the initiative of the personnel, and when the Bayesian theory was integrated into it, a good evaluation measurement result was obtained. This study analyzed subjective material in an objective way, and this kind of combination shows great potential due to its accurate prediction and convenient approach. In addition, with applying text from social media, we greatly improve the ecological validity of proactive personality measurement.
In the process of data mining, there were three major innovation parts in our study. Firstly, this study analyzed 3 datasets, and most previous research chose only social media text. Secondly, features were extracted based on significance level. Thirdly, 5 classifiers by 5-fold cross validation were deployed and compared with 7 comprehensive indicators. As was mentioned by He [37], an accuracy of 0.700 was reached when using NB to identify PTSD individuals. The reason why accuracy among 3 datasets in this study is higher than He's study is because of following possible reasons.
(1) In the processing of collecting text data, the number of words was controlled within 150, while the sample size was reasonable according to Liliya et al. [64]. (2) Classifiers like SVM are more compatible with medium sample size in text classification task [49]. (3) Besides that, our datasets contained 4 short-answer questions text and average of 5.5 Weibo post from each participant, which are more abundant compared with previous research.
In the comparison between 3 datasets, the combined datasets showed the best performance. This result is consistent with our expect, the increasing of information among quantity and types can help in classifying proactive individuals. The reason why short-answer questions text was superior to Weibo text is possibly because short-answer questions were set with provision for measuring proactive personality, whereas Weibo text is open-minded. Therefore, the noise in Weibo text can disturb classifications [65]. However, it is worth noting that Weibo text can enhance the predictive effect of short-answer questions text, which means the daily life status reflected by Weibo can help in predicting personality. Given that Weibo text has its universality, we suggest similar future text mining research could consider using targeted short-answer questions text plus social media text to improve the accuracy of classification. Besides that, the indicator SPE were higher than SEN in all 3 datasets, which means text mining is more competent in identifying individuals with low proactive personality.
For the comparison between algorithms, as we mentioned above, SVM and LR showed good performance on all indicators while NB algorithm had advantages on AUC, SPE and PPV indicators. Consistent with previous studies, this result proved linear algorithms like SVM and LR to be suitable for binary problems in text classification domain. For example, in the study of Liu et al. [66] which used Weibo text as predictor, SVM had the best performance in identifying individuals with suicide risk; in Sun's [67] comparative study, standard SVM algorithm often learned the best decision surface in most test case. They also suggested that threshold value for SVM algorithm's performance is crucial. Considering the result of this study to be acceptable, using p value as threshold should be a good approach in deciding the threshold. In contrast, non-linear classifiers like KNN and XGBoost didn't show good performance in this study, suggesting that overfitting might be one of the causes. The unstable performance of NB might be due to the characteristic of the algorithm itself, which assumes that attributes should be independent [68]. This conditional independence assumption makes NBC to be very sensitive to form of input. Thus, we would like to suggest future studies to consider linear classifiers when it comes to binary personality classification problems, especially when the sample size is limited.

VI. CONCLUSION
In conclusion, text mining technology showed great value in predicting individuals' proactive personality, especially for identifying individuals with low proactive personality, and this can be very valuable for career education practice in high school and college. The best accuracy and specificity reached 0.842 and 0.969 respectively. The form of data used in this study is innovative, few previous studies combine short-answer questions text and social media text together for text mining purpose. The supplementary effect of social media text in predicting individuals' personality is noteworthy, which is reasonable to assume that this kind of VOLUME 8, 2020 effect not only works in predicting proactive personality, but also can be valuable for predicting other traits. Additionally, features extraction based on p value had been proved to be an effective approach in data preprocessing when dealing with text material. Last but not least, support vector machine and logistic regression showed steady performance in this text mining study, we would like to recommend researchers to give priority to these two algorithms in similar study.

APPENDIX
The high weight words under the three datasets in Chinese version are as shown in Table 12. He is also working on integrating career fields with big data. He has published various articles and chapters on these subjects. He has been involved in data mining research with Tsinghua University, as an Advanced Visiting Scholar. His research focuses on the big data psychology, modern measurement theory, and career planning. He was awarded as the Young Talents of Dongyue Scholars at Shandong Normal University, from 2018 to 2023.
YUN YAN is currently pursuing the bachelor's degree with the School of Psychology, Shandong Normal University. She is also working with Prof. Peng Wang's Group on big data research. Her main focuses of research is big data psychology. She believes big data can not only provide more diversified and heterogeneous samples for psychology research, but also free researchers from the limitations of time and space, and avoid social expectation effects as much as possible. Also, it avoids the complex and unrelated interference that the research subject is subjected to during the test.
YINGDONG SI is currently pursuing the master's degree with Shandong Normal University. His tutor is Prof. Peng Wang, who mainly follows Prof. Wang's career and network research. Since enrolling in 2017, he has been studying statistical methods and is skilled in using structural equation models to solve psychological problems. In addition, he has gained in cyberbullying and Internet addiction. He has participated in publishing articles on Internet addiction in important foreign journals.
GANCHENG ZHU received the bachelor's degree from Shandong Normal University, China, in 2019. He is currently pursuing the master's degree with Jilin University. His favorites were both a psychometric theory named Item Response Theory (IRT) and big data in applied psychology. He have already published his first article about IRT in CSSCI and chapter on big data in psychology under the guidance of Prof. Wang when he was a Senior, the book Looking at the world with Big Data: Middle School and Big Data Culture (Wang, 2018). He is also quite dedicated to Bioinformatics and Computational Analysis. VOLUME 8, 2020 XIANGPING ZHAN received the bachelor's degree from Southwest University, China, in 2018. She is currently pursuing the master's degree with Shandong Normal University. Her research interests include big data psychology and text mining. She is interested in regarding big data method as a research tool, such as Weibo and WeChat in China, to analyze people's online psychology and behavior through social media. She followed her tutor Prof. Peng Wang to carry on big data research.
JUN WANG received the bachelor's degree from Qingdao University, China, in 2016. He is currently pursuing the master's degree with Shandong Normal University. His research focuses is psychology of big data. His research direction is big data psychology, which mainly uses the method of big data to collect, process and analyze data, so as to increase the understanding of social phenomena and public psychology. He followed his tutor Prof. Peng Wang to participate in a number of national and provincial projects.
RUNSHENG PAN is currently pursuing the master's degree in applied psychology with Shandong Normal University, under the guidance of Prof. Peng Wang. With passion towards psychology, his current research interests are in pathological internet use and adaptation of college students.