Extracting Deep Personae Social Relations in Microblog Posts

Numerous studies have been conducted to extract relationships from different documents. However, extracting relationships from microblog posts is rarely studied. In this paper, we improve a novel kernel-based learning algorithm to mine the personae social relationships from microblog posts by combining the syntax and semantic meanings of the dependency trigram kernels (DTK). To deeply extract the personal social relationships of microblog posts, we define the relation feature words, provide seven rules for extracting these feature words, and propose a rule-based approach that mines these relation feature words from microblog posts. We construct relation feature word dictionaries for different relation types because of the lack of prominent relation features in microblog posts. We propose an algorithm to classify relation feature words by considering two features of the relation feature words, namely, syntax and semantic similarities between relation feature words in microblog posts and by using relation feature word dictionaries. Experimental results show that the average recall, precision, and F-measure of our proposed approach outperforms the original DTK in sentence selection, personae social relation extraction, and personae social relation classification. Finally, the relation graphs of five topics clarify that our proposed approach is effective for extracting personae social relations from microblog posts.


I. INTRODUCTION
The web is an important platform for searching useful information. At present, an increasing number of people are using social media, such as Twitter, Facebook and microblogs, which generates a large quantity of microtext information (such as microposts and videos) every day. One or several microblog posts usually cannot provide useful and valuable information. However, a number of microblog posts on microblog platforms can provide public opinions and important events of the network for the general public and the government. Microblog posts become increasingly complex with time, resulting in the inability of researchers to obtain useful information from historical microblog texts. Therefore, knowledge graph systems are necessary for solving this problem on microblog platforms [1]. Several knowledge graphs, such as Google Knowledge Graph [2], Microsoft The associate editor coordinating the review of this manuscript and approving it for publication was Shirui Pan .
Satori [3], DBpedia [4], and Freebase [5], have been developed and used extensively by search engines to enhance their semantic search functions. The basic components of knowledge graphs [6] include the number of entities (persons, things, places, events and topics) and relations among these entities extracted from the web. The information extraction technology, which is very important for these systems, mainly includes three tasks [7], [8]: entity recognition, entity relation extraction, and event detection. Many relation extraction approaches have been successfully applied to long texts, and rarer studies have discovered how to extract the entities and relations from microtexts [9], [10]. Two important international conferences, the Message Understanding Conference (MUC) [13] and Automatic Content Extraction (ACE) [11], guide the relation extraction technologies for documents. The MUC has defined many relation templates to mine different types of relations from documents, such as employee_of, product_of, location_of among organizations. Many methods have been proposed for relation extraction in the corpus of ACE2008. This document defines the six relation types, such as Art (artifact), Gen-Aff (General-affiliation), Org-Aff (Org-affiliation), Part-Whole (Part-to-Whole), Per-Soc (Person-Social), and Phys (Physical).
We classify the relation extraction approaches into three categories: rule-based, feature-based, and kernel-based approaches. They focus on mining relations from some documents, such as web pages and news reports. A large number of research works [14] show that these methods are very successful in dealing with entity and relation extraction in long texts because these sentences in these long texts often have clear semantic meanings, the semantics of words in sentences are almost unambiguous, and the vectors of long texts can not be sparser than short texts. These approaches are ineffective in terms of ambiguous sentence and word semantics, sparse data and so on. Recently, an increasing number of people are using social media. Sina microblogs releases hundreds of millions of microposts every day, generating 50 GB of microtext data. Facebook [12] handles 350 million photos, 4.5 billion ''compliments'', and 10 billion messages a day from around the world. These microblog posts, photos, and messages include rich entities (people names, place names, etc.) and relations (friends, adjacencies, etc.) among them. Additionally, the knowledge graph is a very important tool for developing application systems based on microblogs. The knowledge graph saves and organizes entities and their relations extracted from microblog posts. Therefore, extracting relations from these microblog posts is an urgent problem to be solved in the information retrieval of microblogs. However, the sentences in these texts are incomplete and short, and their semantics are ambiguous. The words in their sentences usually have multiple meanings. These relation extraction approaches cannot function well with microblog applications because microposts with short texts easily cause data sparsity problems. Thus, we focus on personae social relation extraction of microposts in this paper. Our main contributions are as follows: • We improve a novel kernel-based learning algorithm (denoted as the dependency trigram kernel, NDTK) to mine personae social relations. This algorithm does not rely on entity information to train microblog posts. We divide the sentences in these microblog posts into dependency trigram kernels (DTKs). We combine the syntax and semantic meanings of the DTK to compute the similarity of two sentences.
• We define the relation feature words (FWs) assigning one type of social personae relation between person entities. We propose a rule-based approach to mine these relation FWs for deeply extracting the personae social relation in microblog posts. We identify seven rules for extracting relation FWs. These rules are based on the entity positions and the word semantic roles obtained from the language technology platform cloud (LTP).
• Then, inspired by the system [30], we design learning algorithms to classify the relation FWs. In the algorithm, we consider the relation types among persons and build the relation FW dictionaries for every relation type. For syntax and semantic viewpoints, we compute the types of the relation FWs. Some difference comparisons of our proposed NDTK with rule-based, feature-based, kernel-based approaches are listed in Table 1. The rest of this paper is organized as follows. In section 2, we introduce the DTK of relation extraction and word similarity approaches. In section 3, we propose our NDTK approach to mine personae social relationships from microblog posts. The experimental results are displayed and analyzed in section 4. We conclude with future works in section 5.

II. RELATED WORKS A. RELATION EXTRACTION APPROACHES
There are many research works on rule-based, feature-based, and kernel-based approaches. We describe them as follows: • Rule-based approaches. These approaches first extract relation rule models by considering the words, phrases, morphologies, and semantic meanings from the document corpus. Then, the relations are retrieved by matching the rule models. Here, we give several typical representations of the approaches in different periods. Brin [15] proposed the dual iterative pattern relation expansion (DIPRE) to extract the relations among authors and documents. They marked the document corpus and constructed relation rule models.
Matsuo et al. [16] developed the polyphonet system. To mine the relations, the system extracts the common occurrence information among words appearing in web pages using Google search technologies. Then, they proposed relation class models and classified common occurrence information into different relations. Nie et al. [17] mined relations based on the specific domains by considering the semantic similarity and dense clusters of the relation rules. Its precision improved by 4% compared with DIPRE. Xu et al. [18] and Zhang et al. [19] adopted machine learning to train the relation rule models and proposed trigger words of relations to discover relations on the web.
• Feature-based approaches. These approaches first retrieve the word and phrase features to form feature vectors for each relation category. They build classifiers with the new relations discovered from the new documents. Kambhatla [20] proposed a maximum entropy model, which combines several text features, such as lexical, syntactic, and semantic features, VOLUME 8, 2020 to extract 24 relation subtypes in the ACE 2003 corpus. Che et al. [21] used a support vector machine (SVM) to train a dataset for relation extraction. Their learning algorithms needed to select features from the ACE 2004 corpus. Xia and Lehong [22] combined sequence, appearance, punctuation, and context features to extract the relations of terms. Finally, they classified these relations by Bayes classifiers. Liu et al. [23] proposed the extremity learning machine based on the neuron network algorithm to extract entity relations. They built a concept model to retrieve the efficient space features that included the sentence features and relations among the sentences. Huang et al. [24] considered that the space feature vectors of documents have high dimensionality, leading to sparse data vectors. They used document frequency, information quantity, mutual information quantity, and the chi-square test to reduce the dimensionality. Then, they used SVM to mine the personae relations.
• Kernel-based approaches. These approaches computed the similarities of two feature vectors in high-dimensionality space. The similarities are some important parameters for constructing the classifiers of the relations. These approaches usually expressed these features of documents, sentences, words, phrases, and semantic meanings using nonlinear methods, such as tree structures. Moreover, the feature vectors contained considerable hidden information about the entity relations. Zelenko et al. [25] designed the kernel function method to extract the entity relations from the nonstructured texts. They proposed the kernel functions to compute the similarities of two texts and adopted the similarities to SVM classifiers to mine person-affiliation and organization-location relations. Yu et al. [26] proposed a convolution tree kernel-based method to extract Chinese semantic relations. They utilized entity types, subtypes, and mention types to construct unified syntactic and entity semantic trees and evaluated the experimental results on the ACE 2005 Chinese corpus. Zhou et al. [27] proposed phrase kernel-based sensitive context information. The method can automatically retrieve the information of the sensitive context trees of sentences. Then, they proposed the convolution tree kernel of the sensitive context information to classify the entity relations. Zhou et al. [28] proposed a novel tree kernel-based method. First, they constructed rich semantic relation trees and then proposed a context-sensitive convolution tree kernel for extracting entity relations. The result shows that this method outperformed other state-of-the-art methods on ACE Relation Detection and Characterization (RDC) corpora. Chun et al. [29] proposed the mixed kernel function to compute the similarities of two relations. The kernel functions considered the phrase structures in the convolution kernel and the predicates in decision models. Li et al. [30] designed a distributed system to extract Chinese entity relations. They constructed six distributed base learners by combining Zhou's convolution tree kernels and entity feature kernels. Then, three communication rules among these learners were proposed to extract the entity relations. The experiments were performed on an ACE RDC2005 Chinese corpus. In conclusion, their objects of study are special corpora containing numerous entity information and features.

B. DTK ALGORITHM
To extract personae social relations from the ACE corpus and Korean news, Choi and Kim [31] proposed the DTK algorithm based on the SVM. They divided the relation extraction process into two phases. In the first phase, sentences that contain relations are selected. In the second phase, the relation names are identified. The DTK algorithm can transform a sentence into some dependency trigrams. Given a sentence S, the dependency tree of S is shown in Fig. 1. In the sentence, given three words w i , w j , and w k , w i → w k indicates that word w i has a dependent relation with w k , and w j → w k indicates that word w j has a dependent relation with w k . Here, w k is the common word of the two dependency relations. We denote the w i → w k ← w j as the dependency trigram. We define the dependency trigram set S T of a sentence into Eq. (1). where Choi and Kim fully considered the literal meaning, syntax and part of speech to design the similarity function s( where • θ is the number of features (such as word literal meaning, syntax and part of speech) of words in sentences.
is the weight factor of the qth feature. According to the dependent trees of sentences, they design kernel functions to select sentences that contain social relations and names of social relations. The number of dependency trigram relations in sentence A is assumed to be less 5490 VOLUME 8,2020 than that in B. The kernel function (Eq. (3)) is used to select sentences.
where • A, and B represent two sentences. A i T , and B j T are the dependency trigrams in sentences A and B, respectively.
• n and m are the numbers of dependency trigram relations in sentences A and B. K (A, B) is a similarity measure function between two dependency trees based on their dependency trigrams. The dependency trigrams are the core components used to calculate the similarity. These components contain various features of sentences, such as word literal meaning, syntax and part of speech. K (A, B) is used to extract the relations in sentence B with case condition A relations as the templates.
The kernel function Eq. (3) considers all dependency trigrams in two sentences to determine whether a new sentence contains relations. The dependency trigrams of a sentence can be expressed in Eq. (4) as follows: where . w c and w p are the child nodes and parent nodes of the entity word w k , respectively.
. w c and w p are the child and parent nodes of the entity word w k , respectively.
. w c and w p are the child and parent nodes of the entity word w k . The three kinds of relations indicate that the keywords describing the names of relations usually appear around the entity words. The DTK algorithm finds the relation name using the kernel function Eq. (5) and the dependency trigrams of two different sentences.

C. WORD SIMILARITY
The TF-IDF approaches [32] based on the scale text corpora are used widely to determine the statistical similarity between documents. However, these approaches have considerable limits when we use these approaches to calculate the similarity for microblog posts, in which each message contains up to 140 characters. With the help of a knowledge base, several approaches [33]- [35] of computing microblog posts usually expand the sematic meanings of words to reduce some limits. HowNet [36] is a detailed semantic knowledge base. This base, which is represented by a number of words in each composition sememe, is a multidimensional form of words. For example, Keyboard is composed of three original composition sememes: Component , Computer , and MusicTool ; Relationships is composed of three original composition sememes: attribute , relatedness , and human . The original sememe of each level description is unequal. A complex relation exists between the sememes. A special language is needed to describe the relations.
With the original sememe of words, we can calculate the distance or the similarity between two words. The range of distance between two words is [0, ∞). The smaller the similarity is, the farther the distance will be. The distance and the similarity between two words can be established by the following relations: • The distance between two words is 0, and the similarity is 1.
• The distance between two words is ∞, and the similarity is 0.
• The greater the similarity between two words is, the smaller the distance will be, and vice versa.
• Given two words W 1 and W 2 , their similarity can be represented as Sim(W 1 , W 2 ), and the distance between these words is Dis(W 1 , W 2 ) [37], [38]. The relation between the distance and similarity can be represented by Eq. (6).
where α is an adjustable parameter, which shows the distance between W 1 and W 2 when their similarity is 0.5.

III. PERSONAE SOCIAL RELATION EXTRACTION
In this section, we introduce our approach for mining personaeśocial relations from microblog posts. Our proposed approach is divided into three parts: mining the personae social relation from the microblog posts; extracting the relation feature words; classifying relation FWs.

A. NDTK APPROACH
The personae social relations in microblog posts are difficult to find by directly using the original DTK algorithm [31]. This limitation is caused by two factors. To make DTK suitable for handling microblog posts, we improve the similarity function among dependency trigram sets and propose a new function to measure word semantics and syntax weight factors. First, we utilize HowNet to calculate the word semantic similarities in dependency trigrams. Second, we propose (POS, GR) pairs to represent VOLUME 8, 2020 (7)). • Sim(w 1 , w 2 ) is the word semantic similarity, which is taken from the HowNet.
• α and β (Eq. (8)) are the weights of the similarity of the left, right, and center words in dependency trigrams.
We consider that the center words are the verbs of the relations, while the left and right words are the entity nouns for these relations. The weight β is larger than α because the verb of the relations plays a key role in computing the similarity. where (2). The POS may be an adjective, verb, or noun. GR indicates that a word belongs to an object, subject, or predicate. The words of the POS and GR are constant in a sentence, and we consider the POS and GR as a whole. Thus, the POS and GR features can be represented as (POS, GR) pairs. We consider that the (POS, GR) contribution of the sentence similarity depends on the frequencies of the left, center, and right words of .
• Syn(X , Y ) indicates the probability that X appears in Y .
• α and β are the same values in Eq. (7). The Eqs. (7) and (9) are the semantic similarity and syntax similarity, respectively, for two dependency trigrams. We consider balancing their weights in contributing the sentences using the information entropy and mutual information entropy. The information entropy of words indicates that the words contain the information capacities. The higher the information capacities of the words are, the higher the similarity contributions of the semantic meanings of the words will be. The mutual information entropy of (POS,GR) words indicates their closeness. The higher the mutual information entropy of (POS,GR) is, the closer the (POS,GR) of the words will be, and the larger the similarity contributions of the syntactic features of the words is. Hence, we integrate the semantics and syntax features into a novel similarity of dependency trigram using Eq. (10).
In the detailing implementation of the NDTK approach, the NDTK approach includes two parts: extracting personae social relations and relation feature words and classifying relation FWs.

B. EXTRACT PERSONAE SOCIAL RELATION AND THE RELATION FEATURE WORDS
In microblog posts on the microblogging platform, numerous relationships among persons exist. However, the types of these relations are limited. Using Li's concept [30], we design a basic learner for each type of relation. For convenience of discussion, we consider only four types of personae relations in microblogs, namely, Work, Family, Friend, and Enemy.
In traditional methods, the FWs are extracted by analyzing a word syntactic structure in a sentence. However, the sentence structure is complex and fuzzy. Thus, traditional methods are complex cases, and inaccuracies exist to determine all correct FWs. Therefore, we utilize NDTK to extract the relation FWs between two entities. These kernel words can represent relation FWs for further classification. For example, the sentence ' (President Jinping Xi meets with U.S. Secretary of state Kerrey today)', we use NDTK to extract the relation FW ' (meet)' between two person entities ' (Jingping Xi)', and ' (Kerrey)'. We called the word 'meet' an NDTK relation FW. Then, we designed four learners using these FWs for four further relation classifications.
Relation FWs are candidates for depicting personae relations. Emotion analysis and classification [39], [40] utilize an emotion dictionary to construct a model of emotion detection or classification. Inspired by these approaches, we first manually constructed the initial relation dictionary describing the words and then used the relation dictionary to classify them. Finally, we expanded the relation dictionary using the chi-square test, mutual information (MI) and HowNet. To construct the dictionary, we selected the standard relation words from the HowNet dictionary, and each relation type contained approximately 300 words. Given a sentence S, the dependence tree is extracted by utilizing a public platform named LTP [42]. The LTP dependence tree of the sample sentence ' 6 28 , (US President Obama spoke to German Minister Merkel on June 28, local time, to discuss the Greek debt crisis)' is shown in Fig 2. In this dependency tree, 'place name', and 'person name' indicate entity types. We denote A0 and A1 as semantic roles. A0 represents the agent of the actions, and A1 represents the receiver of the actions. According to the order of appearance in a sentence, all agents of actions in a sentence can be denoted {A 01 , · · · , A 0i }, and all receivers of actions as {A 11 , · · · , A 1j }. E 1 , and E 2 are two personae entities. In turn, w 1 , · · · and w n are terms except entities. The positions of entities, which contain a relation in the most likely condition, can be concluded in three situations, and FWs describing the relations can be summarized in seven rules ( Table 2). In Table2, FW is a set that contains all words describing relations by using these rules. IR represents the person interaction relation word that is extracted by utilizing the NDTK algorithm. A0 and A1 are semantic roles in the LTP VOLUME 8, 2020 dependency tree. Function minDis(X,Y) returns word set Y , which is the nearest distance from X to Y in a sentence.
• Rule1: If E 1 is an agent of actions, and E 2 is a receiver of actions; then all words between E 1 and E 2 and all relation words are relation feature words.
• Rule2: If these words exist between E 1 and E 2 , and E 1 and E 2 are agents of actions, then all agent of actions words, and all relation words except E 1 and E 2 are relation feature words.
• Rule3: If these words exist between E 1 and E 2 , and E 1 and E 2 are receivers of actions, then all words of receivers of actions, and all relation words except E 1 and E 2 are relation feature words.
• Rule4: If these words exist between E 1 and E 2 , and E 1 and E 2 are not agents of actions and receivers of actions, then all relation words and words of receivers of actions that maintain a minimum distance with all relation words are relation feature words.
• Rule5: If these words do not exist between E 1 and E 2 , and all relation words lie in the left side of agent of actions E1, then all relation words and words of receivers of actions that maintain a minimum distance with E 1 are relation feature words.
• Rule6: If these words do not exist between E 1 and E 2 , and all relation words lie in the right side of the receiver of actions E2, then all relation words and words of receivers of actions that maintain a minimum distance with E 2 are relation feature words.
• Rule7: If the relation word of a sentence is 'is', then the relation feature words include all nouns of the agent of actions and the receiver of actions that maintain a minimum distance with is. For example, there is the sentence '' 6 28 , (US President Obama spoke to German Minister Merkel on June 28, local time, to discuss the Greek debt crisis)'' in Fig.2. According to this structure, we use the first rule to get FW . The IR is ' (spoke)', and it belongs to A1. So that FW = { (spoke), (discuss), (Greek), (debt), (crisis)}. The detailed description of extracting personae social relations and the relation FWs is shown as follows: • Step 1. Expanding the relation dictionary by using the chi-square test, mutual information (MI) and HowNet [41].
• Step 2. Dividing the microblog posts into sentences, the sentences in different words, and constructing dependence trees of the different sentences using the LTP tool [42].
• Step 3. Extracting the dependency trigrams from the different sentences using DTK approaches.
• Step 5. Extracting the relation FWs between two entities by using these seven rules in Table 2.

C. CLASSIFYING RELATION FWs
In the real world, many relations exist among persons. We manually construct dictionaries with many FWs to describe these relationships. We use D j = {d j 1 , · · · , d j k , · · · , d j m } to represent the jth type of the relation FW dictionaries. The d j k is the kth relation FW in the jth type dictionary. FW i = {f i 1 , · · · , f i q , · · · , f i n } is the FW set that describes the relations and is extracted from the ith microblog post by using the above rules, and f i q is qth word describing relation in the ith microblog post. n is the number of relation FWs describing in the ith microblog post. We can construct a similarity matrix between FW i and D j . The matrix is shown as follows: Each element of the matrix Sim(f i q , d j k ) represents the semantic similarity [37] between the word d • In steps 09∼15 of the RFWCA, we first compute for the |FW i D j |. This value indicates that the number of the relation FWs of the microblog post i includes in the relation FW dictionary D j . Then, we compute the maximum number that belongs to the relation FW dictionaries.
• In steps 16∼20 of the RFWCA, if | FW i D j | is the only maximum number in j = {1, · · · , n}; then, return C(FW i ); and the algorithm stops. Otherwise, it indicates several maximum numbers, and steps 16∼20 calculate the semantic and syntactic similarity between FW i and D j .
• In step 23 of the RFWCA, we extract the syntactic features of FW i = {f i 1 , f i 2 , · · · , f i n }, such as POS, GR, semantic role, child nodes, and parent nodes.
• In steps 24∼27 of the RFWCA, we obtain the greatest value vectors, replace the corresponding words in FW i = {f i 1 , f i 2 , · · · , f i n } and words in D j max = {d j 1 , · · · , d j n } with syntactic feature words, and then reconstruct a new dependency tree for extracting syntax features.
• Step 28 of the RFWCA compared with FW i and D j max . We use Eqs. (13) and (14) where k represents the number of features. The d j qp represents the pth of d j q , such as POS, GR, children, and parent.
• According to the M ij m×n dictionary similarity matrix, Step 29 of the RFWCA uses the greatest element of the kth row in the similarity matrix corresponding to d j k to calculate the semantic similarity Sem (FW i , D j ). The equation is shown as follows: • Steps 30∼34 of the RFWCA select the maximum score of similarity between words in the relation dictionary and relation feature words as the result of classification. Score j is computed using the following equation:

IV. EXPERIMENTS AND RESULTS
NDKE is developed on the basis of DKE. The rule-based, feature-based, kernel-based approaches with long texts are not comparable with NDKE. Therefore, we only choose DKE as the baseline to compare our proposed NDKE approach. In our experiment, our algorithms run on a computer group of four computers. Every computer includes an Intel(R) Core(TM) i5-3230 M @2.60 GHz, memory of 4.00 GB, hard disk of 1 TB, Windows 7 OS, and distributed system Hadoop. The four (friend, work, family, enemy) initial dictionaries have approximately 1,000 relation feature words. We describe our experimental flow in Fig. 3. It includes the following parts: crawl microblog posts, construct the initial dictionary, construct dependence trees, extract the dependency trigrams, choose the dependency trigrams, extract the relation FWs, classify relation FWs, and construct knowledge graph.

A. DATASETS
To evaluate our proposed NDTK approach for mining deep personae social relations and proposed RFWCA for classifying relation FWs. We experimentally crawled numerous real data about some person topics from the TenCent and Sina microblog platforms. A total of 110,000 original microblog posts (including 100,000 normal microblogs and 13,000 topic microblog posts) were downloaded. We selected 6,968 topic microblog posts and 11,088 normal microblogs as our experimental dataset. We denoted these microblog posts without the five topics to the normal microblog posts. The topics of microblog posts were related to people, so they contained more person entities. Table 3 shows six topics and the numbers of microblog posts in the crawled dataset.
In Table 3, we divided these microblog posts into two parts, namely, topic and normal posts. Topic posts, which approximately described person relations, and contained five topics, such as microblog news, talks, cooperations, politicians, and famous stars. The normal posts without topics were  people's individual thoughts in microblogs. The topic posts clearly contained more personae relations. These relations were easier to extract than normal posts.

B. EVALUATION CRITERION
In this paper, we divided the relation extraction processes into two phases. In the first phase, personae social relations were extracted using our proposed NDTK algorithm. In the second phase, the relation feature word sets FWs ere extracted and classified into different relation types using the RFWCA.

1) NDTK APPROACH
Three indices P (precision), R (recall) and F (F-measure) [43] were adopted to measure the performances of our proposed NDTK and the original DTK approach.
where NCE is the number of correct personae social relations extracted from the microblog posts in the test datasets. NE is the number of the social personae relations extracted for microblog posts in the test datasets. NCT is the number of correct personae social relations in the test datasets.

2) RKWCA
In this subsection, we provide several measurement criteria of our proposed rules to extract the relation FWs and classify FWs into different types by RKWCA. We adopt the correct rate FWC of FW to measure the performances of the NDTK algorithm.
where CKWords represents the correct word set describing the relations. Eq. (18) indicates that if half of the words in the FW i are correct, then the whole relation FWs are correct.
We adopt the weighted average precision P Avg , recall R Avg and F-value F Avg to evaluate the performance of the RFWCA.
where P j is the precision of relation type j, F j is the F-value of relation type j, R j is the recall rate of relation type j, and NumC j is the number of relation instances classified in type j.

C. RESULT ANALYSIS 1) EVALUATION OF NDTK APPROACH
In this subsection, we compare our improved NDTK approach with the original DTK approach based on two aspects, relation sentence selection and personae social relation extraction. We retrieve the Ps, Rs, and Fs of microblog news, talks, cooperations, politicians, famous stars, and normal posts of the NDTK and DTK approaches.
Figs. 4∼6 demonstrate the sentence selection performance of the NDTK and original DTK approaches. In Fig. 4, we discover that the precision Ps of NDTK's sentence selection is higher than that of the original DTK approach in microblog news, talks, famous stars, and normal posts but lower than that of the DTK approach in politicians and cooperations. This phenomenon may be caused by the smaller number of samples in politicians and cooperations. The average Ps of the sentence selection of NDTK and the original DTK approaches are 77.40% and 76.00%, respectively. This result confirms that the sentence selection performance of the NDTK approach outperforms the original DTK approach. In Fig. 5, we discover that the Rs of the NDTK's sentence selection are generally higher than the original DTK approach, except for the famous stars. The average Rs of the sentence selection of NDTK and the original DTK approaches are 76.36% and 74.56%, respectively. For the F-measure F aspect, the same tendencies are observed with the Ps. In Fig. 6, the average Fs of the sentence selection of NDTK and the original DTK approaches are 0.78 and 0.75, respectively.      In Fig. 7, we discover that the Ps of the NDTK's personae social relation extraction is higher than that of the original DTK approach in microblog news, talks, famous stars, politicians, cooperations, and normal posts. The average Ps of the personae social relation extraction of the NDTK and the original DTK approaches are 77.23% and 66.90%, respectively. The average Ps of the personae social relation extraction of the NDTK approach improved by approximately 10% compared with the original DTK approach. This result indicates that the personae social relation extraction of the NDTK approach outperforms the original DTK approach. In Fig. 8, the Rs of the NDTK's personae social relation extraction are generally higher than those of the original DTK approach. The average Rs of the personae social relation extraction of the NDTK and DTK approaches are 73.40% and 64.72%, respectively. The average Rs of the personae social relation extraction of the NDTK approach improved by approximately 9% compared with those of the original DTK approach. For the F-measure Fs in Fig. 9, the same tendency is observed with the Ps and Rs. The average Fs of the personae social relation extraction of NDTK and the original DTK approaches are 0.75 and 0.64, respectively.

''
'' and '' ''. Eq. (18) can determine the correct or incorrect words. Fig. 10 shows the FW correct rates in the different topics. After retrieving all relation FWs from the topic microblog posts. We classify these relations into four types: friend, enemy, work and family. To measure the performances  of RFWCA, we build two datasets. The first one (denoted WRDW) processes the dataset with relation feature words. The second (denoted NRDW) processes the dataset without relation feature words. In two processes, WRDW and NRDW are input directly into RFWCA. Then, we retrieved P A vg, R A vg and F A vg. All relation words retrieved from all relation VOLUME

D. CLASSIFICATION OF RELATION GRAPHS
Using our improved NDTK and proposed RFWCA algorithms, we construct a knowledge graph based on personae social relations for five topics (microblog news, cooperations, talks, politicians, and famous stars) of Chinese microblog posts. We develop the visual interaction interface that we use for node-XL [44], [45] for the knowledge graph. Fig. 14 is a fragment of the social relation graph without RFWCA. Fig. 14 includes the social relations of approximately 600 personae because the personae social relations cover a wide range of topics. There are many relations between two person entities. Parts of these relations are often repetitive. Hence, the structure of the knowledge graph is complex. The different relation types are difficult to distinguish. We cannot mark the relation types in Fig. 14.
We provide a simplified knowledge graph in Fig. 15 by using our improved NDTK and our proposed RFWCA. Fig. 15 includes the same 600 personae, reduces redundant relationships, and makes the structure of the knowledge graph clear. The personae social relations of two entities can be distinguished relatively. In Fig. 15, we assign the different types of relations with the different colors as follows: work, blue; family, brown; friend, green; and enemy, red. Fig. 15 shows clearer entities and relations between entities than Fig.14.

V. CONCLUSION AND FUTURE WORK
In this paper, we take microblog posts as an example to tentatively study the relation extraction of short text. Some conclusions are listed as follows: • By utilizing the original DTK approach, we propose the NDTK algorithm and seven novel rules for extracting the relation FWs.
• We propose an FW words classification algorithm that can classify FWs into different relation types, such as work, family, friend, and enemy.
• Finally, we experimentally evaluate our proposed method to prove the rules, our improved NDTK algorithm and our proposed RFWCA. The experimental results demonstrate good performance for our approaches. In the future, we will extract personae social relations with microblog posts of more topics and construct their knowledge graph.
YAJUN DU received the Ph.D. degree in traffic information engine and control from the School of Computer and Communicate, Southwest Jiaotong University, in 2005. He is currently a Professor in computer science with Xihua University. He has published several articles in information retrieve, search engine, focused crawler, and knowledge graph. He is serving in committees of Chinese information and PC member for several leading international/conferences WMSE, ICIC, CCIR, CCKS, and SMP. His experiences and researchs works focus on information retrieval, software engineering, search engine, web mining, and computer networks.
FANGHONG SU received the bachelor's degree in computer science and technology from the School of Computer and Soft Engineering, Xihua University, in 2002. He is currently working with Sichuan Lewei Technology Company, Ltd. His experience and research work focuses on knowledge graph.
ANZHENG YANG received the M.S. degree in computer science and technology from the School of Computer and Soft Engineering, Xihua University, in 2016. He has published two articles in information retrieval, search engine, focused crawler, and knowledge graph. His experience and research work focuses on social networks and software engineering.
XIANYONG LI received the D.S. degree in computer science from Chongqing University, in 2014. He is currently a Lecturer of computer science with Xihua University. He has published more than ten academic articles in journals. His research interests include information retrieval, web mining, fault-tolerant computing, interconnection networks, and graph theory.
YONGQUAN FAN received the D.S. degree in traffic information engine and control from SWJTU, in 2010. He is currently an Assistant Professor in software engineering with Xihua University. He has published several articles. His experiences and researchs works focus on information retrieval, information filter, search engine, and web mining. VOLUME 8, 2020