Research on Knowledge Representation and Automatic Recognition of Dynamic Words for Chinese Automatic Syntactic Analysis

There are many temporarily constructed dynamic words in Chinese sentences. Dynamic words are sentence building units that are not included in the general lexicon and are not suitable for further syntactic analysis. Automatic recognition and analysis of dynamic words in sentences play an important role in improving the efficiency and accuracy of Chinese automatic syntactic analysis. The existing researches on dynamic words mainly focus on the qualitative description of concepts and categories. There is no overall algorithm design and experimental exploration on automatic recognition of dynamic words. In the practice of automatic syntactic analysis, dynamic words are generally segmented, and the components are analyzed according to syntax, while the automatic recognition and analysis of dynamic words as a whole are ignored. In this study, the dynamic word is separated from syntactic analysis as the content of lexical analysis and recognized and analyzed as a whole. This paper uses the method of knowledge engineering to research and analyze dynamic words for Chinese automatic syntactic analysis based on sentence pattern structure, initially designs a knowledge representation method of dynamic words, secondly constructs the dynamic word structural mode knowledge base by annotating the dynamic words in the corpus of a certain scale of international Chinese textbooks, and finally explores the automatic recognition methods of dynamic words based on regular expressions, semantic category combinations and machine learning classification algorithms. The experimental results show that the three algorithms can cover the recognition of all types of dynamic words, and achieve relatively ideal accuracy and recall rate.


I. INTRODUCTION
There are many temporarily constructed words and intermediate state combinations between words and phrases in Chinese sentences [1], [2]. In the field of linguistics and Chinese information processing, theoretical research and application practice are generally focused on the definite words included in the normative lexica, and less attention is paid to the dynamic words not included in the lexicon, and even less systematic research on dynamic words. Dynamic words are sentence building units that are not included in the general lexicon and are not suitable for further syntactic analysis [3]- [5]. The typical types of dynamic words include temporary nouns ('' [wu mi er](five meters two)''), etc [6]- [9]. The dynamic word is an objective language unit, which is an important object to be defined and dealt with in Chinese lexical analysis or syntactic analysis. A comprehensive and in-depth study of dynamic words is of great significance not only to the theoretical research of Chinese morphology and Chinese teaching, but also to the research and practice of Chinese information processing.
Dong [10] divides vocabulary knowledge into two parts: the lexicon and morphology, and strictly distinguishes the two concepts. Based on the modern Chinese word segmentation VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ vocabulary for information processing, the research focuses on the lexical patterns with high productivity and the strong structural types and main semantic patterns of compound words. Under the guidance of Hierarchical Network of Concepts, Tang [11] adopts the method of combining corpus extraction and deduction in material acquisition, and classifies and describes dynamic words according to the actual situation. He focuses on the rule research of dynamic word combination patterns and the role of the single character knowledge base for information processing in dynamic word recognition. Song [12] pointed out that the boundary between words and phrases in Chinese is fuzzy, and it is hoped that computers can expand the granularity of language units from words to phrases when processing Chinese in many cases. He further proposed the concept of simple phrases. From the practical point of view, some combination types which are easily recognized by computers automatically, have the same grammatical function and have high frequency are defined as simple phrases artificially. The recognition of several types of simple phrases is carried out by using the method of human-computer combination (first design rough extraction rules composed of parts of speech or word forms, and then manually check the candidates one by one, eliminate errors and correct rules). The above researches mainly describe the concept and category of dynamic words qualitatively. For the computer automatic recognition of dynamic words in real language text, the above researches do not involve, or only describe a kind of knowledge base construction scheme for dynamic word recognition, or use the way of human-computer combination to identify several combination types by examples. There is no overall algorithm design and experimental exploration on automatic recognition of dynamic words. Automatic syntactic analysis is one of the key links and difficulties in the basic research of Chinese information processing, and dynamic word recognition has a very important impact on the efficiency and accuracy of automatic syntactic analysis. In the current mainstream syntactic analysis of Chinese information processing (phrase structure syntactic parsing and dependency syntactic parsing), word segmentation is generally based on the lexicon, and the result of word segmentation is directly used as the leaf node of the syntactic tree. The recognition and lexical analysis of dynamic words outside the lexicon are ignored, and the problems that should be solved at the lexical level are pushed to the syntactic level. Dynamic words are segmented, resulting in the complexity of syntactic tree structure, which undoubtedly increases the burden of syntactic analysis and seriously affects the efficiency and accuracy of syntactic analysis. This paper focuses on the knowledge representation and automatic recognition of dynamic words in Chinese text for automatic syntactic analysis based on sentence pattern structure [13]. The main contents are as follows: initially designs a knowledge representation method of dynamic words, secondly constructs the dynamic word structural mode knowledge base by annotating the dynamic words in the corpus of a certain scale of international Chinese textbooks, and finally on the basis of the structural mode knowledge base explores the automatic recognition methods of dynamic words. The innovative work of this paper is as follows: (1) designing a symbolic knowledge representation scheme of dynamic words; (2) constructing an effective and systematic knowledge base of dynamic word structural modes for Chinese information processing; (3) designing and realizing three automatic recognition algorithms of dynamic words based on regular expressions, semantic category combinations and machine learning classification algorithms.
The organizational structure of this paper is as follows: Chapter 1: introduction, which introduces the research background and significance, research status, research content and innovation, as well as the organizational structure of the article; Chapter 2: sentence pattern structure syntactic parsing, which introduces the syntactic analysis in Chinese information processing and the formalization, syntax and morphology of sentence pattern structure; Chapter 3: knowledge representation of dynamic words, which introduces the knowledge representation method of the dynamic word structural mode, corpus annotation and construction of the structural mode knowledge base; Chapter 4: automatic recognition of dynamic words, which introduces the automatic recognition algorithms and related experiments based on regular expressions, semantic category combinations and machine learning classification algorithms; Chapter 5: conclusion, which summarizes the research and looks forward to the future work.

II. SENTENCE PATTERN STRUCTURE SYNTACTIC PARSING
Sentence level analysis technology in natural language processing can be roughly divided into three levels: lexical analysis, syntactic analysis and semantic analysis. Syntactic analysis is the process of analyzing the input text sentence to get the syntactic structure of the sentence. The analysis of syntactic structure, on the one hand, is the need of language understanding. On the other hand, it also provides support for other natural language processing tasks. For example, syntax driven statistical machine translation requires syntactic analysis of the source language or the target language.
the dependency syntactic parsing is to recognize the interdependence between words in sentences [17]. As shown in Fig. 1: the phrase structure emphasizes the phrase level, mainly through the ''np'', ''vp'' and other functional nodes to build the syntactic tree, and take the structure relationship as the attribute of the node. The dependency structure cancels the phrase node and directly describes the structural relationship between words (dependency arc). Whether the phrase structure or dependency structure, the parser analyzes the phrase-level hierarchy and syntactic relationship of the sentence essentially.
Compared with Sentence Component Analysis, the common feature of phrase structure syntactic parsing and dependency syntactic parsing is the relativization and dualization of structure. Specifically, sentence components, as the basis of structure, are hidden and presented in the form of binary syntactic relations. The difference only lies in that the relationship of phrasal structure is built between phrases, while the dependency structure takes the center word as the representative of phrases. After the sentence components are hidden, the structural framework or sentence pattern features of the sentence are not as clear as those of Sentence Component Analysis. At the same time, a large number of sentence pattern rules and language knowledge which occupy an important position in the research and teaching of Chinese grammar cannot play their due role in the syntactic analysis of Chinese information processing. Therefore, it is necessary to explore the application of Sentence Component Analysis in automatic syntactic analysis of Chinese information processing [18]. Traditional grammar and structuralism grammar can complement each other on the basis of their respective advantages, and jointly promote the research of automatic syntactic analysis, so as to improve the level of computer understanding of Chinese.
From the perspective of sentence framework or sentence pattern, the method of Sentence Component Analysis analyzes the function of sentence components. Sentence components and the various structural relationships and structural centers formed by their combination belong to the specific sentence pattern structure. Therefore, the syntactic analysis method of Chinese information processing based on Sentence Component Analysis is also called sentence pattern structure syntactic parsing.

B. FORMALIZATION OF SENTENCE PATTERN STRUCTURE
In order to make the sentence pattern structure syntactic parsing accepted by computers, it is necessary to formalize the sentence pattern structure, that is, to design a set of standard symbol system and data structure, so as to establish a sentence pattern structure grammar system for Chinese information processing. In the history of Chinese grammar teaching and research, Sentence Component Analysis has a corresponding formal description scheme, that is, various types of ''diagrams''. The most famous is Li Jinxi's diagrammatic method [19], followed by the widely used ''adding line marking method''. The adding line method is a simple diagrammatic method, which cannot describe the complex structure level. Therefore, the formalization of sentence pattern structure is designed systematically based on Li Jinxi's diagrammatic method [20]- [24].
The design of the diagrammatic method includes two aspects: one is the diagrammatic style; the other is the corresponding data structure. Considering the hierarchy and extensibility of syntactic structure, XML (Extensible Markup Language) is used as the data storage format. The two-way transformation of lossless information can be carried out between the diagrammatic style and XML structure, that is, the diagrammatic style can be encoded and saved as a certain level of XML structure; otherwise, the XML structure can be decoded into the original diagrammatic style renewedly.

C. SYNTAX OF SENTENCE PATTERN STRUCTURE
There are seven types of sentence components: subject, predicate, object, attribute, adverbial, complement and independent. Function words do not directly act as sentence components, but occupy a specific ''function word position''. In XML structure, the sentence structure continuously embeds component nodes and function word position nodes according to certain rules until the part of speech node as the leaf node. The XML element tag set is shown in Table 1.
It is the most important thought of syntactic analysis reflected in the diagrammatic method to separate the main components and additional components of sentences by a long horizontal line, that is, the so-called ''grasping the main components''. Among the six components of a sentence, the subject, predicate and object are defined as the main components, which are above the main horizontal line when they are diagrammatized; the attribute, adverbial and complement are defined as the additional components, which are below the main horizontal line when they are diagrammatized. The basic pattern of Chinese sentence trunk is the ''subject-predicate-object'' structure. The sentence structure that does not break the basic pattern in the trunk and does not contain any additional components is called ''basic sentence pattern'', and its diagrammatic style and XML structure are shown in Fig. 2. The sentence structure that does not break the basic pattern in the trunk, but has additional components, or has structural factors such as double objects, juxtaposition, apposition and independent is called ''extended sentence pattern''. Its diagrammatic style and XML structure are shown in Fig. 3. The sentence structure that breaks the basic pattern in the trunk is called ''complex sentence pattern''. Within the scope of a single sentence, there are five categories: compound predicate sentence, parataxis-structure predicate sentence, subject-predicate predicate sentence, pivotal sentence and serial verb sentence. Their diagrammatic style and XML structure are shown in Fig. 4. The above is the normal sentence pattern of Chinese sentences. In addition, there are three kinds of special sentence patterns: one-word sentence, no-subject sentence and inverted sentence. Their diagrammatic style and XML structure are shown in Fig. 5.

D. MORPHOLOGY OF SENTENCE PATTERN STRUCTURE
The final state of syntactic analysis is to analyze the sentence layer by layer to ''word'' units. In the practice of Chinese  information processing, syntactic analysis usually uses the lexicon to realize lexical analysis (word segmentation and part of speech tagging). In the construction of Chinese Treebank based on sentence pattern structure, Modern Chinese Dictionary (6th Edition) [25] is selected as the basic lexicon, and the part of speech and sense code of the word are marked  in the diagrammatic style and XML structure, as shown in Fig. 6.
In the process of parsing sentences into words, it is not easy to master the granularity of analysis, which can be regarded as FIGURE 6. The diagrammatic style and XML structure marked with parts of speech and sense codes of words: the sense code corresponds to the sense of the word in Modern Chinese Dictionary, which is uniquely marked by three digits. In XML structure, the sense code is stored as the attribute -''sen'' of the part of speech node. the ''word'' [26]- [28]. At present, the mainstream syntactic parsing basically accepts the standard of the lexicon of word segmentation in terms of word granularity, that is, taking the result of word segmentation as the leaf node of the syntactic tree. Theoretically speaking, the internal structure analysis of words should not be reflected in the syntactic tree, and the leaf node of the syntactic tree should be the word. However, there are a large number of dynamic words in Chinese sentences that are generally segmented in the process of word segmentation and are finally syntactically analyzed. Dong [10] pointed out that the lexicon and morphology are two parts of vocabulary knowledge, and they have both differences and connections. The definition of words cannot only be from the perspective of the lexicon. All forms that conform to the lexical pattern, even if they are not in the lexicon, should be considered as words. If dynamic words outside the lexicon are syntactically segmented, the syntactic tree structure will expand exponentially. This undoubtedly increases the burden of syntactic analysis, makes the realization of automatic syntactic analysis more complex, and affects the accuracy of the analysis results. In addition, neglecting dynamic words will cause the leaf nodes of the syntactic tree to be too fragmentary, which will further affect the semantic understanding of sentences. Based on the reality of Chinese information processing, the sentence pattern structure syntactic parsing considers dynamic words besides the basic lexicon in terms of word granularity division. The dynamic word is separated from syntactic analysis as the content of lexical analysis and recognized and analyzed as a whole.

III. KNOWLEDGE REPRESENTATION OF DYNAMIC WORDS
The automatic analysis and recognition of dynamic words is the key link and basic work of Chinese automatic syntactic analysis based on sentence pattern structure. The formal knowledge representation of dynamic words is the first important work to be completed. This paper designs the dynamic word structural mode to realize the knowledge representation of dynamic words. VOLUME 8, 2020

A. THE DYNAMIC WORD STRUCTURAL MODE
Dynamic words are characterized by cohesive meaning, suitable syllables and relatively stable structure, which cannot be expanded freely. Four kinds of information, including the whole part of speech of the dynamic word, the part of speech of the internal component, the syllable number of the internal component and the structural relationship between internal components, can better reflect the lexical characteristics of dynamic words and distinguish various types of dynamic words. Therefore, they are selected to describe the dynamic word structural mode. The structural relationship between internal components of dynamic words is shown in Table 2.
The specific design scheme of the dynamic word structural mode is as follows: • < the part of speech of the internal component >: • < structural relationship symbol >:: Examples of the dynamic word structural mode are shown in Table 3. The structural mode of '' [yuelan shi](reading room)'' is ''n: v2 n''. In ''n: v2 n'', the ''n'' before the colon indicates the whole part of speech is a noun; the ''v2'' indicates the part of speech of '' [yuelan](read)'' is a verb and its syllable number is 2; the final ''n'' represents '' [shi](room)'' is a noun and its syllable number is 1 (The syllable number is the default value 1 when it is NULL); the '' '' represents attribute-headword relationship.

B. ANNOTATING OF CORPUS
In order to obtain the actual various sentence pattern structures (to provide recognition rules for the realization of automatic sentence pattern structure syntactic parsing) and dynamic word structural modes in Chinese text, the undergraduates or postgraduates with the background of linguistics and language teaching are organized to annotate syntactic information as well as all the dynamic words in sentences. The corpus is mainly international Chinese textbooks, including New Practical Chinese Textbook, Happy Chinese, Great Wall Chinese, Chinese with Me, Mandarin Teaching Toolbox, Contemporary Chinese, Chinese Paradise and other international Chinese textbooks. Modern Chinese Dictionary (6th Edition), as the basic lexicon, is a systematic, comprehensive and relatively stable word system. The internal components of dynamic words are annotated with the words and their corresponding part of speech [qiumi](ball game fan)'' is obviously not correct, and this combination structure is not consistent with the meaning of the word.
The annotating work is carried out on the semi-automatic corpus annotating platform which integrates the corpus of international Chinese textbooks, Modern Chinese Dictionary and visual annotating tools [29], [30]. The dynamic word annotating content includes the whole part of speech, the corresponding part of speech of each internal component (the syllable number of the internal component is automatically obtained by the semi-automatic corpus annotating platform), the structural relationship between internal components and other structural mode information, and the specific annotation results are shown in Table 4.
In order to ensure the accuracy and consistency of the annotating results, the same corpus text was annotated by two different annotators, and the annotating results were reviewed by professional auditors. The data with consistent annotating results and approved shall be regarded as valid data. If the annotating results are inconsistent or not approved, the final results need to be determined through discussion and analysis between the annotator and the auditor.

C. CONSTRUCTION OF THE STRUCTURAL MODE KNOWLEDGE BASE
In this paper, the corpus data of 29465 sentences (498965 words) with dynamic word structural mode information are collected. Regular expressions are used to match and extract the dynamic words and their structural mode information in the annotated corpus. The regular expression is a formula that can match a class of strings with a certain pattern, which is composed of several common characters and special characters (metacharacters). Common characters include uppercase and lowercase letters, numbers, Chinese characters, punctuation marks and other characters. Metacharacters refer to special characters with special meanings. For example, the character ''\w'' can match the letter, digital or underscore; the character ''.'' can match any single character except ''\n'' and ''\r''; the character ''+'' can match the preceding subexpression one or more times; the character ''?'' is used after ''+'' to indicate that the matching pattern is non-greedy, and the non-greedy pattern matches the searched string as few as possible. The rules of dynamic words and their structural mode information in the annotated corpus are clear. All the information to be extracted can be accurately matched by the regular expression ''< \w mod=''. +?''>. +?</\w></\w>''.
After statistical analysis of the extracted information, a knowledge base with 672 kinds of dynamic word structural modes is preliminarily established. Parts of dynamic word structural modes are shown in Table 5.
The structure of the structural mode knowledge base is shown in Table 6. Table 7 shows the specific contents of the structural mode ''n: v| Ng n'' in the knowledge base. The ''detail'' field in Table 7

IV. AUTOMATIC RECOGNITION OF DYNAMIC WORDS
The realization of automatic recognition of dynamic words is the preparation of automatic syntactic analysis based on sentence pattern structure. On the basis of the previous work, this chapter explores automatic recognition methods of dynamic words in the raw corpus of international Chinese textbooks.

A. RECOGNITION BASED ON REGULAR EXPRESSIONS
The composition of reduplication and quantitative structures is very regular, the formal features are obvious, and the internal components are clear or relatively closed. It is suitable to construct regular expression rules reflecting their distinct TABLE 7. The structural mode ''n: v|Ng n'' in the knowledge base. The fields ''syllable'', ''frequency'' and ''type'' respectively indicate the syllable number, frequency and the number of the types of the dynamic words corresponding to the structural mode in the annotated corpus. The ''detail'' field lists all the dynamic words in the annotated corpus corresponding to the structural mode and their frequency of occurrence, as well as the sense code of each internal component.
features, and use regular expressions to identify the corresponding kinds of dynamic words. In this section, the regular expression method is used to explore the automatic recognition of verb reduplication in corpus.
Before using regular expressions to recognize verb reduplication, it is necessary to summarize and determine reasonable regular expressions for verb reduplication according to its own language characteristics and the structural mode knowledge base. The regular expressions corresponding to various structural modes of verb reduplication are shown in Table 8. The regular expression defines the information of syllable number, the position of the reduplication characters and the fixed component of verb reduplication corresponding to the specific structural mode. A combination can generally be judged to be verb reduplication if it satisfies the regular expression rule and part of speech information, and the specific type of its structural mode can also be determined.

1) RECOGNITION ALGORITHM
The automatic recognition algorithm of the verb reduplication in the raw corpus is as follows. In the process of automatic recognition, the sequence of verb reduplication in different structural modes is different.   v·v3'' is used to match the combinations in the raw corpus, and obtain the verb reduplication candidates. Then use Modern Chinese Dictionary to filter the candidates and delete the vocabularies that have been included

2) RECOGNITION EXPERIMENT
In order to objectively reflect the automatic recognition effect of verb reduplication, 3 groups of raw corpus of international Chinese textbooks (hundreds of thousands of Chinese characters in each group) were selected randomly for the experiment. According to the above recognition algorithm, the automatic recognition experiment is carried out on 3 groups of randomly selected raw corpus, and the result is shown in Table 9. VOLUME 8, 2020 According to the experimental result, it can be concluded that the regular expressions of structural modes of verb reduplication have a significant effect on automatic recognition of verb reduplication. Both the recognition accuracy rate and recall rate of verb reduplication are high, and the recognition results are relatively satisfactory.

3) EXPERIMENTAL ANALYSIS
Through analysis of the experimental data, it is found that the reasons for reducing the accuracy rate and recall rate of automatic recognition are mainly the following aspects: 1) For the multi-category words which have verb and adjective two parts of speech at the same time, it is not rigorous enough to determine the specific part of speech only based on the frequency of the two parts of speech. It is necessary to improve the recognition effect by studying the reduplication phenomenon of this kind of words.
2) It is difficult to avoid some ambiguities in the process of recognition. For example, for the '' [haohao shuo shuo](tell earnestly)'' from the sentence '' (You talk to him about this matter earnestly)'', because there is the verb '' [haoshuo](no problem)'' in Modern Chinese Dictionary, so it is easy to be defined as verb reduplication of the structural mode ''v: v·v. . . v·v''. The ambiguities need to be further eliminated by improving the recognition algorithm.
3) The current regular expressions can only reflect the partial external characteristics of verb reduplication. It is necessary to dig out more effective and quantifiable feature information with the in-depth study of verb reduplication. 4) In the algorithm of recognizing verb reduplication, restrictions and screening conditions are limited. The research resources and achievements in the field of linguistics, language teaching and Chinese information processing are needed to further enrich the recognition conditions of verb reduplication.

B. RECOGNITION BASED ON SEMANTIC CATEGORY COMBINATIONS
The formal features of temporary nouns and resultative/directional structures are not as strong as those of reduplication and quantitative structures, and their internal components are also rich and relatively open, so they are not suitable for the recognition algorithm based on regular expressions. However, the combination of their internal components is obviously restricted by semantics, so the combination of semantic categories corresponding to the internal components can be explored to identify the corresponding type of dynamic words. This section attempts to identify the corresponding dynamic words by using the semantic category combination features of the internal components of dynamic words, taking the recognition of resultative/directional structures as an example. This paper uses Modern Chinese Semantic Dictionary [31] to annotate the semantic categories corresponding to the internal components of resultative/directional structures in the structural mode knowledge base, and obtains the semantic category combinations corresponding to various structural modes. The semantic category combinations corresponding to the mode ''v: v2←a2'' are shown in Table 10.

1) RECOGNITION ALGORITHM
The automatic recognition of resultative structures and directional structures in the raw corpus is carried out separately. The automatic recognition process is as follows. 1) Using the automatic word segmentation tool (with Modern Chinese Dictionary as the basic lexicon) [32], the original corpus is automatically segmented and tagged with parts of speech to obtain the processed corpus which has been segmented and carries part of speech information. In order to avoid the negative effect of segmentation and part of speech tagging errors on the automatic recognition of resultative structures (or directional structures), the results of segmentation and part of speech tagging are manually proofread. 2) Sort resultative structure (or directional structure) structural modes according to the number of internal components from more to less, the number of syllables of internal components from large to small, and the frequency of corresponding resultative structures (or directional structures) from high to low. For example, there are at most three internal components in the resultative structure structural modes, so the structural mode ''v: v-d-a2'' with three internal components, four syllables (the highest) and the highest frequency of corresponding resultative structures ranks first in the queue of structural modes. 3) Recognize resultative structures (or directional structures) corresponding to the current resultative structure (or directional structure) structural mode in order. The specific method is to extract the combinations consistent with parts of speech and syllables of the internal components of the current mode from the processed corpus in step 1 as the candidate set of the structure to be identified. Then, through Modern Chinese Semantic Dictionary, semantic category information is automatically added to all the internal components of each combination in the candidate set. The semantic category combination of each candidate will be compared with the ''combination'' field of the current mode in the queue of structural modes. If it exists in the ''combination'' field, the candidate is qualified and will be included in the resultative structure (or directional structure) set; if it does not exist in the ''combination'' field, the candidate is not qualified. Once identified as resultative structures (or directional structures), the contents are no longer involved in the following recognition operations.

2) RECOGNITION EXPERIMENT
In order to objectively reflect the automatic recognition effect of resultative/directional structures, 3 groups of raw corpus of international Chinese textbooks (tens of thousands to hundreds of thousands of Chinese characters in each group) were selected randomly for the experiment. According to the above recognition algorithm, the automatic recognition experiment is carried out on 3 groups of randomly selected raw corpus, and the recognition results of resultative structures and directional structures are shown in Table 11 and Table 12 respectively. According to the experimental result, it can be concluded that the semantic category combinations corresponding to the resultative/directional structure structural modes in the knowledge base have a significant effect on automatic recognition of resultative structures and directional structures. The accuracy and recall rate of automatic recognition are both more than 80% or even more than 90%. The directional complements of directional structures are very clear, and they are monosyllabic or multisyllabic directional verbs, which are both fixed and few in number in Chinese, so the accuracy and recall rate of automatic recognition of directional structures are better than resultative structures.

3) CONTRAST EXPERIMENT
The resultative structures and directional structures in the corpus are identified only according to the parts of speech of the internal components, the number of syllables of the internal components of the dynamic words and monosyllabic or multisyllabic directional verbs in the contrast experiment. The experimental results are as follows in Table 13. It can be   seen from Table 13 that the accuracy rate and F-measure of automatic recognition of resultative structures are relatively low; the accuracy rate of automatic recognition of directional structures is not far from the above recognition experiment results, mainly because the limitation of directional verbs is very important for the recognition of directional structures, so as to ensure the validity of recognition.

4) EXPERIMENTAL ANALYSIS
Through the comparison of the above recognition experiment results and the contrast experiment results, it is found that the recognition algorithm based on semantic category combinations can significantly improve the automatic recognition effect of resultative/directional structures. Through analysis of the recognition experiment data, the recognition algorithm based on semantic category combinations has some shortcomings and can be improved in the following aspects: 1) Limited by the scale of the annotated corpus, the semantic category combinations obtained cannot cover all the semantic category combinations corresponding to the resultative/directional structures actually appearing in the texts, which limits the recall rate of resultative/directional structure recognition. In the later stage, it is necessary to expand the scale of annotated corpus to further collect and improve the semantic category combinations corresponding to various structural modes.
2) It is difficult to avoid some ambiguities in the process of recognition. The recognition error occurs during the specific recognition process. For example, in the sentence '' , (Mr. Green not only speaks Chinese well, but also studies Chinese culture very well)'', the '' [hao](well)'' in '' [shuo de hao](speaks well)'' should be used as the modal complement of '' [shuo](speak)'', but because its formal structure is identical with the possible form '' [shuo de hao](can come to an agreement)'' of the resultative structure '' [shuo hao](come to an agreement)'', it is easy to be classified into the resultative structure set of the structural mode ''v: vu-a''. The ambiguities need to be further eliminated by improving and perfecting the recognition conditions and process.
3) The information of structural modes and the semantic category combination can reflect the structure and semantic features of resultative/directional structures to a certain extent, and cannot fully reflect the characteristics of resultative/directional structures. It is necessary to dig out more effective and quantifiable feature information with the in-depth study of resultative/directional structures.

C. RECOGNITION BASED ON MACHINE LEARNING CLASSIFICATION ALGORITHMS
The problem of automatic recognition of dynamic words is essentially a binary classification problem. This section explores the specific performance of various machine learning classification algorithms in the automatic recognition of dynamic words, taking the recognition of a typical nounal dynamic word -trisyllabic nounal dynamic word as an example.

1) RECOGNITION ALGORITHM
The automatic recognition algorithm of trisyllabic nounal dynamic words is as follows.
1) Sort dynamic word structural modes according to the frequency of corresponding dynamic words from high to low. The dynamic words corresponding to each structural mode are recognized in order. For example, there is the highest frequency of corresponding trisyllabic nounal dynamic words, so the structural mode ''n: n2 n'' ranks first in the queue of structural modes. 2) Extract the combinations consistent with parts of speech and syllables of the internal components of the current structural mode from the annotated corpus, and whether each combination is a dynamic word or not is marked (the corpus has carried relevant information, with ''yes'' for dynamic words and ''no'' for non-dynamic words). 3) Identify the corresponding semantic categories of internal components of the combinations in Modern Chinese Semantic Dictionary automatically (stipulating and restricting the internal combination of trisyllabic nounal dynamic words from the semantic perspective). For the case that the internal component does not exist in Modern Chinese Semantic Dictionary, the internal component itself is temporarily used as its semantic category. 4) Build the dataset for machine learning. The dataset includes the semantic categories corresponding to the internal components (For trisyllabic nounal dynamic words with the same structural mode, the number of internal components is fixed, which is 2 or 3) and the field whether it is a dynamic word (''yes'' or ''no''). Different semantic categories are mapped to continuous different integers. In the dataset, semantic categories are represented by corresponding numbers. 5) Evaluate different machine learning classification algorithms. The classification algorithms include: Logical Regression (LR), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Classification and Regression Tree (CART), Naive Bayesian Classifier (NB) and Support Vector Machine (SVM). The 10 fold cross validation is used to separate data, and the same random number allocation method is used to ensure that all algorithms use the same data. The parameters of different classification algorithms are the default parameters provided by scikit-learn. 6) Select the classification algorithm with the highest accuracy to create a classification model for the prediction of trisyllabic nounal dynamic words.

2) RECOGNITION EXPERIMENT
In this experiment, the trisyllabic nounal dynamic words corresponding to the two typical structural modes ''n: n2 n'' and ''n: n2 Ng'' are recognized. The preparation of the dataset is as follows: (1) Dynamic words with structural modes ''n: n2 n'' and ''n: n2 Ng'' are extracted from the annotated corpus by regular expressions and marked as ''yes''; then, the combination of ''the disyllabic noun + the monosyllabic noun or nominal morpheme'' is extracted from the remaining corpus with the extracted dynamic words removed and marked as ''no''. (2) The dynamic words or non-dynamic words extracted above all contain two internal components.   order to train the machine learning model, the semantic categories corresponding to the internal components of the combinations are mapped into continuous integers. For example, '' [jiaotonggongju](vehicle)'' and '' [kongjian](space)'' are mapped to ''23'' and ''18'' respectively. (4) The semantic category data of each internal component and the classification result whether the combination is a dynamic word are selected to form the dataset (including three data feature attributes). In this experiment, a total of 2226 data records were extracted, of which 1872 were classified as ''yes'' and 354 were classified as ''no''.
The classification effect of different classification algorithms is shown in Fig. 7. The box-plot shows the accuracy of various algorithms and the result distribution in 10 fold cross validation. It can be seen from Fig. 7 that CART algorithm has the best recognition effect. The average accuracy and standard deviation of each classification algorithm are shown in Table 14. The average accuracy of CART algorithm is up to 86.51%.

3) CONTRAST EXPERIMENT
The experimental corpus data is divided into 3 parts. The trisyllabic nounal dynamic words in the corpus are identified only according to the parts of speech or morpheme categories of the internal components and the number of syllables of the internal components of dynamic words. The experimental results are as follows in Table 15. Although the contrast experiment guarantees the recall rate of automatic recognition, the accuracy rate of automatic recognition is not ideal compared with the above recognition experiment, and the accuracy rate is only between 70% and 80%.
Use the above recognition algorithm based on semantic category combinations to recognize the trisyllabic nounal dynamic words in the three groups of data, and the recognition results are shown in Table 16. It can be seen from Table 16 that the recognition algorithm based on semantic category combinations has also achieved good results, and the accuracy rate is higher than most of the above-mentioned machine learning classification algorithms, which shows that the recognition algorithm based on machine learning classification algorithms has a large space for improvement.

4) EXPERIMENTAL ANALYSIS
By comparing the above recognition experiment results and the contrast experiment results, it can be found that the recognition algorithm proposed in this section is effective. At the same time, it also embodies the scientificity, validity and completeness of the structural mode knowledge base. In the recognition algorithm of trisyllabic nounal dynamic words, the automatic acquisition of semantic categories relies too much on the completeness, accuracy and strong practicability of Modern Chinese Semantic Dictionary, and machine learning classification algorithms use fewer types of features, which limits the accuracy of automatic recognition. In the future work, it is necessary to improve the automatic recognition of trisyllabic nounal dynamic words by improving Modern Chinese Semantic Dictionary and extracting more effective features.
The main positive impacts of the proposed algorithms in this study are as follows: (1) They provide a systematic solution for the problem of automatic recognition of dynamic words in Chinese text. The above three algorithms can cover the recognition of all types of dynamic words. (2) They have achieved good results in recognition accuracy and recall rate. Although the recognition accuracy rates and recall rates of different types of dynamic words are different, the recognition accuracy rates are basically above 80%, most of the recall rates are above 90%, and the recognition accuracy rates of some types of dynamic words can even reach more than 90%. (3) They provide solving ideas or reference value for other natural language processing tasks. For a specific natural language processing task, this study extracts relevant knowledge and rules by annotating large-scale corpus, constructs a task-oriented knowledge base, and then designs algorithms based on the constructed knowledge base to complete the specific task.

V. CONCLUSION
In this paper, based on the dynamic word structural mode knowledge base for Chinese information processing, we propose dynamic word recognition algorithms based on regular expressions, semantic category combinations and machine learning classification algorithms respectively. This study provides a systematic solution for the task of automatic recognition of dynamic words in Chinese text. The proposed three recognition algorithms can cover the recognition of all types of dynamic words, and achieve good recognition accuracy and recall rate. The recognition algorithms use the structural form or semantic pattern of dynamic words to realize the integration of theory research and automatic recognition practice of dynamic words, which makes them verify each other and develop together. In addition, the recognition algorithms can be applied to the process of Chinese automatic syntactic analysis based on sentence pattern structure, which can effectively improve the efficiency and accuracy of syntactic analysis. However, the information of language rules of dynamic words is not fully mined in this study, especially the context information of dynamic words in sentences, which affects the accuracy and recall rate of automatic recognition to a certain extent. In future work, we hope to further analyze the theoretical research results of dynamic words and large-scale annotated corpus to mine the feature information of dynamic words, so as to continue to improve the effect of the recognition algorithms, and improve the recognition algorithm based on machine learning classification algorithms by adding feature attributes, integrating various classification algorithms and adjusting parameters. WEIMING PENG received the M.Sc. and Ph.D. degrees from Beijing Normal University, China, in 2009 and 2012, respectively. He is currently a Lecturer with Beijing Normal University. His research interests include natural language processing, computational linguistics, and Chinese syntactic parsing.
JIHUA SONG received the M.Sc. and Ph.D. degrees from Beijing Normal University, China, in 1995 and 2000, respectively. He is currently a Professor with Beijing Normal University. His research interests include language information processing and computer applications in education. VOLUME 8, 2020