Diagnosis Method of Thyroid Disease Combining Knowledge Graph and Deep Learning

The scale of medical data is growing rapidly, and these data come from different data sources. The amount of data is huge, the production speed is fast, and the format is different. Case data is very important because it contains a lot of medical knowledge about diseases, drugs, treatments, etc. It can provide important support for the development of smart medicine. Knowledge graph is a graph-based data structure, which can well represent the relationship between these medical data in reality and form a semantic network. This research uses knowledge graph technology to connect trivial and scattered knowledge in various medical information systems to assist in disease diagnosis. This research takes thyroid disease as an example, constructs a medical knowledge graph and applies it to intelligent medical diagnosis. First, extract the relationships between biomedical entities to construct a biomedical knowledge graph. Then, the entities and relationships in the knowledge graph are transformed into low-dimensional continuous vectors through the knowledge graph embedding method. Finally, the known pathological disease relationship data is used to train the disease diagnosis model of the bidirectional long short-term memory network (BSTLM). Experiments show that the thyroid disease diagnosis method that combines knowledge graphs and deep learning has a better diagnostic effect. This shows that smart medical care based on the knowledge graph will provide a solution path for alleviating the shortage of domestic high-quality medical resources.


I. INTRODUCTION
Intelligent medical care [1], [2] refers to the use of technologies such as the Internet of Things and artificial intelligence to realize the interaction between patients and medical staff, medical institutions, and medical equipment, and gradually achieve informatization. Use information technology to improve disease prevention, diagnosis and research, so as to achieve scientific management of population health. Ultimately benefit all components of the medical ecosystem. The core and key of intelligent medical care is intelligent diagnosis and treatment, that is, making computers become a brain with medical knowledge, so as to provide assistant decision-making for doctors' diagnosis and treatment. Such The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . computers are capable of diagnosis and treatment. It can not only independently provide medication assistance, triage guidance, health consultation and other services. It can also assist medical practitioners in completing certain tasks with high quality.
Personal information such as age and gender of the patient is recorded in the EMR electronic medical record. The EMR electronic medical record records the patient's age, gender and other personal information, as well as treatment information such as the diagnosis results of each treatment, the length of stay in the hospital, and the medication status. Each classification information can be regarded as an entity, and these entities are related to each other and have different relationships. Using the knowledge graph to describe the relevant medical knowledge in the EMR electronic medical record can improve its utilization rate, promote the development of intelligent medical care, and play an important auxiliary role in providing decision support for doctors [3]. Since the knowledge graph is superior to the expressive ability of general relational databases, it plays an increasingly important role in the process of processing these massive medical data. The knowledge graph can integrate isolated data. As a structured graph model, it can well describe the relationship between various entities in reality.
The knowledge graph [4], [5] is a relational network that connects different information together. At present, the more mature large-scale general knowledge graphs mainly include YAGO [6], DBpedia [7], NELL [8], Freebase [9]. Knowledge graphs have been widely used in education [10], [11], medical treatment [12], [13] and other fields. Knowledge graph is an important technology in the field of artificial intelligence, and it is the technical foundation for building a computer medical knowledge brain. Therefore, the knowledge graph has become one of the key technologies of smart medicine. After Ledley [14] and others first applied the data model to the field of clinical medicine, various forms of medical expert assistance systems appeared. The main workflow of these systems is to structure the clinical experience and knowledge of medical experts to establish a medical knowledge base. Then make inference rules through experts. Finally, in practical applications, diagnosis and reasoning are performed based on the medical examination data input by the user. However, the mechanization of this system and the overly simple rule-based reasoning method have certain limitations in constructing knowledge bases and diagnostic reasoning for medical data with diverse data. With the development of computer technology, machine learning, and artificial intelligence technology [15]- [23], more and more scholars have begun to use machine learning and artificial intelligence technology to build knowledge bases and disease-aided diagnosis systems. Since the 1970s, foreign countries have invested a lot of manpower and material resources in this area of research and development, and have achieved fruitful results [24]- [31].
Most of the above applications based on knowledge graphs in the medical field use traditional machine learning [32]- [40]. The recognition rate of this method needs to be further improved. Therefore, this research introduces a deep learning algorithm called BLSTM [41] to be used in the diagnosis of thyroid diseases. First, construct a knowledge map of thyroid diseases. Second, extract the entities and relationships related to the thyroid in the knowledge graph, and use the knowledge graph embedding to convert them into low-dimensional continuous vectors. Finally, train the BLSTM diagnostic model. Input the characteristic word vector of the thyroid gland and the relevant knowledge entity vector into the trained model to obtain the decision result. The main work of this paper is summarized as follows (1) Constructed a knowledge map for the thyroid. Extract useful information from the thyroid patient information database, thyroid examination information database, thyroid drug use information database, and thyroid index information database. According to the connection between entities, the BFS breadth-first algorithm [42] is used to fill the created concept tree to obtain the thyroid knowledge graph.
(2) Train the thyroid diagnostic model BLSTM by using the constructed knowledge map. Through experimental comparison, thyroid disease diagnosis based on deep learning and knowledge map fusion is more efficient than manual diagnosis and more accurate than diagnosis based on machine learning methods. The classification performance based on BLSTM shows better results than other deep learning algorithms.

II. RELATED WORK A. KNOWLEDGE GRAPH
The knowledge graph is a semantic network that describes the entities, concepts, and their relationships in the real world. It fully uses visualization technology, not only can describe knowledge resources and carriers, but also can analyze and describe knowledge and the connections between knowledge. In the past, most researches on medical knowledge graphs were based on literature, and most of the knowledge came from public medical literature, and electronic medical record data was rarely used. However, the electronic medical record data covers the whole process of diagnosis and treatment of patients in various departments of the hospital, and has a wealth of medical knowledge. The information in the electronic medical record is more disorderly than the information in the subject area, and most of the data belongs to unstructured text. Perform semantic analysis on the information in the electronic medical record, and extract the knowledge unit used to draw the knowledge graph. Find out the connections between the knowledge units, and use the knowledge graph technology to connect the scattered and fragmented knowledge in the medical resources-electronic medical records. This can provide patients and doctors with comprehensive services. In addition, medical knowledge graphs are widely used in medical information search engines, medical question and answer systems, and medical decision support systems.
The drawing of knowledge graph mainly includes three links: constructing knowledge unit, constructing unit relationship, and structured display of knowledge graph. The extraction of knowledge units and the recognition of relationships between knowledge are mapped to the recognition of named entities and the recognition of entity relationships. The main links of the construction process are shown in FIGURE 1.

B. DISEASE DIAGNOSIS BASED ON KNOWLEDGE GRAPH
The main entities included in the medical knowledge map are diseases, symptoms, parts, departments, and drugs, as well as the relationships between entities. The disease diagnosis and principle diagram based on the knowledge map is shown in FIGURE 2.
The user first selects the location based on gender, and then selects their symptoms based on the location. Combining the age, gender and other information selected by the user, based on the reasoning model, seven diseases that the user may have   are diagnosed. And show detailed information about each disease, such as overview, etiology, symptoms, complications, and treatment. In the process of disease diagnosis, since the input information is that the user selects the characteristic information according to his own situation, there may be cases of missed selection and wrong selection, especially for symptom characteristics. In the disease diagnosis model, the symptom features have the greatest impact on the diagnosis results, and it is also the place where users may make mistakes in selecting features. Merely describing the relationship between entities in the existing medical knowledge graph is not enough to solve the above problems, so the relationship between disease and symptoms needs to be redefined. Quantify the weights of diseases and symptoms to distinguish different symptoms that have different effects on different diseases. At the same time, based on the knowledge map to find the hidden relationship between diseases and symptoms, in order to reduce the error caused by the uncertainty of the user's selection characteristics.

III. THYROID DISEASE DIAGNOSIS PROCESS A. THYROID DIAGNOSTIC FRAMEWORK
This article diagnoses whether the input sample has thyroid disease by inputting the medical knowledge graph. The diagnostic model uses BLSTM. The framework of the proposed thyroid disease diagnosis method is shown in FIGURE 3.
Firstly, SemRep Based Knowledge Graph (SemKG) is constructed using the entity relationship in Biomedical Abstracts. Then use Knowledge Graph Embedding to convert the entities and relationships in SemKG into low-dimensional continuous vectors. Then use the ''pathology-disease'' data to train Bidirectional Long Short-Term Memory Networks (BLSTM). Finally, use the trained deep learning model combined with the knowledge map to diagnose thyroid diseases.

B. CONSTRUCTION OF THYROID KNOWLEDGE GRAPH 1) BUILD PROCESS
The overall process of this design is: First, according to different data in different tables in the database, combined with the initial thyroid nodule medical records, the conceptual layer of  the thyroid knowledge map is extracted. Construct a conceptual classification tree and extract relationships between data. Then, the data in the table, that is, the entities are filled into the conceptual layer. In the form of triples, namely <entity, relationship, entity>, a complete thyroid knowledge map is obtained.

2) CONCEPTUAL DESIGN
The thyroid knowledge map is constructed based on the thyroid disease information stored in the database of a third-class hospital. Including patient information entity, drug use information entity, diagnosis data entity, etc. There are many connections between entities. Since the data in the database is tidy, the relationship between entities and entities can be standardized and analyzed to form the entire knowledge graph. Using the formed knowledge graph to provide semantic relationships, users can directly observe the connection between entity data and entities. The data in the thyroid database is classified, and the following medical entity definition is obtained.
Definition 1: Thyroid medical entity. Including thyroid patient entity, basic information entity, thyroid diagnosis result entity, thyroid medication entity, etc.   Table 1.
After defining the thyroid medical entity and the thyroid factual relationship entity, the concept classification tree of the thyroid knowledge graph is constructed as shown in FIGURE 4.

3) PHYSICAL LAYER FILLING
Through the method of entity mapping, the concepts in the conceptual layer are mapped to the entities in the database. This paper uses the BFS breadth-first algorithm to fill the created concept tree to get the knowledge graph. Input the implemented concept classification tree T , the concept collection C in the concept layer, and the defined entity collection E. Output the thyroid knowledge graph G after operation. It is guaranteed that the output knowledge graph is constructed in the form of triples. The process of entity filling can be expressed as: First, create a mapping table. According to the principle that the entity belongs to a concept in the concept tree, the mapping table between the concept and the entity is constructed, as shown in FIGURE 5.
Secondly, fill in entities according to the mapping table. Store the entities in the mapping table in the corresponding sub-nodes through BFS breadth-first traversal, so that each entity has its own attributes and attribute values. Finally,   extract entities and relationships, comprehensively determine the triples, and form the final thyroid knowledge map. The knowledge graph is expressed as where E = {e 1 , e 2 , . . . , e N } represents N entities in the knowledge graph. R = {r 1 , r 2 , . . . , r M } represents the relationship between entities. T = {t 1 , t 2 , . . . , t k } represents the semantic type. Each entity e or relationship r can correspond to the semantic type through the relationship mapping function φ. FIGURE 6 is a schematic diagram of SemKG.

C. DIAGNOSIS MODEL OF THYROID DISEASE
Given a path π l i = e 0 r 0 e 1 r 1... e l−1 r l−1 . Where e 0 indicates a certain pathology. e l indicates whether it is diagnosed as thyroid disease. The goal of the disease diagnosis model is to diagnose the probability that a certain pathology causes thyroid disease. The expression is as follows where D (·) represents any discriminant model with parameter θ . g (·) represents the feature extraction function. The input layer of the disease diagnosis model is an arbitrary path π l i = e 0 r 0 e 1 r 1... e l−1 r l−1 . e is the entity, and r is the relationship between the two entities. In the knowledge graph embedding layer, each element x i in π l i is transformed into a vector representation. The conversion process is shown in  FIGURE 7.
First, transform the knowledge graph G KG = E ∪ φ (E), R ∪ φ (R) into Semantic Graph and Type Graph at the same time. Semantic Graph G SG = (E, R) only contains entities and relationships between entities. Type Graph G TG = (φ (E) , φ (R)) only contains semantic types corresponding to entities and relationships. Translating Embedding (TransE) is used to embed G SG and G TG respectively. Therefore, each element x i in π l i is transformed into a vector. The expression of vector x i is as follows where g (·) represents the knowledge graph embedding method. represents the splicing operation of vectors. The embedded representation of the knowledge graph is obtained by minimizing the loss function. Take the semantic graph as an example. This research uses Stochastic Gradient Descent (SGD) to get the final graph embedding representation. The graph embedding learning process of type graph is the same as that of semantic graph. The knowledge graph embedding VOLUME 8, 2020 layer converts each element in π l i into a vector of length L SG + L TG . Finally, π l i is transformed into a matrix X of size (L SG + L TG ) × l.
This study uses dual BLSTM to predict the relationship between pathology and disease. The LSTM structure including the input layer x i , the hidden layer h i and the output layer y i is as follows.
σ is logistic sigmoid function. i represents the input gate with the same dimension as the hidden layer vector h. f represents the forget gate vector. o represents the output gate vector. c represents the cell activation vector. W * and b * are trainable parameters. is a bit multiplication operation. c 0 is the e 0 vector representation in input π l i . e 0 represents a certain pathology. The disadvantage of traditional LSTM is that only text information is used, and the order of entities in π l i affects the relationship between pathology and disease. Therefore, this study uses BLSTM for thyroid diagnosis. In FIGURE 3, BLSTM uses two hidden layers to process data in different directions. The expression is as follows h f and h b are the hidden layers of the forward layer and the backward layer, respectively. The output disease diagnosis probability is expressed as where W h fz , W h bz and b z are the parameters to be trained. In order to prevent overfitting of the training model, dropout is added to the acyclic part of BLSTM. Optimize the cross entropy loss function L θ by back propagation through time (BPTT).

IV. EXPERIMENT AND ANALYSIS A. EXPERIMENTAL DATA AND SETTINGS
The experimental data of this study are self-made data, which are thyroid disease, diabetes, hypertension, and coronary heart disease obtained from the database of a hospital. 1200 cases of each disease were selected. For the diagnosis of thyroid disease, the other three types of diseases are all negative samples. The details of the data set are shown in   In order to verify the feasibility and effectiveness of the proposed thyroid disease diagnosis model, comparison diagnosis models include SVM [43], BPNN [44], RNN [45], LSTM [46]. The evaluation indicators are Accuracy, Precision, Recall and F1. The calculation formula of each index is shown in TABLE 3. For the evaluation of two classifications, the classification accuracy rate cannot be considered alone, and the four evaluation indicators in Table 4 should be considered comprehensively.

B. EXPERIMENTAL RESULTS AND ANALYSIS 1) COMPARISON OF DIAGNOSIS RESULTS BASED ON DIFFERENT FEATURE EXTRACTION METHODS
In order to explore the impact of different feature extraction methods on the accuracy of disease diagnosis, three comparison experiments were set up. The first group uses LDA to generate structured knowledge, which is the topic. The  number of topics for the thyroid gland is set to be the same as the number of structured features we extracted for the disease. The second group uses the space vector model (VSM) to represent the text. The third group is the method of medical knowledge graph mentioned in this article. The classifier used is the classic SVM. The five-fold crossover method was used to verify the performance of each method. The experimental results are shown in TABLE 4 and FIGURE 8.
The results show that the classification results based on the structured features extracted from the knowledge graph are better than those based on the vector space model. The classification result based on LDA feature extraction is the worst. It shows that the proposed structured features obtained by using the knowledge map have advantages in disease recognition tasks. In order to ensure the robustness of the experiment, multiple identical experiments with multiple diseases have been performed. The experimental results of our method are stable, and the recognition rate of thyroid diseases is above 80%.

2) COMPARISON OF DIAGNOSIS RESULTS BASED ON DIFFERENT CLASSIFIERS
From the experimental results in the previous section, it can be observed that the classification performance based on the knowledge map features is the best. In order to explore the impact of different classifiers on disease diagnosis results, this study introduces multiple contrast classifiers. The feature extraction method of knowledge graph is adopted here. Fivefold cross-validation was used to verify the performance of each method. The experimental results are shown in TABLE 5 and FIGURE 9.  From the experimental results, it can be observed that the four index data of SVM and BPNN algorithm are lower than the four index data obtained by RNN, LSTM and our method. This shows that deep learning algorithms are better than machine learning algorithms for classification. Among the three deep learning algorithms, our method has a leading advantage in four indicators, followed by LSTM, and RNN is the worst. This is because the BLSTM model can better capture bidirectional semantics. In terms of network structure, because it considers both forward LSTM and backward LSTM, it is more robust.

V. CONCLUSION
As people pay more attention to health, many netizens will consult the disease through the Internet, and a large amount of disease description texts have been produced. In addition, the hospital's information platform stores a large amount of disease, disease and other information. How to effectively use this information for intelligent diagnosis of diseases is the ultimate goal of this research. Since most illnesses are recorded using texts, in view of the diversity of the recorded texts of illnesses, this research proposes a diagnosis method for thyroid diseases based on knowledge graphs and deep learning. This method first extracts the relationships between entities in the biomedical literature to construct a biomedical knowledge graph. Then, the entities and relationships in the knowledge graph are transformed into low-dimensional continuous vectors through the knowledge graph embedding method. Finally, the known pathological disease relationship data is used to train the disease diagnosis model based on BLSTM. Experiments verify the effectiveness of our method in the diagnosis of thyroid diseases. There are two innovations in this research. One is to use knowledge graphs to extract structured features and complete pathological structured text representation. The structured representation based on the knowledge graph is a new structured knowledge extraction method, which can be used for the structured knowledge extraction of disease conditions. The second point is the use of classifiers based on deep learning algorithms. Compared with traditional machine learning algorithms, it greatly improves the accuracy of disease diagnosis.