DL-PER: Deep Learning Model for Chinese Prehospital Emergency Record Classification

Prehospital emergency records contain much information about prehospital emergency patients. Extracting important patient information from many records has become the focus of all prehospital emergency personnel. The key to solving this problem is to achieve the automatic classification of prehospital emergency records. This study considers a deep learning-based prehospital emergency record classification model (DL-PER). The model uses a weighted text convolutional neural network to classify prehospital emergency records. First, we use prehospital emergency records to train a bi-directional encoder representation (BERT) model from the transformer. BERT obtains the word vectors. Then, we use a bi-directional long and short-term memory (BiLSTM) model to obtain text features from a global perspective. A weighted text convolutional neural network (WTextCNN) improves this model’s local text feature extraction capability. We used activation functions instead of ReLu activation functions to improve the learning ability of the model. We conducted experiments using prehospital emergency records provided by the Handan Emergency Center. The results showed that the DL-PER model improved the F1 scores by up to 5.7%, 6.8%, 5.7%, and 4.9% on the four data sets, respectively, compared with the BiLSTM model.


I. INTRODUCTION
The emergency medical service system is an essential part of the health service system in China, serving the task of rescuing patients with serious diseases [1]. Prehospital emergency plays a vital role in the emergency medical service system. Most patients treated in prehospital emergencies are critically ill with short survival times. For example, stroke [2] and cardiac arrest patients [3] have short resuscitation times. According to previous studies, only 21.5% of ischemic stroke patients in China reach the emergency room within three hours [4]. An emergency network in large and medium-sized cities has been established, whose professional level is higher and higher [5]. The prehospital emergency record contains all the information about the patient before arriving at the hospital. This information is important to improve the rate of emergency patient care. Finding useful information quickly and efficiently in the massive amount of data to improve the efficiency of emergency care is the challenge faced by prehospital emergency personnel. The key to solving this problem is implementing automatic classification of prehospital emergency records.
Text classification techniques can obtain the text features of each label from a large amount of text to classify the dataset. Researchers have used deep learning to achieve text classification in different medical fields, such as clinical records [6], electronic medical records [7], Chinese medical questions [8], and nursing texts [9], but research on prehospital emergency records is less common.
The prehospital emergency record is a piece of brief information about the patient's onset, the patient's identity information, past medical history, and other related contents recorded by the operator in a short period. The part of the prehospital emergency text describing the condition is similar to the general short text, but compared with the general Chinese short text, it has the following characteristics.
(1) More medical terminology. The prehospital emergency text is recorded by professional medical personnel according to the information provided by patients about their illnesses. The text contains much medical terminology, lacking the deformed words and irregular words found in texts of other fields.
(2) High text similarity. Texts are generally described around a theme. For example, short news texts record sports, finance, and other information, so there are significant differences in wording between various types of short texts, and the text-similarity is low. In contrast, prehospital emergency texts mainly describe patients' illnesses, and the words used are repetitive, such as pain and discomfort. The presence of similar words makes the description of prehospital emergency texts highly similar.
(3) Fixed text format. The prehospital emergency text consists of several parts, such as gender, age, current medical history, and past medical history, each of which describes certain information. In contrast, texts in other fields are composed of various forms, and the information described in each part varies. Text specification is more applicable to text classification, but it is less studied.
Therefore, a DL-PRE text classification model based on the characteristics of prehospital emergency care was developed. The main contributions of this work are summarized as follows.
(1) To address the problem of many medical terms in prehospital emergency records, we use a pre-trained model, BERT, to obtain word vectors. The prehospital emergency texts train the BERT model to obtain the semantic information of medical terms. Moreover, according to the dataset, we adjust the pre-trained language model task to improve the training efficiency of the model.
(2) Based on the problem of high text similarity, we obtain text information from both global and local aspects by fusing BiLSTM and TextCNN models and using meta activate or not (Meta-ACON) instead of ReLu to further improve the feature extraction ability of the model.
(3) We improve the model's local text feature extraction ability by weighted text convolutional neural network based on the fixed-text format phenomenon.
The rest of this paper is structured as follows. We review the development of deep learning-based text classification models in section 2. Then, we develop the DL-PER model in section 3 and conduct analytical experiments on text classification in section 4. Finally, concluding remarks are presented in section 5. z

A. RELATED WORKS
Research on medical text classification mainly focuses on extracting disease information content. Researchers have used fusion, modification, and other methods to improve the classification performance of models based on medical text features. For example, Li et al. [10] established a three-stage hybrid method combining the gated attention-based bidirectional Long Short-Term Memory and the regular expression-based classifier proposed for medical text classification tasks. In order to improve the classification accuracy of electronic medical records, Mu et al. [11] proposed a neural network-based association classification algorithm. The classification performance of the text was improved by combining external medical knowledge sources [12]. To improve the classification efficiency of biomedical texts, Abdollahi et al. [13] used two new approaches to augment medical data to enrich the training data. Prabhakar et al. [14] used two novel deep learning architectures for medical text classification paradigms to alleviate human efforts. Moradi et al. [15] designed a new method called Biomedical Confidence Item Set Interpretation (BioCIE) to address the problem of black-box machine learning post hoc interpretation of biomedical text classification models. To address the ambiguous vocabulary problem in Chinese medical diagnosis, Liang et al. [16] improved the dualchannel mechanism as an essential enhancement of long short-term memory (LSTM). Ibrahim et al. [17] proposed a hybrid multi-label classification method based on generalized deep learning that can be used to classify different types of biomedical texts. Menon et al. [18] used an efficient classification algorithm for data reduction to attributes using PCA transformation instillation.

B. RELATED METHODS
Text classification is an essential component in the field of natural language processing. Research scholars have studied the word embedding model, feature extraction model, and activation function to improve the model's classification accuracy.
In the field of word embedding models, word vectors generated by neural network language models can more accurately compute the similarity between words [19]. The improved word2vec word vector model obtains the relationship between words through bag-of-words [20,21]. The word2vec-based unsupervised learning model doc2vec enhances the ability to learn the text of different lengths by adding paragraph vectors [22]. Bidirectional Encoder Representations from Transformers (BERT) [23] introduced Transformer [24]. The encoder part encodes the text, and through training and dynamic assignment to a large corpus, the feature representation capability of the word vector model is greatly improved. The problem of multiple meanings of a word is effectively solved.
In the field of feature extraction models, researchers have focused their research on recurrent neural networks (RNN), convolutional neural networks (CNN), and model fusion. RNNs are commonly used to solve text classification problems because they can obtain global structural information about the text. Long and short-term memory networks (LSTM) solve the problem of RNN gradient disappearance [25]. BiLSTM can obtain bidirectional semantic information and improve the feature extraction capability of the model [26].
RNN networks cannot extract local semantic information. CNN can ignore the global to extract locally important information. Kim [27] applied convolutional neural networks to text classification. The dynamic VOLUME XX, 2017 9 convolutional neural network algorithm solves the problem of classifying text of different lengths [28]. Character-level CNN algorithms input characters rather than words into the algorithm to obtain more generality [29]. Conneau et al. [30] proposed a deep CNN algorithm to improve the model's classification. Although CNN is continuously improved, it still cannot escape the limitation of convolutional kernel size to obtain global text information.
To make up for the shortcomings of RNN and CNN, researchers have made significant achievements by fusing the two models. Jin et al. [31] fused multi-scale CNN and LSTM. The model improved the accuracy of multi-task multi-scale sentiment classification. A deep learning framework combining BiLSTM and CNN was used to identify sentiment labels in psychiatric social texts to improve the model's classification performance [32]. Jin et al. [33] used BiLSTM with an attention mechanism to obtain contextual semantic features of the text and CNN to obtain local semantic features.
The activation function affects the classification performance of the model. ReLu [34] function was proposed to solve the gradient disappearance problem of Sigmoid and Tanh. However, this activation function has the problem of "neuron death" to solve this problem, PReLu [35], Maxout [36], CReLu [37], Mish [38], and ELU [39] were proposed. However, these functions cannot control the activation state of each neuron adaptively according to the data characteristics. Meta-ACON, proposed by Ma et al. [40], can control the activation state of neurons flexibly by learning and highlighting the key information and has obtained higher classification results in image processing.

A. BERT WORD VECTOR
BERT can encode words flexibly according to the contextual semantics, making the word vector more consistent with the full-text information. Training the BERT model with domain-specific text can obtain specialized lexical information and word associations in the domain text [41,42].
The BERT word vector composition is shown in Figure  1. The text is segmented using the particular classification marker ([CLS]) at the beginning of each sequence, using the particular marker ([SEP]). The [CLS] is a specific marker at the beginning of the sequence, and the [SEP] is placed later in the sequence to separate different sentences. The BERT word vector consists of a word vector, a sentence vector, and a position vector superimposed. The word vector converts words into a word vector of fixed dimension; the sentence vector distinguishes different sentences with zeros and ones; the position vector labels the sequential properties of the input text.
BERT consists of Masked Language Model (MLM) and Next Sentence Prediction (NSP), which automatically masks some words in the input utterance and predicts the words by the contextual background. The masking mechanism randomly selects a token position for prediction with a 15% probability in each training sequence. If the i-th token is selected, it will be replaced with one of three tokens.
1. 80% probability of replacing the word directly with [Mask]: "The patient fell after drinking more than 10 minutes ago." to "Fall injury. The patient was drunk more than 10 minutes ago and was inadvertently [Mask]".
2. 10% probability of replacing the word with another word: "The patient fell after drinking more than 10 minutes ago." to "The patient had an inadvertent headache after drinking more than 10 minutes ago." 3. 10% probability of not substituting: "The patient had a fall after drinking more than 10 minutes ago." to "The patient had an inadvertent fall after drinking more than 10 minutes earlier".
NSP determines whether sentence B is the following of sentence A. The input is a sentence pair of sentence A and sentence B. There is a 50% probability that sentence B is the following of sentence A. NSP enables BERT to obtain logical and causal relationships between contexts better.
The BERT model was fine-tuned to improve the ability of the model to extract features from prehospital emergency records.
(1) Increase the percentage of NSP tasks. The content of prehospital emergency text consists of multiple parts, and the content connection between the parts is weak. Therefore, a weighted approach is used to increase the proportion of NSP tasks to make the pre-trained model more suitable for prehospital emergency text data.
(2) Increase the Transformer hierarchy depth. The pretrained language model using shallow feature representation is difficult to extract the rich content of the prehospital emergency text, which affects the final classification effect.

B. BiLSTM MODEL
BiLSTM can obtain the local semantic information of each part from the global perspective and then get more accurate global feature information of the prehospital emergency text. BiLSTM contains forward LSTM and backward LSTM and finally connects to the same layer of output, which solves the problem that LSTM cannot capture the long sequence of contextual semantic information. Therefore, BiLSTM can obtain prehospital emergency text feature information globally. VOLUME XX, 2017 9 The process of encoding the sentences [x1, x2, x3] is shown in Figure 2. The forward LSTMl inputs x1, x2, x3, in turn, obtain two hidden state vectors h1l, h2l, h3l, and the reverse LSTMr inputs x1, x2, x3, in turn, obtain two hidden state vectors h1r, h2r, h3r and finally the vectors in both directions are spliced to obtain h1, h2, h3.
BiLSTM contains both forward LSTM and backward LSTM. Three gates in BiLSTM influence other neuron states through the activation function. Assuming that the input sequence is 12 ( , ,..., ) , and t denotes the t th word, the sequence is computed through the BiLSTM layers as (1-6). LSTMr LSTMr

C. WTextCNN MODEL
TextCNN can extract local text features but has certain limitations, so this paper proposes weighted TextCNN to improve the text feature extraction ability of the model. The WTextCNN is shown in Figure 3. It is layer consists of two convolutional layers and a pooling layer. The output space of WTextCNN consists of n-gram convolutional filters that have global learning capability and are invariant to the input vector. The first convolutional layer extracts n-grams of features at local locations. Each convolutional kernel can learn the n-gram pattern of the text during training. Then, the convolution result of the first layer is used as the weight. The convolution kernel of the second convolution layer is weighted using the weight. The convolution kernels with important n-gram features will have larger values during the second convolution. In comparison, irrelevant n-grams will have smaller values, effectively capturing the important local features appearing in the text. Pooling can result in more accurate semantic features after combining with the global semantic information highlighted in the activation section.
...   = be m convolution kernels, and use convolution C to compute the X text with the formula (7).
Where  is the convolution operation. For the i -th convolution kernel where f is the activation function, cat is the splicing operation, and n is the size of the convolution kernel. Each convolutional kernel i c can learn some patterns of the ngram, and the learned patterns are used to perform the weighted summation operation on the convolutional kernels, which is calculated as (9), (10). The yellow part is the activated neuron, the white part with x is the inactivated neuron. Meta-ACON function is an activation function that improves the ReLu function and can learn whether to activate neurons or not. Meta-ACON function can process the input vector to highlight key information. The formula is as (11). 12 , pp are parameters that can be learned and are used to adjust the upper and lower bounds of the function by selfadaptation.  is the activation factor that controls whether the neuron is activated or not. 12 , pp can control the upper and lower bounds of the function to get a more suitable smooth curvature by learning. The image of its activation function is shown in Figure 5.   (12)  is the sigmoid function, 12 , WW are the two convolution operations, and c , H , and W are the three channels of the image respectively.
In natural language processing, c is the number of input texts, H is the number of word vectors of the input texts, and W is the dimensionality of the word vectors.
The convolution operation is first performed on each word vector, and then the result is obtained by the sigmoid function.
The Meta-ACON activation function replaces the ReLu function further to improve the feature extraction capability of the model.

F. STRUCT OF DL-PER MODEL
This section mainly introduces the framework structure of the DL-PER model, as shown in Figure 6. The DL-PER model can be described as follows: Preprocess the original text to get the segmented text as the input layer of DL-PER models and extract text VOLUME XX, 2017 9  Then the global aspect semantic information is obtained using BiLSTM, and the local aspect semantic information is extracted using WTextCNN. Finally, the prediction results are output by the classifier. The detailed procedures of our proposed DL-PER model are given in visualized algorithm format, as shown in Table  1.

IV. EXPERIMENTAL
In this section, we evaluate the performance of the DL-PER model based on a real dataset provided by the prehospital emergency center in Handan, Hebei Province. DL-PER is built using tensorflow2.2. In-text processing, jieba is used for word separation. It is one of the Chinese word splitting tools based on word frequency to find the maximum probability path to achieve word splitting. Data preprocessing techniques are used to address the particular problems of prehospital emergency data. After a series of preprocessing operations, all the datasets were randomly divided into training and test texts. Then the settings of model parameters were considered in the process of model building. Finally, our model was evaluated with the baseline model for standard evaluation metrics.

1) THE SOURCE AND CONTENT OF PREHOSPITAL EMERGENCY RECORDS
In this study, prehospital emergency records were retrieved from the Emergency Medical Rescue Command Centre of Handan, a municipal city in the middle plain area of China with over ten million. Prehospital emergency records include all prehospital emergency records from 2018 to 2019. A senior physician labels each prehospital diagnosis record according to the 10th International Classification of Diseases (ICD-10). This system classifies diseases according to rules based on specific characteristics of the disease and represents them using codes.
The prehospital emergency text consists of multi-part information. The layperson provides preliminary information about the patient, including age, gender, chief complaint, current medical history, and past medical history. The chief complaint and current medical history information are complex, containing the patient's current status and etiology. Some patients have no etiology in their chief complaint and current medical history. The professional makes a preliminary diagnosis of the patient based on the information provided. The prehospital emergency records are shown in Table 2. Coma, the patient appeared to be in respiratory and cardiac arrest more than 10 minutes ago.
The patient was seen lying flat on the bed with loss of consciousness, no response to call, loss of carotid artery pulsation, no respiration, and no heartbeat.
Previous history of cerebral infarction, hypertension, and coronary artery disease.

Respiratory and cardiac arrest 65 woman
The fracture was 10 days old and swollen, due to the swelling and pain in the left lower limb found by the family half a day ago.
She called 120 and was retrieved by 120 of our hospital. After retrieval, the arteriovenous ultrasound of the left lower extremity showed: venous thrombosis of the left lower extremity.
Previously treated with surgery for lower extremity fracture 9 days ago Lower extremity fracture

2) PREHOSPITAL EMERGENCY RECORDS PREPROCESSING
The prehospital emergency records need to be preprocessed before inputting into the model, as shown in Figure 7. First, medical experts re-label the prehospital emergency records to ensure the correct classification. Secondly, the information in each part of the prehospital emergency record is stitched together into a sentence. Then, unlike English, Chinese words are constructed differently, requiring a word splitting tool to segment the entire sentence. Finally, words that are not semantically relevant, such as "ah", were deleted.

3) GROUPING OF PREHOSPITAL EMERGENCY RECORDS
Prehospital emergency records are of various types and are unevenly distributed. They were divided into four groups according to the distribution of the number of types, as shown in Table 3.
The number of disease types in each group was not the same and contained different disease types, both trauma and cerebrovascular diseases. Group 1 has the most significant amount of data and contains more disease types, and group 4 has the least amount of data.

B. MODEL EVALUATION INDICATORS
In this paper, The F1 score was used as the evaluation index criterion for the classification effect of the model. The recall rate statistic predicts how many of the results with A are correctly classified; the recall rate statistic labels how many of the samples with A are correctly classified; recall and accuracy are considered from two different perspectives, but they are often contradictory.
The formula for calculating the precision can be expressed as (16): The formula for calculating the recall can be expressed as (17): The formula for the calculation of the F1 score can be expressed as (18): In the confusion matrix, TP, FP, TN, and FN are the number of true cases, false-positive cases, true negative cases, and false-negative cases, respectively. The range of the F1 score is between 0 and 1, and the larger the F1 score, the higher the algorithm's performance. Heart disease, hypertensive cerebral hemorrhage, gastrointestinal hemorrhage, vomiting, eye injury, localized brain injury, chest injury, injury to a person in a vehicle accident, mainly other and unspecified harmful effects affecting the cardiovascular system 500 9 Group 4 Hypoglycemia, epilepsy, acute myocardial infarction, acute heart failure, lumbar fractures, poisoning by drugs, pharmaceuticals biologics, carbon monoxide poisoning, slips, trips and falls on the same plane, harmful effects of other and unspecified drugs that primarily affect the autonomic nervous system 300 9

C. THE SETTING OF MODEL PARAMETERS
This subsection discussed the parameter setting in 4 links (BERT vector, BiLSTM, WTextCNN, and optimizer) while establishing the DL-PER model. Taking Group 1 as an VOLUME XX, 2017 9 example, we explained the detailed parameter configuration in each part below. BERT vector part: Considering that the median length of the prehospital emergency text is 50, the length of the text is unified to 50, and the dimension of each word vector is 768.
BiLSTM part: Peters et al. [43] proposed a two-layers BiLSTM model and achieved a good representation of text features. The number of layers of the BiLSTM can affect the training factor and accuracy of the model. The number of layers of BiLSTM is set to [1,2,3,4], respectively, and group 1 data is used for testing. The final test results are shown in Figure 8. As the number of BiLSTM layers increases, the F1 score gradually becomes more extensive, while the time for each training also becomes longer. Then a two-layer BiLSTM with the highest F1 score and faster training time was selected.  [1,2,3,4,5,6,7,8,9,10], and the data from the first group were used for training and testing. The test results are shown in Fig. 9. Li et al. [44] proposed that using the convolutional kernel size with the highest accuracy and close to the classification effect can improv. So 3, 4, and 5 were chosen as the size of the convolution kernel.  Figure 10. When the dropout rate is 0.5, the BiLSTM model achieves the highest value. The WTextCNN model achieved the highest value when the dropout rate was 0.2. Then, 0.5 was chosen as the dropout value for BiLSTM, and 0.2 was chosen as the dropout value for WTextCNN. Optimizer part: In this paper, the model chooses Nadam as the optimizer, which is Adam's optimization algorithm with high computational efficiency and fast convergence. As a super essential parameter in supervised learning and deep learning, the learning rate determines if and when the objective function converges to a local minimum. An appropriate learning rate allows the objective function to converge to a local minimum in a suitable time. The learning rate is set to [10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 ], and the model is trained and tested using set 1 data, respectively. The test results are shown in Figure 11. Then 10 -4 was chosen as the learning rate. According to the experimental test results, the hyperparameters of BERT, BiLSTM, WTextCNN, and optimizer are set as shown in Table 4.

D. RESULTS & MODEL COMPARISON
To validate the performance of the DL-PER model for prehospital emergency record classification, we built a model environment and learned the model under the training text. The classification effectiveness of the model based on the test text is then evaluated by the evaluation metrics mentioned above. The DL-PER model is compared with the baseline models of LSTM, BiLSTM, TextCNN, BiLSTM-Attention, TexctCNN-Attention, and BiLSTM-TextCNN, respectively. The word2vec word vector model is used to compare with BERT. Comparison with Meta-CON using ReLU, PReLU functions.
In this paper, the DL-PER model is compared with models such as BiLSTM, TextCNN, and BiLSTM-Attention. The test results are shown in Table 5. DL-PER achieves better classification results on all four datasets. Compared with the BiLSTM model with the lowest F1 score, the DL-PER model improved the F1 scores by 5.7%, 6.8%, 5.7%, and 4.9% in the four groups. It proves that the DL-PER model is more accurate in acquiring text features and improving the text classification performance of the model.
The experimental results show that the same model has different classification effects on four data groups. Group 4 had the highest F1 score, and group 2 had the lowest F1 score. The data of group 2 contained many identical words, such as "dizziness" and "pain" appearing in each label of group 2. The occurrence of identical words increases classification difficulty and demands the model to obtain deeper text features.
The DL-PER model achieves higher F1 scores than other comparison classification models in the four sets of test results. The model can effectively extract local semantic information from prehospital emergency text. The difference between the F1 scores of BiLSTM and TextCNN is minor. It may be because BiLSTM has difficulty extracting local text features from deep levels. In contrast, TextCNN has limited feature extraction ability due to the limitation of convolutional kernel size, so these two models have difficulty achieving high F1 scores. The F1 scores of BiLSTM-Attention, TextCNN-Attention, and BiLSTM-TextCNN are higher than BiLSTM and TextCNN models. It indicates that the fusion model can accurately acquire text features and improve the model's classification performance.
The DL-PER model further enhances the ability to acquire local information in the text using WTextCNN after acquiring distant text information. Compared with the TextCNN layer, the WTextCNN layer is more capable of acquiring local text hidden information, thus improving the text classification performance of the model. This paper uses word2vec (WV) and BERT models to obtain text word vectors. The word vectors of word2vec and BERT are input to the BiLSTM, TextCNN, and BiLSTM-Attention models, respectively, for training. The final test results are shown in Figure 12. The experimental results show that the F1 score of the model using BERT word vectors is significantly higher than that of word2vec. It proves that BERT word vectors can fully exploit the semantic information between texts and improve the model's classification performance than word2vec word vectors.
The activation functions of WTextCNN in the DL-PER model were experimented with using ReLu, PReLu, and Meta-ACON excitation, respectively. The experimental results are shown in Table 6. the experimental results show that the F1 score of the Meta-ACON model is higher than that of ReLu and PReLu. The Meta-ACON function can determine the activation state of each neuron by learning, which can highlight the key semantic information more than ReLu and PReLu. The accuracy variation of the model training process is shown in Figure 13. In Figure 13 (a), (b), and (c), the images using the Meta-ACON activation function model converge faster and with higher accuracy VOLUME XX, 2017 9 than ReLu and PReLu. It indicates that Meta-ACON can speed up the its training and improve the model's classification performance. However, Figure 13(d) shows that the model converges at the same rate for each activation function when training for group 4. It may be because the amount of data in group 4 is too small to highlight the advantage of Meta-CON.   Prehospital emergency texts have problems such as many medical terms, the short length, and standardized format. In this paper, we propose a DL-PER text classification model to solve the problem of prehospital emergency text classification. Through the experiment, we have made the following findings.
(1) The BERT model is better than the word2vec model for obtaining semantic information in specialized domains. Improving the feature extraction ability of the model solves the problem of more terms in the professional domain.
(2) The weighted convolution operation effectively improves the local feature extraction ability of the model. After combining with the global text features extracted by BiLSTM, the feature extraction ability of the model is significantly improved.
(3) The activation function Meta-ACON can improve the training efficiency of the model and improve the feature extraction ability of the model. However, its advantage is less evident in the case of a small amount of data.
The DL-PER model obtains the contextual semantic information of long-range text through BiLSTM. Then the local information extraction capability is improved by WTextCNN. In this way, the model can obtain more accurate semantic information about the critical semantics of the text. We use the Meta-ACON activation function instead of the ReLu function to improve the model's performance. The effectiveness of DL-PER is proved