Regular Expression Based Medical Text Classification Using Constructive Heuristic Approach

Medical text classification assigns medical related text into different categories such as topics or disease types. Machine learning based techniques have been widely used to perform such tasks despite the obvious drawback in such “black box” approach, leaving no easy way to fine-tune the resultant model for better performance. We propose a novel constructive heuristic approach to generate a set of regular expressions that can be used as effective text classifiers. The main innovation of our approach is that we develop a novel regular expression based text classifier with both satisfactory classification performance and excellent interpretability. We evaluate our framework on real-world medical data provided by our collaborator, one of the largest online healthcare providers in the market, and observe the high performance and consistency of this approach. Experimental results show that the machine-generated regular expressions can be effectively used in conjunction with machine learning techniques to perform medical text classification tasks. The proposed methodology improves the performance of baseline methods (Naive Bayes and Support Vector Machines) by 9% in precision and 4.5% in recall. We also evaluate the performance of modified regular expressions by human experts and demonstrate the potential of practical applications using the proposed method.


I. INTRODUCTION
Despite the popularity of Electronic Medical Record System, there are still a large amount of unstructured text data in medical domain.Classifying such data into useful categories such as topics or disease types by computer can significantly reduce human efforts and provide useful information for hospitals and medical services, especially the popular online healthcare services.However, classification in the medical domain is usually challenging because a great amount of domain knowledge is required to solve a even seemingly simple problem [1].In the past decades, many techniques and algorithms have been developed for The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Afzal .medical data mining and classification tasks.The prevailing approaches are machine learning algorithms such as Support Vector Machines (SVM) [2] and Latent Dirichlet Allocation (LDA) [3].The application of Artificial Neural Network (ANN) to medical fields has rapidly gained popularity [4] after the first appearance in early 1990s [5].While producing promising results, the models or solutions from these techniques are usually not interpretable by human.Human experts cannot directly fine-tune the models when the solution does not satisfy the high precision requirement of medical decision making.
Unlike dealing with other types of texts, medical text processing is unique in itself given the fact that reliability verification by domain experts is often required.Health experts are inclined to verify the evidence that supports decision making and do not trust systems that act as black boxes.A regular expression based approach can be used to tackle such problem for its great interpretability.The regular expression, also known as regexp or regex, is a classical means for string pattern matching.Regular expressions have long been regarded as efficient tools in a wide range of application domains such as information extraction and text mining.While manually constructing regular expressions is obviously timeconsuming, error-prone, and experience-dependent, there has been little work indicating that automatically generated regular expressions are able to give performances comparable to manual work.One of the main challenges in learning regular expressions by computer is the huge search space due to the large number of candidate words and their combinations through different operators.In addition, most of the previous work is designed without considering the capability of the model to be fine-tuned by human experts.
In online medical guidance, the narrative clinical texts written by patients often contain typos, misspellings, abbreviations, non-standard jargons, as well as incomplete sentences [6].The oral expression of medical terms is therefore difficult to be processed by natural language processing (NLP) tools developed for ordinary text [7].To address these issues, we investigate an automated regular expression generation method to classify medical texts in order to provide informative and comprehensive human-like medical guidance.Our framework is based on a constructive heuristic procedure carefully tailored to meet the specific demands for generating regular expression in the real-world application.The model has been proved to be capable of addressing medical text classification task with realistic data set.
In our opinion, medical text classification approaches should aim to achieve better performance (in terms of precision and recall, for example) and at the same time allow human experts to modify the solutions for even better results.In this research, we call a solution interpretable if and only if the solution can be comprehended and further enhanced by human.Our regular expression based system is transparent and interpretable for domain experts to make further modifications, whereas a system that is using sophisticated and not easy-to-understand machine learning techniques may require additional efforts to achieve this goal.
The main contributions of this work are summarized as follows: 1) A regular expression based text classifier that alleviates the ''black box'' problem which prevails in machine learning algorithms.The introduced regular expression structure makes it easy for human experts to understand and modify the solution for better results.
2) A novel constructive heuristic method that considers both the classification performance and the interpretability of regular expressions.The automated construction of regular expressions achieves the classification performance comparable to manual approach and reduces the labor and time costs.The proposed method is able to be used in conjunction with advanced machine learning methods for better overall performances.
3) A fully working system whose performance has been evaluated with massive amounts of real-world medical data.This work is an innovative attempt to produce interpretable medical decision support with much practical value.
The remainder of this paper is organized as follows.Section II discusses the related work with regards to clinical text classification and regular expressions implementations.The problem scenario is described in Section III.The proposed constructive heuristic method to generate regular expressions is presented in Section IV.In Section V, experimental results are provided to demonstrate the performance of the proposed approach.The paper is concluded in Section VI.

II. RELATED WORK
Text classification has been studied in recent years with the efforts focusing on learning based methods.Problems such as classifying cancer [8], patient record notes [9], intensive care unit (ICU) procedures and diagnosis [10], and other text documents [11] are solved by prevailing machine learning approaches with satisfactory results.With the development of deep learning techniques, many neural network models are designed for text classification tasks.Convolutional Neural Networks (CNNs) that learn high-level features have shown competitive results in sentence modeling [12]- [14], sentiment [15] and semantic [16] classification, email classification [17], online medical guidance [18], and other domains [19]- [21].Recurrent Neural Networks (RNNs) also achieve state-of-the-art performances on a number of clinical data mining tasks [22]- [25].Learning based methods often produce very promising results, but the drawback is also obvious because their solutions are not interpretable, leaving no easy way for human to fine-tune the solutions with their own domain knowledge.
Despite the promising performance that learning based methods show, regular expressions are often used in the situation where interpretable results are needed.Regular expressions are applied to perform several NLP tasks such as text classification, information extraction, and automatic summarizing.To reduce human efforts, a wealth of research efforts have been paid to automatically synthesize regular expressions for interpretable solutions [26], [27].However, these researches often assume that the target regular expression is small and compact thereby allowing the learning algorithm to exploit the information efficiently.In addition, most of these works consider theoretical problems that are not inspired by any real-world applications [28] and the applicability of the corresponding methods is still largely unexplored.Attempts at learning regular expressions over real text were later introduced to detect Hyper Text Markup Language (HTML) lines [29] and spam emails [30], [31].In medical domain, many regular expression based approaches have been adopted in various tasks including symptom classification [32] and extraction of medical information such as blood pressure [33], ejection fraction [34], and bodyweight values [35] from clinical notes.Although producing promising results, these applications aim to deal with very specific and often simpler problems, and hence have trouble to cope with a much longer sequence of symbols from a much larger alphabet.In contrast, our study aims to present a more generalizable approach for automated regular expression learning to tackle text classification problems.The proposed framework inspired by real-world practice is not restricted by any aforementioned limitations in previous attempt.The sheer amount of data collected from online medical services dramatically increases the alphabet size and we exploit syntactic constructs (such as synonyms recognition and correlation analysis) accordingly to enable this scale-up.
Combining machine learning with explicit rules has been studied for decades and the rule-based solutions have demonstrated competitive performances [36], [37].In general, two types of hybrid systems are developed to combine rule-based algorithms with machine learning techniques.One approach utilizes rules to verify the machine learning output [38], [39], while a more prevailing approach leverages on rule-based algorithms to precisely identify the desired features to feed machine learning models.In the medical domain, such methods have been explored to solve NLP tasks such as clinical text classification [40], [41], entity extraction [42], [43], and relation detection [44]- [47].While the performance of machine learning, especially deep learning models is bound by the number and quality of annotated data, the rule-based feature extraction effectively exploits the available training data and gives a clear boost to machine learning models.In the recent literature, Zhang et al. [48] utilized simple regular expressions collected online and composed manually to produce weak labels for the entity mentions over a large document corpus, and a neural network was trained based on the regex-generated weak labels.Instead of massive labeling, human experts only need to label a small set of documents to fine-tune the neural network.Luo et al. [49] incorporated knowledge of regular expressions into the training of neural networks to solve typical spoken language understanding (SLU) tasks.Experiments demonstrate that the learning performance can be significantly improved by the implicit knowledge encoded within regular expressions.In clinical text classification, Wang et al. [40] presented a classification paradigm using weak supervision and deep representation to reduce human efforts.Weak supervision is achieved by a rule-based NLP algorithm to automatically generate labels, and then the pre-trained word embeddings are used as deep representation features for training machine learning models.Similarly, Yao et al. [41] proposed a clinical text classification method that combines rule-based features to identify trigger phrases and knowledge-guided CNN for disease classification.Experimental evaluations have validated the possibility of combining rule-based algorithms and machine learning techniques to achieve impressive accuracy while requiring modest human effort.The rules applied by these works are, however, relatively simple, restricted to specific domains, and ineffective for complex multi-classification tasks [40].Deep learning techniques are used to leverage imperfect rules for higher accuracy.Our work remedies the limitation by developing compact and easily interpretable regular expressions which can be utilized as independent text classifiers with satisfying performance.
The contribution of our work is that we focus on automating the construction of a complete, informative, and wellgeneralized rule-based algorithm.The regular expression based classifier not only achieves promising results on its own, but also serves as a valid secondary verification to correct the instances misclassified by traditional machine learning and deep learning approaches.

III. SCENARIO
In this paper, we are concerned with generating regular expression based classifiers to solve text classification tasks with sample data, i.e., strings annotated with their desired classes.The problem statement along with the notation used hereafter is thoroughly defined in this section.

A. PROBLEM DESCRIPTION
Formally the problem can be defined as follows: given a set of predefined classes C and a set of text inquires Q, the task is to classify each inquiry q ∈ Q to one of the classes c ∈ C. The set of text inquiries belong to the same class c is denoted by Q c .This is a typical text classification problem often being solved by supervised machine learning approaches.
Each solution to the problem (a regular expression based classifier for one class) is encoded as a vector of concatenated regular expressions To check whether a text inquiry belongs to a particular class, the regular expressions in the vector are matched sequentially in the same order of the vector with the text inquiry under consideration.The text inquiry is classified to the particular class if it is matched by any of the regular expressions in the classifier.Therefore, this task is treated as a binary classification.

B. REGULAR EXPRESSION DEFINITION
Each regular expression R i is derived via a combination of functions and terminals defined in Table 1 and follows a global structure of two parts P i and N i concatenated by the NOT function denoted as #_#.That is, each regular expression R i has the following format: where P i tries to match all positive text inquiries and N i is used to filter out the potential text inquiries wrongly matched by P i .Note that under the circumstances when the positive part P i alone is precise enough for correct classification (i.e., the precision of P i exceeds a predefined threshold), N i is not needed.In this case, the structure of R i is simplified to: Let us define the expression e i as a collection of words combined with OR function, which can be expressed as: where n is the number of words in the expression.The positive part P i is either a single expression e k i consisting of keywords w k i , or a concatenation of e k i and e r i (a collection of related words w r i ) with distance function.The negative part N i is a single expression e n i consisting of negative words w n i .We will describe how to select e k i , e r i , and e n i in detail in Section IV-C.Formally, the two parts can be expressed as follows:

C. GENERALITY OF A REGULAR EXPRESSION
Overfitting is one of the common issues in classification problems, where the classifier performs well with the training data but poorly with the testing data.In order to address this issue, we measure the distinctiveness of the words used in the expression compared with words present in negative samples.We define average word frequency f w c as the number of times a given word w ∈ W present in all text inquiries in Q c , divided by the total number of sentences of all text inquiries in Q c .That is, where In a larger sense, a generalizable classifier should be able to learn features directly from data and independent of seed patterns as explained in former parts.Therefore, our model is designed by employing a bottom-up constructive approach which starts from learning similarities and finding patterns rather than modifying seed patterns that are supplied by domain experts.

D. INTERPRETABILITY
In addition to the fitness measurement, we introduce the term interpretability of the solution to the problem in this paper.In our setting, the interpretability of a solution is highly domain specific or medical related.Specifically, the solution should contain keywords and demonstrate the relationship and interaction between keywords.With the help of domain experts, we develop the co-occurrence matrix and apply special operators to generate regular expressions R i for better interpretability.

1) CO-OCCURRENCE MATRIX
A regular expression with good interpretability should be able to identify the hidden pattern of texts in a given class and the words in the expression should be related to each other.To achieve this, we build a co-occurrence matrix to suggest word correlations and determine the distance between phrases.Co-occurrence here is referred to as the frequency of two words occurring together in a certain order in every text inquiry of the input data.Apart from the frequency count, word distance information is also kept in the matrix.Let p be the size of vocabulary in the whole corpus.A matrix with a size of p × p × 2 will be produced.Specifically, we define the co-occurrence matrix M whose elements are calculated as follows: where i and j indicate the i-th and j-th word w i and w j respectively, 0 ≤ i, j ≤ p, and pos q ( * ) represents the index (position) of a given word in text inquiry q.Word distance is calculated by averaging the index difference, i.e. the value of pos q (j) − pos q (i), over the inquires in which pos q (i) < pos q (j).Explicitly, let Q be the set of inquiries that contain both the i-th and j-th words and pos q (i) < pos q (j), that is, The distance between the i-th and j-th word can then be counted as: where |Q | is the number of inquires in Q .The co-occurrence matrix M will be used in later stages to generate regular expressions.

2) SPECIAL OPERATORS
The use of some special regular expression operators during the generation process can lead to a shorter solution, hence increasing the interpretability.In particular, special operators ''?!'' (zero-width negative look-ahead assertion) and ''?<!'' (zero-width negative look-behind assertion) are used for negation apart from N i .Negative look-ahead and look-behind assertions are indispensable when we want to match something not followed or led by something else.For example, if we intend to match a c not followed by a b, negative look-ahead provides the solution: c(?!b).Similarly, (?<!b)c matches a c that is not preceded by a b using negative look-behind.These special negation operators are often used to eliminate undesired words or short phrases instead of long patterns, making the regular expression easier to read, especially when it comes to identifying complex patterns.

IV. REGULAR EXPRESSIONS GENERATION
In this section, we present the overall framework of the automatic generation of regular expressions using a constructive heuristic approach.Different from local search heuristics that improve a complete solution locally, a constructive heuristic method starts from an empty solution and then iteratively expands the current solution until a complete solution is constructed.Our method constructs a set of regular expressions directly from input data based on an iterative process and aims to find solutions with desired precision and recall as well as good interpretability.
In our setting, the input training data to our framework is a set of medical text inquires labeled with |C| classes.The solution, a set of regular expressions, is constructed for every class, transforming the multi-class classification task into |C| binary classification tasks.It is acknowledged that a key challenge in learning regular expressions is the huge search space of candidates since factors such as semantic similarity, word order, and distance between phrases should all be taken into consideration.To reduce the search space and build valid expressions, two methods have been introduced: the co-occurrence matrix is built to demonstrate the word correlations both grammatically and semantically; parallel labels C are clustered by their similarities to produce hierarchical labels.
For a given class c, the training data is divided into positive and negative sets based on labels.We then calculate the comparative frequency of each word in the two sets and select feature words accordingly.Regular expressions are generated based on the calculation of similarities and cooccurrence between words by predefined filtering mechanisms.The iterative process stops when predefined evaluation metrics are satisfied, i.e., with fitness score above a certain threshold or no more additions to the existing solutions can be found.The proposed framework is illustrated in Fig. 1.

A. PRE-PROCESSING
During the preprocessing step, we apply the state-of-the-art Chinese word segmentation method jieba to every inquiry text in the data set.Duplicated strings are merged to prevent redundant processing.Parallel labels C are clustered to several subclasses according to their similarities.The set C can be expressed as a collection of subsets: Given a class c = c j i , all text inquiries q ∈ Q c are treated as positive samples while the rest of the inquiries q ∈ Q c are treated as negative samples, where c is defined as the complement of c j i given the complete set C i .Stop words, symbols, punctuation, etc. are removed from the tokenized data to obtain the positive word set W c and the negative word set W c .

B. FEATURE SELECTION
We term feature words as a set of words that are related to the class topic and contain domain knowledge.The selection of feature words is based on the relevance of a given word to the selected class.Empirically, the more often a word occurs in a class, the more relevant it is to the class; the more often a word occurs throughout all inquiries, the more poorly it discriminates between classes.Therefore, average word frequency f w c and f w c are calculated for every word w in , where λ f is a preliminarily defined threshold, w is regarded as a feature word and kept in W c , otherwise it is removed from W c .Word set W c is filtered similarly.Keywords w k and related words w r are all chosen from W c , while negative words w n are chosen from W c .
Synonyms grouped together in the regular expression is a good indicator for interpretability.We measure the correlation among words is by their semantic similarity.Word embeddings assign such a low-dimensional vector representation to each word that semantically similar words are close to each other in the vector space [50].The semantic correlation between two words is therefore quantified by the cosine similarity measure between their corresponding vector representations.Word2vec can be trained over a largescale unannotated corpus efficiently and encode meaningful linguistic relationships between words into learned word embeddings.We train our word2vec model on tens of millions of text records in the medical domain to produce effective word embedding.We search the embedding dimension from {50, 100, 200, 300, 400} and find 100 dimensional word embedding gives good performance while requiring modest computation time.
To improve the interpretability of regular expressions, we cluster feature words in every word set W to several groups G according to their similarity, where every group G contains synonyms or similar words.Similarity scores between words are measured by word2vec distributed representation.Word pairs with the similarity above a given threshold λ s and with common words are clustered to one group G. Therefore, word set W c and W c can be expressed as: where k 1 and k 2 is the total number of groups in W c and W c respectively, and every word group G is expressed in the format defined in (4), which could be regarded as a valid regular expression.The maximum number of word groups in both W c and W c is restricted to 50 (i.e., k 1 , k 2 ≤ 50) for more efficient calculation.Word groups that contain the least number of words are eliminated.

C. CONSTRUCTIVE HEURISTIC METHOD
Positive part P i (line 2 to line 16 in Algorithm 1) As mentioned in (5), the positive part P i is expressed in either of the two formats: e k i or e k i {a, b}e r i .The word group in W c with the highest recall is selected as e k i to form the positive part P i .Two parameters λ p1 and λ p2 are set to evaluate the precision of e k i , where λ p1 > λ p2 .The choice between the two formats depends on the precision of e k i with regards to the threshold λ p2 .We define the function f p (e) as the precision of expression e and f r (e) as the recall for clearer formulation.
• If the precision of e k i is higher than λ p2 , the expression e k i alone is considered to be specific enough to define topic-related knowledge for the given class, thus P i is expressed as e k i .• If the precision of e k i is not higher than λ p2 , another expression e r i is needed to further specify the pattern, thus P i is expressed as e k i {a, b}e r i .The expression e r i is selected based on the co-occurrence matrix M. Distance control is realized by distance operator, which is denoted by {a, b} in a regular expression.It restricts the number of tokens in between phrases, where a is the minimum token length required, and b is the maximum token length allowed.Compared with concatenating phrases with and operator which posts no distance restriction between phrases, regular expressions with distance control is less likely to match false positive snippets.In our implementation, for each G c j in W c do 5: if f p (G c j ) ≥ λ p2 then 6:  return R i 30: end procedure a is set to zero, and the value of b is the average word distance calculated in the co-occurrence matrix M introduced in Section III-E.
Negative part N i (line 17 to line 28 in Algorithm 1) Negative part N i is needed when P i matches too many text inquires q / ∈ Q c .In other words, when the precision of the positive part P i is lower than the threshold λ p1 .Negative expression e n i is constructed by word groups from W c .We traverse all word groups in W c and output the regular expression R i with the highest fitness score.
Regular expression R i and iteration (Algorithm 2) Regular expression R i is generated in an iterative fashion.Whenever a new regular expression is added to the classifier, the positive set is updated by eliminating the instances matched by the regular expression in order for a quicker and more targeted generation of the next regular expression.The above-mentioned process starting from word selection is iterated to construct new regular expressions to match positive instances that are not matched by previous ones until update Q c by deleting q ∈ Q c matched by R i 6: update W c according to new positive set Q c 7: termination criteria are met, i.e., the recall of current solution has already exceeded the threshold λ r , or the number of iteration reaches n stop .
Algorithm 1 presents the pseudo code of the generation of the regular expression R i and Algorithm 2 demonstrates the iterative process for constructing the complete solution.Complexity analysis is specified to evaluate the efficiency of our algorithm.The time complexity of the algorithm to generate regular expression R i is O(m + n 2 ), where n is the vocabulary size of the dictionary and m is the total number of inquiries.Specifically, the calculation of precision, recall, and fitness of a single regular expression holds the complexity of O(m).The selection of related part e r i in P i and the calculation of distance between word clusters have the complexity of O(n 2 ).In general, our algorithm is relatively efficient with large inputs, because the vocabulary size grows much slower as we increase the number of input text inquiries.

V. EXPERIMENTS
We carried out a comprehensive experimental evaluation to test the performance of the proposed regular expression based text classifier.Specifically, we try to address the interpretability of our proposed approach and the effectiveness of our method in improving the performance of existing machine learning algorithms.Additional experiments on a publicly available data set are conducted to evaluate the generalizability of our proposed method.We report the accuracy, precision, recall, macro-and micro-F 0.5 (as we focus on the precision parameter more than the recall) as our evaluation metrics.

A. DATA AND PARAMETER SETTINGS
We use online consultation data provided by our collaborator, a major online healthcare provider in the Chinese market, for both training and testing of our system.A collection of patient inquiries labeled by medical categories is treated as our input to perform the text classification task.The categories are manually labeled by a team of medical experts in our collaborating company.With the help of medical experts, we cluster these medical categories and give them more general labels (we call them clinical departments) to produce hierarchical labels.For example, common clinical departments may include pediatrics department, orthopedics department, gynecology department, etc.Within the gynecology department, medical categories such as vaginitis, menstrual disorder, and uterine myomas can be found.One can consider medical categories as a logical grouping of similar text inquiries and may not equal to the set of similar diseases as those in International Classification of Diseases Ninth Revision (ICD-9).Table 2 gives an intuition of what the data set looks like.Carefully going through the last three examples, we find the three inquiries with limited distinctions belong to totally different classes.Similar sentence structures and feature words in common would greatly obscure the boundary and confuse traditional classification models, whereas our regular expression based classifiers become comparably more effective because of their ability to precisely identify and exclude undesired patterns in the negative part of the regex solution.
The goal of our experiment is to classify each inquiry q to a class (or medical category in the context of our application).For a class c, positive instances used for constructing the positive part of a regex are inquiries that are labeled as c, while negative instances used for constructing the negative part are inquiries that belong to the same clinical department but different medical categories.
In our setting, 13 clinical departments with 776 medical categories are predefined by experts.A total of 4, 634, 742 effective records are collected from a 2-week online operational stream.The cumulative percentage of inquires with the number of categories is demonstrated in Fig. 2. Statistics reveal that the top 100 most common categories take up 80% of the total inquiries.We therefore choose to train regular expression based classifiers only for the top 100 categories which are within 7 major clinical departments in our experiment, because the quality of the regular expressions cannot be guaranteed with too little training data.To evaluate the performance of our method under different sizes of training data, we utilize three training sets with different sizes.A validation set consisting of 100, 000 instances is utilized to set up the hyper-parameters, and a testing set of size 500, 000 is exclusively prepared.The distribution of records in the 2, 000, 000 training set categories is illustrated in Fig. 3.We tune the value for the parameters λ f , λ s , λ p1 , λ p2 , λ r , n stop on validation set after exploratory experimentation.The results were quite robust to the choice of hyper-parameters within specific ranges.According to our experiments, we set λ f = 5, λ s = 0.7, λ p1 = 0.8, λ p2 = 0.6, λ r = 0.9, and n stop = 10 to get the best results.

B. REGULAR EXPRESSION BASED TEXT CLASSIFIER
Given a set of regular expressions, if the input text contains the pattern defined by any one of the regular expressions, the text is classified as a positive instance; otherwise, the input text is regarded as negative.The classification is performed on a class-by-class basis.On average, our method generates 7 unique regular expressions for each class.The solution is able to recognize 57% (recall) of the text inquiries on average and yields 89% precision using 2, 000, 000 training samples.We that patterns with higher fitness score from the training data tend to provide correct classification on the testing data as well.Therefore, the overfitting problem is effectively avoided by our proposed approach.We compare the performance of our method using different training sizes (see Table Results demonstrate that more training data contributes to the better overall performance of the solution, as expected.Note that the performance of the generated solution varies from category to category due to the different nature with various difficulties in finding feature words and synonyms, and discovering relations between phrases.For example, compared with 92% precision and 73% recall of category diarrhea, the algorithm only achieves the precision of 74% and recall of 28% for category orthopedic pain.When we carefully analyze the reason for this seemingly unstable performance, we find that symptoms for orthopedic pain can be very diverse, that is to say, feature words are hard to be extracted based on word frequency ratio.For instance, pain in body parts such as arm, leg, thigh, chest, finger, toe, etc. is generally covered by this category.This means text inquiries belong to orthopedic pain share less common features.The detailed information of precision and recall distribution of the selected 100 categories is illustrated in Fig. 4.
The current practice of our collaborator is to manually generate regular expressions by human experts for the classification task.This process is labor intensive and does not scale when the online data accumulate.Our method is able to automate the generation of regular expressions without human intervention even with large data sets.Figure 4 demonstrates that the precision of all solutions generated by our proposed approach lies in the range of [0.7, 1], while we observe four out of one hundred manually composed classifiers only yield a precision lower than 50%.This suggests that our method is more robust and less overfitting than the manual approach.With such promising results, we still want to stress that the goal of our system is not to compete with machine learning methods or manual authorship of regular expressions and completely replace human efforts in the loop.In fact, the interpretability of our solution makes it a useful initial solution for human experts to modify.As a result, the overall labor and time required for human is significantly reduced.

C. MACHINE LEARNING + REGEX BASED CLASSIFIER
Note that our regular expression based text classification model is not proposed to compete with other classification approaches such as machine learning based approach.In fact, we design our method in this way so that both types of approaches can be used together for better performance.The regular expression based classifier is utilized to perform secondary verification on prediction results given by machine learning models.Experimental results demonstrate that we can achieve better performance by combining our method with baseline methods.Naive Bayes (NB), Support Vector Machines, and prevailing deep learning techniques for text classification such as Convolutional Neural Networks and Recurrent Neural Networks are utilized as baseline classifiers.In [18], a CNN model with named entity features achieved state-of-the-art performance in online medical guidance.We reproduced this model as one of our baseline methods.The baseline classifiers are trained on the same data set as we train the regular expression based classifiers.Due to the computational cost demanded by machine learning models, we perform the evaluation on the training data set with 500, 000 samples.Note that the absolute performances of the baseline models are not so important as we are interested in the improvement of combining the existing text classification models with our approach.
Regular expression based classifiers are combined with baselines by performing secondary verification on baseline model predictions.If the classification confidence is less than 0.6, we apply our regular expression based classifiers to the top 5 predictions with the highest confidence score sequentially.For example, for the inquiry ''My son is 7 years old.He has big trouble concentrating in class, and his hands and feet sometimes tremble while sleeping.'',the SVM model gives the top 5 predictions hand tremble, Attention Deficit Hyperactivity Disorder (ADHD), pediatric convulsions, sleep problems, and insomnia with confidence scores 0.56, 0.47, 0.20, 0.05, 0.02 respectively.Regular expression classifiers of these 5 predictions are executed successively to verify the result until the inquiry is matched by a particular regular expression.For the example provided above, the inquiry text is matched by a regular expression within ADHD category.The predicted label of this instance is thus ADHD instead of hand tremble, by which means correcting the uncertain prediction originally given by SVM.
Furthermore, one prominent innovation of our proposal is that we develop regular expression based text classifiers that are fully interpretable to human and highly flexible for modification.Compared with manually composed classifiers which are usually labor intensive and time consuming, our approach is adequate to extract key features and identify text patterns within only a small amount of time.Human experts are able to enrich the solution with their prior experience, domain knowledge, and empirical facts to achieve humanmachine collaborative intelligence.A list of examples is provided in Table 4 to illustrate how human modification is completed based on the regular expressions generated by our approach.Examples reveal that feature words and sentence structures can be precisely collected by our method.Medical experts may add uncommon synonyms, remove meaningless phrases, cluster similar expressions, and resolve overbroad solutions on this basis.Four medical experts collaboratively review and modify regular expressions which are generated on the same training set as we train the four baseline models.
Experimental results of baseline models, the combination of baselines and regular expressions, and the combination of baselines and human modified regular expressions are demonstrated in Table 5.The machine generated regular expressions improve the precision of Naive Bayes and SVM results by 9% and the recall by 4.5% on average.Macro and micro F-scores also reflect the multi-classification performance.The regex-based classifier narrows the gap between macro and micro F 0.5 given by NB and SVM models, indicating that the regular expressions elevate the performance of the classes with fewer samples, with which machine learning models do not perform well in general.
Since the performance of deep learning techniques outperforms traditional machine learning models by a great margin, directly combining the above-generated regex classifier with CNNs and RNNs does not markedly enhance the system performance.Nevertheless, the regex solution can be further enriched because of its full interpretability, and we introduce human-in-the-loop to amend the solution.The effectiveness of combining human modified regular expressions with DNNs is validated by experimental results shown in Table 5.A hybrid system which applies deep learning techniques as the foundation and regular expressions as secondary verification reaches a classification precision above 90% with relatively high recall as well.The performances of traditional machine learning methods obtain even greater enhancement with the help of modified regular expressions.Therefore, we conclude that human-machine collaborative intelligence can be achieved with the help of interpretable regular expressions to tackle the ''black box'' issue in machine learning, especially deep learning models.

D. GENERALIZATION CAPABILITY
To demonstrate the generalizability of our method, we conduct additional experiments on a publicly available data set1 released in [51] which was originally utilized to develop Chinese word segmentation tools.The data is collected from an online medical forum Good Doctor Online,2 a Chinese forum for medical consultant.The number of instances in training, validation, and test sets are 4863, 1412, and 1474 respectively.
The annotation was done by the same team of medical experts following the same annotation rule in our previous experiment.Due to the limitation of data volume, the introduction of an excessive number of classes will cause training bias and inaccuracies, we perform the classification task based on the clinical department, a more general label than the medical category as mentioned in Section V-A.The number of instances in each class is provided in Fig. 5.
Four baseline classification models, namely NB, SVM, RNN, and CNN, are trained and combined with regex-based classifiers.Experimental results in Fig. 6 show that our hybrid system can obtain a 6.5% improvement in precision and a 5.6% improvement in recall on average.Further modifications by human experts are not conducted on this data set due to time and practical concern, but we are confident to anticipate that the performance of the system can be further enhanced if we introduce human-in-the-loop to fine-tune the  interpretable regular expressions with specialized domain knowledge.

VI. CONCLUSIONS AND EXTENSIONS
Regular expressions have long been used for text processing because of their expressiveness and flexibility.However, the generation of fully interpretable regular expressions is not trivial and often requires significant investment in manual work in qualified personnel.We have proposed a novel constructive heuristic algorithm towards constructing regular expression based classifiers for medical text classification tasks.The approach only requires a set of labeled examples and has no limitation on the size of the alphabet.Experimental results on real-world online medical inquiry demonstrate the high performance and consistency of this approach.Although there are many developed models for text or sentence classification, most suffer from ''black box'' problems and are incapable of future modification.Compared with conventional machine learning methods, the regular expression based classifiers not only further improve their performances by correctly recognizing many of the misclassified instances, but also avoid black boxes and instead generate solutions fully interpretable by humans.We believe that using regular expressions to tap into the sequential relationships among salient words is a promising approach to improve text classification performance and support accurate decision-making.
In the future, we will extend our framework to perform entity extraction tasks by inducing general patterns of medical entities in healthcare.Information extraction tasks may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern that may be described by regular expressions [52].With the extraction of high quality information and medical concepts as a basis, an automated process of knowledge graph construction will be achieved to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records.

FIGURE 1 .
FIGURE 1. Diagrammatic demonstration of the regular expression generation process.

Algorithm 1
Generation of a Single Regular Expression R i 1: procedure REGEx(W c , W c ) 2: calculate f r (G c j ) for every word group G c j in W c 3: sort W c by f r (G c j ) in descending order 4: j in W c with the highest co-occurrence rate with G c j 10: compute average word distance b by co-occurrence matrix M 11:

Algorithm 2 : repeat 3 :
Iterative Process for the Complete Solution Require: positive inquiry set Q c , positive word list W c , negative word list W c , recall threshold λ r , and the maximum number of iterations n stop Ensure: a complete regex classifier C 1: set i = 0 2call procedure REGEX(W c , W c ) 4:add R i to the classifier C 5:

FIGURE 2 .
FIGURE 2. Number of categories and inquiries in cumulative percentage.

FIGURE 3 .
FIGURE 3. Data set categories and inquiries distribution.

FIGURE 4 .
FIGURE 4. Precision and recall distribution of regex classifiers.

FIGURE 5 .
FIGURE 5. Data distribution of the forum data set.

FIGURE 6 .
FIGURE 6. Performances of baseline and baseline + regex methods.

TABLE 1 .
Functions and terminals used in the model.

TABLE 2 .
Data set demonstration.

TABLE 3 .
Classification performance on different sample sizes.

TABLE 4 .
Human modification of regular expressions for medical category ''Diarrhea''.

TABLE 5 .
Performances of Baseline, Baseline + Regex, and Baseline + Human Modified Regex methods.