Reference-Based Sequence Classification

Sequence classification is an important data mining task in many real world applications. Over the past few decades, many sequence classification methods have been proposed from different aspects. In particular, the pattern-based method is one of the most important and widely studied sequence classification methods in the literature. In this paper, we present a reference-based sequence classification framework, which can unify existing pattern-based sequence classification methods under the same umbrella. More importantly, this framework can be used as a general platform for developing new sequence classification algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that are quite different from existing solutions. Experimental results show that new methods developed under the proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art sequence classification algorithms.


INTRODUCTION
I N many practical applications, we have to conduct data analysis on data sets that are composed of discrete sequences. Each sequence is an ordered list of elements. For instance, such a sequence can be a protein sequence, where each element corresponds to an amino acid. Due to the existence of a large number of discrete sequences in a wide range of applications, sequential data analysis has become an important issue in machine learning and data mining. Compared to non-sequential data mining, sequential data analysis is confronted with new challenges because of the ordering relationship between different elements in the sequences. Similar to the analysis of non-sequential data, there are different sequential data mining problems such as clustering, classification and pattern discovery. In this paper, we focus on the sequence classification problem.
The task of classification is to determine which predefined target class one unknown object should be assigned to. As a specific case of the general classification problem, sequence classification is to assign class labels to new sequences based on the classifier constructed in the training phase. In many real-world applications, we can formulate the data analysis task as a sequence classification problem. For instance, the essential task in numerous bioinformatics applications is to classify biological sequences into existing categories [1].
To tackle the sequence classification problem, many effective methods have been proposed from different aspects. Roughly, existing sequence classification methods can be divided into three categories [2]: feature-based methods, distance-based methods and model-based methods. Feature-based methods first transform sequences into fea-ture vectors, and then apply existing vectorial data classification methods. Distance-based methods apply classifiers such as KNN (k Nearest Neighbors) to solve the sequence classification problem, in which the key issue is to specify a proper distance function to measure the distance between two sequences. Model-based methods generally assume that sequences from different classes are generated from different probability distributions, in which the key issue is to estimate the model parameters from the set of training sequences.
In this paper, we focus on the feature-based method since it has several advantages. First of all, various effective classifiers have been developed for vectorial data classification [3]. After transforming sequences into feature vectors, we can choose any one of these existing classification methods to fulfill the sequence classification task. Second, in some popular feature-based methods such as pattern-based methods, each feature has a good interpretability. Last but not least, the extraction of features from sequences has been extensively studied across different fields, making it feasible to generate sequence features in an effective manner.
In this paper, we present a reference-based sequence classification framework, which can be considered as a nontrivial generalization of the pattern-based methods. This framework has several key steps: candidate set construction, reference point selection and feature value construction. In the first step, one set of sequences that serve as the candi-date reference points are constructed. Then, some sequences from the candidate set are selected as the reference points according to certain criteria. The number of features in the transformed vectorial data will equal to the number of selected reference points. In other words, each reference point will correspond to a transformed feature. Finally, one similarity function is used to calculate the similarity between each sequence in the data and every reference point. The similarity to each reference point will be used as the corresponding feature value.
The reference-based sequence classification framework is quite general and flexible since the selection of both reference points and similarity function is arbitrary. Existing feature-based methods can be regarded as a special variant under our framework by (1) using (frequent or discriminative) sequential patterns (subsequences) as reference points and (2) utilizing a boolean function (output 1 if the reference point is contained in a given sequence and output 0 otherwise) as the similarity function. Besides unifying existing pattern-based methods under the same umbrella, the reference-based sequence classification framework can be used a general platform for developing new featurebased sequence classification methods. As a proof of concept, we develop a new feature-based method by using a subset of training sequences as the reference points and the Jaccard coefficient as the similarity function. In particular, we present two instance selection methods in order to select a good set of reference points.
To demonstrate the feasibility and advantages of this new framework, we conduct a series of comprehensive performance studies on real sequential data sets. In the experiments, we compare several variants under our framework with some existing sequence classification methods in terms of the classification accuracy. Experimental results show that new methods developed under the proposed framework are capable of achieving better classification accuracy than traditional sequence classification methods. This indicates that such a reference-based sequence classification framework is promising from a practical point of view.
The main contributions of this paper can be summarized as follows: • We present a general reference-based framework for feature-based sequence classification. It offers a unified view for understanding and explaining many existing feature-based sequence classification methods in which different types of sequential patterns are used as features.

•
The reference-based framework can be used as a general platform for developing new feature-based sequence classification algorithms. To verify this point, we design new feature-based sequence classification algorithms under this framework and demonstrate its advantages through extensive experimental results on real sequential data sets.
• Some preliminary (theoretical) analysis is provided to reveal why such reference-based method is effective for the sequence classification task. This will serve as the foundation for the future development towards this direction.
The rest of the paper is structured as follows. Section 2 gives a discussion on the related work. In Section 3, we introduce the reference-based sequence classification framework in detail. In Section 4, we show that many existing feature-based sequence classification algorithms can be reformulated within the reference-based framework. In Section 5, we present new feature-based sequence classification algorithms under this framework, which are effective and quite different from available solutions. We experimentally evaluate the proposed reference-based framework through a series of experiments on real-life data sets in Section 6. Finally, we summarise our research and give a discussion on the future work in Section 7.

RELATED WORK
In this section, we discuss previous research efforts that are closely related with our method. In Section 2.1, we provide a categorization on existing feature-based sequence classification methods. In Section 2.2, we discuss several instance-based feature generation methods in the literature of time series classification. In Section 2.3, we present a concise discussion on reference-based sequence clustering algorithms. In Section 2.4, we provide a short summary on dimension reduction and embedding methods based on landmark points.

Explicit Subsequence Representation without Selection
The naive approach in dealing with discrete sequences is to treat each element as a feature. However, the order information between different elements will be lost and the sequential nature cannot be captured in the classification. Short sequence segments of k consecutive elements called k-grams can be used as features to solve this problem. Given a set of k-grams, a sequence can be represented as a vector of the presence or absence of the k-grams or the frequencies of the k-grams. In this feature representation method, all k-grams (for a specified k value) are explicitly used as the features without feature selection.

Explicit Subsequence Representation with Selection (Classifier-Dependent)
The above pattern-based methods are universal and classifier-independent. However, some patterns that are critical to the classifier may be filtered out during the selection process. Thus, several methods which can select pattern features from the entire pattern space for a specific classifier have been proposed [30], [31], [32].
In [30], a coordinate-wise gradient ascent technique is presented for learning the logistic regression function in the space of all (word or character) k-grams. The method exploits the inherent structure of the k-gram feature space in order to automatically provide a compact set of highly discriminative k-gram features. In [31], a framework is presented in which linear classifiers such as logistic regression and support vector machine can work directly in the explicit high-dimensional space of all subsequences. The key idea is a gradient-bounded coordinate-descent strategy to quickly retrieve features without explicitly enumerating all potential subsequences. In [32], a novel document classification method using all substrings as features is proposed, in which the L 1 regularization is applied to a multi-class logistic regression model to fulfill the feature selection task automatically and efficiently.

Implicit Subsequence Representation
In contrast to explicit subsequence representation, kernelbased methods employ an implicit subsequence representation strategy. A kernel function is the key ingredient for learning with support vector machines (SVMs) and it implicitly defines a high dimension feature space. Some kernel functions K(x, y) have been presented for measuring the similarity between two sequences x and y (e.g. [33]).
There are a variety of string kernels which are widely used for sequence classification (e.g. [34], [35], [36], [37]). A sequence is transformed into a feature space and the kernel function is the inner product of two transformed feature vectors.
Leslie et al. [34] propose a k-spectrum kernel for protein classification. Given a number k ≥ 1, the k-spectrum of an input sequence is the set of all its k-length (contiguous) subsequences.
Lodhi et al. [35] present a string kernel based on gapped k-length subsequences for text classification. The subsequences are weighted by an exponentially decaying factor of their full length in the text.
In [36], a mismatch string kernel is proposed, in which a certain number of mismatches are allowed in counting the occurrence of a subsequence. Several string kernels related to the mismatch kernel are presented in [37]: restricted gappy kernels, substitution kernels and wildcard kernels.

Sequence Embedding
All the methods mentioned above use subsequences as features. Alternatively, the sequence embedding method generates a vector representation in which each feature does not have a clear interpretation. Most existing approaches for sequence embedding are proposed for texts in natural language processing, where word and document embeddings are used as an efficient way to encode the text (e.g. [38], [39]). The basic assumption in these methods is that words appear in similar contexts have similar meanings.
The word2vec model [38] uses a two-layer neural network to learn a vector representation for each word. The sequence (text) embedding vector can be further generated by combining the feature vectors for words. The doc2vec model [39] extends word2vec by directly learning feature vectors for entire sentences, paragraphs, or documents.
Nguyen et al. [40] propose an unsupervised method (named Sqn2Vec) for learning sequence embedding by predicting its belonging singleton symbols and sequential patterns (SPs). The main objective of Sqn2Vec is to address the limitations of two existing approaches: pattern-based methods often produce sparse and high-dimensional feature vectors while sequence embedding methods in natural language processing may fail on data sets with a small vocabulary.

Instance-Based Methods
There are several instance-based feature generation methods for time series classification which are closely related with our method (e.g. [41], [42]).
Iosifidis et al. [41] propose a time series classification method based on a novel vector representation. The vector representation for each time series is generated by calculating its similarities from a subset of training instances. To find a good subset of representative instances, one clustering procedure is further presented. In [42], each time series is represented as a feature vector, where the feature value is its dynamic time warping similarity from one of the training instances. Note that all training instances are used for feature generation.

Reference-Based Sequence Clustering
In the literature of sequence clustering, the idea of using reference/landmark points to accelerate the cluster analysis process have been widely studied (e.g. [43], [44]). In this type of sequence clustering algorithm, a reference point selection method is first employed to obtain a small set of landmark points and then the clustering process is conducted based on the similarities between input sequences and selected reference points. Here we would like to highlight the following differences between our method and existing research efforts in this field: (1) The objective is different. We focus on the classification issue while these methods aim at the cluster analysis problem. In addition, their main concern is to improve the running efficiency of the sequence clustering procedure; (2) The method is different. We present two reference point selection methods: one unsupervised method and one supervised method (see Section 5 for the details). In existing reference-based sequence clustering methods, only the unsupervised reference point selection method is applicable since no class label information is available.

Reference-Based Dimension Reduction
A number of research papers have presented the idea of using the distances to a set of reference points to fulfill the dimension reduction task (e.g. [45], [46]). Our method shares some similarities with these methods since the final objective is the same. However, most of these methods are not developed for the task of sequence classification. As a result, our method is quite different from these methods with respect to both the reference point selection and the similarity computation.

REFERENCE-BASED SEQUENCE CLASSIFICA-TION FRAMEWORK
Let I = {i 1 , i 2 , ..., i m } be a finite set of m distinct items, which is generally called the alphabet in the literature. A sequence s over I is an ordered list s = s 1 , s 2 , ..., s l , where s i ∈ I and l is the length of the sequence s. A sequence We use maxsize to denote the allowed maximum length of subsequences.
Let C = {c 1 , c 2 , ..., c j } be a finite set of j distinct classes. A labeled sequential data set D over I is a set of instances and each instance d is denoted by (s, c k ), where s is a sequence and c k ∈ C is a class label, |D| is the number of sequences in D. The set D ci ⊆ D contains all sequences that have the same class label c i (i.e., where t is a given sequence. Sequences in D (D ci ) is divided into a training set T rainD (T rainD ci ) and a testing set T estD (T estD ci ). The set of all subsequences of T rainD is denoted by SubT rainD = {t|t ⊆ s, s ∈ T rainD}.
As shown in Figure 1, we present a reference-based sequence classification framework. It is composed of three major phases: reference point selection, feature value generation, model construction and prediction. In the following, we will elaborate each step in detail.

Reference Point Selection
In the first stage of the presented framework, a reference point selection procedure is performed to generate a set of pivot sequences. As shown in Figure 2, this procedure can be further divided into three steps: alphabet extraction, candidate set generation and pivot sequence selection.
In the first step, we scan the training set T rainD to extract the alphabet I that is composed of distinct items.
Note that there can be some items that only appear in the testing set T estD. In the forthcoming paragraphs, we will see that this extreme case has no effect on our subsequent steps.
In the second step, we generate the set of candidate reference sequences CR from the alphabet I. Note that any sequence over I can be the member of CR. In other words, CR can be an infinite set. In practice, some constraints will be imposed on the potential member in CR. For instance, those pattern-based methods only consider subsequences of T rainD as the members of CR under our framework, which will be further discussed in Section 4. Furthermore, the use of different construction methods for building the candidate set CR will lead to the generation of many new feature-based sequence classification methods.
In the third step, we select a subset of sequences R from CR as the landmark sequences for generating features. That is, each reference sequence will correspond to a transformed feature. The critical issue in this step is how to design an effective pivot sequence selection method. To date, existing pattern-based methods typically utilize some simple criteria to conduct the reference sequence selection task. For example, those methods based on frequent subsequences use the minimal support constraint as the criterion for reference sequence selection. Apparently, many new and interesting pivot sequence selection methods remain unexplored under our framework. In the subsequent paragraphs of this subsection, we will list some commonly used criteria for selecting reference sequences from the set of candidate pivot sequences. Constraint 1. (Gap constraint [10]). Given two sequences s = s 1 , s 2 , ..., s l and t = t 1 , t 2 , ..., t r , if t is the subsequence of s such that t 1 = s i1 , t 2 = s i2 , ..., t r = s ir , the gap between i k and i k+1 is defined as Gap(s, i k , i k+1 ) = i k+1 − i k − 1. Given two thresholds mingap and , then the occurrence of t in s fulfills the gap constraint.

Constraint 2.
(M insup constraint [11]). Given a set of sequences D ci with the class label c i and a sequence t, count Dc i (t) is used to denote the number of sequences in D ci that contain t as a subsequence. The support of then t satisfies the minsup constraint and t is a frequent sequential pattern in D ci . [47]). Given two class labels c 1 and c 2 , a sequence t is said to be a discriminative pattern if it is over-expressed on D c1 against D c2 (or the vice versa). To evaluate the discriminative power, many measures/functions have been proposed in the literature [47]. If the discriminative function value of t can pass certain constraints, then it satisfies the mindisc constraint. Here we just list some measures that have been used for selecting discriminative patterns in sequence classification.
• Discriminative Function (DF) 2 [10]: and mincount is a given threshold. The occount Dc 1 (t) is the number of non-overlapping occurrences of t in D c1 .
• Discriminative Function (DF) 3 [11]: (3.3) • Discriminative Function (DF) 4 [10]: where and Occ within is defined as: • Discriminative Function (DF) 5 [29]: where GR(t, c 1 , c 2 ) = supc 1 (t) supc 2 (t) is the GrowthRate of t, minGR is a given GrowthRate threshold. Sig con (t, c 1 , c 2 ) = min q∈Q GR(t,c1,c2) GR(q,c1,c2) is used to describe the conditional redundancy, where Q is the set of discriminative sub-patterns of t, minSig is a given threshold. The chi-squared test is used as the discriminative function to check if the candidate sequence is correlated with at least one class that it is frequent in.

Constraint 4.
(U niqueness constraint [10]). One sequence is said to satisfy the uniqueness constraint if all its items are unique.

Constraint 5.
(Closeness constraint [18]). One sequence t is said to satisfy the closeness constraint if no sequences that contain t as a subsequence have the same support as t.

Constraint 6.
(Redundancy constraint [25]). One sequence t is said to satisfy the redundancy constraint if [4]). Given a set of sequences D ci with class label c i , two sequences s = s 1 , s 2 , ..., s l and t = t 1 , t 2 , ..., t r , if t is the subsequence of s such that t 1 = s i1 , t 2 = s i2 , ..., t r = s ir ,

Constraint 7. (Interestingness constraint
and Given two thresholds minsup and minint, if sup Dc i (t) ≥ minsup and I ci (t) ≥ minint, then t satisfies the interestingness constraint.

Constraint 8. (Level constraint [16]). Given a sequence
t and a set of sequences D with j classes, a sequential classification rule π is denoted as π : t → count Dc 1 (t), count Dc 2 (t), ..., count Dc j (t), where t is the body of the rule. From a Bayesian point of view, to choose the best rule is equivalent to maximizing p(π|D) = p(π,D) p(D) = p(π)×p(D|π) , where p(D) is a constant, cost(π) = − log(p(π) × p(D|π)) is used as the evaluation criterion, and the normalized criterion level is defined as level is the cost of the null model when the sequence body is empty. If 0 < level(π) ≤ 1, then t satisfies the level constraint.

Feature Value Generation
In the second stage of the presented framework, one similarity function is used to generate vectorial representations for all sequences in both training data and testing data. As shown in the left part of Figure 3, this procedure can be further divided into two steps: (1) calculating the similarities between training instances and reference points; (2) calculating the similarities between testing instances and reference points.
In the first step, we utilize one similarity function to transform T rainD into a vectorial training set T rainD by calculating the similarity between each sequence in T rainD and every reference point in R. Each similarity value will be used as the corresponding feature value. The critical issue in this step is how to choose a suitable similarity function. Note that the selection of the similarity function is arbitrary. In other words, any feasible similarity function can be used in this step. In fact, many existing feature-based methods utilize a boolean function as the similarity function, which outputs 1 as the feature value if the reference point is a subsequence of the target sequence and 0 otherwise.
In the second step, we use the same similarity function to transform T estD into a vectorial testing set T estD . Note that the number of features in the transformed vectorial data set is |R|, which is the number of reference points.
The similarity function plays an important role in generating feature values. Accordingly, it will have a great impact on the prediction result. For the purpose of summarizing existing research efforts under our framework with respect to the similarity function, here we list some similarity functions between two sequences s and t that have been deployed in the literature. In Equation (3.7), similar means ed(α, t) ≤ γ ×|t| (|s| ≥ |t|), ed(α, t) is the edit distance between α and t (the minimum number of operations needed to transform α into t, where an operation can be the insertion, deletion, or substitution of a single item), α is a contiguous subsequence of s with |t| items, which is extracted by using a sliding window of length |t| that starts from the first element of s. If α and t are not similar, then the sliding window will be repeatedly shifted one position to the right until |s| − |t| + 1 subsequences have been checked or one new subsequence α similar to t is encountered. γ is a given maximum difference threshold.
• Similarity Function (SF) 3 [4]: where C(t, s) is the cohesion of t in the sequence s.
• Similarity Function (SF) 4 [17]: where occnum is the number of occurrences of t in s.
• Similarity Function (SF) 5 [10]: where occount s (t) is the number of non-overlapping occurrences of t in s.
• Similarity Function (SF) 6 [18]: where |LCS(s, t)| is the length of the longest common subsequence, |s| and |t| are the length of s and t respectively.

Model Construction and Prediction
In the third stage of the presented framework, we construct a prediction model to make predictions. As shown in the right part of Figure 3, this procedure can be further divided into three steps: model construction, prediction and classification result generation.
In the first step, one existing vectorial data classification method is used to construct a prediction model from the vectorial training set T rainD since we have transformed training sequences into feature vectors in the second stage. Numerous classification methods have been designed for classifying feature vectors (e.g. support vector machines and decision trees) [3], [48]. After training a classifier with T rainD , the prediction model is ready for classifying unknown samples.
In the second step, we input the vectorial testing set T estD to the classifier to make predictions. In the third step, we output the prediction result and compute the classification accuracy by comparing the predicted class labels with the ground-truth labels.

GENERAL FRAMEWORK FOR FEATURE-BASED CLASSIFICATION
In this section, we show that many existing feature-based sequence classification algorithms can be reformulated within the presented reference-based framework. The differences between these algorithms mainly lie in the selection of reference points and similarity functions. As summarized in Table 1

Prediction Model
①Calculating the similarities between training instances and reference points ②Calculating the similarities between testing instances and reference points ③Model Construction ④Prediction ⑤Output  First of all, any sequence over the alphabet can be a potential member of the candidate set of reference points CR. However, all feature-based sequence classification algorithms in Table 1 use SubT rainD to construct CR since the idea of using subsequences as features is quite natural with a good interpretability. Although SubT rainD is a finite set, its size is still very large and most sequences in SubT rainD are useless and redundant for classification. Therefore, it is necessary to explore alternative methods for constructing the set of candidate reference points. For instance, we may use all original sequences in T rainD to construct CR, so that the size of CR will be greatly reduced and the corresponding features may be more representative.
Second, many sequence selection criteria have been proposed to select R from CR, such as minsup and mindisc. The main objective of applying these criteria is to select a subset of sequences that can generate good features for building the classifier. However, it is not an easy task to set suitable thresholds for these constraints to produce a set of reference sequences with moderate size. More importantly, most of these constraints are proposed from the literature of sequential pattern mining, which may be only applicable to the selection of reference sequences from SubT rainD. In other words, more general reference point selection strategies should be developed.
Last, the most widely used similarity function in Table 1 is SF 1, which is a boolean function based on whether the reference point is a subsequence of the sequence in T rainD. Although some non-boolean functions have been used, the potential of utilizing more elaborate similarity functions between two sequences still needs further investigation.
Overall, our reference-based sequence classification framework is quite genetic, in which many existing patternbased sequence classification methods can be reformulated as its special variants. Meanwhile, there are still many limitations in current research efforts under this framework. Hence, new and effective sequence classification methods should be developed towards this direction.

NEW VARIANTS UNDER THE FRAMEWORK
In addition to encompassing existing pattern-based methods, this framework can also be used as a general platform to design new feature-based sequence classification methods.
As discussed in Section 4, there are three key ingredients in our framework: the construction of candidate reference point set, the selection of reference points and the selection of similarity function. Obviously, we will generate a "new" sequence classification algorithm based on an unexplored combination of these three components. In view of the fact the number of possible combinations is quite large, it is infeasible to enumerate all these variants. Instead, we will only present two variants that are quite different from existing algorithms to demonstrate the advantage of this framework.

The Use of Training Set as the Candidate Set
With our framework, all previous pattern-based sequence classification methods utilize the set SubT rainD as the candidate reference point set CR in the first step. One limitation of this strategy is that the actual size of CR will be very large. As a result, it poses great challenges for the reference point selection task in the consequent step. To alleviate these issues, we propose to use all original sequences in T rainD to construct the set of candidate reference points. The rationale for this candidate set construction method is based on the following observations. Firstly, all information given for building the classifier is contained in the original training set. In other words, we will not lose any relevant information for the classification task if T rainD is used as the candidate set of reference sequences. In fact, the widely used candidate set SubT rainD is derived from T rainD.
Secondly, even we use all the training sequences in T rainD as the reference points, the transformed vectorial data will be a |T rainD| × |T rainD| table. That is, the number of features is still no larger than the number of samples. Therefore, we do not need to analyze a HDLSS (high-dimension, low-sample-size) data set during the classification stage. In contrast, the number of features may be much larger than the number of samples in the vectorial data obtained from SubT rainD if the parameters are not properly specified during the reference point selection procedure. In fact, we have tested the performance when all training sequences are used as the reference points. The experimental results show that this quite simple idea is able achieve comparable performance in terms of classification accuracy.
Finally, the same idea has been employed in the literature of time series classification [41], [42]. Its success motivates us to investigate the feasibility and advantage in the context of discrete sequence classification.

Two Reference Point Selection Methods
To select reference sequences from T rainD, those existing constraints proposed in the context of sequential pattern mining are not applicable. Therefore, we have to develop new algorithms to choose a subset of representative reference sequences from T rainD. To this end, two different reference sequence selection methods are presented. The first one is an unsupervised method, which selects reference sequences based on cluster analysis without considering the class label information. The second one is an supervised method, which evaluates each candidate sequence according to its discriminative ability across different classes. In the following two sub-sections, we will present the details of these two reference point selection algorithms.

Unsupervised Reference Point Selection
As we have discussed in Section 5.1, we may choose all sequences in the training set as reference points. However, the number of features in the transformed vectorial data can still be very large if the number of training instances is large. The selection of a small subset of representative training sequences as reference points will greatly reduce the computational burden in the subsequent stage. One natural idea is to divide the training sequences in CR into different clusters using a clustering algorithm [49]. Then, we can select one representative sequence from each cluster as the reference point.
To date, many algorithms have been presented for clustering discrete sequences (e.g. [50]). We can just adopt one existing sequence clustering algorithm in our pipeline. Here we choose the Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm [51] to fulfill the sequence clustering task. This algorithm is used because it can often generate a high-quality clustering result and can handle any forms of similarity measure.
The reference point selection method based on GAHC is shown in Algorithm 1. In the following, we will describe the details of this algorithm.
In the first stage (step 1-7), each sequence S i in CR will form a cluster c i . That is, clusterset is initially composed of |CR| clusters.
In the second stage (step [8][9][10][11][12], one similarity function is used to calculate the similarity between each pair of clusters to produce a similarity matrix Sim, where Sim[i, j] is the similarity between the two clusters c i and c j . Many similarity measures have been presented for sequential data (e.g. [52]). Here we choose the Jaccard coefficient. More specific details on the similarity function will be discussed in Section 5.3.
In the third stage (step 13-33), we first search the similarity matrix Sim to identify the maximum value maxSim, which corresponds to the most similar pair of clusters c k and c l . Then, these two clusters are merged to form a new cluster c k and the number of clusters in clusterset is decreased by 1. Meanwhile, the entries related to c l in Sim are set to be 0 and Sim is updated by recalculating the similarity between c k and each of the remaining clusters. The similarity between the newly generated cluster and each of the remaining clusters is calculated as the average similarity between all members in the two clusters since we use the group-average method. We repeat the third stage until the number of clusters is equal to the number of reference points we want to select.
In the last stage (step 34-37), we select one representative sequence from each cluster in clusterset. For each cluster, any sequence in this cluster can be used as a representative. To provide a consistent and deterministic output, we use the sequence with the minimum subscript in the cluster as the reference point.

Supervised Reference Point Selection
To choose a subset of representative reference sequences from T rainD, we can also employ a supervised method in which the class label information is utilized. As we have discussed in Section 4, different mindisc constraints have been widely used to evaluate the discriminative power of sequential patterns. Unfortunately, these constraints are only applicable to the selection of reference points from SubT rainD. In addition, it is not an easy task to set suitable thresholds to control the number of selected reference points. In order to overcome these limitations, we present a reference point selection method based on hypothesis testing, in which the statistical significance in terms of pvalue is used to assess the discriminative power of each candidate sequence.
Hypothesis testing is a commonly used method in statistical inference. The usual line of reasoning is as follows: first, clusterset ← clusterset ∪ c i ; 7: end for 8: for i ← 0 to |CR| − 2 do 9: for j ← i + 1 to |CR| − 1 do 10: calculate Sim[i, j]; 11: end for 12: end for 13: repeat 14: k ← 0; 15: l ← 0; 16: maxSim ← 0; 17: for i ← 0 to |CR| − 2 do 18: for j ← i + 1 to |CR| − 1 do 19: if Sim[i, j] ≥ maxSim then 20: k ← i; 21: l ← j; 22: maxSim ← Sim[i, j]; 23: end if 24: end for 25: end for 26: c k ← c k ∪ c l ; 27: |clusterset| ← |clusterset| − 1; formulate the null hypothesis and the alternative hypothesis; second, select an appropriate test statistic; third, set a significance level threshold; finally, reject the null hypothesis if and only if the p-value is less than the significance level threshold, where the p-value is the probability of getting a value of the test statistic that is at least as extreme as what is actually observed on condition that the null hypothesis is true.
In order to assess the discriminative power of each candidate sequence in terms of p-value, we can use the null hypothesis that this sequence does not belong to any class and all sequences from different classes are drawn from the same population. If the above null hypothesis is true, then the similarities between the candidate sequence and training sequences are drawn from the same population. Therefore, we can formulate the corresponding hypothesis testing problem as a two-sample testing problem [53], where one sample is the set of similarities between the candidate sequence and the training sequences from one target class and another sample is the set of similarities between the candidate sequence and the training sequences from the remaining classes.
Since we test all candidate sequences in CR at the same time, it is actually a multiple hypothesis testing problem. If no multiple testing correction is conducted, then the number of false positives among reported reference sequences may be very high. To tackle this problem, we adopt the BH procedure to control the FDR (False Discovery Rate) [54], which is the expected proportion of false positives among all reported sequences.
The reference point selection method based on MHT (Multiple Hypothesis Testing) is shown in Algorithm 2. In the following, we will elaborate this algorithm in detail.

Algorithm 2 Reference Point Selection Based on MHT
for each sequence S k in D + do 6: Sim + ← ∅;

14:
Sim − ← Sim − ∪ {Sim[k, j]}; 15: end for 16: S k .pvalue ← U test(Sim + , Sim − ); 17: end for 18: sort D + ; 19: maxindex ← 0; 20: for each sequence S k in D + do 21: if S k .pvalue ≤ α k |D+| then 22: maxindex ← k; 23: end if 24: end for 25: for k ← maxindex + 1 to |D + | − 1 do 26: end for 28: R ← R ∪ D + ; 29: end for 30: return R; In the first stage (step 1-4), we select a set of sequences D ci with the class label c i from CR, then we regard D ci as the positive data set D + and use the set of all remaining sequences in CR as the negative data set D − .
In the second stage (step 5-17), for each sequence S k in D + , one similarity function is used to calculate the similarity between S k and each sequence in D + and D − , where the similarity function is the same as that used in Section 5.2.1 and Sim[k, j] is the similarity between the two sequences S k and S j . Then, the Mann-Whitney U test [55] is used to calculate the p-value based on the two similarity set Sim + and Sim − .
In the last stage (step 28-30), we select all sequences from D + as reference points. The whole process will be terminated after each set of sequences from every class has been regarded as D + .

Similarity Function
In order to measure the similarity between two sequences, we choose the Jaccard coefficient as the similarity function in our method. The larger the Jaccard coefficient between two sequences is, the more similar they are.
Given two sequences s = s 1 , s 2 , ..., s l and t = t 1 , t 2 , ..., t r , the Jaccard coefficient is defined as: where |s ∩ t| is the number of items in the intersection of s and t, |s ∪ t| is the number of items in the union of s and t. However, this may lose the order information of sequences.
To alleviate this issue, we use the LCS (Longest Common Subsequence) between s and t to instead of s ∩ t. Then, the Jaccard coefficient is redefined as: Example 1. Given two sequences s = a, b, c, d, e and t = e, c, d, c , the LCS(s, t) is c, d , then the modified Jaccard coefficient is Note that we can also use other similarity functions in the literature, such as those methods summarized and reviewed in [52]. The choice of a more appropriate similarity function may yield better performance than the modified Jaccard coefficient. In order to check the effect of similarity function on the classification performance, we also consider the following two alternative similarity functions.
The first one is the String Subsequence Kernel (SSK) [35]. The main idea of SSK is to compare two sequences by means of the subsequences they contain in common. That is, the more subsequences in common, the more similar they are.
Given two sequences s = s 1 , s 2 , ..., s l and t = t 1 , t 2 , ..., t r and a parameter n, the SSK is defined as: where φ u (s) is the feature mapping for the sequence s and each u ∈ I n , I is a finite alphabet, I n is the set of all subsequences of length n and u is a subsequence of s such that u 1 = s i1 , u 2 = s i2 , ..., u n = s in , l s (u) = i n −i 1 +1 is the length of u in s, λ ∈ (0, 1) is a decay factor which is used to penalize the gap. The calculation steps are as follows: enumerate all subsequences of length n, compute the feature vectors for the given two sequences, and then compute the similarity. The normalized kernel value is given bŷ . The corresponding feature vector for each of the sequences can be denoted as φ 1 (s) = λ, λ, λ, λ, λ and φ 1 (t) = 0, 0, 2λ, λ, λ , then the normalized kernel value iŝ K 1 (abcde, ecdc) = K 1 (abcde, ecdc) K 1 (abcde, abcde)K 1 (ecdc, ecdc) When this function is employed in our method, n = 1 is used as the default parameter setting. Although the setting of n = 1 may lose the order information, it will greatly reduce the computational cost and can provide satisfactory results in practice.
Another alternative similarity function is the normalized LCS (Longest Common Subsequence). The larger the normalized LCS between two sequences is, the more similar they are.
Given two sequences s = s 1 , s 2 , ..., s l and t = t 1 , t 2 , ..., t r , the normalized LCS is defined as: where |LCS(s, t)| is the length of the longest common subsequence, |s| is length of s, and |t| is the length of t.

EXPERIMENTS
To demonstrate the feasibility and advantages of this new framework, we conducted experiments on fourteen real sequential data sets. We compared our two algorithms derived under the reference-based framework with other sequence classification algorithms in terms of classification accuracy. All experiments were conducted on a PC with Intel(R) Xeon(R) CPU 2.40GHz and 12G Memory. All the reported accuracies in the experiments were the average accuracies obtained by repeating the 5-fold cross-validation 5 times except SCIP (accuracies in SCIP were obtained using 10-fold cross-validation).

Data Sets
We choose fourteen benchmark data sets which are widely used for evaluating sequence classification algorithms: Activity [56], Aslbu [13], Auslan2 [13], Context [57], Epitope [11], Gene [58], News [4], Pioneer [13], Question [59], Reuters [4], Robot [4], Skating [13], Unix [4], Webkb [4]. The main characteristics of these data sets are summarized in Table 2, where |D| represents the number of sequences in the data set, #items denotes the number of distinct elements, minl, maxl and avgl are used to denote the minimum length, maximum length and average length of the sequences respectively, and #classes represents the number of distinct classes in the data set.

Parameter Settings
Our two algorithms are denoted by R-MHT (Reference Point Selection Based on MHT) and R-GAHC (Reference Point Selection Based on GAHC), respectively. In addition, the method that uses all sequences in T rainD as reference points is denoted as R-A, which is also included in the performance comparison. We compare our algorithms with five existing sequence classification algorithms: MiSeRe 1 [16], Sqn2Vec 2 [40], SCIP 3 [4], FSP (the algorithm based on frequent sequential patterns) and DSP (the algorithm based on discriminative sequential patterns). In MiSeRe, num of rules is specified to be 1024 and execution time is set to be 5 minutes for all data sets.
Sqn2Vec is an unsupervised method for learning sequence embeddings from both singleton symbols and sequential patterns. It has two variants: Sqn2VecSEP and Sqn2VecSIM, where Sqn2VecSEP (Sqn2VecSIM) generates sequence representations from singleton symbols and sequential patterns separately (simultaneously). In these two variants, minsup = 0.05, maxgap = 4 and the embedding dimension d is set to be 128 for all data sets.
SCIP is a sequence classification method based on interesting patterns, which has four different variants: SCII HAR, SCII MA, SCIS HAR and SCIS MA. In the experiments, the following parameter setting is used in all data sets: minsup = 0.05, minint = 0.02, maxsize = 3, conf = 0.5 and topk = 11. Frequent sequential patterns have been widely used as features in sequence classification. To include the algorithm based on frequent sequential patterns in the comparison (denoted by FSP), we employ the PrefixSpan algorithm [60] as the frequent sequential pattern mining algorithm. The parameters are specified as follows: maxsize = 3 and minsup = 0.3 for all data sets except Context (the minsup in Context is set to be 0.9 in order to avoid the generation of too many patterns).
Similarly, discriminative sequential patterns are widely used as features in many sequence classification algorithms and applications as well. To include the algorithm based on discriminative sequential patterns in the comparison (denoted by DSP), we first use the PrefixSpan algorithm to mine a set of frequent sequential patterns and then detect discriminative patterns from the frequent pattern set. The parameters for PrefixSpan are identical to those used in FSP and minGR = 3 is used as the threshold for filtering discriminative sequential patterns.

Results
In Table 3, the performance comparison results in terms of classification accuracies are presented. Note that the result of DSP on the Skating data set is N/A because we cannot find any discriminative patterns from this data set based on the given parameter setting. In the experiments, α = 0.05 is used for R-MHT and pointnum is specified to be 1/10 of the size of T rainD for R-GAHC. After transforming sequences into feature vectors, we chose NB (Naive Bayes), DT (Decision Tree), SVM (Support Vector Machine), KNN (k Nearest Neighbors) as the classifiers. The implementation of each classifier was obtained from WEKA [61] except Sqn2Vec. In Sqn2Vec, all classifiers were obtained from scikit-learn [62] since its source code is written in python.
In order to have a global picture on the overall performance of different algorithms, we calculate the average accuracy over all data sets for each classifier. The corresponding average accuracies for different methods are recorded in Table 4. The results show that among our two methods, R-MHT can achieve better performance than R-GAHC when NB, DT and SVM are used as the classifier. However, R-MHT has bad performance when KNN is used as the classifier. In addition, the R-A method outperforms R-MHT and R-GAHC since we will not lose any relevant information for the classification task when all training sequences are used as reference points. However, the feature dimension will be very high in R-A, which will incur high computational cost in practice.
Compared with other classification methods, our methods are able to achieve comparable performance. In particular, R-A and MiSeRe [16] can achieve the highest average classification accuracy among all competitors. It is quite amazing since R-A is a very simple algorithm derived from our framework. This indicates that the proposed referencebased sequence classification framework is quite useful in practice. It can be expected more accurate feature-based sequence classification methods will be developed under this framework in the future. From Table 3 and Table 4, it can be also observed that none of the algorithms in the comparison can always achieve the best performance across all data sets. Therefore, more research efforts still should be devoted to the development of effective sequence classification algorithms.
The use of different similarity functions may affect the performance of our algorithms. To investigate this issue, we use two additional similarity functions in the experiments for comparison: SSK and the normalized LCS, whose details have been introduced in Section 5.3. Table 5 presents the average classification accuracies of different similarity functions over all data sets. Jaccard coefficient, SSK and normalized LCS are denoted as J, S and N, respectively. In Table 5, R-A-J means that the Jaccard coefficient is used as the similarity function in R-A. Other notations in this table can be interpreted in a similar manner. The results show that the use of different similarity functions can affect the performance of our algorithms. Among these three similarity functions, the use of Jaccard coefficient as the similarity function can achieve better performance in most cases. However, R-MHT-J has unsatisfactory performance when KNN is used as the classifier. It can be also observed that none of the similarity functions is always the best performer. Therefore, more suitable similarity functions should be developed.

CONCLUSION
In this paper, we present a reference-based sequence classification framework by generalizing the pattern-based methods. This framework is quite general and flexible, which can be used as a general platform to develop new algorithms for sequence classification. To verify this point, we present several new feature-based sequence classification algorithms under this new framework. A series of comprehensive experiments on real data sets show that our methods are capable of achieving better classification accuracy than existing sequence classification algorithms. Thus, the reference-based sequence classification framework is quite promising and useful in practice.
In the future work, we intend to explore more appropriate reference sequence selection methods and similarity functions to improve the performance and reduce the computational cost. As a result, more accurate feature-based sequence classification methods would be developed under this framework.   Table 5 The average classification accuracies of different similarity functions over all data sets used in the experiment