Identify Significant Phenomenon-Specific Variables for Multivariate Time Series

Multivariate time series (MTS) are collected for different variables in studying scientific phenomena or monitoring system health where each time series records the values of one variable for a time period. Among the different variables, it is common that only a few variables contribute significantly to a specific phenomenon. Furthermore, the variables contributing significantly to different phenomena are often different. We denote the different variables that contribute to the occurrences of different phenomena as <italic>Phenomenon-specific Variables (PVs)</italic>. In this paper, we formulate a <italic>novel problem</italic> of identifying significant PVs from MTS datasets. To analyze MTS data, feature extraction techniques have been extensively studied. However, most of them identify important <italic>global</italic> features for one dataset and do not utilize the temporal order of time series. To solve the newly introduced problem, we propose a solution framework, <italic>CNN<inline-formula><tex-math notation="LaTeX">$_{mts}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mrow><mml:mi>m</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math><inline-graphic xlink:href="hao-ieq1-2934464.gif"/></alternatives></inline-formula>-X</italic>, which is a new variant of the Convolutional Neural Networks (<italic>CNN</italic>) and can embed other feature extraction techniques (as <italic>X</italic>). Furthermore, we design a <italic>CNN<inline-formula><tex-math notation="LaTeX">$_{mts}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mrow><mml:mi>m</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math><inline-graphic xlink:href="hao-ieq2-2934464.gif"/></alternatives></inline-formula>-LR</italic> method that implements a new feature identification approach (<italic>LR</italic>) as <italic>X</italic> in the <italic>CNN<inline-formula><tex-math notation="LaTeX">$_{mts}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mrow><mml:mi>m</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math><inline-graphic xlink:href="hao-ieq3-2934464.gif"/></alternatives></inline-formula>-X</italic> framework. The <italic>LR</italic> method leverages both Linear Discriminant Analysis (<italic>LDA</italic>) and Random Forest (<italic>RF</italic>). Our extensive experiments on five real datasets show that the <italic>CNN<inline-formula><tex-math notation="LaTeX">$_{mts}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mrow><mml:mi>m</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math><inline-graphic xlink:href="hao-ieq4-2934464.gif"/></alternatives></inline-formula>-LR</italic> method has exhibited much better performance than several other baseline methods. Using 30 percent of the PVs discovered from the <italic>CNN<inline-formula><tex-math notation="LaTeX">$_{mts}$</tex-math><alternatives><mml:math><mml:msub><mml:mrow/><mml:mrow><mml:mi>m</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math><inline-graphic xlink:href="hao-ieq5-2934464.gif"/></alternatives></inline-formula>-LR</italic>, classifications can achieve better or similar performance than using all the variables.


INTRODUCTION
M Any applications collect multivariate time series (MTS) for different variables where one variable's time series records the values of this variable for a time period.For example, in tracking human-body movement, multiple sensors (which are treated as variables) are attached to different parts of a body to collect their location information; In environmental sciences, different sensors are used to track environmental information such as temperature and soil moisture.MTS data are typically associated with corresponding phenomena labeled as classes (e.g., walking, sitting, budding).Utilizing both the MTS data and their corresponding class labels, scientists can conduct predictions or classifications.Very often, it is desired to make as accurate predictions or classifications as possible.
However, generating highly accurate predictions is not sufficient.In many situations, it is even more important to understand variables that are most critical for phenomena interpretation or decision making.We observe that, among all the variables, it is common that only a few variables contribute significantly to a specific phenomenon.Furthermore, the variables contributing significantly to different phenomena are different.E.g., in tracking human body movement, we observe that sensors attached to lower legs can help better identify walking activities than sensors attached to upper arms.Thus, it is more useful to monitor different sets of sensors when a person is conducting different activities (sitting or walking).Our observation of different variables contributing to different phenomena is also utilized in clustering analysis where projected clustering (PC) [1], [2] obtains groups of points that are close in different subsets of dimensions.However, typical PC does not work well with variable selection on MTS data.PC treats all the values in a time series as independent dimensions (i.e., each time point is a dimension); thus, the clusters are time-point specific, instead of variable specific.
We denote the different variables that contribute significantly to different phenomena as Phenomenon-specific Variables (PVs).PVs carry the most critical information for a specific phenomenon.We formulate a novel problem of identifying significant PVs from multivariate time series.Note that the solution to this problem is not finding the different features for better predictions.Instead, we are interested in finding variables that make critical contributions to the explanation of specific phenomena (or events).
The proposed problem is different from existing efforts that analyze time series data.Most existing techniques identify global features for one dataset (e.g., [3], [4], [5], [6], [7], [8]).Such global features are used together to analyze the different events in one dataset.The PVs are different from global features because they are specific to different phenomena.Due to such differences, most existing techniques cannot be directly utilized to solve our proposed problem.
Two major challenges need to be addressed to solve the proposed problem.The first challenge comes from the large amount of computation from a huge search space.Assume that the MTS datasets are collected for A variables and the time series instances correspond to E different event types, then the possible number of variable subsets is E•(2 A −1), which is the search space of the PVs.The second challenge comes from the nature of time series, which has values recorded in a temporal order.Treating these values with or without temporal order may generate very different results.A successful example of utilizing the temporal order of the values is the Shapelets approach [7].Shapelets approaches are orthogonal to our methods because Shapelets approaches identify the important subsequences (for multiple or all variables) in MTS data, while our work detects the important variables among all the variables.In the calculation of PVs, we desire to consider the temporal order of values in each variable's time series.
This paper proposes a new solution framework, CNN mts -X to solve the problem.This framework designs a variant of Convolutional Neural Networks (CNN), denoted as CNN mts , and allows flexible utilization of other feature extraction techniques as X.We also present a new PV identification algorithm LR, which takes advantage of both Linear Discriminant Analysis (LDA) [9] and Random Forest (RF) [10].CNN mts can capture the temporal order of values in a time series and LR identifies the PV sets while reducing the search space.The contributions of this paper are as follows.
• We formulate the novel problem of discovering significant PVs from MTS data.evaluated the effectiveness and efficiency of our proposed techniques by using five real datasets in different sizes.The experiments show that CNN mts -LR outperforms other methods.The paper is organized as follows.Section 2 formally defines the problem and related terminology.Section 3 presents our proposed CNN mts -X framework and the new LR method.Section 4 experimentally demonstrates the effectiveness and efficiency of our proposed approaches using real datasets.Section 5 discusses the literature.Finally, Section 6 concludes our work.

PROBLEM FORMULATION AND TERMINOLOGY
This section introduces the terminology used to formally formulate the problem that we are going to solve.Definition 1.A variable for a multivariate time series is a factor in the time series.If a multivariate time series consists of observations for A variables, these variables are denoted as a 1 , a 2 , • • • , a A .
In different applications that collect multivariate time series data, variables represent different meanings.E.g., in human body movement, a variable can be a sensor that is attached to a specific part of a human body.For each variable, values at different times can be recorded.Such values form a sequence (or time series).Formally, Definition 2. An m-sequence S is in the form of (v 1 , t 1 ), (v 2 , t 2 ), • • •, (v m , t m ) where t i < t j for 1 ≤ i < j ≤ m, v i is either a categorical or a numerical value recorded for one variable at time point t i , and m is the length (or the number of temporal points) of the variable sequence.
When the time intervals between consecutive t i s are fixed, this sequence can be simplified to Each sequence is for one variable.

Definition 3.
An event type, denoted as et, is the phenomenon that a study is interested in.Let E denote the total number of event types.One event type can have many corresponding instances.An event instance is represented as et i .
In the study of human body movement, there can be 10-20 different event types for people's activities (e.g., sitting, running).For each specific event type (e.g., sitting), there can be hundreds or thousands of instances.Event types and event instances in our problem are analogous to class labels and instances in classification problems.

Definition 4.
A multivariate time series (MTS) contains A m-sequences.Formally, one M T S can be represented as Each MTS corresponds to an event type (e.g., a person is running) and records the values for all the variables that contribute to the occurrence of one event.
To study what variables contribute more to an event, all the variables for which an MTS is collected need to be investigated.However, as discussed before, among all the variables, different variables may contribute significantly to different phenomena.Definition 5. Phenomena-specific variables (PVs) for an event type are the variables that contribute significantly to the occurring of that event type.

Definition 6.
The problem of identifying phenomenaspecific variables from MTS data takes as input (i) a set of MTS associated with event types, and (ii) a number σ(∈ (0, 1]), and finds the top σ × A variables {a i,1 , • • • , a i, σ×A } for each event type et i such that the chosen variables contribute the most to characterize the given event type.

CONVOLUTIONAL NEURAL NETWORKS BASED APPROACH
This section presents a new framework CNN mts -X to identify PVs from multivariate time series.Convolutional Neural Networks (CNN) are a special type of neural networks (NN).A CNN has special hidden layers, convolutional layers.Different from the hidden layers in regular NN, the nodes in convolutional layers are   only connected to a small region, which is called receptive field, of the previous layer.The receptive fields are spatially connected to capture the local spatial connectivity when a CNN is utilized in image classification.This idea can be utilized to capture the local temporal connectivity of time series in MTS analysis.
The CNN model is adopted in our proposed framework to address the major challenges that are discussed in Section 1 because of two major reasons.First, in the analysis of MTS, it is very necessary to capture the local temporal connectivity in a time series [8], [11], [12], [13], [14], [15].I.e., letting a subsequence contribute to one node in the next layer.Convolutional layers with properly designed kernels can help us achieve this.Second, CNN has shown good performance in classifying large amount of data in very high dimensional space [16], [17]; thus adopting CNN can help reduce the computational complexity.
The CNN approach is capable of automatically extracting features from the training datasets and utilizing such features to recognize different phenomena.Note that these features are combinations of different variables in the original MTS.This work, however, does not target at purely recognizing the different phenomena utilizing the combined features.The purpose of this work, as discussed in Section 1, is to identify the variables (not combined features) that contribute the most to specific phenomena.Thus, the original CNN method cannot directly work to solve this PV identification problem.
The CNN mts -X framework works in two steps: (i) the first step (Section 3.1) is to construct and train a CNN mts model, and (ii) the second step (Section 3.2) is to design a PV Identification (PVI) algorithm to extract significant PVs from the intermediate results of the CNN mts models.To verify the effect of PVs, classifications can be utilized.Section 3.3 introduces the classification algorithm using the PVs identified by CNN mts -X.

Proposed CNN mts model
The first step of the CNN mts -X framework is to train a variant of the traditional CNN model (CNN mts ) for MTS data.To explain the concepts and the algorithms, we will use a running example with the toy dataset in Example 1.The CNN mts model is based on and improves the model in [12].Given an MTS training instance (Def.4) 1 shows the structure of our CNN mts model.This model contains L convolutional layers, L−1 pooling layers, and one fully connected layer.In the first convolutional layer, we apply F 1 filters with kernels

Example 1 (MTS toy data).
to the subsequences gotten by sliding a window (whose length is also k) over an MTS instance.In particular, a node h i,j in the first convolutional layer H 1 1 is calculated as h i,j = j+k−1 l=j v i,l • x l−j+1 .The different kernels differ in their initial values and are utilized to remove the randomness caused by the kernel initialization.The first convolutional layer has F 1 ×A×(m−k+1) nodes because each time series in an MTS instance has length m and the number of subsequences gotten from sliding a length-k window for each variable is m−k+1.
Our CNN mts model applies downsampling to get pooling layers after each convolutional layer.The first pooling layer is obtained by applying F 1 max pooling filters with size 1×r to the first convolutional layer.In particular, a node p i,j in the pooling layer P 1 1 is the maximum value of r corresponding consecutive nodes in the immediate previous convolutional layer H 1 1 .I.e., p i,j = max j+r−1 l=j {h i,l }.The number of nodes in the first pooling layer is F 1 ×A×(m−k−r+2).Our CNN mts model is different from the model in [12] in that we use sliding windows to get the pooling layers, while the model in [12] utilizes non-overlapping windows.We take the sliding window strategy as we observe that CNN models using sliding windows can achieve more stable performance in each iteration.
Other convolutional and pooling layers are constructed in a similar manner although the number of convolutional kernels, the kernel sizes for different convolutional layers, and the sizes of pooling filters can be different.The kernel size of the last convolutional layer is set to be the same as the length of the time series output from the previous pooling layer.The last convolutional layer is not followed by any pooling layer.This is because both the convolutional kernels and the pooling filters are not mixing values from different variables, thus the time series of each variable has been abstracted to exactly one corresponding node in the last convolutional layer.Suppose that the last convolutional layer is calculated using F L kernels, then each MTS training instance is abstracted as F L × A nodes.For n instances, this layer has n × F L × A nodes.The last convolutional layer connects to a fully connected layer which generates the output.The bottom of Fig. 1 shows the size of the matrixes at the different layers of this CNN mts structure.A CNN mts model is trained with multiple epochs [18] and its training terminates when it meets certain criteria such as the model accuracy is good enough.Each epoch consists of n/B iterations (or steps) where B is the number of instances used in one iteration.In each iteration, the sampled instances are fed to the model to adjust the model parameters.The B instances used in one iteration is called a batch.The batches of each epoch are typically generated in a random manner: the first batch contains B (out of n) randomly selected instances.This random-batch generation strategy generally works well when the data have balanced event types.
Random batch generation with adjusted coefficients.When the data is imbalanced, one major issue with the default batch generation is that the sampled instances in one batch are imbalanced.A widely utilized strategy to alleviate this issue is to give different coefficients to different event types.Instances with rare event types are given higher coefficients so that they can contribute more in deciding the output.For example, if a batch contains 10 and 1000 instances from two event types et 1 and et 2 respectively, then the instance coefficients for et 1 and et 2 can be set to 100 and 1 respectively.
Batch generation with oversampling.We observe that the strategy of adjusting coefficients may still not work well when a batch has extremely unbalanced data.At the same time, we observe that one batch may not utilize all the necessary instances from rare event types because one batch only consists of a subset of instances.Given these two observations, we propose an oversampling strategy, which has been utilized in processing imbalanced data [19].This oversampling strategy works as follows.After getting the B instances for each batch, we calculate the ratio of instances in different event types.If the ratio is low (e.g., less than 1/3 for a dataset with two event types), we sample more instances from the rare event types to this batch to make the instances for different event types close-to-be balanced.Then, using the actual number of instances of different event types in a batch, we adjust the coefficients of the event types.The sizes of batches generated by this strategy are bigger than B and some instances are utilized several times in different batches for one epoch.

Extract PVs from intermediate results of CNN mts model
The second step of the CNN mts -X framework extracts significant PVs from L with E ×(n×F L ×A) nodes.We propose Algorithm PVI (representing PV Identification, shown in Fig. 2) for this step.This algorithm can use different feature extraction techniques in Step 2(a)iii.Algorithm PVI calculates an important score that each variable contributes to every event type by aggregating the variable importance from all the n instances and F L kernels.
Specifically, PVI works as follows.It first adds up the importance scores of each variable from n instances and saves the scores to an (1) L: E ×n×F L ×A array from CNNmts, (2) Y : the event-type vector for n instances, (3) σ and A: see problem definition.Output: P Vset: {P V1, P V2, • • • , P V E } where P Vet consists of σ • A PVs for the event type et 1) Initialize an E ×F L ×A array ω with score zero; 2) For each event type et (et=1 ii) Normalize the values of each variable in M et,f ; iii) ω[et, f, 1...A] = aggregateInstance(M et,f , Y, et); /*For a fixed event type et and a kernel f , aggregate the importance of each variable from all instances*/ the importance of each variable by combining the effect of the F L kernels*/ 4) For each event type et a) Fig. 2: The framework of PV identification Step 2) Step 3) Step 4) a) aggregateKernel 1 keeps the importance of each variable for all kernels and event types Γ keeps the importance of each variable for all the phenomena Input: ℒ (output of CNN mts ) Output: PV set Fig. 3: Flow chart of PVI the i-th variable a i to the event type et when using kernel f by considering all the instances.Then, it combines the effect of F L kernels (Step 3).Next, it extracts the PVs for each event type from the combined ranks Γ (Step 4).
PVI calculates the importance scores ω (Step 2) using three steps.First, for each distinct event type and each of the F L kernels, it gets the node values from L, which form an n × A matrix M et,f (Step 2(a)i).This matrix is for all the n instances and A variables.Then, from matrix M et,f , it calculates the importance of each variable to et by aggregating scores for all the instances (Step 2(a)iii).Before this step, we conduct column-wise normalization for all the values in M et,f using the L ∞ -norm so that all the values for one variable (in one column) are comparable.
Example 3. Given the data in Example 1, the size of L (the input for the PVI Algorithm 2) is 3 × (4 × 30 × 2).
Step 2 aggregates the features learned from L using F 3 (which is 30) kernels.The size of M et,f is (4×2).aggregateInstance returns the variable importance vector ω[et, f, 2] (A=2) in Line 2(a)iii and aggregateKernel combines the importance scores from each kernel.The final if σ is set to be 50% (E is 3 and A is 2).
To further illustrate the procedure of the algorithm and show how different data structures are changed, Fig. 3 shows a high-level data flow of this algorithm.
) Y : event vector for n instances, (3) et: a fixed event type Output: ωet: a length-A score vector 1) Initialize a length-A vector ωet with returned scores; Fig. 4: Calculate variable importance using LR

A new algorithm LR to calculate variable importance
In the CNN mts -X framework, X can be any feature extraction technique.We propose a new approach that leverages both Linear Discriminant Analysis (LDA) [9] and Random Forest (RF) [10].LDA identifies linear combinations of variables as features.Such features can explicitly model the difference between different classes [20].However, LDA cannot directly return the variable importance.We use the weight values to estimate variable importance since a variable with higher weight means it contributes more to the combined feature.RF is another widely used technique to rank the importance of variables in a regression or classification problem [21].RF can directly return the variable importance but RF focuses on each individual variable instead of the variable combinations.
To make use of good characteristics of LDA and RF, we propose a new approach LR to learn a combined variable importance.Fig. 4 shows this approach as Function aggre-gateInstanceLR.This function calculates the importance of all the A variables for a given event type et and keeps them in a length-A vector ω et (defined at Step 1 in Fig. 4).
More specifically, it creates a new vector Y whose element values are either zero or one denoting two distinct event types.Here, only two distinct event types are used because PVs are used to distinguish one event type from all the other event types.The value is one when the corresponding actual event type is et and is zero otherwise.LDA is conducted using M et,f and the new event vector Y (shown from Line 4 in Fig. 4).This procedure can be formally represented as shown below.
Note that the values of the first row of ω LDA et is the same as the second row.This is because the first row consists of coefficients that differentiate et and all the other event types (only ¬et), and the second row has coefficients to differentiate ¬et from all the other types (only et).
Line 5 in Fig. 4 utilizes RF to evaluate the variable importance.As shown from Lines 4 and 5, the training accuracies from LDA and RF are both returned.The training accuracy for each approach is used to weigh the important scores of the variables.The final variable importance is the weighted summation of the variable weights returned from LDA and RF where the weights are the training accuracies in LDA and RF (Line 6).Example 4. Let us follow the previous example to explain Algorithm 4. The input M et,f is of size 4 × 2 (n × A).Line 4 and Line 5 evaluate the variable importance using LDA and RF respectively.Line 6 combines the two importance scores.The final output is a vector of size (1 × 2) with the variable importance scores.

Ensemble variable importance
The last step of the Algorithm PVI (Fig. 2) is to ensemble variable importance for all the kernels based on the calculated importance scores ω from all the n instances and F L different kernels.Fig. 5 shows the details of this step.
Function: aggregateKernel (ω, σ, E, A, F L ) Output: an E ×A matrix Γ denoting the importance rank of every variable to all event types.1) Initialize Γ to be an E ×A matrix; 2) For each distinct event type et a) Initialize an F L ×A rank matrix γ with value zero; b) For each kernel f and each variable ai This function ensembles the importance scores for each event type.For an event type et, it first ranks the importance of all the variables for each kernel (Step 2b).The ranking results are kept in an F L ×A rank matrix γ.Then, it calculates the overall importance of each variable a i for a fixed event type by aggregating the importance ranks from all the kernels (Step 2c).The importance ranks of all the variables to the different kernels Γ are returned to PVI to extract PVs.Note that we do not directly utilize the importance scores in ω to extract the significant PVs.Instead, we utilize the importance ranks.This strategy is to remove the effect of unbalanced importance scores.
Time: The running time of the CNN mts -X PVI approach consists of two stages: learning CNN mts and conducting X.Given a dataset with E events, in the worst case we need to learn E CNN mts models.We also note that in many real cases, the number of CNN mts models that we need to train depends on the number of phenomena that people are interested in.We may not need to get the important variables for all the E phenomena.For example, one scientist may only be interested in two phenomena (among hundreds), then our method just needs to train two CNN mts models (instead of hundreds of models) to identify those variables.The exact time complexity of learning the CNN model is beyond our control.Thus, we empirically calculate the running time of the PVI algorithms.The results and analysis can be found in Section 4.4.

Classification using PVs
PVs are identified to differentiate different phenomena.PVs' differentiating effect cannot be tested by directly applying existing classification algorithms because PVs are specific to different phenomena.In order to examine the effect of PVs, we design a PV-based classification algorithm.
PV classification (PVC) algorithm is used to run classifications based on PVs.Fig. 6 shows the detailed procedure of this algorithm.This algorithm returns a vector of F 1 values for all the E classes (or event types) and the overall testing Accuracy.In particular, this algorithm tests the PVs' effect for each et (Line 3).For a given et, it truncates the training and testing data to contain only time series related to this event type's related PVs (Lines 3a-3b).Also, it updates the training and testing labels to contain only et (denoted as one) and ¬et (denoted as zero) (Lines 3c-3e).Then, it trains a classification model to get the classification F 1 value and the prediction probability (Lines 3f-3h) over classes et and ¬et.The final prediction is the class label with the highest probability (Line 4).The overall Accuracy is calculated from the final prediction (Line 5).

EXPERIMENTS
All the methods are implemented using P ython 2.7, and tested on a server with i7-2600 CPU cores @ 3.40GHz and 256GB RAM.TensorFlow (www.tensorflow.org) is used to build our neural network framework.To better understand the advantages/disadvantages of different PV identification methods, we compare the effect of the PVs selected by the proposed method and several other baseline methods.All the methods are listed in Table 3.

Methods to compare
Our proposed method is denoted as CNN mts -LR.We also adopt LDA and RF alone in the CNN mts -X framework and get two baseline methods CNN mts -LDA and CNN mts -RF.In particular, CNN mts -LDA and CNN mts -RF return w LDA et and w RF et respectively in Fig. 4. Furthermore, since Principal Component Analysis (PCA) [22], [23] is another well-recognized classical feature extraction technique, we adopt PCA in the CNN mts -X framework and get CNN mts -PCA.Another approach based on Common PCA (CPCA) [3] can identify important global variables, we adopt CPCA in our framework and get CNN mts -CPCA.CNN mst -PCA (or CNN mts -CPCA) calls PCA (or CPCA) only on the instances from class et instead of on all the instances.The details of CNN mts -PCA can be found from Fig. 7.For CNN mts -CPCA, CPCA is used at Line 4. We also compare our proposed method with other techniques that does not employ our proposed CNN mts -X framework.Corresponding to the five methods that utilize CNN mts -X framework, the five baseline approaches are LR, LDA, RF, PCA, and CPCA.These five baseline methods learn importance scores of each variable for different event types and select the variables with the top σ •A absolute importance scores as PVs.For the LR method, the importance scores of all the variables to an event type et are the weighted summation of the variable scores returned from LDA and RF.The variable scores from LDA are calculated based on the coefficients (c et,1 , • • • , c et,A ), while the variable scores from RF directly come from the trained RF model.For the PCA (or CPCA) methods, the importance scores are learned as follows.For each event type, we first conduct PCA (or CPCA) on all the training instances with event type et.Then, we calculate each variable's importance by adding the absolute weight values of the PCs for this variable.Higher weight values carry more importance.
The effect of the proposed PVs are also compared with the effect of all the variables (denoted as All-variables) and top global variables (denoted as CNN mts -LR-GV).The Allvariables method directly feeds all the values v ij in an MTS to E CNN mts classifiers for the E event types.CNN mts -LR-GV is designed based on CNN mts -LR method.It utilizes the intermediate results Γ (Step 3 in the PVI algorithm in Fig. 2) , each column's values (the importance rank of each variable for different event types) are added to get the overall importance rank of the variables.The σ•A variables with the top overall ranks are chosen as significant global variables.

Experimental settings
(1) Datasets: We use five real datasets to test the performance of our approaches.The first dataset is the Daily and Sports Activities data (denoted as DSA) [24].The second dataset is extracted from the ideal-placement scenario in the REALDISP Activity Recognition data (denoted as RAR) [25].The third and the fourth datasets are the Activity Recognition Challenge data from opportunistic activity recognition systems for subject 1 (denoted as ARC) [26].The fourth dataset also comes from the ARC dataset, but it has fixed training and testing portion used in [12].This dataset is denoted as ARC f ixed and utilized for comparison with [12].The last dataset is for Australian Sign Language (ASL) [27].
The detailed statistics for the datasets are shown in Table 4.
For DSA, RAR, and ARC datasets, we run ten-fold crossvalidation to get stable results.For ASL, we run threefold cross validation.We did not use ten folds because the number of instances in each class is not as many as in the other datasets.
(2) Evaluation measurements: We utilize two ways to evaluate the effectiveness of the selected PVs: (a) conducting classification using the selected PVs, significant global variables, and all the variables (Sections 4.3.1-4.3.3) and (b) manually examining the meaning of the extracted PVs (Section 4.3.4).Section 4.3.5 compares the selected PVs with existing works.For classification, we feed the training instances with only the selected variables to train E classifiers.Given a testing instance, the prediction from one classifier (for event type et) is the probability that the testing instance is predicted as et.The final event-type prediction of this instance is the type with the highest probability.We report the classification measurements, F 1 and Accuracy, to show the performance.To eliminate the bias of classification techniques, we utilize four widely adopted classification methods, convolutional neural network (CNN) [28], k-nearest neighbors (KNN) [29], support vector machine (SVM) [30] and random forest (RF) [10].TensorFlow is used to build the CNN classifier and Python library [31] is used to run the KNN and RF classifiers.For the SVM classifier, we obtain the code provided by Python library LibSVM [32] since it gets similar accuracy with less running time than the typical SVM from [31].Note that the traditional F 1 is used to measure the performance of binary classifiers.In our experiments, each dataset has more than two event types.We calculate F 1 for each event type by treating all the instances belonging to this type as positive and all the other instances as negative.
(3) Parameter setting: The parameters used to train the CNN mts models for both the PV selection and for the classification task are the same.The numbers of convolutional layers and pooling layers are set to be 3 and 2 respectively.For the convolutional layers, the kernel sizes k are 50, 30, and 20.For the pooling layers, the filter size r is 2. The maximum number of epochs is 5, and the batch size B is 100.For the classifiers, KNN sets the parameter K to be 1.LibSVM uses balanced class weights and sets Radial Basis Function (RBF) as the kernel.We are aware that setting different parameter values to achieve good classification performance is still an open problem and that is not the focus of this paper.Meanwhile, we also run experiments with different parameter values to justify our parameter setting (Sections 4.3.7).

Effective analysis
This section shows the classification performance for all the datasets.In particular, we examine the F 1 for all the event types and overall Accuracy.The calculation details for the F 1 and overall Accuracy are shown from the PVC algorithm in Fig. 6.

Compare the effect of PVs using different PV selection approaches
This section compares the effect of PVs selected by the ten PV selection approaches listed in Table 3.The results in Table 5 show that the CNN mts -LR approach outperforms the other nine PV selection approaches in most cases.CNN mts -LR provides the best F 1 with CNN mts classifier on all datasets and it has comparably good results with other classifiers (top 3 F 1 values from KNN and LibSVM and top 2 F 1 values from RF).The accuracy results (Appendix A) show similar conclusions.For most data mining and machine learning methods, generally there is no absolute winner in all cases.In order to show the superiority of the proposed method, we calculate the averaged F 1 and averaged Accuracy over all the five datasets and show the results in Table 6 and Table 7.These tables show that the proposed approach achieves the best averaged F 1 and averaged Accuracy.We also plot the individual F 1 values for all phenomena in different datasets (Figure 8 [6] CNNmts-PCA 0.751 [4] 0.883 [3] 0.902 [7] 0.381 [10] 0.510 [5] CNNmts-CPCA 0.743 [7] 0.837 [5] 0.896 [9] 0.383 [9] 0.314 [8] CNNmts-RF 0.666 [10]    We use the FCN model to replace the CNN mts model in our proposed CNN mts -LR method and get a FCN-LR method.Table 9 shows the results of comparing FCN-LR and CNN mts -LR.It can be observed that features learned from CNN mts -LR outperforms the features from FCN-LR in almost all cases.The overall Accuracy results (Table 20, Appendix A) are similar to the results on F 1 .Therefore, we use CNN mts instead of FCN as the CNN classifier in the PVC algorithm.

Compare the effect of PVs, selected global variables, and all the variables
This section evaluates the performance of the PVs found using the proposed CNN mts -LR method, global variables discovered using CNN mts -LR-GV, as well as all the variables.Section 4.3.1 demonstrates that classifications using PVs from CNN mts -LR return the best performance.Given this, the global variables discovered in this section are based on CNN mts -LR (denoted as CNN mts -LR-GV).For this set of experiments, the All-variables approach uses all the variables to run classification, while CNN mts -LR and CNN mts -LR-GV select approximately 30% of all the variables.Different classification algorithms are applied in oder to get unbiased results.The F 1 using different classifiers are shown in Table 10.The results show that classification using the top 30% of PVs from CNN mts -LR achieves similar or even better F 1 values compared with the classification results using all the variables.Note that our method does not perform better than the method using all the variables all the time.It is mainly because of the characteristics of the data.When the dataset has noisy variables, our PV selection approach is able to identify the important non-noisy variables and utilize them for classification and such classification generally has better performance than the method using all the variables.On the other hand, when all the variables in a dataset are very useful (i.e., no noisy variables), the PV selection approach then misses some variable information and gets slightly worse performance than the method using all the variables.
These results indicate that the 30% PVs identified from CNN mts -LR are able to capture the significant variables and discard other noisy variables.The results also demonstrates that classifications using PVs generate much better F 1 values than classifications using GVs from CNN mts -LR-GV.This is consistent with our expectation and intuition since GVs are important variables for all the class labels and PVs are important variables for different class labels.We also get the overall Accuracy values (Table 21, Appendix A), which show similar results as F 1 , and the individual F 1 values for all phenomena in different datasets (Figure 9, Appendix B), which are consistent with the rank of the overall F 1 values.

Phenomena Playing Basketball
Rowing machine Top 1 PV y gyroscopes (left arm) x magnetometers (left leg) Top 2 PV x gyroscopes (left arm) x magnetometers (right leg) Top 3 PV y gyroscopes (right arm) x magnetometers (torso) Top 4 PV x gyroscopes (right arm) x accelerometers (left arm) Top 5 PV y accelerometers (left arm) x accelerometers (right arm) Top 6 PV y accelerometers (right arm) x accelerometers (left leg)  In this section, we manually verify the usefulness of the extracted features.We show a few number of PVs learned from our CNN mts -LR model for the different classes in the DSA and ASL datasets where DSA represents the human activity domain and ASL represents the sign language domain.In the DSA dataset, there are three groups of variables, accelerometers, gyroscopes, and magnetometers.An accelerometer records the tilt relative to the earth's surface, a magnetometer keeps the heading direction if a person holds the sensor that is parallel to the ground, and a gyroscope sensor keeps rotational velocity without any absolute reference.Those sensors are placed on the torso, left arm, right arm, left leg, and right leg.Table 11(a) shows the top 6 PVs learned on DSA.The first phenomenon is "playing basketball", which is an activity of bouncing the basketball repeatedly using two arms.We can see that all the top 6 PVs for this activity are related to left arm and right arm.In addition, the x and y gyroscopes for both two arms are the top 4 attributes because the arms are rotating while bouncing a ball.The 5th and 6th attributes are the y accelerometers for two arms.They are picked to capture the up-down movement while playing.These PVs are reasonable to identify the "playing basketball" activity.Consider another activity in the DSA dataset, "Rowing machine", which is an activity requiring the whole body to move, we show its top six variables in the third column of Table 11(a).These 6 PVs represent the sensors from torso, arms, and legs.Magnetometers and accelerometers are more helpful to identify this activity because rowing is a forwardbackward activity.Table 11(b) presents the analysis of two phenomena from the ASL dataset.The first phenomenon is "Please" [33], which bends all four fingers without the thumb finger of the right hand.Meanwhile, the right hand needs to move upward-downward.The corresponding top 6 PVs are all from the right hand.The top PVs contain the roll, y position, yaw sensors (which are gyroscopic devices) from the right hand and the bend sensors from three fingers (not the thumb finger).The second phenomenon is "Stubborn" [34], for which both hands bend four fingers without the middle finger, and two hands move both vertically and horizontally.The selected top 6 PVs are bend sensors and the x accelerometers from both hands, which are consistent with the sign language.

Compare the effect of PVs and existing work
This set of experiments compares the classification performance using the PVs selected by CNN mts -LR and two other state-of-the-art approaches: the CNN model in [12] and multivariate shapelet in [4].The PVs from CNN mts -LR cannot be directly applied to the CNN [12] model since the PVs found in our work are phenomenon specific.However, we still report the F 1 and overall accuracy for reference.We directly get the F 1 and accuracy from [12] (without smoothing) using ARC f ixed (RF is not used in [12]).Table 12 (a) and (b) report the F 1 and the overall accuracy respectively.The results show that classification using the PVs gets better F 1 and overall accuracy.This is due to the new variant of the CNN model, CNN mts , and the PV based classification algorithm, PVC.
Next, we compare the classification performance using shapelets.Note that the focus of shapelet extraction is different from PV identification: shapelets are the important subsequences in the sequences of multiple variables, while PVs are the important variables.I.e., they are orthogonal and complement with each other.Given these differences, we compare the classification performance using shapelets that are generated from the overall MTS and from the sequences for PVs.We have implemented two versions of the shapelet generation.The first version directly extracts shapelets from the overall MTS (denoted as Shapelet all ).The second version extracts shapelets from the sequences whose corresponding variables are identified as PVs (denoted as Shapelet P V ).Table 13 presents the classification results using the shaplet features of the two versions for the DSA dataset.Shapelet generation is known to be time-consuming [36].Therefore, the DSA dataset is used because it has fewer instances than RAR and ARC datasets and has a much smaller number of classes than ASL.The results show that the shaplets generated using PVs can achieve similar accuracy as the shapelets identified from all the variables, while the Shapelet P V uses only ∼ 20% of the time used for Shapelet all .
Table 14 compares our proposed approach with another recent approach, MASK [35].MASK identifies the shapelet from time-series sequences and returns a mask to evaluate the importance of different variables.We note that MASK is very time consuming and performs poorly on imbalanced data.For the smallest data set (ASL), MASK runs around 4 hours for one class even when the parameter values are set to be small (for larger parameter values, the algorithms runs much longer time).The setting details and running time are shown in Table 14(a).Furthermore, Table 14(b) shows that the CNN mts achieves ∼20% better F 1 scores than MASK.

Compare batch processing strategies of CNN mts for imbalanced data
This set of experiments tests the effect of batch processing strategies on the RAR dataset as it contains imbalanced data.CNN mts -LR is used to select the top significant 30% PVs and CNN mts classifiers are used to conduct ten-fold cross validation.Table 15   the results show that when the PVs are fixed, the classifier with oversampling can improve both the F 1 and Accuracy about 9% and 4% respectively.Comparing the last two rows, we can see that, when the classifier is fixed, PV selection with oversampling can improve both the F 1 and Accuracy about 4% and 6%.All these show that CNN mts with the oversampling batch processing strategy works better than the default CNN models.

Effect of parameters
We first show how the number of PVs affect the classification performance using the RAR dataset.Table 16 reports the F 1 and Accuracy from the tenfold cross validation and for all the event types.It can be observed that the performance improves with the increase of σ.However, when σ is more than 30%, the performance increase does not grow much.More parameter analysis can be found from Appendix C.

Efficiency Analysis
This section shows (a) the running time of different PV identification methods, (b) the time to train different classifiers using sequences for PVs, and (c) the time to predict the event type of one testing instance for different datasets.17 presents the averaged running time for building the CNN mts -X framework using one fold of the four datasets.The running time for CNN mts -X on ARC f ixed is not included in Table 17 due to the table width limitation.This time consists of the running time for constructing the CNN mts model using all the attributes (Section 3.1) and the PVI algorithm (Section 3.2).The results show that the CNN mts model construction utilizes the majority of the time due to their known long training time.The LR method's running time is approximately the summation of the running time of LDA and RF.LDA and RF use more time than PCA because LDA and RF need to be conducted on all the instances for E times while the PCA methods are only applied to a subset of instances E times.

RELATED WORKS
Identifying significant variables is highly related to feature extraction.The problem of feature extraction has been extensively investigated in the past several decades.For example, Principal Component Analysis (PCA) [22] [37] and Linear Discriminant Analysis [9] are among the commonly used feature extraction techniques proposed in earlier days.However, both methods cannot be directly utilized to identify significant PVs because they cannot treat one time series as a variable directly.
More recent techniques of identifying features from sequence data (e.g., [4], [5], [6], [7]) generally convert the sequence to a set of features and analyze the data in the feature space.Most of the identified features cannot preserve the temporal continuity information that is explicit in the original sequence data.Among the works of extracting features from sequence data, the Shapelets feature, introduced in [7], can preserve the temporal order of points in a time series.Shapelets discovery has gained exploding interest from independent research groups (e.g., [4], [7], [35], [36], [38], [39], [40], [41]) to analyze time series data.The methods that extract Shapelet features cannot be directly used to solve our problem either because the purpose of Shapelet extraction is to get global Shapelet features that can help achieve high accuracy of classification tasks, while our problem is to find variable subsets that can contribute the most to specific event types.Furthermore, the extraction of shapelets from multiple sequences dramatically complicates the Shapelet extraction algorithms which are already very complex even on single-sequence instances.
Techniques that classify multi-class datasets (e.g., [42]) typically focus on improving classification accuracy and do not study the importance of different variables for different classes.
Subspace clustering such as projected clustering [1] has been studied based on the similar rationale of PV identification.It identifies clusters from a dataset such that the points in one cluster are close regarding a subset of dimensions.The dimension subsets are generally different for different clusters.Although having similar intention, the results of projected clustering do not keep the temporal order of the selected dimensions, which cannot be used to identify PVs.
Recent works (e.g., [12], [13], [14], [15], [43]) utilized convolutional neural networks (CNN) in the analysis of MTS data.Most of these methods focus on improving classification accuracy or learning the CNN structure.Thus, they cannot be directly utilized to solve our problem.

CONCLUSIONS
In this paper, we introduced a new problem of identifying significant Phenomena-specific variables (PVs) from MTS data.This problem selects significant variables that are important to different event types of the data.To solve this problem, we proposed a novel CNN mts -X framework.In this framework, a new variant of convolutional neural networks, CNN mts , is designed to convert each variable's corresponding sequence to independent features.The X in this framework can be other feature detection technology.We also designed a new LR approach to be used in this CNN mts -X framework for the identification of important PVs.The results from extensive experiments on four real datasets by comparing CNN mts -LR with seven baseline methods show that (i) our CNN mts -LR method can identify more useful PVs than other methods, (ii) 30% of the PVs found from CNN mts -LR are able to carry almost all import information as all the variables, and (iii) the CNN mts with a new batch processing strategy outperforms typical CNN models when classifying imbalanced multi-class MTS data.19 shows the accuracy results for evaluating the effect of PVs selected by the ten PV selection approaches in Section 4.3.1.Figure 8 to Figure 12 show the details F 1 values for all event types by the top 4 PVI methods.[4] 0.347 [5] CNNmts-PCA 0.774 [4] 0.894 [3] 0.959 [7] 0.809 [7] 0.297 [7] CNNmts-CPCA 0.720 [8] 0.813 [7] 0.955 [9] 0.802 [9] 0.258 [10] CNNmts-RF 0.684 [10]

APPENDIX C ADDITIONAL TABLES AND FIGURES FOR SECTION 4.3.3
Table 21 shows the accuracy results for evaluating the performance of the PVs found using the proposed CNN mts -LR method, CNN mts -LR-GV and All-variables. Figure 13 to Figre 17 shows the detail F 1 values for all event types using all the variables, top 30% of PVs, and top 30% of GVs.

APPENDIX D MORE PARAMETER ANALYSIS FOR SECTION 4.3.7
We evaluate the performance of the KNN classifier with varying K values (Fig. 18).The results show that KNN with K = 1 returns the best averaged F 1 for all four datasets and 0 4 8 12 16 20 24

Fig. 5 :
Fig. 5: Calculate variable importance by combining results from different kernels

FunctionFig. 7 :
Fig. 7: Calculate variable importance using PCA roll (right hand) Middle finger bend (left hand) Top 2 PV y position (right hand) Middle finger bend (right hand) Top 3 PV Forefinger bend (right hand) Little finger bend (left hand) Top 4 PV Middle finger bend (right hand) Forefinger bend (right hand) Top 5 PV Little finger bend (right hand) Forefinger bend (left hand) Top 6 PV yaw (right hand) left leg x accelerometers (b) ASL Dataset

Fig. 16 :Fig. 18 :
Fig.16: F 1 for different event types on ARC f ixed (Classification using all the variables, top 30% of PVs, and top 30% of GVs)

Table 1
shows a toy dataset with three real phenomena: playing basketball, rowing machine, and Elevator UP.Assume that there are two variables representing the height of the sensors attached to the left arm (LA) and the left leg (LL).

TABLE 1 :
Toy dataset: LA represents the y-coordinate of the left arm sensor and LL is the y-coordinate of the left leg sensor 3.1.1Structure of CNN mts

Table 2
summarizes the meaning of the major parameters in CNN mts .

TABLE 2
f and Y as input and output one coefficient vector ω LDA et (length A) and training accuracy ACC LDA ; 5) ω RF et , ACC RF = Run RF using M et,f and Y as input and output one coefficient vector ω RF et (length A) and training accuracy ACC RF ; 6) ω LR et =ω LDA et • •E) a) Generate training sub-matrix X tr : ntr × |P Vet| × m where each instance only contains the time series from P Vet variables b) Generate testing sub-matrix X te : nte × |P Vet| × m where each instance only contains the time series from P Vet variables c) Create a vector Y tr with length ntr and a vector Y te with nte to hold the binary class labels for training and testing data respectively d) For the j-th instance in Ytr model PVMet using X tr and Y tr g) Apply PVMet to testing X te and get prediction Y pred and assign the prediction probability to Prob[0 . . .nte, et] h) F1[et] = F1 score calculated from the prediction Y pred and Y te 4) Y pred =argmax(P rob) 5) Accuracy is calculated based on Yte and Y pred 6) Return F1 vector and Accuracy Method PVI CNNmts-LR LR is used to identify PVs in CNNmts-X CNNmts-LDA LDA is used to identify PVs in CNNmts-X CNNmts-RF RF is used to identify PVs in CNNmts-X CNNmts-PCA PCA is used to identify PVs in CNNmts-X CNNmts-CPCA CPCA [3] is used to identify PVs in CNNmts-X LR LR is used to identify PVs without CNNmts-X LDA LDA is used to identify PVs without CNNmts-X RF RF is used to identify PVs without CNNmts-X PCA PCA is used to identify PVs without CNNmts-X CPCA CPCA is used to identify PVs without CNNmts-X

TABLE 3 :
PV selection methods to compare

TABLE 4 :
Dataset statistics [11]pendix B), which are consistent with the overall F 1 performance.4.3.2Compare the proposed CNN model with othersThis set of experiments compares the proposed CNN mts model with another CNN baseline, Fully Convolutional Networks (FCN)[11].FCN is proposed as a strong baseline

TABLE 5 :
F 1 for different variable selection methods (Top 30% of PVs are selected).The values in [] denote the ranks of the classifier in a row to classify the dataset in a column.

TABLE 6 :
Averaged F 1 over all five datasets for image classification.FCN has only a global pooling layer before the final output layer (instead of a pooling layer after every convolutional layer).FCN may not be a good choice in MTS feature selection because the pooling layer after each convolutional layer helps identify the similar features in a time range.For example, two people are conducting the same activity, hand up-down movement,

TABLE 7 :
Averaged Accuracy over all five datasets

TABLE 8 :
Example: Location of sensor y on a hand with different speed: the first person moves his/her hand slowly, with 20 cm up/down per second, and the second person move his/her hand faster, with 40 cm up/down per second.Table8(a) shows the location of the y sensor on one hand for five time stamps.Table 8(b)shows the results after a max pooling with size 1 × 2. It is clear that this pooling layer amplifies the similarity between these two time sequences.The next convolutional layer can utilize the amplified similarity.However, in FCN (without a pooling layer after each convolutional layer), the next convolutional layer cannot utilize any amplified similarity.

TABLE 9 :
F 1 comparison using the top 30% PVs from CNN mts -LR and FCN-LR.

TABLE 10 :
F 1 (Classification using all the variables, top 30% of PVs, and top 30% of GVs)

TABLE 11 :
Top 6 PVs selected for two phenomena 4.3.4Case studies of the extracted features

TABLE 13 :
Performance comparison of shapelets extracted from the overall MTS and from PV sequences (DSA)

TABLE 14 :
Performance comparison with CNN mts -LR and MASK on ASL shows the results.The first two rows of

TABLE 18 :
Running time of PVC algorithmTable Table 18 reports the training time of PVC using one fold of the dataset and the testing time for one instance.Note that KNN does not use any training time.The prediction/testing time per instance (Table 18(b)) is almost ignorable compared to the PV identification time.As expected, the KNN methods use more testing time than other methods.Please note that the CNN mts time is the running time for all the phenomena (evens) and this training typically happens offline.

Table 20
shows the accuracy results for comparing the effect of the proposed approach CNN mts -LR with the FCN-LR approach.

TABLE 19 :
Overall Accuracy for different variable selection methods (Top 30% of PVs are selected).The values in [] denote the ranks of the classifier in a row to classify the dataset in a column.
Fig. 10: F 1 for all event types on ARC (Top 30% of the PVs are selected by the top 4 PVI approaches) the averaged F 1 reduces with the increase of K.So we set K = 1 for our KNN classifier.RF with CNNmts-X Fig. 11: F 1 for all event types on ARC f ixed (Top 30% of the PVs are selected by the top 4 PVI methods) Fig. 12: F 1 for all event types on ASL (Top 30% of the PVs are selected by the top 4 PVI methods) RF classifier (CNNmts-LR always ranks top 2)

TABLE 20 :
Accuracy comparison using the top 30% PVs from CNN mts -LR and FCN-LR.

TABLE 21 :
Accuracy (Classification using all the variables, top 30% of PVs, and top 30% of GVs) RF classifier Fig.13: F 1 for different event types on DSA (Classification using all the variables, top 30% of PVs, and top 30% of GVs) RF classifier Fig.14: F 1 for different event types on RAR (Classification using all the variables, top 30% of PVs, and top 30% of GVs) RF classifier Fig.15: F 1 for different event types on ARC (Classification using all the variables, top 30% of PVs, and top 30% of GVs) (d) (d)