Deep Multiview Learning From Sequentially Unaligned Data

Multiview learning is concerned with machine learning problems, where data are represented by distinct feature sets or views. Recently, this definition has been extended to accommodate sequential data, i.e., each view of the data is in the form of a sequence. Multiview sequential data pose major challenges for representation learning, including i) absence of sample correspondence information between the views, ii) complex relations among samples within each view, and iii) high complexity for handling multiple sequences. In this article, we first introduce a generalized deep learning model that can simultaneously discover sample correspondence and capture the cross-view relations among the data sequences. The model parameters can be optimized using a gradient descent-based algorithm. The complexity for computing the gradient is at most quadratic with respect to sequence lengths in terms of both computational time and space. Based on this model, we propose a second model by integrating the objective with reconstruction losses of autoencoders. This allows the second model to provide a better trade-off between view-specific and cross-view relations in the data. Finally, to handle multiple (more than two) data sequences, we develop a third model along with a convergence-guaranteed optimization algorithm. Extensive experiments on public datasets demonstrate the superior performances of our models over competing methods.


I. INTRODUCTION
In many real-world applications, data are often collected from various perspectives, each of which presents a view of the same data and has its own representation space and relation characteristics. Multiview learning methods aim to exploit consistency and complementary information between these views to learn new representations for the data. Therefore, these methods often have better generalization ability. Recently, the definition of multiview learning has been extended to accommodate sequential data, i.e., each view of the data is in the form of a sequence. For instance, human actions can be presented by several video sequences with different features, such as binary, Euclidean distance transform, and Poisson equation solutions [55] (see Figure 8).
Multiview sequential learning has posed major challenges that are difficult for conventional methods to accommodate. First, most multiview learning methods essentially rely on an assumption that all views of the data are equal in size and The associate editor coordinating the review of this manuscript and approving it for publication was Jonghoon Kim . sample-wise matching. Here, we take canonical correlation analysis (CCA) [1] and its variants [2]- [4] as representatives. These methods project data samples from two different views into a shared subspace and then minimize the squared difference between the projections subject to whitening constraints. Thus, the two-view training data must have the following form: X = [x 1 , . . . , x n ] ∈ R d x ×n and Y = [y 1 , . . . , y n ] ∈ R d y ×n , where (x i , y i ) is a matching pair (1 ≤ i ≤ n). However, these requirements are likely to be violated in sequential settings. For example, sample deletion and/or insertion often occurs when collecting data sequences because of the temporal failures of devices and other man-made reasons. In addition, the asynchronization of data collection devices, e.g., sensors have dissimilar sampling frequencies, also induces misalignment among the collected sequences. A widely used alignment algorithm, dynamic time warping (DTW) [5], can be used to match samples in correspondence as a preprocessing step before performing conventional multiview learning methods. Unfortunately, DTW fails when the dimensions of the two sequences vary (d x = d y ). Second, in addition to the ambiguous cross-view relations mentioned, multiview sequential data also involve complex view-specific relations that span through the length of the sequences. CCA-based methods capture these relations through linear [1] and nonlinear [2]- [4] projection functions. However, they ignore the sequential order that naturally exists among samples within each view. Finally, in practice, the input data often comprise more than two sequences. Handling multiple sequential views is a difficult task that certainly involves high resource requirements. In addition, the discriminative properties of the learned representation might be degenerated because of the absence of label information and the presence of irrelevant information from multiple views.
In this article, we first propose generalized sequential correlation analysis (GSCA)-a novel deep neural network (DNN)-based method-to tackle the aforementioned challenges. Our method parameterizes the projection functions that map data sequences into the shared subspace by DNNs. Various types of DNNs can be selected regarding the relations among samples within each view. In this work, we use feed-forward neural networks and recurrent neural networks (RNNs) for implementation. In the shared subspace, our method minimizes the generalized smooth DTW distance between projections of the two views subject to soft whitening constraints. This allows GSCA to discover the sample correspondences and capture the relations between the views simultaneously. Because the generalized smooth DTW is a differentiable approximation of the original DTW, parameters of our model can be optimized in a unified manner using gradient descent-based algorithms. Computing the gradient generally takes a quadratic time and requires a quadratic memory space with respect to (w.r.t.) the sequence lengths. We can further increase the computation speed and reduce the space requirement by selecting squared 2 norm for regularization, which induces sparsity in the gradient of the generalized smooth DTW. Second, to provide a better balance between view-specific and cross-view relations, we combine our objective function with the reconstruction losses of autoencoders [6]. This forms generalized sequentially correlated autoencoders (GSCAE), which are a new variant of the proposed model. Finally, we further develop the third model called generalized multiple sequences analysis (GMSA) to handle multiple data sequences. Slightly differing from the two first proposed methods, GMSA uses DNNs to map input data sequences directly into the label space. Thereby, we expect that the learned representation can have cluster interpretability and better discriminability. Because no supervised information is given, we introduce a consensus label sequence that is then aligned with projections of all the input sequences. An efficient algorithm with a convergence guarantee is also provided to optimize both the consensus and the DNNs' parameters.
This article includes materials from [7] with significant expansion. First, GSCA and GSCAE are expanded versions of deep sequential correlation analysis (DSCA) and deep sequentially correlated autoencoders (DSCAE), which are proposed in the preliminary work, respectively. Efficiencies of GSCA and GSCAE are much better than those of DSCA and DSCAE. By taking advantage of the generalized smooth DTW distance, complexities in both terms of time and space for computing the gradients when training GSCA and GSCAE are reduced significantly. 1 Second, while [7] introduced the two deep models based only on feedforward neural networks, in this article, we further extend the model concept to RNNs, i.e., long short-term memory (LSTM) [8], to better capture the sequential relation within each view. Finally, we propose the GMSA model to handle multiple data sequences simultaneously, which was not considered in [7], and learn a more interpretable representation. Experiments using both two-and multiple-view datasets were designed carefully to provide a fair comparison with existing competitors. In summary, the contributions of this article are as follows: • Introduces a novel deep multiview model that can discover sample correspondence implicitly while learning the representation from sequential data.
• Extends the proposed model based on the reconstruction loss regularization of autoencoders, which allows a trade-off between information within each view and information in the correlation across the views.
• Derives a third model that can handle multiple (more than two) data sequences and learn more interpretable and discriminative representation.
• Extensive experiments on various public datasets demonstrate the superior performances of the proposed models over existing methods. The remainder of this article is organized as follows: Section II briefly presents some background for the methods proposed in this article. The GSCA model and its autoencoder-based variant are introduced in Sections III and IV, respectively. The third model for handling multiple data sequences along with a convergence-guaranteed optimization algorithm is described in Section V. After reviewing the related works in Section VI, we present the experimental results in Section VII. Section VIII concludes this article.
Notations. Throughout this article, scalars, vectors, and matrices are denoted by lower-case, bold lower-case, and bold uppercase letters, respectively. An element at position (i, j) of a matrix A is denoted by a i,j or [A] i,j . We denote the Frobenius inner product between A and B as A, B := i,j a i,j b i,j . 0 d is a vector of dimension d whose all elements are zeros. The expression x ∈ R d + indicates that vector x has d elements, each of which is greater than or equal to zero. The norm p of a vector x, where p ∈ {1, 2} in this article,

II. BACKGROUND
A. DYNAMIC TIME WARPING The DTW [5] algorithm measures the similarity between two sequences whose lengths are possibly different and whose sample correspondences are probably unknown. Given two FIGURE 1. A toy example of DTW and its smooth approximation: (a) two example sequences, (b) the distance matrix, (c) the cumulative sum matrix, (d) the cumulative sum matrix of DTW =entropy and (e) the cumulative sum matrix of DTW =squared 2 . The red line depicts the optimal warping path, which encodes the sample correspondences between the two sequences.
The DTW algorithm constructs a cumulative sum matrix S(X, Y ) using the following recursive formulas: The DTW distance between the two sequences is then defined as DTW(X, Y ) := s n,m . By backtracking from the last element s n,m to the start element s 1,1 , an optimal warping path that satisfies: i) boundary condition: (i 1 , j 1 ) = (1, 1) and (i p , j p ) = (n, m); ii) continuous condition: (i r+1 − i r , j r+1 − j r ) ∈ {(0, 1), (1, 0), (1, 1)}, where 1 ≤ r ≤ p − 1; and iii) monotonic condition: if 1 ≤ r ≤ t ≤ p, then i r ≤ i t and j r ≤ j t is formed. This path has the smallest cumulative sum s n,m = d i * 1 ,j * 1 + · · · + d i * p ,j * p and encodes the sample correspondences between the two sequences. A toy example of DTW is shown in Figure 1.

B. GENERALIZED SMOOTH DTW
The optimal warping path can be discovered by minimizing DTW; however, original DTW is not differentiable because of the nonsmoothness of min operator in equation (2), which makes it difficult to minimize using gradient-based methods. To alleviate this issue, [9], [10] studied a smooth min operator that serves as an essential basis to develop the differentiable approximations of DTW.
Let η = [η 1 , . . . , η k ] ∈ R k , the smooth min operator is defined as follows: where k := {γ ∈ R k + : ||γ || 1 = 1} is a (k − 1) unit simplex, ., . denotes an inner product, is a strictly convex function on k , and β is a nonnegative regularization parameter. Because (4) is strictly convex, its minimum is unique and equal to the gradient (based on Danskin's theorem [11]): The equation shows that the smooth min operator also depends on the selection of the regularization function (γ ). Shannon entropy ( k i=1 γ i ln γ i ) or squared 2 norm ( 1 2 k i=1 γ 2 i ) are often chosen. While the former induces closed-form solutions for both smooth min and its gradient, the latter forces the gradient to be sparse. More details are given in Appendix A.
As the definition of the smooth min operator is already given, we can arrive at the following recursive formulation: where the generalized smooth approximation of DTW is defined by DTW (X, Y ) := s n,m . Note that we can have different versions of DTW , e.g., DTW =entropy or DTW =squared 2 , depending on selection of the regularization (γ ). Figure 1 (d) and (e) show the cumulative sum matrices of DTW =entropy and DTW =squared 2 , respectively. The generalized smooth DTW distance is different from the original DTW because it is differentiable. Furthermore, by minimizing DTW , the optimal warping path is discovered implicitly instead of specified directly, as in the original DTW.

III. GENERALIZED SEQUENTIAL CORRELATION ANALYSIS
In this section, we propose Generalized sequential correlation analysis (GSCA) for learning representation from twoview sequential data. We first present the model and its objective function. We then describe the optimization algorithm and compare the proposed method with DSCA, which was proposed in our preliminary work [7]. Given two data sequences X = [x 1 , . . . , x n ] ∈ R d x ×n and Y = [y 1 , . . . , y m ] ∈ R d y ×m from different representation spaces (d x = d y ), our method maps them into a shared subspace: where f x (·, ·) and f y (·, ·) are projection functions and θ x and θ y denotes their parameters. The projection functions are parameterized by deep feedforward neural networks or RNNs. If the former is selected, each sample x i of the sequence X is passed through several fully connected feed-forward layers to compute the output z x i . These outputs are then assembled into columns of the matrix Z x following the increasing order of the index i. The second view is processed in the same manner. Note that we use batch normalization (BN) [12] as the final layer. Thus, the output features empirically have zero mean and unit variance. For the latter, we stack several LSTM units to form two deep LSTM networks. Each network is also equipped with a BN layer to perform the normalization. The representation sequences Z x and Z y are then computed by feeding the input sequence X and Y , respectively, to the networks. Note that the parameters θ x and θ y are now equivalent to collections of all the weights matrices of the corresponding DNNs. Figure 2 shows the diagrams of the proposed method.

A. OBJECTIVE
Because the input sequences are unaligned, the sample-wise correspondence information between their representations Z x and Z y is also absent. Our method aims at minimizing the generalized smooth DTW distance between Z x and Z y . This allows the method to implicitly discover the optimal warping path, which encodes the sample correspondences as mentioned in Section II. In addition, the squared distances between the corresponding representation samples from the two views are also reduced simultaneously, pulling them closer in the shared subspace. The objective function of our method is as follows: (7) where the two regularization terms are of the following form: where v ∈ {x, y} and c v i,j is the element at the (i, j) position of These regularization functions are smooth approximations of the whitening constraints in CCAbased methods. More specifically, the whitening constraints enforce the features of the representations to be pairwise uncorrelated (C v = I). They are used to prevent trivial solutions, e.g., all the data samples are mapped into a single point in the shared subspace. In our method, because the representation sequences are normalized by BN layers, we further use the l 1 −norm to encourage sparsity in the off-diagonal elements of C v . λ 1 > 0 and λ 2 > 0 are regularization parameters that control the trade-off between whitening and warping the two representation sequences.

B. OPTIMIZATION
The parameters θ x and θ y can be trained using the gradientbased method. To compute the gradient of L GSCA w.r.t. all VOLUME 8, 2020 the parameters θ x and θ y , we compute its gradients w.r.t. the outputs Z x and Z y and then use backpropagation [13] in the case of feed-forward neural networks or backpropagation through time (BTT) [14] if the RNNs are used. We have The gradient of the generalized smooth DTW w.r.t. Z x can be computed as where In equation (12), we abused the notations defined in Section II, where s n,m := DTW (Z x , Z y ) and The derivative e i,j = ∂s n,m ∂d i,j can be computed efficiently using a forward-backward algorithm. The details of the algorithm and its complexity are given in Appendix C. The gradient of L x w.r.t. Z x can be computed as where H x ∈ R d×d , whose elements are defined as The gradient ∂L GSCA ∂Z y can be computed in a similar manner. Our model can be trained using a full-batch algorithm (L-BFGS) [15], as in [2]. For large datasets, however, this algorithm is both time and memory inefficient. An alternative is based on stochastic gradient descent (SGD) [16], [17] where the gradient is estimated based on a much smaller number of training samples (a minibatch). The details are shown in Algorithm 1. Note that we use a stochastic estimate of the covariance matrix for each view because at each iteration, t, the algorithm can access only a small number of samples instead of the whole training set.

C. COMPARISON WITH DSCA
GSCA is expanded on DSCA, which was proposed in our preliminary work [7]. Because of the exploitation of the generalized smooth DTW distance, GSCA is more generalized and efficient than DSCA. More specifically, depending on the selection of the regularization function (γ ) in DTW , we have different versions of the proposed method. We denote the version with the Shannon entropy as GSCA-e and the other where the squared 2 norm is selected as GSCA-s. In fact, the objective of GSCA-e is equivalent Algorithm 1 Stochastic Algorithm for GSCA Input: Batch size ratio α ∈ [0, 1], time constant ρ ∈ [0, 1], momentum µ ∈ [0, 1), and learning rate .
; 7: compute gradient ∇ θ using backpropagation; 8: θ (t) = θ (t−1) + θ (t) ; 10: end for to that of DSCA because DTW =entropy is equal to DTW β , i.e., a smooth approximation of DTW employed in the preliminary work. The proof is given in Appendix B. Despite that, computing the gradient of DTW w.r.t. the representation sequences is more efficient because its complexity is only O(nm) in terms of both time and space. In contrast, the gradients of DTW β w.r.t. Z x and Z y associate with a summation over all feasible alignment matrices 2 requires O(n 2 m 2 ) in both computational time and memory storage. Note that by selecting squared 2 norm as the regularization, the complexity of GSCA can be reduced further because of the sparsity of the gradient. However, this advantage possibly comes at a cost of lower alignment accuracy because DTW =squared 2 is a nonexact approximation of the original DTW [10]. DTW =entropy , instead, can exactly approximate DTW. It converges to the original warping distance as β → ∞.

IV. GENERALIZED SEQUENTIALLY CORRELATED AUTOENCODERS
In this section, we develop GSCAE as a variant of the proposed method. The objective of GSCAE is formed by integrating reconstruction losses of autoencoders with the objective of GSCA. Let g x (Z x , x ) and g y (Z y , y ) denote the functions that map the representation sequences Z x and Z y back to the original spaces, where x and y are their corresponding parameters. Then, the objective of GSCAE is as follows where λ > 0 is a trade-off parameter. Similar to projection functions in GSCA, g x (·, ·) and g y (·, ·) can also be parameterized by deep feed-forward neural networks or RNNs. Diagrams of GSCAE are illustrated in Figure 3. By minimizing L GSCA , the correspondences of samples between the views are discovered implicitly and their corresponding squared distances are also reduced. This amounts to maximizing mutual information, which presents the relation between the views. The view-specific relation, on the other hand, is expressed via minimizing the reconstruction errors. This is equivalent to maximizing a bound on the mutual information between the input and output of each view. Thus, GSCAE provide us a better trade-off between information within each view and cross-view information.
The advantages of GSCAE over GSCA come at some costs. For a specific application with a particular dataset, we need to carefully tune the trade-off parameter λ for GSCAE to achieve optimal performance. In addition, training GSCAE certainly requires more computational resources than those for GSCA. Specifically, when training GSCAE using a stochastic-based algorithm, we need to compute the gradients w.r.t. θ x and θ y , which are associated with both L GSCA and autoencoder parts. Furthermore, we also need to compute the gradients w.r.t. x and y , which are only dependent on the reconstruction losses. We note that similar to GSCA, GSCAE have different versions depending on the selection of the regularization function (γ ) in DTW . We denote them as GSCAE-e if (γ ) is Shannon entropy and GSCAE-s when squared 2 norm is selected.

V. GENERALIZED MULTIPLE SEQUENCES ANALYSIS
In this section, we propose generalized multiple sequences analysis (GMSA), which is an extended variant of GSCA for learning representation from multiple data sequences. Slightly differing from the previously proposed method, GMSA directly projects all the data sequences into the label subspace to learn more interpretable and discriminative representations. The projection functions are parameterized by DNNs, as in GSCA. To accommodate sequential mismatching, we introduce a consensus label sequence that is then aligned to all the output sequences of the DNNs. An alternating optimization algorithm is finally derived to solve the objective function w.r.t. parameters of the DNNs and the consensus label sequence.

A. OBJECTIVE
Given v data sequences X (k) ∈ R d x (k) ×n (k) for k = 1, . . . , v, we assume that each sample of these sequences belongs to one of c disjoint classes. The cluster assignment is often denoted by a matrix in the sequence X (k) . As in [18]- [20], for each view, we define a scaled cluster indicator matrix It turns out that In GMSA, each input data sequence is passed through a DNN to compute its output sequence Z (k) = f k (X (k) , θ (k) ) ∈ R c×n (k) , where θ (k) is a collection of all parameters of the k th network. The dimension of the new subspace is exactly the number of the classes, indicating that the input sequences are mapped into the same space with the labels. Because i belongs to the j th class and zero otherwise. VOLUME 8, 2020 the representation sequences in the new space are possibly unequal in length and sample-wise mismatched, we introduce a consensus label sequence Z ∈ R c×n of a prespecified length n and minimize its DTW distances with all sequences Z (k) . The optimization problem of GMSA is as follows: subject to Z ≥ 0 and ZZ = I, (18) where λ k is the weighted parameter for soft whitening regularization of the k th view, which is similar to GSCA. The nonnegative and orthogonal constraints are derived from (17), enforcing Z to satisfy the cluster indicator conditions. By introducing the consensus label sequence, we can avoid minimizing the sum of all pairwise DTW distances between output sequences, which is computationally demanding and prone to errors. We note that deep discriminant analysis with time warping in [21] also utilizes the idea of mapping the input data sequences into the label space. However, the authors assumed that the supervised information was already available. In our model, the sequential labels are not given in advance. Therefore, GMSA is completely unsupervised and its optimization problem is more difficult to solve.

B. OPTIMIZATION
In this section, we propose an alternating algorithm to solve the optimization problem of GMSA. More specifically, we update all the parameters of the DNNs iteratively when Z is fixed and then optimize the consensus label sequence after recomputing all output sequences of the DNNs. Let L GMSA denote the objective function in (18). When fixing the consensus label sequence, we can compute gradients of L GMSA w.r.t. Z (k) for k = 1, . . . , v as follows: Similar to optimization for GSCA, these gradients are then backpropagated to compute the gradients of L GMSA w.r.t. the parameters θ (k) for k = 1, . . . , v. When all parameters θ (k) are fixed, the output sequences Z (k) are recomputed and the optimization problem in (18) reduces to subject to Z ≥ 0 and ZZ = I. (20) By adding an extra penalty term ξ ||ZZ −I|| 2 F to the objective of problem (20), we can remove the orthogonal constraint.
; then, the update rule for Z is as follows: Algorithm 2 Alternating Optimization Algorithm for GMSA Input: Input data sequences X (k) for k = 1, . . . , v. Output: The optimal DNNs' parameters θ * and consensus sequence Z * . 1: for k = 1, . . . , v do 2: Run k-means on X (k) to generate cluster indicator matrix F (k) ; 3: Initialize Update the consensus Z using equation (21); 8: Update the DNNs' parameters θ using a stochastic algorithm; 9: Recompute To guarantee the orthogonality of Z, we set ξ to a relatively large value, ξ = 10 6 , in our experiments. The derivation of equation (21) is given in Appendix D. Because we use the Karush-Kuhn-Tucker (KKT) condition to update Z , the objective value of GMSA can be ensured to decrease monotonically. However, note that the objective function is not jointly convex w.r.t. all the variables and that the alternating optimization algorithm is not guaranteed to converge to the global optimum. Therefore, a good initial guess can help the algorithm to achieve a better optimal solution and converge faster. In this work, we separately run k-means algorithm on the input sequences X (k) to generate their cluster indicator matrices F (k) for k = 1, . . . , v. The initial value of Z (k) 's output of the DNNs are then computed using equation (16). Afterward, Algorithm 2 summarizes the alternating optimization procedure of GMSA.

VI. RELATED WORKS
To capture the relations between data samples among the views, conventional methods such as CCA and its variants implicitly assume that data of the views are equal in size and sample-wise matching. However, these assumptions are likely to be violated in sequential settings because data sequences often have different lengths and the sample correspondence information is also absent. One major approach to solve the aforementioned problem is directly combining CCA with DTW [22], [24]- [27]. These methods find a subspace such that projections of the two sequences are aligned and the learned representations of the two views are maximally correlated. Recently, [21] has proposed to combine deep CCA [2] with DTW. More complex and nonlinear embeddings can be obtained based on DNNs. For more details about the advantages of deep methods over the shallow ones, we refer the readers to a recent overview [28]. However, this direct combination has several serious drawbacks. Because the DTW problem is discrete and its objective is not differentiable, the alignment and representations are not optimized in a unified manner. Specifically, the new alignment is updated while fixing the representations and vice versa. Without a good initialization, this update scheme is prone to suboptimal solutions. In addition, when DNNs are used to map the two views into the new subspace, their expensive training procedures need to be performed multiple times. This makes the approach inefficient and unsuitable for extending to larger deep models.
Another approach is based on manifold alignment. These methods project data from two different but correlated manifolds to a subspace, simultaneously preserving the local structures and ensuring their closeness. A subgroup of this technique includes semisupervised methods [31]- [34] that utilize several sample-wise correspondences known in advance between the manifolds while learning a new subspace. In contrast, the second subgroup, which we focus on in this article, contains unsupervised manifold alignment methods that do not require correspondences to be predetermined. Because there is no prior information on the sample-wise pairing, [35] created a connection between the two views by comparing their local geometry, which is characterized by the k-nearest neighbors (k-NNs). [36] has recently proposed a variant of this method where the local geometry information is measured in the fuzzy granule space instead of the original space. [37] built a k-NN graph for each view and extracted a series of graph-based descriptors for each data sample. The cross-view similarity and dissimilarity matrices are then computed in the descriptors space. [38] constructed the crossview similarity matrix based on probability prediction results, which are obtained by performing classification tasks on both views. [39] took a different approach to the correspondence problem. Specifically, they encoded the cross-view samplewise matching into a binary matrix, which is jointly optimized with the projection matrices of the two views. [40] further extended the correspondence matrix with an extra row and column. They aimed to better handle outliers that had no corresponding sample from the other set. Nevertheless, these methods cannot take advantage of sequential order in the data to discover more accurate correspondence. In addition, they are also limited to shallow models and sensitive to noise that corrupts the adjacency and geometric information of the data.
[41], [42] proposed hybrid methods that combine unsupervised manifold alignment with DTW. Thus, they also inherit drawbacks from the two presented approaches. Our methods (GSCA and GSCAE) differ from the first approach because they discover the sample correspondence implicitly while learning the representations instead of alternately updating the projections of the views and aligning them. In addition, the proposed objective functions allow us to design an efficient stochastic algorithm for training the models, making them applicable to other deep models. Our methods also differ from the second approach because the smooth DTW can better utilize the sequential order of the data. Furthermore, we parameterize the projection functions by deep feed-forward neural networks and/or RNNs. Therefore, our models can better capture view-specific relations as they expand through the sequences and learn much more robust and richer nonlinear embedding functions. We note that [43] recently approached the problem of misalignment in multiview sequential learning using memory-based neural networks. Instead of recovering the sample correspondence between the views, this approach stores view-specific information in memory and makes it accessible to the neural network of the other view. Although having promising results in practice, we exclude this approach from our experiments because it is a supervised method, which is not the interest of this article.
In this work, we also consider a more challenging case where the input data comprise more than two sequences. To handle this problem, existing methods such as [44], [45] combine multiset CCA [46] and an approximation of DTW where the warping path is approximated by a linear combination of monotonic basic functions. In comparison with GMSA, these methods are less favorable to data with a complex latent structure because they can only learn a simple linear projection for each view. In addition, how to select a proper set of monotonic basics for a particular dataset remains unclear. Another closely related work of GMSA is deep discriminant analysis with time warping [21]. This method simultaneously projects and aligns the input data sequences to a given label sequence. In contrast, in our case, supervised information is unavailable. Therefore, GMSA solves a more complex problem.

A. COMPARED METHODS
We compare GSCA and its variant GSCAE with DSCA, which was proposed in our primary work [7], and the following two-view baselines: alignment-based method that encodes cross-view sample-wise correspondence into a binary matrix that is jointly optimized with the projections; • Manifold alignment time warping (MATW) [42]-a hybrid method where sample alignment is performed by DTW and where projection matrices are optimized to preserve the underlying structures of the two views; and compare GMSA with the following multiview method: • Generalized canonical time warping (GCTW) [44]-Multiset CCA is used to project data sequence into a shared subspace with an approximation of the DTW algorithm to align the projected sequences. We note that DTW in the objective of our methods has different versions depending on the selection of the regularization functions (γ ). Therefore, we add suffixes -e and -s to our methods, indicating that (γ ) is Shannon entropy and squared 2 norm, respectively.

B. EVALUATION MEASUREMENTS
All datasets in our experiments are divided into training, tuning, and test sets. We evaluate these methods by measuring class separation in the learned embedding spaces on the test set. First, we perform clustering tasks on the projections of the first view and evaluate how well the clusters agree with the ground-truth labels. 4 We follow the same procedure in [47], where spectral clustering [48] is used to handle possibly nonconvex cluster shapes. We set the number of clusters to the number of ground-truth classes available in each dataset. Clustering accuracy (ACC) and normalized mutual information (NMI) [49] are used as measurements for assessing the clustering performance. Second, we test the accuracy of a simple linear classifier on the learned embeddings. We train one-versus-one linear support vector machines (SVMs) [50] on the projected training set of the first view (label information is used). The trained model is then used to classify projections of the test set, and the percentage of errors is reported as a measurement of classification performance.

C. PARAMETER TUNING
We select the optimal parameters that return the best evaluation measurement results on the tuning set for each method.
Two-View Methods: Dimension d of the new subspace is selected from {5, 10, 20, 30, 50, 70}. For manifold alignment-based methods, we select the parameter that balances between sample matching and geometry preserving from {0.1, 0.2, 0.3, 0.4, 0.5}. The number of neighbors for building the k-NN graph that encodes the local geometry is selected from {1, 3, 5, 10, 15, 30}. We found that these methods return the best average results at k = 5. The trade-off parameter between the autoencoder regularization term and the alignment objective in AECTW and GSCAE is selected using a grid search. For the soft whitening constraints in DSCA, GSCA, and GSCAE, we set λ x = λ y , and their values are also selected using grid search. Another important 4 For GMSA, we use the projections of the views as their cluster indicators. parameter is β, which controls the regularization in the smooth min operator. We set β = 1, as suggested in [7], [51]. For the DNNs used in the compared methods, their topology and configurations are data-dependent and are specified in the following subsections.
Multiview Methods: For GCTW, similar to the two-view methods, we select a dimension of the new subspace from {5, 10, 20, 30, 50, 70}. To approximate the optimal warping paths, we use five hyperbolic tangent and five polynomial functions as the monotonic basics. The other parameters are set according to the original article. In contrast to GCTW, our method GMSA projects the input sequences into the label space. Therefore, we choose the dimension d of the new space for GMSA such that it is identical to the number of classes available in the datasets. We use the same soft whitening regularization parameters for all the views: λ 1 = λ 2 = . . . = λ v , and the value is chosen using grid search. Let n a , n s , and n l denote the average, shortest, and longest lengths, respectively, of the input data sequences. Then, the length of the consensus sequence Z is selected from a rounded set {n s , max(n s , 0.5n a ), max(n s , 0.75n a ), n a , min(n l , 1.25n a ), min(n l , 1.5n a ), n l }. The alternating optimization algorithm of GMSA is determined to be converged if the relative reduction of the objective is smaller than a tolerance = 10 −5 . In practice, we also terminate the algorithm if the number of iterations exceeds a prespecified value iter_max = 50.

D. TWO-VIEW DATA I: NOISY MNIST DIGITS
In this experiment, we utilize the MNIST dataset [17], which consists of 28 × 28 grayscale digit [0,9] images divided into 60K /10K for training/testing. Following the procedure in [47], we generate two-view data as follows. For the first view, we rescale the pixel to [0, 1] and randomly rotate the images at angles uniformly sampled from [− π 4 , π 4 ]. For each image of the first view, we randomly select an image of the same identity from the original dataset, add noise uniformly sampled from [0, 1], and truncate the pixel value to [0, 1]. This image is further resized to 24 × 24 and used for the second view. 10K from 60K image pairs of the original training set are set aside for tuning. set, we generate two misaligned sequences using a profile hidden Markov model (pHMM) [52]. Specifically, we generate two state sequences, consisting of MATCHING state M i that emits the i th matching sample and INSERT state I i intended for emitting sample replication. The transition probability is chosen such that from any state, the next state is MATCHING with probability 0.6 and INSERT with probability 0.4. We terminate the state sequences after reaching the (5 × 10 4 ) th MATCHING. The state M i of the first sequence corresponds to sample x i . For state I i , we replicate x i by randomly selecting a sample x l=l i from its class (having the same identity). (Similarly for the second sequence). Figure 4 shows a toy example of how to generate misaligned sequences using pHMM for a given small set of 10 training image pairs.
In this experiment, we used feed-forward neural networks for all DNN-based methods, including DCTW, DSCA, GSCA, and GSCAE. To parameterize the projection functions that map data from original spaces to the new subspace, we use two fully connected networks whose numbers of sigmoid units at hidden layers are 1200 − 1200 − 1200 (for the first view) and 1000 − 1000 − 1000 (for the second view). Note that each network includes a BN layer of d units on the top as an output layer. The view reconstruction in GSCAE is performed by the symmetric DNNs. Figure 5 visualizes the first view in the original space and its projections in the subspaces learned by different methods. The class separation results are shown in Table 1.
The results show that among manifold alignment-based methods, FGMA had better scores than LUMA and GUMA. FGMA evaluates the local geometry of the data after converting them into a fuzzy granule space. Thus, FGMA can discover more complex local structure information. MATW is a hybrid method. Differing from LUMA, FGMA, and GUMA, MATW discovers sample correspondences by DTW. Thus, it can take advantage of sequential order in the data to find the cross-view correspondence. Nevertheless, these methods returned poor results on noisy MNIST digits dataset because the noise corrupted the geometric information. In contrast, the deep learning-based methods returned much higher class separation results, even in noisy conditions. These methods mapped samples of the same class to similar locations while suppressing noise and rotational variation in the data. We also observe that methods with a smooth approximation of DTW, including DSCA, GSCA, and GSCAE, worked much better than CTW, AECTW, and DCTW, which directly combine the original DTW with variants of CCA. By minimizing the differentiable version of DTW, alignment and projection can be optimized in a unified manner. CSTW also implicitly optimizes DTW because it considers the warping path as a probabilistic variable. However, its EM algorithm still updates the alignment and projection matrices alternatively, which is prone to suboptimal solutions.
The experimental results also show that by carefully tuning the trade-off parameters and combining CTW or GSCA with autoencoders (forming the variants AECTW or GSCAE) can improve their performances. We note that each method TABLE 1. Clustering (ACC, NMI) and classifying (Error) results on the noisy MNIST digits dataset. The data sequences are generated randomly five times using the pHMM-based procedure. Each method is performed on these data to learn the new embeddings and the average results along with variances on projections of the test set are reported. proposed in this article has two versions depending on the selection of the regularization (η). Although GSCA-e and DSCA have similar scores as their objectives are equivalent, computing the gradient of GSCA-e is much more efficient. Figure 6 shows the average times for computing the stochastic gradients of GSCA-e, GSCA-s, and DSCA over different dimensions d. Because of the sparsity of the gradient induced by squared 2 norm, training GSCA-s and GSCAE-s are generally faster than GSCA-e and GSCAE-e, respectively. However, this advantage comes at a cost of slightly lower class separation scores.

E. TWO-VIEW DATA II: ACOUSTIC AND ARTICULATORY RECORDINGS
We next evaluate the performances of the two-view methods on the Wisconsin X-ray microbeam (XRMB) corpus [54], which consists of 2537 utterances recorded from 47 American English speakers. Lengths of the utterances vary from 63 to 2941 frames. Each frame is basically described by 39D acoustic features (13-dimensional mel-frequency cepstral coefficients [MFCCs] along with their first and second derivatives) and 16D articulatory features (horizontal/vertical displacement of 8 pellets attached to the tongue, lips, and jaw). To incorporate context information and generate two sequential views of different frame rates, we slide windows of 7 and 9 frame sizes over each utterance with one frame step size. The frames within the windows are concatenated, resulting in 273D acoustic and 144D articulatory input samples. Because each original frame belongs to one of 41 phone classes, we consider the labels of the central frames as those of the newly generated inputs.
The utterances are presently characterized by two sequences whose lengths are different and the sample correspondences are also missing. We randomly divide them into 1415/471/471 for training/tuning/testing. We use RNNs for GSCA and GSCAE to better capture the sequential nature of the data in this experiment. Specifically, for each view, we stack three LSTM units with the same numbers of memory cells (1500 for acoustic view and 1200 for articulatory view) along with a fully connected BN layer of d units at the output to parameterize the projection function. The view reconstruction in GSCAE is performed by the symmetric deep LSTM networks. Note that while training these models using Algorithm 1, at each iteration, we randomly sample acoustic and articulatory input sequences that correspond to one of the 1415 training utterances to compute the stochastic gradient. Thereby, we can take better advantage of the sequential nature in the data for training the RNNs. For DCTW and DSCA, we concatenate two views of the training utterances into two long sequences and feed them separately to two fully connected networks. These networks consist of three hidden layers whose activation functions are ReLU, and the numbers of the units are 1500 − 1500 − 1500 for the acoustic view and 1200 − 1200 − 1200 for the articulatory view. Table 2 shows the phone class separation results on representations obtained by different methods. Similar to the results on the noisy MNIST digits dataset, the DNNbased methods outperformed CTW, CSTW, AECTW, and the manifold alignment-based methods. The DNNs enable those methods to approximate projection functions nonlinearly, improving the quality of the learned embeddings. In this experiment, the designed RNNs have shown positive effects on our methods. They allow the models to better capture the sequential relations among data samples. As a consequence, GSCA and GSCAE achieved the highest scores among the compared methods. We also see that the results of AECTW and GSCAE surpass those of CTW and GSCA, respectively. This again validates the benefits of coupling autoencoderbased regularizations with the objective functions for providing a better trade-off between view-specific and cross-view information.
We then investigate average times for computing stochastic gradients of GSCA-e, GSCA-s, and DSCA. Because of the differences between their network architectures (RNN in GSCA and feed-forward network in DSCA), we exclude the computational time of the backpropagation 5 (Line 7 in Algorithm 1). Figure 7 shows that the computing gradients 5 Note that in GSCA, we use BTT for computing gradients for its RNNs  of GSCA are much faster than that of our primary model DSCA. This efficiency originates from the use of the new generalized smooth DTW. We can further reduce the training time by setting (η) in DTW to be squared 2 norm. However, as shown in Table 2, the class separation results were slightly decreased. Because DTW =squared 2 is a nonexact approximation of DTW, GSCA-s and GSCAE-s included some certain bias in comparison with the Shannon entropybased versions. In this experiment, we evaluate the performances of multiview methods, including GMSA and GCTW, on the Weizmann dataset [55], which consists of 90 videos of nine subjects, each performing ten actions: wave-one-hand (wave 1), wave-two-hand (wave 2), side, jump-in-place (pjump), jump-forward (jump), jack, skip, bend, walk, and run. Similar to [56], we concatenate videos of the same subject into a long video sequence following the presented order of the actions. Background is subtracted from each frame of these video sequences and the frames are then rescaled to the VOLUME 8, 2020  size 80 × 40. There are three types of features that can be computed to characterize the frames, including type 1: binary, type 2: Euclidean distance transform [57], and type 3: solution of the Poisson equation [58]. We generate threeview sequential data for training by selecting video sequences of the first three subjects, each of which is represented by one of the three feature types without repetition. As a result, each view of the data has different features, and each frame of the views belongs to one of the ten classes (see Figure 8 for more details). To reduce the dimensions of the feature space (3200), the top 123 principal components that preserve 99% of the total energy are selected. Videos of the next three subjects are used for tuning, and the remaining subjects' videos are utilized for testing.
For GMSA, we use RNNs to parameterize the projection functions. We use a similar network configuration for all the views because three views of the dataset have the same input dimensions. Specifically, the data of each view are passed through a deep network with three stacked LSTM units, each  Table 3 shows the class separation results on the representations learned by GCTW and GMSA.
The results show that the performances of clustering and classification on the new embeddings returned by GCTW and GMSA are much better than those on the original space. These results indicate that integrating complementary information from different views helps the two methods to improve the quality of the new representations. In addition, the results also show that the improvement of GMSA is more considerable because it can learn more richer nonlinear embeddings. In contrast, GCTW limits itself to a shallow model where only linear projection matrices can be obtained. Another unfavorable property of GCTW is that the accuracy of its alignment procedure heavily depends on the selection of the monotonic basic functions. However, how to choose a suitable collection of basics for a particular dataset remains unclear. Therefore, the inappropriate settings of monotonic bases might degenerate the results of GCTW.
We then empirically explore the convergence of the optimization algorithm for GMSA. Each outer iteration of the algorithm includes two steps: finding the optimal consensus label sequence and updating the parameters for all DNN branches of the models. Figure 9 shows the convergence cures (objective function value against the number of iterations) with and without the proposed initialization (see Algorithm 2) on the Weizmann dataset. The results show that the algorithm still converges even with a random start. However, the proposed initialization improved the performance of the optimization procedure significantly. Not only does this help the algorithm to converge faster, but a good initial guess also allows better solutions to be obtained. These results again elucidate the efficiency of the GMSA model.

G. MULTIVIEW DATA II: MMI FACIAL ACTION UNITS
We next exploit the MMI facial expression dataset [59], which contains more than 2900 videos of 75 different subjects, each performing a particular combination of an action unit (AU). In our work, we focus on videos of AU12, which corresponds to a smile. These videos consist of different number of frames, and each belongs to one of three classes: neutral (when facial muscle is inactive), apex (when facial muscle intensity is strongest), and onset (when facial muscle VOLUME 8, 2020 TABLE 5. Ablation analysis of GMSA-e. The views are removed one by one, ablating one corresponding branch of DNN from the model. The best class separation scores of the ablated GMSA-e along with their differences to the results of the original one (full views) are reported. starts to activate) or offset (when facial muscle begins to relax). We first preprocess each frame by performing face cropping and face alignment using dlib-ml [60]. The results are depicted in Figure 10. We then convert them to grayscale and reduce their dimension. Specifically, we utilize whitening PCA to pick the top 400 components, preserving 99% of the total energy. Finally, we generate sequential data with five views using videos S002-005, S003-023, S006-026, S014-009, and S017-004. Tuning and testing are performed on videos of the same subjects but in different trails (S002-006, S003-024, S006-025, S014-010, and S017-006).
In this experiment, RNNs are used to parameterize the projection functions for GMSA. We stack two LSTM units, which each has 800 memory cells, and a BN layer with d = 3 units as the output layer to form a deep network for each view. The projections of the views are used to predict the cluster labels and perform the classification task. Table 4 shows the results of GCTW and GMSA.
The results show the same pattern as on the Weizmann dataset: i.e., the representations learned by GCTW and GMSA significantly improve the performances of clustering and classification tasks in comparison with those on the original input features. Because GMSA-s and GMSA-e are nonlinear methods, their results are much better than those of GCTW. We also note that the multiple alignments in GMSA are simpler than that in GCTW because of the introduction of the consensus sequences. GCTW discovers the sample correspondences by instead performing pairwise alignment between every two views, while there are up to five views in this dataset. Therefore, more errors potentially occurred and propagated through update iterations in GCTW.
Finally, we investigated the convergence of Algorithm 2 on the MMI facial expression dataset. Its convergence curve, which shows the objective value against the number of iterations, is depicted in Figure 11. The figure shows that the algorithm always converges, regardless of the initial conditions. Because the updated equation for the consensus label sequence satisfies the KKT conditions and the optimization for DNNs' parameters is based on the gradient descent method, the objective value is guaranteed to not increase after each iteration. In addition, as on the Weizmann dataset, we also observe that the proposed initialization significantly improves the performance of the algorithm. With a better initial value, the algorithm could converge with a much lower objective value, hence obtaining a superior optimal solution.

H. ABLATION ANALYSIS OF GMSA
In this section, we conducted ablation experiments to investigate the multiview effect in the GMSA-e model. For a v-views dataset, 6 the GMSA-e model consists of v branches of DNNs, and each of which maps an input data sequence from one view into the shared label space. Following the same procedure in [61] and [62], we remove the branches of DNNs one by one and report the results in Table 5. The results show that some views are more important than others. For example, the absence of the Euclidean distance transform view in the Weizmann dataset or view 4 in the MMI dataset produces the most significant reduction in the results of the model. However, all of the views contribute more or less to the improvement of the model's performance. GMSA-e with full views has better separation scores than it does with view absence. This result again verifies the advantages of GMSA, which can handle multiple sequential views instantly.

VIII. CONCLUSION
Multiview sequential data pose many challenges to representation learning. In particular, the data sequences of different views are often unequal in size and sample-wise mismatching. Therefore, in this article, we introduced GSCA, a DNN-based method that can discover sample correspondence implicitly while learning representations. By using a generalized smooth DTW distance, which is a differentiable approximation of the original DTW, our model can be trained using a gradient descent-based algorithm, where the gradient can be computed efficiently in terms of both time and space. Our model can be easily extended to improve its learning performance. For instance, we added two DNNs to the GSCA model for view reconstructions, forming a new variant GSCAE. The second model allows a trade-off between view-specific and cross-view relations when learning the representations. Given more than two data sequences, it is obvious that both GSCA and GSCAE are inapplicable. Hence, we further develop the third model called GMSA that can simultaneously handle multiple data sequences and learn more interpretable representations. Through extensive experimentation on different publicly available datasets, our methods were compared with various baselines. The results show that the performances of our methods surpass those of the competitors of all the datasets.

APPENDIX A SMOOTH MIN OPERATOR
The smooth min operator is defined as: where the regularization term (γ ) must be a strictly convex function [9]. Two widely used functions are Shannon entropy and squared 2 norm. Shannon Entropy: If (γ ) = k i=1 γ i ln γ i , we obtain min (η) = min Because the objective is strictly convex, we can take its Lagrangian: With KKT conditions ∂L ∂γ i = 0 and slackness λ 2 γ i = 0, we have γ i = e βλ 1 −βη i −1 ∀i = 1, . . . , k.
Combining with the simplex constraint: Plugging this back into equation (25), we arrive at the minimum of (23) In summary, when using Shannon entropy as regularization, we have closed-form solutions of the smooth min operator and its gradient Squared 2 Norm: When (γ ) = 1 2 k i=1 γ 2 i , the smooth min becomes min (η) = min It can be easily shown that the minimum γ * (i.e. ∇min (η)) of (30) is the projection of −βη onto the simplex k γ * = argmin which is likely to be sparse. The solution of (31) can be efficiently obtained using the algorithm proposed in [63]- [65] with a complexity of O(k ln k).
Proof: Let i,j ⊂ be the set of all warping paths from (1, 1) to (i, j) and denote Note that when the regularization is the Shannon entropy smooth min has a closed-form expression, as shown in equation (28), we also have r i,j = min {s(π 0 )|π 0 ∈ i,j } .

APPENDIX C FORWARD-BACKWARD ALGORITHM
To compute e i,j in equation (12), we use the forwardbackward algorithm, which is originally introduced in [10]. The details are shown in Algorithm 3. The algorithm indeed computes the gradient matrix E, where e i,j is the element at position (i, j), of the generalized smooth DTW w.r.t. the distance matrix D. It includes a forward step and a backward step. Both of them perform constant-time operations in nm times. Therefore, the computational complexity of the algorithm is O(nm). In addition, during the computation, the algorithm stores several matrices whose largest size is 3nm. Thus, its space complexity is also O(nm). Note that when the squared 2 norm is used as regularization in DTW , q i,j become sparse because of equation (31). This then induces the sparsity in E, further reducing the complexity of the algorithm in terms of both time and space.