Kernel Clustering with Sigmoid Regularization for Efficient Segmentation of Sequential Data

The segmentation of sequential data can be formulated as a clustering problem, where the data samples are grouped into non-overlapping clusters with the constraint that all members of each cluster are in a successive order. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.


I. INTRODUCTION
Recently, there has been an increasing interest in developing machine learning and data mining methods for sequential data.This is due to the exponential growing in number of collected data sequences from applications in wide range of fields, including computer vision [1]- [3], speech processing [4]- [6] , finance [7]- [9], bio-informatics [10], [11], climatology [12]- [15] and traffic monitoring [16]- [18].The main problem associated with analysis of these sequences is that they consists of a huge number of data samples.Therefore, it is desirable to summarize the whole sequences by a much smaller number of the data representatives, alleviating burden for the subsequent tasks.
Such compressed and concise summarization can be obtained via sequence segmentation.More specifically, this aims at partitioning the data sequences into several nonoverlapping and homogeneous segments of variable durations, in which some characteristics remain approximately constant.It is widely recognized in the literature that the segmentation of sequential data can be considered as a clustering problem.The difference is that all data samples of each cluster, which represents a segment, are constrained to be in a successive order.Thus, in this paper, we focus on clusteringbased methods for segmentation of data sequences.
In practice, sequential data are often composed of nonlinear and complex segments.Therefore, kernel methods are often applied to map data samples into a new feature space before segmenting.Due to the constraint imposed on the data samples in each cluster, traditional algorithms for clustering are inapplicable to the segmentation problem.[19] proposed an optimal algorithm based on dynamic programming (DP) for segmenting data sequence in the features space, which is associated with a pre-specified kernel and mapping functions.In general, DP has quadratic time and memory complexities.It even induces running time of order O(n 4 ) 1 , where n is the length of the sequence, in practice.Therefore, it is intractable to perform segmentation on long data sequence using DPbased algorithms.To alleviate this issue, many attempts have been made to create approximations to the optimal algorithm.Although a considerable amount of the computational costs are reduced, there are still critical drawbacks remained in the approximation algorithms.Taking pruned DP [20] and greedy algorithm [21] as representatives.These methods sequentially partition the data sequence, returning one segment boundary (a.k.a, change point) at each iteration.This strategy offers a reduction in the computational time.However, its expense is that errors might occur at the earlier steps and they would influence on the subsequent iterations, inducing a huge bias in the final results.Massive memory complexity is also a vital drawback of almost kernel-based methods.They need store the kernel matrix, which requires order of O(n 2 ) space.Therefore, they are prohibited by themselves from handling extensively long data sequences.
In this paper, we take a different approach to alleviate the aforementioned issues.More precisely, we introduce a novel sigmoid-based regularization, which smoothly approximates the constraints of the segmentation problem.It is then integrated with balanced kernel clustering to perform segmentation on sequential data.Our method owns several preferable characteristics.First, because objective of the proposed model is differentiable w.r.t unconstrained and continuous variables we can easily optimize it using gradient descent GD algorithm.Different from the existing methods, which are just heuristic approximations of the optimal segmentation algorithm, our model has a guarantee on quality of the solutions as convergence of the GD algorithm was theoretically proved [22].Second, the proposed model offers the applicability of a more efficient optimization algorithm based on stochastic gradient -the gradient that is estimated from a subsequence (mini-batch), which is randomly sampled from the original data sequence at each iteration.Therefore, the stochastic variant of our model has much lower time and space complexities, making segmentation of extensively long data sequences possible.Finally, the proposed model is flexible.We can easily modify the sigmoid-based regularization to further form a new extended variant that can simultaneously segment multiple data sequences.Through extensive experiments on various types of sequential data, our models are evaluated and compared with baseline methods.The results validate advantages of the proposed models.In summary, contributions of 1 including time for computing the cost matrix in the feature space [20] this paper are as follows • Introduction of sigmoid-based regularization that enables kernel clustering to partition sequential data.Objective of the proposed method called Kernel clustering with sigmoid regularization (KCSR) is smooth and can be effectively solved using gradient-based algorithm.• Development of a stochastic variant of KCSR to reduce the memory complexity, which is prominent in almost kernel-based methods that prohibits them from handling large-scale datasets.• Extension of KCSR for simultaneously segmentation of multiple data sequences.• Extensively empirical evaluation of the proposed methods on widely public datasets shows theirs superiorities over the existing methods.The rest of this paper is organized as follows: In Section II, we review related works that perform segmentation based on clustering methods.Next, we briefly presents some background for our proposed models in Section III, .Section IV introduces the proposed model KCSR and its stochastic version.This section also describes how to modify the sigmoidbased regularization to form an extension of KCSR that can simultaneously segment multiple data sequences.After illustrating and discussing experimental results in Section V, we conclude the paper in Section VI.

II. RELATED WORKS
In this paper, we focus on clustering-based methods for nonlinear segmentation of sequential data.Thus, we will review related works in the literature of kernel segmentation, which sometime is referred to as offline kernel change point detection (CPD) [23].Here, the change points indicate the boundaries between the segments.In addition, we also review temporal clustering methods.They have recently gained more and more popularity in the computer vision field, where clustering-based algorithms are employed to segment videos of human motions.
Offline kernel change point detection.According to [23], almost all offline kernel CPD methods attempt to optimize the objective function as defined in (2).This is also the objective of the kernel k-means clustering.Based on the search scheme for the segment boundaries, existing methods can be divided into local group, which uses sliding window and global group, which bases on dynamic programming.
The local methods [24]- [28] slide a window with a large enough width over the data sequence.They then detect, in the window, a single change point, at which the difference between the preceding and succeeding samples is maximal.Although having low computational cost, these methods is sub-optimal as the whole sequence is not considered when detecting the changes.Our approach is more similar to the global methods, which take all data samples into account for change detection.[19], [29] employed dynamic programming (DP) algorithm to optimally obtain the segment boundaries.However, because DP have time complexity of order O(n 4 ) (including computational time of the cost matrix [20] in the feature space), it is impractical for handling long data sequences.To reduce the time complexity, [21] proposed a greedy algorithm that sequentially detects change points one at an iteration.[20] further reduce the space requirement by introducing pruned DP, which combines low-rank approximation of the kernel matrix and binary segmentation algorithm.Our approach is different from these two methods as it searches for all the segment boundaries simultaneously.In addition, quality of its solutions is guaranteed as convergence to optimum of the gradient descent algorithm employed in our model is theoretically proved [22].Both pruned DP and the greedy algorithm are heuristic approximations of the original DP.Since sequentially detect the changes, errors at the early iterations are propagated and can not be corrected at the subsequent iterations.
Temporal clustering refers to the factorization of data sequences into a set of non-overlapping segments, each of which belongs to one of k clusters.Maximum margin temporal clustering (MMTC) [30] and Aligned clustering analysis (ACA) [31] divide data sequences into a set of nonoverlapping short segments.These subsequences are then partitioned into k classes using unsupervised support vector machine [30] or kernel k-means clustering [31].Recently, a branch of methods based on subspace clustering has been proposed.These methods often include two steps.First, given a data sequences X = [x 1 , . . ., x n ], they learn a new representation (coding matrix) Z = [z 1 , . . ., z n ] that characterizes the underlying subspaces structures and sequential (a.k.a.temporal) information of the original data.Second, the normalized cut algorithm (Ncut) [32] is then utilized for segmentation of Z.
To preserve the sequential information in the new representation, [33], [34] proposed a linear regularization of the form ||ZR|| 1,2 , where R ∈ R n×(n−1) is a lower triangular matrix with −1 on the diagonal and 1 on the second diagonal.By minimizing this regularization jointly with the subspace learning objective, the new representation z j and z j+1 of the two consecutive samples x j and x j+1 , respectively, are forced to be similar.[1] further integrated a weight matrix into the linear regularization to avoid equally constraining on every pair of consecutive samples.Nevertheless, since the regularization is linear, it is ineffective for handling complex data structure.To leverage this issue, [35], [36] proposed manifold-based regularization that preserves the sequential information for the local neighborhood data samples.This type of regularization is more preferable [2] as it often outperforms the linear one in most tests [37].Our approach also employs regularization to model sequential characteristics of the data.However, the sequential information is both globally and locally preserved in the proposed methods, thanks to the smoothness of the sigmoid functions.In addition, since the temporal regularization makes representation of consecutive samples become similar, boundaries of the segments become difficult to be identified.Our methods, in contrast, approximate the boundaries by midpoints in the summation of sigmoid functions with high steepness.Therefore, our models are expected to obtain better segmentation accuracy.
Both temporal clustering and offline kernel CPD approaches have to store an affinity graph matrix and/or a kernel matrix, which require memory of order O(n 2 ).This is also a vital reason that inhibits them from handling long data sequence.Stochastic variant of our method has significantly lower space requirement.At each iteration, it approximates the gradient based on a partial kernel matrix, which corresponds to data samples in the current minibatch.Therefore, memory complexity of Stochastic KCSR is only O(b 2 ), where b n is the minibatch size.Among the existing methods, only pruned DP in [20] is capable of handling large-scale data because it employs low-rank approximation of the kernel matrix, which only requires space of order O(r 2 ), where r n is the rank of the approximation.Comparison between performances of Stochastic KCSR and this algorithm on large datasets will be given in Section V.

A. NOTATIONS
Throughout this paper, we denote vectors and matrices by bold lower-case and bold uppercase letters, respectively.For a particular matrix A, its i th column is denoted as a i and its element at position (j, i) is expressed by a j,i or A j,i .The transpose matrix of A is denoted by A .If A is a square matrix of size n then its trace is expressed as Tr(A) = n i=1 A i,i .If A ∈ {0, 1} k×n then for any given element A j,i we have A j,i = 0 or A j,i = 1 (A is a binary matrix).By a b, we mean that a is very small in comparison with b.

B. KERNEL SEGMENTATION
The goal of the segmentation task is to partition a data sequence into several non-overlapping and homogeneous segments of variable durations.Let X = [x 1 , ..., x n ] ∈ R d×n denotes the given sequence of length n and dimension d.For the number of segments k that is specified in advance, a valid solution of the k−segmentation problem can be represented by an sample-to-segment indicator matrix G ∈ {0, 1} k×n , whose each element is as follows G must satisfy two constraints, including i) Boundary: G 1,1 = 1 and G k,n = 1 and ii) Monotonicity: for any given G i,j = 1 then for the next column G i,j+1 = 1 or G i+1,j+1 = 1.An example of the indicator matrix is given in Figure 1.
To discover segments with complex and nonlinear structures, kernelization is often applied.More specifically, the data sequence X is mapped onto some high dimensional space (a.k.a.feature space) associated with a pre-specified kernel function κ(•, •) : R d ×R d → R. The mapping function φ(•) is implicitly defined by φ(x i ) = κ(x i , •), resulting the inner-product φ(x i )φ(x j ) = κ(x i , x j ).A common objective for segmentation is to minimize the total summation of the intra-segment variances [23].Thus, the optimization problem is often formulated as follows argmin where G is the set of all valid sample-to-segment indicator matrices and µ j is the mean of the j th segment in the feature space.We can observe that the objective of this problem is similar to that of the kernel k−means and it is difficult to be minimized because G is the discrete variables with combinatorial constraints.

C. BALANCED KERNEL K−MEANS
As mentioned above, kernel segmentation is closely related to kernel k−means due to the similarity between their objectives.In fact, this objective can be rewritten in matrix form.More specifically, we can compute the corresponding kernel matrix K ∈ R n×n , where each element K i,j = φ(x i )φ(x j ) = κ(x i , x j ) represents how likely the two samples are assigned to the same class.Let G ∈ {0, 1} k×n denotes the associated (unknown) sample-to-class indicator matrix of X, where G i,j = 1 if x j is assigned to the i th class and zero otherwise.Here, different from the segmentation task, there is no constraint on the indicator matrix G. Then the objective function of kernel k−means [38]- [40] can be expressed as follows: where Kernel k-means is a strong approach for identifying clusters that are non-linearly separable in the original space.However, similar to its linear counterpart, kernel k-means is sensitive to outliers.More specifically, it often outputs unbalanced results that consists of too big and/or too small clusters under presents of anomaly data samples [41].To alleviate this issue, recently [42] has proposed a simple regularization on the indicator matrix of the form Tr(G11 G ), where 1 is a vector, whose all elements equal to one.By minimizing this regularization jointly with the clustering objective, we can prevent a too small or too great number of data samples from being partitioned into a cluster.We now can combine (3) and  the regularization to form a new objective of balanced kernel k-means where λ is a positive parameter that controls the balanced regularization.

IV. THE PROPOSED METHOD A. KERNEL CLUSTERING WITH SIGMOID-BASED REGULARIZATION (KCSR)
Our intuitive idea is to reuse the robust objective of balanced kernel k−means (4) for segmentation of data sequence X = [x 1 , . . ., x n ] ∈ R d×n .However, the challenge is that the sample-to-segment indicator matrix must satisfy two constraints, including boundary and monotonicity, while the indicator matrix for clustering does not.This difference is illustrated Figure 2. To close this gap and enable the clustering approach to segment data sequences, we introduce a novel regularization that smoothly approximates the two above constraints.The new regularization changes the variables from a discrete to continuous domains.Therefore, our problem can be solved using gradient descent (GD) algorithm.Since, the convergence of GD was already proved [22], quality of the proposed models' solutions is guaranteed.The proposed regularization is based on the sigmoid function.A basic sigmoid function is defined as where β specifies the midpoint and α controls the steepness of the function curve at the midpoint.Figure 3 depicts a sigmoid function, where the midpoint β is fixed at 11.5 and the parameter α varies from 0.1 to 10.We can observer that the higher α is the steeper function curve at the midpoint becomes.In addition, the sigmoid function is monotonic and almost piecewise constant.Therefore, it allows us to roughly partition a sequence into two segments, where the parameter β approximates the segment boundary.If we denote τ j ∈ [1, 2] (continuously valued) as segment label of sample x j , then For instance, if α = 10 and β = 11.5, then τ j ≈ 1 for j < 11.5 and 2 otherwise.To generalize for cases, where the number of segments k > 2, we propose to use a summation of k − 1 sigmoid functions with different parameters Figure 4 illustrates an example of a summation of sigmoid functions defined in (7).Here, the steepness parameter α is shared among the sigmoid functions within the summation.k − 1 midpoint parameters β = [β 1 , . . ., β k−1 ] approximate the segment boundaries between the k segments.Note that the midpoints must satisfy 1 ≤ β 1 < . . .< β k−1 ≤ n to guarantee the summation of sigmoid functions monotonically increasing.Thus, we regularize the β by further introducing k parameters γ 1 , . . ., γ k such that In equation (8), the ratio In addition, the ratio becomes larger as i increases.This guarantees that It is notable that the summation of sigmoid functions in Figure 4 smoothly approximates the indicator matrix G of segmentation example in Figure 2(b).To make the observation more clear, we introduce the following approximation to each element of G This equation map the segment label τ j from the range [1, k] to the range [0, 1] for approximating the sample-to-segment indicator matrix.Algorithm 1 : Gradient descent algorithm for KCSR Input: Kernel matrix K, number of segments k, steepness parameter α, tolerance .
compute gradient ∇γ = ∂J ∂γ ; 3: compute stepsize η using Armijo-Goldstein line search [22], [43]; We now can formulate an optimization problem that combines objective of the balanced kernel clustering with sigmoid-based regularization for segmentation.Let K ∈ R n×n be the kernel matrix of the data sequence X then our kernel-based segmentation optimization problem is argmin γ1,...,γ k Since γ = [γ 1 , . . ., γ k ] are unconstrained and continuous parameters, we can optimize objective function in (10) using the gradient descent algorithm.Let J(γ) denotes the objective function in (10), then the gradient w.r.t parameters γ can be computed using chain rule.
where τ = [τ 1 , . . ., τ n ].More details on derivation of the gradient w.r.t γ is given in Appendix A. We call the proposed model Kernel clustering with sigmoid regularization (KCSR) and its optimization algorithm is given in Algorithm 1.

B. STOCHASTIC KCSR
Kernel segmentation allow us to capture nonlinear structure in the data.However, this advantage is achieved at the expense of much higher complexities in both terms of computational time and storage requirement.More specifically, given a sequence of n data samples, existing kernel-based methods compute the kernel matrix K, whose both time and memory complexities are of order O(n 2 ).Note that this is also true for temporal clustering methods, where the affinity graph matrix of size O(n 2 ) is computed and stored while performing Ncut algorithm.When n is large, these methods become computationally difficult.For example, average length of the acceleration data for activity recognition in the experimental section is about 125K.The corresponding kernel matrix K requires up to approximately 116.4 GB for storage, which is definitely out of memory for a regular computer.
Our method is also based on the kernel matrix.Especially, at each iteration, our method computes the gradient using the kernel matrix, which makes it very slow and even impossible due to the large memory requirement for handling long data sequences.Fortunately, since objective function of KCSR is differentiable, we can reduce the complexities by using the stochastic gradient descent (SGD) [44]- [46].SGD estimates the gradient from a randomly sampled subsequence 2 (a minibatch), which consists of a much smaller number of samples, from the original sequence.Let b n denotes length of the randomly sampled subsequence X (t) , where t expresses the iteration index.Then, the stochastic gradient is estimated as follows In equation ( 12), ∂J(γ) ∂G (t) is only associated with a partial kernel matrix K (t) ∈ R b×b , which corresponds to the samples in X (t) .Therefore, it is much more efficient in terms of both running time and memory consumption than computing the full-batch gradient as in equation (11).Details of the algorithm is given in Algorithm 2 and complexity comparison between the proposed methods and several baselines are given in Table 1.We note that convergences of both gradient with step size found by Armijo-Goldstein line search [22], [43] and stochastic gradient descent algorithms with vanishing step size are theoretically proven.In fact, it is wellknown [47], [48] that gradient descent (GD) after T iterations 2 By sub-sequence, we mean that order and indexes of samples in the original sequence are preserved in the randomly sampled mini-batch.compute the stochastic gradient ∇γ = ∂J ∂γ based on K (t) and original indexes of samples in X (t) ; 6: + ∆γ (t) ; 8: end for can find a solution with error O(T −1 ) and stochastic gradient descent (SGD) after T iterations can find a solution with error O(T −0.5 ).Thus, both KCSR and SKCSR can obtain good solutions for problem defined in (10) with enough loops.

C. MULTIPLE KCSR
In practice, at some particular circumstances, we need to perform segmentation on multiple data sequences.If these sequences are not in relation, the problem is effortless since segmentation algorithms can be applied on each sequence independently.However, when the sequences are related to each other, performing multiple segmentation without considering relation among the sequences would induces inferior results.We take sequential segmentation and matching (SSM) problem as a study case.Given m ≥ 2 data sequences, SSM aims at partitioning each sequence into several homogeneous segments and then establishing the correspondences between these segments from different sequences.A popular application of SSM is human action analysis.Specifically, the human action videos are segmented into primitive actions and the resulted sequences of the action segments are then aligned [49]- [51].
To solve the SSM problem, in this work, we introduce an extension of the proposed model termed Multiple kernel clustering with sigmoid-based regularization (MKSSR).MKSSR jointly partitions each data sequences into k segments such that the c th segments of all the m sequences (a)  are matched 3 .Let X p ∈ R d×np for 1 ≤ p ≤ m denotes the p th data sequence and G p ∈ R k×np be its corresponding sample-to-segment indicator matrix.MKSSR firstly concatenates all the sequences to form a single long sequence is the corresponding indicator matrix of X.Similar to the original KCSR, each element of G is defined as in (9).However, in MKCSR, the segment label τ j is computed as following  The function (13), which we call as cut-off summation of sigmoid functions, consists of two components.The first component is the summation of sigmoid functions.It plays a similar role as (7) in KCSR.The second component presents the cut-off points (a.k.a junction points), at which two among the m original data sequences are connected.It will reset the segment label from k to 1 after passing the final sample of one sequence and reaching a new sample from the next sequence.The cut-off summation of sigmoid functions and its components are illustrated in Figure 5.

V. EXPERIMENTS A. BASELINES
We compare KCSR and its stochastic variant SKCSR with the following baselines  [20] -a heuristic approximation of the optimal kernel segmentation, where the solution is obtained by pruned DP algorithm that combines a low-rank approximation of the kernel matrix and the binary segmentation algorithm.• Greedy kernel segmentation (GKS) [21] -another heuristic approximation of the optimal kernel segmentation that detects the segment boundaries sequentially using greedy algorithm.

B. DATASETS
To evaluate performances of the above methods, we use a synthetic dataset and five real-world and widely public datasets.Synthetic data.We first generate 2D data samples that form four circles of different diameters.They are illustrated in Figure 8(a).The number of data samples of each circle is randomly selected in range [500, 1500] and also constrained to be different.For instance, in our case, the numbers of data samples of the circles from low to high diameters are 832, 1018, 1174 and 843, respectively.We then rearrange the generated data samples in contiguous order, i.e. data samples of one circle do not mix to the other circles.By doing so, each circle in the original 2D space corresponds to a segment in the new sequential data.See Figure 8(b) for illustration.
Weizmann data.The Weizmann dataset [53] consists of 90 videos of nine subjects, each performing ten actions:  bend, run, jump-in-place (pjump), walk, jack, wave-onehand (wave1), side, jump-forward (jump), wave-two-hand (wave2), and skip.Similar to [54], videos of the same subjects are concatenated into a long video sequence following the presented order of the actions.We then subtract background from each frame of these video sequences and rescale them to the size 70×35.For each 70−by−35 rescaled frame, we compute the binary feature as shown in Figure 6(a).To reduce the dimensions of the feature space (2450), the top 123 principal components that preserve 99% of the total energy are kept for experiments.
MMI Facial action units.We exploit the MMI Facial Expression dataset [55], which contains more than 2900 videos of 75 different subjects, each performing a particular combination of Action Unit (AU).In this paper, we focus on videos of AU12, which corresponds to a smile.Although, these videos consist of different number of frames, they are composed of exactly five segments with the following order: neutral, onset, apex, offset, neutral, where neutral is when facial muscle is inactive, apex is when facial muscle intensity is strongest, and onset is when facial muscle starts to activate or offset is when facial muscle begins to relax.Following the same pre-processing procedure as in [56], we cropped and aligned the face using dlib-ml [57].The results are depicted in Figure 7.We then convert them to grayscale and reduce their dimension to 400 using whitening PCA.We finally selected videos of five subjects 2, 3, 6, 14 and 17 for experiments.Their ground-truth frames-to-segment lables are already given in the original dataset.
Google spoken digits.Google's Speech Commands (GSC) [58], [59] is a large audio dataset that consists of more than 30 categories of spoken terms.For each category that relates to digits from "one" to "nine", we randomly select a clean recording.These recordings are then concatenated, forming a long audio sequence with 19 segments (9 segments of active voice and 10 silent segments) (see Fig. 10).We further add white noise, which is also provided in the GSC dataset, to make the segmentation problem more challenging.Finally, a sequence of acoustic features, which are 13dimensional mel-frequency cepstral coefficients (MFCCs) [60] for every 10ms of a 25ms window, is computed from the noisy audio sequence.The annotation is manually obtained based on the log filter-bank energies of the clean audio.Ordered MNIST data.the MNIST dataset [61] consists of 28 × 28 grayscale digit [0, 9] images divided into 60K/10K for training/testing.Since all the compared methods are unsupervised and require no training phase, we use all 70K images to perform segmentation.Note that the original data is not exact suited to the sequential assumption.Following the same setting of [2], we rearrange order of the images such that those of the same digit form a contiguous segment and the ten segments are concatenated into a very long images sequence (see Figure 6(b)).Different from [2], where only 2K images were selected for experiment, our ordered MNIST data consists of the whole 70K images.To handle this large-scale data, temporal clustering and kernel CPD methods requires up to 36.5 GB to store the kernel and/or affinity graph matrices, which is impractical for implementation on a single personal PC.Among the compared methods, only SKCSR and AKS with low memory complexities can perform segmentation on this dataset.
Acceleration data. 4The acceleration data [62] are acquired from a triaxial accelerometer mounted on the chests of 15 subjects, each performing a sequence of activities such as working at computer, standing, walking, going updown stairs and talking.The aims of our experiments is to partition the data sequences into segments that correspond to the activities.Thus, we firstly pre-process the data.For each subject, we add squares of signals from the three axles.An example is depicted in Figure 11.We then transform the obtained summation signal using wavelet transform with scale factor 64 and the Morlet wavelet as the mother wavelet function.The resulting 2D wavelet coefficient matrix C is of the size 64-by-n acc , where n acc is the length of the original acceleration signal.Note that the wavelet coefficients C are complex numbers.Thus we take its modulus as input for the methods in our experiments.Similar to the Ordered MNIST data, this dataset consist of long acceleration sequences.The  average n acc is 125K.Therefore, the methods with memory complexities of order O(n 2 ) will require up to approximately 116.4 GB, which is unaffordable in our case, for storage.In our experiments, only SKCSR and AKS can handle this dataset.

C. EVALUATION MEASURES
Given a specific value k, while KCSR, SKCSR, AKS and GKS return exactly k non-overlapping segments, temporal clustering-based methods partition samples of the data sequence into k clusters that maybe dispersed in discontiguous segments.Since, all the compared methods base on clustering scheme, we use accuracy and normalized mutual information [63] as evaluation metrics to assess the segmentation results.
Let L = [ l1 , . . ., ln ] and L = [l 1 , . . ., l n ] be the obtained labels and ground-truth labels of a given data sequence X = [x 1 , . . ., x n ]. lj = i (similar for l j ) for 1 ≤ i ≤ k indicates that x j belongs to cluster (segment) ĉi .The accuracy (ACC) is defined as follows: where δ(a, b) is the delta function that equals one if a = b and zero otherwise and map( lj ) is the permutation mapping function that maps label lj to the equivalent ground truth label.In this work, we use Kuhn-Munkres algorithm [64] to find the mapping.Let C = [ĉ 1 , . . ., ĉk ] and C = [c 1 , . . ., c k ] be the obtained clusters and the ground-truth clusters.Their mutual information (MI) is where p(c i ) and p(ĉ i ) are the probabilities that a data sample arbitrarily selected from the sequence belongs to the clusters c i and ĉi , respectively, and p(c i , ĉi ) is the joint probability The mean score of each methods over five random runs along with its variance are reported.The symbol "-" means that there is no result due to the shortage of memory resources.
that the selected data sample belongs to both c i and ĉi .This metric is normalized to the range [0, 1] as follows: where H(C) and H( C) are the entropies of C and C, respectively.

D. PARAMETER SETTINGS
We select the optimal parameters for each method to achieve the best performance.The number of clusters k of all the compared methods is set to the number of segments available in the datasets.For ACA, its parameters nM a and nM i that specify the maximum and minimum lengths of each divided subsequence, respectively, are data-dependent.Let n be the sequence length, we select nM a from a rounded set {0.01n, 0.02n, 0.04n, 0.06n, 0.08n, 0.1n} and set nM i = nM a 2 .For temporal subspace clustering methods, including SSC and TSC, the most important parameter is that controls the sequential regularization for the new representation Z.We select this parameter from the set {1, 5, 10, 15, 20, 25} and the other parameters are set according to the original papers.For the proposed methods, we fix the parameter that controls the steepness of the summation of sigmoid functions at the midpoints α = 10.The tolerance for convergence verification in KCSR is fixed at 10 −6 .For all the datasets, we use the Radial Basis Function (RBF) Kernel 5 with proper width σ for AKS, GKS and the proposed methods.The minibatch size b of SKCSR and the rank r of the approximation of the kernel matrix in AKS are kept equal.Their values are selected from a set {64, 128, 256, 512, 1024, 2048}.Note that, SKCSR terminates after processing T minibatches.We set T such that T × b ≥ 50n (passing through the data sequence at least 50 times).2. We can observe that each segment of the generated data sequence has a circular structure.Therefore, the nonlinear regularization in sequential representation learning of SSC is ineffective on the synthetic data.TSC performed significantly better.The manifold-based regularization allows it to be able to capture the nonlinear structure in the data.Our methods also perform segmentation based on regularization.However, different from SSC and TSC, where the regularization is just local 6 , the summation of sigmoid functions of KCSR and SKCSR globally regularizes the whole data sequences and the locality is ensured by its smooth nature.Therefore, the proposed methods obtained the best performance on the synthetic dataset.

E. RESULTS DISCUSSION 1) Evaluation of KCSR and SKCSR
On the real-world data, including Weizmann action videos, MMI Facial smiling videos and Google spoken digits audio, the proposed models also outperformed the baselines.Evaluation scores of the corresponding segmentation results are shown in the second, third and fourth rows of Table 2.We can observe that ACA also had good performances on these datasets.Although ACA also performs segmentation based on clustering as our methods do, it cannot guarantee to find exact k non-overlapping segments.Therefore, its evaluation scores are slightly lower than those of the proposed models.In comparison with heuristic approximations AKS and GKS, our models also had better performances.Similar to AKS and GKS, our models also search for segment boundaries.They approximate the boundaries by midpoints β of the summation of sigmoid functions.However, different from these heuristic approximations that search for the segment boundaries sequentially, the proposed models simultaneously obtain all the β via gradient-based algorithm.As convergence of this optimization algorithm is theoretically proved, optimality of the solutions is guaranteed.To qualitatively assess the performances of the compared methods, we also visualized the segmentation results on Weizmann video and Google audio datasets in Figure 9 and Figure 10, respectively.These visualization further validate the superior performances of our methods over those of the baselines.On these datasets, we also observe that evaluation scores of SKCSR are greater than those of KCSR.Thus, we further investigate convergence curves of these models.Figure 13 depicts those of SKCSR and KCSR on Weizmann action videos and Google spoken digits audios, respectively.It is clear that superior performances of SKCSR arise from the exploitation of stochastic gradient descent (SGD) algorithm.SGD allows SKCSR to update its parameter γ more frequently due to fast estimation of the stochastic gradient.In addition, SGD takes randomness of the data into account and enjoys theoretical guarantee on convergence in an expectation sense [45].Therefore, SKCSR is more robust to noise in the data and able to achieve better solution than KCSR.
SKCSR also showed its superior efficiency over the original KCSR and most the other baselines on the ordered MNIST and Acceleration data.Recall that the ordered MNIST data consists of 70K samples.Acceleration data contains even much more longer data sequences, where the average length is 125K.This makes implementation of the memory-demanding methods impossible on regular personal PCs.Among the baselines, only AKS with memory complexity of order O(r 2 ), where r n is the rank of approximation of the kernel matrix, can handle the ordered MNIST and Acceleration data.However, since AKS employs binary segmentation to sequentially detect the segment boundaries, its solutions are not optimally guaranteed.Visualization of the segmentation results on the Acceleration data in Fig. 11 and the evaluation scores in the fifth and sixth rows of Table 2 validate the advantages of SKCSR.

2) Evaluation of MKCSR
We next evaluate performance of MKCSR -an extension of KCSR for handling multiple data sequences.We utilize concatenated Weizmann action videos and MMI Facial AU videos in this experiment.For the Weizmann data, the first, second and third subjects are selected and their corresponding action videos are concatenated to form a long sequence   that consists of 30 segments, each of which belong to one of the ten action categories.For the MMI Facial AU data, videos of all the subjects are concatenated.The new video sequences consists of 491 frames and 15 segments.We compare MKCSR with temporal clustering methods, including SSC, TSC and ACA.For all the compared methods, we set the number of clusters k = 10 and k = 15 for the Weizmann and MMI Facial AU data, respectively, and select the other parameters following the same scheme as mentioned in subsection V-D.Fig. 12 visualizes the segmentation results on multiple video sequences from Weizmann data and Table 3 shows the evaluation scores on both Weizmann and MMI Facial AU datasets.Simultaneous segmentation of multiple data sequences is a challenging task.As we can observe that, in comparison with segmentation results of a single sequence (the second and third rows of Table 2), evaluation scores of SSC, TSC and ACA on the multiple data sequences are significantly reduced.MKCSR, however, compared to its original method KCSR, could preserve a great amount segmentation accuracy.As we can see that MKCSR obtained up to 0.8509 of ACC and 0.8732 of NMI on Weizmann data.
For MMI Facial AU data, MKCSR also achieved 0.9351 of ACC and 0.9221 of NMI.These results validate that MKCSR can inherit advanced properties from SKCSR to perform efficiently and effectively on multiple data sequences.

VI. CONCLUSION
Approximation of segmentation for fast computational time and low memory requirement is very important as nowadays more and more large sequential datasets are available.Previous works for approximating optimal segmentation algorithm are either ineffective or inefficient because they still involve in optimization over discrete variables.In this paper, we proposed KCSR to alleviate the aforementioned issue.Our model combines a novel regularization based on sigmoid function with objective of balanced kernel k−means to approximate sequence segmentation.Its objective is differentiable almost every where.Therefore, we can use gradientbased algorithm to achieve the optimal segmentation.Note that, our model update all the parameters of interest in an unified manner.This is in contrast to existing approximation methods that sequentially update the segment boundaries, which has no guarantee on quality of the solutions.To further reduce the time and memory complexities, we introduce SKCSR -a stochastic variant of KCSR.SKCSR employs stochastic gradient descent, where the gradient is estimated from a randomly sampled subsequence, for updating parameters of the model.Thus, it can avoid storing large affinity and/or kernel matrix, which is a critical issue that inhibits existing methods from segmenting long data sequence.Finally, we modify the sigmoid-based regularization to de-velop MKCSR -an extended variant of KCSR for simultaneous segmentation of multiple data sequences.Through extensive experiments on various types of sequential data, performances of all the proposed models are evaluated and compared with those of existing methods.The experimental results validates the claimed advantages of the proposed models. .

APPENDIX A DERIVATION OF THE GRADIENT
In this section, we provide derivation of the gradient w.r.t γ.
Recall that our objective function is The gradient ∇γ = ∂J ∂γ can be computed using chain rule.We first compute the gradient of J w.r.t G as follows: Since each entry in the j th column of G is a function of continuously segment label τ j we need to compute The segment label τ j is again computed via a mixture of k−1 sigmoid functions, each of whose parameter is β i .Thus, we need to compute Then the gradient of J w.

FIGURE 1 .
FIGURE 1.An example of sequence segmentation: (top) an example sequence of length 23 and (bottom) the corresponding indicator matrix with number of segments k = 7.

FIGURE 2 .
FIGURE 2. Toy examples of (a) Clustering task and (b) Segmentation task, where the given data and the corresponding indicator matrix are depicted.Data samples from the same cluster or segment have identical symbol and color.Segmentation is different from clustering in that data samples of the same segment must be in a successive order.

5 FIGURE 3 .
FIGURE 3. Sigmoid function with different values of the parameter α.
Time and space complexities of different segmentation methods.Here, n denotes length of the data sequence and k is the number of segments.nmax denotes the maximum length of divided subsequences in ACA, d is dimension of the new representation Z in OSC and TSC.t denotes number of total iterations.The rank of the approximation of the kernel matrix in AKS is denoted by r and b is the mini-batch size in SKCSR.Note that b n.

FIGURE 5 .
FIGURE 5. Illustration of the cut-off summation of sigmoid functions.(a) A toy example of a concatenation of two sequences (m = 2, n1 = 23, n2 = 30) and its corresponding indicator matrix (k = 7).(b) The cut-off summation of sigmoid functions, whose two components are depicts in the two first subfigures, can smoothly approximate the indicator matrix in the toy example.

3
Data samples of the c th segments from different sequences belong to the c th class for 1 ≤ c ≤ k.

FIGURE 6 .
FIGURE 6.(a) Concatenated action videos of subject 1 in Weizmann dataset and (b) the rearranged digit images sequence in MNIST dataset.Each data sequence consists of 10 non-overlapping segments and only one representative frame of each segment is depicted.

FIGURE 7 .
FIGURE 7. Videos of five subjects (S002, S003, S006, S014 and S017) performing an action unit 12 that corresponds to smile taken from MMI Facial action units dataset.The representative facial images of the segments are depicted.The bottom of each video shows duration of the corresponding ground truth frame-to-segment labels along with the total number of frames.

FIGURE 8 .
FIGURE 8. Synthetic experiment: (a) data generated in 2D space, (b) the data after contiguously rearranging and visualization of segmentation results returned by all the compared methods.Different colors represent different clusters.

Figure 8 FIGURE 9 .
Figure 8 visualizes the segmentation results on synthetic data and the evaluation scores are given in the first rows of Table

FIGURE 10 .
FIGURE 10.From the top to the bottom: clean audio of spoken digits[1,9], the audio contaminated by white noise, log filter-bank energies of the clean audio used for manual annotation (blue lines depict ground truth segment boundaries) and Mel-frequency cepstrum of the noisy audio (vertical lines show the midpoints β of the summation of sigmoid functions returned by SKCSR).

FIGURE 11 .
FIGURE 11.From the top to the bottom: Acceleration signal of the first subject, the corresponding ground-truth segment labels and the segmentation results returned by AKS and SKCSR, respectively, on the Acceleration dataset.

FIGURE 12 .
FIGURE 12. Visualization of segmentation results of SSC, TSC, ACA and MKCSR on three concatenated action video sequences from Weizmann dataset.

FIGURE 13 .
FIGURE 13.Convergence curves of SKCSR (with stochastic gradients estimated from mini-batches b = 256) and KCSR (with gradients estimated from full batch (the whole data sequence)) on (a) Weizmann and (b) Google spoken digits datasets.
r.t β = [β 1 , . . ., β k−1 ] arrive at the gradient of J w.r.t γ = [γ 1 , . . ., γ k ] TUNG DOAN received the B.S. degrees in Computer Engineering from Hanoi University of Science and Technology in 2014.In 2021, he completed the PhD course at National Institute of Informatics, Japan.He is now a staff lecturer at Department of Computer Engineering, School of Information and Communication Technology, Hanoi University of Science and Technology His current research interests include deep learning, multiview learning, generative model and sequential data.ATSUHIRO TAKASU received his B.E., M.E., and Dr.Eng. in 1984, 1986, and 1989, respectively, from the University of Tokyo, Japan.He is a professor at the National Institute of Informatics, Japan.His research interests are data engineering and data mining.He is a member of the ACM, IEEE, IEICE, IPSJ, and JSAI.

TABLE 2 .
Segmentation results on six datasets, including synthetic data, Weizmann action sequences, MMI Facial smiling video, noisy Google spoken digits, ordered MNIST data and Acceleration sequences, returned by different methods.