Deep Tensor Spectral Clustering Network via Ensemble of Multiple Affinity Tensors

Tensor spectral clustering (TSC) is an emerging approach that explores multi-wise similarities to boost learning. However, two key challenges have yet to be well addressed in the existing TSC methods: (1) The construction and storage of high-order affinity tensors to encode the multi-wise similarities are memory-intensive and hampers their applicability, and (2) they mostly employ a two-stage approach that integrates multiple affinity tensors of different orders to learn a consensus tensor spectral embedding, thus often leading to a suboptimal clustering result. To this end, this paper proposes a tensor spectral clustering network (TSC-Net) to achieve one-stage learning of a consensus tensor spectral embedding, while reducing the memory cost. TSC-Net employs a deep neural network that learns to map the input samples to the consensus tensor spectral embedding, guided by a TSC objective with multiple affinity tensors. It uses stochastic optimization to calculate a small part of the affinity tensors, thereby avoiding loading the whole affinity tensors for computation, thus significantly reducing the memory cost. Through using an ensemble of multiple affinity tensors, the TSC can dramatically improve clustering performance. Empirical studies on benchmark datasets demonstrate that TSC-Net outperforms the recent baseline methods.

a few.Among various clustering techniques, spectral clustering (SC) [3], [4] is a common one due to its simplicity and graphtheoretic interpretation.Nevertheless, data in computer vision or bioinformatics are complex [5], [6], where they are often high-dimensional and contaminated by noise [7].SC relies on an affinity matrix to characterize pairwise similarities, which fall short in providing satisfactory clustering performance for such complex data [8], [9].Converging evidence [10], [11] suggests that dealing with high-dimensional and noisy data requires characterizing more complex similarities, and thereafter tensor spectral clustering (TSC) [12] has been developed recently as a promising solution.TSC employs high-order affinity tensors to characterize multi-wise similarities among samples instead of merely pairwise similarities as in previous methods.It has been shown that such high-order affinity tensors are robust against noise and capable of characterizing comprehensive spatial structure for high-dimensional data to achieve a better performance [13].
The seminal TSC method was proposed in [11], [14], which constructs a third order affinity tensor to encode ternary similarities and utilizes the multilinear singular value decomposition (SVD) of the constructed affinity tensor to yield a tensor spectral embedding, which is an approximation of the cluster indicator matrix.Recently, another TSC method, IPS2 [13], has shown that multiple affinity tensors of different orders have complementary information to each other for clustering, and integration of them can improve the accuracy and robustness of clustering.IPS2 specifically proposes an integrative method that combines a fourth order affinity tensor with a second order affinity tensor, i.e., an affinity matrix, to learn a consensus tensor spectral embedding, which has been shown to boost the clustering performance in several benchmark datasets.Although TSC methods have advanced so far, two key challenges that hamper their applicability remain less explored.
One crucial challenge is that TSC methods need to construct and store the whole high-order affinity tensor, which is memoryintensive.If one is to construct a Kth order affinity tensor for m samples, the general memory cost is O(m K ).Taking K = 3 and m = 1000 as an example, one can check that the memory cost is nearly 7.451 GB if the resulting affinity tensor is stored in double-precision floating points.Such a prohibitively high memory cost becomes an insurmountable roadblock and severely inhibits the applicability of TSC.To alleviate this issue, most TSC methods have adopted sampling techniques such as column sampling [10], Nyström approximation [11], or iterative  I. TSC-Net maps the input data to the tensor spectral embedding under the corresponding objective, where the orthogonalization layer is at the top to ensure that the output meets the orthogonal constraint.
sampling [12] to construct a sparse affinity tensor, in which most of the elements are zero.However, sampling techniques inevitably incur the information loss of the affinity tensor and may result in performance degradation.
The other challenge is that the existing TSC methods mostly employ a two-stage approach that integrates multiple affinity tensors to learn a consensus tensor spectral embedding, which often leads to a suboptimal clustering result.For example, the representative IPS2 [13] has proposed a two-stage approach, which first solves the tensor spectral embedding from a fourth order affinity tensor and the second order affinity tensor independently and then yields a consensus one by simply averaging them.Though demonstrating promising clustering performance in several benchmark datasets, the above two-stage approach disconnects the multiple affinity tensors in the process of reaching the consensus tensor spectral embedding and may lead to unsatisfactory performance.
Recently, deep neural network (DNN) has become a popular technique to learn underlying nonlinear mappings [15] in machine learning and computer vision.Since TSC methods can be regarded as a nonlinear mapping from an original sample space to an embedding space, a modern DNN with nonlinear activation functions is able to approximate such a nonlinear mapping [15], [16], [17].As such, the DNN can be applied to learn the mapping from the input samples to the tensor spectral embedding.Subsequently, this paper aims to develop a DNNbased TSC method to reduce memory cost, while providing a joint framework to integrate multiple affinity tensors of different orders.
Accordingly, a tensor spectral clustering network, abbreviated as TSC-Net, is proposed.The TSC-Net is designed to learn a consensus tensor spectral embedding to integrate multiple affinity tensors in a stacked neural network.We instantiate the integrated method with the second, third, and fourth order affinity tensors.The stochastic gradient descent is thus applied to optimize the TSC-Net.The stochastic optimization allows us to calculate a small part of the affinity tensors at each time while circumventing storing the whole affinity tensors, thus decreasing the memory cost by a significant amount.For example, if the mini-batch size of the stochastic gradient descent is m b , the memory cost to calculate a subset of the Kth order affinity tensor for a mini-batch is O(m K b ).Considering the same case above with K = 3 and choosing mini-batch size m b = 128, the memory cost is about 0.016 GB if the affinity tensor of the mini-batch is stored in double-precision floating points, which is only 0.214% of the memory cost when storing the whole third order affinity tensor.
In training, TSC-Net, consisting of an embedding network and an orthogonalization layer, learns the mapping from the input samples to the tensor spectral embedding via the standard TSC objective proposed in [10], [11], [12].In testing, the input samples are propagated through TSC-Net to obtain the tensor spectral embedding, followed by the k-means algorithm to yield the cluster labels.The network structure is illustrated in Fig. 1.Empirical evaluations on benchmark datasets demonstrate the effectiveness of the proposed method by comparison with recent baselines.
The main contributions of this paper are as follows.1) A tensor spectral clustering network, TSC-Net, is proposed to learn to map the input samples to the spectral embedding via stochastic optimization, thereby reducing the memory cost without sacrificing the clustering performance.
2) The proposed TSC-Net can seamlessly integrate multiple affinity tensors in a joint framework.We instantiate TSC-Net to integrate the second, third, and fourth order affinity tensors to demonstrate the improved clustering performance over nine recent baselines.3) Extensive experiments on benchmark datasets have been conducted to verify the effectiveness of the proposed method in comparison to current competitors in terms of clustering performance and memory cost.The remainder of this article is structured as follows.We first make an overview of the related works in Section II.Next, Section III details our proposed TSC-Net.In Section IV, we conduct experiments to demonstrate the effectiveness of our proposed method.Finally, we draw a conclusion in Section V.

A. Tensor Spectral Clustering
Suppose we have m samples with n features, denoted by a matrix X = [x 1 ; x 2 ; . . .; x m ] ∈ R m×n .The aim of clustering is to assign these samples into c disjoint clusters.
Spectral Clustering: The classic spectral clustering [3] starts by computing an affinity matrix or a second order affinity tensor T (2) ∈ R m×m , where every element T (2) ij represents the pairwise similarity between the corresponding two samples x i and x j .Subsequently, one can obtain the embedding Y ∈ R m×c by a trace minimization problem where is the normalized affinity matrix and ij is the corresponding diagonal degree matrix.
The performance of spectral clustering pins on the pairwise similarity in the affinity matrix.However, the pairwise similarities are easily broken by noise contamination [18] or concentration effect [19] in data with high dimensions.To address this issue, recent works of tensor spectral clustering [13] attempted to use high-order tensor affinity among more than two samples to compensate for the inefficacy within the pairwise similarities.
Tensor Spectral Clustering: The basic goal of tensor spectral clustering is to determine to which clusters these samples belong based on the multi-wise similarities encoded in an affinity tensor.It has recently been proposed to characterize complex multi-wise similarities in an affinity tensor to achieve a better performance than pairwise similarity based approaches [20], [21], [22], [23], [23], [24], [24], [25], [26], [27].Specifically, several works applied a combination of Euclidean distances among samples to construct the multi-wise similarities encoded in the affinity tensor and then employed high-order SVD or tensor trace norm maximization to derive the tensor spectral embedding matrix for clustering.For instance, Ghoshdastidar et al. [11] proposed a multilinear SVD method to decompose the affinity tensor and showed that this decomposition amounted to clustering samples by maximizing the squared associativity of the partition.Ghoshdastidar et al. [10] applied a trace optimization on the affinity tensor and developed a tensor sampling strategy [12] to save the computational cost.
Specifically, the K-wise similarities among m samples can be characterized by a Kth order affinity tensor in which an element of T (K) represents the similarity of K corresponding samples.For instance, one can use scaled Gaussian distance to compute the second order affinity tensor, i.e., the affinity matrix, defined by For the third order tensor affinity T (3) , one can use anchor-based distance, defined by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Here, the metric d ij is computed by Here, the scale σ is set as the median distance between a point to its third neighbor [15].The fourth order affinity tensor T (4) is defined as in [13], which uses the ratio-based pair-to-pair similarities.The Fisher-ratio-like fourth-order tensor affinity among four samples is defined by where d ij denotes the distance between samples x i and x j .
The tensor spectral clustering methods seek to find a consensus low dimensional embedding Y through minimizing a total associativity [10] between the embedding Y with the ensemble of second, third, and fourth orders as follows, where α and β are two hyperparameters for balancing the affinity tensors of various orders and c stands for the number of clusters.
The operator × k between a tensor A and a vector U is called mode-k product and is defined as, Definition 2.1.
Kth order tensor and U ∈ R q×m k be a matrix.The mode-k product of A and U is a Kth order tensor denoted by Similar to the conventional spectral embedding, each row of the consensus embedding Y represents a sample in the embedding space, and the final cluster labels can be derived by performing a subsequent k-means algorithm on Y .
The tensor spectral clustering methods have achieved superior performance when dealing with high-dimension low-samplesize (HDLSS) data [28], [29], [30].However, the tensor spectral clustering methods share a common limitation, i.e., they need to construct and store the whole affinity tensor before deriving the tensor spectral embedding.The expensive memory costs to store the whole affinity tensor make the existing TSC methods less applicable for real-world tasks.

B. Deep Clustering
Deep clustering [31], [32] aims to integrate clustering and deep neural networks into a unified framework.Most deep clustering methods incorporate deep auto-encoder to extract features from complicated high-dimensional data to facilitate clustering, where there is a mutual boosting between clustering and deep feature representation.For example, DEC [33] attempt to achieve learning cluster centers and deep discriminating feature simultaneously, supervised by a clustering objective.In [34], deep self-evolution clustering (DSEC) has been proposed to train the network alternately with chosen pairs of samples.In [35], DCCM has been proposed to use pseudo-labels by a self-supervision scheme to guide clustering and to minimize mutual information to prune discriminative representations.A partition confidence maximization (PICA) [36] has been proposed to minimize a cluster partition uncertainty index, thereby learning the most confident clustering allocation.Recently, a group of deep clustering introduced self-expression to learn an affinity matrix with deep auto-encoders.One representative work, a deep subspace clustering network (DSC-Net), has been proposed in [37], which incorporated a self-expression module with a deep auto-encoder.The above deep clustering methods seek to mine discriminating features via exploiting deep neural networks.Alternatively, several recent works leverage deep neural networks to reduce the intensive computational cost of traditional clustering methods like spectral clustering.For instance, SpectralNet [15] learns a nonlinear mapping that can embed input data points into the spectral embedding, and demonstrate an effective memory cost reduction.SCANDLE [31] utilizes an adaptive neighbors technique to achieve spectral embedding with adaptively estimated affinities.

III. METHOD
This section details our proposed tensor spectral clustering network (TSC-Net).For ease of presentation, the main notations used in this paper are summarized in Table I

A. Tensor Spectral Clustering Network
The major problem of existing TSC methods is that they need to construct and store the whole Kth order affinity tensor Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
before conducting tensor spectral embedding, which is memoryintensive and thus less applicable.To address the memoryintensive issue, we propose a tensor spectral clustering network (TSC-Net) that maps the input samples X to the tensor spectral embedding Y and trains it with stochastic optimization.The intuition behind TSC-Net is that the tensor spectral embedding Y can be regarded as a nonlinear mapping from an original sample space, X, to the embedding space, and a modern neural network with nonlinear activation functions is able to approximate such a nonlinear mapping [15], [16], [17].Accordingly, the proposed TSC-Net is allowed to approximate the tensor spectral embedding and can be trained in a stochastic manner to avoid storing the whole Kth order affinity tensor.As a result, TSC-Net enables a significant reduction of memory cost.
Formally, TSC-Net is defined as a neural network F : R m×n → R m×c that maps the m input samples to their corresponding tensor spectral embedding with possible parameters of neural networks introduced in the next.More specifically, the neural network maps the data matrix X to the tensor spectral embedding as Y = F(X), where the ith row of Y denotes the tensor spectral embedding for the sample x i in the data matrix.
Network Architecture: To achieve such a mapping while considering the orthogonal constraint for Y , the neural network F is designed as follows.As shown in Fig. 1, the neural network F have the collection of learnable parameters {{W q } Q q=1 , W ort } and can be separated into two major parts: 1) an embedding network that maps the data matrix X to the intermediate representation Y .To be more specific, the embedding network with Q layers performs the layer-wise mapping as H q = g(H q−1 W q ), for q ∈ [Q], where H 0 corresponds the input data matrix X, H Q The orthogonalization layer with W ort is implemented to ensure Y Y = I c .Different from the original TSC model, which is solved by a memory-intensive tensor decomposition [10], [11], [12], the proposed TSC-Net can be trained with a stochastic optimization under (4) with less memory requirement.This will be concretely illustrated in Section III-B.

B. Alternating Stochastic Optimization Algorithm
One major advantage of the proposed TSC-Net is that it allows us to optimize the tensor spectral embedding in a stochastic manner and merely needs to construct a small part of the affinity tensor, thus reducing the memory cost.To optimize TSC-Net with (4), we propose an alternating stochastic optimization algorithm, which is an iterative scheme and alternates between the embedding stage and the orthogonalization stage.More concretely, let us assume the algorithm runs Ω epochs.In each epoch, we randomly sample T = m/m b iterations to ensure walking through the entire data matrix, where m and m b denote the whole data size and mini-batch size, respectively, and T is assumed to be an integer for simplicity.In each iteration, m b samples are randomly sampled.In what follows, suppose we are in an arbitrary possible epoch ω ∈ [Ω], and we omit the notation about ω for simplicity.For a possible iteration t ∈ [T ], the index I t ⊆ {1, 2, . . ., m} of the randomly sampled samples of size |I t | = m b , and the mini-batch X I t , the algorithm iteratively alternates between: Orthogonalization stage: 1) Forward passing to obtain the intermediate representation 2) Updating orthogonalization layer where L (t) ∈ R c×c is a lower triangular matrix derived by the Cholesky decomposition as Y (t) I t = L (t) L (t) .As such, the mini-batch of the output of TSC-Net, Y (t) ort , can meet the orthogonal constraint and can be verified by the Theorem 3.1.
In practice, the full rankness of Y Y can be ensured by adding sufficiently small numbers, e.g., 10 −5 , to the diagonal elements.It is noteworthy that the orthogonal constraint works for a mini-batch of samples to prevent trivial solutions, and the orthogonal constraint across minibatches during stochastic optimization is not required.

3) Continuing forward passing: Y (t)
Embedding Stage: By fixing W (t) ort , one then uses the stochastic gradient descent to update the embedding network parameters including Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The hyperparameters α.

Output:
Trained TSC-Net F; Clustering labels; Training 1: Randomly initializing the collection of network parameters {{W q } Q q=1 , W ort }; 2: while not reaching convergence criterion do 3: Randomly sampling a mini-batch of m b samples; 4: Constructing the affinity tensors T (2) , T (3) , and T (4) for the m b samples; Orthogonalization Stage: 5: Updating W ort by (5); Embedding Stage: 6: Updating {W q } Q q=1 by (6); 7: end while Testing 8: Forward passing all the samples through F to obtain the tensor spectral embedding Y ; 9: Performing k-means on Y to obtain the clustering labels.
where η denotes the learning rate, and the detailed derivation of is provided in Appendix A, available online. 1   4) during successive epochs in training is less than a predefined threshold = 10 −3 .Once TSC-Net is trained, all the parameters, {{W q } Q q=1 , W ort }, are freezed.In the testing, the whole samples are propagated through TSC-Net to obtain the tensor spectral embedding Y , and the k-means algorithm is employed on it to obtain the cluster labels.These algorithmic procedures are summarized below in Algorithm 1.

C. Memory Cost Comparison of TSC-Net and Existing Methods
This subsection centers on how much the proposed TSC-Net or TSC-Net can reduce the memory cost in comparison to the existing TSC methods.The stochastic optimization introduced in the previous section allows TSC-Net or TSC-Net to calculate a small part of the affinity tensors at each time and avoid constructing the whole affinity tensor, while most of the existing TSC methods use tensor decomposition, e.g., [10], [11], [12], and require loading the whole affinity tensor.Hence, in contrast 1 https://github.com/Huyu2Jason/Publication_Appendix/blob/main/TPAMI-2023-01-0162-appendix.pdf to the conventional TSC, our proposed TSC-Net and TSC-Net enables a reduction of memory cost.
It is difficult to conduct an exact memory cost comparison between TSC-Net and the existing TSC methods since TSC-Net involves not only constructing the affinity tensors but also training the neural network parameters.It is informative to compare IPS2 [13] and TSC-Net in terms of memory cost on m samples.IPS2 needs to store a second order affinity tensor and a fourth order affinity tensor with O(m2 + m 4 ) memory cost.By assuming the batch size is m b , TSC-Net only needs O(m 2 b + m 3 b + m 4 b ) memory cost for the affinity tensors at each time due to the stochastic gradient descent.Taking m = 1000 and m b = 128 as an example, the actual affinity tensor cost of IPS2 and TSC-Net are 931.320GB and 0.252 GB, respectively, if the corresponding affinity tensors are uniformly stored in double-precision floating points.Additionally, according to the recent literature [38], the stochastic optimization of training the neural network parameters usually costs less than 1 GB of memory, so the total memory cost of TSC-Net is less than 0.252 GB (affinity tensors) + 1 GB (training neural network parameters) = 1.252GB.In a nutshell, TSC-Net is much more memory-efficient compared with IPS2 as the sample size increases.In Section IV, we will experimentally demonstrate the memory cost of TSC-Net on several benchmark datasets with different sample sizes ranging from 50 to 70,000.

IV. EXPERIMENTS AND RESULTS
In this section, the proposed TSC-Net is evaluated on benchmark datasets in comparison with several baseline methods.Specifically, the proposed TSC-Net is assessed in terms of memory cost and performance improvement, especially compared with state-of-the-art TSC methods.Afterward, ablation studies are conducted to verify the effectiveness of the ensemble of multiple affinity tensors.Finally, the hyperparameter analysis and convergence behavior analysis are presented.Due to space limitations, the noise-robustness testament of TSC-Net-234 and its competitors are provided in Appendix B, available online. 2

A. Experiment Setup
Software Environment: All our experiments were performed on a desktop computer with a 3.70 GHz Intel(R) Core(TM) i7-8700 K CPU, 64.0 GB of RAM, and GTX 1080 Ti 10 G.The code is based on Python 3.6, TensorFlow 1.15.
Evaluation Metrics: Following the convention in the clustering literature [3], [11], we employed three metrics for performance comparison throughout all experiments: Accuracy (ACC), Normalized Mutual Information (NMI), and Purity (Purity).For all three metrics, higher numerical values consistently indicate better clustering performance.
Benchmark Datasets: In the experiments, six benchmark datasets were adopted to evaluate the performance of the proposed method and the compared methods, and the corresponding statistics are shown in Table II.Particularly, Synthetic-Data  ON SIX BENCHMARK DATASETS adopted from [13] is composed of 50 samples from three clusters, drawn from i.i.dNormal distributions with an equal standard deviation of 0.5 and a different mean of 1, 2.5, and 5, with each sample having 500 features.Lung [39] and GLI-85 [40] are two typical bioinformatics datasets of high dimensionality.MNIST [41] and USPS3 are two typical image datasets, while Reuters [42] is a popular documentation dataset.
Implementation Details: With regard to the architecture of TSC-Net, the fully connected layers similar to [15] were employed consistently.Specifically, the embedding network involves the first two fully connected layers with both 1024 neurons followed by the ReLU activation function, the third fully connected layer with 512 neurons followed by the ReLU activation function, and the last classification layer with c neurons followed by the Tanh activation function.By contrast, the orthogonalization layer is automatically inferred by Theorem 3.1.TSC-Net is trained by Adam optimizer with the learning rate 0.001 and the maximum epoch Ω = 300, where the mini-batch size is set as 128 if the sample size m is larger than 128 or as m/5 otherwise.The grid search method was used to find the task-related hyperparameters for α and β, both from the set [0.001, 0.01, 0.1, 1, 10, 100].The sensitivity analysis is shown in Section IV-E.Regarding the compared methods, we adopted their default settings, as stated in the original articles.
In addition, the above datasets were all pre-processed by an auto-encoder, in line with the experimental setting as in [15], [31].Specifically, the auto-encoder from [43] was employed to extract the deep feature representation of each dataset.Subsequently, all the experiments were conducted based on the deep feature representation instead of the original feature.For a fair comparison, we ran each method 50 times on each dataset and calculated the mean values of the corresponding metrics.

B. Clustering Performance Comparison on Benchmark Datasets 1) Comparison With
Existing TSC Methods: Table III presents the comparison of the proposed TSC-Net with stateof-the-art TSC methods, including IPS2.TSC-Net consistently outperforms all the TSC methods on all the metrics on the benchmark datasets, if other TSC methods are available to perform experiments.Specifically, TSC-Net is superior to IPS2 by 3.7%, 13.4%, and 16.0% in terms of ACC on Synthetic-Data, Lung, and GLI-85, respectively.The IPS2 is quite relevant to TSC-Net since it leverages the second and fourth order affinity tensors to obtain the consensus tensor spectral embedding.However, it performs the affinity ensemble to obtain the consensus embedding in a two-stage manner.By contrast, TSC-Net allows multiple affinity tensor to integrate in a one-stage manner, enhances and quality of the learned embedding, and thus increase the clustering performance.Fig. 3(a)-(c) provides a visual explanation of the T-SNE results on learned spectral embedding for each method.In those figures, TSC-Net obtains a more discriminating cluster boundary than IPS2, which is consistent with better clustering performance.
2) Comparison With Deep Clustering Methods: Table III also presents the comparison of the proposed TSC-Net with recent deep clustering methods, including SpectralNet, SCAN-DAL, and DEC.TSC-Net uniformly outstrips all these deep clustering methods in all the metrics on the benchmark datasets.Particularly, TSC-Net achieves improvement in comparison to SpectralNet by 19.4%, 12.2%, 10.7%, 1.4%, 2.2%, and 3.9% in terms of ACC on Synthetic-Data, Lung, GLI-85, MNIST, Reuters, and UPSP, respectively.SpectralNet is most relevant to TSC-Net since SpectralNet leverages the second order affinity tensors to obtain the tensor spectral embedding based on a deep neural network.By contrast, TSC-Net leverages the ensemble of the second, third, and fourth order affinity tensors to achieve a better clustering performance.The performance gap between TSC-Net and SpectralNet illustrates that different affinity tensors can complement each other in terms of improving clustering performance.

C. Memory Cost Comparison on Benchmark Datasets
In this subsection, the memory cost comparison of TSC-Net and the existing TSC methods is demonstrated on the benchmark datasets.The memory costs of the existing TSC methods, SC, and k-means are evaluated by Matlab 2019a software since their public codes are all based on Matlab.Differently, the memory costs of TSC-Net and deep clustering methods are evaluated by the Nvidia-SMI software, as they are based on Nvidia GPU (GTX 1080 Ti).
Apart from the clustering performance shown in Table III, Fig. 2(a)-(f) demonstrates the clustering performance and   Combining results from those tables, one can notice that TSC-Net achieves better clustering performance while its memory cost is relatively lower in comparison with other TSC methods like IPS2.Specifically, in Lung, whose sample size is 100, the memory cost of IPS2 is 2.2578 GB, whereas the one of TSC-Net is 0.1352 GB.Such a cost gap stems from the fact that TSC-Net adopts the batch-wise stochastic gradient for optimization without having to load the whole affinity tensors and saves the cost significantly.The results are in accordance with the memory cost analysis in Section III-C, indicating that our method achieves a better clustering performance while reducing the memory cost.
Notably, IPS2, TMM, and SC-MSVD fail to perform on largescale datasets like MNIST (70,000 samples), Reuters (10,000 samples), and USPS (9298 samples) because the memory cost of constructing the affinity tensors is prohibitively high.In comparison, TSC-Net adopts the batch-wise stochastic gradient and only needs to construct and store a small part of affinity tensors on the basis of the batch size, which translates into a relatively low memory cost for large-scale datasets.

D. Ablation Study on Ensemble of Multiple Affinity Tensors
In Table IV, the ablation study of the ensemble of multiple affinity tensor results is illustrated to verify the contribution of each affinity tensor for clustering.Specifically, different combinations of affinity tensors are taken into consideration, where √ means being incorporated and × means being removed.
From Table IV, one can notice that all the incorporated affinity tensors, i.e., the second, third, and fourth order ones, contribute Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to the clustering improvement.In particular, the ensemble of all three affinity tensors achieves the best clustering performance compared with other combinations that incorporate only part of the three.The performance improvements between TSC-Net and the second-best competitor are by 1.4%, 2.5%, 2.7%, 0.9%, 0.7%, and 2.2% in terms of ACC on Synthetic-Data, Lung, GLI-85, MNIST, Reuters, and UPSP, respectively.Also, one can find that the clustering performance by different affinity tensors is task-dependent.For instance, using solely fourth order affinity achieves better clustering performance than solely using third or second order affinity tensor on Synthetic-Data, Lung, and GLI-85, while solely using fourth order affinity achieve worse performance than solely using third order affinity tensor on MNIST and Reuters.This phenomenon indicates that different affinity tensors contribute unequally to clustering performance, depending on specific tasks.

E. Hyperparameter Analysis
TSC-Net has two hyperparameters, i.e., α and β, and their different numerical values of them imply different weight contributions of the corresponding affinity tensors.The previous section illustrates how combining various affinity tensors is task-dependent, and thus we consider using the grid search technique to find the task-optimal hyperparameters α and β both from the set [0.001, 0.01, 0.1, 1, 10, 100].We show the hyperparameter sensitivity in terms of ACC on Synthetic-Data, Lung, GLI-85, MNIST, Reuters, and UPSP in Fig. 4. From the figures, we find that TSC-Net comes with a relatively stable ACC performance under a wide range of hyperparameters across all six datasets.

F. Convergence and Running Time of TSC-Net
In this subsection, the convergence and running time of TSC-Net are investigated.We employed TSC-Net on the benchmark datasets, including Synthetic-Data, Lung, GLI-85, MNIST, Reuters, and UPSP, respectively.We then reported the objective function value of (2) with respect to the increasing epochs in Fig. 5 and the running time in seconds in Table V.Overall, the objective function values under the proposed alternative stochastic optimization monotonically decrease with epochs on all the benchmark datasets, where at the first several epochs, the values remarkably decrease and then continuously decrease before the two hundred and fifty epoch.In summary, TSC-Net achieves a convergence quickly.

V. CONCLUSION
In this paper, we have proposed a tensor spectral clustering network for reducing the memory cost considerably and integrating multiple affinity tensors in a memory-efficient manner.Unlike existing methods, which uniformly need to load the whole affinity tensors, our method maps the input samples into the tensor spectral embedding with a neural network and allows for a batch-wise affinity tensor construction, which enables us to reduce the memory cost.More critically, compared with the previous method using a two-stage integration, our proposed method seamlessly ensemble multiple affinity tensors in a onestage manner to improve the clustering performance while keeping a low memory cost.Experimental results have demonstrated that our method achieves considerable performance improvement while enjoying less memory cost on benchmark datasets.
Overall, the proposed method has demonstrated the effectiveness of high-dimensional data clustering.There are two potential directions for improvement: 1. Redundancy in Multiple Affinity Tensors: While the proposed method is capable of jointly ensembling multiple affinity tensors, one potential limitation lies in the redundancy that might arise in this ensemble process.The inclusion of multiple affinity tensors for similarity estimation may result in computational redundancy and increased computational costs.Specifically, the question of whether the additional complexity of incorporating multiple affinity tensors consistently improves the clustering performance needs to be addressed.Future research could explore strategies to efficiently select or weight the most informative affinity tensors to mitigate redundancy while preserving the benefits of ensemble learning.
2. Absence of Mutual Improvement between Clustering and Feature Learning: Another important aspect to consider is the reliance solely on the given affinity tensors for tensor spectral embedding.This implies that the tensor spectral clustering network does not actively update or adapt the affinity tensors during the learning process.The absence of mutual improvement between clustering and deep feature learning is a noteworthy limitation.Future research directions could explore mechanisms for dynamically updating the affinity tensors as part of the learning process.This would allow a network to adapt and refine the affinity information, potentially leading to improved clustering results.

Fig. 1 .
Fig. 1.Workflow of the proposed tensor spectral clustering network (TSC-Net).The notations used in this figure are defined in TableI.TSC-Net maps the input data to the tensor spectral embedding under the corresponding objective, where the orthogonalization layer is at the top to ensure that the output meets the orthogonal constraint.

Fig. 2 .
Fig. 2. ACC performance along with memory cost comparison of TSC-Net and its competitors on (a) Synthetic-Data, (b) Lung, (c) GLI-85, (d) MNIST, (e) Reuters, and (f) USPS.Note that SC-MSVD, TMM, and IPS2 are not able to run on MNIST, Reuters, and USPS due to out-of-memory issues.
. The network structure and objective function of TSC-Net are detailed in Section III-A, and an alternating training algorithm to optimize TSC-Net is shown in Section III-B.Next, the advantage of TSC-Net compared with the existing TSC methods in terms of memory cost is discussed in Section III-C.
corresponds to the intermediate representation Y , W q denotes the weight matrix for the qth fully-connected layer, and g denotes the ReLU activation function 2) an orthogonalization layer with weight matrix W ort ∈ R c×c that maps the intermediate representation Y to Y = Y W ort to ensure the orthogonal constraint, i.e., Y Y = I c .The elaboration of the orthogonalization layer will be presented in Section III-B.TSC-Net achieves the tensor spectral embedding by minimizing the following objective:

ort . Theorem 3 . 1 .
Given a matrix Y ∈ R m×c and suppose Y Y is full rank, a lower triangular matrix L ∈ R c×c is derived by the Cholesky decomposition of Y Y as Y Y = LL , then Y = Y (L −1 ) satisfies the orthogonal constraint.Proof.If Y Y is full rank with the Cholesky decomposition Y Y = LL and Y

Algorithm 1 :
Training and Testing of TSC-Net.Input: Data matrix X = [x 1 ; x 2 ; . . .; x m ]; Mini-batch size m b ; One can check that only m b samples are required in such an updating with O(m 2 b + m 3 b + m 4 b ) memory cost for the affinity tensors.When finishing an epoch, we reshuffle the data matrix before continuing to the next epoch.The convergence criterion is set as either reaching the maximum epoch Ω = 300 or the relative changes of the objective (

Fig. 3 .
Fig. 3. From the top to the bottom, the T-SNE visualization of (a) SC; (b) IPS2, (c) DEC, and (d) TSC-Net on Synthetic-Data, Lung, and GLI-85, to demonstrate the quality of the embedding learned by each clustering method.As can be seen, TSC-Net can learn a more separable embedding with more clear class boundaries.

Fig. 5 .
Fig. 5. Objective function values of TSC-Net with the number of the epochs on six benchmark datasets.

TABLE III PERFORMANCE
COMPARISON ON SYNTHETIC-DATA, LUNG, AND GLI-85 (MEAN AND STANDARD DEVIATION %)

TABLE V RUNNING
TIME (IN SECONDS) OF TSC-NET ON THE SIX BENCHMARK DATASETS