Spectrum Pursuit With Residual Descent for Column Subset Selection Problem: Theoretical Guarantees and Applications in Deep Learning

We propose a novel technique for dataset summarization by selecting representatives from a large, unsupervised dataset. The approach is based on the concept of <italic>self-rank</italic>, defined as the minimum number of samples needed to express all dataset samples with an accuracy proportional to the rank-<inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula> approximation. As the exact computation of self-rank requires a computationally expensive combinatorial search, we propose an efficient algorithm that jointly estimates self-rank and selects the most informative samples in a linear order of complexity w.r.t the data size. We derive a new upper bound for the approximation ratio (AR), the ratio of obtained projection error using selected samples to the best rank-<inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula> approximation error. The best previously known AR for self-representative low-rank approximation was presented in ICML 2017, which was further improved by the bound <inline-formula> <tex-math notation="LaTeX">$\sqrt {1+K}$ </tex-math></inline-formula> reported in NeurIPS 2019. Both of these bounds are obtained by brute force search, which is not practical, and these bounds depend solely on <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula>, the number of selected samples. In contrast, we describe an adaptive AR that takes into consideration the spectral properties and spikiness measure of the original dataset, <inline-formula> <tex-math notation="LaTeX">${\boldsymbol {A}\in \mathbb {R}^{N\times M}}$ </tex-math></inline-formula>. In particular, our performance bound is proportional to the condition number <inline-formula> <tex-math notation="LaTeX">$\kappa (\boldsymbol {A})$ </tex-math></inline-formula>. Our derived AR is expressed as <inline-formula> <tex-math notation="LaTeX">$1+(\kappa (\boldsymbol {A})^{2}-1)/(N-K)$ </tex-math></inline-formula>, which approaches 1 and is optimal in two extreme spectral distribution instances. In the worst case, AR is shown not to exceed 1.25 for the proposed method. Our proposed algorithm enjoys linear complexity w.r.t. size of original dataset, which results in filling a historical gap between practical and theoretical methods in finding representatives. In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, learning a classifier with representative data, and open-set identification.


I. INTRODUCTION
Low-rank approximation is one of the fundamental workhorses of machine learning and optimization. It is a building block for data compression algorithms such as principal component analysis (PCA), where the data are transferred into an eigenspace representation to learn the representative samples which are suitable for the whole The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . data. The representations are either in a lower-dimensional space or expressed in terms of pseudo-data not part of the original dataset. 1 Thus, these methods are not applicable for data sampling. However, it was shown recently that spectral methods are useful tools to devise a sampling framework [3]. In spectral-based sampling, the goal is to keep the spectrum (spectral decomposition) of representatives as close as possible to the spectrum of the original data. Shannon sampling [4], the most theoretically strong sampling scheme also follows the strategy of keeping the spectrum of the original and sampled function equal to each other. A reconstruction guarantee is then established for signal sampling in an infinite-dimensional space based on spectral properties. In addition to sampling, feature selection, data selection (as in active learning for annotating unlabeled samples), and data summarization (such as video summarization) are amongst other important applications where the representatives must be part of original samples [5].
Datasets encountered in practice are usually composed of a large number of vectors in a finite-dimensional space. Assume M samples in an N -dimensional space organized in matrix A ∈ R N ×M . Let S be a set of integers between 1 and M , which, when used to index the columns of A, define a collection of samples. We define a matrix of the form V = [v mk ] which projects back from the sampled to the original data. We consider the following optimization problem for finding the optimal samples and back-projection: (1) where λ is a parameter that regularizes the sampling rate. We refer to the cardinality of the set S * as the self-rank of matrix A in this paper. Substituting a k in the inner summation with an arbitrary vector u k ∈ S N −1 simplifies this problem to the truncated singular value decomposition (SVD). In this case, |S| is simplified to the conventional definition of rank for matrix A. However, the conventional spectral decomposition using SVD is not suitable for data sampling since singular vectors are arbitrary vectors not actual samples. Enforcing the principal bases to be from the dataset itself in Eq. 1 results in a self-representative low-rank approximation. This problem is related to the column subset selection problem (CSSP) [6], a problem that is known to be NP-hard [7], [8] and subject of ongoing research [2], [9], [10]. Our research group proposed a novel optimization technique to solve the problem of CSSP which is called iterative projection and matching (IPM) [3]. Recently, the IPM algorithm is extended to a more accurate and non-greedy version which is referred as spectrum pursuit (SP) [5]. Despite superior performance of SP compared to the state-of-the-art (SOTA) in a wide range of applications, it is not provably convergent. Moreover, the performance bound of SP algorithm is not investigated. In the present work, we will show theoretically that why IPM and SP family algorithms are working excellently. In addition to convergence guarantee, another important question will be answered in the present paper which is ''how many samples are enough in order to describe a dataset?'' The later question is addressed by developing the concept of self-rank. In this paper, we propose a versatile sampling framework to determine the self-rank and to identify the corresponding samples. Our approach is based on spectral decomposition; we compute the minimum number of samples that cover the K most significant spectral components of the dataset. The main contributions of this paper can be summarized as follows: • We introduce Spectrum Pursuit with Residual Descent (SP-RD), a developed version of SP that is employed for estimating the self rank and summarizing the dataset by selecting the most informative samples.
• We prove an upper bound for the class of spectrum pursuit algorithms performance, and show that in two special cases, the proposed bound for AR sticks to the tightest bound corresponding to the best rank-K approximation.
• We show that the SP-RD algorithm leads to improved performance compared to the SOTA in a number of applications, including graph node sampling with shortest path criterion, summarizing data for training a classifier without losing performance, and open-set identification.

II. RELATED WORK AND MOTIVATION
There is a large line of work on data summarization in machine learning which are mainly studied under the context of CSSP and volume sampling (VS). See [1], [2], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] for a non-exhaustive list. Motivated by the fact that datasets continue to grow larger and larger over time, a popular approach relies on the modification of the dataset itself, such that its size shrinks while preserving its original statistical properties. Based on this observation, [11] studied the reduction in size of a large dataset using random linear projection. On a similar note, [12], [13], [18] constructed a weighted subset of a large dataset, the Bayesian coreset, for a wider class of Bayesian models. Many other efforts have been devoted to data summarization and coreset construction algorithms. Examples include pivoted QR factorization, non-negative matrix factorization, and matrix subset selection, which are under the umbrella of CSSP [2], [9], [10], [23], [24], [25], [26], that can be considered as a form of unsupervised feature selection or prototype selection. We briefly discuss the key models relevant to our work. One line of work focuses on the randomized matrix algorithms which have been developed in fast least squares [19], sketching algorithms [20] and low-rank approximation problems [21]. Authors in [22] develop a randomized QR with column pivoting (RQRCP), where random sampling (RNDS) is used for column selection. [27], [28] improved the running time of CSSP with different sampling schemes while preserving similar approximation bounds. The CSSP method and its variants are also applied and studied under various different settings. A comprehensive summary of these algorithms and their theoretical guarantees is available in Table 1 in [6]. Most of these algorithms can be roughly categorized into two branches. One branch of algorithms are based on rank-revealing QR (RRQR) decomposition [29] which is related to reduction in E-optimal sense [30]. It has been proved in [6] that RRQR is nearly optimal in terms of residue norm. On the other hand, sampling based methods [31] try to select columns by sampling from certain distributions over all columns of an input matrix. Extension of sampling based methods to general low-rank matrix approximation problems was also investigated in [32]. Let us denote the best rank-K approximation of matrix A by A K . The data matrix A is approximated in polynomial-time in [33] and a lower bound on the required number of samples as O (K log K + K / ), where is the coefficient of the relative error, has been obtained in [33], [34]. In other words, a subset of data is selected such that the projection error of all samples on it is less than (1 + ) A − A K F . For a smaller value of , we need more selected samples to achieve the desired projection error. A tighter bound for the number of required samples is introduced in [35]. They show that O (K / ) columns contain a subspace which approximates A with coefficient error of (1 + ). Here, we introduced our approach based on recent advances on data sampling. Our proposed bound is the first work in the literature that targets rank-K approximation using only K samples while can approach to 0 asymptotically.
The AR of existing low-rank approximation methods are only a function of K . We refer the reader to Table 1 for a brief survey of upper bounds on various low-rank models for sampling. Methods in Table 1 are either impractical or shown to be working in very limited practical settings in their experiments. This motivates two main questions in our work: How many samples suffice to represent a dataset? and How can we select such data representatives considering the intrinsic structure of the dataset?. Our proposed method is based on spectrum pursuit (SP) algorithm which is shown to be working efficiently in a wide range of applications [5]. In the present paper we will show theoretically that why SP outperforms SOTA methods in sampling. Moreover, an extension of SP algorithm will be presented which is more accurate and provably convergent. As reflected in Table 1, there is no sampling algorithm whose AR is controlled by structural or spectral properties (such as condition number κ(A)) of the matrix A. In this paper, a theoretical upper bound is established which depends on spectral properties of A in addition to data dimensions and the target rank K . Sampling literature includes a vast set of practical methods which function in limited occasions appropriately, however, come short in theoretical guarantee. On the other hand, one can find sampling methods in the literature supported by established theoretical guarantees with no practical applications. In the present work, for the first time we fill a historical gap between practical sampling and theoretical sampling.

III. JOINT SELF-RANK ESTIMATION AND SAMPLING
We introduce an equivalent problem to (1) that is solved efficiently.  where E K ∈ [0, 1] is the projection error corresponding to the best rank-K approximation of data normalized to A 2 F . Parameter K is a user specified target rank. As K increases, more samples are needed in the set S in order to obtain the desired error. The cardinally of set S * for a specified K is denoted by S K (A) and it is called the self-rank of dataset A with parameter K . For example, setting E K = 0.25 results in a set of samples such that their combination is able to reconstruct all data with 25% error. It is straightforward to find a target rank to provide the desired error by using truncated SVD. Fig. 1 Left shows the low-rank approximation error of SVD vs. an assumed target rank for a subset of Multi-PIE face dataset with 52, 000 images. For instance, 9 spectral components are needed to span all data with a normalized error of less than 0.25. Since spectral components in SVD are not among samples of data, we need to solve Problem (2) in order to find the minimum number of actual samples, such that their span provides a span as accurate as that of the first 9 spectral components. Fig. 1 Right shows the computed self-rank versus the target rank. As an example, there is a subset of the dataset with 15 samples such that their span is as accurate as the span of first 9 spectral components as shown in Fig 5. In other words, there are 15 samples in the dataset which are able to reconstruct all the dataset with an error less than 25%. The normalized error, E K , is a user-specified parameter, which can be set to any value between 0 and 1. The target rank of matrix A corresponding to a desired projection error, E K , is defined as the minimum integer K that satisfies ( N n=K +1 σ 2 n )/ A 2 F ≤ E K . The n th singular value of matrix A is denoted by σ n . The comparison of upper bound on the error of some well-known low-rank approximation algorithms for the CSSP. In all methods, K is considered to be the target rank and approaching A K is the goal of sampling. Our achieved AR is compared with Brute-force [2], Improved CSSP [6], Volume sampling [33], CUR Decomposition [31], and Near-optimal [35].
The self-rank of matrix A with parameter K is equal to the minimum number of columns in A such that their span approximates A by an error less than the error of best rank-K approximation, i.e., the smallest size for set S that holds Matrix A S refers to sampled columns of A, and A † S is the Moore-Penrose inverse of A S .

IV. SPECTRUM PURSUIT WITH RESIDUAL DESCENT (SP-RD)
Solving Problem (2) implies a combinatorial search that is not feasible for a massive dataset and a large target rank. However, computing the target rank is tractable via SVD. Since self-rank is lower-bounded by target rank, a practical algorithm should start searching for self-rank values greater than the target rank. Thus, we start with solving the following problem for |S| = K , and check if the constraint in Problem (2) is satisfied for a desired error threshold.
The solution for S does not guarantee that the error in the constraint of Problem (2) is less than the desired E K . The minimum integer, K , that satisfies the constraint in (2) is considered as the self-rank of dataset A with parameter K . If the error is higher than the desired threshold indicated by E K , the cardinality of set S is increased by 1, and Problem (3) is solved with a new value for K .
Spectrum Pursuit (SP) Algorithm is proposed in [5] for solving (3). Inspired by SP, a more accurate solver for (3) is proposed in this paper which is called Spectrum Pursuit with Residual Descent (SP-RD).
Projection of all the data onto the subspace spanned by the K columns of A, indexed by S, i.e., π S (A), can be expressed by a rank-K factorization, UV T . In this factorization, U ∈ R N ×K , V T ∈ R K ×M , and U includes the K columns of A, indexed by S which are normalized to have unit length. Therefore, (3) can be restated as where, A = {ã 1 ,ã 2 , . . . ,ã M },ã m = a m / a m 2 , and u k is the k th column of U. It should be noted that U is restricted to be a collection of K normalized columns of A, while there is no constraint on V . Since Problem (4) involves a combinatorial search and is not easy to tackle, we modify (4) into two consecutive problems. The first sub-problem relaxes the constraint u k ∈ A in (4) to a moderate constraint u = 1, and the second sub-problem reimposes the underlying constraint.
These sub-problems are formulated as Here, S k is a singleton containing the index of the k th selected data point. Matrix U k is obtained by removing the k th column of U. Matrix V k is defined in a similar manner. Subproblem (5a) is equivalent to finding the first left singular vector (LSV) of E k A − U k V T k . The constraint u = 1 keeps u on the unit sphere to remove scale ambiguity between u and v. Moreover, the unit sphere is a superset for A and keeps the modified problem close to the recast problem (4). After solving for u (which is not necessarily one of our data points), we find the data point that matches u the most (makes the smallest angle with u) in (5b).
The best minimizer for (5a) w.r.t. u is the first LSV of E k . However, restriction to data samples does not imply that the most correlated sample with the first LSV is necessarily the best minimizer for (5a), although in most cases the most correlated sample with the first LSV is the best minimizer. In order to find the best sample that minimizes (5a) at each iteration, we propose to collect a few samples that are correlated with the first LSV and compute residual error for these samples. Let denote the set of P most correlated samples with the first LSV. The modified version of the above problem can be written as follows.  To make the error monotonically decreasing, as a sufficient condition for convergence, we can include the index of the previously selected sample in for updating the k th selected sample. This step is performed in Line 8 of SP-RD algorithm. In order to select K samples, the lower bound for selection cost function is A − A K in which A K is the best rank-K approximation of A. Since the cost function is monotonically decreasing, and it is lower bounded, the modified algorithm is convergent. Alg. 1 and Fig. 3 show the steps in SP-RD algorithm. SP-RD is an extension for IPM algorithm which is an incremental algorithm [3].

V. CONVERGENCE ANALYSIS
At each iteration of SP-RD, a new sample is selected only if the resulted residual error decreases (Alg. 1 (SP-RD), line 9). This way the error is non-increasing. The error is also lower bounded by A−A K 2 F since it is the least error that K vectors can achieve as the projection error. These two conditions guarantee that the algorithm converges and quality of the selected subset always improves or remains the same over iterations. As you can see in Fig. 4 the cost function for SP algorithm is not necessarily decreasing over iterations. However, our proposed SP-RD algorithm is decreasing, since convergence is guaranteed. SP-RD simplifies to IPM if set S is initialized by a empty set and P is set to 1. IPM performs only K iterations and E 1 = A at the first iteration. As it will be shown in our experiments, SP-RD is a more efficient solver for CSSP compared to SOTA. However, theoretical analysis of IPM is more straightforward. In other words, SP-RD is a more robust implementation of IPM and is privileged with convergence thanks to the descent property (The global solution is not necessarily obtained as for sampling based methods which are of combinatorial nature, the global solution is inevitably reached via brute-force search.) The joint problem of self-rank estimation and data sampling is summarized in Alg. 2. The heart of this algorithm is the mentioned SP-RD algorithm. In the experiments, we refer to Alg. 2 as adaptive SP-RD. In Line 1 of the algorithm, K can be initialized using truncated SVD easily. Moreover, S (A) is the projection operator on the subspace spanned by columns of A indexed by set S. Mathematically, it is equal It is noteworthy that the computational complexity of the SP-RD algorithm is O(MN +K 2 N +MNP) per iteration, which depends on the computational burden parameter P. Note that we only need the first singular vector which can be done using fast methods such as power iteration method. For the most relevant cases, large dataset with K < N < M , the complexity is dominated by the O(MNP) per iteration and we need to perform O(K ) number of iterations. This is a linear complexity w.r.t M . Fig. 5, compares the run time of our proposed method with other baselines.
The SOTA upper bound for the minimum required number of samples is introduced as O( K ) [35], in order to guarantee that the projection error of sampling is not worse than the projection error of rank-K approximation. In general, the purpose in CSSP problem is to find an upper bound on the number of sampled data in set S to ensure that A − In the following theorem, we prove there exists an upper bound for when the algorithm SP-RD is used to select one sample. Please note that when K = 1, in Line 3, U k is empty and in Line 5, E k = A. This configuration simplifies the SP-RD algorithm to our previous 88168 VOLUME 10, 2022 Algorithm 1 Spectrum Pursuit With Residual Descent (SP-RD) Require: A, P and K Output: iter=iter+1 end while Algorithm 2 Joint Self-Rank Estimation and Data Sampling via Adaptive SP-RD Require: dataset A, P, and the desired sampling error e ∈ [0, 1] Output: set S.
work, i.e., IPM. Our theoretical contribution is based on the IPM algorithm. However, the later works, i.e., SP, and SP-RD are more sophisticated implementations of the same approach as IPM algorithm. From here on, we assume that the matrix of interest is either full-rank or the zero singular values are truncated in the corresponding economic SVD representation so that the condition number is bounded and is not infinitely large as a result of division by small singular values.
Definition 1: spikiness measure of a rank R matrix A with singular values σ 1 , σ 2 , . . . , σ R is defined as F denote the error achieved by the best first rank-1 approximation of full-rank data matrix A. Further, let ρ(A) denote spikiness measure of A defined in [3], and κ(A) denote the condition number of matrix A. Assume that N is less than M for the given large dataset A. Also, let {s 1 } denote the singleton containing the best sample selected by SP-RD algorithm with P = 1. Then, Proof of Theorem 2: Given the SVD of a full-rank data matrix A is represented as A = U V T , the most correlated column of A with u 1 , which is denoted by s 1 and its normalized version can be written as follows: where u i is ith column of U. The projection error for the first normalized selected sample s 1 can be cast as follows: which is obtained using orthogonality of the cross terms. Now, assume L = 1 + i α 2 i . In order to compute the desired bound for in the format presented in Theorem 2, it only suffices to show that the last term in the above equation is upper bounded by E 1 = i>1 σ 2 i (1 + ). Rewriting and cancelling identical terms from both sides leads to, To establish inequality (10) for some values, we relax it to a more tractable inequality. We consider a tighter inequality by decreasing the right hand side and increasing the left hand side of this inequality. values working for tighter version generalize to the main inequality as well. Since Hence, we remove the i L 2 ≥ 0 from right hand side. Next, we enlarge the left hand side by ignoring the power factor 2 for 1 − 1 L < 1 in inequality (10). Therefore, the resulted inequality to be satisfied for will be We use Eq. (11) to establish two upper bounds for AR. The first bound is achieved by substituting all σ i s for i > 1 with σ N . Hence, (11) reduces to κ(A) 2 VOLUME 10,2022 Knowing that i>1 (12) can be written as: To continue and simplify the bound, we mention the following lemma. Lemma 3: [3] Let A denote an N × M matrix. Let σ 1 and u denote the first singular value and the corresponding left singular vector of A, respectively. Then, there exists at least one column in A such that the absolute value of its inner product with u is greater than or equal to σ 1 The following proposition states a lower bound on the maximum of the absolute value of the correlation between columns of a matrix and u, when data are normalized on the unit sphere.
Proposition 4: [3] Assume the columns of A are normalized to lie on the unit sphere. There exists at least one data point, a i , such that the correlation coefficient between a i and the first left singular vector of A is greater than or equal to ρ(A).
According to lemma 3 and proposition 4, we observe that 1− 1 L ≤ 1−ρ(A) 2 . By replacing this, we lift the left hand side and in order to satisfy the original upper bound, it suffices to find an working for the following inequality: The other bound is obtained by setting α i s to 0 in (10). Following the same procedure, we will have Using the result of lemma 3 for upper bounding 1 − 1 L , the inequality of interest is held if the following tighter inequality is satisfied: Therefore, the two values which hold the two obtained bounds tightly can work for our interest. We set }. For small condition number, the second bound tends to zero. Also, for large ρ (close to 1), the overall bound is small. The worst case for the overall upper bound is the maximum of first term which is 0.25. It is worth noting that for rank deficient data, N can be interchangeably replaced with N − N 0 for some N 0 > 0. The minimum of the two is considered as the suggested bound. Therefore, the smallest possible for the upper bound to be held can be set to the value found in the above equation. Before proceeding to establish an upper bound for K selected samples, we briefly discuss the established bound for one sample in the following remarks clarifying certain properties of the obtained bound in two specific instances, Remark 1-When κ(A) = 1, the data matrix is isometric, meaning that all samples are of equal importance. In this case, the projection error on the span of any sample is equal to the projection error on any spectral component. Thus, A−π s 1 (A) F = A−A 1 F . In other words, the upper bound holds with correction factor = 0. Remark 2-ρ(A) = 1 means that all the energy of the samples selected is accumulated in the direction of the first eigenvector. In other words, the first sample is oriented towards the best rank-1 approximation of the data matrix, i.e., u 1 . Therefore, = 0, and the upper bound holds tightly. This is trivial as the first sample turns out to be the best rank-1 approximation itself. Fig. 6 shows the geometrical interpretation of the two extreme cases in Remark 1 and Remark 2. As can be seen, for an isometric dataset, there is no preference for samples to be selected. In this scenario, projection error of selecting 1 sample is equal to projection error of the best rank-1 approximation. Thus, the tightest bound will be reached and = 0. Similarly, when the dataset is rank-1, there will be the same story since span of any sample will be equal to the span of best rank-1 approximation. Thus, = 0 which the tightest upper bound is touched. After observing the optimality of the presented method in two extreme cases for the dataset spectral distribution, we proceed to present a general bound for all dataset distributions in the following theorem. Please note that condition number is the ratio of the largest singular value to the smallest non-zero singular value.
Theorem 5: Let A − A K 2 F denote the minimum error achieved by the best first rank-K approximation of data matrix A. Further, let S K = {s 1 , . . . s K } be the set containing K samples selected by IPM algorithm which is a simplified version of Alg. 1 with P = 1. Then,

Proof of Theorem 5:
The subspace of interest Ek in Alg. 1 Line 5 (also as defined in Eq. (5)) is obtained by projection on the null-space of the previously selected samples. Therefore, due to null-space projection property at each iteration, every vector in the resulted sub-spaces after projection is orthogonal to the previously selected samples. In other words, if s ∈ Range(Ek ), then s ⊥ span{s 1 , . . . , s k−1 }. The residualbased progression of Alg. 1 selects a novel sample to capture the highest information in projected version of dataset to the null-space of previously selected samples. Thus, representation errors in subsequent iterations fall in orthogonal sub-spaces and this fruitful linear algebraic property of Alg. 1 permits us to sum over error terms as in Line 9 of Alg. 1 in each iteration. The upper bound for such error terms is established in Theorem 2. Owing to orthogonality property, the Frobenius norm of selected subset representation error can be written as sum of iteration error terms to yield: where Z k 2 F is the minimum error value achieved in Line 9 of Alg. 1. Moreover, F . Let EK show the best rank-1 representation error for E N k in Eq. (4). Z k can be upper bounded as (1 + k )EK following the same approach as in Theorem 2 where, To watch for the zero singular values appearing in subsequent iterations making the matrix E k singular, we truncate the nonzero singular values so that the condition number remains finite and meaningful. This requires a reduction in the number of columns in matrices used to select sample indices Uk and Ek . This can be done by deleting the index of previously selected sample leading to singularity after nullspace projection.
Also, the denominator appearing in the lower bound for k is N − k (equal to the new reduced dimension which can be derived in a similar manner as observed in Theorem 2 for the k-th iteration).
Lemma 6: Let us define κ * = max k κ(Ek ). κ * is upper bounded as κ * ≤ κ(A). Proof : It is known that removing columns of a data matrix, the largest singular value decreases and the smallest singular value increases and as a result, the condition number of a given matrix reduces [36]. In addition, throughout iterations of Alg. 1, the data passes through projections on null-space of selected samples. Projection does not affect singular values but may turn some into zero. As a result, null-space projections are non-increasing operators on condition number. Therefore, a series of column deletion and null-space projection leads to decreasing (non-increasing) condition number. Thus, κ * ≤ κ(A). The second term is the upper bound for the quadratic term ρ 2 (1 − ρ 2 ) and therefore 0.25. Inserting κ * in (19), one can immediately find that the smallest possible coefficient K holds in: A toy example is illustrated in Fig. 7 which validates the efficacy of the extracted bound for some synthetic data. Using (20), lemma 6, and (18) the proof is completed immediately. Again, we reiterate that for rank deficient matrices N can be replaced with some N − N 0 without loss of generality. The proposed SP-RD is the extended version of SP [5] and iterative projection and matching (IPM) algorithm [3]. Theorem 5 shows us the theoretical reason behind the exhibited success of SP family algorithms for a very wide range of FIGURE 6. Assume three points in a two dimensional space represented by a 2 × 3 matrix A = [a 1 ; a 2 ; a 3 ]. Two extreme cases of eigenvectors orientation are shown where the self-representative approximation error tightly sticks to the best rank-1 approximation error according to Theorem 2 ( = 0). In both cases, there is no preference on samples to be selected and any random selection provides the same projection error.
applications due to their simplicity and accuracy. SP and IPM algorithms has no parameter to be fine tuned. Parameter P in SP-RD does not need fine tuning, and can be set according to our accessible computational power. These favorable properties make the class of SP algorithms problem-independent and also dataset-independent.

VI. EXPERIMENTAL RESULTS
In this section, we apply the aforementioned theoretical results to perform learning tasks employing the selected samples of datasets based on their complexity, which is reflected in their self-rank. Our aim is to reduce training size of massive data such that accuracy of learning tasks is maintained. In this section, first a synthetic experiment is designed in order to estimate an oracle self-rank. Then, four new real-world applications are studied.
Datasets. i) CMU Multi-PIE Face Dataset. [37] We used a cropped version of this dataset. This dataset contains 249 subjects with 13 poses, 20 illuminations, and 2 expressions. Thus, there are 520 images for each subject. ii) Omniglot [38]. The Omniglot data set is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon's Mechanical Turk by 20 different people. Each image is paired with stroke data, a sequences of [x,y,t] coordinates with time (t) in milliseconds. iii) MNIST [39]. The MNIST dataset of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. iv) Cora [40]. The dataset contains sparse bag-of-words feature vectors for each document and a list of citation links between documents. We treat the citation links as (undirected) edges and construct a binary, symmetric adjacency matrix A with 2,708 nodes and 5,429 edges.   For this dataset, we know that there exist 20 samples such that their linear combination is able to make a subspace such that it is as accurate as the best 15 dimensional low-rank approximation of all data.

A. SELF-RANK ESTIMATION ON SYNTHETIC DATA
Since the computation of self-rank is a combinatorial problem, we need to synthesize a dataset with a known self-rank in order to evaluate Alg. 1 and Alg. 2 for sampling. The known self-rank is assumed as a lower bound on the estimated self-rank using Alg. 2.
A synthetic full-rank dataset is generated with M samples in 200 dimensional space. First, we generate 20 linearlyindependent 2 samples whose target rank is equal to 15 with parameter E K = 0.1. We call these 20 samples the base samples as shown in Fig. 8. Then, M−20 samples are generated by linear combination of base samples. Finally, all M −20 generated samples (all except base samples) are contaminated with noise in the null-space of base samples to result in a full rank dataset. Since base samples can be approximated by a rank-15 subspace, the whole dataset can be approximated by a rank-15 decomposition. However, those 20 base samples correspond to the self-rank of dataset. We test this dataset using different selection algorithms. An efficient algorithm summarizes the dataset to the base samples. Fig. 9 shows the probability of selection from the underlying base samples using different algorithms. We let all selection algorithms to sample 20 data. As can be seen, SP-RD algorithm successfully finds all base samples for up to 1000 generated samples. Increasing P (number of correlated samples) in the SP-RD algorithm results in a more accurate solution to Problem 3. SP-RD is performed for 200 iterations. Note that SP algorithm is equivalent to the SP-RD algorithm with P = 1. Table 2 shows the estimated value for the self-rank of the generated synthetic dataset. Setting P ≥ 8 results in estimating the oracle value for the underlying self-rank. The desired E K is set to 0.1. It is worthwhile to mention that random selection on average needs 67 samples to have the same accuracy as the projection error of the 20 underlying base samples, while SP-RD with P = 8 only needs 20 samples, i.e., base samples are selected intelligently.

TABLE 2.
Estimated self rank for the described synthetic data using SP-RD algorithm with different values for parameter P. The expected estimated self-rank using random selection is computed as 67. It means on average, 67 samples are needed to span the dataset as accurately as 20 samples found by SP-RD algorithm.

B. REPRESENTATIVE SELECTION FROM CMU MULTI-PIE DATASET
Here, CMU Multi-PIE Face Database is considered for representative selection [37]. Fig. 10 compares the performance of different SOTA selection algorithms in terms of CSSP normalized projection error, which is defined as the CSSP cost function in (3) for a given selection method normalized by A − A K . The normalized error is averaged for all 249 subjects. Parameter P is set as 10 for SP-RD algorithm and in order to select K samples, 5K iterations are performed. As it can be seen, the obtained approximation ratios are much better than the tightest theoretical bound in  The averaged cost function of CSSP for selecting K samples from each class of Multi-pie face dataset with 130, 000 images. The projection error is normalized by the projection error of the best rank-K approximation. SP-RD algorithm is compared with the state-of-the art methods including SP [5], DS3 [41], FFS [42], SMRS [43], S5C [44], K-medoids [45] and volume sampling [33]. the literature which is In practice, for multi-pie face dataset, the achieved AR using different algorithms is better than 1.42 for K ≤ 15. Moreover, the proposed SP-RD reaches an AR around 1.15 which is the closest to the best possible AR, i. e., 1.

C. ADAPTIVE SAMPLING FROM MNIST DATASET FOR CLASSIFICATION
To evaluate the effectiveness of our algorithm, we apply our algorithm on MNIST dataset classification task. In order to apply adaptive sampling, first MNIST data points are transferred to a feature space. A simple back-bone architecture is utilized for both feature selection and classification tasks to prevent architecture-specific bias. We train our feature net for classification task on Omniglot dataset [38] and pick the last convolution layer features. We transform every image in MNIST training set into a vector of length 256 with the trained feature net. Our classifier also has the same architecture as our feature net and the only difference is that the last layer consists of only 10 neurons which is the number of classes in MNIST. Figure 11 shows the result of these experiments. In adaptive sampling scenario using SP-RD algorithm, we consider the complexity of each class in MNIST training data determined by the introduced self-rank for each class. We compute the self-rank of each class using Alg. 2 alongside with optimal samples. Therefore, the number of selected samples are class dependent. This forms a non-uniform sampling scenario where data are sampled for learning from each class in an adaptive fashion. We compare this to random selection, and the selection carried out by SP-RD with a fixed number of selected samples from each class. Moreover, comparison to K-mediods, SMRS, and dual volume sampling (DVS) [46] algorithms is provided. Splitting the sampling budget adaptively according to the introduced self-rank results in improvement of classification performance. This experiment is designed based on a generic architecture with a typical dataset. However, other selection algorithms do not provide a gain over random selection. FIGURE 11. Comparison between the accuracy of MNIST classification task when a portion of each class is sampled. In this scenario, K-mediods performs even worse than random selection and the performance of SMRS and dual volume sampling are close to that of random selection. The vertical lines show a 95% confidence interval.

D. OPEN-SET IDENTIFICATION BASED ON SELF-RANK
In this experiment, we show how employing adaptive SP-RD gains advantage over random sampling, and fixed rate sampling towards open-set identification. Due to closed-set nature of training procedure for deep neural networks, such models tend to label all samples as belonging to one the classes while a sample may not belong to the observed classes. The capability of rejecting such samples and classifying them as coming from the open set is known as open-set identification [47], [48], [49], [50]. Alg. 3, describes the self-rank based open-set identification procedure. MNIST dataset is used as the closed-set for training. The open set samples are from Omniglot [38] dataset. The ratio of Omniglot to MNIST test samples is 1 : 1 (10,000 from each) same as the simulation scenario in [49]. A backbone classifier with high-precision (CNN) architecture [51] is trained on MNIST as for step 1 in Alg. 3. The error values resulted from projection (regression) of open and closed-set onto selected representative of each class in step 4 of Alg. 3 differ significantly compared to those of projecting onto the entire dataset (due to effect of out-distribution samples). Results of macroaveraged F1-score [52] for proposed method with different selection methods are listed in Table 3. We observe that the adaptive sampling based on self-rank outperforms random sampling and SP-RD with fixed number of selected samples (without considering the structure of each MNIST training class). In order to carry out a fair comparison, the number of selected samples from each class is set to be the average of the selected samples in the adaptive sampling based on their class-wise self-rank. It has been shown in [5] that the selection-based open-set identification scheme (SOSIS) outperformed SOTA in terms of the projection error for testing samples. In this experiment, we have made SOSIS adaptive in terms of the samples to to be selected depending on the self-rank and self-expressiveness of each MNIST classes.

Algorithm 3 Self-Rank Based Open-Set Identification
Require: A X (closed-set training data), and A Y ={a Y p } P p=1 (test set) 1: Train a classifier on A X on H classes 2: Evaluate self-rank K h for class #h in A X 3: S h ← set of K h selected samples of class #h in A X 4: (p) ← label a Y p using trained classifier in Step 1 (∀p) 5: A graph network is characterized by structural features such as degree distribution, average shortest-path length (ASPL) and the clustering coefficient. Calculating the ASPL of a large graph is memory space and computation intensive. Hence, we propose an alternative efficient method for computing the ASPL with a reasonable error. Instead of the shortest path between all pairs of vertices we compute the ASPL between all the vertices and some selected vertices. Given a graph G we first select a subset of the vertices and then we exploit the measure of ASPL to evaluate the accuracy. Let G be the graph of the real world citation graph datasets, Cora with the set of vertices V . Further, let dist(v 1 , v 2 ) denote the shortest distance between v 1 and v 2 (v 1 , v 2 ∈ V ). Then, the error is defined |S| is the selected samples by the algorithms specified in Table 4. The more selected vertices chosen as these representatives, the less error is obtained. More importantly, the less selected samples are adopted by each algorithm the faster the ASPL will be obtained. For a fair comparison between our algorithm and the SOTA we consider a pre-defined threshold (err th = 0.15) on the error of the approximation based on the topology of the graph and we observe each algorithm to see how many data each algorithm needs to satisfy that precision of error on the shortest path. The results of these experiments are shown in Table 4.

VII. RELATION TO THE CONVENTIONAL SPECTRAL-BASED SAMPLING
Fifty years after Shannon's work, in 90s researchers revisited Shannon theory for signals with special spectral patterns [4]. A popular investigated spectral structure is union of low-rank subspaces [56] which resulted in compressed sensing, a low-rate sensing technique [57]. In compressed sensing we measure a combination (sense) of samples and it is not appropriate for selecting a small number of actual samples. On the other hand, in the recent 20 years information systems are extensively developed based on recent advances in machine learning and big datasets. The present paper links the conventional spectral sampling approaches to sampling from big datasets represented in a finite space for a machine learning task.
To get a constructive intuition about the Shannon sampling theorem, we cast the optimal Shannon sampling in terms of an optimization problem as follows, The sampling period is the average of |t k − t k−1 | for all k. This problem aims to find sampling epochs t k such that function f is able to recover the original signal accurately. Shannon theorem solves this problem using an assumption on the bandwidth of signal x(t). Assume that the decomposition of signal x(t) in terms of bases {exp(jwt)} is non-zero only for |w| ≤ B, then the cost function of the first term in (21) is 0 using t k = k/(2B) as the sampled points and f (t, t k ) = sinc(2B(t − t k )) as the mixer function. Thus the achieved sampling period is 1/(2B). Problem (21)  where λ is a parameter that regularizes the rate of sampling. Matrix V projects back selected samples to the original dataset similar to function f in Eq. (21). Similar to the spectral conditions for a perfect signal reconstruction, we establish an upper bound for the performance of data sampling according to the spectral properties of matrix A. In the continuous signal reconstruction we need 2B samples for each t = 1. However, Self-rank indicates the minimum required samples for a dataset in a finite dimensional space. Shannon sampling theorem suggests a minimum number of samples which is a function of spectral properties of the original function. As it is shown in this paper, we also pursuit the same approach based on Eigen decomposition.

VIII. CONCLUDING REMARKS AND DISCUSSION
In this paper, we study the problem of subset selection. The general goal is to select a subset from a large set of data such that their linear combination is able to target rank-K approximation. In spite of a large body of work on the subset-selection algorithm and low rank approximation problem, many fundamental questions remain unresolved. Probably one of the most important open questions is: ''How the selection algorithm can determine the number of needed representative samples'' or ''how many samples are enough in order to describe a matrix of data (dataset)?'' Despite superior performance of prior state-of-the-art methods like IPM [3], and SP [5], in these algorithms the number of selected samples is given; they are not provably convergent; no performance bound has been provided.
In the present work, leveraging SP we have propounded a novel adaptive sampling technique which depends on the spectral properties of the original dataset, A ∈ R N ×M . In our proposed approach the achieved bound is proportional to the condition number κ(A). This is while, the state-of-the-art selfrepresentative low-rank approximation methods are obtained either by brute force search which is not practical or the obtained bounds depend upon K , the number of selected data. We proved performance guarantees by making use of the self-rank concept which showed that the proposed adaptive sampling improves upon continued sampling. We also present experiments on synthetic and real world datasets to demonstrate significant performance superiority to other sampling methods in different learning tasks.
We expect the outcome of our proposed algorithms and theoretical results to have a significant impact on a wide range of applications. This is due to an ever-increasing generation and collection of data in almost all areas, such as computer vision, machine learning, health care systems, surveillance systems, sensor networks, vehicular networks, smart grids, and IoT, to just name a few. In all these applications, it will be crucial to reduce an enormous amount of data to a much smaller subset of representatives. It will be also very important to efficiently learn critical features of data. Our proposed approach has direct impact on active learning, few-shot learning, self-supervised learning and data compression which are not explored here.
The introduced concept of self-rank is a useful tool which can be employed in other areas of data science in order to establish tighter theoretical guarantees. Specifically, spectral properties such as self-rank and condition number can be investigated in more details to achieve better performance bounds and constructive algorithms based on theoretical supports. The underlying model of our work is a union of linear subspaces. In the future work, more sophisticated models can be considered especially, latent spaces of neural networks which is a union of non-linear manifolds.
Some potential concerns/limitations of our work is that even though the assumptions made in this paper are realistic and compatible with the prior arts, the value of N could be much less than the number of data and also K could be much smaller than N. He also works a variety of computer engineering problems. He has coauthored over 200 journals and conference publications, including two best paper awards, three best paper nominations, and two distinguished paper citations. He has served on panels and given invited presentations at several major conferences, and he has served on over 50 program committees. He also holds five awarded patents. VOLUME 10, 2022