Exploiting Linear Manifold Features With Parts-Based Representation in Various Scenes

Image recognition in complex scenes is a big challenge in computer vision. Manifold learning has become one of the most popular tools in the application of data dimensionality reduction and image recognition due to its efficiency in retrieving the intrinsic geometric features of image data. In this paper, we propose a new manifold feature extracting model based on the nonnegative matrix factorization (NMF) for image clustering in various scenes. In this model, Pearson distance with multiple manifold regulation constraints are adopted as the objective function to derive NMF based learning algorithms for the feature capturing of high dimensional data. With a variable neighborhood size in the learning, the proposed model can learn the linear features and at the same time learn the local similarity of images in multi-scale neighborhoods of a graph space. For different settings of learning parameters <inline-formula> <tex-math notation="LaTeX">$\lambda _{lx}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\lambda _{sx}$ </tex-math></inline-formula>, tests show that the proposed algorithms can efficiently retrieve low dimensional structures of images. Test results on four different image datasets demonstrate that the algorithms can achieve the state of art performance on the clustering of images in different types of scene.


I. INTRODUCTION
Neural networks-based feature extracting for images has been widely applied in pattern recognitions. However, due to the complexity of different scenes of sample data, previous networks often have poor solutions [1], [2]. The performance of algorithms was improved by constructing multi-layer or graph neural networks-based learning models [3]- [5]. Dimension reduction of data is an essential step for feature extraction. By exploiting different models and their compounding setting of parameters, a wide range of algorithms were proposed to obtain low dimensional data features. Nonnegative matrix factorization is one of the most popular data dimensional reduction methods for parts-based feature representation [6], [7]. From NMF, many interesting algorithms have been developed for image clustering and classification [8]- [11], including graph regularized NMF algorithms [26], [27] and deep neural network based NMF algorithms, which focus on the extraction of low The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . dimensional features with intrinsic geometric structures in sample data [29]- [32]. For these algorithms, investigations have shown that unsupervised multi-layer and graph regularization techniques can be utilized for image recognitions in various complex scenes, where the introducing of manifold approaches were a significant improvement to the efficiency of the algorithms. In manifold learning, traditional algorithms include Laplacian Eigenmaps algorithm (LE) [17], Locally Linear Embedding (LLE) [18], and Isometric Feature Mapping (ISOMAP) [19], from which many important learning methods, such as Hessian-based locally linear embedding (HLLE) [20], the log Riemannian exponential map expressed in tangent space algorithm (LOGMAP) [21], and other manifold regularized algorithms [22]- [25] have been derived. All these algorithms were motivated by the idea of similarity embedding of graph nodes in a neighborhood. Meanwhile, by constructing similarity graph or similarity matrix for different views of sample data, multi-view learning has been proposed recently to improve the existing manifold learning algorithms, which is becoming a more important type of methods to recognize complex scene images [40]- [43]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In general, these models show that incorporating manifold regularization with different constraints can obtain more efficient representation of sample data. However, test results show that for the LLE, the size of neighborhood determines the topological structures of sample data. If a neighborhood is too small or too big, the manifold features of data cannot be described by the locally linear embedding. The manifold structures of data in a small neighborhood can only be described by the similarity or invariance of nearest neighbors, that is, in this neighborhood, any two nodes will be similar but these nodes may not have the linear combination relation. On the other hand, when using manifold learning for classification or clustering, most current graph regularized NMF algorithms focused on learning the local invariance in a small neighborhood, which cannot completely obtain the intrinsic structures of objects to handle the complexity of images in different scenes. In this paper, we propose a multiscale local manifold constrained NMF (LMNMF) algorithm, which can learn both locally linear representation and local invariance of images in different neighborhood scales to capture the low dimensional geometric architectures of sample data. The main contribution of this paper is summarized as follows: 1) A novel model called multi-scale local manifold regularized NMF is proposed. In this model, NMF based manifold learning algorithms for both locally linear representation and local invariance of image data with different scales of neighborhoods are developed to extract the low dimensional geometric architectures of images in different scenes.
2) The convergent properties of the proposed algorithms are exploited, which show that the best image clustering result is obtained only if in the learning, the objective function is non-increase and its corresponding learning algorithms stably converge. By adjusting the setting of parameters, the convergence of the objective function and the learning algorithms can be controlled efficiently.
3) Experimental results on four image datasets are presented to show the efficiency of the proposed algorithms, which demonstrate that the proposed method can obtain the best or close to the best performance than several other feature extracting algorithms in terms of accuracy (ACC) and normalized mutual information (NMI). Fig. 1 shows the basic structure of the proposed model. In this model, NMF learning is on the whole domain to obtain the parts-based representation, but locally linear embedding and local invariance feature learning are in different sizes of neighborhoods to obtain their corresponding feature representations. Feature representations are approximated by the linear combination of different basis images. Thus, a multiscale based manifold learning model is constructed.
The rest of this paper is organized as follows: In Section II, the related algorithms are presented. In section III, the Pearson distance based objective function is introduced. In Section IV, the framework of learning algorithms is developed from the proposed objective function. In Section V, the complexity of the proposed algorithms is analyzed.
In section VI, experimental results are presented. Finally, in Section VII, the conclusions are provided.

II. RELATED ALGORITHMS
NMF is to decompose an M × N data matrix Y into two non-negative matrices A and X such that the product of A and X (Y = AX ) can correctly approximate the original data matrix Y, where A ∈ R M ×K is called the basis matrix and X ∈ R K ×N is called the encoding or representation matrix of the original data. Assume that a ij is the element of matrix A, a i is a column vector of A, x jk is an element of matrix X, and x j is a column vector of X. Frobenius norm is one of the most popular and relatively simple objective functions to obtain the error of decomposition, which was as follows: From the objective function in Eq. (1), Lee and Seung developed the following learning rules to decompose the data matrix into factors A and X [6].
where a ij (a ij ≥ 0, i = 1, 2, . . ., M , j = 1, 2, . . . , K ) are the elements of matrix A and x jk (x jk ≥0, j = 1, 2, . . . , K ,k = 1, 2, . . . , N ) are the elements of matrix X. By imposing extra term to the Frobenius norm in Eq. (1), graph regularized NMF was developed by Cai et al. [30], where Tr(XLX T ) = 1 2 ij andw jk is from the definition in Eq. (16). Graph regularized NMF (GNMF) derived from Eq. (3) aims to learn the local invariance of sample data in a neighborhood based on graph theory, which can enhance the efficiency of algorithms by learning the intrinsic geometric features of data. Currently graph regularized NMF has been extended and extensively applied for data representation, including dual embedding regularized NMF, Lp smooth NMF, graphbased discriminative NMF, and so on [12]- [16]. This type of algorithms has the problem of redundant solutions and scale transfer problem [33] as graph regularization terms are incorporated into the objective functions for the constraint of local invariance. Based on the GNMF, the original data Y can be separated into different views Y (1) , Y (2) ,. . . , Y (m v ) . Then graph regularized Multiview learning algorithms can be developed. For the original data Y, its representation X in feature subspace has the same separation X (1) , X (2) ,. . . , X (m v ) . Then, for any view X (v) , it follows that will be the Laplacian matrix of the v-th view [41], [42]. Since the original dataset is separated into different views according to some of the features of images, this type of algorithms can obtain better performance on image recognition. However, the cost is that labelling data into different views may take lots of time.
On the other hand, Graph regularized deep neural network is a significant extension of the original NMF algorithms [27], which incorporates traditional deep auto-encoders (DAEs) with the local variance constraint to generate a graph regularized deep neural network to extract geometric structures of sample data for image clustering. The deep learning model shows the promising performance of graph preserving algorithms [26], [28]. However, current graph regularized algorithms only consider the local invariance of sample data, which has effective test results on datasets such as COIL20, YaleB, MINST, PIE, ORL. All images in these datasets are with simple background. To deal with various real-world data feature extraction with robustness to complex background images, a general divergence measurement defined by Amari (called Amari's αdivergence) [7] can be introduced: where β = (1+α )/2. The α-divergence was defined in [48].
To simplify the expression, β is introduced to replace α, then it has the result in Eq. (4). The advantage of this divergence is that both the differences of y ik /[AX] ik and y ik − [AX] ik are measured. By adjusting the setting of β, the derived learning algorithms can obtain a trade-off between robustness and accuracy.

III. THE PEARSON DISTANCE WITH MULTIPLE LOCAL MANIFOLD CONSTRAINTS
For the Amari's α-divergences Div α (Y , AX) in Eq. (4), studies have shown that the variation of parameter β in this function will determine the robustness and efficiency of algorithms for feature retrieving of different types of sample data.
In the case β = (1 + α)/2, α-divergence is also called Person distance. From this divergence, an interesting objective function can be defined for the development of learning algorithms [7]. We can incorporate the Person distance with some specific constraint terms such as the first or second ordered local similarity to extend the original objective function, from which some improved learning algorithms can be proposed to obtain further sparsity of components. In this paper, we consider the following optimization problem. Minimize a specifically extended α-divergence: local invariance and linear embedding controlled Pearson distance with β = 2, which is defined as follows.
Div αM (Y , AX) where the parameter λ X and λ A will determine the effect of regularization on different scenes or noises of sample data. For the extra terms in Eq. (5), we constrain Div α (x j , x k ) and Div α (a i , a j ) to enforce the local features of the decomposed factors. The detail definitions of these terms will be given in Section IV. VOLUME 8, 2020 Applying the gradient descent approach to Eq. (5), we have the following results: where η kj and δ ij are called the learning rates or step size parameters. The partial derivatives of elements in (6) and (7) can be computed as the follows.

IV. THE PROPOSED ALGORITHMS
In Eq. (5), how to define the extra terms Div α (x j , x k ) and Div α (a i , a j ) for more efficient manifold learning is an important research topic. LLE is one the most interesting manifold learning algorithms for dimensionality reduction of linear data, which was extensively applied to image classification and clustering, text recognition, and multi-dimensional data visualization. LLE is constructed with a simple geometric intuition, that is, in an s-vertex graph, nodes (or data points) are sampled from some underlying manifold. Assume that each node and its neighbors are always closely on a locally linear patch of the manifold [18], then the learned representation data points of these nodes in the feature subspace are also locally linear related with the corresponding coefficients in the sample data space. Manifold learning focuses on retrieving the geometric structures of images for feature recognition.
To apply learning rules developed from Eq. (8) and Eq. (9) for manifold learning, we can define the regulation functions Div α (x j , x k ) and Div α (a i , a j ) with manifold constraints and impose them to the objective function in Eq. (5). From the manifold assumption in [18], [20], for the column vector y j (j = 1, 2, . . . , N ) in matrix Y, if they are locally linear related in some neighborhood, then their corresponding low dimensional representation vectors x j = [x j1 , . . . , x jn ] T will also be in some neighborhood and linearly related each other with the same coefficients w jk in the high dimensional space. The cost function of the LLE is defined in a neighborhood with s nearest neighbors, which assumes that the node x j is linearly related with all the nodes in the neighborhood. If W is the weight matrix and w jk is the jk-th element of this matrix, then the following linear reconstructing error was proposed to be the objective function [18] for LLE: Objective function (10) satisfying condition (11) has the following optimal solution: where To reduce the redundant solutions [33], [34], we assume that the node connections are only in their corresponding neighborhoods, and the sample space are separated into l 1 neighborhoods, then according to Eq. (10) and (11), the locally linear regularizing term LE (W ) =d lm (x j , x k ) in the low dimensional feature space can be defined as follows.
where l 1 is the number of neighborhoods, s 1 is the number of elements in a neighborhood. If we only have two elements in a neighborhood, since n k=1 w jk = 1, then the linear relationship of elements in the neighborhood will have Thus, learning the local invariance of elements in a neighborhood is the special case of learning the locally linear representation of elements in a neighborhood. However, as we have mentioned above, if the size of a neighborhood is too small or too big, the locally linear embedding of elements may not exit in this neighborhood. Therefore, we cannot define the similarity measurement and the locally linear relation in the same neighborhood. The locally linear embedding and the local invariance should be considered in two different scales of neighborhood.
From the divergence in Eq. (13), a robust locally linear regularization term can be defined as the following to develop the new learning algorithms.
Spectral graph theory and manifold learning theory show that nearest neighbor graph modelling can be applied to the extracting of local geometric structures on a high dimensional sample dataset. For a graph with s vertices, we assume that each vertex represents a data point. For each data point y j , we define its p nearest neighbors and connect them with edges.w jk is defined to be the weight of the jk-th edge. For any two points y j and y k on a nearest neighbor graph, three different choices can be used to define the weight matrixW on the graph [30], including: 1. 0-1 weighting, On the other hand, the local invariance in the manifold space assumes that in a sample data space, if two data points y j and y k are close in the intrinsic geometry of the data distribution, the representations of these two points x j , x k in the feature representation subspace are also close to each other. These points will be defined in a neighborhood and connected with edges each other [30]. To build the general local feature extracting model with different scales, we define a smaller neighborhood in the linear representation subspace to capture the local similarity of nearest neighbors in the neighborhood. The similarity measurement d sm in the low dimensional feature space is defined as the following: where l 2 is the number of neighborhoods in the subspace,w jk is the edge of linking two nodes in a graph, which is defined with s 2 nearest neighbors on the graph. The simplest approach to define the matrixW is the 0-1 weighting. Similar to the expression in Eq. (15), the robust similarity measurement can be defined as follows to develop our learning algorithms.
Combining Eq. (15) and Eq. (17) with Eq. (5), the multiscale representation objective function can be defined as: where the definition of divergence d α_sm (a i , a j ) is similar to d α_sm (x j , x k ). In Eq. (18), the locally linear embedding and local invariance measurements are defined in different scales of neighborhoods. Thus, the local feature extracting algorithms for multi-scale representations can be derived as follows: With the variations of neighborhood size p/s and the learning parameters λ lx and λ sx , the algorithms in (19) and (20) learn different scales of manifold features, which is possible to capture more complex geometric structures of images. In the following section, we will show the application of the proposed algorithms to dimensionality reduction and clustering of image data with different scenes.

V. COMPLEXITY ANALYSIS
For the proposed algorithms, to simplify the computing for complexity, we separate the measurements of local invariance and locally linear embedding into two parts, one is their difference, another one is their division. In general, they need O((K + 1)(p + q)N 2 ) to construct the k-nearest neighbor graphs, where p is the neighborhood size of local invariance and q is the neighborhood size of locally linear embedding. In the learning, the complexity of non-negative matrix factorization is O(MNK 2 ) since the computing of [AX] jk needs K extra times, Assume that t is the iteration numbers, then the overall complexity for the algorithms is O(t(K +1)(p+q)N 2 + tMNK 2 ). As p and q are very small numbers, the proposed algorithm has almost the same cost to the GNMF algorithm.

A. COMPARED ALGORITHMS AND DATASETS
To show the efficiency of the proposed algorithms, in the experiments we compare our algorithms in learning rules (19) VOLUME 8, 2020    and (20) with the following algorithms: Some early developed algorithms including Normalized Cut (NCut) [35] and Lee and Seung's NMF [6]. Some recently developed methods including GNMF, a Graph regularized Nonnegative Matrix Factorization method, LGNMF [36], an algorithm employed local centroid structured constraint to achieve sparse representation X, RSNMF [37], a semi-supervised NMF which was introduced to obtain the robust discriminative representation, and MPMNMF [43], a multi-view clustering based NMF algorithm aimed to seek the manifold measurements in the decomposed factors. Deep WSF [26], a multi-layer algorithm to learn a hierarchy of hidden representations so that the final lower-dimensional representation of the data can be extracted with higher quality. The drawback of this model is that the datasets must be with mixed attribute knowledge such as attributes pose, expression, and identity. GR-DNN [27], a deep neural network with traditional auto-encoding to obtain the ability of local geometric structure retrieving of images. In this model, only the local invariance feature of images is learnt. The images that we have selected for the tests include the following four datasets, and each image is resized to 32 × 32 gray scale for the neural network training and clustering. COIL20 database, which contains 20 different sample objects. In this set, each object has 72 images, which were taken 5 degrees apart with the object rotating on a turntable [44]. The images in this dataset have only the target object in a picture. Labeled Faces in the Wild (LFW), a database of face photographs designed for studying the unconstrained face recognition. The data set is with more than 13,000 images of faces collected from the web. 1680 of the people pictured have two or more distinct photos in the data set [45]. We select this dataset for image clustering to show the effectiveness of the proposed algorithms in a complex scene since facial images in this dataset have different occlusions in the front or different persons and scenes on the background. Unconstrained Facial Images (UFI) large database. We only select the training dataset for the tests. The total number of the subjects in this dataset is 530 and an average number/person of training images is 8.2. The original size of images is 384 × 384 pixels. The images in this set have  complex background, the face size also significantly differs and the faces are not localized [46]. This dataset is selected to show the effectiveness of the proposed algorithms in a more complex scene.
The cropped UFI images is selected from the UFI large dataset. This dataset contains images of 605 people with an average of 7.1 images per person in the training set. The images are cropped to an original size of 128 × 128 pixels with only the face part of these selected persons. Similar to the images in the COIL20 dataset, the images in this dataset also have only the target face in a picture but the pose of facial images may vary arbitrarily.
Caltech 101 dataset. Pictures of objects belonging to 101 categories. About 40 to 800 images per category. Most categories have about 50 images. The original size of each image is roughly 300 × 200 pixels [47].
We select these four datasets to test the efficiency of our algorithms since they have various scenes. Fig. 2, Fig. 3, and Fig. 4 show the original images selected from the LFW, Caltech101, and the UFI large dataset respectively, which clearly show the complexity of the image data in different scenes. In Fig. 2, the hair styles of Heidi are various, and there are different people or objects in the background of Carlos. In Fig. 3, we can see that the images are with complex background and the sizes of target objects are various. Some target images are so blurred even our eyes cannot identify them clearly. In Fig. 4, each person's facial images are with different scenes. The size and the pose of target images are also quite different.
In the tests, we set the model with three scales training. The NMF model is used to learn the global feature of sample data for parts-based representation. The locally linear feature and the local invariant feature can be learned by set different scales of neighborhood size. Tests results show that the proposed algorithms can learn the best performance with some specific settings.

B. CONVERGENCE STUDY
The convergence of objective function and the learning rules will significantly determine the accuracy of proposed algorithms. Thus, it is very necessary to study the convergent properties of the new learning algorithms when they are utilized to cluster data. Since a ij is normalized in the learning, we always have 0 ≤ a ij ≤1. Thus, element a ij will be non-divergent. We only discuss the convergent properties of the model for objective function and components x jk (j = 1, 2, . . . , K , k = 1, 2, . . . , N ) in the learning. Fig. 5, Fig. 6 and Fig. 7 show the variations of objective function in the clustering of UFI large dataset. From Fig. 5, we can see that in the cases λ sx = 15, 10, 0.1, 0.001, and 0.00001 from left to right respectively, the objective function Div αM (Y, AX) becomes convergent from divergent, where λ lx is fixed with λ lx = 95. These figures show that, the proposed objective function diverges at the cases of λ lx = 95, λ sx = 15, 10. Obviously, when λ sx is too big, the objective function will be divergent. Thus, the component x jk converges only if we have smaller λ sx . Fig. 6 shows that in thex cases of λ lx = 0.00001, λ sx = 10, 1, 0.01, 0.001, and 0.0001, all the curves of Div αM (Y, AX) are convergent. Meanwhile, the convergence will become faster when we have smaller λ values. From Fig. 7 we can see that when λ sx = 0.01,   λ lx = 10, 0.1, the objective function is divergent. With the decrease of λ lx the objective function begins to converge. When λ sx is fixed, the smaller λ lx will lead better convergence of objective function. Test results in Fig. 7 show when λ lx ≥ 0.01, the objective function may diverge. Decreasing λ sx, then the learning becomes convergent. In general, the convergence of objective function and learning updates may not guarantee the learning to obtain the best clustering result, but if the objective function and/or learning updates diverge, the clustering certainly cannot obtain the best clustering result. Fig. 10 shows that the best clustering result on UFI set is obtained only when λ lx = 95, λ sx = 0.00001. In this  case both the objective function and the learning algorithm are convergent. Thus, the setting of parameters is important for the clustering. Fig. 8 and Fig. 9 show the variations of different elements x jk in the clustering of image data, which indicate that the proposed algorithms have the case of divergence since the curves in the first sub-figure of Fig. 8 are always going up with the increase of the iterations. To study the convergent properties of this algorithm, all the x jk in X are selected arbitrarily to present. This figure also shows that with the decrease of λ sx , the learning becomes convergent gradually. Fig. 9 shows the non-divergence of the algorithms when we have relatively small λ lx and λ sx settings. The left subfigure in Fig. 9 shows that the learning converges at about the iteration of 100, but for the right two sub-figures, although the learning of x jk are not divergent, the curves are not converging to fixed points. They are with oscillation in the learning. Test results show that when the variations of elements are with oscillating or divergence, the algorithm cannot obtain high clustering accuracy. On the other hand, by comparing the results in Fig. 6 and Fig. 9, in the case of λ lx = λ sx = 0.00001, the objective function converges but the variations of x jk are with oscillating, which indicate that the convergence of objective function may not guarantee the convergence of learning algorithms.

C. CLUSTERING RESULTS
The experiments were running on Windows 10 operating system, Intel(R) Core(TM) i5-7200 CPU with 2.      In Fig. 11, sub-figures (a), (b) show the clustering results of images (we only select 100 clustered images to show the test results, 10 images for each object), and sub-figure (c) shows the variation curves of x jk in the feature extracting on COIL20 dataset when λ lx = λ sx = 0.00001. The results indicate that all the images are clustered correctly.
The right sub-figure shows that all the selected x jk converge after 30 iteration learning. This figure is provided to show the relationship between the convergence and the clustering results. Comparing with the convergence curves in Fig. 9, when λ lx = λ sx = 0.00001, the learning does not diverge but the curves are oscillating. Test results show that in this case, the clustering results on UFI large dataset is not the best result. The best clustering results are obtained when λ lx = 95, λ sx = 0.00001 on this dataset. The last subfigure in Fig. 8 shows that in this case, all x jk are converging gradually. Thus, the convergence of learning is significantly related to the clustering accuracy. Fig. 12 shows the clustered results of UFI-Large data when the feature data are mapped in to a two-dimensional space, from which we can see that in general, most images are clustered to the correct clusters. But few of them are in wrong clusters. Some images are not in any clusters, which are the outliers of clustering. Since its time consuming to label and reduce all the image data for low dimensional mapping and visualization, we only select about 600 images, 20 classes to show the test results. In fact, the results in Fig. 10 are also the visualization of clustered images on UFI-Large dataset. The only difference is that we line up the images one cluster by one cluster in one dimensional space, but not in two-dimensional data space. From left to right, Fig. 13 shows the relationship of clustering accuracy with the parameters λ sx and neighborhood size r 1 (locally linear embedding) on the datasets UFI large, LFW, and Caltech respectively (Here the neighborhood size indicates the radius of a neighborhood). In this figure, we set fixed λ lx = 95 and neighborhood size r 2 = 2 (local invariance). The sub-figures in the first-row show that when the similarity measurement parameter is fixed, the proposed algorithm obtains the best performance at the point λ sx = 0.00001. Increase or decrease this parameter will degrade the accuracy of this algorithm. The sub-figures in the second-row show that in the case of the similarity neighborhood size r 2 = 2, the best performance is obtained at the point of the linear relation neighborhood size r 1 = 5. Since in the UFI large dataset, each person has about 8.2 facial images, considering all the images of each person to be in a locally linear patch may obtain the best clustering result. Thus, the radius of the neighborhood r 1 = 5 may have all facial images of one person in the same neighborhood. For other two sets, since the number of images for each object is the times of ten. Therefore, they can obtain the best performance in this case. Table 1 and Table 2 show the average of clustering accuracy (ACC) and the normalized mutual information (NMI) for ten time running of each algorithm on each dataset. From the two tables, it is clear that, by comparing with the state of art algorithms, the proposed algorithms obtain the best performance in terms of ACC and NMI in complex scenes, although the MPMNMF and GR-DNN algorithms can learn close or better to the proposed algorithms on COIL20 and UFI cropped datasets. Since in these two sets, they do not have complex backgrounds, our algorithms obviously have the advantage of robustness in dealing with various scenes.

VII. CONCLUSION
In this paper, multi-scale manifold constrained NMF algorithms are proposed to exploit the intrinsic geometric features of images in different scenes, which can learn a state of art performance in image clustering. The experimental results confirm the efficiency of the proposed model. Analysis and the test results also show that the convergence of the objective function cannot guarantee the convergence of its corresponding learning algorithms. The convergence of objective function and the derived learning algorithms are significantly related to the accuracy of image clustering in various scenes. In the future, it is necessary to explore the deep structures of manifold learning for different applications. In general, the new design of manifold learning with flexible scales of neighborhoods size provides an efficient approach for extracting intrinsic geometric features of images in complex scenes.