Decentralized Principal Component Analysis by Integrating Lagrange Programming Neural Networks With Alternating Direction Method of Multipliers

Conventional centralized methods use entire data to estimate the projection matrix of dimensionality reduction problem, which are not suitable for the network environment where the sensitive or private data are stored or there is no fusion center. In this paper, we develop a decentralized principal component analysis (DPCA) method to deal with the distributed data without sharing or collecting them together. The main contributions of this paper are as follows: i) The proposed DPCA method only needs the projection vector information communications among neighboring nodes other than the communications of the distributed data; ii) The decentralized projection vector determination problem is replaced by a set of subproblems with consensus constraints and the excellent processing capability of alternating direction method of multipliers (ADMM) is used to obtain the consistent projection vectors; iii) Especially, the integrating Lagrange programming neural networks (LPNN) is introduced to solve the projection vectors determination problem with the complex unitary and orthogonal constraints, and iv) the converge analysis of the proposed optimization problem is provided to ensure that the obtained projection vectors of the distributed method converge to those of the centralized one. Some simulations and experiments are given to show that the proposed algorithm is an alternative decentralized principal component analysis approach, and is suitable for the network environment.


I. INTRODUCTION
Networks [1]- [3] are desired in diverse applications, such as security and surveillance, disaster response, and environmental modeling [4]. Conventional centralized scenario of network reserves data in a fusion center. However, the storage and computation load of the fusion center increase with the increasing data, possibly exceeding the available system resources. When the data contains the sensitive or private information, it is impossible to collet or share from other nodes. Furthermore, the entire network may be collapse with the breakdown of the fusion center, which indicates the The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Cusano . centralized scenario is lack of robustness. Therefore, a decentralized scenario, where each network node can extract useful information by implementing some local computation, communication, and storage operations, is often considered in many applications [5]- [8].
Principal component analysis (PCA) [9], extracting low-dimensional subspaces from high-dimensional data, is one of the widely used method for dimensionality reduction in data science. Over the past decades, PCA has been used in many applications, e.g., dimensionality reduction for large, subspace learning for machine learning applications, and robust low-rank matrix recovery from missing samples or grossly corrupted data [10]- [13]. Moreover, many dimensionality reduction methods using the tensor data are proposed to protect the inherent structure and the correlation in the original data, i.e., multilinear discriminant analysis (MDA) [14], concurrent subspace analysis (CSA) [15], discriminant locally linear embedding (DLLE) [16], and marginal Fisher analysis (MFA) [17]. However, these advances have been mostly limited to process all data at the same time in the centralized mode and have little idea on the large-scale, multi-modal, and multi-relational data sets encountered in a variety of applications, including chemometrics [18], hyperspectral imaging [19], high-resolution videos [20], biometrics [21], [22], and social network analysis [23], [24].
To solve the problem of dimensionality reduction in the distributed environment, a directed distributed PCA (DDPCA) [25] exploits a directed acyclic graphical model perspective of the network of measurements. But the DDPCA algorithm assumes a known sparsity structure in Cholesky factor of the concentration matrix and uses an approximate iterative manner to relax the original problem, which is not suitable for the general situations and the accuracy cannot be guaranteed. Additionally, a distributed PCA algorithm [26] relies on in-network processing and does not require evaluation of the whole covariance matrix at a central location, or across all sensors. However, this distributed PCA ignores the unitary and orthogonal properties in the projection vectors and the dimensionality reduced data also needs to be shared with the neighborhood nodes. Recently, a decentralized algorithm [27], based on the integration of a dimensionality reduction step into an asynchronous Gossip consensus estimation protocol, uses all nodes of a network to compute PCA over the complete data. Although this decentralized algorithm reduces local computations and communication cost, all nodes of the decentralized algorithm still need to repeatedly send their current sum and weight to randomly selected peers, which leads to the sensitive information leaked.
On the other hand, the optimization methods, i.e., the alternating direction method of multipliers (ADMM) [28], [29] and the Lagrange programming neural network (LPNN) [30], [31] have received much attention. The former decomposes a constrained convex optimization problem into multiple smaller subproblems and provides superior convergence properties, which has been widely used in image reconstruction [32], distributed learning [6], sensor network localization [33], distributed dictionary learning [34], decentralized dimensionality reduction [35]. The latter is an analog neural computational technique for solving nonlinear constrained optimization problems according to the Lagrange multiplier theory [36]. There are two types of neurons in the network to compute the optimum solution, namely, variable and Lagrangian neurons, which are responsible for finding a minimum point of the cost function as well as the solution at an equilibrium point, and leading the dynamic trajectory into feasible region.
In this paper, we develop a decentralized principal component analysis (DPCA) method for the distributed data across nodes of the networks. It is worthwhile to highlight several aspects of the proposed algorithm here: i) The presented DPCA method can deal with the distributed data without sharing or collecting the entire data together for the network environment where the sensitive or private data are stored or there is no fusion center; ii) We convert the decentralized projection vector determination problem to a set of subproblems with consensus constraints and then only use local computations and information communications among neighboring nodes. It is noting that the information communications among neighboring nodes are the projection vectors, which differs from other distributed methods sending the corresponding information of the data; iii) To obtain the consistent projection vectors, we introduce new auxiliary variables and determine the optimization variables alternately and iteratively by ADMM; iv) Additionally, LPNN is used to solve the nonlinear programming problem with numerous complex unitary and orthogonal constraints; v) We also prove that the optimization problem is a global and convex one, which ensures that the obtained projection vectors of the distributed method converge to those of the centralized one.
The rest of this paper is organized as follows. Problem is formulated in Section II. The decentralized principal component analysis approach is developed in Section III. Experimental results are presented in Section IV. Conclusions are drawn in Section V.
Notation: Vectors and matrices are denoted by lowercase and uppercase letters, respectively. · F denotes the Frobenius norm of a vector or a matrix, (·) T denotes the transpose of a matrix or vector, 0 m×n and I n denote the m × n zero matrix and n × n identity matrix, respectively. trace{·} denotes the trace of a matrix. Other mathematical symbols are defined after their first appearance.

II. PROBLEM FORMULATION
Assume that there are K submatrix data {Y 1 , Y 2 , . . . , Y K } distributed across these K nodes, each with n r × n c (k)dimensions, where n r and n c (k) denote the dimension and the number of the samples in matrix data Y k , respectively. It is worth noting that the value of n c (k) can be different on account of the fact that the number of samples is not required to be the same in each node and each node can calculate independently with its locally storing data and exchange local information with its neighbors [1]- [3]. When all the sub-matrices could be collected by a fusion center, the centralized schemes [9]- [13] can be applied to implement principal component analysis on them.
However, in some particular cases where the sensitive or private data are stored or there is no fusion center as mentioned in Section I, it is difficult to gather these distributed sub-matrices and implement PCA on the distributed submatrices [5]- [8]. Therefore, this paper proposes a decentralized counterpart to deal with the distributed submatrix data by the local computations and the communications among neighboring nodes. In this way, all nodes can attain approximate or even the same performance as that of the centralized counterpart (if it existed).
To meet with the distributed processing requirement of the networks, this subsection develops a decentralized principal component analysis algorithm to use the submatrix data {Y 1 , Y 2 , . . . , Y K } distributed across the K nodes only by means of the local computations and the communications among neighboring nodes rather than collecting the private information {Y 1 , Y 2 , . . . , Y K } together.

A. PRINCIPAL COMPONENT ANALYSIS DETERMINATION
The principal component analysis (PCA) is a typical method for dimensionality reduction. Considering the collected data Y = [Y 1 , Y 2 , . . . , Y K ] ∈ R n r ×n c can be represented as where n c represents the total number of the samples, U = [u 1 , u 2 , · · · , u L ] ∈ R n r ×L is the orthogonal projection matrix with unitary vectors, and L < n r denotes the number of the projection vector.
Then (1) can be rewritten as the following optimization problem: Therefore, we obtain the variables {u i } by an iterative manner to i = 1, · · · , L.
For the first projection vector u 1 of U , it can be determined by solving the following optimization problem: where the corresponding proof can be found in Appendix A. Additionally, we can determine the n-th projection vector by where the proof is shown in Appendix B.

B. DECENTRALIZED PRINCIPAL COMPONENT ANALYSIS DETERMINATION
It seems that we can determine the components of U sequentially according to the afore-mentioned strategy. However, the data, i.e., Y = [Y 1 Y 2 · · · Y K ], is distributed across the nodes of the networks, which cannot be gathered together due to the fact the sensitive or private data are stored or there is no fusion. Therefore, the projection vector u must be determined by the local data in each node rather than the entire collected data.
Since (3) is just the special case of (4), we consider the distributed solution to (4) in the rest of this paper. We rewrite (4) in a separated matrix form as To operate all the distributed {Y k } K k=1 directly, we replace u in (5) with K copies, denoted as local variables } distributed across the K nodes, yielding the equivalent form as follows (here we omit the constraints in (5) for representation convenience): where Ne(k) stands for the neighbors of the kth node for k = 1, · · · , K , and the equivalent constraints denote the so-called consensus constraints [6].

III. ALGORITHM DEVELOPMENT
In this section, we explore a solution to the problem of consensus local variables {u 1 , u 2 , . . . , u K } for k = 1, · · · , K .

A. PROPOSED SOLUTION
To update u k and u k separately, we define new vector w k with the related equivalent constraints of (6): Note that (7) is a optimization function with equality constraints. Therefore, the augmented Lagrangian of (7) can be rewritten in an unconstraint form according to ADMM [28], [29]: where λ k,k and λ k,k are Lagrange multipliers corresponding to the constraints u k = w k and u k = w k for k = 1, . . . , K and k ∈ Ne(k), and ρ > 0 is the augmented Lagrangian parameter. Then {u k , w k , λ k,k , λ k,k } can be determined by an iterative method.
With given {w k (t), λ k,k (t), λ k,k (t)}, we solve {u k (t + 1)} using the following optimization problem: where t is the iteration number.
Then u k (t +1) can be solved in a separate and parallel form [28], [29], for k = 1, . . . , K . Therefore, adding unitary and orthogonality constraints, we divide (9) into K subproblems: where Lagrange multipliers λ Similarly, (11) can be solved in a separate and parallel form: Additionally, the symmetry of the undirected graph is used in (12) Therefore, {u k , w k } can be determined only by processing the local data {Y k } and exchanging {w k (t), u k (t)}, which is a competitive manner compared to the distributed PCA methods [25]- [27].
With given {u k (t + 1), w k (t + 1), λ k,k (t), λ k,k (t)} and considering the relationships between {λ k,k , λ k ,k , λ k,k } and andλ where Card(k) denotes the neighbor number of the kth node, k = 1, . . . , K , k ∈ Ne(k). However, (10) and (12) are complex programming problems. Especially, (10) can be represented as follows ( (12) has the similar form, and thus it is omitted here): where For the problem in (15), we can construct the following Lagrangian function according to the LPNN approaches [30], [31]: where u k is state variables, while ξ u and {ξ o (j)} n−1 j=1 are Lagrange variables. That means, there are n r state variable neurons to hold sate variables u k ,n state Lagrangian neurons to hold the Lagrangian variables.
With (18), the dynamic of the network is given by where t 1 is the time variable. Then the decentralized principal component analysis (DPCA) method can be described as follows: -----------------------Decentralized Principal Component Analysis (DPCA) Method Initialization: Setting the same initial normalized u and ρ for all nodes; for n = 1, · · · , L for t = 1, · · · , T max For the kth node, solving the local eigenvector u k n (t + 1) of (10) using the LPNN steps shown in (18)-(21), for k = 1, · · · , K ; Broadcast u k n (t + 1) to the related neighbors; For the kth node, solving the local eigenvector w k n (t + 1) of (12) using similar LPNN steps as (18)-(21), for k = 1, · · · , K ; Broadcast w k n (t + 1) to the related neighbors; Update λ k (t + 1) using (13); Updateλ k (t + 1) using (14); end for t or the prespecified criterion is satisfied. Note that LPNN is an analog neural computational technique for solving nonlinear constrained optimization problem and some basic properties are required when we use LPNN method to solve (15). Then, by analysing (15)-(21), we can obtain the following lemma: Lemma: If the gradient vector of the constraints at an equilibrium point is linearly independent, then the requirement for the local stability of LPNN is met [30], [31].
Proof: For our problem, the gradient vector of the constraint (u k ) T u k = 1 at an equilibrium point (u k ) * is 2(u k ) * and other constraint (u k j ) T u k = 0 at the point is u k j , j = 1, · · · , n − 1. Therefore, 2(u k ) * and u k j , j = 1, · · · , n − 1 are orthogonal to each other, i.e., they are linearly independent. The proof is complete.
Additionally, the linearized neural dynamics around where and ξ = [ξ u ξ o (1) · · · ξ o (n − 1)] T . Then we prove that G is positive definite in Appendix C. Moreover, we evaluate the required computational complexity of the proposed DPCA method. It can be found that the computational complexity of updating u k occupies a major component in each iteration of the proposed method and it requires O(2 n 2 r ) flops and broadcasts Card(k) vectors to its neighbors for each node.

C. CONVERGENCE ANALYSIS
In this subsection, we prove that the optimization problem shown in (15) is a global and convex one in a hidden form, and it satisfies Lemma 4.1 in [8], which ensures that the obtained atoms by the distributed method converge to the centralized one.
First we prove (15) is a global maximizer and is convex in a hidden form. In other words, the original optimization problems are actually both global and convex ones in a hidden form.
That is to say first we should prove that the problem (15) is equivalent to that with such an inequality constraint: Under the inequality constraint (u k ) T u k 1, we assume that the minimal value of (u k ) T A k u k + (b k ) T u k with respect to u k will appear at (u k ) T u k = β, where 0 < β < 1. Thus, (25) is transformed to the following form: We define Lagrangian function like (18): From which we define the related dual function as follows: It is easy to notice thatF(ξ u ) is also a monotonically decreasing function of ξ u , and lim ξ u →∞F (ξ u ) = −4β, lim ξ u →− nrF (ξ u ) = +∞. Therefore, if β 1 > β 2 , and β 1 , β 2 ∈ (0, 1], the root of is certainly not larger than that of In other words, the larger the scalar β ∈ (0, 1] is, the smaller the rootξ u ofF(ξ u ) = 0 is. Thus the larger the scalar β ∈ (0, 1] is, the smaller the primal objective function is. Note that 1 is the largest value available for β. Therefore, the objective function (15) is equivalent to and achieves its minimum value at boundary (u k ) T u k = 1 and (u k ) T u k j = 0. Note that the set defined by is a convex set. We can prove that C is a convex set, the proof process is as follows.
Assuming that u 1 ∈ C, u 2 ∈ C, and θ ∈ [0, 1], then (33) and The objective function, (u k ) T u k , and (u k ) T u k j in (15) are all convex functions. Besides, according to (61), we can see that the matrix A k + ξ * u I is positive definite. Therefore, according to Theorem III as well as the ε-subdifferentials of convex functions and ε-normal directions in [8], the sufficient and necessary conditions in above can ensure that (15) is a global maximizer and convex in a hidden form. In other words, the original optimization problems is a global and convex one in a hidden form.
Second, we should prove the similar lemma (as Lemma 4.1 in [8]) holds for the hidden convex functions above, which ensures that the obtained atoms by the distributed method converge to that of the centralized one. Similar to the part between Eqs. (4.86) and (4.87) of Lemma 4.1 in [8], we define where J 1 (u) = (u k ) T A k u k and J 2 (u) = (b k ) T u k are continuously differentiable. Then (35) equals to Therefore, according to Lagrange multiplier method, the solution of (36) is where ξ * u satisfies that According to Eq. (4.85) in [8], which shows that Lemma 4.1 in that book [8] holds for (15). Therefore, it also holds for the hidden convex objective functions shown in (5). Based on the derivation above, we let G 1 (a) denotes the hidden convex optimization problem shown in (15), and G 2 (a) = 0, as those of [8]. Then the constraints (u k ) T u k = 1 and (u k ) T u k j = 0 on {u k } are always bounded and every limit point of {u k } is an optimal solution to the original problem (15). Due to that G 1 (a) shown in (15) satisfies the

IV. SIMULATION AND EXPERIMENTAL RESULTS
In this section, the simulation data, open Handwritten Digit database [37], and face database [38] are used to evaluate the performance of the proposed DPCA method for the distributed stored data across the networks.

A. DPCA ON SIMULATION DATA
In the first experiment, we consider a network with K = 20 nodes. Each node is able to process the simulation data to extract relevant information via collaborating with neighboring nodes.
In this experiment, we use 500000 simulation samples to the 20 nodes at random. Additionally, the randomlygenerated normalized column vector is used as the initializes projection vectors {u k 1 } 20 k=1 , the step size ρ = 0.01, and the maximum iteration number T max = 2000.
To evaluate the consistency and the convergence performance of the proposed DPCA algorithm, we define the maximum absolute coefficient difference error between the projection matrices obtained by the centralized (gathering all data together) and distributed manners as ξ (k, t, n) = MAX{|u k n (t) − u n |}, k = 1,· · ·, K , t = 1,· · ·, T max , n = 1,· · ·, L, (40) where MAX{•} denotes the operator of choosing the maximum value of • and ξ (k, t, n) corresponds to the tth iteration and nth projection vector at the kth node.
Then we compute ξ (k, t, n) using the obtained projection matrices of the proposed DPCA algorithm and only plot the corresponding results of the special case n = 1 in Fig. 1. It can be found that the maximum absolute coefficient difference error of each node is decreasing with the increasing iteration number and all of the maximum absolute coefficient difference errors between the distributed and centralized ones are less than 10 −9.8442 after 900 iterations. Therefore, the proposed algorithm has the satisfactory convergence performance and can obtain consistent projection vector (i. e., n = 1) with negligible difference for all the nodes. Additionally, other results, i. e., n = 2, · · · , L, are similar to that of the case n = 1. Therefore, the proposed DPCA is capable of obtaining the consistent projection matrices compared with the centralized cases.

B. DPCA ON HANDWRITTEN DIGIT DATABASE
In the second experiment, we consider a network with K = 5 intelligent post offices (IPOs). Each IPO is able to obtain grayscale digit images of the local postal mails, and also can process the image data to extract relevant information via collaborating with neighboring IPOs.
In this experiment, the two sets available from the USPS database [37] are used as the training and testing samples, i.e., 2500 of the set with 7291 16 × 16 digit images shown in Fig. 2 are used for training and the set with 2007 ones is set for testing. The related distributions are given in Table 1.   Similarly, we compute ξ (k, t, n) to evaluate the consistency and the convergence performance of the proposed DPCA algorithm and plot it in Fig. 3. It shows that all of the maximum absolute coefficient difference errors between the distributed and centralized ones are less than 10 −14.7257 after 100 iterations. Therefore, the convergence performance and consistent projection vectors can be guaranteed in this proposed DPCA method.
Due to the efficient consistency property of the proposed DPCA method in each node, we only use the projection matrices of the 1st node to assess the DPCA algorithm for the recognition task. When the feature number vary from 1 to 16, the related recognition rates of the proposed DPCA method are given in Fig. 4. For comparison purpose, other results from centralized PCA [9], MDA [14], CSA [15], DLLE [16], MFA [17], and the decentralized DDR [35] algorithms  are also plotted in Fig. 4. It can be seen from Fig. 4 that: i) the recognition rates of all the seven methods rise with the increase of the feature number (from 1 to 16); ii) the best handwritten digit recognition rates achieved by PCA, MDA, CSA, DLLE, MFA, DDR, and DPCA are 0.9162, 0.9063, 0.9243, 0.9143, 0.9243, 0.9243, and 0.9168, respectively; iii) the DPCA method with the distributed computation manner has the approximate recognition property as the centralized dimensionality reduction methods; iv) Compared to the decentralized DDR method, the DPCA method obtained the similar recognition rates but has a lower complexity.

C. DPCA ON OLIVETTI & ORACLE RESEARCH LABORATORY (ORL) FACE DATABASE
The ORL database [38] is used in the third experiment to simulate the data of the vision network, where the faces of a subject were captured by different-view cameras in the network, and each observation represents a facial image captured at different times and with different variations including expression (open or closed eyes, smiling or nonsmiling) and facial details (glasses or no glasses). Moreover, 400 images of 40 individuals were in grayscale and normalized to the resolution of 112 × 92 pixels in the ORL database [38] and a tolerance for some tilting and rotation of the face up to 20 • was taken in each image. Fig. 5 shows a snapshot of a person captured by the intelligent cameras with the corresponding viewing angles.
To simulate the network, we use 240 images and the rest of the database as the training and testing samples, respectively. The training images are arranged to a network with K = 5 nodes. Similarly, the same step size and the iteration number are used as the first experiment. Furthermore, we also compute ξ (k, t, n) and plot it in Fig. 6. From Fig. 6, we can find that they are all not larger than 10 −8.0664 after 100 iterations,  which imply that the proposed algorithm has satisfactory convergence and consistent performance.
Additionally, to obtain the classification of the testing samples, we also use the nearest neighbor criterion and show the recognition rates of the seven algorithms in Fig. 7. From  Fig. 7, we can see that the best recognition rates of PCA, MDA, CSA, DLLE, MFA, DDR, and DPCA are 0.9750, 0.9625, 0.9687, 0.9750, 0.9812, 0.9750, and 0.9750, which indicate that the recognition performance of the proposed DPCA method is comparable to those of PCA, MDA, CSA, DLLE, MFA, and DDR. Although MFA method obtains the highest recognition rate when the dimensions increase to 1110, the DDR and DPCA methods can handle the sensitive or private data in a distributed manner to obtain the satisfactory result.

V. CONCLUSION
In this paper, we have developed an efficient DPCA method to process the distributed data across the networks. To circumvent the decentralized projection vector determination problem in the network environment only by local computations and information communications among neighboring nodes, we update the optimization variables alternately and iteratively, and determine the projection vectors with unitary modulus and orthogonal constraints in terms of the remarkable capability of Lagrange programming neural networks in solving general nonlinear programming. Numerical examples have shown that the proposed DPCA method outperforms the state-of-the-art techniques for distributed data across the networks.
SinceŪ is unitary andŪŪ T = I n r , we have Inserting (43) into (42) yields Therefore, if u is an eigenvector of YY T corresponding to the eigenvalue σ 1 , then u T YY T u obtains the maximum value σ 1 .
Furthermore, if u = 0, we have and which shows that (3) holds. The proof is complete.

APPENDIX B
We begin with the n = 2 case where the optimized u should satisfy with u T u 1 = 0, and thus we have where the component σ 1 u 1 u T 1 is dropped due to u T u 1 = 0. Define R 1 = n r i=2 σ i u i u T i = YY T − σ 1 u 1 u T 1 , then the problem in (4) reduces into min u − u T R 1 u s.t. u T u = 1. (48) Note that (48) is very similar to (3), thus the solution to (48) is the eigenvector to the largest eigenvalue of R 1 .
Similar proofs hold for n = 3, · · · , L. The proof is complete.

APPENDIX C
Since G is strictly positive definite provided that 2A k +2ξ u I n r is positive definite. Let α be an eigenvector of G, and let [z T s T ], assumed not a zero vector, be the corresponding eigenvector where z and s are vectors of dimension of n r and n, respectively. We have Inserting the definition of G into (49) yields where z T Bs − s T Bz = 0 is applied. Therefore, combining (49) and (50), we obtain z T (2A k + 2ξ u I n r )z = α(|z| 2 + |s| 2 ).
If there is a unique optimization solution to the problem, then the variable (u k ) T 0 ξ o (1) · · · ξ o (n − 1) T is unique, which implies that the equations have a unique solution.
Therefore, the matrix 2A k + 2ξ u I n r B B T 0 n×n r is inverse.
Furthermore, we prove that the matrix 2A k + 2ξ u I n r is inverse.