A Consensus-Based Diffusion Levenberg-Marquardt Method for Collaborative Localization With Extension to Distributed Optimization

Non-linear least squares problems arise from data fitting have received recently a lot of attention, particularly for the estimates of the model parameters over networked systems. Although the diffusion Gauss-Newton method offers many advantages for solving the non-linear least squares problem in wireless sensor network to estimate target position parameter, there are some key challenges when applying it to practice, including singularity of Gauss-Newton Hessian, selection to constant step sizes and steady state oscillation. These remaining issues lead to obvious performance degradation such as high computational cost, vulnerability to step size change and resulting instability on estimation.In this paper, to eliminate the singularity, we develop a diffusion Levenberg-Marquardt method such that the problem of constant step size is also addressed together. Then, to reach agreement on estimated vector, a consensus implementation is further proposed, thus eliminating the oscillation during steady state. Consequently, the proposed consensus-based diffusion Levenberg-Marquardt method provides a general solution for the non-linear least squares problems with an objective that takes the form of a sum of squared residual terms. By applying to collaborative localization and distributed optimization arise in large scale machine learning, simulation results confirm the effectiveness and wide applicability of proposed method in terms of convergence rate, accuracy and consistency of estimates.


I. INTRODUCTION
Target localization and tracking have found wide applications with location awareness in wireless sensor networks (WSNs) [1], where accurate position information plays a critical role, such as position-based routing protocol [2], environmental monitoring [3], anomaly detection including fire or poisonous gas [4], disaster relief [5], industrial detection [6] and so on. In general, the positioning techniques can The associate editor coordinating the review of this manuscript and approving it for publication was Qingchun Chen . be divided into two main types: range free and range-based. Range free localization techniques are easy to implement with low hardware cost, which are mainly depend on the coarse position estimation by exchanging them among neighboring nodes including connectivity and hop counts, etc. The typical range free methods include Approximate Point in Triangle (APIT) [7], DV-Hop [8] and Centroid [9]. Their common weakness is the high estimation errors, while the advantage is consumption of less energy caused by frequent wireless communication and less hardware used to obtain range values. However, many real applications require that the higher accuracy of target position is better, regardless of cost in computaion/communication or inconvenience in deployment. In this situation, the range-based localization receives more and more attentions because of its good estimation performance.
With the rapid development of Micro-Electro-Mechanical Systems (MEMS) technology, distance between unknown targets and anchors nodes knowing themselves positions becomes easily available, which results in that rangebased localization methods are becoming popular among researchers. Typical range measuring techniques include [1] that Received Signal Strength Indicator (RSSI), Time of Arrival (ToA), Time Difference of Arrival (TDoA), and Angle of Arrival (AoA). The above ranging methods have different accuracy depending on surrounding environments. Obviously, the higher distance measuring accuracy will cause the better node localization. At best only three anchor nodes are needed when distance measurements are exact by using trilateration localization algorithm. However, the known fact is that the environmental noise resulted from various situations including Non-Line Of Sight (NLOS) propagation and multipath propagation is ubiquitous and inevitable. Thus, a main object for range-based localization is to eliminate the negative effect resulting from noise as much as possible.
Some recent studies based on different perspectives have also shown that more anchor nodes by cooperating can improve the accuracy of localization and minimise the effect of noise. In [10], a lightweight distributed node localization scheme for WSNs is proposed to solve the drifting problem with the consideration of computational capacity. In [11], a multi-anchor node collaborative localization (MANCL) algorithm is proposed, in which a communication mechanism and a vote mechanism are used to determine the temporary coordinates of unknown nodes. For a deterministic but unknown node coordinate, Maximum-likelihood (ML) estimation is popular due to asymptotically optimal and consistent properties. The ML solution [12] is proposed to solve a weighted Non-Linear Least Squares (NLLS) problem, which has not a closed-form solution due to its nonconvex characteristic. Although the unconstrained minimization problem can be solved by the classical gradient and Newton methods as well as their variants including the storage-efficient Broyden-Fletcher-Goldfarb-Shanno (BFGS) and other Quasi-Newton methods [13], they ignore the special structure of the Hessian matrix in NLLS model to achieve the reduction on computational cost and require strong assumptions on both the objective function and the initial iterate, which can lead to the failure of these methods. The well-known Gauss-Newton (GN) method can be used to obtain the iterative solution with excellent accuracy. GN method is a modification of Newton's method, which has a deeper learning capacity than those first-order methods such as gradient descent. More importantly, GN method is a Hessian free of second-order type method, which consumes a less computational cost than the standard second-order Newton's method that requires the value of inverse of exact Hessian. To make GN method be effective, one precondition is that the initial estimate is good enough. In other words, the initial guessed position for unknown target must to be in the near neighborhood of the true position. Some methods are developed as an initial point of GN method including second-order cone programming (SOCP) [14], semidefinite programming (SDP) [15], which can provide an accurate enough guess in polynomial time.
GN method used to solve NLLS problems in a networked system can be implemented either in a centralized or distributed manner. Due to the known advantages, such as robustness and load balancing, distributed solutions are preferable to centralized ones. Recently, we proposed an adaptive range-based target localization using diffusion GN method in industrial high noisy environments [16]. Firstly, the node localization is formulated as a NLLS problem, where each anchor node estimates cooperatively the coordinate of unknown target based on their available noisy range measurements. Secondly, a distributed GN method based on diffusion learning is proposed to solve such a NLLS localization problem. The diffusion GN method effectively incorporates the diversity of temporal-spatial data across network into local iterative updates, thus obtaining a significant improvement both in convergence rate and steady state accuracy in a fully distributed way.
Despite of the excellent performance of diffusion GN algorithm, two major and inevitable challenges need to be addressed in order to get a robust solution. Firstly, to make the diffusion GN method succeed, the approximated Hessian matrix is assumed to be full column rank in all steps due to the computation of its inverse matrix. Once the assumption is not satisfied, the diffusion GN will be intractable. However, it is highly likely in practice that the GN Hessian is not well conditioned due to data diversity. Although the generalized inverse method can give a good approximate solution to the inversion of a matrix by using the singular value decomposition [17], the resulting computation cost is very expensive. It is known that the cost to compute a generalized inverse is about three to four times as much as solving an invertible system with same-dimension [18].
Secondly, the diffusion GN method uses the fixed step sizes, which are identical for all nodes during the total iteration process. It is known that the step size should be large at the early stage of the algorithm, while it should decrease as the algorithm progresses. Thus, the diffusion GN method with a diminishing step size control (also known as damped GN) is urgent to be developed instead of static learning step size.
In this paper, a modified Levenberg-Marquart (LM) method is used to address the above issues. LM is a typical damped iteration method by adding a regularization parameter to the approximated Hessian matrix, which can ensure the positive definiteness of the matrix. The main difficulties in applying LM method to localization problem are its distributed implementation and maintenance of consensus. To this end, a consensus-based LM algorithm with diffusion strategy is developed to achieve the network-wide consistency on steady-state estimation performance in a distributed way. Simulation results show that the proposed method outperforms the diffusion GN methods under a variety of different step sizes and also exhibits a good agreement over network. An extension to distributed optimization is used to further confirm the effectiveness and wide applicability of our method. In summary, the present paper makes a twofold contribution. (1) Redeveloping a general optimization method that improves the parameter estimation performance for NLLS problems caused by a sum of squared residual over networked system. (2) Different from previous work in the literature, the consensus is well considered in the deduction of the model for proposed distributed LM method, which reduces effectively the steady state oscillation across a network.
This paper is organized as follows. In Section II, the node localization problem is introduced by formulating a NLLS model over networked systems, and the diffusion GN solution is given. In Section III and IV, a modified diffusion LM method and its consensus version are proposed, respectively. Simulation results for target localization and distributed optimization problems are provided in Section V. The whole paper is concluded in Section VI.
Notation: The operator (·) T denotes the transpose for matrix or vector, the operator (·) −1 denotes the inverse of a non-singular matrix. Capital letters and small letters in bold are used when matrices and vectors are denoted, respectively, while scalars are denoted in normal font. The 2-norm of a vector x is denoted by x . I N denotes the N × N identity matrix. We will use subscripts k and l to denote node, and superscript i to denote time.
where R i k = x − y i k is the real distance and noise i k is the environmental noise that follows Gaussian distribution with finite mean θ and variance σ . The model (1) has been used widely to develop node localization techniques [16]. The distance error between node k and target can be written as (2) Thus, we aim to minimize the sum of squared distance errors over all nodes as follows: where we define f(x) [f 1 (x), . . . , f N (x)] T as the global cost function.
Originating from localization application, the minimization form (3) provides a general networked NLLS model and can be solved by many optimization methods, such as firstorder gradient or second-order Newton type methods as well as their variants. A more efficient method -Gauss-Newton can achieve quadratic convergence rate as Newton method even without the computation of second derivatives, which is better than the linear convergence achieved by general gradient descent methods. Before introducing the iterative methods for the solution of NLLS problems, the following assumptions need to be satisfied. Assumption 1.
(1) A local minimizer x * of (3) always exist by satisfying the first order necessary condition ∇F(x * ) = 0.
(2) f (x) is continuously differentiable near x * and a given initial value x 0 is near enough to x * .
The Assumption 1(1) is necessary to guarantee local optimality, while the Assumption 1(2) provides a sufficiently good Hessian approximation when GN method is adopted.
In the traditional second order Newton method, the optimization problem is solved through successive iterations based on Newton's update rule where x i is the estimate for x at iteration i, α i ∈ (0, 1] is the step size parameter. Moreover, d i N is the Newton's descent step and obtained by Computation of the Hessian ∇ 2 F(x i ) requires computation of the N second-order terms and is given where Obviously, it is too costly to compute the second-order terms of the Hessian for practical usage.
Instead of computing the true Hessian, the GN method exploits the least-squares structure of the objective to compute an approximate Hessian and uses the following iterative update rule where d i is denoted as the GN step and written by It is noted that J T (x i )J(x i ) is used to approximate the Hessian ∇ 2 F(x i ) in (6) when the estimate x i is close to the true position of target represented by x * , while and Consequently, the GN step (8) can be written by Substituting (11) into (7), a centralized GN solution is obtained, where a fusion centre is assumed to implement the GN update and be responsible for gathering interesting information among all nodes. In most of cases, sensor nodes prefer to implement peer to peer communication with neighboring nodes due to limited communication capacity. Thus, we restrict each node to only exchange information with its 1-hop neighbors. Accordingly, a local GN method for localization is obtained as follows where d i k is the GN step of node k at time i, and J k (x i k ) ∈ n k ×M has the entries (J k (x i k )) l,m = ∂f l ∂x m (x i k ) for l ∈ N k , 1 ≤ m ≤ M , and n k is the number of immediate neighbors of node k including itself, whose neighborhood is denoted by N k . And f k (x i k ) ∈ n k ×1 has the scalar entries f l (x i k ) for l ∈ N k . The corresponding local cost function can be written as The update step (12) uses only the limited information from local neighborhood, thus leading to poor performance due to the lack of global knowledge. To address this problem, the diffusion GN method exploits the diffusion property of connected network to perform the deeper learning as follows (14) where w k,l ∈ [0, 1] is the weighted coefficient assigned by node k for each node in network and satisfies the condition The resulting global coefficient matrix W ∈ N ×N with entries w kl is right-stochastic and also known as consensus matrix, which has been proven to reach a consistent estimation for all nodes over network [19].
The diffusion GN method aims to enhance the interaction among neighboring nodes by aggregating the real time estimates from neighborhood as (13) and integrating them into local update step as (14), such that the network-wide sensing data is fused by individual node as long as the network is connected. The analysis has showed that the diffusion GN can converge asymptotically to the minimizer when the step size is selected reasonably.

III. THE MODIFIED DIFFUSION LEVENBERG-MARQUART METHOD
As mentioned above, two given assumptions (i.e., the inverse computation of matrices without full rank and the constant step size for all nodes in all steps) lead to practical difficulties of the diffusion GN method. Historically, Levenberg (1944) and later Marquardt (1963) suggested to use a damped GN method. However, the traditional LM method is implemented in a centralized way. In this section, a modified diffusion LM method is proposed to obtain an adaptive and distributed solution, which outperforms clearly the diffusion GN method.
To begin, note that the standard LM method adds a regularization parameter µ > 0 to J T (x i )J(x i ) instead of step size α. In the same way, we can write the diffusion LM method by the following modification to (13) and (14)  where we define the diffusion LM step as and µ i k is the LM parameter of node k at time i. Difference between (17) and (14) reflects the change from step size in GN method to LM parameter in LM method.
Here, we define and write the approximated Hessian and which motivates us to provide a distributed diffusion implementation. That is, each node k receives the estimate x i l from its neighborhood N k excluding itself, and returns the computational results {∇f k (z i k ), f k (z i k )}. The LM parameter µ i k is fully capable of eliminating the two stringent assumptions based on the following reasons 1) The matrix J T k (z)J k (z) + µ i k I is positive definite for all µ i k > 0, thus ensuring its invertibility and the descent direction of d i k,LM . Please see Appendix A for a proof. 2) By adjusting µ i k , the step size α i k in GN method can be eliminated. For large values of µ i k , the matrix which is a steepest descent step with a multiplication factor 1 In conclusion, LM method can overcome the singularity of the Hessian matrix and achieve the transition from the GN step to the steepest descent step in a smooth way, where the parameter µ i k determines whether to switch and the switching speed between them. Because the cost function at the initial stage of algorithm tends to be large, GN direction with fast quadratic convergence needs to be adopted more by using a small µ i k . As the iteration progresses, the objective function moves towards the minimizer of F(x). Thus, it is good to follow a short step (i.e., gradient descent) with slow linear convergence to avoid missing the minimizer due to big search steps. It is also confirmed by the diffusion GN method that a large step size α results in fast convergence and big error, while slow convergence and high accuracy are achieved by using a small step size α.
It remains to establish the update rule of µ i k in a real time and distributed way. In traditional damped GN method, a widely used strategy is the exact line search with backtracking scheme to find the step size, such that the value of the objective function is decreased between two sequential iterations. The strategy is a heuristic method generating a series of step sizes, in which a value satisfying the descending condition is selected. However it will be very inefficient because of introducing two more parameters that need to be tuned carefully, thus resulting in a big risk to get even increase in the objective function. Furthermore, much computational resource is consumed while achieving poor performance. More importantly, due to non-convexity of the objective function, it is also impractical to find a descent direction when a negative curvature direction is encountered.
An alternative method of determining µ i k is to use the trust region strategy, whose prototype aims to adjust LM parameter dynamically by judging whether the estimate distance between two successive iterations is sufficiently accurate inside a ball with specified radius centered at current iteration. Moreover, trust region methods are still useful even when the GN Hessian has negative curvature [13], [20]. Here, we modify the diffusion GN method as a diffusion LM version with an elaborate trust region strategy.
For each node k in WSN, the local cost function in the diffusion LM step (16) can be defined as whose gradient is equal to g i k given by (20). The Taylor expansion of f k (z) around d i k,LM can be written as which can be approximated as for small d i k,LM . (23) can be a good approximation when the step d i k,LM is assumed to be sufficiently small. It is believed that the current d i k,LM is acceptable and can be increased for next iteration, thus providing a larger descent step and improving the convergence speed. Otherwise, for a large d i k,LM , one should reduce it to avoid the failure of the LM descent step. By applying the approximation, the closeness of approximated and real reduction on objective cost between two sequential iterations can be evaluated.
To do this, substituting (23) into F k (z + d i k,LM ) based on the definition (21), one can obtain which is the predicted cost based on the Taylor expansion.
On the second line of (24), the approximation (23) is used. We define the gain ratio which is motivated by the original trust region algorithm [13], [20]. The numerator of (25) denotes the gain from actual reduction of local cost of node k at time i, while the denominator represents the gain from predicted cost reduction by the Taylor expansion, which is equal to Based on (18), it is known that Substituting (27) into (26), the predicted gain (i.e., denominator of (25)) can be written as where g i k is the gradient of local cost function F k with respect to the aggregated estimation given in (20).
Finally, the gain ratio is obtained as follows  (29), it can be seen that the role of ρ i k is to monitor the deviation between the actual and predicted decrease. It is important to introduce two control parameters ρ low < ρ high . If ρ i k > ρ high , the predicted reduction is particularly close to the actual one for the computed d i k,LM , thus keeping the trend of GN descent. In this case, the LM parameter µ i k should be decreased based on the previous explanation for the role of µ i k . Usually, the objective can be achieved easily by multiplying µ i k by a constant 0 < p 1 < 1. On the other hands, the case of ρ i k < ρ low indicates that the current trial step is unacceptable, thus turning to the deepest descent direction by multiplying µ i k by a constant p 2 > 1. The above process is summarized as follows Note that the thresholds ρ high , ρ low and multipliers p 1 and p 2 are identical for each node at any iteration, mainly because the trust region method is not sensitive to their minor changes in practice. Typically, ρ high , ρ low , p 1 and p 2 are set as 0.75, 0.25, 0.5 and 2, respectively. Moreover, to match the quantity J T k J k , an initial LM parameter µ 0 k can be given by where ξ is a one-time user specified constant and the diag{A} notation denotes all diagonal elements of a matrix A.

IV. CONSENSUS IMPLEMENTATION
For a decentralized WSN, there is not a fusion center to report the final acceptable estimation for all nodes. Thus, all sensor nodes are required to reach consensus or agreement on final estimates in a distributed way. In the collaborative localization application, it can be seen from (2) that the objective function is a perturbation of a smooth function due to the noise of distance measurements, which is called the small residual problem when the perturbation is small. Thus, it is inevitable that some oscillations occur around the minimizer when the iterations converge to steady state. For each sensor node suffering from different instantaneous noise, the failure of consensus in our proposed diffusion LM method will be further enlarged.
To ensure convergence towards a consensus state over network, we introduce the consensus averaging strategies, which have been used in deep learning [21] and distributed control in multi-agent system [22]. Recall from the diffusion GN method that the step (13) is called as a consensus step, while the step (14) is the GN update with network-wide diffusion scheme. The consensus strategy (13) aggregates the estimates from neighborhood by a linear weighted combination and converges to the average value of their desired estimates as iteration progresses.
Assume that the objective of the diffusion LM parameter update is to achieve the averaged value over network denoted by where µ * k is the value of µ i k in diffusion LM algorithm when i → ∞. Similar to (13), each node implements the consensus recursion below by interacting with their neighbors where w k,l has identical definition with (15). The following corollary from classical consensus strategies [23] provides the sufficient conditions, under which that the iteration µ i k can converge to µ as i → ∞. Corollary 1: The consensus iteration µ i k generated by (32) converges to the desired average µ given in (31) when the following three conditions are satisfied: 1) The network is connected, which means that there exists a path connecting any two arbitrary nodes in the network.
2) The weight matrix W is a doubly-stochastic matrix, which means 3) w kl = 0 if l / ∈ N k and w kl > 0 for l ∈ N k . The condition (1) is mandatory, while the conditions (2) and (3) can be integrated and relaxed as (15) for the diffusion GN method [24]. Many elegant methods provide the alternative aggregate rules to generate the doubly-stochastic or rightstochastic matrix, such as Laplacian rule, Metropolis rule, Relative-degree rule, and so on [23]. Here, we use the same rule for (16) and (32) since our algorithm is insensitive to the selection.
Next, we analyze how to achieve the consensus estimate by the iteration (32). As the algorithm converges, ρ i k is smaller and maybe even negative, thus multiplying continually µ i k by p 2 > 1 and causing that the LM update step d i k,LM is gradually dominated by µ i k (or equivalently µ) as i → ∞. Consequently, the result of d ∞ k,LM d ∞ l,LM is obtained. Based on the diffusion LM algorithm (16) and (17), the estimates of all nodes finally reach their average and the oscillations resulted from distance measurements with gaussian noise can be eliminated.
As a result, the consensus-based diffusion LM algorithm is described in Algorithm 1. It should be noted that only n k scalar communication is added in comparison to the diffusion GN algorithm, where receiving {µ i l } from n k − 1 neighbors and broadcasting a scalar quantity µ i k are introduced. 215654 VOLUME 8, 2020 In next section, one will see the apparent performance improvement at the expense of acceptable communication cost.

Algorithm 1: Consensus-Based Diffusion LM Algorithm
1 Given the same initial point x 0 for all nodes and weights w k,l satisfying (15), and auxiliary parameters ξ , p 1 , p 2 , ρ high , ρ low 2 for every iteration i ≥ 1 do 3 for node k = 1 to N do

V. PERFORMANCE EVALUATION
In this section, we present the numerical results to confirm the performance of proposed consensus-based diffusion LM method for collaborative localization in WSNs. Due to the applicability to general NLLS model (3) over networked systems, we also investigate the performance of our method by solving the small residual NLLS problems, such as the wellknown logistic regression and least squares problems, which have been regarded as the typical examples arise frequently in large-scale machine learning scenarios including distributed classification [25], convolutional neural networks [26].

A. SOLVING THE LOCALIZATION PROBLEM
A connected network composed of N = 20 sensor nodes with same communication distance 30m is deployed in a 50 3 3D space. Such deployment is in conformity with practical situation. Fig.(1) illustrates a random network deployment mapped to 2D plane for a clear presentation. The target is  located in the centre of area. The known Metropolis rule is used to generate the weight matrix w k,l , i.e., , l ∈ N k and k = l The initial iteration starts with x 0 k = 0 for all k. Other auxiliary parameters are set as follows: ξ = 0.0001, p 1 = 0.2, p 2 = 2, ρ high = 0.75 and ρ low = 0.25. Moreover, the environment noise follows the standard normal distribution with standard deviation 0.1. The localization errors are averaged over all nodes.
Firstly, we compare the diffusion GN methods under different step size α with proposed consensus-based LM method. It is noted that the step-sizes are carefully chosen by using two typical large and small values, such that the test can reflect the representative convergence performance. Fig. (2) depicts the convergence of localization error and steady-state performance. Not surprisingly, a smaller α causes the diffusion GN algorithm to converge more slowly and achieve less steady state error, while a larger α causes faster convergence and higher error. Thus, a tradeoff needs to be considered when the fixed step sizes are used for the diffusion GN algorithm. It can be seen from Fig. (2) that the proposed method gives a perfect solution to such tradeoff. That is, both of fast convergence  speed and low steady state error are achieved, which are close to the best case of the diffusion GN methods corresponding to two different step sizes. The conclusion that our method addresses the dilemma of fixed step size in diffusion GN can be obtained.
More importantly, it can be seen clearly from Fig. (2) that non-consensus LM method and diffusion GN with big step size create the high steady state oscillation. As described previously, such oscillation results from environmental noise and big step size. Our consensus LM method eliminates the problem entirely by using the proposed consensus strategy (32). For a better view, Fig. (3) shows the convergence behavior of each individual, where each node locates exactly the true target after only about 20 time steps although their initial errors are very different. Fig. (4) shows the change of LM parameters on any two sensor nodes 6 and 13 during initial 25 iterations. It is observed that the first iterates give a GN-type step by reducing the LM parameters. After that, the jagged cures are observed, which means the transition between GN and gradient descent. Overall, the iterations tend to steepest descent direction (i.e., larger µ k ) as algorithm runs.
It is known that a statistical test over many experiments will weaken or average the effect of oscillation among the compared methods. In spite of the fact, to avoid the deviation caused by the randomness of an arbitrary network, Fig. (5) shows the convergence curves by averaging over 100 randomly generated topologies. Except for weakened oscillation due to the averaged effect, the main results match the observations of the one shown in Fig. (2), thus confirming the advantage of our method.
Finally, it should be noted that our method is not very sensitive to the choice of the auxiliary parameters although they need to be set in advance. For example, previous analysis helps guide the choice that ξ can be set as a small value (e.g., 10 −4 ) when the initial point x 0 is close to the target, otherwise, consider a larger value (e.g., 0.1). Moreover, the thresholds ρ high = 0.75, ρ low = 0.25 and the factors p 1 = 0.2, p 2 = 2 are fully applicable to most of collaborative localization problems, as well as other NLLS problems such as the following machine learning problem.

B. EXTENSION TO DISTRIBUTED OPTIMIZATION PROBLEMS
In addition to applying to distributed localization, many potential applications can be found in the machine learning and distributed optimization fields by using our method to solve the networked NLLS problems. For example in computer vision based on a distributed camera sensor network [27], many problems such as image alignment, face alignment, or camera calibration can be posed as solving such nonlinear optimization problem, where sensor nodes cooperate to minimize the overall residual between the source and target intensity images. In some decentralized network optimization tasks, for example power system state estimation in smart grids [28], [29] and fusion estimation in multiagent systems [30], they are typically formulated as a NLLS problem across network.
In this section, we extend our method to the well-known distributed optimization problems, where the objective is to minimize a global objective function over a multi-agents networked system. In addition, we assume that the objective function has the NLLS form (3).

1) LOGISTIC REGRESSION PROBLEM
Logistic regression is a very successful classification model, where the outcome variable is binary or dichotomous. For example, it has been known that there is an obvious relationship between risk factors (e.g., age or smoking) and the presence or absence of some significant diseases (e.g., COVID-19 [31]). Much research on finding such the relationship has been done over the years [32]. For reasons of privacy and security, a single data center has limited capacity to assess all data required to perform the classification computation. Therefore, our method is applicable to effectively address this limitation by implementing a distributed computation, as indicated below.
Consider a minimization problem in the same form with (3) where the local cost function f k (x) is defined as In (34), each agent k holds the learning data samples {h k,j , γ k,j } m k j=1 including the feature vector h k,j ∈ M and binary label/output γ k,j ∈ {−1, 1}, and τ is a positive regularization parameter to avoid overfitting. To generate the label sample, we define γ k,j = sign((x * ) T h k,j + ω k,j ), where x * ∈ M is the global optimal value. The entries of x * and h k,j are generated from the Gaussian distribution N (0, 1). The offset parameter ω k,j is generated from N (0, 0.1). Other parameters are as follows ξ = 1, M = 4, m k = 20 for all k and τ = 0.1. To conduct the large scale machine learning, we generate the random network with N = 30 agents with 8 neighboring agents that are uniformly randomly chosen. We evaluate convergence in terms of the relative error with x 0 k = 0. Fig. (6) depicts the convergence behaviors of different algorithms. It is observed that the non-consensue LM method does not converge, while the diffusion GN with large step size has an obvious steady state oscillation in spite of fast convergence and a slow convergence is obtained by DGD method under large or small step size. The proposed consensus LM method has a distinct advantage whether convergence speed or accuracy, as compared with the diffusion GN and well-known distributed gradient descent (DGD), which is the most popular one for solving the above logistic regression problem. As described previously, the main reason is that our method achieves the goal of a smooth transition from the GN and gradient descent. Moreover, the rugged behaviours resulted from diffusion GN with large step size (α = 0.005) are eliminated; meanwhile, the faster convergence is achieved. In the absence of consensus implementation, the non-consensus LM method fails to convergence and diverges seriously from the optimal value, which is depicted as the top curve in Fig. (6).   tends to stability after consistency adjustment. Combining Fig. (6) and Fig. (7), it can be seen that the proposed method for this example tends to the GN direction (i.e., for small µ) at initial stage and the first-order gradient method (i.e., for large µ) with better steady state performance. The effect of consensus convergence is shown in Fig. (8), where all 30 agents converge to a neighborhood of x * after initial auto-tuning.

2) LEAST SQUARES PROBLEM
We also report our numerical results to illustrate the effectiveness of Algorithm 1 by applying to the decentralized least squares problem. Least squares model plays an important role in many real applications including image recognition [33], spectrum sensing [34] and atmospheric density VOLUME 8, 2020 model calibration [35]. Typically, least squares can be formulated as a global minimization problem over a multi-agents system such (33) and has the following local cost function for each agent where the network is set as size N = 30 with the average number 8 of neighboring agents. Each entry in A k ∈ M ×M is generated from the standard Gaussian distribution N (0, 1) where the entries of x * are i.i.d samples from N (0, 1) and v k is some unknown noise that follows N (0, 0.1). As an example of spectrum sensing based on cognitive radio network [36], which aims at detecting available spectrum bands, x can be referred to the signal strengths of spectrum channels, while b k denotes the time-domain measurement with the noise v k and A k is channel gain. To estimate x, a set of geologically nearby cognitive radios collaboratively solve the optimization problem (36). More details for cooperative spectrum sensing can be found in [36].
The convergence behavior of proposed consensus LM method is shown in Fig. (9). Notice that for the compared DGD and diffusion GN, there is no optimal choice of step size. Therefore, in this experiment, a large and small step size are used for them, respectively. Not surprisingly, it is observed that the consensus LM method achieves the overall excellent performance over the non-consensus version as well as DGD and diffusion GN with different step sizes. Fig. (10) shows the change of µ k on an arbitrary agent k over time. In this plot, it can be seen clearly that the proposed method switches back and forth between gradient direction and GN direction. Moreover, the effect of consensus convergence is confirmed in Fig. (11) as expected, where the trajectories of the estimates tend towards consistency.

C. COMPARISON ON COMPUTATIONAL COMPLEXITY
The computational cost of our method as well as the compared methods depends almost entirely on two computational operations, i.e., gradient of a nonlinear objective function and inverse of a matrix. Firstly, it is known that the computational   cost of gradient is problem dependent. Hence, no general big-O expression for complexity can be given. However for comparison's sake, we can assume a constant (i.e., O( ) ) for the computational complexity of a gradient with respect to the vector x when the objective function is given in advance. Secondly, the studies [18], [37] have indicated that the computational cost of a generalized inverse is about three to four times as much as solving a invertible system with samedimension. Accordingly, we can summary the computational complexity for each node per iteration in Table 1.
From Table 1, one can see that all compared methods have the same complexity on gradient computation since all of them are based on first order derivative. For the inversion operator, we assume its complexity O(M 3 ) for an M × M dense matrix by using Gauss-Jordan elimination [38]. Our proposed method has a moderate complexity, which is higher than DGD method and lower than the diffusion GN method with the highest one due to the computation of generalized inverse. In conclusion, with less number of iterations, our method has acceptable cost in terms of total complexity.

VI. CONCLUSION
In this paper, we proposed a consensus-based diffusion LM method, which is motivated by collaborative localization in WSN and also can be used to solve the general small residual NLLS problems over multi-agents networked system. The proposed method has some obvious advantages in terms of convergence speed, accuracy and computational cost than the recent diffusion GN method. Most importantly of all, the problems of non-consensus properties across network and singularity on the GN Hessian matrix are addressed, thus each node in the network can obtain the consistent result with high accuracy and avoid the inverse computation with high computational complexity. Finally, we have extended the results to distributed optimization for solving two types of machine learning problems.

APPENDIX A PROOF THAT THE DIFFUSION LM STEP IS A DESCENT DIRECTION
Given any nonzero vector x ∈ M , we have The above inequality follows from the facts of J k (z)x 2 ≥ 0 and µ i k > 0. Thus, the matrix J T k (z)J k (z) + µ i k I is positive definite.
For any node k, we write the Taylor expansion of the local cost function F k (z + d i k,LM ) as follows for small d i k,LM 2 .
One can say d i k,LM is a descent direction for node k if the current cost F k (z + d i k,LM ) is less than its previous cost F k (z). Using the approximation (38), we obtain the sufficient condition to decrease the cost as From the definition (18)   Technical University about ten years. He is currently a Professor with the College of Intelligence and Computing, Tianjin University, China. He has published over 200 international journal articles and over 100 international conference papers. His research interests include cloud computing, security and dependability, parallel and distributed computing, networks, and optimization theory. He is serving as an Editor-in-Chief, an Associate Editor, or an Editor Member for over ten international journals.