Distributed Sampling Rate Offset Estimation Over Acoustic Sensor Networks Based on Asynchronous Network Newton Optimization

Sampling rate synchronization is an inevitable issue in distributed acoustic sensor networks. In this paper, an analytical sampling rate offset (SRO) estimation approach is first proposed, and then, it is extended to a distributed method that suitable for acoustic sensor networks with arbitrary communication graphs. Specifically, a linear-phase drift model in the short-time Fourier transform domain is used to approximate the SRO between each pair of microphone nodes. Next, after unwrapping the temporally averaged phase information, SROs are recovered analytically via a new weighted-sum criterion. Based on this, a distributed cost function is established at each node to obtain the SROs of all nodes simultaneously in a distributed manner. Finally, a state-of-the-art distributed algorithm named asynchronous network Newton optimization is adopted to carry out the distributed SRO estimation. The proposed method can effectively estimate the SROs among acoustic sensor nodes in noisy and reverberant environments. Compared with the existing approaches, it does not require an external central processor, and only local communications among nodes are needed. Experimental results confirm the validity of the proposed method.


I. INTRODUCTION
I N RECENT years, there are an increasing number of intelligent devices around us that contain one or more microphones, such as smartphones, laptops, smart screens, and so forth. They can constitute an ad-hoc acoustic sensor network, which vastly increases the amount of spatial information compared with the traditional single microphone (or microphone array). Therefore, it boosts the performance of many audio processing tasks, e.g., speech enhancement [1], source separation [2], [3], speaker localization and tracking [4], [5], [6]. In such a network, each node possesses its independent clock system, resulting in asynchronous sampling rates among nodes. To deal with this problem, several sampling rate offset (SRO) estimation methods [7], [8], [9], [10] have been proposed, but most of them only work in a centralized way. Instead, this paper addresses the sampling rate mismatch problem in a distributed manner. Despite setting a nominal sampling rate over all nodes, the actual sampling rates are often deviated from the nominal value, due to the manufacturing tolerances (or age-related changes) of oscillators and crystals in different node clock systems. As a result, the SROs among nodes generally range from a few ppm (parts per million) to many hundreds of ppm [11], leading to an additional time-varying phase shift between different node outputs. The introduced phase shift is approximately linear with time, thus its impact is negligible at the beginning of sampling. However, as sampling time increases, the natural phase relationships among node output signals are destroyed severely, which in turn affects the spatial information acquisition by using microphone networks. Consequently, this will significantly degrade the performance of many audio processing tasks [12], [13], [14], [15], [16]. Therefore, sampling rate synchronization is crucial for distributed acoustic sensor networks. Currently, several effective SRO estimation methods were proposed, which can be divided into two classes: time-stamp based and correlationcoefficient based methods.
Time stamps can be exchanged between two nodes through network communication protocols, which is a well established model for SRO estimation. Two simple but robust two-way exchange algorithms were proposed in [17], [18], where the SROs are implicit in the permutation of transmitted time stamps. In these methods, nevertheless, the obtained SROs are affected by the time-stamp transmission times that related to certain wireless links. To reduce this affect, Schmalenstroeer et al. [19] adopted the Kalman filter to fuse the SROs calculated from multiple time stamps, where a wireless 802.15.4 link is employed and its corresponding additive frequency offset errors are proved to obey Gaussian distribution. In [20], they further modified the method [19] by using a gossiping algorithm, which can estimate all the SROs among nodes in a distributed manner. In general, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the SRO accuracies corresponding to these time-stamp based methods are proportional to the number of message exchanges, thus they require a large amount of communications among nodes.
The second category methods are purely acoustic, which synchronize the node sampling rates by processing the audio signals received by nodes. Most of these methods estimate SROs based on the linear-phase drift model in discrete short-time Fourier transform (STFT) domain [10]. Wang et al. [21] proved that the correlation coefficient between two microphone signals tends to present the highest value when their SRO is compensated. Based on this finding, they recovered the SRO by searching the highest correlation coefficient, which performs high estimation accuracy. Similarly, Cherkassky et al. [22] formulated a cost function via the covariance of two microphone signals, from which the SRO can be deduced by using the gradient ascent algorithm. Cherkassky and Gannot [23] estimated the clock skews based on a wideband correlation processor, where the Newton-Raphson method is applied to solve the SRO iteratively. In [24], different from the methods mentioned above, the analytical solution for SRO was derived from the instantaneous correlation coefficients (ICCs) averaged across frames. This can reduce remarkable computational complexity, but its accuracy is relatively lower. To overcome this issue, Schmalenstroeer et al. [25] proposed a multi-stage coherence drift-based sampling rate synchronization method, which can not only deduce SROs analytically but also keep high accuracy. Recently, Wang et al. [26] presented an active sampling rate calibration method, which has a strong tolerance for deviations among the microphone frequency responses. Still, all these methods can only be applied in distributed microphone networks in a centralized way, which requires an external central processor. Besides, it is unclear how these approaches can be scaled to large acoustic sensor networks with a lot of nodes.
To the best of our knowledge, there is no distributed implementation for correlation-coefficient based SRO estimation methods. In this paper, such a gap will be filled by a novel distributed SRO estimation method. An analytical SRO estimation (ASROE) scheme is first presented for a pair of neighboring nodes, by unwrapping the weighted average coherence drift (WACD) phase information. Then, a local cost function is established at each microphone node, all of which constitute a distributed cost function. Finally, the distributed optimization problem is solved by the asynchronous network Newton method [27]. Compared with the typical methods [24], [25], the proposed method takes advantage of all frequency information in SRO estimation, by eliminating the phase ambiguity that appears in higher frequencies. Moreover, it requires only local communications among neighboring nodes and does not need an external central processor, which is more suitable for distributed acoustic sensor networks.
The contributions of the this paper are as follows: 1) A novel SRO estimation approach named ASROE method is proposed by employing unwrapping operators and deriving a new weightedsum criterion; 2) the ASROE method is extended to a distributed method through constructing the distributed cost function; 3) a state-of-the-art distributed algorithm [27] is used to solve the formulated distributed optimization problem.
The paper is structured as follows. The sampling rate mismatch problem is described in Section II. An analytical SRO estimation algorithm is derived in Section III, and its distributed implementation is given in Section IV. Numerical experiments are carried out in Section V, followed by conclusions in Section VI.
Notation: In this paper, we adopt lower cases for scalar, and denote vectors and matrices in bold lowercase letters and bold uppercase letters, respectively. Superscript T denotes the matrix transpose, and u I denotes a I × 1 vector with all elements one. The notations (·) and [·] represent the continuous-time and the discrete-time functions, respectively.

A. Sampling Rate Offset Model
We consider the setup, where a single acoustic source is propagated in space and captured by an acoustic sensor network with I nodes. For the sake of simplicity and without loss of generality, let us suppose that each node in the network consists of a single microphone. Accordingly, the continuous-time signal captured by the ith node is given by where t represents the index of continuous time, s(t) is the acoustic signal emitted by the sound source, h i (t) is the room impulse response (RIR) between the source and node i, '*' is the linear convolution operator, and v i (t) is the ambient noise at node i. In such a sensor network, each node is equipped with the independent ADC and the clock source. Due to the manufacturing tolerances and age-related changes, the actual sampling rates always have a certain tolerance in their nominal values, which leads that the nodes capture the acoustic signals asynchronously. Specifically, x i (t) is sampled as the discrete-time signal x i [n] by node i with the sampling rate f i = f s + ε i , where f s and ε i indicate the nominal sampling frequency and the SRO of node i, respectively. The relationship between x i [n] and x i (t) can be written as As demonstrated in literatures [21], [24], [25], SROs introduce approximated linear-phase drifts among x i [n] (i = 1, 2, . . ., I) in short-time Fourier transform (STFT) domain. This signal model is the fundamental theory of the proposed method, thus, a detailed derivation is given here. After adding an N -point window ω[n], the STFT of x i [n] with shift N s can be expressed as where j = √ −1 is the imaginary unit, k and l denote the frequency and frame indices, respectively. Generally, the SRO ε i is much smaller than f s , i.e., ε i f s . In this case, X i [k, l] can be further approximated as Let us denote X i [k, l] as the STFT of signal x i ( n f s + lN s f s ), which is sampled with the nominal frequency f s , then we have It can be found from (5) that ε i leads a linear-phase drift for It also indicates that the phase errors caused by SROs are small enough when the sampling begins. However, the more signal frames sampled, the larger phase errors introduced in received microphones' signals. This will dramatically degrade the estimation accuracy of TDoAs or ToAs, especially in the scenario with a long sampling time (e.g., audio/video conferences).

B. Impact of Sampling Rate Mismatch on Source Localization
To intuitively illustrate the adverse effect of SROs, their impact on the source localization is discussed below. Five microphone nodes are employed to localize a fixed sound source in the room of size 8 m × 6 m, as depicted in Fig. 1, where the sampling frequencies corresponding to five nodes are set as 16 kHz, 16.001 kHz, 16.002 kHz, 16.003 kHz, 16.004 kHz, respectively. A typical source localization method named linear-correction least-squares (LCLS) approach [28] is applied in a noise-free and anechoic environment. In this way, the sound source is localized over different sampling time instants, i.e., t = 10 s, t = 20 s, and t = 30 s. Fig. 1 shows that the source localization error will continuously increase over the sampling time. Specifically, the mean source localization error is about 0.18 m when t = 10 s, which is increased to about 0.89 m when t = 30 s. Beyond this, the performance of source localization will degrade more severely when the noise and reverberation are considered. In practice, apart from source localization, most of audio processing tasks also suffer from the impact of sampling rate mismatch. Consequently, the SRO estimation methods are completely required in distributed microphone networks.

C. Existing Sampling Rate Calibration Approaches
In this subsection, three typical SRO estimation approaches, which closely related to the proposed method, are briefly discussed.

1) Correlation Maximization-Based Approach [21]: The correlation coefficientṖ
where l = L l=1 , and L is the total frame number. According to (5), (6) is equivalent tȯ where is a time-independent term, which usually depends on the sampling start points and the TDoA from the source to nodes i and m. While the time-variant phase term e j 2πklNs(εm −ε i ) fsN depends on the SRO between nodes i and m. In other words, the above two terms are independent of each other. For this reason, the time-domain averaging of the numerator in (7) satisfies whereε im is the estimate of ε im . Obviously, (9) is a non-convex non-linear optimization problem, and its solution is difficult to be expressed in an analytical form. Because there is only one variable ε im to be optimized, an exhaustive search scheme is employed to solve ε im in [21]. In this way, an accurate SRO can be obtained by setting a small searching step. Note, however, that a smaller searching step also brings a large amount of computation.

2) Average Coherence Drift (ACD) and
Weighted ACD Approaches [24], [25]: Instead of using all frames of signals, the ICC between nodes i and m is defined via temporally averaged signals in STFT domain, i.e., where L w is the frame number employed in the ICC calculation. Then, the temporally averaged phase information corresponding to the ACD approach and the weighted ACD (WACD) approach are respectively defined as where l 0 is the frame shift of the correlation in (12), the SRO ε im between nodes i and m can either be estimated by the ACD approach [24] witĥ or by the WACD method [25] witĥ 2l 0 Ns kεmax (14) where ∠{·} denotes the phase operator, ε max is the maximum SRO occurs between nodes i and m, and K max = is the maximum frequency bin index without phase ambiguity. Without loss of generality, f i in (13) and (14) is set to f i = f s in [24], [25]. Evidently, both the ACD and the WACD methods can obtain the SRO in analytical forms, which can save computation effectively. Generally, nevertheless, the maximum SRO between each pair of nodes is unknown in the real world, which also leads to an unknown K max . Similar to references [21], [24], [25], most of the existing methods only consider the SRO estimation for a pair of nodes. In these ways, it is hard to obtain the SROs of all nodes simultaneously, or estimate SROs in a distributed manner.

III. ANALYTICAL SRO ESTIMATION FOR A PAIR OF NODES
In this section, we only consider the blind SRO estimation for a pair of nodes, (i.e., nodes i and m). Namely, the task is to blindly estimate ε im (ε im = ε m − ε i ) from the microphone signals x i [n] and x m [n]. Without loss of generality, the sampling frequency f i is set to f i = f s .

A. Analytical SRO Estimation
As demonstrated in (13) and (14), a constraint (i.e., k ≤ K max ) is added for the frequency bin k to avoid the phase ambiguity. Meanwhile, however, the phase information corresponding to frequency bins k > K max are discarded. If analyzing these abandoned information well, it can be used to further improve the accuracy of SRO estimates. To this end, the constraint k ≤ K max is relaxed in this section.
Specifically, inspired by [25], the WACD information is utilized here, i.e., whose angle information can be expressed as Such a direct operation may cause the phase ambiguity when To avoid this circumstance, [25] requires k < Nf i 2l 0 N s ε max . Alternatively, a non-linear unwrapping algorithm is adopted to eliminate this constraint. Define ϕ im as the vector form of (16), i.e., and its median filtering values are given by in which k 0 is the filter order. The median filtering values in (18) can effectively remove the pulse values and provide more accurate phase information. Then, the unwrapped angles are given byφ where ϑ(·) is the unwrapping operator, and the details of (20) are as follows (k = 0, 1, . . . , K) by changing absolute jumps greater than π to their 2π complement. Afterward, the SRO can be estimated aŝ (22) where Ψ im [k] is the weight of exponential form of the SRO estimate at k frequency bin, which is given by and ς im represents the unknown maximum SRO between nodes i and m, which is defined by where ς 0 is a constant that greater than 0, and Υ im is the sum of Ψ im [k], i.e., Compared with (14), (22) can not only eliminate the phase ambiguity, but also rewrite the weight |ψ WACD . It can enlarge the weight difference between useful and useless frequency bins. The reason is that ψ WACD im [k] is computed from the normalized ICC in (10), while Ψ im [k] is calculated from the non-normalized cross power spectrum. Besides, the role of ς im is to avoid another phase ambiguity problem introduced with the increase of ε im .
Based on (22), the SRO can be obtained without adding any constraint for frequency bins. Moreover, prior information about the maximum SROs are also not required. Furthermore, by using the characteristics of conjugate symmetry in STFT domain, we set K = N/2 to eliminate redundancy.
Remark 1: Although the phase ambiguity has been suppressed by unwrapping operators in (21), but it may still appear in (22) when the real sampling rate offset is too large. In other words, without the parameter ς im , the phase of (22) may be out of the range [−π, π]. Thus, we use ς im to normalize the phase in (22).
Remark 2: If the parameter ς 0 is omitted in (24), it can be seen as a rough estimation of the sampling rate offset between nodes i and m. Considering that the above rough estimation is not very accurate, the parameter ς 0 is employed to eliminate the potential phase ambiguity. The value of ς 0 can be selected empirically, after measuring the accuracy of rough estimation by some experiments.

B. Resampling
After obtaining the SROε im between nodes i and m, the discrete signal x m [n] from the mth node can be resampled via the sinc-interpolation algorithm [29] witĥ where 2L 0 + 1 is the window size. Remark 3: Any other resampling algorithms (e.g., [30]) can also be used in the proposed method. Since these approaches are not the focus of this paper, a classical sinc-interpolation algorithm [29] is employed to compensate for the SRD between nodes i and m.

C. Time Synchronization and Multi-Stage SRO Compensation
In the proposed method as well as in [21], [24], [25], the assumption of similar contents of x i [n] and x m [n] in the lth frame is required. However, it does not hold if the TDoA τ t im from the source to nodes i and m is large. To mitigate this problem, τ t im can be estimated previously based on the GCC-PHAT method [31] witĥ Due to the effects of reverberation, noise as well as SROs,τ t im may not be very accurate. But through (28), the microphone signals from different nodes can be synchronized roughly. However, the similarity of x i [n] and x m [n] in the same frame will also be deteriorated for large SROs or long signals. To overcome this problem, the multi-stage SRO compensation scheme is presented in [25], which is adopted here to further improve the precision of SRO estimates.

IV. DISTRIBUTED SRO ESTIMATION OVER MICROPHONE NETWORK
In this section, the SRO estimation proposed in Section III is expanded to a distributed method, which can obtain the SROs of all nodes simultaneously in a distributed manner. Consider

A. Distributed Cost Function
The SROsε im (m ∈ N i ) between the ith node and its neighboring nodes are first estimated by the blind estimation method presented in Section III. At node i, according to the definition of ε im , we have According to this, all the differences between two sides of (29) are used to construct a total centralized cost function as where ε = [ε 1 , ε 2 , . . . , ε I ] T . It is obvious that (30) constitutes a convex optimization problem, whose closed-form solution can be derived easily. However, it requires the observation data (ε im , i ∈ [1, 2, . . . , I], m ∈ N i ) from all nodes, which is hardly obtained at one node, when a non-fully connected distributed microphone network is considered. That is, the cost function (30) is only suitable for the centralized computation. Instead, another cost function is formulated below for distributed solving. For convenience of distributed processing, let ε i m represent the local form of ε m at node i, then a local cost function is constructed at node i as where ε i denotes a column vector that consists of all local variables at node i, i.e., ε i = [ε i 1 , ε i 2 , . . . , ε i I ] T . Evidently, the local cost function at each node requires only the data from its neighboring nodes. By adding some constraints for the sum of all local cost functions, a distributed cost function can be established as whereε i is the estimate of ε i . The objective of (32) is to minimize all local cost functions, while keeping their variables equal to the corresponding common variables from their neighbors. Since the network is connected (i.e., each node has at least one neighbor), each node in the network has at least one neighbor. Thus, the constraints can collapse the feasible space of (32) to a hyperplane in which all local variables are the same. In this case, we have , the latter of which is equivalent to the centralized cost function (30). Therefore, the optimization problems (30) and (32) have the same optima theoretically. The difference is that, (32) can be solved in a distributed manner without a central processor.

B. Asynchronous Network Newton Optimization
Problem (32) can be solved by distributed gradient descent methods [32], [33]. Whereas these methods converge slowly, since only the first-order information is utilized. To accelerate their convergence speed, a network Newton distributed optimization (NNDO) method that incorporates the secondorder information is proposed in [34]. Recently, Mansoori et al. [27] build upon the NNDO algorithm and developed an asynchronous network Newton optimization (ANNO) method in an asynchronous setting under the absence of a central coordinator. Therefore, the ANNO method is adopted herein to solve (32).
In the ANNO method, it reduces the synchronization wait in distributed iterations, and allows some nodes to compute faster and execute more iterations. As shown in Fig. 2, for instance, node i has a faster calculating speed than node m (where m ∈ N i ), and broadcasts updated information to its neighbors once it finishes an iteration. While the information received from neighbors will be stored in a local buffer at node i. At the beginning of each iteration, the newest information of neighbors is extracted from the buffer, which is used to implement the current iteration. Specifically, letting q i be a discrete-time index of node i for distributed iteration, then the update formula can be written as where is an iteration stepsize, p i is a coefficient related to the computation speed of node i, d i κ (q i ) is the distributed Newton direction, which can be computed by the following recursion process after κ iterations and B im are given by where I is the is the local Hessian matrix, α is the weight of the local gradient and Hessian matrix, w im is the (i, m)th element of a weight matrix W that satisfies where Υ(·) denotes the spectral radius of a matrix [35]. In general, w im = 0 if m ∈ N i . Besides, d m κ (q i ) and ε m (q i ) appears in (34) and (35) indicate the latest local Newton direction and the latest local variable received from node m, respectively, until the time q i comes. Here

C. Distributed SRO Estimation
Based on the above discussions, the details of the proposed distributed SRO estimation method for microphone networks is summarized in Algorithm 1. At the initialization step of the proposed algorithm, each node determines the parameters , p i , W, α, κ and the initial variables ε i (0). Then, each node collects the microphones signals from its neighbors and computes the corresponding STFT signals. Afterward, the power spectrum densities (PSDs) between the signals from each node and its neighbors are computed with (15), and the unwrapped angle information are obtained by (20). Next, the SRO differences between each node and its neighbors are estimated through (22) with closed forms, according to which a distributed cost function is constructed with (32). Finally, the AANO method is adopted to solve the above distributed optimization problem.  (27) and (28). 2: for z = 1, 2, . . . , Z do 3: Execute Algorithm 1. 4: Convert the sampling rate of nodes with (26).

5: end for
As discussed in Section III, the similarity of x i [n] and x m [n] in the same frame will also be deteriorated for large TDoAs, large SROs or long signals, which in turn degrade the SRO estimation accuracy. To overcome this issue, the multi-stage strategy is also employed in the proposed distributed SRO estimation method, and the details are given in Algorithm 2.

D. Analyses of Identifiability Conditions and Transmission Bandwidth Requirements
As mentioned before, the distributed cost function (32) is equivalent to the centralized cost function (30). By observing (30), we can find that there are I unknown variables ε i (i = 1, 2, . . . , I), while there are 2N E observationsε im (i = 1, 2, . . . , I, m ∈ N i ), where N E is the edge number in the network. Since the network is connected (i.e., each node has at least one neighbor), we have N E ≥ I − 1. However, most observations are dependent with each other. To be specific,ε im = −ε mi andε im =ε ir +ε rm theoretically. After removing these redundancies, there are only I − 1 independent observations left. To ensure the number of observations is greater than or equal to that of unknowns, we set ε 1 = 0 in the SRO estimation process.
The main communication burden in microphone networks is the transmission of acoustic signals. In centralized methods, the received signals are required to be transmitted into the central processor to estimate the SRO between each pair of nodes, which consumes large bandwidth resources. Specifically, the corresponding transmission data is about It 0 f s b 0 bits, where t 0 and b 0 denote the time duration of calibration signals and the bit number per sample, respectively. Evidently, the overall transmission data will increase dramatically with the increasing number I of nodes. Comparatively, in the distributed manner, SROs can be estimated from the acoustic signals transmitted among neighboring nodes. Thus, the overall transmission data is about I i=1 |N i |t 0 f s b 0 bits, where the traffic for node i is only about |N i |t 0 f s b 0 bits. Although the additional communications are needed to perform the distributed optimization (i.e., steps 8 and 10 in Algorithm 1), the traffic for each node will rise slowly as the number of nodes increases. More importantly, the fusion center is unnecessary in the distributed way.

V. EXPERIMENTS AND RESULT DISCUSSIONS
To evaluate the performance of the proposed method, some typical simulation experiments are carried out.

A. Simulation Setup
The simulated environment is a room of size 10 m × 6 m × 3 m with I = 6 nodes constituting a distributed acoustic sensor network, where each node consists of a single microphone. The RIRs between the source and the nodes are simulated with the well-known Image method [36], and the sound speed is set to c 0 = 343 m/s. The calibration signals are convolved with the above RIRs, and then added by different levels of Gaussian white noises, yielding the final noisy and reverberant audio signals. The frame length is equal to N =1024 with 75% overlap N s = 256, and the parameter ς 0 in (24) is set to 1. The nominal sampling rate is set to f s = 16 kHz, and the actual sampling rates of nodes are selected from the frequencies around f s . In AANO iterations, the parameters are given as follows: = 0.5, κ = 4, p i = 2 11 (i ≥ 5) and p i = 1 11 (i = 6), the weight matrix W is defined by a Metropolis scheme [37], i.e., and the coefficient α is dynamically selected from an interval [10 −4 , 10 2 ] to accelerate the convergence speed while maintaining the higher convergence accuracy, where the selection strategy is that the bigger the number of iterations is, the larger α will be chosen. Besides, the SROs are initialized by zeros and the iteration number is fixed to Q i = 300.

B. Performance Assessment Standards
The SRO estimation performance is evaluated by the root mean square error (RMSE), which is given by where ε i andε i are the true and estimated SRO of node i, respectively. When using the distributed SRO estimation method,ε i is computed asε i = 1 I I m=1ε m i . All the experimental results are averaged over 50 Monte Carlo trials under different noise realizations.

C. Simulation Results
In the first experiment, we compare the computational time of the Matlab implementations of all considered algorithms on an Intel CPU i7-11700F@2.50 GHz with 16 GB RAM. Then, in the second experiment, the SRO estimation performance is studied for the different durations of the calibration signals. Next, the impact of inter-node distances is evaluated in the third experiment. In the fourth experiment, the synchronization performance is studied for a set of stage numbers and sampling frequency differences. Finally, the effects of and background noises and reverberations are discussed in the fifth and sixth experiments, respectively.

1) Comparison of Computation Times:
In this experiment, a 16 s long white noise is used as the calibration sound source in an environment with SNR = 20 dB and RT60 = 0.2 s. The microphone signals are resampled with SROs randomly chosen from the interval [−4,4] Hz, and the block size is set to l 0 = 20. From Table III, we can see that the CM method [21] achieves the smallest RMSE, when the exhaustive searching stepsize is set to 0.001 Hz. But it comes at a cost, the SRO estimation for a pair of nodes requires about 357 s. While the ACD method [24] runs fast, it takes only 1.38 s when estimating the SROs among I = 6 nodes, but the accuracy is not too high. The double-crosscorrelation processor (DXCP) based method [38] outperforms the ACD method, but its computation time is relatively long. The reason is that it is implemented by computing double cross correlation in time domain. The 2-stage WACD approach, hereafter referred to as WACD-2, significantly improves the SRO accuracy with a little added computational time compared with the ACD method. Since the more frequency resources are exploited in this paper, the proposed 2-stage ASROE (ASROE-2) method further improves the SRO estimation precision. Besides, the proposed distributed SRO estimation (DSROE) method takes 19.2 s in the single computing device, when a full connected microphone network is considered. If I = 6 parallel computing units are used, the computation time of each node is about 3.2 s, which is slower than the proposed ASROE method. The reason is that the ASROE method can obtain SROs in closed forms, but the DSROE method computes SROs by the distributed optimization iterations. The RMSE of DSROE is about 0.0097 Hz, which is also larger than that of ASROE. It is because some additional convergence errors are introduced in AANO iterations. Still, these costs are worthiness since the distributed calibration does not need the central processor and is more suitable for distributed networks .
2) Effect of the Length of Calibration Signals: The previous experiment settings are kept, but the difference is that SNR = 15 dB and a set of different calibration signal lengths, i.e., [8,16,24,32,40,48] s, are considered. Besides, two types of calibration signals (white noise and male speech) are used, as shown in Fig. 3. Obviously, all the methods perform better in white noise scenarios, since there are abundant frequency components that   can be utilized compared with speech signals. Due to the fact that the constraint k ≤ K max is eliminated in the proposed ASROE-2 and DSROE-2 methods, their advantages are more evident when the white noise is employed. The RMSEs of ASROE-2 are always smaller than 0.005 Hz and 0.05 Hz in white noise and male speech scenarios, respectively. The proposed DSROE-2 method and the WACD method have comparable performance when the male speech is used, while the former's RMSEs are lower than that of later when the white noise is adopted. Moreover, we can see that the ASROE-2 and DSROE methods are not very sensitive to the available signal length.
3) Effect of Inter-Node Distances: In this experiment, the influence of inter-node distances on the SRO estimation performance is evaluated, in which I = 6 nodes construct a uniform linear array, and the distance between two adjacent nodes is set to 2 cm, 10 cm, 20 cm, and 50 cm, respectively, and SNR = 15 dB. The other experimental parameters are coincident with the first experiment. As given in Table II, the CM method (with a search Fig. 4. Effects of stage number and initial SRO differences. step 0.01 Hz) can achieve the smallest RMSEs if the inter-node distances are small enough. While the WACD method and the proposed ASROE and DSROE methods seem to be not very sensitive to the inter-node distances, and they outperform the CM method with the increasing inter-node distance. Specifically, ASROE-2 provides the smallest RMSE (about 0.006 Hz) when the inter-node distance is equal to 50 cm. It illustrates that the rough time synchronization before the SRO estimation is a necessary and effective strategy.

4) Effects of Stage Numbers and Sampling Frequency Differences:
In this experiment, the SRO estimation performance is evaluated for a set of different stage numbers and sampling frequency differences under the condition of SNR = 15 dB, RT60 = 0.2 s, and l 0 = 10. Here, we define e ε = 1 I−1 I 2 |f i − f 1 | as the mean absolute SROs differences between the first and other nodes before sampling frequency synchronization. Besides, an 8 s long white noise acts as the calibration source. To be fair, a fully connected microphone network is considered for the DSROE method. As depicted in Fig. 4(a), the RMSEs of all methods rise evidently with the increase of e ε . It is because the linear-phase shift model used in these methods degrades for large SROs. It can be observed from Fig. 4(b), (c) and (d) that all the multi-stage methods perform better than the single-stage methods. To be specific, the RMSEs of WACD are relatively larger than that of ASROE and DSROE, especially when e ε ≥ 8 Hz. The reason is that the maximum frequency bin index K max without phase ambiguity becomes smaller with the increase of e ε , resulting in relatively large RMSEs. While the proposed multi-stage ASROE and DSROE approaches are not very sensitive to the increasing e ε , since the constraint k < K max is relaxed. Moreover, if e ε ≤ 8, DSROE achieves comparable performance with ASROE in the 3 rd and 4th implementations.

5) Effect of SNR Conditions:
The impact of ambient noise on the SRO estimation performance is evaluated for a set of different SNRs, i.e., SNR = [0, 5,10,15,20,25] dB, when RT60 = 0.2 s, e ε = 4 Hz, and l 0 = 20. The calibration source is a 16 s long white noise. In addition to a fully connected (FC) microphone network, another non-fully connected (NFC) network is also   Fig. 5, all the methods perform better with SNR increasing. Obviously, the proposed ASROE method provides the smallest RMSE for all SNR conditions, indicating that it is more robust against ambient noise than comparison methods. Besides, the RMSE of DSROE in the NFC network is smaller than that of WACD, since the former utilizes more frequency information in SRO estimation process. Moreover, DSROE (FC) outperforms DSROE (NFC), due to the fact that the more neighboring observations are adopted in the fully connected network. The WACD and ASROE method can only be applied in distributed microphone networks in a centralized manner, while the DSROE method can run either in a centralized or a distributed way.

6) Effects of Reverberation Time RT60s:
The previous experiment settings are kept, but the difference is that a set of different RT60 s are considered and SNR is fixed to 20 dB. Fig. 6 shows that all the RMSEs grow steadily as RT60 increases, which indicates that the reverberation is harmful for all the SRO estimation approaches. Besides, it can be found that the proposed ASROE method consistently provides the smallest RMSEs. If RT60 ≤ 0.4 s, its RMSEs are smaller than 0.01 Hz. The RMSEs of the WACD method are slightly larger, which can be explained by its underutilization of phase information of observed frequency bins. While the RMSE performance corresponding to DSROE (FC) is comparable to ASROE when RT60 ≥ 0.2 s, and it outperforms DSROE (NFC). It is because more observations can be collected in the local cost function (31) in the fully connected acoustic sensor network. As we discussed in the fifth experiment, the WACD and ASROE methods are implemented in a centralized manner, while the DSROE methods are executed in a distributed manner, which is more suitable for distributed networks due to the fact that it does not require an external centralized processor.

D. Real-World Experiment
To further investigate the validation of the proposed method, the real-world experiment is carried out.
The real-world environment is a conference room of size 6 m × 5 m × 3 m, where I = 4 mobile devices constitute a distributed acoustic sensor network. As depicted in Fig. 7, three smartphones (Models: Xiaomi Mix 2, iPhone 11 and iPhone 12 max pro) and one tablet (Model: iPad 7) are randomly placed on the conference table, which are labeled as nodes 1 -4, respectively, and a loudspeaker (Model: Edifier R1200II) acts as the calibration source. The calibration signal is the white noise signal with a length of 24 s. The nominal sampling rate is 16 kHz, but these devices have individual oscillators and central processors, resulting in more or less differences among their sampling rates. To eliminate these SROs, the proposed ASROE and DSROE methods are adopted here, where the DSROE method is carried out with a fully connected network.
The actual sampling rate of node i is measured as follows: 1) The sinusoidal signal with frequency 1000 Hz is emitted and received by node i; 2) Compute the first N p peaks of the received sinusoidal signal, whose time indices are labeled as p k (1), p k (2), . . . , p k (N p );  Table III.
The estimated SROs are also given in Table III. It can be observed that the proposed methods can effectively acquire the SROs among nodes, and they outperform the existing ACD and WACD methods. Specifically, the ASROE method can achieve a better estimation accuracy, since it compute SROs in closed forms and does not include the iteration errors. While the performance of DSROE is also considerable, which is more suitable for distributed acoustic sensor network.

VI. CONCLUSION
In this paper, an analytical SRO estimation (ASROE) approach is proposed for a pair of nodes, and its distributed implementation is further exploited. After approximating the SRO with a linear-phase drift model in the STFT domain, the proposed analytical algorithm estimates the SRO between each pair of neighboring nodes by unwrapping the temporally averaged phase information. To perform distributed estimation, the distributed cost function is formulated via the SROs obtained by the ASROE approach. Finally, all the SROs among nodes are retrieved simultaneously in a distributed manner based on the asynchronous network Newton optimization. The proposed method can successfully estimate the SROs among nodes in noisy and reverberant environments. Moreover, it does not require an external central processor and is scalable for distributed acoustic sensor networks.
Feilong Bao received the Ph.D. degree in computer application technology from Inner Mongolia University, Hohhot, China, in 2013. He is currently a Professor with the Department of Computer Science, Inner Mongolia University. His research interests include speech signal processing, natural language processing, speech synthesis, speech recognition, and neural machine translation.
Rui Wang received the B.E. degree in automation from the Shenyang University of Technology, Shenyang, China, in 2012, the M.E. degree in instrument science and technology from Northeast Electric Power University, Jilin, China, in 2015, and the Ph.D. degree in information and communication engineering from the Dalian University of Technology, Dalian, China, in 2021. Since 2021, she has been a Lecturer with Changzhou University, Changzhou, China. Her research interests include speech processing and microphone array signal processing.