Optimized Real-Time MUSIC Algorithm With CPU-GPU Architecture

Direction-of-arrival (DOA) estimation algorithm for uniform planar arrays has been applied in many fields. The multiple signal classification (MUSIC) algorithm has obvious advantage in high-resolution signal source estimation scenarios. However, the MUSIC algorithm has high computational costs, therefore it is hard to be used in real-time scenes. Many studies are dedicated to accelerating MUSIC algorithm by parallel hardware, especially by Graphics Processing Units (GPU). MUSIC algorithm based on Central Processing Unit (CPU) -GPU architecture acceleration is rarely investigated in previous literatures, and how well MUSIC Algorithm with CPU-GPU architecture could perform remains unknown. In this paper, we present and evaluate a model of search parallel MUSIC algorithm with CPU-GPU architecture. In the proposed model, the steering vector of each candidate incident signal and the corresponding value of 2D spatial pseudo-spectrum (SPS) function are sequentially calculated in a single core of the GPU, and the subsequent calculation of each elevation or azimuth is parallel in batches. Furthermore, in order to improve the peak search speed, we propose a new Coarse and Fine Traversal (CFT) peak search algorithm via CPU and a new parallel peak search algorithm based on GPU acceleration. Across strategy comparison, utilizing CPU-GPU architecture for processing, a 150-160x performance gain is achieved compared to using CPU only. Besides, the resolution of uniform planar arrays is also analyzed.


I. INTRODUCTION
Direction-of-arrival (DOA) estimation has become a popular branch of array signal processing during the past four decades and has extensive applications in various fields, such as radar, sonar, mobile communications, and electrocardiograms [1]- [4]. The DOA estimation algorithms have in-depth study on accuracy and efficiency and have developed into many types of methods, such as beamforming methods [5], subspace decomposition methods, subspace fitting (SF) methods [6]- [8], and sparse representation (SR) methods [9], [10]. The beamforming algorithm reduces the effect of the co-channel on DOA estimation by using the phase-shifted reference signal that is obtained after interference suppression, accompanied by mediocre performance in high-resolution scenes. Maximum likelihood (ML) algorithm and SF algorithm are concise in theory and excellent in performance, yet they require complex and computationally The associate editor coordinating the review of this manuscript and approving it for publication was Hasan S. Mir. intensive multi-dimensional nonlinear optimization [11]. The division of grid determines the estimation accuracy and computational complexity of SR algorithm. Subspace decomposition methods have been divided into two types in terms of processing methods. One is the noise subspace algorithm represented by multiple signal classification (MUSIC). And the other is the signal subspace algorithm represented by estimation of signal parameter via rotational invariance technique (ESPRIT). Algorithms represented by ESPRIT mainly include Toeplitz approximation approach (TAM), least-squares ESPRIT (LS-ESPRIT) and total least squares ESPRIT (TLS-ESPRIT) [12]- [15]. ESPRIT estimates the DOA by employing the rotation invariance among the signal subspace of each sub-array. The performance of these ESPRIT algorithms is substantially resembling, yet the performance of TLS-ESPRIT is somewhat better than other ESPRIT algorithms in the case of low signal-to-noise ratios (SNRs). The algorithms represented by MUSIC include MUSIC algorithm, Root-MUSIC algorithm, and minimumnorm methods (MNM) [16]- [18]. Root-MUSIC realizes VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ DOA estimation by solving the polynomial constructed using the orthogonality of steering vector and noise subspace. ESPRIT algorithm has a more diminutive computational complexity than the MUSIC algorithm and does not require SPS computation. However, the performance of ESPRIT is poor than MUSIC. The MUSIC algorithm has distinguished advantages in high-resolution applications. However, the 2D SPS computation of MUSIC algorithm makes it difficult to exploit in real-time scenes. Reduced-dimensional MUSIC algorithm was proposed for far-field signal source localization with sphere microphone arrays [19]. One-dimensional MUSIC-type algorithm (ODMUSIC) constructs a mapping matrix between the spherical harmonic function and Fourier series only depending on the array configuration. Two root polynomials in ODMUSIC are established to estimate the elevation and azimuth iteratively by exploiting the Vandermonde of Fourier series. Moreover, GPU computation has developed extremely rapidly during the past two decades. GPU is a typical multicore co-processor where 32 threads of a warp are controlled by a controller and process one instruction at the same time. In the single instruction multiple thread (SIMT) paradigm, threads are automatically grouped into 32-wide bundles called warps. Warps are the base unit used to schedule both computation on Arithmetic and Logic Units (ALUs) and memory accesses. Threads within the same warp follow the single instruction multiple data (SIMD) pattern, i.e., they are supposed to execute the same operation at a given clock cycle [20]. From the original image processing engine to the more powerful high-performance computing processor, it has been maturely applied to various fields [21], [22]. The wideband DOA estimation for uniform linear arrays (ULA) of 16 and 4 sensors has been parallelized, partitioned, mapped, and scheduled on Multi-Core, GPU, and IBM Cell BE processor [23]. Broadband underwater sound source estimation of MUSIC algorithm has been implemented based on GPU [24]. Four noise subspace DOA algorithms including Pisarenko Harmonic Decomposition, Eigen Vector, MUSIC, and MNM have been analyzed on CPU and GPU with uniform circle array (UCA) [25]. These aforementioned GPU-accelerated MUSIC algorithms have executed parallel of SPS computation.
However, existing GPU-based parallel MUSIC algorithms pay insufficient attention to the parallel calculation of candidate steering vector calculation, optimization of CPU-GPU memory transmission, and parallel optimization of peak search. The research on optimized computing, parallel computing and CPU-GPU acceleration has penetrated into many fields. In order to solve the problem of high computational cost and storage cost of large-scale 3D models, differential evolution and whale optimization algorithm are combined to calculate the simplified mesh model more effectively [26]. [27] proposes a GPU-based parallel tabu search algorithm that uses a single GPU kernel of compacting neighborhood and a kernel fusion strategy to reduce the amount of GPU global memory accesses. [28] proposes a model of vector parallel ant colony optimization for multi-core SIMD CPU architecture and accelerates the tour construction of each ant by vector instructions. GPU-based parallel ant colony optimization has been analyzed [29]. Blockwise Weighted Least Square Active Noise Control ANC algorithm for CPU-GPU Architecture has designed a real-time execution framework using GPU asynchronous parallel computing [30], [31]. [32] proposed a hybrid parallel algorithm for beamforming using the parallel mode of just-in-time (JIT) on multi-CPU and multi-GPU platform.
Inspired by optimized computing, parallel computing, CPU-GPU acceleration, real-time block diagram, and finegrained parallelism [26]- [32], this paper presents an idea to accelerate a fully developed parallel MUSIC model with CPU-GPU architecture. In our algorithm, steering vector calculation, 2D SPS calculation and peak search are all parallelized and optimized. To the best of our knowledge, this is the first parallel MUSIC algorithm exploiting maximum parallelization with CPU-GPU architecture. In the 2D SPS calculation stage, we design an algorithm to calculate the steering vector and spatial pseudo-spectrum in parallel, and each GPU core is responsible for a task corresponding to the elevation and azimuth. Furthermore, in the peak search stage, we propose a new CFT peak search approach based on CPU and a new parallel peak search approach based on GPU acceleration. More importantly, we compare our algorithm with previous MUSIC algorithm based on CPU, The results indicate the strong potential of parallel MUSIC model with CPU-GPU architecture. Finally, it is theoretically analyzed why the resolution of 2D SPS is insufficient in specific angle ranges.
The remainder of this paper is organized as follows. In Section II, a data model of MUSIC algorithm for uniform planar arrays is given. A real-time CPU-GPU architecture for the MUSIC algorithm is described in Section III. Finally, simulation results are employed to illustrate real-time DOA estimation capability in Section IV. Conclusions are drawn in Section V.

II. DATA MODEL
As shown in Fig. 1, there is a uniform planar arrays with MN mutually independent and omnidirectional sensors, assuming that the sensors are not mutually coupled. We set the reference sensor is at the origin of coordinate system. Supposed that D signals impinge on the uniform planar arrays from the farfield. Let d = (θ d , φ d ), where θ d and φ d denote the elevation and azimuth of the d th incident plane wave, respectively. The discrete signal output of the uniform planar arrays can be described in a matrix form as follows [33] where A( ) = [a( 1 ), a( d ), . . . , a( D )] ∈ C MN ×D , it means that A( ) is a MN ×D steering matrix related to the shape of the array and the direction of the signal source. A( ) represents the transfer function from the source to the array and holds the DOA information. In general practical Gaussian noise (AWGN) vector, whose elements are usually assumed to be Gaussian random variables with zero means and variance σ 2 n as follows [33] where E{·} represents mathematical expectation, (·) H represents Hermitian transpose operation, 0 is used throughout the paper to mean that all elements of matrices or vectors are zeros and I represents the identity matrix. The wave path difference between the mnth sensor and the reference sensor can be described as where λ represents source signal wavelength and (x mn , y mn ) represents the coordinate of the mnth sensor. The uniform planar arrays is generally in the x-y plane, so z mn is generally 0. The steering vector of M sensors on the x-axis a x ( d ) can be defined as follows where ζ is the distance between adjacent sensors. Since the array is uniform, the distance between adjacent sensors of x-axis is the same as y-axis. Then the steering vector of N sensors on the y-axis a y ( d )can be defined as follows Assuming that the steering vector of sub-array 1 is a x ( d ), and the steering vector of sub-array 2 must consider the offset along the y-axis. The wave path difference of each sensor relative to the reference sensor must be added 2π ζ sin φ sin θ/λ to the wave path difference of sub-array 1, so the steering vector a( d ) can be described as where ⊗ is Kronecker product. The W ×W uniform planar arrays covariance matrix can be written as where W = MN , R s = E{s(t)s H (t)} and R n = E{n(t)n H (t)}. R s is the source signal covariance matrix, and R n is the additive white Gaussian noise covariance matrix. According to (2), (7) can be reconstructed as where R is positive definite Hermitain matrix. If the unitary transformation is used to achieve diagonalization, the similar diagonal matrix is composed of W different positive real numbers, and the corresponding W feature vectors are linearly independent. The eigenvalue decomposition (EVD) of the estimation covariance matrix R can be described as where = diag(η 1 , η 2 , . . . , η W ), and it can be proved that its eigenvalue obeys the order First D eigenvalues are related to the source signal, and their values are greater than σ 2 n . The eigenvectors corresponding to these D larger eigenvalues η 1 , η 2 , . . . , η D are denoted as U 1 , U 2 , . . . , U D , and they form the signal subspace U s . Let s be diagonal matrix composed of D larger eigenvalues. The W − D eigenvalues completely depend on the noise, their values are all equal to σ 2 n . The eigenvectors corresponding to η D+1 , η D+2 , . . . , η W constitute the noise subspace U n , and n is a diagonal matrix composed of W −D smaller eigenvalues. Therefore, the EVD of the exact covariance matrix R is given by However, the covariance matrix R is not available in practice, so the theoretical uniform planar arrays covariance matrix R given in (7) is usually replaced byR where X = [X(1), X(t), . . . , X(T )] is the uniform planar arrays output discrete signal matrix. The EVD of the exact covariance matrixR in (10) can be described aŝ Among the eigenvalue-decomposition-like DOA estimation algorithms for the sensor array covariance matrix, the MUSIC algorithm has universal applicability. As long as the layout of the sensor array is known, whether it is a linear array or a circular array, and whether the array senors are equally spaced, high-resolution estimation results can be VOLUME 9, 2021 obtained. The uniform planar arrays covariance matrixR can be divided into two parts, we can get according to (13), we get The matrix R s is a full-rank and not singular matrix, so there is an inverse. Therefore the above (14) can be changed to A H ( )Û n = 0, and it shows that each column vector in matrix A is orthogonal to the noise subspace. We can getÛ According to the orthogonal relationship between the noise eigenvector and the source signal steering vector, the following 2D SPS function is obtained It is easy to find that (16) can be regarded as an optimization problem, so (16) can be reconstructed as where · F denotes the Frobenius norm. We can obtain elevation and azimuth of 1, 2, . . . , D source signal as follows The estimation of the elevation and azimuth can be accomplished by exercising the 2D peak search.

III. REAL-TIME CPU-GPU ARCHITECTURE FOR MUSIC ALGORITHM
In this section, we will introduce a CPU-GPU architecture that can reduce the computation time cost of MUSIC algorithm to a very low level to fit real-time requirements. This architecture can be used for real-time sound source location, real-time signal source search, and medical imaging systems. The working flow of the proposed architecture is described.
As mentioned above, the MUSIC algorithm is a powerful and high-performance algorithm for determining the DOA of common wideband or narrowband signal sources, especially when the number of arrays is appropriate and the scan resolution is moderate. However, in the current high-precision positioning applications, more attention is paid to whether it can satisfy real-time requirements while ensuring accuracy. In the case of only one-dimensional scan, the MUSIC algorithm can almost be an excellent one of the same type of DOA algorithms, once the elevation and azimuth are searched simultaneously, the entire computation time will become unbearable. We proposed a CPU-GPU architecture for the MUSIC algorithm. This architecture fully utilizes the high computational power coming from the parallel processing of GPU and overcomes the mentioned limited computational efficiency.
represent the input and output of this CPU-GPU framework, respectively. The CPU executes some modules with complex functions and complex logic operations, which mainly include the EVD and separation of noise subspace, while the GPU mainly executes the remaining 2D spectrum generation and peak search with high repeatability and low interdependence. First, the proposed architecture takes the sampled signal X as the input. Secondly, CPU is the calculation platform of the previous stage, including covariance calculation (11), eigenvalue decomposition (9), and noise subspace extraction (12). Then the inner product of the noise subspace (16) is transmitted to the GPU, and the candidate steering vector (4), (5), (6) and the 2D SPS (16) are calculated in parallel. Finally, the 2D spectral peak search (18) is optimized in parallel and the DOA information {θ d ,φ d } D d=1 is transmitted back to the CPU. More details of the calculation scheme and memory transfer between CPU and GPU are as follows.

1) CPU COMPUTATION PART
Generally, the complex calculations in the solving process are executed on the CPU. The multi-core CPU is mainly responsible for complex logic operations on the basis of providing vector processing. To minimize the memory transmission between the CPU and GPU, steering vectors of candidate angles are calculated via the GPU. Since covariance calculation, eigenvalue decomposition, and noise subspace separation are complicated and not repetitive, they are calculated via the CPU.

2) GPU COMPUTATION PART
The design goal of GPU is its ultra-high floating-point computing power. The GPU avoids or weakens complex functions that are not related to floating-point calculations, such as branch processing and logic control, and focuses on floatingpoint calculations [34], [35]. For the estimation of source signal elevation and azimuth based on the MUSIC algorithm in the same time block, a vast quantity of parallel computing branches are worth using GPU to optimize the computation task of the 2D spectrum. Since the noise subspace required for the elevation and azimuth search of the MUSIC algorithm in the same time block is constant, the noise subspace is placed in the on-chip shared memory of the GPU to facilitate accessing the memory unit. Moreover, the peak search task is handed over to the GPU for execution.
As shown in Fig. 3, the inner product of noise subspace and array distribution are transferred to the GPU global memory through the CPU memory and then to the shared memory. Some of the memory types are on-chip memory whose access speed is faster than off-chip memory. Therefore, utilization of shared memory to store the inner product of noise subspace and array distribution enhances data access rate. It is worth noting that although local memory is unique to each thread, this part of memory still belongs to off-chip memory, which belongs to the same physical memory as global memory.

3) MEMORY TRANSMISSION BETWEEN CPU AND GPU
Memory transmission in the CPU-GPU architecture is always an unavoidable issue that demands to be considered. Due to the distinct physical memory addresses of CPU and GPU, as well as the bandwidth and transmission speed limitations of PCIe bus, it is necessary to transfer less data between CPU and GPU. If the calculation of the candidate steering vector is established into the CPU and a unique steering vector is required for each DOA information update, then the one-byone transmission of this portion of data will consume a lot of time. Therefore, the proposed CPU-GPU architecture for the MUSIC algorithm arranges the calculation of the steering vector on the GPU and only requires to transmit the array distribution vector once during the DOA estimation of the first time block. Similarly, if the GPU transmits the calculated 2D spectrum back to the CPU each time, the time of each transmission delays the entire calculation, so the task of peak search is likewise scheduled for the GPU calculation. More optimization details will be discussed next.

4) CONDITION FOR REAL-TIME EXECUTION
Since the DOA information of a time block must be updated before the next time block. The condition of the proposed CPU-GPU architecture for MUSIC algorithm to be executed in real-time is as follows where T CPU is the preprocessing time of CPU calculation includes covariance calculation, eigendecomposition, and inner product of noise subspace. T GPU is the computation time of all kernels on the GPU that includes 2D spectrum computation and peak search. T Transfer is the time of data transmitted between CPU and GPU. T Transfer does not include the transmission time of array distribution and T Block represents the sampling period of the signal block.

B. PARALLEL OPTIMIZATION STRATEGIES
The computation complexity of MUSIC algorithm is listed in Table 1. Inner Product of Noise Subspace representsÛ nÛ H n , 2D Spectrum is the remainder of the (16). L θ , L φ represents scan range of elevation and azimuth, respectively. The number of snapshots T only affects the computation complexity of the covariance matrix linearly. The 2D spectrum is related to the number of sensors W and the scan range of the elevation and azimuth. Through comparative analysis, we can perceive that candidate steering vector, 2D spectrum, and peak search have fine parallelism.

1) PARALLEL OPTIMIZATION OF CANDIDATE STEERING VECTOR AND 2D SPS COMPUTATION
Due to the computation of the spectrum of each candidate elevation and azimuth are not related to each other, the highlevel parallel structure is determined to estimate the 2D spectrum. The DOA estimation of each sampling block does not depend on the previous sampling block. Such excellent independent computing features are very suitable for parallelism.
In Algorithm 1, after the shared memory allocation is completed, only one thread synchronization is performed within VOLUME 9, 2021 the block, and the rest proceed downward smoothly. The outermost loop is a manifestation of SIMT, where multiple threads are bound into a warp to execute the same instruction stream. Only one main branch is set up in the kernel, which is not disturbed by other branches. While a certain instruction is scheduled, the cores in SIMT is executed according to SIMD model, that is, each core in the warp or arithmetic and logic unit (ALU) executes the same operation on multiple data operands concurrently. It is worth mentioning that all intermediate variables in Algorithm 1 are stored in registers to ensure the access speed. Therefore, it is critical to ensure data size of variables in the block does not exceed the total capacity of registers available in each block. Furthermore, it is necessary to clear the status of accumulators promptly.

2) PARALLEL OPTIMIZATION OF PEAK SEARCH
In Algorithm 2, we set the number of surrounding points as C n , and the entire comparison times are approximately L θ ×L φ ×C n . Each point must be compared with all surrounding points. The Priority_queue(D) is used to maintain the D max peaks. The number of surrounding points of four vertices, four edges, and the rest of the spectrum are 3, 5, and 8 respectively. About half of the comparisons are repeated. Therefore, Algorithm 2 is modified to use a peak label spectrum to record the result of the current comparison. As shown in Algorithm 3, if the adjacent point is judged not to be the peak point, then this point is ignored. In this way, the redundant comparison is well removed, and comparison times is lower than 1/2. Moreover, comparison times can be reduced to a lower level by changing the step size.
As shown in Fig. 4, the first four subgraphs show that the first rough traversal, which can filter out more than half of the non-peak points. The last four subgraphs show that the second fine traversal can skip non-peak points and filter some unpassed points. Ultimately, two peak points are obtained. Nevertheless, the disadvantage of this algorithm is that it needs to transfer the 2D spectrum from the GPU back to the CPU, so the calculation efficiency is only increased by about twice. If the DOA information can be calculated in the GPU, the transfer time from GPU to CPU can be saved. Firstly, Algorithm 4 utilizes the high-efficiency access characteristic of shared memory in GPU to search for peaks within blocks and assigns them to global memory. Secondly, after all peaks within blocks are searched, the peaks of each block are scanned one by one and placed in Priority_queue(D). Finally, DOA information is transmitted back to the CPU to indicate that the DOA estimation within a time block is completed.

IV. SIMULATIONS AND DISCUSSION
In this section, we carried out comparative experiments in time domain to verify the real-time feasibility of the proposed calculation models and strategies. for j := 0 to N −1 do 24: a(i×N +j) = a x (j) × a y (i) 25: end for 26: end for 27: for i := 0 to W −1 do 28: for j := 0 to W −1 do 29: temp1(i)+ = a(j) × u(i×W +j) 30: end for 31: end for 32: for j := 0 to W −1 do 33: Conj(a(i)) 34: end for 35: Clear the accumulators temp1[W ] and temp2 37: t+ = t n 38: end for of CPU, GPU, and system parameters used in the algorithm are listed in Table 2. The parameters of uniform planar arrays are M = 4 and N = 4.

B. INFLUENCE OF SNAPSHOTS AND MULTIPLE SOURCES ON COMPUTATION TIME AND ACCURACY
We simulate in the narrowband far-field environment. The sensor interval is set to 0.01 m to satisfy ζ ≤ λ 2 . The center frequency of the narrowband signal is 2 kHz. As shown for j := 0 to L φ do 3: if p(i, j) == non-peak label then 4:

5:
end if 6: if sps(i, j) > sps(i ± 1, j ± 1) then 7: p(i ± 1, j ± 1) ← non-peak label 8: if sps(i, j) > all surrounding points then 9: 10: else 11: p(i, j) ← non-peak label 12: end if 13: end if 14: end for 15: end for 16:  in Table 3 if shared_sps(t) all surrounding points then 8: peak( t t n , blockIdx) ← shared_sps(t), t 9: end if 10: t+ = t n 11: end for 12: Synchronize All Threads 13: gridSize = L θ ×L φ t n 14: for i := 0 to gridSize × gridDim do 15: if peak(i) = NULL then 16: Priority_queue(D) ← peak(i) 17: end if 18: end for 19:  The CPU overhead is about a quarter for the proposed algorithm to execute in the worst environmental conditions. Besides, the RMSE gradually decreases with the increase of the number of snapshots, yet the increase of accuracy is not linear but logarithmic. We cannot rely on increasing the number of snapshots to improve the estimation accuracy, after all the calculation time also increases linearly. Across the comprehensive comparison, 500 snapshots are used as the calibration parameter in subsequent simulations.

C. MAXIMUM NUMBER OF SIGNALS
The estimated range of UPA elevation is 0 • -90 • , and azimuth is 0 • -360 • . The maximum number of signal sources can be estimated in this simulation test. Since we use the W × W Noise Subspace U n , theoretically W-1 signal source estimation can be achieved. We set SNR = 15 dB and test 15 uncorrelated random signals. As shown in Fig. 5, to facilitate the observation of the 2D SPS, 8 signals , respectively. We observe that 7 prominent crests are corresponding to 7 signal sources, and a small bump is on the waist of the widest wave, which corresponds to the signal source of (56 • , 132 • ).

D. RESOLUTION ANALYSIS
As shown in Fig. 5, as the incident elevation of the signal exceeds 60 • , the waveform of the 2D SPS becomes less sharp. The resolution of signal source in a certain direction is directly related to the rate of change of steering vector near that direction. Near the direction where the steering vector changes rapidly, the difference of array wave path changes greatly with the change of the source angle, and the corresponding resolution is also high [36]. Define the resolution (θ) as follows where τ represents delay among sensors in the array. The spectrum in the θ direction becomes sharper as (θ ) increases. For the uniform linear array, τ is as follows where τ l is the delay of the lth sensor relative to the reference sensor, c is signal propagation speed, and x l is the x-axis coordinate of the lth sensor. Then (θ) ∝ cos(θ).
It shows that the resolution of signal in the location of 0 • is the highest, and the resolution in the position of 60 • direction is reduced by half.  For the uniform planar arrays placed horizontally, τ mn can be written as follows According to (20), we can get (θ ) of the uniform planar arrays as follows Similarly, we can get (φ) as follows The above analysis verifies that the resolution is insufficient if elevation exceeds 60 • in Fig. 5. In the light of (24), (θ) keeps getting smaller as θ increases where φ is a parameter. As shown in Fig. 6, the spectrum has the highest resolution where the elevation is 0 • , and the resolution is reduced to half at 60 • , until reduced to 0. As shown in Fig. 7, (φ) fluctuates sinusoidally in an interval as φ increases where θ is parameter. However, as θ gets closer to 0, the fluctuation range of (φ) gets smaller until it finally reaches 0.

E. STRATEGY COMPARISON
To verify the rationality in real-time application, the proposed algorithm is compared on the CPU and GPU. For notational convenience, we abbreviate five calculation strategies Algorithm2, Algorithm3, Algorithm1-Algorithm2, Algorithm1-Algorithm3 and Algorithm1-Algorithm4 to CPU-Config1, CPU-Config2, GPU-Config3, GPU-Config4, and GPU-Config5, respectively. The specifications of the hardware required for the experiment are shown in Table 2  and 90 • × 360 • , respectively. All the simulation results are obtained by averaging 500 independent experiments.
The simulation results are shown in Table 4, the preprocessing via CPU calculation includes covariance calculation, eigendecomposition, and inner product of noise subspace. H2D is memory transfer from CPU to GPU, and D2H is memory transfer from GPU to CPU. From GPU back to CPU, the transmission time of 2D SPS is much longer than that of DOA information. The highest speed-up ratio for computing the 2D spectrum is 162.25x, the highest speedup ratio for peak search is 733.11x, and the speed-up ratio for the whole calculation is 160.17x. By using 90 • × 90 • scan range, the GPU outputs DOA information every 0.25 s. The scan range of 90 • × 180 • and 90 • × 360 • also be simulated. According to Table 3, increasing the sampling block of the signal hardly affect the T CPU . Therefore, the real-time condition is still contented and the output frequency of DOA information is less than 4 Hz.
In practical conditions, the signal sources may propagate from adjacent directions. Here, we simulate two sets of adjacent signal sources with SNR = 0 dB and 2 dB whose elevation and azimuth are (20 • ,40 • ), (28 • ,40 • ) and (40 • ,40 • ), (40 • ,48 • ), respectively, i.e., the elevation and azimuth only differ 8 • between the adjacent source. Fig. 8 plots 2D SPS of the proposed method in minimum SNR scenario. Under this challenging scenario, two peaks can still be detected. In the same SNR scenario, if the angle difference between adjacent signals continues to decrease, the two peaks cannot be clearly distinguished. In this set of simulations, we use GPU-Config5 as the calculation strategy. Through statistics, the time spent in each part is T CPU = 4.3 ms, T GPU = 180.1 ms, T Transfer = 0.137 ms, and T Block = 250 ms. According to the above statistics and (19), we observe that this set of simulations still meet the real-time execution condition.

V. CONCLUSION
This paper presents an optimized real-time MUSIC algorithm using CPU-GPU architecture for uniform planar arrays. Joint parallel of candidate steering vector and 2D spectrum make full use of shared memory, SIMT, and SIMD, reducing transmission time and delay waiting for calculation. Besides, the parallel optimization of peak search via the GPU saves the transmission time of SPS and accelerates the entire process. Importantly, the GPU-Config5 strategy was shown to have performance benefits of 150x to 160x in comparison to the CPU-Config1 strategy. Moreover, theoretical analysis shows why the resolution of the 2D SPS continues to decrease with the increasing elevation. In the future, we plan to further improve the estimation effectiveness on the basis of ODMUSIC by remodeling the CPU-GPU architecture.