Acoustic Enhanced Camera Tracking System Based on Small-Aperture MEMS Microphone Array

The camera tracking systems based on visual image processing face a problem that they are completely ineffective in their blind zones. To address this problem, a design of acoustic enhanced tracking system combining visual and auditory target tracking methods is reported in this article. The system holds the abilities of performing sound direction estimation and target tracking in real-time. Estimating direction of arrival of the sound accompanied with the target helps the camera turn towards the target outside the field of view. This sound-triggered mode of camera operation makes a significant supplement to conventional cameras’ working state. Considering the embedded system is necessary in consideration of the cost and size of the system in practical application, we designed a small aperture array with 7 digital omnidirectional MEMS microphones and built the overall system based on FPGA and ARM. The experiments were carried out in a normal indoor environment and the results confirmed that the system can perform auditory and visual tracking in real-time.


I. INTRODUCTION
Real-time target tracking, as a basic core technology in the field of computer perception, is widely employed in many applications, such as intelligent robot [1], security monitoring [2], and drone detection [3]. Among these applications, the detection and tracking of moving targets have been attracting great research interest. In the past few decades, relevant researchers have made great progress in moving targets tracking based on video sequence [4]- [9]. Actually, the development of the visual tracking algorithm has not dealt with the problem existing in conventional camera monitoring. These cameras are completely ineffective for visual tracking while the object is out of their field of view (FOV). As shown in FIGURE 1, except for vision, hearing accounts for the most proportion of all human perception, about 13% [10]. A person who hears what is happening out of sight will turn his neck to see the object. Imitating these joint actions The associate editor coordinating the review of this manuscript and approving it for publication was Yasar Amin . from hearing to vision, we developed a tracking system with auditory and visual methods to address the blind angle problem. The functions of ears, eyes, neck and brain can be imitated by microphones, a camera, a pan-tilt actuator and CPU, respectively. However, both the integration of such embedded system and the implementation of effective algorithms bring great challenges for this design in practical applications.
In the field of acoustic source localization, the methods based on microphone array (MA) are very popular, which estimate the sound direction by processing the spatial information [11]- [21]. These methods can be generally divided into three categories: beamforming based methods [11], [12], [16], subspace based methods [17]- [19], and parametric methods [20], [21]. Parametric methods feature high computational cost so that they are not suitable for real-time processing. The subspace based methods are characterized by high resolution, but these methods only care about the direction result, without a synthesized output. That makes them not a better choice than beamforming in terms of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ functionality. Additionally, the beamforming based methods are small in computation, which gives beamforming an advantage in real-time processing. The direction information is mapped to the phase delay of signals received by each array element. The detailed explanations of phased array beamforming technologies are well documented in [22]- [24]. Adaptive beam-forming is a significant task in array signal processing, of which the Capon beamformer is a representative example [25]- [27]. The signal of interest (SOI) is allowed to pass through without distortion while the interference signals and noise are suppressed as much as possible. However, it has been found that the Capon beamformer is sensitive to modelling mismatches [28]- [33], especially in small aperture arrays. To address this problem, a class of robust adaptive beamformers (RABF) are designed to offer acceptable array output performance [34]- [41]. It was found that the influence of array elements mismatch on locating accuracy in small aperture arrays can be reduced by adjusting the weighting vector of beamformer. Thus, RABF algorithms provide a solution for the acoustic localization of this design. It is necessary to integrate the system in practical application. But the array of traditional electret condenser microphone (ECM) is often featured by larger size and power consumption. Furthermore, additional amplification circuits and AD acquisition modules also complicate the system. All these factors above bring great difficulties to the integration of the system. Fortunately, the emergence of the Micro-Electro-Mechanical Systems (MEMS) microphone makes it possible to miniaturize the acoustic sensor array [42], [43]. The MEMS microphones have an acoustic transducer, an amplifier, and even an analog-to-digital converter (ADC) integrated in the chip [44], that contributes to the MA's small aperture. They can be directly connected to the FPGA through the I 2 S interface, without using an audio decoder, enabling the further reducing of system complexity. Additionally, compared to ECM, MEMS microphones have less sensitivity to temperature, vibrations or mechanical shocks [45], [46]. These advantages including high quality and small package, make MEMS MA more portable and suitable for our design, that brings solutions for scheme of small-aperture MA.
The combination of visual and auditory information processing has attracted the interests of many researchers [47]- [53]. A perception sensor net-work [47] capable of detecting emergency situations was presented for school safety, using a Kinect with four microphones to acquire audio signals. D'Arca et al. [48] used distributed directional microphones to recognize speakers, combined with video information captured by the camera. Viciana-Abad et al. [49] proposed an audio-visual perception system to direct the behavioral responses of the robot with two microphones and two cameras attached to the head. Wilson et al. [50] combined a video camera array and a MA to locate the speaker in a conference room, with 32 omnidirectional microphones spread across the ceiling and 2 cameras on adjacent walls. Despite the great effort in studying the audio-visual information processing, it has rarely been reported that using such a small aperture MEMS MA to solve the camera blind spot problem. As described before, in practical applications, embedded implementation is necessary to meet the cost and size constraints.
In this article, we have reported an acoustic enhanced camera tracking system based on a small-aperture MEMS MA, in order to extend the detection angle of the camera tracking system to all directions, imitating the hearing-vision interactions of human. To address the integration problem, a circular small-aperture MEMS MA is designed with a 4.5 cm radius. The embedded platform is constructed based on FPGA and ARM for data parallel acquisition and system control, respectively. The estimation results confirmed that the whole system can perform all the localization and tracking functions reliably in real-time.
This article is organized as follows: Section II describes the related algorithms implemented. Section III shows the system architecture. The experiments and discussions to evaluate the performance of the system are explained in Section IV.

II. METHODS
In this section, the algorithms used in the system are introduced, including the voice activity detection (VAD), the sound source localization, visual detection and target tracking.

A. VOICE ACTIVITY DETECTION AND SOUND SOURCE LOCALIZATION 1) VOICE ACTIVITY DETECTION
In order to reduce the computing burden as well as provide trigger signals for the tracking system, the VAD processing is implemented. The VAD is based on the estimation of the short-term energy and short-term zero-crossing rate (ZCR). The short-term energy of an n-th frame audio signal can be given by: where N is the window length; x(n) is the audio signal; ω(m) is given by the following equation: And ZCR is expressed as: where

2) SOUND SOURCE LOCALIZATION
In this work, the beamforming algorithm is used to estimate the direction of arrival (DOA) of the sound. This method is based on the time difference of arrival signals of the microphone array elements. The time difference in time domain is reflected as phase shift in frequency domain. The accuracy of delay in the time domain is limited by the sampling rate, while the accuracy of phase shifting in the frequency domain can be higher. At the same time, most of the natural sound are broadband signals, which need to be analyzed after being decomposed. Therefore, the fast Fourier transform (FFT) processing is necessary. If the wavelength of the signal is known, the phase has a corresponding relationship with the angle of arrival. The output power of the SOI is maximized when the most suitable compensation phase making the signals of all channels become coherent signals is found. Thus, we establish the relationship between output power and the direction of arrival.
Supposing that a(θ, ϕ) is the direction vector of a plane wave propagation in space: where θ and ϕ are azimuth and elevation angle in spherical coordinates, respectively. In this design, we only care about the azimuth angle θ. So, the value of ϕ is treated the same, and a(θ, ϕ) is simplified to a(θ) to facilitate the expression.
Supposing that the array elements number is M and the position of each is P m (m = 0, 1, · · · , M − 1), then the time difference between m-th array element and reference point is given by where c represents the velocity of the plane wave and a T is the transpose of vector a. The signal received by each array element is given by is the signal received at the reference point. So its frequency spectrum is where j and ω are imaginary unit and frequency, respectively. Define the wave number k = ωa/c, then there is ωτ m = k T P m . The receive signal matrix can be expressed as . . .
where v is array steering vector, that is the function of k, describing the response of the array to the signal in spatial domain.
Actually, X(ω) is the output matrix of the array. The conventional beamformer H T (ω) is used to compensate for the phase difference. It is given by where k s is the wave number of the plane wave we are interested in. The output is Supposing that the signal is a unit of power, i.e., S(ω) = 1. The Eq. (10) is defined as the beam pattern to describe the array corresponds to a unit power plane wave signal in space.
In the narrow band snapshot model processing at ω c , the output of the beamformer is given by where complex weight vector ω H = H T (ω c ) and ω c represents the central frequency of the narrow band.
Beam output power can be calculated as follows: where R x is the matrix of the input signal and it can be expressed as: Also, ω is a function of θ and ϕ. In this application, only the azimuth angle θ is concerned. So the Eq. (12) becomes The angular distribution of power can be represented as power azimuth spectrum (PAS) by beam scanning in azimuth. The azimuth of maximum power is the direction of the sound.
The standard Capon beamforming (SCB) or minimun variance distortionless response (MVDR) beamforming algorithm can be summarized as that the SOI is allowed to pass through without distortion while the interference signals and noise are suppressed as much as possible. It is given by the following constraints: The Lagrange multiplier methodology is used to solve formula above, and the weight vector of MVDR is obtained as follows: that is substituted into Eq. (14) to get the power estimate: However, the actual array steering vector often has a certain deviation, causing power loss of the SOI. The RABF algorithm estimates the true steering vector v and replaces it with the estimated v. The smaller the deviation between v and v, the larger the power output value of beamforming. Therefore, the RABF algorithm based on array steering vector estimation can be converted to the following quadratic optimization: where ε is the upper norm of the error of the steering vector, which depends on the error between the theoretical and the actual steering vector. Similarly, this problem can be solved by using the Lagrange multiplier methodology, that is given by in which λ 0 is the Lagrange multiplier. Find the partial derivative of the above equation with respect to v, and let the derivative be 0, then get the best steering vector estimate as: Let where U consists of eigenvectors of R, and the eigenvalues of R constitute the diagonal elements of the diagonal matrix , in which γ 1 γ 2 · · · γ M . Let The solution λ to Eq. (21) is unique and it can be calculated that λ belongs to the following interval: where z m denotes the m-th element of z. Once λ is determined, the best steering vectorv can be calculated by Eq. (20). Substitute the estimate of the steering vector into Eq. (16) to obtain the weight vector: Finally, the power estimate can be calculated by Eq. (17), as follows: The overall broadband beamforming algorithm can be illustrated as FIGURE 2. The VAD processing is to distinguish whether the sound signal is valuable, so it should run on all 7 channels before beamforming. The received signal of each array element is preprocessed and the data shell is decomposed into K subbands by discrete Fourier transform (DFT), and then signal in each subband can be treated as a narrow band to perform beamforming operation. Finally, we can obtain the output signal in time domain by inverse discrete Fourier transform (IDFT).
K -point DFT is performed on the M sampling channels in order to get K frequency bins. A new vector of the result in the same frequency bin is made up so that K vectors can be expressed as: . . .
where X m (k) is the signal of m-th (m=1,2,· · · ,M -1) array element in the frequency domain.
After M signals are weighted and summed, the results of K frequency bands can be obtained as follow: where ω H (k) is the weight vector in each subband, it can be expressed as: K -point inverse Fourier transform of Y (k) is performed to obtain the beam output y(n) in the time domain. Calculate the beamforming output power and scanning power-azimuth spectrum can be plotted.

B. VISUAL TARGET DETECTION AND TRACKING
According to the result of acoustic localization, the pan-tilt camera turns towards the target outside the camera's FOV. Then the frame-difference method is used to detect the moving object and the mean-shift tracking algorithm is applied to track the detected target.

1) MOVING TARGET DETECTION BY FRAME-DIFFERENCE
The principle of the frame-difference method is to subtract two adjacent frames of video images in order to detect the moving target. The difference of two adjacent frames of video images is expressed in the form of where I (x, y, t) represents the gray value at the (x, y) point and time t, and I (x, y) is the difference of grayscale of the two adjacent frames of images. Then binarize the difference image so that we can detect the target and locate its position.

2) MEAN-SHIFT TRACKING
After the target position is located by frame-difference, the mean-shift algorithm is used for its tracking. The algorithm tracks the target motion using gradient information, with linear convergence rate, which makes the iterative calculations small and it is easy to be applied in real-time.
The mean-shift vector about the target model is described according to the candidate model with the largest similarity to the target, which is the vector of the target movement. Due to the fast convergence of the mean-shift algorithm, by continuously iteratively computing the mean-shift vector, the algorithm will eventually converge to the true position of the target. The mean-shift algorithm takes the moving target obtained by the frame-difference method as the tracking target, and the RGB feature is used to model the probability density estimation. Suppose that there are n pixels in the target area which position is x i (i = 1, · · · , n). The color space of the target is divided into m intervals. The probability density of the target model is expressed as q = [q 1 q 2 · · · q m ] T . Suppose that x is the position of the center of the target region. In each interval it is given by In the Eq. (31), h is the window width of kernel function. δ(i, u) is the Kronecker function which determines whether x i belongs to the u interval, and returns 1 or 0. k is the profile function of Epanechikov kernel function K E (z). C 1 is a normalized coefficient expressed as: There is the formula as K E (z) = k( z 2 ). Suppose the volume of the d-dimensional unit ball is c d and it can be described as: The pixels in target candidate region is assumed as x i (i = 1, 2, · · · , n) and the probability density estimation of candidate is expressed as p = [p 1 p 2 · · · p m ] T . In each interval it can be given by where y is the position of the center of the candidate. C 2 is a normalized coefficient expressed as: The Bhattacharyya coefficient is used to measure the similarity between the target and the candidate model, as follow: The larger the value of ρ[ p(y), q ], the higher the similarity. Let y s be the target initial position, and perform a first-order Taylor expansion on ρ[ p(y), q ] at y s and sort it to get the Eq. (37).
in which In Eq. (37), the first term on the right side of the equation does not contain y. The second term represents a kernel probability density estimation with a profile function of k and a weight of ω i . Let T (y) represent the second term. Calculate its gradient with respect to y and sort it to get the Eq. (39).
where k represents the derivative of k. It can be noted that the term in the first bracket can be regarded as the kernel probability density estimation with the profile function of k, and the term in the second bracket represents the vector of the mean shift. The maximization of T (y) can be completed by the following mean-shift iterative process.
When the condition y s+1 − y s ≤ ε or the number of iterations λ ≥ N are satisfied, the iteration is considered to be the end and the target location is updated. Take ε = 0.5 pixels and N = 20 in the algorithm.

III. SYSTEM ARCHITECTURE
This section describes the system architecture depicted in FIGURE 3. It is worth mentioning that the rotation range of the camera is 0 to 360 degrees in the two-dimensional plane. The array shape should not only match its detection range, but also adopt the minimum system volume. The resolution of the circular array is the same in all directions, so it perfectly matches the rotation range of the camera. As shown in FIGURE 4, the overall system is integrated on a circular substrate with a diameter of 10 cm. It is connected to the host computer through Ethernet or USB port. The platform is based on FPGA data acquisition and ARM system control. The auditory module is mainly divided into three parts: acoustic sensor, FPGA data acquisition and ARM control module.
1) The acoustic sensing module in the system is a centrally symmetrical circular array structure consisting of 7 acoustic digital MEMS sensors ADMP441. 2) FPGA synchronously acquires digital audio signals from 7 MEMS acoustic sensors via I 2 S bus. 3) ARM aggregates and processes the data received by the FPGA through the FMC bus.
The ADMP441 (ADI, Massachusetts, America) microphone consists of MEMS sensor, ADC circuit, power management and industry standard 24-bit I 2 S interface. It has a flat response curve from 60 Hz to 15 kHz and can be directly connected to the FPGA using the I 2 S interface. Therefore, no audio encoder and decoder are required in the system.
The main function of the FPGA is to provide a synchronous multi-channel I 2 S interface design. The low-power and low-cost chip EP2C5T144C8 (Altera, California, America) in the Cyclone II family is selected as the FPGA controller. It needs to complete the timing simulation of four I 2 S buses, a buffer area and a data transmission interface for the task of collecting data in parallel. The 48 MHz active crystal oscillator is used as the clock input of FPGA. The serial configurator EPCS16 with 16 Mbit storage capacity is used in program storage and chip configuration, and the USB blaster is used as the emulator for debugging and downloading programs. The two AMS117 linear voltage stabilizer chips provide the core voltage of 1.2 v and 1.8 v respectively for the FPGA. To keep the microphone signals acquired synchronously, the same set of connecting lines SCK and WS are used between the FPGA and microphones.
The main function of ARM chip in this system is to process the data collected by FPGA through FMC bus. Communication with FPGA requires a high-speed communication bus. Meanwhile, the underlying hardware and drivers supporting USB or ethernet are needed to transmit data with the upper computer. The ARM microprocessor Cortex-M7 series is used in real-time control, in which the STM32F746ZGT6 (STMicroelectronics, Geneva, Switzerland) has a 216 MHz main frequency, 320 KB RAM and 1 MB FLASH storage space. Most importantly it supports FMC function, USB interface and ethernet communication. In this design, STM32 mainly writes the data in SRAM into the buffer through the FMC bus, and then transmits packaged data to the host computer through the USB bus to realize the function of the USB microphone array. The main work is the implementation of the STM32 and FPGA communication interface, as well as the adaptation of the USB audio class specification.  STM32 and FPGA initiate data transmission through a 16-bit data bus, and data transmission is performed by means of time division multiplexing of data lines and address lines.

IV. RESULT AND DISCUSSION
In this section, the experiment setup as shown in FIGURE 5 (a) consists of the MEMS microphone array mentioned in Section III, a PTZ camera (DS-2DC7120IW-A, Hikvision, Hangzhou, China), and a computer. The system was tested in a normal office room with the size of 12 m × 7.5 m × 4 m. FIGURE 5 (b) illustrates the experimental scenario. The door opening sound activated the auditory module of the system, and then the visual module was driven to detect and track the target person according to the azimuth information provided by the auditory module. Auditory module was separately tested because of its significance to the overall system.

A. AUDITORY LOCALIZATION
In the auditory module test, seven channels of sound signals are received by the MA. The phase difference of these seven signals is used to calculate the source direction. A snapshot of these seven signals acquired by the MA in time domain is shown in FIGURE 6. FIGURE 7 demonstrates a snapshot of the sound signal and the VAD result based on short-term energy and ZCR. The part between the two red lines is judged to be a valid voice. VOLUME 8, 2020  The power-azimuth spectrum estimates of CBF, MVDR and RABF are plotted on a graph with a theoretical R and a sample R respectively, as is shown in FIGURE 8. The circular MA is used in this design with omnidirectional microphones, that is why the SOI power will not be different depending on the azimuth of the source. Therefore, the SOI azimuth in the simulation is set to 0 • , while the azimuth angle of the experimental sound source relative to the MA is assumed to be 0 • . Same as the MA, the number of array elements M=7, radius r=0.045 m, and array spacing d=0.045 m are set. As is well known to us, in the time domain signal processing, the under-sampling of the signal will lead to the generation of the gate lobes. The appearance of the grating lobes will lead to peak response blur. This is Nyquist's sampling law. Similarly, in order to satisfy the spatial sampling law to eliminate lobe aliasing, it must be satisfied that d/λ ≤ 1/2, that is, the signal frequency is lower than 3777 Hz. The frequency of the simulated signal is set to 3600 Hz. Therefore, the simulated signal  source is composed of 3600 Hz sine wave and Gaussian white noise. Under the comparison of the SOI power estimates of these three algorithms, it is obvious that RABF obtains higher SOI power than the other two algorithms.
Further more, in order to study the robustness of the algorithms, we also calculate the root-mean-square error (RMSE) performance of these three algorithms from 500 Monte-Carlo runs, respectively. In the simulation, for the sake of generality,  the SOI azimuth is given randomly from 0 • to 359 • . The SNR is set to be 0 dB while changing the number of snapshots, and the number of snapshots is set to be 16 while changing the SNR. FIGURE 9 (a) illustrates that RABF and CBF are superior to MVDR when the number of snapshots is less than 16 and RABF and CBF have similar performance. FIGURE 9 (b) shows that CBF and RABF are slightly better than MVDR when the SNR is lower than 0 dB and also RABF and CBF have similar performance. There is no doubt that CBF is of superior robustness. FIGURE 9 demonstrates that RABF has almost the same performance as CBF in both less snapshots number and low SNR. The above experiments demonstrates that RABF possesses superior robustness while having higher SOI power. The RMSE is less than 3 • in the case of 0 dB SNR with enough sampling points.
Finally, the localization accuracy of the sound module was tested in the conference room environment. The microphone array was placed in the center of the room, and the sound sources was placed at 0 • , 90 • , 180 • , and 270 • , respectively, at a distance of 3 m from the microphone array. The algorithm is tested 30 times at each direction, and the estimation results are described in TABLE 1. The experiment results prove that the error of the averaged angle is less than 3 • and it is helpful for visual module to detect and track the target.

B. AUDITORY-VISUAL TRACKING TEST
The experiment of auditory-visual joint tracking was performed in accordance with the scenario of FIGURE 5 (b). The system was modeled as human perception system. At the beginning, the camera was in a normal state and the target was out of sight. When the system received the valid audio input, that was door opening sound and footsteps, the auditory localization algorithm was used to detect the target angle and the PTZ camera rotated to make the target emerge in the FOV of camera. Then the visual module could detect and track the target. The pan-tilt rotation was controlled according to   the target position in each frame image. The audio module is a supplement to its working state for necessary situation. The detection rate of the vision module is 25 FPS. In the experiment, the target moved at a speed of about 1 m/s, and the distance between the target's straight path and the system was about 2.5 m. In this design, the sound source module is always in working state to provide azimuth information for the system. The priority of visual tracking is higher than that of auditory, because the information obtained by visual module is larger than that of auditory module. Until the video tracking loses its target, it will be tracked by the audio module again. If neither has an enabling signal, the system works like a normal camera.
For the visual tracking, referring to Eq. (33), the probability weight distribution of each pixel of the target region is shown in FIGURE 10. In the target area of 310 × 136 pixels, the weight of the center portion is greater than edge portion. Referring to the Eq.  TABLE 2. The VAD, FFT and beamforming all run on 7 channels. N is the length of audio data per frame or the number of pixels of image data per frame. The computational cost represents the number of times the system performs real multiplication. VOLUME 8, 2020 Obviously, the beamforming and mean-shift algorithms takes up a lot of calculations and the time cost of time is 9.205 ms and 25.583 ms, respectively. The overall time cost meets the real-time requirements.
The process of the overall test is shown in FIGURE 12. The time process was represented in numerical order. From frame 1 to 3, the video module was initially in sleep or normal state and it was triggered by the door opening sound signal. The pan-tilt rotation was controlled by auditory azimuth signal, based on the data of the auditory localization, until the target in the video sequence was detected as frame 4. When the target was detected, visual enable signal took over the control of pan-tilt rotation according to the data of mean-shift tracking, as frame 5 to 16. When the target was lost by visual module, control was taken over by the auditory module again. If neither module provide an enable signal, the tracking ended and the visual module re-entered the sleep or normal state until it was triggered by an audio signal. The experiment was repeated 20 times and the targets were accurately detected and tracked in all groups.

V. CONCLUSION
In this article, for the purpose of solving problems that conventional cameras fail to monitor objects in their blind zones, we proposed an acoustic enhanced camera tracking system for real-time monitoring. It is developed based on a small aperture MEMS microphone array with a 4.5 cm radius, that makes the system miniaturized and integrated. The demo is inspired by the joint action of human beings from hearing to vision. The broadband beamforming and mean-shift algorithm were mainly implemented in sound and image data processing. The acoustic localization results prove that the error of the averaged angle is less than 3 • , demonstrating that it is helpful to locate the target outside FOV of camera for the visual tracking module. In the joint tracking experiments, targets were accurately detected and tracked in all groups. The system extends the detection angle of camera tracking system to all directions and performs well in the real-time and robustness. In practical applications, worse scenarios is often faced with, such as lower SNR, occlusion environment and multi-source situation. The applied algorithms need to be optimized to improve the system's performance in worse environment. An improved tracking strategy is also indispensable to meet the demands of more complex situations.
LEI LI received the B.S. degree in electronic engineering from Zhengzhou University, Zhengzhou, China, in 2004, and the Ph.D. degree in electronic engineering from the Institute of Acoustics, Chinese Academy of Sciences, in 2009. He is currently an Associate Professor with the School of Physics and Engineering, Zhengzhou University. His current research interests include array signal processing, machine learning, and computer vision.
KECHAO LIAN received the B.S. degree in electrical engineering from Zhengzhou University, Zhengzhou, China, in 2018, where he is currently pursuing the master's degree with instrumentation engineering. His current research interests include array signal processing and computer vision.
JINTAO FU is currently pursuing the bachelor's degree in electrical engineering with Zhengzhou University, Zhengzhou, China. His current research interests include digital signal processing, digital image processing, and sensor principle and application.
PENGFEI ZHU received the B.S. and M.S. degrees in electrical engineering from Zhengzhou University, Zhengzhou, China, in 2014 and 2018, respectively. He is currently an Algorithmic Engineer with the 22nd Research Institute of China Electronics Technology Group Corporation. His research interests cover signal processing, satellite navigation, and remote sensing.
ZHIYONG HU received the B.S. degree in electrical engineering from the Zhengzhou University, Zhengzhou, China, in 2018, where he is currently pursuing the master's degree with instrumentation engineering. His current research interests include ultrasonic gas flow meter, signal processing, and machine learning.
CE GUO received the B.S. degree in electrical engineering from the Zhengzhou University of Light Industry, Zhengzhou, China, in 2016. He is currently pursuing the master's degree with instrumentation engineering with Zhengzhou University, Zhengzhou, China. His current research interests include array signal processing, machine learning, and computational vision. VOLUME 8, 2020