Audio-Processing-Based Human Detection at Disaster Sites With Unmanned Aerial Vehicle

This paper describes a human search system that uses an unmanned aerial vehicle (UAV). The use of robots to search for people is expected to become an auxiliary tool for saving lives during a disaster. In particular, because UAVs can collect information from the air, there has been much research into human search using UAVs equipped with cameras. However, the disadvantage of cameras is that they struggle to detect people who are hidden in shadows. To solve this problem, we mounted an array microphone on a UAV and to detect the human voice as a means of finding people that cameras cannot. Also a search method is proposed that combines voice and camera human detection to compensate for their respective shortcomings. The rate and accuracy of human detection by the proposed method are assessed experimentally.


I. INTRODUCTION
In the event of a large-scale disaster such as an earthquake, it is expected that many people will go missing. In such situations, the survival rate is related directly to how long it takes to rescue the victims. However, because the number of people who can engage in rescue operations is limited, it is important to have an efficient means of obtaining information about victims. Against this background, in recent years, human search systems using robots have been actively investigated in order to improve the efficiency of rescue activities [1]- [3]. In particular, unmanned aerial vehicle (UAVs) that can search for people aerially have been developed for situations in which rescuers cannot access damaged locations directly [4]- [10]. Such detection helps rescuers to understand the situation at the disaster site, thereby facilitating rescue operations. Although such robots now make it possible to detect people visually, a drawback of this search method is that it remains difficult to detect people who are either in camera blind spots or hidden in shadows.
Therefore, to detect people more reliably, we have addressed these problems by using a UAV equipped with not only a camera for visual information but also a microphone The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao . for audio information. Fig. 1 shows the system schematically. A UAV equipped with a loudspeaker and a microphone hovers over the disaster site and broadcasts an audio request for a response from anyone below. The microphone detects the voice of anyone who responds, thereby determining whether there is anyone there who requires rescuing.
However, a hovering UAV and a microphone are highly incompatible. First, the microphone picks up the sound of the proximate UAV propellers, thereby obscuring the person's voice. Second, the farther away the person, the fainter their voice. In this paper, to solve these two problems, we detect only the human voice by applying sound-source separation processing to the mixed sound of the recorded propellers and human voices. In sound-source separation by an array microphone, soundsource localization for estimating the direction of a human voice is performed. We use this localization result to point the UAV-mounted camera toward the sound source to photograph the person whose voice was detected. Fig. 2 shows the relationship between the camera and the microphone. This is an operational image of a PlayStation Eye, in which the array microphone and camera used in this study are integrated. By simultaneously performing human detection using a microphone and a camera on a single UAV, we constructed a system with high detection accuracy that takes advantage of voice recognition and image processing, as shown in Fig. 3. Here, in the proposed system, when the human voice is only acquired, the system detects availability of a human. On the other hand, the on-board camera can also be used to detect humans same as UAV-mounted camera based human detection systems in the literature.

II. RELATED WORK
In this section, we describe previous research into the two main human search technologies used in the present research, namely, voice processing and image processing.
For human search by voice processing, we previously studied sound recording using a unidirectional microphone mounted on a UAV and detecting people from voice information [11]. We tried using digital-filter voice processing to remove only noise from the mixed sound of voices and propellers, but it was difficult to remove only the propeller sound without losing the voice sound. Another disadvantage was that the detection accuracy dropped sharply with distance from the microphone.
In recent years, products such as smart speakers, cars, and robots that all use voice recognition technology have been developed as systems for detecting human voices using microphones. These products also have the problem of ambient sound being picked up by the microphone and interfering with speech recognition. Therefore, such products use sound-source separation technology in the form of an array microphone equipped with multiple microphones. Sound-source separation technology estimates the direction of the sound source based on the sound pressure and time difference of the sound picked up by each microphone, and the mixed sound containing the surrounding noise pick up by the array microphones is separated. Nakadai et al. [12] researched and developed the HARK system that realizes sound-source localization and separation and voice recognition [13].
Techniques for using image processing to search for people or objects in captured images have been actively studied. In particular, object detection methods based on deep-learning networks such as Faster R-CNN [14] and SSD [15] have been rapidly developed in recent years. One such object detection method is the algorithm YOLO v3 developed by Redmon and Farhadi, [16], [17]. A problem with object detection based on a deep-learning network has been that the recognition accuracy for small objects in an image is poor. YOLO v3 uses a concept similar to pyramid network features [18] to extract features from three different scales and predict objects, thereby improving the recognition accuracy of small objects. This feature of YOLO v3 works very well with UAVs, thereby increasing the target distance, and is used in research to detect objects from UAV-mounted cameras [19].
Based on these previous studies, we have built a UAV equipped with an array microphone and a human detection system based on sound-source separation, and we have integrated camera-based human detection as a search aid to build a system with a high detection rate. However, in this application, the distance from the UAV to the search target must be considered. The shorter the distance, the easier it is to pick up sound but the narrower the range that the camera can capture. Therefore, we use the direction information obtained at the time of sound-source localization so that when a person is detected by their voice, we can aim the camera toward them to achieve compatibility between the two systems.
As a summary, that human search can be done with UAV-mounted image and voice processing individual systems. Both approaches are weak in searching humans when UAV flies at higher altitudes. The target humans appear in images from UAV-mounted camera becoming smaller at higher altitudes while the voice of the target human cannot be captured by UAV-mounted microphones at higher altitudes. These reasons sharply drop the detection accuracy of both approaches. However, the both approaches show comparatively better detection rate at the lower UAV altitudes like 3m. However, the image processing based systems can only search humans when target humans appear in UAV-mounted camera images. As a solution for this issue, we propose an UAV-mounted voice processing based human search systems. Finally, by combining the two type of human search systems, we expect to improve the human detection performance.

III. SYSTEM OVERWIEW
We have been developing different type of UAV systems for making applications in indoor mapping [20], and autonomous flight [21]- [28]. Fig. 4 shows the UAV produced in this study. It is equipped with a loudspeaker to request responses from people, an array microphone to acquire sound, a servomotor to move the camera, and a small lightweight computer to control the aforementioned items. The UAV is also equipped with a Raspberry Pi single-board computer, but the voice and image processing, which is the core of this research, was performed on a separate host computer because the Raspberry Pi had insufficient processing capability. The PlayStation Eye shown in Fig. 5 served as both the array microphone and camera mounted on the UAV. A PlayStation Eye is an integrated camera and array microphone, the latter having four microphones arranged horizontally to collect sound in four channels. Meanwhile, the camera has a maximum resolution of 640 × 480 pixels and an angle of view of 56 • . Because the camera must point toward the sound source, the former is attached to a servomotor as shown in Fig. 5. Fig. 6 shows an overview of the constructed human detection system. First, sound data recorded by the array microphone are sent to the host computer, and sound-source separation processing is performed through sound-source localization. Next, voice recognition is applied to the separated human-voice data. In this way, any words spoken by a person are detected, thereby detecting the presence or absence of a person. The results of the aforementioned soundsource localization, voice recognition, and presence-absence  detection are saved, and the sound-source direction information obtained by the sound-source localization is sent to the Raspberry Pi on the UAV. Using the sound-source direction information, the Raspberry Pi rotates the camera via the servomotor to point in the sound-source direction. The resulting photographic data are also sent to the host computer, which looks for a person in the image. The above sequence is the process of human detection. As for the system execution time, the host computer receives an audio file recorded for 10 s and takes 20 s to process it; as such, the detection system completes one cycle every 30 s. Fig. 7 shows a network diagram of the audio processing flow in this experiment. Sound-source separation is a technique for performing separation based on the input direction of a target sound source. Therefore, it is necessary to localize the direction of the target sound source as preprocessing. In Fig. 7, the sound-source localization node is entitled Local-izeMUSIC and uses what is known as the MUSIC method, in which a transfer function from a sound source to each microphone is measured in advance and is used as prior information. If h M (θ, ω) is the transfer function in the frequency domain from the array microphone to microphone M in the θ direction, then the transfer function vector H(θ, ω) can be expressed as

IV. AUDIO PROCESSING BASED HUMAN DETECTION A. SOUND-SOURCE SEPERATION AND LOCALIZATION BY HARK
Because this transfer function changes greatly depending on the environment, it is necessary to measure it for each experimental environment. In this study, we experimentally determined the transfer function. The transfer function for sound source separation method introduced by HARK [12] is applied. Here, HARK separates inputted sound mixtures into a set of separated sounds and this process needs a set of transfer functions to estimate a separation matrix. These transfer functions are determined by real world measurements between a microphone array and sound sources. Generally, time stretched pulse (TSP) responses or impulse responses recording, according to sound sources are used to determine the transfer functions when the real measurements are used.
In this work, we use TSP responses. They can be received in two different categories, as synchronized recording and unsynchronized recording. The unsynchronized recording can be conducted with most microphone array devices. Therefore, in this study unsynchronized recording is conducted. TSP responses are recorded by moving a sound source in a circle at the 5 • intervals while keeping the sound source in a fix point. The radius of the circle is set as 3m.
After collecting the TSP responses, the transfer functions are determined from the TSP response wav files with HARK-TOOL5 [12]. Next, an inter-channel correlation matrix of the input signal is calculated. First, an M-channel input signal is subjected to a fast Fourier transform (FFT) at the MultiFFT node in Fig. 7 to obtain a frequency-domain signal vector X(ω, f ) as where ω is a frequency and f is a frame. The inter-channel correlation matrix R of the input signal X is where X * is the complex-conjugate transpose of X. However, in this system, to obtain a stable correlation matrix, an average of the correlation matrix in the time direction is used. Next, by performing eigenvalue decomposition on R, the M-dimensional space is decomposed into a signal subspace and other subspaces. In this paper, SEVD (Standard EigenValue Decomposition) is specified as the algorithm of MUSIC, so eigenvalue expansion is performed as Here, the matrix E = [e 1 , e 2 , · · · , e M ] consists of mutually orthogonal eigenvectors, and (ω) is a diagonal matrix whose eigenvalues correspond to each eigenvector as diagonal components. The eigenvalues corresponding to the eigenvector space E obtained by the eigenvalue decomposition are correlated with the power of the sound source. Therefore, by taking the eigenvector corresponding to the largest eigenvalue, only the subspace of the target sound with high power is selected. That is, if the number of sound sources to be considered is N, then [e 1 , · · · , e N ] is an eigenvector corresponding to the sound source and [e N+1 , · · · , e M ] is an eigenvector corresponding to noise. Based on the above, the MUSIC spectrum for sound-source localization is calculated as where the denominator on the right-hand side is the inner product of the transfer function and the eigenvector of the noise component in the input signal. If the transfer function is a vector corresponding to the direction of the target sound source, then the denominator is zero because it is orthogonal to all the eigenvectors corresponding to noise in the input signal. Therefore, in theory, P (θ, ω, f ) becomes infinite in the direction of the sound source. In practice, however, it remains finite because of the influence of noise and the like, but the sound-source direction can be obtained nevertheless because a peak is observed. The sound-source separation process can be formulated as follows. If the transfer vector between the sound source and the microphone is H(ω) and the spectrum vector for multiple sound sources is s(ω), then the microphone input x(ω) is represented as Using the separation matrix W(ω), the sound-source separation result y(ω) is expressed as Here, W(ω) where y(ω) = s(ω) is an ideal separation matrix. Because the environment in this study is assumed to include noise with high directivity, such as the UAV propeller sound, we used the GHDSS (Geometric High-order Dicorrelation-based Source Separation) algorithm to obtain W(ω) [31]. This algorithm performs decorrelation among sound-source signals, forms directivity in the direction of the sound source, and is effective at suppressing a noise source that has high directivity. The GHDSS algorithm receives the multi-channel complex spectrum output from the MultiFFT node and the sound-source direction output from the Local-izeMUSIC node and then separates the mixed sounds for each input direction to the microphone.

B. HUMAN VOICE RECOGNITION
Next, the speech obtained after sound-source separation is sent to the speech-recognition node. In this research, we constructed a system that uses speech-recognition technology to distinguish between the human voice and other sounds such as noise [29]. Specifically, speech recognition is performed on the separated speech, and the result is output. In addition, a system is provided in which a word list is created in advance, and a match is made with a speech-recognition result to determine that there is a person when a match is found. This is because the propeller sound is input even when a person is not responding, so that the propeller sound is processed as a human voice and an incorrect speech recognition result is output. Because the erroneous recognition result has a grammatical error, this method can exclude this type of erroneous recognition from the recognition result. In addition, speech recognition generally has the problem that the first part of an utterance section cannot be detected well, and thus a recognition trigger called a magic word is required. However, the purpose of this research is to search for people, and it is not possible to have the search target issue a magic word. Therefore, we set up a system that can recognize speech without a magic word by notifying the search target that the utterance section has started. Specifically, the loudspeaker mounted on the UAV informs the person below that the recording interval has started. After that, recording is performed, and voice recognition is performed on the recorded voice data. Although this is a simple method, it reduces the rate of false recognition without impairing the usability of the dialogue between the person and the UAV.

C. MERGING WITH CAMERA BASED HUMAN DETECTION
This paper aims at detecting a person on the ground from a UAV in flight, and it is assumed that the UAV hovers at an altitude of least 3 m above the ground. Because the scale of the detected target person in the captured image decreases with distance, it is necessary to consider the detection of a small object. Therefore, we attempted object detection using YOLO v3 as a method of human detection. As described above, YOLO v3 is an object detection method that can handle the recognition of objects of different scales, which is difficult in object-recognition processing.
YOLO v3 involves object detection based on a deeplearning network, and its detection accuracy depends on the quality of the input image and the training dataset. Therefore, we chose to use the COCO dataset [30], which is a large-scale object detection, segmentation, and captioned dataset that features rich annotations. As described above, in this study we detected a small photographed person by performing object detection using YOLO v3 trained using the COCO dataset.

V. EXPERIMENTAL EVALUATION A. EXPERIMENTAL ENVIRONMENT
We performed three experiments on the proposed system and evaluated its performance. In the first experiment, we mounted it on a UAV and measured the rate of human detection when talking to the array microphone. The measurement was performed with and without the UAV propellers turning, and the detection rates of each case were compared. In addition, the same experiment was performed in an environment in which a single microphone was mounted as the voice input device [11], and the results when voice recognition was performed without performing sound-source separation processing were also measured. For the measurement, we attempted speech recognition 20 times at each distance and checked whether it was possible to detect people. Note that the sound pressure of the voice is not always constant and changes because of sound diffusion and reverberation. Therefore, when the distance is large, the sound pressure input to the array microphone becomes small, and localization may not be performed. When this phenomenon occurred, we responded by lowering the threshold of sound pressure necessary for localization. In the second experiment, we used the sound-source localization results to calculate the accuracy of the sound-source direction. Finally, in the third experiment, we analyze the performance of human detection combining the proposed voice based human detection method with a human detection method by UAV mounted camera. Fig. 8 shows the measured data for the rate of human detection. For data comparison with a conventional method [11], we also measured the results of human search using sound data recorded with a one-channel microphone with neither sound-source localization nor sound-source separation.

B. HUMAN DETECTION EVALUATIN
In the Fig. 8, the orange line and blue line indicate the human detection performance of proposed method and conventional method respectively, when propeller sound does not exist in the environment. On the other hand, the yellow line and grey line indicate the human detection results of the proposed method and conventional method respectively, when both human sound and propeller sound exist in the environment. The latter experiments with availability of UAV sound, were conducted with the real UAV at both indoor and outdoor environments. For these experiments, we used our developed UAV shown in Fig. 4. Here, 20 experiments were conducted in each indoor and outdoor environments by changing the distance between microphone and sound source (human voice) approximately between 1m to 5m as shown Fig. 8. During the outdoor experiments, the outdoor environment was noisy for a certain level with natural sounds such as wind. On the other hand, during the indoor experiments, the environment did not include such natural sounds, only included UAV sound and human voice.
We did further analysis to confirm the effectiveness of the sound source separation part and voice recognition part in our proposal. For this analysis, we used experimental data acquired from both indoor and outdoor. According to analysis, the sound source separation performance varies sharply according to the distance between microphone and human sound, as shown in Fig. 9. On the other hand, the average voice recognition rate after the sound source separation was 92.8% and this value was almost constant and did not vary much according to the distance changes between microphone and human sound.
The results of conventional method shows that when the sound is collected using only one channel, the distance of around 3 m is the limit of speech recognition even without propeller sound. This is because the sound diffuses with distance and becomes difficult to acquire. By contrast, although the accuracy decreased with distance when the array microphone was used, human detection was confirmed possible even in a noisy environment. Fig. 10 shows an image taken after rotating the camera based on the direction information obtained by sound-source localization. When source localization was performed for a  voice uttered by a person 5 m away from the UAV's position, the direction of the input sound source was determined to be 20.023989 • with respect to the array microphone. However, when the camera was moved to that angle, the person was not in the center of the image, as shown in Fig. 9. There was a deviation of 1.5 m from the center of the image, corresponding to an angular deviation of approximately 17 • . This was attributed to the resolution of the array microphone. In this study, we used an array microphone comprising four microphones arranged horizontally. Because the accuracy of sound-source localization is correlated with the number of microphones mounted, we expect improved accuracy by mounting more microphones. However, because the purpose of this study was to show the person in the image, sufficient localization accuracy was confirmed. In addition, on a host computer with a Core i5-7500 processor and 8 GB of memory, the average detection time for one image file using YOLO v3 was around 6.

D. EVALUATION OF COMBINED VOICE BASED AND IMAGE BASED HUMAN DETECTION
In the previous section, we discussed the sound source localization by using the human detection with an UAV-mounted camera. In this section we discuss human detection combining both voice based method (proposed method) and an image VOLUME 8, 2020 based method with UAV-mounted camera. In these experiments, UAV altitudes were approximately 2m-5m. We used YOLO v3 to detect humans from the UAV-mounted camera [17]. According to the results, the YOLO v3 itself human detection rate by using UAV-mounted camera images was 84.5%. The reason for most missed-detections was that the humans did not appear in camera images even though humans are in the environment. However, the human detection rate by combining these two systems was 93.3%. Thus, according to this combination, we could confirm a clear human detection improvement. Furthermore, human detection using images from UAV-mounted camera is difficult at higher altitudes. However, this problem would be solved for a certain level by using a wide-angle high resolution camera.

VI. CONCLUSION
In this paper, we proposed a human search system using a UAV with an array microphone. Specifically, we have constructed a system that detects the human voice by applying sound-source separation processing to the mixed sound of voice and propellers collected by the array microphone on the UAV. In addition to voice-based human detection, we used YOLO v3-a deep-learning-based object detection method-for human detection from images. In doing so, we obtained more information by integrating the two detection methods, thereby realizing a possible human detection system.
In experiments to assess the performance of the proposed system, we focused on the accuracy of human detection and sound-source localization and the processing speed. We measured the accuracy of human detection with and without sound-source separation processing and obtained higher accuracy with separation. The localized position of the sound source deviated from the actual position in some cases, but the accuracy required for human detection using images was maintained.