ACOUSTIC scene analysis systems ,  have shown promise in monitoring and understanding the surrounding environment. For example, real-time acoustic scene analysis systems have been employed in security applications  and environmental monitoring projects . In nearly all acoustic scene analysis applications human voice tends to be the most revealing cue. Moreover, there are speech analysis methods for exploiting gender, age, emotional state, speaker tracking and content. Usually, an independent sensing approach is employed using a single central sensor that collects and interprets the incoming speech signals. With the advancement in hardware embedding technology, low-power sensors have been successfully used in distributed Wireless Sensor Networks (WSNs) . For example, applications such as vehicle classification and gun shooter localization ,  can benefit from distributed WSNs. However, Implementation aspects of WSNs present many challenges including real-time operating constraints. For example, the sensor nodes are powered by batteries and are associated with a number of resource constraints such as limited battery life, narrow bandwidth, small memory, drifting sampling rates, and insufficient throughput . In particular, the bandwidth and volume of data memory can be problematic for applications involving wideband, time-varying signals because of the higher sampling rate required. The development of a collaborative networking infrastructure can be valuable where the feature parameters are extracted at the node level and combined later to improve the system performance. Hence, an efficient acoustic scene analysis depends on several system aspects such as i) their ability to embed several algorithms efficiently for real-time operations, ii) their ability to extract relevant information using resource efficient sensing schemes, and iii) their ability to communicate in a resource efficient manner.
In this paper, we consider a resource-efficient sensing scheme that primarily characterizes voice related scenes for use in a real-time WSN. A sensing scheme is developed where individual sensors first collect and analyze input signals locally and then work collaboratively with other sensors to improve estimates. A high level overview of the proposed voice scene characterization is shown in Fig. 1. Here, an input signal is first classified into two categories, i.e., signal of interest and background noise. Algorithms for signal detection and noise level estimation were described in our previous work . If a signal is detected, select sensing tasks are executed depending upon network conditions or user control. Each sensing task at the sensor is capable of extracting parametric information that is then transmitted to the base station. At the base station, information collected from various sensors is centrally processed to characterize voice scenes. For instance, once a signal that contains voice is detected, the number of speakers can be estimated in an area of interest. After estimating the number of speakers, gender and emotional state estimation tasks are activated. Some of these tasks are carried at the sensor level. Moreover, when a scene is declared as a situation of interest the base station can instruct individual sensors to transmit the parameters required for low-rate speech synthesis. This allows a human operator to listen to the speech acquired and then take appropriate action. The system has the capability to further instruct the individual sensors to perform only select sensing tasks. Such select sensing tasks involve extracting specific low-rate acoustic features that can be transmitted on demand. Therefore, the proposed collaborative sensing system operates at lower data rate and helps relieve the network from data collisions.
Fig. 1. A high level overview of acoustic sensing algorithm for voice scene characterization.
View All | Next
In all, the proposed system involves the following five sensing tasks: 1) speech discrimination, 2) number of speakers, 3) gender, 4) emotional state and 5) voice monitoring. The parameters extracted for voice monitoring are associated low-complexity vocoders such as the LPC-10 and the full-rate GSM. For several sensing tasks we extract relevant acoustic features and employ a SVM based classification algorithm. In our experiments, we develop and evaluate a series of acoustic sensing algorithms on real-time platforms such as the Crossbow sensor motes  and the TI DSP board . These real-time implementations are deployed in security and surveillance simulations and evaluated over several sensing tasks.
The rest of the paper is organized as follows. Section II provides a description of the sensing tasks and their performance. In Section III, we specify the hardware implementation of the algorithm and its complexity and Section IV contains concluding remarks.
In this section, several sensing tasks are performed using the WSN. Sensor boards analyze speech signals, extract pertinent information, and transmit them to the base station. At the base station, a decision for each task is made using a classifier. We consider two classification algorithms: Gaussian Mixture Models  and Support Vector Machines (SVM) . While the GMM characterizes statistical information of objects in a generous manner, the SVM constructs nonlinear decision boundaries by discriminating different classes. The radial basis kernel function is chosen for the SVM. Their performance is analyzed according to the sensing task in the following.
A. Speech Discrimination
The speech discrimination algorithm is based on time-domain and frequency-domain acoustic features. These features include: frame energy, normalized energy, band energy ratio and tonality . The set of features is transmitted to the base station where the speech discrimination is done using the SVM. Table I lists the confusion matrix for this simulation and for the hierarchical thresholds reported in our previous work . The confusion matrix entries include the average of 1000 trials where voice frames and noise frames of 1 second duration were randomly chosen from TIMIT  and the NOISEX 92 databases  respectively. The SVM showed better performance for speech discrimination and than that involving hierarchical thresholds  and exhibited 97.5% classification accuracy.
TABLE I Confusion Matrix for the Speech Discrimination
B. Estimating the Number of Speakers
The modulation spectrum is calculated by analyzing the intensity envelope of speech in the frequency domain . The input speech is sampled at 16 kHz and band-pass filtered (500 Hz to 2 KHz). The intensity envelope is extracted using the Hilbert transform and down-sampled to 80 Hz. For the modulation spectral analysis, 17 octave band-pass filter banks are applied centered from 0.5 Hz to 8.5 Hz and the outputs are averaged. Simulations were carried out using clean speech from the TIMIT database with additive white Gaussian noise. A data set consisting of 420 speakers and 9 sentences were randomly chosen for 1 and 11 seconds intervals. Since the modulation pattern relies heavily on the temporal dynamics of speech, a longer duration signal provides better performance. However, we choose 1 second duration due to the limited volume of memory in the hardware. Accordingly, the input signals are sampled at 2 kHz and band-pass filtered from 30 Hz to 1 kHz. As the number of speakers increase, the patterns are less prominent due to the lack of synchronization between the phases of the individual speech records . To obtain better distinction in classifying the number of speakers, we obtain the following statistics from the modulation patterns: the cumulative sum, max peak and average of the modulation pattern. These three features together provide some robustness when one among them incorrectly classifies the number of speakers. In order to characterize these parameters, we simulated the modulation patterns for 1 second intervals. Fig. 2 exhibits the fidelity of the parameterization in classifying the number of speakers. Under varying SNRs, the parameters are not well separated when the number of speakers goes over five, especially in the case of the parameter max peak. Although the parameters are not highly robust for low SNRs and short durations, estimating the number of speakers works well for up to 4 speakers.
Fig. 2. The parameters of the modulation patterns for 1 second interval at varying SNRs.
Previous | View All
The estimate of the number of speakers was done using 100 trials where speakers and speech phrases were randomly chosen from the TIMIT database. The GMM (3 mixtures) and the SVM was trained to classify speakers varying from 1 to 4 people, respectively. The classification results are listed in Table II. The classification accuracy rates were 46.4% at the GMM and 52.1% at the SVM. Although both algorithms performed over 60% for a single person, the patterns obtained were not as useful for two and three people. To assess implementation aspects of this process, we developed and tested software on the TI DSP board for real time operation. Speech signals of 10 speakers were acquired in an office environment. In the real time experiment, the classification accuracy rates were 44.4% at the GMM and 53.8% at the SVM. As illustrated Table II, the performance has the potential of estimating the number of speakers without any prior information even though the results can be further improved for the two and three speaker scenario.
TABLE II Confusion Matrix for the Estimated Number of Speakers
C. Gender and Emotional States
The gender and emotional state analysis were performed using the following acoustic features: pitch and the RASTAPLP . These features are extracted at the sensor and transmitted to the base station. At the base station, the GMM and SVM were pre-trained for gender classification. The performance of both algorithms was evaluated using speech excerpts from the TIMIT database. The pitch and 14th order RASTA-PLP parameters were obtained at different SNRs ranging from 40 dB down to 20 dB. In Table III, we list the accuracy for gender classification task. It can be observed that the SVM classifier shows marginal improvement in performance over the GMM classifier.
TABLE III Confusion Matrix of the Gender for TIMIT DB
A 24 mixture GMM and SVM were trained using speech records obtained from the German emotional speech database . The following four emotional states were considered for classification: angry, happy, sad, and neutral. Table IV presents the confusion matrix of emotional states and it can be observed that an average classification accuracy of 43.8% and 53.9% were obtained using the GMM and SVM classifiers respectively. However, we note that the classifiers often confuse the angry and neutral states.
TABLE IV Confusion Matrix of Emotion States
D. Voice Monitoring
A bimodal (voiced/unvoiced) source/filter speech synthesis model used in speech compression algorithms  is employed for voice monitoring. If it is a voiced frame, the source signal is modeled by an impulse train (the period of the impulse train is the pitch period); however, for unvoiced frames, the source signal is modeled as white noise. For both frame types, the filter is modeled using an auto-regressive (AR) model. Although there are a number of compression algorithms used in voice communications, we chose the following two low complexity vocoders: the 2.4 kbps LPC-10 and the 13 kbps full rate GSM . The GSM coder has better voice quality (MOS: 3.5) than the LPC-10 (MOS: 2.5) and has reasonably low complexity (5 MIPS). Although the parameters obtained at the sensors can be combined to synthesize the speech segment, they can also be used individually for other purposes. For example, a speaker identification algorithm was developed using the LPC-10 vocoder. The voice activity detector ensures that speech only frames are considered for feature extraction. The following set of features is obtained from the LPC-10 vocoder: pitch, and filter coefficients (including reflection and cepstrum parameters). Four speakers were each asked to speak a phrase ‘hello’ with identical loudness 10 feet away from the sensor in an office environment. Following this, an SVM is trained using the set of features obtained from each of the four speakers. Next, a single-sensor is used to identify the target speaker and an identification accuracy of 83.2% was obtained. In addition, whenever a target speaker was identified, the corresponding speech was synthesized at the base station. From this real-time experiment, we were able to both monitor speech dialogues and identify the speaker.
As an initial prototype, we used the Crossbow MICAz™ and the TMS320C6713 DSK board. They are interfaced through the RS232 port. The MICAz™ has a transceiver that modulates signals to 2.5 GHz, a data rate of 250 kbps, and external flash memory. The DSK board has a 32-bit stereo codec and the sampling rate was set at 8 KHz. The constructed packet is passed from the DSP to the MICAz and finally transmitted to the base station. The incoming packets are received through the MIB600 board (10 Mbps Ethernet) and sent to the base station PC where real-time decisions are made. In Table V we list the number of clock cycles, execution time, and data rate for each sensing task performed. Note that the tasks requiring the largest number of cycles are gender and emotion classification, which use the pitch and RASTA-PLP coefficients. The sensing task that requires the fewest cycles is voice monitoring. It takes 385.64 milliseconds to execute all the sensing tasks on the DSP board.
TABLE V Number of Clock Cycles (in Thousands, msec) and Data Rate (bytes) for Each Part of the Sensing Algorithm
Analysis and real-time implementation of acoustic sensing algorithms for use in voice scene analysis were presented. Resource efficient sensing tasks were designed for use in a WSN. The sensing system was implemented using the Crossbow™ motes and TI DSP board. The acoustic feature parameters were extracted at the node level and transmitted to the base station where the final decisions are made. The gender and emotional states were determined using the same infrastructure and the classification performance was reported.
Two low-bit rate vocoders were implemented real time for speech monitoring and partial parameters from the LPC 10 were used for speaker identification.
Portions of this project have been funded by the ASU SenSIP center.