IEEE Xplore At-A-Glance
  • Abstract

A Sensor Network for Real-Time Acoustic Scene Analysis

Acoustic scene analysis can be used to extract relevant information in applications such as homeland security, surveillance and environmental monitoring. Wireless sensor networks have been of particular interest in monitoring acoustic scenes. Sensors embedded in such a network typically operate under several constraints such as low power and limited bandwidth. In this paper, we consider resource-efficient acoustic sensing tasks that extract and transmit relevant information to a central station where information assessment can be conducted. We propose a series of acoustic scene analysis tasks that are performed in a hierarchical manner. Hierarchical tasks include sound and speech discrimination, estimation of the number of speakers from the acquired sound, gender and emotional state, and ultimately voice monitoring and key word spotting. We apply support vector machine and Gaussian mixture model algorithms on sound features. A real-time implementation is accomplished using Crossbow motes interfaced with a TI DSP board. A series of experiments are presented to characterize the performance of the algorithms under different conditions.



ACOUSTIC scene analysis systems [1], [2] have shown promise in monitoring and understanding the surrounding environment. For example, real-time acoustic scene analysis systems have been employed in security applications [3] and environmental monitoring projects [4]. In nearly all acoustic scene analysis applications human voice tends to be the most revealing cue. Moreover, there are speech analysis methods for exploiting gender, age, emotional state, speaker tracking and content. Usually, an independent sensing approach is employed using a single central sensor that collects and interprets the incoming speech signals. With the advancement in hardware embedding technology, low-power sensors have been successfully used in distributed Wireless Sensor Networks (WSNs) [5]. For example, applications such as vehicle classification and gun shooter localization [6], [7] can benefit from distributed WSNs. However, Implementation aspects of WSNs present many challenges including real-time operating constraints. For example, the sensor nodes are powered by batteries and are associated with a number of resource constraints such as limited battery life, narrow bandwidth, small memory, drifting sampling rates, and insufficient throughput [5]. In particular, the bandwidth and volume of data memory can be problematic for applications involving wideband, time-varying signals because of the higher sampling rate required. The development of a collaborative networking infrastructure can be valuable where the feature parameters are extracted at the node level and combined later to improve the system performance. Hence, an efficient acoustic scene analysis depends on several system aspects such as i) their ability to embed several algorithms efficiently for real-time operations, ii) their ability to extract relevant information using resource efficient sensing schemes, and iii) their ability to communicate in a resource efficient manner.

In this paper, we consider a resource-efficient sensing scheme that primarily characterizes voice related scenes for use in a real-time WSN. A sensing scheme is developed where individual sensors first collect and analyze input signals locally and then work collaboratively with other sensors to improve estimates. A high level overview of the proposed voice scene characterization is shown in Fig. 1. Here, an input signal is first classified into two categories, i.e., signal of interest and background noise. Algorithms for signal detection and noise level estimation were described in our previous work [8]. If a signal is detected, select sensing tasks are executed depending upon network conditions or user control. Each sensing task at the sensor is capable of extracting parametric information that is then transmitted to the base station. At the base station, information collected from various sensors is centrally processed to characterize voice scenes. For instance, once a signal that contains voice is detected, the number of speakers can be estimated in an area of interest. After estimating the number of speakers, gender and emotional state estimation tasks are activated. Some of these tasks are carried at the sensor level. Moreover, when a scene is declared as a situation of interest the base station can instruct individual sensors to transmit the parameters required for low-rate speech synthesis. This allows a human operator to listen to the speech acquired and then take appropriate action. The system has the capability to further instruct the individual sensors to perform only select sensing tasks. Such select sensing tasks involve extracting specific low-rate acoustic features that can be transmitted on demand. Therefore, the proposed collaborative sensing system operates at lower data rate and helps relieve the network from data collisions.

Figure 1
Fig. 1. A high level overview of acoustic sensing algorithm for voice scene characterization.

In all, the proposed system involves the following five sensing tasks: 1) speech discrimination, 2) number of speakers, 3) gender, 4) emotional state and 5) voice monitoring. The parameters extracted for voice monitoring are associated low-complexity vocoders such as the LPC-10 and the full-rate GSM. For several sensing tasks we extract relevant acoustic features and employ a SVM based classification algorithm. In our experiments, we develop and evaluate a series of acoustic sensing algorithms on real-time platforms such as the Crossbow sensor motes [9] and the TI DSP board [10]. These real-time implementations are deployed in security and surveillance simulations and evaluated over several sensing tasks.

The rest of the paper is organized as follows. Section II provides a description of the sensing tasks and their performance. In Section III, we specify the hardware implementation of the algorithm and its complexity and Section IV contains concluding remarks.


Sensing Tasks

In this section, several sensing tasks are performed using the WSN. Sensor boards analyze speech signals, extract pertinent information, and transmit them to the base station. At the base station, a decision for each task is made using a classifier. We consider two classification algorithms: Gaussian Mixture Models [11] and Support Vector Machines (SVM) [12]. While the GMM characterizes statistical information of objects in a generous manner, the SVM constructs nonlinear decision boundaries by discriminating different classes. The radial basis kernel function is chosen for the SVM. Their performance is analyzed according to the sensing task in the following.

A. Speech Discrimination

The speech discrimination algorithm is based on time-domain and frequency-domain acoustic features. These features include: frame energy, normalized energy, band energy ratio and tonality [13]. The set of features is transmitted to the base station where the speech discrimination is done using the SVM. Table I lists the confusion matrix for this simulation and for the hierarchical thresholds reported in our previous work [13]. The confusion matrix entries include the average of 1000 trials where voice frames and noise frames of 1 second duration were randomly chosen from TIMIT [14] and the NOISEX 92 databases [15] respectively. The SVM showed better performance for speech discrimination and than that involving hierarchical thresholds [13] and exhibited 97.5% classification accuracy.

Table 1
TABLE I Confusion Matrix for the Speech Discrimination

B. Estimating the Number of Speakers

The modulation spectrum is calculated by analyzing the intensity envelope of speech in the frequency domain [16]. The input speech is sampled at 16 kHz and band-pass filtered (500 Hz to 2 KHz). The intensity envelope is extracted using the Hilbert transform and down-sampled to 80 Hz. For the modulation spectral analysis, 17 octave band-pass filter banks are applied centered from 0.5 Hz to 8.5 Hz and the outputs are averaged. Simulations were carried out using clean speech from the TIMIT database with additive white Gaussian noise. A data set consisting of 420 speakers and 9 sentences were randomly chosen for 1 and 11 seconds intervals. Since the modulation pattern relies heavily on the temporal dynamics of speech, a longer duration signal provides better performance. However, we choose 1 second duration due to the limited volume of memory in the hardware. Accordingly, the input signals are sampled at 2 kHz and band-pass filtered from 30 Hz to 1 kHz. As the number of speakers increase, the patterns are less prominent due to the lack of synchronization between the phases of the individual speech records [16]. To obtain better distinction in classifying the number of speakers, we obtain the following statistics from the modulation patterns: the cumulative sum, max peak and average of the modulation pattern. These three features together provide some robustness when one among them incorrectly classifies the number of speakers. In order to characterize these parameters, we simulated the modulation patterns for 1 second intervals. Fig. 2 exhibits the fidelity of the parameterization in classifying the number of speakers. Under varying SNRs, the parameters are not well separated when the number of speakers goes over five, especially in the case of the parameter max peak. Although the parameters are not highly robust for low SNRs and short durations, estimating the number of speakers works well for up to 4 speakers.

Figure 2
Fig. 2. The parameters of the modulation patterns for 1 second interval at varying SNRs.

The estimate of the number of speakers was done using 100 trials where speakers and speech phrases were randomly chosen from the TIMIT database. The GMM (3 mixtures) and the SVM was trained to classify speakers varying from 1 to 4 people, respectively. The classification results are listed in Table II. The classification accuracy rates were 46.4% at the GMM and 52.1% at the SVM. Although both algorithms performed over 60% for a single person, the patterns obtained were not as useful for two and three people. To assess implementation aspects of this process, we developed and tested software on the TI DSP board for real time operation. Speech signals of 10 speakers were acquired in an office environment. In the real time experiment, the classification accuracy rates were 44.4% at the GMM and 53.8% at the SVM. As illustrated Table II, the performance has the potential of estimating the number of speakers without any prior information even though the results can be further improved for the two and three speaker scenario.

Table 2
TABLE II Confusion Matrix for the Estimated Number of Speakers

C. Gender and Emotional States

The gender and emotional state analysis were performed using the following acoustic features: pitch and the RASTAPLP [13]. These features are extracted at the sensor and transmitted to the base station. At the base station, the GMM and SVM were pre-trained for gender classification. The performance of both algorithms was evaluated using speech excerpts from the TIMIT database. The pitch and 14th order RASTA-PLP parameters were obtained at different SNRs ranging from 40 dB down to 20 dB. In Table III, we list the accuracy for gender classification task. It can be observed that the SVM classifier shows marginal improvement in performance over the GMM classifier.

Table 3
TABLE III Confusion Matrix of the Gender for TIMIT DB

A 24 mixture GMM and SVM were trained using speech records obtained from the German emotional speech database [17]. The following four emotional states were considered for classification: angry, happy, sad, and neutral. Table IV presents the confusion matrix of emotional states and it can be observed that an average classification accuracy of 43.8% and 53.9% were obtained using the GMM and SVM classifiers respectively. However, we note that the classifiers often confuse the angry and neutral states.

Table 4
TABLE IV Confusion Matrix of Emotion States

D. Voice Monitoring

A bimodal (voiced/unvoiced) source/filter speech synthesis model used in speech compression algorithms [18] is employed for voice monitoring. If it is a voiced frame, the source signal is modeled by an impulse train (the period of the impulse train is the pitch period); however, for unvoiced frames, the source signal is modeled as white noise. For both frame types, the filter is modeled using an auto-regressive (AR) model. Although there are a number of compression algorithms used in voice communications, we chose the following two low complexity vocoders: the 2.4 kbps LPC-10 and the 13 kbps full rate GSM [18]. The GSM coder has better voice quality (MOS: 3.5) than the LPC-10 (MOS: 2.5) and has reasonably low complexity (5 MIPS). Although the parameters obtained at the sensors can be combined to synthesize the speech segment, they can also be used individually for other purposes. For example, a speaker identification algorithm was developed using the LPC-10 vocoder. The voice activity detector ensures that speech only frames are considered for feature extraction. The following set of features is obtained from the LPC-10 vocoder: pitch, and filter coefficients (including reflection and cepstrum parameters). Four speakers were each asked to speak a phrase ‘hello’ with identical loudness 10 feet away from the sensor in an office environment. Following this, an SVM is trained using the set of features obtained from each of the four speakers. Next, a single-sensor is used to identify the target speaker and an identification accuracy of 83.2% was obtained. In addition, whenever a target speaker was identified, the corresponding speech was synthesized at the base station. From this real-time experiment, we were able to both monitor speech dialogues and identify the speaker.



As an initial prototype, we used the Crossbow MICAz™ and the TMS320C6713 DSK board. They are interfaced through the RS232 port. The MICAz™ has a transceiver that modulates signals to 2.5 GHz, a data rate of 250 kbps, and external flash memory. The DSK board has a 32-bit stereo codec and the sampling rate was set at 8 KHz. The constructed packet is passed from the DSP to the MICAz and finally transmitted to the base station. The incoming packets are received through the MIB600 board (10 Mbps Ethernet) and sent to the base station PC where real-time decisions are made. In Table V we list the number of clock cycles, execution time, and data rate for each sensing task performed. Note that the tasks requiring the largest number of cycles are gender and emotion classification, which use the pitch and RASTA-PLP coefficients. The sensing task that requires the fewest cycles is voice monitoring. It takes 385.64 milliseconds to execute all the sensing tasks on the DSP board.

Table 5
TABLE V Number of Clock Cycles (in Thousands, msec) and Data Rate (bytes) for Each Part of the Sensing Algorithm


Analysis and real-time implementation of acoustic sensing algorithms for use in voice scene analysis were presented. Resource efficient sensing tasks were designed for use in a WSN. The sensing system was implemented using the Crossbow™ motes and TI DSP board. The acoustic feature parameters were extracted at the node level and transmitted to the base station where the final decisions are made. The gender and emotional states were determined using the same infrastructure and the classification performance was reported.

Two low-bit rate vocoders were implemented real time for speech monitoring and partial parameters from the LPC 10 were used for speaker identification.


Portions of this project have been funded by the ASU SenSIP center.


Homin Kwon, Harish Krishnamoorthi, Visar Berisha, and Andreas Spanias are with the Department of Electrical Engineering, SenSIP Center, Arizona State University, AZ 85287-5706, USA (,,,


1. Multimodal cognitive spaces: Implicit user monitoring

P. Georgiou

Nicosia, Cyprus
1st Cyprus Workshop on Signal Processing and Informatics, 2008-07-08

2. Real-time collaborative monitoring in wireless sensor networks

V. Berisha, H. Kwon, A. Spanias

Proc. of ICASSP, 2006, 1120–1123

3. The cricket location support system

N. Priyantha, A. Chakraborty, H. Balakrishnan

Mobiconi 2000. ACM, 2000-08

4. Wireless sensor networks for habitat monitoring

A., Mainwaring, WSNA, Atlanta, GA, 2002-09

5. The mote revolution: Low power wireless sensor networks

J., Polastre, Proceedings of the 16th Symposium on High Performance Chips (HotChips), 2004

6. Vehicle classification in distributed sensor networks

M. F. Duarte, Y. H. Hu

J. Parallel Distrib. Comput., vol. 64, p. 826–838, 2004

7. Shooter localization in Urban Terrain

M. Maroti, G. Simon, A. Ledeczi, J. Sztipanovits

IEEE, Comp., vol. 37, issue (8), p. 60–61, 2004

8. Real-time implementation of a distributed voice activity detector

V. Berisha, H. Kwon, A. Spanias

Sensor Array and Multichannel Processing, 2006. Fourth IEEE Workshop on, 2006, 659–662

9. Crossbow Technology Inc.

10. Texas Instruments Inc., TMS320C6713 DSP Starter Kit (DSK)

11. Tutorial on EM Algorithm and its Application to Estimation of Gaussian Mixtures and HMM

J. Bilmes

1997, UC Berkeley, ICSI-TR-97-021

12. A tutorial on support vector machines for pattern recognition

C. J. C. Burges

Data Mining and Knowledge Discovery, vol. 2, p. 121–167, 1998

13. Real-time sensing and acoustic scene characterization for security applications

H. Kwon, V. Berisha, A. Spanias

Wireless Pervasive Computing, 2008. ISWPC 2008. 3rd International Symposium on, 2008, 755–758

14. The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

J. S. Garofolo, et al.


16. Estimating number of speakers by the modulation characteristics of speech

T. Arai

Proc. of ICASSP, 2003, vol. 2, 197–200

17. A database of German emotional speech

F. Burkhardt, et al.

Interspeech, pp. 1517–1520, 2005

18. Speech coding: A tutorial review

A. Spanias

Proceedings of the IEEE, vol. 82, p. 1541–1582, 1994


No Photo Available

Homin Kwon

No Bio Available
No Photo Available

Harish Krishnamoorthi

No Bio Available
No Photo Available

Visar Berisha

No Bio Available
No Photo Available

Andreas Spanias

No Bio Available

Cited By

No Citations Available


INSPEC: Non-Controlled Indexing

No Keywords Available

Authors Keywords

No Keywords Available

More Keywords

No Keywords Available


No Corrections


No Content Available

Indexed by Inspec

© Copyright 2011 IEEE – All Rights Reserved