C-AVDI: Compressive Measurement-Based Acoustic Vehicle Detection and Identification

As society grows ever more interconnected, the need for sophisticated signal processing and data analysis techniques becomes increasingly apparent. This is particularly true in the field of intelligent transportation systems (ITSs), where various sensing applications generate data at an exponential rate. In this paper, we present C-AVDI, a compressive measurement-based acoustic vehicle detection and identification architecture capable of extracting information from vehicle audio signals while sampling at sub-Nyquist rates. In addition, we further reduce the overall complexity by performing any necessary signal filtering during the acquisition process, removing the need for a separate filtering stage in the system’s front-end. Our results obtained from data collected under a range of weather conditions present an accuracy of 80% with a back-end analog-to-digital converter (ADC) sample rate of 3 kHz, with initial results from a microcontroller (MCU) implementation of our proposed system presenting an accuracy of 72%.


I. INTRODUCTION
I N recent years, we have seen a rise in the development and adoption of intelligent transportation systems (ITS) technology. Key applications such as traffic flow control, navigation systems, and road safety management are growing ever more sophisticated and gaining increasingly widespread use. These improvements, however, come at a cost: the increasing amount of data created, processed, and used requires expensive, power-intensive hardware to be stored, accessed, and leveraged effectively. Particularly important in ITSs is the vehicle detection and identification (VDI) process, a key functionality underpinning a significant number of applications such as traffic flow and congestion management, electronic toll collection (ETC), and transportation infrastructure monitoring. Lowering the cost and computational requirements associated with the VDI process will have a direct effect on the overall cost and complexity of many ITS applications.
Various low-cost, low-complexity VDI techniques have been proposed, with acoustic vehicle detection and identification (AVDI) in particular being the subject of wide-ranging research [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. The low price and simple installation process of AVDI systems make them an attractive alternative to other more expensive and difficult-to-install VDI systems, such as video camera-, radar-, or induction loop coilbased systems. While the installation costs associated with AVDI systems are generally low, the subsequent analysis and leveraging of data is often relatively costly in terms of computational requirements, making the use of such systems in power-critical applications difficult. In current AVDI systems, the computational costs are incurred due to processes occurring in either the acquisition and preprocessing stage as in [2] and [3] (successive discrete Fourier transforms (DFTs) or discrete wavelet transforms (DWTs)), or the classification stage as in [4] and in [5] (use of multilayer perceptron (MLP) and artificial neural network (ANN) respectively).
In a bid to reduce the computational costs associated with AVDI systems, we proposed in our previous work [1] an initial attempt at creating a compressive measurement-based acoustic vehicle sensing architecture, capable of detecting and identifying vehicles while sampling at sub-Nyquist rates. By leveraging certain key properties of compressive sensing (CS), a technique first presented in [16] and [17], the system was able to detect and identify vehicles with an accuracy of 83% for a back-end sample rate of 3 kHz.
The goal of this paper is to present a low-cost, lowcomplexity alternative to existing AVDI sensing systems that can implemented on a microcontroller (MCU) and whose performance is comparable to that of currently available systems. This is achieved first and foremost by reducing the sample rate at which the vehicle sounds are acquired. Indeed, the biggest limitation of MCUs is the amount of available volatile and nonvolatile memory, and while this is not a problem when working with short signals such as in [18], it makes it impossible to use MCUs in AVDI applications where the signal length is typically on the order of a few seconds. By reducing the number of samples, we lower the memory requirements required for both the implementation of the classifier (nonvolatile memory), and the feature extraction process (volatile memory).
We seek to expand and build upon the results obtained by our proof-of-concept system in [1] in a number of ways.
First, we remove the front-end filtering stage and instead filter the input signal directly during the acquisition process using a spectrally tailored bipolar pseudorandom sequence. This reduces the number of components in the system and allows the filtering parameters to be adjusted by modifying the spectrum of the sequence rather than the hardware. Second, we test the system under adverse weather conditions, and observe their effects on the detection and identification performance. Third, we implement the feature extraction and classification processes on an MCU, demonstrating the reduction in both cost and computational complexity presented by our newly proposed system.
Our main contributions are as follows: • We propose a compressive measurement-based AVDI (C-AVDI) architecture capable of operating at sub-Nyquist rates. Leveraging the inherent structure of vehicle sound signals enables us to reduce the number of samples required to detect and identify passing vehicles by a factor of 16 while maintaining an accuracy comparable to those of existing systems. • We develop a method to simultaneously sample and filter incoming signals using spectrally shaped bipolar sequences generated from a pair of Markov chains. The shape of these sequences' spectra is controlled by varying the Markov chain length and transition probability.
In our paper, we use the term filtering to refer to both the attenuation and amplification of frequencies present in a signal. • Demonstrate the viability of C-AVDI as a low-cost, lowcomplexity sensing architecture by benchmarking our proposed system against our previous work and prototyping an initial MCU-based implementation of the system's feature extraction and classification processes.
Finally, while the purpose of our paper is to design a C-AVDI system for ITS applications, the underlying concepts can be applied to a wide range of sensing and machine learning (ML) applications. As highlighted in [19] and [20], if we are to ensure the continued sustainability and accessibility of ITS, as well as that of ML and artificial intelligence (AI) in general, then it is necessary to reduce the financial and environmental impact of these technologies. We believe that the techniques outlined in this paper can help contribute towards this goal.
The remainder of this paper is structured as follows: we examine the existing literature in Section II, describe the theory and properties of CS in Section III, and present our proposed system in Section IV. The system evaluation results are shown in Section V and discussed in Section VI.

II. RELATED WORK
There is a wide variety of existing research that uses supervised learning techniques to detect and identify vehicles using features extracted from their acoustic signatures.
The frequency-domain features extracted from vehicle audio signals are used with a support vector machine (SVM) classifier to identify and classify vehicles and their associated parameters in [6], and used to detect vehicles for the purpose of collision avoidance in non-line-of-sight situations in [7].
Mel-frequency cepstral coefficients (MFCCs) are used in conjunction with ML or deep learning (DL) as features in a number of existing AVDI systems: in [8] they are used with a modified MLP, in [5] they are extracted from a specific high energy audio region and used with an ANN and knearest neighbors (KNN) classifier, and in [9] they are used in a feature set containing the pitch class profile (PCP) and short-term energy (STE) of vehicle audio signals in a hybrid convolutional neural network (CNN) containing a long shortterm memory (LSTM) layer.
In [4], Göksu presents a system capable of analyzing the acoustic signatures of vehicles independently of any changes in engine sound. Using wavelet packet decomposition (WPD) and an MLP classifier, the system obtains engine speedindependent features from the acoustic signals of passing vehicles. On the other hand, the system in [10] identifies different vehicles based on the sound of their engines using modulated per-channel energy normalization (Mod-PCEN) features in tandem with a Siamese neural network (SNN).
All of these aforementioned existing AVDI systems share a common, or very similar, goal and basic approach, but differ in their implementation, applications, and performance. While these systems generally all present good performance metrics, the use of computationally intensive input signal processing or a complex supervised learning method offsets the improvements in system cost, complexity, efficiency and flexibility.
We have proposed several acoustic vehicle detection systems in our previous work.
In [11], we present a sequential acoustic vehicle detector (SAVeD) that operates by fitting S-curves generated using generalized cross-correlation (GCC) to points drawn on a sound map using an estimation method based on random sample consensus (RANSAC). This system presents an Fmeasure of 83%. An initial attempt at expanding this method to include multilane detection was proposed in [12].
In [2], we design a stereo microphone-based AVDI system (SMBAS) capable of identifying passing vehicles through the use of features extracted from the frequency-domain representation of the vehicles' sound signatures using successive short-time Fourier transforms (STFTs). The signals obtained by a pair of microphones are time-aligned and combined, creating an emphasized sound signal from which features are drawn and used in an SVM classifier. The time alignment and combination process improves the system estimation accuracy, particularly when faced with simultaneously and successively passing vehicles, resulting in a system accuracy of 95%.
Both of these systems present good performance metrics, but again the computational cost associated with the two of them makes low-power, embedded applications difficult.
In [13], we propose an ultra-low power vehicle detector (ULP-VD) capable of detecting passing vehicles with minimal computational cost using logistic regression (LR). This system, however, is only able to detect the presence of passing vehicles and requires an additional stage to identify them.
Most recently, in [1], we proposed an initial attempt at creating a CS-based AVDI system capable of detecting and identifying vehicles while sampling at sub-Nyquist rates. By combining a more traditional sub-Nyquist sampling architecture with a tailored analog front-end filtering section, the system is able to identify and detect vehicles with an accuracy of 83%, and a back-end sampling rate of 3 kHz using a random forest (RF) classifier. The front-end filtering section is an integral part of the system; however, its implementation as a separate stage requires the use of additional components, increasing deployment and implementation costs.
The main characteristics of these AVDI systems are summarized in Table 1. To the best of our knowledge, there is no current AVDI system capable of performing both detection and identification on an MCU.
Sub-Nyquist signal processing using the measurements obtained during the CS process has been explored as an alternative to traditional digital signal processing in a range of existing work.
The authors of [21] present a methodology in which signal processing is performed directly on the compressive measurements obtained during the CS process. The paper demonstrates that filtering, detection and classification can be performed directly on the compressive measurements without recovering the signal beforehand. Similarly, the authors of [22] propose a multiband compressive signal processing CS RF [2] STFT SVM [4] WPD MLP [5] MFCC KNN /ANN [6] DFT SVM [7] STFT SVM [8] MFCC MLP [9] MFCC CNN /PCP/STE /LSTM [10] Mod-PCEN SNN [11] GCC RANSAC [12] GCC RANSAC [13] DWT LR architecture in which information acquired at sub-Nyquist rates from different frequency bands is used in a range of applications without prior signal reconstruction. There is also existing research that focuses on the creation of sampling strategies used to optimize the performance of CS-based systems.
Both [23] and [24] propose a method for shaping the spectra of the pseudorandom bipolar sequences used in various CS architectures to improve signal reconstruction performance. The first proposes a Markov chain-based method inspired by run-length limited (RLL) sequences to both shape the spectrum of the bipolar sequence and limit the rate at which it switches polarity. This paper lays the groundwork for subsequent research, but does not fully explore the filtering effects of such a sequence in the context of CSbased sensing, instead focusing on optimizing signal reconstruction. The second proposes a method based on convex relaxation to create a binary sequence whose spectrum resembles that of a notch filter. While this method is able to produce sequences with well-defined spectra, the complexity associated with this method makes it unsuitable for lowpower applications.

A. BASICS OF COMPRESSIVE SENSING
First presented in [16] and [17], CS is a method to efficiently sample sparse or compressible signals (a signal is called sparse if it contains only a few nonzero components compared to its total length, and is called compressible if it contains a large number of nonzero components, but only a few of significant magnitudes), enabling us to directly acquire the information of interest in a given signal at sub-Nyquist rates rather than sample the signal at the Nyquist rate and subsequently discard unwanted information.
Let us begin by defining a sparse signal x ∈ R N of length T s whose highest frequency component is W/2 Hz and with N = W T s . This signal can be expressed as a combination of discrete coefficients α ∈ C N and vectors ψ n that form the VOLUME 4, 2016 columns of an orthonormal basis matrix Ψ ∈ C N ×N for a given time window: with the coefficients computed as α n = x, ψ n . More often than not, x(t) itself is not sparse and instead has a sparse representation in Ψ. Given prior knowledge of Ψ, we need only obtain the information contained in the sparse coefficient vector α to be able to reconstruct the original signal. This information is obtained by drawing a set of compressive samples y ∈ R M from the original signal, where M N . The CS acquisition process can be described mathematically as: where Φ ∈ R M ×N represents the consecutive sampling and operations performed on x.
Additionally, we define the reconstruction matrix as: Θ ∈ C M ×N = ΦΨ.
From (1) and (2): The sparse coefficient vector and thus the original signal can be recovered from the compressive measurements by solving the l 1 minimization problem: whereα is the estimated coefficient vector and α 1 is the l 1 norm (sum of the absolute vector values) of α.
For complete recovery of x, it is necessary to design Θ in such a way that it satisfies the incoherence and RIP (restricted isometry property) conditions outlined in [25], thereby ensuring that all the relevant information present in x is preserved in the measurements y. In [26], it is stated that a matrix such as Θ in (4) satisfies these conditions with overwhelming probability if Ψ is an orthonormal basis, Φ is drawn randomly from a suitable distribution such as the Gaussian distribution, and if the number of measurements M is higher than a lower bound defined as: where K is the sparsity level of the input signal and S is a positive constant [27].

B. COMPRESSIVE MEASUREMENTS AS DIMENSIONALITY REDUCTION
In our proposed system, we are not looking to reconstruct the original input signal x. The computational complexity associated with the CS procedure occurs predominantly during the reconstruction process, and bypassing this procedure by detecting and identifying passing vehicles based on information extracted directly from the compressive measurements y enables us to significantly reduce the computational complexity of our system compared to more traditional CS-based systems. We thus consider the procedure Φ: R N → R M purely as a dimensionality reduction operation as it produces a lower dimension representation of our input signal.
For us to be able to consider compressive measurements as low-dimensionality representations of an input signal, we must ensure that certain conditions are met, most importantly that for any two distinct signals x 1 and x 2 , Φx 1 = Φx 2 , and thus Θα 1 = Θα 2 . This is guaranteed if Φ and Θ follow the criteria outlined in [25]. Furthermore, as stated in Section III-A, ensuring that these two matrices follow the aforementioned criteria guarantees that the information contained in α (and thus, x) is present in the measurements y.
The term compressive signal processing (CSP) was first coined by the authors of [21] and refers to the process of performing detection, classification, and filtering directly on the compressive measurements obtained during the CS process without prior reconstruction. This concept has been further explored and analyzed in [22,28,29]; and while the goals and approaches differ, the fundamental idea of bypassing the reconstruction of x and instead directly leveraging y remains the same. In a similar manner, in our paper, we perform AVDI by extracting features directly from compressive measurements.

IV. PROPOSED SYSTEM A. SYSTEM OVERVIEW
Our proposed system aims to obtain information from a specific frequency band within the audio signals of passing vehicles while sampling at sub-Nyquist rates. Features are extracted from the samples obtained in this manner and are used to detect and identify the vehicles.
An overview of our proposed system is shown in Figure 1 and is made up of three sections: Random Demodulator, Spectral Shaper, and Feature Extraction and Classification.
Input signals are simultaneously filtered and sampled at a sub-Nyquist rate in the Random Demodulator section using a Markov chain-generated spectrally shaped bipolar pseudorandom sequence P RS(t) generated by the Spectral Shaper section. Features are then extracted from the acquired samples and used to detect and identify vehicles in the Feature Extraction and Classification section.
In our application, the specific frequency band of interest can be determined by examining the average frequencydomain plots of the vehicle classes under consideration. Figure 2 shows the average sound signals of passing cars,

Feature Extraction and Classification
Vehicle Sound scooters, and periods without a passing vehicle, sampled for a duration T s = 2s at a rate W = 48 kHz. We can see that the signals show the clearest separation for the highest signal power over a narrow band of interest located just under 3 kHz.

Band of interest
As highlighted in Figure 1, in our proposed system, the processes that constitute the Random Demodulator and Spectral Shaper sections are performed on continuous-time analog signals, and the processes that constitute the Feature Extraction and Classification section are performed on discretetime digital signals. These stages will be examined in more detail in the rest of Section IV.

B. RANDOM DEMODULATOR
At the heart of our proposed system is an architecture called the Random Demodulator (RD), which is composed of the following blocks: a signal generator, a mixer, a low-pass filter (LPF), and an analog-to-digital-converter (ADC). In our proposed C-AVDI system, we include an additional Markov chain block that is used in tandem with the signal generator to spectrally shape the P RS.
First presented in [30] and further expanded upon in [31] and most notably in [32], the RD enables us to perform CS on sparse or compressible continuous-time signals rather than the exclusively discrete signals described in initial theoretical work. Compared to more sophisticated architectures such as the modulated wideband converter (MWC) in [22] and quadrature analog-to-information converter (QAIC) in [33], the RD is more straightforward and cheaper to implement as it is single channel and only requires a single ADC. Furthermore, the RD is particularly suitable for our application as it is designed to acquire sparse single band multitone signals, unlike the MWC for instance, which is a multichannel architecture designed to acquire sparse multiband signals. More details on different CS acquisition strategies can be found in [27], and [34] provides an in-depth comparison of the RD and MWC systems in particular.
Intuitively, we can describe the RD's operation as follows: rather than acquiring a signal through traditional Nyquist sampling, the RD first demodulates the signal by multiplying it with a white noise-like pseudorandom sequence, spreading the signal's frequency content across the entirety of the spectrum. The resulting signal is then low-passed a before being sampled at a sub-Nyquist rate. If required, the original signal can be recovered from the sub-Nyquist samples through l 1 minimization as in Section III-A.
Let us describe the operation of the RD more formally. An analog signal as described in (1) is combined with a pseudorandom bipolar sequence of unitary amplitude defined as: where n is a Rademacher sequence that switches between values {−1, 1} at a rate C = W .
The combined signal x(t)P RS(t) is passed through an anti-aliasing LPF h(t) and sampled at a rate R < W to obtain linear compressive samples y[m]. This procedure can be expressed as a multiplication followed by a convolution in the time domain: From which we obtain an expression for Θ, whose entries are defined as θ m,n for row m and column n: a In the traditional presentation of the RD architecture, the filtering is described as being performed by an integrator. Often, however, in practice this integrator is considered as, or replaced by an LPF [23,32,[35][36][37]. In our implementation of the RD, the filtering is performed using an LPF, and will be referred to and represented as such in the remainder of the paper. where Θ is a combination of the matrix Φ which represents the sequence of operations mapping the input signal x to the compressive measurements y, and of the orthonormal basis matrix Ψ. It is shown in [32] that Θ satisfies the previously outlined RIP conditions, and more importantly the dimensionality reduction properties outlined in Section III-B, as long as the number of measurements matches or exceeds M such as: Finally, it is necessary to ensure that in our proposed application, the input signals to the RD are sparse or compressible in the domain defined by Ψ. A key feature of our proposed system is the simultaneous sampling and filtering performed during the signal acquisition process, which is achieved by matching the spectra of an input signal and bipolar chipping sequence. As a result, the signals in this paper are considered to be sparse or compressible in the frequency domain (Ψ = DFT matrix). Signals are only very rarely perfectly sparse and are much more likely to be compressible, that is, the magnitudes of the nonzero coefficients present in the signal decay following a power law distribution. This is defined in [38] as: where P and q are constants. Figure 3 shows the DFT coefficient distributions of the signals shown in Figure 2. We can see that the signals are compressible as their spectra are clearly dominated by a relatively small number K of high magnitude coefficients.

C. SPECTRAL SHAPER
The Spectral Shaper consists of a signal generator and Markov chain block. It is used to produce bipolar sequences with specifically tailored, rather than random, frequency distributions. Indeed, in more traditional RD architectures, the particular P RS(t) used to demodulate the input signal switches polarity with equal probability; as a result,  its frequency-domain representation resembles that of white noise. This ensures equal spreading of all K components contained within x(t), which is optimal if we do not have prior knowledge of the distribution of the frequency information within the signal's bandlimit.
If, however, we do have prior knowledge of the locations of interest, then [23] shows that signal reconstruction accuracy can be improved by using bipolar sequences whose frequency-domain representation matches that of the input signal. The authors of [39] demonstrate that the bipolar sequence can be generated using a Markov chain with each state corresponding to an output symbol of ±1. Thus, the polarity of P RS(t) at a given time is determined by the corresponding Markov chain's state transition probability and chain length. The authors also establish that the Φ matrix obtained as a result of using a Markov-chain generated P RS(t) satisfies the required RIP conditions outlined previously with very high probability. We base the design of our Spectral Shaper on these results and expand upon the single Markov chain sequence generation method proposed in [39] by designing a dual Markov chain sequence generation method, capable of creating bipolar sequences with more complex spectra. Figure 4 and Figure 5 show a selection of spectrally shaped bipolar sequence spectra and the diagrams of the corresponding Markov chains used to create them. The transition probability matrices corresponding to the 2-state and 4-state Markov chains are defined as P 1 and P 2 respectively: We sweep the value of the transition probability p over the range 0 < p < 1. The higher the value of p is, the more likely the 2-state chain is to stay in its current state, and the more likely the 4-state chain is to transition to a state with the same output. In the case of a 2-state chain, this results in more of the generated signal's energy being located towards the lower end of its bandlimit; and in the case of a 4-state chain this results in the generated signal's energy tending towards a narrow peak in the middle of its bandlimit. Conversely, a lower value of p means that both chains are more likely to transition to a state with a different output, leading to the energy of both of the generated signals being located towards the higher end of their respective bandlimits. It is important to note that this relationship is due to the way P 1 and P 2 were designed: permuting the rows and columns of these matrices would change our state diagram and reverse the relationship between frequency distribution and transition probability.

D. INPUT SIGNAL RECONSTRUCTION WITH SPECTRAL SHAPER
As stated previously, our final C-AVDI system does not reconstruct the original input signal at any point in its operation. During the system design process, however, it is important to quantify and visualize the effects of matching P RS(t) and x(t), which can be done most effectively by reconstructing test signals using different bipolar sequence generation parameters and examining the results. This is done by performing CS using the RD on a set of four test signals whose spectra are shown in Figure 6, varying the Markov chain's p-value over each run for both a 2-state and a 4-state chain. The test signals each have the same bandlimit but different frequency compositions: a low-frequency x LF signal, mid-frequency x M F signal, highfrequency x HF signal, and broadband x BB signal. We denote the corresponding reconstructed versions of these signals aŝ x LF ,x M F ,x HF andx BB (reconstruction is performed via basis pursuit using the SPGL1 toolbox available from [40]).
The test signal reconstruction parameters are shown in Table 2. The optimal value of M is determined by first establishing a theoretical lower bound value using (11), from which an optimal value of M is obtained experimentally by repeatedly performing the CS process on our input signals while incrementally increasing M and noting the resulting relative error (RE). Repeating this process enables us to find the values of M after which the RE levels off for each of the four test signals. We use the largest of the four values, M = 12000. Figure 7 plots the resulting RE of the original and reconstructed signals, and the values of p for the optimal reconstruction of each signal are summarized in Table 3. These results confirm that for a constant value of M , matching P RS(t) and x(t) by changing the value of p improves reconstruction performance.

E. INPUT SIGNAL FILTERING WITH SPECTRAL SHAPER
In addition to improving reconstruction accuracy, a tailored P RS(t) can act as a filter to the input signal, attenuating or amplifying specific frequency content. Thus, we can equate the mixing of x(t) and P RS(t) to a simultaneous demodulating and filtering operation in which the spectrum of the bipolar sequence is likened to the frequency response of a filter. This property is a key feature of our proposed system, and again is most effectively quantified and visualized during the system design process by reconstructing the test signals in Figure 6.

1) Single Chain-Generated Sequences
We visualize the filtering effects of a spectrally tailored P RS(t) on an input signal by plotting the spectra of test signalsx LF ,x M F ,x HF andx BB reconstructed using different P RS(t) with Markov chain transition probability values p = 0.01, p = 0.555, and p = 0.999. Figure 8 and Figure 9 show the test signal spectra reconstructed using a 2-state chain and a 4-state chain respectively. We can see that the filtering effect a bipolar sequence has on a test signal depends on the length and p-value of the Markov chain used to it. This is best illustrated by x BB : for a 2-state chain-generated P RS(t), a value of p = 0.001 suppresses the low frequency content ofx BB , and a value of p = 0.999 suppresses the high frequency content; and for a 4-state chain-generated P RS(t), a value p = 0.001 again suppresses the low frequency content ofx BB , and a value of p = 0.999 suppresses the high and low frequency content. The filtering effect is consistent with the shape of the P RS(t) spectra shown in Figure 4 and Figure 5.

2) Dual Chain-Generated Sequences
In order to create P RS(t) with more complex spectra, we design combined dual Markov chain-generated sequences by mixing two different single chain sequences, which we define as P RS C (t), P RS 1 (t), and P RS 2 (t) respectively, where P RS C (t) = P RS 1 (t)P RS 2 (t).
To fully assess the effects of P RS C (t) on a given input signal, we perform three reconstruction tests using the x LF , x M F , x HF and x BB signals shown in Figure 6, while varying the p-values of P RS 1 (t), P RS 2 (t) (p 1 and p 2 respectively), and chain lengths. The purpose of these three reconstruction tests is to visualize different properties of our proposed dual chain-generated P RS C (t) approach: • The purpose of the first reconstruction test is to gauge the improvements in signal reconstruction when using a dual chain-generated bipolar sequence. This is achieved by determining the optimal P RS C (t) that minimize the reconstruction RE for each of the four test signals  Reconstructed test signalsx LF ,x M F ,x HF , andx BB using 2-state Markov chain-generated P RS(t) with transition probability p. Each row represents one of the four test signals presented in Figure 6, and each column represents a different value of p. and comparing their REs to those of their single chaingenerated counterparts shown in Table 3. The spectra of the optimal dual chain-generated sequences and of the reconstructed test signals are shown in Figure 10, and the optimal dual chain reconstruction parameters are summarized in Table 4. These results show that using dual chain-generated sequences improves reconstruction performance compared to single chain-generated sequences, even if only marginally. • The purpose of the second reconstruction test is to visualize the effects of any mismatches between the input signal and dual chain-generated sequence. This is achieved by reconstructing the four test signals using P RS C (t) designed to maximize reconstruction RE and comparing the spectra of the original and reconstructed signals. The spectra of the reconstructed signals and of their respective sequences are shown in Figure 11. These results show the impact of a mismatched P RS C (t) on signal reconstruction, underlining the importance of using an appropriately designed P RS C (t). • The purpose of the third reconstruction test is to visualize the filtering properties of the P RS C (t). We reconstruct the broadband signal, x BB , using four P RS C (t) sequences whose spectra can be likened to a low-pass, band-pass, and high-pass filter for the first three signals, and with the spectrum of the final signal resembling white noise. The spectra of the bipolar sequences and of the reconstructed test signals are shown in Figure 12.
These results show that in addition to having a direct impact on reconstruction performance, a tailored P RS C (t) can attenuate or amplify the frequency content of the input signal.  . Reconstructed test signalsx LF ,x M F ,x HF , andx BB using 4-state Markov chain-generated P RS(t) with transition probability p. Each row represents one of the four test signals presented in Figure 6, and each column represents a different value of p.

F. FEATURE EXTRACTION AND CLASSIFICATION
The Feature Extraction and Classification section is composed of a feature extraction block and a classifier block. In the feature extraction block, we extract a set of 5 features from the y[m] measurements produced by the RD. We select the 5 most important features out of the 9 used in our preliminary work [1] to use in our C-AVDI system: In the classifier block, these extracted features are used as inputs to a multiclass classifier to detect and identify passing vehicles. Classification is performed using an RF classifier, chosen for its robustness in the presence of outliers, inherent suitability for multiclass classification, and minimal prepro-cessing requirements (no input data rescaling required). The RF is implemented using the scikit-learn library [41].

V. EVALUATION
System evaluation is performed using a software-based implementation of the proposed system with audio data collected from vehicles traversing a university campus.

A. DATA ACQUISITION
The data acquisition setup is shown in Figure 13. A pair of Azden SGM-990 microphones are installed parallel to a one-lane two-way road and connected to a Sony HDR-MV1 video camera. The microphones are 1m from the ground, the intra-microphone distance is 50 cm, the distance between the microphones and the center of the front lane is 3 m, and the distance between the microphones and the center of the back lane is 6 m. The microphones' pickup pattern is set to  cardioid, and they record the sound of passing vehicles for a 30-minute duration at a sample rate of 48 kHz and a bit depth of 16 bits. We average the signals obtained from the two microphones to create a mono signal used in subsequent analysis. We obtain 14 different datasets by recording vehicle sounds at the same location with the same setup on 14 different days; at different times of day; on different days of the week; and under clear, windy, and rainy weather conditions. The average vehicle signal plots using data obtained under all weather conditions are shown in Figure 2, and the individual average vehicle signal plots for each weather condition are shown in Figure 14.
The three classes considered for classification are cars, scooters/motorbikes, and no vehicles. We refer to these as Car, Scooter, and NoVeh, respectively. The time at which a given vehicle passes in front of the middle of the microphone pair is defined as t p , and the time window for the vehicle as T r = t p − Ts 2 ; t p + Ts 2 , where T s = 2s. As we are not seeking to detect successively or simultaneously passing vehicles in this evaluation, we retain only the vehicle signals  whose T r does not overlap with those of the preceding or following signals. The total number of vehicle sounds obtained for each class on each day can be seen in Table 5.

B. SYSTEM SIMULATION
In this section, we evaluate the system described in Section IV-A by simulating its operation in software. Figure 15 shows an overview of the software implementation of the system proposed in Figure 1. As previously mentioned, our proposed system operates by taking an audio signal emitted from a passing vehicle as input, sampling it at a sub-Nyquist rate, and outputting a predicted vehicle type. System performance is evaluated using the accuracy metric.
As seen in Section IV-B, the switching rate C of a bipolar sequence is defined as twice the highest frequency contained in x(t). Typically, C would be set to 48 kHz to match the sample rate of the microphone's ADC, ensuring that the entirety of the input signal's frequency content has at least a partial tone signature representation at baseband. We, however, are interested in the information contained in a narrow band whose upper limit is 3 kHz, so we set the P RS(t) switching rate as C = 6 kHz accordingly. This limits the amount of high frequency content spread across the frequency spectrum during the demodulation process: in our case, we can expect all the frequency components below 3 kHz in the input signal to have at least a partial tone signature representation at baseband, but not the components located above 3 kHz. We are looking to design a P RS(t) that, when used to demodulate an input signal, will strongly emphasize the frequency content present in the band of interest outlined in Figure 2 and Figure 14. This same sequence will be used when acquiring data under all three weather conditions, as the location of the frequency band of interest is the same in all three cases.
We create a P RS C [n] ∈ R N with switching rate C, from P RS 1 [n] ∈ R N a 4-state chain with p = 0.001 and P RS 2 [n] ∈ R N a 2-state chain with p = 0.999. The spectrum of our P RS C [n] is shown in Figure 16.
Our input signal is represented as a discrete version of the analog input signal x(t), defined as x[n] ∈ R N sampled at rate W = 48 kHz during the data acquisition process. We approximate the sparsity level of x(t) as K = 400 by determining the number of DFT coefficient magnitudes shown in Figure 3 such that |α n | ≥ 10 −1 , and rounding to a single significant figure.
The anti-aliasing LPF preceding the ADC is a 2nd order Butterworth filter, and the measurements y[m] ∈ R M are obtained by sampling and quantizing the combined x[n] P RS C [n] signal (where signifies elementwise multiplication) at rate R = 3 kHz and bit depth B = 12 Bits. This corresponds to a reduction in sample rate by a factor of ( 48 kHz 6 kHz )( 6 kHz 3 kHz ) = 16. Similarly to Section IV-D, the optimal value of M for our system is determined by first establishing a theoretical lower bound value that is incrementally increased until we obtain the optimal balance between the smallest number of y[m] samples and maximum system prediction accuracy. We find this value to be M = 6000 and list it, along with the rest of the parameters used in our system simulation, in Table 6.
The discrete-time and frequency-domain representations of the average y[m] measurements for each class obtained by our simulated system are shown in Figure 17.

C. CLASSIFICATION
Classification is performed on 14 feature sets, which are obtained by extracting the 5 features outlined in Section IV-F from the y[m] measurements of each of the 14 datasets described in Section V-A.
We measure the performance of our system using leave-  Total  1  Clear  60  8  104  172  2  Clear  21  10  45  76  3  Wind  21  40  64  125  4  Clear  40  57  115  212  5  Clear  52  50  112  214  6  Rain  43  26  77  146  7  Rain  77  30  152  259  8  Clear  177  95  305  577  9  Wind  174  103  267  544  10  Clear  143  99  239  481  11  Wind  129  84  234  447  12  Clear  130  59  197  386  13  Wind  119  51  186  356  14 Rain 228 65 317 610 one-day-out cross-validation: we set each one of the 14 feature sets (where each set corresponds to a different day) as the testing set, and the combined 13 remaining feature sets act as the training set. This process is performed 14 times in total, with each of the feature sets acting as the testing set in turn. We set the number of trees and the minimum number of samples per leaf parameters of the RF classifier as n_estimators = 1000 and min_samples_leaf = 3 respectively, with the other parameters left as default. Both   the training and testing sets are balanced using random undersampling prior to classification, and the results obtained from each of the 14 runs are averaged to obtain the system accuracy. To account for any potential discrepancies caused by inherent randomness in the classification process, we perform the full classification process 10 times and average the obtained results, resulting in our final confusion matrix shown in Figure 18 and the system accuracy scores by weather condition shown in Table 7.
The confusion matrix in Figure 18 shows that the most prominent misclassification is that of Car as Scooter. This can be explained by the similar variance and amplitude of their y[m] samples, which translates to similar standard deviation and interquartile range features in particular.
This information, along with the results presented in Table 7, gives us insight into how to modify the system to improve performance in future evaluations. Most notably, the breakdown of metrics by weather condition sheds light on one of the principal causes of misclassification: adverse weather conditions. The design of a system capable of more effectively mitigating the effects of wind and rain noise on classification accuracy will be the focus of future work.

D. COMPUTATIONAL EVALUATION
We can gauge the computational performance of our proposed system by benchmarking it against our previous work. Benchmarking is performed by running the systems under test on the same computer under identical conditions and comparing their runtimes. We use a computer equipped with an Intel i9-9900K 16-core CPU @ 3.60 GHz with 64GB memory running Ubuntu 18.04. As in the previous section, the full process is run 10 times, from which we calculate the respective average runtimes. VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  The systems under test are the following: 1) The C-AVDI method presented in this paper.
For benchmarking purposes, we changed the classifier from SVM to RF, and removed the signal emphasis process.
The benchmarking results in Table 8 show that there is a link between computation time and performance: Mod-SMBAS shows the best accuracy but is more computationally intensive than C-AVDI by a factor of ≈ 2.8.

E. MICROCONTROLLER IMPLEMENTATION
We also seek to create an MCU implementation of the Feature Extraction and Classification section of the simulated system shown in Figure 15. Table 9 shows the specifications of some commonly used MCUs. We choose the Teensy 4.1 MCU board for our MCU implementation as it has the most total available memory. The compressive measurements y[m] are generated in the same manner as in our simulated system in Section V-B before being saved on an SD card from which they are sequentially loaded into the MCU for feature extraction. The previously trained classifiers are ported from their soft-ware implementations onto the MCU using the emlearn library [42]. Figure 19 shows the MCU implementation of the proposed system: the Random Demodulator, and Spectral Shaper sections are implemented in software, and the Feature Extraction and Classification section is implemented on the MCU. Figure 20 shows the resulting confusion matrix. We obtain an accuracy of 72%, which is lower than the accuracy obtained by the simulated system in Section V-B. Upon inspection, we find that the features extracted from y[m] in the feature extraction block in both the software and MCU implementations of our system are identical. From this, we can infer that the difference in accuracy between both implementations is due to the difference in size and complexity between the ported and original versions of the classifier models. Indeed,the limited memory of the MCU restricts the number of trees in the RF classifiers to 100, compared to the 1000 used in the software implementation, while also truncating the size of the various coefficients, weights and parameters used in MCU implementation of the RF classifier, leading to a drop in classification accuracy.

VI. DISCUSSION
We can assess the performance and viability of our C-AVDI system by comparing it to SMBAS, our previous system presented in [2]. While SMBAS presents an accuracy of 95 %, which is higher than the 80 % obtained by our C-AVDI system, there are two key differences to consider when comparing these results. The first is the weather conditions in which the systems were tested: SMBAS used only vehicle sounds obtained under clear conditions, and was therefore not tested for robustness to adverse conditions, whereas the C-AVDI system has been tested under different weather conditions as outlined in Table 5 and Table 7. The second is the computational cost: as discussed in Section V-D our C-AVDI system runs approximately 2.8 times faster than Mod-SMBAS, the simplified version of our previous system.
Taking these factors into consideration makes the difference in performance between these two systems much less marked, as in clear conditions the C-AVDI system obtains an accuracy of 85 % while running approximately 2.8 times faster than Mod-SMBAS which presents an only marginally higher accuracy of 87 %. The similar accuracy score combined with a faster computation time when compared to a system of similar complexity (mono input signal, limited or no post-sampling processing, use of RF classifier), serves to demonstrate the viability of our proposed C-AVDI system, particularly in applications where low-cost, low-complexity sensing is required. On the other hand, the full implementation of SMBAS is better suited to applications where a highcost, high-complexity system is required due to its superior accuracy performance.
There are two notable limitations in our proposed system that need to be addressed in any future work. The first limitation is the performance of our proposed C-AVDI system in adverse weather conditions (windy conditions in partic-  [15], by implementing CSP-based interference removal techniques as presented in [21], or by designing more complex bipolar sequences for more precise filtering during signal acquisition. The second limitation is the performance of the MCU implementation of our C-AVDI system, which, as discussed in Section V-E, is limited by the available memory of the Teensy device. The design and use of a custom board with a suitable amount of memory could theoretically bring the accuracy of the MCU implementation of the system in line with that of the simulated system. The creation of an updated hardware implementation of our C-AVDI system containing all the changes outlined above is also a subject of future work. The obtained M values listed in Table 2 and Table 6 would suggest that in our particular C-AVDI application, in which we perform classification without reconstruction, the requirements on the lower bound value of M can be relaxed. Indeed, we can observe an inconsistency between the two values of M with regard to the rest of their respective system parameters. According to (11), we would expect the value of M listed in Table 2 to be smaller than the value listed in Table 6; however, in our case, the opposite is true. Given that the purpose of the system described by the parameters in Table 2 is to minimize the RE, and that the purpose of the system described by the parameters in Table 6 is to maximize the classification accuracy, their respective values of M would seem to indicate that a smaller number of y[m] measurements is required for optimal classification than for optimal reconstruction. A more rigorous investigation of this phenomenon is left as a topic for future research.
Finally, the concepts underpinning C-AVDI presented in this paper can be used to extend the benefits of sub-Nyquist sampling to a wide variety of different acoustic (smart device wake-up-word detection) and non-acoustic (human activity recognition using devices such as smartwatches) sensing applications, further contributing to the democratization of ITS technologies.

VII. CONCLUSION
This paper presented C-AVDI, a method to detect and identify vehicles from their audio signatures at sub-Nyquist rates using features extracted from the compressive measurements obtained by a modified random demodulator. Our system uses Markov chain-generated spectrally shaped bipolar sequences to target a specific frequency band in the input signal during the sampling process itself.
The experimental evaluation of a simulated version of our system under a range of weather conditions produced a classification accuracy of 80 % for a back-end ADC sample rate of 3 kHz, with a runtime approximately 2.8 times quicker than the frequency-domain feature-based method proposed in our previous paper [2]. Evaluation of the system's MCU implementation produced an accuracy of 72 %.
Future work includes improving system performance by reducing the misclassification errors caused by adverse weather conditions, as well as improving the performance of the system's MCU implementation by increasing both the number and complexity of the classifiers used in the detection and identification process, and working towards a full hardware implementation of the proposed C-AVDI system.