Radial Velocity Estimation for Multiple Objects Using In-Air Sonar With MIMO Virtual Arrays

As autonomous platforms become more advanced, it is useful to obtain more information about the environment. Classic vision sensors such as RGB-D cameras, LiDAR, or acoustical imaging cameras are capable of accurately displaying the location of an object in 3D space and are widely used for this purpose. However, a problem arises when the measurement environment becomes more complex and measurement data might begin to suffer in terms of accuracy. Imaging sonar sensors are developed specifically for these environments but are limited in terms of frame rate due to the inherent slow nature of sound. This paper will propose a method to increase the amount of useful information obtained from the available data from a single measurement in post-processing. Utilizing a signal in combination with a Doppler velocity-tuned matched-filter bank it is possible to estimate the radial velocity of an object with respect to the sensor’s location. This allows the robot to make better decisions when it comes to path planning and collision avoidance, as a rapidly approaching object requires a different action than a stationary object or one moving away from the sensor. Another advantage of this system is that a single seemingly-large object might be identified as two close lying objects with different speeds. This paper serves as a proof-of-concept with results from a realistic simulation environment using an imaging sonar sensor and shows that the proposed method is perfectly suitable for making accurate estimations of an object’s radial velocity.


I. INTRODUCTION
As society becomes more and more used to and reliant on robots and automated vehicles, it's not uncommon for them to appear close to humans. From humanoid robots at airports to small automated vacuuming robots and Automated Ground Vehicles (AGV) maneuvering through warehouses, they are becoming ubiquitous. It is therefore extremely important for these robots/vehicles to have a clear understanding of the environment, as these now range from quiet autonomous warehouses, to hap-hazardous airports during rush hour. When situations are suitable, common perception sensors such as cameras and LiDAR can be used [1]. Because of their popularity there has already been extensive research on topics such as object and scene recognition, collision avoidance, mapping, etc. The resulting algorithms allow for safe operation in crowded environments where external factors hindering these cameras or LiDARs are minimal. However, there are The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Huo . many scenarios where measurement conditions are far less than ideal due to mud, dust, water spray, . . . (e.g. perimeter sensing in automated mining, road following in construction, heavy construction equipment, . . . ). The ubiquity of these smart platforms in day-to-day situations drives the research and demand for similar possibilities in heavy industries. The CoSys-Lab imaging sonar sensor [2] has been developed with these heavy industry applications in mind but, due to the inherent slow nature of sound (343 m/s) the frame rate is limited to approx. 12 Hz, which can be considered slow in comparison to its non-acoustical alternatives. The only way to receive more information about the environment has been to increase the number of recorded data points by either upgrading the hardware to get an increase in processing power and in turn, frame rate [3], [4] or to simply increase the number of sensors used [5], [6]. Both of these ways require inherent changes and increased cost to the sensor hardware that is being used, which limits the practicality of these improvements in terms of development. As there is a large amount of recorded data available with every measurement that is done, it might be useful to look at what is already available within the sensor and use this to enhance the performance.
In most settings, an active sonar sensor is developed as a system that emits an ultrasonic signal and records the reflections along with noise that is present in the environment [1]. The recorded reflection of the measurement sequence could then be separated from background noise and other sounds by the means of a matched filter. A matched filter will look for a certain signal and its output will peak when the signal ''matches''. It is common practice to design the measuring sequence and matched filter side-by-side, so that the signal can be detected in an optimal fashion [7]- [9]. Due to the nature of applications at the moment of writing, it is important to accurately measure the distance and direction of a certain object within view of the sensor. This led to the co-design of sequence and filter combinations that are robust against any transformations that the measuring sequence might undergo during both the reflection and its travel through the medium [10], [11]. That also includes making the systems robust against the Doppler effect that causes compression or elongation of the signal when reflecting from an object that is moving at a speed that is different from that of the sensor [8], [12], [13]. Because of this, at time of writing, there are not many published algorithms that allow for reliable extraction of the radial velocity of different objects, using active sonar sensing. This paper will take a different approach, exploiting both the Doppler effect and the presence of multiple microphones and emitters that will employ waveform orthogonality to obtain reliable radial velocity estimates for every object that is registered by the imaging sonar sensor [2] and in turn ''adding a dimension'' to the recorded data.
Knowing the velocity for every object in the scene will allow the moving vehicle to make more sensible decisions in terms of path planning an collision avoidance, as it is evident that a fast-approaching object requires more immediate attention than an object that is slowly moving away. The ability to register different speeds from an object that would be otherwise registered as one big reflector will also make it possible to more accurately separate and track different objects.
The paper is structured as follows: The next section will discuss the different concepts that are being used by the system. Section III will explain the signal processing flow that starts at data acquisition and results in an image of the reflectors labeled with their corresponding radial velocity information. The simulation environment along with the obtained results are discussed in section IV followed by the conclusion and future plans in section V.

A. MATCHED FILTERING IN ACTIVE SENSING
Classical ranging systems rely on the emission of a particular sequence and calculating the time it takes for the signal to reflect back to the receiver subsystem. When the emitted sequence reaches the receiver, it contains noise from various sources and can be represented as: where s(t) is the original sequence x(t) with added delay t and noise, n(t) [14]. The latter can be ambient noise, signal reverberations, system noise, etc. To accurately locate the emitted sequence in the receiver recordings, a receiver filter is used [8]. A receiver filter, or matched filter (MF), is used to maximize the Signal-to-Noise (SNR) ratio at a timedelay that corresponds to the target's range with a known offset [15], [16].
Eq. 2 [4], [17] shows how the matched filter is implemented by taking the Inverse Discrete Fourier Transform (IDFT) (F −1 ) of the Discrete Fourier Transform (DFT) of the recorded signal S r (jω), multiplied by DFT of the complex conjugate of the known base sequence S * b .(jω). The output of the matched filter, y(t), will serve as a range indicator as the location of the reflection in the recorded signal is now indicated. For a signal to be suitable for range estimation it needs to have an excellent auto-correlation property so that the matched filter output approaches the largest possible output when both signals match perfectly, and zero at every other instance. Huge amounts of research go into the design of signals that approach this response [8], using various encoding methods aimed primarily towards narrowband signals. For this paper three waveforms (Pseudo Random Additive White Gaussian Noise (PR-AWGN), multisine sequences and a Frequency Modulated Hyperbolic chirp) were taken into consideration. These sequences all have a frequency content that ranges from 80 kHz to 20 kHz, sampled at 450 kHz and are 4000 samples in length (approx. 8 ms). Figure 1 shows the aperiodic auto-correlation function for each signal and shows that, if nothing else has to be taken into account, PR-AWGN has the best performance out of these three signals.
The system described in this paper builds on the eRTIS sensor [2], [4], which is capable of scanning the entire frontal hemisphere. Using conventional delay-and-sum beamforming, the array is steered in different directions and for each of these sampling points the recorded signal will be passed through a matched filter, looking for a reflection of the emitted sequence and indicating the presence of an object. The resulting images, termed energyscapes in earlier publications, are explained in detail in [4], [18].

B. THE DOPPLER EFFECT IN IN-AIR SONAR
The targets that reflect the measuring sequence are seldom completely stationary with respect to the sensor, this will lead to a change in the frequency content of the reflected signal due to the Doppler effect. The Doppler effect will cause a signal to scale in the time-domain depending on the movement of source and the reflector. The frequency dependent FIGURE 1. a) The aperiodic auto-correlation for different measuring sequences previously used in the research of CoSys-Lab regarding in-air sonar sensing. b) shows in detail that, for accurate ranging purposes with a single emitter, a PR-AWGN sequence has an excellent response due to the rapidly decaying slope when there is a slight mismatch between the signals.
Here, f represents the change in frequency between the incident and the reflected wave, v the difference in radial velocity between the sensor and the reflecting object, c the propagation speed of sound (343 m/s) and f 0 represents the frequency of the incident wave. The change in frequency that occurs can be used to determine the radial velocity of the reflecting object with respect to the movement of the sensor. This is done by using carefully tuned sets of matched filters that each filter out a specific Doppler-velocity compensated version of the emitted sequence. This collection of matched filters is called the matched filter bank and its size depends on the range and resolution of velocities that is required for a specific application. An important tool to check for whether a measuring sequence is suited for accurate velocity estimation is the ambiguity function, which will show the output of the matched filter for a set of frequency shifted versions of the original signal. When keeping in mind these frequency shifts the formula for the ambiguity function can be derived from the matched filter formula (eq. 2) by adding the appropriate frequency shift in the frequency domain. This results in eq. 4: where y(t, v) is a matrix containing the matched filter outputs for different Doppler-scaled versions of the original base signal, S r j ω· v c ], being compared to the complex conjugate of the original base signal itself, S * b [jω]. A lot of research in the field of communications [20] and remote sensing [8] revolves around the generation of signals that are robust in a way that Doppler-shifts will have a minimal impact on the behavior of the matched filter when compared to the non-shifted signal. This will ensure a correct detection of the signal when it is recorded at the sensor, even when the environment is moving and/or unpredictable. Another example of this is the call used by the big brown bat to scan its environment, on which the measuring sequence of the original eRTIS is based [4], [18]. Figure 2c shows the ambiguity function for such a signal. The reason these chirp calls are so Doppler tolerant is that the frequency-shift roughly corresponds with the hyperbolic shape of the frequency sweep in the signal, so the shift has little influence on the frequency content of the signal. Figure 2 shows the ambiguity functions of three waveforms (PR-AWGN (a), multisine (b) and the Frequency Modulated Hyperbolic chirp (c)) that our research group has used in the past. For this paper a Pseudo-Random Additive White Gaussian Noise (PR-AWGN) sequence is used because, as shown in figure 2a, it has great potential for this application. Figure 2d shows that, aside from an excellent auto-correlation in time domain (figure 1), PR-AWGN is also very sensitive to frequency shifts showing a steep roll-off with two small sidelobes. This characteristic is also important as it defines the Doppler resolution of the system, as defined by the Rayleigh criterion [15]. If multiple objects are located closely together so that they can not be individually distinguished by the angular resolution of the sensor (i.e. they do not meet the Rayleigh criterion), the received reflection will be a superposition of both reflections [15]. When the Doppler resolution is adequate, techniques such as Moving Target Indication (MTI) or other Doppler processing techniques can be used to separate these close lying targets [15].
In this case, the Rayleigh limit for the velocity domain is entirely dependent on the auto-correlation characteristic of the measuring sequence used. Other research relies on the use of genetic algorithms to create signals where these characteristics are as ideal as possible [21]- [23], or have employed encoding schemes on specific sets of signals [8]. However the performance of the PR-AWGN signals was deemed suitable for what is to be demonstrated in this paper.

C. EXTRACTING VELOCITY INFORMATION
Designing a system that relies only on the previously mentioned mechanisms of active sensing and a frequency-tuned matched filter bank, has proven to be feasible. For a given reflector, determining both range and radial velocity can be achieved by finding the arguments that result in the matched filter bank (eq. 4) reaching its largest output value for that measurement (eq. 5).
However for this system, correct velocity estimations, v, are hit-and-miss due to the small value of v (eq. 3). While the average error does remains relatively small (around 0.02 m/s), it is prone to increase when noise is present. Seeing as noise is inevitable, it is important to increase stability if the system is to be used in a practical context. The concepts in the This comparison shows that a PR-AWGN sequence (a) has the best auto-correlation performance in that the peak rapidly declines and haslittle sidelobes. The multisine (b), that is otherwise perfectly suitable for MIMO applications, proves to be less suitable as it has larger sidelobes. As comparison the frequency modulated chirp sequence (c) used in the eRTIS sensor is shown to be highly resistant to the signal's Doppler scaling. All sequences have a frequency content between 20 kHz and 80 kHz at a sampling frequency of 450 kHz and are 4000 samples in length (lasting approx. 8 ms). (d) A comparison of the velocity sensitivities of the different waveforms.
following sections are all aimed towards improving both the accuracy and the stability of these estimations.
The proposed signal processing flow is similar to that of the normal eRTIS [4] but is extended in several ways. Starting with a set of matched filter banks, as illustrated in figure 4c, consistently linking the recorded sequence to its pre-calculated Doppler-shifted counterpart is crucial for the correct operation of the algorithm. Therefore it is necessary for the matched filters to behave in a way that is consistent across all instances. This gives the normal generalized correlation method an advantage over algorithms that would otherwise obtain a better performance such as the Phase Transform (PHAT) algorithm [17], but due to the normalization that occurs within PHAT it is not suitable for this application as it is important to retain linearity across the different matched filter outputs and the resulting energyscape (ES) [4].

D. MIMO VIRTUAL ARRAY SONAR
To increase the accuracy of the results, RADAR systems calculate the Doppler-shift using pulse trains, where one measurement is made up out of a string of pulses [15]. This method can not be applied when using in-air sonar due to the slow speed of sound that is already serving as a bottleneck in terms of frame rate. Instead, the proposed system enables the use of multiple pulses by employing multiple orthogonal waveforms that are being transmitted simultaneously in combination with a MIMO array, allowing the use of a MIMO virtual array sonar system. Using multiple emitters, along with the eRTIS microphone array (forming a MIMO system), will also result in more energy being emitted at each measurement. This leads to a greater range, and also a better detection of (small) objects, as they are now able to reflect more energy back to the microphone array. Having multiple emitters also allows the system to measure using multiple different sequences. These can later on be treated as individual measurements after separation by matched filtering at the receiving end of the system. Having these additional measurements available can improve both the Signalto-Noise ratio (SNR) and the Point Spread Function (PSF), which are important factors in the next phase of the system where the objects are to be located. The usage of multiple measuring sequences also increases the number of filters in the matched filter bank by factor N, as every individual measuring sequence requires a personal set of matched filters to handle Doppler shifts. Therefore it is not necessarily beneficial for the system's overall performance to have a large number of emitters, regarding the impact on computational needs.
The MIMO system used in this paper is similar to the one that is discussed in [7]. The system has an emitter array consisting of N (in this case, eight) emitters that are placed at position p b n = [x b n , y b n , z b n ] T using a genetic algorithm designed to optimize sensors' Point Spread Function [7]. The measuring sequences, s b n (t) (where n = 1 . . . N ) are bandpass filtered PR-AWGN. The receiving end of the MIMO setup is the microphone array of the eRTIS sensor, having J receivers (in this case 32) placed pseudo-randomly using Poisson-disc sampling [4], [24] at positions p m shows the reflected signal as it is received by microphones of the array.
where r k,n and r k,j represent the range between emitter n and reflector k, and the range between reflector k and receiver j, calculated using geometry from p. Equation 7 calculates the round trip time it takes for a certain measuring sequence s b n (t) to go from emitter to receiver. Now, if a non-stationary object causes the reflection, s b n (t) will be transformed by the Doppler-effect. The system has a fixed matched filter bank set to look for a certain set of radial velocities, e.g. v d = −1 . . . 0 . . . 1 m/s. To achieve this, a Doppler-shifted set of signals needs to be pre-generated. Doppler operator D is introduced in eq. 8.
s b,d n (t) being the Doppler shifted version of the original signal (with the frequency shift from eq. 3) for a given radial velocity. To see which objects are traveling at which radial velocity, an energyscape [4] is generated for each velocity in v d . Starting with the matched filtering step in eq. 10: Having N × J matched filter outputs, a MIMO virtual array can be synthesized that will increase the quality of the resulting energyscapes [16]. First, applying conventional Delayand-Sum beamforming on the emitter array: With ψ being steering direction (with a total of Z directions) and yielding t. Next, applying the same beamforming technique at the receiver array: With t being the result of sampling direction ψ and the locations in p. Taking the envelope of s BF,ψ ψ results in s EN ,ψ ψ which can be interpreted as a range-energy profile showing the amount of reflected energy for a given range [4]. Doing this for every sampling direction will create a Dopplerspecific energyscape (DES v d ) for the system for the currently selected value of v d : With r representing the number of range samples. Repeating this process for every radial velocity in v d will result in a datacube as represented in figure 3. In a theoretical setting, the datacube will be a collection of velocity-specific energyscapes where an object is only imaged in the energyscape of which the velocity setting corresponds to the radial velocity at which the reflector is moving. Due to the nonideal response of the ambiguity function and noise collected throughout the system, the practical datacube will be less clear as objects will not be located at a single sampling direction or range bin. This inaccuracy, but more importantly the computational load required by the system to calculate a high resolution energyscape using the MIMO virtual array method for every radial velocity present in v d , steers this research towards a less straight-forward approach to make the system feasible.

III. PROPOSED SIGNAL PROCESSING FLOW A. CLUSTERING SPARSE SONAR IMAGES
In earlier work the locations of the reflectors found were deduced by simply applying a minimum threshold and scanning the ES for maximum values [5]. This can sometimes lead to inaccurate estimations as the Point Spread Function (PSF) of the imaging sensor used is not a Dirac Delta function, and the accuracy decreases when an object approaches ±90 degrees. This phenomenon can also be observed in figure 5a. Using a function that scans the ES for local maxima can result in hundreds of detected points for each single reflector. In applications where only imaging or situational awareness is needed, this should pose no issue since every detected point in the main lobe should always be avoided. However, for this velocity-estimation algorithm it is important to get only the absolute center of every detected object as the Doppler-ACF curves located in the sidelobes or sides of the main lobe will show erratic behavior and have no clear peak for any of the signals in the matched filter bank. This will lead to significant velocity estimation errors, as illustrated in figure 7. Along with a decrease in estimation performance comes the significant increase in computational need as these hundreds of erratic Doppler-ACF curves will also be processed.
Several algorithms designed to accurately detect groups of close lying data points were investigated. The available data (being the ES generated by the eRTIS imaging sensor) is, while accurate, sparse and does not have the same amount of data points available as its short-wave alternatives such as LiDAR. It is therefore important that we use an algorithm that is capable of handling this sparsity. Another important factor to keep in mind is that sensors like this are to be used in various scenarios and that there is no prior knowledge available about the environment. The latter constraint prohibits the use of popular algorithms like k-means clustering, as k-means will attempt to form a specific (k) number of clusters. When looking at popular clustering algorithms that are flexible and do not require pre-knowledge about the environment there are two main candidates: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Agglomerative Hierarchical Clustering. DBSCAN is a popular algorithm that has been used in numerous applications. Using only two input parameters, DBSCAN can identify any number of complexshaped clusters [25]. As the name suggests, DBSCAN will determine clusters based on several parameters linked to the varying density in the provided data. These parameters are the minimum number of points for a cluster (MinPts) and the presumed radius of a cluster (Eps) [25]. Because our sonar data is sparse, it is possible for a cluster to contain a very small number of points, requiring MinPts to be of small value. Choosing values as low as one or two points in a cluster will lead to DBSCAN yielding similar results to other popular algorithms such as Single Link (SLINK) hierarchical clustering [25]. Reference [25] also suggests that if the parameters can not be estimated accurately, it may be more useful to use Agglomerative Hierarchical clustering. Agglomerate Hierarchical clustering does not require inputs except for the data, but rather relies on a stopping condition for distinguishing clusters.
Using an approach that repeatedly combines closely positioned data points and remembers the distances between these points (leafs) or clusters (branches), the user can define a value at which the algorithm will stop clustering and give the information about the formed clusters as an output. There are several different approaches when it comes to hierarchical clustering [26]. These algorithms start of with an association matrix, in this case a matrix containing the distances between all points found, and will start linking the different elements in that matrix. The most popular and intuitive linking method is the aforementioned SLINK. Single-Link hierarchical clustering (also referred to as nearest-neighbour) will define the shortest distance between the points in a data set and combine the two nearest. If a set of clustered points C is used in the comparison with point P, the smallest distance between P and the different points in cluster C will be used to determine the distance between the two. A downside of this simple method is its sensitivity to noise, as it will greedily add any point to the cluster that is within minimum range. This, theoretically, makes SLINK less suitable for this application. The opposite is complete linkage, where the distance to new point P will be determined with the point in cluster C that is the furthest removed from P. This makes complete linkage clustering more resistant to noise, but less suitable for handling oddshaped reflections. In-between methods using the average distances between pairs (i.e., Paired Group Methods (PGM)) offer several advantages such as the ability to assign weights to points, and working with centroids instead of individual points. Unweighted Centroid Clustering (UPGMC) was deemed ideal for this application along with another method that uses a minimum variance approach, Ward's method. Ward's method also utilizes the PGM approach but instead of defining distances based on the means between different points, Ward's method will try to minimize the objective function (eq. 14 [27]). Starting in a way similar to the ANalysis Of VAriance (ANOVA), the squared error is seen as the sum of the squared euclidean distances between all points in a cluster and its centroid ( • m) [27].
Where E 2 hi is the change in the sum of the squared errors between the two clusters h and i that are compared. The two clusters that result in the smallest change will be combined if they do not exceed the stopping criterion. n h and n i are the number of points in each cluster. p is the number of descriptors, which in this case is the total number of available distances between all centroids,  An important part of hierarchical clustering is the stopcondition. The most intuitive way is to stop clustering once all points that lie within range are processed and no more branches can be combined without going beyond the specified distance-limit, this distance-limit could be represented by drawing a horizontal line on the dendrogram (figure 5d) that stops the process once the linking-tree grows above it. However, a common issue with imaging using planar arrays is the increasing diameter of the Point Spread Function as a reflector moves to the sides, decreasing the sensor aperture as the incident angle strays further from being perpendicular [28]. Simply using pairwise euclidean distance between points will cause a single object lying at ±90 degrees, having VOLUME 11, 2023 FIGURE 4. The signal processing flow of the proposed velocity estimation algorithm. a) Using a MIMO array (containing eight emitters and 32 receivers) allows us to emit more energy, increasing the range and probability of detecting smaller objects. b) A pre-calculated set of time-scaled (frequency-shifted) versions of the original N measuring sequences will make it possible to match a certain reflection to its appropriate speed. v d being the row matrix containing all velocities that are to be estimated. c) A collection of matched filter banks, one for each N x v d pair. d) The MIMO virtual array step. For a predetermined velocity in v d , the environment will be imaged using the N different measuring sequences. The MIMO virtual arrays allow for the generation of a ''high resolution'' image, which is necessary for the clustering in (e). e) All detected reflections are first thresholded and then clustered. The quality of these clusters is important as they are used to find the centroid • m. f) For each of the N measuring sequences a series of lower quality energyscapes is calculated, one for each value in v d . The ''lower quality'' images are the result of the system that is being reduced to Multiple-Input Single-Output (MISO), increasing the speed of the algorithm. Knowing the locations of all reflector centroids, • m, a Doppler-ACF curve can be extracted. This curve will, ideally, have its peak value at the v d (υ) value corresponding to the reflectors' radial velocity. In practice this proves to be unpredictable. The N orthogonal measuring sequences allow us to extract N Doppler-ACF curves for each reflector. Taking the product of these N curves (g) reduces the estimation error and produces more reliable results. Having the estimates for every reflection, these can be combined with the location of the centroids, a ''smeared out'' Point Spread Function when compared to that of the perpendicular objects (also visible in figure 5a, e), to be cut up into pieces that do satisfy the distance requirement. For this research, a different approach is used where the stop-condition is an inconsistency-value where the variation of the different branch-lengths within a cluster is used. Once the variance of a specified number of sub-branches (in this case three sub-branches were taken into account) exceeds a threshold value it is presumed to have combined all the points of a certain cluster, while still allowing for more flexibility in the clustering process. Figure 5 gives an overview of the steps taken to efficiently localize the centroids of the measured reflectors. Starting from a reference energyscape generated by the MIMO virtual array system (figure 5a), this example shows three reflectors positioned in the environment. The first step is to apply a certain threshold for the amplitude (Tr) over the entire ES (ES ψ ) (figure 5b), getting rid of a large amount of noise and sidelobes that are present. Now a simple peak finding function will further extract all values that are likely to be of interest (figure 5a, marked red). Having a large set of data points, the clustering algorithm will need an association matrix. Since the association metric for this application is the shortest distance between all points, a euclidean distance matrix is used (figure 5c). The linking algorithm (SLINK, Ward's, UPGMC, . . . ) will use this matrix to group data points, shifting the location of the centroid along the process (figure 5d). Once the stopping criterion is met, the cluster centroids are available for further processing (figure 5e).

B. FROM CENTROID TO DOPPLER-ACF CURVE
The step happening at the last part of figure 4f brings a challenge of its own. The most important part of the algorithm is the Doppler-ACF curve. It is crucial that it produces a peak at the correct velocity value. Apart from limiting the number of peaks per reflector to a single centroid , v d being a row vector containing the radial velocities in the process), will provide inaccurate estimations (as illustrated in figure 7), as the true peak might lie in a slightly offset location. Repeating the process in figure 5 for each individual DES is cumbersome as a large amount of values in v d will reduce the speed of the algorithm and increase its computational load. To counter this effect, the Doppler-ACF curve is calculated using a window around • m ψ 16706 VOLUME 11, 2023 FIGURE 5. The clustering process. a) shows the raw MIMO energyscape (ES) which is used as input for the algorithm. b) A threshold (Tr ) is applied to the ES, mitigating a large part of the sidelobes and noise in the input data. A simple peak finding algorithm will scan the thresholded ES and locate the peaks, this will still lead to a large number of close lying data points (displayed as red markers over subplot a. c) will determine the euclidean distance between all points found, allowing the clustering algorithm in d) to merge peaks and locate the centroids of a cluster. Subplot e) shows the same ES with only the centroids • m marked, going from >100 data points to three centroids. Drastically reducing the computational load for the rest of the algorithm and improving the quality of the generated Doppler-ACF curves. and taking the maximum value for each window (eq. 15).
where υ • m is a row vector representing the Doppler-ACF curve and wi the window used. The size of the window is dependent on the accuracy of the overall system, although it should not be larger than a few data points to not hinder the resolution and estimation performance of the system.
As mentioned earlier, a single Doppler-ACF curve is sensitive to noise and will sometimes fail to produce a clear peak at the correct value. Taking the product of the N individual Doppler-ACF curves (figure 4f) will increase the peakedness at the correct value of v d , along with the overall accuracy of the velocity estimations by rectifying any misplaced peaks caused by poor signal detection in the matched filter. Poor signal detection can be caused by either low SNR of the reflected pulse, or sub-optimal orthogonality of the reflected signal due to practical causes. This is illustrated in figure 6a. Figure 6b shows a snapshot for such a scenario containing a poorly detected, stationary, object during simulation. The individual Doppler-ACF curves are not peaked and some also FIGURE 6. To enhance the accuracy of the velocity estimation, it is possible to exploit the orthogonality of the N measuring sequences by the product of all Doppler-ACF curves generated within the system. The top plot shows that, in theory, this leads to a single Doppler-ACF curve that has both a much stronger peak at the estimated value and reduced sidelobes. The bottom plot shows a snapshot taken out of a simulation run where, for a single object, there is a lot of deviation between the estimated velocities as both the measurement and the system add noise that is not present in the theoretic case. If the system is to estimate the velocity of the object using the peaks of these indivual lines (indicated by black diamond shapes), there is no clear winner and there is a high probability of a wrong velocity estimation. Multiplying all N Doppler-ACF curves, the error is reduced and results in a more reliable estimation that is within spec of the system. contain these misplaced peaks. Taking the product eventually rectified the situation, with a correct velocity estimation as a result.
Another issue that might surface comes with the presence of strong reflections. If an object reflects too much energy, it can happen that its corresponding sidelobes are not removed in the previous thresholding steps. This can lead to false positives when looking for reflectors in the sensor's environment. These false positives are not easy to omit, as they would require complex thresholding along with the risk of omitting an actual closely positioned reflector. This is another issue that can be tackled using the Doppler-ACF (υ • m ) curves. As figure 8 demonstrates, these sidelobes are still picked up by the matched filter bank but, because of the unpredictable frequency content they contain, will not produce a decent peak. Instead these false-positive Doppler-ACFs will have a maximum value at a random velocity index. This might confuse the platform on which the sensor is mounted as a sidelobe might appear as a seemingly fast approaching object (again stressing the importance of obtaining a correctly positioned centroid, preventing the large velocity estimation errors illustrated in figure 7). During testing, it did become apparent that the variance (Var(υ • m )) of these false-positive reflections is significantly smaller than that of the real reflections. VOLUME 11, 2023 FIGURE 7. A representation of the velocity estimation error that can be expected when the used centroid location is not positioned correctly. Notice that a certain range the velocity estimation error stays similar, but if the centroid is wrongly positioned at a range slightly different from that of the true location, there is a significant increase in error. a) shows this effect for a MISO system where only one measuring sequence is used. The range offset that is required for the system to obtain a large error is small when compared to a MIMO system, shown in (b). The system shown in figure 4 uses the MIMO image to calculate the reflector centroid positions, and the MISO images to calculate the Doppler-ACF curves. (a) also indicates the benefit of using the windowing method as it is now clear that the small differences in centroid-shapes caused by mismatched filters can indeed have a significant impact on the estimated speed.

FIGURE 8.
An issue with strong reflections are sidelobes and strong artifacts. When their amplitude is high enough to surpass thresholding, the clustering algorithm will see these as separate reflectors. This is illustrated in b). a) shows the different υ • m that are obtained when using the centroids, • m, to determine the speed corresponding to the current reflector. Because the signal content in the sidelobes and/or artifacts is not identical to that of the true centroid • m (and so the wanted signal), the output of the matched filter bank will not show a strong peak at any velocity index. Knowing this, the outputs of the matched filter bank can be thresholded to a certain expected variance. Doing this will make the system more robust against falsely identified cluster centroids.
Thresholding the variance of the Doppler-ACF curves allows the system to filter out false-positives and have a more stable and trustworthy knowledge of its surrounding obstacles.

IV. SIMULATION RESULTS
To validate the proposed methodology a realistic 2D simulation environment has been developed. The system is designed for scanning the complete 3D frontal hemisphere, but a 2D environment is more fit for demonstration purposes. The simulation environment consists out of a robot (a round moving object), that navigates through an environment filled with obstacles. All reflectors appear as cylindrical reflectors of varying sizes. During the drive the simulated robot generates a 2D ES ranging from −90 degrees to +90 degrees azimuth and is fixed at 0 degrees in elevation, displaying the 2D horizontal plane in front of the robot. Because of the use of a simulation environment, there is a reliable ground truth to which the obtained measurements can be compared. For the results in figure 9, a matched filter bank designed to detect velocities ranging from −7 m/s to 7 m/s, with an accuracy of 0.05 m/s is used. The signal used during these simulations is the PR-AWGN sequence shown in figure 2. Figure 9a shows a simulation environment where the robot is stationary, along with four stationary objects and two objects that are moving in a certain direction, their paths are indicated with the dotted lines. In figure 9b, the frame shows all objects being detected with their centroids marked. The estimated speeds are placed next to each centroid with a red font to indicated an approaching object (higher priority) and a green font to indicate a stationary or fleeing object. The true radial velocities of the objects are shown in the simulated world and the estimations are shown in the bottom plot. The estimated radial velocities for each moving object are accurate while the stationary objects contain a small error, the largest being 0.15 m/s. Seeing as the velocity resolution for this simulation run is 0.05 m/s, this is only three Doppler-ACF indexes removed from the actual velocity.
To test the accuracy of the proposed algorithm further, a few objects are placed along the path of the robot with varying distances. The matched filter bank is set to detect velocities ranging from −0.4 to 0.4 m/s with a resolution of 0.02 m/s. The simulated robot will drive along a straight path with a constant velocity of 0.3 m/s. The varying distances will result in different radial velocities with respect to the sensor, positioned on the mobile robot. The measurement is run several times. Because the spawn location of the robot is semi-random (within boundaries) the trajectory will vary slightly, prohibiting the measurements to be identical between runs and inducing slight variance. Figure 10 shows the error obtained during these measurements using both SLINK and Ward's clustering for comparison. Figure 10a shows the performance of the algorithm against the angle at which the reflector is located, as the angle with respect to the sensor surface deviates from 0 degrees the aperture gets smaller and the PSF will become larger. This graph shows that the clustering algorithms do not struggle depending on the size of the PSF (the main lobe) and are able to find the cluster centroid, • m, without any issue. Figure 10b shows the effect of distance on the quality of the estimation. Again, both algorithms perform similarly, except for SLINK that will start to struggle when the reflectors come close to the sensor (below 1 meter). Ward's estimation error seems to be the most consistent, so further research will employ Ward's minimum variance method to define the clusters centroids. It should be noted that Ward's estimation error never surpasses 5 cm/s. This makes the overall system accurate enough to make distinctions between objects of which the speed only FIGURE 9. An overview of the simulated world in (a), with the view of the sensor enhanced with every objects radial velocity in (b). The simulated world contains both stationary and moving objects labeled with their simulated radial velocity in (a). The moving objects are also labeled with their velocity vectors (radial (red), tangential (blue) and actual velocity (green)) and their traveled path (gray dotted line) for illustration purposes. This image serves as a ground truth for the result shown in (b) where every object is again labeled. The correct position of the reflector is marked by the cluster centroid • m and the estimated velocities are the output of the system shown in figure 4. Red labels indicate an approaching object, as these are of higher importance. Objects that are stationary or fleeing are indicated with their velocities in green. The estimated velocities obtained by the system are accurate, with three objects having a an error of max. 0.15 m/s. varies slightly, and at the same time enhance its path planning and collision avoidance capabilities. To reduce the remaining error that is still present in the system it is useful to look at different encoded emission signals of which the ambiguity function comes closer to the Kronecker delta, as this can be done in pre-processing and does not slow down the system. Using different beamforming strategies will improve image quality but might also increase the computational cost of the system, making it a less favorable approach.
Lastly, the proposed method is compared to the initial method described in section II D, where a high resolution acoustic image is calculated for every radial velocity using the MIMO virtual array method. The radial velocity of an object is then defined by the largest value across the velocity vector taken at the location of the centroid (with no additional windowing to handle the mismatched curves). A stationary object located in front of the sensor is found and both methods' Doppler-ACF curves are displayed in figure 11. The figure shows that, while both methods produce a strong peak around FIGURE 10. The average estimation error of three measurements using both SLINK as Ward's linkage method are shown. a) Shows that there is no significant difference between the two methods. SLINK has an average measurement error of 0.024 m/s, while Ward's method is slightly better at 0.021. b) Shows a larger difference between both, and shows that the SLINK method struggles more when objects are close to the sensor (and thus having stronger reflections/sidelobes). When reflections are further away from the sensor, the results are again similar. The larger error for close reflectors results in SLINK having an average estimation error of 0.032 m/s. Ward's, with 0.019 m/s, achieves a similar result as in the angle-dependent measurements and thus can be considered more consistent for the application presented here.

FIGURE 11.
Comparing the Doppler-ACF curves for both techniques discussed. The initially proposed theoretical method, calculating a high resolution acoustic image for each velocity using a MIMO virtual array and extracting the Doppler-ACF curve at the location of the centroid with no additional windowing method, has a smaller Peak-to-Sidelobe ratio than the method proposed in this paper.
the center value of the velocity vector (going from −5 m/s to 5 m/s with a resolution of 0.05), the Peak-to-Sidelobe ratio of the proposed method is significantly larger.

V. CONCLUSION AND FUTURE WORK
In this paper the methodology to add radial velocity information to an existing 3D imaging sonar sensor is explained and the operation is validated in a simulation environment. The different aspects of the signal processing flow are discussed in detail; the considerations that should be made when selecting waveforms, the importance and caveats of using a matched filter bank, a discussion regarding the options and importance when it comes to accurately locating reflectors without any pre-knowledge of a sparse data set, the importance of orthogonal waveforms that are not only used for MIMO virtual array sonar imaging, but can also be used to identify false-positive reflections and increase velocity estimation performance, and the simulation environment. The end result shows that it is perfectly possible to outfit an existing imaging sonar sensor with extra information without changing the hardware so that it is also capable of registering the radial velocity of every object in view. The proposed algorithm is capable of estimating the velocity of an object with a maximum margin of error of only three steps in velocity resolution, making it perfectly suitable for accurately identifying the velocity of multiple moving objects in a single frame and serving as a base for enhanced features such as object tracking. This result encourages further exploration of this topic.