Optimized Continuous Wavelet Transform Algorithm Architecture and Implementation on FPGA for Motion Artifact Rejection in Radar-Based Vital Signs Monitoring

The continuous wavelet transform (CWT) has been used in radar-based vital signs detection to identify and to remove the motion artifacts from the received radar signals. Since the CWT algorithm is computationally heavy, the processing of this algorithm typically results in long processing time and complex hardware implementation. The algorithm in its standard form typically uses software processing tools and is unable to support high-performance data processing. The aim of this research is to design an optimized CWT algorithm architecture to implement it on Field Programmable Gate Array (FPGA) in order to identify the unwanted movement introduced in the retrieved vital signs signals. The optimization approaches in the new implementation structure are based on utilizing the frequency domain processing, optimizing the required number of operations and implementing parallel processing of independent operations. Our design achieves significant processing speed and logic utilization optimization. It is found that processing the algorithm using our proposed hardware architecture is 48 times faster than processing it using MATLAB. It also achieves an improvement of 58% in speed performance compared to alternative solutions reported in literature. Moreover, efficient resources utilization is achieved and reported. This advanced performance of the proposed design is due to consciously implementing comprehensive approaches of multiple optimization techniques that results in multidimensional improvements. As a result, our achieved design is suitable for utilization in high-performance data processing applications.

researchers in effectively implementing indoor radars [1], [2], [3], [4]. Moreover, there is more research directed recently towards real-time detection applications, as such feature is highly desirable. Real-time detection requires powerful processing capability, which is not always affordable for everyday use [5], [6], [7], [8], [9]. This opens opportunities of inventing novel solutions, which are capable of processing complex random body movements detection algorithms to meet the real-time requirements [10].
Different approaches were used in the literature to overcome the challenge of random body movements in vital signs. Researchers in [11] tackled this issue by using the continuous wavelet transform (CWT) algorithm to identify the contribution of the motion artifacts in the phase signal and then to smooth it by applying a moving average filter. Meanwhile, in [12] the features of the frequency spectrum of vital signs while undergoing random body motions are analyzed. This work utilized the motion modulation effect and extracted the direction of the body motion with the new position of the respiration peaks. Since body movements introduce frequency shifts in the spectrum, the direction and amount of this frequency shift depends on the direction and the speed of the body motion. Thus, this feature was used to account for the body motions in the spectrum to detect the breathing rate accordingly. Meanwhile, the work in [13] effectively reduced the random movement using two methods: the complex signal demodulation (CSD) method and the arctangent demodulation (AD) method, which were implemented on Doppler radar detection of vital signs. It was targeted for sleep monitoring and baby monitoring to eliminate false alarm caused by random movements. The CSD resulted to be more immune against the effects of the DC offset, whereas the AD reduces the effect of harmonics and inter-modulation interference and high carrier frequencies. Finally, an adaptive phase compensation method was used for random body movements cancellation in [14]. To measure the random body movements of a subject, a camera was integrated in the radar system. The camera measurement was fed back into the system as the phase information. The presence of large body movements may result in receiver saturation. However, the use of phase compensation avoids such a problem. A simple video processing was also performed to extract the random body information without using any markers.
The reported algorithm in [11] uses targets with random body motion that affect the detection of vital signs. It uses CWT to identify the locations of the artifacts and then applies the moving average filter to smooth these identified artifacts. It also uses the discrete wavelet transform (DWT) to separate the heartbeat signal from the respiration signal, which results in accurate detection. However, this innovative algorithm is computationally complex and thus does not overcome implementation requirements such as high speed, low resource utilization, and low-power consumption for high-performance and potential real-time processing scenarios. To overcome these challenges, this algorithm needs to be employed in innovative architectures in the processing platforms for high-performance processing metrics.
As will be outlined in the next section, the CWT algorithm has been used in the literature for different applications, in different structures, and designed and implemented in different processing platforms, see [15], [16], [17], [18], [19], and [20]. The use of CWT for the specific application of detection unwanted body movement was outlined in [11]. In the article, a standard desktop computer was used as the main processing unit for the algorithm. That work focused on proving the viability of the CWT algorithm when applied to more practical scenario using radar for detection and monitoring of heart rate (HR) and respiration rate (RR), so accuracy was one of the most important parameters to be reported. However, to investigate the high-performance data processing viability of the algorithm, other parameters (processing speed, resources utilization) are yet to be investigated. Since the algorithm was implemented in the software loaded into the central processing unit (CPU) of a desktop for execution, it implies that the implementation of the algorithm is performed in a sequential manner. In this case, execution speed, processing time, and the potential to apply such algorithm in real-time applications are questionable. In summary, the main gap in the literature is that the CWT algorithm is very complex and in order to process this algorithm especially when very large data samples are involved, it is important to consider the processing speed and resources needed. Consequently, it is apparent that there is a need to provide a CWT design and implementation that can be used for vital sign detection scenario with high-performance processing speed.
Since CWT in [11] has been validated to be successful in identifying unwanted movements, our proposed work here investigates the CWT use for high-performance detection of unwanted movement. This is achieved by designing an FPGA implementation architecture with high speed, low processing time and optimized hardware resources. To our best knowledge, the FPGA implementation of the CWT algorithm proposed in [11] has not been explored yet. The proposed solution provides a processing structure by adopting several optimization techniques. The design utilizes the FPGA reconfigurability and parallelism features to implement the optimization objectives.

A. OUR CONTRIBUTION
• A new architecture implementation of CWT on FPGA for unwanted body movement detection: we developed a new CWT architecture implementation on FPGA to overcome the gaps described earlier in the introduction. This was done through several optimizations implemented in the design to improve processing speed and logic utilization: -CWT processing speed optimization: we have been able to implement several speed optimizations on (i) pre-processing and modifications of the algorithm input data (ii) the FPGA architecture design. Examples of such optimizations are wavelet function optimization, wavelet scales optimization, Fast Fourier Transform (FFT) output optimization and multiplication optimization. The CWT processing speed was improved due to adopting and implementing these optimizations.
-CWT resources utilization optimization: we have been able to implement several resources optimizations methods on (i) the algorithm input data pre-processing and modifications (ii) the FPGA architecture design. Examples of such optimizations are wavelet function optimization, wavelet scales optimization, FFT output optimization, multiplication optimization and random access memory (RAM) requirements optimizations. The CWT logic utilization was improved due to adopting and implementing these optimizations.
• Applicability to high-performance data processing for unwanted body movement detection application: we developed the design so that it can be applied for high-performance detection applications. This is due to the significant improvement achieved in the processing speed.
The hardware design optimization done in this work is not referred to the design optimization of the primitive blocks of functions such as the multiplier or the RAM block. The work is done in the following: • How the multipliers and RAMs are used. • The connection and the control of how and when each block is connected to other blocks to maximize speed and minimize resources.
• The flow of data from one block to another and how that is controlled to maximize speed and minimize resources.
• The input signal and wavelet samples In summary, the above points allowed the use of the FPGA primitive blocks in the most efficient way in the design.
The rest of this paper is organized as follows: Section II presents related works in the literature, Section III provides brief background on CWT, Section IV provides details on the CWT processor design features and implementation, Section V outlines the results and comparison with the state of the art works, Section VI provides the conclusion.

II. RELATED WORKS
The CWT has been attracting researchers' interest in the last decades therefore multiple implementation approaches of the algorithm are presented in the literature. In [21], a CWT based approach was proposed to measure the voltage flicker resulting in power systems from fast load variations. It uses a Gaussian modulated wavelet function as the basis wavelet in the algorithm. Based on the resulting CWT coefficients, the flicker frequency response and the amount of system frequency deviation can be evaluated. This approach calculates the CWT coefficients in time domain and the algorithm was implemented using LabView software. It might be accurate in detecting flicker voltages in comparison to typical FFT approaches. However, performing it in time domain involves a very complex computational steps within the convolution, increasing the time requirements for processing. Besides that, [22] proposed a unified architectural framework for DWT and CWT based on a reconfigurable lifting scheme to be used in image processing application. The proposed architecture supports the use different wavelets based on the reconfigurability of the lifting scheme. The unified scheme consists of a reconfigurable lifting scheme array (RLSA), a reconfigurable address generator (RAG), two dual port static random access memory (SRAM) and the main controller unit. The design was implemented using very small number of scales and very small decomposition levels (3 levels), limiting the access and identification of various frequencies of interest. The processing of 512 × 512 image using the three-level decomposition was reported to be completed in 12.6 ms. The combination of DWT and CWT in one scheme might result in better resource utilization, especially when considering large number of scales and decomposition levels. Nonetheless, parameters of every new wavelet function to be used in this scheme need to be defined and then embedded in the design to start calculating the wavelets. This step can be replaced by precalculating the wavelet coefficients themselves, storing them in a memory and then calling them when needed.
Another algorithm that combines the use of DWT and CWT is presented in [23]. In this work, a hybrid method utilizing DWT and CWT was implemented on FPGA for underwater target motion estimation in sonar systems. The design parts consist of the DWT bank filtering for signal de-noising and the CWT convolver for target motion estimation. In the proposed structure of the CWT, only one multiplier with one adder were used to map the convolution process. This work was then improved by the same authors and presented in [24]. Scale optimization block was added to this design to estimate the sets of filters coefficients to be used by the CWT by determining the optimal scales. Despite being focused on saving of area, this comes at the cost of its speed of computation. In this version of the design, the CWT is implemented using 5 multipliers and 4 adders following the same principle in the previous design flow. Despite Improving the computation time, the convolution itself in time domain is more complex compared to doing only multiplication in frequency domain.
Next, the researchers in [25] developed an algorithm to detect and classify six types of electrocardiogram (ECG) signal beats using neural network classifier. Prior to inserting the signal samples to the classifier, they were pre-processed using CWT and principal component analysis (PCA) algorithm. CWT was mainly used to extract the ECG signal features while the use of PCA was to reduce the size of the data before being fed as input to the classifier. The combination of CWT and PCA featured a more effective input data to the neural network leading to better classification results. The HAAR mother wavelet was used in the CWT estimation using ten scales via MATLAB. The performance accuracy of the classifier in this method is very high. The speed feature of this method was not investigated, and no results were reported. However, it should be apparent that the use of CWT increases the complexity and time consumption of the processing.
The works presented in [26], [27], and [28] are multiple improvements built on each other by the same researchers. They presented the design and the implementation of CWT algorithm using FPGA to detect and extract the event related potential (ERP) signal part of the electroencephalogram (EEG) signals. The basic idea of the implemented algorithm is to conduct the CWT in frequency domain rather than time domain to speed up the computation. The algorithm processing steps are designed and implemented on FPGA. In addition, optimization techniques were used to further reduce the time processing requirements as well as the logic utilization. In this implementation, the zeroes in the Morlet wavelet were removed from calculations. In addition, the scales used were reduced to focus only on the scales supporting the targeted ERP feature for extraction. The initial design computes the CWT in 1 ms, which was improved by the optimization to around 0.57 ms. The achieved run-time is good in this work partially due to the moderate number of samples in the signal (1024 points). Further investigations should be carried out for longer lengths of the signal and how that affects the run time and logic utilization.
The work in [29] proposed a fringe pattern recognition method using CWT in Fourier space. This work attempted to reduce the algorithm execution time by designing and implementing it on FPGA. Its design consists of the wavelet core, input and output buffer memory sized at 1 KB each and a NIOSII processor. The design was tested on a 512×512 fringe pattern image, which was downloaded on the external SRAM to the FPGA. The FPGA requires around 100 ms to process the image while using C language it needs around 1 s, and around 650 ms if a high-end station with higher processing capability was used. The resources utilization when using Altera cyclone IV was 61% of logic elements, 49% of on chip memory and 100% of embedded multipliers.
Another fringe pattern recognition and fringe phase extraction application using CWT was presented in [30]. The CWT algorithm is implemented on an FPGA using the multiplication between the two-dimensional pattern spectrum and the two-dimensional wavelet kernel spectrum to avoid the complex convolution operations and reduce processing time. The Morlet function was used as the mother wavelet. The design consists mainly of digital data acquisition module, data buffer module, configuration module, CWT operation module and an output module. The heart of the design is the CWT operation. When the input signal matrix samples are of 256 × 256 in size, the FPGA execution time is less than 10 ms (using 200 MHz clock) whereas when using MATLAB, it was 1 s.
The work proposed in [31] was aimed to design a hardware description language (VHDL) module that detects the R wave and interval in an ECG signal to obtain the HR in real-time using the CWT with splines. The CWT was designed to function in time domain with the wavelet function selected as the first derivative of a second order spline function. Among all scales of the wavelet function, only one scale was used in the CWT calculation, which is scale 8, and it was selected because it contains the frequencies of interest in the passband. The processing time requirement in this design was 20 ms with an accuracy of 90%. The designed module was then implemented in an FPGA prototype presented in [32]. In this prototype, four modules were implemented: the data receiver module, the module for obtaining the HR, module to manage the storage of data in micro-SD card and a module to manage the processed data visualization. The prototype was tested with 8 files of input data with a reported accuracy of 99.5%. The prototype design bandwidth (BW) is 200 Hz and power consumption of 625 mW/h.

III. BACKGROUND ON CWT
CWT is a windowing technique with variable-sized regions that allows to have different frequency and time resolutions depending on what is needed. For example, at high frequencies usually a high time resolution is necessary, but not at low frequencies where a high time resolution causes redundancy. With CWT, a time frequency representation of the signal is obtained and therefore the artifacts, that typically have also higher frequencies than the normal vital sign signals, can be clearly seen and located in time. Using CWT, the vital signs phase signal disturbed by the artifacts and its corresponding CWT can be extracted. The artifacts can be identified in the time domain exploiting the frequency information [33], [34], [35], [36].
The CWT of an input signal x(t) for a selected mother wavelet function (t) at a given time b and scale a is defined as [37], [38], [39], [40], and [41]: The definition of the convolution between the input signal x (t) and a linear time invariant filter h (t) is: if h (t) is defined in terms of the mother wavelet function as: The CWT can be seen as the convolution process between the input signal and the mother wavelet in time domain. This convolution produces the CWT coefficients. This convolution equation in a simple compact form is [28]: where the sign * represent complex conjugate. If the total number of convolutions is small (this is true in the case of small signal samples) this time domain method can be used in practical situations. Alternatively, the CWT algorithm is calculated in frequency domain in which the convolution operation is replaced by multiplication operation as follows: where the lower-case g is the input signal and wavelet signal in time domain while the upper-case G is the input signal and wavelet signal in frequency domain. In addition, the sign * represents the convolution. By definition, processing using CWT requires the selection and use of a wavelet function either in time domain or a frequency domain. If the process is based on time domain, then the most computationally complex step is the convolution between the wavelet function and the input signal. If transformation is based on frequency domain, then the process is in the form of multiplication between the wavelet function and the input signal.
In every CWT, the mother wavelet signal needs to be selected. There are certain criteria for the selection of the mother wavelet such as the finite energy and the admissibility factor, which are elaborated in [31]. The most commonly selected mother wavelets in the CWT algorithm are the analytic Morlet, the Mexican hat and the Paul wavelet functions whose formulas are shown in Table 1 [28]. It summarizes the three functions equations in time domain and their corresponding frequency domain equations.
One of the most widely used mother wavelets in the biomedical applications is the Morlet function. The Morlet wavelet consists of complex sinusoidal waves modulated by Gaussian envelop. The Morelt function is typically selected for biomedical applications such the vital sign detection due to its simplicity and suitability for spectral analysis.
The Morlet wavelet used by MATLAB is defined by: where U (ω) is the unit step in frequency domain. if the unit step is ignored for a while, the time domain Morlet used in MATLAB is:

IV. CWT PROCESSOR DESIGN FEATURES AND IMPLEMENTATION
Different methods and approaches are planned to optimize an architecture for the CWT algorithm on FPGA to be used for vital sign detection. The following approaches are among the ones used to achieve the optimized design: • Capitalizing on the parallelism features of FPGA.
• Reducing unnecessary calculation steps in the algorithm.
• Optimal selection of the scales fit for the application.
• Using the available IP cores.

A. CWT PROCESSOR DESIGN METHODOLOGY
The overall block diagram of the proposed CWT processor design methodology is shown in Fig. 1. It contains three main parts of this work, which are: • Input data preparation and preprocessing; • CWT-based data processing; • Results validation. The first and the third ones are considered the software parts and are processed using MATLAB while, the second one, is the hardware part, which includes the FPGA.
Detailed description of all components of Fig. 1 is presented below.

1) EXPERIMENTAL DATA UNIT
This unit represents the experimental measurement data matrix resulting from the experiment conducted in [11]. Each element in the Matrix R is a complex element of a real part and an imaginary part (I + Q) of base-band digitized signal samples. R contains 45,572 rows representing the slow time samples and 50 columns representing the fast time samples or (range). The important parameters used in the experiment which are utilized in this work are the following: • Sampling time in slow time (across rows) t s0 = 3.072 ms.
• Sampling frequency in slow time (across rows) f s = 325.5208 Hz.
• Range resolution R = 0.02 m. The above parameter values of the sampling time, sampling frequency, and range resolution are based on the system designed in [11].

2) INPUT DATA PREPARATION AND PRE-PROCESSING UNIT
Internally, this unit consists of three sequential steps as shown in the Fig 2. The purpose of this step is to correct the data from the offset caused by the cables in the experiment and focus on the time and range of interest. This step is also important to insert the correct data to the CWT processor, which contains the target information where oscillations due to the presence of the RR and HR of the volunteers.

a: DATA CORRECTION
To properly process the data matrix R, some correction steps are applied to extract the part of the matrix on which subsequent algorithms in the FPGA is applied, these steps are: • Starting the data processing of R at the 21st second of the measurement, as the first 20 seconds were preparation to actual measurement.
• Starting the data processing of R at the 7th sample in the columns to account for around 1.4 m offset due to the cables used in the experiment. This reduces the range of interest to less than 10 m.
• Resulting matrix is G with 39,062 rows representing slow time samples and 44 columns representing fast time samples (range).

b: GENERATING TARGET RANGE PROFILE
• Generating slow time axis (x axis) vector using the sampling time data in slow time.
• Generating range axis (y axis) vector using the range resolution data.
• Finding magnitude of each element in the complex matrix G.
• Mapping magnitudes of G data to the mesh grid having slow time as x axis and ranges as y axis with color map indicating the data values. The target range profile is shown in Fig 3.

c: DOPPLER SIGNAL EXTRACTION
• From the range profile, identifying the range bin with oscillation, which is in this case (range b in) = 13 indicating target presence at 2.6 m from the radar.
• Creating the Doppler signal by keeping all rows of G and only one column corresponding to (range b in) = 13. This result is the Doppler signal complex vector of size 39,062. The magnitude of this Doppler signal is shown in Fig. 4.  The Doppler signal is used as input of the CWT processing unit. The length of the complex test signal x(t) selected for the processing unit is of N = 4096 samples.

3) CWT BASED DATA PROCESSING UNIT
Before starting the FPGA design, all steps of CWT in the context of vital sign detection and using the experimental data collected are simulated in MATLAB to ensure correctness of the steps and to establish a comparison point later. The algorithm steps are configured on the FPGA again to ensure correctness of these steps and to establish another point of reference.
There are two main CWT basic algorithm structures found in the literature: the time domain and the frequency domain. Because the first one involves complex convolution process, the second one is adopted in this work. The most commonly used methods to move from time domain to frequency domain are the short Fourier transform (STFT) and the FFT. Because the latter involves lower computational complexity, it is adopted in this work. The structure of the FFT-based CWT algorithm is shown in Fig 5. It shows the flow chart of the CWT algorithm, which contains the following critical functions/steps: Those are the most complex and time-consuming operations of the CWT algorithm.
To determine the size and the number of scales of the mother wavelet, the sampling frequency and length of the input signal needs to be known. Once the scales are determined, the wavelet coefficient matrix is generated in time domain (t), which has all dilated and translated versions of the mother wavelets based on the scales. To shift from time domain to frequency space, the FFT should be applied both on the input signal x(t) and on the wavelet coefficients (t) to obtain X (ω) and (ω), respectively. After that, the frequency domain signal is multiplied by the wavelet coefficient in frequency domain at each scale. The resulting multiplications at each scale are converted back to time domain using the IFFT process to produce the CWT coefficients in time domain. The algorithm is applied for the proposed design as follows: • The sampling frequency of the input signal is f s = 325.5 Hz and length is N = 4096.
• The mother wavelet function is selected (Morlet wavelet) with N samples in time domain.
• The total number of scales S = 89. This is determined based on the input signal considering two factors: 1) the sampling frequency of the input signal and 2) the number of samples of the input signal. These two factors are used as input to MATLAB to determine the number of scales.
• Generating the wavelet coefficient matrix in time domain (t) of size N ×S where each column represents the wavelet function in time domain at each scale.
• Applying the FFT on (t) at each scale to get the frequency domain wavelet coefficients (ω) of size N × S, where FFT( (t)) = (ω). The FFT requires S N log N operations.
• Applying the FFT on x(t) to get the frequency domain vector signal is a complex vector of data. The FFT requires N log N operations.
• Applying point by point multiplication between each point in the column vector x(ω) and the corresponding points in the first column of the wavelet coefficient matrix (ω) already calculated, then repeat the multiplication operation between the points of x(ω) with the corresponding points of the second column of the wavelet coefficients matrix (ω), and then the third column until reaching to the last column no. S. The output of this step is a matrix let us call it M x (ω) = x(ω) × (ω). The number of multiplication operations here is 2 × N × S operations (since there is real part and imaginary part in each sample of x(ω)).
• Applying the IFFT on M x (ω) column wise. The result of this operation is the matrix W x(t) containing the CWT coefficients results, which has N rows and S columns.
To better optimize the algorithm, several observations on the CWT algorithm steps are as follow: • From the input signal sampling frequency and the number of samples, the number of scales of the selected mother wavelet is determined prior to algorithm computation.
• The algorithm implementation is optimized by directly calculating the wavelet coefficients matrix in the frequency domain (ω). This is done instead of calculating it in time domain and then applying the FFT to obtain the frequency domain coefficients. This optimization significantly reduces computation time of applying the FFT on (t), which is of complexity around N SlogN .
• The algorithm implementation is optimized by precalculating the wavelet coefficients and storing them instead of synthesizing the wavelet equation in frequency domain. This approach reduces resources when implementing the wavelet equation in FPGA as well as the computational time. This optimization approach makes the algorithm design and FPGA architecture implementation more modular and reusable for other wavelets without drastically modifying the design.
• The IFFT can be optimized by having it start once enough input is available from the output of the multiplication. It does not have to wait until full completion of the multiplication. This makes these steps overlapping in time, which improves the speed of the calculation.

4) RESULTS VALIDATION UNIT
The purpose of this unit is to validate the resulting CWT coefficients to ensure that design accurately identifies the artifacts' locations, which are then suppressed. The block diagram of this step is shown in Fig. 6.

a: SCALES SELECTION (APPLICATION DEPENDENT)
The resulting CWT coefficient matrix contain profiling of the signal components on which CWT was applied. VOLUME 10, 2022 The lower scale numbers such as 1,2,3, etc. contain the highest frequency components of the signal while the highest scales such as S, S-1, S-2, etc. contain the lowest frequency components of the signal. In the vital signs signal, the maximum mechanical displacements of lungs and heart have typical amplitudes of 1 mm and 0.1 mm, respectively. Depending on the subject activity and health condition, the vital signs frequencies range between 0.1 Hz to 3 Hz. Besides that, there exists certain frequency ranges in the target signal which need investigation to identify the unwanted movement corrupting the vital signal. These artifacts typically have higher frequencies and amplitude than the typical vital signs. The range of frequencies of 4 Hz to 20 Hz is selected, which corresponds to scales between 26 and 50. Thus, these scales are used to represent the range of frequencies of interest on which binary masking is applied.

b: BINARY MASKING
• The binary masking is applied on the CWT coefficients from scales 26 up to scale 50, therefore it is applied on S = 25 scales only and not all the S scales.
• This is done by creating new vectors containing the maximum magnitudes of the CWT coefficients at the frequency range of interest (to locate artifacts of unwanted movements). To be specific, it is achieved by finding the maximum magnitude of the columns from W x(t) at thê S scales resulting in s1 ofŜ samples size.
• Then, setting up threshold TRS 1 for binary masking. All values below TRS 1 are zero and all values above TRS 1 are one, so that: If s1 ≤ TRS 1 then s1 = 0 else s1 = 1.
• The resulting vector where values are 1's are the artifacts locations where moving average is applied.

c: MOVING AVERAGE
• This is applied on Doppler signal at the artifacts locations.
• This is done by first setting up the number of moving average points: pts = 101. This value is selected due to the unsatisfactory level of artifact reductions produced by other values (e.g. 21 and 51 points). Instead, the choice of 101 points performed well in attenuating the artifacts.
• Find indices of the artifacts in the binary masking where to apply moving average ind 1 .
• Apply moving average on x(t) at the indices where the artifacts are found, as: . where x(ind 1 (i)) is the value of x at index ind 1 of (i); (i) is any integer value starting from 1 until length(ind 1 ) and v is the sample value at certain location.

d: COMPARISON
After the moving average is applied at the artifact's locations, the resulting signal is compared with the original x(t) to observe the level of improvement.

B. CWT PROCESSOR ARCHITECTURE AND FPGA IMPLEMENTATION 1) BASIC IMPLEMENTATION ARCHITECTURE
The input signal x(t) is separated into two parts, one containing the real part I (t) and the other containing the imaginary Q(t), with each part consisting of N samples. I (t) is used as the input test signal in the FPGA design. The CWT steps identified are mapped from the hardware point of view, with Fig. 7 presenting the basic block diagram of the operations and architecture of the FFT-based CWT algorithm. It shows the hardware implementation of the FFT-based CWT algorithm of Fig. 5. Therefore, it represents the second part of Fig. 1 (i.e., CWT-based data processing). The main functions/ steps, such as FFT, multiplication, IFFT, are designed and implemented. In addition, the ''control module'' is used to control the data flow, to synchronize the processes, and to establish/define the relation between all the design blocks and modules. Moreover, since some data needs to be stored during the process, different RAMs are introduced to the design. In this work, VHDL was used for FPGA design coding and implementation. Quartus prime 15.1 software was used in all the stages of the FPGA design. The processing steps of this design are as follow: • Storing I (t) in RAM 1 at one sample per clock cycle. The memory size is of N locations each of 20 bits width.
• Reading I (t), which is of 20 bits width from RAM 1 to the N points FFT. This FFT is of fixed-point representation of the data.
• Store the output of the real part of the FFT operation at RAM 2 , while the imaginary part at RAM 3 . These two memories are of size N locations each of 20 bits width.
• Calculating the Morlet wavelet values via MATLAB using the input signal length N and the sampling period t s0 . This results in a huge matrix of size N ×S containing all the frequency domain values of the wavelet function for S scales. Each value is represented with 8 bits.
• Writing the Morlet wavelet coefficients form the wavelet input interface to RAM 0 , which is of size N ×S locations each of 8 bits width.
• Synchronized reading from RAM 0 and RAM 2 to the inputs of the multiplier module at a rate of one sample per clock cycle. This reading from RAM 2 starts from 0 to N − 1 and then the process is repeated S times. While the reading from RAM 0 starts from 0 to N − 1 and then continues from N to 2N − 1 and so on to go over all the values of RAM 0 .  • Using the same multiplier, multiplying RAM 3 and RAM 0 samples at latency of one clock cycle and storing the output in RAM 5 of size N ×S locations each of 28 bits width.
• Reading from RAM 4 and RAM 5 to the N points IFFT. This IFFT is of fixed point representation of the data.
• Obtaining the IFFT results, which represent the W I values.
• Using the control module as the brain sending and receiving of control signals to the different modules of the design. It generates all memory addresses, read e nable and write e nable signals, in addition to controlling and synchronizing the operation of FFT, Multiplier and the IFFT. The green outlined boxes in Fig. 7 are the input and output modules of the system while the numbers inside the arrows represent the sequence of the process. In addition, the red lines connecting the control module to all other modules in the system represent all the control interface and signals.

2) OPTIMIZED IMPLEMENTATION ARCHITECTURE
Several approaches are proposed in this part of the paper to advance the performance of the algorithm. Based on the previously presented block diagram in Fig. 7, optimizations in the following areas are proposed: • Eliminating unneeded samples: in addition, another optimization level on scales is done by removing unneeded samples. By closely looking at the structure of each wavelet scale, it is noticeable the BPF have many leading zeros and trailing zeros, the non-zero values are concentrated in the middle of each scale. Moreover, the wavelet scale shape gets narrower as we move from scale 1 to scale S, resulting in the increase of zero values in each scale samples. The idea here is to keep track of the index of nonzero values rather than storing, reading, and multiplying the whole scale samples (zeros and nonzeros). This is because storing the zero values is not useful and its result in multiplication is already known. -RAM needed to store the multiplication outputs.

c: FFT OUTPUT OPTIMIZATION
This FFT module is adopted from OpenCores in Quartus Prime. The input signal to the system I (t) is read by the FFT as N samples each sample is 20 bits, the output of the FFT is again N samples while each sample is 20 bits. This means a RAM of N × 20 = 81, 920 bits is needed to store the real part of the FFT output (RAM 1 ). A similar RAM is also needed to store the imaginary part of the FFT output (RAM 2 ). However, RAM 1 and RAM 2 locations are reduced based on the following: • Since zero reduction is applied to the wavelet scales, the corresponding index containing non-zero values at the output samples from the FFT are of no use.
• As a result, these corresponding values in the FFT output do not need to be stored in RAM 1 and RAM 2 , hence further memory reduction as well as reduction in the number of multiplication cycles is achieved.
• According to Table 2, the lowest index used from the wavelet values is index 40 while the highest index is 709, therefore only 670 samples are needed from the FFT output rather than N samples.
The reduction obtained in RAM 1 and RAM 2 by considering the FFT output optimization is: • Total bits size of RAM 1 and RAM 2 prior to considering the FFT output optimization: 2×N ×20 = 163, 840 bits.
• Reduced bits size in RAM 1 and RAM 2 : 137,040 bits.

d: MULTIPLICATION OPTIMIZATION
The multiplication module is adopted from OpenCores in Quartus Prime.The optimization in the multiplication process is conducted as follows: • Starting Time of the Multiplication Process: typically, once the FFT completes its operation and produces all the values to RAM 1 and RAM 2 , then the multiplication process commence. However, when closely examining the design, the multiplication process related to values of scale 26 needs the sample number 204 from the FFT output, and the index is incremented until it arrives to the last value (709) corresponding to scale 26 as shown in Table 2. After that, the multiplication of the values from scale 27 onwards starts with index 191 and so on. As shown in Table 2, it is observed that the starting indexes decrease from one scale to another. This is due to the fact that each scale moves across the frequency axis towards the lower center frequencies. As a result, parallel reading and writing operations are constructed, i.e. the writing operation is activated at the output sample number 40 from the FFT while the reading is activated once sample number 204 is produced from the FFT to RAM 1 and RAM 2 for 6144 cycles. There is no need to wait until all the 670 samples from the FFT are produced.
• Parallel Multipliers: this is considered since there are real and imaginary output samples from the FFT. Since the multiplication is done one sample per cycle, utilizing the parallelism feature via synthesizing two parallel multiplier (one each for handling the real and imaginary samples, respectively) reduces the number of cycles by 50%. Increasing the number of multipliers will have positive impact on reducing the required multiplication cycles. However, increasing the number of multipliers above 2 does not reduce the required number of multiplications in a linear manner. The optimal impact occurs when using 2 parallel multipliers (which is adopted in our design). Nonetheless, adopting higher number of parallel multipliers is possible at the expense of extra logic utilization while not improving the speed significantly. The optimization in the RAM usage is performed as follows: • Storing Multiplication Outputs: the needed size to store the multiplication results is reduced due to several reasons. Firstly, the optimization reduced the number of scales. Secondly, zero values from the selected scales are now eliminated. Finally, the unnecessary FFT outputs are now eliminated. The reduction obtained in RAM 3 and RAM 4 to store the multiplication outputs is: -Total bits size of RAM 3  • Storing the Wavelet Scales Values: the need for a dedicated memory initialized with the wavelet values in RAM 0 is completely eliminated. Instead, RAM 4 is initialized with the wavelet values and is also used to store the output of the multiplication. This is possible as RAM 4 is utilized only at the beginning of multiplication process. Therefore, a write during read operation is incepted at RAM 4 . This is conducted to read the wavelet values initialized at RAM 4 to the input of the multiplier. At the same time, the output of the multiplier is saved again in RAM 4 in the addresses from which values have already been read. A careful control process is designed to separate the read and write operation from the same address by at least 3 cycles. In addition to that, a careful address generation of read enable and write enable signals is performed.

f: IFFT OPTIMIZATION
The IFFT module is adopted from OpenCores in Quartus Prime.Given all previous optimizations conducted over the FFT process, multiplication process, zeros and scales optimizations, inserting the correct values in the IFFT module is vital to ensure a correct IFFT process. A precise control module is designed to insert the leading and trailing zeros to the input IFFT signal at the correct locations. This requires careful design of the reading operation from RAM 3 and RAM 4 where the multiplication outputs are stored. The block diagram in Fig. 8 depicts the optimized algorithm implementation on the FPGA. In this optimized design, the input I (t) is fed to the FFT directly upon reading from an external input interface. Besides that, the wavelet coefficients are precalculated directly in the frequency domain and initialized at RAM 4 , which is also used to store the multiplication outputs. In this design, only non-zero values from scale 26 to scale 50 are saved and used. Only 670 samples of the FFT outputs are needed and stored in RAM 1 and RAM 2 . Furthermore, it is seen in Fig. 8 that two parallel multipliers are introduced (Mult 1 and Mult 2 ). The outputs of the two multipliers are stored in parallel into the optimized RAM 3 and RAM 4 , which are then read into the IFFT.
The green outlined boxes in Fig. 8 are the input and output modules of the system while the numbers inside the arrows represent the sequence of the process. Besides that, the red lines connecting the control module to all other modules in the system represent all the control interface and signals.
The timeline of the different operations of the proposed optimized FFT-based CWT processor of Fig. 8 is shown in Fig. 9. This timeline shows the three major operations in the process each with a different color: 1) FFT operation is shown in cream color 2) multiplication operation is shown in green VOLUME 10, 2022 color 3) IFFT operation is shown in pale-blue color. Besides that, it shows the related steps with each major operation in the same color. It can be observed that there are several parallel steps running on the same time. Moreover, the IFFT operation is launched once the complete multiplication steps are performed.

3) CONTROL MODULE
The control module functions as the brain of the proposed optimized design, generating and synchronizing all control signals from one central module, as follows: • FFT operation: Generates and controls the timing of the sink valid , sink start , sink end and reset signals.
• Control of the clock cycle counter: Starts from the first activation of the FFT representing the point at which the design starts actual processing. This counter values are used at different places in the design to trigger specific signals and deactivate others. It is also used at later stage to calculate the number of cycles needed from the start of processing until the end.
• Multiplication: Generates and manages the control signals needed for the multiplication. This includes the multiplication clock enable signal, and the selection and synchronization of the two input signals to the multiplier. This is critical in terms of timing, so synchronization needs to be performed correctly.
• IFFT: Controls the operation of the IFFT by generating and controlling the timing of the sink valid , sink start , sink end and reset signals of the IFFT.

• Memory operations:
Generates and controls the timing and address counting signals. These are used to generate and control the timing of the read-write operation to RAM 1 , RAM 2 , RAM 3 and RAM 4 . These operations are also controlled in conjunction with the read enable and write enable signals of the these RAMs.

4) TIMING AND LATENCY ANALYSIS
The block diagram of Fig. 10 shows the time analysis and clock cycles requirements of the design illustrated in Fig. 7. The blue filled boxes represent the number of cycles needed for the specific operation. The letters (L) inside the arrows represent data loading process, whereas the red dashed arrows represent a wait/no operation until the specified location of the other arrow end is reached. It is apparent that the most time-consuming steps in the design are the multiplication stage and the IFFT. This is because these stages need to be repeated for the different scales of the wavelet. It is expected that the reduction in the number of scales required and the reduction of the samples in each scale strongly impact the calculation speed improvement as well as the logic utilization reduction.
In addition to that, the block diagram of Fig. 11 shows the time analysis and clock cycles requirements of the design in Fig. 8. Similar to the previous figure, the blue filled boxes represent the number of cycles required. The arrows filled with (L) represent loading process, whereas the red dashed arrows represent a wait/no operation until the specified location of the other arrow end is reached. It is also apparent that the most time-consuming steps in the design are the multiplication stage and the IFFT. This is because these stages need to be repeated for the different scales of the wavelet. The following equation shows how the total number of cycles needed to complete the algorithm processing is calculated: where TC is the total required cycles.

5) DATA SOURCES AND DATA COLLECTION TECHNIQUES
The main sources of data are obtained from recently conducted radar-based human vital sign detection experiments by a group of researchers from IMEC -Netherlands, Maastricht University -Netherlands, and IMEC -Belgium [11]. This data is comprised of reflected signals from target objects in practical experimental setups. The experiment was conducted in a 'brainstorming' office area that mimics a typical room setting. These settings contain furniture, metal shelves and objects, metal walls, personal computers (PCs), instruments, tables, sofas, a big screen, and chairs. Wi-Fi repeater stations were also active in the environments where the measurements were performed. The two radar antennas, which were horizontally separated by 10 cm, were placed at 1.25 m height above a reinforced concrete floor [11]. The experimental radar set-up includes a radar module, a digital signal processor/field-programmable gate array (DSP/FPGA) board, an analog-to-digital converter (ADC) and a laptop, see Fig. 12. The radar block consists of a waveform generator based on a programmable phase-locked loop (PLL), a power divider, a low noise amplifier (LNA), a gain block, a radiofrequency (RF) mixer, a base-band filter and an amplifier. The radar waveform is generated by a PLL that is configured by the DSP/FPGA board. This signal feeds a power divider that splits it into two branches. The first output is connected to the transmitter antenna. The signal reflected from the target is received, amplified, and then mixed with a copy of the transmitted signal. On the receiving path, the signal is amplified by the LNA and gain block and then fed into the RF input of the mixer. The local oscillator (LO) input of the mixer is connected to the second output of the power divider. The baseband signal produced by the mixer is amplified, filtered, and digitized by the ADC. The DSP/FPGA manages both waveform generation and acquisition. The radar sensor is based on a linear frequencymodulated continuous-wave (FMCW) architecture. It transmits a series of chirps, separated from each other by an off interval, where no signal is transmitted. The radar sensor was designed using commercial off-the-shelf components [11].
Using described set-up, eight experiments each of two minutes were conducted. In each measurement, a volunteer is invited to breathe normally and avoid any other movements when seated on an 'acoustic sofa', hidden behind the high sound-absorbing back panels. Two different absolute distances of 2.6 m and 5.4 m were evaluated. These measurements were then repeated at the same distance, but now with the addition of moderate random body movements. The volunteers were instructed to perform four or five moderate random body movements (moderate limb movements, crossing the legs, and so on) per measurement at 11, 50, 69 and 98 s [11].
The process of taking the input data from the MATLAB and storing it in the RAMs, and of taking the output data from FPGA to MATLAB is performed via co-simulation between MATLAB and FPGA. The data transfer synchronization is controlled using a process built specifically for this purpose in our control module.

V. RESULTS AND COMPARISON A. CWT MATLAB SIMULATION
An artificially generated input test signal with imposed noise in the signal is used as the input signal to observe the effectivness of the algorithm in detecting the unwanted movements (noise) in the signal. This signal was generated with 700 samples and a sampling period of 0.1 s. As shown in Fig 13, the signal contains four artifacts at different times. The Morlet function was used as the mother wavelet in this simulation. The filter banks were calculated using a signal length of 700 points with a sampling period of 0.1 s. The Morlet wavelet coefficients were then calculated and 66 scales was identified as the needed number of scales. Fig. 14 shows the used Morlet wavelets in frequency domain at different scales with normalized amplitude at 2. As observed from the graph, the lower scale number represents a wavelet at higher center frequency. Besides that, the shape of the wavelet becomes narrower when moving from a lower scale to a higher scale.
Utilizing the Morlet wavelet coefficients, CWT algorithm was applied on the input signal. The CWT complex coefficient matrix was generated for all the 66 scales. Using the CWT coefficients at selected scales, the locations of the unwanted artifacts in the input signal can be identified. In Fig 15, maximum magnitudes of the coefficients were plotted over time. These coefficients clearly contain information related to the noise in the signal. The figure shows the four distinct locations of the artifacts in the input signal. To locate these locations accurately, a binary mask was generated and shown in Fig 16. These identified locations in the input signal corrupted with unwanted artifacts were applied with the moving average filter. Fig 17 shows the input signal before and after the moving average was implemented on the identified artifacts locations using CWT. Clear improvement in the input signal can be seen after removing the high frequency components artifacts in the signal. This simulation case shows that the CWT can identify the locations of random VOLUME 10, 2022 FIGURE 12. Radar components set up in [11]. movements in the signal, and hence applying moving average on these location rejects these artifacts from the signal.
After successfully using the CWT algorithm on the artificially generated signal (with noise), it is then applied on experimental radar data to further verify the algorithm results using MATLAB. The graph in Fig. 18 shows the Doppler signal of a person with unwanted random movement introduced in the signal at four locations. The experimental input signal spans time of 120 seconds with 39062 samples. This signal was segmented into individual segments with a time span of around 11 seconds each. The part of the Doppler signal with the unwanted random movement utilized in this work is shown in Fig 19. This segmentation is performed as a quick test of smaller parts of the long original signal. Fig. 20 shows the Morlet wavelet at different scales in frequency domain. These wavelets are used for implementing the CWT on the experimental test signals. As can be seen, each wavelet scale is a BPF with different center frequency. The first scales are very wide, but as the wavelet scale number increase, the BPF moves to lower frequencies and becomes narrower.
After applying CWT algorithm, the complex matrix coefficients of the CWT were generated. The maximum magnitudes of these coefficientswith scales of between 26 and 50 were calculated. These values reveal information of the location of the unwanted random movement at each location in the signal as shown in Fig. 21. It also shows the maximum magnitudes of the coefficients of the selected scales, which shows some areas with much higher values than the rest. A binary mask is then generated to identify the unwanted random movements in the Doppler signal as seen in Fig. 22. The moving average filter was then applied on the signal at the identified locations resulting in the signal presented in Fig. 23. The unwanted random movement and with high frequency components were successfully removed from the identified artifacts locations.
To further validate these results, the CWT was applied on another segment of the experimental signal where unwanted artifacts were not present. The new input signal is shown in Fig. 24 with no presence of artifacts. The algorithm was successful in building an all-zero binary mask based on the CWT  output coefficients. As a result, unwanted random movement was not detected, as shown in Fig. 25 and hence the moving average filter was not applied on any part of the input signal. This demonstrates that the algorithm is capable of detecting the presence of the artifacts while not generating erroneous behavior in their absence.
Furthermore, another segment of the Doppler test signal with unwanted movement at a different location was injected in the system to test its validity. This test input signal is shown in Fig. 26. The system successfully identified the location of these artifacts and applied the moving average to improve it. Fig. 27 shows very clear identification of the artifacts in the signal through the binary mask. It also shows the improvement in the signal after applying the moving average at the identified location.
The execution time of the MATLAB algorithm is around 82 ms when using a computer running on Windows10 with Intel(R) Core (TM) i7 CPU @ 1.80 GHz 1.99 GHz and 16 GB RAM.

B. CWT FPGA IMPLEMENTATION RESULTS
From the previous MATLAB simulation results, it is confirmed that the proposed algorithm is capable of locating the unwanted movements and artifacts in the Doppler signal used in the system. In this section, the results from the algorithm implementation on the FPGA are examined and validated. Fig. 28 shows the Doppler signal used in the FPGA as the input signal containing unwanted artifacts. Note that the Morlet wavelet function used in the FPGA is the one used in the MATLAB simulation from Fig.20.
Utilizing the Morlet wavelet to conduct the CWT process on the input signal, the output of the FPGA is expected to be      the CWT complex coefficients matrix of sizeŜ × N . Since our design used the scales from 26 to 50, the related CWT coefficients are presented in Fig. 29. It can be seen that these scale samples do contain information about the locations of the artifacts in the input signal. However, the information of the overall unwanted movement is scattered across the        information in one graph as shown in Fig. 30. In order to exactly extract the artifacts location in the signal, the binary mask shown in Fig. 31 was generated. It shows clearly that it locates the places where unwanted random movement is present in the signal. Utilizing this binary mask to apply the moving average filter on the identified locations, results in the improved signal at the unwanted random movement locations in Fig. 32.
It is clear that the output from the FPGA is capable of identifying the artifacts in the signal indicating successful implementation of the proposed optimized algorithm. Table 3 outlines the different attributes in the proposed CWT processor FPGA architecture pre and post optimization. One of the main attributes is the number of scales, which was reduced from S toŜ. Moreover, the memory bit requirement was    reduced from 23,576,576 bits to 3,708,646 bits indicating major speed and resources improvement. For example, the required multiplication cycles were reduced from 729,088 to N and the total processing cycles needed were reduced from 1, 478, 824 to 125, 307.
Results of the optimized FPGA implementation are tabulated in Table 4. It shows that the maximum frequency the design can achieve based on the critical path is 74.1 MHz and therefore a processing time of 1.69 ms is achieved. Besides that, the proposed design utilizes 42% of the FPGA logic elements and 72% of RAM bit locations on an Altera Max10 FPGA, specifically the 10M50DAF484C7G device. The power consumption of the optimized FPGA design is measured using the power analyzer tool in Quartus prime and is found to be 363 mW.
The details of the processing cycles performance related to each operation in the design pre optimization are presented    in Table 5, whereas Table 6 summarizes the details of the processing cycles performance related to each operation in the optimized design. From Table 3, 4, 5 and 6, it is clear that there is significant (91.53%) reduction in the number of clock cycles required for processing the design. This is due to the collective implementation of the multiple optimization approaches outlined in this paper. In addition to that, the significant reduction in the memory resources in the optimized design is also due to the same reason. A total of 98.4% reduction in the memory bits requirements is achieved when comparing the basic architecture to the optimized one. Table 7 compares the proposed optimized work in this article and other state-of-the-art works found in the literature. It is very difficult to establish fair comparison as each design uses different processing platform, CWT algorithm structure, signal length and number of scales in the wavelet function. Nonetheless, performance comparison of state-ofthe-art designs with our proposed work is done considering these different factors. Firstly, the work in [32] implements CWT algorithm on FPGA to detect the R signal from the ECG signal. The focus of this work was to prove the concept of utilizing the CWT for such application and to achieve higher accuracy. Therefore, the processing time and speed, and the resources utilized to implement the algorithm on FPGA were not explicitly reported. Next, the work in [28] is the most similar design with the proposed work. This work was developed for feature extraction of ERP from EEG signal. When comparing the number of cycles required for [28] with the proposed design, we find that it needs 76,012 cycles while our design needs 125,307 cycles. However, it is important to note that the application presented in [28] is different form this work, and therefore different signal length and scales are required. The input signal length used in [28] is N /4 and our input signal length is N , which is four times longer than the input signal of [28]. This indicates that if N length was used in [28] it would require more or less 4 times the current number of cycles required while using a signal with N /4 length. Yet, the proposed design required less than double the cycles needed in [28]. To be more specific, if the proposed design used a signal length of N /4, it would require around 31,326 cycles, which is 58.8% faster than [28]. As a result, the proposed design would have a processing time of 0.42 ms compared to 0.57ms achieved by [28]. This is largely due to the various optimizations implemented in the design to eliminate unnecessary calculation cycles, conduct parallel processing and to focus on optimizing the most computationally intensive parts of the algorithm. In addition, the design in [30] implemented 2D FFT based CWT to analyze fringe patterns. It achieved very good processing time of 8 ms compared to the length of the signal. The reason of this is due to the use of only one scale in the wavelet, compared toŜ scales in the proposed design. As aforementioned, the factor determining the selection of scales to be used is application dependent. In our case, 25 scales were required to be used to cover wider range of expected unwanted frequencies in the Doppler signal. However, when the application requires narrow range of frequencies, it is suitable to use few scales that cover the specific frequency range.
It is not easy to compare the utilization of resources in each FPGA design unless the same FPGA is used. However, this work performed advanced optimizations on the resources to ensure it is implemented on low-end, cost-effective FPGA such as the Max 10 from Altera. Among these optimizations, eliminating the need of the dedicated memory to store the wavelet values. This was done by re-utilizing RAM 4 for that purpose and also for storing the multiplication results. Besides that, resource optimization was performed by reducing the memory required to store the outputs of the FFT from N to only 670 locations. In addition, reduction of the memory requirements to store the multiplication results was also part of the resources optimization. It is believed that these techniques position the proposed work to be competitive with other works from the prospective of resource utilization.

VI. CONCLUSION
This work realizes CWT processor on FPGA architecture implemented with several optimization approaches to make it suitable for high-performance data processing applications. It is demonstrated in the results that the proposed FPGA design achieves high calculation speed performance compared to state of the art designs as well as reduced resource utilization. In addition, there is significant improvement of the hardware execution time over MATLAB execution time (48 times faster). This signifies the importance of and benefit of utilizing optimized hardware processing implementations on platforms such as FPGA to process complex algorithms.