Frame Boundary Detection and Deep Learning-based Doppler Shift Estimation for FBMC/OQAM Communication System in Underwater Acoustic Channels

Underwater acoustic (UWA) communication is the only option for underwater long-distance wireless communication. However, its data rate is very low, in tens of kbps, and the time-varying nature of the UWA channel exhibits a Doppler effect. The data rate can be increased using a multicarrier technique, such as Filter Bank Multicarrier (FBMC). The FBMC-OQAM system does not require a cyclic prefix (CP) to eliminate intersymbol interference (ISI). Therefore, compared to other multicarrier techniques, the FBMC-OQAM system achieves the maximum spectral efficiency. Despite the well-designed filter, synchronisation errors cause performance degradation in the FBMC-OQAM system. In this paper, we propose frame boundary and Doppler frequency estimation methods for the FBMC-OQAM system under UWA channels. In this study, we propose a preamble-based frame boundary. The conjugate symmetry property is used in the frame boundary estimation. The performance of the proposed technique was studied under different UWA channels with different Doppler scale. The proposed time synchronisation technique yielded multiple peaks under UWA channels. A method for identifying the peak that represents the frame boundary is proposed. We compared the performance of proposed timing metric with Schmidl & Cox timing metric and Reddy.s timing metric. The proposed timing metric exhibited superior performance over existing techniques. In addition, we propose a convolutional neural network (CNN)-based Doppler scale estimation method. The proposed technique was shown to estimate a wide range of Doppler scale factors. The performance of the proposed technique was studied by considering different UWA channels at different SNRs. The proposed CNN architecture is analysed by changing the CNN parameters. Stacked autoencoders were adopted for Doppler scale estimation and the performance of the proposed CNN-based Doppler scale estimation method was compared with the stacked autoencoders based Doppler scale estimation method. Simulation results show that the proposed CNN-based Doppler Scale estimator exhibits good performance in UWA channels at different SNRs and Doppler scale factors.

The speed of sound changes with the depth of the UWA channel, but it remains constant over the shallow water depth. The impulse response of a shallow UWA channel depends on the reflection and refraction properties of the channel, as well as the geometry of the channel. Fading occurs due to reflections from the seafloor, sea surface, and underwater obstacles. Transmitted acoustic waves experience scattering when they encounter small objects, such as water bubbles and fish shoaling [2]. Therefore, as a consequence of multipath propagation, multiple copies of acoustic signals reach the receiver at different delays with different path losses and cause signal smearing due to intersymbol interference (ISI). The channel dispersion length changes with the geometry of the channel, and the velocity of the underwater acoustic signal is low (~ 1500 m/s). This nature of the shallow water environment turns the UWA channel into doubly selective channels, that is, the transmitted signal while passing through the UWA channel, undergoes dispersion in both time and frequency domains [3]. UWA channel characteristics include ambient noise, frequency-dependent attenuation, limited acoustic bandwidth, and a non-negligible Doppler effect. Similar to free-space channels, Doppler frequency is generated in the UWA channel owing to the relative motion of the underwater transmitting and receiving vehicles [4]. Apart from vehicular movement, the UWA channel experiences the Doppler effect owing to two more factors, such as drifting (unintentional motion of transmitter, receiver, or both) and surface wave motion [4].

A. Related Works
In this study, we consider a shallow water environment in the frame boundary and Doppler shift estimation. Underwater long-distance wireless communication is possible only with acoustic waves. As acoustic waves are at low frequencies, the available bandwidth is low. To improve the data rate, multicarrier techniques, such as "Orthogonal Frequency Division Multiplexing (OFDM)" and "Filter Bank Multicarrier (FBMC)", are employed. To achieve high spectral efficiency, the offset QAM (OQAM) technique was adopted in the FBMC system. The OQAMbased FBMC (FBMC-OQAM) system uses well-localized band-limited pulse shaping filters [5][6][7] which overlap in the time domain and require the satisfaction of the orthogonal condition of the filter in the real domain [8]. Therefore, in FBMC-OQAM, in-phase and quadrature-phase modulation symbols are transmitted separately with a half-symbol duration offset between them [9,10]. The real orthogonal condition requirement in FBMC-OQAM leads to intrinsic imaginary interference in the dispersive channels [11]. The OQAM technique solves the intrinsic interference problem through a phase offset between adjacent real symbols. The FBMC-OQAM system does not require a cyclic prefix (CP) to eliminate intersymbol interference (ISI). Therefore, compared to other multicarrier techniques, the FBMC-OQAM system achieves the maximum spectral efficiency [12]. The FBMC-OQAM system generates a very low outof-band (OOB) energy level compared to other filter bank models [13]. FBMC-OQAM was proposed for applications such as cognitive radio and uplink of multiuser multicarrier systems [9], and for future mobile communications, an alternative waveform to OFDM in 5G scenarios such as device-to-device (D2D)-enabled machine-type communications [14][15][16]. In addition, it was considered as a potential candidate for physical layer data communication in future CR networks [17,9].
The impact of time and frequency synchronisation errors on the performance of filter bank multicarrier transmission was investigated in [18][19][20][21][22][23][24][25]. Therefore, accurate time and frequency synchronisation techniques must be designed to handle synchronisation errors. Davide Mattera [26] analysed the performance of an FBMC-PAM system in the presence of a CFO. By comparing the SIRs of OFDM and FBMC-OQAM systems, they proved that the FBMC-OQAM system outperforms the OFDM system when the same number of subcarriers is considered for both multicarrier systems.
Synchronisation techniques can be broadly divided into two categories. These are blind synchronisation techniques and data-aided synchronisation techniques. Mattera and Tanda proposed a blind synchronisation technique for symbol boundary estimation in [27] and blind symbol timing and CFO estimation in [28]. Both techniques exploit the approximate conjugate symmetry property of the FBMC-OQAM signal and require multiple symbols. The second-order cyclostationarity property of the OFDM-OQAM signal was used for blind CFO estimation [29]. As blind synchronisation methods converge slowly, they are not suitable for practical implementation. Therefore, dataaided synchronisation techniques are adopted.
Data-aided synchronisation techniques are further classified into correlation-based and conjugate-symmetrybased techniques. Several correlation-based synchronisation techniques have been investigated. Fusco [30] considered a training sequence made up of identical blocks, and Zeng [31] designed a preamble with periodic properties for the proposed joint time and frequency synchronisation method.
As the training sequence consists of several identical parts, a strong correlation between adjacent blocks still exists in the presence of a time offset (TO). Therefore, in the presence of TO, these techniques did not yield sharp peaks at the symbol boundary.
Seo [32] proposed time synchronisation based on discretely inserting synchronisation symbols into FBMC data symbols. This method provides interference cancellation, but its complexity is high. A pilot-based fine synchronisation method was proposed in [33]. This is useful only after a coarse initial synchronisation is accomplished. Chung [34,35] designed a constant-amplitude zeroautocorrelation (CAZAC) sequence-based preamble for timing synchronisation. It has a distortion-compensating property and achieves higher synchronisation accuracy. In [36], the authors provided an overview of the existing time and frequency synchronisation algorithms for the FBMC-OQAM system. Stitz [37] proposed carrier frequency offset (CFO) and fractional time offset (FTO) estimation methods using scattered pilots. FTO is estimated using the phase shift between two successive subcarriers of the same symbol, whereas CFO is estimated using the phase shift between two subcarriers at the same locations in two adjacent OFDM symbols [36].
Cho and Ma [38] proposed a joint time and frequency synchronisation technique. They modified the conjugatesymmetric (CS) property by considering null subcarriers and suggested a method for proper peak detection for time offset estimation. In addition, they proposed a ratio-based (non-correlation method) CFO estimator. In the frequency domain frame detection method [39], the authors proposed an autocorrelation metric for repeated pilots of the FBMC-OQAM symbol. The training symbol is generated by loading data onto even-numbered subcarriers only, and weighted correlation is applied to the resultant identical sub-blocks for joint estimation of time and frequency offsets [40]. A long training sequence consisting of nine FBMC symbols which are generated using a modified Zadoff -Chu sequence is used for joint time and frequency synchronisation [41].
The conjugate symmetry property has been applied for timing synchronisation, that is, the real data are transmitted on subcarriers of the preamble symbol in [42]. Li [43] generated a conjugate symmetric sequence using a pilot symbol and two auxiliary data symbols and proposed a symbol timing algorithm based on the conjugate symmetry property.
In [44][45][46], the authors proposed timing synchronisation techniques by using specially designed synchronisation sequences with a phase-weighted conjugate symmetric structure. Repeated conjugate-symmetric property in two consecutive FBMC-OQAM symbols has been modified by considering null subcarriers and an algorithm for proper peak detection for joint estimation of TO and CFO [47].
Doppler frequency and Doppler rate are estimated for the FBMC-OQAM system using the correlation between identical parts of a specially designed preamble [48]. The Doppler scaling factor and channel parameters were estimated using high-resolution harmonic estimation methods [49]. UWA signal processing techniques have been introduced in [50,51]. Zhou et al. considered zero-padded OFDM to minimise the transmission power and proposed a null subcarrier-based Doppler compensation technique [52], who proposed non-uniform Doppler compensation via resampling and uniform compensation on the residual Doppler [53]. The authors [54] used Preamble with HFM signals to estimate the frame boundary of a signal in a UWA channel and proved that it outperforms the existing LFMbased approach with high frame boundary detection ability. A brief overview of signal processing methods related to a single carrier and OFDM is presented in [55]. A two-step procedure for decision-feedback equalisation was proposed for filtered multitone modulation (FMT), and its performance was analysed [56]. Amini et al. [57,58] showed that FBMC outperforms OFDM for underwater acoustic communications.
The cross-section of the cyclic spectral correlation function of the received signal was used for the CFO estimation [59]. FrFT is applied to linear frequency modulated (LFM) signals in the preamble to determine the Doppler shift [60]. FrFT is applied to the LFM signal and hyperbolic frequency modulation (HFM) signal for the determination of joint time and frequency synchronisation [61]. A time and frequency synchronisation algorithm based on FrFT on the preamble with a symmetrical triangular linear frequency modulation (STLFM) signal is proposed for OFDM systems in UWA channels [62,63]. Instead of a cyclic prefix (CP), zero padding (ZP) is used with OFDM and proposed for channel equalisation for the UWA channel [64]. The preamble with a hyperbolic frequency modulation (HFM) signal is used for timing position and Doppler factor estimation [65]. By applying a fractional Fourier transform to the preamble containing a chirp signal, the authors [66] proposed that the time and frequency synchronisation for the signal is the UWA channel. A method of matched filter banks along with weighted summation and adjacent peaks algorithms is used for Doppler scale estimation of OFDM signals in UWA channels [67]. The authors [68] studied the maximum likelihood approach [69] based Doppler scaling coefficient of the OFDM signal in the UWA channel. The accuracy of the algorithm depends on the proper selection of the training sequence length and window length. The training symbol is generated by using two chirp symbols, in which the first symbol consists of an up-chirp and the second symbol consists of a down-chirp. At the receiver, the time-difference-of-arrival (TDoA) between two received chirp symbols is measured, and the Doppler shift is determined from this time offset [70].
Neural networks, which are trained using the Levenberg-Marquardt rule, are used in Doppler frequency estimation and show that they are faster than backpropagation methods [71]. With powerful feature extraction capabilities, convolutional neural networks (CNNs) have become unique and superior models in applications related to computer vision, such as autonomous driving and human-computer interaction. In recent years, CNNs have been applied in signal processing. Yann LeCun and Léon Bottou described the CNN in detail [72]. Deep learning (DL) has been used for channel estimation and equalisation in FBMC systems [73]. Ninkovic [74] presented a performance and complexity analysis of packet detection and CFO estimation using both conventional and DL-based approaches. Kojima [75] developed spectrogram images by concatenating PSD values over multiple OFDM symbols and applied spectrogram images as input to the CNN for joint classification of SNR and Doppler shift.

B. Paper Contributions
Our main goal is to estimate the frame boundary and Doppler frequency for the FBMC-OQAM system for communication through UWA channels with a low minimum error probability. Deep learning-based techniques have a proven track record of extracting hidden features from the input image for image classification and are more appropriate for attaining accurate Doppler frequency estimation results. Therefore, we considered CNN for Doppler feature extraction. We considered a specially designed preamble [76] suitable for the FBMC-OQAM system and developed images for input to the CNN. Our major contributions in this study are as follows: 1) To present a suitable method to estimate the frame boundary in the UWA environment, we considered the periodic conjugate-symmetry present in the preamble. A procedure for selecting an appropriate peak that falls on the frame boundary is also suggested. 2) To present Doppler frequency estimation, we considered a CNN-based approach, and images were developed using the periodic conjugate symmetry property in the preamble.
3) The superior performance of the proposed methods is presented through simulation results.
The remainder of this paper is organised as follows. Section II introduces the FBMC-OQAM system model. Section III presents the frame boundary estimation and DL-Doppler shift estimation methods. Section IV describes the proposed CNN architecture. Section V presents the simulation results of the performance of the proposed methods for the FBMC system in the presence of the UWA channel, and Section VI presents the conclusions.

II. SYSTEM MODEL OF FBMC
In FBMC, initially randomly generated bits are sent to symbol mapper i.e. OQAM symbol mapper to map random bits to OQAM symbols. The OQAM symbol is generated by providing an offset of half of the symbol duration (T) to the QAM symbol so that the real and imaginary parts are transmitted separately with an offset of T/2. These OQAM symbols are parallel to the serial converter, and then modulation is applied using IFFT which acts as a multicarrier modulator. The IFFT produces time-domain samples, and these samples are given to the synthesis filter bank. A block diagram of the FBMC-OQAM system is shown in Fig. 2. The synthesis filter bank used in this block was a PHYDYAS prototype filter. The prototype filter is designed using certain frequency domain coefficients, and the shifted versions of this prototype filter form a filter bank to remove ICI and maintain orthogonality between subchannels.
The output of the synthesis filter bank was passed through an underwater acoustic channel. At the channel, some noise is added owing to the Doppler spread and multipath effects. This signal is received by an MMSE equaliser which is used to compensate for the channel effects. The equalized signal is passed through an analysis filter bank i.e. PHYDYAS prototype filter bank which detects the original samples. These samples are passed through a serial-to-parallel converter and then passed through FFT to recover the data symbols because FFT acts as a multicarrier demodulator. These recovered data symbols are passed through the OQAM de-mapper to obtain the original bits.
Baseband equivalent FBMC signal is given by where N is the number of subcarriers, d (m,k) is the realvalued offset QAM (OQAM) symbol transmitted over the k th subcarrier, and the m th time slot, g (m,k) (t) represents the time-frequency translated version of the symmetrical realvalued prototype filter g(t) which is given by This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Here, T 0 /2 is the time interval between real-valued symbols, F 0 is the subcarrier spacing, and T 0 = NT s , where T s is the sampling interval. Also, F 0 = 1/T 0 . The additional phase shift connected with the OQAM method is represented by  (m,k) , and it depends on the time and frequency slots as We considered the PHYDYAS prototype filter with an overlapping factor of K = 4. The PHYDYAS prototype filter is defined as: where P 0 = 1, P 1 = 0.97196, P 2 = 0.707, The peak of the PHYDYAS prototype filter occurs at t = KT 0 /2, whereas for the filter with overlapping factor K = 4, the peak occurs at t = 2T 0 . We considered the UWA channel with channel impulse response where h l ,  l , and l d f represent the complex amplitude, path delay, and Doppler frequency along the l th channel path, l = 0,1,2,…,L-1, and L is the total number of channel paths in the UWA channel. Now the FBMC signal is passed through the UWA channel. The channel output becomes We assume equal Doppler frequency shift along all channel paths. That is By substituting (7) and F k = kF 0 in (6) and adding additive white Gaussian noise w(t), the received signal y(t) can be expressed as From this channel distorted and Doppler-affected noisy FBMC signal, we have to determine the frame boundary and Doppler shift.

III. PROPOSED METHOD
We propose a preamble-based frame boundary estimation method and a deep learning-based carrier frequency offset. The preamble data are known to the receiver. It is used for time and frequency synchronisation, and channel estimation. The carrier frequency offset is caused by (i) the mismatch between the transmitted carrier frequency and local oscillator frequency and (ii) the Doppler frequency shift due to the mobility of either the transmitter, receiver, or both. In this study, we consider the Doppler frequency shift estimation.
The frame structure of the proposed FBMC system is shown in Fig. 3. The frame consists of a preamble, followed by a data payload with FBMC symbols. The first sample index, n = 0, represents the frame boundary. The length of the FBMC symbol is represented by L p .

A. Preamble Generation
We considered the preamble sequence structure proposed by Ma [76]. By using a Zadoff-Chu (ZC) sequence, an FBMC signal with an overlapping factor K = 4 and signal length L p = 4N, where N represents the number of subcarriers, is generated. Then, the FBMC signal is copied four times and all five copies of FBMC signals are superimposed after shifting each successive FBMC signal by N samples, as shown in Fig. 4 [76], to generate a preamble sequence x[n].  The FBMC symbol was generated using the simulation parameters listed in Table I. FBMC symbol length is L p = KN = 4N = 512. Five copies of the FBMC symbol were superimposed, as shown in Fig. 2. Mathematically, we can represent this as follows: Let the four equal parts of the FBMC symbols be A, B, C, and D, and be expressed as A, B, C, and D are expressed as The preamble is now expressed as , , , , , As shown in Fig. 4, the length of the preamble sequence x[n] is 2L p = 8N and it consists of two identical parts. The identical parts of the preamble sequence are mathematically expressed as A plot of the amplitude of the preamble sequence is shown in Fig. 5.

B. Proposed Frame Boundary Estimation Method
The two identical parts x 1 [n] and x 2 [n] exhibit conjugate symmetry properties, as shown below.
Let us divide each identical part of the preamble sequence into two equal-length sequences as 11 (29) We considered the conjugate symmetry property stated in (26) for the timing metric, and we propose the timing metric for frame boundary estimation as The normalized timing metric is given by represents the average power of the sequence and it is expressed as yields multiple peaks at regular intervals of N/2.
To understand the performance of the proposed timing metric, we generated a frame consisting of a preamble followed by a data payload with 10 FBMC symbols using the simulation parameters listed in Table 1. The length of the FBMC symbol is 512, and that of the preamble is 1024. Then, we applied the proposed timing metric (31) to the frame, and the resultant magnitude plot of the proposed metric is shown in Fig. 6. In the preamble part of Fig. 3, we obtain several significant peaks with amplitude 1, and the data part also consists of significant peaks at regular intervals of 448 samples and each peak with adjacent peaks at 97 and 64 samples from it.
To understand the peaks in the preamble part in detail, we considered the enlarged version of the preamble part of Fig. 3, as shown in Fig.7. This plot consisted of seven significant peaks at regular intervals of N/2 = 64. The first six peaks have equal amplitudes and are equal to 1, whereas the 7 th peak amplitude is approximately 0.6. The first peak appeared at n = 0 which represents the frame boundary. Therefore, the proposed timing metric indicates the time index of the first peak as the frame boundary.

C. Doppler Scale Estimation
Shuo Ma [76] proposed a fractional Doppler frequency shift algorithm by considering the correlation between two identical parts of the preamble sequence.
Now the Doppler shift factor is determined as

D. CRLB for Estimation of Doppler Scale
The Cramer-Rao Lower Bound (CRLB) sets a lower bound on the variance of any unbiased estimator [77]. Assuming perfect timing synchronization, we have estimated the Doppler scale using (33) and (34).
As presented in (15) and (16) This algorithm works excellently in the AWGN channel and yields erroneous results in the presence of the UWA channel. Therefore, we propose a deep learning (DL)-based Doppler shift estimation algorithm.

E. Proposed DL-based Doppler Shift Estimation Method
As discussed in the timing metric and Doppler shift estimation, the preamble exhibits both correlation and conjugate properties. These properties are considered to be features of deep learning.
From Conjugate Symmetry properties, we consider four features as shown below.
The correlation feature grid is given by We considered combining conjugate symmetry and correlation feature grids for Doppler shift estimation using deep learning. 1 { , } YZ  (44) The features of the preamble are shown in Fig. 8. The amplitude of the conjugate symmetry-related features is shown in Fig. 8(a) to Fig. 8(d), and the amplitudes of the correlation-related features are shown in Fig. 8(e) Fig. 8(h). A convolutional neural network (CNN) was used in deep learning.
F. Pre-processing CNN works only on real data, whereas our feature grid consists of complex values. Therefore, it is necessary to convert complex values to real data. During pre-processing, we divided each complex feature vector into two real feature vectors representing the real and imaginary parts of the complex vector separately. After pre-processing, the Doppler feature grid suitable for the CNN is given by: where Y R and Y I represent the real and imaginary parts of feature grid Y. Similarly, Z R and Z I represent the real and imaginary parts of the feature grid Z.

IV. DESCRIPTION OF CNN MODEL
Initially, a CNN was developed for computer vision, where a two-dimensional image database was analysed for classification-related applications [79]. A CNN is a special form of a multi-layer neural network. The CNN architecture consists of several convolutional layers, ReLU layers, pooled layers, a fully connected layer, a softmax layer, and a classification layer stacked together [80].
We propose two versions of CNN architectures. CNN architecture -1 is as shown in Fig. 9. The basis of the CNN is the convolutional layer, which performs the convolution operation on the input image. In this study, we consider a feature grid of size N/4 × 8×1 as an input image to the CNN model.
The proposed CNN architecture -1 is presented in five sections, where each section represents a set of layers. The first section consists of a convolutional layer, ReLU layer, and average pooling layer. The convolutional layer passes this image through a set of convolutional filters to create feature maps from the features hidden in the image. The convolutional layer consists of eight filters, each with a typical size of 3×3. This is followed by an activation function, known as a rectified linear unit (ReLU), which maps all negative sample values to zero so that features with only positive sample values are carried forward to the next layer. Then, the average pooling layer computes the average pixel values in the rectangular regions. We considered a pool size of 2, that is, the height and width of the rectangular regions were both 2. In addition, the stride i, the step size of 2, is considered so that the pooling regions do not overlap.
The output of the first section is applied to the second section which consists of four layers: convolutional layer, batch normalisation layer, ReLU layer, and average pooling layer. A convolutional layer with 16 filters, each with a filter size of 3×3. The batch normalisation layer performs normalisation along the mini-batches. It speeds up the training involved in the convolutional layer and makes learning easier. The batch normalisation layer is followed by the second ReLU layer and the second average pooling layer. The average pooling layer output is applied to the third section which consists of three layers: convolutional layer, batch normalisation layer, and ReLU layer. The convolutional layer consists of 32 filters, each with a filter size of 3×3, followed by the batch normalisation layer and ReLU layer. The three layers of the third section are repeated to form the fourth section. The first layer of the final section is the dropout layer. We considered a dropout layer with a dropout probability of 0.2. This means that the dropout layer randomly selects 20% of the pixels of the image and nullifies them. A fully connected layer utilises the features extracted from the previous stages and predicts the class of the image, whereas the softmax layer performs a multi-classification task. It produces classification results which include the probability of each classification class. The final layer of the proposed CNN is the classification layer. It computes the cross-entropy loss for each classification class obtained from the softmax layer and performs weighted classification on mutually exclusive classes.
We propose CNN architecture -2 as shown in Fig. 10. It is the simplified version of CNN architecture -1. It consists of one convolutional layer with 32 filters, each with a filter size of 3×3. It is followed by the batch normalisation layer, ReLU layer and Dropout Layer with 20% dropout. Dropout layer is followed by fully connected layer, softmax layer and finally classification layer.

V. UWA CHANNEL SIMULATION
Underwater and acoustic channels present a huge challenge for high-speed and reliable communication because they are not only vulnerable to interference and noise but are also highly affected by impediments, Doppler shift (moving environment), and changing communication environment dynamics. The transmitted radio signals are likely to be diffracted, scattered, reflected, and attenuated during their transmission. Therefore, when the acoustic signals reach the receiver, they may have different transmission paths and random phases, and their power is attenuated. The UWA channel exhibits three main characteristics: low speed of sound that varies with medium This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3148410, IEEE Access conditions, time-and frequency-dependent attenuation, and time-varying multipath propagation that depends on boundary conditions [81]. In an underwater acoustic channel, the Doppler frequency shift is caused by (i) the drifting of the transmitter/receiver due to water movement, (ii) the reflections caused by the surface wave motion, and (iii) if either the transmitter, receiver, or both are mobile and move at a particular speed. Among the three types of Doppler scaling factors, the Doppler scaling factor due to the mobility of the vehicle dominates the remaining two. However, Doppler cannot be neglected due to drifting and surface waves.
We simulated the UWA channel using the Stojanovic channel model [4]. This model assumes flat bathymetry, surface shape, and sound speed profile, and provides good results for flat bathymetry. To obtain large-scale simulations, we used Bellhop [82]. The UWA channel geometry considered in the channel simulation is shown in Fig.11. We considered a bathymetry of 100 m depth, one transmitter, and one receiver located at 80 m and 40 m from the bottom of the channel (seafloor), as shown in Fig. 10. The transmitter and re.ceiver were separated by a range of 1000 m.
The varying sound speed profile (SSP) is considered with sound speed changing in the range of 1539 m/s to 1534 m/s at a UWA channel depth of 100 m. Additionally, the surface of the UWA channel is considered non-flat.
With the ray-tracing option, Bellhop produces an Eigen ray file containing a fan of rays emanating from the source. Bellhop computes the arrival time and amplitude data. Doppler scaling factors due to drifting and surface wave motion are included in the channel simulation, and the Doppler scaling factor due to vehicular motion is applied externally. The instantaneous channel gain of the simulated channel model is shown in Fig. 12.
Simulations were conducted in MATLAB over 1000 iterations with 360 different UWA channel impulse responses. Six paths were considered in the UWA channel simulations. As shown in Fig. 13, the simulated channel consists of six paths with approximate delays of 0 ms, 0.2 ms, 1.1ms, 2, 4, and 8 ms. These delays change slowly with time owing to the Doppler effect. Fig. 13 shows the normalised channel impulse responses for four different iterations. In Fig. 13(a), the 3 rd path is the strongest path, the first three paths are strong in Fig. 13(b), 2 nd and 3 rd paths are strong in Fig. 13(c) and 2nd path is the strongest, as shown in Fig. 13(d). The path delay by the longest significant path is called the total multipath delay spread. In the simulations shown in Fig. 13, the total multipath delay spread was 8 ms.
We repeated the experiment for different SNR values. A total of 1000 different channel and noise realisations were conducted at each SNR value with SNR ranging from 0 dB to 40 dB in steps of 5 dB.
For the impulse response of the channel along each path, the normalised channel impulse responses of Fig. 13 are shown for a 2 ms delay period in Fig. 14. The channel impulse response along each path changed in each iteration. As shown in Fig. 14, the delay spread of individual paths may be different but less than the total multipath delay spread.

A. Performance of Timing Metric
With the simulation parameters in Table 1, a frame with the preamble followed by a data payload with 10 FBMC symbols was generated. The performance of the timing metric in the ideal case is shown in Fig. 6. The time offset of 10 samples was added to the frame, then it was passed through the UWA channel and a Doppler shift was applied at the channel output. Then, additive white Gaussian noise at SNR = 0 dB was added. Hence, the received signal is time-shifted and distorted owing to the channel, and the Doppler affects the transmitted noisy signal. The timing metric was applied to the received signal, and the resultant magnitude plot is shown in Fig. 15.
The enlarged version of the preamble part of Fig. 15 is shown in Fig. 16. Fig. 16 shows several significant peaks. Significant peaks at sample indices {26, 90, 154, 218, 282} exhibit periodicity of N/2 = 64 samples with different amplitudes due to channel distortion. The first peak occurred at time index n = 26. That means the proposed metric produces a significant peak along the first channel path. Therefore we used threshold level V T for peak identification.
A threshold level V T is set in such a way that the miss detection probability P(M) and false detection probability P(Fa) are very low as compared to correct detection probability P(C). With the proposed timing metric amplitude more than threshold level V T , if the first peek of the timing metric falls at the frame boundary i.e., at the time index d = d opt , it is called the correct detection, else it is called a false alarm. If the proposed timing metric amplitude is less than threshold level V T , it is called miss detection. Mathematically, probability of correct detection is defined as The probability of false alarm is defined as Similarly, the probability of miss detection is defined as We noted the number of correct detections, number of false detections, and missed detections of frame boundaries by varying the SNR in steps of 5 dB from 0 dB to 40 dB. The probability of correct detection is calculated as the ratio of the number of correct detections to the total number of detections. The probability of false alarm and probability of missed detection were calculated similarly.

B. Comparison of Proposed Timing Metric with Existing Timing Metrics
We compared the performance of the proposed timing metric with three reference methods. The first method is Schmidl & Cox [83] method. Cox"s timing metric estimates the frame based on the correlation between two repeated sections of the preamble. FBMC symbol does not contain CP but, as shown in Fig. 4, the preamble contains three partially identical sections on either side of two identical sections. Therefore, the Schmidl & Cox method on two identical halves of the preamble produces a plateau and causes ambiguity in frame boundary detection. For the frame boundary detection in the presence of plateau, averaging method Schmidl and Cox proposed an averaging method in which 90% of points on right and left of the maximum point are found and an average of them is considered as frame boundary. The second reference method is Chung"s [40] method and. it is the improved version of Smidl & Cox"s method, and it is based on the weighted correlation between two repeated sections of the preamble. Our proposed method is based on conjugate symmetry present in the preamble sequence and identification of the first significant peak. The third reference method is Liming"s timing metric [84] and it is based on weighted conjugate symmetric sequence. We compared the performances of the four methods under consideration through MATLAB simulations. We added complex Gaussian noise to the received signal. The mean of the noise is zero while its variance depends on SNR. We conducted the simulation experiments 1000 times each time with different noise realizations. We considered the root-mean-square error (RMSE) as a metric for checking the accuracy of the timing metrics.

RMSE is defined as
where λ FB represents the actual frame boundary and i  repsents i th estimated frame boundary and N represents the total number of observations. We varied SNR in steps of 3 dB from 0 dB to 30 dB and determined the RMSE of the timing metric at each SNR value. With the AWGN channel, the proposed method and Liming methods yielded exact frame boundary i.e., zero RMSE for all SNR values from 8 dB onwards and small RMSE for SNR  8 dB as shown in Fig. 18. With AWGN channel, RMSE of Cox"s method varies in the range 66 to 41.5 for SNR changes from 0 dB to 30 dB. Chung"s method exhibited superior performance as compared to Cox"s method and RMSE of Chung"s method varies from 26 to 8.7 over SNR = 0 to 30 dB.
We repeated the simulation experiments 1000 times each time with different channel and noise realizations for two different Doppler scale values  = 0 and  = 0.006. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3148410, IEEE Access As shown in Fig. 19, Cox"s method exhibits the highest RMSE as compared to the other three methods, and its RMSE increases as the Doppler scale increases. Chung"s method exhibits better performance than the Cox"s method. Whereas the proposed method and the Liming method exhibited nearly similar performance irrespective of the Doppler scale values. RMSE of all the methods increased with UWA channels.
The computational complexity of Schmidl & Cox"s method and the proposed method is the same because both require N complex multiplications and additions per sample. The computational complexities of the Chung"s method and the Liming"s method are 2N complex multiplications and N additions each.

C. Generation of Dataset
An A-frame with a preamble followed by a data payload was generated. The data payload consists of ten FBMC-OQAM symbols. A time-shifted frame was passed through the UWA channel to a frame time offset of 10 samples. A Doppler frequency offset is added to the channel output. At the receiver, we first estimated the frame boundary using the proposed timing metric, and then a feature grid of size 31×16, where 31 is the number of channels and 16 is the number of samples, was generated by applying the correlation and conjugate symmetry properties on the preamble. Its Doppler scale is considered as the label.
We considered the maximum underwater vehicle speed to be 150 km/h, which results in a Doppler scale equal to 0.027. The Doppler scale was varied from 0 to 0.027 in steps of 0.001. For each Doppler scale, 1000 feature grids were generated by considering 1000 different UWA channels. For 28 Doppler scales, we generated 28000 feature grids and 28000 corresponding labels. All 28000 feature grids were preprocessed and applied to train the CNN. With this process, we prepared a dataset consisting of 28000 pairs of feature grids and their corresponding labels.

D. Performance of CNN-based Doppler Shift Estimation
Out of 28000 feature grids of the dataset, 256 feature grids were randomly selected for validation, 256 feature grids were randomly selected for testing, and the remaining were used for training the CNN.
The proposed CNN consists of 18 layers, including an input image layer, four convolution layers, four ReLU layers, three batch normalisation layers, two average pooling layers, one dropout layer, one fully connected layer, one softmax layer, and a classification layer, as shown in Fig. 3. Although the basic architecture of our proposed CNN is similar to the well-known computer vision-related networks, we empirically determined the exact configuration based on pilot studies on Doppler signal input data.
Zero-padding was used in the Doppler feature grid before each convolution. The number of filters in the convolution layers is 8, 16, 32, and 32, respectively, and all filters have a filter size of 3×3. Average pooling was performed on 2×2 regions with a stride of two. Batch normalisation was used to normalise the mini-batches across each channel. The ReLU layer eliminates negative values in the feature maps produced by the convolution layer. The dropout layer is used with a probability of dropout chances of 0.2 and this layer nullifies 20% of randomly selected data in the feature maps. The network was trained using a stochastic gradient descent with momentum (SGDM) algorithm with an initial learning rate of 0.0003. The network was validated using a validation frequency of 21 iterations. The layers, size of kernels, and number of kernels for each layer used in the proposed CNN architecture are listed in Table II.
The stochastic gradient descent with momentum (SGDM) optimiser with the following parameters was used:  the number of epochs = 5,  Learning rate = 0.0003 and  Mini batch size = 256 (number of feature grids and corresponding labels  Table II The layers, size of kernels, number of kernels for each layer used in the proposed CNN architecture 14 iterations and a 0.0003 learning rate yielded 99.22% validation accuracy by the time it completed the 5 th epoch. The data loss validation curve (in Fig. 18) does not have a plateau at the beginning; hence, the weight initialisation is good. The initial data loss is 3 which is a low value and exponentially reduces to 0.5, by the 60 th iteration (within the first epoch) and then gradually reduces to zero by the time it completes the 5 th epoch, that is, 360 iterations. This indicates that the selected learning rate is sufficient for the loss validation curve to reach a local minimum within a few epochs.
The confusion matrix is illustrated in Fig. 19. Randomly selected 256 feature grids containing different Doppler scales. For example, nine feature grids relate to zero Doppler scale, seven feature grids correspond to Doppler scale equal to 0.001, and so on. The applied Doppler scales are called true classes. After the CNN training progresses, the estimated Doppler scale is called the predicted class. Our proposed method predicted 253 true classes, and three classes were predicted as their adjacent class; that is, the proposed CNN-based Doppler scale estimation method achieved 98.83% testing accuracy.
The performance of the proposed CNN architecture-1 and CNN architecture-2 based classifiers were tested using different parameter values. Test accuracy depends on several parameters. The test accuracy is determined by varying the number of epochs, learning rate, batch size, and SNR. The parameters are listed in Table III. Fig. 19(a) shows the test accuracy of the classifiers with respect to the learning rate. Test accuracy of CNN classifier-1 and CNN classifier-2 is 72% and 48% respectively at a learning rate of 0.0001. The test accuracy increased up to 0.0005 and then decreased. The number of epochs was varied from 1 to 15 in steps of 1, and test accuracy was measured for each epoch value. Fig. 19(b) shows the performance of the proposed CNN classifiers with respect to the number of epochs. The test accuracy increased up to the 7 th epoch and then decreased. Test accuracy of CNN classifier-1 is higher than that of CNN classifier-2 at all epoch values from 1 to 15. For the default value of the number of epochs, that is, ten epochs, the test accuracy of CNN classifier-1 was 82%, whereas the test accuracy of CNN classifier-2 was 77% for the number of seven epochs.
The batch size was selected from the set of values {32, 64, 128, 256, 512, 1024}. As the batch size decreased, the number of iterations per epoch increased and took more time. At the lower batch sizes (32 or 64), CNN yields high validation accuracy, but the training accuracy shows large variation as compared to the validation accuracy at each iteration. The training progress curve closely followed the Fig. 18. Training Progress at SNR = 5 dB   Fig. 19(d), the accuracies of CNN classifier-1 and CNN classifier-2 at 0 dB SNR are 83.5% and 76.66% respectively and accuracies of both classifiers are 100% from SNR = 10 dB onwards.

E. Stacked Autoencoders based Doppler Shift Estimation
Mingqian Liu et al [85] proposed stacked autoencoders based wireless signal classification. Choi-Williams distribution (CWD) was used to generate two-dimensional time-frequency images corresponding to wireless signals. This feature classification method needs further processing like conversion of time-frequency images to binary images, application of cropping algorithm for searching presence of signals, and adjustment of cropped time-frequency images to the required image size by a bicubic interpolation algorithm. The entire process is very complex and timeconsuming.
In this paper, we consider stacked autoencoders for Doppler frequency estimation. Instead of applying CWD and processing algorithms on the input signal, we consider feature grids as input images to the stacked autoencoder network for comparing its performance with the performance of CNN classifier. Feature grids are preprocessed to acquire grey image properties i.e., pixels are positive integers in the range 0 to 255.
Complex feature grid is converted to two real grids containing real and imaginary parts of the complex signal grid. The magnitude of each grid is taken to have all positive values. Grids contain fractional values; they are multiplied with 10 4 and rounded off to the next integers.
We considered two types of stackednets. Architecture of Stackednet -1 consists of two sparse autoencoders joined with softmax layer as shown in Fig. 20. Input feature grid size is 31×16 i.e, input array to the stackednet is 496 samples. The hidden size of autoencoder-1 is 100 and hidden size of autoencoder-2 is 50.
Each sparse autoencoder of stackednet-1 was trained separately in an unsupervised manner and softmax layer was trained. Then all the sections were cascaded to form stacked autoencoder network. Finally stacked network was trained in a supervised manner.
The number of neurons in the hidden layer of autoencoder 1 was set to 100. The coefficient for L 2 weight regularizer in the loss function with weight regularization factor 0.004 was used for the weights of the network. The sparsity regularizer controls the sparsity of the output from the hidden layer. Sparsity Proportion and sparsity regularization are the parameters of sparsity regularizer and they are set to 0.15 and 4 respectively. Autoencoder 1 was trained with the maximum number of epochs equal to 100. As shown in Fig. 20, the autoencoder consists of an encoder followed by a decoder. The encoder maps an input of size 496 samples to 100 samples using 100 hidden neurons. Decoder follows the reverse mapping i.e., reconstructs the input size of 496 samples.
The autoencoder-1 was trained with the training dataset consisting of 8000 received feature vectors each with 496 samples and it produces the same size data at the output. The decoded data of size 496×8000 from the autoencoder-1 was further encoded to 100×8000 size for getting new feature vectors.
The block diagram of autoencoder-2 is as shown in Fig.  21. Similar to the autoencoder-1, autoencoder-2 was designed with the following parameters in Table IV. Input dataset to the autoencoder-2 is the encoded output of autoencoder-1 which is of the size 100×8000. After training the autoencoder-2, we get the encoded output of the autoencoder-2 with a size 50×8000.
In dataset was generated as described in subsection B of Simulation Results in section VI by considering 10 different Doppler scales from zero in steps of 0.001. The softmax layer follows the Autoencoder-2 and is the final layer in the proposed architecture of the Stackednet-1. The softmax layer activation function returns the probability of each of ten classes of Doppler shift. The block diagram of the   Fig. 22. Input to the softmax layer consists of 50-dimensional feature vectors. With the help of labels of training data, The softmax layer was trained in a supervised manner. It converts the input dataset of size 50×8000 to 10 classes with corresponding probabilities.
. The complete architecture of stackednet-1 is shown in Fig. 23. 1024 feature vectors from 10 different classes were applied to stackednet-1 and tested its performance.
A Confusion matrix is used to describe the performance of the stackednet-1. Fig. 24 shows the confusion matrix of the stackednet-1 classifier at SNR = 5 dB. It shows overall test accuracy of 76%.
We propose an architecture of the stackednet-2 as shown in Fig. 25. It consists of one sparse autoencoder followed by the softmax layer. The hidden size of the autoencoder is 100.
The stackednet-2 was trained for the same dataset and tested its performance for the same set of feature vectors that were used for the stackednet-1. It exhibited overall test accuracy of 90.3% at SNR= 5 dB which is very high as compared to the test accuracy of 76% of the stackednet-1.
Performance of proposed CNN classifiers and stackednets was compared in terms of test accuracy with respect to SNR. SNR was varied 0 dB to 30 dB in terms of 5 dB and test accuracy of each of the four classifiers was estimated. Fig. 27 shows the plot of the test accuracy with respect to SNR for the CNN classifier-1, CNN classifier-2, stackednet-1 and stackednet-2. At SNR = 0 dB, CNN classifier-1 exhibits superior performance as compared to other classifiers. Its test accuracy was 83% which is 5% more than the test accuracy of CNN classifier-2.
Test accuracies of stackednet-1, and stackednet-2 are 31.6% and 40.2% respectively. Test accuracy of stackednet-2 is s 90.3% which is the same as that of CNN classifier-2. CNN classifier-2 yields 97% test accuracy at SNR = 5 dB, and it is the highest accuracy among all the four classifiers. Except stackednet-1, the remaining three classifiers exhibit 100% test accuracy from SNR = 10 dB onwards.
We considered the off-line training time in the computational complexity analysis. The off-line training  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3148410, IEEE Access time of CNN classifier-1 is 428.71 seconds, the off-line training time of CNN classifier-2 is 197.97 seconds, the offline training time of the stackednet-1 is 220.29 and the offline training time of the stackednet-2 is 217.32 seconds. From this computational complexity analysis, it can be seen that the computational complexity of the proposed CNN classifier-2 is the lowest among all the four classifiers.

VII. CONCLUSION
This paper proposes frame boundary and Doppler scale estimation algorithms for the FBMC-OQAM underwater communication system. The proposed algorithms and computer simulation results were as follows. 1) A data-aided frame boundary estimation technique suitable for the FBMC-OQAM system in the UWA channel is presented. Specifically, we developed a technique based on the periodic conjugate symmetry present in the preamble. Owing to its periodic nature, the timing metric produces multiple peaks. In particular, the proposed timing metric yielded six significant peaks during the preamble, and the first peak indicated the frame boundary. The performance of the proposed frame boundary estimation method was compared with timing metrics related to Schmidl & Cox, Chung, and Liming. The accuracy of the timing metrics was checked with the RMSE parameter. For the demonstration of the performance of timing metrics, simulations were conducted for AWGN and UWA channels. From the simulation results, it was found that the RMSE of the proposed metric is very close to the RMSE of Liming"s method and also, the computational complexity of the suggested method is lesser than that of Liming"s method.
2) The FBMC-OQAM system is prone to a severe Doppler effect in the UWA channel. We recommend two types of CNN classifiers for Doppler scale estimation. The frame boundary indicates the first sample of the preamble. We applied the correlation and conjugate-symmetry properties to the preamble and generated feature grids. After pre-processing, feature grids were applied to the CNN, and the Doppler scale was estimated using a CNN-based classification approach.
3) The PHYDYAS shaping filter was considered in the FBMC-OQAM symbol generation. The proposed time and frequency algorithms exhibit superior performance in the AWGN and UWA channels. The proposed timing metric performs well for an SNR ≥ 5 dB. At SNR = 5 dB, it exhibits a 67% correct probability of accurate frame boundary detection, 33% false frame boundary detection, and zero miss detection. The false detection-related peaks fall along the strongest path of the channel. The probability of correct frame detection was 100% for an SNR ≥ 10 dB. The proposed CNN classifiers-based Doppler scale estimation algorithm yielded 100% testing accuracy for SNR ≥ 10 dB. 4) Stacked autoencoders were considered for Doppler scale estimation. Feature grids were preprocessed to form a feature vector dataset. We propose two types of Stackednets. They were trained with the dataset and tested. At 10 dB SNR, The stackednet-1 yielded 94% testing accuracy whereas stackednet-2 yielded 100% testing accuracy. 5) CNN classifier-1 exhibits superior performance, but its computational complexity is very high. Computational complexity of the stackednet-1 is low, but its testing accuracy is low. CNN classifier-2 and stackednet-2 exhibit the same testing accuracy for SNR ≥ 5 dB and their computational complexities are also around 200 seconds.