On the Low Complexity Implementation of the DFT-Based BFSK Demodulator for Ultra-Narrowband Communications

The DFT-based demodulator for BFSK has been introduced for applications where the received signal experiences a carrier frequency offset (CFO) much larger than the symbol rate. The Ultra-Narrowband (UNB) communication techniques have been introduced for implementing the emerging Low Power Wide Area Networks (LPWAN). Since UNB communication is prone to CFO, a DFT-based BFSK demodulator is an interesting option for this type of communication. However, for proper operation in a large frequency offset, the DFT-based demodulator requires a complex window synchronization which is not desirable for low power nodes. The main source of complexity, is calculating the DFT of a window which slides over the preamble. In this work, the complexity is alleviated by considering the window synchronization algorithm and its implementation together. First, a new window synchronization algorithm is proposed which is designed such that an efficient class of implementations of the sliding DFT (SDFT), called Single Bin SDFT (SB-SDFT) in this work, can be used. Moreover, a new stable implementation of SB-SDFT is designed to enable zero-padding which is required by the demodulator. The complexity of the proposed algorithm implemented using the SB-SDFT, scales more efficiently compared to the conventional algorithm when the range of tolerable CFO increases. Using the proposed method, for a CFO tolerance in the order of 14.5 times the symbol rate (±14.5 kHz for a symbol rate equal to 100 Hz), the number of complex operations is reduced by more than 85% (and memory by 90%) compared to the conventional method.


I. INTRODUCTION
The concept of Low Power Wide Area Networks (LPWANs) has been recently introduced as an attractive technology to address a variety of IoT applications [1], [2]. LPWANs focus on low data rate applications which need a long range and favor a battery life as long as ten years or more. One of the proposed communication techniques for implementation of LPWAN is Ultra-Narrowband (UNB) communication [3], [4]. UNB offers very low data rate but its wide coverage, low cost devices, unlicensed band and robustness against interference makes it an attractive technology for LPWAN implementation [3], [5]. One of the challenges in a UNB communication system is Carrier Frequency Offset (CFO).
The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Maaz Rehan .
It is particularly challenging in downlink communication as a low power node needs to receive a signal with frequency offset [6]. Considering the ultra-narrow bandwidth of the signal (as narrow as 100 Hz), using a low cost crystal without thermal compensation can lead to a CFO which is several times the symbol rate at the receiver [7].
CFO is a well-known problem in communication systems. It is either a consequence of mismatch between oscillators in the communication nodes or the Doppler shift resulting from their relative movement. Offset tolerant demodulators have been proposed in the last decades as a solution to this problem in low data rate satellite communications where the CFO is larger than the data rate [8], [9]. Such demodulators can also be used in UNB communications [6], [10], [11]. The ability of the demodulator to tolerate CFO relaxes (or even eliminates) the requirement for time and power consuming carrier recovery as well as costly precise crystals with power hungry thermal compensation. Thus, a low power communication node for wireless sensor networks and IoT can be designed.
A DFT-based demodulator for BFSK is an example of aforementioned offset tolerant demodulators [8], [10]. To properly select the samples of a symbol for which the DFT is calculated, a window synchronization is required. To enable large offset tolerance for the demodulator, a window synchronization algorithm based on DFT is proposed by Hara et al. in [8] (also used in [10]). For tolerating large frequency offset using DFT-based demodulator, it must be implemented together with this window synchronization. Thus, the synchronization is discussed as a phase in the demodulator operation in this work. The main challenge of implementing this demodulator is the complexity of the synchronization algorithm which also increases when the tolerable frequency offset increases. In our previous work [12], a low complexity window synchronization algorithm was proposed and a Single Bin Sliding DFT (SB-SDFT) structure was introduced to efficiently implement the proposed algorithm.
This article elaborates on our previous design, mathematical formulation and related literature while, additionally, contributes to extending and analyzing it. The stability of the SB-SDFT proposed in [12] is considered and a stabilized version (using a damping factor) is proposed. The complexity is analyzed when the new stabilized SB-SDFT is utilized for implementing the proposed algorithm. In addition to the AWGN channel, which was considered in [12], in this work the BER performance in fading channel is also presented. Compared to our previous work, a more efficient implementation of the conventional synchronization algorithm is derived and used as the benchmark for complexity. This makes the complexity comparison between the proposed method and the conventional algorithm more realistic compared to our previous work. To increase the range of the tolerable frequency offset, the sampling frequency increases (a larger number of samples per symbol, N ). In our previous work, BER performance and complexity results were presented only for N = 8; however, here, the BER performance and complexity are demonstrated for different values of N . This helps us to illustrate how the demodulator scales for larger values of N i.e. a larger tolerable frequency offset. Moreover, the design parameters, including the damping factor, are determined using simulations.
The paper is outlined as follows. The DFT-based demodulator and the conventional window synchronization algorithm (Hara synchronization) are briefly explained in the next section to properly define the problem and show how we can interpret the calculations of the algorithm as a sliding DFT. Section III looks into related work on the efficient implementation of a sliding DFT while jointly motivating the design of the proposed synchronization algorithm and the proposed SB-SDFT. Subsequently, the proposed synchronization algorithm is presented in Section IV. Next, section V elaborates on the proposed SB-SDFT and derives its stabilized version. The complexity analysis follows in Section VI to obtain expressions for the number of operations and memory use. Numerical results and discussions are included in section VII. Finally, section VIII concludes the paper.

II. BACKGROUND AND PROBLEM DEFINITION
In this section, first, the principle of the DFT-based demodulator in [10] is explicated to familiarize the reader with the base of this work. Then, the window synchronization algorithm introduced in [8] (and also used in [10]) is described to clarify the algorithm which is used as a benchmark with respect to performance and complexity. In the last subsection, a primary overview of the complexity of the demodulator is provided. It is shown that the synchronization complexity is dominant and should be alleviated.

A. THE PRINCIPLE OF THE DEMODULATOR
The block diagram of the DFT-based demodulator in [10] is shown in Fig. 1. Signal samples pass through a low pass filter prior to the demodulator. The set of complex samples belonging to a symbol are selected by a rectangular window which needs to be synchronized (aligned with symbols). When these samples are selected by a correctly synchronized window, they are padded with zeros and sent to the DFT calculation block. The zero-padding factor is defined as I which means N samples of the signal are padded by N (I −1) zeros to achieve a zero-padded sequence with length NI as the input of the DFT. Finally, the decision for each BFSK symbol is made based on the magnitudes of the DFT for k 0 and k 1 which are the DFT bins corresponding to the frequencies of symbol zero (f 0 ) and one (f 1 ) of BFSK modulation, respectively. Zero-padding is used to improve bin resolution of the DFT as explained in [8]. In [8] and [10], I = 8 is chosen because increasing it further adds complexity but does not improve the BER performance. The frequency separation of the BFSK modulation is assumed to be equal to the symbol rate (f 1 = f c + 1/2T and f 0 = f c − 1/2T where T is the symbol period and f c is carrier frequency for passband signal and CFO for the baseband signal). In this work we also consider I = 8 and a frequency separation equal to the symbol rate.
To tolerate a large frequency offset, the lowpass filter in Fig. 1 might be much wider than the signal bandwidth. Moreover, complying with the Nyquist criterion (with respect to the filter bandwidth) necessitates a higher sampling frequency. The sampling frequency is assumed to be N times the symbol rate R Sym (N complex samples per symbol). To increase the offset tolerance, the sampling frequency and the number of samples per symbol, N , must be increased. The relation between the filter bandwidth, the sampling frequency, the signal bandwidth and the tolerable frequency offset are shown in Fig. 2. B t is the transition band of the filter and B F is the filter bandwidth including the transition band. The sampling frequency (for complex samples) is set to F s = 2B F to avoid aliasing and noise folding. The bandwidth of the baseband signal is B S /2 (where B S is the bandwidth of the passband signal) and the maximum tolerable frequency offset is shown ±f O,max .
Two phases can be distinguished in the demodulator operation; synchronization and detection. In the synchronization phase, the window must be aligned to the received symbols; otherwise, frequency components of the adjacent symbols introduce inter-symbol interference. Furthermore, the DFT frequency bins corresponding to the frequency of symbol one (k 1 ) and zero (k 0 ) of the BFSK modulation are determined in this phase. In the detection phase, the DFT is calculated for samples of each symbol. For each symbol the decision is made by comparing the magnitude of the DFT bins k 1 and k 0 .

B. THE WINDOW SYNCHRONIZATION ALGORITHM
The window synchronization algorithm described here has been introduced by Hara et al. [8] and, in the rest of this work, is referred to as Hara synchronization or conventional synchronization algorithm. This algorithm uses a preamble of alternating ones and zeros (1, 0, 1, 0, . . .). The number of symbols in the preamble is denoted by L. For each symbol, windows with different delay values, between 0 and N − 1, are considered. Note that, because of oversampling, each symbol consists of N samples. Only one of these windows is fully aligned with a symbol and others include samples of two consecutive symbols. Fig. 3 illustrates the first three symbols of the preamble and the windows in case of four samples per symbol (each square is a sample). The doubleheaded horizontal arrows denote windows. Each row of arrows corresponds to a certain delay value as shown in the figure. Solid (dotted) arrows show the windows for an Odd (Even) symbol with different delay values. The symbols of the preamble are called Even and Odd depending on their index m = 1, . . . L. Fig. 3 also shows how the magnitude of the DFT changes for different delays. When the DFT for each window is calculated, for each delay value, all DFT magnitudes corresponding to Odd symbols (Even symbols) in the preamble are added together to achieve the following.
where X i k,2n (X i k,2n−1 ) is the DFT value for the k th bin and symbol 2n (2n−1) with a window delay i. Considering Fig. 3, (1) and (2) actually calculate the sum of the DFT magnitudes for solid windows within one row of arrows and the sum of those for dotted windows, respectively.
To synchronize the window, the Hara algorithm finds the delay value for which R i in the following equation is maximized [8], [10].
where k E i and k O i are the bins with maximum magnitude in S E i and S O i , respectively. Finding the maximum of (3) is simply finding the delay value which maximizes the difference shown by D in Fig. 3 which in the specific case of Fig. 3 is for Delay = 0. When the delay value is found, the k 1 and k 0 can be achieved from k O i and k E i , respectively.

C. AN OVERVIEW OF COMPLEXITY
In the synchronization phase, all bins of an NI -point DFT are calculated over all symbols of the preamble and N different windows for each symbol in the preamble. In the detection phase, only two bins of the DFT are required for each symbol. Calculating only two bins of the NI -point DFT using the definition of the DFT series requires 2(N − 1) complex multiplications (notice that there are only N non-zero samples in the zero-padded set of samples). The synchronization is done once in a packet; however, the packet length is short in target applications (in the order of 200 symbols) [4], [6], [7] which makes the complexity of the synchronization a significant overhead. Moreover, the operations required for the synchronization phase are executed in a shorter period of time compared to the detection phase which leads to high instantaneous power consumption. Thus, it might be unsuitable for applications where the maximum instantaneous power is limited (such as energy scavenging applications). Thus, the complexity of the synchronization algorithm is dominant and needs to be alleviated. In our previous work and the current paper, the DFT calculations required for the window synchronization are interpreted as calculating DFT for a window which slides over a sequence of samples (see the arrows in Fig. 3). DFT for a window sliding over a sequence can be implemented efficiently using sliding DFT algorithms. Thus, this interpretation enables us to achieve an efficient implementation. The next section elaborates on implementing the sliding DFT reviewing related literature and clarifies the motivation for the proposed window synchronization algorithm and the proposed SB-SDFT.

III. RELATED WORK ON SLIDING DFT
Calculating the DFT for a window which is sliding over a sequence is called Sliding DFT (SDFT). As a computationally intensive block, many researchers have investigated the efficient implementation of the SDFT. In case of a sliding window, only one sample (or a few samples) is (are) different between the current set of input samples of the DFT and the previous set. Exploiting this property, various methods have been proposed for efficient implementation of the DFT with a sliding window [13]- [29]. By reusing calculations done for each window, the DFT of the next window can be calculated in a more efficient way. For a better understanding, we categorize the methods into two general groups. The first group, Complete SDFT (C-SDFT), includes algorithms that calculate all the DFT bins for each window in a single structure; whereas, the second category, Single-Bin SDFT (SB-SDFT), covers algorithms which derive only a single bin of the DFT. In the next two subsections these categories are explained. Although the final design in this work belongs to the SB-SDFT category, review of the C-SDFT methods is necessary. It helps us to derive an efficient implementation of the Hara synchronization algorithm (the algorithm explained in the previous section) which is used as a benchmark for complexity comparison. In the last subsection, the conclusions that can be drawn from literature are pointed out. This subsection also describes how this insight provokes the design of the proposed algorithm and the proposed SB-SDFT.

A. COMPLETE SDFT (C-SDFT)
One of the initial examples of an SDFT has been introduced in [13]- [15]. Based on the conventional Decimation-in-Time FFT structure and storing calculated intermediate values, this method decreases the complexity of FFT calculation for the new coming sample from O(N log 2 N ) in the FFT to O(N ) [15]. However, it increases the memory as it needs to store all the intermediate values. Interpreting the FFT structure as a prism, Wang et al. [16] proposed a method to calculate the SDFT. Although its complexity is more than that of the SFFT in [15], it can be implemented in parallel for faster calculation [17] which is not necessary in our target application as it involves very low data rates. Rewriting the sliding DFT definition and using time shift properties, Montoya et al. have proposed a sliding DFT method with almost 50% reduction in the number of complex multiplications compared to the SFFT [18]. This idea is also extended to a Radix-4 decomposition in [19] which achieves even more savings at the cost of limiting the DFT points to a power of four. Despite the complexity reduction achieved in [18], [19], these cannot be used when zero-padding is involved which is the case in our target application. A guaranteed stable recursive algorithm for calculation of all bins is proposed in [20] which is based on a single bin recursive structure from the same author in [21]. This method is only applicable in a hopping scenario where each window hops N /4 samples (N is the DFT size). Another recursive algorithm for C-SDFT is introduced in [22] utilizing the observer theory in the control systems. The algorithm calculates the C-SDFT while solving the stability problem associated with recursive SDFTs with less memory than what is required by the SFFT. However, it cannot work when zero-padding is used in the input sequence. Among the C-SDFT methods, the technique in [15] (the first mentioned method in this section) is the best one for implementing the Hara synchronization algorithm due to its simplicity and possibility of using zero-padding. This method is called SFFT in the sequel.

B. SINGLE BIN SDFT (SB-SDFT)
The second group of SDFT algorithms are those which focus on calculating a single bin of the DFT (SB-SDFT). The core idea of such systems is based on the shifting property of the Fourier transform [23]. To clarify, consider the N -point DFT for X n = {x(n − N + 1), . . . , x(n)} as follows where X k (n) is the k th DFT bin of X n and W N = exp(j2π/N ). X n k can be written in terms of X k (n − 1), which is the DFT for X n−1 = {x(n − N ), . . . , x(n − 1)}, as follows.
Noticing that W Nk N = 1, (5) can be rewritten as: The block diagram of a filter which realizes (6) is depicted in Fig. 4. It calculates the k th bin of an N -point DFT by implementing the DFT series for each bin as an IIR filter.
With each incoming sample (each filter iteration), the value of the k th bin of an N -point DFT is updated for the last N samples. As a result, when the window hops one sample, only one complex multiplication and two complex additions are required to obtain the k th bin for the new set of samples.
As can be seen, in the second stage there is a feedback loop with a pole on the unity circle of the complex plane. This pole makes the system conditionally stable. In real applications, the limited precision of the twiddle factors (W k N ) can push the pole out of the unit circle and cause instability. In [23] a damping factor, r, is used to force the pole inside the unit circle while compromising the precision of the DFT calculation and causing errors. Nonetheless, if the damping factor is chosen close to one, the error can be kept small enough for many applications. Due to using the damping factor r, the method in [23] is also called rSDFT. To overcome the stability problem without compromising precision, a modulated SB-SDFT algorithm (modulated-SDFT) has been introduced in [25]. In this method the time-domain modulation duality of the Fourier Transform is used to shift any desired frequency component to the zero frequency. In this way, the pole in the feedback loop is equal to one and the filter will always be stable. Although both stability and precision are addressed in this method, the complexity increases considerably since the input samples need to be modulated with a proper twiddle factor.

C. CONCLUDING THE LITERATURE REVIEW
Considering previous work on the sliding DFT, one may conclude that C-SDFT methods are more efficient if all bins of the DFT are required; however, if a subset of the bins are needed, the complexity of an SB-SDFT is lower [29] (less than half of all bins when SB-SDFT is compared to SFFT). The Hara synchronization algorithm needs all bins of an NI -point DFT. Now, if a new algorithm is designed which only uses a subset of the bins belonging to an NIpoint DFT, the complexity can be decreased using an SB-SDFT implementation. This is the incentive for designing the proposed window synchronization algorithm.
On the other hand, as mentioned in section II, zero-padding is necessary for correct detection in the demodulator. None of the mentioned SB-SDFT algorithms can be used together with zero-padding. Hence, for efficient implementation of the proposed algorithm a modified SB-SDFT is needed that incorporates zero-padding. In the next section, a synchronization algorithm is proposed which only needs a subset of the NI -point DFT bins. In section V, a stable SB-SDFT structure is derived for efficient implementation of the proposed algorithm.

IV. THE PROPOSED WINDOW SYNCHRONIZATION ALGORITHM
So far, it has been concluded that using only a subset of the bins of an NI -point DFT for window synchronization reduces complexity compared to the Hara synchronization method. Before introducing the proposed synchronization algorithm, it is needed to check whether it is feasible to only rely on a subset of bins for synchronization (without losing signal information). This is done in the first subsection where a set called Bins of Interest is defined and the general concept of the proposed synchronization algorithm is introduced. Afterwards, the algorithm is described in detail.

A. BINS OF INTEREST AND THE PROPOSED SYNCHRONIZATION CONCEPT
In the presence of large frequency offset (multiple times the symbol rate) a filter wider than (multiple times) the signal bandwidth is needed before the demodulator to capture the signal. It means that the sampling frequency should be higher than the actual information bearing bandwidth of the signal. On the other hand, using zero-padding increases the number of bins that must be calculated including those which are out of the signal bandwidth. As a result of this oversampling and zero-padding, only a small part of the spectrum calculated by an NI -point DFT includes signal information. These bins of the spectrum are called Bins of Interest hereafter and BoI for each DFT is defined as the set which includes these bins. Since the signal information resides in the BoI, solely the BoI can already provide a correct synchronization without loss of information. The BoI is in the vicinity of the signal center frequency, f c = (f 0 + f 1 )/2 + CFO (f c = CFO in the baseband signal). This is shown in figure   Since the frequency offset is assumed to be unknown, there is no prior knowledge about the whereabouts of the BoI in the spectrum. The proposed method aims first at finding the BoI to limit the number of the DFT bins required for the synchronization later on. Using a subset of the DFT bins, not only SB-SDFT algorithms can be used to decrease complexity but also fewer bins need to be stored during synchronization which considerably decreases memory requirements.
In the proposed synchronization concept an extra process is added to the synchronization, splitting it into two stages; the zoom stage and the window alignment stage. The concept of the proposed synchronization method is shown in Fig. 6. Input samples pass through a sliding rectangular window providing sequences of length N . First, in the zoom stage (explained in the next subsection) the BoI of an NI -point DFT called BoI Final , is detected. Secondly, BoI Final is sent to the window alignment stage where a proper window delay is obtained only calculating the DFT bins in BoI Final and using the algorithm illustrated in Fig. 3. To compensate for the delay introduced by the zoom stage, an equivalent delay (z −2N ( +1) ) has been inserted before the Window Alignment block. It enables the algorithm to reuse the same samples used in the zoom stage for the window alignment. It avoids an increased number of preamble symbols for the proposed synchronization. In the next subsection, the zoom stage is explained in detail.

B. THE ZOOM STAGE
The target of this stage is to find the BoI of an NI -point DFT (BoI Final ) which covers the DFT bins corresponding to the BFSK modulation frequencies k 1 and k 0 . For this purpose, a step-by-step zooming approach is utilized which searches for the center bin in an NI -point DFT, k c,Final . The center bin is the bin closest to the center frequency (f c ) of the signal. The BoI Final is calculated using the k c,Final . Before elaborating on the step-by-step zooming algorithm, parameters are defined. T is the symbol period and N is the number of samples per symbol. I = 2 is the zero-padding factor. As mentioned earlier, the zero-padding factor for the DFT-based demodulator is selected to be I = 8, yet, for the proposed algorithm it can be any power of two, 2 . Moreover, the frequency separation of the BFSK modulation is equal to the symbol rate (f 1 = f c + 1/2T and f 0 = f c − 1/2T ) similar to [8], [10]. The zoom stage includes + 1 steps while the zero-padding factor at step γ is equal to 2 γ (γ = 0, . . . , ). A visual illustration of the step-by-step zooming is shown in Fig. 7. In this figure N = 8 and I = 8 which is equivalent to = 3. At each step γ , the bin k γ c of an N 2 γ -point DFT which is closest to f c is detected based on calculating only the bins within BoI γ . Then, an estimate of the next step center bin,k γ +1 c , is calculated using k γ c . Following on that,k γ +1 c is exploited to determine the BoI for step γ + 1, BoI γ +1 . Subsequently, in step γ + 1, the actual center bin k γ +1 c is detected by calculating only a subset of the bins of an N 2 γ +1point DFT in BoI γ +1 . Continuing this procedure, the center frequency of an NI -point DFT is obtained which is used to find BoI Final .
The zoom stage starts with the first step, γ = 0, which uses an N -point DFT i.e. with a zero-padding factor of one (or no zeros). The BoI for the first step (γ = 0) includes all the bins of the N -point DFT. All points of the N -point DFT are required in the very first step because the center frequency can be anywhere in the spectrum that can be covered by the sampling frequency. Thus, we need to look at all bins at the first step and then we can limit ourselves to the BoI γ calculated for further steps. Methods for calculating k γ c , calculatinĝ k γ +1 c from k γ c and determining BoI are explicated as follows.
At step γ , a window of length N slides for 2N hops (in shifts of one sample) over the the preamble (this is equal to the number of samples for two symbols) and the DFT is calculated for each window. For each one-sample shift of the window, only the bins of an N 2 γ -point DFT inside BoI γ are calculated. So there will be 2N DFTs for which the bin with maximum magnitude is changing from f 1 to f 0 . An example of the windows for N = 4 and how the spectrum changes can be seen in Fig. 3.
The magnitudes of all these DFTs for each bin are added. Then, the bin for which this sum is maximum is closest to the center frequency k γ c . Thus, k γ c is obtained as follows.
VOLUME 8, 2020 where: k,i is the DFT for the k th bin and delay i = 0, . . . , 2N − 1 in step γ . The first window can start from any part of either a ''one symbol'' or a ''zero symbol'' in the preamble. Proving why the bin achieved using this method is actually the center bin is dealt with in Appendix.
The zero-padding factor is doubled between two consecutive steps. It means that zeros are added so that the length of the sequence over which DFT is calculated (including zeros) becomes twice of its value in the previous stage. This actually adds a new ''interpolated'' bin between each two bins of an N 2 γ -point DFT to achieve an N 2 γ +1 -point DFT. Therefore, the center bin in step γ + 1 can be estimated ask Notice that f c might not be matched to a bin due to the arbitrary CFO. In this case, two adjacent bins close to the f c have the largest magnitudes among all (the magnitudes are exactly the same if f c is exactly in the middle of the two bins and there is no noise). In such cases, a noisy received signal and leakage causes the calculatedk γ +1 c to deviate from the actual center bin in step γ + 1. That is why the center frequency should be detected step-by-step so that the final center frequency bin in the NI -point DFT is selected correctly.k γ +1 c is used to determine BoI γ +1 as follows.

3) DETERMINING BoI γ +1 AND BoI Final
To account for any erroneous detection due to the spectral leakage and noise, I bins (equal to the zero-padding factor) in the vicinity of the estimated center bin are considered as follows. where: In (10) and (11), mod is modulo operator. It is used to map values that are outside the range of bin numbers of an N 2 γ +1 -point DFT in step γ + 1 to the valid set i.e. k ∈ {0, . . . , N 2 γ +1 − 1}. In the final step, when the k Final c is calculated, the final BoI used for the window alignment stage (BoI Final ) is determined based on that. BoI Final should include the frequency bins corresponding to symbols 1 and 0 of the BFSK modulated signal (k 1 and k 0 , respectively). Since the frequency separation of the BFSK modulation is equal to the symbol rate (R Sym ), when the zero-padding factor is I and the sampling frequency is NR Sym , there are I − 1 bins between k 1 and k 0 . To reduce the effect of noise, more than I bins are considered. If the number of bins in the BoI Final (its cardinality) is denoted by |BoI Final |, the BoI Final is determined as follows. where: The size of BoI Final is optimized to achieve the best BER performance using simulations which are presented in section VII. A detailed block diagram of the proposed synchronization algorithm is also shown in Fig. 8. Both the zoom stage and the window alignment stage receive a set of N samples from a sliding window as shown in Fig. 6. The sets are the result of a rectangular window which slides over the preamble by onesample shifts. The zoom stage is composed of three blocks. The first block calculates the DFT, the second calculates k γ c and the third calculatesk γ +1 c and BoI γ +1 . The BoI Final is the output of the zoom stage which is sent to the window alignment stage. The window alignment stage is similar to the algorithm explained in section II. Getting samples of windows with different delays over the preamble, the DFT calculation block calculates the DFT magnitudes for bins in BoI Final . These values are passed to the next block which calculates R i (where 0 ≤ i ≤ N − 1 are the window delays) similar to (3) but only for the bins in BoI Final . Finally, the third block finds the i value for which R i is maximum and finds k 0 and k 1 in the BoI final . This will be the proper window delay (the timing information) and, together with k 0 and k 1 , it is used during the detection phase. For the DFT calculation blocks in the zoom and the window alignment stages, an SB-SDFT calculator can be used which is introduced in the next section.

V. THE PROPOSED SB-SDFT WITH ZERO-PADDING
In the previous section, a synchronization algorithm was presented. The second issue mentioned in the conclusion of the literature review for efficient implementation of the SDFT is an SB-SDFT scheme in the presence of zero-padding. Aforementioned implementations for an SB-SDFT (see section III) do not take the zero-padding into account. The main derivation of these algorithms is based on the periodicity of the twiddle factors. When each set of the samples is padded with zeros before the DFT calculation, this periodicity is violated due to zeros appended to the samples sequence. This leads to incorrect calculation by SB-SDFT schemes. For the proposed synchronization algorithm, a new SB-SDFT algorithm is needed to incorporate the zero-padding. A procedure similar to the conventional SB-SDFT derivation in [23], [25] and [20] is followed.
The k th bin of an M -point DFT over a set of N samples, X n = {x(n−N +1), . . . , x(n)}, which are padded with M −N zeros, is as follows.
where W M = e j 2π M and the last M − N terms of sum are ignored as they are zero. The k th DFT bin in (15) can be obtained using the k th DFT bin for sample set X n−1 = {x(n − N ), . . . , x(n − 1)} as follows.
where X k (n−1) is the k th DFT bin for X n−1 . When there is no zero-padding, N = M and W −(N −1)k M in the last term of (16) can be simplified to W k M leading to the known SB-SDFT equation (see (6)). If zero-padding is used, this simplification is not valid anymore. This is the violation of the periodicity mentioned above and necessitates a modified version of the SB-SDFT. In case of zero-padding, N = M and (16) can be written as follows. (17) can be seen as a filter taking samples of x and generating the k th DFT bin while sliding the window by one sample. The transfer function of the filter is: The block diagram of such a filter is also depicted in Fig. 9. The SB-SDFT with zero-padding has an extra multiplication in the first loop compared to the SB-SDFT in Fig. 4. Similar to the conventional SB-SDFT in [23], it has a pole on the unity circle. The stability of this modified version of SB-SDFT can be guaranteed using a damping factor r (similar to the idea proposed in [23]). Another method is to extend this derivation to achieve a modified version of the modulated-SDFT introduced in [25]. The former compromises precision while the latter increases complexity almost twice. Using a damping factor causes an error in the calculated DFT values. As shown in [30], this error is in the order of 1% when r is close enough to one. In our proposed method the exact value of the DFT is not the target but the relative value of different bins is important. As a result, to avoid the complexity of the modulated-SDFT, the modified SB-SDFT method is stabilized by adding a damping factor. To change the pole from W k M to rW k M , the z in (18) is replaced with z/r. This leads to the following transfer function.
The block diagram of the stable filter is shown in Fig. 10. For each loop, one multiplication between a complex number and a real number is added which is composed of two real multiplications. The extra multiplications in the second loop can be integrated into the twiddle factor. To do so the nominator and the denominator of the transfer function are multiplied with r −1 to achieve: The final block diagram is shown in Fig. 11. Interestingly, the one extra multiplication compared to (18) which is in the first loop is independent of the bin number and can be shared between filters which are calculating different DFT bins. The effect of the damping factor on the demodulator performance and the proper value for it are investigated using simulations presented in section VII.

VI. COMPLEXITY ANALYSIS
In the first subsection the number of complex multiplications/additions (CM/CA) required for DFT calculation are obtained. Next, the memory usage is calculated including the VOLUME 8, 2020 memory needed to compute the Sliding-DFT and the memory required to store S O i /S E i in (1)/(2) and the twiddle factors. In the final design, the zero-padding factor I is 8; however, in the following, parameter I is used for clarity. The proposed SB-SDFT is shown again for easier reference in Fig. 12. The dotted circles in Fig. 12 demonstrate different operations and/or memory required for the calculation of the SB-SDFT. Notice that the calculations shown in Fig. 12, are executed for each bin k of the DFT; however, when several bins are calculated the result of these operations/memory might be reused for multiple bins i.e. that part of the block can be shared between a few bins. This is further explained in the following analysis wherever applicable.

A. COMPLEX OPERATIONS
The complexity of the proposed method is separately calculated for the zoom and the window alignment stage while explaining the operations shown by the dotted circles of Fig. 12. Each time that a new value for the DFT bin is calculated is called an iteration. Each iteration updates the DFT value based on the input sample and the last N − 1 samples of the input sequence (in total N samples). In the following, multiplication and addition refer to complex operations unless stated otherwise.

1) ZOOM STAGE
A: The twiddle factor in A for the last step γ = is r −1 W −kN NI for bin k which is written as r −1 W −k I . Since W I = exp(j2π/I ) is always a point on the unit circle within the complex plane, its powers have only I different values regardless of the number of calculated bins. Since r −1 W −(k+I /2) I = −r −1 W −k I , only half of the twiddle factor multiplications, 0.5I , need to be calculated and the rest are simply sign conversions (multiplications with +/− j are still considered as a complete complex multiplication). The last step of the zoom stage has the largest zero-padding factor and consequently, the largest number of different twiddle factors. So the number of multiplications within operation A is the largest in the last step. To simplify the complexity expressions we consider 0.5I multiplications for the other steps of the zoom stage as well. So A leads to (0.5I )( + 1) multiplications per iteration for all bins and all steps together.
B: The operation in B is a real-by-complex multiplication and can be approximated by 0.5 complex multiplication.
It can be shared between all bins at each step. For all steps together, it leads to 0.5( + 1) multiplications for each iteration and all bins together.
C: Based on above discussion, the output of A may have at most I different values in each step for all bins and B can be shared with all bins. As a result, C, leads to I additions (at most) per iteration in each step for all bins together. So C leads to I ( + 1) additions per iteration for all bins and all steps together.
D/E: These parts together include one multiplication and one addition per bin and per iteration. In the zoom stage, at the first step, N bins are calculated and for the other steps I bins are calculated (N + I in total). Thus, for the zoom stage, per iteration and for all bins together D and E lead to N + I multiplications and additions, respectively.
Iterations: Remember that for each step of the zoom stage the window shifts 2N samples (i.e. 2N iterations). Starting from the initial state of zero, N iterations are required to generate the DFT value for the first window of N samples. Then, each iteration gives the SB-SDFT for the next window. As a result, 3N iterations for each bin at each step of the zoom stage. Now, considering the above discussion and the number of iterations, the total number of multiplications and additions for the zoom stage (CM Zoom and CA Zoom , respectively) are obtained as follows.
2) WINDOW ALIGNMENT STAGE A: In the window alignment stage, the zero-padding factor is I . Following the same reasoning as the zoom stage, A needs 0.5I multiplications per iteration for all bins together.
B: In total, this part needs two real multiplications or 0.5 complex multiplication per iteration for all bins together.
C: Similar to what was mentioned for the zoom stage, C leads to I complex additions per iteration for all bins together.
D/E: In the window alignment stage, |BoI Final | bins are calculated. Thus, per iteration and for all bins together D and E lead to |BoI Final | multiplications and additions, respectively.
Iterations: The window alignment stage needs NL iterations where L is the length of the preamble (N delays for each symbol of the preamble). Following the same reasoning for the zoom stage, N extra iterations are required when starting from the initial state of zero so the alignment stage needs N (L + 1) iterations for each bin. The number of multiplications and additions for the window alignment stage (CM Align and CA Align , respectively) are: Since the zoom stage and the window alignment stage are not executed in parallel, the registers (memory elements) used in one can be used in the other as well. Here, we only focus on the window alignment stage as it needs more memory and therefore, dictates the total required memory for calculating SDFT. For the Sliding DFT each filter needs a memory of size N for samples and a delay of one for the previous DFT value (See B and F in Fig. 11). The memory within B (z −N with length N ) can be shared between all bins and the second is unique for each bin. Additionally, the delay block in Fig. 6 requires storing 2N ( + 1) samples. All these values have two parts (real and imaginary), so the required memory for calculating SDFT (M Calc ) is as follows.
where |BoI Final | is the cardinality of the set of BoI used in the window alignment stage. The proposed synchronization algorithm needs to store the results of the accumulation of the DFT magnitudes for |BoI Final | bins and N delay values. This memory is denoted by M Acc and is derived as follows.
The factor two comes from the two sets of magnitudes that must be accumulated for odd and even symbols.
For an NI -point DFT, only half of these values should be stored from which two values are equal to 1 and j for k = 0 and k = NI /4, respectively. As a result, NI /2 − 2 complex twiddle factors should be stored so the required memory is NI − 4.

C. COMPARISON
As mentioned in section III, the SFFT method in [15] provides an efficient implementation of the Hara synchronization algorithm. The complexity of the SFFT for each iteration can be found in [15]. Due to zero-padding, in a few first stages of the SFFT some operations have zero inputs and are not executed which leads to a pruned structure with reduced complexity. The number of iterations for the SFFT in the Hara synchronization algorithm can be derived based on a reasoning similar to what was used above for the window alignment stage. Taking these into account, the complexity of the Hara synchronization can be calculated. For the sake of brevity, the detailed calculations are not included in this article and only the final results are presented for comparison.
The number of operations and memory required for the Hara synchronization and the proposed synchronization algorithms are presented in TABLE 1. For the total number of operations only the term with the largest power of N is included in the table. Although the number of CM/CA increases with N 2 for both methods, in practical cases, the factor of N 2 in the Hara method is much larger than that of the proposed method. Considering L = 16 and I = 8 similar to both [8] and [10], the factor of N 2 in the number of complex multiplications required for the Hara synchronization (implemented using SFFT) is about 45 times the proposed synchronization. Furthermore, the total memory of the Hara method (with SFFT) increases with N 2 while the total memory of the proposed method increases with N . These differences between the complexity stem from the fact that in the proposed synchronization, in contrast with the Hara synchronization, the number of the calculated DFT bins for the window alignment does not change with N and is constant (BoI Final ). The numerical results and comparison of complexity for different values of N are presented in the next section.

A. DESIGN PARAMETERS
According to our discussions in the previous sections, two parameters need yet to be determined. Those are the number of bins in the BoI Final and the damping factor r for the SB-SDFT. The most important performance metric for our system is the overall BER of the demodulator. Therefore, BER values obtained by simulations are utilized to determine the appropriate design parameters. As stated in IV, in all simulations of this section I = 8 and L = 16 similar to [8], [10]. Fig. 13 shows how the BER of the demodulator changes relative to |BoI Final | for different values of E b /N 0 . The damping factor is considered to be r = 1 in these simulations.
As can be seen in Fig. 13, the bit error rate value hardly changes when the number of bins in the BoI Final is increased more than 14. For a safety margin the number of bins in the BoI Final is considered to be 16.
The value of the damping factor determines the error at the output of the SB-SDFT filter. As mentioned earlier, the damping factor introduces errors to the calculated values for the DFT which may degrade the performance of the demodulator. This can be solved by choosing r very close to one.   Additionally, the number of samples which pass through the SB-SDFT filter plays an important role in the value of r. Due to the recursive behavior of the filter, the error accumulation increases when an SB-SDFT is applied to a long sequence of samples and, to avoid that, r values closer to one might be required. Thus, for a constant preamble length, the appropriate value for r also depends on the number of samples per symbol. As explained in section II, N is proportional to the bandwidth of the filter before the demodulator so it determines the tolerable frequency offset. Fig. 14 illustrates the BER at E b /N 0 = 11 dB (corresponding to BER ≈ 0.1%) for different values of N and r when L = 16 and I = 8. According to Fig. 14, r = 0.999 guarantees a consistent performance up to a sampling frequency more than 100 times the symbol rate.

B. BER PERFORMANCE
To evaluate the BER performance and compare it with the conventional algorithm (Hara synchronization), the proposed algorithm is applied to a DFT-based demodulator with parameters similar to [10]. The sampling frequency is 8R Sym where R Sym is the symbol rate while I = 8. The total system is simulated for an AWGN channel. It is assumed that the samples are the output of a brick-wall lowpass filter so the noise samples are uncorrelated. The BER performance of a DFT-based demodulator using the proposed synchronization and the Hara synchronization ( [10]) algorithms is depicted in Fig. 15. As can be seen, the performance of the proposed method is only slightly worse at high BER values; however, for practical BER values in the order of 10 −3 or smaller, it performs the same as the Hara method. The small increase in the BER at low SNR is due to a slight increase of error caused by using a small set of bins. Fig. 16 illustrates the BER for a range of frequency offsets values up to twice the symbol rate and different E b /N 0 . It can be seen that the demodulator with the proposed synchronization can tolerate frequency offset as expected. Fig. 17 shows the BER performance of the proposed method for different values of N and E b /N 0 . For these simulations I = 8, L = 16 and r = 0.999. It is seen that the increase of N does not affect the BER performance of the demodulator using the proposed synchronization algorithm and the proposed SB-SDFT. Notice that increasing N means that the bandwidth of the filter shown in Fig. 2 increases. This means that the noise bandwidth increases in our simulation.  As can be seen in the results, this extra noise bandwidth does not degrades the BER performance. This can be justified by looking at the DFT as a bank of narrowband filters. What determines the detection performance is the bandwidth of these narrow filers and not the bandwidth of the wide filter before the demodulator [8].
Finally, Fig. 18 illustrates the performance of the demodulator with the proposed synchronization in Rayleigh and Rician fading (K = 4) channels. It can be seen that a demodulator using the proposed synchronization performs the same as a demodulator with Hara synchronization algorithm.  Notice that the values of N are selected to be a power of two as required by the SFFT implementation of the Hara method; however, this is not necessary for the efficient implementation of the proposed algorithm. Both axes in Fig. 19 are on logarithmic scale. Fig. 19 and Fig. 20 show that the difference between the complexity of the proposed method (with SB-SDFT) and the Hara algorithm (with SFFT) increases when N increases. One of the reasons is that only the number of DFT bins calculated in the first step of the zoom stage depends on N . For the next steps of the zoom stage and the alignment stage, the number of the calculated bins is constant for all N (I and 2I , respectively). Although the number of samples involved in the DFT calculation is still related to N , the complexity grows slower as the number of bins required to be calculated VOLUME 8, 2020  remains constant. That is why the difference between the complexity of the proposed synchronization and the Hara synchronization increases when N increases. For a sampling frequency which is 32 times of the symbol rate, almost 85% saving can be achieved in arithmetic operations (and around 90% in the memory). The null-to-null bandwidth of a BFSK modulated signal (B S in Fig. 2) for a frequency separation of f sep = R Sym is 3R Sym [31]. If a very steep filter is considered (B t = 0 in Fig. 2) such a sampling frequency means that a frequency offset around ±14.5R Sym can be tolerated assuming that the main lobe should remain inside the filter (see Fig. 2). As shown in Fig. 15 and Fig. 18, the DFT-based demodulator which uses the proposed synchronization implemented using SB-SDFT has a BER performance similar to the DFT-based demodulator with the Hara synchronization; thus, the proposed synchronization and its efficient implementation reduce the complexity without any sacrifice in the performance.

VIII. CONCLUSION
The offset tolerant DFT-based demodulator is an interesting solution for low data rate applications including emerging ultra-narrowband communications for LPWANs. However, the existing synchronization algorithm for this demodulator is complex as it involves calculating the DFT of a sliding window. In this work, a new synchronization algorithm is proposed to decrease complexity. Using a step-by-step zooming technique, the proposed algorithm only requires a subset of the DFT bins. Consequently, efficient Single Bin Sliding DFT (SB-SDFT) implementations can be used. Besides, a stable SB-SDFT was introduced to incorporate zero-padding. The proposed algorithm and its implementation using the proposed SB-SDFT for an oversampling factor of e.g. N = 8, obtains 48%, 68%, 61% saving in the number of complex multiplications, complex additions and memory usage compared to the conventional window synchronization algorithm (Hara synchronization) while achieving the same BER performance. The relative reduction in the complexity further increases when larger N is required. The higher the sampling frequency (the larger value of N ), the larger frequency offset can be tolerated. For a frequency offset tolerance about ±14.5 times the symbol rate which requires an oversampling factor N = 32, the number of complex operations is reduced by more than 85% (and memory by 90%) compared to the conventional method (Hara synchronization implemented using SFFT).
To implement a low power receiver, the proposed algorithm should be combined with efficient digital design which requires research on the circuit level techniques. The effect of limited wordlength on the BER performance should also be studied. Additionally, it was shown that stabilizing the proposed SB-SDFT using a damping factor decreases the precision of the calculated DFT values. Further research is needed on designing a stable SB-SDFT with zero-padding for applications where high precision is required.

APPENDIX MATHEMATICAL PROOF FOR k c CALCULATION ALGORITHM
According to section IV, to achieve k γ c at each step, the magnitudes of DFTs for a sliding window with 2N one-sample shifts are added. Then, k γ c is the bin for which this sum has a maximum value (see (7) and (8)). In this appendix, it is mathematically shown that this sum over DFT magnitudes has a maximum at the center frequency. This is correct as long as all the 2N windows are in the preamble. The first window can start in any part of either a one symbol or a zero symbol in the preamble.
As the zero-padding factor changes during the zoom stage, (7) and (8)   function in the time domain, shown by x[n], is denoted by X ( ) where = 2πf /F s (−π < < π) is the frequency normalized to the sampling frequency, F s . The k th bin of the DFT calculated for N samples with zero-padding factor I , actually is the value X ( ) of DTFT for effectively N samples, sampled at f = (k/NI )F s ( = 2πk/NI ) [32]. Using the concept of the DTFT, (7) and (8) can be written as follows. where: X i ( ) is the DTFT for the i th window and c = 2πf c /F s = ω c T s . Because (27) and (28) are generalized versions of (7) and (8), which are true for all zero-padding factors, the superscript γ from (8) is removed in (28). If it is proved that X ( ) has a maximum at c , then, the sum in (8) has a maximum at the bin closest to f c (called k c ) for each zero-padding factor. This can be concluded considering that the DFT values for each k and window are actually samples of the DTFT.

A. CALCULATING THE COMPONENTS OF X ( )
The DTFT of a signal x[n] over N samples is as follows.
Fig. 21 depicts a part of the preamble which is used for mathematical derivation. Black squares are the samples of an Odd symbol (f 1 ) and gray ones belong to an Even symbol (f 0 ). Each double-headed arrow stands for a window (set of samples) of length N which is used for calculating the DTFT. In general, during the zoom stage, the window is not synchronized yet. Thus, the set of sliding windows in (28) does not necessarily start from the beginning of a symbol. To account for this, it is assumed that the first window from the set of 2N windows starts from a window with delay d from the beginning of a symbol. This is shown with i = 0 in Fig. 21. The first sample of that symbol is shown by x[q]. The following procedure is executed considering windows i = 0 to i = 2N − 1 and calculating the sum of the magnitudes of the DTFTs for these windows. The window i = 0 starts in an Odd symbol; nevertheless, the same mathematical formulation can be achieved if the black and gray samples are interchanged in Fig. 21 i.e. if i = 0 starts in an Even symbol. This generalization is possible because of the specific frequency separation for BFSK which is equal to the symbol rate. Now, let us consider a pair of windows in the summation of (28) with delay d of the odd symbol (f 1 ) and the same delay for the next even symbol (f 0 ) (i.e. i = 0 and i = N in (28)). In Fig. 21 these are shown by X 0 ( ) and X N ( ). If G 0 ( ) is defined as follows: Then, X ( ) can be derived in terms of N similar pairs and different G m as: where G m ( ) = |X m ( )| 2 + |X m+N ( )| 2 . First, G 0 ( ) is analyzed. Considering BFSK modulation, the samples for the i = 0 and i = N window are as follows.
where ω 1 and ω 0 are 2πf 1 and 2πf 0 , respectively. In (32) and (33), θ 1 = ω 1 NT s and θ 0 = ω 0 NT s are actually accumulated phase at the end of an Odd symbol and Even symbol, respectively. Without loss of generality, it is assumed that the initial phase before x[q] (in Fig. 21) is zero. Using x 0 [n], x N [n] and the definition of the DTFT in (29), X 0 ( ) and X N ( ) are calculated.
Notice that in (32) and (33), θ 1 − ω 0 NT s and θ 0 − ω 1 NT s are equal to +(ω 1 − ω 0 )NT s and −(ω 1 − ω 0 )NT s , respectively. Remember from section II that the frequency separation of BFSK is equal to the symbol rate R Sym = 1/T and T s /T = 1/N . So ±(ω 1 − ω 0 )NT s are removed from the phase of the signals since they equal ±2π. Now that the primary expressions for X 0 ( ) and X N ( ) are obtained, a new variable is introduced to simplify the rest of the proof.

B. DEFINITION OF α
The main target is to prove that X ( ) has a maximum at c = ω c T s . Since the frequency separation of BFSK is equal to the symbol rate (1/T = 1/NT s ), ω 1 T s and ω 0 T s can be written in terms of center frequency ω c as follows.
ω 0 T s = ω c T s − π/N , ω 1 T s = ω c T s + π/N , (36) Now that ω 0 T s and ω 1 T s are written in the form of deviations from ω c T s , let us define α as the deviation of from In the following, the variable is changed to ω c T s + α and (30), (31), (34) and (35) are formulated as functions of α. These new functions are denoted by a tilde over their name (for instance X ( ) changes toX (α)). When all functions are formulated in terms of α, instead of proving that there is a maximum at = c for X ( ), it must be proved thatX (α) has a maximum at α = 0. This simplifies the calculations, particularly, it enables us to eliminate the terms related to ω c T s which is seen in the expressions for ω 1 T s and ω 0 T s (36) as well.
By change of variable to α, (31) can be written as: For now, we focus onG 0 (α) = |X 0 (α)| 2 + |X N (α)| 2 . Substituting ω 1 T s and ω 0 T s from (36) to (34)  where T = NT s and the terms which are independent of n are taken out of summation. As a result of the change of variable, π appears in the phase of the exponential functions in (35) which is converted to a negative sign before the summations in (40). For a complex function f (x), |f (x)| 2 = f (x)f (x) * where f (x) * is the complex conjugate of f (x). Thus, the magnitudes ofX 0 (α) andX N (α) are as follows. where (n, p) = π N (2(d + q) + n + p) and the following equality is used: The first summation of (42) and the second summation of (41) can be rewritten as follows.
To obtain (44) and (45), first the negative sign before j is dissolved to (n − p) to make (p − n); then, using a simple change of indexes in summation (see (43)), (p − n) is converted to (n − p). Also, using cos(x) = cos(−x), the third term of |X N (α)| 2 can be written as follows.
This simplified notation of |X 0 (α)| 2 and |X N (α)| 2 shapes the basis of the final stage of our proof.
For α = 0 the right side of (49) is zero. Notice that all derivations so far are based on a general case of d, and, as mentioned earlier, the same procedure can be followed if the i = 0 window starts from an Even symbol. The main reason that allows this is the frequency separation which is equal to the symbol rate. This means the result of (49) can be generalized to anyG m (α). Considering the definition ofX (α) from (38), statement I is proved as follows: Statement I shows that α = 0 is an extremum (maximum or minimum) forX (α). To further prove that α = 0 is a maximum (and not a minimum), it must be shown that its second derivative is negative. Taking another derivative from both sides of (49) according to the chain rule, the d 2 dα 2G0 (α) is derived as follows.
From statements I and II, it is concluded that the sum in (28) has a maximum at α = 0 or = ω c T s .