Real-Valued Spreading Sequences for PSSS Based High-Speed Wireless Systems

In past years, Parallel Sequence Spread Spectrum (PSSS) has attracted significant attention as a modulation technique for wireless communication systems targeting data rates of 100 Gb/s and beyond. PSSS allows designing high-speed baseband processors, which can be partially implemented in the analog domain. It uses multiple analog-to-digital converters (ADCs) to sample the received baseband signal in parallel, significantly relaxing the sampling rate and ADC complexity. However, due to the sidelobe effects of bipolar <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula>-sequences, PSSS shows lower performance than standard digital modulation schemes. This paper proposes real-valued PSSS spreading sequences with attenuated autocorrelation sidelobes. Such sequences show excellent bit error rate (BER) performance. Moreover, our sequences do not have length restrictions of 2<sup>m</sup> – 1, like in the case of <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula>-sequences, and reduce the chip area required to implement PSSS transceiver. The proposed sequences also reduce the peak-to-average power ratio (PAPR) of PSSS.


I. INTRODUCTION AND MOTIVATION
The increase of wireless data rates in the last decade drives the need for new baseband processing solutions for the next generation of ultra-high-speed wireless communications targeting 100 Gbps and beyond. Due to the high bandwidth available above 200 GHz, as depicted in Fig. 1, it is possible to provide channels exceeding 50 GHz of continuous bandwidth, which allow for very high data rates, assuming uncomplicated modulation schemes [1]- [3]. One of the novel modulation techniques that could be considered for such wide radio frequency (RF) channels is the so-called Parallel Sequence Spread Spectrum (PSSS) [4]- [6], which considerably simplifies transmitter and receiver baseband design due to its inherent parallelism and analog-friendly architecture [7]. The main advantage of PSSS is a lightweight baseband implementation that allows to sample and process the baseband signal in parallel ADC structures and baseband threads (Fig. 2) [7]. This feature is especially important for future high-speed wireless channels located in the THz band, where the ADCs and DACs need to process large bandwidth.
The associate editor coordinating the review of this manuscript and approving it for publication was Yan Huo .
Instead of employing a pair of I-Q ADCs like in conventional systems with QAM modulation, the received signal can be sampled by N ADCs, where each ADC instance operates on N -times reduced clock frequency. Moreover, a large portion of PSSS processing can be performed directly in the analog domain due to PSSS's analog-friendly architecture. The initial PSSS processing on the receiver side consists of correlation and integration, which can be easily done in analog hardware. Thus, the PSSS baseband is a kind of mixed-signal implementation.
Wireless communication with PSSS modulation has been practically demonstrated already some years ago. In [8] and [9], the authors have implemented a hardware node for industrial wireless communication based on 5.8 GHz PHY with a 20 MHz channel. In [10], a similar module for IEEE 802.15.4-2006 is presented. The peak data rate is 250 kbps, with the system operating in 2.4 GHz. A significantly higher data rate is reported in [11], where the authors perform a hardware-in-the-loop experiment achieving 80 Gbps with analog frontends at 230 GHz. The Digital Signal Processing (DSP) pre-and post-processing are performed in Matlab and Simulink. PSSS is also deployed in orthogonal frequencydivision multiplexing (OFDM) systems [12], is a part of  [13], and was targeted for universal serial bus (USB) communication [14].
The main drawback of the PSSS variant used by us is the correlation DC-offset caused by the bipolar spreading sequences' sidelobes, which significantly reduce the BER performance. The mentioned correlation DC-offset caused by the standard m-sequences employed for PSSS leads to ∼2.5 dB EbN0 penalty [15]. In the case of PSSS, the orthogonal Walsh-Hadamard sequences cannot be used to transfer the data streams. PSSS needs sequences with good cyclicautocorrelation properties due to its cyclic nature and orthogonalization methods proposed in [16] and [17].
This paper proposes yet another method for mitigating sidelobes in PSSS. Using a genetic algorithm, we construct new real-valued spreading sequences, defined in R domain, with attenuated autocorrelation sidelobes. Due to removed sidelobes, the improved PSSS scheme shows excellent BER performance and approaches QPSK and QAM modulations in AWGN. To the best of our knowledge, such a method has never been employed for generating PSSS sequences [5]- [7], [16]- [19]. Our algorithm has five main advantages when compared to the state of the art PSSS schemes: • In contrast to the solutions shown in [4], [15], and [18], our method does not introduce any EbN0 penalty for BER performance. The resulting sequences achieve the highest BER performance and approach standard modulation schemes.
• Our idea is simpler to implement than solutions based on matrix factorization proposed in [16].
• The proposed genetic algorithm generates sequences with adjustable lengths ranging from 3 to 30 chips, which has not yet been demonstrated for PSSS in any paper. This allows designing PSSS transceivers with a more flexible hardware architecture. This is especially interesting for analog implementations considered in [7] and [20]- [22], where the m-sequences are problematic due to the integrator complexity. Using our sequences, the designer can specify the PSSS sequence length in 1 chip (bit) precision. The competitive m-sequences are FIGURE 2. PSSS receiver with 1b/s/Hz. Figure adopted from [7]. • PSSS systems based on the resulting real-valued sequences require up to 40% less ASIC area than PSSS combined with two external matched filters. External matched filters are one of the possible high-speed hardware realizations of PSSS orthogonalization proposed in [16] (more in sections II.D and II.E).
• The proposed real-valued sequences reduce the resulting PAPR of the PSSS transmission. In our case, we can reduce the PAPR up to 38% without any penalty in BER and EbN0. The organization of the paper is as follows. Section II introduces the PSSS and gives a quick overview of PSSS literature. Section III presents our algorithm for PSSS sequences generation. The achieved results, and their analysis together with a comparison with state-of-the-art, are provided in section IV. The final conclusions are made in section V.

II. PSSS PHYSICAL LAYER
In this section, we introduce PSSS modulation, discuss the problem of autocorrelation sidelobes, propose our solution to mitigate these sidelobes, and compare it to the solutions from the literature. Figures 2 and 3 depict a wireless PSSS receiver and transmitter in the most straightforward, high-speed architecture with 1 bit/s/Hz spectral efficiency [7], [19]. PSSS can be defined over different spreading sequence lengths, and usually, 15 ≤ N ≤ 255 [7], [9], where N denotes the used spreading code length. We target the shortest sequences (3 ≤ N ≤ 30), which ease the hardware representation, mainly the analog receiver's integrator. In [7] and [20]- [22], the N parameter is set to 15 and corresponds to m-sequences with 15 chips. 8674 VOLUME 10, 2022 Due to hardware limitations explained in [7] and [20]- [22], we investigate spreading sequences with similar lengths, to be sure that analog PSSS realizations are possible.

A. PSSS BASICS
In our scheme, the input user bits, denoted by x: are multiplied with cyclically shifted real-valued spreading sequences, represented as: and then added in the time domain resulting in transmitter baseband signal: where M (spread matrix) is defined as: Accordingly, the multivalent elements of t in (3) are: At the receiver side, the received baseband sequence becomes r, defined by: where n vector denotes channel noise originated from AWGN or Rayleigh channel. The receiver correlates the r signal with spreading sequences stored in the spreading matrix M by performing: The received bit valuesx are estimated by checking the sign of the correlation elements of c in (7) (hard decision detection): x = sign (c) , The equivalent receiving circuit can be realized, as shown in Fig. 2. The analog correlators and integrators in Fig. 2 are equivalent to the vector by matrix multiplication in (7). Digital implementations of similar PSSS systems in FPGA are demonstrated in [5] and [9]. A high-speed hard-decision detection can also be realized by replacing the ADCs and LLR estimators ( Fig. 2) with binary comparators. PSSS allows processing the baseband signal in an analog circuit and sampling the soft bit values in a parallel ADC array (Fig. 2). In our case, we use up to 30 ADCs, and therefore the conversion speed can be reduced by 30 times (N ≤ 30). This reduction is the main reason for selecting PSSS for high-speed wireless communication at 100 Gbps. The PSSS algorithm investigated in this paper is slightly different from the original method published in 2004 [5], [6]. In our case, we use real-valued sequences on the transmitter and receiver sides in (3) and (7), while the standard PSSS method is derived for a combination of unipolar and bipolar m-sequences defined in the binary domain. In the standard PSSS, the transmitter uses the unipolar {0, +1} m-sequence for encoding, while the receiver correlates using the same m-sequence in the bipolar {−1, +1} representation. For example, sequence '1001110' is used in the transmitter, and the receiver uses '+1−1−1+1+1+1−1.' For the transmitter and receiver, the data is encoded as {−1, +1}. In our variant, we use PSSS dedicated sequences defined in R domain with values in the range [−1, +1] for the transmitter and receiver (identical sequences are used on both sides). These sequences are the critical element of our scheme and are specially generated for the PSSS system defined in (1)-(8).

B. SYSTEM MODEL WITH 2 b/s/Hz SPECTRAL EFFICIENCY
This subsection shows the investigated PSSS model with an IQ-upconverter (Fig. 4). The multiplication and add functionality shown on the left side in Fig. 4 is the equivalent of a vector by matrix multiplication defined in (3). The right side of Fig. 4 is a standard IQ-upconverter, which is identical to QPSK and QAM upconverters. In this paper, we target a carrier frequency of 240 GHz [23]. Fig. 5 depicts BER performance with 2 b/s/Hz spectral efficiency for the presented system. All curves are generated by typically used m-sequences [5]- [7], [16]- [19]. PSSS, as defined in (1)- (8) with bipolar m-sequences used as encoding and decoding spreading codes (blue curve in Fig. 5), shows disappointing BER performance compared to QPSK and the standard PSSS method [5], [6] (red-dashed curve in Fig. 5). In the next subsection, we prove that autocorrelation sidelobes cause the poor BER performance. To the best of our knowledge, only three publications address and solve this issue [15], [16], [18]. The authors in [18] propose an iterative symbol detection algorithm in the receiver. This solution causes bit detection dependencies, leading to hardware complexity, especially VOLUME 10, 2022  when analog correlators and integrators at ≥100 Gbps are targeted [20], [22]. In [15], a method based on DC-level correction for binary Barker codes is presented. Authors in [16] propose a matched filter technique based on matrix factorization to eliminate the sidelobes of bipolar m-sequences. Currently, this technique shows the highest BER performance and is the most promising solution for PSSS.

C. AUTOCORRELATION SIDELOBES IN BASIC PSSS SCHEME
The m-sequence of length 15 (denoted as m-sequence-15) has sidelobes of value −1 for each circular shift (Fig. 6). These sidelobes strongly influence the DC levels at the outputs of the correlators ('C'-elements in Fig. 2) by adding an individual DC offset for each PSSS symbol in (8). Assuming that the data bits in (1) follow a uniform distribution, and have a similar count of −1's and +1's, the sidelobes with positive and negative amplitude accumulate, and therefore are canceled. In such a case, the PSSS decoder with bipolar codes can operate as described in (1)- (8). If the count of positive and negative bits in user data is not balanced, and the count of positive or negative bits is increased, the signal-to-noise ratio (SNR) is reduced. When PSSS symbol data in (1) is dominated by −1's or +1's, the decoder cannot extract data bits from the symbol in (8). The sidelobes interfere constructively and cause a strong DC offset in the correlators. Such a demodulator breaks down and cannot work. To demonstrate this effect, we show an IQ-constellation in an AWGN channel at SNR = 30 dB, i.e., the signal power is 1000 times higher than the noise power (Fig. 7). The distance between the constellation points is defined by PSSS correlation values (elements of c in (7)). Although the constellation is similar to QPSK, the QPSK mapper and demapper are not used. Instead, PSSS with 2 b/s/Hz, as shown in Fig. 4 and (1)-(8), is employed for modulation. The noise marked in the red circles in Fig. 7 is the main reason for the poor BER performance of bipolar PSSS. A different DC offset influences each PSSS symbol in (8), and therefore, we observe four ''clouds'' in the constellation. To overcome this problem, we use realvalued spreading sequences with attenuated autocorrelation sidelobes. To generate these sequences, a genetic algorithm is used.

D. SIDELOBE MITIGATION ALGORITHMS
Until the work in [15]- [18] was not released, PSSS had problems achieving the BER performance of standard modulation systems due to the sidelobes and the resulting DC-offsets. The authors in [18] propose an iterative detection algorithm for PSSS, and this is the first published solution that closes the performance gap to the standard BPSK, QPSK, and QAM systems. Unfortunately, the iterative detection algorithm shows residual error floor for some sequences. Its implementation also increases PSSS receiver complexity, especially for analog receiver realizations. Authors in [16] and [17] propose a more robust algorithm to orthogonalize the PSSS. This solution shows excellent performance when applied for m-sequences, but requires matrix decomposition methods. The key idea is to factorize a circulant matrix initialized with cyclically shifted m-sequences (since the matrix is circulant, it is diagonalizable by the DFT matrix).
Authors in [15] show a trivial method to improve PSSS BER performance for the PSSS algorithm combined with Barker spreading codes. This is the simplest and most straightforward solution, but shows a constant ∼0.2 dB EbN0 penalty compared to matrix factorization proposed in [16] and [17]. The main advantage of this method is its low complexity and straightforward hardware realization. Some of the previously mentioned methods are explained and compared in [24].

E. SIMILARITIES BETWEEN MATCHED FILTERS AND REAL-VALUED SEQUENCES
The authors in [16] orthogonalize the circulant matrix M in (4), which is initialized with binary m-sequences, to remove the constant cyclic-autocorrelation sidelobes caused by the sequences. In this paper, we employ fully orthogonal realvalued sequences for M in (4). Our sequences do not have sidelobes and are generated primarily for PSSS. In some cases, both approaches might lead to the same values in M, but the development of these two ideas is entirely independent. In both solutions, the values of M are manipulated so that all N rows in M are entirely orthogonal to each other. The methods have different advantages but allow achieving identical BER performance, which is ∼2.5 dB higher than for the standard PSSS [4] with binary m-sequences. The matched filters orthogonalization focuses on M matrix factorization. At the same time, our method uses the idea of finding spreading sequences by a heuristic procedure proposed in 1974 in [25], with the difference that we employ more powerful genetic algorithms that have become widespread some years later. We also search the codes in an application-oriented approach, because we focus on cyclicautocorrelation and PAPR properties, not standard linear autocorrelation.
In general, the authors in [16] propose three orthogonalization methods for m-sequences. In this paper, we selected the solution based on external matched filters due to its hardware-friendly CMOS implementation. The PSSS variant based on matched filters consists of PSSS transmitter, transmitter filter, receiver filter, and PSSS receiver. Although the authors do not mention it in their paper, this architecture has one crucial practical advantage. Their scheme's DSP comprises four modules. PSSS transmitter and receiver work in the binary domain, and their architectures are effortless (achieve very high throughput). The effort to reject sidelobes is divided among two filters. Due to the matrix by vector implementation in these filters, the DSP critical path is split into concise sections of similar delay. This allows high clock frequencies and throughput in digital CMOS realizations due to its pipelined nature. Moreover, the filter logic consists only of two coefficient values, due to the repetitive nature of the K OPSSS matrix in [16]. Therefore, its implementation can be realized with minimal energy overhead. In the case of m-sequence-15, 87% of the filter logic does not update its values and remains idle due to the repetitive multiplication coefficients. Although the authors of [16] have overlooked all these significant advantages, the architectural benefits of their matched filter scheme are visible after performing VHDL optimizations on the RTL level. We implement and compare our real-valued sequences with the matched filters in an FPGA and ASIC technology in section IV.E.

III. WORK DETAILS A. GENETIC ALGORITHM-BASED PSSS SEQUENCES
The best BER performance is achieved when a PSSS sequence with cyclic-autocorrelation sidelobes ≈0 is selected. We search for sequences with possible low sidelobes and possible high autocorrelation peaks using a genetic algorithm prototyped in Matlab (Fig. 8). This algorithm is a six-stage procedure that starts with randomly generated sequences. These sequences are called the ''initial population.'' In the selection stage ( Fig. 8), the sequences with favorable cyclic-autocorrelation properties are selected for later processing. All sequences with weak cyclic-autocorrelation are deleted and replaced with new sequences called ''offspring generation.'' The offspring sequences are derived as a modification of the parent sequences with the highest cyclic-autocorrelation in the ''rounding,'' ''crossover,'' and ''mutation'' stages ( Fig. 8). These recombination schemes are performed with the expectation that one of the newly created sequences may have better cyclic-autocorrelation properties than the parent sequences. After that, the procedure is repeated until at least one sequence achieves the targeted correlation properties. Termination criterion, in our case, is to have the sidelobe amplitudes lower than the predefined threshold t = 0.001 (more in sections III.C and III.D). This algorithm is iteratively repeated at least 10 4 times and has to be executed only once. When the desired sequences are found, the coefficients are saved, and there is no need to repeat the procedure. Therefore, a precise algorithm's optimization can be avoided, and only coarse algorithm adjustment is required to achieve the termination criterion in a deterministic time.

B. POPULATION GENERATION AND CHROMOSOME CODING
The initial population is generated randomly by uniformly distributed binary random numbers in {−1, +1}. Each chromosome belonging to the population consists of 3 ≤ N ≤ 30 real numbers (genes). The number of genes equals to the targeted spreading sequences length (N ) and directly represents the s 1 , . . . , s N elements of s in (2). The algorithm has been prototyped with an arbitrary selected population size of 30 individuals, typical for similar implementations [26]. After that, the fitness function, mutation scheme, and recombination strategies have been investigated and optimized for the initial population attributes (size, encoding, starting values). Thus, changing the starting population parameters requires modifications in other algorithm sections. The algorithm shows good convergence when the population size is in a range of 25-35 individuals, and it is recommended to keep this boundary. Fig. 9 depicts the relation between the population size and execution time for finding sequences with N = 10.
The initially generated population should consist of {−1, +1} binary numbers only. If the algorithm starts with random R values in the range of [−1, +1], then the average execution time is extended by 29%. Moreover, the data type representation for the genes should be set to double-precision floating-point numbers. Single precision data type increases the average execution time by 18%. In general, the algorithm is susceptible to changes, and it is recommended to keep all the given boundaries. Otherwise, the recombination and mutation methods may not work efficiently. Most algorithm parameters were determined experimentally, and were prototyped in a Matlab implementation.

C. FITNESS FUNCTION
We use two values to assess each individual s in (2) in the population. Firstly, we compute the maximum of each individual's cyclic-autocorrelation function in (9).
This value is symbolized by m 1 in Fig. 10 and represents the amplitude of the cyclic-autocorrelation mainlobe. The m 1 is obtained by finding the maximal element of sM multiplication in (9), which is equivalent to evaluating the cyclic-autocorrelation function for s. Secondly, we search the second maximum, denoted as max 2 function in (10), to get the highest sidelobe amplitude. This value is symbolized by m 2 in Fig. 10 and (10)- (11). The difference between m 1 and m 2 gives the fitness value f in (11) and is computed for each individual in the population at each iteration. The α and p in (11) are needed to reduce the algorithm's execution time, g PAPR is used to reduce the resulting PAPR, and all these additional variables are explained later. For now, please assume that g PAPR = 0, and this parameter is not used. We introduce it in section IV.C. The fitness value f indicates sequences, which are favorable in terms of cyclicautocorrelation. If the m 1 and m 2 difference is high, the f value is also high, and the investigated sequence has good cyclic-autocorrelation properties, and should be used at a later algorithm stage. All sequences with low f in the population are removed. Thus, the algorithm searches for sequences, which give the highest f values. This means, m 1 mainlobe is maximized, and sidelobes m 2 are minimized (m 1 = max f (s); m 2 ≈ 0). As a result, sequences as shown in Fig. 11, are obtained.
For sequences with N = 30, at least 10 7 iterations are needed. The algorithm's operation as a function of the time for N = 10 is shown in Fig. 12.
Eq. (11) uses α and p to improve algorithm convergence additionally. The maximal sidelobe amplitude m 2 is 8678 VOLUME 10, 2022  manipulated by the α amplification factor in (11). The α value is proportional to the code length, and is defined in (12). The higher the α is, the algorithm optimizes the sidelobeamplitudes more aggressively at the mainlobe amplitude reduction cost. The α increases the convergence and reduces execution time, but its value should not be higher than 0.35N. Otherwise, the optimal sequence cannot be found, and as a result, a sequence with reduced main-peak amplitude is returned.
α depends on N due to the reason that longer sequences have higher autocorrelation peaks. To keep a quasi-constant relation between the mainlobe and sidelobes amplitude, and keep quasi-constant convergence, m 2 is scaled by α. This is required because the algorithm can be executed for various N . Short sequences with N = 3 have significantly lower mainlobe amplitude than sequences with N = 30. Therefore we adjust the slope of f in (11) by employing α, which depends on N . The value of α has been estimated experimentally to 0.3N in our scripts. The influence of the α on the execution time for N = 10 is depicted in Fig. 13. Additionally, we use the penalty value p in (11), explained in the next subsection. In general, the genetic algorithm has to find the s argument giving the maximum of the f (s) function in (11), which corresponds to the sequence with the highest cyclic-autocorrelation peak (m 1 ) and sidelobe amplitudes approaching zeros (m 2 ≈ 0). The function f (s) is handled as simple floating-point value f, calculated for each s,and stored in a one-dimensional array of 30 elements (one f value for each individual in the population; population size is 30). Thus, in this paper, we refer to f values instead of the f (s) function, which may look complicated to the reader and do not explain how to implement it in a program routine. If we investigate only 30 values of f (s) stored in a one-dimensional array at each iteration, there is no point in using complicated terminology.
Genetic algorithms allow us to investigate complicated problems in a simple semi-Monte-Carlo fashion, and in our case, we evaluate the f (s) properties avoiding analytical approaches. The f (s) domain is R and can consist of 30 variables when a sequence with N = 30 is scheduled. We limit the applicability of our algorithm to N ≤ 30. Otherwise, the execution time might take a while to find the proper sequence, and the algorithm is not optimized for solving such complex problems (long spreading sequences in PSSS cause high PAPR and should be avoided). Nevertheless, for practical applications targeting 100 Gbps wireless communication with analog PSSS, N is limited to ∼15 due to the correlator complexity [7], [20], [22]. Also, for digital realizations, the correlator and integrator sizes limit the clock frequency (more in section IV). Our solution well supports the targeted length of N ≤ 15. The genetic algorithm is executed only once, and the generated s sequences required for M in (4) are stored in a file. Therefore, the execution time of the genetic algorithm is not critical.

D. PENALTY VALUE (P)
In genetic algorithms, constraints are handled using the penalty functions, which penalize infeasible solutions by reducing their fitness values (f in (11)). We use the p to exclude all codes with sidelobe amplitude m 2 ≥ t. In such a case, we set p as follows: where β = 6, = 4, and t = 0.001. The t = 0.001 threshold for m 2 is an arbitrarily selected value for implementation purposes in our Matlab scripts. In fact, t should not be higher than 0.01, as indicated in Fig. 14. Figure 14 depicts the BER performance of PSSS, compared to analytical QPSK, when spreading sequences with different sidelobes amplitudes are used. This means that setting t to 0.001 allows us to generate s sequences with ∼10 times higher precision than required for our floatingpoint simulation model. In section IV, we also investigate the influence of the quantization for the spreading sequences. Setting 6-bit precision for the sequence coefficients leads to sidelobes with an amplitude of ≈0.05 usually. Thus, setting t below 0.05 does not bring any visible benefits in the BER performance of our CMOS implementation, because the sidelobes are increased to ≈0.05 after the 6-bit quantization. Thus, the prototyped genetic algorithm provides sequences with ≈50 times higher precision than required for our CMOS-related investigations shown in section IV.
β and are also selected arbitrarily and have a relatively low impact on the execution time. This is mainly caused by the fact that algorithm quickly reduces sidelobes, and only at the beginning of the execution, p value dominates the f evaluation. After the initial sidelobe reduction, the main peak amplitude is dominant in f (Fig. 12), and the sidelobes are low enough that p f . When the algorithm starts, the relation is opposite (p f ). The is used as a saturation level for p to additionally reduce the importance of p at the beginning of the program execution. β represents the slope value of p (Fig. 15). These values have been selected experimentally in our case, and any value in the range [4,6] for both parameters is acceptable. In some cases, selecting the parameters out of this range might require adopting the α. Different parameter values lead to different penalty function results, but the relation between the individuals is preserved in most cases, and allows selecting better and worse individuals in the selection phase. This allows to assess the individuals in the relation lower-sidelobes or higher-sidelobes, but the absolute difference between the sidelobes amplitude is unimportant, and is fully ignored. In general, p is required to exclude all m-sequences, Barker, APAS [27], Arasu [28], and similar binary sequences, which the algorithm can identify. In some cases, the algorithm gets stuck in a local maximum of f (s) caused by one of these sequences, and p is required to prevent it. Therefore, absolute values of β and parameters are less important, until we successfully drop these binary sequences from the population. How much these sequences are penalized, and the absolute value of p is unimportant, until the algorithm does not get stuck in local maxima. For controlling the algorithm convergence, we use the α parameter instead. Therefore, the β and are constant in our implementation and we do not pay attention to them. Note however that all three parameters are related in the penalty function evaluated in (11). Manipulation of all these parameters at the same time is not advised. The p-value in a function of m 2 is shown in Fig. 15. The p function may be composed differently, and the algorithm will work correctly and converge to correct results.
The importance of p can be explained by analyzing the following examplary spreading sequences:   Thus, sequence-B has a higher fitness value and is preferred by the algorithm. Sequence-B will be used in the successive iterations, and A will be deleted. Fig. 17 depicts the BER performance of PSSS based on both sequences, where the resulting power for modulated signals is normalized to 1W in both cases. Sequence-A achieves good performance and underperforms the optimal solution by ∼0.5 dB only. Sequence-B does not allow to perform reliable data transfer. Thus, the genetic algorithm has selected the wrong sequence and cannot reach the optimal solution. The fitness function f without p is poorly composed. If both fitness values are recalculated, including the penalty values p, then f A ≈ 2.91 and f B = 0.9. Thus, sequence-B is removed, and sequence-A is used in a later processing stage. Note that the absolute values of p and f are unimportant. Only the relation greater/lower between f values for both sequences is considered. Thus, the parameters α, β, used in the genetic algorithm are relatively flexible and do not have to be estimated precisely. The main reason for sequence-B weak performance is high sidelobes that cause strong interferences between the parallel data streams (the lack of orthogonality, Fig. 16). Thus, p is needed to amplify m 2 selectively when f is evaluated.

E. SELECTION
We select 18 fittest individuals and use their genes for reproduction in the selection phase. The offspring generation replaces 12 sequences having the lowest f in the population. The combination of 18 + 12 ensures the fastest convergence of the algorithm, and is explained in detail in the following subsections.

F. GENERATION OF THE OFFSPRING POPULATION
To generate the offspring population, we use two recombination schemes described in the following two paragraphs.

1) ROUNDING
Two offspring individuals are formed by a randomly selected parent pair, in which single gen has been rounded (one gen in each parent). This strategy allows convergence of sequence coefficients to {−1, +1}, or to any other number with limited floating-point precision. Without this rounding, the genetic algorithm has enormous problems to approach {−1, +1} and numbers with limited floating-point precision. E.g., the coefficients in s are equal to +0.89 (9). . . and can approach neither +1.0 nor +0.9, although it would improve the value of f . In our case, we round the code coefficient to retain 3 significant digits, and we provide exactly two rounded genes per iteration. In the other case, execution time is extended. Fig. 18 depicts the impact of the rounding precision on the average algorithm's execution time for N = 10, while Fig. 19 depicts the impact of the number of rounded genes per iteration on the execution time.

2) CROSSOVER
Ten new individuals are generated by the arithmetic average of two randomly selected parents and added to the offspring population. Fig. 20 depicts the impact of the number of crossovers on the average algorithm's execution time for N = 10. The fastest execution time is achieved, when the crossover is executed ten times and generates ten individuals. Thus, the offspring population consists of 12 individuals in total. The crossover function generates ten of them, and two by the rounding strategy.

G. MUTATION
The mutation is required to prevent being stuck in local maxima of f (s) and significantly affects the algorithm execution time. After the recombination phase, we insert exactly 15 genes on random positions in random individuals. If the number of mutated genes is <15, the algorithm works slower but is able to find the approximated maximum of f (s) (Fig. 21). If the value of mutations is >15, the method works like a Monte-Carlo method, and for long codes (N > 20) it does not work (unable to solve (11) with more than 20 variables in R). Therefore, we limit the number of the mutations to 15. After the mutation process, the procedure repeats, and the newly generated offspring population is assessed by (11). The stopping criterion is the maximal sidelobe value m 2 < t, and in our case, t = 0.001.

IV. RESULTS
This section discusses the BER performance of the proposed PSSS sequences in AWGN and Rayleigh channels. Later, we compare our solution with other methods from the literature. In the end, hardware implementation in 28 nm CMOS is investigated.

A. THEORETICAL ANALYSIS OF BER PERFORMANCE IN AWGN
The theoretical BER for 1 b/s/Hz PSSS systems is derived in [18] and, in a simplified form, is expressed as The EbN0 penalty factor γ is equal to [18] where ε i represents the cyclic correlation result for sequence i defined as with the following definitions: 8682 VOLUME 10, 2022 Pr(N, q) probability follows a binomial distribution function and is equal to [18] Pr The value of ε variable represents the amplitude of the main autocorrelation peak in the receiver correlators. Due to sidelobes of m-sequences (Fig. 6), this peak is usually different from N . Its value is influenced by ±1 depending on the sidelobes amplitude values, which depend on the transmitted data. Thus, the above-mentioned equations depend on q and u. The q and u allow estimating the height and the sing of the accumulated sidelobe values. In our case, we use quasiorthogonal spreading sequences generated by the genetic algorithm. The algorithm's main optimization criterion is to minimize the sidelobe amplitudes, which equals the improvement of the orthogonality between the PSSS sequences. If the sequences become orthogonal, the sidelobes will be 0. Thus, the ε variable in our case is approximately equal to: where d ∈ {−1, +1}.
Let estimate now the value of γ (N, q) in (15) for our system. Equation (18) proves that either ε i (N , q) ∼ = N or ε i (N , q) ∼ = −N . Moreover, the sum of probabilities p [Pr (N , q)] equals 1. Thus, the value of γ (N, q) approaches 1, and the BER equation in (14) reduces to The sum of probabilities in (19) equals 1, and (19) can further be reduced to which is approximately equal to the BPSK BER curve. Thus, our variant of PSSS approximately equals the BER performance of a BPSK system. The same can be proven for higher-order modulation schemes.   QPSK and QAM-16 in AWGN. Due to strongly reduced sidelobes, PSSS with our sequences approaches QPSK and QAM for each sequence length N , and does not introduce any BER penalty. This allows us to benefit from the parallel ADC architecture and parallel baseband processing without negatively impacting the BER performance. The parallel implementation architecture is the main advantage of the proposed PSSS scheme and is strongly desired for high-speed systems targeting 100 Gbps and more. PSSS based on short spreading sequences does not improve the BER performance, to the best of our knowledge. Fig. 23 compares BER performance for PSSS with 2 b/s/Hz over a multipath Rayleigh channel defined in ITU-R M.1225 [29], reflecting a 3G channel for internet access (outdoor base station to indoor modem communication at 1.9 GHz and sampling frequency of 3.84 MHz). We use least squares zero-forcing equalization in the time domain VOLUME 10, 2022 FIGURE 24. Comparisons of the simulated PSSS results for an example Rayleigh channel taken from [30] with the characteristic defined as:

B. SIMULATION OF BER PERFORMANCE IN AWGN AND RAYLEIGH
with an FIR filter with 30 taps. For such a channel, PSSS cannot approach the performance of QPSK. Fig. 24 compares the performance of the proposed PSSS sequences in a Rayleigh channel defined in [30], which is defined as follows: where T c denotes the chip time period. In such a case, PSSS shows better EbN0 performance in the low EbN0 region than chaos-modulation schemes proposed in [30]. However, our methods show a constant error-floor that needs to be compensated by FEC. Nevertheless, our sequences show significantly higher BER performance in both Rayleigh channels than the standard PSSS scheme based on m-sequences.
Although the selected 3G Rayleigh channel at 1.9 GHz allows estimating realistic multi-fading parameters, our systems targets THz-band at 240 GHz. The targeted THz frontend is shown in [23]. A PCV antenna lens with very high directivity is used, and all lateral reflections are strongly attenuated at the receiver. Moreover, the THz-band attenuation is high, and all reflections from objects have relatively low power and quickly disappear. Thus, the resulting multipath propagation is marginal, and the performance in the AWGN channel is close to the actual operating conditions at 240 GHz for point-to-point indoor communication [31], [32]. The presented Rayleigh channels can be considered for PSSS at lower frequencies.
C. PEAK TO AVERAGE POWER RATIO (PAPR) AND TRANSMITTED SIGNAL SPECTRUM Figure 25 depicts the resulting PAPR for PSSS composed of m-sequences, Barker codes, real-valued sequences, and real-valued sequences optimized for PAPR. For all realvalued sequences listed in Table 1, generated by our genetic algorithm, the PAPR is better than for Barker codes and m-sequences. It is also possible to adjust the genetic algorithm proposed in section III to search for sequences with reduced PAPR. Thus, it is possible to optimize the searched sequences according to additional properties. In our case, we added the requirement to reduce the PAPR by adding the g PAPR parameter in (11). In our case, it is defined as In short, (21) estimates the PAPR reduction of real-valued sequences in PSSS compared to m-sequences, Barker codes, and some other binary spreading sequences that could be applied in PSSS. To increase the speed of the PAPR computations, we use the following approximation It is possible to reduce further computation effort of (22), where only one specific PSSS symbol data is investigated. Then, computation of the denominator in (22) can be realized, as shown in (23).
Note that the simplified PAPR estimation in (22) and (23) can be applied only for our real-valued sequences in PSSS systems, because the average energy of such composed PSSS symbols is flat (Fig. 26).
For m-sequences, it leads to erroneous results due to imprecisely estimated average power (Fig. 26). For PSSS with m-sequences, the average power needs to be estimated on a significantly more extended period than 1 PSSS symbol. In our algorithm, all binary sequences are dropped by the penalty function in (11). Therefore this simplified approximation of PAPR can be applied. It is also possible to estimate the average power by statistical methods (we leave it as a future work and do not consider it in this paper). Figures 26 -28 compare PSSS based on m-sequences with PSSS based on the proposed real-valued sequences concerning average power, peak power, and PAPR computed within single PSSS symbols. The main advantage of the proposed sequences is constant power across transmitted symbols, reduced peak power variation, and reduced PAPR. Moreover, the real-valued spreading sequences allow obtaining a flat spectrum of PSSS signal (Fig. 29). Some resulting spreading sequences with reduced PAPR are presented in Table 2. Figure 30 compares the BER performance for PSSS with and without PAPR optimizations. Although the PAPR optimized sequences show reduced main peak amplitude, no BER performance reduction is observed (the orthogonality condition for M in (3) is preserved).

D. COMPLEXITY COMPARISON TO OTHER SIDELOBES MITIGATION TECHNIQUES
The problem of PSSS-sidelobes originating from the nonorthogonal codes has been investigated several times in the literature. This section compares our solution to other methods according to the BER performance and hardware complexity. Firstly, BER performance of external matched filters inspired by [16], iterative bit detection [18], DC-correction for Barker codes [18], median detection for VOLUME 10, 2022  Barker codes [18], and standard PSSS [4]- [6] are compared with our newly proposed real-valued sequences in Fig. 31.
The proposed method, together with matched filter solution, achieves the best results and approaches the performance of QPSK. The other methods underperform by at least 0.25 dB. Moreover, the DC-correction method works only with the Barker code of length 13. The same holds for the median detection technique for Barker-13. The method based on the iterative bit decoding shows an error floor for sequences with N = 15, and should not be used for such short sequences. This method should be applied only for sequences with the length of at least 31 chips [16], [17].
The standard PSSS decoder loses ∼2.5 dB and has the lowest performance from the tested algorithms, but also its architecture is the simplest. At this point, the matched filter technique and the proposed real-valued sequences seem to be the most promising approaches at the cost of additional processing.
It is essential to mention that the matched filters designed in [16] are needed to remove the m-sequences' sidelobes,   and are not used for pulse shaping purposes. The filtering logic performs a dedicated vector by matrix multiplication and works cyclically in a PSSS symbol boundary. The filters cannot be implemented as standard FIR filters.
In general, the limit for N ≤ 15 targeted in this paper is caused by the integrator complexity when BiCMOS implementation for PSSS is considered [7], [20], [22]. Also, longer  sequences show higher PAPR, and their applicability is limited in practical systems. Table 3 compares the tested solutions' complexities, assuming 2 b/s/Hz, N = 15 for m-sequences, and N = 13 for Barker sequences. Hardware overhead to support our sequences is lower than adding two external matched filters. On the one hand, we require multipliers in the receiver, and this increases the receiver complexity. On the other hand, all methods which do not use multiplications in the receiver have lower BER performance and underperform standard modulation techniques (e.g., QPSK, QAM-16). Thus, the applicability of these methods in practical applications is questionable, especially that in modern ASIC technologies, the cost of fixed-point multiplications, or Gilbert cells in the analog domain, is not challenging anymore. Also, low-cost Kintex FPGAs are equipped with thousands of DSP blocks that support multiplication and addition operations in prefabricated hardware blocks. Thus, we do not see any prospect for reducing the receiver sensitivity due to the reduction in the number of multiplications. In a data link layer, forward error correction (FEC) is performed, and FEC usually implements thousands of transistors [33] to gain from the received softbits as much as possible. FEC tries to correct all bit errors at the cost of very high hardware utilization and power consumption. Thus, reducing the sensitivity on the physical layer due to the reduction in a few multiplications is an example of a horrible designing practice. In the remaining part of this paper, we discard all the methods that underperform the QPSK BER performance, and we concentrate only on the real-valued sequences and matched filters inspired by [16].
Note that our technique supports short spreading codes with any arbitrary selected length in N ∈ [3,30], while all other solutions work with m-sequences with the length restrictions of 2 m -1, or support a Barker code of length 13 only. Thus, our method is significantly more hardwarefriendly. Instead of being fixed to m-sequence-7, m-sequence-15, or Barker-13, we can use any sequence in the range of 3 to 30 chips, and we pick up the sequence that gives the best performance and lowest overhead in the targeted technology. This is especially important for analog realizations [7], [20], [22].

E. HARDWARE IMPLEMENTATION RESULTS
Although table 3 gives an overview of hardware complexity for all investigated PSSS methods, our solution and matched filters inspired by [16] need to be further compared. Firstly, both solutions achieve the highest BER performance and can be considered for practical realizations. Secondly, our sequences require different hardware architecture than the matched filters. Thus, a question arises which of these two methods is more hardware friendly. For this reason, we implement both solutions in FPGA and 28 nm CMOS technology, and perform a detailed comparison. Fig. 32 depicts the impact of binary quantization of the real-valued sequences. A 6-bit resolution for storing spreading sequences is required to approach the QPSK floatingpoint performance. Fig. 33 depicts that DACs and ADCs should support 7-bit to 9-bit sampling precision. The same parameters are investigated for matched filters. The filtering arithmetic requires 5-bit multiplication precision, which means that the multiplication operands can be reduced by 1 bit compared to real-valued sequences. ADC and DAC sampling precision have to support at least 7 bits, identical to the case of real-valued sequences. Fig. 34 depicts BER performance for both solutions when the above-mentioned quantization is applied. As a result, both algorithms have   identical BER performance, and in this form, are implemented in VHDL. Table 4 compares hardware utilization in Virtex Ultrascale+ FPGA for N = 15 and 2 b/s/Hz. A hardware module that contains a transmitter and receiver is considered,  so that a transceiver functionality is implemented. Our solution requires ∼33% fewer flip-flops (FFs) than the matched filters, but the maximal clock frequency is reduced by ∼10%. On the one hand, we do not implement the logic for filters, and we save resources. On the other hand, the receiver's correlator for real-valued sequences is significantly more complicated and reduces the overall clock frequency. It is also the most resource-consuming element in the implementation and consumes ∼5 times more look-up tables (LUTs) than in the case of matched filters. It is worth mentioning that matched filters and the proposed real-valued sequences require approximately 3-4 times more resources than the standard PSSS based on m-sequences (table 4), but also, the standard PSSS shows the lowest BER performance and underperforms the mentioned solutions by ∼2.5 dB.
The presented FPGA results are difficult to interpret. On the one hand, our solution requires significantly fewer FFs than the matched filters. On the other hand, we achieve a lower clock rate and require ∼3% more LUT logic. Thus, CMOS implementation in 28 nm is additionally investigated, where the LUTs and FFs are reflected in a common silicon area. Thus, it is easier to compare both solutions and show normalized results. The 28 nm floorplans are made with Synopsys and Cadence software with default settings. All power-related parameters are performed using vector-based profiling. All area values include only the core area, and the space required for power rings and pads is excluded. Table 5 depicts the results of this investigation. The consumed power and energy per bit remain at a very similar level for both solutions. Again, our implementation shows a complex receiver integrator's problem and achieves ∼8% lower clock frequency at the benefit of ∼35% smaller chip area. As an effect, the PSSS transceiver based on real-valued sequences achieves ∼40% higher data rate per 1 mm 2 of silicon. This comparison shows our architecture's main advantage, which achieves the same BER performance as the matched filters, but requires significantly less CMOS area. The matched filter solution requires a significantly larger chip area but shows the advantages of well-distributed DSP. The power is dissipated equally among the receiver and transmitter. The same applies to the consumed area. In this aspect, our realvalued sequences have a very non-balanced architecture. The receiver is a couple of times more complicated than the transmitter. Thus, our solution will perform better in applications where a stationary receiver is used, and the transmitters are small battery-powered devices. Optimally, the number of transmitters should be larger than receivers. In contrast to our solution, the matched filters perfectly fit networks, where transmitting and receiving effort is equally distributed among communication nodes.
Due to the well-pipelined architecture of matched filters, this solution perfectly fits digital CMOS realizations. The filtering logic is hardly implementable in the analog domain due to its cyclical nature (which is different from FIR logic).
Our solution is significantly friendlier for analog implementations, where we can realize the multiplication as Gilbert cells and represent the sequence coefficients as current sources. Moreover, we can adapt analog correlation effort in one chip (bit) precision, because our sequences can be generated with one chip (bit) step. This is impossible for the matched filters because they are based on m-sequences and have to follow the 2 m -1 length restrictions. Fig. 35 and Fig. 36 compare floorplans for both systems. Our PSSS transmitter and receiver have larger areas due to the employed real-valued sequences and fixed-point arithmetic, but do not require the filter logic. The overhead required for implementing fixed-point arithmetic is lower than the area required for filters. Thus, the implemented chip area for the proposed method is smaller. On the other side, we see that the logic in matched filters is better balanced between the transmitter and receiver. However, the effort spent on filtering is significantly higher than performing PSSS transmission and bit detection.

V. CONCLUSION
This paper addresses Parallel Sequence Spread Spectrum (PSSS) with bipolar spreading codes. Sidelobes of cyclic-autocorrelation function cause intense DC noise at the output of the receiver's correlators (Walsh-Hadamard sequences cannot be used for PSSS due to its cyclic nature). Thus, the BER performance of PSSS receivers is reduced [5]- [7], [16]- [19]. We show real-valued spreading codes (defined in R), which have strongly suppressed cyclic-autocorrelation sidelobes. Therefore, the noise originating from the sidelobes is avoided, and excellent results are achieved. To the best of our knowledge, never before such a method to construct PSSS spreading sequences has been proposed [5]- [7], [16]- [19].
The presented genetic algorithm and PSSS architecture with real-valued sequences have six significant advantages.
• Firstly, the genetic implementation is uncomplicated and can be used without advanced mathematics. Only a structural code of ∼80 lines in Matlab is required to generate the PSSS spreading sequences, and all methods based on matrix factorization are avoided. Therefore, it is simpler to implement than the solution based on matched filters which uses matrix factorization [16]. The general structure of PSSS is unaffected, and only the 6-bit fixed-point arithmetic needs to be added for spreading and correlation.
• PSSS transceivers based on the proposed sequences require ∼40% less CMOS area, providing the same performance as matched filters. Moreover, our algorithm has higher BER performance than the standard PSSS [4], median detection [15], DC-correction [15], and iterative bit detection [18] methods.
• The genetic algorithm generates PSSS sequences of any chosen length in the range of [3,30], which is not shown in any other paper. We start with a random circulant matrix and improve its characteristics in each iteration. Therefore, we can generate sequences that are in any arbitrary selected length. The matched-filters solution in [16] uses a circulant-matrix initialized with m-sequences, which are factorized in a later stage. Therefore, it generates sequences of the length of 2 m -1 only. In our scheme, we do not need to stick to 2 m -1 in BiCMOS realizations like in [7], [20], and [22], but we can choose the length that is most suitable for the targeted technology.
• The proposed algorithm reduces the resulting PAPR in PSSS systems up to 38% without BER performance degradation.
• The same method can be easily adapted by modifying (9)- (12) to search sequences with other predefined properties, e.g., Kasami-and Gold-like sequences in R and C.
• This is the first paper, which addresses a practical ASIC implementation of the PSSS algorithm, which does not introduce any BER penalty in AWGN channels. The genetic algorithm described in this paper is a heuristic method to solve an optimization task of reducing spreading sequences' sidelobes and PAPR on the sequence-coefficients level, which are difficult to mitigate by analytical means. Also, the algorithm does not need precisely estimated configuration parameters. The algorithm accepts a wide range of values for almost every parameter. This is the advantage of the genetic, heuristic, and Monte-Carlo methods. We solve the optimization task without estimating the specific parameter sets and starting points. Only a rough guess is enough to converge to the targeted results. It is quick, easy, and allows to skip all other complicated solutions.