Learning-Based Secret Key Generation in Relay Channels Under Adversarial Attacks

Wireless secret key generation (WSKG) facilitates efficient key agreement protocols for securing the sixth generation (6G) wireless networks thanks to its inherently lightweight functionality. Nevertheless, with the existence of adversarial attacks or internal impairments, WSKG can be negatively affected during the randomness distillation, where the legitimate parties measure a source of common randomness. In this article, we propose a learning-aided approach for cooperative WSKG under man-in-the-middle (MitM) adversarial attack, while the legitimate nodes suffer from hardware impairments (HIs). The key idea is to process the PHY-attribute data on the application layer via deploying a deep neural network (DNN) to enhance the randomness distillation. This way, we realize a learning-based software-centric security solution. More specifically, we take into account the sequence-type nature of observed data, and propose a DNN comprised of gated recurrent units (GRUs) to learn the sequence of observations at legitimate endpoints, while the MitM is also alleviated. Our numerical results verify the performance gain of the proposed learning-based approach compared with the state-of-the-arts. Moreover, time and computation complexity of different learning-based models are studied to address the complexity-performance trade-off. Our tests highlight a performance gain of about 43% in terms of mean-square error (MSE) in comparison with a conventional PHY-only scheme.


I. INTRODUCTION
Network security mechanisms rely traditionally upon cryptography-based keys to provide confidentiality and authentication requirements.Nevertheless, the modern era of the sixth generation (6G) wireless networks with substantially large number of peer-to-peer communications performed on-the-fly, challenges the performance of conventional solutions [2].As a promising framework to realize a paradigm shift from the conventional complexity-based security solutions towards lightweight techniques, wireless secret key generation (WSKG) has been envisioned to be leveraged in future 6G networks [3], either as a standalone security mechanism or a complementary to the existing ones [4].Notably, WSKG has gained much interest from both academic [4], [5], [6], [7] and industrial researchers [8], [9].

A. MOTIVATION
WSKG has remarkable merits for wireless networks.Specifically, its protocol does not require any additional infrastructure, and the secret key is obtained without the need of a third party contributor.This substantially reduces the required time for key agreement and the potential information leakage.Moreover, continuous update of the secret key can be realized owing to the dynamic variations of wireless channels [6].WSKG framework employs lightweight mechanisms with minimum required changes at the control plane (in terms of time scheduling, syncronization, and radio resource management), while offering information-theoretic security guarantees [10].Hence, this approach is envisioned as a promising solution for the applications in beyondthe-fifth-generation (B5G) systems, such as the Internet-of-Things (IoT) [5], [7], [10] and latency-sensitive communications [3], [11].Information-theoretic security guarantees of WSKG make this approach resilient against quantum computers, which can help the development of lightweight post-quantum security solutions [3].In addition, the artificial intelligence and machine learning (AI/ML) techniques can be integrated into the WSKG scheme to improve its performance [3].
6G is envisioned to bring device-level intelligence via implementing contemporary deep learning (DL) algorithms [12].Owing to the capabilities of DL methods to capture and learn from the feature statistics of data sequences, they can be incorporated into WSKG frameworks to realize intelligent security solutions [3], [13].In addition, wireless networks are facing a new trend of transferring functionalities from PHY to higher layers via employing software-centric solutions.This key idea can be applied to the WSKG as well.That is, after obtaining the raw PHY-attribute data, it can be further exploited by DL algorithms implemented on higher layers of the protocol stack to improve system's performance [14].
The procedure of WSKG is based on exploiting the wireless medium as the shared source of randomness among legitimate entities.However, this phase of randomness distillation is affected by inevitable practical deficiencies such as random noise, imperfectly-reciprocal channel state information (CSI), and hardware impairments (HIs).In addition, substantial growth of adversarial attacks on wireless edge, such as spoofing and man-in-the-middle (MitM) attack-that are easily implemented using low-cost software defined radios-has been widely witnessed during recent years [14].If not properly dealt with, such adversarial attacks have shown to be able to destruct the WSKG process [14], [15].To elaborate, an adversary can try to "find" the agreed secret key between legitimate entities by inject poisoned signals, such that he "deceives" legitimate parties about the source of common randomness [15].

B. RELATED WORKS
To provide "proof of concept" (PoC) for WSKG, the authors in [7] implemented a WSKG scheme for long-range wireless communications in low power wide area networks (LPWANs).However, the effect of active or passive adversaries on the performance of their proposed scheme was missing.In [10], WSKG in the presence of both a totallypassive eavesdropper and a hostile jammer is analyzed, and a closed-form expression for the probability of successful handshaking is derived.The WSKG scheme was further extended to the case of cooperative communications in [16] and [17].
It is also of great importance to establish secure connections via wireless key agreement schemes, even when the endpoints lack a direct link to communicate.In such scenario, an intermediate node relays communication signals to the endpoints.In this regard, the authors in [17] investigated the key generation scheme in the presence of an intermediate relay node and an active jammer.The generated secret key was exploited to determine a secure frequency hopping pattern for IoT nodes.Cooperative secret key generation in static environments was investigated in [18] in the presence of a passive eavesdropper.To deal with poor sources of "natural randomness" in static environments, the authors proposed to induce "artificial randomness" to the network.
We remark that although the legitimate parties in [16], [17], [18] suffered from channel estimation errors, their transceivers were assumed to operate perfectly.That is, HIs were not modeled, nor examined in their proposed schemes.In addition, the active adversaries considered in the [10], [16], [17] were quite simple, i.e., they blindly inject noise-like signals to degrade the channel exploration capability of legitimate nodes.A MitM adversary who tries to control the WSKG process by spoofing the communication was investigated in [19].The optimal strategy for the MitM in a direct communication was derived using game-theoretic approaches.However, the nodes were assumed to perform ideally (without any hardware mismatch) during transmission and reception.Then the authors in [20] considered the effect of mismatched radio-frequency (RF) front-ends in a point-to-point (P2P) communication, which sheds light on further investigations of the HI effects in cooperative WSKG scenarios.
Remarkably, none of the aforementioned works have leveraged the potential capabilities of ML/DL techniques to come up with an intelligent security solution for WSKG.In fact, to the best of our knowledge, there are only a few papers addressing learning-aided schemes for wireless key agreement protocols [21], [22], [23], [24].For instance, [21] and [22] utilized a fully-connected (FC) neural network for a simple P2P communication in the presence of a simple passive eavesdropper.Considering a similar topology, convolutional neural networks were proposed in [23].However, the sequencetype nature of observations during WSKG scheme was not addressed in [21], [22], [23].In other words, they did not take into account the potential capabilities of state-aware neural networks, which can capture the relevant information within the chain of PHY data sequences.To the best of our knowledge, the scenario of learning-aided cooperative key generation is only studied in [24], however, the performance of that scheme under active adversarial attacks was not addressed.Moreover, we perform extensive evaluations on learning-based WSKG models and comprehensively compare our scheme with various benchmarks in terms of different secrecy and complexity metrics.Such extensive studies are not addressed in [24], nor in other related works in the context of cooperative key generation.Details of our novelties and contributions are explained in the following subsection.

C. OUR CONTRIBUTIONS
In this article, we propose a learning-aided solution for the cooperative WSKG scheme under MitM adversarial attack and the practical assumption of HIs.The MitM adversary in our system spoofs the randomness distillation phase via fake injection.The goal of a MitM is fundamentally different from that of an external hostile jammer (HJ) studied in [16] and [17].HJ does not care about the procedure of key agreement, and its simple goal is to inject noise-like signals to impose mismatches within the observations of legitimate nodes.In contrast, a MitM performs his attack in a smart way that his injected adversarial data imitates a SoCR, and (if not secured) legitimate nodes can be misled about the SoCR.The corresponding mathematical details are given in Section III.To counteract the MitM, the exchanged packets at PHY are randomized, which is shown to prevent the MitM from taking control of the WSKG process.We further show in our article that our learning-based scheme is secure against the MitM adversarial attack in terms of having zero information leakage.
The main idea we pursue in this article is that we propose the exchanged data packets at PHY are subsequently processed on the application layer via implementing a deep neural network (DNN).This way, we come up with a datadriven and software-centric intelligent security solution for the cooperative WSKG scheme.Notably, none of the reviewed literature considers the sequence-type context of data exchanged during WSKG protocol for capturing the relevant information within the chain of observation sequences.On the contrary, we propose to employ state-aware neural networks.To elaborate, we leverage the concept of recurrent networks and propose a DNN comprised of gated recurrent units (GRU) to learn the sequence of observations at legitimate endpoints.We also compare our proposed learning-based scheme with different state-of-the-art neural networks to emphasize the performance of our scheme in terms of the resulting meansquare error (MSE) mismatch.Insightful comparisons from different aspects including time complexity (training time and inference time), computation complexity, and the required memory storage size are also provided in this article.
This article is an extension to our conference paper [1].We extend the scheme of learning-based key generation to the case of learning-based cooperative communication, while in [1] we considered a simple scenario of Alice and Bob talking to each other directly.In this article, extensive simulations are also conducted to compare different learning algorithms in terms of various metrics, including MSE, training time, inference time, computation complexity, and storage usage, which were not addressed in [1] and [24].We further emphasize that in [15], the performance analysis of a P2P WSKG scheme under HIs was provided (without proposing any learning algorithm to enhance the performance) to show that a fundamental limit occurs in terms of the achievable secret key rate.However, the goal of this article is totally different from [15], and we focus on proposing a learning-based scheme for cooperative WSKG under MitM attack, where we extensively study different learning algorithms and various benchmarks.
To summarize, our contributions and novelties are as follows: 1) We propose a learning-aided approach for cooperative WSKG under MitM adversarial attack, considering a realistic scenario, in which the transceivers of legitimate endpoints and the intermediate node suffer from HIs. Outdated CSIs are also taken into account for the communication links.2) Inspired by the concept of gated recurrent neural networks, we propose a DNN implementation to process the PHY-attribute data on the application layer.This way, the randomness distillation is enhanced and an intelligent learning-based security solution is realized.
Studying the intersection of information theory and DL, we also show that implementing DNN does not leak any information about the key generation process.
3) The adversarial MitM attack against WSKG protocol is mitigated via employing randomized pilots (RPs).By leveraging an information-theoretic approach, we prove that the attack of MitM does not affect our WSKG scheme in terms of information leakage.4) Our numerical studies shed more light on the effect of HIs and adversarial attacks on the DL-based WSKG.Moreover, insightful comparisons of our proposed approach with different benchmarks and other learningbased methods are provided in terms of MSE, secret key rate (SKR), key difference rate (KDR), number of sessions, computation time (training and inference), computation complexity, and memory storage size.In Table 1, we provide a bold summary and explicitly contrast our contributions to the literature.

D. ORGANIZATION AND NOTATIONS
The rest of our article is organized as follows.We introduce a detailed description of our proposed system model in Section II.Our communication protocol and the attacking strategy of MitM adversary is addressed in Section III.Our implemented DNN is proposed in Section IV together with the technical details on different layers employed for it.Useful information-theoretic analysis and remarks are provided in Section V to address the secrecy of our proposed scheme.Some information-theoretic aspects of utilizing a DNN for the WSKG framework are also addressed.Section VI provides readers with different tests and experiments on our proposed learning-aided scheme, and Section VII concludes the article.
Notations: We denote the transpose, conjugate, and 2 norm of a vector by (•) T , (•) † , and || • ||, respectively.Moreover, | • | represents the absolute value of a variable.The kernel (null) space is denoted by null(•).Vectors are represented by bold lowercase letters, while matrices are written as bold uppercase symbols.The zero and the identity matrices are shown by 0 and I, respectively.The real part of a variable x is illustrated by Re(x).CN(μ, σ 2 ) represents a complex Gaussian random variable (RV) with mean μ and variance σ 2 .Moreover, the distribution of jointly Gaussian RVs X 1 and X 2 with mean vector μ and covariance matrix C ≥ 0 is denoted by (X 1 , X 2 ) ∼ CN(μ, C).The expected value and the probability density function (pdf) of RV X are denoted by E[X ] and f X (x), respectively.The mutual information of RVs X and Y is denoted by I (X ; Y ).Hadamard (element-wise) product is denoted by , while sigm(•) and tanh(•) stand for sigmoid and hyperbolic tangent functions, respectively.

II. SYSTEM MODEL
Our proposed system model consists of three mutually authenticated users, i.e., two legitimate endpoints, namely Alice (A) and Bob (B), and an intermediate relay (R) node, as depicted in Fig. 1.Alice and Bob aim to agree on a common secret key sequence via exploiting the wireless medium, i.e, the communication links between A-to-R and B-to-R.Notably, there does not exist a direct link between Alice and Bob due to heavy shadowing, or the direct link is too weak, such that A and B choose the option of relay-aided cooperative communication.This is a realistic assumption when the end nodes are placed far apart [17], [25].Accordingly, an amplify-and-forward relay which is compliant with the networking protocols is employed, which amplifies and forwards its received signals without tampering with the contents.In our system model, there also exists a MitM adversary, named Matt (M), who has impersonated other nodes and convinced them to establish unauthenticated links with him.This can be realized through circumventing mutual authentication mechanisms [28].Investigation of authentication protocols and their vulnerabilities against MitM adversaries is out of the scope of this article.A DNN is implemented at A with the aim of compensating for observation mismatches among legitimate parties.This is done via learning Bob's observation sequence over time.To elaborate, the signals observed during packet exchange at PHY are processed on the application layer, through implementing a DNN to learn the sequence of observations.Then, the distilled randomness can be utilized as the source for key extraction.The model training and task inference procedures are implemented at one of the legitimate sides, and the intermediate node does not participate in the learning phase.Without loss of generality, it is assumed in the article that the DNN is employed by A. In addition, we will show in Section V that if M is provided with the training data and similarly trains his DNN, he cannot obtain any useful information.Technical details on the structure and hyperparameters of the proposed DNN, together with the training strategy, are addressed in Section IV.
The physical RF transceivers of A, B, and R suffer from HIs.The level of impairments at the transmitter and receiver hardware are denoted, respectively, by κ t n and κ r n , n ∈ {A, B, R}.These factors reflect the error vector magnitudes (EVMs) as a measurable metric for the quality of RF transceivers [25].While Alice and Bob exchange properly designed signals for randomness distillation, Matt tries to deceive legitimate entities by injecting fake signals.Details of the attacking model of Matt are elaborated in Section III.As a pessimistic assumption, Matt is considered to occupy an ideal RF transmitter to realize a powerful MitM attack, and he is equipped with n T > 1 transmitting antennas.We also assume that Alice, Bob, and the relay node are equipped with single antenna transceivers.This is in line with the scenario of low-cost devices employed in B5G IoT-enabled networks and with an advantage of the attacker [10], [17]. 1ommunication links are assumed to follow discrete time quasi-static block-fading model [10], [16], [17], [25].Accordingly, the wireless link between Alice(Bob) and the relay is denoted by the complex circularly symmetric Gaussian RV h AR(BR) ∼ CN(0, δ 2 AR(BR) ) where δ 2 AR(BR) represents the large scale fading effect of legitimate channels.Similarly, the link of R-to-A(B) is denoted by h RA(B) ∼ CN(0, δ 2 A(B)R ).As a practical assumption, we assume that the link of R-to-A(B) experiences imperfect reciprocity with respect to the link of A(B)-to-R [9].That is, where h RN and h NR have a correlation 0 < ρ ≤ 1, with ρ = 1 corresponding to the special case of perfect reciprocity.Moreover, u NR characterizes the uncertain part of h RN which is modeled as u NR ∼ CN(0, δ 2 NR ) independent from h NR .As a worst-case assumption from the security perspective [10], [17], [19], we consider that Matt has perfect knowledge about his channel vectors to Alice(Bob) and Relay, denoted by h MA(B) and h MR , respectively.Similar to [10], [17], [19], we assume channel coefficients of adversarial links are pairwise statistically independent with h Mn ∼ CN(0 n T , δ 2 M I n T ) for n ∈ {A, B, R}, where δ 2 M denotes the large scale effect.This is a plausible assumption in block-fading scenarios, while the channel's coherence time is respected [19].Additive noises are assumed pairwise statistically independent with variance σ 2 n .

III. COMMUNICATION DESIGN
With the aim of distilling a common source of randomness from PHY, the legitimate nodes should first take turns conducting channel excursion, during which A and B exchange pilot signals with the help of the relay.Considering practical scenarios, physical RF transceivers suffer from impairments in real testbeds [26].Hence, the general communication model from node i to node j at any given flat-fading channel is well-captured by the following equation [25], [26], [27].
where s ∈ C with power E[|s| 2 ] = P i is the signal sent over a wireless channel with fading coefficient h ∈ C and additive noise n ∈ C, and y denotes the received signal.This is an experimentally-validated model for HIs, which is widelyadopted in wireless communication literature [25], [26]. 2 The independent distortion noise η i j ∼ CN(0, κ 2 i j P i ) (varying from one block to another) models the HIs at the communication link of i-to-j, where κ i j = (κ t i ) 2 + (κ r j ) 2 reflects the aggregate level of impairments.κ t i , κ r j ≥ 0 are the design parameters characterizing the level of impairments in the transmitter and receiver hardware, respectively [25], [26], [27].
In the following, we propose our communication protocol on how to distill a shared source of randomness for Alice and Bob using the characteristics of PHY layer.A general block diagram of the scheme is provided in Fig. 2.

A. SIGNALING PROTOCOL
As shown in Fig. 1, the legitimate nodes perform a three-step protocol to render randomness distillation from PHY, while Matt tries to deceive them via sending spoofing signals.The details of signaling protocols are as follows: 3 2 Detailed description of HIs and their compensation algorithms can be found in [27].According to [25], the combined influence of different types of HIs at a given flat-fading block is well-modeled by the generalized channel model given in (2). 3 Without loss of generality, the same protocol can be applied to the case of block-fading channels with N sc parallel blocks, known as sub-channels [17], [18], [19].In the following, we consider the communication over a single carrier for the sake of notation brevity and tractability.

1) ESTIMATION PHASE
In the first step, channel estimation is performed, during which the relay sends channel probing signals (with power P R ) to help Alice and Bob estimate their links to R. These estimates will further be utilized by A and B to cancel out selfinterference signals from their observations.The estimates of Alice and Bob about their communication links to R, denoted by ĥNR , N ∈ {A, B} are given as follows [17], [18] where e N denotes the estimation uncertainty, i.e., channel estimation error.According to (3), HIs at legitimate nodes can affect the quality of channel estimation.During the estimation phase, Matt also exploits transmitted pilots and obtains his link h MR to the relay.Notably, the strategy of Matt is to perform his adversarial role in a "wait-then-attack" manner.He first listens to the exchanged packets to obtains an accurate estimation of his links to the other nodes.Once the CSI of his links to the legitimate nodes are estimated, he can design and transmit poisoned packets to inject fake SoCR at Alice and Bob.Otherwise, if he just blindly emits a noise-like signal from the very first step of packet exchange, he might be detected, while being unable to inject fake randomness.] = P M towards R to degrade the receiving performance of R. Hence, the received signal y (2)  R at R can be formulated as

3) RELAYING STEP
In the third step, an amplified version of y (2)  R is relayed to Alice and Bob.The relaying gain G can be computed as such that the mean transmit power of relay becomes P R [29].The value of G is determined based on the relay's received power , and it is considered as a publicly-known parameter [16], [17], [18].
Details of the Adversarial Attack: Now is the time for Matt to play his adversarial role.Informally speaking, the strategy of Matt is to "steal" the randomness distillation.That is, Matt aims to intelligently inject "poisoned" data so that the same fake signal is "observed" at legitimate endpoints, making them "believe" the source of shared randomness is what he has sent.Mathematically speaking, Matt injects an adversarially-precoded signal, denoted by w M , such that his poisoned packets are observed similarly by Alice and Bob after they are received.Hence, Matt wants to satisfy with z M denoting the adversarial term observed by Alice and Bob.Inspired by (5), Matt designs his adversarial data, w M , such that (h MA − h MB ) T w M = 0. Hence, define the kernel (a.k.a.null-space) matrix V ∈ C n T ×n T −1 associated with vec- Then, invoking ( 5) and ( 6), w M can be calculated as where ν l denotes the l'th column of V and x M l shows the adversarial signal on the l-th antenna (before precoding).Moreover, we have ] ≤ P M , with P M denoting Matt's transmit power budget.Remarkably, our proposed multi-stream MitM attack in (7) exploits the entire kernel space of attacking links, while setting l = 1 in (7) simplifies to the special case of single-stream injection proposed in [19].We also remark that incorporating other types of learning-based adversarial attacks, such as adversarial machine learning, into the WSKG process will be studied in our future works.
Based on the aforementioned discussions and by utilizing (4)- (7), the raw observations of Alice and Bob, denoted by ỹN for N ∈ {A, B} are as follows: Invoking (8), we can see that there exists a common (but adversarial) term z M in the raw observations of Alice and Bob, which mimics the SoCR.However, it is the adversarial data sent by Matt, and hence, known by him.This can lead to security faults during the WSKG process, since both A and B maintain a common term which is known by Matt.Therefore, if we directly perform WSKG by exploiting the raw observations in (8), it results in information leakage to the MitM.Mathematically, the information leakage rate L for such a naive system is upper bounded by L ≤ I ( ỹA , ỹB ; z M ).Remark 1: Based on the aforementioned discussions, the optimal strategy for Matt (in the sense of maximizing the information leakage) is to choose his adversarial signals w M in a way that the observed signal at legitimate endpoints, z M , is a Gaussian-distributed RV.This fact basically relates to the capacity achieving input of Gaussian channels [17], [19], [42].Therefore, by invoking the expression of z M in ( 5) and ( 7), a good choice for Matt is to set x M l to a constant value.
This results in z M ∼ CN(0, n T P M δ 2 M ), where the proof can be obtained in a similar way to what discussed in [19].
As a countermeasure to the MitM attacks, utilization of randomized pilots (RPs) have been shown to be an effective strategy [19], [30].Hence, inspired by [19], we propose that in the pilot packet exchange, A and B exchange RPs of the form { √ P n e jϕ n }, n ∈ {A, B}, where ϕ n 's are drawn according to independent and identically distributed (i.i.d.) zero-mean discrete uniform distribution = 0. We will show in Section V that this choice of employing RPs results in having zero information leakage to the MitM.

4) LOCAL PROCESSING
Alice and Bob cancel their self-interference signals from their local observations in (8).By invoking (1), (4), and ( 8 B by their RPs to finally retain the source of shared randomness.This results in where τ A and τ B represent the residual HIs, channel estimation uncertainties, and random noises, given in (11), with Invoking (11), we can deduce that the relay has unintentionally amplified the components, such as HIs, which increases the level of mismatch within the signals of A and B. This highlights the importance of proposing proper solutions for hardware-impaired cooperative key generation schemes as studied in this article.We also remark that if we set ρ = 1 and κ t,r A,B = 0, our network simplifies to the special case of ideally reciprocal channels and perfect hardware [19].Moreover, if we set P M = 0, i.e., it results in the special case of relay-aided WSKG without adversarial attack [18].In the following section, we propose our learning-based approach to enhance the hardware-impaired cooperative WSKG.

IV. NEURAL NETWORK IMPLEMENTATION
In the previous section, a general sketch of the shared randomness was achieved by performing proper packet exchange at PHY.Here, with the aim of improving the randomness distillation, the PHY data is passed to the application layer to be further processed and compensate for underlying discrepancies.Hence, a software-centric security solution is proposed by utilizing DNNs.Recall that the discrepancies exist due to the injected signals of MitM, and the unbalanced imperfections at legitimate transceivers.This can be inferred from ( 9) and (10).The main idea in this section is that we want to make predictions about the occupied data sequence of endpoints.By doing so, we wish to obtain a sequence similar to the original data occupied by either sides; hence, compensate for potential discrepancies between A and B.
We leverage the concept of recurrent neural networks (RNNs) and capture the relevant information which lies within the chain of observation sequences.Remarkably, the chainlike nature of RNNs makes them suitable for sequence data types [32], [33].RNNs allow information to persist, i.e., they do not begin to learn from scratch every time.Instead, at every time-stamp they learn from their previous understandings.There are feedback loops implemented in recurrent layers of RNNs to help them update their current state, according to the previous states.Thanks to the employment of feedback loops, the recurrent layers can memorize the historical information obtained from data sequences; hence, they are able to establish meaningful connections between every single data and its corresponding contextual information which is hidden in the data sequence [32], [33], [34].

A. OUR PROPOSED DNN
Inspired by the concept of RNNs, we propose a DNN for our WSKG scheme as depicted in Fig. 3. Our DNN is comprised of two GRU layers followed by two dense layers.According to Fig. 3, Alice runs a GRU-based DNN to learn the observation sequences of Bob from her own observations, y A .Technically speaking, our proposed neural network realizes a DL-based sequence-to-sequence (Seq2Seq) regression on Alice's observations to make them resemble Bob's. 4ccordingly, the dense layers in our implemented DNN characterize the regression process on the underlying information of Alice's data, which is early extracted by the GRU layers.Remarkably, the GRU layers [35] as a well-established type of RNNs have become increasingly popular to be used in DL algorithms [36], [37].GRUs maintain fewer tensor operations; hence, they typically perform faster than the long short-term memory (LSTM) networks during the training and inference. 5efore going through the details of the learning and prediction  process, we briefly examine how GRU layers help networks extract state-aware information from given data sequences.
As illustrated in Fig. 3, a typical GRU layer consists of several units, called hidden units.The idea is to regulate the flow of information in a state-wise manner, i.e., the units maintain hidden states that act as the memory of neural networks, holding information on previous data the network has seen before.Hence, the GRU layer gradually learns which data in a sequence is important to keep or throw away.Then, by passing the relevant information down the chain of data sequence, it can perform predictions [36], [37].After feeding each input data sequence to a GRU layer, it processes the input sequence one by one.During the processing of each element, the GRU layer passes the previous hidden state to the next states.To see how GRU layers calculate the hidden states, an arbitrary unit of a GRU layer is sketched in Fig. 4, showing the n'th hidden state.It is comprised of two main parts, i.e., a reset gate and an update gate.By defining the update and reset gate vectors as z

[n] and r[n], respectively, and the output state vector as h[n],
the controlling equations for a GRU are as follows: where x[n] stands for the input vector to this unit (as shown in The training process should also take the information leakage into account.This can be captured by the mutual information metric between the adversarial signals occupied by Matt, and the data sequences at legitimate parties, i.e., F W,B (y A ) and y B .Hence, the overall loss function for the training process can be formulated as follows where (•, •) i is any desired error measure between the input and output sequence of the DNN corresponding to the i-th training sequence.In this article, we employ mean-squarederror (MSE) ||F W,B (y A ) − y B || 2 as a widely-adopted error measure.We note that in the next section, we show that the leakage term for the proposed scheme is zero.Hence, the final loss function will only consider the error measure between the input and output sequences.Invoking ( 9), (10), and ( 13), one can infer that the formulated optimization problem in ( 13) is complicated due to the existing non-linearities and fake signals.Hence, traditional optimization methods incur a considerable computational complexity.Whilst, finding the output of our DNN simply requires the calculation of learning blocks by moving from the input layer to the output layer of the trained DNN [13].The minimization of ( 13) can be handled by off-the-shelf gradient descent-based methods specifically developed for training DNNs [38], where the review of these methods is beyond the scope of this article.
We have chosen the widely-adopted adaptive moment estimation (Adam) optimizer algorithm.More details regarding the hyperparameters of our DNN, together with conducted experiments on our proposed network are provided in the subsequent section.
To prepare the training dataset T, in addition to gathering N T observation sequences {y A } i , Alice should be provided with Bob's sequences {y B } i .This can be done via employing secure data transmission schemes for cooperative networks, e.g., the data transmission protocol proposed in [17].It should be noted that sending the training set to Alice is for the purpose of training; not for quantization and key extraction.Hence, it will not compromise the secrecy.This is because the wireless channels change over time, and the observations which will be exploited to generate keys are independent from the ones used for training.In addition, we show in the next section that if Matt is provided with the training data T and implements the same DNN, he cannot obtain any useful information.
After the training process is completed, i.e., the minimization problem of (13) converges to a relatively low MSE, all weights and biases of our DNN are configured and the DNN achieves an acceptable state to perform Seq2Seq prediction.Once Alice and Bob perform packet exchange to distill PHY randomness, Alice will pass her new data sequences y A to the application layer to conduct DL-based prediction on Bob's data in a real-time manner.When the trained DNN is utilized for predicting new sequences, it is simply required to perform a forward propagation, i.e., moving forward through the DNN from the input layer to the output layer and performing the computations of (12).We note that the complexity of our DNN compared with related benchmarks is investigated in Section VI by examining the computational complexity, computation time, and memory size.Moreover, we show in Section VI that by considering different configurations and generating samples with different distributions than that of the training set, our DL-based approach performs well without the need to update the DNN.

V. SECURITY ANALYSIS AND DISCUSSIONS
In this section, we provide the information leakage analysis to address the security of our proposed learning-based scheme.Specifically, we show that the poisoned data of Matt (generated based on ( 5)-( 7)) does not help him take control of the WSKG process, and the information leakage rate of our scheme is zero.We also address the intersection of information theory and deep learning, and show that utilizing the proposed DNN does not affect the information leakage rate.More precisely, if Matt is provided with the training data T and implements the same DNN, he cannot obtain any information corresponding to the WSKG process.To this end, we first show that the two endpoints experience independent versions of fake randomness in ( 9) and ( 10).The following corollary, which is obtained with a similar approach to what proposed in [1], formulates this claim.
Corollary 1: The fake randomness which lies within the observations of legitimate endpoints are pairwise independent with the following distribution The corollary indicates that the adversarial counterparts lying within the observations of the legitimate parties do not have any mutual information with each other.In other words, z M x p A and z M x p B do not contain any common information.Hence, there is not any leakage imposed to the network through Matt. 6According to the aforementioned discussions, by utilizing Corollary 1 and invoking ( 9) and (10), one can deduce that the information leakage L of the proposed scheme is zero due to the statistically independence of injected fake randomness at legitimate parties [1], [19].Mathematically speaking, by invoking ( 9), (10), and ( 14) we can rewrite To gain insight about (15), it shows, from the informationtheoretic perspective, that there will be no leakage by utilizing y A and y B for the process of secret key agreement, although Matt occupies the adversarial signal z M .( 15) also ensures that the data sequences of Matt are decorrelated with the signals at Alice and Bob.
One might argue that according to the proposed scheme, Matt might obtain more information than z M during the protocol, if equipped with a full-duplex radio, for example.To answer this, we provide the following remark.
Remark 2: Considering both an untrusted relay and an external pure eavesdropper, it is shown in [18] that the probability of eavesdropping attack can be arbitrary small.The results can be applied to our scenario when Matt is equipped with full-duplex radio, and wishes to simultaneously inject malicious data and wiretap the packet exchange.Therefore, inspecting the packets exchanged by Alice and Bob, e.g., by 6 One can also easily verify that z M and v (2)  M are independent of the common randomness term, i.e., ρh AR h BR Gx p A x p B in ( 9) and (10).
pretending to be the relay, does not help Matt obtain useful information.Similarly, listening to the packets transmitted by R does not help him in terms of the information leakage [18].Also note that deploying full-duplex hardware and decoding all of the exchanged packets require consuming a relatively large amount of available energy, which is costly for an adversary.Hence, in this article we proposed a MitM who aims to wisely deceive endpoints regarding the SoCR via adding his own data to their observations.Remark 3: Based on the above discussions, the achievable secret key rate (SKR) for the proposed scheme can be formulated by R key = I (y A ; y B ).However, obtaining a closed-form expression for the SKR in this case is intractable.This is because the common source of randomness in ( 9) and ( 10) corresponds to the product of two complex Gaussian RVs, i.e., h AR x p A and h BR x p B . 7This RV follows complex double Gaussian distribution, a.k.a.Gaussian-product, where its pdf is provided in [31].
The next question in terms of studying the secrecy of the proposed scheme is whether implementing the same DNN by Matt could help him infer any useful information.This is addressed in the following proposition.
Proposition: Leakage Analysis of Utilizing DNN: Intersection of Information Theory and DL.If Matt is provided with the training data T and implements the same DNN, denoted by F W,B (•), he cannot obtain any useful information.
Proof: Based on the notations mentioned above, the inferred sequence at Matt, when utilizing the DNN, can be denoted by F W,B (z M ), which is obtained by feeding the adversarial samples z M to the DNN.Accordingly, the leakage rate, L DNN , in this case is bounded by the mutual information between the inferred sequence at Matt, and the data sequences at the output of the legitimate parties, i.e., F W,B (y A ) and y B .Mathematically speaking, we can write where (a) follows from data processing inequality (DPI) [42] for the Markov chain (F W,B (y Finally, (c) directly follows from (15).Since the mutual information metric is non-negative, the leakage rate should be zero, and the proof is completed.We further note that invoking (16), the last inequality also indicates that utilizing the proposed DNN does not affect the information leakage rate.According to the above proposition, one can deduce that I (F W,B (y A ), y B ; z M ) = 0. Therefore, invoking ( 13) and ( 16), Before providing the results of our numerical experiments, we mention that the full procedure of secret key agreement is realized through running the following blocks: 1) A mapping, e.g., quantization, from the occupied data of A and B to a discrete subspace, followed by 2) the reconciliation phase; and, 3) a hash function [18].In this article, however, our focus is on the randomness distillation phase as the fundamental part of any WSKG scheme.Interested readers are referred to [4] for more details on other blocks of PHY-based key agreement.In our future works, we will study the integration of DL algorithms into the other blocks of WSKG.

VI. NUMERICAL RESULTS
In this section, we present different numerical examples to investigate our proposed DL-based scheme for relay-aided WSKG.We also compare our scheme with different stateof-the-art benchmarks to demonstrate its performance.The codes are run on Intel(R) Xeon(R) Silver 4114 CPU running at 2.20 GHz.For the following tests, a typical wireless channel h between two arbitrary nodes with distance d is modeled as h = Gd − α 2 h 0 , with G = c 4π f c denoting the constant parameter of the path-loss with exponent α = 4, c = 3 × 108 m/s, and f c = 2.4 GHz [1], [17], [18], [29], [40].Moreover, h 0 ∼ CN(0, 1) models the typical small scale Rayleigh fading.Unless otherwise stated, Alice, Bob and the relay are placed at [−10, 0]m, [10, 0]m, and [0, 5]m, respectively [18], [29], while Matt (equipped with n T = 4 antennas) is located at [0, −5]m [11], [17], [19].This can be considered as a typical scenario of indoor WiFi networks.Training parameters are provided in Table 2.Moreover, the number of hidden units in GRU layers and the number of neurons employed at dense layers are denoted on top of their corresponding blocks in Fig. 3.In addition, dropout regularization with probabilities 0.8, 0.6, and 0.6 are implemented on each layer.The length of the input and output sequences of our DNN is set to L = 20, which is obtained by hyper-parameter tuning. 8During the training of DNN, the transmit power of legitimate pilots and the MitM transmit power are set to P A⇔B⇔R = 10 dBm and P M = 20 dBm, respectively.The training set is created according to (9) and (10), and based on the configurations mentioned above, using Monte Carlo method.For the test scenario, however, we vary different configurations, such as transmit power, impairment levels, and nodes' locations, and generate samples with different distributions than the training set, in order to verify the generalization property of the implemented DNN.Fig. 5 illustrates the achievable SKR R key versus the transmit power of pilot packets for different HI levels.The mutual information calculation for the SKR is obtained numerically, using empirical distributions of y A and y B over 10 5 realizations for Monte-Carlo simulation [43], [44].The figure demonstrates a fundamental limit of realistic WSKG schemes when HIs are taken into account.Considering wireless networks in practice, we face with the ceiling phenomena, i.e., the SKR saturation when increasing power.This ceiling effect can also be inferred from ( 2) and ( 9)- (11), where the increase in transmit power not only improves the quality of shared randomness, but also increases the variance of residual HI-related terms.The figure demonstrates that HIs are very influential at high SNR regimes, since the differences between SKR values at different HI levels are greater in high transmit powers.In addition, the figure indicates that the less HI the network faces, the more SKR can be achieved, which is in line with intuition.In this figure, we also examine some benchmarks: The SKR of WSKG scheme is plotted when the received signal strength indicator (RSSI) of y A and y B is considered as the source of randomness [4].We observe that in this case, the achievable SKR is much lower than that of our scheme.This is because the RSSI-based scheme only utilizes the amplitudes of observations instead of the complex-valued observations y A and y B .We also compare the hardware-impaired results with the special case of perfect hardware.We can infer from the figure (the line with triangle markers) that the imperfect reciprocity in wireless links also plays an important role in the ceiling effect.Moreover, when the HIs are neglected and the channel is assumed to be perfectly reciprocal-which is actually not realizable in a realistic deployment [9], a large gap occurs between the SKR of the ideal and the realistic scenarios.To conclude Fig. 5, it is pivotal for network designers to carefully take into account the hardware and channel imperfections to have an accurate understanding of wireless system.Fig. 6 illustrates the observations mismatch, measured via (normalized) MSE metric, between the (absolute value of) occupied sequences at Bob and the predicted sequences at Alice.The MSE metric is plotted for different transmit powers P A,B,R = P.This figure provides a useful insight on choosing an appropriate transmit power for pilot packets.To elaborate, increasing P does not necessarily lead to achieving lower MSEs.In other words, if the signal level of the common randomness gets close to the received signal of fake data, the mismatch between legitimate parties increases according to (9) and (10).Fig. 6 also verifies that our proposed DNN is robust against different ranges of pilot power.In other words, our proposed DNN shows substantial reduction in observation mismatches for a wide range of transmit powers, although being trained by pilot packets with a fixed power P A,B,R = 10 dBm.Thus, we proposed a data efficient DNN, which does not need to be retrained when the pilot powers change.
In this experiment, we also investigate the performance of our proposed DNN compared with different state-of-the-arts We also show the generalization capability of our DL-based approach for being utilized in different communication scenarios.Notably, the following DL-based and non DL-based benchmarks are considered: 1) PHY-ONLY WSKG SCHEME Fig. 6 shows that more than 40% improvement, in terms of distillation mismatch between Alice and Bob, is achieved by implementing our proposed DNN compared with a PHY-only WSKG scheme [16], [18], which only relies on PHY-extracted observations rather than employing a neural network.

2) ECHO-BASED NEURAL NETWORK
An echo state network (ESN) is implemented as a benchmark for predicting the observations sequences [41].ESNs perform prediction using a relatively large reservoir of sparselyconnected neurons, each of which has a short-term memory of the previously-seen states.The main idea of ESNs is that the sparse random connections in the reservoir pool let previous states "echo" even when they have passed.After data echoes in the pool, it flows towards the output layer.The recurrent connections in the reservoir pool together with the connection weights in the input layer are randomly generated, while the output layer (which connects the reservoir to the output neurons) is trained during the training process.A general sketch of a typical ESN is illustrated in Fig. 7.For this benchmark, we implemented an ESN with a pool of size N r = 50 neurons.We also set the spectral radius of the reservoir weights to 0.5 with connection density of s p = 0.5.In addition, the weights of all untrained connections were chosen uniformly between −1 and 1.As can be seen from Fig. 6, our proposed GRUbased scheme outperforms the ESN benchmark by about 15%.This performance is achieved thanks to the wisely-adopted reset and update mechanism of GRU layers proposed in (12), while the typical update equations of neuron reservoir is a simple echo-inspired update (Please see [41] and [24] for the detailed mechanism of ESNs).

3) FULLY-CONNECTED (FC) NEURAL NETWORK
The FC network is implemented for another learning-based benchmark [21], [22].For this benchmark, we implemented two dense layers with 8 neurons at hidden layer and 5 neurons at output layer.Remarkably, our proposed DNN is comprised of both the GRU layers and the dense layers.Therefore, in addition to having the learning capabilities of a FC network, our DNN is also capable of capturing the relevant information which lies within the sequence of observations.Hence, better performance can be achieved compared with a simple FC network by about 20% performance gain.
Comparing the general structure of our DNN, which is comprised of recurrent and dense layers (Figs. 3 and 4), the ESN benchmark, which is an aggregated version of recurrent neurons with a sparsely-connected network (Fig. 7), and a general FC network, one can intuitively imply that the prediction performance of an ESN would be something between the performance of a FC network and a GRU-based network, where our results in Fig. 6 validate this claim.

4) FURTHER COMPARISONS BETWEEN GRU, ESN, AND FC
To have a more comprehensive insight on performance comparisons between the GRU-based neural network, the ESN, and the FC network, we further examined the required training time (with a fixed training data size), the inference time for predicting new sequences (during a fixed number of 256 timestamps), and the required memory storage for saving each of the corresponding neural networks after that they are trained.These experimental results are summarized in Table 3.According to the table, the implemented ESN maintains a small training time compared with FC and GRU-based networks.This basically addresses the typical advantage of echo-based approach, i.e., its incredibly simple training process as the output layer is the only layer that gets trained, while other weights are randomly-assigned just once.To address the performancecomplexity trade-off, we mention that the computation time of the ESN is less than the GRU-based approach and FC network, thanks to its relatively simple recurrent structure with sparse connections.However, our proposed DNN achieves much lower MSEs than the ESN and FC, as shown in Fig. 6.This can be interpreted as a trade-off between computation time/complexity and resultant MSE.Notably, the required memory storage to save the trained neural network is drastically large for the ESN.This is due to its huge number of internal states in the reservoir pool which needs to be stored for inference on new data.
Based on the results of Table 3, one might argue that the training time for our GRU-based network is too long.Although it seems to contradict with the purpose of utilizing our DNN, however, this is not the case due to the following reasons: Training is performed offline before establishing the real-time configuration settings.Hence, much higher computational time can be afforded with significantly less constraints than a real-time computation [13].Once the offline training is finished, it can be used for online prediction of new data sequences in a real time manner, where our results show that better performance than ESN and FC networks can be achieved with much less time for online computations than the offline training.
We further study the computational complexity of our scheme and the state-of-the-art benchmarks.The computational complexity is evaluated in terms of the number of floating point operations (FLOPs) [45], as given in Table 4.In this table, l i and l o denote the length of input and output vectors of the corresponding neural networks, (which is set to 20 in our numerical experiments).For the ESN benchmark, N r and s p denote the number of internal neurons within the reservoir pool, and the sparsity parameter, respectively.Finally, n i and n h i (1 ≤ i ≤ H) respectively stand for the number of neurons in the dense layers of FC benchmark, and the number of hidden states in the recurrent layers of our GRU-based scheme, with H denoting the number of neural layers according to Fig. 3. Inspecting the computational complexity orders in Table 4, one can imply that the computation complexity of the studied benchmarks are more or less the same, with the same polynomial order O(n 2 ) with respect to the size of the employed neurons.This is also in line with the inference computation time results in Table 3.Nevertheless, we emphasize that inference computation time in Table 3 is comprised of not only the tensor-based multiplications, but also other operations and processes, including additions, concatenations, activation functions, and reading from and writing to the memory, which would be different among different neural network architectures, and investigating their corresponding mechanisms is beyond the scope of this article.
We finally emphasize that moving from software level computations towards hardware level implementations, other computational complexities could be taken into account, such as number of bit operations (BOP), number of additions and bit shifts (NABs) in fixed-point computations, and number of hardware logic gates [45], which are not the focus of this article, as we proposed a software-centric security solution by employing a DNN at the application layer of the network protocol stack.

5) SINGLE-STREAM ATTACK
Based on the adversarial attack elaborated in (7), the MitM launches a multi-stream injection attack in our system.A special case of single-stream attack can also be considered by choosing an arbitrary column of (6) as a benchmark [19].Although the training procedure has been performed under multi-stream attack, the result of our tests in Fig. 6 shows that our DNN is also robust in the single-stream scenario.

6) MULTI-ANTENNA WSKG
To highlight the generic capability of our proposed scheme, a MIMO WSKG scheme is considered in this benchmark, where Alice and Bob are equipped with 3 antennas.Remarkably, it can be seen from Fig. 6 that our DL-based approach can also be applied to the case of multi-antenna legitimate nodes by achieving 40% performance gain compared to a conventional PHY-only MIMO WSKG scheme [29].Remark 4: Trade-off between MSE and SKR upper bound.We emphasize that implementing DNNs cannot increase the "achievable" SKR, due to DPI.Mathematically, we have I (F W,B (y A ); y B ) ≤ I (y A ; y B ).However, as we can see from Fig. 6, the MSE of observations is decreased by using the proposed DNN.This can facilitate having lighter information reconciliation algorithms for error correction, resulting in less information leakage and communication overhead during the reconciliation phase.It can be an interesting research direction to investigate the trade-off between utilizing DNNs vs. employing reconciliation algorithms in the future works.
Fig. 8 shows the SKR versus the MitM adversarial power P M when R is placed in different locations.Pilot packets with power 0 dBm is considered for this test.The figure highlights the fact that when the relay moves towards one of the legitimate parties, the achievable SKR decreases.This is because in a non-symmetric placement of legitimate nodes, higher levels of discrepancies between y A and y B occurs.This is because the aggregate levels of HIs in relaying links h AR and h BR are different.This can also be inferred from the residual terms in (11).In this figure, the SKR of a conventional scheme with unmodified pilot signals is also depicted, which shows that the MitM can override the key generation process if RPs are not exploited.This can be inferred from (8), in which a common adversarial data z M (designed, controlled, and injected by Matt) lies within the observations of Alice and Bob.In other words, the information leakage in this case can become arbitrarily large with the increase in P M .However, thanks to the exploitation of RPs, Matt cannot disrupt our proposed scheme via increasing his power.Instead, the adversary should choose P M in a way that the received signal level of his adversarial data z M gets closer to the signal level of shared randomness data.By doing so, the mismatches between legitimate endpoints increases and the achievable SKR decreases.A similar trend was seen in Fig. 6 as well.In order to generate raw key sequences at Alice and Bob, denoted by K A and K B , respectively, the following one-bit quantization block is employed at Alice and Bob where A = F W,B (y A ) is the predicted data at Alice, and B = y B is the observation data of Bob at each time-stamp.In addition, μ B and σ B denote the mean and standard deviation of Bob's observations, which are assumed to be publicly known among legitimate parties.Moreover, = 0.3 is the quantization guard band [21].Fig. 9 illustrates the key difference rate (KDR) versus the level of HIs, κ r A = κ r B = κ r , for two cases of Matt being located at [0, −5]m and [10, −5]m, respectively.We mention that according to Remark 3, it is not tractable to derive a closed-form expression for the KDR.For this figure, the HI of intermediate relay is set to [κ t R , κ r R ] = [0.1,0.1].We have considered that κ t A(B) = κ tot − κ r A(B) , and κ tot = 0.3 [25].The figure indicates that the level of HIs at receiver ends plays a more important role than the transmit HIs.This can be inferred from the overall proposed protocol in Section III, where the Rx hardware of Alice and Bob contribute in the first and the third step of packet exchange, while their Tx hardware is active in the second step only.Inspecting the residual terms in (11) can also verify this fact.From Fig. 5, one can also infer that when Matt is near one of the legitimate parties, higher levels of mismatch are imposed, leading to higher KDRs.This is because Matt can cause greater discrepancies due to the unbalanced levels of HIs in the adversarial links h MA and h MB .Moreover, Fig. 9 remarks that our proposed DNN is data efficient in terms of the values of HIs, i.e., our DNN is able to provide lower KDRs for a wide range of HIs.Fig. 10 depicts the average number of randomness distillation sessions, i.e., the sessions of packet exchange, required to be performed by A and B to agree on a secret key of length |K| = 256 bits [11].Notably, a key of 256 secure bits can be utilized for encrypting up to gigabytes of data [11].In this test, the average number of required sessions is calculated based on the general formula of [21], i.e.,

|K|
N sc (1−KDR) .Moreover, we consider the WSKG scheme over N sc = 12 parallel blocks to show the generalization capability of our proposed scheme [18].The results of Fig. 10 imply that increasing the transmission power of pilot packets can decrease the required number of sessions.This is because increasing pilot power can lower the KDR.For instance, in the ideal case of perfect hardware, by having a transmit power of more than 25 dBm the number of sessions tends to its minimum value of |K| N sc = 22 which corresponds to the case of one bit quantization with zero KDR.In addition, the figure shows the negative impact of having HIs that can increase the required number of sessions.For instance, having HIs at a level of κ tot = 0.1, can impose to the network about 9% increase in the number of required sessions.Thus, it is important to carefully take the hardware and channel imperfections into account to reflect the realistic behavior of wireless systems.

VII. CONCLUSION
In this article, we studied a DL-based approach for relay-aided WSKG scheme in wireless networks under MitM adversarial attacks.We took into account the practical assumptions of HIs and imperfect channel reciprocity to gain realistic understandings of a practical system.To alleviate the MitM from spoofing the randomness distillation, RPs were deployed at PHY layer.We also implemented a DNN, comprised of GRUs, to further improve the WSKG process.The impacts of HIs and MitM adversarial attacks on system's performance were examined, while numerous experiments were conducted to highlight the performance gain of our DL-based approach compared with the state-of-the-arts.Proposing a mathematical framework to analytically study the trade-off between the computation overhead of learning block and the information leakage of the reconciliation phase will be considered in our future works.Moreover, we will incorporate other types of learning-based attacks, such as adversarial machine learning (AML), into the WSKG process in our future works.
Another important direction that is left for our future work is to study the applications of WSKG scheme at the intersection of 6G networks and emerging technologies such as metaverse and digital twins [46], [47], [48].

FIGURE 1 .
FIGURE 1. Proposed learning-assisted system model for hardware-impaired relay-aided WSKG under MitM attack.

FIGURE 2 .
FIGURE 2. Block diagram of the proposed signaling and learning protocol.
where the superscript (2) indicates the second step of pilot packet exchange.Moreover, p(2)  M = h † MR ||h MR || denotes the precoder of Matt.Details on how to choose the pilot signals x p A , x p B are elaborated later.
), the self-interference signals at Alice and Bob can be formulated by Gρh 2 AR x p A and Gρh 2 BR x p B , respectively.Since A and B have estimated versions of ĥAR , ĥBR , it results in ŷ(3) N = ỹ(3) N − Gρ ĥ2 AN x p N , for N ∈ {A, B}.After local interference cancellation, A and B locally multiply their raw observations ŷ(3) A and ŷ(3)

FIGURE 3 .
FIGURE 3. Proposed deep neural network implemented for the WSKG.

Fig. 4 )
, {W z , W r , U z , U r } and {b z , b r } are the weights matrices and bias vectors, respectively, and h[n] formulates the intermediate memory unit (a.k.a.candidate state).According to(12), the update gate determines how useful past information is for the current state.The sigmoid function exploited in(12) leads to having updated values between 0 and 1.By invoking(12), the closer z[n] is to 1, the more we incorporate past information and vice versa.Reset gate helps the network ignore past information that might be irrelevant in future steps.Finally, the new candidate value h[n] is scaled by the GRU state update, and h[n] is calculated as the output.In the following, we study the utilization of GRU-based DNNs for the application of WSKG in our scheme.B.TRAINING PROCEDURE OF THE PROPOSED DNNDuring the training phase, the weights W = {W z , W r , U z , U r } and biases B = {b z , b r } in(12) need to be properly adjusted.This adjustment is done through training our DNN with a training set denoted by T = {(y A , y B ) i }, i = 1, . . ., N T , with N T showing the number of training sets.Moreover, (y A , y B ) denotes the vector of occupied observations at Alice and Bob, each of which being a sequence of length L, where each element is given in (9) and(10), respectively.Using the examples provided in T, our DNN gradually learns to predict the sequence of Bob.Mathematically speaking, the training process opt for adjusting the weights and biases of the DNN with the goal of minimizing the loss between desired output vector y B and the actual output sequence ŷB = F W,B (y A ).

FIGURE 7 .
FIGURE 7. General sketch of the ESN benchmark.

FIGURE 9 .
FIGURE 9. Key difference rate vs. the level of receiver impairment for ρ = 0.95, P M = 10 dBm, and P A,B,R = −10 dBm.

TABLE 2 . Parameters for Training the Proposed DNN we
can rewrite the training process as follows