Innovative Variational AutoEncoder for an End-to-End Communication System

Powered by deep learning (DL), autoencoders (AE) end-to-end (E2E) communication systems have been developed to merge all physical layer blocks in traditional communication systems and have achieved great success. In this paper, a new probabilistic model, based on the variational autoencoders (VAE), is proposed for short-packet wireless communication systems. Using this new approach, the information messages are represented by the so-called packet hot vectors (PHV), which are inferred by the VAE latent random variables (LRVs). Then only LRVs’ parameters can be transmitted through the physical wireless channel. This results in a significant improvement in spectral efficiency when compared with the pure AE approach, where longer hot vectors are to be transmitted. Specific VAE models have been developed for both binary (BPSK) as well as Quadrature phase shift keying (QPSK) systems. Simulation and numerical results are given to demonstrate the performance of the proposed method in different real scenarios, including Rayleigh and Rician fading channels with Shadowing and Doppler effects. Our simulation and numerical results show that the new proposed VAE with a DL classifier can provide an improved symbol error rate (SER) performance than both the baseline AE and the classical Hamming code with hard decision decoding. Furthermore, as far as the spectral efficiency of the proposed method is concerned, we show that using two channels in the proposed VAE performance exceeds the 7 channels’ baseline AE.


I. INTRODUCTION
Wireless networks and other related services are becoming more intelligent with innovative advances and unprecedented levels of computing capability. The advent of numerous unprecedented services, such as factories, self-driving cars, smart cities, factories, and telemedicine and remote diagnostics, presents a challenge to classical communication in terms of latency, flexibility, reliability, energy efficiency, and connection density. All of these technologies require new architectures, approaches, and algorithms in almost all layers of the communications systems. An advanced artificial The associate editor coordinating the review of this manuscript and approving it for publication was Dave Cavalcanti . intelligence (AI)-based approach can significantly improve the design and management of communication components. AI, represented by machine learning (ML) and deep learning (DL), has attracted tremendous attention as it has successfully transformed the manner in which humans work and communicate. This has been addressed in [1] and [2]. Some of these techniques have been applied in the communication literature, have triggered extensive research, and have greatly impacted the solutions to some communication problems. Various emerging trends for the DL method are also considered based on information theory, probability, statistics, and solid mathematical modelling. The primary function of a communication system is to transmit a message, such as a bit stream, from the source to the destination over a channel through the accurate use of a transmitter and receiver. In order to achieve this optimally, the transmitter and receiver are segmented into strings of multiple independent blocks, each of which is responsible for a particular mini-task. Many approaches have been demonstrated in various applications such as modulation recognition [3], signal detection [4], channel coding [5], channel decoding [6], [7], [8], [9], and channel estimation and detection [10], [11], [12], [13], [14], and replacement of the total communication system with a novel architecture based on an auto-encoder (AE). In [3] and [15] the authors show a significant gain by introducing an AE as a communication system, in which the modulation and coding are jointly designed as one end-to-end (E2E) DL model. The work in [3] showed how the use of block structures typically enables individual optimization, analysis, and control of each block, without the need for any domainspecific information. The E2E AE can achieve a performance similar to the conventional method in additive white Gaussian noise (AWGN) channels. However, the block-based approach is sub-optimal in certain cases [3]. Considering the DL-based communications system design, the optimization of E2E as one black box block is proposed in [3] and [16]. All previous work has shown that the idea of E2E learning in communication systems has received widespread attention in the wireless communications community [17], [18]. In our paper, we use generative models known as variational autoencoders (VAEs) [20], [46], as they have been extensively used for unsupervised and semi-supervised DL. Moreover, since most of the current mobile systems generate unlabeled or semilabelled data, the VAE is well suited to learning in wireless environments.

A. RELATED WORKS
As DL advances, the research paradigm can shift away from designing schemes using mathematical models to autonomously constructing E2E DL schemes based on observations of large quantities of data. For example, when DL is employed for image classification, feature detectors that are far more accurate than conventional detectors can be derived from a large set of image inputs using DNN structures. Therefore, in the age of DL, it starts with preparing, selecting, and pre-processing data to be used in the DNN structure. Then, determine the appropriate structure for the DNN. Lastly, interpreting the output of the DNN becomes increasingly important than developing analytic schemes from mathematical systems that typically contain assumptions necessary to enable analysis.
Recently, DL has been applied to many areas of wireless communications research. Besides improving conventional communication modules, DL-based E2E communication systems have recently been developed, in which DNNs represent both the transmitter and receiver. A framework with block structures under the AWGN channels was proposed in [3] and performs similarly to traditional approaches. There is also an E2E framework in the OFDM (overlay frequency division multiplexing) system [21] and singular value decomposition (SVD) precoding-based MIMO system [15], which view the channels as a group of independent sub-channels.
Moreover, Recent research has examined how to learn an E2E communication system without prior knowledge of channel models. A reinforcement learning (RL) approach based on reinforcement learning was developed [22] to optimize the transmitter DNN without regard to the channel transfer function or channel state information (CSI). The stochastic perturbation approach was used in [23] to design a model-free E2E communication framework. In [24], a conditional generative adversarial network (GAN) approach has been developed for building E2E communications, where the channel effects are modelled by a conditional GAN.
In contrast to other ML techniques that do not require communication resources, federated learning (FL) utilizes communication between the central server and distributed local clients in order to train and optimize the model. ML-based FL allows training models to be distributed between multiple clients, each with a certain amount of training data and coordinated through a central server. Therefore, the computation can be offloaded from the central server to the client. In brief, in FL, the local clients communicate with the central server only using model parameters learned locally rather than raw data, preserving both privacy and communication overhead [25], [26], [27].
A part of the artificial intelligence field is ML, which includes algorithms for classification, clustering, and dimensionality reduction (DR). Over the last decade, various classification algorithms have been developed, including Deep Convolutional Neural Networks (DCNNs) [28], and Variational AutoEncoders (VAEs) [46]. The VAE inherits the traditional AE architecture, meaning it is composed of two neural networks (NNs), an encoder and a decoder, respectively. The encoder decreases the dimensionality of the inputs into a latent space. On the other hand, the decoder can reconstruct the inputs from the latent space through learning. Thus, VAEs can be used for classification [29], [30], [31] and production [32], [33], [34]. Moreover, VAE can learn a data generation distribution that can take random samples from the latent space. It then generates unique images with features similar to those on which the network was trained after decoding the random samples using the decoder network. Using the Bayes rule [35], the VAE can learn the joint probability of input data and labels simultaneously. Bayesian inference is a method of statistical inference that provides a powerful framework for reasoning and prediction under uncertainty. However, the limitation of computing the posterior with only a few parametric distributions, makes wider applications of Bayesian inference difficult [36]. Recently, to approximate the posterior by representing the variational distribution with a set of particles and update them through a deterministic optimization process, particle-based variational inference (ParVI) methods have been proposed [37], [38], [39]. Although the ParVI method can achieve computational efficiency and asymptotic accuracy, it restricts the fixed number of particles and lacks the ability to draw VOLUME 11, 2023 86835 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
new samples beyond the initial set of particles [37]. Generally, variational inference and Markov chain Monte Carlo (MCMC) methods have been used to give tractable approximate inference, but these approaches bring their own set of challenges when the space's dimensionality is particularly high. Bayesian neural networks (BNNs) are a recent example of interest. These apply Bayesian inference to deep neural network training to provide a principled mechanism to analyze model uncertainty. Developing efficient computer strategies to estimate this intractable posterior with exceptionally high dimensionality, on the other hand, remains challenging.
On the basis of the above and the development of DL, semantic communication is again being considered a key technology and has received great attention. As the 5G system has approached the Shannon limit, semantic communication aims to retain the successful transmission of semantic information by the source rather than the accurate reception of each bit or single symbol regardless of its meaning. Semantic communication is at the second level of communication-based on Shannon and Weaver [40], aiming to accurately convey the semantic information of the transmission symbols, rather than accurately recovering the transmitted information.
Recently, several semantic communication concepts have been developed based on NNs to replace conventional communication blocks. In [41], the conditional generative adversarial net (GAN) was designed to represent channel effects, while in [42], a complete point-to-point communication system in the physical layer was developed using NNs. The authors of [43], show that the network can learn a projection function from feature space to a semantic embedding space in zero-shot learning (ZSL) models. The work in [44] developed a DL-based semantic communication system (DeepSC) for text transmission, with the aim of maximizing the capacity of the system and minimizing semantic errors, as it would recover the meaning of sentences rather than the bit or symbol error. Moreover, the authors in [45] proposed a semantic communication approach based on AE for the wireless relay channel (AESC) to extract and compress semantic information and reconstruct its semantic features. However, there are some key differences between semantic communication systems and conventional communication that can be defined as follows [44]: • The design and optimization of the information transmission module in conventional systems are contained in the transceiver, unlike the semantic system, where the whole information processing block is jointly designed from the source information to sink.
• Recovering the exact data is the focus of conventional communication systems; however, semantic communication systems are intended for transmission decisions.
• Conventional communication systems compress data in the entropy domain, while semantic communication systems process data in the semantic domain.

B. MAIN CONTRIBUTIONS
In this paper, a new approach has been proposed and investigated with the help of a variational autoencoder (VAE) as a probabilistic model to reconstruct the transmitted symbol without sending the data bits out of the transmitter. Our main contributions are summarized as follows: • We propose an E2E communication system that represents the symbol as PHV and operates over BPSK modulation in AWGN channels, where modulation and demodulation are performed by a deep neural network (DNN) based on a VAE architecture.
• We extend our experiment to investigate the QPSK modulation, Rayleigh, Rician fading channels, shadowing, and Doppler effect for a limited range of doppler frequency shifts and phase offsets.
• While the baseline AE uses 4 and 7 channels in [3] to achieve their results. In our work, we efficiently use two channels only to achieve better performance than AE baseline.
• Our work considers a VAE with two LRVs, and a simple classifier can reconstruct the transmitted message by sending only the LRVs' parameters and the message error rate (MER). The result shows that the performance of our proposed system is better than that of the existing classical scheme.

C. PAPER STRUCTURE AND NOTATIONS
The rest of this paper is organized as follows. Section II describes the system model, starting from the anatomy of the VAE and then formulating the wireless system model and VAE model. Section III outlines the experiment setup, the classifier training algorithm, and the VAE training algorithm. Section IV evaluates the performance of the proposed VAE and compares it with several benchmarks. Finally, Section V draws conclusions. Furthermore, a list of important symbols used throughout this paper are summarized in Table 1.

II. SYSTEM MODEL
In this paper, the wireless communication system model has a simple setup to allow the reader to follow the proposed idea. Our goal is to design a probabilistic model that can reconstruct the transmitted information without sending the exact bits or the deterministic transformed bits of the exact symbol (e.g, channel coding using Hamming codes), but by transmitting the statistical parameters of the LRVs through the physical layer rather than sending the data bits of the original symbol out of the transmitter.

A. VARIATIONAL AUTOENCODER (VAE)
A brief description of the basic VAE, on which this work builds are required to clearly grasp what follows. The VAE is a popular generative model, allowing us to solve problems in the framework of probabilistic graphical models with latent variables [46], [47]. VAEs can be considered as two independently parameterized models: the recognition model, known as the encoder, and the generative model or decoder. The encoder delivers an approximation to its posterior over latent random variables to the decoder, which is required to update its parameters inside the iteration of expectation maximization learning. Conversely, the decoder is a scaffolding of sorts for the encoder to learn meaningful representations of the data besides class labels. In other words, the VAE helps the encoder infer the distribution of original data rather than the original data itself. By employing a properly designed object function, the distribution of original data can be encoded into certain low-dimensional distributions. Similarly, the decoder training allows the decoder to transform the distributions into the approximate original data distribution to obtain a new sample that represents the reconstruction of the original ones.
Moreover, as probabilistic models, VAEs also contain data and unknowns. Therefore we need to assume some level of uncertainty around this aspect of the model. This uncertainty can be specified in terms of a conditional probability distribution, where the model can contain both discrete and continuous variable values. In addition, between these variables, this probabilistic model is able to specify all correlations and higher-order dependencies in the form of a joint probability distribution.
As shown in Fig.1, VAEs can learn the stochastic mappings between the observed x-space that has distribution q D (x) and the latent z-space. The generative model learns the joint distribution p θ (x,z), which is factorized as p θ (x,z)=p θ (z)p θ (x|z) with a prior distribution over latent space p θ (z) and a stochastic decoder p θ (x|z). The inference model or the stochastic encoder q φ (x|z) approximate the true but intractable posterior p θ (x|z) of the generative model [47].
Specifically, we use the vector (x) to represent the set of all observed variables that we want to model its joint distribution. We assume the observed variable (x) from an unknown underlying process is a random sample that has an unknown probability distribution p * (x). To approximate this underlying process, we used a chosen model p θ (x), with parameters θ which can be written as: To find the value for the parameter θ, we used the learning 1 process, which is the most commonly used search process. Since the probability distribution function is given by the model p θ (x) and approximates the true distribution of the data, denoted by p * (x), therefore, for any observed (x): Often, in the case of classification or regression problems, we are interested in a learning conditional model such as p θ (y/x) that approximates the underlying conditional distribution p * (y/x), where the distribution of the value over the variable y is conditioned on the value of the observed variable x. In this case, x is the input of the model. As in the previous paragraph, the model p θ (y/x) is chosen and optimized to be close to the unknown underlying distribution for any x and y: 1 learning: In terms of ML, the concept of learning can be formulated as Tom Michell defines it, as a ''problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples.'' [48]. VOLUME 11, 2023 One of the most common examples of conditional modelling is image classification, where (x) is an image, and (y) is the image's class that we want to predict.
We can extend the models discussed above into directed models with latent variables, where the latent variables can be defined as variables that are part of the model but are not part of the data-set, and which, therefore, we do not observe. Normally, we use z to denote the latent variables. In the case of unconditional modelling of the observed variable x, we can represent the directed graphical model by a joint distribution p θ (x, z) over the observed variable x and the latent variables z. The marginal distribution over the observed variables p θ (x) can be written as: The model p θ (x, z) can be conditioned in some context, such as p θ (x, z | y) and for this, we use the term ''deep latent variable model'' (DLVM), which is when the distributions are parameterized by NNs. The advantage of the DLVM is that when each factor in the directed model, whether its prior or conditional distribution, is relatively simple, the marginal distribution p * (x) can be very complex. This expression makes the DLVM attractive for approximating complicated underlying distributions. One of the most common and simplest DLVM is known as factorization, which can be defined as follows:  More details about the VAE can be found in [19], [46], and [47].

B. WIRELESS SYSTEM MODELS
We build up an end-to-end communication system that consists of a transmitter sending the desired signal to the receiver, as shown in Fig. 3. We assume that the wireless channels have AWGN. The equation below formulates the received signal vector s r : where , s d is the desired transmitted signal vector for propagated s d from the transmitter to the receiver, and n o is the AWGN noise vector, where N o = σ 2 is the noise power variance that contaminates the transmitted signal power as shown in Fig. 3. By definition, the signal-to-noise ratio (SNR) is: where S r is the power of the desired signal received, and N o is the AWGN power. for t bits per symbol in E b /N o , (6) and (7) can be written as: where h i ∈ C 1×1 sampled for Rayleigh distribution, Rician distribution, and long normal Shadowing for Rayleigh, Rician and Shadowing models, respectively [49]. While in the Doppler model, we use the theoretical flat Doppler spectrum S(f ), where S(f ) = 1 2f d , and phase shift φ d [50].

C. THE VAE AS A WIRELESS SYSTEM MODEL
The proposed VAE model design learns the noise, multipath, line of sight, and non-line of sight effects features using a directed probabilistic graph model (DPGM) as in Fig. 5, where z represents the LRVs that are used to infer the signal features from the packet hot vector (PHV). Using this method, the relation between the transmitted signal and the received signal patterns can be presented using inferred LRVs. Inspired by the semantic level communication and VAE, our work considers the use of variational inference for generative modelling; however, we reinterpret the variational inference from a new perspective. We use generative modelling, which refers to the process of valid samples from p(x). Fig. 5, shows our generative model. In this work, the samples of x are generated from a latent variable z, and θ represents the associated parameters, while the solid lines denote the generative model p θ (z) p θ (x | z). For example, to generate valid samples of x, we first sample z, then use z and θ to generate x. The dashed lines represent the inference procedure with a variational approximation of the intractable posterior p θ (z | x). Moreover, we apply DL that is proposed by a stochastic optimization-based technique to approximate the inference p(z | x) with appropriate prior on p(z) using an encoder network q φ (z | x). After that comes the decoder network p θ (x | z) to compute the reconstructionx of the message x, where this will be learned during the training phase. Given a neural network model with sufficient learning capability and good prior distribution p(z), this high-capacity model will approximate the posterior by q φ (z | x) ≈ p θ (z | x). Since this model is structured as an encoder-decoder, the technique is known as autoencoding variational Bayes (AVB), where the expected marginal likelihood p θ (x) of the datapoint x ∈ X , under an encoding function, q φ (.), can be computed as in [51]: The first term in (9) is the Kullback-Leibler (KL) divergence between q φ (z|x) and p θ (z|x).
The second term in (9) is called the evidence lower bound (ELBO): and We have to maximize the L θ,φ (x) by minimizing the D KL q φ (z|x) ||p θ (z|x) in order to maximize the penalized likelihood of the reconstruction of x from z using: Moreover, since backpropagation through a random operation is not possible in the training stage, we use the reparameterization trick to move the random sampling operation to an auxiliary variable ε that is shifted by the mean µ i and scaled by the standard deviation σ i , respectively, representing the distribution that the network is trying to learn, as in Fig. 6. This allows backpropagation through the deterministic nodes f , z, . The idea here is that sampling from N (µ i , σ 2 i ) is the same as sampling from (µ i + ε.σ i ), where ε ∼ N (0, 1).  Next, we describe the architecture of the VAE in the proposed E2E wireless communication system shown in Fig. 4 and compare this transformation with a simple wireless system as shown in Fig. 3.

1) VAE INPUT
The hot vector in [3] can be replaced with a new concept known as the PHV in the same way that [19] used to represent the constellation of a symbol. However, in this work, we present the symbol as a packet of ones and zeroes where the inputs s 0 and s 1 to the transmitter are encoded as a one-PHV 1s ∈ R M . The sent binary phase shift keying (BPSK) message s 0 has been presented by a packet of B bits. This packet consists of K sub-packets, where each sub-packet k i , i ∈ {1, . . . , K } contains b bits. For example, this means that the total length of our PHV is 1 × bK . Let the space of possible messages be M = 2 bK and bK be the necessary number of bits to represent each message m. Then transmit input message s t ∈ {1, . . . , m, . . . , M }, where M is the space size of the possible messages as in Fig. 7.

2) VAE ENCODER
Each PHV x fed into the input layer will be transformed by f : R 1×bK → R 1×c , where c is the dimension of the last layer in the encoder. Looking at Fig. 4, the encoder layers include two-dimensional convolution (2DConv) layers, each of which is configured with several filters (each filter has a size ofh height and width). The features output by each layer are mapped to a number of filters ν 1 and ν 2 , respectively. The filter shifts by ς strides at each convolution step, while the padding size ℘ can be calculated using: size (2DConv) = k−h +2℘ ς + 1, to keep the output size equal to the input. A rectified linear unit (ReLU) layer is used after each 2DConv to eliminate any negative output value. A final fully connected (FC) layer was added to the encoder with the dimension of 1×2c. The output of the FC layer is divided into two sets µ z = [µ 1 , . . . , µ c ] and σ z = [σ 1+c , . . . , σ 2c ], which represents the latent variables' distributions parameters (the expectation and the variance, respectively).
The transformation can be formulated using the DNN hyperparameters θ T : where x n ∈ X , x n is the input data point and y n is the output of the FC layer which has decimal format. After this, the FC decimal output values use the physical decimal to binary converter (DCB) component to start sending the LRVs' distribution parameters over the physical layer.

3) PHYSICAL MEDIUM
In this paper, our unique approach is to explain the practical aspect of implementing an E2E system that includes the realization of the physical wireless transmission and the receiving components, such as the digitization of µ and σ values for each LRV, the modulator, demodulator, and AWGN channel: • Decimal-coded binary (DCB) and binary-coded decimal (BCD) converters: In the DCB component, the received decimal integer part will be represented by b y number of bits and the same for the fractional part of the decimal value. In addition, an extra bit for the sign has been added as the most significant bit (MSB), which means 2 × b y + 1 is the final length of bits code that the modulator receives. After signal demodulation, the BCD will use the binary decoded bits to convert it back as a decimal integer and fraction parts before combining both using a fixed point radix to retrieve the decimal value. This proposed method eliminates any digitization error for the y values when b y length satisfies the required significant figures for precision sf .
• BPSK modulation and demodulation components: The BPSK is used to modulate the output of the DCB using a standard modulation, whereas the demodulated output is used to feed the BCD input.
• AWGN noise channel: the physical AWGN noise is ∼ CN (0, ξ ), where ξ is the fixed standard deviation value that contaminates the amplitude and phase of the received signal. The purpose of this is to represent the posterior of every parameter in all weight tensors from each layer of deep networks. The number of channels in the physical wireless component medium has the dimension of R 1×c , and c is the number of channels that the proposed communication system uses to send one message out of the 2 bK messages. The E2E rate of this communication can be measured by r E2E = bK c [bits/channel use]. However, over the physical wireless components medium, the rate of the physical transmission is r PH = (2b y + 1) [bits/channel use]. This leads to the compression rate (CR) formula: The channel noise is an AWGN due to the assumption that the main source of the noise is on the receiver side [3]. The channel uses a fixed variance ξ 2 = (E b /N o ) −1 and is characterized as a distribution N (0, ξ 2 I ), where (E b /N o ) is the energy per bit E b to the ratio of the spectral density of noise power N o that contaminates the desired signal at the receiver after converting the values from binary to a decimal using the BCD.

4) VAE DECODER
The BCD output of the physical medium represents the LRVs' contaminated expectation, and variance decimal parameters vector values as a function of , respectively. In this paper, we proposed to use the sampling layer inside the receiver to realize a practical architecture of the E2E wireless system. The dimensions of the sampling input layers are equal to those of the last encoder output layer 1 × 2c. However, for the following layers, the reparameterization trick is necessary to allow the VAE to perform the backpropagation at the training phase and to sample the z as shown in Fig. 6, and has been formulated using ∈∼ [N 1 (0, 1), . . . , N c (0, 1)] as: 86840 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
At high E b /N o values, z = z as a result of eliminating any contamination of the z values, due to the AWGN channel effect, is mathematically proved by: (16) which is the input of the decoder that is transformed back to f −1 : R 1×c → R bK to reconstruct the input symbol s as s. The transformation can be formulated using the DNN hyperparameters θ R : The DNN consists of one input layer, three transposed 2DConv layers, and a ReLU layer is used to eliminate the negative values at each output. Lastly, a 2DConv is used to reconstruct the transmitted image.

D. SIMPLE DNN IMAGE CLASSIFIER
To classify the final reconstructed symbol x → s d ∈ {1, . . . , m, . . . , M }, a simple DNN classifier has been used. Fig. 4, shows the architect of the classifier block using convolution, batch normalization, ReLU and max-pooling layers to extract the feature of x. The classifier output layer learns the final messageŝ d from the output size of the previous fully connected and softmax layers with output size M possible messages.

E. PROPOSED NUMERICAL PERFORMANCE MEASUREMENT METHODS FOR THE NEW E2E WIRELESS SYSTEM
To measure the performance of the proposed E2E VAE wireless system, we suggested the following methods: • BER E2E definition: This is the ratio of bits error of the transmitted PHV from transmitter to receiver where N is the number of transmitted PHVs (Symbols).
x i and x i ∈ {1, . . . , bK } bits that produced by converting x and x from decimal to binary respectively.
• BER PH definition: This is the ratio of bits error of the transmitted LRVs values between the DCB and BCD components.
• MER definition: This is the ratio of the wrongly classified messages at the receiver to the transmitted ones.
It is important to mention that the SER is the analogy of the proposed MER measurement in classical wireless communication. More discussion regarding this point can be found in Section IV. However, the most important of the three methods is the MER, because it measures the final ratio of the correctly received messages out of the total transmitted ones, which is the ultimate goal of the proposed system.

III. EXPERIMENT SETUP, E2E WIRELESS SYSTEM TRAINING AND SIMULATION A. EXPERIMENT SETUP
The main parameters for the VAE, classifier and physical wireless component layers are summarized in Table. 2

B. CLASSIFIER TRAINING
A classifier stochastic gradient descent with momentum (SGDM) training type is used to train PHVs under AWGN contamination with a value of E b /N o = 0 dB to produce the final retrieved sent message. The SGDM algorithm can oscillate along the path of the steepest descent towards the optimum. Adding the momentum term with the contribution factor ϒ to the parameter update is one way to reduce this oscillation as in (15). Algorithm 1 describes the classifier training process [52].
where θ c l is the vector of weight and bias parameters for the DNN classifier in iteration l, η c is the learning rate, and L(θ c l ) is the loss function, while ∇L(θ c l ) is the gradient of the loss function used to train the entire training set.

C. THE VAE TRAINING
VAE training aims to reconstruct the sent PHV from a meaningful continuous space produced by the LRVs z ranges using the ELBO as in: for each Itr do 4: The input layer passes the PHV values to the 2DConv layer.

5:
The 2DConv layer produces the first features map. 6: The output of the 2DConv layer passes the batch normalization to speed up the training and reduce the sensitivity of network initialization. Then the output passes ReLU layer to remove any negative values.

7:
To reduce the spatial size of the feature map and redundant spatial information, the ReLU output uses the max-pooling layer to down-sample the input. 8: Repeat steps 5 to 7, to fine-tune the detection of the important features in the message. (The gradient threshold = + ∞) 9: Apply SGDM algorithm to optimize θ c as in (21) using initial parameters: η c (learning rate), ϒ (the momentum contribution factor) to get the gradient g Itr : g Itr ← − ∇ θ c L(x, x, θ c ) 10: use g Itr to update θ c according to [52]. where However, unlike the existing references, the contamination of the LRVs' inferred parameters occurs at the transmitted binary (not decimal values) level bits while it is propagated through the wireless channel to imitate the practical aspects of the experiment. In addition, the sampling layer has been moved to the receiver side to produce the contaminated LRV's z values from the received contaminated LRVs' inferred parameters. The used optimization algorithm for DL networks wights is adaptive moment estimation (Adam) with an added momentum term. It keeps an element-wise moving average of both the parameter gradients and their squared values [53]. The VAE Training was done at a fixed value of E b /N o = 7 dB with a learning rate 0:001 and batch size=64. More details about the training set will be illustrated in section IV. Algorithm 2 describes the proposed VAE training process. N o : noise sample ∼ N (0, ξ 2 ) θ V : DNN weights and biases matrix for θ T and θ R . n: number of re-sampled PHV at the receiver, 2: for each Epo do 3: for each Itr do 4: use x for input layer to produce y = f (x, θ T ) 5: use y for physical wireless layer to produce w = τ (y) 6: use w for sampling layer input to produce the LRVs z values using (15). 7: use the sampling layer output to reconstruct the PHV: Apply (25) to find the ELBO: 9: Apply Adam optimization algorithm to optimize θ V using initial parameters: η V (learning rate), λ 1 &λ 2 (the exponential decay rate for the 1st and 2nd moment estimates respectively), ε V (a small constant value for numerical stability) to get the gradient g Itr : g Itr ← − ∇ θ V L(x, x, θ V ) 10: use g Itr to update θ V Itr according to [53]. 11: end for 12: end for 13: Output: Return the up-to-date θ V and save the DNN ''VAE-Wireless'' model.

D. E2E WIRELESS SYSTEM SIMULATION REALIZATION
Once both the Classifier and VAE − Wireless models have been trained, the two models cascade as in Fig. 4, and then the real data transmission starts. In this experiment, 10 6 PHVs have been sent from the transmitter through the VAE-encoder, physical wireless component layer, VAE-encoder and finally pass the classifier to each under observation E b /N o . The proposed system has a novel method to re-sample the retrieved message for N times using parallel computing techniques and hardware such as graphics processing units (GPUs), then finding the mode (the data value with the highest count) of the N re-sampled messages ( s n d ) N n=1 : Algorithm 3 describes the realization process.
86842 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. for each Exp do 4: use x for input layer to produce y = f (x, θ T ), where θ T ∈ ''VAE-Wireless'' 5: use y for physical wireless layer to produce w = τ (y) 6: for each n do 7: use w for sampling layer input to produce the LRVs z values using (15). 8: use the sampling layer output to reconstruct the PHV: Use x as input for the ''Classifier-PHV'' model 10: provide the final class of received message x ⇒ s d ∈ {1, . . . , m, . . . , M } 11: end for 12: find the s d = mode ( s n d ) N n=1 for the N re-samples message. 13: compare s d to s d 14: end for 15: Calculate the MER at specific E b /N o 16: end for 17: Output: Return MER for all E b /N o .

IV. NUMERICAL RESULTS
In this section, a series of experiments will be implemented to evaluate the performance of the new approach proposed under various scenarios and compared with several benchmarks. In particular, we consider QPSK modulation in AWGN and BPSK modulation with the effects of AWGN, fading, shadowing and Doppler on the model. We compare our results with the commonly used QPSK and BPSK expert modulation schemes which have long been used [49].
We start this section with the training process by using BPSK modulation in AWGN with the parameter settings recommended by Adam [53]. To begin with, we fix the learning rate to 0.001 and increase the batch size from 32 to 128. From the simulation results shown in Fig. 8, we can see that all the curves have a similar trend, but the curve for batch size = 64 is smoother and more stable than the other curves. This is due to the effect of underfitting and overfitting the data while calculating the loss function at the training stage [54]. As a result, we choose batch size = 64 in our training process.
Next, it is important to choose appropriate learning parameters. The parameters are adjusted by observing the SER values as shown in Fig. 8, and Fig. 9. In this case,  we fix the batch size to 64 and increase the learning rate from 0.0001 to 0.01. The lowest SER can be obtained with a learning rate = 0.001, and the learning rate value we used in our training procedure was 0.001. The results using η Y = 0.0001 show deterioration in SER as the search for the optimal solution required more iterations than the used one (in this work the iterations: 300 iteration/epoch × 50 epoch=1500 iterations).
On the other hand, choosing η Y = 0.01 produces results between the different choices due to utilising the iterations but with less resolution in loss function [55]. Similarly, with a fixed learning rate and different batch sizes, we observe a similar trend in SER. As the E b /N o increase, the SER constantly decreases.
Having established the feasible learning parameters, we simulate the performance of the proposed algorithm as follows: A. BPSK CHANNEL 9] dB with BPSK modulation in AWGN are depicted in Fig. 10. The proposed VAE with two LRVs is capable of reconstructing the transmitted message by only sending the LRVs' parameters (µ z , σ z ), and the MER (in our work, MER = SER) decreases when the E b /N o increases as the green curve shows. As the AE and VAE state-of-the-art articles assume that the encoder output has decimal output values only, we used Hamming code to add protection and correction to the binary transmitted values of the encoder output, after converting it to binary by adding two bits for each transmitted bit over the physical layer. However, when comparing the numerical performance of the VAE SER with the theoretical Hamming (3,1) decoded by the hard-decision method, our proposed VAE outperforms Hamming (3,1), as shown in Fig.10. Furthermore, even if Hamming (3,1) encoded by the soft decision method performs better until E b /N o = 2 dB, the VAE will outperform this scheme at E b /N o > 2 dB. From this result, we observe that the VAE at low E b /N o cannot outperform the optimal soft decision scheme as it does not learn the distribution for LVRs properly, which is one of our research findings. Moreover, the proposed VAE outperforms the hard-decision-decoded Golay scheme with a semi-constant gap (parallel) with an average of 0.5 dB. Moreover, comparing the performance of the VAE SER to the classical AE [3], the dashed curves show that at the same number of channels used to transmit the encoder outputs (AE (1,4) in brown), the proposed VAE outperforms the AE scheme, as shown in Fig. 11. However, as the number of channels of the AE increased, the performance gap decreases as the blue dashed curve AE (7,4) in comparison to the amber curve (VAE with 2 LRVs), which means that the VAE use fewer channels than the classical AE to achieve the same SER numerical performance.  Fig.12 shows a similar comparison, but for a higher-order modulation scheme, specifically, quadrature phase shift keying (QPSK) under AWGN channel to the classical AE [3] and the proposed VAE. This result shows that the proposed VAE with different modulation (BPSK and QPSK) achieve better performance than the classical AE. Notice that even the QPSK VAE still perform better than AE (7,4) at low E b /N O as in [3].

C. RAYLEIGH FADING CHANNEL
The numerically computed SER values versus E b /N o ∈ [0,20] dB with Rayleigh are depicted in Fig. 13. The proposed VAE with two LRVs is capable of reconstructing the transmitted message by only sending the LRVs' parameters (µ z , σ z ) and the SER decreases as the E b /N o increases, as in Fig. 13. As with BPSK VAE SER performance, when comparing the VAE SER numerical performance with the theoretical Rayleigh [49], our proposed VAE with Rayleigh outperforms the theoretical one.  Our proposed VAE with Rician performs better when the value of k increases until it gets close to the performance of the AWGN. Fig. 15 shows the proposed BPSK VAE SER performance compared to VAE with shadowing behaviour regarding the σ of lognormal fading for a different number of E b /N o . In this figure, it is possible to observe that increasing E b /N o in presence of the shadowing effect decrease the SER.  Fig.16 presents the proposed VAE with a variation of the Doppler shift value under the non-stationary case. From the simulated results, we can notice that the SER increases as the Doppler shift increase if we assume that both transmitter and receiver are moving along the same axis with different phase offsets 5 • and 45 • , which demonstrates increasing mobility causing SER increment in compared to the stationary scenario.

G. FURTHER RESULTS DISCUSSION
We extend our experiment to investigate the shadowing, Rayleigh and Rician fading channels in addition to the AWGN. Moreover, the QPSK modulation under AWGN has been used to investigate the possibility of applying higher modulation schemes, which provides promising insight. However, due to the work limitation in focusing on the proof of the proposed concept where short packets can be transmitted through a wireless E2E VAE-based system. Further work can be conducted to find the SER performance for 64PSK and 128PSK. Furthermore, the Doppler effect has been added to the experiment to show that the proposed design has potential for non-stationary scenarios and it requires to study of more varying channel parameters in order to overcome the limitation of this experiment from the perspective of the transmitterreceiver mobility.

V. CONCLUSION
This paper introduced a novel approach using the VAE as a probabilistic model to reconstruct the transmitted symbol by transmitting the statistical parameters of the LRVs through the physical layer instead of sending the data bits of the original symbol out of the transmitter. We show significantly improved PHV or SER performance compared to the baseline Hamming code with hard decision decoding, and classical AE E2E, where increasing the E b /N o improves the SER of the proposed system in comparison to the baseline schemes. In addition, the proposed VAE shows a promising channel utilizing efficiency in comparison to the classical AE, where the results show that the VAE with two channels, (BPSK and QPSK) under AWGN, outperforms the classical AE of 4 and 7 channels schemes. Moreover, the performance of the proposed approach in the presence of fading (Rayleigh, Rician and shadowing) is promising too as the results show the performance improvement towards the BPSK VAE SER. Furthermore, other cases such as the Doppler effect has been simulated and discussed, showing that the proposed model can be generalized to the case in which the LVRs' parameters are transmitted, rather than the original bits. Our findings illustrate the importance of using the VAE approach and may inspire other researchers to use a similar approach for future communication systems. Nevertheless, while we are concentrating on the proof of the proposed concept, there are some limitations to our work, further work can be conducted to find the SER performance for 64PSK and 128PSK. In addition, this paper probes the applications of the proposed design for a non-stationary case. Further investigation is required for both high mobility and higher modulation schemes to find how such limitations can be overcome.