Symbol Detection and Channel Estimation for Space Optical Communications Using Neural Network and Autoencoder

Optical wireless communications in space are degraded by atmospheric turbulence, light attenuation, and detector noise. In this paper, we develop a neural network (NN) channel estimator that is optimized across a wide range of signal-to-noise ratio levels during the training stage. In addition, we propose a novel autoencoder (AE) model to develop a complete physical layer communication system in space optical communications (SOC). The AE is designed to work with both perfect and imperfect channel state information (CSI), providing a flexible and versatile solution for SOC. Batch normalization and multiple-decoders are incorporated into the proposed AE, which improves receiver learning capabilities by allowing the use of more than one path to update encoder and decoder weights. This novel approach can reduce the error in detection relative to state-of-the-art models. Using the system tool kit simulator, we examine our system’s performance in a downlink SOC channel that connects a geostationary satellite to a ground station in Log-normal fading channel. Furthermore, we evaluate the performance of our system in a downlink channel that establishes a connection between a Low Earth Orbit satellite and a ground station, operating in Gamma-Gamma fading channel. The numerical results show that the proposed channel estimator NN is superior to state-of-the-art learning-based frameworks and achieves the same level of performance as the minimum mean square error estimator. Additionally, with no fading and for both perfect and imperfect CSI with different code rates and fading channels, the proposed AE-based detection outperforms both benchmark learning frameworks and most popular convolutional codes.


I. INTRODUCTION A. BACKGROUND
W IRELESS communication has turned out to be a necessity for our day-to-day activities.When transmitting data, most current communication strategies rely on radio frequency (RF) technologies.Bandwidth scarcity is a serious concern due to the restricted RF spectrum and the ever-increasing demand for wireless data.Accordingly, it is essential to also take into consideration higher frequency spectrums such as the optical spectrum for wireless communication.When compared to RF communications, optical wireless communications (OWC) and space optical communications (SOC) offer several benefits over their RF counterparts, including lower transmission power, licensefree spectrum, higher throughput, and cost-effective installation [1].
Unlike typical OWC, the signal in SOC transmission travels over very long distances.Large information bandwidth, low transmitted power, improved directionality, and immunity to jamming are the obvious benefits for SOC.SOC has been widely considered by many space agencies worldwide in a variety of practical applications [1], [2], [3].One of the most common application scenarios for SOC includes communication between a geostationary Earth orbit (GEO) satellite, and ground stations [2].The Mars laser connectivity demonstrates laser connectivity between Earth and Mars at a rate of 10 Mbps [3].Two-way optical communication between high-altitude aircraft and a GEO satellite is demonstrated for the first time using airborne laser optical link [4].The Laser communication relay demonstration, conducted by NASA, served as a practical example of laser satellite missions and demonstrated the feasibility of using optical relay services for communication missions in near-Earth and deep-space environments [2].
Both SOC and OWC utilize lasers as optical transmitters.Specifically for SOC, the receiving telescope plays a vital role.It incorporates a PD for direct detection, facilitating a precise light signal focusing and collection before being directed to the PD, as observed in applications as SOC and astronomical observations [5], [6].Unlike OWC, SOC signals must travel long distances, requiring innovative laser transmitters to facilitate long-range OWC connections.These laser transmitters must exhibit high photon efficiency and peak power capability to achieve adequate BER performance for the downlink SOC channel [2], [5].Additionally, narrow line-width, high beam quality, and low modulation rates are essential for SOC's downlink lasers.
Intensity modulation direct detection (IM/DD) is considered an appropriate modulation technique for its ease of use and its ability to eliminate the need for high-order modulation schemes [2].The intensity modulation is attained by a laser diode which utilizes data for controlling the strength of its light intensity.As a result, the transmitted signal is proportional to the light intensity and follows a non-negativity constraint.When a photo-detector absorbs the light, it sends out a signal whose strength is proportional to the amount of light it has received and is also attenuated by noise and atmospheric fading channel [7].
On the other hand, laser uplink channels provide specific challenges that are distinct when compared to downlink channels.Due to the atmosphere's spatial and temporal fluctuations through its refractive index, laser uplink from the ground to a satellite is particularly prone to distortion and pointing instability.However, during satellite-to-ground downlink transmissions, the optical beam spreads geometrically due to beam divergence loss, and only a small amount of the spread is caused by variations in beam steering [2].Additionally, the effect of atmospheric turbulence is generally very small on the downlink propagation as the beam goes through a non-atmospheric path until it reaches about 30 km from the Earth's surface [2].

B. RELATED STATE-OF-THE-ART
There exists a wide body of work related to OWC in general and SOC in particular.This work can be categorized mainly in the following areas: channel modeling, modulation, and coding, channel estimation, and learning-based design leveraging artificial intelligence (AI) methods such as autoencoder (AE) and/or deep neural network (DNN) [8], [9].Since the scope of the current paper contributes to all these areas, we briefly overview the most notable related state-of-the-art next.

1) CHANNEL MODELING
In [10], the authors integrated a hybrid RF/FSO lunar communications system that employed micro satellites in a Low Earth Orbit (LEO) constellation.During this implementation, the channel modeling for the entire system is performed in the Analytical Graphics System Tool Kit (STK) simulator.Moreover, the STK program allows the accessibility to the propagation delay, transmission loss, and signal-to-noise ratio (SNR) measurements.Furthermore, the STK program is utilized to configure two ground stations and two satellites for point-to-point communications in order to create an SOC system [11].The authors in [12] consider utilizing a Log-normal distribution for OWC to accurately represent the atmospheric modeling in weak turbulence regime.On the other hand, the Gamma-Gamma distribution is more suitable for strong turbulence regime [6].The authors in [13] proved that double Generalized Gamma distribution is an appropriate statistical model to represent the irradiance fluctuations in strong and weak turbulence regimes for OWC.On the contrary, laser beam pointing errors arise when the transmitter and receiver are in motion, an accurate acquisition, tracking, and pointing system (APT) is necessary for proper reception of the signal in inter-satellite communication [14].In the downlink SOC channel, the pointing error can be easily mitigated due to the capability and stability of the ground station [2].

2) MODULATION AND CHANNEL CODING
Coherent communication techniques involving modulation and detection of the amplitude and phase of the optical carrier can be used for SOC.However, incoherent modulation as IM/DD is preferred due to its simplicity, cost-effectiveness, and ease of implementation [15].It has been shown that the modulation scheme generated from the AEs-based OWC in [16] and [17] has a similar output constellation as the IM/DD.On the other hand, for increasing the number of accessible modes in limited optical communication systems, the authors in [18] propose fractional modulation of laser spatial modes.To accomplish high-resolution identification of fractional modes, a convolutional NN decoder is specifically used.Narrowing down to channel coding schemes in SOC, the convolutional codes have been shown to outperform the Hamming and Bose-Chaudhuri-Hocquenghem (BCH) linear block codes for various code rates while maintaining the same order of complexity [19].Authors in [16], [17] applied the channel coding schemes via deep learning (DL) AEs and achieved similar performance to the Hamming codes in OWC.Instead of adding redundant bits as conventional coding schemes, researchers utilize the AEs by applying the compression at the encoder and expansion at the decoder.

3) CHANNEL ESTIMATION
Attention-based models have emerged as a transformative paradigm in deep learning, making notable inroads into various domains.Particularly in the realm of channel estimation, attention mechanisms have shown the potential to address some challenges in communication systems [20], [21], [22].Authors in [20] proposed the Channelformer, a neural framework tailored for enhanced orthogonal frequency-division multiplexing (OFDM) channel estimation in downlink scenarios.This model capitalizes on self-attention for input precoding and seamlessly integrates multi-head attention with residual convolution.Alongside this, they have incorporated a novel weight pruning technique, driving the architecture towards a leaner, high-performance, low-latency solution.In addition, authors in [22] put forth a non-local attention methodology explicitly for OFDM channel estimation in a multiple-input multiple-output (MIMO) system.This neural network (NN) centric approach utilize specific frequency data, paving the way for optimized pilot design and more accurate channel estimation.
Communications systems that rely on least square (LS) channel estimators tend to perform poorly in the low SNR regime [23].This poor performance is due to the fact that the LS estimation process does not suppress the effect of noise.Compared to LS, minimum mean square error (MMSE) channel estimator mitigates the noise effect and achieves the optimal performance, in terms of mean square error (MSE) [24].However, MMSE channel estimation requires computing the cross covariance matrix between the received signal and the time-domain channel, thus inducing an increased complexity [24].To undertake this issue, the authors in [25] proposed a DL-enabled image denoising network to acquire knowledge from a huge set of training data and to compute an estimate of the massive MIMO visible light communication (VLC) channel.Furthermore, it was shown in [26] that a NN with one hidden layer and sigmoid activation functions can be trained to get an accurate channel state information (CSI) estimates in a Log-normal fading.However, the system therein is not practical as it needs a NN for every training SNR.In [23], the authors propose employing only one NN to rectify the LS estimation error.The results in [23] show that their NN design outperforms LS estimator but it is simpler in implementation compared with [26].Despite their accurate CSI prediction results, the authors in [23] relied on an unrealistic assumption that all the input samples are already known in advance for the testing phase.This assumption will lead to significant delay in the processing of the signal in the wireless communication system.The design of channel estimator NN should have adequate performance on every code word, to fulfill the real-time requirement of 5/6G.

4) END-TO-END COMMUNICATION SYSTEMS
AEs are considered as DL NNs where the input and predicted output are identical.The input is transformed into a compressed code referred to as the latent space, using the end-to-end learning concept, which can then be used to reconstruct the input data [27].In [8], AE has shown adequate performance compared to uncoded modulations employing maximum likelihood detector.Their approach considered single and multi-user communications over fading RF channel.
In [9], the authors have shown that it is feasible to create a point-to-point communications system in which NNs handle all of the physical layer computation.Training a system as an AE is a good approach for any stochastic channel model; nevertheless, substantial effort is needed before the system can be employed for transmission over the air [9].On the other hand, Turbo AE is a fully end-to-end cooperatively trained neural encoder and decoder, and its performance under canonical channels is close to that of the convolutional codes when using small block lengths [28].The authors in [27], proposed Turbo AE with average power constraints instead of peak intensity constraints required by OWC in general and SOC in particular.In OWC systems, performance of the AEs has shown comparable performance to Hamming codes in pointto-point communications [16].It should be noted that the study in [16] only assumed the presence of an additive white Gaussian noise (AWGN) channel and did not investigate the performance of AE in fading channels.The authors in [17] expanded the work in [16] and incorporated the turbulence channels, resulting in a performance that is comparable to that of Hamming codes using the MMSE estimator for both perfect and imperfect CSI.The MMSE estimator can be used with AEs, although this strategy would increase system computation complexity [24].On the other hand, the DL models created in [16], [17], [24] perform worse in terms of BER than convolutional codes.

C. CHALLENGE AND CONTRIBUTION
First, we address the challenges and contribution of proposed channel estimator NN in SOC, followed by an examination of the aspects associated with the proposed AE in symbol detection.Finally, we discuss the contribution related to merging both the proposed AE and proposed channel estimator in a single unit.Achieving an adequate MSE while maintaining a low complexity model in SOC is a challenging task.While several research studies based on learning frameworks achieved good MSE results in OWC, their designs involved high complexity schemes since it is necessary to create a NN for each SNR value [26].According to [23], the channel estimator's design complexity was simplified to a single NN for all SNR values.However, the resulting performance was found to be inferior to that of the optimal MMSE estimator.Accordingly, the results in [23], [26] inspired us to develop a channel estimator which involves a single NN with a nonuniform strategy that is robust along a wide range of SNRs and achieves equivalent performance as the MMSE channel estimator and outperforms the LS channel estimator.In addition, our implementation does not require the knowledge of all input samples in advance for the testing phase, as required in [23].
In the related state of the art [8], [16], [17], several issues arise with end-to-end learning schemes, notably their low BER performance when compared with convolutional codes and their high complexity structure.The results in [16], [17], [28] motivated us to apply significant changes in the design of standard and turbo AE to improve the symbol Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.detection capabilities.The proposed AE is constructed on multiple-decoders and a new layered framework based on batch normalization (BN) for designing both encoders and decoders.Multi-decoding functions as a form of ensemble learning, employing multiple decoders to interpret encoded data from varied perspectives.This interpretation enhances system robustness and lower the error cost function by aggregating results from multiple models.Through the multidecoder approach, more than one path can be employed to update encoder and decoder weights during training, resulting in a more robust model than that would be possible with a single decoder architecture.BN has the ability to stabilize NN training.By ensuring each layer's inputs have a steady mean and variance, BN counters the problem where input distributions change between layers.This speeds up training and allows for more independent and efficient layers of learning.
Additionally, our design exhibits reduced complexity in both the proposed AE and NN estimator when compared with the existing learning frameworks.While the majority of studies utilizing DNNs for symbol detection depend on external channel estimators [16], [17], we have adopted another approach.We have not only designed a standalone NN channel estimator but also seamlessly integrated it into our proposed AE for combined training and testing in a unified system.This is crucial for scalability issues and faster implementation.Furthermore, when utilizing modelbased methods, our channel estimator NN is also available as an individual design.By combining the outcomes of the proposed channel estimator NN along with the proposed AE, we provide a holistic end-to-end system based on NNs that includes both symbol detection and channel estimation in SOC.
This work is a substantial extension of [1].In [1], we assumed a perfect CSI in symbol detection and excluded any channel estimation calculations for simplicity.However, the assumption that receiver knows the fading coefficients perfectly may not be viable in practical scenarios.Consequently, we develop a NN channel estimator that is as effective as MMSE estimator with low complexity.We evaluate our channel estimator NN against both state-of-the-art learning estimators and the MMSE estimator in terms of the MSE metric.In addition, the proposed AE architecture is significantly modified to provide adequate performance in symbol detection with both perfect and imperfect CSI.We also evaluate our AE against state-of-the-art learning frameworks and convolutional codes at different code rates and with perfect and imperfect CSI in different fading channels.The main contributions can be summarized as follows: • Instead of creating/training an individual NN for each training SNR value, a two-input channel estimator is developed that is optimized across a wide range of training SNRs utilizing a non-uniform strategy.This approach demonstrates an equivalent performance as the MMSE estimator in terms of MSE, outperforming the existing learning-based frameworks and the LS channel estimator in different fading channels.Additionally, we have provided a detailed comparison highlighting the decreased computational complexity relative to learning-based frameworks.Moreover, the mathematical expression for MMSE estimator is derived in Log-normal fading channel, which can be employed in both estimation and detection analysis.
• An AE model is proposed to construct an end-to-end physical layer communication system for SOC in the presence of AWGN, a Log-normal fading channel and Gamma-Gamma (GG) fading channel.A new layered structure employing BN for both encoders and decoders, as well as a multi-decoder approach, form the basis of the proposed AE.In light of this structure, we found that when compared to the state-of-the-art models, the proposed AE can significantly reduce the error loss function.This observation is supported by the performance of bit error rate (BER) that has significantly improved.
While achieving an adequate BER performance, the computational complexity is further reduced in comparison to the standard AE model.
• The proposed AE model is compared to existing learning-based frameworks in [16], [17], [28] as well as with the so-called capacity-approaching convolutional codes [29].Our findings show superior performance in the presence of both perfect and imperfect CSI at code rates of 1/2 and 1/3 compared to model-based convolutional codes and learning-based frameworks in Log-normal fading channels.Furthermore, we have conducted validation experiments under the presence of GG fading channel, focusing on a code rate of 1 2 for scenarios with both perfect and imperfect CSI.To the best of our knowledge, this is the first instance where AE employing DNNs outperforms capacity-approaching convolutional codes in SOC.
• We show that the proposed AE-based detection parameters are adjusted to utilize the estimated channel gains resulted from the proposed channel estimator NN.Subsequently, the proposed channel estimator NN and MMSE channel estimator perform equally well in BER detection.In addition, we have successfully integrated both the proposed channel estimator NN and the proposed AE into a unified system aiming for an end-to-end solution enabling one DL model for both symbol detection and channel estimation simultaneously.

D. OUTLINE
The rest of the paper is organized as follows.Section II focuses on the STK simulator-based SOC channel model.
The overall system model is briefly discussed in Section III.Section IV describes the novel design of the channel estimator NN.The structure of the DL AE is presented in Section V.
In Section VI, we compare the results of the channel estimator NN with bench-marking schemes and evaluate the proposed AE-based detection in comparison to model-based and state-of-the-art learning-based frameworks in SOC.Finally, the conclusion of this article is presented in Section VII.

II. SPACE OPTICAL CHANNEL MODEL
We define the point-to-point downlink channel between GEO satellite and a ground station.Following this, we describe a separate setup for a downlink channel between a LEO satellite and a ground station.The STK simulator facilitates precise channel modeling for the point-to-point SOC channels [11], [30], [31].In the system, the ground station holds the receiver antenna gimbal and avalanche photo-detector.Additionally, the GEO satellite holds the laser transmitter and the gimbal for the transmitter antenna.The gimbal system can be used to support and stabilize transmitters and receivers.
The laser transmitter is modeled as a Gaussian beam model.The laser utilizes IM/DD, where the light intensity is modulated as an information-carrying signal, with data recovery accomplished by the detection of incoming light intensity.
In addition, the generated modulating signal (current) is real and positive as a result of this procedure.This is a significant difference from RF coherent communications, where the modulated signal is complex-valued [15].Furthermore, the modulated signal in IM/DD is peak-constrained for reasons of operation, safety, and illumination [15].The Log-normal distribution is typically used to describe the weak atmospheric turbulence regime and is the best distribution fitting that STK has recommended for the GEO to ground SOC channel.Changes in atmospheric temperature and pressure at various points along the signal's propagation are the cause of atmospheric turbulence [32].The probability density function (PDF) for the Log-normal distribution of the channel gain is given by [12] where h represents the positive channel gain, µ represents the mean, and σ l denotes the standard deviation.Next, we outline the downlink configuration from a LEO satellite to a ground station.Within this context, the presence of atmospheric turbulence leads to the scintillation effect, causing variations in the received signal power.Under conditions of strong turbulence, the GG distribution emerges as a suitable model to represent the channel model in such scenarios [33], [34].
The GG model arises when we assume that the turbulenceinduced log-intensity fluctuations can be described by the product of two statistically independent Gamma-distributed processes, typically associated with the strong turbulence effects.
The probability density function (pdf) of the GG distribution is described as [33] where the parameters α and β represent the shape factors of the distribution, stemming from the individual shape parameters of the two Gamma distributions associated with turbulence effects.The term K α−β is identified as the modified Bessel function of the second kind with order α−β, while γ (•) denotes the gamma function.Furthermore, the received sequence y u is described as where w u ∼ 0, σ 2 w I u is the Gaussian noise and σ 2 w is the noise variance.The vectors y u , x u , and w u have dimensions of R u , where u represents the length of the sequence of symbols.In our model, we consider both perfect and imperfect CSI for the Log-normal fading channel.
The average amount of energy per bit to noise power spectral density ratio E b N o in on-off-keying (OOK) is given by [35] where A is the peak intensity, k is the message bits and u is the length of coded symbols.

III. PROPOSED END-TO-END LEARNING-BASED DESIGN
As depicted in Fig. 1, we take into account an SOC system in which a transmitter located in the GEO satellite sends the message b ∈ B, B = {1, 2, . . ., B} to a certain receiver over a Log-normal fading channel.To model the channel, we use the STK simulator, with the encoder on a GEO satellite and the receiver at a ground station.The message b is first fed into the DL encoder NN producing x u .The elements of x u are represented as x(i), 1 ≤ i ≤ u, which meets both the peak and the non negativity constraints required by the optical channel's physical characteristics, i.e., 0 ≤ x(i) ≤ A. The data rate is defined as k u bits/channel use, where k = log 2 (B) bits are sent through u coded symbols.Additionally, the encoded vector x u is transmitted through a SOC channel as described in Section II.The resulting sequence is denoted as y u ∈ R u .The received sequence which can be obtained in accordance with the probabilistic law given by where h ∈ R + denotes the optical fading channel produced by STK and it is considered to remain constant through the transmission of the sequence x u .The result of P(y u | x u , h) is a conditional probability distribution that a particular sequence y u = [y 1 , . . ., y u ] is received given that the transmitted input sequence x u = [x 1 , . . ., x u ] and the channel fading coefficient h [36].In this paper, we argue that the proposed channel estimator NN can be trained to acquire the knowledge of the transition probability law for an input-output model that could be governed by (2), or could also be more general as in (4) without an explicit law.The channel estimator NN is based on two inputs with a single NN whose parameters are tuned across a wide range of training SNRs.Furthermore, we take into account a pilot-based channel estimation approach, wherein the pilot symbol x p is used for channel estimation and is communicated as the first symbol x(1) of the transmitted sequence, i.e., x p ≜ x(1).For symbols' detection, we propose the AE structure and we consider 3 cases: AWGN (no fading), fading with perfect CSI at the receiver, and fading with imperfect CSI at the receiver.The proposed AE is developed with multiple decoders along with a layered structure of encoders and decoders that employs BN layers.Next, design details regarding the proposed NN-based estimation and the AE-based detection are discussed.

IV. PROPOSED NN DESIGN FOR CHANNEL ESTIMATION
In this section, we present the proposed channel estimator NN( ĥ).Additionally, we derive the mathematical expression for the MMSE estimator in Log-normal fading channel and apply it in both estimation and detection, as a benchmark.Although, the MMSE estimator provides the optimal performance in terms of MSE, this estimator has a considerable level of computational complexity and requires an explicit input-output model like the one in (2).On the contrary, the proposed channel estimator NN is capable of predicting the CSI and obtaining equal performance as the model-based MMSE estimator with far less complexity and without the need of an explicit input-output model.In addition, the proposed channel estimator NN relies on two inputs, and we train with a single NN whose parameters are adjusted across a wide range of training SNRs as opposed to generating a separate NN for each possible training SNR.
The proposed NN architecture: The proposed NN estimator is installed at the GEO satellite.It is composed of two fully connected (FC) hidden layers, a rectified linear activation unit (ReLU) activation function at each hidden layer, and a linear activation function at the output layer.As shown in Fig. 2, the NN has two inputs: the received signal y p ≜ y(1) and the peak intensity A.

Training methodology:
The following steps generate the training data used in channel estimation: • We first generate the true channel coefficients based on Log-normal fading channel from (1), h n with 1 ≤ n ≤ N s , where N s is the number of training samples.
• We distribute the peak intensities uniformly and randomly of the N s samples.We then generate various peak intensity constraints A ∈ [A min , A max ] to cover a wide range of SNR values.In the training set, samples exhibiting high peak intensity values have a higher probability of occurring, while samples with low peak intensity are set to have a lower probability of occurring.
• The NN has two inputs: y p and A. To generate the first element of the received pilot element y p for the n th training sample, we substitute the corresponding peak intensity A (n) and the true channel coefficients h (n) in (2).
• The label of the training data tuple is based on two inputs as (y ) , where (y ) is the input tuple to the NN and h (n) is the target value for the n th training sample.

A. LEARNING ALGORITHM
The proposed channel estimator NN only makes use of two inputs, and we train with a single NN whose parameters are adapted across a wide range of training SNRs as opposed to creating a new NN for each possible training SNR.There are two phases to NN's learning process: training and testing.The network model must be trained in three steps before effective channel parameter estimation can be implemented.The first step is to select the data samples to utilize.Second, the gradient descent algorithm is used to calculate the partial derivative of the cost function by minimizing the difference between the output value and the target value.Specifically, its value should be adjusted in the direction of the fastest descent of the error function, or the direction of the negative gradient.Third, when the training data for an epoch is finished, the validation data is used to determine the best model across all training iterations.In Fig. 2, θ l ij corresponds to the weight of the link between the j th neuron in the (l − 1) th layer and the i th neuron in the l th layer.The l th layer pre-activation is represented by where b [l]  i represents the bias of the i th neuron in the l th layer and a [l−1] j is the activation of the j th neuron in the (l−1) th layer.Employing the rectified linear unit (ReLU) activation function, the neuron output activation can be rewritten as At the start of the training, the initial point of the weights is selected as a random number drawn from a Gaussian distribution.Then, the state vector z [l] can be obtained through each layer using the forward propagation formula as where [l] is denoted as the weight matrix with i rows and j columns, a [l−1] is the activation vector of dimension j in the (l − 1) th layer and the bias vector of dimension i in the l th layer is denoted as b [l] .Afterwards, z [l] is fed into a ReLU activation function resulting the output vector a [l] at layer l: Each hidden layer applies a nonlinear ReLU function f a (x) = max(0, x), after each neuron to enable the learning of complex, nonlinear relationships between the inputs and output.By employing network's hidden layers, inputs from the training data are extracted and then used to generate estimation results.The NN estimated channel gain at the final output layer L can be described as where [L] describes the connection weight matrix of the output layer, b [L] represents the bias vector in the final output layer, and ĥ denotes the estimated channel gain generated by the output of the entire NN.Then, the loss calculations follow the feed forward computations.The utlized loss function L( ĥ, h) is the normalized MSE which is the most suitable function in regression problems, defined as where h (n) is the true output of the n th sample, ĥ(n) is the actual output provided by the NN of the n th sample.Then, the objective of the proposed channel estimator NN during the training stage is to minimize the training loss, which can be described as minimize ĥ L( ĥ, h), The detailed steps for the backpropagation process which minimizes the training loss are provided in Appendix A. The learning strategy of the proposed NN estimator is summarized in Algorithm 1.
Testing stage: The NN-based estimator utilizes the received signal y p ≜ y (1) to obtain an estimate of the channel gain ĥ.The same procedures for verification are applied in GG fading channel.To demonstrate how the proposed channel estimator NN compares to the MMSE estimator, we derive the MMSE estimator in a log-normal fading channel.The MMSE objective function can be described as The MMSE algorithm is noise resistant and takes into account the influence of Gaussian noise on estimation performance, but it has a high computational complexity.The estimated channel gain for the MMSE estimator in the lognormal fading channel can be described as Following the same steps outlined in Appendix B, the estimated channel gain for the MMSE estimator in the GG fading channel can be described as:

4:
for i ← 1 to m do 5: + w (i) 6: end for 8: Calculate minibatch loss: Calculate gradients: ∇ θ L ← ∂L ∂θ 10: Update parameters: θ ← θ − η∇ θ L 11: until convergence.process.In addition to compressing data, the AE learns how to recreate the original data from the compressed form.Furthermore, the AE system can be expressed by the pair (k, u), where k and u are the number of message bits and the codeword length, respectively.The channel code rate is described as R = k/u.The proposed AE(k, u) is illustrated in Fig. 3 for SOC system with code rate 1/3 without loss of generality.The receiver is based at a ground station, whereas the encoder is on a GEO satellite.The channel coding code rate is 1/3, where k = 7 and u = 21.The system is composed of three components: the transmitter, the SOC channel, and the receiver.First, the transmitter sends one out of M possible messages b ∈ M, M = {1, . . ., M } as one hot vector 1 b of dimension 2 k bits.The transmitter then uses the mapping function f : M → R n to transform the input hot vector 1 b into the encoded vector x u .The benefits of one-hot-encoding are that the output is binary rather than ordinal.The one-hot vector has all zero inputs, except one indexing a message m ∈ M. The symbol vector x u generated by the normalization stage of the transmitter satisfies the positivity and peak requirements for SOC.Then, it is transmitted through the SOC channel provided by STK as discussed in Section II.The SOC channel is constructed from both Log-normal fading and AWGN channel with zero mean and unit variance.Subsequently, the estimated hot vector 1 b is generated by the receiver, which uses a multiple-decoder approach to recover the message b from the corrupted vector y u .
Moreover, the transmitter model is based on FC layers, with BN layers occurring after each FC layer and a Randomized Leaky Rectified Linear Unit (RReLU) activation function in between.In order to generate more accurate models, AE can make use of RReLU activation, a non-saturated function that produces simultaneous activations associated with regression and classification [37].The RReLU activation outperforms the Sigmoid and Tanh activations in terms of both training time and generalization capabilities [37].
In addition, for both the encoder and each decoder, we utilize BN on all of the hidden units in the same layer.BN is a technique to normalize the distributions of intermediate layers.It enables smoother gradients, faster training, and better generalization accuracy [38].BN offers a solution to the challenge of statistical estimation when dealing with a limited batch size.The BN normalization is described as [38]: Here, the average value over the entire block is given by: where the standard deviation of the entire block is represented by: Throughout the training process, the values of µ(c) and σ (c) are computed from the training batch.In contrast, during the inference or testing stage, these are based on the optimal values determined during training.While most existing learning-based frameworks only employ a single decoder at the receiver [16], [17], we employ a multi-decoder scheme.Using a set of several decoders and BN-based layered structure of both encoders and decoders, we found that the gradient descent can significantly improve the BER performance over the existing state-of-the-art models by minimizing the error loss function.Additionally, it may be considered a type of ensemble learning in which multiple neural networks operate concurrently to address a problem.Ensemble methods often lead to better generalization because they combine the strengths of multiple models and mitigate individual model weaknesses [39].In situations where one branch might fail or produce suboptimal results, having multiple could facilitate reducing the error cost function.If one branch encounters difficulties or noise in the data, the other branches can still contribute to the final decision.The idea inspired us to apply the parallel structure is the ensemble learning [39], [40].The parallel structure can reduce the error cost function as it is based on employing multiple decoders to interpret encoded data from varied perspectives.This structure enhances the training stability and lower the error cost function by aggregating results from multiple models.If N is the number of parallel branches and y i (x) is the output of the i th branch for an input x, the ensemble's average output is: Typically, ensemble methods reduce the variance component of the error, which can lead to better generalization.For a given generalization error E i associated with the i th branch, the ensemble learning error is given by [39]: The effectiveness of the parallel structure becomes apparent in the presence of parallel computing conditions, as seen when using GPUs.The ensemble time, denoted as T ensemble , can be defined as a function of T i , which represents the time taken by the i th branch.The ensemble time can be given as [39]: The second reason is the addition of BN layers before fully connected layers.BN helps the network overcome the internal covariate shift problem, where the distribution of activations in intermediate layers of a NN can change.This can make it challenging for the network to converge and learn effectively, as the weights need to adapt to the constant shift in activations distributions in addition to minimizing the training loss [28].BN mitigates this problem by normalizing the inputs to each layer, ensuring that they have a consistent mean and variance during the training process.
The BER improves when the error loss function decreases.When this occurs, our AE model provides predictions that are close to the actual data.Through the multi-decoder approach, more than one path can be employed to update encoder and decoder weights during training, resulting in a more robust model than that would be possible with a single decoder architecture.Furthermore, during the training phase, the encoder and decoder operate as a unified NN.This means that the backpropagation method can be employed simultaneously to compute error gradients for both components in every training iteration.This concurrent computation facilitates the combined training of the encoder and decoder.The feedback from backpropagation guides each layer on how to adjust its parameters to reduce the error in the cross-entropy loss function.Employing optimization strategies such as stochastic gradient descent, the parameters of both the encoder and decoder are refined.This iterative process continues until the error reaches the lowest possible value.Furthermore, the hyperparameters are optimized by experimenting with various parameter values until the best possible validation loss is achieved.
Figure 3 shows that the input to the first decoder is r 1 of length u 1 = 7.Similarly, r 2 , r 3 correspond to the second and third decoder inputs.Every decoder makes an independent prediction of the estimated input hot vector with dimension 2 k .Each decoder, as shown in Fig. 4, is built from a sequence of dense layers based on FC, RReLU, and LN layers, which is similar to the construction of the encoder.To estimate the input hot vector, each decoder maps the input vector r j to the corresponding output vector o j of length M = 2 k , where j ∈ {1, 2, 3}.The estimated vectors from each decoder are then multiplied by a corresponding learnable weight w j and summed to obtain the vector v of dimention 2 k .Afterwards, vector v is fed into a BN layer to produce a vector d of dimension 2 k .The softmax activation function is applied to the resultant vector d to get a probability vector over all possible messages p of length M = 2 k .The decoded message b is the index of the highest probability.A definition of the softmax function is: where i ∈ {1, 2, . . ., M }.Cross-Entropy loss is a significant cost function for improving classification model precision.
The cross-entropy loss function can be described as In addition, the benefit of choosing 3 identical decoders each of input length 7 is their adaptability when transitioning the AE to encode at higher or lower code rates.One can easily add or omit one of the uniform decoders.For instance, to adapt the AE for a code rate of 1 2 , we can simply bypass one of the three identical decoders without altering the overall structure.This modularity presents a significant advantage, allowing us to easily switch our AE to a code rate of 1 2 .Furthermore, if there is a need to train at a code rate of 1  4 , introducing another identical decoder with an input length of 7 becomes straightforward.On the other hand, employing two decoders, one with an input length of 10 and the other of input length 11 requires a comprehensive redesign to accommodate a code rate of 1  2 .A similar challenge arises with a configuration of four decoders having input lengths of 5, 5, 5, and 6.Adjusting such a design for a code rate of 1  2 with a codeword length of u = 21 or block length k = 7 introduces structural challenges.These necessitate alterations to the decoder configurations, making it less flexible and posing deployment concerns.Another benefit of employing three decoders over two decoders is the enhanced speed during training and testing under parallel processing.This parallel architecture proves especially efficient under parallel computing environments, such as when leveraging GPUs.Subsequently, the time is determined by the longest duration of a single branch, rather than the cumulative time of all branches.Our NN is trained at a fixed peak intensity A or a corresponding SNR according to (3).To determine which training peak intensity A value yields the lowest cross-entropy loss, we investigate a wide variety of values throughout the AE training stage.The best value of training A for AWGN, perfect CSI, and imperfect CSI for a particular code rate will be demonstrated in the numerical results.In addition, training with a peak intensity A higher than necessary is not promising because the network will only update its weights for the high SNR regimes, which might produce good results during training but poor results while testing.During the testing phase, we do not only assess our model's performance at the trained SNR but also across a broad range of SNRs.In the context of DL in wireless communications, channel estimation is primarily a regression problem, whereas the AE generally addresses classification problems.However, we develop an innovative approach for integrating the channel estimator NN into the AE model, as illustrated in Algorithm 2. In this scenario, we train this unified system once.This design prioritizes minimizing the cross-entropy loss over estimating the channel h that achieves the lowest MSE.Reducing the cross-entropy loss is directly proportional to the improvement in the BER performance.Accordingly, the main objective is to estimate the channel h that will reduce the BER to the least possible value.

VI. SIMULATION RESULTS
In this section, the proposed channel estimator NN is compared to the MMSE and LS estimator and different stateof-the-art learning-based estimations.Then, in the presence of AWGN, Log-normal, and GG fading channels , we compare the BER performance of AE-based SOC systems with the learning-based frameworks and convolutional codes at code rates 1/2 and 1/3 for perfect and imperfect CSI.Additionally, we train with a single NN whose parameters are adjusted across a wide range of training peak intensities.Draw m minibatch messages (b (1) , b (2) , . . ., b (m) ).

9:
for i ← 1 to m do 10: x 13: 14: 3 ← Split y (i) into three segments of equal length.17: end for 20: end for

24:
Calculate minibatch loss: Calculate gradients: Following the procedures outlined in Section III, the input tuple to the NN is based on two inputs (y , where y (n) p and A (n) are the pilot received sequence and corresponding peak intensity of the n th sample, respectively.The distribution of the peak intensity A among the training samples is uniform except for the high peak intensities.Figure 5a depict the distribution of peak intensities in the case of Log-normal fading channel.Similarly, Fig. 5b shows the non-uniform strategy of the training peak intensity A in the GG fading channel.The batch size is 1000 and number of training, validation, testing samples are 40, 5, 10 million samples, respectively.The output of the channel NN estimator is a single neuron representing the estimated channel gain ĥ.
The NN estimator in [26] is designed for specific peak intensity level, leading to poor performance when tested with different intensities or corresponding SNRs.This complicates adaptive systems and requires frequent retraining for varying intensities.Our first contribution is overcoming this limitation by enabling real-time processing without requiring extensive knowledge of sample statistics.The second contribution involves a modified NN design.We introduce an additional input, peak intensity A, enriching the model's information.In addition, instead of uniform training across intensity levels, we employ a non-uniform training strategy as previously illustrated in Fig. 5.This approach enhances the flexibility and practicality of our model for real-world Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
applications where immediate processing is essential.Moreover, authors in [23] need to build three NN in order to achieve the MMSE estimator performance; one trained at A = 3 yields the best estimation from 0 to 7 dB, another trained at A = 7 provides the estimation range from 7 to 14 dB, and another trained at A = 20 gives the estimation range from 14 to 20 dB.Their approach appears to yield good results with lower complexity compared to the MMSE estimator and channel estimator NN in [26].However, it relies on the impractical assumption that the statistical data of testing samples are known in advance.The proposed single channel estimator NN outperforms the MSE performance of [23] and does not require any prior knowledge of the statistics of the testing samples and without using multiple NNs.As shown in Fig. 6a, in the presence of a Log-normal fading channel, the proposed channel estimator NN achieves a 15% MSE improvement at SNR 6 dB compared to the model in [23] trained at peak intensity A = 20.Moreover, when compared to [23] trained at peak intensity 3, the proposed channel estimator NN yields a 37% MSE improvement at SNR 12 dB.Furthermore, at an SNR of 8 dB, the proposed channel estimator NN outperforms the LS estimator by a significant 57% improvement.Additionally, it exhibits a 13% enhancement at SNR of 20 dB compared to the LS estimator.Similarly, as shown in Fig. 6b under the GG distribution, our proposed NN estimator has equivalent performance as the MMSE estimator across various SNR levels.However, when we apply the uniform strategy in our proposed NN estimator, it performs 39% worse at SNR 15 dB compared to the nonuniform approach.This highlights the importance of using the non-uniform strategy, especially at higher SNR levels.Also, the proposed NN estimator outperforms the LS estimator by 28% at SNR 12 dB.
Table 1 provides a comprehensive overview of the proposed AE's structure and number of parameters, comparing it with the standard AE [8], [16], [17].The proposed and the standard AE are compared at code rate 1  3 .It is shown that the encoder module in the proposed AE has approximately 19% fewer parameters than the encoder in the standard AE.
On the decoder's side, we can see that a single decoder in the proposed AE has 35% of the number of parameters in the standard AE's decoder, which means that the three decoder structures in the proposed AE have only 5% more parameters than the single decoder in the standard AE.To ensure a fair comparison, all the normalization schemes used in the proposed AE were included in the computational complexity calculations.Overall, the proposed AE(7/21) has 8% fewer learnable parameters than the standard AE.On the other hand, in Table 2, we observe a 15% reduction in complexity for the proposed channel estimator NN in comparison to [26].Notably, the latter requires the creation or training of a single NN for every SNR.In Tables 3 and 4, we introduce the baseline systems for symbol detection and channel estimation, respectively.Table 3 elaborates on the encoder, decoder, and the channel conditions for the given code rate.On the other hand, Table 4 offers a summary of the baseline channel estimators, highlighting their structure, relevant statistical details, and key information.
Next, we demonstrate the BER performance of the proposed AE-based SOC at 1/2 and 1/3 coding rates.In addition, we compare the proposed AE model to both state-of-the-art learning-based approaches and model-based coding schemes.Figures 3 and 4 illustrate the simulation layout for the proposed AE.A total of 20,000,000 samples were used for training, and 10,000,000 used for testing.We accomplish both training stability and the effective learning weights by employing the Adam optimizer and a learning rate of 0.0001 throughout 100 training epochs.Convolutional codes using IM/DD at code rates of 1/2 and 1/3, as well as uncoded IM/DD, are implemented and compared in terms of BER with the proposed AE.In addition, we evaluate our results against the benchmarking AE models as described in [16], [17], [28].Although [16] demonstrates the viability of standard AE in OWC channels under the assumption of an AWGN channel, they do not explore the performance of AE in fading channels.By extending the work of [16] to include turbulence channels, the authors of [17] were able to apply changes for standard AE to adapt with both perfect and imperfect CSI.In addition, the Turbo AE [28] performance in SOC was not satisfactory after optimizing the training SNR and switching to positive normalization, which is suitable for SOC.The proposed AE outperforms learning-based frameworks presented in [16], [17], [28] for code rates of 1/2 and 1/3, respectively.This improvement can be attributed to the utilization of a new layered structure that incorporates BN for both encoders and decoders, along with a multi-decoder approach.The convolutional codes with a code rate of 1  3 depicted in Fig. 7  have generator values: G 0 = 133 8 , G 1 = 171 8 , and G 2 = 165 8 .This signifies a constraint length of 7 and 6 memory registers [41].
As can be seen in Fig. 7, the AE(7,21)'s BER performance is 0.6 dB better than the convolutional codes at BER 10 −6 for AWGN channel.For BER 10 −4 the AE outperforms the Turbo AE and standard AE by 2.1 dB and 1.4 dB, respectively.Furthermore, at BER 10 −4 for code rate 1/3, the proposed AE performance is superior than the uncoded SOC system employing uncoded IM/DD and a maximum likelihood decoder (MlD) by 2.9 dB.The proposed AE (7,21) is developed in an AWGN channel with a training peak intensity A = 3.In Fig. 8, we observe that the proposed AE(7,21) achieves 0.3 dB better performance than the convolutional codes at a BER of 10 −4 and 0.1 dB better performance at a BER of 10 −6 when using a Log-normal fading channel with perfect CSI at the receiver.For BER 10 −5 , it exceeds  the performance of the learning-based framework of standard AE and Turbo AE by 1.1 dB and 2.1 dB, respectively.The training peak intensity A employed with a Log-normal fading channel is set to 4. For computations involving 10 million samples, the simulation time required by the convolutional codes employing the MLD is 8 times longer the testing time of the proposed AE model.a fair comparison, both methods were executed using the same CPU.MLD decoders identify the most likely transmitted signal based on each received signal.This identification relies on the probability of every potential transmitted signal, factoring in the observed signal and established channel statistics.
In Fig. 9, the proposed AE(7,21)-based detection utilizing the MMSE estimator exhibits the same performance as the proposed channel estimator NN.Despite its superior estimation performance, the MMSE estimator involves high implementation complexity [24].The same BER performance is also obtained when utilizing the channel estimator NN provided in [26] which uses a design for a NN for each training SNR.In contrast to [26], we only need to develop a single NN to achieve the same results.In both the low and high SNR regimes, convolutional codes (7,21) exhibit same BER performance when using the MMSE, [26], and the proposed channel NN estimator.
As can be seen in Fig. 10a, the proposed AE(7,21) outperforms the convolutional codes for a Log-normal fading channel with imperfect CSI at the receiver by 0.9 dB at a BER of 10 −4 and by 0.6 dB at a BER of 10 −6 , provided that both convolutional codes and AE are using the proposed channel   estimator NN.In addition, the AE(7,21) demonstrates better BER performance over the convolutional codes across low and high SNR regimes.When both employ the LS estimator, the proposed AE has gain of 1.1 dB at a BER of 10 −4 and 0.4 dB gain at 10 −6 .Moreover, the proposed AE utilizing the proposed NN estimator outperforms the proposed AE employing the LS estimator by 2.1 dB at a BER of 10 −4 .This highlights the significant improvement of the proposed NN estimator compared with the LS estimator.Additionally, as depicted in Fig. 10b, the proposed AE using the LS estimator outperforms the standard AE employing the LS estimator by 2.3 dB at a BER of 10 −6 .Additionally, at BER 10 −4 the proposed AE is 1.4 dB better than the standard AE when both utilize the MMSE estimator.The proposed AE (7,21) employing the channel estimator NN is inferior by 0.8 dB compared with the perfect CSI case at BER 10 −6 .

VOLUME 2, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Moreover, it has the same performance when utilizing [26] which use a training NN for each training SNR.The training peak intensity is set to A = 4 in the imperfect scenario.
Figure 11 demonstrates that the proposed AE yields a significant improvement of 1.6 dB over the standard AE at a BER of 10 −4 .We also discover that for BER 10 −6 , the AE's performance is 0.25 dB greater than that of the convolutional codes in AWGN channel.Furthermore, the uncoded SOC system employing IM/DD is inferior by 2.3 dB at BER 10 −6 compared to the proposed AE (7,14).For a convolutional code with a code rate of 1  2 , having 6 memory registers and a constraint length of 7, we use the generator values: G 0 = 133 8 and G 1 = 171 8 .At BER 10 −4 , the AE outperforms the convolution code (7,14) by 1 dB.As illustrated in Fig. 12, at a BER of 10 −4 , the proposed AE (7,14) surpasses the standard AE by 1.6 dB with the presence of fading channels.Moreover, when compared to the convolutional code (7,14), the proposed AE(7,14) offers a 0.8 dB improvement at BER 10 −4 and a 0.3 dB improvement at BER 10 −6 .
In Fig. 13, the BER performance of convolutional code (7,14) using the MMSE estimator is identical to that of convolutional code (7,14) using the proposed channel estimator NN in [26].Again, we achieve similar behavior as Fig. 9 when code rate 1/3 is used.While utilizing the estimator presented in [26], which employs a design for a NN for each training SNR, it achieves the same BER performance as convolutional code (7,14) utilizing the proposed channel estimator NN.The proposed AE (7,21) performs the same operations as convolutional codes, demonstrating that the BER is consistent across a wide range of SNR values, whether the proposed channel estimator NN or the MMSE estimator is used.
As illustrated in Fig. 14a, the proposed AE (7,14) with the proposed channel estimator NN only deviates from the perfect CSI case by 0.8 dB for a BER of 10 −6 .Narrowing down to the imperfect CSI, the proposed AE outperforms the convolutional codes by 0.4 dB for BER 10 −6 .Moreover, the proposed AE (7,14) outperforms the convolutional codes for a Log-normal fading channel by 0.7 dB at a BER of 10 −6 and provided that both convolutional codes and AE are using the LS channel estimator.In Fig. 14b, we further investigate this behavior for standard AE and reveal that AE in [17] utilizing the proposed channel estimator NN only differs from the perfect CSI case by 1.1 dB at a BER of 10 −4 .For BER 10 −6 , the proposed AE(7,14) achieves 1.6 dB better performance than the standard AE when both utilize the proposed channel estimator NN.Additionally, the proposed AE using the LS estimator outperforms the standard AE employing the same scheme by 2.1 dB at a BER of 10 −6 .In contrast to the perfect CSI case, where the training peak intensity is A = 5, the training peak intensity is increased to A = 6 in the imperfect scenario.
Next, we evaluate the performance of the proposed AE (7,14) in the presence of GG fading channel as illustrated in Fig. 15.The proposed AE outperforms both the convolutional codes and the standard AE by 1.5 dB and 3 dB, respectively, at a BER of 10 −6 with perfect CSI.It is worth noting that the training peak intensity is set to A = 7 at the perfect CSI in GG fading channel.Furthermore, under conditions of imperfect CSI, as illustrated in Fig. 16a, our proposed AE when integrated with the proposed channel estimator NN following Algorithm 2 outperforms the convolutional codes employing our channel estimator NN by 1 dB at a BER of 10 −6 .Also, the proposed AE in imperfect CSI has just a marginal 0.9 dB performance degradation compared to the perfect CSI scenario.Also, the proposed AE (7,14) demonstrates superior performance compared to convolutional codes across all SNR regimes in a GG fading channel when both employ the LS estimator, achieving a 0.8 dB improvement at a BER of 10 −6 .Furthermore, in Fig. 16b, the proposed AE, which employs the LS estimator, outperforms the standard AE utilizing the LS estimator by 2.1 dB at a BER of 10 −4 .In addition, the proposed AE employing the proposed channel estimator NN has 2.3 dB gain compared with the proposed AE utilizing the LS estimator at BER 10 − 4.
The training peak intensity is increased to A = 8 in the scenario with imperfect CSI at GG fading channel.
As illustrated in Fig. 17, the proposed AE (7,14) has roughly learned an IM with constellation points located at 0 and A = 4 for both AWGN and perfect and imperfect CSI. Figure .17 is trained and tested at A = 4.The results presented in this section demonstrate that the proposed channel estimator NN outperforms learning-based frameworks and LS estimator while performing as well as MMSE estimator in terms of MSE.The proposed AE for both 1/2 and 1/3 code rates has learned encoding and decoding functions that outperform convolutional codes with IM/DD and learningbased frameworks in terms of BER for AWGN as well as perfect and imperfect CSI.

VII. CONCLUDING REMARKS
This work presents a novel channel estimator NN that is optimized in a wide range of SNR levels in the training stage.The numerical results demonstrate that the proposed channel estimator NN outperforms learning-based frameworks and performs as the optimal MMSE estimator.Further, we propose an AE detection for creating an end-to-end communication system for SOC over AWGN and fading channels with perfect and imperfect CSI at the receiver.The proposed AE further employs multiple decoders and a stacked structure for building encoders and decoders that is based on BN.Compared to the state-of-the-art models, the innovative method can facilitate the training, which reduces the computation complexity.To the best of our knowledge, this is the first time that AE-based detection has been demonstrated to be superior than the state-of-the-art capacity-approaching convolutional codes in SOC.This study shows that the proposed AE holds considerable potential for use in future SOC systems that will benefit from more efficient coding, modulation, and decoding strategies.The future research will focus on evaluating the efficacy of AE in a variety of contexts, including multiple access, broadcast and relay assisted SOC communications.Additionally, for effective training, it is vital to examine parallelizable AE structures that may take advantage of current parallel computing capabilities.

APPENDIX A
The parameter d [L] of the single neuron output layer is defined as follows d [L]  = ∂L ∂z [L] = 2 h − ĥ .
The vector d [l] in the l th layer is given as ⊙ ReLU ′ z [l] .
The gradient decent algorithm is employed in conjunction with backpropagation solving the optimization problem in (12) to reduce the loss function by updating the weights at the hidden and output layers.
Moreover, the proposed channel estimator NN makes use of the Adaptive Moment Estimation (Adam) optimizer.Adam is a technique for computing adaptive learning rates for each weight parameter.In addition to storing a decaying average of past squared gradients v t , we also keep track of them individually.We compute the exponentially decaying averages of past and past squared gradients as follows where the first and the second moment estimates are denoted by m t and v t , respectively.The decay rates for the first and second moment are defined as β 1 and β 2 , respectively.Then the weight parameters are updated according to Finally, updating weights stop functioning whenever the difference in error between the two most recent times is negligible or the allocated number of epochs has been reached.

APPENDIX B A. PROOF OF THEOREM
The received element y can be given by where h is the true channel coefficients based on Log-normal fading channel and AWGN w ∼ N (0, 1).The criteria of MMSE estimator is based on where f (h | y) is defined as where f h (h) is the PDF of the Log-normal distribution.Furthermore, the PDF of the received element y can be denoted as where f (y | h) follows a Gaussian distribution with a mean µ = hA and unit variance, by substituting ( 27) and ( 28) in ( 26), E[h/y] can be described as where f h (h) follows a Log-normal distribution and the PDF f h (h) is given by , for h > 0. (33) Afterwards, we deduce that f (y | h) can be described as Following along the same lines, by substituting (31) in (29), this yields to ĥ =

FIGURE 1 .
FIGURE 1.An overview of the system implementation for symbol detection and channel estimation for SOC channel.The transmitter at the GEO satellite employs an encoder based-GEO satellite to convert a stream of k bits b into a codeword x u of u coded symbols.The encoded vector x u satisfies the positivity and peak criterion conditions.The first symbol x u (1) = x p is assumed a pilot, which passes over a Log-normal fading channel verified from STK.At the receiver side (the ground station), the proposed channel estimator NN( ĥ) utilizes the first element y u (1) = y p of the received sequence y u ∈ R u in order to retrieve an estimated version of the channel gain ĥ.Afterwards, the muti-decoder AE makes use of ĥ and the received sequence y u to derive an estimate for the transmitted symbols xu and hence the recovered message b.

FIGURE 2 .
FIGURE 2. The implementation of the proposed NN used for channel estimation and located in the GEO satellite.The inputs are the received pilot y p and the peak intensity A. The NN is composed of two FC hidden layers.Each neuron is followed by ReLU activation function for each layer.The output ĥ is an estimated version of the channel gain.

FIGURE 3 .
FIGURE 3. The proposed AE(k, u) architecture has a code rate of R = k/u, where k = 7 is the number of bits in the input message, and u = 21 is the length of the encoded message.The encoder is located on a GEO satellite, while the receiver is based at a ground station.The message b is represented by the one hot vector 1 b of length 2 k = 128.The input hot vector 1 b is passed through a sequence of multiple dense layers in order to construct the encoded vector x u of length u = 21.The normalization layer, the last layer of the transmitter, uses a weighted sigmoid A × sigmoid (•) to ensure that x u lies inside the interval [0, A].The input to the receiver is the corrupted vector y u that is produced when the encoded vector x is transmitted across the SOC channel.The receiver is composed of three decoders.The entire input hot vector with dimension 2 k is estimated independently by the three decoders.The first decoder's input vector r 1 of length u 1 = 7 is fed into multiple dense layers and the output vector is denoted as o 1 .Additionally, the second and the third decoder map the vectors r 2 and r 3 , of length 7 each, into the output vectors o 2 and o 3 , respectively.The length of o 1,2,3 is equivalent to M = 2 k = 128.Each of output vectors o 1 , o 2 and o 3 is multiplied by a learnable weight w 1 , w 2 , and w 3 , respectively, then summed to produce the vector v. Finally, vector v is fed into a BN layer and the estimated hot vector 1 b of dimension 2 k is then output from the softmax activation layer.

FIGURE 4 .
FIGURE 4. Proposed decoder architecture at the receiver in the ground station.

FIGURE 5 .
FIGURE 5. Constellation points of training peak intensity A versus probability of occurrence: (a) Log-normal fading channel and (b) Gamma-Gamma fading channel.

FIGURE 6 .
FIGURE 6.The NMSE versus E b /N o of the proposed channel estimator NN compared with the MMSE, LS channel estimators and learning based frameworks: (a) Log-normal fading channel and (b) Gamma-Gamma fading channel.

FIGURE 8 .
FIGURE 8. BER versus SNR for the proposed AE(7,21) compared to the convolutional codes using IM/DD and benchmark learning frameworks for code rate 1/3 in a SOC channel with σ = 0.3 for perfect Log-normal channel.

FIGURE 9 .
FIGURE 9. BER versus SNR for the proposed AE(7,21) compared to the convolutional codes using IM/DD and benchmark learning frameworks for code rate 1/3 in a SOC channel with σ = 0.3 for imperfect Log-normal channel.

FIGURE 10 .
FIGURE 10.The BER versus SNR of the AE(7,21)-based detection in the existence of imperfect CSI against: (a) convolutional codes employing IM/DD and (b) benchmark learning frameworks for a SOC channel at a code rate of 1/3.

FIGURE 11 .
FIGURE 11.BER versus SNR for the proposed AE(7,14) compared to the convolutional codes using IM/DD and benchmark learning frameworks for code rate 1/2 in a SOC channel for AWGN channel.

FIGURE 12 .
FIGURE 12. BER versus SNR for the proposed AE(7,14) compared to the convolutional codes using IM/DD and benchmark learning frameworks for code rate 1/2 in a SOC channel with σ = 0.3 for perfect CSI.

FIGURE 13 .
FIGURE 13.BER versus SNR in the existence of imperfect CSI for the proposed AE(7,14) compared with the convolutional codes employing IM/DD for code rate 1/2 in a SOC channel.

FIGURE 14 .
FIGURE 14.The BER versus SNR of the AE(7,14)-based detection in the existence of imperfect CSI against: (a) convolutional codes employing IM/DD and (b) benchmark learning frameworks for a SOC channel at a code rate of 1/2.

FIGURE 15 .
FIGURE 15.BER versus SNR for the proposed AE(7,14) compared to the convolutional codes using IM/DD and benchmark learning frameworks for code rate 1/2 in a SOC channel with Gamma-Gamma fading channel for perfect CSI.

FIGURE 16 .
FIGURE 16.The BER versus SNR of the AE(7,14)-based detection in the existence of imperfect CSI against for Gamma-Gamma fading channel: (a) convolutional codes employing IM/DD and (b) benchmark learning frameworks for a SOC channel at a code rate of 1/2.

FIGURE 17 .
FIGURE 17. Constellation points against relative frequency developed by the proposed AE(7,14) with AWGN, perfect and imperfect CSI for peak intensity A = 4.
on(33) and(35), we finally obtain the channel estimate ĥ based MMSE estimator