Blind Equalization and Channel Estimation in Coherent Optical Communications Using Variational Autoencoders

We investigate the potential of adaptive blind equalizers based on variational inference for carrier recovery in optical communications. These equalizers are based on a low-complexity approximation of maximum likelihood channel estimation. We generalize the concept of variational autoencoder (VAE) equalizers to higher order modulation formats encompassing probabilistic constellation shaping (PCS), ubiquitous in optical communications, oversampling at the receiver, and dual-polarization transmission. Besides black-box equalizers based on convolutional neural networks, we propose a model-based equalizer based on a linear butterfly filter and train the filter coefficients using the variational inference paradigm. As a byproduct, the VAE also provides a reliable channel estimation. We analyze the VAE in terms of performance and flexibility over a classical additive white Gaussian noise (AWGN) channel with inter-symbol interference (ISI) and over a dispersive linear optical dual-polarization channel. We show that it can extend the application range of blind adaptive equalizers by outperforming the state-of-the-art constant-modulus algorithm (CMA) for PCS for both fixed but also time-varying channels. The evaluation is accompanied with a hyperparameter analysis.


I. INTRODUCTION
T HE digital transformation along with the modern lifestyle and the advent of video streaming platforms brought up a strong demand for high-speed and highly flexible communication systems. Precisely, the required data rates can only be provided by coherent optical communication systems along with high-order modulation formats and probabilistic constellation shaping (PCS) [2]. Due to its properties, e.g., easy rate adaption [3], a decreased gap to the additive white Gaussian noise (AWGN) channel capacity, increased energy efficiency [4] and a larger tolerance against fiber nonlinearities, PCS has become an essential ingredient of modern coherent optical communication systems [5], [6]. However, the use of PCS entails a more challenging carrier recovery than conventional square quadrature amplitude modulation (QAM) This work was carried out in the framework of the CELTIC-NEXT project AI-NET-ANTILLAS (C2019/3-3) and was funded by the German Federal Ministry of Education and Research (BMBF) under grant agreement 16KIS1316.
Parts of this paper have been presented at the Advanced Photonics Congress, Signal Processing in Photonic Communications (SPPCom), 2021 [1].
formats. Often, data-aided or pilot-based algorithms are the only option nowadays, since an open issue in communications is the lack of optimum (but practical) blind adaptive channel equalizers. However, pilot symbols cannot transport information so they reduce data rate and limit the achievable net bit rate significantly. Hence, there is a strong need of blind channel equalizers, which can adapt to time-varying channels and transmission parameters. The saved data rate can be used to either increase the throughput or forward error correction (FEC) overhead.
In coherent optical communications, the standard algorithm for blind adaptive equalization of linear channels and symmetric complex modulation formats is the constant-modulus algorithm (CMA) [7]. It tries to reach a constant signal amplitude (radius) by adaptively equalizing the signal with trainable finite impulse response (FIR) filters. Thus, it is optimal for constant-amplitude formats such as -ary phase shift keying (PSK), but it also converges for multi-amplitude formats such as -ary QAM [8] where its criterion is sub-optimal. The multi-modulus algorithm (MMA) [9] is an extension for multiamplitude formats, however, it suffers from its high implementation complexity and low convergence rate. Based on the same criterion, a non-linear blind neural network (NN) based equalization scheme was proposed in [10]. Since the CMA's criterion is independent of the signal phase, detection is only possible in combination with a carrier-phase estimation (CPE) block. The commonly used algorithms are the blind phase search [11] or the Viterbi-Viterbi algorithm [12], which both face performance degradation for PCS [13], [14]. Similarly, the CMA suffers from convergence issues for PCS as well [15], [16].
Optimally, we want to use the maximum likelihood (ML) criterion, which has been considered for blind equalization, e.g., in [17], [18], [19]. However, we are not aware of any blind ML based channel equalizer which has been seriously considered in real coherent optical communication systems. A promising approach is to approximate ML by variational inference via a variational autoencoder (VAE) [20], [21]. Variational inference is used for unsupervised and semi-supervised learning as well as generative models, however, there are not many applications in communications with notable exceptions being [22], [23], [24]. While [22] trains end-to-end transmission systems without inter-symbol interference (ISI) in a supervised manner, we, in contrast, focus on blind VAEbased equalization where we use unsupervised learning at the receiver to adjust the equalizer weights. Such an unsuper-arXiv:2204.11776v2 [eess.SP] 15 Sep 2022 vised equalizer implementation has initially been presented in [23] and further extended in [24] towards unsupervised lowdensity parity-check (LDPC) decoding. However, only a simple quadrature PSK (QPSK) implementation was used. In this work, we generalize the approach of [23], [24] and propose essential extensions, including the application to oversampled dual-polarization (DP) signals and multi-amplitude PCS formats. We show that the generalized approach is independent of the equalizer architecture and train both a convolutional neural network (CNN) based equalizer as in [23], [24]-the VAE-NN-and a novel linear model-based equalizer with butterfly FIR filters-the VAE-LE. We evaluate the performance of the proposed equalizers on different linear channels and propose an extension for slowly time-varying channels.
This paper is structured as follows: in Sec. II, we introduce our system model, in Sec. III, we motivate variational inference for equalization, derive the VAE-based equalizer in a general form and explain how the proposed extensions can be incorporated. In Sec. IV, we discuss the implementation of the equalizers and propose an appropriate parameter update scheme, before we introduce our simulation environment and our results in Sec. V. We conclude the paper in Sec. VI.

II. SYSTEM MODEL
We start with the demonstration of the basic concept on a simple AWGN channel with ISI where the transmit vector is convolved with the simulated channel impulse response (IR) sim and a white noise vector is added. Then, we focus on a dispersive linear optical dualpolarization transmission to prove the concept in a practical environment (coherent optical transmission). We use the more natural description in the frequency domain by a linear channel matrix, i.e., Per se, the optical channel is nonlinear, but the linear distortions are dominant in practical systems and have to be compensated, whereas the nonlinearity compensation is computationally demanding and usually provides a signal-tonoise ratio (SNR) gain of less than 1 dB [25, Sec. 6.9.3]. Hence, we focus in this paper on linear impairments and assumes potential nonlinearities to be either negligible or compensated by a separate digital signal processing (DSP) block, e.g., based on digital backpropagation, which can be switched on if required.

¯T
Further details on the simulation model including the specific parameters are provided later in Sec. V.

III. VARIATIONAL INFERENCE FOR EQUALIZATION
The goal of communications is to transmit data to a receiver, which has to fully recover the information without knowledge of the actually transmitted data. This can be interpreted as an inference problem, where the received samples ∈ C are observed variables while the transmitted samples ∈ C are unobservable latent variables. The optimum decision is based on the maximum of the a posteriori probability distribution [26,Ch. 4.1] where ( | ) is the likelihood of given , ( ) is the prior probability and ( ) is the observations' marginal density (also called the evidence). Throughout the paper, we denote probability mass functions (pmfs) by a capital (·) and continuous densities by a lower case (·). While the prior and the likelihood can usually be assumed as known or modeled well, the evidence is commonly intractable to compute, since the marginalization's complexity grows exponentially with the length and symbol order of , i.e. ( ) = ( ) ( | ).
In statistics, this is a common problem which can be solved by variational inference. It is also used in machine learning when the conditional has to be approximated efficiently and reliably [21]. In particular, this is the case in our problem where we usually require fast convergence to cope with timedependent distortions. The main idea is to cast inference into an optimization problem, where the goal is to find an approximation ( | ) ∈ Q of the true a posteriori pmf ( | ) from a family of approximate pmfs Q, parameterized by free variational parameters [21], over the latent variables.

A. The Evidence Lower Bound (ELBO)
A suitable objective function is the relative entropy D KL ( ), also called the Kullback-Leibler (KL) divergence, which is an information-theoretical measure of proximity. It is asymmetric, non-negative and convex with its minimum at = [27, Ch. 2]. Then, the optimization's goal is to find the best approximation to the true a posteriori probability for the observed varibles bŷ With E {·} = E ( | ) {·} being the expectation regarding the variational approximation, the KL divergence can be expressed as 1 Since the KL divergence depends on ln ( ), it is not easily computable and thus not suitable as objective function. However, the evidence is independent of ( | ), so the last term in (1) is only an additive constant regarding the optimization. Hence, maximizing the evidence lower bound is equivalent to minimizing (1). This mirrors the usual balance between likelihood and prior, since the ELBO's first term B, the expected likelihood, favors densities which explain the observed data, while the second term A encourages the densities to be close to the prior [21]. The complexity of this optimization is defined by the complexity of the family Q.
By rearranging (1) and due to the KL divergence's nonnegativity, we can show that the ELBO lower-bounds the (log) evidence, i.e., Since ln ( ) is a fixed upper bound, we sandwich D KL ( ( | ) ( | )) by maximizing the ELBO, so we eventually minimize the KL divergence and find a good approximation ( | ) for the true a posteriori probability.
The concept can also be interpreted from a communication theory perspective with the likelihood ( | ) as a probabilistic encoder and the a posteriori-or its variational approximation ( | ), respectively-as the corresponding decoder. 2 Precisely, during transmission, the data (latent variables) is encoded into the observable receive samples , while the receiver tries to decode the transmitted data again, estimating from . Assuming that the densities come from families of parameteric distributions, the concept can be implemented as a variational autoencoder (VAE) using machine learning techniques, where typically both encoder and decoder are implemented as NNs [20], [24]. However, if we have a suitable model, e.g., of the encoder ( | ), we do not have to apply an NN to learn it but we can use the model directly, as done in the following subsection.

B. The Variational Autoencoder (VAE)-based Equalizer
We assume a general transmission system through a (parameterized) channel , as depicted in Fig. 1. Then, the evidence ( ) as well as the likelihood ( | ) are parameterized by , while the variational approximation ( | ) can be parameterized by a set of learnable parameters , as denoted by the corresponding subscripts.
In other words, the channel (respectively the encoder) distorts the transmitted signal, while the decoder tries to find the mapping from the distorted received samples to the transmitted data. Thus, finding the optimum variational approximation corresponds to finding the optimized equalizer for this channel. Furthermore, ( | ) gives a soft-decision on the received symbols, so the VAE-based equalizer also approximates an ML receiver [24].
Note that the exact values of are unknown, so the proposed equalization concept is blind and the model parameters are learned simultaneously with the decoder. In fact, the evidence ( ) = ( | ) can also be interpreted as the likelihood regarding the channel parameters, so variational inference also approximates maximum likelihood channel estimation. This byproduct can be used, e.g., for joint communication and sensing. See, e.g., [28] for an example of capturing acoustic signals based on the channel IR.
In the following, we derive the VAE-based equalizer in a generalized form compared to [24], where it is only derived for a toy model with QPSK. We try to keep repetitions as short as possible, but it is unavoidable at some points to highlight the generalizations we did. We start by assuming transmission over an AWGN channel parameterized by = , 2 w with finite IR and noise variance 2 w . We can model the likelihood as ( Further, we consider transmission of independently modulated square--QAM symbols , so = ( 1 , . . . ) = I + j Q is a vector of complex-valued symbols. Assuming further that I and Q have been modulated independently, than I , Q ∈ A = { 1 , . . . , √ } are conditionally independent given . Consequently, we can model and define a vector | , which only depend on and . Although we have a similar decoder model as [24], we consider multi-level signals and, thus, cannot simplify further by exploiting the normalization of probabilities.
Then, similarly to [24], we create a minimization problem by defining the loss function L ( , , ) := −ELBO( ) = A−B (see (2)), which only depends on both parameter spaces, and , as well as the received samples .
The first term A can be easily computed by the standard formula of the KL divergence, i.e., Since the term A does not break down to the entropy as in [24] (due to the assumption of a uniform prior pmf), an important feature is the inclusion of the prior density ( ) into the loss function, which implies the adaption to, e.g., PCS [2], [3]. Since state-of-the-art blind equalizers struggle with nonuniform priors [15], [16], [29], simple inclusion is one of the major benefits of this concept. The second term B can be re-written, similarly to [24], as so B tries to find the best channel estimate regarding the least-squares of the observation to the prediction ( * ), also referred to as autoencoder distortion [24]. Further, with the operator Re{·} returning the real part of a complex number, (·) H denoting a vector's conjugate-complex and transpose, (·) T the transpose, and assuming that , and are column vectors, C becomes with In contrast to [24], we keep the vector notation, which helps to identify an efficient implementation, and we also have to compute the expectations for any -QAM symbol from the VAE-based equalizer's output by ( ∈ {I, Q}) Similarly to [24], we find an analytical solution for 2 w by partially differentiating the loss function and equating it to zero. In fact, A (see (5)) does not depend on 2 w and B 2 w Hence, inserting (8) into the loss function and omitting all additive constants yields Here, we emphasize that the equalizer can also be designed for any integer oversampling factor os . If the equalizer incorporates downsampling, e.g., by convolution with stride os , all vectors can be defined accordingly. However, the  Fig. 2. Complex-valued 2 × 2 multiple-input multiple-output (MIMO)-system size of the expectation vectors does not match the size of the observations anymore, so the term C would not be computable. Since the loss is summed over all samples, we can simply match the vectors by inserting ( os − 1) zeros between consecutive samples of , ( ).

C. Extension Towards Coherent Optical Communication Systems
Light is always traveling as combination of two orthogonal polarizations, which can be independently modulated but rotate during propagation through a standard fiber. In combination with other effects like polarization mode dispersion (PMD) and chromatic dispersion (CD), the receiver observes a superposition of both polarizations similar to a classical multiple-input multiple-output (MIMO) channel with cross-talk. Although we focus on DP systems in this work, the proposed VAE-based equalization scheme can be extended towards any kind of MIMO system accordingly.
In the DP case, the transmitted and received sequences Considering the orthogonality, we can model and derive the loss function similarly to the AWGN channel case, which yields In principal, the losses per polarization are calculated similarly to the proposed case for a single channel. However, in order to incorporate cross-talk between the polarizations, the mean of the likelihood's circular-symmetric normal distribution is not anymore ( * ) (see (3)), but depends on the superposition of both polarizations. In this work, we implement it as a complexvalued 2 × 2 MIMO-system as depicted in Fig. 2, which is based on a physical model [30], [31]. Alternatively, a realvalued 4 × 4 system can be implemented, which has additional degrees of freedom to compensate transceiver impairments.

IV. REALIZATION OF THE EQUALIZER
Commonly in the machine learning community, the VAE's encoder ( | ) and the decoder ( | ) are implemented as NNs with the parameters and , but this is no requirement.
In the application of the VAE concept to communications, the transmission channel is forming the encoder, while the decoder can be either an NN, as in in [24], or an FIR filter system with a soft-demapping block, as proposed in this work.
In comparison to the FIR filter system, the NN • carries out a classification task and, hence, combines equalizer and demapper, • is potentially capable of compensating non-linearities, • requires more learnable parameters since its dimensionality depends on the modulation order, • comprises more hyperparameters which have to be tuned, • provides no access to the output constellation (since it only outputs the approximations , ( )), which may prevent the inclusion into state-of-the-art DSP chains. In Fig. 3, we show the block diagrams of the investigated adaptive equalizers. The VAE-NN employs a CNN with two one-dimensional convolutional layers as in [24]. Adaptions are necessary to the final layer, namely exchanging the sigmoid with a softmax, to transform the multilevel output into probabilities, and we apply an exponential linear unit (ELU) instead of a softsign activation to the first layer, which provides better results. If applicable, a stride in the final layer downsamples the output. We found that the second layer's kernel size can be fixed to a small value (3 to 5), while the first layer's kernel size remains a hyperparameter. Precisely, the first layer's kernel size and the length of the estimated channel impulse response can be any odd integer (to ensure symmetry around a major central tap), which we optimized during our simulations. Typical values for both have been around 29 (for os = 2) and 11 (for os = 1). All equalizers and the simulation environment are implemented in Python with the PyTorch library.

A. The Proposed VAE-LE Scheme
We further propose the VAE-LE scheme, which is based on a classical 2 × 2 butterfly equalizer system with complexvalued FIR filters as depicted in Fig. 2. It uses the same filter system with taps per filter as the reference CMA, which allows the integration into state-of-the-art DSP chains [31]. Note, however, that the computation of the cost function may not be as simple as for the CMA and needs to be adapted as it requires a soft-demapper output and backpropagation through the latter as well as possibly some further DSP blocks. Future work may take into account backpropagation through consecutive, possibly non-differentiable DSP blocks (see, e.g., [32] for an example) or the complexity reduction of the update algorithm. Precisely, the VAE-LE uses one filter system for the equalization, where its weights are Φ in the derivation above, and a second similar filter system as channel model for the estimation. Both parameter sets are independent, but they are trained simultaneously. We also initialize both similarly, i.e., we only initialize the real part of the filters with a 1 at the center tap, while all other taps (including the imaginary parts) are zero (Dirac initialization). Similarly to the CMA, if used along with soft-decision FEC, the VAE-LE also requires a soft-demapping block to transform the constellation output into the variational approximations. The implemented structure of the VAE-LE for a DP system is displayed in Fig. 4, which is representative for the implementation of the other equalizers.
We assume the general case of squared--QAM transmission, where the modulation symbols' prior pmf follows a Maxwell-Boltzmann distribution with normalization constant and shaping parameter ≥ 0, i.e., for a Gaussian likelihood [4]. Translating this to softdemapping and adapting it to the -QAM case with

B. Parameter Update Schemes
Next, we show how to adapt the VAE's training procedure to resemble the online-update of classical gradient-descentbased equalizers (such as the CMA) and to enable tracking of time-varying channels. Instead of separating the dataset into a training, test, and validation set as in classical supervised machine learning systems, we can directly train on the same data which we evaluate. Therefore, we continuously buffer the received data stream, slice it into consecutive mini-batches (of length B · os samples) and feed them to the equalizer. Then, an Adam optimizer [33] constantly updates the weights after each mini-batch. For the VAE-LE, the equalizer outputs all B equalized symbols and the soft-demapper translates them into the corresponding variational approximations, while the VAE-NN's CNN directly outputs the B aproximated probabilities.
For both, we start the slicing of the next batch at the end of the former, so there are consecutive slices without any gap or overlapping in between. However, there is no requirement for having no overlapping, so we can also start the next slice only flex · os (instead of B · os ) samples after the start of the former slice and, thus, reduce the equalizer output from B to flex symbols. This results in an overlap of ( B − flex ) · os samples. Hence, each sample is considered for = B / flex consecutive update steps and, if ( B / flex ) is no integer, some samples are also considered for a further update step. This boosts convergence speed by sacrificing computational complexity due to more frequent updates. We call this generalized implementation VAEflex and introduce it for the evaluation of the time-varying channel. Throughout this work, we focus on adaptive channel equalization assuming an infinitely long random data sequence. Hence, we do not have to worry about overfitting during training for a specific channel and data sequence, although time-varying effects require a continuous re-adaptation of the filters.
The batch-wise update incorporates an implicit averaging of the loss over B · os samples, while the CMA updates the filter taps after each processed symbol by gradient descent as in its standard implementation [7], [8]. We observe that the VAE-LE with batch-wise training, which only updates every B symbols, had a significantly shorter computation time on  [23], [24]) and estimations est by the VAE-LE at 20 dB for os = 1 and os = 2 sps (without pulse shaping). a standard laptop's CPU as the CMA with its symbol-wise update.
The constant modulus criterion is not phase sensitive, so the CMA can only equalize the amplitude and requires a consecutive CPE stage, which we implement using the Viterbi-Viterbi algorithm [12] with averaging over 501 symbols. This CPE stage has, to the best of our knowledge, not been considered in [23], [24] (or was insufficient), which could explain the severe observed symbol error rate (SER) penalties in their QPSK simulations. In fact, we would expect significantly better results for a PSK-transmission over an AWGN channel, for which the CMA is well suited, especially if sufficiently long data sequences are used.

V. RESULTS
We evaluated the proposed equalizers both for simulations of a simple AWGN channel with ISI and an optical DP transmission system as well as a time-varying channel as introduced in Sec. II. The source code is available online [34].

A. Simulation Environment
Unless stated differently, our transmitter model consists of a source, which outputs a random sequence of -QAM symbols, and a root-raised cosine (RRC) pulse-shaping filter with roll-off = 0.1. We use an oversampling factor of os = 2 sps throughout our simulations and incorporate downsampling to the equalizer. Since we omit matched filtering (see [26,Ch. 9]) but expect the equalizers to learn it, the receiver faces ISI from the RRC pulse-shaping additionally to the simulated channel.
In the AWGN channel, we convolve the transmitted sequence with the complex-valued channel IR already used in [23], [24] (also shown in Fig 5), which we oversample by inserting ( os − 1) zeros between consecutive samples and interpolate by convolving it with the RRC pulse. We add real-valued AWGN with a variance of 2 w /2 on both real and imaginary part and take the oversampling into account for the SNR calculation. Monte-Carlo simulations of an AWGN channel without ISI provide a baseline SER per SNR and modulation format, which we denote by "No ISI" in the result plots.
Our optical DP transmission model follows [35] and [36] and includes a static (input and output) IQ-phase-shift IQ and a static rotation of the reference polarization to the fiber's principal state of polarization (PSP), called HV-phase-shift hv . Additionally, we simulate both first-order PMD caused by the differential group delay pmd = pmd √︁ pmd between the PSPs over a fiber length pmd , and residual CD, which is defined by the fiber's group velocity dispersion (GVD) parameter cd [37,Ch. 2.3] times the uncompensated fiber length cd . We display all parameter values in Table I.
By assuming-similarly to [35]-that all non-linear effects and phase noise are either negligible or compensated, and that we transmit a complex-valued DP signal = where = e −j IQ cos hv sin hv − sin hv cos hv . Again, we add AWGN on both polarizations and I/Q-components with a variance of 2 w /2 each. To simulate a time-varying channel, we change the HV-phase-shift after each frame of frame = 10, 000 symbols. Precisely, we extend˜h v = hv + Δ hv · frame with the deviation of the HV-shift Δ hv , the frame duration frame = frame / S , the symbol rate S and the frame-index as an integer indexing the discrete time (in multiples of the frame duration). It should be highlighted that we apply deviations frame-wise, i.e., we neglect deviations within each frame of frame symbols.
For evaluation, we chose the averaging scheme sketched in Fig. 6, which might not seem to be straight-forward at first glance but fits to our needs, i.e., it returns a reliable SER estimate to compare different hyperparameter settings but also provides insights into the convergence behavior. In particular, the analysis of the average convergence behavior requires a certain temporal averaging, but the final analysis requires also an averaging over multiple execution runs, since the behavior could vary between different runs.
To get the desired insights, we first have to slice the equalizer output pol (per simulation run and per polarization pol ∈ {TE, TM}) into ind = 170 consecutive frames˜p ol , ∈ C frame to allow the evaluation at different time steps, respectively frame-indices = 1 . . . ind . Then, we estimate each frame's scalar SER pol , after hard decision taking into account the  ∈ C frame are the slices (frames) of the vector containing the corresponding equalizer output with pol ∈ {TE, TM}, MA is a moving average filter with length ma = 10, so ma = ( ind − ma + 1), and min is an operator which returns a vector's element with the minimum value.
prior distribution of the constellation symbols [4]. Since we perform blind equalization, we also need to compensate for possible time-shifts, I/Q-flips and phase-rotations (in multiples of 4 ). Furthermore, we discard symbols that may be incorrect due to boundary effects (but having at least 8,000 symbols per frame remaining for evaluation-hence, we can evaluate the potential performance of each algorithm). Eventually, we perform a moving average with filter length ma = 10 and get a sequence of ma = ind − ma +1 estimates SER pol , , each being evaluated on approximately 80, 000 . . . 100, 000 symbols per frame-index (per polarization and per run). This allows the analysis of the equalizers' convergence for each polarization and run independently.
For further evaluations, we need one reliable scalar SER estimate for each algorithm and hyperparater setting. Thus, we carry out run = 10 (unless stated otherwise) independent simulation runs and average the results over all successful runs, i.e., runs where the SER drops below a pre-defined threshold of 0.3, by taking the element-wise mean of the vectors SER pol ∈ R ma , which contain the estimates per run and polarization after moving average. If unsuccessful runs occurred, we display the amount by a small number next to the corresponding data point in the result figures. Finally, we get a reliable scalar SER estimate by taking the element with the minimum value from the vector containing the averaged estimates for all remaining ma frame-indices. We chose the minimum over the mean to display the best (averaged) performance of the equalizers, since we already have a sufficient averaging (the mean of pol · run estimates where each is averaged over at least 80,000 symbols), but can prevent distortions from potential outliers. In fact, we did not observe any outliers in all our simulations, so we conjecture that the estimates' variance is small and the difference between taking the minimum and the mean (after convergence) is insignificant. Furthermore, we use a learning rate scheduler for the optical DP channel which halves the learning rate lr of each equalizer after every frame-index being an integer-multiple of 20, so the given lr represents the initial value. Figure 7 depicts the proposed equalizers' performance for both uniform and PCS-64-QAM at os = 2 sps. We tuned the lengths of the filters , the batches B , the kernels, and the learning rate. Later, for the DP simulations, we will show how the hyperparameters influence the equalizers' performance. Besides the baseline "No ISI"-curve and the CMA as reference, we also evaluated two non-blind equalizers at os = 1 sps without pulse-shaping, namely the non-linear decision feedback equalizer (DFE) [26,Ch. 9.5] (with 10 taps each for both feed-forward and feedback filters) and the linear minimum mean squared error (MMSE) equalizer [26,Ch. 9.4] (with 20 taps).

B. Simple AWGN Channel with ISI
For uniform 64-QAM, the CMA converges with a significant penalty to the MMSE, while both VAE-based equalizers operate close to the MMSE. For a linear channel, the VAE-LE outperforms the VAE-NN, which might originate from a closer bounding of the family Q around the optimumˆ( | ) ∈ Q. We found qualitatively similar results for other ISI channels as well, e.g., in Fig. 8 for channel 2 from [24]. For the sake of computational complexity, we restrict ourselves in the following to the VAE-LE.
As already forecasted in the early 1990s [15], [16], the CMA fails to converge for PCS formats which approximate a Gaussian prior; however, the VAE-LE even reaches the non-blind MMSE's performance for higher SNRs, although we used the same soft-demapper as for the uniform QAM in these AWGN channel simulations (instead of the optimal PCS-adapted version as introduced in Sec. IV). Furthermore, Fig. 5 demonstrates the VAE-LE's capability of estimating the channel IR. We depict the estimate without averaging while processing uniform 64-QAM at SNR = 20 dB for 1 and 2 sps without pulse-shaping.

C. Optical Dual-polarization Transmission
The results for the application of the VAE-LE are depicted in Fig. 9. The right axis shows the estimated SNR (averaged similarly to the SER) while evaluating DP-64-QAM at various SNRs. The VAE-LE always underestimates the SNR, but only within a fraction of a decibel (dB). Along with the results of Fig. 5, this demonstrates the potential for joint communications and sensing, since the estimation of the channel parameters is a valuable byproduct for tracking and evaluating the channel IR and SNR without interfering communications. The left axis in Fig. 9 depicts the corresponding SERs. Similar to the results in Fig. 7, the CMA has a significant penalty without learning rate scheduler. Since the scheduler halves the learning rate after every frame-index being an integer-multiple of 20, the CMA's performance-convergence trade-off is avoided and it reaches marginally lower SERs as the VAE-LE. For high symbol rates and high SNRs, the VAE-LE deviates from the ideal "No ISI"-curve. In all other cases, both equalizers stay within 1 dB of the "No ISI"-curve and converge to it for low SNRs.
The equalizers' convergence behavior differs significantly, as depicted in Fig. 10, where we display one simulation run's ma SER estimates after the moving average filter. It should be noted that we do not consider a fixed amount of training data but conduct online-learning on the (in theory, infinitely long) received data sequence and that the frame-index corresponds to discrete time steps during training, i.e., one frame has a duration of frame and corresponds to os · frame samples processed by the equalizer. The CMA converges gradually with a decreasing slope. The distinct "plateaus" are caused by the learning rate scheduler. The VAE-LE shows a "waterfall-like" curve. Also, the VAE-LE starts at a significantly higher SER. Although the initial learning rate and batch size strongly influence the VAE-LE's time convergence, the curve's waterfalllike shape remains. Interestingly, the CMA seems to optimize both polarizations equally, while the VAE-LE first focuses on one polarization before trailing the second one.
The influence of the main hyperparameters is analyzed in Fig. 11. Interestingly, the symbol rate S and the filter length influence the CMA significantly less than the VAE-LE, which suffers from performance penalties for low . The reason might be that the VAE-LE has to compensate for both amplitude and phase offset while an extra CPE compensates the latter for the CMA. Since a high symbol rate mainly increases the ISI, it is reasonable that it only influences the VAE-LE for high SNR.
The VAE-LE suffers from convergence issues for high and high B (or lr ), but is very stable for small and small B even under high symbol rates. While changes in the initial learning rate affect the CMA strongly, the VAE-LE has a negligible penalty over a relatively large range.
We show the results for PCS in a DP optical channel in Fig. 12. With the adapted soft-demapping and the learning rate scheduler, the VAE-LE is able to follow the "No ISI"-curve within 1 dB penalty even for high symbol rates and low SNR. Lower filter lengths and batch-sizes B as well as higher lr are necessary to ensure a high probability for convergence, especially for high symbol rates. The VAE-LE is potentially capable of converging for S > 100 GBd and strong shaping in our simulation model, but the probability for non-convergence is relatively high for typical working points. Still, the VAE-LE significantly outperforms the CMA for PCS, where the latter does not converge at all.

D. Time-varying Channel
We analyze the influence of time-dependent channel deviations to the equalizers by a frame-wise increasing HV-shift˜h v = hv + Δ hv · frame within the optical DP model. Figure 13 depicts the SER for different slopes Δ hv . Since we did not employ the learning rate scheduler in this evaluation, the hyperparameter values differ from the ones used in the time-invariant case. Due to its rather slow convergence speed, the CMA has to operate at rather high learning rates and entails a severe penalty. The wide range of possible learning rates allows the VAE-LE to optimize its working point towards low SERs for low Δ hv or a high tolerance towards deviations by moderate penalties. The VAEflex with B = 100 and flex = 10 accelerates training significantly, which makes it tolerant towards deviations by still reaching very low SERs. Although we did not optimize the VAEflex as thoroughly as the other algorithms, it converges until a low (respectively, early) frame-index at the cost of a higher computational complexity. Hence, an option would be to switch between the VAEflex and the batch-wise VAE during operation by changing flex between 1 and B .
For comparison, we also implemented a CMA with a batchwise updating scheme as proposed in [38], which we denote CMAbatch. Additionally, we also extended this scheme with a flexible update rule akin to the VAEflex. We denote the resulting equalizer CMAflex. Although Fig. 13 shows that both the CMAbatch and the CMAflex perform better than the symbol-wise CMA for this time-varying channel and without the learning rate scheduler, the gain is relatively small and both are outperformed by the VAE-based equalizers. Especially the CMAflex performs very similar to the CMAbatch and, in contrast to the VAEflex, the flexible update rule is not able to accelerate training.

VI. CONCLUSION
In this paper, we proposed the new VAE-LE, a modelbased approach with linear butterfly FIR filters, which is  For AWGN channels with ISI, the blind VAE-based equalizers can approach the performance of the non-blind MMSE equalizer for both uniform and PCS formats. The proposed VAE-LE outperforms the previously introduced VAE-NN for this linear channel. The VAE-LE equalizer also converges in a dispersive optical DP system but shows a negligible penalty to the CMA for uniform formats. However, for PCS formats where the CMA fails to converge without modifications, the VAE-LE still approaches the ideal reference within 1 dB.
The VAE-LE's rapid convergence behavior is advantageous for time-varying channels, where the gradually converging CMA performs significantly worse. Our proposed VAEflex update scheme with flexible step length is a powerful alternative if convergence speed is a key factor. Additionally, we have shown that the VAE-LE is able to estimate both the communication channel taps and the noise variance very well, which can be an enabler for joint communications and sensing.
While we focused on linear channels in this work, the extension towards nonlinear impairments might be a possible direction of future research.