End-to-end Optimization of Constellation Shaping for Wiener Phase Noise Channels with a Differentiable Blind Phase Search

As the demand for higher data throughput in coherent optical communication systems increases, we need to find ways to increase capacity in existing and future optical communication links. To address the demand for higher spectral efficiencies, we apply end-to-end optimization for joint geometric and probabilistic constellation shaping in the presence of Wiener phase noise and carrier phase estimation. Our approach follows state-of-the-art bitwise auto-encoders, which require a differentiable implementation of all operations between transmitter and receiver, including the DSP algorithms. In this work, we show how to modify the ubiquitous blind phase search (BPS) algorithm, a popular carrier phase estimation algorithm, to make it differentiable and include it in the end-to-end constellation shaping. By leveraging joint geometric and probabilistic constellation shaping, we are able to obtain a robust and pilot-free modulation scheme improving the performance of 64-ary communication systems by at least 0.1bit/symbol compared to square QAM constellations with neural demappers and by 0.05 bit/symbol compared to previously presented approaches applying only geometric constellation shaping.


I. INTRODUCTION
I N recent years, more and more innovations, e.g., internet of things, 6G, and video streaming, continue to increase the demand for higher data rates. Optical fiber communication systems, in particular, have to bear the bulk of the traffic to interconnect geographical regions to provide connectivity to said innovations. Therefore, to keep up with the growing demand, increasing the network capacity is necessary. Ideally, the data rate of the physical layer employed in existing communication links shall be increased.
One way to increase the data rate is to increase the spectral efficiency of optical communication systems by applying constellation shaping. Both probabilistic and geometric shaping achieve a shaping gain over classical square quadrature amplitude modulation (QAM) constellations. Probabilistic shaping changes the probability of occurrence of constellation symbols arranged in the classical square QAM, while geometric shaping changes the position of constellation points. For the Parts of this work have been presented at the Optical Fiber Communications Conference (OFC) 2022 in paper [1] and at the European Conference on Optical Communication (ECOC) 2022 in paper [2].
A. Rode  additive white Gaussian noise (AWGN) channel, there exist good solutions for probabilistic [3], [4] and geometric [5]- [7] constellation shaping. For channels with memory or nonlinearities, closed-form analytical solutions for neither probabilistic nor geometric shaping are currently available and we need to resort to numerical optimization techniques. Especially for optical fiber communication systems, a popular approach to optimize constellation shaping is to apply machine learning (ML) [8]- [11]. An end-to-end (E2E) approach allows us to optimize constellations that maximize the achievable information rate (AIR) [12] in ubiquitous bit-interleaved coded modulation (BICM)-like systems with bit-metric decoders (BMDs). Classical approaches for constellation shaping usually lack efficient ways to optimize the bit labeling for such BICM system. In more recent works [13]- [15], joint geometric and probabilistic shaping optimized with the help of machine learning and in particular neural networks (NNs) was successfully investigated.
Another way to increase the effective net data rate is to reduce the number of transmitted pilot symbols. A portion of pilots are inserted to aid carrier phase estimation (CPE) [16]; the application of blind CPE reduces the need for pilot symbols. Reducing or fully eliminating the need for pilot symbols for CPE will be a benefit for future systems. A popular and widely used algorithm for blind CPE in optical communication systems is the blind phase search (BPS) algorithm. Its popularity stems from the fact that it can be implemented in a parallel and pipelined fashion [17], enabling CPE for high symbol rates. This is a big advantage over decision-directed CPE algorithms with feedback, which are popular in wireless communication systems because of their lower computational complexity. Due to the required feedback connection, a feedforward, pipelined implementation is not straightforward.
Combining constellation shaping with auto-encoders and blind CPE is challenging, since the E2E optimization of constellation shaping with auto-encoders requires channel models and digital signal processing (DSP) algorithms that are differentiable. Differentiability is required to enable gradient descent based optimization using the back-propagation algorithm. Other approaches try to avoid the requirement of differentiable channels by introducing surrogate channel models [18] or by using techniques like reinforcement learning [19], genetic algorithms [20] or cubature Kalman filters [21]. These approaches come with a cost, as complexity of the training increases significantly and the rate of convergence is lower. arXiv:2212.03839v2 [eess.SP] 21 Apr 2023 Direct E2E optimization through back-propagation and gradient descent is the most robust approach, since the effects of all elements between the sender and receiver, including DSP algorithms, are included during the training through the loss function. In order to have an E2E differentiable autoencoder channel, all operations between encoder and decoder NNs have to be differentiable w.r.t the trainable parameters in the NNs. The regular BPS algorithm includes a nondifferentiable arg min operation and hence cannot be used directly with gradient descent. Thus constellation shaping with surrogate channels or without requiring a differentiable channel have been studied for this particular use-case [18], [21], [22]. We however want to leverage the robustness and convergence speed of gradient descent and hence in [1] proposed a differentiable BPS algorithm. In this work, we propose to use the differentiable BPS algorithm from [1], to optimize constellations for the Wiener phase noise channel in an E2E manner. Besides geometric constellation shaping, we also investigate the potential of joint probabilistic and geometric constellation shaping.
The remainder of this work is organized as follows. In Sec. II, we explain how the non-differentiable arg min operation can be replaced with the differentiable softmin with temperature operation to implement the differentiable BPS algorithm. We introduce the system model for geometric constellation shaping (GCS) in Sec. III. The extension of the GCS system model to joint geometric and probabilistic constellation shaping (GeoPCS) is highlighted in Sec. IV. In Sec. V, we discuss the chosen simulation parameters for the E2E optimization and discuss results we obtain for constellation shaping with the differentiable BPS for the Wiener phase noise channel. We compare and discuss the results of our optimizations of GCS and GeoPCS in Sec. VI. In particular, we show a novel approach to jointly optimize geometric and probabilistic shaping for the Wiener phase noise channel including the BPS algorithm. We highlight differences to previous approaches optimizing the geometric shaping. We conclude the paper in Sec. VII.

II. DIFFERENTIABLE BLIND PHASE SEARCH
In a first step, we summarize the BPS algorithm as it was presented in [17] and introduce our modification to make the BPS differentiable in a second step. The BPS algorithm is described in Alg. 1 for a set of M = 2 m constellation symbols M := {c 1 , . . . , c M }, c i ∈ C and a sequence of complex received symbols z = (z 1 , . . . , z k , . . .). The BPS is parametrized by the length of the averaging window 2N + 1 and the number of test phases L, which define the granularity with which the BPS estimates and corrects phase errors.
The first step in the BPS algorithm when estimating the phase errorφ k of received symbol z k consists of finding the minimum distance d k, between all known transmit constellation symbols c i and the symbol z k rotated by all test phases ϕ with ∈ {1, . . . , L}. In an equivalent, practical, implementation, the rotated constellations can be saved and the minimum distance for all test phases can be computed in parallel. After computing the distances d k, , an averaging step is performed by calculating the cumulative sum D k, per test phase ϕ using N previous and N following distances. This step increases the robustness against AWGN while decreasing the resilience to higher laser linewidths. The cumulative sum introduces fringe effects at the start and the end of a BPS block, which we will need to remove during optimization to resemble the behavior of a system operating in a steady state. After computing the cumulative sum D k, the following operations between differentiable BPS and regular BPS differ. With regular BPS, the test phase indexˆ k that minimizes the cumulative sum D k, is found using the arg min operation and the corresponding test phaseφ k = ϕˆ k is used as a carrier phase estimate. We apply phase unwrapping toφ and obtaiñ φ, which we use to correct the received symbols z. In the differentiable BPS, we replace the arg min operation with a differentiable approximation softmin t , which we call softmin with temperature. In the following section, we introduce softmin t in more detail. To obtain a phase estimateφ k , the dot product between the result of softmin t and the vector of test phases ϕ has to be computed. The degree to which softmin t approximates the arg min operation is controlled with the temperature parameter t.

A. Softmin with Temperature
We derive the softmin t from the softmin operation. For a vector x = (x 1 , . . . , x n ) T , the i-th element of softmin t (x) is given by To improve the approximation of the true minimum, we scale the input vector to the softmin operation with 1/t: Now, for 0 < t < 1, the minimum and maximum are further separated in value than for the unscaled input. Therefore the following softmin operation returns a higher weight for the minimum value. In Fig. 1, we show the change in the approximation of arg min with softmin t for a varying parameter t. Decreasing t improves the approximation: For small values of t, most elements of the softmin t output vector are zero; only a single non-zero element at a single test phase which leads to the smallest distance in the cumulative sum vector D k persists. This corresponds closely to the arg min operation. The approximated minimum valueD k = D T k softmin t (D k ) is plotted in Fig. 1 only for illustrative purposes, as we are only interested in the phase estimateφ k which is given bŷ φ k = ϕ T softmin t (D k ). For t = 1, softmin t is equivalent to the softmin operation and we can observe a significant offset from the true arg min. In [23], the authors show how to smoothly approximate arg max with a similar approach applied to the softmax function. In our use case, the softmin operation does not approximate the arg min sufficiently and consequently the returned phase estimate has a significant offset from the true phase unless the extra parameter t is used.

III. GEOMETRIC CONSTELLATION SHAPING SYSTEM MODEL
Auto-encoders have been successfully used for unsupervised learning of efficient latent representations in the wider machine learning community and are quite naturally applicable to the design of communication systems. Namely, in an auto-encoder, information is first processed by an encoder NN to get a representation in a latent space, which usually has a smaller dimension than the input. A decoder NN is then used to reconstruct the original information from the information in the latent space. No labeling in the latent space is required, instead, representations of the information can be extracted from the latent space. We can cast this auto-encoder concept to a communication problem: the encoder is a mapper which maps bit vectors to symbols on the complex plane and the decoder is a demapper trying to recover the embedded information from complex, noisy received symbols. In order to learn representations that are efficient in the presence of channel impairments, the symbols-generated by the encoder NN (denoted in the following by "Tx-NN")-are impaired by a communication channel, e.g., by AWGN, dispersion, or multi-path propagation. These impairments are partially removed with classical DSP algorithms and the recovered complex symbols are processed by a decoder NN (denoted in the following by "Rx-NN") to return a log likelikood ratio (LLR) vector. By optimizing encoder and decoder NNs to minimize the binary cross entropy (BCE) loss [12], an efficient mapping can be found by back-propagation and gradient descent with, e.g., the Adam algorithm [24]. The learned encoder and decoder do not necessarily have to be used in a practical implementation to reap the benefits of this optimization. The encoder can be replaced by a lookup table and the bitwise decision regions can be approximated by simpler structure than a deep neural network. In [25], we present a way to learn a decoder with comparable complexity to a Gaussian demapper.
Our proposed system model in Fig. 2 is built according to a bitwise auto-encoder. At time step k, a bit vector b k = (b 1,k , . . . , b m,k ) T is first mapped to a corresponding one-hot vectorb k of length 2 m that is only non-zero at the index that corresponds to the binary-to-integer conversion of the bit vector. The Tx-NN is used to generate a constellation M of size M = 2 m . In the case of a non-parametrizable mapper, the mapper neural network (Tx-NN) reduces to a real-valued weight matrix W m of size 2 × 2 m and we can generate a vector of constellation symbols c = (c 1 , . . . , c M ) T ∈ C M containing all M modulation symbols of the constellation The dot product between a one-hot vectorb k and c is then used to select one constellation symbol x k for transmission: AWGN and Wiener phase noise are applied to x k to simulate a communication channel which is dominated by Gaussian noise and impairments from a non-zero laser linewidth. The standard deviations σ n = √ N 0 = Es SNR and σ φ = 2π ∆f R are selected so that a channel with SNR = Es N0 and linewidth ∆f at symbol rate R is simulated. The transmit symbols x k are affected by complex AWGN n k ∼ CN 0, σ 2 n and Wiener The impaired symbols z k are then further sent through a CPE algorithm to recover and correct the carrier phase. In our system model in Fig. 2, we use either our differentiable BPS or the regular BPS. After CPE, we obtain the phase compensated symbolsx k . The demapper neural network (Rx-NN) takes the complex symbolsx k and performs demapping to obtain m LLRsL k .
To evaluate the performance of a BICM system we calculate the bitwise mutual information (BMI) 1 as the metric. For a sequence of P bit vectors b k (k ∈ {1, . . . , P }) and their corresponding LLRsL k , the BMI is approximated as where H(M) is the entropy in bits of the modulation symbols, calculated from the discrete probability of occurence p(c i ) for each modulation symbol c i : In case of a uniform distribution of the modulation symbols, (5) further simplifies to Subsequently, instead of a custom implementation, the BCE loss function common to many machine learning frameworks can be used directly as a loss function to optimize the BMI (see [12] for details).
We initialize the Wiener phase noise process with a starting phase φ start ∼ U(−π, π), where φ k = φ start + k−1 k =0 ∆φ k . The random initialization of the starting phase helps to improve the robustness of the constellation when BPS is used for CPE without prior compensation using pilots. This allows us to learn a constellation for a pilot-less system which is robust to AWGN and Wiener phase noise.

A. Parameterizable GCS
Our proposed system model in Fig. 2 contains additional inputs σ n and σ φ at both Tx-NN and Rx-NN to allow for the optimization of GCS over a range of channel parameters. This parameterization serves multiple purposes: it allows the investigation of changes in the constellation when channel parameters are varied. Additionally, we obtain a mapper and demapper optimized for each set of channel parameters within the range of training parameters. This parameterization is roughly equivalent to training a separate set of mapper and demapper NNs for each individual set of channel parameters. Such individualized training would be significantly more computationally expensive and the bit labeling may be different for different channel parameters preventing us from analyzing the isolated impact of the variation in the parameters on the shaping. With this parametrized geometric constellation shaping (pGCS), each symbol-which corresponds to a particular bit vector-is moved only in a small region defined by the set of channel parameters. The pGCS retains its general structure and changes can be observed with respect to the change in the channel parameters.

B. Trainable Differentiable BPS
Having implemented a differentiable BPS, a natural question arises: Can the differentiable implementation replace the regular BPS not only for the training and optimization, but also for evaluation and implementation? To investigate this possibility, we additionally make the temperature t in the differentiable BPS trainable. To implement a regularization of the temperature parameter t, we choose an unbounded trainable parameter t * , which we use to calculate t = σ(t * ).
Here, σ(x) = (1 + e −x ) −1 is the sigmoid function and limits t to the range (0, 1). We then use the differentiable BPS and the optimized parameter t * in the implementation instead of the regular BPS. Since the phase estimate of the differentiable BPS is not limited to the granularity of the test phases ϕ, the remaining residual phase noise may decrease.

IV. JOINT GEOMETRIC AND PROBABILISTIC CONSTELLATION SHAPING SYSTEM MODEL
To incorporate the idea of learning a joint geometric and probabilistic shaping of the constellation, we extend our system model in Fig. 2 with a distribution sampler at the transmitter for the training in Fig. 3, as proposed in [13]. This extension changes the operation of the training in terms of how bits and symbols are generated and sent through the channel. With GeoPCS, an NN (denoted by "p-NN") first generates a vector of logits p to represent the probability of occurrence for each constellation point in the log domain. Then, a softmax operation is applied on p to get a vector of probabilities p.
For each training batch, the symbol probabilities p and the batch size S are passed to a distribution sampler. To find the quantized number of symbols according to the given distribution of symbol probabilities p, we apply Algorithm 2 from [26]. We convert the symbol indices to their binary representations to get S bit vectors b k , which can be passed through the neural mapper to get S complex symbols x. To account for the non-uniform symbol probability, we change the energy normalization operation in the neural mapper to apply p to scale the contribution of each constellation point according to the symbol probability p(c i ). The normalized constellation is obtained as The remaining operations to obtain the S LLRs vectors are the same as in the case for only geometrical shaping (GS) in Fig. 2. For optimization of this GeoPCS system model, we use a modified loss function based on the BMI in (5). This is motivated by the fact that using the BCE does not correctly incorporate the reduction of the entropy resulting from probabilistic shaping in (5). The entropy is calculated directly from p and optimization is performed with automatic differentiation and gradient descent. We also extend the the approach presented in [13] to use a neural network with one hidden layer with inputs for σ φ and σ n to allow for parameterization of p . In the case parameterization is not required, the neural network for p simplifies to a vector of trainable weights.

A. Probability Distributions with Symmetry
In order to learn a logits vector p which can be implemented jointly with forward error correction (FEC) in a BICM system akin to the probabilistic amplitude shaping (PAS) scheme [4], [27], we extend the probabilistic shaping to be parameterizable with a symmetry parameter s. This parameter s controls the size of the output dimension of the p-NN to be 2 m−s . This limits the parameter s to a range of [0, m − 1] to have any probabilistic shaping. To obtain a probability of occurence for each modulation symbol, the logits vector p,s is repeated s times to form the final logits vector p of length 2 m . This introduces an s-fold symmetry in p and in turn this results in s bits of the bit vector forming the modulation symbols that are 0 or 1 with equal probability. This property can be inferred from the symbol-to-bit mapping and bit-to-symbol mapping, which convert the symbol index to the binary representation. This property can then be used to combine m − s information bits from a distribution matcher (in the final system) with s parity check bits from an FEC encoder to select modulation symbols corresponding to the probability distribution. For regular and fully flexible probabilistic shaping, the parameter s is set to zero. To perform probabilistic constellation shaping which contains one bit with equal probability distribution the symmetry parameter has to be set to s = 1. As highlighted in the system models in [4], [27], a fixed number of bits with an equal probability distribution can be used to assign the approximately uniform distributed parity check bits from FEC.

V. SIMULATION SETUP
We perform GCS and GeoPCS with the system model shown in Fig. 2 and Fig. 3 with the set of hyperparameters given in Table I. The trainable parameters of the NNs are initialized randomly with the Glorot initializer. Validation is performed with the regular BPS algorithm unless stated otherwise. Across epochs, the channel parameters such as SNR and linewidth are fixed. In each batch of every epoch, the input bits are randomly permuted and a new noise realization is used. While a symbol rate of 32 GBaud is quite low for modern single-carrier transmission systems, it is a realistic value for multi-carrier transmission systems [28]- [30]. Additionally, for low cost systems where coherent transmission is relevant, the symbol rate may be higher, but the resulting phase noise is still comparable to our parameterization due to low cost and high linewidth lasers [31]. We implemented our system in the PyTorch machine learning framework [32] and our source code is accessible in [33]. Selection of an appropriate range of temperature values for the softmin t operation during the optimization is crucial for the stability of the training. Starting with a temperature t = 1 and then decreasing it to t = 0.001 as the training of the NNs progresses yielded stable optimization in our setup. We attribute this behavior to the fact that for low temperature values, only a few of the distance measures around the minimum contribute to the phase estimate. Therefore, most of the other test phases are scaled with a value close to zero; subsequently, also the contribution to the gradients is small. Therefore, a rather wide inclusion of values at the beginning of the training helps to boost the training, to avoid getting stuck   in a local minimum and finally to reach a good distribution of constellation points that are optimized in later training epochs.
Training and validation for the shaped constellations has been performed with random phase initialization of the Wiener phase noise process. In the case of QAM, the Wiener phase noise has been initialized with zero and BPS only operates on the first quadrant, since otherwise the phase offset cannot be estimated correctly due to the rotational ambiguities in the QAM constellation.

VI. RESULTS
We performed optimization of the constellation shape with both GCS and the GeoPCS system models. Both approaches have been trained with a fixed channel parameterization to get a general idea about the shape and bit labeling of the learned constellation mapper and demapper. Additionally, we also trained with varying channel parameters as additional inputs to the NNs. This allows us to conduct a further investigation of the effects of CPE on learned communication systems and we obtain an estimate of the expected performance gain for optimized GCS and GeoPCS.

A. Fixed Channel Parametrization GCS and GeoPCS
In this approach, the training is performed with fixed channel parameters, which are not passed as inputs to Tx-NN and Rx-NN. The optimization of the NNs is performed at a fixed operating point, and we expect the performance on different channel parameterizations to be worse. In Fig. 4a), we show a GS constellation for an SNR = 17 dB and laser linewidth ∆f = 100 kHz. A pronounced feature is the introduced asymmetry in the constellation in the lower left corner. Multiple constellation points are placed further separated from the other constellation points than their counterparts in other corners. This positioning increases the robustness and supports the operation of the BPS algorithm by resolving possible ambiguities of constellation points. This asymmetry comes at the expense of more closely packed constellation points at the center of the constellation and thus offers lower robustness to AWGN compared to a constellation optimized for AWGN only. In Fig. 4b), we show a GeoPCS constellation with the same fixed channel parameters as the GCS constellation. For the human eye, the induced asymmetry is hard to spot, but the constellation contains less rotational symmetry compared to a square QAM.
To investigate the performance of our E2E system optimized with GCS, we take a look at the learned neural demapper and the decision regions for individual bits as depicted in Fig. 5. For better visualization, the demapper decision regions are shown for LLRs between [−5, 5].
With the chosen neural demapper architecture, our system is able to accurately partition the constellation into two sets of the same size for each bit. The decision regions shown in Fig. 5 are an extension of the 1D LLR plots shown in [34] to the full 2D region. A 1D graph would not be enough to characterize the decision regions of the neural demapper, since L k,i depends on both Re{x k } and Im{x k }.
Finally, we compare the attainable performance of the different approaches in Fig. 6. For the systems trained on a fixed set of channel parameters, the GeoPCS system has the best performance, outperforming our previously presented work on GCS [1] by 0.05 bit/symbol at the channel param- eters both systems were trained on and by even more in the region of higher laser linewidths. The GeoPCS constellation with introduced symmetry also outperforms the GCS constellation. All shaped constellations outperform a square QAM constellation paired with a neural demapper, which was trained on the same channel parameters as the shaped constellation systems. For a fair comparison, the validation of the square QAM was not initialized with a random start phase, since QAM is rotationally symmetric and the BPS algorithm wouldn't be able to disambiguate carrier phases if the phase difference equals a multiple of π/2; this means that phase slips (sometimes referred to as cycle slips) do not play a role in our evaluation. To investigate the performance of probabilistically shaped QAM, we perform an E2E optimization to learn the parameter λ of the Maxwell-Boltzmann distribution for the fixed channel parameters. We apply a similar approach as for GeoPCS, but instead of the probability of occurence for each symbol, the NN returns the parameter λ, which is then used with the Maxwell-Boltzmann distribution and known constellation symbols to obtain the probability distribution. In Fig. 6, we report the results of this validation as "QAM PCS". For the fixed channel parameterization we see a significant gain compared to square QAM and a similar performance as our proposed schemes. This difference in performance is most likely caused by Gray labeling in the square QAM and "QAM PCS" which is not achieved by the trained geometric constellations, but rather an almost Gray labeling is learned.

B. GCS with Varying Channel Parametrization
To optimize GCS for a range of varying channel parameterizations, we train the pGCS system on channel parameterizations drawn from U(σ n,min , σ n,max ) and U(σ φ,min , σ φ,max ). The parameters are also provided to the NNs, such that the system is optimized for each set of provided channel parameters and learns how to change the constellation to obtain a better constellation and demapper for the given condition. In Fig. 7, we show the obtained transmit constellation for training the system on a channel with SNR between 14 dB and 24 dB and the laser linewidth ∆f between 50 kHz and 600 kHz. The constellation is depicted for a fixed SNR = 18 dB and varying laser linewidth in a color change in Fig. 7a). For a high laser linewidth, the constellation exhibits a few points at the outside of the constellation at the top right and 325 600 Laser linewidth (kHz) 14 19 24 SNR (dB) bottom right corners, which move outwards compared to the constellation optimized for a lower laser linewidth. The change in the transmit constellation diagram is small and in stark contrast to the change in the transmit constellation diagram for varying SNR depicted in Fig. 7b). In this diagram, the laser linewidth ∆f = 100 kHz is fixed and the color indicates the SNR. For low SNR, the same constellation points as for varying laser linewidth are moved outwards compared to the transmit constellation for high SNR. Additionally, for low SNR, more points at the center of the constellation are moved closer together to allow for a separation of other constellation points at the outer edge of the constellation. This "sacrifices" the information contained in at least one bit carried by the constellation and results in a more Gaussian-like shaping of the complete transmit constellation diagram.
The results of training the parameterized GeoPCS (pGeoPCS) are depicted in Fig. 8 and Fig. 9. The transmit constellation is shown in Fig. 8a) for varying laser linewidth and shown in Fig. 8b) for varying SNR. The change in the   Fig. 9a) for varying laser linewidth and for varying SNR in Fig. 9b). Contrary to the pGCS system, a varying laser linewidth does not change any features of the constellation, neither the geometric constellation shape in Fig. 8a) nor the probabilistic constellation shape in Fig. 9a). This result is somewhat surprising, as this means that the learned transmit constellation is only changed for varying SNR and is independent of the laser linewidth. For a change in SNR, a behaviour similar to that of the AWGN channel is observed [13]. The probabilistic shaping approaches the shape of the well-known Maxwell-Boltzmann distribution.
In Fig. 10, we depict the value λ for the normalized Maxwell-Boltzmann probability mass function which has the closest numerical fit to the learned symbol distribution. The fit to the Maxwell-Boltzmann distribution is very accurate with a maximum observed Kullback-Leibler divergence of 0.0002. In addition, we also display the learned parameter λ of the parameterized "QAM PCS" constellation. In our training, we limited the λ to the range of [0, 1] by applying a sigmoid function at the output of the neural network. In the resulting plot, a similar shape of the learned λ parameter over a changing SNR can be observed. The λ parameter learned by the probabilistically shaped QAM differs from the fitted λ parameter from the pGeoPCS constellation especially for low SNR. For a higher λ, a stronger probabilistic shaping is performed. This difference may indicate a reason for the performance difference seen in Fig. 11.
In Fig. 11, we compare the performance of our different approaches to constellation shaping with varying channel parameters. The performance is shown in BMI over a varying laser linewidth. Colors indicate SNR and different lines and markers indicate the system. The NNs of all depicted systems used σ n and σ φ as input to optimize the mapper and demapper for the channel parameters used for validation. In this comparison, applying GeoPCS provides a consistent performance gain over GCS. But it is also clearly visible that for higher SNR, this shaping gain disappears. In comparison, the results of "QAM PCS", which is a parameterized version of the probabilistically shaped square QAM (using a Maxwell-Boltzmann distribution), show stability issues for low SNR. For the lowest value of 14 dB, the results are not shown since the performance was impacted by many cycle slips. In the higher SNR regime, a similar picture as to the fixed training point can be seen: for low laser linewidths, "QAM PCS" outperforms the pGeoPCS scheme, where for higher laser linewidths pGeoPCS is more robust and shows a better performance. We conclude from these results that the probabilistically shaped square QAM has advantages in the low laser linewidth regime if the shaping parameter λ is chosen correctly. For high laser linewidths, our proposed schemes employing geometric shaping show better robustness. We have to emphasize that the square QAM constellations have a better starting condition, their phase noise realizations are initialized with zero phase offset, therefore mandating the use of regular pilot symbols to retain this phase reference.

C. Trainable Differentiable BPS
For the trainable BPS, our assumption is that a lower number of test phases could be used for the final system since the differentiable BPS can interpolate between the test phases. Therefore, we train and evaluate GCS systems with and without a trainable differentiable BPS for a number of test phases L = 30 and L = 60. All systems are trained on a fixed SNR = 17 dB and a fixed laser linewidth ∆f = 100 kHz. The performance comparison between the system with trainable differentiable BPS and the system with regular BPS is depicted in Fig. 12. For L = 60, the regular BPS has a consistently better performance compared to the trainable BPS. We attribute this to the phase discontinuity at 0 and 2π. If the actual phase error φ k is close to the discontinuity, we obtain similar values for the cumulative sum D k, for the test phases ϕ 1 = 0 and ϕ L = L−1 L 2π in the BPS algorithm. In that case, the softmin t will return high values for both these test phases even when the temperature t is very low. Subsequent multiplication and summation with the test phase vector ϕ returns a phase error estimateφ k around π which is significantly different from the actual phase error leading to incorrect phase correction and consequently large errors in the LLRs. This issue does not occur with an arg min, since the issue of phase error estimates of consecutive symbols oscillating between 0 and 2π is corrected by the phase unwrapping step after obtaining the phase error estimates.
The incorrect phase corrections in some cases leads to higher performance variation of the trainable BPS between runs, which is indicated by the error bars. For L = 30 the trainable BPS has a slightly better performance. In this case, the ability to perform interpolation with the softmin t operation between test phases might reduce residual phase noise compared to the regular BPS, but high performance variations between runs are still observed. We therefore suggest further investigation into increasing the stability and robustness of the differentiable BPS to avoid problems at phase errors close to the discontinuity.

D. Validation with Dual Polarization Split-Step Fourier Method Simulation
In order to verify our results reported in Fig. 11 on a channel impaired by chromatic dispersion, we additionally perform validation with an implementation of a dual polarization split-step Fourier method (DP-SSFM). This method simulates the effects described by the coupled non-linear Schrödinger equation. We do not include other channel variations, like polarization mode dispersion (PMD) or random polarization rotation. This would entail investigating joint optimization of constellation shaping and blind equalization methods [35]. Similarly to a geometrically shaped constellation with prominent asymmetric features, a constellation which is shaped geometrically to assist a blind equalizer may exhibit other asymmetric features. The study of interaction between the optimization of geometric constellation shaping and equalization is an interesting topic, but is outside the scope of this work. Applying the DP-SSFM introduces some inter-symbol interference and the results of this validation may give an indication if increased tolerance for phase noise come at a cost of higher sensitivity to inter-symbol interference.
First, we apply a root-raised-cosine (RRC) pulse shaping and upsampling to the simulation rate, to the transmit symbols x k . Then, AWGN and Wiener phase noise are added to the transmit samples to simulate transmitter impairments. The same Wiener phase noise realization is used on both polarizations. The impaired transmit samples are then sent through the DP-SSFM and a different Wiener phase noise is applied on the receiver side. Afterwards we downsample the received signal and apply digital chromatic dispersion compensation. The matched filter follows the dispersion compensation and we apply the BPS algorithm separately for both polarizations. In this validation we use the regular BPS. In the case of square QAM, we apply a genie-aided phase-slip compensation on every symbol to remove any phase ambiguity by amultiple of π/2. We use the trained neural receivers to obtain LLRs and compute the performance in terms of BMI.
We selected system parameters reported in Table II. To find the best launch power we performed a sweep across the launch power for each constellation shaping method. For this validation, we use the models trained in a parameterizable fashion for the channel with AWGN, Wiener phase noise and the BPS algorithm as presented in Sec. IV and Sec. III, and use them in the validation without any re-training. One challenge is correctly selecting the input parameters σ φ and σ n for the NNs. Since we apply i.i.d. Wiener phase noise at the transmitter and receiver, we assume the resulting phase noise coming from the Wiener phase processes can be described with σ φ,tot = √ 2σ φ and we use σ φ,tot as input parameter for the NNs. For AWGN, we have multiple sources to account for, first we apply AWGN at the transmitter, but also the ASE noise contribution of the EDFAs is modeled as AWGN. Additionally, the noise contribution of the non-linearity and interaction between non-linearity and chromatic dispersion is not accounted for in the parameterization of the NNs. Therefore we perform a sweep of the σ n parameter, which is only used as input to the NNs, between 17 dB and 24 dB to find the best operation point for the parameterizable mappers and demappers.
In Fig. 13, we show the performance of the different constellations in terms of their BMI over a varying laser linewidth. The launch power has been optimized between 0 dBm and 3 dBm in steps of 1 dBm and for all constellations, a launch power of 2 dBm has shown the best performance. As for the σ n input parameter for the NNs, the best performance for all constellations has been observed for a parameter corresponding to an SNR ≈ 18 dB. Validation with the DP-SSFM model confirms our observations made on the linear channel model in Fig. 6 and in Fig. 11. The parametrizable GeoPCS constellation outperforms both pGCS and QAM constellations. We computed the E2E SNR in our simulations to be between 19 dB and 20 dB. Comparing the results in Fig. 13 with results obtained on the linear channel, we observe a significant impact by additional non-linear impairments and residual inter-symbol interference.

VII. CONCLUSION
We presented and analyzed bitwise auto-encoders for joint optimization of geometric and probabilistic shaping for Wiener phase noise channels with carrier phase estimation. We used a differentiable carrier phase estimation (CPE) algorithm which allows for efficient end-to-end optimization using gradientdescent. With the proposed system, optimization of GCS leads to a constellation robust to phase noise and supports the operation of the CPE. A further extension to joint geometric and probabilistic constellation shaping provides additional shaping gain compared to GCS. In this work, we showed that joint geometric and probabilistic shaping can be applied to communication systems impaired by AWGN and Wiener phase noise with BPS as the carrier phase estimation algorithm. With a parameterizable mapper, demapper and probabilistic shaper, we are able to outperform square QAM by more than 0.1 bit/symbol for low SNR values. Comparison with gradient-free methods and a study of their computational complexity and convergence speed is planned future work.