A Parametric Network for the Global Compensation of Physical Layer Linear Impairments in Coherent Optical Communications

This paper proposes a parametric network for the joint compensation of multiple linear impairments in coherent optical communication systems. The considered linear impairments include both static and time-variant effects such as in-phase/quadrature (IQ) imbalance, laser phase noise (PN), chromatic dispersion (CD), polarization mode dispersion (PMD), and carrier frequency offset (CFO). To jointly compensate for these considered impairments, the proposed network is composed of parametric layers that exploit the particular signal model of each impairment. The layers’ parameters are jointly learned during a training stage. This stage uses a supervised step that exploits the knowledge of some transmitted data (preamble and/or pilots) and a self-labeling step that uses the knowledge of the symbol constellation. In addition, a new validation technique that does not require a different dataset is developed to avoid overfitting. The parametric network performance is compared to classical digital signal processing (DSP), and Deep Learning (DL) approaches using simulated data. Simulation results show that the proposed network outperforms the competing approaches in terms of Bit Error Rate (BER) while maintaining a relatively reduced computational complexity. In the scenarios considered, compared to the parametric network, the DSP approach introduces an OSNR penalty between 0.2 dB and 1.7 dB at a BER of ${4\times 10^{-3}}$ . Furthermore, simulation results demonstrate that the proposed network is way more flexible than other approaches since it can easily be adapted to a different scenario and coupled with other techniques.


I. INTRODUCTION
C OHERENT optical communication is the leading technology used for long-haul transmissions as it might answer the stringent requirements for high-data-rate [1], [2]. At these data rates, the impact of the impairments on communication performance could be very severe. Among the optical chain's most important imperfections are laser phase noise (PN) [3], in-phase/quadrature (IQ) imbalance [4], chromatic dispersion (CD) [5], polarization mode dispersion (PMD) [6], carrier frequency offset (CFO) [7], and fiber nonlinearities [8]. Multiple local digital signal processing (DSP) techniques that benefit from the model knowledge have been developed to mitigate these imperfections. A local DSP approach compensates for one or a few imperfections of the optical chain in particular scenarios. Local compensation algorithms are designed to operate at the transmitter side (pre-compensation, pre-distortion), at the receiver side (post compensation), or using a hybrid strategy. While pre-compensation or pre-distortion algorithms, such as the techniques in [9], [10], [11], allow for relaxing the stringent computational demands on the receiver side, they are mainly designed to compensate for transmitter impairments only since the pre-compensation of receiver impairments requires communicating information between the receiver and transmitter sides. The local algorithms are commonly deployed at the receiver side to avoid this limitation. Post compensation algorithms are designed to operate in a blind, data-aided, or hybrid manner to compensate for laser PN [12], [13], [14], [15], IQ imbalance [16], [17], [18], CD [19], [20], [21], PMD [22], [23], [24], CFO [25], [26], [27], and fiber nonlinearities [28], [29], [30]. Although these techniques proved their effectiveness, using local algorithms for imperfection compensation could be problematic as their performance may be impacted in a complex fashion by the presence of other impairments. In addition, most algorithms are developed for particular scenarios or/and modulation formats, and their applicability is limited in these cases. In [31], a global DSP approach was proposed to jointly compensate for multiple imperfections of the optical chain. Despite this advantage, the approach lacks flexibility because it may be difficult to extend it to more complex scenarios, including complimentary imperfections and/or alternative modeling for considered imperfections. Recently, different Machine Learning/Deep Learning (ML/DL) techniques were developed to mitigate multiple impairments of the optical chain as the fiber nonlinearities, laser PN, and others [32], [33], [34]. These techniques are able to improve the performance of the system. However, these techniques require a large signal dataset and have difficulties tracking time-variant imperfections. Moreover, it is difficult for these approaches to benefit from the model knowledge of the system, and their computational complexity may be prohibitive at the moment [35].
To benefit from the advantages of both model-based and ML/DL techniques, different approaches that aim to insert the system knowledge into an ML/DL network have been proposed [36], [37], [38], [39]. For the coherent optical communications, most of the model-based ML/DL techniques focus on the compensation of nonlinear impairments related to the fiber propagation [40], [41], [42], [43], [44], [45]. Compared to the classical digital backpropagation (DBP) [46], [47], [48], it was demonstrated that the compensation of the nonlinear imperfections is improved or the computational complexity is reduced by using a model-based ML/DL approach. Model-based ML/DL approaches are usually combined with local DSP algorithms to mitigate the nonlinear and linear impairments. Unfortunately, integrating DL approaches into the classical DSP chain is a complicated task, and advanced training strategies are generally developed because of the different time-variant linear imperfections that impact the batches differently [41], [42], [44], [49]. Even if the integration of DL networks into the classical DSP chain has been of much interest recently, integrating the parametric model of linear impairments into a compensation network was not investigated.
This paper proposes a model-based network that globally compensates for multiple linear imperfections of the dualpolarization (DP) optical chain. This parametric network aims to benefit from the model knowledge of the system to reduce the size of the training dataset and allow the tracking of time-varying impairments. The main contributions of this paper are twofold.
• First, we introduce a multi-layer parametric network based on the impairments model of the coherent optical chain. This technique can be used with different modulation formats. The network is easily adaptable since new impairments can be easily compensated for, and their positions can be freely arranged in the network architecture. Concerning the statistical performance, the proposed network outperforms some classical DSP algorithms and DL techniques; • Secondly, we propose a hybrid technique based on the knowledge of the symbol constellation for network training and validation. Compared to the conventional approach based on a large dataset, the proposed technique allows training of the network with a small number of pilot or preamble data. Furthermore, this technique avoids the need for a separate validation database. Compared to our previous global compensation study [31], the main difference between the two approaches is in the use of a multi-layer parametric network. While our previous study focuses on a parametric model designed for singlepolarization (SP) systems, the proposed study inherits the flexibility of multi-layer networks and can be deployed in a broader range of scenarios, such as DP systems. Furthermore, while our previous approach is difficult to combine with other ML/DL techniques (for example, for nonlinearity mitigation), using a multi-layer network allows us to easily integrate the proposed approach with other networks.
The remainder of the paper is organized as follows. In Section II, the model of the system is developed. Then, in Section III, the proposed network architecture is presented. Finally, in Section IV, the results are analyzed. In Section V, some ideas are discussed, and the paper is concluded in Section VI.

II. SIGNAL MODEL
In this section, the signal model is developed. First, a generic communication model is introduced in the Section II-A. Then, in Section II-B, the specific DP coherent optical system model is detailed.

A. GENERIC COMMUNICATION CHAIN
A generic 2 × 2 multiple-input multiple-output (MIMO) system is considered. Let us denote a data vector of length N containing symbols corresponding to a single input signal as where p represents the input index, x p [n] ∈ S, and S is a finite alphabet composed of |S| complex elements (e.g., PSK, QAM). Therefore, the resulting 2 × 2 MIMO signal is defined by vertically stacking the two inputs as T . This signal undergoes several distortions mainly caused by hardware imperfections and channel effects. A generic communication chain composed of L impairments is illustrated in Figure 1, where x is the transmitted signal, and y is the received signal. The impairments modelization is extended from [38] for the case of a 2 × 2 MIMO communication. By assuming only linear imperfections, each impairment can be mathematically modeled by: where the tilde denotes the real augmented versions of the original complex data, x i+1 is the output signal, x i is the input signal, and F i (α i ) is the real-valued transfer matrix depending on some unknown parameters α i of the i th impairment. By also including a noise component, the received signal y can be expressed as follows: where F(α) is a real-valued 4N × 4N accumulated transfer matrix that can be expressed as: T is a column vector containing the realvalued system's parameters, and b corresponds to the noise contribution. This step is denoted as the forward propagation as the signal goes from transmitter to receiver. It is important to note that the matrix expressions in (3) and (1) are generic mathematical representations. In practice, the physical structure of some impairments allows expressing (1) under a scalar form. By this, the computational complexity can be significantly reduced.

B. DP COHERENT OPTICAL SYSTEM WITH MULTIPLE IMPAIRMENTS
A DP coherent optical communication can be seen as a particular 2×2 MIMO system, where each input signal corresponds to a polarization (p ∈ {X, Y}). This section details the mathematical model of such a system impaired by multiple imperfections. For simplicity, except for PMD, we express the impact of the impairments for the SP case. Consequently, we use the generic x p,i as input, and x p,i+1 as output.
The complete model of a typical DP coherent optical system is presented in Figure 2. On the transmitter side, the signals are modulated, upsampled, and filtered using a Root-Raised-Cosine (RRC) filter. Then, the resulting signals are sent to the receiver over the communication channel. During all these processes, multiple imperfections occur. The imperfections parameters are marked in red on the figure.

1) LASER PHASE NOISE
The spontaneous emission that appears in the case of semiconductor lasers broadens the laser linewidth and introduces a PN [50] which is one of the most critical impairments that impact coherent optical systems [12]. It is a time-variant effect induced by both transmitter and receiver lasers. The laser PN is modeled as a Wiener process [51], [52]: where the terms f [l] are independent and identically distributed random Gaussian variables with zero mean and variance σ 2 f = 2πδfT s , with T s being the symbol period, and δf is the laser linewidth. Under this assumption, the impact of PN on an SP signal can be expressed as follows:

2) IQ AMPLITUDE AND PHASE IMBALANCE
Multiple imperfections related to the amplifiers gain differences, unequal split and/or combining ratios of the couplers, phase control shift, or diverse manufacturing problems of the optical modulators and/or 90 • hybrids can introduce an IQ imbalance [53]. This static linear impairment can impact coherent optical systems on the transmitter and receiver sides. Amplitude imbalance, denoted as g, refers to the amplitude difference between the two IQ branches. Phase imbalance ϑ refers to the phase variation from the ideal 90 • between the two IQ signals [4], [17]. The IQ imbalance can be modeled as in [54], [55]: where (μ, ν) ∈ C 2 . The impact of the IQ imbalance on the input signal can be modeled as follows [56], [57]: where (.) * denotes the complex conjugate operation.

3) CHROMATIC DISPERSION
The CD is a static linear impairment that appears because the group velocity of the propagating signals in fiber optics is frequency-dependent. This leads to a spread in time of the optical pulses and can limit the transmission distance or data rate [21]. The CD can be modeled as an all-pass filter whose frequency transfer function is given by [58], [59]: where Dz is the accumulated CD coefficient represented by the product between the fiber dispersion coefficient D and the fiber length z, λ the wavelength, c the speed of light, and ω the angular frequency with respect to the sampling frequency f s . The impact of the CD on the output signal can be modeled as: where F and F −1 represent the discrete Fourier transform and the inverse discrete Fourier transform operations, respectively.

4) POLARIZATION MODE DISPERSION
Due to the imperfections in manufacturing and/or mechanical stress, fibers experience some amount of birefringence. This optical birefringence and the random variation of the birefringent axes' orientation along the fiber length leads to the appearance of the time-variant PMD [6]. In singlemode fiber communications, it introduces a differential group delay (DGD) between the two principal states of polarization (PSP) [6]. The frequency response of PMD can be expressed as in [22]: where θ is the angle between the PSP of the fiber and the reference polarizations, and τ is the DGD between PSPs. The impact of the PMD on the input signals can be expressed as follows:

5) CARRIER FREQUENCY OFFSET
The CFO is a static linear impairment that appears because of the frequency difference between the transmitter and receiver lasers [60]. The impact of the CFO on the output signal can be modeled as follows [61], [62]: where F is the frequency difference between the two lasers. In DP systems, some imperfections are assumed to have a similar impact on both polarizations, while others impact them differently. Moreover, some imperfections are assumed to be quasi-constant over multiple data blocks, so they are considered static, while others may have a relatively fast evolution. The classification of impairments regarding their time evolution and polarization impact is presented in Table 1. In this work, we consider that laser PN, CD, and CFO similarly impact the two polarization signals, while the IQ imbalance can affect the two polarization signals differently. The laser PN and PMD are time-variant, while the IQ imbalance, CD, and CFO are assumed to be static over the data blocks.

III. PROPOSED NETWORK ARCHITECTURE
On the receiver side, the objective is to recover the transmitted signals x from the received signals y. This can be done using a Zero-Forcing (ZF) equalizer that inverts the global transfer matrix impact [63]. Using this approach, the augmented transmitted signal can be estimated bŷ where Instead of inverting the matrix F(α) explicitly, a simple alternative is to estimate the matrix B directly using a training database. This approach can be implemented using a Least Squares (LS) estimator [64]. However, this approach has some major drawbacks: • the time variability of some impairments will make the training ineffective, as the transfer matrix will have variations from the training to testing operations; • as B is a 4N × 4N matrix, the number of estimated parameters is 16N 2 parameters, which imposes to use of a large training database. To overcome these limitations, this section describes a parametric-based network for the global impairments compensation. This section is divided into four parts: first, the structure of the network is presented in Section III-A, then its application to the compensation of linear impairments in optical DP systems is described in Section III-B, and finally, the network training and validation stages are detailed in Sections III-C and III-D.

A. PARAMETRIC-BASED COMPENSATION NETWORK
To compensate for the imperfections, we assume that any impairment parameterized by a particular set of parameters α i can be compensated by applying the same impairment with a compensation set of parameters β i . Using this assumption, the global compensation can be obtained by using a multi-layer parametric network composed of L layers.
A comparison between a conventional Multi-layer Perceptron (MLP) network and the proposed multi-layer parametric network is depicted in Figure 3. For the proposed parametric network, each layer describes a parametric linear operation that compensates for the effect of a particular imperfection. Mathematically, the compensated signal at the output of i th layer can be described byỹ where y i+1 is the compensated signal, y i is the signal before the corresponding layer compensation, F i (β i ) is the real-valued transfer matrix of the layer, and β i is the compensation vector parameter. Compared to the compensation in (14) and the MLP, the proposed network drastically reduces the number of total unknown parameters β. Finally, The output of the parametric network can be expressed bỹ where the accumulated transfer matrix can be expressed as follows: Based on (16), the network layers are ordered reversely compared to the imperfections. Consequently, we denote this step as backward propagation. By using the proposed parametric network, new impairments may be considered just by inserting new layers into the network. Also, their order can be freely arranged in the network architecture based on the knowledge of the system. The network could be coupled with model-based DL networks (as the ones used for nonlinearity compensation) by adding and changing layers into the network architecture. In the same way, some layers could be removed from the network, and the related impairments could be compensated for by using digital pre-distortion or pre-compensation techniques. Other scenarios may include using the classical DSP algorithms for a coarser impairments estimation, then employing the proposed network to improve the performance, or passing the compensated data at the network's output to a network designed for detection purposes.

B. PARAMETRIC NETWORK FOR COHERENT OPTICAL SYSTEMS
The complete block diagram of the system of interest with the compensation network architecture in the lower part can be seen in Figure 4. In the following, the design of the compensation layers is detailed. In addition to impairments compensation, the proposed network implements other non-learnable static operations like matched filtering and re-sampling.

1) PHASE NOISE COMPENSATION LAYER
In the high-baud rate optical communications systems, it can be assumed that the laser phase changes slowly compared to the signal phase [65]. Consequently, we can assume that the laser phase is constant for a block of data of length K [66]. This assumption leads to a coarser description of the laser phase evolution, but it reduces the number of parameters to estimate. The parameter that compensates for the laser phase noise in (5) can be denoted as β 1 In practice, to reduce the number of layer parameters, we propose to set K > 1. The compensated complex signal at the output of this layer can be expressed as follows:

2) IQ IMBALANCE COMPENSATION LAYER
The IQ imbalance compensation layer can be expressed as follows: where the complex numbers β 2,p and β 3,p corresponds to the layer parameters. It can be checked that setting allows compensating for the IQ imbalance imperfection in (8).

3) CARRIER FREQUENCY OFFSET COMPENSATION LAYER
The complex signal at the output of the CFO compensation layer can be expressed as follows: It can be verified that setting β 4 = − F allows to perfectly compensate for the CFO impairment in (13).

4) POLARIZATION MODE DISPERSION COMPENSATION LAYER
PMD is an impairment that depends on the parameters θ and τ . The complex signal at the output of the compensation layer can be expressed as follows: It can be checked that setting β 5 = θ and β 6 = −τ allows to perfectly compensate for the PMD impairment in (12).

5) CHROMATIC DISPERSION COMPENSATION LAYER
CD is an impairment that depends on the parameter Dz, representing the accumulated chromatic dispersion. The complex signal at the output of this compensation layer can be expressed as follows: It can be checked that setting β 7 = −Dz allows to perfectly compensate for the chromatic dispersion in (10).

C. NETWORK TRAINING
During training, the network estimate the unknown vector parameter β = {β 1 , . . . , β L } corresponding to the compensation parameters. The training can be performed by minimizing the following cost function: where x target and x out represent the target (desired output of the network) and the actual output of the network, respectively.
Regarding the target symbols x target , we consider the particular frame structure presented in Figure 5. The frame structure is composed of a preamble of N 0 symbols,x 0 , and P pilot symbols,x 1 , inserted periodically in each data block of length N b . The preamble and the pilots are random symbols extracted from whatever constellation is employed. Then, we propose to divide the training stage into three steps. First, a global estimation is performed by training the network using the preamble as target data. Secondly, pilotbased training is employed to re-estimate the time-variant parameters over the data blocks. Finally, self-labeling-based training is performed for a finer re-estimation of time-variant parameters.

1) PREAMBLE-BASED TRAINING
During the preamble training, the network aims to globally estimate the system parameters β. All the gray-colored blocks from Figure 4 are trained in a supervised manner during this step. The target is represented by the transmitted preamble symbolsx target =x 0 and the output of the network by the compensated received preamblex out = B(β)ỹ 0 , wherẽ y 0 corresponds to the received preamble symbols.

2) PILOT-BASED TRAINING
This training is performed in a supervised manner. It uses as a target the pilot symbolsx 1 to track the time-variant impairments that impact the communication chain. During this step, the blocks with red contour from Figure 4 re-update their parameters, while the other blocks use the parameters estimated during the preamble-based training. During this step, the target is represented by the transmitted pilot symbolsx target =x 1 and the output of the network by the compensated received pilotsx out = PB(β)ỹ, where P is a matrix selecting the compensated pilot symbols from the network output.

3) SELF-LABELING TRAINING
The constellation alphabet is exploited during this step to improve the network generalization. This technique could be seen as a decision-directed least mean squares (DD-LMS) algorithm [67] performed after a data-aided approach. But in this case, the parameters' update is done by blocks, not by sample as for DD-LMS. In our work, we use the ML-specific term self-labeling to denote this approach. The network output corresponds to the compensated signalx out = B(β)ỹ. We denote byx 2 = S (x out ) the N b detected symbols, where S (x) = arg min s∈S x − s 2 is the orthogonal projector onto the constellation set. The self-labeling training step re-updates the time-variant parameters of the red contour blocks in Figure 4 by using as target the detected symbols x target =x 2 . This last step is designed to perform a finer estimation of the time-variant parameters.

D. VALIDATION
During the pilot-based training, the number of pilot symbols is reduced and is relatively close to the number of unknown parameters. In this case, the proposed network architecture may suffer from overfitting. Generally, a validation dataset is used to avoid this problem. In this study, we propose an alternative approach for the validation stage that does not require any additional dataset. This approach exploits the symbol constellation. The unknown received data, which represents the testing data, will be detected at a specific interval. Then, the cost function from (23) will be computed by consideringx target =x 2 andx out = B(β)ỹ. If the value of the cost function stops decreasing with the number of iterations, the training is stopped.
The training and validation operations are detailed by using the Algorithm 1.

IV. SIMULATION RESULTS
In this section, we present the proposed parametric network's statistical performance and computational complexity by using simulated results over the testing data. The performance is evaluated using the Mean Squared Error (MSE), Bit Error Rate (BER), and Error Vector Magnitude (EVM) metrics. Also, we provide the performance and complexity comparisons to some classical DSP algorithms and a DL-based approach.

A. PERFORMANCE IN A DP COHERENT SYSTEM 1) IMPLEMENTATION DETAILS
The signal generation and the imperfections impact are simulated by using Python's scientific libraries like NumPy [68] and SciPy [69]. On the receiver side, the compensation network is implemented using the PyTorch framework [70] and the training is performed by using the ADAM Algorithm 1 Network Training and Validation 1: Initialize parameters: β ← β 0 Preamble-based training 2: Set stop_condition 3: x target ←x 0 4: while stop_condition = false do 5: x out ← B(β)ỹ 0 if i mod p = 0 then 18: x out ← B(β)ỹ 19: x target ←x 2 = S (x out ) 20: if projection_error stop decreasing then 22: stop_condition = true 23: end if 24: end if 25: end while Self-labeling training 26: Set stop condition 27: while stop_condition = false do 28: x out ← B(β)ỹ 29: x target ←x 2 = S (x out ) 30: Compute the loss: x target −x out  optimization algorithm [71] with learning rates (LRs) of 10 −4 and 10 −2 for preamble training, and tracking training, respectively. The system parameters can be seen in Table 2.
A number of 600600 symbols were considered for each independent simulation. Of these, 600 were used as a preamble, and 20000 as pilots (inserted periodically, each 30 th symbol is a pilot), resulting overhead of 3.4%. The rest of the 580000 symbols were used as the payload. Each data block contains 600 symbols. 4-QAM and 16-QAM are employed in this work, but more advanced modulations, like 64-QAM and 256-QAM, could be employed, depending on the scenario considered. Regarding the impairments parameters, during all the simulations, we consider a laser linewidth δf of 100 kHz and a CFO of 200 MHz. The dispersion coefficient has a value of D = 17 ps/nm-km and the fiber length has a value of z = 1000 km, resulting in an accumulated CD of 17000 ps/nm. The PMD is assumed to be constant for a block of N b consecutive symbols and is randomly chosen from the following intervals: θ ∈ [−π/2, π/2), and τ ∈ [5 ps, 20 ps], respectively. If not stated otherwise, the IQ imbalance parameters are 1 dB and 10 • , both on transmitter and receiver.
As ADAM optimizer is a local optimization algorithm, we need to initialize the network parameters' value. In each simulation, the parameters are initialized as follows: • IQ imbalance: β 2 X/Y,TX/RX = 1, β 3 X/Y,TX/RX = 0 -the case for which the signal is not impacted by this imperfection; • CFO: β 4 is randomly chosen from an interval corresponding to the original impairment F ∈ [187.5 MHz, 212.5 MHz]; • CD: β 7 is randomly chosen from an interval related to a fiber length between 995 km and 1005 km. In this case, the interval could be increased, but generally, we have prior knowledge about the fiber length and dispersion coefficient therefore, we reduced the interval to minimize the computational cost; • PMD: β 5 = 0, β 6 = 0 ps -the case for which the signal is not impacted by this imperfection; • Laser PN: during the preamble-based estimation, all the laser phase values β 1 TX/RX are initialized with 0. During the pilot-based training, they are re-initialized with the last estimated phase of the previous block. In addition, the validation error is computed for each 20 th iteration to obtain a good compromise between computational cost and statistical performance. Table 3 summarizes the network and optimizer parameters.

2) INFLUENCE OF THE NUMBER OF CONSTANT PHASES K
An important parameter we need to consider is the number of symbols K for which we assume the laser phase to be constant. This parameter can have multiple implications as follows: • A small value of K reflects better the dynamics of the laser and could improve the statistical performance; • An increased value for K leads to reduced computational complexity because a low number of parameters should be estimated; • During the pilot-based tracking, a reduced value for K could lead to overfitting as the number of data symbols and parameters to be estimated have relatively close values.
During the preamble-based training, the number of parameters to be estimated is much lower than the number of known symbols. Also, the preamble-based training is performed only once. Considering these, the computational complexity and the overfitting are not stringent problems. Consequently, we propose using a reduced value for K to increase the statistical performance. We have chosen a value of K = 25 for the preamble-based training in this work. During the stage devoted to time-variant imperfections tracking, the number of pilot symbols is limited, and a reduced computational complexity is a stringent requirement. To focus only on the impact of the value of K, we first consider a communication chain that PMD does not impact. The simulation was performed for a 16-QAM communication at a 20 dB OSNR (Optical Signal-to-Noise Ratio). The evolution of MSE with respect to the number of iterations during the pilot-based training stage for different values of K is shown in Figure 6. The training error is represented by the MSE between the transmitted pilot symbols and the pilots at the output of the compensation network, the validation error by the MSE between the detected symbols and the data at the output of the network, and the testing error by the MSE between the unknown transmitted symbols and the output of the network. The training error has the minimum value for a value of K = 25, but the validation and testing errors have maximum values. It can be concluded that the method suffers from overfitting for K = 25. By increasing the value of K, it can be seen that the training error increases, but the validation and testing errors decrease.

3) BER PERFORMANCE
BER is the final metric of interest in most communication systems and is computed as follows: Number of erroneous bits Total number of bits .
The BER evolution with respect to OSNR is presented in Figure 7 for two scenarios where K = 50 and K = 100, respectively. In this case, the impact of PMD is considered. Figure 7 is considered to obtain a post-FEC BER of 7 × 10 −14 as indicated in Appendix I.9 of ITU-T G975.1 recommendation [72]. The BER simulated values of M-QAM for a Gaussian channel and the threshold of BER are considered in all the following simulations that use BER as a performance metric. It can be seen that the results have a similar evolution for the supervised and self-labeling tracking cases. For all modulations, the performance is better for the case where K = 100 for values of OSNR below 25 dB for 7(a) and 20 dB for 7(b), respectively. This is because overfitting is more likely to occur when the noise component is important, and the scenario with K = 100 has a better resilience to noise. Above these OSNR values, the performance of the scenario where K = 50 is better as the noise level is low and the overfitting is less probable, and the model describes the laser phase dynamics better. In addition, it can be seen that in both cases, the self-labeling stage improves the BER performance. Starting from this point, we will consider a value of K = 100 for 4-QAM and 16-QAM during the tracking stage of the training as it is a good compromise between statistical performance and computational complexity. At the same time, K = 50 will be preferred for 64-QAM and 256-QAM because it improves BER performance at high OSNRs allowing the method to achieve the desired quality of service for 64-QAM and approaching it in the case of 256-QAM.

4) EVM PERFORMANCE
EVM is another widely used metric to assess the quality of communication [73]. It measures the difference between the ideal transmitted symbols x and the estimated received symbolsx. The EVM is generally expressed in percentages and can be computed as follows [74]: As in [75], in this work, we consider the EVM threshold specified by the 3 rd Generation Partnership Project (3GPP) [76]. The EVM should be below 17.5% for 4-QAM, 12.5% for 16-QAM, 8% for 64-QAM, and 3.5% for 256-QAM. The EVM performance with respect to OSNR is depicted in Figure 8.
The red curves are used as performance bounds and denote the EVM values obtained for simulated AWGN channels. It can be seen that these curves have an identical evolution whatever the modulation employed. They are below the EVM threshold after OSNRs of approximately 15 dB for 4-QAM, 18 dB for 16-QAM, 22 dB for 64-QAM, and 29 dB for 256 QAM. In Figure 8(a), the EVM values after the supervised pilots tracking are shown, while in Figure 8(b) the EVM values after the self-labeling tracking. In Figure 8(a), it can be seen that the parametric network after the supervised tracking can reach the desired performance after OSNRs of approximately 18 dB for 4-QAM, 22 dB for 16-QAM, and 28 dB for 64-QAM. Similarly, in Figure 8(b), it can be seen that the parametric network after the supervised tracking can reach the

5) ANALYSIS OF DETECTION PERFORMANCE
Another way to measure the performance of the proposed approach is to analyze the accuracy of the supervised training and validation steps, and the confusion matrix [77] corresponding to the transmitted symbols and the detected ones. Even if the proposed network model is a regression-based model, we can analyze the method's accuracy after the detection, which can be seen as a classification problem. The accuracy refers to the number of correct detections from the total number of detections and is expressed in percentages as follows:  We consider a separate dataset containing 120000 symbols to compute the accuracy. From these symbols, 60000 were used for preamble-based training, 2000 for pilot-based training, and 59800 for validation. The same symbols used for pilot-based training and validation are also used for selflabeling training, totaling 60000 symbols. Figure 9 shows the accuracy and confusion matrix for a 16-QAM communication. Figure 9(a) displays preamble-based training, pilot-based training, and validation accuracy. We denote by "1 st validation" the accuracy corresponding to the unknown detected data after the pilot-based training, and "2 nd validation" the accuracy corresponding to the unknown detected data after the self-labeling training. For the preamble-based training, the accuracy increases from approximately 20% at 0 dB OSNR to 99% at 18 dB OSNR. For the pilot-based training, the accuracy is slightly improved compared to the preamble training. The data after the pilot-based training is detected and used as a target for the self-labeling training. It can be observed that the 2 nd validation has better accuracy than the 1 st validation. The training and validation steps have a similar evolution, with a relatively reduced generalization error. This also proves that the technique proposed for validation in Section III-D allows for avoiding overfitting. Figure 9(b) depicts the confusion matrix for a 16-QAM Gray-coded communication at an OSNR of 10 dB. The computation is performed on an unseen testing dataset. In this case, the overall accuracy is 66%. The points positioned in the center square of the constellation (5, 7, 13, 15) have a better accuracy (69%), while the points on the corners of the constellation (0, 2, 8, 10) have the lowest accuracy (62%). The errors generally appear between the neighbor constellation points. Complementary simulations were performed for values of OSNR of 15 dB and 20 dB. For the first case, the overall accuracy is 94%, while for the second is 99%. Also, the errors have a similar distribution in the confusion matrix.

6) INFLUENCE OF IQ IMBALANCE
In Figure 10, we display the performance of the proposed network in the presence of different configurations for the IQ imbalance on the transmitter and receiver sides. When we consider multiple values for one type of IQ imbalance (TX/RX), the other one (RX/TX) is fixed with 1 dB and 10 • . The impairments are considered equal for both polarizations. It can be seen that BER increases with the values of IQ imbalance. For the considered communication chain, the transmitter IQ imbalance has a bigger impact on the statistical performance than the receiver IQ imbalance. The system cannot reach the desired quality of service for the scenario where the transmitter IQ imbalance has the values 1.5 dB and 20 • and above. On the contrary, for the receiver IQ imbalance case, the proposed approach can reach the desired quality of service even in the presence of an IQ imbalance of 1.5 dB and 20 • . In addition to these, we performed additional tests where we varied the OSNR between the preamble stage of training and the tracking and testing. The results proved that the  proposed network performance is not significantly impacted in this context, showing a good tolerance to the noise variations.

7) COMPUTATIONAL COMPLEXITY
The DP coherent optical communications generally benefit from huge bandwidths. Consequently, computational complexity is a fundamental characteristic to be considered. In this work, we evaluate the computational complexity by approximating the number of floating-point operations (FLOPs) [78]. The following results separately report the number of FLOPs for the training steps. The training requires multiple iterations, while a single iteration is performed for the testing step. In addition, a complexity versus performance analysis is operated.
The FLOPs required for the proposed approach can be seen in Table 4. As expected, the most computationally demanding step is preamble training. The pilot-based and self-labeling training requires a relatively similar number of FLOPs, less than the preamble training. The testing is the less computational demanding operation. Complementary simulations show that the total number of FLOPs per iteration for the proposed approach follows approximately a quadratic evolution with respect to the number of symbols considered.
The number of iterations performed during the training strongly depends on the optimization algorithm. The preamble training is performed only once, while the pilot-based and self-labeling training is employed for every data block. Consequently, these two last training steps are critical from a computational point of view. In Table 5, a performance versus complexity analysis considering the evolution of BER with respect to the number of iterations during the tracking steps for a 16-QAM communication at an OSNR of 20 dB is shown. For 25 iterations, the performance is insufficient, with a BER exceeding the considered threshold of 4×10 −3 . In this case, the best compromise could be obtained using only the pilot-based training with 75 iterations. This scenario achieves a BER of 2.2e × 10 −3 by requiring a total number of FLOPs approximately equal to 1.8 × 10 7 . Note that better compromises could be achieved using more advanced optimization algorithms requiring fewer iterations.

B. COMPARISON WITH CONVENTIONAL TECHNIQUES
In the following, we compare our proposed approach to two different techniques: a DL technique and a classical DSP technique.

1) DEEP LEARNING TECHNIQUE
A well-known general architecture used for compensation and detection is the static MLP network [35], [79]. In the following, we compare our proposed approach to the MLP. The system diagram with imperfections can be seen in Figure 11. We consider SP communication impaired by IQ imbalance, both on the transmitter and receiver side, residual CD, and CFO. All these imperfections can be considered static. In this case, we have a 4-QAM communication with a baud rate of 20 GBaud and a sampling frequency of 20 GHz. The training and testing are performed using blocks of length N 0,b = 30. The IQ imbalance parameters, both on the transmitter and receiver side, correspond to a 1 dB amplitude difference between the two branches and a 10 • variation from the ideal 90 • . The signal is impaired by a residual CD of 17 ps/nm and a residual CFO of 12.5 MHz.
The MLP comprises 6 layers (an input layer, 4 hidden layers, and an output layer). Each hidden layer is composed of 20N b neurons. The activation function used is the Rectified Linear Unit. MLP was trained for 1000000 iterations with a batch size of 1000 samples. The test database contained a number of 1500000 signals. Our parametric network is composed of the compensation layers ordered backwardly. Regarding the initialization of the parameters, the CFO compensation parameter is randomly chosen from an interval corresponding to F ∈ [0, 25 MHz], and the CD compensation parameter is initialized with 0. The network is trained using just the 30 symbols preamble, and 10000 signals were used for testing.
In Figure 12, the comparison results are shown considering BER with respect to OSNR for static and time-variant channels. Moreover, in the case of the static channel, we simulated the performance of the Clairvoyant equalizer [64] that has prior knowledge of the channel matrix. In 12(a), it can be seen that MLP and the Clairvoyant equalizer have similar performances, which are slightly better than the performance of the proposed approach. Compared to MLP, the parametric network introduces a penalty of approximately 0.5 dB at the BER threshold. In Figure 12(b), we evaluate the performances of MLP and the proposed parametric network in the presence of a time-variant laser PN on the receiver side related to δf = 100 kHz. The performance of MLP is poor, as it does not meet the desired quality of service imposed by the BER threshold, as shown by the green curve. Indeed, the time-variant imperfections impact each data block differently, and MLP cannot track their evolution. On the other hand, our approach proved the ability to track time-variant impairments, as seen from the orange curve in Figure 12(b).
Regarding the computational complexity, the considered MLP architecture needs to estimate 1551120 parameters. For comparison, for the case of the parametric network, the number of real-valued parameters to be estimated is limited to 10 for the static channel and 11 for the time-variant channel. The computational complexity is then drastically reduced. As a state of comparison, MLP requires 15.4×10 12 FLOPs, while the parametric network requires 8.2 × 10 2 FLOPs for the testing in the case of a static channel. In the case of the time-variant channel, for MLP, the same FLOPs are required, while 5.1 × 10 3 FLOPs are needed for a single iteration of the pilots and self-labeling tracking and 10 3 for the testing by using the parametric network.

2) DSP TECHNIQUES
Another competing technique considered is the classical DSP approach. Using this approach, the compensation is performed by using several cascaded compensation blocks. In Figure 13, we present two system diagrams for the communication chains. First, in Figure 13(a), we consider a single polarization communication impaired by CFO, IQ imbalance, and laser PN on the receiver side. Secondly, in Figure 13(b), we also consider an IQ imbalance on the transmitter side. The symbol rate is 20 GBaud, and the sampling frequency is 20 GHz for both cases. A total number of 300300 symbols was considered. Of these, 300 are designated for the preamble, and 10000 for pilots, resulting in a total overhead of 3.4%. The processing blocks (preamble and multiple data blocks with pilots) are of length N 0,b = 300 and for the timevariant imperfections tracking, we use pilots at an interval of 30. The IQ imbalance parameters, both on transmitter and receiver, correspond to a 1 dB amplitude difference between the two branches and a 10 • variation from the ideal 90 • , and the CFO between the two lasers has a value of 200 MHz. The initialization of the CFO related parameter for the proposed parametric network is randomly done in an interval corresponding to F ∈ [187.5 MHz, 212.5 MHz].
For the classical DSP approach, we used the following techniques: • A blind compensation of the RX IQ imbalance using the Gram-Schmidt orthogonalization procedure (GSOP) algorithm as in [80] is operated. GSOP compensation is performed using the time representation of the signal by transforming a set of nonorthogonal samples into a set of orthogonal samples; • The CFO is compensated by using the preamble data. The compensation is also performed using the time representation of the signal. First, the modulation phase is removed by multiplying complex conjugated transmitted preamble symbols with received preamble symbols as in [81]. Secondly, the CFO is estimated by performing a least-squares linear fit; • The laser phase noise is estimated by using pilots inserted periodically. The estimation is performed by correlating the transmitted and received pilots as in [15].
Then an averaging filter of length 4 is applied to reduce the noise impact as in [82]. Moreover, to improve the algorithm performance, after the pilots-based estimation, a maximum likelihood phase estimation is performed in a decision-directed (DD) manner [83]. The averaging filter is again applied after this step; • Transmitter IQ imbalance is compensated for using a butterfly Finite Impulse Response (FIR) filter with a single tap as in [84]. This pilot-based technique uses the least mean square algorithm for the filter's coefficients update. In addition, this technique requires Quadrature Phase Shift Keying (QPSK) pilots for the laser PN compensation.
In the case of the proposed parametric network, the compensation is performed using the corresponding compensation layers in a backward manner. Figure 14 compares the compensation performances of the proposed parametric network and the classical DSP approach. Figure 14(a) reports the performance when the TX IQ imbalance does not impact the communication. In this case, the receiver IQ imbalance is first compensated for, then the CFO, and finally, the laser PN. This compensation technique is denoted by "DSP v1". This figure shows that both approaches have good performance, with a slight improvement for the parametric network. At the BER threshold of 4 × 10 −3 , the "DSP v1" has an OSNR penalty of approximately 0.4 dB for 4-QAM, and 0.6 for 16-QAM, respectively. Figure 14(b) depicts the performance of the considered approaches when the TX IQ imbalance also impacts the communication chain. In this case, we considered three compensation techniques: • First, we use the "DSP v1" technique; • In the second case, after the "DSP v1" technique, we use the butterfly filter for TX IQ compensation. Then, as suggested in [84], another block of laser PN compensation is used. This technique is denoted by "DSP v2"; • In the third case, we first compensate for RX IQ imbalance using GSOP, then for CFO by using the preamble-based method. After that, a constant phase rotation of the constellation is performed by using the pilot symbols. The TX IQ imbalance is compensated for by using the FIR filter. Finally, the laser PN is compensated for by using the pilots and the DD technique with the averaging filters. This technique is denoted by "DSP v3".
It can be observed that the performances of the "DSP v1" approach are poor. This is primarily because of the TX IQ imbalance, which is not optimally compensated for by GSOP and also limits the performance of the laser PN compensation technique. An improvement can be seen using the "DSP v2" technique. However, the performance of this DSP approach is still poor compared to the ones of the parametric network. This arrives because the TX IQ imbalance degrades the PN compensation algorithm. Finally, a very important improvement can be observed for the "DSP v3" technique, but its BER performance is still lower than that of the parametric network. Compared to the parametric network, at the BER threshold, the "DSP v3" technique introduces a penalty of approximately 0.2 dB for 4-QAM and 1.7 dB for 16-QAM, respectively.
In the following, the computational complexity is reported in terms of FLOPs. For the first scenario, with IQ imbalance only on the receiver side, the "DSP v1" technique requires approximately 6 × 10 3 FLOPs. The parametric network requires 3.2 × 10 4 FLOPs for a single training iteration with pilots and self-labeling steps and 6.9 × 10 3 FLOPs for the testing. In the second scenario, with IQ imbalance both on transmitter and receiver sides, the FLOPs required in the "DSP v3" scenario are approximate 8.9 × 10 3 . A single training iteration for pilots and self-labeling steps requires 4× 10 4 FLOPs, while the testing operation of the parametric network requires 7.8 × 10 3 FLOPs.

V. DISCUSSION
This paper emphasized the advantages of using model knowledge in a specific parametric network to compensate for multiple linear imperfections. Our simulations have shown that the parametric network outperforms other competing approaches. However, essential aspects must be considered and investigated in future works. First, we generally do not know about the presence of all imperfections and their order in the communication chain. This knowledge may be acquired by considering a sparse approach. Secondly, computational complexity is a critical requirement that needs to be improved. One way to meet this requirement could be by employing more advanced optimization algorithms. Finally, an important research topic in optical communications is represented by the mitigation of the nonlinear effects of the fiber. Multiple approaches based on DBP that use DL have been recently proposed and demonstrated their effectiveness. However, these approaches' performance may be reduced by linear imperfections, as the classical DSP approaches used for compensation may not be adapted to this scenario. We are convinced that the proposed parametric network may overcome these limitations and could be jointly used with the DL approaches to compensate for both nonlinear and linear imperfections of the optical chain.

VI. CONCLUSION
The paper proposed a parametric network that jointly compensates for multiple linear impairments in coherent optical communication systems. The network architecture uses compensation layers ordered reversely compared to the imperfections. A custom training stage for network parameters estimation that can exploit the knowledge of a preamble, pilots, and symbol constellation was also presented.
The proposed parametric network was compared to an MLP network and some classical DSP compensation algorithms. In the scenario considered, the parametric network outperforms the statistical performance of the two competing techniques. The main advantage of the proposed parametric approach relies on its flexibility, as it can easily adapt and compensate for new impairments. The parametric network just needs to incorporate a new imperfection model to be able to compensate for it. Future works will focus on computational complexity reduction with more advanced optimization algorithms.