## Abstract

We present an optimized dual-stage CPR method based on pilot symbols and a blind phase search algorithm (BPS) for *M*-QAM coherent optical transmission systems, targeting an efficient hardware implementation. A comprehensive optimization of the key CPR parameters is performed, including, pilot-rate, number of test phases, angle interval and moving average filter size. The optimization process is validated through numerical assessment of the optical communication system performance using 16-QAM, 64-QAM, and 256-QAM modulation formats at 64 GBaud. The optimized dual-stage CPR is demonstrated to operate with a very low number of test phases in the second stage BPS algorithm. In addition, it supports more than 10× higher laser linewidth than the standard CPR based on pilot symbols. Besides, a reduction of more than 90%, in terms of hardware complexity is achieved with respect to the standalone application of the BPS algorithm. The optimized dual-stage CPR is also validated through hardware implementation based on VHDL, and its gate-level complexity is assessed for a commercial off-the-shelf Xilinx Virtex-7 FPGA.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Random fluctuations in the phase of the optical carrier and local oscillator field severally impact the performance of high-order modulation coherent optical transmission systems [1]. Due to the tight requirements in terms of power dissipation and chip area, it is highly important to design and optimize digital signal processing (DSP) algorithms for carrier phase recovery, such that an efficient hardware implementation is achieved, while still operating with high performance.

Carrier phase recovery (CPR) based on the Viterbi-Viterbi (VV) method is widely employed for M-phase-shift keying (PSK) modulation formats, since it provides an optimum solution for blind phase estimation of optical signals with constant amplitude [2]. However, for high-order M-quadrature amplitude modulation (QAM) signals, the VV algorithm becomes sub-optimal [2]. Therefore, modified VV algorithms based on the quadrature phase-shift keying (QPSK) partitioning approach have been proposed [2,3]. Despite the significant performance improvement enabled by these modified VV approaches, a non-negligible implementation penalty is still observed, mainly when transmitting at low signal-to-noise ratio (SNR) with high cardinality constellations [2]. Another prominent approach for fully-blind CPR in coherent optical systems is the blind phase search (BPS) algorithm, which tends to provide a better performance, with the advantage of being agnostic to the modulation format [4]. However, the main drawback of BPS is related with its implementation complexity [5]. Several research works have been conducted to achieve reduced complexity BPS-based carrier recovery techniques [5–7]. Many of the proposed low complexity algorithms are based on the concept of a two-stage implementation, where the first stage provides a coarse recovery and the second stage performs a finer phase noise compensation [5]. Different stage combinations have been explored, namely BPS$+$BPS [5], BPS$+$VV [7] and BPS$+$ML (maximum likelihood) [6] algorithms.

As opposed to blind CPR, a popular alternative for data-aided phase noise estimation and compensation is based on the use of pilot symbols [8]. This approach shows two major advantages: i) since it uses a data-aided absolute phase reference, it avoids the use of differential encoding to overcome the effect of cycle slips [8]; ii) since it is based on simple phase comparisons between the received signal and the pilot symbols, it is straightforward to implement in hardware. However, these advantages are obtained at the cost of reduced transmission spectral efficiency due to the pilot data overhead [8]. In this regard, two-stage CPR configurations using pilot symbols in the first stage followed by a second stage based on BPS [9] and ML [10] algorithms have also been proposed. In [11,12], the implementation of these algorithms in an application specific integrated circuit (ASIC) has been discussed. In [11], single-stage CPR based on BPS and pilot symbols has been explored, while in [12] non-data-aided two-stage CPR based on BPS, VV and principal component-based phase estimation (PCPE) are investigated showing that the input number of bits, the number of test phases and the averaging window size can highly impact the performance and power dissipation of CPR algorithms.

The design and optimization of CPR has been typically performed for long-haul and ultra-long-haul fiber links, where the performance is essentially limited by the interplay between laser phase noise and chromatic dispersion, originating the detrimental effect of equalization-enhanced phase noise (EEPN) [13,14]. This has led to the development of CPR algorithms tailored for long-haul optical fiber systems, namely resorting to the concept of digital subcarrier multiplexing [15]. However, owing to the ever-increasing bandwidth demand on access networks, the research on coherent optical communications has been quickly shifting towards shorter reach applications, led by the notable example of 400ZR [16] and upcoming 800ZR standards. These novel short-reach applications are foreseen to represent the vast majority of market demand for coherent ports in the short term, fostering the urge to develop high-capacity and low-power transceiver pluggables.

With the rise of coherent transceivers in access networks, novel DSP platforms are currently being addressed as feasible alternatives to the generalized use of ASIC chips in long-haul transceivers [17,18]. Despite its efficient cost at high manufacturing volumes, the design and validation of an ASIC chip still represents a very high initial investment (typically several million dollars), thereby hindering the entrance of new industrial players in the market. In addition, the use of ASIC platforms inherently imposes a strong limitation on the flexibility of the implemented DSP, preventing to upgrade the transceiver for upcoming novel applications and/or standards. Indeed, several recent works have demonstrated the feasibility of real-time DSP processing for coherent transceivers using flexible off-the-shelf hardware platforms, such as graphical processing units (GPU) [19,20] and field-programmable gate arrays (FPGA) [21,22]. Considering these new application scenarios for coherent transceivers, together with the use of novel hardware platforms, it becomes apparent that the optimization of the CPR subsystem requires a renewed analysis, prioritizing the reduction of computational effort and the simplification of hardware circuitry in flexible DSP platforms, such as FPGAs.

Building upon the state-of-the-art of carrier-phase estimation for coherent optical transceivers, this paper aims at a comprehensive joint optimization of performance and complexity of a two-stage CPR based on pilot symbols and BPS, providing a high performance and low-complexity CPR approach for optical transceivers impaired by laser phase noise. The main novel contributions provided by this work are on i) the development and numerical validation of theoretical expressions for the dual-stage CPR; and ii) the evaluation of circuit hardware implementation of a dual-stage CPR method supported by an FPGA platform, thereby enabling a comprehensive complexity analysis in terms of main hardware logic circuits. The impact of parallel processing and bit-resolution requirements are also thoroughly analyzed in this work. In order to reduce the hardware implementation complexity, a look-up table (LUT)-based processing scheme is introduced and validated, enabling a significant reduction on the number of operations while keeping a high performance CPR. The optimized two-stage technique can support more than 10$\times$ higher laser linewidth than the standard CPR based on pilot symbols. A reduction of more than 90$\%$ in terms of hardware complexity is achieved with respect to the standalone application of the BPS algorithm. These optimizations have been assessed for different modulation formats: 16-QAM, 64-QAM, and 256-QAM and with a field programmable gate array (FPGA) hardware implementation. To facilitate the reproducibility of the presented results, all FPGA-ready CPR algorithms and routines implemented in VHDL and utilized in this work are made available in an open-source repository [23].

The rest of this paper is organized as follows. In section 2, the concepts behind the optimization of carrier phase recovery method are presented. In section 3, their performance is numerically assessed and discussed. In section 4, the computational effort is evaluated for high-order *M*-QAM signals. In section 5, we present the evaluation of FPGA-based hardware implementation of the optimized method. Finally, in section 6, the main conclusions are presented.

## 2. Impact of laser phase noise on the performance of coherent optical systems

Ideally, the output of a single longitudinal mode laser should be perfectly monochromatic with the whole power concentrated at the central frequency. However, the spontaneously emitted photons in the laser cavity impose fluctuations on the phase of the laser output and the resulting optical field with random phase from the spontaneous emission adds to the coherent optical field originated from the stimulated emission, then imposing a perturbation on the phase and amplitude of the laser output optical field. This effect is generally characterized in terms of laser linewidth, which represents the FWHM of the power spectral density of the laser output field.

Laser phase noise can be modeled as a Wiener process defined as [4],

where $\theta _{pn}$ is the time-varying phase noise and $\Delta \theta _{pn}(k)$ represents the phase difference between the instances $k-1$ and $k$. The corresponding variance of $\Delta \theta _{pn}(k)$ can be given as, where $\Delta f$ corresponds to the laser linewidth, while $\Delta t$ is the time difference between the instants $(k-1)\Delta t$ and $k\Delta t$.The performance of the typical coherent optical transmission system employing homodyne receivers is highly sensitive to the phase noise impairment and the performance degradation can occur when the phase of transmitter laser and local oscillator fluctuates. Since the phase noise process originated from the transmitter laser and local oscillator are considered independent processes, the overall linewidth of the complete system can be simply obtained from the sum of the two individual linewidths associated with the transmitter and local oscillator. In order to evaluate the impact of laser phase noise in digital coherent optical systems we have assumed a simplified transmission system shown in Fig. 3 of section IV, where the phase noise and additive white Gaussian noise (AWGN) are the only considered system impairments.

Figure 1(a) and (b) illustrate the visual representation of the effect of laser phase noise on the received constellation for 16-QAM coherent optical transmission systems. It can be noticed that the phase noise effect imposes a rotation of the constellation symbols over a constant average amplitude range, which means it fundamentally affects the phase of transmitting symbols. It can be also observed that the impact of phase noise is critical given that the symbol rotations can overlap with each other, up to a point where it is impossible to distinguish the constellation symbols, then leading to symbol detection errors. From expression (2), we can note that the effect of phase noise increases with the increase of laser linewidth and the decrease of transmission symbol period. This is understood by noting that higher linewidth impose faster symbol rotations and the decrease of symbol rate implies a larger observed time interval for the symbol rotations. On the other hand, the AGWN imposes Gaussian-distributed and time-uncorrelated fluctuations on both amplitude and phase, which coherently add up to the Wiener laser phase noise, thereby making it harder to track and compensate. This effect has particularly high impact on higher order modulation formats, where the symbol amplitudes are closer to each other, being more susceptible to errors.

The impact of laser phase noise on coherent transmission systems is illustrated in Fig. 1(c), where the evolution of bit error rate (BER) is shown as function of signal-to-noise ratio (SNR), while considering different values of laser linewidth product with symbol period, $\Delta f T_\mathrm {sym}$ (common metric used to evaluate the effect of phase noise on coherent optical communication systems) [4]. The results correspond to an uncompensated phase noise transmission system based on 16-QAM constellation and considering standard Gray bit mapping. First of all, we can see that in the absence of laser phase noise, $\Delta f T_\mathrm {sym}=0$, the results perfectly match the theoretical BER performance in additive white Gaussian noise (AWGN) channels [24]. Nevertheless, for a non-zero laser linewidth the performace rapidly degrades and for the case of $\Delta f T_\mathrm {sym}$ in the order of $1\times 10^{-6}$ an effective receiver operation becomes impractical. It is important to stress that the 16-QAM constellation and the corresponding BER results presented in Fig. 1 have been obtained after simulation with a fixed number of 131072 symbols. Naturally, if no carrier phase estimation is implemented at all, even the slightest laser phase noise will, in the long term, result in a multi-ring-shaped constellation, i.e. the accumulated phase noise will tend to span over the full angular range of the constellation as time goes by, rendering it completely impossible to properly decode the transmitted information. Therefore, it becomes apparent that the compensation of laser phase noise plays a key role in coherent optical transmission systems.

## 3. Optimized two-stage CPR method

Carrier phase recovery methods based on BPS and pilot symbols have been widely exploited in digital coherent optical receivers using advanced modulation formats. Besides, the two-stage CPR based on pilot symbols and BPS has been also considered in order to provide higher performance and robustness against laser phase noise.

#### 3.1 Blind phase search algorithm

The received symbol is rotated by multiple test phases over a limited phase range using fixed or variable phase increments. The test phases can be defined as [4],

where $B$ is the number of test phases and $\vartheta$ corresponds to the angle of symmetric constellation rotation. Note that for the case of square QAM constellations $\vartheta = \frac {\pi }{2}$ [4]. The rotated symbol is then fed into a decision circuit block, which evaluates the squared distance of the rotated symbols with respect to the ideal constellation symbols.Then, a moving average filter is applied over $N$ neighbours squared distances to reduce the impact of AWGN.

After filtering, the minimum average square distance is determined and the corresponding test phase is selected. The estimated phase is then unwrapped to remove the phase discontinuity and used to de-rotate the corresponding input symbol. Phase ambiguity, generally occurs in coherent optical transmission systems using square QAM modulation formats. These ambiguities arises from the phase rotations multiple of $\frac {\pi }{2}$, which then cause uncertainty about the absolute quadrant positions. In order to overcome this limitation, the so-called differential quadrant encoding and decoding techniques can be applied [4]. The drawback of differential quadrant encoding is the loss of the Gray encoding property when the bit mapping is arranged with rotation symmetry, leading to performance degradation [4].

#### 3.2 Pilot-based carrier phase estimation

Pilot-based (the same as pilot-only) is generally implemented through the time multiplexing of pilots symbols with payload symbols, such that the phase associated to the pilot symbols is estimated with high accuracy at the receiver side [8]. It is worth to highlight that this approach has been considered in this work. The received noisy symbols corresponding to the predefined pilot positions are multiplied by the complex conjugate of the noiseless pilot symbol, which is known a priori, and the resulting output is filtered by a moving average filter to reduce the impact of AWGN. When the phase of two consecutive pilot symbols has been evaluated using the aforementioned process, a simple interpolation procedure can be applied to estimate the phase of the payload symbols that are located in between the pilots. First-order (linear) interpolation is often used, but even simpler approaches, such as zero-order hold, can be considered to further reduce the computational complexity. Finally, the estimated phases are used to remove the phase noise on the input symbol sequence.

One of the main advantages of phase recovery techniques based on pilot symbols is related with its robustness against cycle slips. Nevertheless, this is achieved at the expense of additional transmission overhead, which implies a given SNR penalty depending on the rate of pilot insertion [8]. The main parameter associated to this technique is the pilot-rate, $R_\mathrm {pil}$, which describes the rate of pilot symbols insertion along with the transmitted symbols. A pilot-rate of $\frac {M-1}{M}$ represents the insertion of one pilot symbol periodically after $M-1$ payload symbols.

#### 3.3 Optimized two-stage Pilot-BPS algorithm

The pilot-based algorithm is applied in the first stage to obtain a coarser estimation of the received constellation. The output of the first stage is then fed to the BPS algorithm, employed in the second stage, such that an improved performance is achieved. The corresponding block diagram of two-stage Pilot-BPS is shown in Fig. 2. It is interesting to note that the employment of pilot-based algorithm in the first stage not only performs the coarse phase noise compensation, as it also provides the absolute phase reference to the BPS algorithm. This, enables the application of the BPS algorithm without the need for differential quadrant encoding to avoid the phase ambiguity issue.

As it is detailed in section 3, the pilot insertion implies a given inherent performance penalty depending on the insertion rate, which tends to increase with the decrease of pilot-rate and scales with the increase of the modulation order. On the other hand, the pilot-based CPR becomes more effective with decreasing value of pilot-rate. In this context, the employment of a two-stage CPR algorithm can be an attractive solution to further improve the overall system performance and enabling a higher tolerance to the laser phase noise. It should be also observed that, using BPS as a second stage can enable the operation of the pilot-based algorithm in a first stage with higher pilot-rate, which then implies lower performance penalty due to the pilot insertion.

Then, the main issue of the two-stage CPR is related with the hardware implementation complexity, due to the inherent high complexity associated to the BPS algorithm employed in the second stage. Note that, although the BPS algorithm can be employed with reduced complexity in the two-stage configuration, its high parallel implementation architecture still requires further optimization, such that the requirements of transceiver power dissipation are fulfilled. To further reduce the complexity of the second stage BPS algorithm, we first identify the number of test phases, the angle interval and the moving average filter size as the main parameters to be optimized.

Within the optimization process of the pilot$+$BPS CPR algorithm, we have conducted the following main procedures:

- • optimization of pilot-rate for the pilot-based stage;
- • optimization of the number of delay taps within the moving average filters of both BPS and pilot-based stages;
- • optimization of the number of test phases, $B$, and the corresponding angle interval, $\vartheta$, for the BPS algorithm in a dual-stage configuration.

Note that all optimizations are performed by pondering both the complexity and performance of the pilot$+$BPS algorithm. To that end, we will consider reference values for maximum tolerable penalty, which will be introduced as the optimization procedure is developed, and will serve as complexity minimization targets. Finally, the optimized configurations emerging from this comprehensive study were implemented on an FPGA platform for realistic complexity assessment.

## 4. Numerical simulation

A comprehensive evaluation of the phase noise compensation in digital coherent transmission systems has been performed based on the simulation setup presented in Fig. 3, where the impact of laser phase noise is evaluated under the effect of an AWGN channel. It should be noted that in this scenario the phase noise compensation module, CPR, is performed using both BPS and pilot-based algorithms. The simulation setup is configured according to the employed CPR algorithms. For the standalone evaluation of the BPS algorithm, we consider the use of differential quadrant encoding and decoding at the transmitter and receiver, respectively, in order to overcome the associated phase ambiguity issue. In contrast, standard (non-differential) coding and decoding are implemented whenever pilot-based CPR is considered. The performance evaluation is conducted for different *M*-QAM modulation formats and considering a transmission symbol rate of 64 GBaud. The effect of laser phase noise is evaluated in terms of BER and SNR considering different values of the product between laser linewidth and transmission symbol period.

#### 4.1 Performance of BPS algorithm

For simulation simplicity, the results are obtained by considering a fixed BPS filter length of 15 taps, while the number of test phases are optimized such that its impact on the performance is negligible. In this regard, we have performed the evaluation of performance in terms of BER as a function of SNR, and considering different values of the product between laser linewidth and symbol period, $\Delta f T_\mathrm {sym}$, which is swept by varying the values of laser linewidth. Figure 4(a), (b), (c) show the results for the 16-QAM and 64-QAM constellations, respectively, which is compared with the theoretical BER for non-differential Gray-coded AWGN channels. First of all, it should be observed that a performance penalty with respect to the non-differentially-encoded theoretical result [24] occurs even when the laser linewidth is set to zero, due to the differential encoding and loss of the Gray encoding property when the quadrant rotation symmetry is imposed. Then, we can note that the product of laser linewidth and symbol rate is a very sensitive parameter on the system performance. Considering 16-QAM modulation, Fig. 4(a), a performance penalty of $\sim 1$ dB is imposed by shifting from $\Delta f T_\mathrm {sym}= 3\times 10^{-4}$ to $\Delta f T_\mathrm {sym} 7\times 10^{-4}$, operating at a BER of $2.4\times 10^{-2}$, whereas, a significant SNR increase is observed by shifting to $\Delta f T_\mathrm {sym}= 2\times 10^{-3}$. This analysis is also valid for 64-QAM constellations, which reveals higher sensitivity with respect to the parameter $\Delta f T_\mathrm {sym}$, Fig. 4(b). It is then evident that, as we increase the modulation order, the requirements of laser linewidth become tighter and the coherent optical transmission system becomes more sensitive to laser phase noise. To provide further evidence of this aspect, Fig. 4(c) presents the required SNR to achieve the BER threshold of $2.4\times 10^{-3}$ as a function of linewidth times symbol duration. Considering an exemplary 64 GBaud transmission system, typical for state-of-the-art optical transceivers, these values correspond to a combined laser linewidth of 6.4 MHz and 640 kHz for the 16-QAM and 64-QAM cases, respectively.

#### 4.2 Performance of pilot-based algorithm

In order to adequately characterize the performance of pilot-based carrier phase recovery algorithm, we initiate the study by analysing the impact of pilot insertion on the SNR performance in the absence of phase noise. The pattern of pilot insertion is generally based on the use of QPSK-like pilots, which can be inserted at the inner, intermediate or outer constellation points as shown in Fig. 5(a). In this study we have considered the pattern of pilot insertion based on the QPSK-like pilots inserted at the outer constellation points. This is beneficial in terms of phase noise tracking, given that it provides high SNR pilot symbols for the estimation of the phase noise, which enables a higher estimation effectiveness. The main disadvantage of QPSK-pilot insertion is that it causes an increase of average power transmission, which then imposes an SNR penalty. Besides, the pilot insertion implies a given SNR penalty depending on the insertion rate. Note that, as more QPSK pilots are inserted into the signal, i.e. as $R_\mathrm {pil}$ decreases, the actual transmission symbol-rate, $R_s$, must be proportionally increased in order to keep the same net bit-rate, which leads to broadening of the spectrum, and consequently to a higher required launched power for the same operating SNR. The SNR penalty can be given as,

To analyse these considerations through the simulation results, we have assessed the system performance by evaluating the SNR penalty, at BER of 2.4$\times$10$^{-2}$, as a function of the pilot rate for different modulation formats in the absence of laser phase noise and considering pilot insertion in the outer QPSK-like constellation points. From Fig. 5(b), we can note that the SNR penalty increases with the decrease of pilot-rate, which is in accordance with the first term of expression (4). On the other hand, we can observe that the increase in SNR penalty scales with the increase of constellation order and becomes more evident with decreasing pilot-rate. This is also in accordance with the second term of expression (4). These results can be considered as the inherent penalty associated with pilot insertion. Therefore, it is crucial that the enhanced phase noise estimation performance provided by the added pilot symbols must compensate for the corresponding baseline SNR penalty due to pilot insertion, such that the overall performance is indeed improved.

After identifying the baseline SNR penalty due to pilot insertion, we now proceed with the assessment of phase noise impact on the performance of pilot-based CPR considering different values of pilot-rate and laser linewidth. The performance is provided in terms of required SNR to achieve a BER performance of 2.4$\times$10$^{-2}$ as a function of laser linewidth. For the sake of simplicity, we show the performance as a function of laser linewidth instead of the product of laser linewidth by symbol duration, since each pilot-rate imposes a different system baud rate. In this regard, Fig. 6 depicts the dependence of required SNR on laser linewidth for different QAM-based transmission systems and considering different values of pilot-rate, along with the corresponding theoretical results. First of all, it is evident the baseline penalty caused by pilot insertion (for a laser linewidth tending to zero), which increases with decreasing pilot-rate and increasing constellation order. For all considered constellations, we can observe that the required SNR increases with the increase of laser linewidth. Besides, it is notable that as the pilot-rate approaches unity (i.e. the pilot overhead approaches zero), the required SNR also rapidly increases with increasing laser linewidth. Furthermore, it can be seen that the increasing rate of required SNR scales with the increase of constellation order, revealing that higher order QAM constellations are less tolerant to the effect of phase noise. For example, if we assume a maximum allowed performance penalty of 0.5 dB in required SNR, the maximum tolerable laser linewidths for each considered modulation format are 1 MHz, 200 kHz and 60 kHz for 16-QAM, 64-QAM and 256-QAM, respectively. These results are obtained for a pilot-rate of 63/64, however, it should be noted that depending on the allowed performance penalty, the maximum tolerable laser linewidth may be achieved for different pilot-rates. For instance, considering an hypothetical scenario in which the combined laser linewidth is sufficiently low, the contribution of the inherent SNR penalty due to pilot insertion becomes determinant and consequently the employment of high values of pilot-rate tend to be more advantageous. In contrast, a lower pilot-rate is required to achieve optimum performance for scenarios with large combined laser linewidth. This shows that the enhanced phase noise estimation provided by a larger pilot overhead does actually compensate for its additional baseline SNR penalty, resulting in an overall improvement of the system performance. This issue becomes even more exposed for higher constellation orders. Despite this evidence, it should be also noted that depending on the maximum performance penalty that the system can tolerate, an excessive pilot overhead might be ineffective for typical values of laser linewidth. For instance, note that if the maximum penalty is set to 0.5 dB, then the pilot-rate of 7/8 (12.5$\%$ overhead) becomes inadequate for the whole range of considered laser linewidths.

#### 4.3 Optimized two-stage Pilot-BPS algorithm

From the previous results we can identify the following issues associated with BPS and pilot-based CPR: i) BPS reveals to be an effective technique for the estimation and compensation of laser phase noise in coherent transmission systems using QAM constellations, but it requires the employment of differential encoding to avoid the QAM phase ambiguity issue and this in turn imposes a significant performance penalty; ii) on the other hand, we have seen that the pilot-based CPR becomes more effective to compensate the effect of laser phase noise as we decrease the value of pilot-rate. However, the inherent performance penalty due to pilot insertion increases with decreasing pilot-rate, which can significantly limit the overall maximum achievable system performance. In this regard, we will now consider a two-stage CPR algorithm to further improve the overall system performance and enabling a higher tolerance to laser phase noise.

For the numerical assessment of the two-stage pilot-BPS algorithm, the second stage based on BPS is implemented without differential encoding. Figure 7 illustrates the performance of the two-stage pilot-BPS in terms of required SNR as a function of laser linewidth for different QAM-based transmission systems and considering different values of pilot-rate. The filter length, number of test phases and angle interval have been optimized such that the maximum performance is guaranteed [4]. The number of test phases has been set to 32 for 16-QAM, and 64 for 64-QAM and 256-QAM, respectively, while the angle interval is set to $\pi /2$. We can observe that the use of BPS as a second stage is ineffective when low pilot-rate is considered and this ineffectiveness becomes more visible for larger constellation order. The inefficiency of BPS in this case can be due to the high effectiveness of pilot-based algorithm at low value of pilot-rate, which provides reduced margin for further phase noise effect compensation.

When high values of pilot-rate are used, the employment of BPS as a second stage becomes highly effective, providing a significant gain in comparison with the pilot-based CPR. For all considered QAM constellations, the achieved gain scales with increasing pilot-rate and increasing laser linewidth. It should be noted that when high pilot-rate is used, the first stage performs a coarser compensation, providing higher margin for the compensation of residual phase noise uncompensated in the first stage. The combination of small inherent SNR penalty for lower rate of pilot insertion and high effectiveness of BPS algorithm for the compensation of laser phase noise is the main reason for the high gain of the two-stage algorithm at high pilot-rate. In order to obtain a clear understanding of these issues, Fig. 8 shows the combined laser linewidth tolerance as a function of pilot-rate for the pilot-based and two-stage pilot-BPS algorithms, considering different modulation formats. These results are obtained for 0.5 dB SNR penalty of the system performance with respect to its maximum theoretical performance. First of all, it is notable the tendency of achieving a global maximum of $\Delta f$ for a given pilot-rate, which is clearly observable for 16-QAM constellation. It should be also noted that the corresponding pilot-rate for the maximum $\Delta f$ increases with increasing modulation order, although for the case of 64-QAM and 256-QAM the maximum is not observable for the range of considered pilot-rates. The results indicate that a maximum $\Delta f$ of 3.8 MHz, 636 kHz and 151 kHz can be achieved for the two-stage pilot-BPS algorithm, while the pilot only algorithm has achieved a maximum $\Delta f$ of 932 kHz, 244 kHz and 62.2 kHz, for 16-QAM, 64-QAM and 256-QAM constellations, respectively. Another interesting observation for higher constellation signals is that while the robustness of the pilot-only algorithm decreases with increasing pilot-rate, the robustness of the two-stage algorithm tends to increase with increasing pilot-rate. Therefore, we can see that the two-stage pilot-BPS algorithm can be an alternative solution to pilot only CPR when higher laser linewidth tolerance and low performance penalty are required.

## 5. Computational effort

In this section, we assess the computational effort associated with BPS, pilot-based and two-stage pilot-BPS algorithms, in terms of the total number of real multiplications (RMs) and also taking into account the impact of parallel processing. For the sake of simplicity and because of the inherent structure of each algorithm, which can be implemented in very different manners, the complexity evaluation is restricted to the counting of RMs. In addition, in this analysis each complex multiplication is considered to require 3 RMs [25]. Other operations such as the angle to complex, complex to angle and decision circuits are neglected in this analysis, since they can be directly implemented through hardware units with lower complexity, such as LUTs and multiplexers. Based on these considerations, the algorithms complexity is presented in Table 1, where $B$ is the number of test phases used in BPS algorithm and $N_p$ corresponds to the degree of parallelism.

#### 5.1 Parameter optimization of two-stage pilot-BPS

Since our first goal in the previous analysis was to identify the maximum transmission performance of the two-stage pilot-BPS algorithm, CPR parameters such as the angle interval and number of test phases have been set according to the optimization carried out in the previous section, so that the maximum performance is guaranteed. Nevertheless, it is well-known that the number of test phases is the main source of complexity in a BPS-based CPR. If we look into the internal operation flow of the BPS algorithm we can see that each test phase requires a dedicated circuitry for the calculation of a distance in the complex plane, which is a computational intensive task. This is even more critical when hardware implementation is taken into account, where several input samples are processed in parallel. Based on these considerations, it is clear that reducing the number of BPS test phases is a key vector for minimizing the overall complexity of the two-stage pilot-BPS CPR. To accomplish this study, we have first identified the minimum angle interval that guarantees negligible performance penalty. The number of test phases has been defined so that the angle resolution is maintained the same to avoid any performance gain. Besides, we have assumed a maximum performance penalty of 0.5 dB and the optimization is then performed for the maximum $\Delta f$ that still guarantees a penalty below 0.5 dB. In this context, Fig. 9 presents the optimization of the angle interval in terms of SNR, considering various pilot-rates and QAM constellations. Generally, it can be seen that as we increase the pilot-rate the required angle interval for negligible performance penalty also increases. In the case of 16-QAM modulation we can note that the pilot-rate of $511/512$ requires an angle interval of $\pi /4$, while a pilot-rate of $127/128$ requires an angle interval of $\pi /8$. In addition, we can see that the required angle interval decreases with the increase of modulation order. Note that, in the case of 64-QAM and 256-QAM, the required angle interval for a pilot-rate of $511/512$ is $\pi /8$ and $\pi /18$, respectively. This is supported by noting that for higher constellation order, where the symbols are closer to each other, smaller search angle interval is allowed.

After the optimization of angle interval, we proceed with the evaluation of the minimum number of test phases that still provides maximum system performance. The study has been performed for different modulation formats and pilot-rates and the performance is estimated in terms of SNR, as shown in Fig. 10. The corresponding minimum angle interval for each pilot-rate and constellation order has been used, which in turn was obtained for a given number of test phases to maintain the same angle resolution. Each corresponding number of test phases becomes the reference point for the optimization to reduce the number of test phase. For instance, note that for 64-QAM, after optimizing the angle interval, the number of test phases for a pilot-rate $255/256$ and $511/512$ is 8 and 16, respectively. Then, essentially we can observe that the number of test phases can be highly reduced from the reference pointing without incurring any performance penalty for all considered QAM constellations. Considering pilot-rates of $127/128$ and $255/256$, the corresponding minimum number of test phases to achieve maximum performance is (3,4), (3,5) and (3,6) for 16-QAM, 64-QAM and 256-QAM, respectively. This result clearly reveals that the two-stage pilot-BPS CPR can be an attractive solution for high performance and low-complexity phase noise compensation in coherent optical transmission systems. Note that the employment of a second stage based on BPS enables significant performance enhancement, while using very small number of test phases, which is the main source of complexity.

#### 5.2 Computational effort assessment

The optimized parameters for all the considered algorithms, BPS, pilot and two-stage pilot-BPS are indicated in Table 2, in terms of number of test phases, $B$, angle interval, $\vartheta$, and number of filter taps, $N_{\mathrm {taps}}$. The parameters are given as a function of different CPR algorithms, modulation format and pilot-rate. It should be noted that for the case of the pilot-only CPR the parameters $B$ and $\vartheta$ are not defined, while the parameter $N_{\mathrm {taps}}$ for the two-stage pilot-BPS given as $\mbox {N}_{\mathrm {1}}+\mbox {N}_{\mathrm {2}}$ means that the first and second stage apply a filter with $N_{\mathrm {1}}$ and $N_{\mathrm {2}}$ taps, respectively.

Following the expressions presented in Table 1 and based on the parameters indicated in Table 2, we have assessed the complexity for all considered algorithms, which are provided in Table 3. Among these parameters, it should be noted that the algorithm complexity in terms of number of RMs only depends on the number of test phases, $B$. However, for an hardware implementation point of view, the hardware complexity is impacted by all parameters, $B$, $\vartheta$ and $N_{\mathrm {taps}}$. In addition to the complexity assessment, Table 3 also shows the maximum $\Delta f$ supported by each algorithm and the corresponding SNR penalty that achieves a BER performance of 2.4$\times$10$^{-2}$. In accordance with the previous analysis, an SNR penalty of 0.5 dB has been defined, however, for the BPS-DE algorithm we have considered the minimum penalty that guarantees BER performance of 2.4$\times$10$^{-2}$, since the baseline penalty imposed by differential encoding is higher then the considered 0.5 dB. These optimizations have been conducted for the transmission symbol rate of 64 GBaud and considering different modulation formats, 16-QAM, 64-QAM and 256-QAM, which correspond to the system data rate of 400 Gb/s, 600 Gb/s and 800 Gb/s, respectively.

### 5.2.1 Comparison between two-stage CPR and BPS-DE

By firstly performing the comparison of complexity between the BPS-DE and two-stage pilot-BPS algorithms, we can note that the two-stage algorithm is more computationally efficient than the BPS-DE algorithm, providing a reduction of more than 93$\%$ in terms of RMs for 16-QAM modulation. The same analysis indicates a complexity reduction of approximately 96$\%$, 94$\%$ for 64-QAM and 256-QAM constellations, respectively. It is important to understand that the higher laser phase noise tolerance of BPS-DE is because its corresponding $\Delta f$ is measured at a higher performance penalty. However, at the same performance penalty the two-stage algorithm tends to present a higher tolerance to laser phase noise. For clearly understanding the complexity efficiency of the two-stage algorithm, it should be noted that the second stage of the pilot-BPS algorithm requires a much lower number test phases when compared to the BPS-DE algorithm and also that the additional complexity imposed by the first stage of the pilot-BPS algorithm is small when compared to the second stage complexity. This issue is detailed in the following of this section. In addition, it can be observed that since the complexity of the BPS-based algorithm scales with $B\times N_p$, a very high complexity reduction can be achieved in scenarios of hardware implementation with high degree of parallelism. In general, the results of Table 3 reveal that the complexity reduction tends to increase with decreasing pilot-rate, which is due to the reduction on the number of test phases, $B$.

### 5.2.2 Comparison between two-stage CPR and pilot-only CPR

In the following we have performed the comparison between the two-stage CPR and the pilot-only CPR. The results show that the phase noise compensation based on pilot symbols only tends to be more efficient than the two-stage algorithm on what concerns the hardware complexity, in terms of RMs. For instance, we can observe that for pilot-rate of 63/64, which corresponds to the scenario of lowest computational effort of the two-stage algorithm, more 71.9$\%$ (16-QAM, 64-QAM, 256-QAM) of RMs are required by the two-stage pilot-BPS CPR than the pilot-only CPR. The increased complexity of the two-stage CPR is dictated by the second stage based on the BPS algorithm, where the complexity quickly scales with the number of test phases. This clearly puts in evidence the high dependence of BPS-based algorithms with respect to the number of test phases. Nevertheless, the two-stage pilot-BPS algorithm reveals to be much more robust against laser phase noise than the pilot-only algorithm. We can note that over 10$\times$ higher $\Delta f$ can be supported for 16-QAM, while for 64-QAM and 256-QAM the two-stage algorithm can operate with up to 3$\times$ higher laser linewidth than the pilot-only algorithm.

## 6. FPGA-based hardware implementation

The hardware implementation of the aforementioned carrier phase noise compensation techniques is described in this section. The algorithms implementation is based on VHSIC Hardware Description Language (VHDL), where the parallel processing is employed to adapt the high speed incoming data stream to the relatively low clock speed of an FPGA platform. In addition, pipeline stages are also included to reduce the critical path of the design, however, at the cost of an increased implementation latency. The design is based on the fixed-point representation, enabling the optimization of number of bits employed along the chain of several entity blocks. It is also presented the number of bits at different stages of implementation, represented as $X\, b$, meaning $X$ bits is used for the fixed-point representation.

All VHDL files that have been developed during the hardware implementation of the dual-stage CPR presented in this paper can be publicly consulted and downloaded from an open-access repository [23].

#### 6.1 BPS entity block

Figure 11(a) illustrates the top level diagram of a VHDL implementation of the BPS algorithm using parallel processing, where at each clock cycle $M$ input samples are simultaneously processed and $M$ samples are obtained at the output. Each of $N$ parallel Euclidean $\texttt {Distance Calculation}$ blocks is fed by $M$ parallel input samples and one corresponding output of $\texttt {LUT-0}$ block, which models the exponential operator required for phase rotation. In this case, $N$ corresponds to the number of test phases and $M$ corresponds to the degree of parallelism.

The parallel input samples are firstly delayed by one clock cycle, using delay blocks $D_1$, to be synchronized with the $\texttt {LUT-0}$ output, which operates in one clock cycle. It is considered that a delay block $D_k$ represents a delay of $k$ clock cycles. Then, the calculated distances to the ideal closest constellation points are obtained. The outputs of the Euclidean $\texttt {Distance Calculation}$ blocks correspond to $M$ array holding $N$ calculated square distances, which is then followed by $M$ $\texttt {Min array}$ blocks to calculate the minimum distance. The $\texttt {Min array}$ output of $\texttt {Min array}$ block is holding the index for the minimum distance, which is used as the input for the next block, $\texttt {LUT-1}$, to select the corresponding optimum test phase. The output is then: i) multiplied by 4 using right shift; ii) unwraped; divided by 4 using left shift; iii) applied angle to complex conversion using LUT-based implementation ($\texttt {LUT-2}$); iv) finally the compensation is performed by multiplying the estimated phase by the delayed version of the input samples.

#### 6.2 Pilot-based entity block

The simplified hardware implementation diagram of the pilot-based CPR is shown in Fig. 11(b). The implementation diagram is divided into phase estimation block, responsible for the estimation of the phase noise associated with input samples and phase compensation block that removes the phase noise from the input samples. The implementation details of phase estimation and compensation block are also provided by the dashed line of Fig. 11(b).

In the phase estimation block two LUTs are used, $\texttt {LUT-CONJ}$ and $\texttt {LUT-PIL}$, which provide the complex conjugate of the input sample and its corresponding pilot symbols, respectively. The block $\texttt {COUNT-EN0}$ is used to provide the enabler signal to the other blocks to control their valid operation instances. The enable signal is set to mode ON at every $L_\mathrm {p}$ clock cycles so that the phase estimation is performed only for pilot symbols. In this case, $L_\mathrm {p}$ corresponds to the symbol spacing of two consecutive pilots symbols. The block $\texttt {COUNT-EN1}$ provides the reading address to the $\texttt {LUT-PIL}$ block at every $L_\mathrm {p}$ clock cycles to access the pilot symbols. The multiplication by $\pi /2$ is required to adjust the complex to angle conversion to the correct value. Two buffers are used, $\texttt {Buffer0}$ and $\texttt {Buffer1}$, to correctly synchronize the input to the $\texttt {Average}$ and $\texttt {Interp1}$ blocks, respectively. The block $\texttt {Interp1}$ performs linear interpolation between each of two consecutive estimated pilot phases, which is controlled by the enable signal provided by the block $\texttt {COUNT-EN2}$. It can be seen that the distribution of the enable signal from the block $\texttt {COUNT-EN0}$ to other blocks is obtained through the delay blocks $\texttt {D}$, allowing the correct synchronization between different blocks. The block $\texttt {LUT-A2C}$ is used to convert the estimated angles to their complex representation, then followed by the multiplication of input samples by their corresponding estimated phase to complete the phase compensation.

#### 6.3 FPGA implementation results

The validation of hardware implementation of the aforementioned carrier phase recovery techniques is first conducted by evaluating their performance under the effect of fixed-point operations. It should be noted that fixed-point operations may cause performance penalty, incurring in an increasing required SNR to achieve a given BER performance. For the sake of simplicity, the hardware simulations are performed for a given operation point defined by laser linewidth, $\Delta f$, SNR and pilot-rate, $R_\mathrm {pil}$. In this sense, we have considered the simulations for the maximum value of $\Delta f$ and minimum value of SNR supported by each algorithm that achieves a BER performance of 2.4$\times$10$^{-2}$ according to Table 3. It is worth to mention that, before the hardware complexity evaluation, the algorithms performance has been evaluated and the penalty due to the fixed point operation is compensated by adjusting the number of bits along the algorithms diagram chain, as depicted in Figs. 11(a) and 11(b). Based on these considerations, an illustration of the obtained signal constellations for 16-QAM and 64-QAM after two-stage pilot-BPS CPR is shown in Fig. 12(a) and (b).

In order to estimate the hardware area occupation for all the considered algorithms, we have assessed the implementation complexity through the hardware Synthesis Report generated by the Xilinx software tool. For the sake of simplicity, the complexity is quantified in terms of number of occupied slice LUTs, slice registers and DSP slices (DSP48E1S), which are the main chip area estimation metrics provided by the Synthesis Report. Note that DSP slice blocks provide efficient signal processing functions, such as multiplication and division. As already mentioned before, note that a slice is a basic element of FPGA resources formed by LUTs, registers, carry chain and multiplexers, which can be programmed to form different logic circuits. Taking into account these considerations and for an implementation targeting the commercial FPGA model Virtex-7 XC7VX330T, Fig. 12 shows the estimated hardware occupation for different algorithms and considering various values of pilot-rate. Due to the software limitation in synthesising the design with large requirements of area occupation, which is the case of BPS algorithm for 64-QAM, we have limited the degree of parallelism to 8 samples. However, its extension to higher degree of parallelism is straightforward. The results confirm the high hardware implementation efficiency of pilot-based algorithms, where a very low hardware occupation is achieved. On the other hand, we can observe that the hardware occupation of the standalone BPS algorithm is very high, surpassing the total available resources of the considered FPGA model. For the case of 64-QAM, it can be seen that the design occupies 100$\%$ of the available slice LUTs, which hinders its real-time implementation for the considered FPGA platform. In accordance with the computational effort analysis provided in Table 3, the two-stage pilot-BPS algorithm presents an increase of hardware complexity in relation to the pilot-based algorithm. Nevertheless, a high hardware complexity reduction is confirmed when compared to the BPS algorithm. These results clearly corroborate the theoretical and simulation results presented in previous section, which further reinforces our previous observations.

## 7. Conclusions

We have provided a comprehensive assessment and optimization of a two-stage CPR for coherent optical transceivers based on the use of pilot symbols followed by a BPS algorithm, which is compared against single-stage CPR based on the individual application of pilot symbols or BPS. The performance and hardware complexity optimization has been assessed for a transmission system operating at data rates of 400 Gb/s, 600 Gb/s and 800 Gb/s. The optimized two-stage pilot-BPS algorithm has shown higher robustness against laser phase noise, over 10$\times$ higher laser linewidth, at the expense of an additional complexity increase with respect to the pilot-only algorithm. When compared to the standalone BPS algorithm (assisted by differential encoding), a reduction of more than 93$\%$ of hardware complexity measured in terms of real multiplications has been achieved, while presenting higher performance. In this regard, the two-stage pilot-BPS algorithm can be an attractive solution for high performance and high robustness against laser phase noise, together with moderate computational effort. Hardware implementation based on VHDL, where the parallel processing and fixed point representation is taken into account, has been also performed for the considered carrier phase recovery algorithms and the FPGA-based hardware complexity analysis has confirmed the aforementioned results.

## Funding

European Regional Development Fund through the Competitiveness and Internationalization Operational Programme (COMPETE 2020) of the Portugal 2020 framework; Projeto DSPMetroNet Functions for Simplified Coherent Transceivers in Optical Metropolitan Networks (POCI-01-0145-FEDER-029405); “la Caixa” Foundation (LCF/BQ/PR20/11770015); Fundação para a Ciência e a Tecnologia (PD/BD/113817/2015).

## Acknowledgments

This work is supported by the European Regional Development Fund (FEDER), through the Competitiveness and Internationalization Operational Programme (COMPETE 2020) of the Portugal 2020 framework, Projeto DSPMetroNet: DSP Functions for Simplified Coherent Transceivers in Optical Metropolitan Networks, POCI-01-0145-FEDER-029405. Fernando P. Guiomar acknowledges a fellowship from ``la Caixa'' Foundation (ID 100010434). The fellowship code is LCF/BQ/PR20/11770015. Celestino Martins acknowledges the financial support provided by FCT through the Ph.D. Grant PD/BD/113817/2015.

## Disclosures

The authors declare no conflicts of interest.

## Data availability

Data underlying the results presented in this paper are available in Ref. [23].

## References

**1. **M. S. Faruk and S. J. Savory, “Digital signal processing for coherent transceivers employing multilevel formats,” J. Lightwave Technol. **35**(5), 1125–1141 (2017). [CrossRef]

**2. **I. Fatadin and S. J. Savory, “DSP techniques for 16-QAM coherent optical systems,” in IEEE Photonics Society Summer Topicals 2010, (IEEE, 2010).

**3. **S. M. Bilal, G. Bosco, J. Cheng, A. P. T. Lau, and C. Lu, “Carrier phase estimation through the rotation algorithm for 64-QAM optical systems,” J. Lightwave Technol. **33**(9), 1766–1773 (2015). [CrossRef]

**4. **T. Pfau, S. Hoffmann, and R. Noe, “Hardware-efficient coherent digital receiver concept with feedforward carrier recovery for *m*-QAM constellations,” J. Lightwave Technol. **27**(8), 989–999 (2009). [CrossRef]

**5. **T. Pfau and R. Noé, “Phase-noise-tolerant two-stage carrier recovery concept for higher order QAM formats,” IEEE J. Sel. Top. Quantum Electron. **16**(5), 1210–1216 (2010). [CrossRef]

**6. **X. Zhou, “An improved feed-forward carrier recovery algorithm for coherent receivers with *m*-QAM modulation format,” IEEE Photonics Technol. Lett. **22**(14), 1051–1053 (2010). [CrossRef]

**7. **J. R. Navarro, A. Kakkar, R. Schatz, X. Pang, O. Ozolins, F. Nordwall, H. Louchet, S. Popov, and G. Jacobsen, “High performance and low complexity carrier phase recovery schemes for 64-QAM coherent optical systems,” in Optical Fiber Communication Conference, (OSA, 2017).

**8. **M. Mazur, J. Schroder, A. Lorences-Riesgo, T. Yoshida, M. Karlsson, and P. A. Andrekson, “12 b/s/Hz spectral efficiency over the C-band based on comb-based superchannels,” J. Lightwave Technol. **37**(2), 411–417 (2019). [CrossRef]

**9. **X. Zhou, L. E. Nelson, P. Magill, R. Isaac, B. Zhu, D. W. Peckham, P. I. Borel, and K. Carlson, “High spectral efficiency 400 Gb/s transmission using PDM time-domain hybrid 32–64 QAM and training-assisted carrier recovery,” J. Lightwave Technol. **31**(7), 999–1005 (2013). [CrossRef]

**10. **M. Magarini, L. Barletta, A. Spalvieri, F. Vacondio, T. Pfau, M. Pepe, M. Bertolini, and G. Gavioli, “Pilot-symbols-aided carrier-phase recovery for 100-G PM-QPSK digital coherent receivers,” IEEE Photonics Technol. Lett. **24**(9), 739–741 (2012). [CrossRef]

**11. **E. Borjeson, C. Fougstedt, and P. Larsson-Edefors, “VLSI implementations of carrier phase recovery algorithms for *m*-QAM fiber-optic systems,” J. Lightwave Technol. **38**(14), 3616–3623 (2020). [CrossRef]

**12. **E. Borjeson and P. Larsson-Edefors, “Energy-efficient implementation of carrier phase recovery for higher-order modulation formats,” J. Lightwave Technol. **39**(2), 505–510 (2021). [CrossRef]

**13. **W. Shieh and K.-P. Ho, “Equalization-enhanced phase noise for coherent-detection systems using electronic digital signal processing,” Opt. Express **16**(20), 15718 (2008). [CrossRef]

**14. **A. Kakkar, J. R. Navarro, R. Schatz, X. Pang, O. Ozolins, A. Udalcovs, H. Louchet, S. Popov, and G. Jacobsen, “Laser frequency noise in coherent optical systems: Spectral regimes and impairments,” Sci. Rep. **7**(1), 844 (2017). [CrossRef]

**15. **M. S. Neves, P. P. Monteiro, and F. P. Guiomar, “Enhanced phase estimation for long-haul multi-carrier systems using a dual-reference subcarrier approach,” J. Lightwave Technol. **39**(9), 2714–2724 (2021). [CrossRef]

**16. **Optical Internetworking Forum, “OIF-400ZR implementation agreement,” (https://www.oiforum.com/technical-work/implementation-agreements-ias/).

**17. **A. Leven, N. Kaneda, and S. Corteselli, “Real-time implementation of digital signal processing for coherent optical digital communication systems,” IEEE J. Sel. Top. Quantum Electron. **16**(5), 1227–1234 (2010). [CrossRef]

**18. **T. Pfau, H. Zhang, J. Geyer, and C. Rasmussen, “High performance coherent ASIC,” in 2018 European Conference on Optical Communication (ECOC), (IEEE, 2018).

**19. **T. Suzuki, S.-Y. Kim, J. ichi Kani, and J. Terada, “Real-time implementation of coherent receiver DSP adopting stream split assignment on GPU for flexible optical access systems,” J. Lightwave Technol. **38**(3), 668–675 (2020). [CrossRef]

**20. **S. van der Heide, R. S. Luis, B. J. Puttnam, G. Rademacher, T. Koonen, S. Shinada, Y. Awaji, H. Furukawa, and C. Okonkwo, “Field trial of a flexible real-time software-defined GPU-based optical receiver,” J. Lightwave Technol. **39**(8), 2358–2367 (2021). [CrossRef]

**21. **R. M. Ferreira, A. Shahpari, J. D. Reis, and A. L. Teixeira, “Coherent UDWDM-PON with dual-polarization transceivers in real-time,” IEEE Photonics Technol. Lett. **29**(11), 909–912 (2017). [CrossRef]

**22. **B. Baeuerle, A. Josten, M. Eppenberger, D. Hillerkuss, and J. Leuthold, “Low-complexity real-time receiver for coherent Nyquist-FDM signals,” J. Lightwave Technol. **36**(24), 5728–5737 (2018). [CrossRef]

**23. **C. S. Martins, F. P. Guiomar, and A. N. Pinto, “VHDL implementation of dual-stage CPR: Pilot-BPS,” (Zenodo, 2020, https://zenodo.org/record/4308781#.YQ7tV4hKjn1).

**24. **R. A. Shafik, M. S. Rahman, and A. R. Islam, “On the extended relationships among EVM, BER and SNR as performance metrics,” in 2006 International Conference on Electrical and Computer Engineering, (IEEE, 2006).

**25. **Y. Mahdy, S. Ali, and K. Shaaban, “Algorithm and two efficient implementations for complex multiplier,” in ICECS’99. Proceedings of ICECS ’99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357), vol. 2 (1999), pp. 949–952.