Data-driven fiber model based on the deep neural network with multi-head attention mechanism

Yubin Zang; Yubin Zang; Zhenming Yu; Zhenming Yu; Kun Xu; Minghua Chen; Minghua Chen; Sigang Yang; Sigang Yang; Hongwei Chen; Hongwei Chen

doi:10.1364/OE.472981

1. Introduction

Calculation of optical signal evolution in fiber transmission is a crucial piece of work because it can provide valuable information in the research of fiber properties and telecommunications. In order to do that, one must solve Nonlinear Schrödinger Equation (NLSE) which describes signal transmission in the fiber numerically. Split-step fourier method (SSFM) is one of the most widely used algorithms to solve NLSE [1]. By firstly dividing the whole fiber into different sections appropriately based on dispersion and nonlinear effects, this method can calculate the signal by taking account of dispersion and nonlinear effects successively in each section. In the end, results will be obtained when all sections are calculated one by one. Though this method can provide reliable and accurate results, two main disadvantages can be concluded according to its calculating procedures. Firstly, the number of divisions of the fiber before calculation must be determined appropriately and wisely. Improper determination will cause the mismatch of the fiber length in each section which may have high probability of obtaining inaccurate or even wrong results. This operation, which needs experience or prior estimation of fiber dispersion and non-linearity, has similarity in setting girds in Finite Time Domain Method (FDTD) [2] or Finite Element Method (FEM) [3] in numerically solving Maxwell Equation. Secondly, the computation time varies drastically with the differences in dispersion and non-linearity. As for large dispersion or non-linearity, larger number of division is adopted. This will result in a smaller value of distance in each fiber section in order to fully take into both dispersion and nonlinear effects into consideration. Therefore, larger computation time may be consumed under these circumstances. In fiber telecommunications, computation time will be larger under either cases which have higher bit rate, more complicated modulation format, longer transmission distance, larger launch power or special fiber.

Artificial intelligence, which has been developing at an unprecedented speed, brings revolutions in ways of researching. Multiple data-driven models or networks like fully-connected networks [4], convolution networks (CNNs) [5,6] or recurrent neural networks (RNNs) [7] have been established in recent years. By utilizing the prerequisite data obtained before hand and appropriately training models or networks using supervised learning strategy, they can have the ability of extracting both features and rules inside the data and show good performance even in the data which they have never seen before. RNNs are the most widely used models in the fields of both natural language processing (NLP) [8] and time series processing [9]. Unlike CNNs which are adopted widely in picture processing and classifying, they calculate the data in a serial way. Later, the long-short time memory network (LSTM) was put forward [10]. By introducing the memory module in each time step of the RNN, this model can extract more features and reflect notable time sequence properties in processing time series so that it can obtain better performances in most cases. However, though the LSTM can memorize features in different time steps, it only processes the time series in the forward direction. In order to solve this problem, Bidirectional-LSTM (BiLSTM) was proposed [11]. Unlike the LSTM which process information towards forward, information of time series is processed towards both forward and backward in the BiLSTM. Therefore, The BiLSTM is probably the best model in time-series processing before the Transformer was proposed. Though the BiLSTM can effectively increase the directions of time-series processing, its serial way of calculation may relatively severely influence the computing efficiency. In order to solve this problem, the multi-head attention mechanism was proposed to extract features from time-series in a parallel way. This mechanism has allowed the Transformer to become one of the most powerful models in fields of natural language processing by now [12].

Calculating signals in fiber transmission can be viewed as time series processing. Here, original analog signals which propagate through the fiber are discretized at a specific sampling rate to become time series. Since the data-driven model can be used in time series processing, it can also be adopted in predicting signal transmission in fiber propagation. Indeed, several research groups have conducted a lot of work combining the conventional time series model and fiber optics including the BiLSTM based model, the generative adversial network (GAN) based model and so on [13–21]. Compared with the conventional SSFM, our data-driven model works as a data-driven NLSE solver. Its core neural network structures which adopt the multi-head attention mechanism can learn how to predict signal transmission from the training data and can predict the change of unknown signal changes through fiber transmission correctly. Once the model is appropriately trained, it can predict transmitted signals as a whole instead of calculating the fiber sections step by step according to the fiber dispersion and nonlinear effects. Therefore, the data-driven signal prediction model we proposed can have the same computation time with regard to different fiber dispersion and non-linearity. Due to this reason, it can save a lot of computation time especially when fiber dispersion and non-linearity are notable as well. Besides, there is no need to select the appropriate number of fiber divisions any more, which can greatly eliminate the risk of obtaining wrong results caused by the inappropriate estimation of fiber dispersion and non-linearity. In contrast to the previously proposed neural network based prediction model such as BiLSTM based model, our model shows relatively greater regression ability which extends the modulation formats from PAM to QAM, bit rates from 40Gbps to 160Gbps and maximum transmission distance from 80 km to 100 km. In addition, our model has better distance generalization ability over transmission distances ranging from 0 to 100 km. Future cascaded model and noise researches will further extend the models’ distance generalization and prediction accuracy on noise-contained signals.

In this paper, we will apply the multi-head attention mechanism into constructing the data-driven signal prediction model. Since this mechanism can extract multiple features regarding transmission distortions, both accuracy and distance generalization will be improved without losing the benefits of time consumption and operating difficulty. By numerically demonstration, we report that this model can predict up to 160Gbps QAM signals under 100 km with the potential of predicting signals with further transmission distance and complicated modulation formats by adopting cascaded or other extended methods. This paper will mainly be divided into five parts. The background information containing fiber signal prediction and data-driven model will be firstly introduced in the introduction part. In the second part, the structure of the data-driven model will introduced briefly and the data collection and preparations before the training will be described. Training and testing procedures will also be described in this part. All results and the related analysis will be shown in the third part and the corresponding discussions and comparisons will be provided in the fourth part. Conclusions and future focus will be showed in the last part.

2. Principles and simulation setups

2.1 Structure of data-driven fiber transmission model

The data-driven model after the appropriate training can function as the fiber transmission link as is shown in Fig. 1. Therefore, the inputs and outputs of the model should be signals before and after the fiber link transmission respectively. In order to accomplish relatively complicated signal regression and prediction tasks, the structure of data-driven model should be well designed.

Fig. 1. Structure of data-driven model containing multi-head attention mechanism

Download Full Size | PDF

The gray box in Fig. 1 depicts the model structure in detail. This data-driven model consists mainly of the transformer encoder component and the deep fully-connected structure. Transformer encoder, as the first component, is adopted to extract multiple features from the input data via its multi-head attention mechanism. These multiple features can not only reflect the distribution and characteristic of the signals before transmission but also transmission rules containing fiber dispersion, non-linear effects and so on. Residue branches and structures which are shown in pink colored arrows and blue boxes are adopted to avoid the potential gradient vanishing issue during training procedures. Deep fully-connected structure, as the second component, reflects the extracted multiple feature information into higher-dimensional space so that the model can learn more detailed information through training.

Input data, which is originated from the signals before the transmission is firstly replicated into two. One copy, as the main information, flows into the multi-head attention mechanism to be extracted into multiple features. The layer of the multi-head attention mechanism which is shown in the red box in Fig. 1 has various key, query and value matrices which can extract multiple features. During the training procedures, these three kinds of matrices will progressively converge into the most appropriate state which can map the input data into various feature spaces where data’s distribution and fiber transmission rules can be best reflected. These extracted features will then be merged and stretched into vectors before entering into the addition and normalization layer. In order to extract as many features as possible, the number of heads in multi-head attention mechanism of our data-driven model is set at 17. The other inputs replica will skip the multi-attention mechanism and flow directly into the addition and normalization layer. Summation between replica copy of input and merged features will be firstly conducted here. This operation and structure is called the residue structure which was first proposed and widely adopted in the ResNet and to prevent gradient vanishing during the training of deep neural networks. The layer normalization will then be conducted as to compress the summation results into standard ranging areas. This operation benefits the most for neurons to learn during training. Normalized outputs of this layer will be replicated into two as the input data before the layer of multi-head attention mechanism. Similarly, the second replica will be worked as the information flow which is shown in the pink colored arrow in Fig. 1 to enter directly into the second addition and normalization layer while the main replica of outputs will flow into the feed forward layer to further extract and integrate with the second replica to extract different features.

All outputs from the second addition and normalization layer will then enter into the deep fully-connected component containing three fully-connected layers and two activation layers to further match with target data. The three fully-connected layers are worked as a high-dimensional space mapper which can map the pre-extracted and processed feature data into high-dimensional space as to further improve the regression ability of the data-driven model. Here the scale of the first, second and third fully connected layer is set to be 512:1024, 1024:1024 and 1024:512 respectively. The nonlinear function in both activation layers is set to be ReLU since it can effectively avoid the potential gradient vanishing problem especially when neurons are faced with either relatively large or relatively small weights. Since the data-driven model deals with signal regression tasks, the activation layer is no necessarily needed before the output layer.

2.2 Data preparations and pre-processing

The data used for training and testing the model should cater both requirements of fiber transmission and model characteristics. Since the supervised learning strategy is adopted to train the model, both inputs and targets should be contained in the dataset. Here, the inputs are originated from the signals before transmission while targets are originated from the signals after transmission. In order to testify the distance generalization of the model, signals which are transmitted through different lengths of fiber are collected. Therefore, the inputs should not only contains the sampling points of the signals before transmission, but also the transmission distances. For the targets, standard signals after the fiber transmission are calculated through SSFM.

Figure 2 depicts the data formats and collection configurations. In total, signals of three kinds of symbol rates-10GBaud, 20GBaud and 40GBaud, four kinds of modulation formats-OOK, PAM, QPSK and QAM are collected. For each specific modulation format and symbol rate, the systems sampling rate is eight times the symbol rate which means that for the waveform during one specific symbol period, 8 points will be sampled. The launch power for the system is set to be 0 dBmW. . In the dataset, both training and testing data are contained. For training data, signals transmitted at the distance from 1 km to 100 km with 1 km interval will all be collected. For testing data, signals transmitted at distances from 0.5 km to 95.5 km with 5 km interval will all be collected. For each set of the symbol rate, the modulation format and the transmission distance, 31744 symbols are mapped into waveforms and transmitted through the fiber. For each of the four modulation formats, After the signals are collected and discretized at the sampling rate, both real and imaginary parts of the signals’sampled data can be obtained from their original complex values. All sampling points collected should undergo power normalization before being grouped and stacked into the data whose formats are suitable for the data-driven model.

From Fig. 2, the formats of inputs and targets are different. The whole view of the input data format is a cubic-shaped 3D tensor with three dimensions. The basic unit of the 3D tensor is the real/imaginary part of the complex value of each sampling point as is shown by the small cubic with deeper or lighter color respectively. Since the sampling rate is eight times higher than the symbol rate, in total 16 values with 8 for real part p^r and 8 for imaginary part pⁱ consists of data for one symbol s. Other than sampling values, the normalized distance d which is depicted in the gray small cubic is stacked at the end in order to let data-driven model to learn transmission rules with distance generalization during training. In order to reflect the appropriate inter-symbol interference(ISI) caused by fiber dispersion during transmission, 32 symbols, each with 16 sampling values and 1 distance value, consists as a group to form a layer L in input data. One layer of the input data shown in Fig. 2 also represents one sample of data. Since, 31744 symbols are contained for each distance in our dataset, in total, there are 992 data samples for one specific distance in our dataset. Different from the inputs, the targets in our dataset form as a 2D matrix with each row representing one data sample and each column representing one sampling values. In the dataset of our data-driven model, each row contains 512 sampling values are originated from 32 symbols.

Fig. 2. Data formats and collection configurations

Download Full Size | PDF

2.3 Training and testing settings of the model

Training and testing settings should also be considered seriously in the design of the data-driven model. Through training, the whole model should establish the effective relations between the inputs and the targets in order to predict well for those signals with different distances and bit patterns. Since this data-driven model accomplishes signal regression tasks, normlized mean square error (NMSE) is adopted to evaluate the extent of differences between model outputs and targets. Multiple optimizers which update trainable parameters based on the several modified stochastic gradient descent algorithms(SGD) can be optional for training. In this paper, ADAM optimizer is chosen since its adaptive learning rate adjustment and momentum mechanism can effectively avoid the problem of gradient explosion or local minimum trap. Under this circumstance, the model can converge to the ideal state in a more stable and progressive way. Batch size, which acts as another hyper-parameter, is also crucial during training. Oversized or small-sized batches will have great probability of either exceeding the memory limitation or causing convergence curve to vibrate intensively. Here, both batch sizes for training and testing data are set to be 128 after the whole data scale of training set and testing set is taken into consideration. Batch shuffle is necessary and should be operated before training in order to better balance the distribution differences between each sample of data.

In order to show how the data-driven model progressively learn to capture the rules of fiber transmission, both regression performances and the convergence curve during the training of 10GBaud PAM signals are depicted in Fig. 3 without losing the generality. Signal fragments which contains 800 sampling points are chosen to plot. In order to clearly describe the convergence curve, logarithmic values of the loss function are depicted in Fig. 3(a). As can be seen from this figure, the loss decreases progressively through the training process. Figure 3(b)–3(g) shows how the model learns to predict transmitted signals progressively. The outputs from the model with the targets at the epoch 1, 50, 100, 300, 500 and 1000 are shown. In contrast, yellow curves are adopted to describe the outputs from the model while blue curves are adopted to describe the targets which are calculated by SSFM. As can be seen from Fig. 3(b) and (c), at the initial several training epochs, though all parameters are randomly initialized with different values, the model tends to capture crucial feature of the inputs and learn the rule which relates the inputs with the targets at a relatively high converging speed. However, the model at this stage performs roughly in several details like the bottoms or peaks of the signal. This phenomenon can be found as well at Fig. 3(b)-(e) where the model’s outputs show inconsistency more or less in contrast with the targets. As can be seen from Fig. 3(h)-(k), the model performs better and better in fitting details as the training process operates. After 300 epochs, it is hard to distinguish between the fitting differences by comparing between Fig. 3(f) and (g). Only the logarithmic convergence curve in Fig. 3(a) can indicate that the loss continues to decrease at this stage but with a relatively much lower speed. Note that other cases with different bit rates and modulation formats may differ in convergence speed, but the overall trend of convergence remains the same.

Fig. 3. Convergence performance of the model through training. (a) Iteration curve. (b) Epoch 1. (c) Epoch 50. (d) Epoch 100. (e) Epoch 300. (f) Epoch 500. (g) Epoch 1000. In (b)-(g), blue curves represents targets which come from SSFM while yellow curves mark model outputs.

Download Full Size | PDF

3. Results and analysis

3.1 Model’s performances with noise-free signals

Testing dataset which consists of signals of different symbol rate-10GBaud, 20GBaud, and 40GBaud, different modulation formats-OOK, PAM, QPSK and QAM under the transmission distances from 0.5 km to 95.5 km at the interval of 5 km is utilized to test the model’s performance on signals prediction and distance generalization.

Regression results of the time-domain OOK and PAM signals with different symbol rates are shown in Fig. 4 since their time-domain signals are intuitive. For each subfigure, the convergence curve is depicted on the top, input power and phase of waveforms are shown in the middle on the left and right respectively while output power and phase of waveforms are shown at the bottom on the left and right side respectively. In order to conduct the comparison between model outputs and SSFM outputs as targets, blue curves are used to represent targets while yellow curves are utilized to show model outputs. Without losing generality, signal fragments with 2000 sampling points are chosen to plot.

Fig. 4. Performance of model testing on signals with different symbol rate and modulation formats. In figures of output waveform, blue curves represents targets which come from SSFM while yellow curves mark model outputs. (a) 10GBaud-OOK. (b) 20GBaud-OOK. (c) 40GBaud-OOK. (d) 10Gbaud-PAM. (e) 20GBaud-PAM. (f) 40GBaud-PAM.

Download Full Size | PDF

When it comes to the model’s performances on predicting OOK and PAM signals, the convergence curves in Fig. 4(a)-(f) suggest that our data-driven model has successfully established the relations between signals after transmission and those before transmission. The convergence curves all descend at relatively fast speeds though some fluctuations still exist especially near the end of the training processes. These phenomena can be clearly seen in Fig. 4 which all imply the complicated shapes of high-dimensional loss functions. Under the circumstances with the same modulation format but different symbol rates by comparing among Fig. 4(a)-(c), the differences in final logarithmic loss values suggest that the model’s performances may slightly decrease as the symbol rate increases. This is because higher symbol rate not only brings out relatively more severe ISI, but also causes larger differences in signals transmitted at different distances which can further result in higher difficulty in predicting for our model with the same scale of learnable parameters. The relatively good predicting precision under all circumstances can also be demonstrated by both output power and phase results shown in Fig. 4 where the yellow curves representing the model’s outputs match relatively well with the blue curves representing the targets. In addition, it can also be concluded from the results of the output phase in Fig. 4 that the phase signals undergo changes during transmission even for intensity modulations.

Apart from OOK and PAM signals, relatively more complicated modulation format signals such as QPSK and QAM are utilized to test as well. In order to further depict high-order modulation format signals in a more vivid way, constellation diagrams of the equivalent B2B signals of QPSK and QAM with different symbol rates and transmission distances are shown in Fig. 5 and Fig. 6 respectively. These B2B signals are obtained by compensating for the effects of dispersion, non-linearity of the transmitted signals. Filters effects are also eliminate to show the performances of model’s predictions clearer. Each of the two rows of the six diagrams, e.g. Figure 6(a1)-(a6), can be viewed as one subfigure group since it shows constellation diagrams of the equivalent B2B signals originated from the model and SSFM at one specific symbol rate. For subfigure group Fig. 5(a1)-(a6), constellation diagrams of the equivalent B2B signals which are originated from SSFM (targets) are depicted in the first row and the constellation diagrams of predictions of this data-driven model are described in the second row. In each subfigure group, constellation diagrams of the equivalent B2B signals transmitted over three different transmission distances are chosen to show. In the constellation diagrams of the signals before transmission, different colors are utilized to mark different modulation symbols. Therefore, in total 4 colors are used in QPSK signals while 16 colors are utilized in QAM signals to represent sampling points from 4 or 16 different symbols respectively.

Fig. 5. Constellation diagrams of the equivalent B2B QPSK signals. (a1)-(a6): 10GBaud. (b1)-(b6): 20GBaud. (c1)-(c6): 40GBaud. Each two rows of the six diagrams, e.g. (a1)-(a6), form one subfigure group. In each subfugure group, the diagrams in the first row depict the results from the model while the diagrams in the second row depict the results from the targets.

Download Full Size | PDF

Fig. 6. Constellation diagrams of the equivalent B2B QAM signals. (a1)-(a6): 10GBaud. (b1)-(b6): 20GBaud. (c1)-(c6): 40GBaud. Each two rows of the six diagrams, e.g. (a1)-(a6), form one subfigure group. In each subfugure group, the diagrams in the first row depict the results from the model while the diagrams in the second row depict the results from the targets.

Download Full Size | PDF

By comparing between the first and second rows in each subfigure group in Fig. 5 and Fig. 6, regression precision between model’s outputs and targets can be suggested. In all, our data-driven model shows relatively good regression of signals with high-order modulation formats since patterns of constellation diagrams of model outputs and targets are almost the same, though the slight differences between outputs and targets’ constellation diagrams which barely can be observed directly in outer region distributions indicate few regression imperfections.

By comparing the constellation changes over distances within each subfigure group of Fig. 5 and Fig. 6, as the transmission distance increases, the differences between the constellation diagram distributions between the model and the target increases due to higher levels of fiber dispersion. For example, in Fig. 6(a1)-(a6), the constellation points of model’s prediction on 10GBaud QAM signals transmitted over 90.5 km concentrate onto the sixteen dots as is shown in those constellation diagrams of SSFM calculation while in Fig. 6(c1)-(c6) the constellation points show relatively larger scattering from the theoretical sixteen dots.

In addition, as for the analysis of our model’s performances of predictions on signals with these four different modulation formats, it is found that the modulation format acts as another significant factor in affecting the regression ability of the data-driven model. The overall view of the effects can be concluded that the regression difficulty increases for higher-order modulation format. As can be seen in Fig. 4–6, the overall regression performances of OOK and PAM signals are relatively better than QPSK and QAM signals. For OOK and PAM signals, information is modulated onto the intensity. In contrast, for QPSK and QAM signals, the information is modulated onto the phase at least. Even for the relatively simpler QPSK which information is only modulated onto the phase, the effects in ISI are more sophisticated than that of the intensity-modulation. This is because the modulations on phase will cause both intensity and phase signals to change more intensively. This will lead to the greater difficulty for our model with the same scale of learnable parameters to conduct the regression tasks over the QPSK and QAM signals. Therefore, the model’s performances on predicting OOK and PAM signals are better than those of QPSK and QAM signals, especially for higher symbol rate over longer transmission distance.

The overall view of evaluating the model’s prediction performances on signals with different modulation formats and symbol rates can be seen from Fig. 7, the logarithmic NMSE diagram on power predictions over different modulation formats-OOK, PAM, QPSK and QAM is shown in Fig. 7(a)-(d) respectively In each subfigure, blue, orange and yellow color is utilized to represent 10GBaud, 20GBaud and 40GBaud respectively. Each bar represents each error with respect to each distance among all 20 distances contained in the testset. In this figure, both the influence of modulation formats and symbol rates can be vividly seen. By comparing between each subfigure, conclusions can be drawn that the model predicts more precisely for OOK and PAM than QPSK and QAM which is in consistence with the above analysis. Via comparing the overall bars’ heights filled with different colors in each subfigure, it can be easily found that the average prediction errors over all distances increase as the symbol rate increases. The common reason behind these trends is the differences in the extent of signal distortions and sensitivity to fiber physical effects. Distance generalization can be observed vividly by comparing the heights of the same-colored bars with respect to different distances in Fig. 7. The overall prediction differences for the model are acceptable though relatively notable prediction differences are shown for different distances when either the complexity or symbol rate increases. Though the model operates like a ‘black box’, some interesting behaviors can also be found in Fig. 7 that relatively larger prediction errors tend to exist when the model predicting long-distance signals. This trend becomes more notable for higher symbol rate cases.

Fig. 7. Logarithmic of NMSE on power prediction. (a) OOK. (b) PAM. (c) QPSK. (d) QAM.

Download Full Size | PDF

3.2 Model’s performances with signals containing noise

Noise can be one of the most crucial factors in degrading the signals’ quality throughout the whole fiber optical communications. Therefore, it is necessary to test whether our proposed model can deal with the signals with noise. In order to conduct this verification, we add the noise into the signals before transmission. All the data collection, training and testing procedures follows the same routine as those of the noise-free cases. Convergence curves and prediction results in the time domain waveforms with regard to noise-containing OOK and PAM signals can be seen in Fig. 8. Constellation diagrams of both model predictions and targets with regard to noise-containing QPSK and QAM signals are depicted in Fig. 9 and Fig. 10 respectively.

Fig. 8. Performance of model testing on signals containing noise with different symbol rate and modulation formats. In figures of ouput waveform, blue curves represents targets which come from SSFM while yellow curves mark model outputs. (a) 10GBaud-OOK. (b) 20GBaud-OOK. (c) 40GBaud-OOK. (d) 10Gbaud-PAM. (e) 20GBaud-PAM. (f) 40GBaud-PAM.

Download Full Size | PDF

Fig. 9. Constellation diagrams of the equivalent B2B QPSK signals containing noise. (a1)-(a6): 10GBaud. (b1)-(b6): 20GBaud. (c1)-(c6): 40GBaud. Each two rows of the six diagrams, e.g. (a1)-(a6), form one subfigure group. In each subfugure group, the diagrams in the first row depict the results from the model while the diagrams in the second row depict the results from the targets.

Download Full Size | PDF

By observing the input waveforms of all cases in Fig. 8, it can be obviously seen that noise has been added onto the original noise-free signals as are depicted in Fig. 4. In general, the rules of model converging remains the same with those noise-free cases which shows that the both model’s convergence speed and final precision will decrease when either the modulation format order or symbol rate increases. By comparing with the model dealing with the same signal cases with and without the noise, it can be clearly found out that the model predicts relatively better with noise-free signals. This is because the adding of the noise increases the uncertainty of waveforms’ characteristics throughout the fiber propagation which will further increase the difficulty of capturing the transmission rules for the model with the same scale of learnable parameters.

For relatively higher order modulation format signals predictions whose results are shown in Fig. 9 and Fig. 10, the convergence trends still keep unchanged with OOK and PAM signals. By comparing with the constellation diagrams of the B2B equivalent signals before fiber transmission in Fig. 9 or Fig. 10 with Fig. 5 and Fig. 6, the relatively more diversely distributed constellation points imply the noise contained in QPSK and QAM signals. The noise affects signals evolution throughout fiber transmission in a certain way so that it can cause relatively more diverse or separable constellation points of the signals after transmission as well. Though dealing with relatively more complicated noise-contain QPSK and QAM signals, the model still remains good precision as the inconsistency between the constellation diagrams from the model and the targets barely can be directly observed.

Fig. 10. Constellation diagrams of the equivalent B2B QAM signals containing noise. (a1)-(a6): 10GBaud. (b1)-(b6): 20GBaud. (c1)-(c6): 40GBaud. Each two rows of the six diagrams, e.g. (a1)-(a6), form one subfigure group. In each subfugure group, the diagrams in the first row depict the results from the model while the diagrams in the second row depict the results from the targets.

Download Full Size | PDF

In order to evaluate the model’s prediction precision in quantity, the bar diagrams of the NMSE of power prediction of the model dealing with the noise-containing signals transmitted over different distances are depicted in Fig. 11. The general rules of the prediction precision are in consistence with the above analysis. By comparing with Fig. 11 and Fig. 7, the influence of noise on prediction precision over different transmission circumstances can be concluded more clearly. When dealing with noise-containing signals, the differences of prediction errors over 20GBaud and 40GBaud signals increase when the noise exists. In Fig. 7(d), the prediction errors with regard to 20GBaud QAM signals are the same or even better than those of 40GBaud QAM signals when the transmission distances ranging from 30.5 km to 90.5 km. However, when noise exists, the prediction errors of 20GBaud QAM signals are larger than 40GBaud QAM signals when the transmission distances ranging from 30.5 km to 90.5 km which can be seen in Fig. 11(d). Apart from the relatively higher average prediction errors when dealing the signals with noise, the noise can also worsen the distance generalization performances of the model. As can be seen in Fig. 7(b), the prediction errors are almost the same with towards to different distances for noise free 20GBaud PAM signals while the prediction errors with regard to different distances becomes relatively notable with noise-containing signals.

Fig. 11. Logarithmic of NMSE on power prediction over signals containing noise. (a) OOK. (b) PAM. (c) QPSK. (d) QAM.

Download Full Size | PDF

4. Discussions and comparisons

4.1 Model’s multi-head attention mechanism visualization

Though the whole model tends to operate like a black box which shields the operating and computing principles, it is still meaningful to visualize the model, or at least the multi-head attention layer of the model. Since the model deal with several different communication circumstances and adopts 17 heads in the multi-head attention layer as is illustrated in Fig. 2 and section 2.1, it is impossible to depict all attention weights of each head. Without losing generality, attention maps of 2 heads are chosen to shown in Fig. 12 and Fig. 13 for each communication case. Figure 12(a), 12(c), 13(a) and 13(c) shows attention maps of noise-free 10GBaud OOK, 40GBaud PAM, 10GBaud QPSK and 20GBaud QAM signals respectively while Fig. 12(b), 12(d), 13(b) and 13(d) shows attention maps of noise-containing 10GBaud OOK, 40GBaud PAM, 10GBaud QPSK and 20GBaud QAM signals respectively.

Fig. 12. Attention layer visualization for models on OOK and PAM transmission cases. (a1)-(a2):10GBaud-OOK signals without noise. (b1)-(b2): 10GBaud-OOK signals with noise. (c1)-(c2):40GBaud-PAM signals without noise. (d1)-(d2):40GBaud-PAM signals with noise.

Download Full Size | PDF

In each subfigure of Fig. 12 and Fig. 13, X-axis represents source symbol sequence, Y-axis represents target symbol sequence and Z-axis represents connection weights between the connection between the source symbol sequence and the target symbol sequence. As can be seen in each subfigure, different patterns of head indicate that different head captures different feature and connection weights between the waveforms of the source and target sequence. By comparing with the attention maps with regard to signals with or without the noise, it can be found out that the distribution of weights are similar. This phenomenon can be obviously concluded by conducting the comparison between Fig. 12(a1) and Fig. 12(b1). However, for higher order modulation format signals or higher symbol rates, the similarity seems to decrease. This can be found that for 20GBaud QAM signals, the distribution of attention maps show relatively less similarity between noise free and noise containing signals.

Fig. 13. Attention layer visualization for models on QPSK and QAM transmission cases. (a1)-(a2):10GBaud-QPSK signals without noise. (b1)-(b2):10GBaud-QPSK signals without noise. (c1)-(c2):20GBaud-QAM signals without noise. (d1)-(d2):20GBaud-QAM signals without noise.

Download Full Size | PDF

4.2 Computational complexity analysis

With regard to the computational complexity and the related operating time, SSFM is closely related with maximum transmission distance or fiber non-linearity which will increase exponentially when either of the two factors increases according to the analysis of the article [12] while the model closely relies on the scale of its learnable parameters.

As for the quantitative analysis, First, the calculation complexity of the model is presented. According to the theory of the Transformer, The time calculation complexity of each block of the model is mainly from three structures-multi-head attention layer, feed forward structure and the fully-connected layer.

For the multi-head attention layer, three operations-input linear mapping, attention calculation and output linear mapping are conducted. In order to illustrate it clearer, n is denoted to represent the sequence length per sample in each block, d is denoted to represent the scale of the input dimension. The computational complexity of the three operations is O(nd²), O(n²d), O(nd²) according to the article “Attention is all you need” [11]. Therefore, the complexity of multi-head attention layer should be O(2nd²+ n²d).

For the feed-forward structure, it is a fully connected network with the number of neurons as nd. For each of the sequence, this structure first maps its input size d into the high-dimensional representation space whose number of dimension equals R₁, then it maps the representation space back into the output dimensions d. Therefore, the total complexity of this structure is O(2ndR₁).

For the last fully-connected layer, it is also a fully connected network whose neurons configurations are n(d-1), R₂, R₃, n(d-1). Therefore, the total complexity of this layer should be O(n(d-1)R₂+ R₂R₃+ n(d-1)R₃).

Therefore, the complexity of each block is O(2nd²+ n²d + 2ndR₁+ n(d-1)R₂+ R₂R₃+ n(d-1)R₃). This clearly indicates that the model’s calculation complexity only related with network’s structure itself instead of fiber or signal parameters like non-linearity, transmission length, launched power, etc. This fact indicates that the complexity of the model remains the same even for longer transmission length or larger launched power.

The computational complexity of the SSFM can be subsequently presented according to the paper [23]. SSFM divides the whole fiber into several computing unit to calculate. In each computing unit, there exists two FFT operations and both dispersion and non-linearity operations. Suppose SSFM divides the L km fiber equally into computing units whose length equals dl, since the computational complexity for each FFT operation is O(2n(d-1)log₂(n(d-1))), then the total computational complexity of SSFM can be obtained as O(L/dl × (4n(d-1)log₂(n(d-1))+N_dispersion+ N_nonlinearity)). N_dispersionand N_nonlinearityrepresents the computational complexity of the dispersion and non-linearity operations respectively.

By inserting the model’s configuration, setting dl equal 0.05 km, 0.02 km, 0.01 km and 0.005 km and ignoring both N_dispersionand N_nonlinearity, the plot of both model’s and SSFM’s computational complexity can be obtained as Fig. 14. In the figure, black, red, green, blue and purple lines are adopted to depict the computational complexity of the model and SSFM with different lengths of the computing unit.

Fig. 14. Computational complexity of the model and SSFM

Download Full Size | PDF

Several rules of the computational complexity can be obviously seen from Fig. 14. Firstly, the model’s computational complexity does not vary with the transmission distance. Once the model is effectively trained, it can predict the signals transmitted over any distance which is below 100 km. Secondly, the computational complexity of the SSFM varies with respect to both transmission distance and computing unit length. When either transmission distance becomes larger or computing unit length become smaller, SSFM’s computational complexity may increase in a more intensive way. Therefore, these two rules cause the cross points of the curve of the model and SSFM. When the computing unit length equals 0.05 km, the computational complexity of these two methods reach the same at around 11.8 km, these value decreases to 1.2 km when the computing unit length equals 0.005 km. On average, if taken the computing length as 0.005 km, the computational complexity of our model is around the 2.37% of the SSFM.

4.3 Comparison with other models

With the developments of the neural networks, different neural-network based fiber prediction models have been proposed, such as BiLSTM-based model [13], GAN-based model [24], principle-driven model [23] or other data-driven model [25] etc. The above models are also utilized to cope with the needs of the different circumstances of fiber optical communications. Classified by transmission distances, the proposed model in this paper, like BiLSTM-based model aims at dealing with short distance fiber transmission such as the near-end fiber links in real communications. Since parameters of near-end fiber links may vary according to different custom requests and building conditions, these corresponding models often maintain better distance generation. In the contrast, for medium or long distance transmission as long-haul fiber transmissions, since fiber links are usually consisted of several transmission spans whose length are fixed, the corresponding prediction models [23–25] may pursue more on prediction accuracy.

Predict accuracy and computational complexity can vary according to the different neural networks each model adopts. As for the models dealing with short fiber transmission, since single module forms the whole model whatever the transmission distances, their computational complexity is only determined by their own model’s structures. This characteristics has already been clearly seen in the black line in Fig. 14. For the BiLSTM-based model [13], its computational complexity is clearly related with the BiLSTM structure and the subsequent fully-connected structure. Since BiLSTM structure’s complexity is only related with the sequence length (33 in each time step), and the number of neurons in the subsequent fully-connected layers are not very large (<100), therefore the computational complexity is smaller than our proposed model. However, according Fig. 7 and Fig. 11, our model possesses better prediction precision and distance generalization than that of the BiLSTM model. When it comes to the models dealing with medium and long fiber transmission, the comparison between our model and these models are of less significance since model’s applications are different. Computational complexity of these type of models often need to take fiber length or span number into consideration since cascaded modules form the whole model, especially for the multi-span long-haul model [22]. For GAN-based models, their computational complexity is also closely related with the structures inside the generative and adversarial structure like the analysis in [23,24].

Table 1 shows the comparison between our model and other highly recognized fiber models. The 2^nd and 3^rd column is the most important since these two models all deal with short distance transmission. Other models who deals with long distance transmission and adopts different training and testing model are also listed as representative in the 4^th and 5^th column. Traditional model SSFM is listed in the last column. As can be seen from the table, different models are coped with different scope of fiber transmission. For example, cascaded fiber model focus on dealing with multi-span long-haul fiber transmission while single-structured model such as BiLSTM-based model and ours focus on dealing with short distance fiber transmission. Besides, different models have different learning strategies. Data driven models like the cascaded mode, BiLSTM-based model and ours adopt large amount of data to train the networks while principle driven model adopt NLSE equation and the corresponding initial conditions as its loss function to constrain the network to converge.

Table 1. Comparison with other highly recognized fiber models

View Table

By comparing our model with the most relevant and highly-recognized BiLSTM-based model which both deal with short distance fiber transmission and adopt data to train the network. In brief three improvements can be found by adopting our proposed model. Firstly, not only 10GBaud OOK and PAM signals prediction are demonstrated, our model can also deal with up to 40GBaud QAM signals with relatively high precision which can be seen in Fig. 7. Secondly, distance generalization is extend from 80 km to 100 km which means that our model can predict any signals’ transmission within 100 km once appropriately trained. Thirdly, the ability of predicting signals with noise is also demonstrated by our model which makes the model more suitable for the practical use.

5. Conclusions

In this paper, we propose a novelty structured data-driven model featuring the multi-head attention mechanism which can predict signal transmission in the fiber. Due to its multi-head attention mechanism, residue structures and deep fully-connected structures, this model could not only progressively capture and process multiple important features from the inputs but also establishing the relations between the signals before and after the transmission through training processes. Demonstrated by testing with the waveforms of previously unseen transmission distances and bit patterns, this model has a relatively great ability in regression and generalization of signals with different code patterns and distances. By far, this model can perform predictions of 160Gbps QAM signal over 100 km with acceptable prediction accuracy. Apart from that, the model can predict noise-containing signals without losing much prediction precision.

In total, three main advantages can be obtained once the model is adopted. Firstly, since this model is a data-driven model, it saves computation time once being well trained compared with conventional SSFM. There is no need to evaluate the fiber dispersion and non-linearity to choose the correct number of sections before calculation so that this model can avoid the potential risk of obtaining the wrong results due to lack of experience or perquisite knowledge. Secondly, the relatively great regression and generalization ability of the model guarantees that once this model is trained in an appropriate way, it can predict signals with different code patterns and transmission distance with relatively great accuracy which would otherwise need to be calculated from the beginning when adopting SSFM. Thirdly, this model shows a good balance between prediction accuracy and distance generalization thanks to the multi-head attention mechanism and deep network structures it features.

Compared with other neural network based model such as BiLSTM, our model’s relatively better regression ability allows it to predict relatively sophisticated signals with either higher-order modulation formats or higher symbol rates. Thus, our model extends the maximum prediction symbol rate from 10GBaud to 40GBaud, maximum transmission distance from 80 km to 100 km and the modulation format from PAM to QAM. In contrast with GAN based prediction model, our model’s distance generalization allows it to conduct prediction of the signals whose transmission distances ranging from 0 to 100 km instead of the fixed 50 km. Hence, it will have more application potential in the field of optical fiber communications. Judging by its regression and generalization ability, this model can either work as transmission signal generator to generate signal database for further model training or a transmission signal provider for fault detecting and so on.

Future work will mainly focus on further improving the regression and generalization performance of the model. Other non-ideal effects and parameters apart from the noise such as the linewidth of the laser and PMD should be factored in and predicted by the model, as to be closer to the realistic optical communication circumstances. Structures or scenarios may thus be either adjusted or employed by the model correspondingly as to better extract more features and relatively complicated rules during the signals evolution when propagating through fiber. Apart from that, more efforts should be dedicating to further extending the maximum transmission distance. Cascading hierarchy, as being one of the feasible plans, adopts several pre-trained prediction models with the same maximum prediction distance and organically concatenates them as a chain to predict signals with longer distances. Under this circumstances, this model can perform even better over time consumption and computation complexity, and enjoy advantages over conventional SSFM algorithms and other data-driven models.

Funding

National Key Research and Development Program of China (2019YFB1803501); National Natural Science Foundation of China (62135009).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. G. P. Agrawal, “Nonlinear Fiber Optics (authorized Chinese version of 5th edition),” Beijing: Publishing House of Electronics Industry. 17-57 (2013).

2. S. D. Gedney, “Introduction to the finite-difference time-domain (FDTD) method for electromagnetics,” Synthesis Lectures on Computational Electromagnetics 6(1), 1–250 (2011). [CrossRef]

3. G. Meunier, “The finite element method for electromagnetic modeling,” (2010).

4. H. Wang and B. Raj, “On the origin of deep learning,” arXiv, arXiv:1702.07800 (2017). [CrossRef]

5. Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-based learning,” Shape, contour and grouping in computer vision. Springer, Berlin, Heidelberg 1681, 319–345 (1999). [CrossRef]

6. D. K. Jha, D. Kumar, and J. K. Mishra, “Transfer learning approach toward joint monitoring of bit rate and modulation format.,” Appl. Opt. 61(13), 3695–3701 (2022). [CrossRef]

7. Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv, arXiv:1506.00019 (2015). [CrossRef]

8. K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv, arXiv:1406.1078 (2014). [CrossRef]

9. J. Zhang and K. F. Man, “Time series prediction using RNN in multi-dimension embedding phase space,” SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218). 2. IEEE (1998).

10. F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” Neural Computation 12(10), 2451–2471 (2000). [CrossRef]

11. Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv, arXiv:1508.01991 (2015). [CrossRef]

12. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv, arXiv:1706.03762 (2017). [CrossRef]

13. D. Wang, Y. Song, J. Li, J. Qin, T. Yang, M. Zhang, X. Chen, and A. C. Boucouvalas, “Data-driven optical fiber channel modeling: a deep learning approach,” J. Lightwave Technol. 38(17), 4730–4743 (2020). [CrossRef]

14. F. N. Khan, Q. Fan, C. Lu, and A. P. T. Lau, “An optical communication's perspective on machine learning and its applications,” J. Lightwave Technol. 37(2), 493–516 (2019). [CrossRef]

15. B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end deep learning of optical fiber communications,” J. Lightwave Technol. 36(20), 4843–4855 (2018). [CrossRef]

16. S. Yan, F. N. Khan, A. Mavromatis, D. Gkounis, Q. Fan, F. Ntavou, K. Nikolovgenis, F. Meng, E. Hugues Salas, C. Guo, C. Lu, A. P. T. L., R. Nejabati, and D. Simeonidou, “Field trial of machine-learning-assisted and SDN-based optical network planning with network-scale monitoring database,” 2017 European Conference on Optical Communication (ECOC). IEEE, (2017).

17. B. Rahmani, D. Loterie, G. Konstantinou, D. Psaltis, and C. Moser, “Multimode optical fiber transmission with a deep learning network,” Light: Sci. Appl. 7(1), 69 (2018). [CrossRef]

18. K. Godal, B. Jalali, C. Lei, G. Situ, and P. Westbrook, “AI boosts photonics and vice versa,” APL Photonics 5(7), 070401 (2020). [CrossRef]

19. A.-P. B.- Dionne and O. J. F. Martin, “Teaching optics to a machine learning network,” Opt. Lett. 45(10), 2922–2925 (2020). [CrossRef]

20. S. Kumar, T. Bu, H. Zhang, I. Huang, and Y. Huang, “Robust and efficient single-pixel image classification with nonlinear optics,” Opt. Lett. 46(8), 1848–1851 (2021). [CrossRef]

21. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

22. Y. Zang, Z. Yu, K. Xu, M. Chen, S. Yang, and H. Chen, “Multi-span long-haul fiber transmission model based on cascaded neural networks with multi-head attention mechanism,” J. Lightwave Technol. 40(19), 1–8 (2022). [CrossRef]

23. Y. Zang, X. Lan, Z. Yu, K. Xu, M. Chen, S. Yang, and H. Chen, “Principle-Driven Fiber Transmission Model Based on PINN Neural Network,” J. Lightwave Technol. 40(2), 404–414 (2022). [CrossRef]

24. H. Yang, Z. Niu, S. Xiao, J. Fang, Z. Liu, D. Fainsin, and L. Yi, “Fast and accurate optical fiber channel modeling using generative adversarial network,” J. Lightwave Technol. 39(5), 1322–1333 (2021). [CrossRef]

25. H. Yang, Z. Niu, H. Zhao, S. Xiao, W. Hu, and L. Yi, “Fast and accurate waveform modeling of long-haul multi-channel optical fiber transmission using a hybrid model-data driven scheme,” J. Lightwave Technol. 40(14), 4571–4580 (2022). [CrossRef]

Model	Our model	BiLSTM-based fiber model [13]	Cascaded fiber model [22]	Principle-driven fiber model [23]	SSFM [1]
Type	Data driven			Principle driven	Numerical Calculation
Applications	Short distance transmission		Long distance transmission	Fiber physical effects	Fiber effects related work
Optimizer	ADAM		ADAM	ADAM, L-BFGS	—
Loss function	NMSE		NMSE	NLSE + initial condition	—
Max. distance	100 km	80 km	1000 km (4 × 250 km)	100 km	Based on configuration
Symbol rate	10/20/40 GBaud	10GBaud	10/20/40 GBaud	10/20/40 GBaud	Based on configuration
Modulation format	OOK, PAM, QPSK, QAM	OOK, PAM	QAM	OOK	Based on configuration
Containing noise	YES	NO	YES	NO	YES
Computational complexity	Structures of both multi-head attention layer and FC structures	Length sequence, Weight matrices of FC structures	Structures of each module, the number of modules needed	Weight matrices of FC structures	Fiber length, computing unit length, FFT operations, dispersion and non-linearity operations
Distance generalization	YES		NO	YES	Based on configuration

Data-driven fiber model based on the deep neural network with multi-head attention mechanism

Abstract

1. Introduction

2. Principles and simulation setups

2.1 Structure of data-driven fiber transmission model

2.2 Data preparations and pre-processing

2.3 Training and testing settings of the model

3. Results and analysis

3.1 Model’s performances with noise-free signals

3.2 Model’s performances with signals containing noise

4. Discussions and comparisons

4.1 Model’s multi-head attention mechanism visualization

4.2 Computational complexity analysis

4.3 Comparison with other models

5. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (14)

Tables (1)

Optics Express