OptiDistillNet: Learning nonlinear pulse propagation using the student-teacher model

Naveenta Gautam; Naveenta Gautam; Vinay Kaushik; Amol Choudhary; Amol Choudhary; Amol Choudhary; Brejesh Lall; Brejesh Lall

doi:10.1364/OE.463450

1. Introduction

Recent advances in the field of machine learning have made it possible to solve many kinds of problems in different domains, such as natural language processing, object detection, computer vision and optical communications [1–5]. Despite the use of machine learning for various applications, it struggles to overcome the challenge of data efficiency. Most of the deep learning models require large datasets for good performance. Moreover, collecting and annotating a large dataset is expensive. Therefore, this data-hungry nature of machine learning tools limits their applications [6].

Deep neural networks (DNNs) have a large number of parameters and consequently demand a great deal of computation during training and testing. Normal computers are incapable of handling such deep networks. This problem impedes the adoption of deep learning models in real time systems. Since DNNs are being widely employed in resource constrained environments such as mobile phones, which typically require real time operation, numerous studies have been conducted to effectively architect and train them. To address this issue, there have been numerous efforts towards compressing deep networks [7]. A potent answer to this challenge can be to use pre-trained large data models to improve the learning of a relatively simpler architecture. Deep models trained on large datasets have been shown to be capable of transferring knowledge to shallow networks. One way of doing this is knowledge distillation (KD), which is recently becoming an increasingly popular method for model compression [8].

Knowledge distillation (KD) has emerged as an extremely beneficial approach for a variety of applications such as computer vision and natural language processing [6,9]. While bigger networks improve overall performance, they are difficult to train and have a tendency to overfit the data [1]. Recently, several efforts have been made to deploy neural networks on hardware in order to mitigate the computational complexity introduced by software implementations. Other factors such as security and latency have also led to the rise of neural networks (NNs) being implemented on hardware rather than being processed offline in the cloud [10]. These ML-based architectures are widely utilised in optical communication systems for nonlinear compensation to improve channel efficiency and reduce latency [11–14]. There is a need for even greater use of these architectures to be used in hardware systems [10,15].

Nonlinear compensation is critical to meet the ever-increasing need for bandwidth in optical communications [16–18]. Real-time nonlinear pulse propagation optimization involves comprehensive numerical simulations of the NLSE. This impedes the use of these numerical algorithms to reconstruct the pulse profile in real time. Digital back propagation (DBP) is the most widely used nonlinear compensation technique. It operates by numerically modelling the fiber channel and backpropagating the received signal using inverted fiber parameters. However, this method is computationally intensive and hence requires significant digital signal processing at the receiver. It is incapable of representing the channel adequately due to the presence of random parameters arising due to the interaction of amplified spontaneous emission with nonlinear effects [11]. Unlike conventional signal processing approaches, ML techniques process all impairments simultaneously and thereby incorporate random interactions. In addition, training the NN is a one-time expense. Once trained, it operates significantly faster than the split step Fourier technique (SSFM) [16]. Machine learning approaches are advantageous in these situations, as exemplified by [2,12,17–21] and a comprehensive comparison of ML based architectures for learning the forward and reverse mapping of the NLSE was carried out in [13,22]. Advances in the area of field-programmable gate-arrays (FPGAs) and integrated circuits (ICs) have made it possible to process high bandwidth signals. Authors in [10] exhibited the world’s first FPGA-based NLC using K-means clustering. This emphasises the significance of making machine learning technology as computationally efficient as possible, and KD can potentially prove to be an effective strategy to achieve this goal.

While groundbreaking work has been accomplished in the field of KD [6,7,9], little emphasis has been given to regression tasks. The soft teacher output labels (also known as dark knowledge of the teacher network) provide more information than one hot encoding, which is a common way of pre-processing categorical features for ML models [1]. By using these labels the student network can learn from the teacher. In regression problems, the outputs are continuous values without any dark knowledge, which makes it unclear as to how the teacher could be used to train the student [23]. Very few attempts have been made to implement the student teacher model for regression problems. Teacher loss has previously been used as an upper bound to train the student [23]. By introducing the teacher’s softened output, the notion of KD into the teacher–student paradigm was introduced for the first time by [8]. While KD training improves accuracy across multiple datasets, it has limitations, including difficulty in optimising very deep networks. Authors in [24] developed a hint-based training technique that utilises the pretrained teacher’s hint layer and the student’s guided layer to increase the performance of KD training for deeper networks.

To solve the aforementioned issues, we suggest employing a newly discovered KD strategy to quickly train an accurate student network for learning the inverse transfer function of the fiber. We train a shallow network called OptiDistillNet from a teacher network that has learnt the inverse NLSE for a wide range of pulse and fibre parameters. OptiDistillNet is computationally efficient and faster while maintaining the same level of accuracy. Our paper makes the following contributions :

• To the best of our knowledge, KD has been used for the first time for nonlinear optics applications whereby, we use KD to compress a deep CNN trained for learning the inverse mapping of the nonlinear Schrodinger equation.
• The performance of a shallow network has been greatly enhanced by using the proposed knowledge distillation approach.
• The model size is reduced by $91.2\%$, while maintaining the same level of accuracy as the teacher model, thereby making the system faster and less complex.

2. Method

The nonlinear Schrödinger equation (NLSE) is the universal equation for expressing the wave dynamics inside an optical fiber and is given by [16]:

(1)$$i\frac{\partial \psi}{\partial z}+ i\frac{\alpha}{2}\psi+ \frac{\beta_{2}}{2}\frac{\partial^{2} \psi}{\partial t^{2}}-\frac{\beta_{3}}{6}\frac{\partial^{3} \psi}{\partial t^{3}}+\gamma |\psi|^{2} \psi = 0$$

where, $\psi$ denotes the slowly varying amplitude of the pulse envelope, $z$ denotes distance, $\alpha$ denotes attenuation, $t$ denotes the time coordinate in the pulse’s moving reference frame, $\beta _{2}$ denotes group velocity dispersion (GVD), $\beta _{3}$ denotes third order dispersion, and $\gamma$ denotes the nonlinear parameter. Normally, the transmitted pulse envelope $\psi$ is selected to be hyperbolic secant, parabolic, Gaussian, or super Gaussian. However, to demonstrate proof of concept we have concentrated on Gaussian pulses with the envelope defined as follows:

(2)$$\psi(t) = \sqrt{P_o}exp\left[\frac{-(1+iC_{o})}{2}\left(\frac{t}{T_{o}}\right)^{2}\right]$$

The shape of the pulse at the fiber’s output is primarily determined by eight parameters, three of which, initial pulse width $T o$, chirp $C{o}$, and peak power of the pulse $P_{o}$, capture the input pulse characteristics, and the remaining five parameters, length of the fibre $L$, $\alpha$, $\beta _{2}$, $\beta _{3}$, and $\gamma$, specify the optical fibre communication channel [16].

To enhance the capacity of fibre optical communication systems, it is necessary to quantify the linear and nonlinear impairments that act as bottlenecks for high-speed data transfer [25]. As a result, research is being conducted to undo these effects and retrieve the original transmitted signal. We show a comparison between 3 different NN architectures namely fully connected NN, CNN and LSTM in section 3. and choose a CNN due to its superior performance. Convolutional NNs are prevalent in the area of computer vision [19,26] and have previously been used in the optical domain to learn the inverse transfer function of an optical fiber [27]. The filters present in CNNs extract useful information by sliding across the data. Because of their narrow receptive field, CNNs are particularly good for modelling neighbourhood dependencies. While fully connected NNs are widely employed in many fields of research, they have a tendency to lose dependence of spatial information. When compared to CNNs, which use pooling techniques to lower the number of parameters, fully connected NNs have a higher number of learnable parameters. For the reasons stated above, we have used a CNN based regressor to model the relationship between the input and output pulses. However, the primary issue with a NN is its unexplained behaviour. When NN gives a possible solution, it does not give a clue as to why and how the mapping is being established. Additionally, NN models need to be trained again when the underlying data changes drastically.

The computing time of convolutional layers grows in proportion to the size of the input. As a result of this constraint, CNNs eventually exceed the power consumption limit, limiting the accuracy and rate at which the system can operate. These restrictions frequently prevent CNNs from being deployed in real time. To address this, we employed a KD-based technique to simplify the model architecture of a CNN trained to learn the NLSE. Knowledge distillation is a two-step process in which we first train deep models on big datasets in order to learn how to perform a complex task. The knowledge gained from this deep model is subsequently transferred to a shallow model with fewer parameters and a faster convergence rate.

2.1 Training sata

We trained a CNN-based regressor for the teacher model, with the ultrashort pulses [28] received at the fiber’s output serving as the input and the transmitted pulses serving as target vectors, as shown in Fig. 1. The training dataset includes waveforms corresponding to a wide variety of parameters obtained using NLSE simulations. A total of 256 sample points (both temporal and spectral) from one half of the input and output pulses were taken during training. We employ temporal and spectral intensity points of a pulse for training since we presume a setting in which these quantities can be observed. To prepare the training dataset for these trials, we use a wide variety of values for pulse width, chirp, second order dispersion, nonlinear effects, fiber length, and transmitted peak power. Ultrashort pulses were generated at a wavelength of 1550nm using a mode locked laser [29] and coupled into the highly nonlinear fiber using a lens. We assume an ideal laser and do not consider the scenario of laser phase noise. One part of the beam is directed to an autocorelator to produce the temporal intensity profile. The other part of the beam is incident on an optical spectrum analyser (OSA) for frequency-dependent magnitude measurements [22]. For the "dispersion" and "nonlinear parameter" we set the former to range from 0 to 17 $ps^{2}/km$, and the latter from 10 to 30 $W^{-1}/km$. Pulse width, peak power, and chirp were varied from 0.5-1ps, 80-100 Watt, and -2 to +2, respectively, for the three pulse characteristics. The tests were also carried out on fibers with a length of 5 to 15 metres. Using this arrangement, we were able to obtain a total of $7*10^{5}$ waveforms [22] by randomly varying these parameters in the defined range. Because all of these parameters were changed at the same time, the wave dynamics at the output were extremely complicated. As the dataset becomes more complicated, the problem becomes more complicated as well. As a result, extensive simulations were done to select the teacher model with the best performance.

Fig. 1. Details of network architecture and loss function for teacher (top) and student (bottom) network with 91.23% distillation rate. The values inside rectangular boxes indicate the spatial dimension of the convolutional feature.

Download Full Size | PDF

The data used to train the student model was also generated using NLSE simulations with parameters in the same range as those used to train the teacher network. The number of training data samples needed to train the student model, have been drastically reduced. We also experimented with different sizes of training datasets. The accuracy was calculated by comparing the NN-predicted temporal and spectral characteristics to the NLSE simulated pulses.

2.2 Student training loss functions

In this section, we formulate knowledge distillation for regression problems. Different ways to blend the teacher loss with the student loss in order to take advantage of the teacher’s knowledge have been described in [6]. The simplest way to leverage the teacher model is to assume that the teacher has a high level of training and testing accuracy. This indicates that the teacher’s prediction is quite accurate in terms of the goal values. Our assumption is valid since the teacher is trained on a large dataset and has reached a stage during training where it can perform a task with high accuracy. In this method, an additional loss term is added to the student network’s loss function as shown in Fig. 1. This term aims to minimise the difference between the student and the teacher’s prediction as shown in Eq. (3). Given the high accuracy of the teacher forecast, it makes no difference whether we minimise relative to the ground truth or the teacher prediction. Additionally, including this term in the loss serves as a regularizer, preventing overfitting. Here, $\alpha$ is a scale factor used to balance the two loss terms and $o_{s}, o_{gt}, o_{t}$ are the student, ground truth and teacher outputs, respectively.

(3)$$L_{OptiDistillNet} = \frac{1}{n} \sum_{i=1}^{n} \alpha \left\|{o_{s} - o_{gt}}\right\|^{2} +(1-\alpha) \left\|{o_{s} - o_{t}}\right\|^{2}$$

The loss function described by Eq. (3) assumes that the teacher’s prediction is very accurate and has excellent generalization capabilities. However, in practice, there is a possibility for the teacher model to give erroneous guidance to the student model for some combinations of parameters where the wave dynamics are very complex. As a result, we need to include a mechanism for de-emphasizing the instructor model’s forecast when it is incorrect. We use the empirical error of the teacher’s output with respect to the target output to down weigh the teacher’s prediction when it is unreliable. The new objective function is given by [6]:

(4)$$L_{OptiDistillNet} = \frac{1}{n} \sum_{i=1}^{n} \alpha \left\|{o_{s} - o_{gt}}\right\|^{2} +(1-\alpha) \psi_{i} \left\|{o_{s} - o_{t}}\right\|^{2}$$

(5)$$\psi_{i} = 1 - \frac{ \left\|{o_{t} - o_{gt}}\right\|_{i}^{2}}{\beta}$$

where $\beta$ is the normalization parameter retrieved by taking the difference between the maximum and minimum of the absolute square value of the distance between the teacher’s prediction and the ground truth value [6].

The teacher and student networks consists of 6 and 8 convolutional layers, respectively, with dimensions described in Fig. 1. The teacher’s prediction is assumed to be accurate and so the student prediction is minimised with respect to the teacher’s prediction. The KD objective function is formed by adding this term to the MSE between the student’s prediction and the target data (as illustrated in Fig. 1). The findings of our study with both loss functions are provided in the following section.

2.3 Network architecture

The CNN model being trained is a multi-task network since it has the ability to reconstruct both the time and the frequency domain representation of the transmitted signal. The teacher network architecture is shown in Fig. 1. As shown in the figure, the dimension of the input to the first CNN layer is 256*32 where the intensity and spectrum values of one half of the output pulse have been concatenated together to form 256 values and 32 is the number of input channels in the conv 1D layer. Similarly, 64 and 256 are the number of input channels in the subsequent CNN layers. Batch normalization and ReLU follow the 1D convolutional layers. For choosing the teacher network, we experimented with different architectures which will be further discussed in section 3. The initial weights between the nodes are initialised randomly which are then updated during the training process. In terms of networks and training parameters, we used a batch size of 512 training samples to propagate through the network at a time, a total epoch (one complete pass of the training data) count of 5000, and a learning rate which controls the rate at which the model is to be changed in response to the estimated error as 0.01. Adam optimizer was used during training for adjusting the weights of the network as it requires less memory and is efficient [1]. It should be noted that all of the networks were trained on Intel Xeon Silver 4116 CPU@2.10GHz with Nvidia 2080 Ti graphic processor unit (GPU) using PyTorch.

3. Results and discussions

We experimented with 3 different NN architectures for this problem, including fully connected NNs, long short term memory (LSTM) networks, and CNNs. In Table 1, we compare these architectures in terms of mean square error and number of trainable parameters. The term trainable parameters refers to the weights and biases learnt during the training phase. The size of the training data used for training the 3 networks is $7*10^{5}$. The size of the test data is approximately 10% of the training dataset and contains pulses not included during the training phase. Although a fully connected NN with two hidden layers and 100 neurons in each layer performed well, it was not chosen due to the large number of trainable parameters in comparison to OptiDistillNet. Since Gaussian pulses can be treated as sequences, we also tested LSTM on this problem. The performance of an LSTM with 2 LSTM layers and 100 memory units in each layer was not comparable to that of the CNN and fully connected NN. We can deduce that CNNs learn the inverse NLSE with the lowest mean square error and thus have been chosen as the optimal architecture for this dataset.

Table 1. Comparison of different NN architectures. The second column describes the architecture of the network in the following order: [Size of input layer, hidden layer neurons, size of output layer]

View Table | View all tables in this article

When the individual models are huge neural networks and the dataset is very large, the amount of computation required during the training is prohibitively large, despite the fact that parallelization is trivial. We began by training the teacher CNN using the standard procedure. In Table 2, the network topologies trained for use as the baseline model are listed. The model’s accuracy improves as the number of output channels grows. This also leads to an increase in the number of trainable parameters. The reason that a deeper network improves performance is that a more complicated non-linear function may be learned. Deep models learn to represent the world as a layered hierarchy of concepts, with each concept defined in relation to simpler concepts and more abstract representations computed in terms of less abstract ones. This gives them a lot of power and flexibility. Based on the results from our simulations, we chose CNN5 as the teacher network since it gives the lowest MSE out of all the other networks.

Table 2. Performance comparison of different CNN models. The second column represents the size of the input channels in each convolution layer.

View Table | View all tables in this article

With the learning technique outlined in Sec.2.2, CNN5 was utilised to train the student CNNs. We established several student networks by utilising a single teacher network, as shown in Table 3. The architecture of CNN1 and CNN2 in Table 3 is the same as that defined in Table 2. The student networks have a faster convergence rate when the additional loss term from the training baseline network is used. For the student network, we experimented with two distinct CNN architectures. As the amount of data required for training is a function of the complexity of the problem and the learning technique, we experimented with several data sizes for training. Table 3 contains the results of the student networks. The selected teacher model was used to train student networks with varying architectures: CNN1 and CNN2. The table indicates that when CNN1 is trained using the teacher network, its performance greatly improves and the MSE decreases from $2*10^{-4}$ to $1.38*10^{-5}$. Figure 2(a) shows the MSE versus the number of trainable parameters for all the teacher and student networks. CNN5, which achieves the lowest MSE when compared to other teacher networks and has the maximum number of trainable parameters. Teacher networks CNN1 and CNN2 have considerably different MSE levels. This is as a result of different architectures and CNN1’s inability to learn the mapping with relatively few parameters. The student networks CNN1 and CNN2 achieve the same level of accuracy as CNN5 with less number of parameters. Figure 2(b) shows the training curves for the teacher network, the student network trained independently and the proposed KD based OptiDistillNet. We can infer that OptiDistillNet converges faster than the teacher model CNN5 and achieves a MSE comparable to that of the teacher network. Figure 2 shows the trainable parameters of the student networks being as low as CNN1 and CNN2 while achieving the accuracy of the teacher network.

Fig. 2. (a) Comparison of teacher and student networks in terms of the trainable parameters and MSE, and (b) Training curves of the teacher, student trained independently and proposed OptiDistillNet

Download Full Size | PDF

Table 3. Performance comparison of student networks trained using CNN5 as the teacher

View Table | View all tables in this article

The performance of CNN2 was also enhanced compared to when it was trained independently. As we increase the amount of training data used to train the student network, the MSE further goes down. After comparing the performances of CNN1 and CNN2, we chose CNN1 as the student network since it produces almost identical MSE values while dramatically lowering the number of trainable parameters. We also experimented with two distinct loss functions (Eqs. (3), (4)). When the second loss function (in Eq. (4)) was employed, the results were better since it reduced the weight of the teacher network’s prediction when it was not reliable. However, there was a very small improvement from $1.40*10^{-5}$ to $1.38*10^{-5}$ as seen from Table 3. This could be due to the teacher network’s high prediction accuracy.

Additionally, we also trained the student networks fully independently (as shown in Table 2). The absolute test accuracy for student models is shown in Table 3. Our distillation strategy extracts more meaningful information from the training set than simply training a single model with the hard labels.

The MSE between NLSE simulated pulses and NN predicted pulses was employed as a measure of accuracy in these tests. The results of the MSE and parameter comparison between the baseline, OptiDistillNet and student network are shown in Table 2. The shallow student network was trained independently with the same training data of $7 * 10^{5}$ pulses as that of the teacher network but could not learn the complex mapping. However, by introducing soft labels from the teacher model in the form of a loss function, the student model’s performance was enhanced by an order of magnitude. Along with the increased accuracy, the number of training data samples required is significantly reduced. The teacher network CNN5 and OptiDistillNet contain 267265 and 23425 trainable parameters, respectively, which results in OptiDistillNet having $91.2\%$ fewer trainable parameters than the teacher model. This results in a rapid optimization strategy and faster convergence while maintaining the teacher network’s performance.

The trained networks were evaluated for various pulse and fibre parameters, and the teacher and student-generated input pulses were compared to the NLSE-simulated input pulses to determine their accuracy. The temporal intensity and spectrum plots are shown in Fig. 3. The teacher network and OptiDistillNet exhibit excellent agreement with the NLSE simulated I/P pulse. However, the student network trained independently was unable to reliably recover the input pulse. For clarity, we have shown independent student prediction only on the left half of the pulse and OptiDistillNet prediction on the right half.

Fig. 3. Comparison between the teacher network, student network (trained independently) and OptiDistillNet for the reconstruction of temporal (a,c) and spectral (b,d) domain curves of the pulses propagating in a nonlinear fiber for different conditions

Download Full Size | PDF

The results of this study reveal that when the student model is trained using the teacher’s prediction, it performs well. It achieves an MSE that is extremely close to that of the teacher network while greatly simplifying the design. In addition, the amount of training data needed to train the student network was also cut by more than half from $7*10^{5}$ to $3*10^{5}$.

4. Conclusions and future work

In this study, we have proposed a novel knowledge distillation based regressor network "OptiDistillNet" which has learnt the universal equation for predicting wave dynamics inside an optical fiber using a pre-trained teacher model. We have shown that teacher loss can be efficiently used to transfer knowledge to a shallow network. The results from our simulations show that the teacher network can be used for transfer learning tasks to train another network to work with a lower amount of data and faster convergence rate. The proposed method optimizes the CNN regressor, reduces the number of parameters and training data by up to 91.2% and 42%, respectively, for approximately the same level of accuracy. Additionally, the network can be taught to incorporate various pulse shapes, higher order nonlinearity, laser phase noise and Raman gain. This problem can also be modified to add cross phase modulation and several high-data-rate pulses in order to simulate a optical fiber coherent communication system and to reduce the complexity of the digital signal processing block, thus enabling faster demodulation and a reduced power consumption at the receiver.

Funding

Ministry of Electronics and Information technology (RP04156G).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) http://www.deeplearningbook.org.

2. F. N. Khan, Q. Fan, J. Lu, G. Zhou, C. Lu, and P. T. Lau, “Applications of machine learning in optical communications and networks,” in Optical Fiber Communication Conference, (Optical Society of America, 2020), pp. M1G–5.

3. M. Närhi, L. Salmela, J. Toivonen, C. Billet, J. M. Dudley, and G. Genty, “Machine learning analysis of extreme events in optical fibre modulation instability,” Nat. Commun. 9, 1–11 (2018). [CrossRef]

4. G. Genty, L. Salmela, J. M. Dudley, D. Brunner, A. Kokhanovskiy, S. Kobtsev, and S. K. Turitsyn, “Machine learning and applications in ultrafast photonics,” Nat. Photonics 15(2), 1–11 (2020). [CrossRef]

5. H. Yang, Z. Niu, S. Xiao, J. Fang, Z. Liu, D. Fainsin, and L. Yi, “Fast and accurate optical fiber channel modeling using generative adversarial network,” J. Lightwave Technol. 39(5), 1322–1333 (2020). [CrossRef]

6. M. R. U. Saputra, P. P. De Gusmao, Y. Almalioglu, A. Markham, and N. Trigoni, “Distilling knowledge from a deep pose regressor network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2019), pp. 0, 263–272.

7. J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization network minimization and transfer learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), pp. 4133–4141.

8. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531 2 (2015).

9. J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” Int. J. Comput. Vis. 129(6), 1789–1819 (2021). [CrossRef]

10. E. Giacoumidis, Y. Lin, M. Blott, and L. P. Barry, “Real-time machine learning based fiber-induced nonlinearity compensation in energy-efficient coherent optical networks,” APL Photonics 5(4), 041301 (2020). [CrossRef]

11. S. Sygletos, A. Redyuk, and O. Sidelnikov, “Nonlinearity compensation techniques using machine learning,” in Signal Processing in Photonic Communications, (Optical Society of America, 2019), pp. SpT2E–2.

12. S. Zhang, F. Yaman, K. Nakamura, T. Inoue, V. Kamalov, L. Jovanovski, V. Vusirikala, E. Mateo, Y. Inada, and T. Wang, “Field and lab experimental demonstration of nonlinear impairment compensation using neural networks,” Nat. Commun. 10, 1–8 (2019). [CrossRef]

13. S. Boscolo and C. Finot, “Artificial neural networks for nonlinear pulse shaping in optical fibers,” Opt. Laser Technol. 131, 106439 (2020). [CrossRef]

14. T. Zahavy, A. Dikopoltsev, D. Moss, G. I. Haham, O. Cohen, S. Mannor, and M. Segev, “Deep learning reconstruction of ultrashort pulses,” Optica 5(5), 666–673 (2018). [CrossRef]

15. F. A. Aoudia and J. Hoydis, “Towards hardware implementation of neural network-based communication algorithms,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), (IEEE, 2019), pp. 1–5.

16. G. P. Agrawal, “Nonlinear fiber optics,” in Nonlinear Science at the Dawn of the 21st Century, (Springer, 2000), pp. 195–211.

17. S. Boscolo, J. M. Dudley, and C. Finot, “Modelling self-similar parabolic pulses in optical fibres with a neural network,” Results Opt. 3, 100066 (2021). [CrossRef]

18. C. Finot, I. Gukov, K. Hammani, and S. Boscolo, “Nonlinear sculpturing of optical pulses with normally dispersive fiber-based devices,” Opt. Fiber Technol. 45, 306–312 (2018). [CrossRef]

19. R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, “Convolutional neural networks: an overview and application in radiology,” Insights into imaging 9(4), 611–629 (2018). [CrossRef]

20. N. Gautam, A. Choudhary, and B. Lall, “Neural networks for modelling nonlinear pulse propagation,” in Applications of Machine Learning 2021, vol. 11843 (International Society for Optics and Photonics, 2021), p. 118430Q.

21. A. Kokhanovskiy, A. Bednyakova, E. Kuprikov, A. Ivanenko, M. Dyatlov, D. Lotkov, S. Kobtsev, and S. Turitsyn, “Machine learning-based pulse characterization in figure-eight mode-locked lasers,” Opt. Lett. 44(13), 3410–3413 (2019). [CrossRef]

22. N. Gautam, A. Choudhary, and B. Lall, “Comparative study of neural network architectures for modelling nonlinear optical pulse propagation,” Opt. Fiber Technol. 64, 102540 (2021). [CrossRef]

23. G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learning efficient object detection models with knowledge distillation,” Adv. neural information processing systems 30 (2017).

24. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 (2014).

25. E. Giacoumidis, Y. Lin, J. Wei, I. Aldaya, A. Tsokanos, and L. P. Barry, “Harnessing machine learning for fiber-induced nonlinearity mitigation in long-haul coherent optical ofdm,” Futur. internet 11(1), 2 (2019). [CrossRef]

26. V. Kaushik and B. Lall, “Undispnet: Unsupervised learning for multi-stage monocular depth prediction,” in 2019 International Conference on 3D Vision (3DV), (IEEE, 2019), pp. 633–642.

27. J. Noble, C. Zhou, W. Murray, and Z. Liu, “Convolutional neural network reconstruction of ultrashort optical pulses,” in Ultrafast Nonlinear Imaging and Spectroscopy VIII, vol. 11497 (International Society for Optics and Photonics, 2020), p. 114970L.

28. D. P. Shepherd, A. Choudhary, A. A. Lagatsky, P. Kannan, S. J. Beecher, R. W. Eason, J. I. Mackenzie, X. Feng, W. Sibbett, and C. T. A. Brown, “Ultrafast high-repetition-rate waveguide lasers,” IEEE J. Sel. Top. Quantum Electron. 22(2), 16–24 (2015). [CrossRef]

29. A. Choudhary, A. Lagatsky, Z. Zhang, K. Zhou, Q. Wang, R. Hogg, K. Pradeesh, E. Rafailov, W. Sibbett, and C. Brown, “A diode-pumped 1.5 µm waveguide laser mode-locked at 6.8 ghz by a quantum dot sesam,” Laser Phys. Lett. 10(10), 105803 (2013). [CrossRef]

Model	Architecture	Test MSE	No. of learnable parameters
Fully connected	[256,100,100,256]	3.60e-05	61656
LSTM	[256,100,100,256]	1.23e-02	577856
CNN	[32,64,256,64,32]	1.04e-05	267265
OptiDistillNet	[2,4,8,16,8,4,2]	1.38e-05	23425

Network Architecture	No. of O/P channels for each conv layer	No. of conv. layers	Training Data size	Test MSE	No. of learnable parameters
CNN1	2,4,8,16,8,4,2	8	7e+05	2.0e-04	23425
CNN2	4,16,64,16,4	6	7e+05	7.0e-05	56049
CNN3	8,16,64,16,8	6	7e+05	3.95e-05	64385
CNN4	16,32,128,32,16	6	7e+05	3.84e-05	124673
CNN5	32,64,256,64,32	6	7e+05	1.04e-05	267265

Network Architecture	Training Data Size	Test MSE (with eqn 3)	Test MSE (with eqn 4)
CNN1	3e+05	4.00e-05	3.70e-05
CNN1	4e+05	2.40e-05	2.05e-05
CNN1	5e+05	1.40e-05	1.38e-05
CNN2	3e+05	2.80e-05	2.05e-05
CNN2	4e+05	2.46e-05	2.08e-05
CNN2	5e+05	1.36e-05	1.35e-05

Model	Architecture	Test MSE	No. of learnable parameters
Fully connected	[256,100,100,256]	3.60e-05	61656
LSTM	[256,100,100,256]	1.23e-02	577856
CNN	[32,64,256,64,32]	1.04e-05	267265
OptiDistillNet	[2,4,8,16,8,4,2]	1.38e-05	23425

Network Architecture	No. of O/P channels for each conv layer	No. of conv. layers	Training Data size	Test MSE	No. of learnable parameters
CNN1	2,4,8,16,8,4,2	8	7e+05	2.0e-04	23425
CNN2	4,16,64,16,4	6	7e+05	7.0e-05	56049
CNN3	8,16,64,16,8	6	7e+05	3.95e-05	64385
CNN4	16,32,128,32,16	6	7e+05	3.84e-05	124673
CNN5	32,64,256,64,32	6	7e+05	1.04e-05	267265

OptiDistillNet: Learning nonlinear pulse propagation using the student-teacher model

Abstract

1. Introduction

2. Method

2.1 Training sata

2.2 Student training loss functions

2.3 Network architecture

3. Results and discussions

4. Conclusions and future work

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (3)

Tables (3)

Equations (5)

Optics Express