Artificial neural networks for photonic applications—from algorithms to implementation: tutorial

Pedro Freire; Egor Manuylovich; Jaroslaw E. Prilepsky; Sergei K. Turitsyn

doi:10.1364/AOP.484119

1. Introduction

Machine learning has a tremendous number of definitions, which often reflect the specific interests of the researchers who formulate them. Here, we use the definition of machine learning as a bevy of algorithms that “$\cdots$ allows computer programs to automatically improve through experience and that automatically infer some general laws from specific data,” taken from the classical Tom Mitchell’s monograph [1]. In this tutorial–review, we discuss blending machine learning with various photonics technologies and applications. The mixture of these two complementary disciplines enables the development of new scientific and engineering techniques that benefit both from the speed and parallelism inherent to optical systems and the ability of machine learning to infer from data and automatically improve system performance. Nonlinear photonics often features complex light dynamics and deals with systems that cannot be easily comprehended or controlled. Therefore, another attractive feature of machine learning in photonics applications is its capability to deal with complex nonlinear systems, whilst staying flexible and re-adaptable. In addition, photonic devices and systems operating at high speed can quickly generate a vast amount of data. This makes them well-suited to the application of various databased machine-learning algorithms that improve performance with increasing available data sets. Therefore, photonics and machine learning look like a perfect fit for each other, and their combination can naturally bring forth new ideas, theories, and devices, as well as novel concepts for understanding the description of light-related phenomena.

Artificial neural networks, which we will henceforth call simple neural network (NNs), are computational machine-learning frameworks that attempt to mimic some brain operations. The attractive features of biological NNs, which we would like to keep when using their artificial analogs, are: robustness and fault tolerance; flexibility and easiness of re-adaptation to the changing conditions; ability to deal with a variety of data, meaning that the network can make do with information that is fuzzy, probabilistic, noisy, and even inconsistent; and collective computation, i.e., the network can process the data in parallel and in a distributed manner [2]. Whilst the NNs are frequently attributed to supervised learning thanks to numerous widely-known successful examples in, e.g., image recognition [3,4], they are also applicable to unsupervised learning [5], semi-supervised learning [6,7], and reinforcement learning (RL) [8–11], to mention the most noticeable directions. Of course, in this tutorial–review, we cannot address each specific item from the list above. Instead, we will focus on some particular examples of using the NNs in photonics, trying to explain why the particular combination of a machine-learning method with a photonics application has turned out to be successful.

Here, it is pertinent to note that ultrafast photonic applications can bring about conditions and requirements (in terms of accuracy, speed, and complexity), which differ from those in more “traditional” use cases of NNs. For example, in optical communications, the typical bit error rate (BER; the probability of error occurrence in the dataset, speaking in “machine-learning” language) before forward error correction, is of the order $10^{-2}$, which is, for instance, much lower than we have in typical image recognition tasks [12]. Therefore, the solutions developed in deep learning applications for image recognition and language processing often require adaptation and/or substantial modifications when we deal with, e.g., an equalization task in optical communications. We specifically notice that the real-time operation of NNs in ultrafast photonics inevitably sets a limit on the acceptable level of a NN’s complexity and processing latency (inference). Thus, in this review, we pay special attention to the NNs with reduced complexity, and this, in turn, emanates into the reduction of the energy consumption used for signal processing, the sought-for feature in almost every application nowadays.

There are numerous recently emerged and still developing areas at the interface of machine learning and photonics: general neuromorphic photonics, unconventional optical computing, photonic NNs, optical communications, imaging, and sensing, to mention a few important examples where the cross-fertilization of the fields has already proven to be fruitful. Typically, the NNs’ application in photonics is related to the processing of large data sets, which is the case in optical communications, ultrafast photonics, optical imaging and sensing, lasers, optical metrology, design of new photonic materials, and so on. However, we would like to stress that this tutorial–review is not aimed to be a comprehensive overview of all applications of NNs (or, in more general terms, of the machine-learning methods) in photonics, as this goal would be too large and general to fit into any review paper or even in a monograph. More information, details, methods, and examples of merging the photonics and artificial intelligence solutions can be found in other recent review papers covering different aspects of the subject and presenting various viewpoints [13–29]. How then is this tutorial–review different from numerous other review papers in the field? In this paper, we aim to improve some photonic techniques and technologies by using NNs for signal or data processing, providing analysis of the complexity and hardware implementation. We do not provide a comprehensive survey of optical reservoir computing or photonic NNs, which form a huge, rapidly expanding, and utterly fascinating area; we refer the reader to recent works and reviews on the subject [29–45], including critical opinions [46]. In particular, a good exposition of the known and potential benefits of using neuromorphic devices in place of their “von Neumann” counterparts, including estimates of energy consumption, is given in [47].

Now, we emphasize that signal-processing (inference) speed and energy efficiency are the two factors that quite often (virtually always) emerge when we talk about the practical implementation of a particular model or method. Both of these factors relate to the complexity of the NNs. Therefore, the tutorial part of our work is focused on a rather specific challenge: how to pick a correct NN structure fitting the task in hand and how to manipulate (typically reduce) the complexity of the NN to make them practically implementable and power-/cost-efficient, while not losing much in the efficiency/functionality of the initially developed unrestricted (typically complex) NN solutions. In the following, we try to follow the whole path, from the NN algorithms explanations and development stage down to notes on the existing approaches for the hardware implementation of NNs. Thus, the tutorial part systematically describes the tools that can be used when we already have some NN model performing the desired task “well enough,” and when the next step refers to how to match the model with the constraints (imposed by, say, the limited available resources) for the practical implementation. Evidently, there is some trade-off between performance and complexity. The performance of the compressed and/or quantized models, in general, degrades, and the important goal of complexity optimization is to identify the acceptable balance between complexity reduction and performance degradation. We hope that this tutorial–review can provide the necessary assistance along the “thorny way” of modifying your model toward a much simpler, but still efficient structure that would not flabbergast hardware designers.

The plan of our tutorial–review is as follows. First, in Section 2, we briefly overview/remind the optical community of the basics of NNs and discuss their key photonic applications, trying to stay as close to the layman level but providing a viewpoint from a traditional digital signal-processing (DSP) perspective. In Section 3, we describe how to select the NN architecture appropriate for the task at hand. In the review part, Section 4, we present an overview of different applications of NNs in various branches of photonics, discuss the open problems and challenges, and outline future research directions in the field. In the tutorial part, Section 5, we describe versatile directions and methods that can be used to reduce the complexity of NNs (i.e., the model compression techniques) in photonics applications, also paying attention to the different metrics that we can employ to quantify our complexity. We also append our work with a considerable number of references that can add more particular details to the questions considered.

2. Basics of Artificial Neural Networks for Photonics Community

In an artificial NN, several neurons are connected together in a nonlinear manner. The network learns by adjusting the weights and biases, according to the feedback (typically provided by the so-called back-propagation technique) based on the evaluation of the accuracy of the NN’s prediction, which is quantified by a cost (loss) function. The number of neurons in the input layer corresponds to the input characteristics, whereas the number of output neurons is linked to the batch of classes of interest for classification; or it can be just a single neuron when we do with a single-class regression. In the deep NN structures, the layers between the input and output layer are referred to as hidden layers; the number of neurons per layer is arbitrary, and the choice of NN’s hyperparameters (the number of neurons in the hidden layers and the number of hidden layers) requires designer’s expertise in adjusting the NN structure to the task in hand; the choice of hyperparameters also depends on the complexity of the system to be modeled, as these parameters ultimately define the representation capability of a NN. For convenience of presentation, in this section, we briefly revisit some basic types and features of artificial NNs that are discussed throughout the paper.

However, we note that in spite of the (deceptive) simplicity of the short description of NNs given above, there are a plethora of unresolved puzzles and problems in the theory of deep NN, which typically relate to the three fundamental deep NN challenges: expressibility, optimizability, and generalizability. At the moment, we do not seem to have a good universal theory that would give us persuasive answers to all the problems itemized above, while the works shedding light on some of the NNs’ properties, features, and peculiarities emerge continuously.

2.1 Dense Layer

We start from the basic feed-forward NN, the so-called multi-layer perceptron (MLP). The simplest variant of the perceptron idea was first developed in 1943 by McCulloch and Pitts [48], but this concept drew the essential attention of scientific society only after Frank Rosenblatt’s implementing it as a machine built in 1958 [49]. While Rosenblatt used just a single layer of neurons for binary prediction, nowadays, the perceptron’s original idea has been largely generalized, such that it evolved into a (deep) feed-forward densely connected multi-layer structure that we call the MLP. A dense layer, also known as a fully connected layer, is a layer in which each neuron (labeled as $i$) is connected with all the neurons (labeled as $j$) from the previous layer with a specific weight $w_{{i}{j}}$. The input vector is mapped to the output vector in a nonlinear manner by the dense layer, due to the presence of a nonlinear activation function. Dense layers can be combined to form an MLP, which is a class of a feed-forward deep NN. Figure 1 illustrates the working operation of a single neuron in such a dense layer.

Figure 1. Schematics of a McCulloch–Pitts neuron.

Download Full Size | PDF

The output vector $y$ of a dense layer, given $x$ as an input vector, is written as

(1)$$y =\phi (Wx+ b ),$$

where $y$ is the output vector, $\phi$ is a nonlinear activation function, $W$ is the weight matrix, and $b$ is the bias vector.

Now, let us turn to the hardware implementation aspect of this most prolific NN structure, where we first mention the electronic implementation. The traditional matrix multiplier-and-accumulator (MAC) is used for the implementation of such layers in the digital domain [50]. More recently, the electrical analog implementation of a dense layer was demonstrated using a CMOS with transistors and resistors [51,52], or using an operational transconductance amplifier [53]. As a drawback, the analog NNs’ implementation typically renders a lower accuracy and is more sensitive to noise compared with their digital counterparts [54].

Now, we mention that there are two different elements of the NN processing that are addressed in the photonic feed-forward NN implementation: the matrix–vector multiplications (MVMs) and the activation function. First, we address the differences in the activation function. The first widely adopted approach for the activation of photonic NNs, which can be called a “fully analog” implementation, entails utilizing silicon photonic meshes comprising the networks of Mach–Zehnder interferometers (MZIs) and programmable phase shifters (electro-optic activations). However, lately, a novel approach for the activations coined “hybrid” photonic programmable NNs has emerged, demonstrating remarkable features in terms of low latency and energy efficiency for inference. These hybrid photonic NNs combine programmable photonic linear optical elements, such as meshes, with digital nonlinear activation functions [40,56,57]. In comparison with existing fully analog photonic NNs that employ electro-optic nonlinear activation functions, hybrid designs can overcome the significant challenge of photonic loss and provide improved flexibility in performing logical operations between layers as compared with the fully analog counterparts. Quite importantly, the hybrid design has been shown to be able to learn online [40,58], which gives immense opportunities for the prompt reconfiguration of photonic NNs and, so, for their usage for real-life problems, see also the explanatory note [59].

Let us consider probably the most resource-consuming NN part: MVM. There are three main ways to implement the MVM in the optical domain. The first kind of optical MVM [plain light conversion (PLC)] is based on the diffraction of light in free space. Figure 2(a) shows a typical MVM configuration. First, the incident vector of $X$ distributed along the $x$ direction can be expanded and replicated along the $y$ direction through a cylindrical lens or other optical elements. Then, the spatial diffraction plane is used to adjust each element independently, and its transmission matrix is $W$. Finally, the $x$-direction beams are combined and summed similarly, and the final output vector of $Y$ along the $y$-direction is the product of the matrix of $W$ and the vector $X$, that is, $Y =WX$. The second MVM exploits a MZI network. Figure 2(b) depicts the configuration diagram, which is based mainly on rotation submatrix decomposition and singular value decomposition. The calibration of the transmission matrix is more difficult since every matrix element is affected by multiple dependent parameters. For a simple $2\times 2$ MZI multiplier, considering the inputs $x_1$ and $x_2$, the matrix multiplication with a $2\times 2$ weight $W$ results in an output that follows the formula [60]:

(2)$$Y = U^2(\theta, \alpha,\beta) X,$$

(3)$$W = U^2(\theta, \alpha,\beta) = \begin{bmatrix} e^{{-}j\alpha}(e^{{-}j\theta}-1) & je^{{-}j\alpha}(1+e^{{-}j\theta}) \\ je^{{-}j\beta}(e^{{-}j\theta}+1) & e^{{-}j\beta}(1-e^{{-}j\theta}) \end{bmatrix},$$

where to set the weight values to the desired ones, the phase shifter $\theta, \alpha,$ and $\beta$ need to be properly adjusted.

Figure 2. Optical implementations of vector–matrix multipliers. (b),(c) Adapted from [55].

Download Full Size | PDF

Figure 3. Schematics of 1D and 2D convolutional filters. A series of arrays of such filters constitute a convolutional NN.

Download Full Size | PDF

The third MVM is an incoherent matrix computation method based on wavelength division multiplexing (WDM) technology. Figure 2(c) shows a typical diagram based on microring resonators (MRRs). The input vector of $X$ is loaded onto beams with different wavelengths, which pass through the microrings with one-on-one adjustment of the transmission coefficients of $W$. Then, the total output power vector is given by $Y=WX$.

In [61], an optical neural chip was designed in which matrix multiplications were performed using the MZI network, and a simple nonlinear activation function was based on intensity detection $f(x) =||x||$. A good survey and comparison of the different MVM realizations and photonic chip architectures are given in recent reviews [62,63].

2.2 Convolutional Neural Networks

In a convolutional NN (CNN), we apply the convolutions with different filters to extract the features and convert them into a lower-dimensional feature set, as can be seen in Fig. 3. The CNNs can be used in one-dimensional (1D), two-dimensional (2D), or three-dimensional (3D) network arrangements depending on the applications. Here we focus on 1D-CNNs, which are applicable to, e.g., processing sequential data [64]. The 1D-CNN processing with padding equal to 0, dilation equal to 1, and stride equal to 1, can be summarized as the following transformation:

(4)$$y^{f}_{i} = \phi \left(\sum_{n=1}^{n_i}\sum_{j=1}^{n_k}x^{in}_{i+j-1,n} \cdot k^{f}_{j,n} + b^{f} \right),$$

where $y^{f}_{i}$ denotes the output, known as a feature map, of a convolutional layer built by the filter $f$ in the $i$th input element, $n_k$ is the kernel size, $n_i$ is the size of the input vector, $x^{in}$ represents the raw input data, $k^{f}_{j}$ denotes the $j$th trainable convolution kernel of the filter $f$ and $b^{f}$ is the bias of the filter $f$.

In the general case, the additional parameters, such as padding, dilation, and stride, also affect the output size of the CNN. The padding adds information (often zeros) to the empty points around the edges of an input signal so that its size stays the same after the convolution operation. The dilation and stride affect how the kernel operation will behave in the convolution. The dilation “inflates” the kernel by adding holes between the kernel elements, and the stride controls how the filter convolves the input signal by setting the number of shifting units at a time that the kernel will do in the convolution. The generalized output shape of the 1D-CNN can be formalized as

(5)$$Output Size = \left \lfloor \frac{n_s +2 \, padding - dilation (n_k-1) - 1 }{stride} + 1 \right\rceil ,$$

where $\left \lfloor \cdots \right \rceil$ is the nearest integer operation, $n_s$ is the input time sequence size and $n_k$ is, again, the respective kernel size.

To understand the relation of the CNN to the ordinary DSP filtering, recall that the output of the 1D finite impulse response (FIR) filter can be presented as follows (see, e.g., [65, p. 58]):

(6)$$y^{\text{FIR}}_i = \sum_{m=0}^{n_{\text{FIR}}-1} x_{i-m} \cdot \kappa_m,$$

where $\kappa _m$ is the set of coefficients (time-reversed impulse response) that generate the required filter response (e.g., low-pass, high-pass, and baseband); $n_{\text {FIR}}$ is the number of filter taps in the output, i.e., the FIR filter order. Comparing (6) with (4), we can put $m=1-j$, and designate $n_k = 1- n_{\text {FIR}}$, $\kappa _{1-j} = k_j$, to obtain $y_i = \sum _{j=1}^{n_k} x_{i+j-1} \cdot k_j$. We can readily see that the action of a CNN layer before the activation is tantamount to the convolution of the several FIR filters’ outputs, and the whole CNN layer adds the nonlinearity to the convolution of FIR filters via the activation function; if, otherwise, $\phi$ in Eq. (4) is a linear function, $\phi (x)=x$, the CNN transforms into the direct FIRs convolution.

Two-dimensional convolutional filters used in the vast majority of modern image-processing CNNs are similar to their 1D counterparts (see Fig. 3), but with the increased dimensionality the additional summation over the second axis is added:

(7)$$y^{f}_{i,j} = \phi \left(\sum_{i=1}^{n_i}\sum_{j=1}^{n_j}\sum_{m=\left\lfloor-n_m/2\right\rfloor}^{\left\lfloor n_m/2 \right\rfloor}\sum_{l=\left\lfloor-n_l/2\right\rfloor}^{\left\lfloor n_l/2 \right\rfloor}x_{i+m,j+l} \cdot K^{f}_{m,l} + b^{f} \right).$$

These 2D convolutions are very similar to the optical concept of a point spread function (PSF) that is used for the description of the response of a focused optical imaging system to a point source or point object. In free space optical (FSO) systems, the image behind a scattering medium can be described as a convolution of the original image with a PSF: [66]:

(8)$$I_{out}(x,y) = I_{in}(x,y)\ast PSF(x,y),$$

where $\ast$ denotes a 2D convolution. Thus, a 2D convolution can be implemented in free-space optics with a diffraction mask in the Fourier plane of a $4f$ imaging system, as shown in Fig. 4

Figure 4. Optical 2D convolution using scattering matrix in a Fourier plane of a $4f$ imaging system.

Download Full Size | PDF

2.3 Vanilla Recurrent Neural Networks

A vanilla recurrent neural network (RNN) is different from MLP and CNN in terms of its ability to handle memory, which is quite beneficial for time series analysis and prediction. Here, we note that the feed-forward models (e.g., those described previously) can be reckoned, according to Elman [67], as an “$\cdots$ attempt to ‘parallelize time’ by giving it a spatial representation $\cdots$ However, there are problems with this approach, and it is ultimately not a good solution. A better approach would be to represent time implicitly rather than explicitly.” The recurrent structures described in the following subsections do that implicit representation, Fig. 5: RNNs take into account the current input and the output that the network has learned from the prior input. The propagation step for the vanilla RNN at the time step $t$, can be described as follows:

(9)$$h_{t} = \phi(W{x}_{t} + Uh_{t-1} + b),$$

where $\phi$ is again the nonlinear activation function, $x_{t}\in \mathbb {R}^{n_i}$ is the $n_i$-dimensional input vector at time $t$, $h_{t} \in \mathbb {R}^{n_h}$ is a hidden layer vector of the current state with size $n_h$, $W \in \mathbb {R}^{n_h \times n_i}$ and $U\in \mathbb {R}^{n_h \times n_h}$ represent the trainable weight matrices, and $b$ is the bias vector. For more explanations on the vanilla RNN operation, see, e.g., [68]. Even though the RNNs were tailored for efficient memory handling, they still suffer from the inability to capture the long-term dependencies because of the infamous vanishing gradient issue [69].

Figure 5. Schematics of a RNN. Hidden layer neurons with closed-loop connections underlie the memory effect.

Download Full Size | PDF

In addition to the mathematical description of such a layer, when designing sequence modeling algorithms (i.e., the algorithms involving recurrent layers), it is crucial to consider whether the training architecture is stateless or stateful [70–72]. Figure 6 schematically illustrates how both architecture types work. The primary difference between these two architectures is how the first state ($h_0$) of the model (corresponding to each batch) is initialized as the training advances from one batch to the following one. Considering that the input share of these sequential data is $input=[batch_{size}, time_{length}=M, input_{features}]$, in both architectures, for each batch of data, we utilize $M$ recurrent cells in forward propagation. However, in the stateless architecture, every batch initializes the first state as $h_0 = 0$. This causes the model to forget the prior batch’s learning. This design is utilized when the independent and identically distributed (i.i.d.) assumption for the data distribution is true [73]. This means that when building the training batches, there is no interdependence between the batches, and each batch is independent. This is not to be confused with the parameters/weights, which have already propagated through the entire training process, which is the goal of training.

Figure 6. Stateful and stateless RNN training architectures.

Download Full Size | PDF

Nevertheless, not all sequential data, such as time series, contain non-i.i.d. samples; hence, it is not reasonable to always presume that the divided batches are completely independent. Therefore, it is natural to propagate the learned states across successive batches in such a way that the model not only reflects the temporal dependence inside each sample sequence but also does this across batches. The second type, called stateful architecture, is the solution to this problem. In the stateful architecture, the cell and the hidden states of the recurrent cell for each batch are initialized using the states learned from the previous batch, allowing the model to learn the dependence between batches for each sample in the batch. As indicated in Fig. 6, the states are reset to zero only at the start of each epoch.

Here, we highlight that the default implementation of recurrent cells in the most popular machine-learning libraries (TensorFlow and PyTorch) uses the stateless setup. In order to transit to a stateful architecture, in addition to connecting the hidden states across batches, it must be ensured that the batches cannot be shuffled internally (which, otherwise, is the default step in the case of stateless architecture, and in the case of stateful architectures it would break the learning process) [74].

In the stateless configuration, the linear part that describes the vanilla RNN resembles the well-known infinite impulse response (IIR) filter. The IIR filter is characterized by its theoretically infinite impulse response as [75,76]:

(10)$$y(t) = \sum_{k=0}^{\infty}r(k)x(t-k),$$

where $r(k)$ is the linear time-invariant filter’s impulse response.

Practically speaking, it is not possible to compute the output of the IIR using this equation. Therefore, the equation may be rewritten in terms of a finite number of poles $p$ and zeros $q$ of the IIR filter, as defined by the linear constant coefficient difference equation:

(11)$$y(t) = \sum_{k=0}^{q}a_k x(t-k) + \sum_{k=1}^{p}b_k y(t-k),$$

where $a_k$ and $b_k$ are the filter’s denominator and numerator polynomial coefficients, the roots of which are equal to the filter’s poles and zeros, respectively. In this sense, if $p=q=1$ and if we consider that the number of hidden units in the recurrent cell is equal to 1, Eq. (11) and the linear part of (9) become the same.

Now, it is interesting to analyze the interrelation of the RNN models with a Kalman filters theory [77,78]. To do this, we first consider the Elman variant of RNN [67,79], a relatively simple three-layer recurrent structure, where the “hop” of the variable $h$ from $t-1$ to $t$, is given by Eq. (9), and the output $y_t$ (the prediction associated with the input $x_t$) is defined by

y_{t} = \phi_y ( H h_t + d),

where $H$ is the matrix of parameters to optimize for our getting the best prediction, and $d$ is the bias vector; $\phi _y$ is the activation function, whose subindex $y$ signifies that it can be different from the activation function in (9). Index $t$ can be understood as the number of the pairs $\{x_t,\tilde {y}_t\}$ in the overall dataset used for the training, where $\tilde {y}_t$ is the true value, while $y_t$ marks the prediction given by the RNN; $y_t$ produced by the RNN run number $t$, can also be understood as a “measurement” rendered by our RNN model at the $t$th step, whereas $\tilde {y}_t$ can be reckoned as the true result of the “measurement.”

Applied to the regression task, the goal of the RNN training is to identify the optimal NN structure, namely the particular values of the parameters (matrices and vectors) $W$, $b$, $U$, $H$, and $d$, which are further used in the inference stage. Optimization is performed by minimizing the loss function, i.e., typically some characteristic function of the difference of $y_t$ (predicted by the RNN) and the “true value,” $\tilde {y}_t$. Quite often, the regression loss function to minimize is the mean-squared error (MSE), such that we minimize $\mathrm {MSE} (y_t - \tilde {y}_t) = |y_t - \tilde {y}_t|^2$ (where $| \ldots |$ means the norm); the goal is to minimize this function across all allowed values of $W$, $b$, $U$, $H$, and $d$ [80].

Now, turning to the ordinary two-step Kalman filter, it deals with the estimates (predictions) attributed to the linear systems of discrete equations [78]:

(12)$$\begin{aligned} \eta_t & = W^K \varkappa_t + U^K \eta_{t-1} + \beta_t, \\ \psi_t & = H^K \eta_t + \delta_t, \end{aligned}$$

where $\eta$ variables represent the “hidden” system state to which we do not have direct access (cf. the internal $h$-variables in the RNN), the matrix $U^K$ defines the state’s discrete transition model which is applied to the previous state at step $t-1$, $\eta _{t-1}$ in our case; $W^K$ is the control-input model which is applied to the control vector $\varkappa _t$; $\beta _t$ is the process noise, which is, for the correctness of the Kalman filter theory, assumed to be drawn from a zero mean multivariate normal distribution with known covariance independent of any other variables [81]. In the lower equation from Kalman filter set (12), $H^K$ is the observation model, which maps the true state space $\eta$ into the observations space $\psi$, and $\delta _t$ is the observation noise, which is, again, assumed to be drawn from a zero mean multivariate normal distribution with known covariance independent of any other variables.

The Kalman filter algorithm can be conceptualized in two steps (that we do not detail here mathematically): (i) a prediction step and (ii) an update step. Initially, we assume that we have the a priori estimate of $\eta _{t-1}$ (the prior), say $\hat {\eta }_{t-1}^-$, obtained at the previous step of the algorithm, and we know the (diagonal) error covariance matrix associated with $\eta _{t-1}^-$; the latter is also computed at the previous step. For the prediction at step $t$, we now use the value $\psi _t$ and calculate the optimal Kalman gain matrix. The Kalman gain allows us to update the prior and calculate a posteriori (the posterior) value for the hidden variable estimate, $\hat {\eta }_{t-1}$. For the optimal Kalman gain, the value of the expectation for $\mathrm {MSE}(\eta _{t-1} - \hat {\eta }_{t-1})$ (the variance of the posterior) is minimal, and it is used to calculate the prior of the covariance matrix associated with $\eta _t$, see [77,78] for detailed equations and explanations.

We can notice the difference in the outputs for the RNN and for the Kalman algorithm: while the former attempts to minimize the MSE for the difference of the RNN result $y_t$ with the observed value, $\tilde {y}_t$, the Kalman algorithm ensures the minimization of the errors’ posterior estimates for the hidden states $\eta$, and the latter is the “analogs” of the hidden RNN variables $h$. Thus, we ought to understand how the Kalman filter handles the estimation of $\psi$ errors (the so-called innovations), the measurement, or the “observer,” i.e., to deal with the so-called pre-fit and post-fit residuals. However, it is known that the observer for the optimal Kalman gain is also optimal in the MSE sense [82], and, so, the Kalman filter also minimizes the observation error; therefore, the tasks for the regression Elman RNN and Kalman filter are, indeed, similar, and it is possible to compare the results of two approaches. As the answer for the “ideal” Kalman system is obvious, and it has been rigorously proven that the Kalman filter for such a system is an optimal linear estimator, the two approaches are often compared for systems and conditions different from (12): it was found that the RNN can give good results in conditions where the classical Kalman filter fails, e.g., when the system to estimate is nonlinear [83].

Now, let us turn to the distinctions between the two approaches. First, obviously, the RNN contains nonlinear activation functions, while the Kalman system is linear. Second, the Kalman system contains random variables, the process, and measurement noises, and the optimal Kalman gain is expressed through the (known) variances of the two noises. In contrast, the “ordinary” RNN’s parameters are deterministic [84,85]. However, the important difference is that the Kalman filter in its original positioning cannot learn, it just gives the estimate based on the known system’s parameters, and this estimate is optimal in the MSE sense if the specific conditions (system’s linearity and the white additive Gaussian character of participating noises) are fulfilled. We can, of course, state the problem differently: find some (or all) of the parameters of the system using the given input (control vector) and measurement pairs; the latter are supposed to be associated with the Kalman system [86]. The latter problem statement is already closer to the learning phase of the RNN [87]. More details on the comparison of different Kalman filtering-based techniques and RNNs, as well as the interpenetration of these two techniques, can be found in [88–91]. Finally, we also mention that the Kalman filter theory and its extensions can be efficient in the training of NNs [92], yet another important application relating the two concepts.

Concerning the photonic implementation of recurrent structures, we note that these are rarer compared with the feed-forward counterparts. First, we note [93], where the authors proposed a photonic architecture enabling all-to-all continuous-time RNN. We also mention [94], where a free-space network of up to 2025 diffractively coupled photonic nodes, forming a large-scale RNN, was demonstrated, and [55], where the experimental realization of diffractive RNN was also evaluated. Some further analysis of recurrent topology implementation was given in the review [95]. Another RNN (coupled with CNN) realization was considered in [96]. An interesting generalized look at the realization of RNN in hardware was presented in [97].

2.4 Long Short-Term Memory Neural Networks

Long short-term memory (LSTM) is an advanced type of RNN. While RNNs suffer from short-term memory issues, the LSTM network has the ability to learn long-term dependencies between time steps ($t$), insofar as it was specifically designed to address the gradient problems encountered in RNNs [98,99]. LSTM networks are made up of LSTM cells, which are units that contain a series of gates that can control the flow of information into and out of the cell, as shown in Fig. 7. The gates can learn to keep relevant information and discard irrelevant information, allowing the LSTM cell to remember important information for long periods of time. More specifically, there are three types of gates in a LSTM cell: an input gate ($i_t$), a forget gate ($f_t$), and an output gate ($o_t$). More importantly, the cell state vector ($C_t$) was proposed as a long-term memory to aggregate relevant information throughout the time steps.

Figure 7. Schematics of a LSTM cell that constitutes the backbone of LSTM NNs. Weight matrices are omitted.

Download Full Size | PDF

The LSTM equation describes the computations involved in a single time step of a LSTM model:

(13)$$\begin{gathered} i_{t} = \sigma(W^{i}{x}_{t} + U^{i}{h}_{t-1} + b^{i} ), \\ f_{t} = \sigma(W^{f}{x}_{t} + U^{f}{h}_{t-1} + b^{f}), \\ o_{t} = \sigma(W^{o}{x}_{t} + U^{o}{h}_{t-1} + b^{o}),\\ C_{t} = f_{t}\odot C_{t-1} + i_{t}\odot \phi(W^{c}{x}_{t} + U^{c}{h}_{t-1}+ b^{c}), \\ h_{t} = o_{t} \odot \phi(C_{t}), \end{gathered}$$

with $\odot$ being the element-wise (Hadamard) multiplication, where $\phi$ is usually the “tanh” activation functions, $\sigma$ is usually the sigmoid activation function, the sizes of each variable are $x_{t}\in \mathbb {R}^{n_i}$, $f_{t}, i_{t}, o_{t}\in (0,1)^{n_h}$, $C_{t}\in \mathbb {R}^{n_h}$ and $h_{t}\in (-1,1)^{n_h}$. The input at time step $t$, $x_t\in \mathbb {R}^{n_i}$, is processed by the LSTM model to produce an output at time step $t$, $h_t\in (-1,1)^{n_h}$. The subscript $t$ denotes the current time step, while $t-1$ denotes the previous time step.

To explain further, the LSTM equation above is divided into five stages. First, the input gate controls the flow of information into the memory cell. It takes the input $x_t$ and the previous hidden state $h_{t-1}$ as inputs, and produces an output $i_t\in (0,1)^{n_h}$ that represents the degree to which the input should be written to the memory cell. Second, the forget gate controls the flow of information out of the memory cell. It takes the input $x_t$ and the previous hidden state $h_{t-1}$ as inputs, and produces an output $f_t\in (0,1)^{n_h}$ that represents the degree to which the previous cell state $C_{t-1}$ should be retained. Next, the output gate controls the flow of information out of the memory cell. It takes the input $x_t$ and the previous hidden state $h_{t-1}$ as inputs, and produces an output $o_t\in (0,1)^{n_h}$ that represents the degree to which the current cell state $C_{t}$ should be outputted. Then, the memory cell $C_t$ is responsible for storing and updating information over time. It takes the input $x_t$, the previous hidden state $h_{t-1}$, and the previous cell state $C_{t-1}$ as inputs, and produces a new cell state $C_t\in \mathbb {R}^{n_h}$ that integrates the current input and the previous memory. Finally, the hidden state $h_t$ is the output of the LSTM model at time step $t$. It takes the current cell state $C_t$ and the output gate $o_t$ as inputs, and produces an output $h_t\in (-1,1)^{n_h}$ that represents the current hidden state of the LSTM model.

Note that LSTM networks are trained using back-propagation through time, where the error is propagated back through the network over multiple time steps. This allows the LSTM network to learn how to use information from earlier time steps to make predictions at later time steps.

Here, it is also important to mention the existence of another structure called bidirectional LSTM (BiLSTM). The BiLSTM is a type of LSTM that processes the input sequence in both forward and backward directions and concatenates the output of both directions at each time step. This allows the model to have access to information from both past and future contexts of the input sequence, making it more effective in capturing both past and future contexts of the input sequence, which is important for analyzing temporal patterns in optical signals. In particular, in optical fiber communications, BiLSTMs have been shown to be effective in analyzing the temporal patterns of optical signals for detecting and mitigating various impairments, such as polarization mode dispersion and chromatic dispersion. Similarly, in optical sensing, BiLSTMs can be more effective than regular LSTMs in capturing the temporal patterns of optical signals. By processing the optical signal in both forward and backward directions, BiLSTMs can capture the context of the signal from both the past and the future, leading to more accurate detection and measurement of physical parameters.

The equations for the forward and backward LSTM layers are similar to those for the regular LSTM, except that they are computed in opposite directions. The forward LSTM layer processes the input sequence from the first time step to the last, while the backward LSTM layer processes it from the last time step to the first. The output of the forward LSTM layer at time step $t$ is denoted by $h_{t}^{f}$, and the output of the backward LSTM layer at the same time step is denoted by $h_{t}^{b}$.

2.5 Gated Recurrent Units

Introduced in 2014 [100], the gated recurrent units (GRU) network, similar to the LSTM, was designed to overcome the short-term memory issues of RNNs. However, the GRU is less complex than the LSTM [101,102], as it has only two types of gates: the reset ($r_t$) and update ($z_t$) gates, as shown in Fig. 8. The reset gate is used to handle short-term memory, whereas the update gate is responsible for long-term memory [103]. In addition, the candidate hidden state ($h'_{t}$) is also introduced to measure how relevant the previous hidden state is to the candidate state. The GRU for a time step $t$ can be formalized as

(14)$$\begin{gathered} z_{t} = \sigma(W^{z}{x}_{t} + U^{z}{h}_{t-1} + b^{z}), \\ r_{t} = \sigma(W^{r}{x}_{t} + U^{r}{h}_{t-1} + b^{r}), \\ h'_{t} = \phi(W^{h}{x}_{t} + r_{t} \odot U^{h}{h}_{t-1} + b^{h}), \\ h_{t} = z_{t} \odot {h}_{t-1} + (1 - z_{t}) \odot h'_{t}, \end{gathered}$$

where $\phi$ is typically the “tanh” activation function and the rest of the designations are the same as in Eq. (13).

Figure 8. Schematics of a GRU cell that is a less computationally complex alternative to the LSTM cell. Weight matrices are omitted.

Download Full Size | PDF

In addition to (14), defining the so-called fully gated unit, the simpler GRU architecture variants called minimal gated unit are also sometimes used [104]: in these types, the reset, and update gates are merged. Some other GRU variants are described and compared in [103].

2.6 Echo State Networks

Echo state networks (ESNs) belong to the class of recurrent structures, more specifically, to the reservoir computing category [105]. The ESN was proposed to simplify the training process while staying efficient and simple to implement. The ESN comprises three layers: an input layer, a recurrent layer, known as a reservoir, and an output layer, which is the only layer that is trainable. The reservoir with random weights assignment is used to replace back-propagation in traditional NNs to reduce the computational complexity of training [106]. We note that the reservoir of the ESNs can be implemented in two domains: digital and optical [107]. With the optical implementation of the reservoir, the computational complexity dramatically falls; however, the degradation of the performance due to the change of domain can be non-negligible [108]. In this work, we only examine the digital domain implementation. Moreover, we focus on the leaky ESN, as it is believed to often outperform the “standard” ESNs and is more flexible due to time-scale phenomena [109,110]. The equations of the leaky ESN for a certain time step $t$ are given as

(15)$$a_t = \phi \left( W^{r} s_{t-1} + W^{\text{in}} x_t + W^{\text{back}} y_{t-1} \right),$$

(16)$$s_t = (1- \mu) s_{t-1} + \mu a_t,$$

(17)$$y_t = W^{o}s_{t} + b^{o},$$

where $s_t$ represents the state of the reservoir at time $t$, $W^r$ denotes the weight of the reservoir with the sparsity parameter $s_p$, $W^{in}$ is the weight matrix that shows the connection between the input layer and the hidden layer, $\mu$ is the leaky rate, $W^{o}$ denotes the trained output weight matrix, and $y_t$ is the output vector.

The schematics of an ESN are shown in Fig. 9. The crucial point in the ESN or reservoir computing concept is that despite the complex structure of these networks, only the weights of the output (readout) layer are trainable. One can see that the multiple interconnections described by matrices $W^{in}$, $W^{r}$, and $W^{back}$, constitute a complex recurrent structure with rich internal dynamics. Training of a classical MLP or RNN with a comparable number of neurons would be time-consuming. However, the concept of ESN speeds up the training process drastically and reduces it to linear regression on the output layer. The important feature of this type of NNs is that it can be easily implemented in the physical domain. Many dynamical systems with large internal phase space and exhibiting nonlinear properties can be employed as a reservoir. There are various experimental implementations of ESNs, including fiber-cavity-based schemes [111]. Figure 10 shows fiber optic implementations of this concept: the fiber cavity with a circulating modulated signal serves as an optical reservoir.

Figure 9. Schematics of an ESN that is a RNN with only output weights trainable.

Download Full Size | PDF

Figure 10. Examples of optical reservoir computers or ESNs.

Download Full Size | PDF

Finally, we would like to highlight some potential drawbacks of using ESNs which include: (i) difficulty in training, i.e., ESNs can be difficult to train, as they require careful tuning of the network’s hyperparameters in order to achieve a good performance; (ii) limited ability to model long-term dependencies, i.e., an ESN is not able to effectively model long-term dependencies in the data, as they have a fixed-size reservoir and do not allow information to flow through the network over many time steps.

2.7 Attention Layers

Attention is a NN mechanism that observes a whole collection of data and selectively focuses on a subset of the collection. In other words, attention mechanisms are a way to allow a model to focus on specific parts of its input when processing it, rather than using the entire input equally. The attention unit is schematically represented in Fig. 11. It was first applied to sequence-to-sequence learning in [112] and was used mostly to further exploit the importance of each subset among the input data. In other words, attention is one add-on component of a network’s architecture, in charge of managing and quantifying the interdependence between the data of interest. General attention investigates the interdependence between input and output elements, whilst self-attention deals with finding correlations among input elements [113–115].

Figure 11. Schematics of an attention unit that constitutes the main element of any attention NN. Trainable weights are omitted.

Download Full Size | PDF

Let us turn to the case of general attention to account for the interdependence between the final predicted symbol and both the input symbols and the output hidden states. By adding such an attention mechanism, we expect to find the contribution of the input symbols and their hidden representations to the final received symbol prediction. Therefore, we can identify the essential part of the input sequence for training that could lower the computational complexity.

The attention is generally a single- or multi-layer feed-forward NN with trainable weights and biases, which are applied to the output hidden states of the RNN layer.

In the original attention mechanism [112], an input sequence $\{x_{1},\ldots,x_{T_x}\}$ targets an output sequence $\{y_{1},\ldots,y_{T_y}\}$. The conditional probability for a certain target output $y_i$, is defined as

(18)$$p(y_i|y_1,\ldots y_{i-1},\mathbf{x}) = g(y_{i-1},s_i,c_i),$$

where $g$ is a nonlinear, potentially multi-layered, function that outputs the probability of $y_i$; $s_i$ is an RNN’s hidden state for time $i$ computed through $s_i = f(s_{i-1},y_{i-1},c_i)$; $c_i$ is a context vector conditioned for each target $y_i$, i.e., a vector generated from the sequence of the hidden states for predicting the current target output $y_i$; it is computed as a weighted sum of the hidden states $\{h_{1},\;\cdots,\;h_{T_x}\}$:

(19)$$c_i = \sum_{j=1}^{T_{x}} \alpha_{i,j}h_{j},$$

where the weight $\alpha _{i,j}$ of each $h_j$ is computed by

(20)$$\alpha_{i,j} =\frac{\exp{e_{ij}}}{\sum_{k=1}^{T_x} \exp{e_{ik}}},$$

where $e_{ij}=a(s_{i-1}, h_j)$ is an alignment model which scores how well the inputs around position $j$ and the output at position $i$ match.

Instead of predicting the conditional probability of each target $y_i$ from a sequence of targets, we focus only on the received symbol $y_i$:

(21)$$y_i = g(c), \quad \text{where} \ c = \alpha \ast h = [\alpha_{1}h_{1}, \; \cdots\;\alpha_{T_x}h_{T_x}].$$

The weight $\alpha _{i}$ of each $h_i$ is calculated by

(22)$$\alpha_{i} =\frac{\exp{e_{i}}}{\sum_{j=1}^{2k+1} \exp{e_{j}}},$$

where $e_{i}=a(h_i)$ is the adapted alignment model and indicates the matching score between the output symbol $y_i$ and the hidden representations $\mathbf {h}$ of the input sequence $\mathbf {x}$. According to [112], we can define the activation function $f$ of the RNN and the alignment model $a$ by choice. A single-layer perceptron (SLP) is selected as our alignment model. Matrix multiplication is first performed between the hidden input states and a trainable weight matrix $W_a$ $\in$ ${\rm I\!R}^{1\times n_h}$ with bias $b_a$ $\in$ ${\rm I\!R}^{1\times n_s}$, where $n_h$ is the number of hidden units, and $n_s$ is the input sequence length, after which a $\tanh$ function is applied as the activation function of the SLP:

(23)$$a(h_j) = \tanh(W_ah_j + b_{a_j}).$$

The softmax activation function is then applied to the alignment model to compute a probability, i.e., the attention score of the hidden states with respect to the final output symbol. The context vector $c$ is then obtained by an element-wise matrix multiplication between the attention score $\alpha$ and the hidden states. The attention score specifies the amount of attention given to each element of the hidden state sequence that corresponds to that of the input symbol sequence.

Finally, we conclude by highlighting some potential drawbacks and benefits of using attention mechanisms in machine-learning models.

• Increased complexity: attention mechanisms can add additional complexity to a model, which can make the model more difficult to understand and debug.
• Increased training time: attention mechanisms can also require more computation to train, which can increase the training time for a model.
• Improved performance: attention mechanisms can allow a model to focus on the most relevant parts of the input, which can improve the model’s performance on a variety of tasks.
• Better handling of long input sequences: attention mechanisms can be particularly useful for tasks that involve long input sequences, such as machine translation, as they allow the model to focus on the most relevant parts of the input rather than processing the entire sequence equally.
• Improved generalization: attention mechanisms can also improve the generalization of a model, as they allow the model to adapt to different input patterns and focus on the most important features.

2.8 Transformers

The vanilla transformer is a deep learning architecture that was introduced in [116]. Its architecture is shown in Fig. 12. The transformer is a sequence-to-sequence model that operates on sequences of vectors, where the goal is to learn a mapping from one sequence to another. The key innovation of the transformer is the use of the previously mentioned self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when generating the output sequence. In a nutshell, the transformer consists of an encoder and a decoder. The encoder takes the input sequence and produces a sequence of hidden representations, which are then used by the decoder to generate the output sequence. The self-attention mechanism is used in both the encoder and the decoder, allowing the model to attend to different parts of the input sequence when generating each element of the output sequence. The vanilla transformer can be expressed mathematically as follows.

Figure 12. Transformer architecture reproduced from the original paper [116].

Download Full Size | PDF

Let $X = \{x_1, x_2,\ldots, x_n\}$ be the input sequence, where $x_i$ is a vector of dimension $d_{model}$, so the shape of $X$ is [$n \times d_{model}$]. Similarly, let $Y = \{y_1, y_2,\ldots, y_m\}$ be the output sequence, where $y_j$ is a vector of dimension $m \times d_{model}$.

The encoder consists of $N$ identical layers, where each layer has three sublayers: a multi-head self-attention mechanism, an Add&Norm layer, and a position-wise fully connected feed-forward network. The output of the $i$th layer of the encoder is denoted as $H_i = \{h_{i,1}, h_{i,2},\ldots, h_{i,n}\}$, where $h_{i,j}$ is a vector of dimension $n \times d_{model}$.

The multi-head self-attention mechanism can be expressed as

\begin{aligned} \mathrm{MultiHead}(Q,K,V) &= \mathrm{Concat}(head_1, head_2,\ldots, head_h)W^O,\\ \mathrm{where}\;\;\;\;\;\;\;\;\; head_i &= \mathrm{Attention}(QW_i^Q,KW_i^K,VW_i^V),\\ \mathrm{Attention}(Q,K,V) &= \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V. \end{aligned}

Here, $Q$, $K$, and $V$ are the query, key, and value matrices, respectively, with dimensions $n\times d_{model}$. The matrices $W_i^Q$, $W_i^K$, and $W_i^V$ [117] are learned projection matrices with dimensions $d_{model} \times d_k$, $d_{model} \times d_k$, and $d_{model} \times d_v$, respectively. In this case, $d_k$ is the dimensions in the embedding space used for keys and queries and $d_v$ is the dimensions in the embedding space used for values [118]. In addition, note that because the input data as well as the linear layer weights are uniformly partitioned across the attention heads, the $d_k$ dimension is usually equal to $d_{model}/h$ and $h$ is the number of attention heads. Here $W^O$ is a learned projection matrix that concatenates the outputs of all the attention heads.

In the context of optical communications for denoising, one can interpret that the “key” represents the information that the model uses to look up relevant parts of the input signal (i.e., representations of the noisy signal at different positions/times), the “query” represents the information that the model is trying to find or pay attention to denoise the signal (i.e., representations of the noisy signal after some initial processing or encoding), and the “value” represents the actual content or information at each position in the signal sequence (i.e., these could be the noisy signal representations or even the same as the “key” in some cases).

In addition, it is important to note that masked multi-head attention can also be present in the transformer structure. In such case, the inputs of the softmax function are masked out by adding the matrix $M$ which contains zeros and $-\infty$. The $-\infty$ correspond to invalid connections. The equation then is modified to

(24)$$\mathrm{Attention_{Masked}}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V.$$

Next, the position-wise fully connected feed-forward network can be expressed as

\mathrm{FFN}(x) = \mathrm{ReLU}(xW_1 + b_1)W_2 + b_2.

Here, $x$ is a vector of dimension $d_{model}$, $W_1$ and $W_2$ are learned weight matrices, and $b_1$ and $b_2$ are learned bias vectors. The decoder also consists of $N$ identical layers, where each layer has three sublayers: a multi-head self-attention mechanism, a multi-head attention mechanism between the encoder output and the decoder input, and a position-wise fully connected feed-forward network.

In addition to the multi-headed and feed-forward layers, we also have the Add&Norm layer. Such Add&Norm layer involves a residual connection around each of the two sublayers [119] followed by layer normalization [120]. The output of this Add&Norm layer is

(25)$$y_\text{{Add{\&}Norm}} = \mathrm{LayerNorm}(x+\text{Sublayer}(x)),$$

where $\text {Sublayer}(x))$ refers to the function implemented by the sublayer itself, for instance, FFN or multi-head. The LayerNorm is then defined as

(26)$$\mathrm{LayerNorm}(z) = \frac{g(z-\mu_z)}{\sigma_z}+b,$$

where $g$ and $b$ denote gain and bias, respectively, $\mu$ and $\sigma$ are the mean and variance of the summed inputs within each layer, respectively.

Finally, the transformer output can be computed as

\mathrm{Transformer}(X,Y) = \mathrm{softmax}(Z_N W^V).

Here, $Z_N$ is the output of the last layer of the decoder, with dimensions $m \times d_{model}$, and $W^V$ is a learned projection matrix with dimensions $d_{model} \times |V|$, where $|V|$ is the size of the output vocabulary.

In optical communications, the transformer can potentially be used to perform various tasks, such as equalization, modulation classification, and channel estimation. In particular, the self-attention mechanism of the transformer can be used to model the complex interactions between the different components of the optical communication system, such as the transmitter, the channel, and the receiver. In addition, the parallel processing nature of transformers, coupled with the direct interactions among symbols in an input sequence, allows them to capture memories more efficiently than LSTM models, where memory is handled sequentially. This feature makes transformers well-suited for hardware developments in ultrahigh-speed optical transmissions [121].

2.9 Residual Neural Networks

An artificial NN becomes a residual neural network (ResNet) [119] if the input of a specific layer is also passed (or skipped) to another deeper layer in the network; this connection is called a residual connection. The utilization of skip connections or shortcuts, visually illustrated in Fig. 13, is a distinctive feature of ResNets. These connections facilitate the bypassing of specific layers, thereby addressing challenges such as vanishing gradients and promoting more efficient training within deep architectures.

Figure 13. Schematics of a ResNet with double layer skips.

Download Full Size | PDF

Another famous architecture that uses residual connections is the HighwayNet [122]. The HighwayNet preserves the shortcuts introduced in the ResNet, but augments them with a learnable parameter to determine to what extent each layer should be a skip connection or a nonlinear connection. It is noteworthy that HighwayNets possess the capacity to autonomously learn the skip weights through an additional weight matrix governing their gates. In contrast, ResNet models are conventionally characterized by double or triple-layer skips, incorporating nonlinear activation functions such as rectified linear unit (ReLU) and batch normalization, which enhance the expressiveness and convergence capabilities of the models.

In addition, DenseNets [123] serve as a relevant descriptive reference for models incorporating multiple parallel skip connections, underscoring the adaptability and versatility of residual connections in contemporary NN designs.

Let us now define what is the feed-forward equations for such a type of NN layer. Given the weight matrix $W^{\ell -1, \ell }$ for the connection weights from layer $\ell -1$ to $\ell$, and the weight matrix $W^{\ell -2, \ell }$ for the connection weights from layer $\ell -2$ to $\ell$, then the forward propagation through the activation function would be (a.k.a. HighwayNets)

\begin{aligned} a^{\ell} & :=\mathbf{g}\left(W^{\ell-1, \ell} \cdot a^{\ell-1}+b^{\ell}+W^{\ell-2, \ell} \cdot a^{\ell-2}\right) \\ & :=\mathbf{g}\left(Z^{\ell}+W^{\ell-2, \ell} \cdot a^{\ell-2}\right), \end{aligned}

where $a^{\ell }$ is the activations (outputs) of neurons in layer $\ell$, $\mathbf {g}$ is the activation function for layer $\ell$, $W^{\ell -1, \ell }$ is the weight matrix for neurons between layer $\ell -1$ and $\ell$, and $Z^{\ell }=W^{\ell -1, \ell } \cdot a^{\ell -1}+b^{\ell }$. Absent an explicit matrix $W^{\ell -2, \ell }$ (a.k.a. ResNets), forward propagation through the activation function simplifies to

a^{\ell}:=\mathbf{g}\left(Z^{\ell}+a^{\ell-2}\right),

activations from layer $\ell -2$ are passed to layer $\ell$ without weighting (a.k.a. DenseNets):

a^{\ell}:=\mathbf{g}\left(Z^{\ell}+\sum_{k=2}^{K} W^{\ell-k, \ell} \cdot a^{\ell-k}\right).

The all-optical realization of the ResNet structure can be implemented using the scheme shown in Fig. 14. In diffractive optical NNs, the skip layer can be easily realized by shortcutting the part of diffractive layers by using semi-transparent mirrors or beam splitters. Both the shortcutted beam and the signal propagated through additional diffractive layers can be spatially combined by using another beam splitter, as shown in the right-hand side of Fig. 14.

Figure 14. (a) Schematics of a ResNet and (b) corresponding optical implementation. Adapted from [124].

Download Full Size | PDF

It is pertinent to comment here on why the residual structures are needed: a very good and comprehensive example on the subject is given in [125]. The authors of [125] consider the seemingly elementary problem of representing the identity function, $f(x)=x$, via a small seven-parameter NN with a one-node input layer, a two-node hidden layer with a ReLU activation, and a one-node linear output layer (see Fig. 2 of [125]). When training this NN to approximate the identity function, but taking the data from the $[-1,1]$ region, the authors observed that the NN representation of the identity function diverged outside the training domain. Further (see Fig. 3 of [125]), if the NN is trained repeatedly with different random initialization of the parameters, different results were observed: in some cases, the training loss plateaued and the NN failed to accurately fit the training data at all. This observation was attributed to the tendency of deep and narrow ReLU networks to collapse to the mean value of the function. The following conclusions were drawn in [125]: (i) it is yet another manifestation of the fact that the NNs cannot be relied upon to extrapolate outside the training domain; (ii) even though it is possible to represent the identity function with this nonlinear NN, it is non-trivial for the training algorithm to fit the data. Therefore, it is recommended that we use the residual NNs to avoid the aforementioned problem.

2.10 Radial Basis Function Neural Network

A radial basis function (RBF) network is an artificial NN that uses the RBFs as activation functions. Its schematics and a comparison of RBF function and sigmoid function are given in Fig. 15. The network output is a linear combination of input RBFs and neuron parameters. The concept itself was introduced by Broomhead and Lowe in 1988 [126]. There are numerous applications for RBF networks, including function approximation, time series prediction, classification, and system control. Even though the RBF concept is considerably old and familiar, and, often, the other NN types are preferred nowadays, it still attracts the attention of data scientists [127].

Figure 15. Schematics of a radial basis NN and RBF neuron.

Download Full Size | PDF

The RBF networks typically have three layers: an input layer, a hidden layer with a nonlinear RBF activation function, and a linear output layer [128]. The input can be modeled as a vector of real numbers $\mathbf {x} \in \mathbb {R}^{n}$. The output of the network is then a scalar function of the input vector, $\varphi : \mathbb {R}^{n} \rightarrow \mathbb {R}$, and is given by

\varphi(\mathbf{x})=\sum_{i=1}^{N} a_{i} \rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right),

where $N$ is the number of neurons in the hidden layer, $\mathbf {c}_{i}$ is the center vector for neuron $i$, and $a_{i}$ is the weight of neuron $i$ in the linear output neuron. Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name RBF. In the basic form, all inputs are connected to each hidden neuron. The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance [129] appears to perform better with pattern recognition) and the RBF is commonly taken to be a Gaussian function:

\rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right)=\exp \left[-\beta_{i}\left\|\mathbf{x}-\mathbf{c}_{i}\right\|^{2}\right].

The Gaussian basis functions are local to the center vector in the sense that

\lim _{\|x\| \rightarrow \infty} \rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right)=0,

i.e., changing the parameters of one neuron has only a small effect on input values that are far away from the center of that neuron.

The RBF networks are the universal approximators on a compact subset of $R^n$ under certain modest restrictions regarding the activation function shape. This implies that a RBF network with sufficient hidden neurons can approximate any continuous function on a closed, constrained set with arbitrary accuracy [130].

In addition to the unnormalized architecture mentioned, the RBF networks can be normalized. In this case, the mapping is

\varphi(\mathbf{x}) \stackrel{ \text{ def }}= \frac{\sum_{i=1}^{N} a_{i} \rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right)}{\sum_{i=1}^{N} \rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right)}=\sum_{i=1}^{N} a_{i} u\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right),

where

u\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right) \stackrel{\text{ def }}= \frac{\rho\left(\left\|\mathbf{x}-\mathbf{c}_{i}\right\|\right)}{\sum_{j=1}^{N} \rho\left(\left\|\mathbf{x}-\mathbf{c}_{j}\right\|\right)}

is known as a normalized RBF.

Here, we outline the primary reason why the RBFs application failed to gain traction. RBFs are fundamentally flawed because they are (a) too nonlinear, (b) do not perform dimension reduction, and (c) RBFs were always trained using $k$-means as opposed to gradient descent. In contrast, the deep NN have their nonlinearity under control, are able to reduce their dimensionality proportionately, and are learning by means of gradient descent. In spite of the fact that one might make the RBF’s covariance matrix adaptive and hence achieve dimensionality reduction, this makes it even more challenging to train the RBF networks.

2.11 Autoencoders

Autoencoders are the NN architectures that are trained to reconstruct their inputs. A basic architecture of an autoencoder is shown in Fig. 16. In a more formal description, consider that the data $X$ is encoded by $\phi$ to a latent representation $Z$, which is passed through a “bottleneck.” In this sense, this first part can be summarized by an encoder NN $g_{\phi }(.)$. The bottleneck output is decoded terminating with an output layer with the same dimensionality as the encoder’s input layer ($\hat {X}$), a reconstruction of $X$. Here, the decoder part can be described as $f_{\theta }(.)$. It is important to highlight that without the bottleneck, the encoder, and decoder would copy their input to the output, and by having a bottleneck, the encoder compresses the data to a latent representation that is more robust. The bottleneck appears in many autoencoder variations [131–133]. Regarding the training, both encoder and decoder NNs are simultaneously trained by minimizing a reconstruction loss function. The loss function will depend on the nature of $X$ and the task at hand. If $X$ has $n$ features with continuous values, this problem can be understood as a regression problem and one of the possible loss functions is the MSE:

(27)$$L_{MSE}(x,\hat{x}) = \frac{1}{n}\sum_{i=1}^{n}\left[(x_i - f_{\theta}(g_{\phi}(x_i)))^2\right].$$

However, if the data $X$ are discrete, two possible tasks can be performed: multi-class classification or multi-label classification. In the first case, $X$ is categorical in nature and is described by a one-hot-vector with $n$ possible classes in which just one of those features is equal to one at a time. For this case, the decoder’s output needs to have a softmax activation function and the loss function can be the categorical cross-entropy loss:

(28)$$L_{CCEL}(x,\hat{x}) ={-}\frac{1}{n}\sum_{i=1}^{n}\left[x_i\log(f_{\theta}(g_{\phi}(x_i)))\right].$$

Figure 16. Schematic of autoencoders.

Download Full Size | PDF

In the second case, $X$ also has $n$ features, but this time multiple features can be assigned to one. This case is known as the multi-label classification, and the decoder needs to have a sigmoid activation function; the loss function to use can be the binary cross-entropy loss:

(29)$$L_{BCEL}(x,\hat{x}) ={-}\frac{1}{n}\sum_{i=1}^{n}\left[x_i\log(f_{\theta}(g_{\phi}(x_i))) + (1-x_i)\log(1-f_{\theta}(g_{\phi}(x_i))) \right].$$

Moreover, the classical applications of autoencoders are dimensionality reduction (acting in the same way as a principal component analysis), data denoising, data generation, anomaly detection, and clustering. In the next section, we show a few more examples of how this architecture is used in photonics.

Variational autoencoders (VAEs) are a type of autoencoder that adds a probabilistic spin to the model [134]. Unlike traditional autoencoders that learn deterministic functions for encoding and decoding, VAEs learn probability distributions for both. They encode the input data into a mean and variance of a probability distribution, from which a sample is drawn and then decoded to generate the output. VAEs employ a unique training strategy called the “reparameterization trick” [135], which allows for the optimization of the model using standard back-propagation. This additional complexity of modeling the input data as probability distributions, however, tends to increase the computational complexity of training, relative to standard autoencoders. VAE has been applied for predictive control and hidden parameters’ retrieval [136]. There is also a photonic realization of VAE for high-throughput and low-latency image transmission [137].

Adversarial autoencoders (AAEs) are another variant of autoencoders, and they leverage the power of generative adversarial networks (GANs) for training [138]. They consist of an autoencoder paired with a discriminator network that is trained to differentiate between the encoded representations of real and generated data. The autoencoder is trained to fool the discriminator, thereby encouraging it to produce encoded representations that resemble the distribution of real data. The training of AAEs is more complex than that of standard autoencoders and VAEs, as it involves a min–max game between the autoencoder and the discriminator, which makes it computationally more intensive. AAEs are used for machine-learning-assisted metasurface design [139], global optimization of photonic devices [140], and hyperspectral anomaly detection [141,142].

Finally, we note that the serious challenge when dealing with the autoencoders is to understand the variables that are relevant to the problem under investigation. In this sense, it is important to highlight that the decoder part is never perfect and is highly dependent on the bottleneck used, since we can miss out on important dimensions of the problem.

2.12 Generative Adversarial Network

The idea behind the GANs was based on the concept of zero-sum game theory [143].

As shown in Fig. 17, the framework of GAN consists of two NN models: a generative model called a generator that captures the data distribution, and a discriminative model that distinguishes whether a sample came from the real dataset or from a generated (“fake”) one. In a nutshell, the generator aims to learn the distribution of real data, while the discriminator aims to correctly determine whether the input data are from the real data or from the generator. In order to win the game, the two participants need to continuously optimize themselves to improve the generation ability and the discrimination ability, respectively. Therefore, during the training procedure, the two models compete with each other. The generator is designed to generate data as realistically as possible so that it is difficult to distinguish them from the truth, while the discriminator as a binary classifier aims to identify real and fake data as accurately as possible. The generator and discriminator are optimized alternately until the augmented data are indistinguishable from the actual data [144–146]. In other words, drawing upon the principles of game theory, GANs utilize a minimax strategy, where the generator and discriminator networks strive to optimize their respective objectives, resulting in a dynamic equilibrium. This competitive nature of GAN training resembles a zero-sum game, in which the gains made by one network come at the expense of the other. However, it is important to note that GANs, in their formal definition, may not strictly adhere to the conditions of a zero-sum game. In a zero-sum game, the total utility or payoff remains constant, and any gain by one player directly corresponds to a loss for the other player. In contrast, the training process of GANs does not necessarily maintain a fixed overall utility.

Figure 17. Typical schematic of a GAN structure.

Download Full Size | PDF

In a more mathematical formulation, to learn the generator distribution $p_g$ over data $x$ with distribution $p_x(x)$, a prior to input noise variables is defined as $P_z(z)$, where $z$ is the noise variable. Then, the generator represents a mapping from noise space to data space as $G(z,\theta _g)$, where $G$ is a differentiable function represented by a NN with parameters $\theta _g$. The other NN, $D(x,\theta _d)$, is also defined with parameters $\theta _d$, but the output of $D(x)$ is a single scalar. Here $D(x)$ denotes the probability that $x$ comes from the data rather than from the generator $G$. The discriminator $D$ is trained to maximize the probability of assigning a correct label to both real training data and fake examples generated by the generator $G$. Simultaneously, $G$ is trained to minimize $\log (1- D(G(z)))$. Therefore, the optimization of a GAN can be formulated as a minimax problem:

(30)$$\min_G \max_D \{E_{x}[\log D(x)]+E_{z}[\log(1- D(G(z)))]\},$$

where $E[\ldots ]$ represents the expectation value.

In practice, the training for such structures is inherently unstable, such that an alternative training method is used. In a nutshell, this alternative training occurs in two stages: freeze the $\theta _g$ parameters and optimize $\theta _d$ to maximize the discrimination accuracy of $D$; freeze the $\theta _d$ parameters and optimize $\theta _g$ to minimize the discrimination accuracy of $D$. This process alternates and we could achieve the global optimal solution if and only if $p_x=p_g$.

Finally, it is important to highlight some of the best practices when using this type of structure [147].

• Scale properly the real data $x$ and the generator output $G(z,\theta _g)$. A problem at this step can cause sample oscillation and model instability. Thus, it is recommended to avoid applying batchnorm to the generator output layer and the discriminator input layer.
• The data fed into this merged model can either be a mix of real and fake data (from the generator), or it can be purely real and purely fake. The latter is a better approach, since having the data separated into fake and real improves the GAN performance.
• It is recommended to use the leaky ReLU activation unit in all layers of the GAN except the output of the generator, where we should use tanh.
• In [147], the authors initialized all weights using a zero-centered Gaussian distribution with a standard deviation of 0.02.
• Use techniques to stabilize training: there are several techniques that can be used to stabilize the training of GANs, such as using batch normalization, using a history of generated samples in the discriminator, or using a two-time scale update rule for the generator and discriminator.
• Use a stable optimizer: GANs can be sensitive to the choice of an optimizer. Using a stable optimizer such as Adam can help to improve the training process.
• Monitor the training process carefully: it is important to monitor the training process carefully and track metrics such as the generator and discriminator loss. This can help to identify issues such as mode collapse, where the generator generates only a few types of samples, or the discriminator becomes too strong, and the generator is unable to improve.

As an example, in optical applications, GANs have been used for the end-to-end model for geometric constellation shaping applicable for any nonlinearity-limited optical communication channel [148].

3. How to Choose Your NN Architecture: The Hyperparameter Search

One of the most important steps in the NNs is the design of the NN architecture. Indeed, the hyperparameters of the NN model (e.g., number of layers, number of neurons, type of activation function, and learning rate) affect the speed and accuracy of the learning process of the NN models and ultimately define its functioning. However, due to the lack of analytical approaches to calculating such hyperparameters, only a limited number of options (e.g., exhaustive and random search) have been typically used. In this section, we will describe one of the efficient techniques known as Bayesian optimization (BO), explaining how it can help us to design our NN architecture with the aim to maximize NN’s performance [149–151]. We also briefly introduce the concept of RL for hyperparameter search, as it is gaining popularity in academic and industrial circles as a replacement for BO and other searching techniques.

3.1 The Problem of Hyperparameter Tuning

Given a certain NN model that solves a problem under investigation, in which a given arbitrary input $x$ yields a response $y$, the model accuracy can be evaluated through an objective function $f$. A hyperparameter set $\theta$ fully represents the architecture of the NN, such that the objective function is described as $f = f(\theta,s,r)$, which for simplicity can be written as $\ f = f\left ( \theta \right )$. In order to estimate the optimal model accuracy, $f$ must be subject to an optimization process with respect to $\theta$. However, in most cases, this optimization of $f$ is bounded by two important restrictions as follows [152].

(1) Computational complexity: the number of evaluations performed on $f$ is limited, typically in the range of a few hundred. This condition frequently arises because each evaluation takes a substantial amount of time.
(2) Non-differentiability: first- and second-order derivatives of $f$ with respect to $\theta$ are not easy to obtain, thus, preventing the application of methods such as gradient descent, Newton, or quasi-Newton methods.

There are a few possible search methods that suppress some of these aforementioned restrictions: grid search, random search, genetic algorithm, particle swarm optimization, and BO [153–156]. However, from our experience, BO is the most promising among them because it needs relatively few evaluations of $f$, it is a derivative-free method, and it is fairly robust to noisy objective function evaluations.

In summary, all the search methods for NN hyperparameter tuning have the same core, which we illustrate schematically in Fig. 18. First, we define which hyperparameters we want to optimize (e.g., the number of filters), their initial values, which search method we will use (e.g., BO) as the seed, and what is the search space for each hyperparameter (e.g., we wish that the number of filters ranges from 5 to 350). Next, using this hyperparameter set, we proceed to the training validation phase. To estimate the accuracy of the NN model efficiently, we can use the cross-validation method [157], which divides the dataset into $k$ sections, training with $k-1$ sections, and testing with the remaining one to get the model accuracy. This process is repeated until all sections have been used for testing, and the average accuracy is calculated. This average accuracy is assigned to the set of hyperparameters, and this is the feedback to the search model that uses it, to suggest the next set of hyperparameters (or to decline the following iteration). This search cycle ends when the whole space is searched in the case of the grid search, when a certain number of interactions were done in the case of random search, or when the model converged in the case of genetic algorithm/particle swarm optimization/BO. Finally, when the cycle is finished, the hyperparameters with the best average accuracy are taken as the ones that will be used to design the NN model.

Figure 18. Hyperparameter tuning routine using optimization techniques.

Download Full Size | PDF

Next, we will detail further how the BO algorithm functions, also pointing out its drawbacks.

3.2 BO Algorithm

The BO algorithm is based on two core principles. First, it builds a basic surrogate function $f^{*}$ to “fit” the objective $f$ and estimate its response to unknown entries $\theta$. Second, it bypasses the impossibility of using gradient descent methods on $f$ by introducing an acquisition function, i.e., a statistical operator that orients the optimum search.

Regarding the idea behind the surrogate function, it can be understood as a function $f^{*} = p\left ( f \middle | \mathcal {D} \right )$ that estimates the value of the objective function $f$ for arbitrary $\theta$, i.e., $f(\theta )$, conditioned on a limited subset of n-observed data points ($\mathcal {D = \{}f(\theta _{1}),f(\theta _{2}),\ldots,{f(\theta }_{n})$}). To build $f^{*}$, the BO algorithm models $p(f|\mathcal {D)}$ as a Gaussian process (GP), which permits the representation of the posterior distribution $p\left ( f \middle | \mathcal {D} \right )$ by the normal distribution $\mathcal {N}(\mu,\sigma )$, with the mean value $\mu$ and dispersion $\sigma$.

Acquisition functions are crucial to the BO scheme: they are used to choose the next vector of hyperparameters as that which has the highest probability of improvement over the current state. In a nutshell, the acquisition function $a$ can be evaluated for any arbitrary hyperparameter input $\theta$, and it quantifies how promising the next sampling decision $\theta _{n + 1}$ is to indicate the location of the global optimum. By maximizing the acquisition function, i.e., $\theta _{n + 1} = \max a(\theta )$ to select the next numerical evaluation $f(\theta _{n + 1})$, we merely substitute our initial optimization problem with another optimization, but now with a cheaper function. A common choice for the acquisition function is the expected improvement (EI), computed as [152]

(31)$$a(\theta) = [\mu(\theta) -f^+]\phi(Z)+ \sigma(\theta)\phi(Z),$$

where $f^+ = \mathrm {max}\mathcal {(D)}$ and $Z = \ \frac {\mu \left ( \theta \right ) - f^+}{\sigma \left ( \theta \right )}$ if $\sigma \left ( \theta \right ) > 0$ or $Z = \ 0$ if $\sigma \left ( \theta \right ) = 0$. The functions $\Phi$ and $\phi$ correspond to the cumulative and probability density functions of the standard normal distribution $\mathcal {N}(0,1)$, respectively. Since $a\left ( \theta \right )$ can be analytically expressed as a function of $\mu \left ( \theta \right )$, $\sigma \left ( \theta \right )$, and $f^+$, which are directly obtained from the surrogate function $f^{*}$, the sampling point $\theta _{n + 1}$ is easily found by numerically evaluating $a\left ( \theta \right )$ for all $\theta$ in the searching space.

To summarize, the BO algorithm can be defined by the scheme in Fig. 19. First, the set $\mathcal {\text {\ D}}$ is initialized by sampling $f$ with an initial hyperparameter set $\theta _1$. It should be noted that this sampling can be performed either randomly when no previous information is known about $f$, or deterministically, when there is some indication about the optimum of $f$. Here $f$ is defined by the same training and evaluation phases described previously. Then, the BO is programmed to run until a maximum number of iterations is reached. For each $i$th iteration loop, the surrogate function $f^{*}$ is computed, i.e., $\mu$ and $\sigma ^{2}$ are calculated, and these are used to maximize an acquisition function $a$, which, provides a new sampling decision $\theta _{n + i}$. Finally, the sampling decision is evaluated as $f\left ( \theta _{n + i} \right )$ and incorporated into $\mathcal {D}$ before a new cycle starts. When this iterative process ends, the hyperparameter $\theta$ that yields the maximum $f(\theta )$ in $\mathcal {D}$ is selected as the optimal solution $\theta _{\text {opt}}$.

Figure 19. BO flow chart.

Download Full Size | PDF

Finally, it is important to highlight some drawbacks of BO. BO is restricted to problems of moderate dimension. This is a difficult problem: to ensure that a global optimum is found, we require good coverage of searching space of $\theta$, but as the dimensionality increases, the number of evaluations needed to cover searching space of $\theta$ increases exponentially [158]. In this sense, we recommend that the number of hyperparameters should be less than 20, even though other works in the literature show that the BO still produces some advantages depending on the problem tuning up to 76 parameters [159]. In this sense, unless cost function evaluation is rather costly and the dimensionality of the problem is somewhat small, BO will tend to produce the same performance as the random search [160].

3.3 Reinforcement Learning

RL has emerged as a promising technique for optimizing the hyperparameters of NN structures in various domains. RL leverages an agent–environment interaction paradigm to learn an optimal policy that maximizes a cumulative reward signal. The utilization of RL in finding the best hyperparameters of a NN structure involves the formulation of the problem as a Markov decision process (MDP). In this setting, the NN structure is considered the agent, while the selection of hyperparameters constitutes the action space. The environment provides feedback to the agent in the form of rewards, typically based on performance metrics such as accuracy, loss, or other domain-specific objectives. One popular approach is using deep $Q$-networks (DQNs), which combine deep learning with $Q$-learning. In this domain, [161] is a seminal paper, which presents a meta-modeling approach utilizing RL to generate customized CNN designs for various image classification tasks. In this approach, a common set of hyperparameters is employed to train all network topologies during the $Q$-learning phase. Subsequently, the hyperparameters are fine-tuned for the top models selected by the meta Q-agent. Another example can be found in [162], where they have built upon the aforementioned work by employing $Q$-learning to define learning agents per layer.

This approach partitions the design space into independent, smaller design subspaces, wherein each agent fine-tunes the hyperparameters of the assigned layer based on a global reward. This methodology aims to expedite the design space search while maintaining accuracy. Finally, moving to the optics field, Xu et al. [163] proposed a similar application of RL. However, instead of optimizing the hyperparameters of a specific NN architecture, the study employed RL to design an optimum Volterra nonlinear equalizer. The schematic of the architecture used is shown in Fig. 20. This approach utilized deep deterministic policy gradient (DDPG) agents to interact with the environment (the equalizer) and learn an effective search policy. The DDPG agent’s output actions represent the structural parameters of the Volterra equalizer, including memory length for each order, feedback memory length, and pruning rate. The reward from the environment is defined as a function of the BER after equalization and the complexity of the equalizer.

Figure 20. RL framework for Volterra nonlinear equalizer. Schematics reproduced from [163].

Download Full Size | PDF

4. Applications of Neural Networks in Different Photonics Areas

In this section, we discuss some photonic applications of NN structures introduced previously. We would like to reiterate that we do not aim here to present a comprehensive overview of all numerous important applications of artificial NNs in photonics, as illustrated by Fig. 21 (many of the missing areas can be found in recent review papers [13–26]). Instead, we often use several specific examples to illustrate how the NNs are used in these fields. Where appropriate, we try to identify when the complexity of NNs can be an issue in these particular applications and stress the key point of our work: a reduction in the complexity of NNs used in photonics. Though we will use optical communications as an example for the illustration of the complexity reduction methods, we point out that the majority of our results can be applied in various other areas.

Figure 21. Graphical depiction of the NN applications discussed in this tutorial.

Download Full Size | PDF

4.1 Optical Communications: Channel Modeling

First, we address one of the NN applications in optical communications, which is acquiring more and more popularity today: the use of different NN structures for the simulations of signal propagation down the fiber-optic channel. Of course, most of the simulations related to the fiber systems analysis are still carried out using the well-elaborated, efficient, and accurate split-step Fourier method (SSFM) [164]. A more advanced version of this algorithm, with a memory filter within the nonlinear step, was proposed in [165,166], see also [167]. Nonetheless, sometimes the simulation of the optical fiber system’s functioning is a bottleneck, consuming too much time. We also note that, at present, the new (say NN-based) techniques for signal propagation modeling in fiber-optic systems are typically compared with an “ordinary” SSFM without memory filters. Thus, perhaps, it is too early to state that the NN methods are superior to the “traditional” approach, and more work in this direction is still required to make a fair comparison.

When dealing with the NN-based optical channel modeling, we naturally aim at obtaining a high-quality result (the output signal that has passed through the communication system and experienced the respective distortions) at a lower “complexity” cost, i.e., the NN can simply render the desired result faster. The latter becomes a significant bottleneck in, e.g., the simulations of a wideband signal propagation [168,169], or at high powers, when the spatial step of the SSFM has to be very small to guarantee a satisfactory modeling accuracy [170,171]. However, the existing NN-based modeling propositions do not address the ultra-wideband systems, mostly because of the novelty of the subject itself and, perhaps, because the data collection for wideband/extra-high powers is a truly time-consuming process. Thus, the advanced NNs application for the wideband simulations can be rated as an interesting and practically important open problem.

The first and somewhat obvious replacement of the channel function for our modeling task is, as mentioned previously, to recast SSFM as a learnable NN-type framework [172]. The linear SSFM step, i.e., the Fourier transform, is a vector–matrix convolution, while the nonlinear step amounts to the use of the operator $N$, see Fig. 22, applied to the result of the preceding convolution. Now, if we allow the elements of the matrices (the linear step) to be optimized by some training procedure, we arrive at the NN-type structure, where the network parameters can be optimized with some standard procedure used in NN training. The SSFM sequence of operations is virtually identical to the NN functioning, but with the important differences: (i) for the NN-type implementation, the weights are now considered as the parameters to optimize through training, and (ii) the nonlinear activation functions are now rendered by the mathematics behind the approximation approach, but not taken from some “standard” deep NN set (say, ReLU). We note that (ii) can potentially be a source of problems along the NN training, as the “non-standard” activation functions often result in the exploding/vanishing gradients’ problem, such that for employing this method, we typically need to initialize the trainable weights using the (assumed known) fiber propagation parameters participating in the SSFM. However, the neural approximation of the SSFM has so far been used for the so-called learned digital back-propagation concept [173,174], i.e., for the channel equalization purpose, and has not been tested specifically for the channel modeling. This approach can be ascribed to the so-called model- or physics-driven approaches, meaning that we utilize the mathematical model of the channel (in the form of its SSFM approximation) and recast it as a NN structure with the trainable weights.

Figure 22. Correspondence between linear dispersion $D$ and nonlinearity $N$ operators, used within the SSFM to approximate the light evolution down the fiber, and the linear (vector–matrix convolution) and nonlinear (activation) transformation steps in a feed-forward NN structure.

Download Full Size | PDF

Another bevy of methods that can be used for modeling the signal propagation down the fiber refers to the so-called physics-informed neural network (PINN) concept [175,176]; see also the comprehensive review of the method’s applications with some analysis in [177]. PINNs are the NN structures encoding the problem governing equation, i.e., the Manakov equation for the fiber-optic transmission systems, as a part of the NN, or, in other words, the scheme that adopts the physical laws of the true channel model to parameterize its solution via the NN. PINNs approximate the equation solutions by training a NN to minimize a specific loss function that includes terms corresponding to the initial/boundary conditions and the equation’s residual at selected points in the space–time domain (called collocation point). PINNs, given an input point in the integration domain, produce an estimated solution in that point of a differential equation after training. The PINN concept’s application to channel modeling was studied by two groups [178,179]. Both studies agree that the PINN can be used to accurately model pulse evolution down the fibers with less complexity as compared with the SSFM-based modeling, also underlining the universality of the approach. An important observation is that while the SSFM step has to be reduced when we simulate the high-power signals, the PINN method is insensitive to that and, thus, utilizes the same complexity as we have for low-power signals. A tutorial on the application of PINNs in optical communications can be found in [180].

An interesting approach for the NN-based channel transfer function modeling, introduced recently, uses GANs [181]. Yang et al. [181] claimed that the GAN-based method can indeed learn the accurate transfer function of the fiber channel well, and the approach can be extended to model the signal propagation to arbitrary distances. Importantly, the GANs show noticeable generalization capabilities, such that we can model the propagation with different optical launch powers, signal modulation formats, and input signal distributions. Comparing the complexity of GAN-based method to the SSFM modeling, the total multiplication number for the GAN modeling was found to be around 2% compared with that of a “standard” SSFM, which means a considerable reduction in the simulations’ complexity.

One more recently introduced approach for channel modeling is based on the concept of Fourier neural operator (FNO) [182]. The latter method belongs to the supervised operator learning methods family [183], a machine-learning framework proven to be efficient in modeling the evolution of spatiotemporal dynamical systems and approximating general black-box relationships between functional data [184]. The feature of the FSO, which makes it attractive for channel modeling, is that the FSO is mesh-independent, which is similar to the PINN but different from the standard deep learning methods such as CNN-type SSMF mentioned previously. Thus, the FNO network can be trained on one mesh and evaluated on another: by parameterizing the model in function space, it learns the continuous transfer function instead of discretized vectors, which is, of course, a highly desirable ingredient for the optical channel modeling, where we operate with randomly modulated signals with different characteristics. The test regarding the FSO utilization for the channel modeling was carried out in [185]. It was shown that the effective signal-to-noise ratio (SNR) differences between the proposed FNO and SSFM are all within 1 dB at 1200 km of the range of launch powers, and, importantly, the results rendered by the FSO modeling are also close to that obtained in the experiment. At the same time, the authors report the improvement in the complexity against standard SSFM simulations. It is interesting to note that the FNO method implies the increase in the dimensionality for the internal NN representation of the signal’s evolution, such that the whole structure for the NN channel looks like a so-called over-complete autoencoder. Notably, it is opposite to another interesting method used for the nonlinear Schrödinger equation modeling [125], which is quite similar to the optical channel modeling task: in the latter method, the parsimony principle is used, such that the channel model becomes a traditional autoencoder. This demonstrates that the universal recommendation on which particular modeling method we should use cannot be made at the moment. Perhaps, the only feature that the authors of the aforementioned works require from their NN optical channel analogs is that the resulting structure’s inference takes fewer operations than the modeling with SSFM. It would be, then, interesting to compare the existing approaches in different conditions and describe each method’s benefits and shortcomings.

Yet another method for channel modeling refers to the use of transformers [186]: the authors specifically addressed the case of orthogonal frequency division multiplexing (OFDM) transmission. It was noted that the model-driven approaches could suffer in balancing accuracy and efficiency, especially for complex and long-haul transmission. The authors proposed a simplified transformer, combining it with a feature-decoupled distributed scheme for fast and accurate fiber channel modeling. The decoder part of the transformer was removed, and the self-attention was dropped out, as the latter contributes significantly to the inference complexity. The modeling performance was investigated, taking into account the generalization ability, while the method demonstrated the high precision and robustness of the model. Furthermore, the modeling was studied for different transmission rates and was proven reliable over a wide bandwidth. Compared with the BiLSTM, the transformer performed better in accuracy and had lower computational and memory costs. For models under the same conditions, the required running time of the transformer was about 60% of BiLSTM, and less than 1% of that corresponding to the SSMF in the same scenario.

Finally, we turn to a very recent modeling approach, where the complexity reduction problem is posed in general but does not refer to the multiplications’ reduction compared with the SSFM technique: OptiDistillNet [187]. The approach uses a deep CNN to solve the nonlinear Schrödinger equation. Then, the so-called knowledge distillation (KD)-based framework for compressing a CNN is considered, which involves the original complex model as a teacher, the knowledge of which is used to train the reduced model called a student. By using the latter, we gain faster modeling, whereas the quality of modeling is very close to that of the original model.

Overall, as we see, while the modeling of the optical channel with the use of NN is gaining increased attention, and we already have a plethora of different methods, the structuration of approaches and their face-to-face comparison is yet to be done; at the moment, it is difficult to distinguish a particular most promising direction in the simulations of signal propagation down the fiber. We also note that it would be interesting to investigate the existing approaches for the sake of their incorporation into the end-to-end impairments’ mitigation framework, Section 4.2.3.

4.2 Optical Communications: Signal Processing for Impairments Equalization

It is widely accepted that we are approaching the capacity limits of the fiber-optic communications channels largely imposed by the nonlinearity-induced impairments, or, rather, by the interplay of nonlinearity with dispersion and noise [164,188]. Thus, the search for efficient nonlinearity-mitigation solutions (i.e., the channel equalization tools) in optical transmission lines continues to be one of the primary research topics in the optical communication community. Up to now, numerous DSP algorithms have been proposed and studied for the optical fiber channel equalization problem [189]. However, over the past few years, the “conventional” equalizers/soft-demappers have started to evolve toward designs incorporating machine-learning techniques. In general, various machine-learning-based approaches and, more specifically, the NN structures are rapidly finding their way into the telecommunication sector due to their ability to efficiently mitigate transmission- and device-induced impairments and, also because of the considerable speed of optical transmission provides sufficiently large datasets in a short time so that we can have sufficient datasets to train our models [14,15,17,20,172,190–197].

4.2a Post-Equalizers

Perhaps, the simplest and most straightforward NN-based concept to mitigate the signal distortions in optical fiber systems relies on the use of post-equalizers: at the receiver (Rx) side, we add the neural structure that has to revert the channel function and recover the transmitted information [12]. The post-equalizer means that in Fig. 23 we use the NN only after the fiber channel at the Rx side.

Figure 23. Flowchart depicting the variants of using the NNs in optical communications at a physical layer, including post-equalization, full NN-based DSP, pre-distortion, symbol-to-symbol, and bit-to-bit end-to-end systems. Note that the latter two require the surrogate optical channel to pass the gradients over to the receiver.

Download Full Size | PDF

Even though the use of NNs for wireless transmission was considered already in 2003 [198], the implementation of NN-based equalizers in application to the optical transmission was first presented some 10 years later [190]. Already in that seemingly first publication on the subject, it was stated that the MLP rendered a better performance compared to the Volterra-series equalization, while the MLP itself was rated as a low-complexity method. Since then, this direction has flourished (the “early years” of the subject’s development are reviewed in [14,194]) and, at present, we have over a hundred papers addressing different aspects of the subject. First, we note that the channel equalization for the intensity-modulation direct-detection (IM-DD) systems [199–202] represents a simpler task compared with the coherent optical systems [12,193,203], as the dimensionality of output objects in the latter case is higher (i.e., the real numbers versus the complex ones). In particular, the reservoir computing-type approaches are relatively efficient in the IM-DD shot-reach systems [204–207], while in the long-haul coherent communications, the capacity of the reservoir computing has been found insufficient [208], even though this subject could benefit from some further investigations. At the same time, it was demonstrated in [208] that the usage of the ESN for the post-equalization is tantamount to the MLP in terms of complexity: the interplay between complexity (for the neural equalizers’ structures obtained with the BO) and performance is given in Fig. 24.

Figure 24. Comparison of performance ($Q$-factor gain over chromatic dispersion compensation) for different NN post-equalization topologies as a function of the number of real multiplication per symbol (a single symbol output NN). The legend identifies the line types for MLP, BiLSTM, ESN, and combined architectures, CNN+biLSTM and CNN+MLP. The system: dual polarization, 34 GBd, 16 QAM, single-channel TrueWave Classic fiber, 9$\times$50 km propagation distance. Simulation results (the evaluation of the experimental transmission showed a very similar trend). Adapted from [208].

Download Full Size | PDF

The next question to address when designing a NN equalizer is: which type of predictive modeling should we use to get the most from the equalizer? Freire et al. [209] analyzed this question, comparing the benefits and deficiencies of each predictive modeling type in the context of coherent optical channel equalization and soft symbol demapping. The issue here is that the datasets usually used for training the NN in transmission problems contain very few errors (especially when we use the NN after the Rx DSP chain). Therefore, when the NN is based on the classification, we can merely have an insufficient number of data points that would induce the training. The latter results in the infamous exploding/vanishing gradient problem, and the NN cannot train well enough. In contrast, when we use the regression task, each dataset point contributes to the training as we have a small continuous deviation of each constellation point’s location compared with its initial value, and the difficulties are alleviated. The features of using the NNs described in this paragraph are a specific peculiarity of the transmission post-equalization for almost every transmission system, and these should be accounted for when designing equalization techniques. Some special loss functions that work better specifically for the optical transmission tasks have also been proposed [210]. Together with this, we have to remember that the ultimate characteristic that we ought to improve when equalizing an optical system is the BER. However, the BER as a function of NN parameters is not differentiable, and, therefore, we make do with the MSE (or some other function of the difference between the true and predicted symbols) as a measure of prediction accuracy. Whence, we arrive at the mismatch between the actual “goal” of the equalizer and the result of the NN prediction, such that we need to check our results in terms of the achieved BER (or some other metrics, say the $Q$-factor, that are expressed through the BER), but the use of the MSE-type metrics, say, the effective SNR, can bring about misleading conclusions and wrongly working designs.

Now, we need to pay attention to how to structure the output of our NN. We note that the initial equalizers’ designs operated with the single-symbol recovery [190,203,208,211,212], such that the NN returned the predicted value of the real and imaginary symbol parts for the coherent transmission (or just one real number for the IM-DD setups). However, the newest trend now is to use the multi-symbol equalization: it was used in [213] to reduce the complexity of the overall post-processing and further assessed in [214] for the coherent systems and in [201,215] for the IM-DD short-reach systems. In particular, in [201,215], it was found that the multi-symbol output works well for both feed-forward and recurrent NN topologies. By increasing the number of NNs output symbols, the number of slide windows in equalization can be sharply reduced, so the complexity is also reduced. With this, more information is brought to the multi-symbol NNs in the back-propagations, resulting in a better learning capability and, therefore, better overall performance of the structure. In particular, in [215], it was shown that the multi-symbol NN equalization in the short-reach IM-DD systems outperforms the single-symbol equalization, even though in the latter case, the task looks simpler. In contrast, in [216], no performance benefits were found (but even a slight degradation) when using a multi-symbol output for the long-haul coherent optical system. However, the direct detailed comparison of multi-symbol versus single-symbol equalization for the coherent long-haul system is yet to be carried out, so it can be an interesting subject for further research.

Next, when designing an equalizer, it is pertinent to think whether we wish to use some predefined structure (e.g., some black-box) solution and further optimize it [208,211,213,217], or we incorporate the elements of some known equalization techniques and/or recast that technique as a trainable/learnable approach. An example of the latter is the learned DBP [172,173,218,219], where we perform the back-propagation with the use of the SSFM-type architecture but allow the weights in the matrices (which used to be a Fourier convolution) to become trainable (a good analysis of the method was given in [220]). Another popular approach is to use the perturbation theory, but we now allow the perturbation parameters to be trainable [193,221–227]. One more interesting direction is to base the learnable approach on the Volterra series technique [228,229]. Finally, [203] proposed a special NN architecture combining a NN-based nonlinear step and NN additions, which also shows promising performance. No general recommendation on which of the two paths (a black-box approach or a trainable version of some existing method) to follow can be given, as both directions have their positive and negative features.

The following question addresses the particular NN architecture: whether we work with real numbers or adopt the complex-valued NNs. While the latter path is more complicated, there have been considerable advances in the development of complex-valued NNs framework [230], and we can benefit from using the complex-valued NNs applied to optical channel equalization, see [203,231–234].

Now, let us turn to the selection of the particular NN type/topology. In general, many versatile NN-type structures can be used for equalization. The earlier studies incorporated the MLP [190,211,212], as it is the most studied structure. At present, a great lot of other structures have been assessed in the channel equalization context. First, as a natural extension of the feed-forward MLP, the CNN-type-based structures have been considered for both coherent and IM-DD systems [219,235]. However, we note that optical transmission setups typically feature essential memory effects. Here is the right place to recall that the RNN-type topology and its advanced modifications, such as GRU and LSTM, are specifically tailored to handle the memory. Therefore, recent studies have begun shifting increasingly toward the equalizers incorporating various recurrent structures [208,213,217,236], including advanced models with attention mechanism [237,238], and some combinations of recurrent and feed-forward NN parts [208,221,239], where we can expect to have benefits rendered by both topologies. Figure 24 shows the performance of the BO-optimized different post-equalizing NN structures versus the complexity, i.e., the number of multiplications required to process one symbol, for a long-haul coherent system. The complexity was upper-bounded by setting the upper limits for the BO process. It can be seen that the LSTM-based recurrent structures typically outperform their feed-forward counterparts when we allow the complexity to be high. Interestingly, when we restrict the allowed complexity to lower values, a simple MLP can emerge as the most efficient solution. Thus, in general, we recommend testing several structures/topologies before deciding upon an ultimate design for the equalizer and then applying the complexity reduction procedures described in the next section.

Finally, we note that positioning our NN after the “traditional” Rx DSP chain is not the only option. It seems more efficient to replace/impute the whole DSP chain at the Rx side with the NN elements. The latter design was assessed in [240]: the authors underline the interpretability of the resulting design, a truly important feature when we want to understand why we have some specific behavior/functioning problems of the NN setup.

To end this section, we mention that various difficulties emerging in the design and training of NN-based post-equalizers incoherent optical systems are amply described in [12]; moreover, some pitfalls depicted in that paper are generally pertinent to the NN usage in communications, not only to the NN equalizers.

4.2b Pre-Distortion

Pre-distortion, i.e., the pre-compensation of signal’s distortion at the transmitter (Tx) via special] digital pre-processing of the symbol sequence is another popular way of combating channel nonlinearities: in optical communications, the pre-distortion is typically based on the aforementioned Volterra series approach [241]. The application of learning approaches to the digital signal pre-distortion is, actually, not a new subject in general [242,243]. Over the recent years, a number of digital pre-distortion techniques based on the NNs’ utilization have been proposed for wireless systems [244,245]. However, the pre-distortion techniques’ demonstrations for coherent optical systems are still relatively few and far between. There are methods efficient for memoryless digital pre-distortion of a Mach–Zehnder modulator [246] and low-resolution digital-to-analog converter [247]. A Wiener–Hammerstein model-based approach was proposed in [248]. One of the most remarkable results demonstrating the efficiency of the NN-based pre-distortion in coherent optical links was given in [249]. The implementation of the pre-distortion based on the deep MLP-like NN led to the record transmission rates for single-channel [249,250], and multi-channel dense-wavelength-division-multiplexed (DWDM) transmission [251] over 80 km single-mode fiber systems. The recent results regarding the usage of the NN-based pre-distortion approach for IM-DD systems can be found in [252], and those for coherent systems using RNNs in [253].

4.2c End-to-End Equalization of Optical Systems

The methods in the previous two subsections, incorporating the NN structures for the channel equalization, referred to the Rx (post-equalization) or Tx (pre-distortion) transmission system parts only. Meanwhile, a more “omnidirectional” approach incorporating the able-to-learn elements into a communication system can be based on the so-called end-to-end (E2E) learning concept [254]. We notice that the E2E learning concept, though fitting well the communications-related problems, is a very general multipurpose method [255], very efficient, e.g., for such a famous but seemingly irrelevant problem as autonomous car driving [256]. The E2E learning can be formulated as the method involving training a (often very complex multi-component) learning system represented by a single model (typically a NN) that represents the complete target system, where each NN part (that can be just a layer or a complex ensemble of layers) can specialize in performing intermediate tasks. Thus, returning to optical communications, we need that the whole optical communication link is modeled as a NN. In this respect, we can span the modeling from initial bits entering into our system down to received identified bits [195,257,258], or, potentially, assume that we model our system from symbols to symbols [259,260], see Fig. 23. To recast our system as a NN, we now need some differentiable model of emulation of the signal propagation down the fiber (it can be, e.g., some NN structure described in Section 4.1), and the DSP elements at both Tx and Rx ends can now be represented as NNs with trainable parameters. Importantly, the parameters of both Tx and Rx NN block are now optimized simultaneously using standard NN training, using the fact that we can efficiently pass the gradients through the optical link model toward the receiver; alternatively, some advanced gradient-free methods can be used [261], but this direction has not so far being adopted and elaborated in optical transmission. Noticeably, the benefits of using E2E setups are evident for the systems, where the optimal DSP solutions are not known. We also note that the concept of E2E systems is conceptually similar to the contractive autoencoder architecture described previously.

The initial works related to the E2E application referred to the short-haul IM/DD systems, where the fiber nonlinearity was non-essential, so the fiber propagation model was linear and simple. In [195,257,262,263], the E2E learning of geometric constellation shaping (GS), i.e., the optimal symbol locations for IM/DD optical communication systems were researched: it was shown that the E2E methodology resulted in the essential performance gain. Developing the approach further, in [264], the E2E learning of waveforms was addressed for a very special nonlinear frequency division multiplexing (NFDM) optical communication system, see Section 4.2.5. In [265–270], the E2E learning of single-symbol GS was considered for already a more complicated coherent communication system. In particular, in [265–269], only the optical channel-related distortions were taken into account, while in [270] a more realistic link model, that also included the local oscillator laser noise, was studied. In [241,249–251,271–273], the E2E learning of GS, signal waveform, and nonlinear pre-distortion resistant to transmitter distortions were considered. However, the distortions emanating from the signal’s propagation down the fiber were neglected in these works, either completely or modeled via a simplified Gaussian noise model. [274] addressed the joint E2E learning of GS and linear pre-distorter mitigating the fiber channel distortion. However, the learned linear pre-distorter’s contribution to the nonlinearity mitigation is questionable. Finally, in [260], the E2E learning of the constellation shaping for a single-channel dual-polarized 64 GBd transmission over 170 km standard single-mode fiber link, which takes into account the nonlinearities and optical channel memory, was proposed and studied. With the new method, it became possible to jointly optimize symbol locations in the constellation diagram, the symbol probabilities, and the nonlinear pre-distortion: the learned transmitted signal distribution chooses the transmitted symbol based not only on the message sent in the corresponding time slot as in the conventional constellation shaping but also on the messages sent in the neighboring time slots. The feature of the approach from that in [260] is that a relatively accurate auxiliary (surrogate) channel model based on perturbation theory was used there. With this, the training procedure for the simultaneous learning of symbol probabilities [275] was implemented.

Overall, the E2E learning application, especially in coherent optical systems with the account of all distortion types, is truly a nascent subject. Some initial results obtained up-to-date suggest that this direction can be really fruitful. At the same time, we ought to recognize the problems associated with the E2E system’s development. First, it is the method that is very difficult to implement in experimental conditions: we need to understand which changes the loss/cost metrics alternations induce in our experimental transmission setup. The latter can be quite demanding, as we typically do not know a priory the number of runs that we have to spend to achieve some desired performance values, and we are also unaware of the “level” (or type) of the system’s alternations that the training would result in; some representatives of the latter can be technically unfeasible at all. Together with this, we have to be accurate in designing the multi-modular E2E systems, as the high complexity of the NN structure may imply that a considerable amount of data and training runs are needed to have an acceptable result, accompanied by numerous problems characteristic to very deep NN training. In addition, the generalizability of E2E optical communication systems has to be investigated in more detail, even though we can attain some flexibility by a specially designed training [276]. Thus, we can think of the E2E systems as of high-complexity (potentially) high-reward direction, where the machine-learning-related problems may intertwine with those pertinent to optical transmission, thus requiring a researcher to apply different mitigation methods and where a lot of difficulties are still yet to be alleviated.

4.2d Free Space Optical Systems

One of the directions that have recently started to attract more attention, including the studies of NN-based equalization, is FSO communications. FSO systems can provide a considerable unlicensed bandwidth for data transmission at more than 100 Gb/s [277], reach extremely long distances, being secure and robust to electromagnetic interference [278] and atmospheric turbulence [279].

As for the NN applications in FSO systems, a good systematization of different results is given in [280]. This reference classifies the NN methods used to compensate the FSO transmission impairments into three categories: “classical machine-learning-based methods,” which include either non-NN- or shallow NN-based approaches [281,282], the approaches involving the CNNs [283–286], and deep NN-based methods (though CNN structures can also be classified as “deep”) [287]. Overall, whereas the application of the NNs to mitigate the detrimental impact in FSO is gaining momentum, the studies related to the complexity reduction of the processing are relatively not widespread, aside, perhaps, from the aforementioned [280], where the authors compare the complexity of their method against some existing signal-processing solutions.

4.2e Nonlinear Fourier-Transform-Based Fiber Systems

The nonlinear Fourier-transform (NFT)-based optical signal-processing and modulation techniques, and, in particular, the NFDM as the most efficient method among the NFT-based optical transmission systems, have been intensively studied over the last years [288–291]. Within the NFDM systems, the data modulation and transmission take place inside the special nonlinear Fourier (NF) domain, where the nonlinear intermodal cross talk (arising due to the Kerr effect) between the effective “nonlinear modes” is virtually absent [289]. Even though, theoretically, the signal’s propagation in NFDM systems is unaffected by the fiber nonlinearity provided that the signal’s evolution is well approximated by an integrable evolutionary equation, in real systems, the deviation of the channel model from the idealized integrable equation results in modes’ coupling. Therefore, we arrive at the mismatch between the NFT-based processing and the channel. However, the system’s performance can be improved by using highly adaptive NNs instead of “deterministic” NFT operations.

The first direction in employing the NNs for improving the functioning of NFDM systems consists in applying the additional NN-based processing unit at Rx to compensate the emerging line impairments and deviations from the ideal model [292–299]: it can be deemed as the extension of the post-equalization concept, Section 4.2.1. However, despite ensuing the transmission quality improvement, this type of NN usage brings about the additional complexity of the receiver. At the same time, the complexity reduction for the NFT operations has been a subject of active research, and it is undesirable to raise it further by adding more processing units. In the more viable alternative approach, the NFT operation at the receiver is entirely replaced by the NN element [300]. It has been shown that this approach, indeed, results in a noticeable improvement of the NFT-based transmission system functioning [300–302]. At the initial stage of research, the NNs emulating the NFT operation were used in the NFDM systems operating with solitons only (the study of such systems is also actively developing [303–305]). However, we note that the most efficient NFDM systems developed so far operate with the continuous nonlinear Fourier spectrum [306–312]. In the first work related to the communication system based on the continuous nonlinear Fourier spectrum [313], a standard “imageInputLayer” NN (developed originally for handwritten digits’ recognition) from MATLAB 2019a deep learning toolbox was adapted to process and demap the data. However, such an approach utilizes the classification task, which can bring about difficulties [209]. The problem of the NN-based nonlinear Fourier spectrum recovery using the regression task was considered by Sedov et al. [314]: a special CNN-type structure coined NFT-Net was proposed there. Further, it was shown that such a structure is reversible, i.e., it can be used for the inverse NFT computation [315]. This direction was extended further in [316,317]. In [316] two CNN-type structures were analyzed for directly decoding NFDM data: a small serial network scheme was designed for small user applications, and a parallel network scheme with high speed was designed for high data rates. Importantly, the questions regarding the complexity of NN signal processing were addressed. In [317], a diffractive fiber-based NN was proposed to discern the NFDM symbols. That NN was composed of multiple cascaded dispersive elements and phase modulators. An all-optical back-propagation algorithm was used to optimize the phase. The fiber-based time domain NN structure acts as a powerful tool for signal conversion and recognition, and such a structure can be used to recognize the symbols all optically, which can allow us to replace the NFT processing with much simpler and even system-agnostic operations.

Finally, we mention [264], where the end-to-end optimization was used for the NFDM system based on solitons. A very efficient recent NFDM system was presented in [318]: using the NN equalization, the authors experimentally demonstrated a 25-channel NFDM system with polarization division multiplexing 16-QAM modulation, transmitting over 10 Tb/s for the 800 km distance.

4.3 Optical Communications: Network Layer

The idea of adding intelligence to optical networks to make network operations easier and boost network performance is rapidly gaining traction in both research and industry. The optical network is a key point in the worldwide infrastructure for communications since it bridges the gap between higher-level services and the underlying physical infrastructure by allocating resources such as links, wavelengths, spectrum slots, fiber cores, and time slots. Owing to this complexity, optical networks are more difficult to operate and maintain than other types of communication networks. As optical networks grow more complex, manual optimization can take too long and lead to suboptimal results. Therefore, machine learning, and more specifically NNs, are used more and more nowadays, allowing for better, faster optimization decisions to be made.

In Fig. 25, we have summarized the three main areas in which the NNs have been successfully applied in optical networks. Those areas are network planning, optical monitoring, and failure management [17,319,320].

Figure 25. Three major directions of using the NNs in optical networks.

Download Full Size | PDF

When dealing with network planning, the NNs have been deployed for two main purposes: traffic prediction and solving the rout, modulation level, and spectra assignment (RMLSA) problem. Predicting the bandwidth requirement in the next time step based on real-time measurement of traffic characteristics is one of the key challenges in improving the efficiency of network operation. The purpose of using the NN models to predict future traffic rate variations is to do it as accurately as possible based on historical data. In this case, the NN input is the history of requests per note of the optical network (past traffic data), so the NN can forecast future traffic demands. RNNs, such as GRU and LSTM, are among the NN structures that have been efficiently used for traffic prediction because of their ability to adaptively capture the dependencies on different time scales. In [321,322], the GRU is used for making the traffic matrix forecasts for both a fixed-grid WDM network and a backbone elastic optical network (EON). In addition, in [323,324] studied traffic prediction in passive optical networks and core networks using LSTM models. For the RMLSA problem, NN has lately emerged as an alternative to standard methods such as integer linear programming, heuristics such as simulated annealing, $k$-shortest path routing, first fit, and genetic algorithms. Generally speaking, the NN is capable of efficiently learning the network and physical layer aspects by having information on the network properties, fiber properties, and user requests. Then, it can provide the optimal routes, launch powers, symbol rates, modulation formats, and spectrum assignments per request to minimize network blockage and maximize spectrum utilization.

Next, we address the optical performance monitoring (OPM) category. OPM is crucial to guaranteeing a stable network, as even a momentary disruption in service due to faulty fiber or equipment can cause widespread packet loss. Parameters such as optical signal-to-noise ratio (OSNR), chromatic dispersion (CD), polarization mode dispersion (PMD), polarization-dependent loss (PD), optical power (OP), and fiber nonlinearity (FN) are of primary interest to OPM. In this sense, the NN is used to estimate such parameters when only the received optical signal is available in the form of eye diagrams, sampled signal, I/Q components, and other network and light-path features. Here we highlight that in [325–331], the artificial NN was used to estimate fiber properties of an optical link. Further, in [332–336] the NNs were used to estimate BER and/or OSNR. We specifically highlight recent work [332], where the authors experimentally demonstrated that a NN-based quality of transmission (QoT) estimation, where the NN was trained on synthetic QoT data, could successfully estimate the SNR on a live optical network.

The last important aspect of the NN application at a network layer is failure management in optical networks. Failure management’s goals are to find and fix any network problems that arise, keep everything running well, and live up to the service level agreement with customers. However, the standard approach to failure management still necessitates laborious and lengthy human involvement. Machine-learning approaches have been extensively applied to the aforementioned problems in an effort to push failure management in the direction of intelligence and efficiency. First, in failure prediction, deep CNN structures [337] were used to estimate the bend location of remote fiber by using the information of the constellation data from the receiver. An important direction concerns the prediction of failure in optical transport network (OTN) boards. Using the historical data of the operating state parameters from OTN equipment, in [338] a biGRU model and in [339] an attention mechanism-driven LSTM model were proposed for temporal data-driven failure prediction and prognostics. For other optical equipment such as lasers, Abdelli et al. [340] used deep NN to predict the mean time to failure of a laser by having as the input just the laser monitored parameters. Finally, in a failures Location, the NN receives the alarm log of the system, and LSTM [341], attention/transformers [342], or other types of NN [343], can produce an alarm root cause analysis-enabled failure location.

4.4 Optical Sensing

There is a great variety of optical sensors, but they all use light to detect, measure, and convert magnitudes from any domain to an optical signal, first, and then to an electrical one. These domains include temperature, pressure, stress, displacement, strain, liquid level, vibration, rotation, velocity, acceleration, electric, magnetic and acoustic fields, force, pH value, chemical species, humidity, and many others. As a result, optical sensors are used in a vast range of applications, from structural health monitoring and seismic measurement to the medicine and food industry, the oil and gas industry, power line monitoring, smart city applications, and many others. NNs are used in optical sensing for classification tasks and for improving both the accuracy and speed of raw data processing in applications varying from distributed strain sensing to biochemical optical sensing, see [133,344–353] and references therein. The field of using the NNs in optical sensing is too large to discuss it in one section, so we do not aim here to cover all-optical sensing applications, but rather give several typical examples of using NNs in this field and provide references for further reading.

The NN-based approach to analyzing optical fiber sensors’ signals and the applications of NNs for fiber sensor signal interpretation are reviewed in [344]. The applications of NNs for pH monitoring using fiber optic sensors are discussed in [348,351]. Deep NN for the predictions of the resonance spectra of plasmonic sensors is considered in [349]. Non-invasive glucose monitoring using optical sensors and machine-learning techniques for diabetes applications are discussed in [350], where light sources with multiple wavelengths were used to enhance the sensitivity and selectivity of glucose detection in an aqueous solution. Machine-learning techniques are employed in optical sensors to increase accuracy and noise resilience.

Although there are a huge number of examples of using NNs in optical sensing, not all these applications employ high-complexity NNs. In many cases (e.g., as in [344,348,350,351]) the NNs used are simple feed-forward structures. These are often applied as a simple replacement for signal-processing techniques and sensing data interpretation, as shown in Fig. 26. These low-complexity NNs are basically used as a black box trained to perform denoising/approximation and consist of just a few hidden layers. As a particular example of this approach, in [351] the operation of an optical fiber pH sensor measuring the reflectance spectra of the immobilized bromophenol blue was enhanced using a relatively simple feed-forward NN. The input layer consists of six neurons, corresponding to the reflectance intensities measured at six different wavelengths from each spectrum, the output layer is a single neuron corresponding to variable pH values, and 11 neurons in the hidden layer were enough to improve the dynamic response of the pH sensor from (pH 2.00–5.00) to (pH 2.00–12.00). In [352], an MLP trained via supervised learning with the Levenberg–Marquardt algorithm was applied to reduce the errors obtained by the matrix method in an optical fiber Bragg grating based on the simultaneous measurements of strain and temperature.

Figure 26. Typical low-complexity MLP for signal processing in sensing applications. A series of optical spectra are processed to obtain a time series of the measured parameter.

Download Full Size | PDF

More complex NNs are also actively used in a number of optical sensing applications. For instance, deep NNs are used for denoising [354] and event recognition [355]. Deep NNs have also been employed to simulate complex sensor behavior and avoid time-consuming precise modeling of the response of the plasmonic sensor using Finite-Difference Time-Domain (FDTD) or finite element method (FEM) [349]. In the distributed fiber sensors, static or dynamic measurements can be done over hundreds of kilometers with the meter-scale spatial resolution by processing the data for Rayleigh, Brillouin, or Raman scattering. Due to the high speed of the operation of optical sensors, the analysis of the collected enormous amount of data requires advanced methods of processing, which is especially challenging for real-time raw data processing. In this situation, the reduction of the NNs computational complexity and power efficiency is the key to the development of efficient hardware for wide NNs deployment in optical sensing.

4.5 Ultrafast Light Measurements and Characterization

Ultrafast photonics deals with high-speed optical measurements, generation, characterization, and usage of extremely short pulses with picosecond to attosecond scale duration, the topics which are important for a range of applications, from medical lasers and nonlinear imaging and microscopy to materials processing and 3D laser printing. One of the challenges in ultrafast photonics is that the dynamics of ultrashort pulses in many applications are highly nonlinear. Therefore, the design optimization of pulse evolution in the nonlinear medium requires time-consuming numerical modeling. NNs can provide new design tools and enhance the performance of measurement and characterization techniques for ultrafast light [19,356–362]. For instance, the conventional approach to optimize nonlinear fiber-optic dynamics is based on numerical modeling using the generalized nonlinear Schrödinger equation. NNs can be efficiently exploited to emulate nonlinear pulse propagation and reduce computational time and memory. In [356], a recurrent NN has been applied to model and predict complex nonlinear propagation in optical fiber, using data solely from the input pulse intensity profile. The NN prediction agreed well with the experimental results for pulse compression and ultra-broadband supercontinuum generation. Nonlinear instabilities in fiber optics, with their inherently complex light dynamics, are a good test-bed for the theoretical and computational tools required for the design and optimization of fiber devices. NN-based analysis of instabilities has been introduced in [363,364]. In [19], a supervised NN has been trained to correlate the spectral and temporal properties of modulation instability using simulations and then applied to analyze high dynamic range experimental spectra to yield the probability distribution for the highest temporal peaks in the optical field. It was also shown that unsupervised learning can be used to classify noisy modulation instability spectra into subsets associated with distinct temporal dynamic structures.

The characterization of ultrashort laser pulses with femtosecond to attosecond pulse duration is yet another area where the NNs can offer new perspectives [361,362]. The characterization of the amplitude and phase of such ultrashort pulses is of critical importance for, e.g., chemical reactions and electronic phase transitions. Employing the NN for the reconstruction of ultrashort pulses enables diagnostics of low-power pulses and/or characterization without a priori knowledge of the relations between the pulses and the measured signals [361]. In [362], a method for the phase reconstruction of an ultrashort laser pulse based on the deep learning of the nonlinear spectral changes induced by self-phase modulation has been presented. The NNs have been trained on the simulated pulses with random initial phases and spectra and validated on experimental data produced from an ultrafast laser system, where near real-time phase reconstructions were performed. This method can be used in systems with high-energy, large-aperture beams, for instance, in petawatt laser systems [362].

The NNs can assist in retrieving the amplitude and phase of the complex electric field from the interferogram. In [365], a NN-based spectral interferometry system that utilizes a NN to infer the magnitude and phase of femtosecond interferograms directly from the measured single-shot interference patterns has been demonstrated. A five-layer fully connected NN was used to perform the regression, inferring the amplitude and phase spectra from the measured spectral interferogram. The input size (the size of a single input frame) of the network is defined by dividing the sampling rate of the real-time analog-to-digital converter (ADC) by the repetition of the laser [365]. The NN directly outputs a vector that is the concatenated magnitude and phase spectra imposed by the spectral modulator. Importantly, this method does not require a priori knowledge of the shear frequency.

4.6 Laser Systems

The NNs are well suited for dealing with nonlinear problems, which makes them very useful in laser science and technology because nonlinearity plays an important role in the operation of many classes of modern lasers [366–372]. The challenges in optimizing and control of lasers result from a large number of effective degrees of freedom (or control parameters) that need to be balanced to achieve stable operation or to achieve a specific targeted lasing regime. Moreover, there is an increasing demand for autonomous laser operation and active self-tuning in the presence of changing environmental perturbations. In this section, we address the applications of NNs to a specific class of lasers: fiber lasers [368,369].

The efficient modeling of fiber lasers can potentially lead to a breakthrough in their performance. However, laser modeling requires the accurate characterization of all elements (not always available) and efficient data analysis [368,369]. Even with the known mathematical models, the complexity of light dynamics in the laser cavity makes the design and optimization of such lasers a challenging task, namely: (i) direct numerical modeling requires the knowledge of system parameters that are not always available; (ii) even though the impact of optical noise on the evolution of radiation can be accounted for, this requires massive, time-consuming Monte Carlo simulations of the stochastic partial differential equations underlying laser operation; (iii) comprehensive design optimization requires the analysis of large amounts of data. The simultaneous presence of the three aforementioned factors is exactly why machine-learning-based methods can transform the future of laser science and technology. It has been demonstrated recently that NN-based techniques such as deep learning can be applied to solve nonlinear stochastic partial differential equations [175,373,374]. Therefore, machine-learning methods, and in particular the NNs, have great potential to improve the performance of (fiber and other) lasers; such techniques can be used in the development of a new generation of “smart” laser systems. Lasers can quickly generate large data sets required to reach a good accuracy of NNs. Moreover, the availability of mathematical models, even without exact knowledge of all parameters, can be utilized to speed up the operation of NNs. This can be implemented by embedding available a priori information into the architectures and loss functions of the NNs or by using simulated data to train NNs as shown in Fig. 28.

In the context of laser science and technology, machine learning can be used for:

• design optimization and predictions of lasing regimes;
• characterization of laser radiation;
• improving our understanding of the physical mechanisms underlying the operation of complex laser systems;
• field control and self-tuning of the lasing regime.

Machine-learning-based techniques combined with the feedback loops have the potential to revolutionize the ways to design, control, and select desirable regimes in lasers, leading to self-starting, self-optimizing lasers, and robust against environmental perturbations [136,375–379].

The complexity of NNs used in laser dynamics varies from low-complexity classical MLPs for predicting performance parameters of a mode-locked laser [380] to sophisticated deep NN systems for hidden parameters’ retrieval [136] and self-optimization of the lasing cavity [371]. Figure 27 shows a circuit for predictive control and hidden parameters’ retrieval. Even more complex NNs have been proposed for deep reinforcement learning (DRL) [381–384], which include both fully connected and LSTM parts for finding various mode-locked states without any prior knowledge of the system. Dropout layers and pruning are both routinely used to reduce the computational complexity of these NNs. In [380], the NN was used first to identify stable mode-locking regimes by considering the evolution of a small noise pulse. Then, the NN was trained to predict the pulse shape quickly and accurately.

Figure 27. Schematics of an artificial NN used for laser cavity parameter optimization.

Download Full Size | PDF

Figure 28. Schematics of deep learning model predictive control circuit for hidden parameter retrieval and auto-mode-locking of a nonlinear polarization evolution (NPE) fiber laser. Adapted from [136].

Download Full Size | PDF

CNNs are also widely used for laser spatial distribution field analysis [385–387], for example, for measuring mode structure inside a multimode fiber (MMF) or full beam characterization. These methods have become widespread, even though low-complexity algorithms [388] are also available in this field. In the context of the MMF lasers with complex spatiotemporal nonlinear dynamics [389–391], the NNs are even more important and likely will play an increasing role in the design and optimization. The application of NNs for the design and optimization of an auto-setting mode-locked fiber laser cavity was discussed in [371].

For more detailed reading, among numerous examples of the applications of NNs in laser science, we would like to mention the following most characteristic recent works.

• Nonlinear polarization evolution (NPE)-based lasers. NNs have been used for polarization control in NPE laser cavities that employ nonlinear polarization rotation as a mode-locking mechanism. This type of laser cavity is an excellent example of a complex nonlinear dynamical system governed by equations with “hidden” or non-measurable stochastic parameters such as birefringence. The pioneering works [375,392] demonstrated how optimization algorithms could be used to self-tune this type of lasers and make them “self-starting.” A more recent paper [136] introduced a method to extract the hidden stochastic parameter of birefringence and exploit them using deep learning and model predictive control. Electronic initiation and optimization of the NPE evolution mode-locking in a fiber laser was studied in [378]. An intelligent programmable NPE-based mode-locked fiber laser with a human-like algorithm was proposed and demonstrated in [393].
• Double-gain, nonlinear amplifying loop mirror-based lasers. In [394], machine-learning-based control was applied to a mode-locked fiber laser with a double-gain, nonlinear amplifying loop mirror, a modification of the figure-of-eight lasers. This system offers the possibility of combining simple electronic control of the nonlinearity and, consequently, the generated pulse properties with machine-learning algorithms.
• Reduction of the number of the control measurement devices. An important new feature that databased approaches can bring to laser technology is the possibility of reducing the number of the control measurement devices. By combining NN-based data analysis and the dispersive Fourier transform, Kokhanovskiy et al. [395] demonstrated the possibility of determining the temporal duration of picosecond-scale laser pulses using a nanosecond photodetector. The trained NN makes it possible to predict the pulse duration with an average agreement of 95%. This technique introduced in [395] paves the way to creating compact and low-cost feedback for complex laser systems.
• DRL in lasers. Application of NN-based methods generally requires arduous efforts and tuning of numerous hyperparameters. It is not obvious that the training in the laboratory environment can be smoothly transferred to the field applications or to the systems that differ from the specific laser used to develop the algorithm by design or environmental parameters. In [382,384], it was demonstrated that a DRL approach, based on trials and errors and sequential decisions, can be successfully used to control the generation of dissipative solitons in a mode-locked fiber laser system. It has been shown that deep $Q$-learning algorithms can be successfully applied to generalize knowledge about the laser system, assisting the search for conditions of stable pulse generation. The region of stable generation was transformed by changing the pumping power of the laser cavity, while a tunable spectral filter was used as a control tool. The deep $Q$-learning algorithm is capable of learning the trajectory of adjusting spectral filter parameters to a stable pulsed regime, relying on the state of output radiation. Application of the DRL for mode-locked lasers was also studied in [381] as well as in [383] with spectrum series learning control.

Further examples and details can be found in the recent overview of machine-learning methods in ultrafast photonics [16], which provides an insight into what are the control elements of different types of “smart” lasers, which quality measures are used to provide the feedback, and what type of algorithms can be used to control and tune such complex nonlinear dynamical systems as ultrafast lasers.

We anticipate that the next stage merging the data science techniques with laser technology will lead to the development of laser digital twins that will take full advantage of the available large data sets to create self-starting, self-tuned, robust laser systems. Laser digital twins will consist of a set of adaptive models that emulate the complex light dynamics in a laser system, using real-time data to update its operation. The laser digital twin will replicate the laser system to predict the characteristics of the output radiation and opportunities for its variation. It will also provide real-time recommendations for optimizing performance and mitigating unexpected events and situations.

4.7 Imaging and Remote Sensing

Application of NNs in digital image processing, enhancing techniques from statistical pattern recognition, has a relatively long history (e.g., see [396–398] and references therein). Deep learning algorithms, in particular, deep CNNs effectively became a methodology of choice for analyzing a stream of images, and the leading machine-learning tool for image classification and processing [399–404]. CNNs are used for denoising [405,406], enhancing image quality [407,408], and separation of spectral channels [409,410]. Deep CNNs are capable of finding patterns and learning abstractions from raw data, in this case, images. Photonic technologies, on the other hand, are capable of generating vast numbers of images in a short period of time. Therefore, deep learning methods are naturally well suited for merging with optical imaging and sensing techniques. Fast automatic and robust analysis of optical images is important for a variety of scientific and engineering applications, including medical imaging, quality control, hyperspectral analysis, nondestructive testing, image recognition for security, environmental science, biophotonics, remote sensing, microscopy, agritech, urban land use, and many others. Digital image processing completely transformed optical metrology making it possible to convert the results of measurements (typically displayed in the form of deformed fringe/speckle patterns) into the object characteristics, such as geometric coordinates, displacements, strain, refractive index, and so on. NN applications in image processing is such a huge and well-developed area that we limit the discussion here to several references for more detailed reading (e.g., [399,400,411–413]) and a few particular examples.

Deep learning algorithms are actively used in the remote sensing imagery [414,415]. Deep learning has been applied for remote sensing image analysis tasks including image fusion, image registration, scene classification, object detection, land use and land cover classification, segmentation, and object-based image analysis [414]. In particular, urban land-use mapping is one of the important though challenging problems in the field of remote sensing [415]. Deep CNNs can substantially improve the accuracy and efficiency of the classification methods used for land-use information in urban areas. However, the conventional CNNs uniformly decompose large images into small ones for processing, which is not ideal for land-use analysis applications. In [415], a semi-transfer deep CNN technique was introduced to take advantage of three-channel multispectral remote sensing images, maintaining the integrity of the land-use patterns. It is likely that remote sensing and imaging will be instrumental in developing digital twins for urban applications.

MMF has great potential to revolutionize the field of medical endoscopy and to become an optical tool of choice for endoscopic imaging due to the possibility of direct image transmission using multiple spatial modes. However, the propagating optical field is distorted by the fiber mode dispersion. Therefore, the usage of MMF as an imaging optical element, similar to a lens, requires calibration and the signal’s post-processing. The NNs offer a natural approach to image reconstruction in MMF-based imaging, enhancing performance [416–419]. Optical techniques allow images to be transformed to make processing more efficient. For instance, the high-speed all-fiber imaging combining transformation of 2D spatial information into 1D temporal pulsed streams by leveraging high intermodal dispersion in a MMF and image reconstruction using deep learning have been demonstrated in [420]. The fiber probe detected micrometer-scale objects with a high frame rate (15.4 Mfps) and large frame depth (10,000). The scheme proposed in [420] combines high speed with high mechanical flexibility and integration, making it attractive for various in vivo applications.

Imaging in the infrared spectral interval is of high interest for numerous applications. In recent work on the hyperspectral imaging of narrow absorption lines [421], a machine-learning technique was utilized to reconstruct images obtained with a massively parallelized infrared detection. The gathering and subsequent processing of optical images make it possible to characterize the processes occurring in systems without introducing measuring elements directly into the system. For example, remote sensing using NNs for aerial/satellite images scene classification is used for forest fire monitoring [422,423]. It has become increasingly used in agricultural technologies and for monitoring ecosystems. Kattenborn et al. [424] provided a review of CNN-based approaches to the characterization of both spatial and temporal vegetation patterns. It is shown that CNNs are very effective in extracting a wide array of vegetation properties from remote sensing imagery. Another deep learning approach [425] has been applied to obtain more information about global climate processes occurring on Earth, as well as to improve the accuracy of predicting seasonal fluctuations in observed atmospheric parameters and to capture long-range relationships of climate processes on various time scales. Overall, optical imaging and remote sensing is the area where merging of photonic techniques and NN processing is already happening, leading to the emerging field of digital optical imaging.

4.8 Neural Networks for the Design of New Photonic Materials

The design of new materials and structures (photonic crystals, meta-materials, and meta-surfaces, to mention a few) is one of the most important progress drivers in photonics. The direct problem of material design, in general, is extremely computationally intensive as it requires electromagnetic modeling based on numerical full-wave simulation methods (such as the finite-difference time-domain method, the finite-element method, and others, see [426–428]) and then lots of trials together with global optimization techniques to find a good design. There exists a plethora of different advanced optimization techniques. However, combined with the direct electromagnetic solvers with high computational costs, these numerical frameworks are too time-consuming and inefficient in complex design tasks. Moreover, the performance of the conventional optimization techniques degrades with the increase of the number of additional constraints, limiting the practical applicability further. The inverse design [429,430], dealing with finding the appropriate optical material structure that can provide for the desired properties, is an even harder task insofar as it requires the search within a much larger design space.

NNs brought computational effectiveness in this field [24,431–433] since the NNs can be much less computationally complex than traditional approaches. The NNs have found numerous applications in material design, from predicting new materials and revealing hidden relations to the construction and optimization of nanophotonic and metamaterial structures to manipulate electromagnetic fields at the subwavelength scales. However, a typical NN used in material design is still pretty complex, consisting of millions of trainable weights, and complexity reduction in this field is in high demand.

In [140], the authors utilized generative and discriminative NNs for the efficient and fast optimization of plasmonic metamaterials. Figure 29 shows the schematics of the NN-based optimization routine employed in [140].

Figure 29. Generative and estimating NNs for global optimization of plasmonic nanoantennas. Adapted from [140].

Download Full Size | PDF

Several reviews [24,434] have provided a summary of machine-learning techniques in material design and show how deep NNs, configured as discriminative networks, can learn from training sets and operate as high-speed surrogate electromagnetic solvers. These are often used to speed up the process of numerical simulations of the proposed layouts, which usually requires costly simulation techniques when Maxwell’s and/or material equations are solved using mesh methods such as finite difference time domain (FDTD) or finite-element modeling (FEM).

For example, a physics-inspired loss function can be used to train a deep NN [435] that is capable of solving partial differential equations without employing computationally costly mesh methods. Deep learning is often utilized for inverse design in nanophotonics where the accurate mathematical description of many problems is still challenging. A recent review [436] of machine learning and, in particular, deep learning methods used for inverse design in nano-optics provides a good insight into how NNs have changed the field of material design and have gone beyond that with solving partial differential equations, interpreting physical properties and photonics experiments. Deep NNs and generative models are also employed for the design and optimization of photonic splitters [437]. More examples of how deep NNs are utilized in the design of photonic materials, polymers, and other materials [438] or designing new photonic structures [21] show the broad interest in applying NNs in this field.

Jiang and Fan [439] have shown how a physics-inspired conditional generative NN can be applied for global optimization of the topology of metasurfaces. Two NNs were trained on forward and adjoint electromagnetic simulations. Jiang and Fan showed that the topologies obtained by the developed approach have advantages over those obtained by the standard global optimization-based approach while being more energy efficient.

In a paper on machine-learning framework for a quantum sampling of highly constrained, continuous optimization problems [440], the authors developed a machine-learning framework that maps inverse design optimization problems into surrogate quadratic unconstrained binary optimization problems by employing a binary VAE and a factorization machine. Then, using the D-wave advantage hybrid sampler and simulated annealing, the authors demonstrated how diffractive meta-gratings can be developed for highly efficient beam steering.

The perspectives of deep NNs in photonics and photonic materials are summarized in [18], which describes the future opportunities for machine-learning methods in the domains of photonics, nanophotonics, plasmonics, and photonic materials’ discovery, including metamaterials.

5. Reducing the Complexity of Neural Networks

In the machine-learning model development stage, the primary challenge is often the time consumption of the NN model. From a time complexity perspective, the focus is on minimizing the computational resources required during training and inference. This entails reducing the number of operations and memory requirements, thus enabling faster and more efficient model execution. Time-critical applications such as real-time image or speech recognition often prioritize time complexity optimization to ensure timely and responsive predictions. In this sense, complex models with high parallelization can reduce the time complexity, and the use of GPUs and batch processing can further enhance efficiency. However, it is important to balance model complexity with training efficiency, as highly complex models can require a large amount of data and time to train.

In contrast, signal-processing problems pose specific challenges when it comes to the implementation complexity of NN models. In these scenarios, it is crucial to carefully consider how the NN models are implemented and deployed. The inference complexity of NNs, which encompasses the computational requirements during the prediction phase, assumes a critical role in signal-processing applications. One key aspect to consider in signal processing is the impact of inference complexity on power consumption. As the complexity of the model increases, the computational demands also rise, leading to higher power consumption. This is a significant concern, particularly in resource-constrained environments or energy-efficient systems. By reducing the complexity of the NN model, such as minimizing the number of layers or parameters, it becomes possible to achieve a more efficient signal-processing system that consumes less power without compromising performance. Another vital consideration is the effect of inference complexity on time delay.

In real-time signal-processing applications, prompt and timely responses are imperative. Complex NN models often require longer inference times, which can introduce undesirable delays. By optimizing the model’s complexity, such as utilizing efficient algorithms or reducing redundant computations, the inference time can be minimized, ensuring timely predictions and responses to signal-processing tasks. In addition, the ability to adapt rapidly is of utmost importance in signal-processing applications. The characteristics of the transmitted signals or the environment in which the processing takes place may vary over time. Therefore, the NN model must possess the capability to quickly adapt and adjust to these changes to maintain optimal performance. This adaptability can be achieved through techniques such as online learning, where the model learns and updates itself in real-time based on incoming data. Determining the relative importance of time complexity and implementation complexity depends on the unique requirements and constraints of the application under consideration. In situations where promptness is of the uttermost importance, it may be preferable to sacrifice implementation complexity for greater temporal efficiency. In situations where power consumption and hardware dimensions are of critical importance, a harmonious balance between time complexity and optimization of implementation complexity is required. Consequently, we will now evaluate the techniques used to reduce complexity in both the training and inference stages, as well as the metrics that can be used to evaluate their positive impact on NN model design.

5.1 Computational Complexity Metrics for Training and Inference Stages

In the domain of training NNs, the evaluation of training complexity traditionally revolves around two key metrics (Fig. 30): the number of trainable parameters within the NN and the time required for training to attain a specific performance level. Although both metrics offer some insight into training complexity, it is important to note that two NNs with an equal number of trainable parameters can exhibit distinct training complexities, as exemplified in the work referenced in [12]. Moreover, evaluating training time as a metric is somewhat challenging due to its dependence on hardware resources utilized for training and the size of the training dataset. To address these limitations, two additional metrics have been developed to bridge the gaps. The first metric, known as NENB (number of epochs $\times$ number of batches), combines the number of training epochs and the number of data subsets (batches) used in each epoch. In general, an increased number of training epochs implies a more intricate and computationally demanding model. The number of batches refers to the subsets of data employed during each epoch, and is influenced proportionally by the dataset size and batch size. Consequently, NENB can serve as a suitable metric for comprehensively assessing training complexity. Evaluating the number of epochs or batches individually would fail to provide a holistic evaluation. For instance, one model may require more epochs but fewer batches, while another model might necessitate fewer epochs, but a larger number of batches.

Figure 30. Training complexity metrics and their dependencies.

Download Full Size | PDF

The second metric, in its objective to gauge the versatility and generalization capabilities of the model, seeks to quantify the count of operational ranges in which the NN equalizer can effectively function with an acceptable level of gain. This metric provides valuable insights into the NN’s ability to adapt and perform across various scenarios and contexts. When the NN is constrained to a specific task or domain, it becomes necessary to frequently retrain the model to ensure its proficiency in handling new or evolving situations. This need for frequent retraining adds to the overall complexity of the system, as it demands additional computational resources, time, and effort to maintain the desired performance levels. By considering this aspect of the NN’s operational scope, the second metric offers a valuable perspective on the adaptability and complexity of the model within its intended application domain.

Moving to the computational complexity of the inference, from a computer science perspective, computational complexity analysis is almost always attributed to the Big-$O$ notation of the algorithm [441–443]. In general, the Big-$O$ notation is used to express an algorithm’s complexity while assessing its efficiency, which means that we are interested in how effectively the algorithm scales with the size of the dataset in terms of running time [444–446]. However, from the engineering standpoint, the Big-$O$ is often an oversimplified measure that cannot be immediately translated into the hardware resources required to realize the algorithm (NNs) in a hardware platform [447].

Owing to the absence of some “universal” complexity measure, various works started to present complexity in terms of MAC [447–450], Kolmogorov complexity [451], the number of bit-operations (BOP) [452,453], the number of real multiplications (RM) [208,211,213], number of shift and add operations [214], and number of hardware logic gates [454,455]. In this paper, we have summarized a sequence of useful complexity matrices going from a software level to a hardware level, which is also depicted in Fig. 31.

The first, most software-oriented, level of estimation traditionally deals only with counting the RM number of the algorithm [456,457] (quite often defined per one processed element, say a sample or a symbol). When comparing computational complexity, the purpose of this high-level metric is to consider only the multipliers required, ignoring additions, because the implementation of the latter in hardware or software is initially considered cheap, while the multiplier is generally the slowest element in the system and consumes the largest chip area [456,458]. This ignoring of the additions can also be easily understood by looking at the Big-$O$ analysis of multiplier versus adder. When multiplying two integers with $n$ digits, the computational complexity of the multiplication instance is $O(n^2)$, whereas the addition of the same two numbers has a computational complexity of $\Theta (n)$ [459,460]. As a result, if we deal with float values with 16 decimal digits, multiplication is by far the most time-consuming part of the implementation procedure. Therefore, when comparing solutions that use floating-point arithmetic with the same bitwidth precision, the RM metric provides an acceptable comparative estimate to qualitatively assess the complexity against some existing benchmarks (e.g., against the DSP operations for optical channel equalization tasks [457]).

Figure 31. Different levels of inference computational complexity metrics: from software notions down to hardware logic elements.

Download Full Size | PDF

When moving to fixed-point arithmetic, the second metric, known as the number of bit-operations (BOP), must be adopted to understand the impact of changing the bitwidth precision on the complexity. The BOP metric provides a good insight into mixed-precision arithmetic performance since we can forecast the BOP needed for fundamental arithmetic operations such as addition and multiplication, given the bitwidth of two operands. In a nutshell, the BOP metric aims to generalize floating-point operations (FLOPs) to heterogeneously quantized NNs, as far as the FLOPs cannot be efficiently used to evaluate integer arithmetic operations [453,461]. For the BOP metric, we have to include the complexity contribution of both multiplications and additions, since now we evaluate the complexity in terms of the most common operations in NNs: the MAC operations [453,461,462]. However, the BOP accounts for the scaling of the number of multipliers with the bitwidth of two operands and the scaling of the number of adders with the accumulator bitwidth. Note that since most real DSP implementations use dedicated logic macros [e.g., DSP slice in field-programmable gate arrays (FPGA) or MAC in application-specific integrated circuit (ASIC)], the BOP metric fits as a good complexity estimation inasmuch as the BOP also accesses the MAC taking into account the particular bitwidth of two operands.

The progress in developing new advanced NN quantization techniques [463–466] allowed the implementation of the fixed point multiplications participating in NNs efficiently, namely with the use of a few bit-shifters and adders [467–469]. Since the BOP cannot properly assess the effect of different quantization strategies on the complexity, a new, more sophisticated metric is required. We can introduce the third complexity metric that counts the number of total equivalent additions to represent the multiplication operation, called the number of additions and bit shifts (NABS) [214]. The number of shift operations can be neglected when calculating the computational complexity because, in the hardware, the shift can be performed without extra costs in constant time with the $O(1)$ complexity. Even though the cost of bit shifts can be ignored due to the aforementioned reasons, and only the total number of adders has to be accounted for to measure the computational complexity, we prefer to keep the full NABS name to highlight that the multiplication is now represented as shifts and adders.

Finally, the metric which is closest to the hardware level is the number of logic gates (NLG) that is used for our evaluating method’s hardware (e.g., ASIC or FPGA) implementation. It is different from the NABS metric, as it now reflects the true cost of implementation in particular hardware. In this case, in contrast to the other complexity metrics, the cost of activation functions is also taken into account because, to achieve better complexity, they are frequently implemented using look-up tables (LUTs) rather than adders and multipliers. In addition, other metrics such as the number of flip–flops (FFs) or registers, the number of logic blocks used for general logic and memory blocks, or other special functional macros used in the design are also relevant. As is clear from this explanation, we cannot present a straightforward equation to convert the NABS to NLG, as the latter depends on the circuit design adopted by the developer: special tools such as Synopsys Synthesis [470] for ASIC implementation can provide this information. However, concerning the FPGA design, it is harder to get a correct estimate of the gate count from the report of the FPGA tools [471].

From an analytical standpoint, Table 1 provides a quantitative representation of the primary layers employed in machine learning, of the following complexity metrics: RM, BOP, and NABs. The equations provided in the table are determined based on the dimensions of the layer’s input and output, the assumed bitwidth for operations, and the design hyperparameters applied to these NN layers. For a comprehensive understanding of the calculation methodologies employed to derive these metrics, please refer to the work by [214].

Table 1. Summary of the Three Computational Complexity Metrics per Layer^a for a Zoo of NN Layers as a Function of Their Designing Hyperparameters^b

View Table

5.2 Reducing the Complexity of Training

In many applications, when designing a NN structure with some particular purpose, we, first and foremost, pay attention to the performance of the respective model. Typically, we expect that this performance is better than some established benchmark: for instance, the performance of post-equalizers is gauged against the digital backpropagation method with some number of steps. However, when considering the implementation aspects, the ultimate cost of the processing chain has to be taken into account, i.e., we need to assess the computational complexity of our NN. When talking about NN-based devices, we can distinguish two important factors related to the NN complexity: the complexity of training, which is often omitted as the training is assumed to be made off-line, and the complexity of inference, i.e., the complexity associated with the on-line optical signal processing for the subject considered. In this section, we devote attention to both directions insofar as the training complexity can be associated with the reconfigurability of a NN device, indicating how the device can readjust itself for some changes in the usage environment. The overview of the strategies for complexity reduction in the training and inference is shown in Fig. 32.

Figure 32. Overview of strategies for complexity reduction in the training and inference phases when deploying NN-based solutions in photonics.

Download Full Size | PDF

5.2a Data Augmentation

Data augmentation is the technique of producing additional data points from the current data obtained to artificially increase the amount of available data. Data can be augmented in a number of ways, e.g., by making small modifications or by employing machine-learning models to generate new data points in the latent space of original data. Having a large dataset is essential for the effectiveness of machine learning and deep learning models. In a nutshell, data augmentation is a technique used in machine learning to increase the size of a dataset by generating additional data points based on the existing data, reducing the complexity of the learning process by providing the model with more examples to learn from, which can improve the model’s generalization and reduce the risk of overfitting.

Data augmentation is often used in image classification tasks, where the model is trained to recognize objects in images. By generating additional images that are slightly modified versions of the original images (e.g., by rotating, cropping, or adding noise), the model is able to learn more about the features that are relevant to the task, as well as how to be robust to small variations in the data.

In optical communications, the data augmentation has been recently considered in network scenarios for predicting failures [472,473] and traffic peculiarities [349,474]. The aforementioned network applications suggested new object generation by GANs [349,473,474] utilizing the heuristics [472]. It is worth noting that training supervised learning algorithms for every particular task require a unique dataset structure and, hence, a unique data augmentation procedure. Therefore, the aforementioned data augmentation techniques from the networking layer are not applicable to signal distortion mitigation at the physical layer of optical communications. Only in [475] was the data augmentation technique successfully used for improving the training of supervised learned algorithms for the compensation of nonlinear distortion compensation in fiber-optic communication systems. In this case, it was shown, both numerically and experimentally, that data transformations that account for underlying propagation equation symmetries (e.g., Manakov equation) can be used to synthetically expand the training dataset.

5.2b Transfer Learning

Transfer learning is a machine-learning framework that uses a pre-trained model as a starting point to solve a new task rather than training a model from scratch. It can reduce the complexity of the learning process by leveraging the knowledge learned by the pre-trained model and adapting it to the new task.

Transfer learning is often used when there are a limited amount of labeled data available for the new task, as the pre-trained model can provide a good starting point for the new task even with a small amount of labeled data. It can also be used to reduce the number of computations required to train a new model, as the pre-trained model has already learned many of the general patterns and features that are relevant to the new task. This procedure is more likely to succeed if the features are universal or applicable to both the base and target tasks.

Transfer learning in optical networks has been mainly used for OSNR monitoring. In [476], this application was introduced using an artificial NN-based transfer learning approach to accurately predict the QoT of different optical networks without re-training NN models from scratch. In that paper, the source domain was a 4$\times$80 km (4 spans) large effective area fiber (LEAF) link using QPSK modulation. The target domain was the same system but with a different number of spans (propagation distance) and different modulations formats (4$\times$80 km LEAF with 16-QAM; 2$\times$80 km LEAF with 16-QAM; and 3$\times$80 km dispersion-shifted fiber with QPSK). The results showed that when using transfer learning, just 2% of the original training dataset size was enough to calibrate the NN for the new target domain. More recently, in [477], the experimental demonstration of the application of transfer learning for joint OSNR monitoring and modulation format identification from 64-QAM signals was presented. It was shown that by implementing the transfer of learning from simulation to experiment, the number of training samples and epochs needed for the same prediction quality was reduced by 24.5% and 44.4%, respectively. Another recent application of transfer learning was in the spectrum optimization problem for resource reservation [478]. To predict a spectrum defragmentation time, the pre-trained NN model for a source domain (having a 6-node topology) was transferred and trained again using the data from the target domain (the NSFNet with 14 nodes). It was shown that by using this technique, the proportion of affected services was reduced, the overall likelihood of resource reservation failure was diminished, and the spectrum resource utilization was improved.

Only a few works have addressed transfer learning for nonlinearity mitigation, and these mostly focus on short-haul IM-DD systems. In [479], the successful transfer of the knowledge for the links with different bit rates and fiber lengths was demonstrated. Both feed-forward and recurrent NNs were tested for the transfer learning application: about 90% (feed-forward) and 87.5% (recurrent) reduction in epochs were achieved, and 62.5% (feed-forward) and 53.8% (recurrent) reduction in training symbols were demonstrated. Another work in direct detection [480] applied the transfer from 5 dBm launch power to other powers (ranging from $-$7 dBm to 9 dBm) and from one transmission distance (640 km) to others (from 80 km up to 800 km). The experimental results showed that the iterations with transfer learning constitute approximately a quarter of the full NN training iterations. In addition, the transfer learning did not result in a performance penalty in a five-channel transmission when transferring the learned features from training just the middle channel to the four other channels.

Finally, the transfer learning in coherent optical systems was first investigated in [481]. In that work, the authors applied transfer learning for different launch powers but provided a very brief explanation of the technique. However, only in [482] was a comprehensive description presented of how the transfer learning can be efficiently used to realize flexible NN equalizers for adaptation to changes in launch power, modulation format, symbol rate, and fiber setup.

5.2c Domain Randomization

Domain randomization is a technique used in machine learning to train models that are robust to changes in data distribution. It involves training a model on a wide range of simulated data that are randomly generated within a certain domain, rather than training on real-world data [483]. By training on a wide range of simulated data, the model is able to learn the general patterns and features that are common across the entire domain, rather than being specifically adapted to the data distribution of a particular dataset. This can make the model more robust to changes in the data distribution and less prone to overfitting [484].

The usage of domain randomization can reduce the complexity of the training process by decreasing the amount of real-world data required to train a model. It can also reduce the need for extensive data preprocessing, as the simulated data are generated randomly and do not need to be cleaned or normalized [485].

In particular, when talking about the optical channel equalization task, by using domain randomization, we can train the model in such a way that it can successfully work for different baud rates, powers, etc. [486]. Quite often, the randomization is coupled with domain adaptation techniques, e.g., with transfer learning.

5.2d Other Approaches

In order to minimize the time and effort needed to train NN, we can use some other approaches. The meta-learning method [487] is the first option. Meta-learning is a machine-learning technique that involves learning how to learn or learning the process of improving a learning system. It aims to reduce the complexity of the learning process by learning common patterns across various tasks and using this knowledge to improve the performance of learning algorithms. One way that meta-learning can be used to reduce training complexity is by pre-training a model on a large dataset and then fine-tuning it on a specific task, rather than training a model from scratch for each task. This can reduce the amount of data and computation required to train a model for a specific task, as the model has already learned many of the general patterns that are common across the tasks. Another way that meta-learning can be used to reduce the training complexity is by learning to adapt the model’s architecture or hyperparameters based on the specific task at hand. This can allow the model to automatically adjust its complexity to the needs of the task, rather than requiring manual tuning of the model’s architecture or hyperparameters.

The second possibility is to use semi-supervised learning techniques [488]. Semi-supervised learning is a machine-learning method that uses both labeled and unlabeled data to train a model. It can be used to reduce the complexity of the learning process by reducing the amount of labeled data required to train a model. In supervised learning, the model is trained using a large dataset of labeled examples, where each example has a known correct output. However, collecting and labeling a large dataset can be a time-consuming and expensive process. In semi-supervised learning, a smaller amount of labeled data is used in conjunction with a larger amount of unlabeled data. The model is trained to make predictions on the labeled data, and the predictions are then used to label the unlabeled data. This process is repeated until the model acquires the ability to make accurate predictions on the entire dataset.

Using semi-supervised learning can reduce the amount of labeled data required to train a model, as the model is able to learn from both the labeled and unlabeled data. This can be particularly useful in situations where it is difficult or expensive to obtain a large amount of labeled data.

5.3 Inference Complexity Reduction

5.3a Pruning Neural Networks

Pruning is a technique used in machine learning to reduce the complexity of a model by removing unnecessary (low-importance) parameters or connections. It can be used to reduce the computational complexity of a model by reducing the number of parameters that the model needs to store and the number of operations required to process input. In summary, pruning is the process of reducing the size of a preexisting NN by eliminating nonessential elements. Maintaining the network’s precision while increasing its productivity is the goal of this procedure. This can also reduce the CPU time required for the NN to function.

The area of NN pruning is wide and encompasses several subcategories: (a) static or dynamic; (b) one-shot or iterative; (c) structured or unstructured; (d) magnitude-based or information-based; (e) global or layer-wise. Detailed information on the different types of pruning can be found in, e.g., [489–494]. The four (most promising from our viewpoint) strategies for the iterative-pruning retraining process are fine-tuning, weight rewinding, learning rate rewinding, and BO-assisted.

Fine-tuning pruning is considered the most classic not only in the machine-learning field but also in the field of equalizers for optical channel nonlinearities compensation. Such pruning technique can be used in a simpler way to eliminate, e.g., the coefficients of the Volterra equalizers [495–497] and to trim not important triplets, making the triplet feature vector more sparse; in perturbation-method-based approaches [193,223,498] and in the NN-based equalizers, Section 4.2.1. For such complex NN structures, several papers investigated the use of fine-tuning, mainly in short-reach transmission (IM-DD) [499–504], to reduce the complexity of the model. So far, pruning analysis in optical channel equalization was restricted to the cases when such NN models used only the feed-forward layers.

In the context of channel equalization, seemingly the first paper that applied the weight rewinding approach was by Koike-Akino et al. [464], where such a technique was tested in the feed-forward model called ResMLP, which could give a sparsity of 99% when compared with an initial over-parametrized solution with 6 layers and more than $10^6$ parameters.

5.3b Weight Sharing

The weight-sharing compression approach is another method that can be explored to reduce the NN model’s complexity by reducing the number of effective weights used by the model. This approach takes into account that several connections may share the same weight value, and then fine-tunes those shared weights. One common use of weight sharing is in CNNs, where the same set of weights is used for each convolutional filter in the network. This allows the model to learn general features that apply to multiple parts of the input, rather than learning separate features for each part. In addition, weight sharing can significantly reduce the number of parameters in a model, reducing the amount of memory required to store the model and the amount of computation required to train it. It can also make the model more efficient at inference, as it requires fewer operations to process our input.

In the case of feed-forward structures, this strategy was already successfully employed to minimize the complexity of NN models [462,494,505,506]. Following the selection of a centroids’ initialization technique, a minimal distance from each weight to such centroids is used to determine the shared weights for each layer of a trained network so that all weights in the same cluster share the same value [494]. The weights are not shared between the layers to prevent further performance loss and because sharing weights between sequential layers does not lower the computational complexity. Using the weight-sharing approach has the advantage of reducing the number of distinct multipliers in matrix multiplication to at least the number of clusters per input element. Then, the results of the multipliers are sent to the different adders.

5.3c Quantization Techniques

Quantization is used to lower the bitwidth of the numbers participating in arithmetic operations along the signal processing, which typically helps to significantly reduce the computation complexity of the processing. This means that a quantized model can use, for example, integers instead of floating-point numbers for some/all operations. Therefore, quantization allows representing the model using less memory and doing high-performance vectorized operations on a variety of hardware platforms [507].

Quantization has demonstrated excellent and consistent results when used during the training and inference in different NN models [490,507–509]. It is particularly effective during inference because it saves computing resources without significantly decreasing the accuracy. NNs benefit from quantization because they are remarkably robust to aggressive quantization and extreme discretization. This robustness emerges from the large number of parameters involved in the NN, meaning that the NNs are typically over-parameterized. In this subsection, we present the categories of quantization in terms of their mode (post-training quantization [510] or quantization-aware training [511]) and quantization approach (homogeneous [512] or heterogeneous [513]).

Many quantization strategies have been investigated for equalizing the optical channel. Regarding the post-training quantization, Kaneda et al. [514] implemented a MLP-based equalizer with two hidden layers in an FPGA (XCZU9EG FFVC900) by using the post-training quantization with the traditional uniform int8 quantization, and it was tested on an experimental setup of 50 Gb/s PON with a 30 km SSMF link. Next, this time using a RNN-based equalizer, Huang et al. [229] tested the equalizer in a PAM4-based 100-Gbps passive optical network (PON) signal over a 20 km SSMF fiber testbed and applied a post-training quantization changing the weight’s bitwidth from 8 to 2 bits to study the BER degradation due to the quantization noise. In addition, Huang et al. [229] realized such an equalizer in a FPGA using the Xilinx Vivado toolset for high synthesis. For coherent transmission, He et al. [234] introduced a complex-valued dimension-reduced triplet input NN and experimentally tested it with a 16-QAM 80 Gbps single-polarization transmission at 1800 km, with 100 km SSMF in the loop. In this study, to validate the robustness of such a NN equalizer on the quantization errors, the authors managed to reduce the bit precision of weights down to 2 bits with some acceptable decrease in performance.

Moving on to the quantization-aware training (QAT) strategy, an important discussion on the quantization of NN weights was held in [515], emphasizing that the equalizer’s inference should be performed by a fixed point system rather than a floating-point system. In this paper, a MLP-based equalizer was used, and its weights were quantized with a powers-of-two (PoT) quantization strategy. The authors incorporated the quantization error in the training of the equalizer by using the Learning-Compression (LC) algorithm, which characterizes a QAT strategy. Then, considering a theoretical dispersive channel with additive white Gaussian noise and inter-symbol interference, Xu et al. [516] used a deep CNN equalizer to show its proposed quantization strategy, which combines QAT and post-equalization to find the most appropriate number of bits in the uniform quantization. The CNN equalizer performs comparably to the full-precision model using just 5-bit weights. More recently, Koike-Akino et al. [464] showed that instead of using PoT, the additive powers-of-two (APoT) strategy would bring much more resilience in terms of not degrading the performance as the PoT does. In this work, a ResMLP equalizer was tested in simulation for a dual-polarization 64/256-QAM, 34 GBd 11CH-WDM transmissions over 22 spans of 80 km SSMF fiber, and the QAT for APoT quantized weights was used for assessing the performance limits of such quantization strategy. More recently, Freire et al. [216] reported a complete and comprehensive description and comparison study of various quantization approaches that have been applied to feed-forward and recurrent NN designs in the context of optical channel equalization. Finally, in [517], the experimental implementation of an MLP-based coherent optical channel equalizer functioning realized in Raspberry Pi and Jentson Nano was performed using pruning and quantization.

5.3d Knowledge Distillation

KD [518] is used to describe the process of condensing the information contained inside a large, complex model or set of models and passing it into a more manageable, standalone model suitable for deployment in the real-world applications. In other words, KD is a technique used in machine learning to reduce the complexity of a model by transferring the knowledge learned by a larger, pre-trained model (called the teacher model) to a smaller model (called the student model), allowing the student model to achieve similar performance to the teacher model with fewer parameters and less computation. This occurs because the student model learns the precise behavior of the teacher model by attempting to mimic its outputs at each level (not just for the final loss metric). The different forms of KD are response-based KD, feature-based KD, and relation-based KD [519]. The response-based knowledge focuses on the final output layer of the teacher model. The hypothesis for this KD type is that the student model will learn to mimic the predictions of the teacher model. The feature-based KD focuses on what intermediate layers learn to discriminate specific features; this knowledge is further used to train a student model. Finally, the relation-based KD focuses on capturing the relationship between feature maps (e.g., graphs, similarity matrix, feature embeddings, or probabilistic distributions) to train a student model.

KD has been considered for optical applications only very recently. In [187], the authors presented the technique for low-complexity channel modeling. First, an efficient teacher deep complex CNN model was trained. The student NN termed there as OptiDistillNet, was developed: it was shown that it has better generalization and convergence, runs faster, and uses fewer trainable parameters. Simultaneously, the loss function (MSE) value stayed very close to that rendered by the teacher model when the student model size was only 91.2% of that for the teacher model. In [520], the authors used the KD to get rid of the CNNs’ nonlinearity, which claimed to be a major challenge in using the spectral approach to CNNs and in CNNs’ optical implementation. A special CNN linear counterpart network architecture was designed using the KD, and its optical implementation was considered. Finally, in [521], the KD was used not only to reduce the complexity of a recurrent NN-based equalizer, but also to alter its topology from recurrent to feed-forward. It was demonstrated that the performance of the student’s parallelizable NN structure (CNN) is very close to that of the original BiLSTM (teacher) model.

5.3e Parallelization Aspect of NN Implementation

Feed-forward NNs have been designed to be inherently parallelizable as the computations within each layer of a feed-forward network are independent of each other, enabling them to be parallelized across multiple processors or cores. In contrast, RNNs have a recurrent structure, making parallelization more challenging. Specifically, computations within each time step of an RNN depend on the computations from the previous time step, thus making it infeasible to parallelize computations across time steps. Consequently, hardware parallelization of RNNs is limited. However, to address this challenge, specialized hardware such as graphics processing units (GPUs) has been developed to support the parallelization of RNNs. A common technique to parallelize RNNs on GPUs is to unroll the RNN over a fixed number of time steps [522], creating a feed-forward network with shared weights. This enables parallel computation across the unrolled time steps, thereby allowing for the effective use of the parallelization capabilities of GPUs. However, the choice of the number of time steps to unroll an RNN can have an impact on its performance, with longer sequences requiring more memory and potentially leading to overfitting. Therefore, the optimal number of time steps to unroll a RNN is often a topic of research and experimentation. Another approach to address this issue is to use specialized hardware that is designed to handle the recurrent nature of RNNs. An example of such specialized hardware is the neural processing unit [523], which is a hardware accelerator for NNs that is optimized for both feed-forward and recurrent computation.

Finally, it is worth mentioning that [524] presented an important study that evaluated architectural variations with very different degrees of parallelism which produced trade-offs between area, speed, and reliability. In fact, with the increasing complexity and size of state-of-the-art NN topologies, it becomes impractical to deploy all the required processing elements (PEs) on FPGA devices due to limitations in available logic resources. Consequently, it is crucial to analyze and discuss the trade-offs involving area utilization, performance, and reliability when employing different levels of parallelism for NN accelerators. In fact, as shown in [524], the trade-offs associated with varying degrees of parallelism in NN accelerators are crucial considerations in optimizing resource utilization while maintaining performance and reliability. Achieving maximum parallelism by incorporating a large number of PEs might yield high computational efficiency, but it comes at the cost of increased logic resource utilization and potential limitations imposed by the FPGA device’s capacity. In order to determine the optimal level of parallelism that strikes a balance between resource utilization, performance, and dependability, it would be advisable to evaluate these trade-offs. This analysis will contribute to the development of efficient and robust NN accelerator designs, taking into consideration the limitations of hardware (such as FPGA devices [525]).

6. Conclusions and Perspectives

Optical systems are capable of generating large datasets in a short time, making various data-driven methods and techniques especially attractive and efficient in this field. NNs, in particular, can fundamentally transform the design approaches in material science and optical engineering; the methodology of optical measurements and characterization; the architecture, operation, and control process of photonic devices and systems. Data-based techniques revolutionize optical sensing and imaging, improving resolution accuracy, speed, and power consumption. Machine learning will increasingly contribute to the data-driven discovery (via sparsity-promoting techniques) of the physical models and master equations underlying the operation of complex photonic systems. We anticipate that in the future, laser systems with growing complexity will evolve into digital laser twins that will allow acquired data to be converted into the efficient control of such systems. In optical communications, the channel model is typically presented by complicated nonlinear equations, and the NN-based approach has already proven to be utterly efficient in modeling signal propagation down the system, the inversion of optical channels (for the equalization), or for the transmission systems’ control. Together with this, in optical applications, we almost always have to deal with non-negligible noise impact, a situation in which machine-learning methods can really flourish.

We want to reiterate that there are a plethora of other important research areas at the interface of photonics and machine learning that have not been discussed in this review–tutorial, for instance, the all-optical implementation of NNs that we anticipate to grow substantially in the near future.

Though many existing (and future) problems in photonics can be solved using already available and well-established machine-learning approaches, we anticipate that emerging data-driven photonics will require the development of new efficient algorithms specifically designed for optical applications, having a feedback impact on data science and leading to new synergetic concepts at the interface of photonics and machine learning. Overall, one of our goals in this tutorial is to stimulate the exchange of methods and ideas between data scientists and optical researchers/engineers. We strongly believe that the mutual penetration and cross-fertilization of these two disciplines will soon lead to unexpected innovations and fundamental breakthroughs in both fields.

Funding

Horizon 2020 Framework Programme (813144, MSCA EID REAL-NET); Leverhulme Trust (RP-2018-063); Engineering and Physical Sciences Research Council United Kingdom (EP/R035342/1, EP/W002868/1).

Acknowledgments

This paper was supported by the EU Horizon 2020 program under the Marie Sklodowska-Curie grant agreement 813144 (REAL-NET). EM acknowledges the support of the EPSRC project EP/W002868/1. JEP is supported by Leverhulme Trust, Grant No. RPG-2018-063. SKT acknowledges the support of the EPSRC project TRANSNET.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

The data used for the results presented in this work are available upon request from the authors.

References and Notes

1. T. Mitchell, Machine Learning (McGraw-Hill, 1997).

2. B. Yegnanarayana, Artificial Neural Networks (PHI Learning Pvt. Ltd., 2009).

3. M. Pak and S. Kim, “A review of deep learning in image recognition,” in 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT) (2017), pp. 1–3.

4. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, and T. Chen, “Recent advances in convolutional neural networks,” Pattern Recognit. 77, 354–377 (2018). [CrossRef]

5. J. Karhunen, T. Raiko, and K. Cho, “Unsupervised deep learning: a short review,” Adv. Ind. Compon. Anal. Learning Mach. 1, 125–142 (2015). [CrossRef]

6. J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Mach. Learn. 109, 373–440 (2020). [CrossRef]

7. A. Oliver, A. Odena, C. A. Raffel, E. D. Cubuk, and I. Goodfellow, “Realistic evaluation of deep semi-supervised learning algorithms,” in Advances in Neural Information Processing Systems, Vol. 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds. (Curran Associates, Inc., 2018).

8. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature 518, 529–533 (2015). [CrossRef]

9. D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of Go with deep neural networks and tree search,” Nature 529, 484–489 (2016). [CrossRef]

10. Y. Li, “Deep reinforcement learning: an overview,” arXiv, arXiv:1701.07274 (2017). [CrossRef]

11. P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32 (2018).

12. P. J. Freire, A. Napoli, B. Spinnler, N. Costa, S. K. Turitsyn, and J. E. Prilepsky, “Neural networks-based equalizers for coherent optical transmission: caveats and pitfalls,” IEEE J. Sel. Top. Quantum Electron. 28, 1–23 (2022). [CrossRef]

13. D. Zibar, H. Wymeersch, and I. Lyubomirsky, “Machine learning under the spotlight,” Nat. Photonics 11, 749–751 (2017). [CrossRef]

14. D. Zibar, M. Piels, R. Jones, and C. G. Schäeffer, “Machine learning techniques in optical communication,” J. Lightwave Technol. 34, 1442–1452 (2016). [CrossRef]

15. F. Musumeci, C. Rottondi, A. Nag, I. Macaluso, D. Zibar, M. Ruffini, and M. Tornatore, “An overview on application of machine learning techniques in optical networks,” IEEE Commun. Surv. Tutorials 21, 1383–1408 (2019). [CrossRef]

16. G. Genty, L. Salmela, J. M. Dudley, D. Brunner, A. Kokhanovskiy, S. M. Kobtsev, and S. K. Turitsyn, “Machine learning and applications in ultrafast photonics,” Nat. Photonics 15, 91–101 (2021). [CrossRef]

17. J. W. Nevin, S. Nallaperuma, N. A. Shevchenko, X. Li, M. S. Faruk, and S. J. Savory, “Machine learning for optical fiber communication systems: an introduction and overview,” APL Photonics 6, 121101 (2021). [CrossRef]

18. D. Piccinotti, K. F. MacDonald, S. A. Gregory, I. Youngs, and N. I. Zheludev, “Artificial intelligence for photonics and photonic materials,” Rep. Prog. Phys. 84, 012401 (2021). [CrossRef]

19. M. Närhi, L. Salmela, J. Toivonen, C. Billet, J. M. Dudley, and G. Genty, “Machine learning analysis of extreme events in optical fibre modulation instability,” Nat. Commun. 9, 4923 (2018). [CrossRef]

20. F. N. Khan, C. Lu, and A. P. T. Lau, “Machine learning methods for optical communication systems,” in Advanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS) (Optica Publishing Group, 2017), p. SpW2F.3.

21. W. Ma, Z. Liu, Z. Kudyshev, A. Boltasseva, W. Cai, and Y. Liu, “Deep learning for the design of photonic structures,” Nat. Photonics 15, 77–90 (2021). [CrossRef]

22. L. Pilozzi, F. A. Farrelly, G. Marcucci, and C. Conti, “Machine learning inverse problem for topological photonics,” Commun. Phys. 1, 57 (2018). [CrossRef]

23. L. Pilozzi, F. A. Farrelly, G. Marcucci, and C. Conti, “Topological nanophotonics and artificial neural networks,” Nanotechnology 32, 142001 (2021). [CrossRef]

24. J. Jiang, M. Chen, and J. A. Fan, “Deep neural networks for the evaluation and design of photonic devices,” Nat. Rev. Mater. 6, 679–700 (2020). [CrossRef]

25. Y. Xu, X. Zhang, Y. Fu, and Y. Liu, “Interfacing photonics with artificial intelligence: an innovative design strategy for photonic structures and devices based on artificial neural networks,” Photonics Res. 9, B135–B152 (2021). [CrossRef]

26. F. Vernuccio, A. Bresci, V. Cimini, A. Giuseppi, G. Cerullo, D. Polli, and C. M. Valensise, “Artificial intelligence in classical and quantum photonics,” Laser Photonics Rev. 16, 2100399 (2022). [CrossRef]

27. Y. P. Raykov and D. Saad, “Principled machine learning,” IEEE J. Sel. Top. Quantum Electron. 28, 1–19 (2022). [CrossRef]

28. T. Wettlin, S. Pachnicke, T. Rahman, J. Wei, S. Calabro, and N. Stojanovic, “Complexity reduction of Volterra nonlinear equalization for optical short-reach IM/DD systems,” in Photonic Networks; 21th ITG-Symposium (VDE, 2020), pp. 1–6.

29. J. Wei, L. Yi, E. Giacoumidis, Q. Cheng, and A. Lau, “Special issue on ‘Optics for AI and AI for Optics’,” Appl. Sci. 10, 3262 (2020). [CrossRef]

30. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11, 441–446 (2017). [CrossRef]

31. S. Sunada and A. Uchida, “Photonic neural field on a silicon chip: large-scale, high-speed neuro-inspired computing and sensing,” Optica 8, 1388–1396 (2021). [CrossRef]

32. C. Huang, S. Fujisawa, T. F. de Lima, A. N. Tait, E. C. Blow, Y. Tian, S. Bilodeau, A. Jha, F. Yaman, H.-T. Peng, H. G. Batshon, B. J. Shastri, Y. Inada, T. Wang, and P. R. Prucnal, “A silicon photonic–electronic neural network for fibre nonlinearity compensation,” Nat. Electron. 4, 837–844 (2021). [CrossRef]

33. B. J. Shastri, C. Huang, A. N. Tait, T. F. de Lima, and P. R. Prucnal, “Silicon photonic neural network applications and prospects,” in AI and Optical Data Sciences III, Vol. 12019 (SPIE, 2022), pp. 135–144.

34. C. Huang, V. J. Sorger, M. Miscuglio, M. Al-Qadasi, A. Mukherjee, L. Lampe, M. Nichols, A. N. Tait, T. Ferreira de Lima, B. A. Marquez, P. R. Prucnal, and B. J. Shastri, “Prospects and applications of photonic neural networks,” Adv. Phys.: X 7, 1981155 (2022). [CrossRef]

35. T. F. de Lima, H.-T. Peng, A. N. Tait, M. A. Nahmias, H. B. Miller, B. J. Shastri, and P. R. Prucnal, “Machine learning with neuromorphic photonics,” J. Lightwave Technol. 37, 1515–1534 (2019). [CrossRef]

36. T. F. de Lima, B. J. Shastri, A. N. Tait, M. A. Nahmias, and P. R. Prucnal, “Progress in neuromorphic photonics,” Nanophotonics 6, 577–599 (2017). [CrossRef]

37. K. Berggren, Q. Xia, K. K. Likharev, et al., “Roadmap on emerging hardware and technology for machine learning,” Nanotechnology 32, 012002 (2021). [CrossRef]

38. G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljacic, C. Denz, D. A. B. Miller, and D. Psaltis, “Inference in artificial intelligence with deep optics and photonics,” Nature 588, 39–47 (2020). [CrossRef]

39. D. Brunner, M. C. Soriano, and S. Fan, “Neural network learning with photonics and for photonic circuit design,” Nanophotonics 12, 773–775 (2023). [CrossRef]

40. S. Pai, Z. Sun, T. W. Hughes, T. Park, B. Bartlett, I. A. Williamson, M. Minkov, M. Milanizadeh, N. Abebe, F. Morichetti, A. Melloni, S. Fan, O. Solgaard, and D. A. B. Miller, “Experimentally realized in situ backpropagation for deep learning in photonic neural networks,” Science 380, 398–404 (2023). [CrossRef]

41. M. Miscuglio, Z. Hu, S. Li, J. K. George, R. Capanna, H. Dalir, P. M. Bardet, P. Gupta, and V. J. Sorger, “Massively parallel amplitude-only Fourier neural network,” Optica 7, 1812–1819 (2020). [CrossRef]

42. M. Miscuglio and V. Sorger, “Photonic tensor cores for machine learning,” Appl. Phys. Rev. 7, 031404 (2020). [CrossRef]

43. M. Miscuglio, A. Mehrabian, Z. Hu, S. I. Azzam, J. George, A. V. Kildishev, M. Pelton, and V. J. Sorger, “All-optical nonlinear activation function for photonic neural networks,” Opt. Mater. Express 8, 3851–3863 (2018). [CrossRef]

44. N. Peserico, B. J. Shastri, and V. J. Sorger, “Integrated photonic tensor processing unit for a matrix multiply: a review,” J. Lightwave Technol. 41, 3704–3716 (2023). [CrossRef]

45. D. V. Christensen, R. Dittmann, B. Linares-Barranco, et al., “2022 roadmap on neuromorphic computing and engineering,” Neuromorph. Comput. Eng. 2, 022501 (2022). [CrossRef]

46. N. Peserico, T. F. de Lima, P. Prucnal, and V. J. Sorger, “Emerging devices and packaging strategies for electronic–photonic AI accelerators: opinion,” Opt. Mater. Express 12, 1347–1351 (2022). [CrossRef]

47. A. Mehonic and A. J. Kenyon, “Brain-inspired computing needs a master plan,” Nature 604, 255–260 (2022). [CrossRef]

48. W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bull. Math. Biophys. 5, 115–133 (1943). [CrossRef]

49. F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain,” Psychol. Rev. 65, 386–408 (1958). [CrossRef]

50. S. Hong, H. Kang, J. Kim, and K. Cho, “Low voltage time-based matrix multiplier-and-accumulator for neural computing system,” Electronics 9, 2138 (2020). [CrossRef]

51. M. Heidari and H. Shamsi, “Analog programmable neuron and case study on VLSI implementation of multi-layer perceptron (MLP),” Microelectron. J. 84, 36–47 (2019). [CrossRef]

52. C. Geng, Q. Sun, and S. Nakatake, “An analog CMOS implementation for multi-layer perceptron with ReLU activation,” in 2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST) (IEEE, 2020), pp. 1–6.

53. S. Abden and E. Azab, “Multilayer perceptron analog hardware implementation using low power operational transconductance amplifier,” in 2020 32nd International Conference on Microelectronics (ICM) (IEEE, 2020), pp. 1–4.

54. R. Sarpeshkar, “Analog versus digital: extrapolating from electronics to neurobiology,” Neural Comput. 10, 1601–1638 (1998). [CrossRef]

55. H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen, Z. Ruan, and X. Zhang, “Photonic matrix multiplication lights up photonic accelerator and beyond,” Light: Sci. Appl. 11, 30 (2022). [CrossRef]

56. N. C. Harris, J. Carolan, D. Bunandar, M. Prabhu, M. Hochberg, T. Baehr-Jones, M. L. Fanto, A. M. Smith, C. C. Tison, P. M. Alsing, and D. Englund, “Linear programmable nanophotonic processors,” Optica 5, 1623–1631 (2018). [CrossRef]

57. W. Bogaerts, D. Pérez, J. Capmany, D. A. Miller, J. Poon, D. Englund, F. Morichetti, and A. Melloni, “Programmable photonic circuits,” Nature 586, 207–216 (2020). [CrossRef]

58. T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5, 864–871 (2018). [CrossRef]

59. C. Roques-Carmes, “Learning photons go backward,” Science 380, 341–342 (2023). [CrossRef]

60. J. Cheng, H. Zhou, and J. Dong, “Photonic matrix computing: from fundamentals to applications,” Nanomaterials 11, 1683 (2021). [CrossRef]

61. H. Zhang, M. Gu, X. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. Yung, Y. Z. Shi, F. K. Muhammad, G. Q. Lo, X. S. Luo, B. Dong, D. L. Kwong, L. C. Kwek, and A. Q. Liu, “An optical neural chip for implementing complex-valued neural network,” Nat. Commun. 12, 1–11 (2021). [CrossRef]

62. N. Peserico, X. Ma, B. J. Shastri, and V. J. Sorger, “Photonic tensor core for machine learning: a review,” Emerg. Top. Artif. Intell. (ETAI) 2022 12204, 15–60 (2022). [CrossRef]

63. M. Thomaschewski, Z. Hu, B. M. Nouri, Y. Gui, H. Wang, S. Altaleb, H. Dalir, and V. J. Sorger, “High-performance optoelectronics for integrated photonic neural networks,” in AI and Optical Data Sciences IV, Vol. 12438 (SPIE, 2023), pp. 262–271.

64. S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J. Inman, “1D convolutional neural networks and applications: a survey,” Mech. Syst. Signal Process. 151, 107398 (2021). [CrossRef]

65. R. Woods, J. McAllister, G. Lightbody, and Y. Yi, FPGA-Based Implementation of Signal Processing Systems (Wiley Publishing, 2017).

66. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8, 12324 (2018). [CrossRef]

67. J. L. Elman, “Finding structure in time,” Cogn. Sci. 14, 179–211 (1990). [CrossRef]

68. Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv, arXiv:1506.00019 (2015). [CrossRef]

69. Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw. 5, 157–166 (1994). [CrossRef]

70. A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-Gutiérrez, “Exploiting the convLSTM: human action recognition using raw depth video-based recurrent neural networks,” arXiv, arXiv:2006.07744 (2020). [CrossRef]

71. S. Saha, N. Majumder, D. Sangani, and A. Das Bhattacharjee, “Comprehensive forecasting-based analysis using hybrid and stacked stateful/stateless models,” in Advances in Distributed Computing and Machine Learning (Springer, 2022), pp. 567–579.

72. T.-T. Pham, M. Pister, and P. Couvée, “Recurrent neural network for classifying of HPC applications,” in 2019 Spring Simulation Conference (SpringSim) (IEEE, 2019), pp. 1–12.

73. This is similar to other supervised learning methods where we assume that each batch of the dataset you pass is i.i.d. with respect to each other.

74. Most applications in practice use the stateless RNN, because if we use the stateful RNN, then in production, the network is forced to deal with infinitely long sequences, and this property can be quite difficult to handle.

75. M. E. Van Valkenburg, Reference Data for Engineers: Radio, Electronics, Computers and Communications (Newnes, 2001).

76. R. Storn, “Differential evolution design of an iir-filter,” in Proceedings of IEEE International Conference on Evolutionary Computation (IEEE, 1996), pp. 268–273.

77. R. G. Brown and P. Y. C. Hwang, Introduction to Random Signals and Applied Kalman Filtering: with Matlab Exercises, 4th ed. (John Wiley & Sons, Inc., 2012).

78. E. Brookner, Tracking and Kalman Filtering Made Easy (Wiley, 1998).

79. H. Cruse, Neural Networks as Cybernetic Systems (Brains, Minds & Media, 2006).

80. The word “allowed” in this statement means that we can impose some specific borders on the range of each parameter’s change, based on our experience, a priori information, desired solution properties, etc.

81. Sometimes, in the literature, the variables κ_t and β_t in (12) are marked with index t − 1, but not t; of course, this change of notation does not affect the physical meaning of the result. The matrices W^K, U^K, and H^K can also be altered with the index t, but we omit this dependence here for simplicity.

82. J.-N. Juang, C.-W. Chen, and M. Phan, “Estimation of Kalman filter gain from output residuals,” J. Guid. Control. Dyn. 16, 903–908 (1993). [CrossRef]

83. J. DeCruyenaere and H. Hafez, “A comparison between Kalman filters and recurrent neural networks,” in [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, Vol. 4 (IEEE, 1992), pp. 247–251.

84. We do not consider here the case of stochastic NNs [85].

85. L. V. Jospin, H. Laga, F. Boussaid, W. Buntine, and M. Bennamoun, “Hands-on Bayesian neural networks—a tutorial for deep learning users,” IEEE Comput. Intell. Mag. 17, 29–48 (2022). [CrossRef]

86. C.-W. Chen, “Integrated system identification and adaptive state estimation for control of flexible space structures,” Ph.D. thesis (Old Dominion University, 1991).

87. As noted in [82], when we have the deviations of the true system from the ideal Kalman case, the resulting filter identified through the input–measurement pairs is not the Kalman filter. In such a case, the identified filter is simply an observer that is computed from input–output data that minimize the filter residual in a MSE sense.

88. S. K. Chenna, Y. K. Jain, H. Kapoor, R. S. Bapi, N. Yadaiah, A. Negi, V. S. Rao, and B. L. Deekshatulu, “State estimation and tracking problems: a comparison between Kalman filter and recurrent neural networks,” in Neural Information Processing, N. R. Pal, N. Kasabov, R. K. Mudi, S. Pal, and S. K. Parui, eds. (Springer, 2004), pp. 275–281.

89. A. Parlos, S. Menon, and A. Atiya, “An algorithmic approach to adaptive state filtering using recurrent neural networks,” IEEE Trans. Neural Netw. 12, 1411–1432 (2001). [CrossRef]

90. S. S. Haykin, ed., Kalman Filtering and Neural Networks, Vol. 284 (Wiley Online Library, 2001).

91. D. P. Mandic and V. S. L. Goh, Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models (John Wiley & Sons, 2009).

92. Y. Shao, F. M. Dietrich, C. Nettelblad, and C. Zhang, “Training algorithm matters for the performance of neural network potential: a case study of Adam and the Kalman filter optimizers,” The J. Chem. Phys. 155, 204108 (2021). [CrossRef]

93. A. N. Tait, T. F. De Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7, 7430 (2017). [CrossRef]

94. J. Bueno, S. Maktoobi, L. Froehly, I. Fischer, M. Jacquot, L. Larger, and D. Brunner, “Reinforcement learning in a large-scale photonic recurrent neural network,” Optica 5, 756–760 (2018). [CrossRef]

95. B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, “Photonics for artificial intelligence and neuromorphic computing,” Nat. Photonics 15, 102–114 (2021). [CrossRef]

96. H.-T. Peng, J. C. Lederman, L. Xu, T. F. de Lima, C. Huang, B. J. Shastri, D. Rosenbluth, and P. R. Prucnal, “A photonics-inspired compact network: toward real-time AI processing in communication systems,” IEEE J. Sel. Top. Quantum Electron. 28, 1–17 (2022). [CrossRef]

97. T. W. Hughes, I. A. D. Williamson, M. Minkov, and S. Fan, “Wave physics as an analog recurrent neural network,” Sci. Adv. 5, eaay6946 (2019). [CrossRef]

98. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. 9, 1735–1780 (1997). [CrossRef]

99. F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: continual prediction with LSTM,” Neural Comput. 12, 2451–2471 (2000). [CrossRef]

100. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” arXiv, arXiv:1406.1078 (2014). [CrossRef]

101. The difference in the LSTM and GRU functioning is studied in detail in [102].

102. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” in NIPS 2014 Workshop on Deep Learning (2014).

103. R. Dey and F. M. Salem, “Gate-variants of gated recurrent unit (GRU) neural networks,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (IEEE, 2017), pp. 1597–1600.

104. J. C. Heck and F. M. Salem, “Simplified minimal gated unit variations for recurrent neural networks,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (IEEE, 2017), pp. 1593–1596.

105. H. Jaeger and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science 304, 78–80 (2004). [CrossRef]

106. Q. Wu, E. Fokoue, and D. Kudithipudi, “On the statistical challenges of echo state networks and some potential remedies,” arXiv, arXiv:1802.07369 (2018). [CrossRef]

107. M. Sorokina, S. Sergeyev, and S. Turitsyn, “Fiber echo state network analogue for high-bandwidth dual-quadrature signal processing,” Opt. Express 27, 2387–2395 (2019). [CrossRef]

108. S. S. Mosleh, L. Liu, C. Sahin, Y. R. Zheng, and Y. Yi, “Brain-inspired wireless communications: where reservoir computing meets MIMO-OFDM,” IEEE Trans. Neural Netw. Learning Syst. 29, 4694–4708 (2018). [CrossRef]

109. C. Sun, M. Song, S. Hong, and H. Li, “A review of designs and applications of echo state networks,” arXiv, arXiv:2012.02974 (2020). [CrossRef]

110. H. Jaeger, M. Lukoševičius, D. Popovici, and U. Siewert, “Optimization and applications of echo state networks with leaky-integrator neurons,” Neural Networks 20, 335–352 (2007). [CrossRef]

111. G. Van der Sande, D. Brunner, and M. C. Soriano, “Advances in photonic reservoir computing,” Nanophotonics 6, 561–576 (2017). [CrossRef]

112. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv, arXiv:1409.0473 (2014). [CrossRef]

113. J. Quinn, J. McEachen, M. Fullan, M. Gardner, and M. Drummy, Dive into Deep Learning: Tools for Engagement (Corwin Press, 2019).

114. M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv, arXiv:1508.04025 (2015). [CrossRef]

115. Y. Kim, C. Denton, L. Hoang, and A. M. Rush, “Structured attention networks,” arXiv, arXiv:1702.00887 (2017). [CrossRef]

116. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. Neural Inf. Process. Syst. 30, 1 (2017). [CrossRef]

117. An important aspect of this setup is that each attention head has its own W_V, W_Q, and W_K transforms. That means that each head can zoom in and expand the parts of the embedded space that it wants to focus on, and it can be different from what each of the other heads is focusing on.

118. Usually, d_v is considered to be equal to d_k, but in reality they do not have to be.

119. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.

120. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv, arXiv:1607.06450 (2016). [CrossRef]

121. B. B. Hamgini, H. Najafi, A. Bakhshali, and Z. Zhang, “Application of transformers for nonlinear channel compensation in optical systems,” arXiv, arXiv:2304.13119 (2023). [CrossRef]

122. R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv, arXiv:1505.00387 (2015). [CrossRef]

123. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4700–4708.

124. H. Dou, Y. Deng, T. Yan, H. Wu, X. Lin, and Q. Dai, “Residual D²NN: training diffractive deep neural networks via learnable light shortcuts,” Opt. Lett. 45, 2688–2691 (2020). [CrossRef]

125. C. Gin, B. Lusch, S. L. Brunton, and J. N. Kutz, “Deep learning models for global coordinate transformations that linearise PDEs,” Eur. J. Appl. Math 32, 515–539 (2021). [CrossRef]

126. D. S. Broomhead and D. Lowe, “Radial basis functions, multi-variable functional interpolation and adaptive networks,” Tech. rep., Royal Signals and Radar Establishment Malvern (United Kingdom) (1988).

127. Q. Que and M. Belkin, “Back to the future: radial basis function networks revisited,” in Artificial Intelligence and Statistics (PMLR, 2016), pp. 1375–1383.

128. L. Beheim, A. Zitouni, F. Belloir, and C. d. M. de la Housse, “New RBF neural network classifier with optimized hidden neurons number,” WSEAS Trans. Syst. 2, 467–472 (2004).

129. R. De Maesschalck, D. Jouan-Rimbaud, and D. L. Massart, “The Mahalanobis distance,” Chemom. Intell. Lab. Syst. 50, 1–18 (2000). [CrossRef]

130. J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural Comput. 3, 246–257 (1991). [CrossRef]

131. G. Böcherer, Lecture Notes on Machine Learning for Communications (2021).

132. D. Bank, N. Koenigstein, and R. Giryes, “Autoencoders,” arXiv, arXiv:2003.05991 (2020). [CrossRef]

133. A. Venketeswaran, N. Lalam, J. Wuenschell, P. R. Ohodnicki Jr, M. Badar, K. P. Chen, P. Lu, Y. Duan, B. Chorpening, and M. Buric, “Recent advances in machine learning for fiber optic sensor applications,” Adv. Intell. Syst. 4, 2100067 (2022). [CrossRef]

134. C. Doersch, “Tutorial on variational autoencoders,” arXiv, arXiv:1606.05908 (2016). [CrossRef]

135. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” arXiv, arXiv:1312.6114 (2013). [CrossRef]

136. T. Baumeister, S. L. Brunton, and J. N. Kutz, “Deep learning and model predictive control for self-tuning mode-locked lasers,” J. Opt. Soc. Am. B 35, 617–626 (2018). [CrossRef]

137. Y. Chen, T. Zhou, J. Wu, H. Qiao, X. Lin, L. Fang, and Q. Dai, “Photonic unsupervised learning variational autoencoder for high-throughput and low-latency image transmission,” Sci. Adv. 9, eadf8437 (2023). [CrossRef]

138. A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv, arXiv:1511.05644 (2015). [CrossRef]

139. Z. A. Kudyshev, A. V. Kildishev, V. M. Shalaev, and A. Boltasseva, “Machine-learning-assisted metasurface design for high-efficiency thermal emitter optimization,” Appl. Phys. Rev. 7, 021407 (2020). [CrossRef]

140. Z. A. Kudyshev, A. V. Kildishev, V. M. Shalaev, and A. Boltasseva, “Machine learning–assisted global optimization of photonic devices,” Nanophotonics 10, 371–383 (2020). [CrossRef]

141. A. Creswell and A. A. Bharath, “Denoising adversarial autoencoders,” IEEE Trans. Neural Netw. Learning Syst. 30, 968–984 (2019). [CrossRef]

142. W. Xie, B. Liu, Y. Li, J. Lei, and Q. Du, “Autoencoder and adversarial-learning-based semisupervised background estimation for hyperspectral anomaly detection,” IEEE Trans. Geosci. Remote Sensing 58, 5416–5427 (2020). [CrossRef]

143. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Commun. ACM 63, 139–144 (2020). [CrossRef]

144. K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: introduction and outlook,” IEEE/CAA J. Autom. Sinica 4, 588–598 (2017). [CrossRef]

145. J. Gui, Z. Sun, Y. Wen, D. Tao, and J. Ye, “A review on generative adversarial networks: algorithms, theory, and applications,” IEEE Trans. Knowl. Data Eng. 35, 3313–3332 (2023). [CrossRef]

146. D. Wang and M. Zhang, “Artificial intelligence in optical communications: from machine learning to deep learning,” Front. Comms. Net. 2, 656786 (2021). [CrossRef]

147. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv, 1511.06434v2 (2016). [CrossRef]

148. A. Cohen and S. Derevyanko, “Generative adversarial network and end-to-end learning for optical fiber communication systems limited by the nonlinear phase noise,” in 2021 IEEE International Conference on Microwaves, Antennas, Communications and Electronic Systems (COMCAS) (IEEE, 2021), pp. 241–246.

149. V. Nguyen, “Bayesian optimization for accelerating hyper-parameter tuning,” in 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE) (IEEE, 2019), pp. 302–305.

150. H. Cho, Y. Kim, E. Lee, D. Choi, Y. Lee, and W. Rhee, “Basic enhancement strategies when using Bayesian optimization for hyperparameter tuning of deep neural networks,” IEEE Access 8, 52588–52608 (2020). [CrossRef]

151. J. Wu, X.-Y. Chen, H. Zhang, L.-D. Xiong, H. Lei, and S.-H. Deng, “Hyperparameter optimization for machine learning models based on Bayesian optimization,” J. Electron. Sci. Technol. 17, 26–40 (2019). [CrossRef]

152. M. Sena, M. S. Erkilinc, T. Dippon, B. Shariati, R. Emmerich, J. K. Fischer, and R. Freund, “Bayesian optimization for nonlinear system identification and pre-distortion in cognitive transmitters,” J. Lightwave Technol. 39, 5008–5020 (2021). [CrossRef]

153. This list is not exhaustive, and new alternatives and methods’ variants emerge constantly [155–157].

154. S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer, “Neural networks designing neural networks: multi-objective hyper-parameter optimization,” in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (IEEE, 2016), pp. 1–8.

155. E.-G. Talbi, “Automated design of deep neural networks: a survey and unified taxonomy,” ACM Comput. Surv. 54, 1–37 (2022). [CrossRef]

156. M. Pinos, V. Mrazek, and L. Sekanina, “Evolutionary approximation and neural architecture search,” Genet. Program. Evolvable Mach. 23, 351–374 (2022). [CrossRef]

157. If enough data are available, instead of cross-validation we can have independent datasets for training, validation, and testing as well.

158. B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking the human out of the loop: a review of Bayesian optimization,” Proc. IEEE 104, 148–175 (2016). [CrossRef]

159. F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-based optimization for general algorithm configuration,” in International Conference on Learning and Intelligent Optimization (Springer, 2011), pp. 507–523.

160. T. Joyce and J. M. Herrmann, “A review of no free lunch theorems, and their implications for metaheuristic optimisation,” in Nature-Inspired Algorithms and Applied Optimization, Vol. 744, pp. 27–51 (2018). [CrossRef]

161. B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” arXiv, arXiv:1611.02167 (2016). [CrossRef]

162. A. Iranfar, M. Zapater, and D. Atienza, “Multiagent reinforcement learning for hyperparameter optimization of convolutional neural networks,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 41, 1034–1047 (2022). [CrossRef]

163. Y. Xu, L. Huang, W. Jiang, L. Xue, W. Hu, and L. Yi, “Automatic optimization of volterra equalizer with deep reinforcement learning for intensity-modulated direct-detection optical communications,” J. Lightwave Technol. 40, 5395–5406 (2022). [CrossRef]

164. G. P. Agrawal, Fiber-Optic Communication Systems, 5th ed. (Wiley, 2021).

165. D. Rafique, M. Mussolin, M. Forzati, J. Mårtensson, M. N. Chugtai, and A. D. Ellis, “Compensation of intra-channel nonlinear fibre impairments using simplified digital back-propagation algorithm,” Opt. Express 19, 9453–9460 (2011). [CrossRef]

166. The filtered version typically provides better performance at the low spatial resolution, with an almost negligible computational complexity increase.

167. A. Napoli, Z. Maalej, V. A. Sleiffer, M. Kuschnerov, D. Rafique, E. Timmers, B. Spinnler, T. Rahman, L. D. Coelho, and N. Hanik, “Reduced complexity digital back-propagation methods for optical communication systems,” J. Lightwave Technol. 32, 1351–1362 (2014). [CrossRef]

168. S. Musetti, P. Serena, and A. Bononi, “On the accuracy of split-step Fourier simulations for wideband nonlinear optical communications,” J. Lightwave Technol. 36, 5669–5677 (2018). [CrossRef]

169. P. Serena, C. Lasagni, S. Musetti, and A. Bononi, “On numerical simulations of ultra-wideband long-haul optical communication systems,” J. Lightwave Technol. 38, 1019–1031 (2020). [CrossRef]

170. M. Jaworski, “Step-size distribution strategies in SSFM simulation of DWDM links,” in 2008 2nd ICTON Mediterranean Winter (IEEE, 2008), pp. 1–6.

171. B. Schmauss, R. Asif, and C.-Y. Lin, “Recent advances in digital backward propagation algorithm for coherent transmission systems with higher order modulation formats,” Next-Generation Opt. Commun. Components, Sub-Systems, Syst. 8284, 151–165 (2012). [CrossRef]

172. C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deep neural networks,” in 2018 Optical Fiber Communications Conference and Exposition (OFC) (IEEE, 2018), pp. 1–3.

173. C. Häger and H. D. Pfister, “Physics-based deep learning for fiber-optic communication systems,” IEEE J. Select. Areas Commun. 39, 280–294 (2021). [CrossRef]

174. S. Zhang and C. Häger, “Chapter two - machine learning for long-haul optical systems,” in Machine Learning for Future Fiber-Optic Communication Systems, A. P. T. Lau and F. N. Khan, eds. (Academic Press, 2022), pp. 43–64.

175. M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,” J. Comput. Phys. 378, 686–707 (2019). [CrossRef]

176. G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, “Physics-informed machine learning,” Nat. Rev. Phys. 3, 422–440 (2021). [CrossRef]

177. S. Cuomo, V. S. Di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli, “Scientific machine learning through physics-informed neural networks: where we are and what’s next,” arXiv, arXiv:2201.05624 (2022). [CrossRef]

178. X. Jiang, D. Wang, Q. Fan, M. Zhang, C. Lu, and A. P. T. Lau, “Solving the nonlinear Schrödinger equation in optical fibers using physics-informed neural network,” in Optical Fiber Communication Conference (Optica Publishing Group, 2021), pp. M3H–8.

179. Y. Zang, Z. Yu, K. Xu, X. Lan, M. Chen, S. Yang, and H. Chen, “Principle-driven fiber transmission model based on PINN neural network,” J. Lightwave Technol. 40, 404–414 (2022). [CrossRef]

180. D. Wang, X. Jiang, Y. Song, M. Fu, Z. Zhang, X. Chen, and M. Zhang, “Applications of physics-informed neural network for optical fiber communications,” IEEE Commun. Mag. 60, 32–37 (2022). [CrossRef]

181. H. Yang, Z. Niu, S. Xiao, J. Fang, Z. Liu, D. Fainsin, and L. Yi, “Fast and accurate optical fiber channel modeling using generative adversarial network,” J. Lightwave Technol. 39, 1322–1333 (2021). [CrossRef]

182. Z. Li, N. B. Kovachki, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” in International Conference on Learning Representations (2020).

183. L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis, “Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators,” Nat. Mach. Intell. 3, 218–229 (2021). [CrossRef]

184. S. Wang, H. Wang, and P. Perdikaris, “Learning the solution operator of parametric partial differential equations with physics-informed DeepONets,” Sci. Adv. 7, eabi8605 (2021). [CrossRef]

185. X. He, L. Yan, L. Jiang, A. Yi, Z. Pu, Y. Yu, H. Chen, W. Pan, and B. Luo, “Fourier neural operator for accurate optical fiber modeling with low complexity,” J. Lightwave Technol. 41, 2301 (2023). [CrossRef]

186. N. Zhang, H. Yang, Z. Niu, L. Zheng, C. Chen, S. Xiao, and L. Yi, “Transformer-based long distance fiber channel modeling for optical OFDM systems,” J. Lightwave Technol. 40, 7779–7789 (2022). [CrossRef]

187. N. Gautam, V. Kaushik, A. Choudhary, and B. Lall, “OptiDistillNet: learning nonlinear pulse propagation using the student–teacher model,” Opt. Express 30, 42430–42439 (2022). [CrossRef]

188. P. J. Winzer, D. T. Neilson, and A. R. Chraplyvy, “Fiber-optic transmission and networking: the previous 20 and the next 20 years,” Opt. Express 26, 24190–24239 (2018). [CrossRef]

189. J. C. Cartledge, F. P. Guiomar, F. R. Kschischang, G. Liga, and M. P. Yankov, “Digital signal processing for fiber nonlinearities,” Opt. Express 25, 1916–1936 (2017). [CrossRef]

190. M. A. Jarajreh, E. Giacoumidis, I. Aldaya, S. T. Le, A. Tsokanos, Z. Ghassemlooy, and N. J. Doran, “Artificial neural network nonlinear equalizer for coherent optical OFDM,” IEEE Photonics Technol. Lett. 27, 387–390 (2015). [CrossRef]

191. S. Hunt, Y. Sun, A. Shafarenko, R. Adams, N. Davey, B. Slater, R. Bhamber, S. Boscolo, and S. K. Turitsyn, “Adaptive electrical signal post-processing with varying representations in optical communication systems,” in Engineering Applications of Neural Networks (Springer, 2009), pp. 235–245.

192. T. A. Eriksson, H. Bülow, and A. Leven, “Applying neural networks in optical communication systems: possible pitfalls,” IEEE Photonics Technol. Lett. 29, 2091–2094 (2017). [CrossRef]

193. S. Zhang, F. Yaman, K. Nakamura, T. Inoue, V. Kamalov, L. Jovanovski, V. Vusirikala, E. Mateo, Y. Inada, and T. Wang, “Field and lab experimental demonstration of nonlinear impairment compensation using neural networks,” Nat. Commun. 10, 3033 (2019). [CrossRef]

194. F. N. Khan, C. Lu, and A. P. T. Lau, “Machine learning methods for optical communication systems,” in Signal Processing in Photonic Communications (2017), pp. SpW2F–3.

195. B. Karanov, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end optimized transmission over dispersive intensity-modulated channels using bidirectional recurrent neural networks,” Opt. Express 27, 19650–19663 (2019). [CrossRef]

196. F. N. Khan, Q. Fan, C. Lu, and A. P. T. Lau, “An optical communication’s perspective on machine learning and its applications,” J. Lightwave Technol. 37, 493–516 (2019). [CrossRef]

197. E. Giacoumidis, Y. Lin, J. Wei, I. Aldaya, A. Tsokanos, and L. P. Barry, “Harnessing machine learning for fiber-induced nonlinearity mitigation in long-haul coherent optical OFDM,” Futur. Internet 11, 2 (2018). [CrossRef]

198. G. Charalabopoulos, P. Stavroulakis, and A. H. Aghvami, “A frequency-domain neural network equalizer for OFDM,” in GLOBECOM’03. IEEE Global Telecommunications Conference (IEEE Cat. No. 03CH37489) Vol. 2 (IEEE, 2003), pp. 571–575.

199. J. Estaran, R. Rios-Müller, M. Mestre, F. Jorge, H. Mardoyan, A. Konczykowska, J.-Y. Dupuy, and S. Bigo, “Artificial neural networks for linear and non-linear impairment mitigation in high-baudrate IM/DD systems,” in ECOC 2016; 42nd European Conference on Optical Communication (VDE, 2016), pp. 1–3.

200. C. Ye, D. Zhang, X. Huang, H. Feng, and K. Zhang, “Demonstration of 50Gbps IM/DD PAM4 PON over 10GHz class optics using neural network based nonlinear equalization,” in 2017 European Conference on Optical Communication (ECOC) (IEEE, 2017), pp. 1–3.

201. B. Sang, J. Zhang, C. Wang, M. Kong, Y. Tan, L. Zhao, W. Zhou, D. Shang, Y. Zhu, H. Yi, and J. Yu, “Multi-symbol output long short-term memory neural network equalizer for 200+ Gbps IM/DD system,” in 2021 European Conference on Optical Communication (ECOC) (IEEE, 2021), pp. 1–4.

202. E. Arnold, G. Böcherer, E. Müller, P. Spilger, J. Schemmel, S. Calabrò, and M. Kuschnerov, “Spiking neural network equalization for IM/DD optical communication,” arXiv, arXiv:2205.04263 (2022). [CrossRef]

203. P. J. Freire, V. Neskornuik, A. Napoli, B. Spinnler, N. Costa, G. Khanna, E. Riccardi, J. E. Prilepsky, and S. K. Turitsyn, “Complex-valued neural network design for mitigation of signal distortions in optical links,” J. Lightwave Technol. 39, 1696–1705 (2021). [CrossRef]

204. F. Da Ros, S. M. Ranzini, H. Bülow, and D. Zibar, “Reservoir-computing based equalization with optical pre-processing for short-reach optical transmission,” IEEE J. Sel. Top. Quantum Electron. 26, 1–12 (2020). [CrossRef]

205. F. Da Ros, S. M. Ranzini, R. Dischler, A. Cem, V. Aref, H. Bülow, and D. Zibar, “Machine-learning-based equalization for short-reach transmission: neural networks and reservoir computing,” in Metro and Data Center Optical Networks and Short-Reach Links IV, Vol. 11712 (SPIE, 2021), p. 1171205.

206. S. Wang, N. Fang, and L. Wang, “Signal recovery based on optoelectronic reservoir computing for high speed optical fiber communication system,” Opt. Commun. 495, 127082 (2021). [CrossRef]

207. F. Da Ros, S. M. Ranzini, Y. Osadchuk, A. Cem, B. J. G. Castro, and D. Zibar, “Reservoir-computing and neural-network-based equalization for short reach communication,” in Signal Processing in Photonic Communications (Optica Publishing Group, 2022), pp. SpTu1J–1.

208. P. J. Freire, Y. Osadchuk, B. Spinnler, A. Napoli, W. Schairer, N. Costa, J. E. Prilepsky, and S. K. Turitsyn, “Performance versus complexity study of neural network equalizers in coherent optical systems,” J. Lightwave Technol. 39, 6085–6096 (2021). [CrossRef]

209. P. J. Freire, J. E. Prilepsky, Y. Osadchuk, S. K. Turitsyn, and V. Aref, “Deep neural network-aided soft-demapping in coherent optical systems: regression versus classification,” IEEE Trans. Commun. 70, 7973–7988 (2022). [CrossRef]

210. F. Diedolo, G. Böcherer, M. Schädler, and S. Calabró, “Nonlinear equalization for optical communications based on entropy-regularized mean square error,” arXiv, arXiv:2206.01004 (2022). [CrossRef]

211. O. Sidelnikov, A. Redyuk, and S. Sygletos, “Equalization performance and complexity analysis of dynamic deep neural networks in long haul transmission systems,” Opt. Express 26, 32765–32776 (2018). [CrossRef]

212. E. Giacoumidis, S. T. Le, I. Aldaya, J. Wei, M. McCarthy, N. Doran, and B. J. Eggleton, “Experimental comparison of artificial neural network and Volterra based nonlinear equalization for CO-OFDM,” in Optical Fiber Communication Conference (Optical Society of America, 2016), pp. W3A–4.

213. S. Deligiannidis, C. Mesaritakis, and A. Bogris, “Performance and complexity analysis of bi-directional recurrent neural network models versus volterra nonlinear equalizers in digital coherent systems,” J. Lightwave Technol. 39, 5791–5798 (2021). [CrossRef]

214. P. J. Freire, S. Srivallapanondh, A. Napoli, J. E. Prilepsky, and S. K. Turitsyn, “Computational complexity evaluation of neural network applications in signal processing,” arXiv, arXiv:2206.12191 (2022). [CrossRef]

215. B. Sang, W. Zhou, Y. Tan, M. Kong, C. Wang, M. Wang, L. Zhao, J. Zhang, and J. Yu, “Low complexity neural network equalization based on multi-symbol output technique for 200+ Gbps IM/DD short reach optical system,” J. Lightwave Technol. 40, 2890–2900 (2022). [CrossRef]

216. P. J. Freire, A. Napoli, D. A. Ron, B. Spinnler, M. Anderson, W. Schairer, T. Bex, N. Costa, S. K. Turitsyn, and J. E. Prilepsky, “Reducing computational complexity of neural networks in optical channel equalization: from concepts to implementation,” arXiv, arXiv:2208.12866 (2022). [CrossRef]

217. S. Deligiannidis, A. Bogris, C. Mesaritakis, and Y. Kopsinis, “Compensation of fiber nonlinearities in digital coherent systems leveraging long short-term memory neural networks,” J. Lightwave Technol. 38, 5991–5999 (2020). [CrossRef]

218. B. I. Bitachon, A. Ghazisaeidi, M. Eppenberger, B. Baeuerle, M. Ayata, and J. Leuthold, “Deep learning based digital backpropagation demonstrating SNR gain at low complexity in a 1200 km transmission link,” Opt. Express 28, 29318–29334 (2020). [CrossRef]

219. O. Sidelnikov, A. Redyuk, S. Sygletos, M. Fedoruk, and S. Turitsyn, “Advanced convolutional neural networks for nonlinearity mitigation in long-haul WDM transmission systems,” J. Lightwave Technol. 39, 2397–2406 (2021). [CrossRef]

220. Q. Fan, G. Zhou, T. Gui, C. Lu, and A. P. T. Lau, “Advancing theoretical understanding and practical performance of signal processing for nonlinear optical communications through machine learning,” Nat. Commun. 11, 3694 (2020). [CrossRef]

221. X. Luo, C. Bai, X. Chi, H. Xu, Y. Fan, L. Yang, P. Qin, Z. Wang, and X. Lv, “Nonlinear impairment compensation using transfer learning-assisted convolutional bidirectional long short-term memory neural network for coherent optical communication systems,” Photonics 9, 919 (2022). [CrossRef]

222. A. Barreiro, G. Liga, and A. Alvarado, “Data-driven enhancement of the time-domain first-order regular perturbation model,” arXiv, arXiv:2210.05340 (2022). [CrossRef]

223. M. M. Melek and D. Yevick, “Nonlinearity mitigation with a perturbation based neural network receiver,” Opt. Quantum Electron. 52, 450 (2020). [CrossRef]

224. M. M. Melek and D. Yevick, “Fiber nonlinearity mitigation with a perturbation based Siamese neural network receiver,” Opt. Fiber Technol. 66, 102641 (2021). [CrossRef]

225. C. Li, Y. Wang, J. Wang, H. Yao, X. Liu, R. Gao, L. Yang, H. Xu, Q. Zhang, P. Ma, and X. Xin, “Convolutional neural network-aided DP-64 QAM coherent optical communication systems,” J. Lightwave Technol. 40, 2880–2889 (2022). [CrossRef]

226. A. Redyuk, E. Averyanov, O. Sidelnikov, M. Fedoruk, and S. Turitsyn, “Compensation of nonlinear impairments using inverse perturbation theory with reduced complexity,” J. Lightwave Technol. 38, 1250–1257 (2020). [CrossRef]

227. H. Dzieciol, T. Koike-Akino, Y. Wang, and K. Parsons, “Inverse regular perturbation with ML-assisted phasor correction for fiber nonlinearity compensation,” Opt. Lett. 47, 3471–3474 (2022). [CrossRef]

228. N. Castro and S. Sygletos, “A novel learned Volterra-based scheme for time-domain nonlinear equalization,” in CLEO: Science and Innovations (Optica Publishing Group, 2022), pp. SF3M–1.

229. X. Huang, D. Zhang, X. Hu, C. Ye, and K. Zhang, “Low-complexity recurrent neural network based equalizer with embedded parallelization for 100-Gbit/s/λ PON,” J. Lightwave Technol. 40, 1353–1359 (2022). [CrossRef]

230. A. A. Cruz, K. S. Mayer, and D. S. Arantes, “RosenPy: an open source Python framework for complex-valued neural networks,” Social Science Research Networkhttps://dx.doi.org/10.2139/ssrn.4252610 (2022). Accessed: 2022-12-01. [CrossRef]

231. S. Liu, M. Xu, J. Wang, F. Lu, W. Zhang, H. Tian, and G.-K. Chang, “A multilevel artificial neural network nonlinear equalizer for millimeter-wave mobile fronthaul systems,” J. Lightwave Technol. 35, 4406–4417 (2017). [CrossRef]

232. L. Wang, M. Gao, Y. Zhang, F. Cao, and H. Huang, “Optical phase conjugation with complex-valued deep neural network for WDM 64-QAM coherent optical systems,” IEEE Photonics J. 13, 1–8 (2021). [CrossRef]

233. S. A. Bogdanov, O. S. Sidelnikov, and A. A. Redyuk, “Application of complex fully connected neural networks to compensate for nonlinearity in fibre-optic communication lines with polarisation division multiplexing,” Quantum Electron. 51, 1076–1080 (2021). [CrossRef]

234. P. He, F. Wu, M. Yang, A. Yang, P. Guo, Y. Qiao, and X. Xin, “A fiber nonlinearity compensation scheme with complex-valued dimension-reduced neural network,” IEEE Photonics J. 13, 1–7 (2021). [CrossRef]

235. H. Yang, X. Zhang, A. Yi, R. Wang, B. Lin, H. Xing, and B. Sha, “A modified convolutional neural network-based signal demodulation method for direct detection OFDM/OQAM-PON,” Opt. Commun. 489, 126843 (2021). [CrossRef]

236. H. Ming, X. Chen, X. Fang, L. Zhang, C. Li, and F. Zhang, “Ultralow complexity long short-term memory network for fiber nonlinearity mitigation in coherent optical communication systems,” J. Lightwave Technol. 40, 2427–2434 (2022). [CrossRef]

237. Y. Liu, V. Sanchez, P. J. Freire, J. E. Prilepsky, M. J. Koshkouei, and M. D. Higgins, “Attention-aided partial bidirectional RNN-based nonlinear equalizer in coherent optical systems,” Opt. Express 30, 32908–32923 (2022). [CrossRef]

238. A. Shahkarami, M. I. Yousefi, and Y. Jaouën, “Attention-based neural network equalization in fiber-optic communications,” in Asia Communications and Photonics Conference (Optical Society of America, 2021), pp. M5H–3.

239. A. Shahkarami, M. I. Yousefi, and Y. Jaouen, “Efficient deep learning of kerr nonlinearity in fiber-optic channels using a convolutional recurrent neural network,” in Deep Learning Applications, Vol. 4 (Springer, 2023), pp. 317–338.

240. X. Huang, W. Jiang, X. Yi, J. Zhang, T. Jin, Q. Zhang, B. Xu, and K. Qiu, “Design of fully interpretable neural networks for digital coherent demodulation,” Opt. Express 30, 35526–35538 (2022). [CrossRef]

241. V. Bajaj, M. Chagnon, S. Wahls, and V. Aref, “Efficient training of Volterra series-based pre-distortion filter using neural networks,” in 2022 Optical Fiber Communications Conference and Exhibition (OFC) (2022), pp. 1–3.

242. D. Psaltis, A. Sideris, and A. A. Yamamura, “A multilayered neural network controller,” IEEE Control Syst. Mag. 8, 17–21 (1988). [CrossRef]

243. A. Bernardini, M. Carrarini, and S. De Fina, “The use of a neural net for copeing with nonlinear distortions,” in 1990 20th European Microwave Conference, Vol. 2 (1990), pp. 1718–1723.

244. T. Gotthans, G. Baudoin, and A. Mbaye, “Digital predistortion with advance/delay neural network and comparison with Volterra derived models,” in 2014 IEEE 25th Annual International Symposium on Personal, Indoor, and Mobile Radio Communication (PIMRC) (2014), pp. 811–815.

245. X. Hu, Z. Liu, X. Yu, Y. Zhao, W. Chen, B. Hu, X. Du, X. Li, M. Helaoui, W. Wang, and F. M. Ghannouchi, “Convolutional neural network for behavioral modeling and predistortion of wideband power amplifiers,” IEEE Trans. Neural Netw. Learning Syst. 33, 3923–3937 (2022). [CrossRef]

246. M. Schaedler, M. Kuschnerov, S. Calabrò, F. Pittalà, C. Bluemm, and S. Pachnicke, “AI-based digital predistortion for IQ Mach–Zehnder modulators,” in 2019 Asia Communications and Photonics Conference (ACP) (2019), pp. 1–3.

247. M. Abu-Romoh, S. Sygletos, I. D. Phillips, and W. Forysiak, “Neural-network-based pre-distortion method to compensate for low resolution DAC nonlinearity,” in 45th European Conference on Optical Communication (ECOC 2019) (2019), pp. 1–4.

248. T. Sasai, M. Nakamura, E. Yamazaki, A. Matsushita, S. Okamoto, K. Horikoshi, and Y. Kisaka, “Wiener–Hammerstein model and its learning for nonlinear digital pre-distortion of optical transmitters,” Opt. Express 28, 30952–30963 (2020). [CrossRef]

249. V. Bajaj, F. Buchali, M. Chagnon, S. Wahls, and V. Aref, “Deep neural network-based digital pre-distortion for high baudrate optical coherent transmission,” J. Lightwave Technol. 40, 597–606 (2022). [CrossRef]

250. V. Bajaj, F. Buchali, M. Chagnon, S. Wahls, and V. Aref, “Single-channel 1.61 Tb/s optical coherent transmission enabled by neural network-based digital pre-distortion,” in 2020 European Conference on Optical Communications (ECOC) (2020), pp. 1–4.

251. V. Bajaj, F. Buchali, M. Chagnon, S. Wahls, and V. Aref, “54.5 Tb/s WDM transmission over field deployed fiber enabled by neural network-based digital pre-distortion,” in Optical Fiber Communication Conference (2021), pp. M5F–2.

252. L. Minelli, F. Forghieri, A. Nespola, S. Straullu, and R. Gaudino, “A multi-rate approach for nonlinear pre-distortion using end-to-end deep learning in IM-DD systems,” J. Lightwave Technol. 41, 420–431 (2023). [CrossRef]

253. V. Bajaj, V. Aref, and S. Wahls, “Performance analysis of recurrent neural network-based digital pre-distortion for optical coherent transmission,” in 2022 European Conference on Optical Communication (ECOC) (2022), pp. 1–4.

254. T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw. 3, 563–575 (2017). [CrossRef]

255. T. Glasmachers, “Limits of end-to-end learning,” in Asian Conference on Machine Learning (2017), pp. 17–32.

256. M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” arXiv, arXiv:1604.07316 (2016). [CrossRef]

257. B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end deep learning of optical fiber communications,” J. Lightwave Technol. 36, 4843–4855 (2018). [CrossRef]

258. B. Karanov, M. Chagnon, V. Aref, F. Ferreira, D. Lavery, P. Bayvel, and L. Schmalen, “Experimental investigation of deep learning for digital signal processing in short reach optical fiber communications,” in 2020 IEEE Workshop on Signal Processing Systems (SiPS) (2020), pp. 1–6.

259. V. Neskorniuk, A. Carnio, D. Marsella, S. K. Turitsyn, J. E. Prilepsky, and V. Aref, “Model-based deep learning of joint probabilistic and geometric shaping for optical communication,” in 2022 Conference on Lasers and Electro-Optics (CLEO) (2022), pp. 1–2.

260. V. Neskorniuk, A. Carnio, D. Marsella, S. K. Turitsyn, J. E. Prilepsky, and V. Aref, “Memory-aware end-to-end learning of channel distortions in optical coherent communications,” Opt. Express 31, 1–20 (2023). [CrossRef]

261. O. Jovanovic, M. P. Yankov, F. Da Ros, and D. Zibar, “Gradient-free training of autoencoders for non-differentiable communication channels,” J. Lightwave Technol. 39, 6381–6391 (2021). [CrossRef]

262. B. Karanov, M. Chagnon, V. Aref, D. Lavery, P. Bayvel, and L. Schmalen, “Concept and experimental demonstration of optical IM/DD end-to-end system optimization using a generative model,” in Optical Fiber Communication Conference (2020), pp. Th2A–48.

263. B. Karanov, V. Oliari, M. Chagnon, G. Liga, A. Alvarado, V. Aref, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end learning in optical fiber communications: experimental demonstration and future trends,” in 2020 European Conference on Optical Communications (ECOC) (2020), pp. 1–4.

264. S. Gaiarin, F. Da Ros, R. T. Jones, and D. Zibar, “End-to-end optimization of coherent optical communications over the split-step Fourier method guided by the nonlinear Fourier transform theory,” J. Lightwave Technol. 39, 418–428 (2021). [CrossRef]

265. R. T. Jones, T. A. Eriksson, M. P. Yankov, and D. Zibar, “Deep learning of geometric constellation shaping including fiber nonlinearities,” in 2018 European Conference on Optical Communication (ECOC) (2018), pp. 1–3.

266. R. T. Jones, M. P. Yankov, and D. Zibar, “End-to-end learning for GMI optimized geometric constellation shape,” in 2019 European Conference on Optical Communication (ECOC) (2019), pp. 1–3.

267. K. Gümüş, A. Alvarado, B. Chen, C. Häger, and E. Agrell, “End-to-end learning of geometrical shaping maximizing generalized mutual information,” in 2020 Optical Fiber Communications Conference and Exhibition (OFC) (2020), pp. 1–3.

268. V. Oliari, B. Karanov, S. Goossens, G. Liga, O. Vassilieva, I. Kim, P. Palacharla, C. Okonkwo, and A. Alvarado, “High-cardinality hybrid shaping for 4D modulation formats in optical communications optimized via end-to-end learning,” arXiv, arXiv:2112.10471 (2021). [CrossRef]

269. Z. Niu, H. Yang, H. Zhao, C. Dai, W. Hu, and L. Yi, “End-to-end deep learning for long-haul fiber transmission using differentiable surrogate channel,” J. Lightwave Technol. 40, 2807–2822 (2022). [CrossRef]

270. O. Jovanovic, M. P. Yankov, F. Da Ros, and D. Zibar, “End-to-end learning of a constellation shape robust to variations in SNR and laser linewidth,” in 2021 European Conference on Optical Communication (ECOC) (2021), pp. 1–4.

271. J. Song, C. Häger, J. Schröder, A. G. i Amat, and H. Wymeersch, “End-to-end autoencoder for superchannel transceivers with hardware impairment,” in Optical Fiber Communication Conference (2021), pp. F4D–6.

272. Z. He, J. Song, C. Häger, A. G. I. Amat, H. Wymeersch, P. A. Andrekson, M. Karlsson, and J. Schröder, “Experimental demonstration of learned pulse shaping filter for superchannels,” in Optical Fiber Communication Conference (2022), pp. W2A–33.

273. J. Song, C. Häger, J. Schröder, A. G. I. Amat, and H. Wymeersch, “Model-based end-to-end learning for WDM systems with transceiver hardware impairments,” IEEE J. Sel. Top. Quantum Electron. 28, 1–14 (2022). [CrossRef]

274. T. Uhlemann, S. Cammerer, A. Span, S. Dörner, and S. ten Brink, “Deep-learning autoencoder for coherent and nonlinear optical communication,” in Photonic Networks; 21th ITG-Symposium (2020), pp. 1–8.

275. V. Aref and M. Chagnon, “End-to-end learning of joint geometric and probabilistic constellation shaping,” in 2022 Optical Fiber Communications Conference and Exhibition (OFC) (2022), pp. 1–3.

276. B. Karanov, L. Schmalen, and A. Alvarado, “Distance-agnostic auto-encoders for short reach fiber communications,” in 2021 Optical Fiber Communications Conference and Exhibition (OFC) (2021), pp. 1–3.

277. Y. Ren, Z. Wang, P. Liao, L. Li, G. Xie, H. Huang, Z. Zhao, Y. Yan, N. Ahmed, A. Willner, M. P. J. Lavery, N. Ashrafi, S. Ashrafi, R. Bock, M. Tur, I. B. Djordjevic, M. A. Neifeld, and A. E. Willner, “Experimental characterization of a 400 Gbit/s orbital angular momentum multiplexed free-space optical link over 120 m,” Opt. Lett. 41, 622–625 (2016). [CrossRef]

278. M. A. Khalighi and M. Uysal, “Survey on free space optical communication: a communication theory perspective,” IEEE Commun. Surv. Tutorials 16, 2231–2258 (2014). [CrossRef]

279. Y. Li, Z. Chen, Z. Hu, D. M. Benton, A. A. Ali, M. Patel, M. P. Lavery, and A. D. Ellis, “Enhanced atmospheric turbulence resiliency with successive interference cancellation DSP in mode division multiplexing free-space optical links,” J. Lightwave Technol. 40, 7769–7778 (2022). [CrossRef]

280. M. A. Amirabadi, M. H. Kahaei, and S. A. Nezamalhosseni, “Low complexity deep learning algorithms for compensating atmospheric turbulence in the free space optical communication system,” IET Optoelectron. 16, 93–105 (2022). [CrossRef]

281. C. Zheng, S. Yu, and W. Gu, “A SVM-based processor for free-space optical communication,” in 2015 IEEE 5th International Conference on Electronics Information and Emergency Communication (2015), pp. 30–33.

282. S. Lohani and R. T. Glasser, “Turbulence correction with artificial neural networks,” Opt. Lett. 43, 2611–2614 (2018). [CrossRef]

283. Y. Hao, L. Zhao, T. Huang, Y. Wu, T. Jiang, Z. Wei, D. Deng, A.-P. Luo, and H. Liu, “High-accuracy recognition of orbital angular momentum modes propagated in atmospheric turbulences based on deep learning,” IEEE Access 8, 159542–159551 (2020). [CrossRef]

284. Q. Tian, Z. Li, K. Hu, L. Zhu, X. Pan, Q. Zhang, Y. Wang, F. Tian, X. Yin, and X. Xin, “Turbo-coded 16-ary OAM shift keying FSO communication system combining the CNN-based adaptive demodulator,” Opt. Express 26, 27849–27864 (2018). [CrossRef]

285. J. Li, M. Zhang, D. Wang, S. Wu, and Y. Zhan, “Joint atmospheric turbulence detection and adaptive demodulation technique using the CNN for the OAM-FSO communication,” Opt. Express 26, 10494–10508 (2018). [CrossRef]

286. M. P. Bart, N. J. Savino, P. Regmi, L. Cohen, H. Safavi, H. C. Shaw, S. Lohani, T. A. Searles, B. T. Kirby, H. Lee, and R. T. Glasser, “Deep learning for enhanced free-space optical communications,” arXiv, arXiv:2208.07712 (2022). [CrossRef]

287. Z.-R. Zhu, J. Zhang, R.-H. Chen, and H.-Y. Yu, “Autoencoder-based transceiver design for OWC systems in log-normal fading channel,” IEEE Photonics J. 11, 1–12 (2019). [CrossRef]

288. M. Yousefi and F. Kschischang, “Information transmission using the nonlinear Fourier transform, part I–III,” IEEE Trans. Inf. Theory 60, 4312–4328 (2014). [CrossRef]

289. S. K. Turitsyn, J. E. Prilepsky, S. T. Le, S. Wahls, L. L. Frumin, M. Kamalian, and S. A. Derevyanko, “Nonlinear Fourier transform for optical data processing and transmission: advances and perspectives,” Optica 4, 307 (2017). [CrossRef]

290. L. Xi, J. Wei, and W. Zhang, “Applications of machine learning on nonlinear frequency division multiplexing optic-fiber communication systems,” in 2021 IEEE 9th International Conference on Information, Communication and Networks (ICICN) (2021), pp. 190–194.

291. S. Le, V. Aref, and H. Buelow, “Nonlinear signal multiplexing for communication beyond the Kerr nonlinearity limit,” Nat. Photonics 11, 570–576 (2017). [CrossRef]

292. S. Gaiarin, F. Da Ros, N. De Renzis, E. P. da Silva, and D. Zibar, “Dual-polarization NFDM transmission using distributed Raman amplification and NFT-domain equalization,” IEEE Photonics Technol. Lett. 30, 1983–1986 (2018). [CrossRef]

293. J. Koch, R. Weixer, and S. Pachnicke, “Equalization of soliton transmission based on nonlinear Fourier transform using neural networks,” in 45th European Conference on Optical Communication (ECOC) (2019), pp. 1–3.

294. O. Kotlyar, M. K. Kopae, J. E. Prilepsky, M. Pankratova, and S. K. Turitsyn, “Machine learning for performance improvement of periodic NFT-based communication system,” in 2019 European Conference on Optical Communications (2019), pp. 1–3.

295. O. Kotlyar, M. Pankratova, M. Kamalian-Kopae, A. Vasylchenkova, J. E. Prilepsky, and S. K. Turitsyn, “Combining nonlinear Fourier transform and neural network-based processing in optical communications,” Opt. Lett. 45, 3462–3465 (2020). [CrossRef]

296. O. Kotlyar, M. Kamalian-Kopae, M. Pankratova, A. Vasylchenkova, J. E. Prilepsky, and S. K. Turitsyn, “Convolutional long short-term memory neural network equalizer for nonlinear Fourier transform-based optical transmission systems,” Opt. Express 29, 11254–11267 (2021). [CrossRef]

297. M. Kamalian-Kopae, A. Vasylchenkova, O. Kotlyar, M. Pankratova, J. Prilepsky, and S. Turitsyn, “Artificial neural network-based equaliser in the nonlinear Fourier domain for fibre-optic communication applications,” in 2019 Conference on Lasers and Electro-Optics Europe European Quantum Electronics Conference (CLEO/Europe-EQEC) (2019).

298. X. Chen, H. Ming, C. Li, G. He, and F. Zhang, “Two-stage artificial neural network-based burst-subcarrier joint equalization in nonlinear frequency division multiplexing systems,” Opt. Lett. 46, 1700–1703 (2021). [CrossRef]

299. X. Lv, C. Bai, Q. Qi, H. Xu, X. Luo, X. Chi, L. Yang, and L. Xi, “Noise equalization scheme based on complex-valued ANN for multiple-eigenvalue modulated nonlinear frequency division multiplexing systems,” Appl. Opt. 61, 10755–10765 (2022). [CrossRef]

300. R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Time-domain neural network receiver for nonlinear frequency division multiplexed systems,” IEEE Photonics Technol. Lett. 30, 1079–1082 (2018). [CrossRef]

301. S. Yamamoto, K. Mishina, and A. Maruta, “Demodulation of optical eigenvalue modulated signal using neural network,” IEICE ComEX 8, 507–512 (2019). [CrossRef]

302. Y. Wu, L. Xi, X. Zhang, Z. Zheng, J. Wei, S. Du, W. Zhang, and X. Zhang, “Robust neural network receiver for multiple-eigenvalue modulated nonlinear frequency division multiplexing system,” Opt. Express 28, 18304–18316 (2020). [CrossRef]

303. K. Mishina, S. Sato, Y. Yoshida, D. Hisano, and A. Maruta, “Eigenvalue-domain neural network demodulator for eigenvalue-modulated signal,” J. Lightwave Technol. 39, 4307–4317 (2021). [CrossRef]

304. K. Mishina, T. Maeda, D. Hisano, Y. Yoshida, and A. Maruta, “Combining IST-based CFO compensation and neural network-based demodulation for eigenvalue-modulated signal,” J. Lightwave Technol. 39, 7370–7382 (2021). [CrossRef]

305. H. Takeuchi, K. Mishina, Y. Terashi, D. Hisano, Y. Yoshida, and A. Maruta, “Eigenvalue-domain neural network receiver for 4096-ary eigenvalue-modulated signal,” in 2022 Optical Fiber Communications Conference and Exhibition (OFC) (2022), pp. 01–03.

306. S. Le, J. E. Prilepsky, and S. K. Turitsyn, “Nonlinear inverse synthesis for high spectral efficiency transmission in optical fibers,” Opt. Express 22, 26720–26741 (2014). [CrossRef]

307. X. Yangzhang, V. Aref, S. T. Le, H. Buelow, D. Lavery, and P. Bayvel, “Dual-polarization non-linear frequency-division multiplexed transmission with b-modulation,” J. Lightwave Technol. 37, 1570–1578 (2019). [CrossRef]

308. X. Yangzhang, S. T. Le, V. Aref, H. Buelow, D. Lavery, and P. Bayvel, “Experimental demonstration of dual-polarization NFDM transmission with b-modulation,” IEEE Photonics Technol. Lett. 31, 1–4 (2019). [CrossRef]

309. T. Gui, G. Zhou, C. Lu, A. P. T. Lau, and S. Wahls, “Nonlinear frequency division multiplexing with b-modulation: shifting the energy barrier,” Opt. Express 26, 27978–27990 (2018). [CrossRef]

310. S. Derevyanko, M. Balogun, O. Aluf, D. Shepelsky, and J. E. Prilepsky, “Channel model and the achievable information rates of the optical nonlinear frequency division-multiplexed systems employing continuous b-modulation,” Opt. Express 29, 6384–6406 (2021). [CrossRef]

311. Q. Zhang and F. R. Kschischang, “Correlation-aided nonlinear spectrum detection,” J. Lightwave Technol. 39, 4923–4931 (2021). [CrossRef]

312. M. Balogun and S. Derevyanko, “Enhancing the spectral efficiency of nonlinear frequency division multiplexing systems via Hermite-Gaussian subcarriers,” J. Lightwave Technol. 40, 6071–6077 (2022). [CrossRef]

313. W. Q. Zhang, T. H. Chan, and S. Afshar, “Direct decoding of nonlinear OFDM-QAM signals using convolutional neural network,” Opt. Express 29, 11591–11604 (2021). [CrossRef]

314. E. V. Sedov, P. J. Freire, V. V. Seredin, V. A. Kolbasin, M. Kamalian-Kopae, I. S. Chekhovskoy, S. K. Turitsyn, and J. E. Prilepsky, “Neural networks for computing and denoising the continuous nonlinear Fourier spectrum in focusing nonlinear Schrödinger equation,” Sci. Rep. 11, 22857 (2021). [CrossRef]

315. E. V. Sedov, I. S. Chekhovskoy, and J. E. Prilepsky, “Neural network for calculating direct and inverse nonlinear Fourier transform,” Quantum Electron. 51, 1118–1121 (2021). [CrossRef]

316. W. Q. Zhang, T. H. Chan, and S. A. Vahid, “Serial and parallel convolutional neural network schemes for NFDM signals,” Sci. Rep. 12, 1–12 (2022). [CrossRef]

317. J. Zhou, Q. Hu, and H. Pu, “Nonlinear Fourier transform receiver based on a time domain diffractive deep neural network,” Opt. Express 30, 38576–38586 (2022). [CrossRef]

318. X. Chen, X. Fang, F. Yang, and F. Zhang, “10.83 Tb/s over 800 Km nonlinear frequency division multiplexing WDM transmission,” J. Lightwave Technol. 40, 5385–5394 (2022). [CrossRef]

319. R. Gu, Z. Yang, and Y. Ji, “Machine learning for intelligent optical networks: a comprehensive survey,” J. Netw. Comput. Appl. 157, 102576 (2020). [CrossRef]

320. D. Wang, C. Zhang, W. Chen, H. Yang, M. Zhang, and A. P. T. Lau, “A review of machine learning-based failure management in optical networks,” Sci. China Inf. Sci. 65, 211302 (2022). [CrossRef]

321. S. Troia, R. Alvizu, Y. Zhou, G. Maier, and A. Pattavina, “Deep learning-based traffic prediction for network optimization,” in 2018 20th International Conference on Transparent Optical Networks (ICTON) (2018), pp. 1–4.

322. D. Aloraifan, I. Ahmad, and E. Alrashed, “Deep learning based network traffic matrix prediction,” Int. J. Intell. Networks 2, 46–56 (2021). [CrossRef]

323. J. A. Hatem, A. R. Dhaini, and S. Elbassuoni, “Deep learning-based dynamic bandwidth allocation for future optical access networks,” IEEE Access 7, 97307–97318 (2019). [CrossRef]

324. X. Zhu, O. Xu, and G. Li, “Prediction accuracy improvement of passive optical network traffic by a LSTM model with a new activation function,” in 2020 7th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS) (2020), pp. 662–666.

325. F. J. Vaquero-Caballero, D. J. Ives, and S. J. Savory, “Perturbation-based frequency domain linear and nonlinear noise estimation,” J. Lightwave Technol. 40, 6055–6063 (2022). [CrossRef]

326. D. Wang, M. Zhang, Z. Li, J. Li, C. Song, J. Li, and M. Wang, “Convolutional neural network-based deep learning for intelligent OSNR estimation on eye diagrams,” in 2017 European Conference on Optical Communication (ECOC) (2017), pp. 1–3.

327. T. Tanimura, T. Hoshida, T. Kato, S. Watanabe, and H. Morikawa, “Convolutional neural network-based optical performance monitoring for optical transport networks,” J. Opt. Commun. Netw. 11, A52–A59 (2019). [CrossRef]

328. T. Tanimura, S. Yoshida, K. Tajima, S. Oda, and T. Hoshida, “Concept and implementation study of advanced DSP-based fiber-longitudinal optical power profile monitoring toward optical network tomography,” J. Opt. Commun. Netw. 13, E132–E141 (2021). [CrossRef]

329. D. Wang, Y. Xu, J. Li, M. Zhang, J. Li, J. Qin, C. Ju, Z. Zhang, and X. Chen, “Comprehensive eye diagram analysis: a transfer learning approach,” IEEE Photonics J. 11, 1–19 (2019). [CrossRef]

330. S. Lohani, E. M. Knutson, W. Zhang, and R. T. Glasser, “Dispersion characterization and pulse prediction with machine learning,” OSA Continuum 2, 3438–3445 (2019). [CrossRef]

331. W. Du, D. Côté, C. Barber, and Y. Liu, “Forecasting loss of signal in optical networks with machine learning,” J. Opt. Commun. Netw. 13, E109–E121 (2021). [CrossRef]

332. J. Müller, T. Fehenberger, S. K. Patri, K. Kaeval, H. Griesser, M. Tikas, and J.-P. Elbers, “Estimating quality of transmission in a live production network using machine learning,” in Optical Fiber Communication Conference (2021), pp. Tu1G–2.

333. A. D’Amico, S. Straullu, A. Nespola, I. Khan, E. London, E. Virgillito, S. Piciaccia, A. Tanzi, G. Galimberti, and V. Curri, “Using machine learning in an open optical line system controller,” J. Opt. Commun. Netw. 12, C1–C11 (2020). [CrossRef]

334. A. S. Kashi, J. C. Cartledge, and W.-Y. Chan, “Neural network training framework for nonlinear signal-to-noise ratio estimation in heterogeneous optical networks,” in 2021 Optical Fiber Communications Conference and Exhibition (OFC) (2021), pp. 1–3.

335. M. Lonardi, J. Pesic, T. Zami, and N. Rossi, “The perks of using machine learning for QoT estimation with uncertain network parameters,” in Photonic Networks and Devices (2020), pp. NeM3B–2.

336. H. Lv, X. Zhou, J. Huo, and J. Yuan, “Joint OSNR monitoring and modulation format identification on signal amplitude histograms using convolutional neural network,” Opt. Fiber Technol. 61, 102455 (2021). [CrossRef]

337. F. Inuzuka, T. Oda, T. Tanaka, K. Kitamura, S. Kuwabara, A. Hirano, and M. Tomizawa, “Demonstration of a novel framework for proactive maintenance using failure prediction and bit lossless protection with autonomous network diagnosis system,” J. Lightwave Technol. 38, 2695–2702 (2020). [CrossRef]

338. C. Zhang, D. Wang, L. Wang, J. Song, S. Liu, J. Li, L. Guan, Z. Liu, and M. Zhang, “Temporal data-driven failure prognostics using BiGRU for optical networks,” J. Opt. Commun. Netw. 12, 277–287 (2020). [CrossRef]

339. C. Zhang, D. Wang, J. Jia, L. Wang, S. Liu, L. Guan, and M. Zhang, “Attention mechanism-driven potential fault cause identification in optical networks,” in 2021 Optical Fiber Communications Conference and Exhibition (OFC) (2021), pp. 1–3.

340. K. Abdelli, D. Rafique, H. Grießer, and S. Pachnicke, “Lifetime prediction of 1550 nm DFB laser using machine learning techniques,” in Optical Fiber Communication Conference (2020), pp. Th2A–3.

341. T. Liu, H. Mei, Q. Sun, and H. Zhou, “Application of neural network in fault location of optical transport network,” China Commun. 16, 214–225 (2019). [CrossRef]

342. J. Jia, D. Wang, C. Zhang, H. Yang, L. Guan, X. Chen, and M. Zhang, “Transformer-based alarm context-vectorization representation for reliable alarm root cause identification in optical networks,” in 2021 European Conference on Optical Communication (ECOC) (2021), pp. 1–4.

343. X. Zhao, H. Yang, H. Guo, T. Peng, and J. Zhang, “Accurate fault location based on deep neural evolution network in optical networks for 5G and beyond,” in Optical Fiber Communication Conference (2019), pp. M3J–5.

344. E. Lewis, C. Sheridan, M. O’Farrell, D. King, C. Flanagan, W. Lyons, and C. Fitzpatrick, “Principal component analysis and artificial neural network based approach to analysing optical fibre sensors signals,” Sens. Actuators, A 136, 28–38 (2007). [CrossRef]

345. S. Kowarik, M.-T. Hussels, S. Chruscicki, S. Münzenberger, A. Lämmerhirt, P. Pohl, and M. Schubert, “Fiber optic train monitoring with distributed acoustic sensing: conventional and neural network data analysis,” Sensors 20, 450 (2020). [CrossRef]

346. S. Liehr, L. A. Jäger, C. Karapanagiotis, S. Münzenberger, and S. Kowarik, “Real-time dynamic strain sensing in optical fibers using artificial neural networks,” Opt. Express 27, 7405–7425 (2019). [CrossRef]

347. S. Liehr, “Artificial neural networks for distributed optical fiber sensing (invited),” in Optical Fiber Communication Conference (OFC) 2021 (2021), p. Th4F.2.

348. F. B. M. Suah, M. Ahmad, and M. N. Taib, “Applications of artificial neural network on signal processing of optical fibre pH sensor based on bromophenol blue doped with sol–gel film,” Sens. Actuators, B 90, 182–188 (2003). [CrossRef]

349. X. Li, J. Shu, W. Gu, and L. Gao, “Deep neural network for plasmonic sensor modeling,” Opt. Mater. Express 9, 3857–3862 (2019). [CrossRef]

350. M. Shokrekhodaei, D. P. Cistola, R. C. Roberts, and S. Quinones, “Non-invasive glucose monitoring using optical sensor and machine learning techniques for diabetes applications,” IEEE Access 9, 73029–73045 (2021). [CrossRef]

351. F. B. M. Suah, M. Ahmad, and M. N. Taib, “Optimisation of the range of an optical fibre pH sensor using feed-forward artificial neural network,” Sens. Actuators, B 90, 175–181 (2003). [CrossRef]

352. I. Dias, R. Oliveira, and O. Frazão, “Intelligent optical sensors using artificial neural network approach,” in Innovation in Manufacturing Networks, A. Azevedo, ed. (Springer, 2008), pp. 289–294.

353. L. Zhao, J. Wang, and X. Chen, “BP neural network with regularization and sensor array for prediction of component concentration of mixed gas,” in Advances in Neural Networks – ISNN 2018, T. Huang, J. Lv, C. Sun, and A. V. Tuzikov, eds. (Springer International Publishing, 2018), pp. 541–548.

354. Y. C. Manie, J.-W. Li, P.-C. Peng, R.-K. Shiu, Y.-Y. Chen, and Y.-T. Hsu, “Using a machine learning algorithm integrated with data de-noising techniques to optimize the multipoint sensor network,” Sensors 20, 1070 (2020). [CrossRef]

355. Y. Shi, Y. Wang, L. Zhao, and Z. Fan, “An event recognition method for ϕ-OTDR sensing system based on deep learning,” Sensors 19, 3421 (2019). [CrossRef]

356. L. Salmela, N. Tsipinakis, A. Foi, C. Billet, J. Dudley, and G. Genty, “Predicting ultrafast nonlinear dynamics in fibre optics with a recurrent neural network,” Nat. Mach. Intell. 3, 344–354 (2021). [CrossRef]

357. L. Salmela, C. Lapre, J. M. Dudley, and G. Genty, “Machine learning analysis of rogue solitons in supercontinuum generation,” Sci. Rep. 10, 9596 (2020). [CrossRef]

358. A. Ermolaev, A. Sheveleva, G. Genty, C. Finot, and J. Dudley, “Data-driven model discovery of ideal four-wave mixing in nonlinear fibre optics,” Sci. Rep. 12, 12711 (2022). [CrossRef]

359. X. Jiang, D. Wang, Q. Fan, M. Zhang, C. Lu, and A. P. T. Lau, “Physics-informed neural network for nonlinear dynamics in fiber optics,” Laser Photonics Rev. 16, 2100483 (2022). [CrossRef]

360. M. Soltani, F. Da Ros, A. Carena, and D. Zibar, “Spectral and spatial power evolution design with machine learning-enabled Raman amplification,” J. Lightwave Technol. 40, 3546–3556 (2022). [CrossRef]

361. T. Zahavy, A. Dikopoltsev, D. Moss, G. I. Haham, O. Cohen, S. Mannor, and M. Segev, “Deep learning reconstruction of ultrashort pulses,” Optica 5, 666–673 (2018). [CrossRef]

362. M. Stanfield, J. Ott, C. Gardner, N. F. Beier, D. M. Farinella, C. A. Mancuso, P. Baldi, and F. Dollar, “Real-time reconstruction of high energy, ultrafast laser pulses using deep learning,” Sci. Rep. 12, 5299 (2022). [CrossRef]

363. M. Mabed, F. Meng, L. Salmela, C. Finot, G. Genty, and J. M. Dudley, “Machine learning analysis of instabilities in noise-like pulse lasers,” Opt. Express 30, 15060–15072 (2022). [CrossRef]

364. L. Salmela, M. Hary, M. Mabed, A. Foi, J. M. Dudley, and G. Genty, “Feed-forward neural network as nonlinear dynamics integrator for supercontinuum generation: erratum,” Opt. Lett. 47, 1741 (2022). [CrossRef]

365. G. Pu and B. Jalali, “Neural network enabled time stretch spectral regression,” Opt. Express 29, 20786–20794 (2021). [CrossRef]

366. A. E. Siegman, Lasers (University Science Books, 1986).

367. W. Fu, L. G. Wright, P. Sidorenko, S. Backus, and F. W. Wise, “Several new directions for ultrafast fiber lasers,” Opt. Express 26, 9432 (2018). [CrossRef]

368. O. G. Okhotnikov, ed., Fiber Lasers (John Wiley & Sons, Ltd, 2012).

369. S. K. Turitsyn, B. G. Bale, and M. P. Fedoruk, “Dispersion-managed solitons in fiber systems and lasers,” Phys. Rep. 521, 135–203 (2012). [CrossRef]

370. S. K. Turitsyn, S. A. Babin, D. V. Churkin, I. D. Vatnik, M. Nikulin, and E. V. Podivilov, “Random distributed feedback fibre lasers,” Phys. Rep. 542, 133–193 (2014). [CrossRef]

371. U. Andral, J. Buguet, R. S. Fodil, F. Amrani, F. Billard, E. Hertz, and P. Grelu, “Toward an autosetting mode-locked fiber laser cavity,” J. Opt. Soc. Am. B 33, 825–833 (2016). [CrossRef]

372. J. N. Kutz, “Mode-locked soliton lasers,” SIAM Rev. 48, 629–678 (2006). [CrossRef]

373. J. N. Kutz, “Deep learning in fluid dynamics,” J. Fluid Mech. 814, 1–4 (2017). [CrossRef]

374. M. Raissi, “Deep hidden physics models: deep learning of nonlinear partial differential equations,” J. Mach. Learn. Res. 19, 932–955 (2018). [CrossRef]

375. S. L. Brunton, X. Fu, and J. N. Kutz, “Self-tuning fiber lasers,” IEEE J. Sel. Top. Quantum Electron. 20, 464–471 (2014). [CrossRef]

376. J. N. Kutz and S. L. Brunton, “Intelligent systems for stabilizing mode-locked lasers and frequency combs: machine learning and equation-free control paradigms for self-tuning optics,” Nanophotonics 4, 459–471 (2015). [CrossRef]

377. R. Woodward and E. J. Kelleher, “Towards a smart lasers: self-optimisation of an ultrafast pulse source using a genetic algorithm,” Sci. Rep. 6, 37616 (2016). [CrossRef]

378. D. G. Winters, M. S. Kirchner, S. J. Backus, and H. C. Kapteyn, “Electronic initiation and optimization of nonlinear polarization evolution mode-locking in a fiber laser,” Opt. Express 25, 33216–33225 (2017). [CrossRef]

379. R. Woodward and E. Kelleher, “Genetic algorithm-based control of birefringent filtering for self-tuning, self-pulsing fiber lasers,” Opt. Lett. 42, 2952–2955 (2017). [CrossRef]

380. X. Ma, J. Lin, C. Dai, J. Lv, P. Yao, L. Xu, and C. Gu, “Machine learning method for calculating mode-locking performance of linear cavity fiber lasers,” Opt. Laser Technol. 149, 107883 (2022). [CrossRef]

381. C. Sun, E. Kaiser, S. L. Brunton, and J. N. Kutz, “Deep reinforcement learning for optical systems: a case study of mode-locked lasers,” Mach. Learn.: Sci. Technol. 1, 045013 (2020). [CrossRef]

382. E. Kuprikov, A. Kokhanovskiy, K. Serebrennikov, and S. Turitsyn, “Deep reinforcement learning for self-tuning laser source of dissipative solitons,” Sci. Rep. 12, 7185 (2022). [CrossRef]

383. Z. Li, S. Yang, Q. Xiao, T. Zhang, Y. Li, L. Han, D. Liu, X. Ouyang, and J. Zhu, “Deep reinforcement with spectrum series learning control for a mode-locked fiber laser,” Photonics Res. 10, 1491–1500 (2022). [CrossRef]

384. A. Kokhanovskiy, A. Shevelev, K. Serebrennikov, E. Kuprikov, and S. Turitsyn, “A deep reinforcement learning algorithm for smart control of hysteresis phenomena in a mode-locked fiber laser,” Photonics 9, 921 (2022). [CrossRef]

385. A. Liu, T. Lin, H. Han, X. Zhang, Z. Chen, F. Gan, H. Lv, and X. Liu, “Analyzing modal power in multi-mode waveguide via machine learning,” Opt. Express 26, 22100–22109 (2018). [CrossRef]

386. Y. An, L. Huang, J. Li, J. Leng, L. Yang, and P. Zhou, “Deep learning-based real-time mode decomposition for multimode fibers,” IEEE J. Sel. Top. Quantum Electron. 26, 1–6 (2020). [CrossRef]

387. Y. An, L. Huang, J. Li, J. Leng, L. Yang, and P. Zhou, “Learning to decompose the modes in few-mode fibers with deep convolutional neural network,” Opt. Express 27, 10127–10137 (2019). [CrossRef]

388. E. Manuylovich, A. Donodin, and S. Turitsyn, “Intensity-only-measurement mode decomposition in few-mode fibers,” Opt. Express 29, 36769–36783 (2021). [CrossRef]

389. L. G. Wright, W. H. Renninger, D. N. Christodoulides, and F. W. Wise, “Nonlinear multimode photonics: nonlinear optics with many degrees of freedom,” Optica 9, 824–841 (2022). [CrossRef]

390. L. Wright, P. Sidorenko, H. Pourbeyram, Z. H. Ziegler, A. Isichenko, B. A. Malomed, C. R. Menyuk, D. N. Christodoulides, and F. W. Wise, “Mechanisms of spatiotemporal mode-locking,” Nat. Phys. 6, 565-570 (2020). [CrossRef]

391. H. Haig, P. Sidorenko, A. Dhar, N. Choudhury, R. Sen, D. Christodoulides, and F. Wise, “Multimode Mamyshev oscillator,” Opt. Lett. 47, 46–49 (2022). [CrossRef]

392. S. L. Brunton, X. Fu, and J. N. Kutz, “Extremum-seeking control of a mode-locked laser,” IEEE J. Quantum Electron. 49, 852–861 (2013). [CrossRef]

393. G. Pu, L. Yi, L. Zhang, and W. Hu, “Intelligent programmable mode-locked fiber laser with a human-like algorithm,” Optica 6, 362–369 (2019). [CrossRef]

394. A. Kokhanovskiy, A. Ivanenko, S. Kobtsev, S. Smirnov, and S. Turitsyn, “Machine learning methods for control of fibre lasers with double gain nonlinear loop mirror,” Sci. Rep. 9, 2916 (2019). [CrossRef]

395. A. Kokhanovskiy, E. Kuprikov, A. Bednyakova, I. Popkov, S. Smirnov, and S. Turitsyn, “Inverse design of mode-locked fiber laser by particle swarm optimization algorithm,” Sci. Rep. 11, 13555 (2021). [CrossRef]

396. B. K. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell. 17, 185–203 (1981). [CrossRef]

397. J. A. Marshall, “Self-organizing neural networks for perception of visual motion,” Neural Networks 3, 45–74 (1990). [CrossRef]

398. M. Egmont-Petersen, D. de Ridder, and H. Handels, “Image processing with neural networks—a review,” Pattern Recognition 35, 2279–2301 (2002). [CrossRef]

399. C. Zuo, J. Qian, S. Feng, W. Yin, Y. Li, P. Fan, J. Han, K. Qian, and Q. Chen, “Deep learning in optical metrology: a review,” Light: Sci. Appl. 11, 39 (2022). [CrossRef]

400. H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique,” IEEE Trans. Med. Imaging 35, 1153–1159 (2016). [CrossRef]

401. M. T. McCann, K. H. Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging: a review,” IEEE Signal Processing Magazine 34, 85–95 (2017). [CrossRef]

402. U. S. Kamilov, I. N. Papadopoulos, M. H. Shoreh, A. Goy, C. Vonesch, M. Unser, and D. Psaltis, “Learning approach to optical tomography,” Optica 2, 517–522 (2015). [CrossRef]

403. Z. Zhang, J. Liu, D. Yang, U. S. Kamilov, and G. D. Hugo, “Deep learning-based motion compensation for four-dimensional cone-beam computed tomography (4D-CBCT) reconstruction,” Med. Phys. 50, 808 (2023). [CrossRef]

404. Z. Wu, Y. Sun, A. Matlock, J. Liu, L. Tian, and U. S. Kamilov, “Simba: scalable inversion in optical tomography using deep denoising priors,” IEEE J. Sel. Top. Signal Process. 14, 1163–1175 (2020). [CrossRef]

405. K. Yan, Y. Yu, C. Huang, L. Sui, K. Qian, and A. Asundi, “Fringe pattern denoising based on deep learning,” Opt. Commun. 437, 148–152 (2019). [CrossRef]

406. B. Lin, S. Fu, C. Zhang, F. Wang, and Y. Li, “Optical fringe patterns filtering based on multi-stage convolution neural network,” Optics and Lasers in Engineering 126, 105853 (2020). [CrossRef]

407. J. Shi, X. Zhu, H. Wang, L. Song, and Q. Guo, “Label enhanced and patch based deep learning for phase retrieval from single frame fringe pattern in fringe projection 3D measurement,” Opt. Express 27, 28929–28943 (2019). [CrossRef]

408. H. Yu, D. Zheng, J. Fu, Y. Zhang, C. Zuo, and J. Han, “Deep learning-based fringe modulation-enhancing method for accurate fringe projection profilometry,” Opt. Express 28, 21692–21703 (2020). [CrossRef]

409. Z. Zhang, D. P. Towers, and C. E. Towers, “Snapshot color fringe projection for absolute three-dimensional metrology of video sequences,” Appl. Opt. 49, 5947–5953 (2010). [CrossRef]

410. J. Qian, S. Feng, Y. Li, T. Tao, J. Han, Q. Chen, and C. Zuo, “Single-shot absolute 3D shape measurement with deep-learning-based color fringe projection profilometry,” Opt. Lett. 45, 1842–1845 (2020). [CrossRef]

411. L. Lu, Y. Zheng, G. Carneiro, and L. Yang, eds., Deep Learning and Convolutional Neural Networks for Medical Image Computing, Advances in Computer Vision and Pattern Recognition (Springer, Cham, 2017).

412. G. Wang, Y. Zhang, X. Ye, and X. Mou, Machine Learning for Tomographic Imaging (IOP Publishing, 2019), pp. 2053–2563.

413. D. Hemanth and V. Estrela, Deep Learning for Image Processing Applications, Advances in Parallel Computing (IOS Press, 2017).

414. L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep learning in remote sensing applications: a meta-analysis and review,” ISPRS J. Photogramm. Remote. Sens. 152, 166–177 (2019). [CrossRef]

415. B. Huang, B. Zhao, and Y. Song, “Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery,” Remote. Sens. Environ. 214, 73–86 (2018). [CrossRef]

416. P. Caramazza, O. Moran, R. Murray-Smith, and D. Faccio, “Transmission of natural scene images through a multimode fibre,” Nat. Commun. 10, 2029 (2019). [CrossRef]

417. S. Aisawa, K. Noguchi, and T. Matsumoto, “Remote image classification through multimode optical fiber using a neural network,” Opt. Lett. 16, 645–647 (1991). [CrossRef]

418. N. Shabairou, E. Cohen, O. Wagner, D. Malka, and Z. Zalevsky, “Color image identification and reconstruction using artificial neural networks on multimode fiber images: towards an all-optical design,” Opt. Lett. 43, 5603–5606 (2018). [CrossRef]

419. B. Rahmani, I. Oguz, U. Tegin, J. liang Hsieh, D. Psaltis, and C. Moser, “Learning to image and compute with multimode optical fibers,” Nanophotonics 11, 1071–1082 (2022). [CrossRef]

420. Z. Liu, L. Wang, Y. Meng, T. He, S. He, Y. Yang, L. Wang, J. Tian, D. Li, P. Yan, M. Gong, Q. Liu, and Q. Xiao, “All-fiber high-speed image detection enabled by deep learning,” Nat. Commun. 13, 1433 (2022). [CrossRef]

421. T. Voumard, T. Wildi, V. Brasch, R. G. Alvarez, G. V. Ogando, and T. Herr, “Dual-frequency comb hyperspectral imaging by massively parallelized infrared detection and machine learning,” in Optical Sensors and Sensing Congress (Optica Publishing Group, 2020), p. EM1C.1.

422. L. B. Lentile, Z. A. Holden, A. M. S. Smith, M. J. Falkowski, A. T. Hudak, P. Morgan, S. A. Lewis, P. E. Gessler, and N. C. Benson, “Remote sensing techniques to assess active fire characteristics and post-fire effects,” Int. J. Wildland Fire 15, 319–345 (2006). [CrossRef]

423. G. A. Daldegan, D. A. Roberts, and F. de Figueiredo Ribeiro, “Spectral mixture analysis in Google Earth engine to model and delineate fire scars over a large extent and a long time-series in a rainforest–savanna transition zone,” Remote. Sens. Environ. 232, 111340 (2019). [CrossRef]

424. T. Kattenborn, J. Leitloff, F. Schiefer, and S. Hinz, “Review on convolutional neural networks (CNN) in vegetation remote sensing,” ISPRS Journal of Photogrammetry and Remote Sensing 173, 24–49 (2021). [CrossRef]

425. M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and Prabhat, “Deep learning and process understanding for data-driven Earth system science,” Nature 566, 195–204 (2019). [CrossRef]

426. W. Chew, E. Michielssen, J. M. Song, and J. M. Jin, ISPRS J. Photogramm. Remote. Sens. (Artech House, Inc., 2001).

427. M. N. O. Sadiku, Numerical Techniques in Electromagnetics (CRC Press, 1992).

428. B. Gallinet, J. Butet, and O. J. F. Martin, “Numerical methods for nanophotonics: standard problems and future challenges,” Laser Photonics Rev. 9, 577–603 (2015). [CrossRef]

429. S. Molesky, Z. Lin, A. Y. Piggott, W. Jin, J. Vucković, and A. W. Rodriguez, “Inverse design in nanophotonics,” Nat. Photonics 12, 659–670 (2018). [CrossRef]

430. S. D. Campbell, D. Sell, R. P. Jenkins, E. B. Whiting, J. A. Fan, and D. H. Werner, “Review of numerical optimization techniques for meta-device design,” Opt. Mater. Express 9, 1842–1863 (2019). [CrossRef]

431. M. Zandehshahvar, M. H. Javani, M. Chen, T. Brown, Y. Kiarashi, and A. Adibi, “Machine learning for efficient inverse design of nanophotonics structures,” in Photonic and Phononic Properties of Engineered Nanostructures XII (SPIE, 2022), p. PC120100W.

432. R. S. Hegde, “Deep learning: a new tool for photonic nanostructure design,” Nanoscale Adv. 2, 1007–1023 (2020). [CrossRef]

433. W. Ma, F. Cheng, and Y. Liu, “Deep-learning-enabled on-demand design of chiral metamaterials,” ACS Nano 12, 6326–6334 (2018). [CrossRef]

434. Z. Liu, D. Zhu, L. Raju, and W. Cai, “Tackling photonic inverse design with machine learning,” Adv. Sci. 8, 2002923 (2021). [CrossRef]

435. J. Lim and D. Psaltis, “MaxwellNet: physics-driven deep neural network training based on Maxwell’s equations,” APL Photonics 7, 011301 (2022). [CrossRef]

436. P. R. Wiecha, A. Arbouet, C. Girard, and O. L. Muskens, “Deep learning in nano-photonics: inverse design and beyond,” Photonics Res. 9, B182–B200 (2021). [CrossRef]

437. M. H. Tahersima, K. Kojima, T. Koike-Akino, D. Jha, B. Wang, C. Lin, and K. Parsons, “Deep neural network inverse design of integrated photonic power splitters,” Sci. Rep. 9, 1368 (2019). [CrossRef]

438. J. Wang, Y. Wang, and Y. Chen, “Inverse design of materials by machine learning,” Materials 15, 1 (2022). [CrossRef]

439. J. Jiang and J. A. Fan, “Global optimization of dielectric metasurfaces using a physics-driven neural network,” Nano Lett. 19, 5366–5372 (2019). [CrossRef]

440. B. A. Wilson, Z. A. Kudyshev, A. V. Kildishev, S. Kais, V. M. Shalaev, and A. Boltasseva, “Machine learning framework for quantum sampling of highly constrained, continuous optimization problems,” Appl. Phys. Rev. 8, 041418 (2021). [CrossRef]

441. W. Maass, “On the computational complexity of networks of spiking neurons,” Adv. Neur. Inf. Proc. Syst. 7, 1 (1994).

442. R. Alizadeh, J. K. Allen, and F. Mistree, “Managing computational complexity using surrogate models: a critical review,” Res. Eng. Design 31, 275–298 (2020). [CrossRef]

443. S. Wiedemann, K.-R. Müller, and W. Samek, “Compact and computationally efficient representation of deep neural networks,” IEEE Trans. Neural Netw. Learning Syst. 31, 772–785 (2020). [CrossRef]

444. A. M. Amin, R. R. Mahmood, and A. I. Khan, “Analysis of pattern recognition algorithms using associative memory approach: a comparative study between the Hopfield network and distributed hierarchical graph neuron (DHGN),” in 2008 IEEE 8th International Conference on Computer and Information Technology Workshops (IEEE, 2008), pp. 153–158.

445. S. N. Kerr, “A Big-O experiment: which function is it?” in Proceedings of the 43rd Annual Southeast Regional Conference-Volume 1 (2005), pp. 317–318.

446. V. D. Blondel and J. N. Tsitsiklis, “A survey of computational complexity results in systems and control,” Automatica 36, 1249–1274 (2000). [CrossRef]

447. P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented approximation of convolutional neural networks,” arXiv, arXiv:1604.03168 (2016). [CrossRef]

448. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: a tutorial and survey,” Proc. IEEE 105, 2295–2329 (2017). [CrossRef]

449. B. Li and T. N. Sainath, “Reducing the computational complexity of two-dimensional LSTMs,” in INTERSPEECH (2017), pp. 964–968.

450. T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 5687–5695.

451. J. L. Balcázar, R. Gavalda, and H. T. Siegelmann, “Computational power of neural networks: a characterization in terms of Kolmogorov complexity,” IEEE Trans. Inf. Theory 43, 1175–1183 (1997). [CrossRef]

452. M. Van Baalen, C. Louizos, M. Nagel, R. A. Amjad, Y. Wang, T. Blankevoort, and M. Welling, “Bayesian bits: unifying quantization and pruning,” Advances in neural information processing systems 33, 5741–5752 (2020). [CrossRef]

453. C. Baskin, “Uniq: uniform noise injection for non-uniform quantization of neural networks,” ACM Trans. Comput. Syst. 37, 1–15 (2019). [CrossRef]

454. S. Sahin, Y. Becerikli, and S. Yazici, “Neural network implementation in hardware using FPGAs,” in International Conference on Neural Information Processing (Springer, 2006), pp. 1105–1112.

455. A. Dinu, M. N. Cirstea, and S. E. Cirstea, “Direct neural-network hardware-implementation algorithm,” IEEE Trans. Ind. Electron. 57, 1845–1848 (2010). [CrossRef]

456. E. Jacobsen and P. Kootsookos, “Fast, accurate frequency estimators [DSP tips & tricks],” IEEE Signal Process. Mag. 24, 123–125 (2007). [CrossRef]

457. B. Spinnler, “Equalizer design and complexity for digital coherent receivers,” IEEE J. Sel. Top. Quantum Electron. 16, 1180–1192 (2010). [CrossRef]

458. S. Mirzaei, A. Hosangadi, and R. Kastner, “FPGA implementation of high speed FIR filters using add and shift method,” in 2006 International Conference on Computer Design (IEEE, 2006), pp. 308–313.

459. S. Jahani, “ZOT-MK: a New Algorithm for Big Integer Multiplication,” MSc Thesis (Department of Computer Science, Universiti Sains Malaysia, 2009).

460. The Big-O notation represents the worst case or the upper bound of the time required to perform the operation. Big Omega (Ω) shows the best case or the lower bound whereas the Big Theta (Θ) notation defines the tight bound of the amount of time required; in other words, f(n) is claimed to be Θ (g(n)) if f(n) is O(g(n)) and f(n) is Ω(g(n)).

461. B. Hawks, J. Duarte, N. J. Fraser, A. Pappalardo, N. Tran, and Y. Umuroglu, “PS and QS: quantization-aware pruning for efficient low latency neural network inference,” arXiv, arXiv:2102.11289 (2021). [CrossRef]

462. J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin, “Deep k-means: re-training and parameter sharing with harder cluster assignments for compressing deep convolutions,” in International Conference on Machine Learning (PMLR, 2018), pp. 5363–5372.

463. Y. Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: an efficient non-uniform discretization for neural networks,” arXiv, arXiv:1909.13144 (2019). [CrossRef]

464. T. Koike-Akino, Y. Wang, K. Kojima, K. Parsons, and T. Yoshida, “Zero-multiplier sparse dnn equalization for fiber-optic QAM systems with probabilistic amplitude shaping,” in 2021 European Conference on Optical Communications (ECOC) (IEEE, 2021), pp. 1–4.

465. M. Elhoushi, Z. Chen, F. Shafiq, Y. H. Tian, and J. Y. Li, “Deepshift: towards multiplication-less neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 2359–2368.

466. H. You, X. Chen, Y. Zhang, C. Li, S. Li, Z. Liu, Z. Wang, and Y. Lin, “Shiftaddnet: a hardware-inspired deep network,” arXiv, arXiv:2010.12785 (2020). [CrossRef]

467. P. Gentili, F. Piazza, and A. Uncini, “Efficient genetic algorithm design for power-of-two FIR filters,” in 1995 International Conference on Acoustics, Speech, and Signal Processing Vol. 2 (IEEE, 1995), pp. 1268–1271.

468. J. B. Evans, “Efficient FIR filter architectures suitable for FPGA implementation,” IEEE Trans. Circuits Syst. II 41, 490–493 (1994). [CrossRef]

469. W. R. Lee, V. Rehbock, K. L. Teo, and L. Caccetta, “Frequency-response masking based FIR filter design with power-of-two coefficients and suboptimum PWR,” J. Circuits, Syst. Comput. 12, 591–599 (2003). [CrossRef]

470. P. Kurup and T. Abbasi, Logic Synthesis Using Synopsys (Springer Science & Business Media, 2012).

471. H. Li and W. Ye, “Efficient implementation of FPGA based on Vivado high level synthesis,” in 2016 2nd IEEE International Conference on Computer and Communications (ICCC) (2016), pp. 2810–2813.

472. L. Cui, Y. Zhao, B. Yan, D. Liu, and J. Zhang, “Deep-learning-based failure prediction with data augmentation in optical transport networks,” in 17th International Conference on Optical Communications and Networks (ICOCN 2018) Vol. 11048 (International Society for Optics and Photonics, 2019), p. 110482I.

473. H. Zhuang, Y. Zhao, X. Yu, Y. Li, Y. Wang, and J. Zhang, “Machine-learning-based alarm prediction with GANs-based self-optimizing data augmentation in large-scale optical transport networks,” in 2020 International Conference on Computing, Networking and Communications (ICNC) (IEEE, 2020), pp. 294–298.

474. S. Li, J. Li, M. Zhang, D. Wang, C. Song, and X. Zhen, “Adaptive traffic data augmentation using generative adversarial networks for optical networks,” in 2019 Optical Fiber Communications Conference and Exhibition (OFC) (IEEE, 2019), pp. 1–3.

475. V. Neskorniuk, P. J. Freire, A. Napoli, B. Spinnler, W. Schairer, J. E. Prilepsky, N. Costa, and S. K. Turitsyn, “Simplifying the supervised learning of Kerr nonlinearity compensation algorithms by data augmentation,” in 2020 European Conference on Optical Communications (ECOC) (2020), pp. 1–4.

476. W. Mo, Y.-K. Huang, S. Zhang, E. Ip, D. C. Kilper, Y. Aono, and T. Tajima, “ANN-based transfer learning for QOT prediction in real-time mixed line-rate systems,” in 2018 Optical Fiber Communications Conference and Exposition (OFC) (IEEE, 2018), pp. 1–3.

477. Y. Cheng, W. Zhang, S. Fu, M. Tang, and D. Liu, “Transfer learning simplified multi-task deep neural network for PDM-64QAM optical performance monitoring,” Opt. Express 28, 7607–7617 (2020). [CrossRef]

478. Q. Yao, H. Yang, A. Yu, and J. Zhang, “Transductive transfer learning-based spectrum optimization for resource reservation in seven-core elastic optical networks,” J. Lightwave Technol. 37, 4164–4172 (2019). [CrossRef]

479. Z. Xu, C. Sun, T. Ji, J. H. Manton, and W. Shieh, “Feedforward and recurrent neural network-based transfer learning for nonlinear equalization in short-reach optical links,” J. Lightwave Technol. 39, 475–480 (2021). [CrossRef]

480. J. Zhang, L. Xia, M. Zhu, S. Hu, B. Xu, and K. Qiu, “Fast remodeling for nonlinear distortion mitigation based on transfer learning,” Opt. Lett. 44, 4243–4246 (2019). [CrossRef]

481. W. Zhang, T. Jin, T. Xu, J. Zhang, and K. Qiu, “Nonlinear mitigation with TL-NN-NLC in coherent optical fiber communications,” in Asia Communications and Photonics Conference (2020), pp. M4A–321.

482. P. J. Freire, D. Abode, J. E. Prilepsky, N. Costa, B. Spinnler, A. Napoli, and S. K. Turitsyn, “Transfer learning for neural networks-based equalizers in coherent optical systems,” J. Lightwave Technol. 39, 6733–6745 (2021). [CrossRef]

483. J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017), pp. 23–30.

484. X. Chen, J. Hu, C. Jin, L. Li, and L. Wang, “Understanding domain randomization for sim-to-real transfer,” arXiv, arXiv:2110.03239 (2021). [CrossRef]

485. F. Muratore, F. Ramos, G. Turk, W. Yu, M. Gienger, and J. Peters, “Robot learning from randomized simulations: a review,” Front. Robot. AI 9, 799893 (2022). [CrossRef]

486. P. J. Freire, B. Spinnler, D. Abode, J. E. Prilepsky, A. Ali, N. Costa, W. Schairer, A. Napoli, A. D. Ellis, and S. K. Turitsyn, “Domain adaptation: the key enabler of neural network equalizers in coherent optical systems,” in 2022 Optical Fiber Communications Conference and Exhibition (OFC) (2022), pp. 1–3.

487. O. Simeone, S. Park, and J. Kang, “From learning to meta-learning: reduced training overhead and complexity for communication systems,” in 2020 2nd 6G Wireless Summit (6G SUMMIT) (2020), pp. 1–5.

488. Y. Ouali, C. Hudelot, and M. Tami, “An overview of deep semi-supervised learning,” arXiv, arXiv:2006.05278 (2020). [CrossRef]

489. D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag, “What is the state of neural network pruning?” arXiv, arXiv:2003.03033 (2020). [CrossRef]

490. T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: a survey,” Neurocomputing 461, 370–403 (2021). [CrossRef]

491. Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv, arXiv:1810.05270 (2018). [CrossRef]

492. M. Augasta and T. Kathirvalavakumar, “Pruning algorithms of neural networks—a comparative study,” Open Comput. Sci. 3, 105–115 (2013). [CrossRef]

493. S. Vadera and S. Ameen, “Methods for pruning deep neural networks,” arXiv, arXiv:2011.00241 (2020). [CrossRef]

494. S. Han, H. Mao, and W. J. Dally, “Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv, arXiv:1510.00149 (2015). [CrossRef]

495. C.-Y. Chuang, W.-F. Chang, C.-C. Wei, C.-J. Ho, C.-Y. Huang, J.-W. Shi, L. Henrickson, Y.-K. Chen, and J. Chen, “Sparse Volterra nonlinear equalizer by employing pruning algorithm for high-speed PAM-4 850-nm VCSEL optical interconnect,” in Optical Fiber Communication Conference (2019), pp. M1F–2.

496. W.-J. Huang, W.-F. Chang, C.-C. Wei, J.-J. Liu, Y.-C. Chen, K.-L. Chi, C.-L. Wang, J.-W. Shi, and J. Chen, “93% complexity reduction of Volterra nonlinear equalizer by L1-regularization for 112-Gbps PAM-4 850-nm VCSEL optical interconnect,” in 2018 Optical Fiber Communications Conference and Exposition (OFC) (2018), pp. 1–3.

497. F. P. Guiomar, S. B. Amado, N. J. Muga, J. D. Reis, A. L. Teixeira, and A. N. Pinto, “Simplified Volterra series nonlinear equalizer by intra-channel cross-phase modulation oriented pruning,” in 39th European Conference and Exhibition on Optical Communication (ECOC 2013) (2013), pp. 1–3.

498. O. S. Kumar, L. Lampe, S. Luo, M. Jana, J. Mitra, and C. Li, “Deep neural network assisted second-order perturbation-based nonlinearity compensation,” in Signal Processing in Photonic Communications (2021), pp. SpF2E–2.

499. M. Li, W. Zhang, Q. Chen, and Z. He, “High-throughput hardware deployment of pruned neural network based nonlinear equalization for 100-Gbps short-reach optical interconnect,” Opt. Lett. 46, 4980–4983 (2021). [CrossRef]

500. Z. Wan, J. Li, L. Shu, M. Luo, X. Li, S. Fu, and K. Xu, “Nonlinear equalization based on pruned artificial neural networks for 112-Gb/s SSB-PAM4 transmission over 80-km SSMF,” Opt. Express 26, 10631–10642 (2018). [CrossRef]

501. W. Zhang, L. Ge, Y. Zhang, C. Liang, and Z. He, “Compressed nonlinear equalizers for 112-Gbps optical interconnects: efficiency and stability,” Sensors 20, 4680 (2020). [CrossRef]

502. L. Wang, X. Zeng, J. Wang, D. Gao, and M. Bai, “Low-complexity nonlinear equalizer based on artificial neural network for 112 Gbit/s PAM-4 transmission using DML,” Opt. Fiber Technol. 67, 102724 (2021). [CrossRef]

503. L. Ge, W. Zhang, C. Liang, and Z. He, “Compressed neural network equalization based on iterative pruning algorithm for 112-Gbps VCSEL-enabled optical interconnects,” J. Lightwave Technol. 38, 1323–1329 (2020). [CrossRef]

504. A. G. Reza and J.-K. K. Rhee, “Nonlinear equalizer based on neural networks for PAM-4 signal transmission using DML,” IEEE Photonics Technol. Lett. 30, 1416–1419 (2018). [CrossRef]

505. L.-N. Wang, W. Liu, X. Liu, G. Zhong, P. P. Roy, J. Dong, and K. Huang, “Compressing deep networks by neuron agglomerative clustering,” Sensors 20, 6033 (2020). [CrossRef]

506. S. Son, S. Nah, and K. M. Lee, “Clustering convolutional kernels to compress deep neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 216–232.

507. A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv, arXiv:2103.13630 (2021). [CrossRef]

508. Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv, arXiv:1710.09282 (2017). [CrossRef]

509. O. Weng, “Neural network quantization for efficient inference: a survey,” arXiv, arXiv:2112.06126 (2021). [CrossRef]

510. H. Bai, L. Hou, L. Shang, X. Jiang, I. King, and M. R. Lyu, “Towards efficient post-training quantization of pre-trained language models,” arXiv, arXiv:2109.15082 (2021). [CrossRef]

511. R. Alvarez, R. Prabhavalkar, and A. Bakhtin, “On the efficient representation and execution of deep acoustic models,” arXiv, arXiv:1607.04683 (2016). [CrossRef]

512. J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Ngadiuba, M. Pierini, R. Rivera, N. Tran, and Z. Wu, “Fast inference of deep neural networks in FPGAs for particle physics,” J. Instrum. 13, P07027 (2018). [CrossRef]

513. C. N. Coelho, A. Kuusela, S. Li, H. Zhuang, J. Ngadiuba, T. K. Aarrestad, V. Loncar, M. Pierini, A. A. Pol, and S. Summers, “Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors,” in Nature Machine Intelligence (2021), pp. 1–12.

514. N. Kaneda, Z. Zhu, C.-Y. Chuang, A. Mahadevan, B. Farah, K. Bergman, D. Van Veen, and V. Houtsma, “FPGA implementation of deep neural network based equalizers for high-speed PON,” in Optical Fiber Communication Conference (2020), pp. T4D–2.

515. F. A. Aoudia and J. Hoydis, “Towards hardware implementation of neural network-based communication algorithms,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) (2019), pp. 1–5.

516. W. Xu, X. Tan, Y. Lin, X. You, C. Zhang, and Y. Be’ery, “On the efficient design of neural networks in communication systems,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers (2019), pp. 522–526.

517. D. A. Ron, P. J. Freire, J. E. Prilepsky, M. Kamalian-Kopae, A. Napoli, and S. K. Turitsyn, “Experimental implementation of a neural network optical channel equalizer in restricted hardware using pruning and quantization,” Sci. Rep. 12, 8713 (2022). [CrossRef]

518. G. Hinton, O. Vinyals, J. Dean, et al., “Distilling the knowledge in a neural network,” arXiv, arXiv:1503.02531 (2015). [CrossRef]

519. J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: a survey,” Int. J Comput. Vis. 129, 1789–1819 (2021). [CrossRef]

520. J. Xiang, S. Colburn, A. Majumdar, and E. Shlizerman, “Knowledge distillation circumvents nonlinearity for optical convolutional neural networks,” Appl. Opt. 61, 2173–2183 (2022). [CrossRef]

521. S. Srivallapanondh, P. J. Freire, B. Spinnler, N. Costa, A. Napoli, S. K. Turitsyn, and J. E. Prilepsky, “Knowledge distillation applied to optical channel equalization: solving the parallelization problem of recurrent connection,” arXiv, arXiv:2212.04569 (2022). [CrossRef]

522. A. X. M. Chang and E. Culurciello, “Hardware accelerators for recurrent neural networks on FPGA,” in 2017 IEEE International Symposium on Circuits and Systems (ISCAS) (2017), pp. 1–4.

523. T. Willi, J. Masci, J. Schmidhuber, and C. Osendorfer, “Recurrent neural processes,” arXiv, arXiv:1906.05915 (2019). [CrossRef]

524. F. Libano, P. Rech, B. Neuman, J. Leavitt, M. Wirthlin, and J. Brunhaver, “How reduced data precision and degree of parallelism impact the reliability of convolutional neural networks on FPGAs,” IEEE Trans. Nucl. Sci. 68, 865–872 (2021). [CrossRef]

525. Note that [526–532] can provide further insights in regard to the parallelization of NN structures when designing them in hardware.

526. C. Wang, L. Gong, X. Li, and X. Zhou, “A ubiquitous machine learning accelerator with automatic parallelization on FPGA,” IEEE Trans. Parallel Distrib. Syst. 31, 2346–2359 (2020). [CrossRef]

527. G. Zhong, A. Prakash, S. Wang, Y. Liang, T. Mitra, and S. Niar, “Design space exploration of FPGA-based accelerators with multi-level parallelism,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 (2017), pp. 1141–1146.

528. C. Luo, M.-K. Sit, H. Fan, S. Liu, W. Luk, and C. Guo, “Towards efficient deep neural network training by FPGA-based batch-level parallelism,” J. Semicond. 41, 022403 (2020). [CrossRef]

529. S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “FPGA acceleration of recurrent neural network based language model,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (2015), pp. 111–118.

530. D. Danopoulos, I. Stamoulias, G. Lentaris, D. Masouros, I. Kanaropoulos, A. K. Kakolyris, and D. Soudris, “LSTM acceleration with FPGA and GPU devices for edge computing applications in B5G MEC,” in Embedded Computer Systems: Architectures, Modeling, and Simulation: 22nd International Conference, SAMOS 2022, Samos, Greece, July 3–7, 2022, Proceedings (2022), pp. 406–419.

531. J. Du, X. Zhu, M. Shen, Y. Du, Y. Lu, N. Xiao, and X. Liao, “Model parallelism optimization for distributed inference via decoupled CNN structure,” IEEE Trans. Parallel Distrib. Syst. 32, 1 (2020). [CrossRef]

532. J. Wang, W. Tong, and X. Zhi, “Model parallelism optimization for CNN FPGA accelerator,” Algorithms 16, 110 (2023). [CrossRef]

Pedro J. Freire received his Bachelor’s and Master’s degrees in Electronic Engineering from the Federal University of Pernambuco, with a one-and-a-half-year period at the State University of New York and the State University of San Francisco. He received his PhD degree in Optical Communication and Photonics with a Marie-Curie (MSCA) doctoral fellowship at the Aston Institute of Photonic Technologies, United Kingdom. During his PhD, he also received the prestigious 2022 IEEE Photonics Society Graduate Student Scholarship Award for his contribution to the application of machine learning in optical communications (nonlinear channel equalization). His interests focus on advanced digital signal processing and coding, network monitoring and planning, artificial intelligence, machine learning, hardware DSP, and analog and photonic machine-learning development for signal-processing applications.

Egor S. Manuylovich graduated from the Moscow Institute of Physics and Technology in 2010 with a BSc degree in Physics and Applied Mathematics. He furthered his education at the same institution, earning a Master’s degree in Physics and Applied Mathematics in 2012 and a PhD in Laser Physics in 2016. He began his career as a Development Engineer of optical quantum systems at IPG IRE-Polus, where he contributed to the development of pulsed fiber-laser-based devices. In 2018, he was appointed as a Research Scientist at the N.M. Emanuel Institute of Biochemical Physics RAS (IBCP RAS), followed by a similar role at the Kotelnikov Institute of Radio-Engineering and Electronics of the Russian Academy of Sciences. In 2019, he earned a Marie-Curie (MULTIPLY) fellowship and joined the Aston Institute of Photonic Technologies (AiPT), Aston University. His research interests are diverse, encompassing areas such as laser physics, optical communications, nonlinear fiber optics, and the application of machine learning methods to fiber optics.

Sergei K. Turitsyn graduated from the Department of Physics of Novosibirsk University in 1982 and received his PhD degree in Theoretical and Mathematical Physics from the Budker Institute of Nuclear Physics, Novosibirsk, Russia, in 1986. In 1992 he moved to Germany, first, as a Humboldt Fellow and then working in the collaborative projects with Deutsche Telekom. Since 2012 he has served as a director of the Aston Institute of Photonic Technologies, which is a world-known photonics research center, with a strong track record of academic achievements, a range of developed technologies, and industrial collaborations. He is the originator of several key concepts in the fields of nonlinear science, optical fiber communications, and fiber lasers. He was/is a principal investigator in 68 national and international, research and industrial projects. He serves as a member of the editorial board of the Journal of the European Optical Society-Rapid Publications and an Associate Editor of JLT. He was the recipient of a Royal Society Wolfson Research Merit Award in 2005. He received the Lebedev medal from the Rozhdestvensky Optical Society in 2014 and Aston Chancellor’s Medal in 2018. He is a Fellow of the Royal Academy of Engineering, Optica, and the Institute of Physics.

Jaroslaw E. Prilepsky received an ME degree (Hons.) in theoretical physics from the National University of Kharkiv, Kharkiv, Ukraine, in 1999 and the PhD in theoretical physics from the B. Verkin Institute for Low Temperature Physics and Engineering, Kharkiv, Ukraine, 2003, focusing on nonlinear excitation in low-dimensional systems. From 2003 to 2010, he was a Research Fellow with the B. Verkin Institute for Low-Temperature Physics and Engineering. From 2010 to 2012, he was a Research Associate with the Nonlinearity and Complexity Research Group, Aston University, Birmingham, UK. Since 2012, he has been a Research Fellow and Senior Research Fellow with the Aston Institute of Photonics Technologies, Aston University. He has authored two invited book chapters and more than 90 journal papers and conference contributions in nonlinear physics and mathematics, optical transmission, signal processing, and machine learning. His current research interests include optical transmission systems, nonlinearity mitigation methods, neural networks and machine learning, and optical signal processing.

Network Type	RM	BOP	NABS
MLP	$n_{n} n_{i}$	$n_{n} n_{i} [b_{w} b_{i} + Acc (n_{i}, b_{w}, b_{i})]$	$n_{n} n_{i} (X_{w} + 1) Acc (n_{i}, b_{w}, b_{i})$
1D-CNN	$n_{f} n_{i} n_{k} \cdot O u t p u t S i z e$	$\begin{matrix} O u t p u t S i z e \cdot n_{f} Mult (n_{i} n_{k}, b_{w}, b_{i}) \\ + n_{f} Acc (n_{i} n_{k}, b_{w}, b_{i}) \end{matrix}$	$\begin{matrix} O u t p u t S i z e \cdot n_{f} [n_{i} n_{k} (X_{w} + 1) - 1] \\ \cdot Acc (n_{i} n_{k}, b_{w}, b_{i}) \\ + n_{f} Acc (n_{i} n_{k}, b_{w}, b_{i}) \end{matrix}$
Vanilla RNN	$n_{s} n_{h} (n_{i} + n_{h})$	$\begin{matrix} n_{s} n_{h} Mult (n_{i}, b_{w}, b_{i}) \\ + n_{s} n_{h} Mult (n_{h}, b_{w}, b_{a}) \\ + 2 n_{s} n_{h} Acc (n_{h}, b_{w}, b_{a}) \end{matrix}$	$\begin{matrix} n_{s} n_{h} [n_{i} (X_{w} + 1) - 1] Acc (n_{i}, b_{w}, b_{i}) \\ + n_{s} n_{h} [n_{h} (X_{w} + 1) + 1] Acc (n_{h}, b_{w}, b_{a}) \end{matrix}$
LSTM	$n_{s} n_{h} (4 n_{i} + 4 n_{h} + 3)$	$\begin{matrix} 4 n_{s} n_{h} Mult (n_{i}, b_{w}, b_{i}) \\ + 4 n_{s} n_{h} Mult (n_{h}, b_{w}, b_{a}) \\ + 3 n_{s} n_{h} b_{a}^{2} \\ + 9 n_{s} n_{h} Acc (n_{h}, b_{w}, b_{a}) \end{matrix}$	$\begin{matrix} 4 n_{s} n_{h} [n_{i} (X_{w} + 1) - 1] Acc (n_{i}, b_{w}, b_{i}) \\ + 4 n_{s} n_{h} [n_{h} (X_{w} + 1) + 1] Acc (n_{h}, b_{w}, b_{a}) \\ + 6 n_{s} n_{h} b_{a} \end{matrix}$
GRU	$n_{s} n_{h} (3 n_{i} + 3 n_{h} + 3)$	$\begin{matrix} 3 n_{s} n_{h} Mult (n_{i}, b_{w}, b_{i}) \\ + 3 n_{s} n_{h} Mult (n_{h}, b_{w}, b_{a}) \\ + 3 n_{s} n_{h} b_{a}^{2} \\ + 8 n_{s} n_{h} Acc (n_{h}, b_{w}, b_{a}) \end{matrix}$	$\begin{matrix} 3 n_{s} n_{h} [n_{i} (X_{w} + 1) - 1] Acc (n_{i}, b_{w}, b_{i}) \\ + n_{s} n_{h} [3 n_{h} (X_{w} + 1) + 5] Acc (n_{h}, b_{w}, b_{a}) \\ + 6 n_{s} n_{h} b_{a} \end{matrix}$
ESN	$n_{s} N_{r} (n_{i} + N_{r} s_{p} + 2 + n_{o})$	$\begin{matrix} n_{s} N_{r} Mult (n_{i}, b_{w}, b_{i}) \\ + n_{s} N_{r} s_{p} Mult (N_{r}, b_{w}, b_{a}) \\ + n_{s} N_{r} Mult (n_{o}, b_{w}, b_{a}) \\ + 2 n_{s} N_{r} b_{a}^{2} \\ + 4 n_{s} N_{r} Acc (N_{r}, b_{w}, b_{a}) \end{matrix}$	$\begin{matrix} n_{s} N_{r} [n_{i} (X_{w} + 1) - 1] Acc (n_{i}, b_{w}, b_{i}) \\ + n_{s} N_{r} [s_{p} (N_{r} X_{w} + N_{r} - 1] + 4) Acc (N_{r}, b_{w}, b_{a}) \\ + n_{s} N_{r} [n_{o} (X_{w} + 1) - 1] Acc (n_{o}, b_{w}, b_{a}) \\ + 4 n_{s} N_{r} b_{a} \end{matrix}$

Abstract

Corrections

1. Introduction

2. Basics of Artificial Neural Networks for Photonics Community

2.1 Dense Layer

2.2 Convolutional Neural Networks

2.3 Vanilla Recurrent Neural Networks

2.4 Long Short-Term Memory Neural Networks

2.5 Gated Recurrent Units

2.6 Echo State Networks

2.7 Attention Layers

2.8 Transformers

2.9 Residual Neural Networks

2.10 Radial Basis Function Neural Network

2.11 Autoencoders

2.12 Generative Adversarial Network

3. How to Choose Your NN Architecture: The Hyperparameter Search

3.1 The Problem of Hyperparameter Tuning

3.2 BO Algorithm

3.3 Reinforcement Learning

4. Applications of Neural Networks in Different Photonics Areas

4.1 Optical Communications: Channel Modeling

4.2 Optical Communications: Signal Processing for Impairments Equalization

4.2a Post-Equalizers

4.2b Pre-Distortion

4.2c End-to-End Equalization of Optical Systems

4.2d Free Space Optical Systems

4.2e Nonlinear Fourier-Transform-Based Fiber Systems

4.3 Optical Communications: Network Layer

4.4 Optical Sensing

4.5 Ultrafast Light Measurements and Characterization

4.6 Laser Systems

4.7 Imaging and Remote Sensing

4.8 Neural Networks for the Design of New Photonic Materials

5. Reducing the Complexity of Neural Networks

5.1 Computational Complexity Metrics for Training and Inference Stages

5.2 Reducing the Complexity of Training

5.2a Data Augmentation

5.2b Transfer Learning

5.2c Domain Randomization

5.2d Other Approaches

5.3 Inference Complexity Reduction

5.3a Pruning Neural Networks

5.3b Weight Sharing

5.3c Quantization Techniques

5.3d Knowledge Distillation

5.3e Parallelization Aspect of NN Implementation

6. Conclusions and Perspectives

Funding

Acknowledgments

Disclosures

Data availability

References and Notes

Data availability

Cited By

Figures (32)

Tables (1)

Equations (43)

Advances in Optics and Photonics