Forward–forward training of an optical neural network

Ilker Oguz; Junjie Ke; Qifei Weng; Feng Yang; Mustafa Yildirim; Niyazi Ulas Dinc; Jih-Liang Hsieh; Christophe Moser; Demetri Psaltis

doi:10.1364/OL.496884

Neural networks (NNs) are among the most powerful algorithms today. By learning from immense databases, these computational architectures can accomplish a wide variety of sophisticated tasks [1]. These tasks include understanding languages, translating these languages [2], creating realistic images from verbal prompts [3], and estimating protein structures from genetic codes [4]. Given the significant potential impact of NNs on various areas, the computation-intensive nature of this algorithm necessitates faster and more energy-efficient hardware implementations for NNs to become ubiquitous.

With its intrinsic parallelism, high number of degrees of freedom, and low-loss information transfer capability, optics offer different approaches for the realization of a new generation of NN hardware. Silicon photonics based modulator meshes have been demonstrated capable of performing tasks that form the building blocks of NNs, such as linear matrix operations [5] and pointwise nonlinear functions [6]. Two-dimensional (2D) spatial light modulators can exploit optics’ 3D scalability fully as free-space propagation provides connectivity between each location on the modulator [7,8]. Reservoir computing approach capitalizes on complex interactions of various optical phenomena to make inferences by training a single readout layer to map the state of the physical system [9–11]. However, to achieve a state-of-the-art performance in sophisticated tasks, training of multiple trainable layers is generally required.

Within the framework of NNs, training parameters before a physical layer constitute a challenge because complex physical systems are difficult to characterize or analytically describe. Without a fully known and differentiable function to represent the optical system, the error backpropagation (EBP) algorithm, which trains most conventional NNs, cannot be used. One solution to this problem utilizes a different NN in the digital domain (a digital twin) to model the optical system. Error gradients of layers preceding the physical system are approximated with the digital twin [12]. However, this method requires a separate experimental characterization phase before training the main NN, which introduces a computational overhead that may be substantial depending on the complexity of the physical system. Another approach resorts to metaheuristic methods by only observing the dependency of the training performance on values of programmable weights, without modeling the input–output relation of the physical system [13]. The computational complexity of the training is much smaller in this case, but the method is more suitable for training a small number of programmable parameters.

The forward–forward algorithm (FFA) defines a local loss function for each set of trainable weights, thereby eliminating the need for EBP and perfect characterization of the learning system while scaling efficiently to large numbers of programmable parameters [14]. With this approach the error at the output of the network does not need to be backpropagated to every layer. The local loss function is defined as the goodness metric $\boldsymbol{L}_{\boldsymbol{goodness}}\left( \boldsymbol{y} \right) = \boldsymbol{\sigma }\left( {\mathop \sum \limits_{\boldsymbol{j}} \boldsymbol{y}_{\boldsymbol{j}}^2 -\boldsymbol{\theta }} \right)$, where $\boldsymbol{\sigma }(\boldsymbol{x} )$ is the sigmoid nonlinearity function, $\boldsymbol{y}_{\boldsymbol{j}}^{}$ is the activation of the $\boldsymbol{j}$th neuron for a given sample, and $\boldsymbol{\theta }$ is the threshold level of the metric. The goal for each trainable layer is to increase ${\boldsymbol{L}_{\boldsymbol{goodness}}}$ for positive samples and decrease it for negative samples. For a multi-class classification problem, such as the MNIST-digit dataset, positive samples are created by marking the designated area for the true label of a given image. Similarly, a negative sample is created by filling the region other than the one corresponding to the true label. The difference between the squared sum of activations of positive and negative samples is balanced between each trainable layer with a normalization step to ensure that each layer learns distinct representations from each other. The local FFA training scheme enables the use of low-power analog hardware for NNs [14,15] because, unlike EBP, FFA does not require direct access to or modeling of the weights of each layer in the NN. In our study, we explore the potential of the FFA and experimentally demonstrate that a complex nonlinear optical transform such as nonlinear propagation in a multimode fiber (MMF) can be incorporated into an NN to improve its performance.

The optical apparatus we used to implement the network which we trained with the FFA is shown in Fig. 1. We use multimode nonlinear wave propagation of spatially modulated laser pulses in an MMF. Even though the proposed training method is suitable for virtually any system capable of high-dimensional nonlinear interactions, this experiment is selected as the demonstration setup due to its remarkable ability to provide these effects with very low-power consumption (6.3 mW average power, 50 nJ per pulse) [16]. The propagation of 10 ps long mode-locked laser (Amplitude Laser, Satsuma) pulses with a 1030 nm wavelength in a confined area (diameter of 50 µm) for a long distance (5 m) provides nonlinear interactions between 240 spatial eigenchannels of the MMF (OFS, bend-insensitive OM2, 0.20 NA) using a pulse energy of only 50 nJ.

Fig. 1. Schematic of the experimental setup used for obtaining nonlinear optical information transform

Download Full Size | PDF

Before coupling light pulses to the MMF, their spatial phase is modulated with input data by a phase-only 2D spatial light modulator (Meadowlark HSP1920). Due to the dimensionality of the learning task and the MMF, in our implementation the inputs to the optical system have 32 × 32 resolution with 8-bit depth. After upsampling this pattern to 520 × 520 for covering the whole beam area on the SLM, the digital values on each pixel in the range of 0–255 are linearly mapped between 0π and 2π of phase modulation. The modulated beam is coupled to the MMF with a plano–convex lens. The output of the MMF is collimated with a lens, and its diffraction off a dispersion grating (Thorlabs GR25-0610) is recorded with a camera (FLIR BFS-U3-31S4M-C). As the diffraction angle has a dependency on the wavelength, the dispersion grating enables the camera to capture information about spectral changes in addition to the spatial changes due to nonlinearities inside the MMF.

Linear and nonlinear optical interactions in the MMF can be simplified as follows by the multi-modal nonlinear Schrodinger’s equation in terms of the coefficients of propagation modes $({{A_p}} )$ of the MMF:

\begin{aligned} \frac{{\partial {A_p}}}{{\partial z}} &= \underbrace{{i\delta \beta _0^p{A_p} - \delta \beta _1^p\frac{{\partial {A_p}}}{{\partial t}} - i\frac{{\beta _2^p}}{2}\frac{{{\partial ^2}{A_p}}}{{\partial {t^2}}}}}_{{\textrm{Dispersion}}} + \underbrace{{i\,\,\sum\limits_n {{C_{p,n}}{A_n}} }}_{{\textrm{Linear mode coupling}}}\\ &+ \underbrace{{i\frac{{{n_2}{\omega _0}}}{A}\sum\limits_{l,m,n} {{\eta _{p,l,m,n}}} {A_l}{A_m}A_n^\ast }}_{{\textrm{Nonlinear mode coupling}}}\; , \end{aligned}

where $\boldsymbol{\beta }_{\boldsymbol{n}}^{}$ is the nth order propagation constant, $\mathbf{C}$ is the linear coupling matrix, ${\boldsymbol{n}_2}$ is the nonlinearity coefficient of the core material, ${\boldsymbol{\omega }_0}$ is the center angular frequency, A is the core area, and $\boldsymbol{\eta }$ is the nonlinear coupling tensor. This equation delineates the nature of interactions obtained with the proposed experiment. In addition to linear coupling, nonlinear coupling is provided with the multiplication of three different mode coefficients, demonstrating the high-dimensional complexity of the optical interactions.

We evaluated the effectiveness of the proposed approach by constructing a network to implement the MNIST handwritten digit classification task [17]. Due to speed and memory limitations, we randomly selected 4000 samples from the dataset for training, while the validation and test sets were allocated 1000 samples each. The architecture of our neural network is shown in Fig. 2. A fully digital implementation of a multilayer network trained with EBP is depicted in Fig. 2(a), while Fig. 2(b) shows a fully digital implementation trained with the FFA. Finally, Fig. 2(c) includes the optical layers which are trained with the FFA. In all three cases, each layer has similar numbers of trainable parameters and is trained on the same subset of samples with 32 × 32 resolution, and the data shape remains in this resolution throughout the NNs, including the input and output of the optical transform, until they are flattened and processed by the output layer. All three NNs start with convolutional layers (2 or 3) followed by a fully connected (FC) output layer of 10 neurons. NNs in Figs. 2(b) and 2(c) uses the Ridge classifier algorithm from the scikit-learn library in Python as the output layer, since this algorithm allows for faster training with a single step of singular value decomposition.

Fig. 2. Different NN architectures compared in the study. (a) Conventional NN trained with error backpropagation. Blue arrows show the information flow in the forward (inference) mode, and green lines indicate the training. (b) Diagram of a fully digital NN trained with the FF algorithm. Layers are trained locally with the goodness function. Activations of trainable layers except the first one are used by a separate output layer. (c) Our proposed method also includes optical information transformations between each trainable block. These activations reach to the output layer after optical transformations.

Download Full Size | PDF

The first two trainable layers of the NN trained by EBP [Fig. 2(a)] and all trainable layers of FFA-trained NNs [Fig. 2(b) and 2(c)] except their output layers are convolutional layers, each with one trainable kernel sized 5 × 5 with 4 pixels of dilation ReLU nonlinearity. Dilated kernels capture large features in images with a small number of parameters and are suitable for the current implementation as speckles span across multiple pixels. Layer normalization operations are used as part of the FFA and for scaling activations for each sample so that the vector of activations has an L2 norm equals to 1. In the optical transform steps in Fig. 2(c), these vectors of activations modulate the beam phase distribution as 2D arrays individually, and the corresponding beam output patterns are recorded by the camera and transferred to the next trainable layer.

Performances obtained on the test set are shown in Table 1, and they confirm the findings that the performance of an NN trained with the FFA decreases compared to the performance obtained with the EBP trained network. For the given dataset, the baseline of using the output layer directly on the dataset yields an 84.4% test accuracy, the EBP trained NN [Fig. 2(a)] consisting of 2 convolutional and 1 FC layer reaches 91.8%, while its FFA-trained counterpart with 3 convolutional layers reaches 90.8%. The addition of high-dimensional nonlinear mapping improves the performance of 2 convolutional layer NNs to 94.4% without any increase in computational operations. The fully digital FFA-trained NN included 3 convolutional layers to have an output layer at the same size as the optical NN. Also, the 4% decrease in LeNet-5’s accuracy due to the current smaller training set (95.0% vs. 99.0% with the full dataset [17]) indicates that with larger datasets the proposed approach may reach even higher accuracies.

Table 1. Comparison of Accuracies between Different Neural Networks

View Table

Improvement in the performance of the NN with the addition of optical transforms is shown in Fig. 3 in more detail. The accuracy obtained is plotted as a function of the strength of the regularization term used in the training of the output layer. When optical nonlinear connectivity is combined with the relatively small number of trainable weights in convolutional layers, both the training and test accuracies on the subset of the MNIST digits improve [Fig. 3(c)]. This improvement is observed also in class-wise accuracies in the confusion matrix. The correct inference ratio increases for nearly all classes with the optical transform.

Fig. 3. Comparison between classification performances of NNs with and without optical transform on MNIST-digit dataset. (a) and (c) Dependence of training and test accuracies on the Ridge classifier regularization. (b) and (d) Confusion matrix of the test set when the optimum regularization parameter in (a) and (c) is used, respectively.

Download Full Size | PDF

Even though FFA simplifies the NN training and decreases the memory usage by decoupling weight updates of different layers, benchmarks show that the task performance tends to decrease compared to training the same architecture with EBP. We demonstrate that by adding non-trainable nonlinear mappings to the architecture, this decrease can be reversed and an increase in the performance can be obtained.

To utilize physical transforms with most gradient-based or gradient-free training algorithms, the transform should be applied to each sample multiple times through the epochs of the algorithm. FFA makes physical systems more accessible to NNs by removing this requirement. The presented approach applies the optical transform to the data representation only once after training each layer, and the next layer is trained with the transformed representation, instead of repetitive experiments during many epochs. In addition to faster training, this model-free method has a much smaller digital footprint, exploiting the energy-efficiency potential of physical computing fully. Analog systems often need recalibration when system characteristics change, such as fiber conformation in this study. Our proposed approach would restore the optimal performance with a single set of physical experiments without rigorous remodelling. In summary, this study demonstrates an NN architecture with high accuracy, fast training, and small digital footprint by combining the benefits of FFA and the complexity of optics. This advancement could be a solution to one of the biggest bottlenecks in training optical NNs considering the limited modulation speed of electro-optic conversion devices.

Funding

Google (901381).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper may be obtained from the authors upon reasonable request.

REFERENCES

1. D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, Science 362, 1140 (2018). [CrossRef]

2. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” arXiv, arXiv:2206.07682 (2022). [CrossRef]

3. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, June 18–24, 2022, pp. 10684–10695.

4. J. Jumper, R. Evans, A. Pritzel, et al., Nature 596, 583 (2021). [CrossRef]

5. N. C. Harris, J. Carolan, D. Bunandar, M. Prabhu, M. Hochberg, T. Baehr-Jones, M. L. Fanto, A. M. Smith, C. C. Tison, P. M. Alsing, and D. Englund, Optica 5, 1623 (2018). [CrossRef]

6. I. A. D. Williamson, T. W. Hughes, M. Minkov, B. Bartlett, S. Pai, and S. Fan, IEEE J. Sel. Top. Quantum Electron. 26, 7700412 (2020). [CrossRef]

7. X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, Science 361, 1004 (2018). [CrossRef]

8. T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, and Q. Dai, Nat. Photonics 15, 367 (2021). [CrossRef]

9. M. Rafayelyan, J. Dong, Y. Tan, F. Krzakala, and S. Gigan, Phys. Rev. X 10, 041037 (2020). [CrossRef]

10. S. Sunada and A. Uchida, Sci. Rep. 9, 19078 (2019). [CrossRef]

11. M. Yildirim, I. Oguz, F. Kaufmann, M. R. Escale, R. Grange, D. Psaltis, and C. Moser, “Nonlinear optical data transformer for machine learning,” arXiv, arXiv:2208.09398 (2022). [CrossRef]

12. L. G. Wright, T. Onodera, M. M. Stein, T. Wang, D. T. Schachter, Z. Hu, and P. L. McMahon, Nature 601, 549 (2022). [CrossRef]

13. I. Oguz, J.-L. Hsieh, N. U. Dinc, U. Teğin, M. Yildirim, C. Gigli, C. Moser, and D. Psaltis, “Programming nonlinear propagation for efficient optical learning machines,” arXiv, arXiv:2208.04951 (2022). [CrossRef]

14. G. Hinton, “The Forward-Forward Algorithm: some preliminary investigations,” arXiv, arXiv:2212.13345 (2022). [CrossRef]

15. A. Momeni, B. Rahmani, M. Mallejac, P. Del Hougne, and R. Fleury, “Backpropagation-free training of deep physical neural networks,” arXiv, arXiv:2304.11042 (2021). [CrossRef]

16. U. Teğin, M. Yıldırım, İ. Oğuz, C. Moser, and D. Psaltis, Nat. Comput. Sci. 1, 542 (2021). [CrossRef]

17. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, Neural Comput. 1, 541 (1989). [CrossRef]

Network	Test Accuracy (%)	Number of Parameters	Digital Operations per Sample (FLOPs)
1 FC	84.4	10,250	20 K
2 conv. + 1 FC - EBP	91.8	14,398	143 K
3 conv. + 1 FC - FFA	90.8	26,712	204 K
2 conv.+ optics + 1 FC – FFA	94.4	24,638	150 K
LeNet – 5 - EBP	95.0	61,706	846 K

Forward–forward training of an optical neural network

Abstract

Funding

Disclosures

Data availability

REFERENCES

Data availability

Cited By

Figures (3)

Tables (1)

Equations (1)

Optics Letters