Classification accuracy improvement of the optical diffractive deep neural network by employing a knowledge distillation and stochastic gradient descent &#x3b2;-Lasso joint training framework

Tao Fang; Jingwei Li; Xiang Zhang; Xiaowen Dong

doi:10.1364/OE.446890

1. Introduction

As an emerging machine learning technology, deep neural networks (DNNs) have attracted numerous research attentions and industrial investments around the world [1–4]. However, the unsustainable cost of energy and time become the major concern of implementing DNNs [5,6]. Trying to mitigate this issue, the works of implementing networks in optical domain have been paid more attention, because of its inherent advantages in power efficiency and operation speed [7–9]. More recently, a number of optical neural networks (ONNs) that harness nano-photonics technology have been experimentally demonstrated [10–14]. Optical diffractive deep neural network (OD$^2$NN), as an emerging ONN, has been intensively studied recently [15–27]. Up to now, OD$^2$NNs have two architecture paradigms, one is the standard architecture (all-optical, Fig. 1(a)), and the other is the hybrid architecture (optic-electric, Fig. 1(b)). In the hybrid architecture, a fully connected electronic layer is integrated after the cascaded diffractive layers for class assignment at the output plane. OD$^2$NN has shown the classification accuracy comparable to DNNs in specific image classification tasks [15,17]. For OD$^2$NNs, the inherent parallelism of light greatly improves the capability, capacity, and bandwidth of data processing. In addition, the inference of OD$^2$NNs is instantly, which is only limited by the bandwidth of the input modulator and the photodetector array. The image with millions of pixels can be classified in tens of microseconds and watts by OD$^2$NNs typically.

Nevertheless, OD$^2$NNs have not yet achieved the accuracy comparable to the fully-trained DNNs in complex tasks. In order to tackle this problem, some new OD$^2$NNs have been proposed. Jingxi Li et al. proposed a jointly optimized standard OD$^2$NN via class-specific differential detection and achieved 50.82% blind testing accuracy on Cifar-10 dataset [18]. Similarly, on Cifar-10 dataset, the standard OD$^2$NN introduced by MSS Rahman et al. numerically achieved 62.13% accuracy by using feature engineering and ensemble learning [20]. However, the ensemble learning results in a dramatic growth in the number of diffractive layers and the opto-electronic detectors, which increases the complexity of hardware fabrication and layers alignment for OD$^2$NNs set-up greatly. The hybrid OD$^2$NN has also been proposed and demonstrated for slightly better classification accuracy compared with the standard OD$^2$NN [17].

Fig. 1. (a) Architecture of the standard OD$^2$NN. (b) Architecture of the hybrid OD$^2$NN.

Download Full Size | PDF

Although, the aforementioned works improve the classification accuracy, there exists a nontrivial discrepancy of accuracy between OD$^2$NNs and DNNs. Different from DNNs, there is no nonlinear activation between each diffractive layer in OD$^2$NNs. The nonlinearity in optical domain is difficult to introduce because the excitation of optical nonlinearity requires high input optical power and exotic materials. Alternatively, introducing nonlinearity in electronic domain is a conventional method. While, the overhead of repeatedly opto-electronic conversion between each diffractive layer severely impairs the benefit of optical operations, such as the speedup factor and energy efficiency [28].

To further compensate the drawback of nonlinearity absence and improve the classification accuracy of the hybrid OD$^2$NN, we propose a novel training framework by employing knowledge distillation (KD) and stochastic gradient descent $\beta$-Lasso (SGD-$\beta$-Lasso). Originally, KD was introduced for pruning of DNNs. Reference [29] and Ref. [30] utilized KD to transfer the knowledge from the large teacher DNNs to the less complex student DNNs. In this work, KD is used to transfer the distilled knowledge of the teacher DNN to the hybrid OD$^2$NN, and eliminate the essential implementation of optical or electronic nonlinearity in layer-wise diffractive process. In essence, the distilled knowledge here is a feature expression rather than a nonlinear mapping, which can be learned with the help of the nonlinear module to facilitate classification task. Generally, for better knowledge transferring performance, KD operation requires the architecture of teacher network and that of student network is homogeneous. However, it has been demonstrated that OD$^2$NN is similar to multilayer perceptron (MLP) in [15], the architecture of which is different from of the teacher DNN [31]. Thus, additional measures are required to guarantee the performance of KD. In [32], $\beta$-Lasso has been proposed to enable MLP to have the ability of local connections which is similar to DNN, and 85.19% blind testing accuracy has been achieved on Cifar-10. Therefore, the SGD-$\beta$-Lasso is proposed to compensate the performance deterioration caused by the knowledge transferring of KD from DNN to the hybrid OD$^2$NN in here.

In this paper, a novel training framework that consists of KD and SGD-$\beta$-Lasso is proposed and demonstrated on a hybrid OD$^2$NN which consists of an optical frontend (5-layer diffractive layers) and a single fully connected electronic layer. A blind testing classification accuracy of 70.19% and 85.17% are achieved numerically for Cifar-10 and Cats vs. Dogs datasets respectively, which are the state-of-the-art accuracy performed in the hybrid OD$^2$NN using the same datasets.

2. Proposed method

2.1 Architecture of the hybrid OD$^2$NN

This section details the architectures of hybrid OD$^2$NN used in this paper. As shown in Fig. 1, for both paradigms, the image modulates coherent light in the input plane. The modulated electromagnetic radiation propagates layer-wise in free space while complex-modulated by the trainable transmission coefficients of different diffractive layers. Following Huygens-Fresnel principle [33], each small finite element of the diffraction layer acts as a source of a secondary wave when reached by luminous disturbance and represents the computing unit of OD$^2$NNs. In the standard OD$^2$NNs, as shown in Fig. 1(a), the output plane is implemented by a photo-detector array, and each photo-detector represents a specific inference class. The standard OD$^2$NNs perform inference via identifying the focusing position of output light onto detectors. The proposed hybrid OD$^2$NN consists of an optical frontend with cascaded diffractive layers and a single fully connected electronic layer, as shown in Fig. 1(b). In fact it has been demonstrated that the hybrid OD$^2$NN has slightly better classification accuracy than standard OD$^2$NN in intricate tasks [17]. To further improve the classification accuracy of the hybrid OD$^2$NN without employing nonlinear layers and/or increasing the number of diffractive layers, a novel training framework which uses KD and SGD-$\beta$-Lasso to transfer the knowledge for the hybrid OD$^2$NN is proposed. The details will be described in the following section.

2.2 Novel training framework

In this section, we describe the proposed novel training framework in detail. Firstly, the architecture of the hybrid OD$^2$NN network is introduced. Secondly, the training process of the hybrid OD$^2$NN network with KD is presented. Thirdly, SGD-$\beta$-Lasso optimizer applied during the training phase is expounded.

Figure 2 shows the schematic flow of the proposed training framework. As shown in Fig. 2, the novel training framework for hybrid OD$^2$NN consists of a teacher and a student network. Firstly, the nonlinear teacher network (all-convolutional neural network, AllConvNet) is fully trained, which is capable to perform the given task with high classification accuracy. Then the student network (the hybrid OD$^2$NN) is trained with the conventional approach by minimizing the loss. The loss comes from two parts: the ‘hybrid OD$^2$NN loss’ between the prediction results and the true labels of training data and the ‘temperature loss’ caused by the Kullback–Leibler (KL) divergences between the hybrid OD$^2$NN and AllConvNet. The minimization of the loss is achieved through the back-propagation (BP) algorithm which uses the SGD-$\beta$-Lasso optimizer.

Fig. 2. Schematic diagram of the proposed training framework applied on the hybrid OD$^2$NN.

Download Full Size | PDF

AllConvNet [34] proposed by Springenberg et al. is used here as a teacher network because of its effectiveness for KD. The architecture of AllConvNet is described in Table 1. The size of the input image for the modified AllConvNet is $100\times 100\times 1$. $5\times 5$ convolutions with rectified linear units (ReLUs) as activation functions are used in the AllConvNet.

Table 1. Architecture of the teacher network (AllConvNet)

View Table | View all tables in this article

2.2.1 Mathematics of the hybrid OD$^2$NN network

In this section, the aforementioned hybrid OD$^2$NN network is described briefly. The trainable parameters of the hybrid OD$^2$NN are complex-value formed by the transmission coefficients of diffractive layers. The transmission coefficient $t_i^l$ of the $i_{th}$ neuron placed in layer $l$ is given by:

(1)$$t_i^l = a_i^le^{j\Phi_i^l},$$

where $\Phi _i^l$ is the adjustable phase parameter of the $i_{th}$ neuron in $l_{th}$ layer, while $a_i^l$ is the amplitude term. Herein, each diffractive layer contains precisely $N^2$ trainable parameters, while $N$ is the number of neurons on one dimension of the diffractive surface. Regardless of the amplitude loss, $a_i^l$ can be set to 1 without affecting the classification accuracy of the network. Different from DNNs, the multiple diffractive layers of the hybrid OD$^2$NN are connected by means of electromagnetic radiation (mathematically described by a complex field). For example, if the complex field emerged from the input layer $l$ has been encoded in the matrix $X_l (N\times N)$ with containing elements $x_i^l$ ($i=[1,2,\ldots,N\times N$]), the electromagnetic radiation field after the next layers $Y_{l+1}$ can be simplified with Fourier transform as below [35]:

(2)$$Y_{l+1} = {\mathscr{F}}^{{-}1}({\mathscr{F}}(X_{l}) \cdot e^{jk_zd})\circ T_{l+1},$$

(3)$$k_z = \dfrac{2\pi}{\lambda} \sqrt{1-\alpha^2-\beta^2}.$$

In Eq. (2) and Eq. (3), $d$ is the distance between the two considered layers. $\alpha$ and $\beta$ denote the direction cosines. $\circ$ denotes the Hadamard product. $T_{l+1}$ is the matrix of the transmission coefficients of the next layer $l+1$. The $i_{th}$ neuron of $T_{l+1}$ is given by $t_i^{l+1} = e^{j\Phi _i^{l+1}}$ with the trainable parameter $\Phi _i^{l+1}$. The matrix $Y_{l+1}$ obtained in Eq. (2) represents the output from the previous diffractive layer and also is equivalent to the input for the subsequent layer. By iterating this operation, the electromagnetic radiation is propagated to the output layer.

In the hybrid OD$^2$NN, the output of the cascaded diffractive layers is coupled with a single fully connected electronic layer. The output complex-valued matrix $X_{ll}$ ($ll$: last layer) after the last diffractive layer of the hybrid OD$^2$NN contains $N^2$ elements whose modulus can be obtained by the photo-detectors before the fully connected layer. Consequently, the results vector $Y_ol$ ($ol$: fully connected layer) can be described as:

(4)$$Y_{ol} = W_{fc}\cdot|X_{ll}|^2 + b_{fc},$$

where $W_{fc}$ and $b_{fc}$ represent the weights matrix and the bias of the fully connected layer, respectively. Therefore, the probabilistic distribution $P^j$ of each class $j$ defined by the hybrid OD$^2$NN model is written as:

(5)$$P^j = exp(Y_{ol}^j)/{\sum}_j exp(Y_{ol}^j),$$

where $Y_{ol}^j$ is the $j_{th}$ element of $Y_{ol}$. The transmission coefficients of diffractive layers in the hybrid OD$^2$NN are optimized by minimizing the cross-entropy $L_\Phi$ of training data:

(6)$${\rm Minimize} \hspace{0.2cm} L_\Phi ={-}\sum_{n=1}^M\sum_{j=1}^Clog(P_n^j),$$

where $M$ is the number of training samples, $C$ is the number of classes, $y$ is the class label, $n$ is the $n_{th}$ training sample and $j$ is the $j_{th}$ class.

2.2.2 Hybrid OD$^2$NN training with knowledge distillation

Thanks to the ground breaking work described in [30], the knowledge distillation has been widely used in model compression and transfer learning. In general, making predictions using a whole ensemble of models is cumbersome and computationally expensive. Deploying the compressed knowledge in an ensemble into a single model is much easier. More specifically, the features representing the ‘knowledge’ are distilled and transferred to student networks without loss of validity. It has been certified that student networks trained with the KD approach can achieve a significant improvement of accuracy [36]. In the proposed training framework shown in Fig. 2, KD has been used to transfer the distilled knowledge of the teacher AllConvNet to the hybrid OD$^2$NN, and eliminate the essential implementation of optical or electronic nonlinearity in layer-wise diffraction process. In further, the probabilistic distribution of each class defined by the hybrid OD$^2$NN in Eq. (5) can be rewritten as:

(7)$$P^{j,t} = exp(Y_{ol}^j/t)/{\sum}_jexp(Y_{ol}/t),$$

where $t$ is a hyper-parameter called ‘temperature’. The higher ‘temperature’ can generate a smoother probabilistic distribution among the output classes. The objective of KD training is to minimize the total loss via updating the transmission coefficient $\Phi$ of the hybrid OD$^2$NN. The total loss is a linear combination of the ‘temperature loss’ and the ‘hybrid OD$^2$NN loss’. The ‘temperature loss’ is derived from Kullback–Leibler (KL) divergences between the $P_{teacher}^{(j,t)}$ of the pre-trained AllConvNet and the $P_{student}^{(j,t)}$ of the hybrid OD$^2$NN under the same ‘temperature’ $t$. The ‘hybrid OD$^2$NN Loss’ is the standard cross-entropy loss between the true labels of training samples and the probabilistic distribution $P^j$ of the hybrid OD$^2$NN. Thus, Eq. (6) can be rewritten as:

(8)$${\rm Minimize} \hspace{0.2cm} L_{\Phi,t} =\alpha\sum_{j=1}^C\xi_{ce}(y^j,P^j)+(1-\alpha)\sum_{j=1}^C\xi_{kl}(P_{teacher}^{j,t},P_{student}^{j,t}),$$

where $\alpha$ is the linear combination hyper-parameter, $\xi _{ce}$ is the cross-entropy loss function and $\xi _{kl}$ is the KL divergence loss function. The transmission coefficient $\Phi$ is updated according to the back-propagation method.

2.2.3 SGD-$\beta$-Lasso optimizer for hybrid OD$^2$NN training

Today, convolutional neural network (CNN) is the first priority in the task of image processing as it can precisely extract image features. Therefore, the choice of the nonlinear teacher network is mainly focused on the CNN models. In practice, the performance of KD training depends on the architecture of the teacher network. The knowledge distilling from the teacher network and transferring to the student network is particularly effective when the architecture differences between the two networks are minimal. Herein, the AllConvNet is utilized as the teacher network for distilling the knowledge. The architecture of AllConvNet is only characterized by convolution layers, without pooling layer, residual blocks and other intricate structure that are difficult to construct in OD$^2$NN. In addition, the operation of diffractive layer in the hybrid OD$^2$NN is a superset of mathematical convolution. In a word, there is a relatively large similarity in architecture between the hybrid OD$^2$NN and the AllConvNet.

Convolution is a “hand-designed” bias which expresses local connectivity and weight sharing [32]. If the diffractive layer can be used to learn these convolutional biases from scratch, the performance of the hybrid OD$^2$NN can be further improved under KD training because of the improved architecture’s similarity. Empirically, locally-connected networks have better performance than fully-connected ones. In [32], the generalization gap of an architecture governed by a small number of non-zero weights has been disclosed. Therefore, Lasso algorithm, which is short for Least Absolute Shrinkage and Selection Operator [37], can be incorporated as a simple strategy of encouraging sparse connections to find a locally-connected hybrid OD$^2$NN with better performance. By employing Lasso algorithm, sparseness and feature selection of the hybrid OD$^2$NN are achieved by tuning some transmission coefficients to 0. Then, the update equation for the transmission coefficients $\Phi$ is:

(9)$$\Phi^{(k+1)}= \Phi^k-\eta^k(\frac{\partial}{\partial\Phi}(L_{\Phi,t})+\lambda sign(\Phi^k)),$$

where $\eta$ is the learning rate and $\lambda$ is the meta-parameter that controls the degree of regularization. Herein, $\beta$-Lasso has been introduced as a simple algorithm which is very similar to Lasso except the extra parameter $\beta$ allowing for a more aggressive soft threshold. The update process of $\beta$-Lasso is described by Eq. (9) together with Eq. (10) which is shown as below:

(10)$$\Phi^{k+1} = \Phi^{k+1}(|\Phi^{k+1}|\geq\beta\lambda)_+,$$

where $\beta$ is the meta-parameter of threshold coefficient.

To further improve the training efficiency of the hybrid OD$^2$NN, Stochastic Gradient Descent (SGD) learning framework has been introduced together with $\beta$-Lasso, abbreviated as ‘SGD-$\beta$-Lasso’, for updating the transmission coefficients. SGD uses approximate gradients estimated from subsets of the training data and updates the weights much more frequently than batch training frameworks. This learning framework is attractive because it often requires much less training time, especially when the training data is large and redundant. The SGD-$\beta$-Lasso in pseudo-code is shown in Fig. 3.

Fig. 3. Algorithm flow of SGD-$\beta$-Lasso

Download Full Size | PDF

3. Results and discussion

The classification accuracy of the proposed hybrid OD$^2$NN on Cifar-10 (Gray) and Cats vs. Dogs (Gray) datasets have been systematically evaluated (see Table 2). The Cifar-10 dataset consists of 60000 $32\times 32$ color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Cats vs. Dogs dataset consists of 25,000 images associated with two classes: 12,500 images for cats and dogs, and 12,500 images for the test set without calibration labels. Since the images of Cifar-10 and Cats vs. Dogs dataset contain three color channels (red, green and blue), they are converted to grayscale to comply with the monochromatic illumination used in OD$^2$NNs, which may deteriorate the classification accuracy. The simulation of the OD$^2$NN is implemented based on Pytorch framework.

Table 2. Classification accuracy of the OD$^2$NNs

View Table | View all tables in this article

In this paper, the optical frontend of the hybrid OD$^2$NN consists of 5 successive diffractive layers. The number of neurons $N$ in one dimension of each layer is 100. The wavelength of the coherent light is 532 nm. The size of each neuron and the axial distance between two successive diffractive layers are set to be $0.5\lambda$ and $40\lambda$, respectively. Phase-modulation is used in the simulation, with phase coefficients ranging from 0 to $2\pi$ during the training. The configuration of the training parameters are set as follows: the optimizers include Adaptive Moment Estimation (Adam) and the proposed novel SGD-$\beta$-Lasso; the number of training epochs for each model is set to 100; the batch size of the training set and test set are 64 and 1000, respectively. All the training tasks above are run on the server configured as follows: Intel Xeon Platinum 8176 central processing unit (CPU, Intel Inc.); Tesla V100 graphical processing unit (GPU, Nvidia Inc.); Ubuntu 18.04 operating system.

Table 2 presents the classification accuracy of the hybrid OD$^2$NN trained with different methods. In particular, the hybrid OD$^2$NN trained with different approaches: Hybrid OD$^2$NN, Hybrid OD$^2$NN + KD and Hybrid OD$^2$NN + KD + SGD-$\beta$-Lasso have been evaluated. Results show the proposed training framework with KD and SGD-$\beta$-Lasso approach improving the classification accuracy significantly. On Cifar-10 dataset, the hybrid OD$^2$NN achieves 70.19% accuracy, which is the state-of-the-art accuracy achieved by the OD$^2$NN on the same dataset. In addition, it achieves 85.17% accuracy on Cats vs. Dogs (gray) dataset, which is close to the accuracy of the AllConvNet whose benchmark is actually around 91% [33]. Considering the input images are grayscale, it is promising to achieve better classification accuracy by utilizing more channels for the hybrid OD$^2$NN. Compared with the hybrid OD$^2$NN using regular training strategy, the hybrid OD$^2$NN with KD and SGD-$\beta$-Lasso joint-training improves the accuracy from 52.38% to 70.19% on Cifar-10 (gray) and from 69.64% to 85.17% in Cats vs. Dogs dataset (gray), respectively.

Using the KD approach in the training phase, the hybrid OD$^2$NN can easily approach the performance of the fully trained network with nonlinearity. This indicates that the distillation of knowledge and transference of the knowledge from the nonlinear network to the linear hybrid OD$^2$NN are feasible. In addition, the use of the SGD-$\beta$-Lasso does make the hybrid OD$^2$NN to adopt the inductive bias of the convolution operation for accuracy improvement. This observation indicates that the KD approach is effective in regimes where the network architecture of the student model is as close to the teacher as possible. However, there remains a classification accuracy gap between the hybrid OD$^2$NN with the KD and the SGD-$\beta$-Lasso joint-training and the AllConvNet (18.24% below the accuracy on Cifar-10 (gray); 5.52% below the accuracy on Cats vs. Dogs (gray)). It is noteworthy that the classification accuracy gap for Cifar-10 and Cats vs Dogs (gray) dataset differ by 12.72%, which is a relatively large gap. The reason may be that the feature extraction capability of the optical frontend (5-layer OD$^2$NN) is insufficient for datasets like Cifar-10 with many categories and large object similarities. Increasing the frequency channels of input light and the number of diffractive layers are potential solutions to improve classification accuracy in further.

The convergence plots of training process of the hybrid OD$^2$NN for Cifar-10 (Gray) and Cats vs. Dogs (Gray) datasets are shown in Fig. 4 and Fig. 5. In order to demonstrate the capability of the SGD-$\beta$-Lasso in learning local connectivity, the performance of the hybrid OD$^2$NN trained with the SGD-$\beta$-Lasso and the conventional method (Adam) have been investigated independently. The phase value of each neuron is wrapped to [0, $2\pi$]. The right panel shows the trained multi-layer phase mask and the histogram distribution corresponding to the different datasets. Obviously, the solution found by the SGD-$\beta$-Lasso has less nonzero parameters in the hybrid OD$^2$NN than that in the hybrid OD$^2$NN trained with the Adam optimizer.

Fig. 4. (a) Sample pictures of Cifar-10 (gray). (b) Classification accuracy of OD$^2$NN under different training framework on Cifar-10. (c) Phase distribution for different diffractive layer without SGD-$\beta$-Lasso Optimizer. (d) The statistical distribution histogram of phase for (c). (e) Phase distribution for different diffractive layer with SGD-$\beta$-Lasso Optimizer. (f) The statistical distribution histogram of phase for (e).

Download Full Size | PDF

Fig. 5. (a) Sample pictures of Cats vs. Dogs (Gray) dataset. (b) Classification accuracy of OD$^2$NN under different training framework on Cats vs. Dogs (Gray) dataset. (c) Phase distribution for different diffractive layer without SGD-$\beta$-Lasso Optimizer. (d) The statistical distribution histogram of phase for (c). (e) Phase distribution for different diffractive layer with SGD-$\beta$-Lasso Optimizer. (f) The statistical distribution histogram of phase for (e).

Download Full Size | PDF

4. Conclusion

In summary, in order to compensate the drawback of nonlinearity absence and to improve the classification accuracy of the hybrid OD$^2$NN, a novel training framework is proposed for the hybrid OD$^2$NN, which is employing the Knowledge Distillation (KD) and the SGD-$\beta$-Lasso. The KD is introduced in the training phase of the hybrid OD$^2$NN to transfer the knowledge of the teacher model with nonlinearity (AllConvNet) to the hybrid OD$^2$NN without introducing nonlinear layers. While the SGD-$\beta$-Lasso approach allows the hybrid OD$^2$NN with diffraction operations to adopt the inductive bias of the convolution operation, which ensures the architecture’s similarity of the teacher and the student network. The performance of the proposed hybrid OD$^2$NN on Cifar-10 (Gray) and Cats vs. Dogs (Gray) dataset are evaluated. On Cifar-10 (Gray) dataset, a 70.19% blind testing accuracy is achieved, which is the state-of-the-art classification accuracy so far as known for hybrid OD$^2$NNs. On Cats vs. Dogs (Gray) dataset, an 85.17% blind testing accuracy is achieved, which is close to the performance of the fully trained deep neural network in most.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but maybe obtained from the authors upon reasonable request.

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

2. A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review,” IEEE Access 7, 19143–19165 (2019). [CrossRef]

3. J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey,” IEEE Signal Process. Mag. 35(1), 84–100 (2018). [CrossRef]

4. L. Deng and Y. Liu, Deep Learning in Natural Language Processing (Springer, 2018).

5. D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, “Carbon Emissions and Large Neural Network Training,” arXiv:2104.10350 [cs] (2021).

6. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: generalized pipeline parallelism for DNN training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19 (Association for Computing Machinery, 2019), pp. 1–15.

7. B. Javidi, J. Li, and Q. Tang, “Optical implementation of neural networks for face recognition by the use of nonlinear joint transform correlators,” Appl. Opt. 34(20), 3950–3962 (1995). [CrossRef]

8. N. H. Farhat, D. Psaltis, A. Prata, and E. Paek, “Optical implementation of the Hopfield model,” Appl. Opt. 24(10), 1469–1475 (1985). [CrossRef]

9. D. Psaltis and N. Farhat, “Optical information processing based on an associative-memory model of neural nets with thresholding and feedback,” Opt. Lett. 10(2), 98–100 (1985). [CrossRef]

10. D. Rosenbluth, K. Kravtsov, M. P. Fok, and P. R. Prucnal, “A high performance photonic pulse processing device,” Opt. Express 17(25), 22767–22772 (2009). [CrossRef]

11. A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and Weight: An Integrated Network For Scalable Photonic Spike Processing,” J. Lightwave Technol. 32(21), 4029–4041 (2014). [CrossRef]

12. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

13. D. Psaltis, D. Brady, X.-G. Gu, and S. Lin, “Holography in artificial neural networks,” in Landmark Papers on Photorefractive Nonlinear Optics (WORLD SCIENTIFIC, 1995), pp. 541–546.

14. A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7(1), 7430 (2017). [CrossRef]

15. X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

16. Y. Chen and J. Zhu, “An optical diffractive deep neural network with multiple frequency-channels,” arXiv:1912.10730 [physics, stat] (2019).

17. D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Analysis of Diffractive Optical Neural Networks and Their Integration With Electronic Neural Networks,” IEEE J. Sel. Top. Quantum Electron. 26(1), 1–14 (2020). [CrossRef]

18. J. Li, D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Class-specific differential detection in diffractive optical neural networks improves inference accuracy,” AP 1, 154–183 (2019). [CrossRef]

19. Y. Luo, D. Mengu, N. T. Yardimci, Y. Rivenson, M. Veli, M. Jarrahi, and A. Ozcan, “Design of task-specific optical systems using broadband diffractive neural networks,” Light: Sci. Appl. 8(1), 112 (2019). [CrossRef]

20. M. S. S. Rahman, J. Li, D. Mengu, Y. Rivenson, and A. Ozcan, “Ensemble learning of diffractive optical networks,” arXiv:2009.06869 [physics] (2020).

21. Y. Chen, “Express Wavenet – a low parameter optical neural network with random shift wavelet pattern,” arXiv:2001.01458 [cs, eess, stat] (2020).

22. T. Yan, J. Wu, T. Zhou, H. Xie, F. Xu, J. Fan, L. Fang, X. Lin, and Q. Dai, “Fourier-space Diffractive Deep Neural Network,” Phys. Rev. Lett. 123(2), 023901 (2019). [CrossRef]

23. T. Zhou, T. Zhou, T. Zhou, L. Fang, L. Fang, T. Yan, T. Yan, J. Wu, J. Wu, Y. Li, Y. Li, J. Fan, J. Fan, H. Wu, H. Wu, X. Lin, X. Lin, X. Lin, X. Lin, Q. Dai, Q. Dai, Q. Dai, and Q. Dai, “In situ optical backpropagation training of diffractive optical neural networks,” Photonics Res. 8(6), 940–953 (2020). [CrossRef]

24. D. Mengu, Y. Zhao, N. T. Yardimci, Y. Rivenson, M. Jarrahi, and A. Ozcan, “Misalignment Resilient Diffractive Optical Networks,” Nanophoto. 9(13), 4207–4219 (2020). [CrossRef]

25. J. Su, Y. Yuan, C. Liu, and J. Li, “Multitask Learning by Multiwave Optical Diffractive Network,” https://www.hindawi.com/journals/mpe/2020/9748380/.

26. H. Dou, Y. Deng, T. Yan, H. Wu, X. Lin, and Q. Dai, “Residual D²NN: training diffractive deep neural networks via learnable light shortcuts,” Opt. Lett. 45(10), 2688–2691 (2020). [CrossRef]

27. D. Mengu, Y. Rivenson, and A. Ozcan, “Scale-, shift- and rotation-invariant diffractive optical networks,” arXiv:2010.12747 [physics] (2020).

28. S. Colburn, Y. Chu, E. Shilzerman, and A. Majumdar, “Optical frontend for a convolutional neural network,” Appl. Opt. 58(12), 3179–3186 (2019). [CrossRef]

29. A. Mishra and D. Marr, “Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy,” arXiv:1711.05852 [cs] (2017).

30. G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531 [cs, stat] (2015).

31. M.-H. Guo, Z.-N. Liu, T.-J. Mu, D. Liang, R. R. Martin, and S.-M. Hu, “Can Attention Enable MLPs To Catch Up With CNNs?” arXiv:2105.15078 [cs] (2021).

32. B. Neyshabur, “Towards Learning Convolutions from Scratch,” arXiv:2007.13657 [cs, stat] (2020).

33. E. Huggins, “Introduction to Fourier Optics,” Phys. Teach. 45(6), 364–368 (2007). [CrossRef]

34. J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Simplicity: The All Convolutional Net,” arXiv:1412.6806 [cs] (2015).

35. E. Racca, “Neural networks in optical domain,” Politecnico di Torino (2019).

36. M. Phuong and C. Lampert, “Towards Understanding Knowledge Distillation,” in Proceedings of the 36th International Conference on Machine Learning (PMLR, 2019), pp. 5142–5151.

37. R. Tibshirani, “Regression Shrinkage and Selection Via the Lasso,” J. R. Stat. Soc., Series B Stat. Methodol. 58, 267–288 (1996). [CrossRef]

Dataset	Model	Test Accuracy
Cifar-10(gray)	AllConvNet	88.43%
	Standard OD $^{2}$ NN	44.27%
	Hybrid OD $^{2}$ NN	52.38%
	Hybrid OD $^{2}$ NN+KD	62.34%
	Hybrid OD $^{2}$ NN+KD+SGD- $β$ -Lasso	70.19%
	Ref. [18]	50.82%
	Ref. [20]	62.13%
Cats vs. Dogs(gray)	AllConvNet	90.69%
	Standard OD $^{2}$ NN	60.45%
	Hybrid OD $^{2}$ NN	69.64%
	Hybrid OD $^{2}$ NN+KD	80.28%
	Hybrid OD $^{2}$ NN+KD+SGD- $β$ -Lasso	85.17%

Dataset	Model	Test Accuracy
Cifar-10(gray)	AllConvNet	88.43%
	Standard OD $^{2}$ NN	44.27%
	Hybrid OD $^{2}$ NN	52.38%
	Hybrid OD $^{2}$ NN+KD	62.34%
	Hybrid OD $^{2}$ NN+KD+SGD- $β$ -Lasso	70.19%
	Ref. [18]	50.82%
	Ref. [20]	62.13%
Cats vs. Dogs(gray)	AllConvNet	90.69%
	Standard OD $^{2}$ NN	60.45%
	Hybrid OD $^{2}$ NN	69.64%
	Hybrid OD $^{2}$ NN+KD	80.28%
	Hybrid OD $^{2}$ NN+KD+SGD- $β$ -Lasso	85.17%

Classification accuracy improvement of the optical diffractive deep neural network by employing a knowledge distillation and stochastic gradient descent β-Lasso joint training framework

Abstract

1. Introduction

2. Proposed method

2.1 Architecture of the hybrid OD$^2$NN

2.2 Novel training framework

2.2.1 Mathematics of the hybrid OD$^2$NN network

2.2.2 Hybrid OD$^2$NN training with knowledge distillation

2.2.3 SGD-$\beta$-Lasso optimizer for hybrid OD$^2$NN training

3. Results and discussion

4. Conclusion

Disclosures

Data availability

References

Data availability

Cited By

Figures (5)

Tables (2)

Equations (10)

Optics Express

Teacher network(AllConvNet)
Input: $100 \times 100 \times 1$
Layer 1: $5 \times 5$ conv. 96 filters, ReLU
Layer 2: $5 \times 5$ conv. 96 filters, ReLU
Layer 3: $5 \times 5$ conv. 96 filters, ReLU
Layer 4: $5 \times 5$ conv. 192 filters, padding = 2, stride =4, ReLU
Layer 5: $5 \times 5$ conv. 100 filters, ReLU
Layer 6: fully connected layer

Teacher network(AllConvNet)
Input: $100 \times 100 \times 1$
Layer 1: $5 \times 5$ conv. 96 filters, ReLU
Layer 2: $5 \times 5$ conv. 96 filters, ReLU
Layer 3: $5 \times 5$ conv. 96 filters, ReLU
Layer 4: $5 \times 5$ conv. 192 filters, padding = 2, stride =4, ReLU
Layer 5: $5 \times 5$ conv. 100 filters, ReLU
Layer 6: fully connected layer