Optronic convolutional neural networks of multi-layers with different functions executed in optics for image classification

Ziyu Gu; Yesheng Gao; Xingzhao Liu

doi:10.1364/OE.415542

1. Introduction

With the rapid advancements in artificial intelligence, neural networks play a crucial role in solving object classification and pattern recognition problems [1–3]. Making neural networks deeper is an effective and necessary approach to obtain better classification accuracy [4–7]. However, deeper layers not only entail a larger number of parameters, but also massive computational requirements for convolutional operations [8–10].

In recent years, optical technologies applied to neural networks have been developing rapidly. Their ultrafast processing speed and inherent parallel computing characteristics provide important support for big data computing [11–14]. Researchers have successfully achieved some optical neural networks (ONNs). The classification performance on the Modified National Institute of Standards and Technology (MNIST) dataset [15] and Fashion-MNIST dataset [16] validated the feasibility of these ONNs. Neural networks can be divided into deep neural networks (DNNs) and convolutional neural networks (CNNs). Two main optical frameworks are available for implementing DNNs. One framework is to utilize the spatial light diffraction between each point on the transmissive (or reflective) layers to simulate information transmission between neurons. Lin et al. first proposed a diffractive DNN [17–19]. Then, Fourier-space diffractive DNNs and all-optical neural networks were proposed successively by Yan and Zuo [20,21]. The other framework is constructed with optical interference units and optical nonlinearity units. Shen and Hughes et al. built integrated nanophotonic chips, such as coherent nanophotonic circuits and photonic neural networks, based on this theory [22,23]. In case of more complex classification problems, the performance of conventional DNNs is worse than that of CNNs; therefore, it is important to discuss the optical realization of CNNs. For realizing CNNs, using an optical $4f$ system is a better approach to implement convolution than building an optical convolution unit architecture with acousto-optical modulator arrays [24]. It is well-known that a lens can perform two-dimensional Fourier transform at the speed of light [25]. Therefore, an optical $4f$ system can be used instead of conventional processors to achieve complicated convolutional operations, thereby reducing the computational costs. Chang et al. built a hybrid optical-electronic neural network with good classification performance by training a single convolutional layer in optics to replace multiple convolutional layers in electronic CNNs [26]. Further, Colburn et al. proposed an optical frontend for a CNN to implement AlexNet in optics [27]. However, these hybrid optical-electronic neural networks achieve only convolutional operations in optics, whereas other operations are still performed via electronic computation. More importantly, the output of the optical layer still contains a large number of parameters to be processed in digital processing. Therefore, optical techniques cannot be used to reduce computation costs to the maximum extent.

Therefore, this study proposes an optronic convolutional neural network (OPCNN) that exhibits good performance in image classification tasks [28,29]. This network executes all computation operations in optics, and data transmission and control in electronics. The OPCNN consists of four modules with different functions. We first utilize a lenslet array $4f$ system to implement convolutional layers. Then, we implement the optical-strided convolution operation, which is a replacement of conventional pooling methods, to implement down-sampling layers [30]. Furthermore, nonlinear activation layers are implemented by utilizing the camera’s curve—this step is executed in electronics, but without incurring additional computation costs. Finally, fully connected layers are implemented in optics, based on the properties of Fourier optics and the physical meaning of the Airy disk.

To demonstrate the performance of the OPCNN framework, we numerically tested it on the MNIST dataset and Fashion-MNIST dataset. Results show OPCNN obtains higher classification accuracy and better performance than previous optical CNNs by comparison. Considering that the MNIST and Fashion-MNIST datasets are not sufficiently complex, we only tested the performance of a small-scale optical convolutional neural network for image classification. However, modern electronic DNNs contain tens of layers and more than hundred channels in convolutional layers. Therefore, we also discuss the potential and limitations of the OPCNN with increasing layers because our architecture is scalable. The principle and performance of the OPCNN are illustrated in the next section.

2. Architecture and implementation of OPCNN

The OPCNN architecture (Fig. 1) consists of four independent modules. Each module corresponds to one convolutional layer, one down-sampling layer, one nonlinear activation layer, or one fully connected layer in electronic CNNs. We used spatial light modulators (SLMs) and Fourier lenslet arrays to implement an optical convolutional layer. Then, we achieved spatial dimensionality reduction through the strided convolution operation in the optical down-sampling layer, which consists of SLM3 and a demagnification lens system. The conventional optical $4f$ system can only implement convolutional operations in a single step. Therefore, we found the fixed sampling interval function of the demagnification system. Thus, the output of the conventional optical $4f$ system can be sampled by placing this demagnification system behind SLM3, thereby achieving a strided convolution operation. The nonlinear activation function, such as the rectified linear activation function (ReLU), is a crucial component in electronic CNNs. Considering that the intensity of light must be nonnegative, we utilized the sCMOS camera’s curve to custom a proper shifted-ReLU function in the nonlinear activation layers. Finally, we analyzed the physical meaning of the Airy disk in optical frequency spectra, and found the relationship between the light intensity of the Airy disk and the output of the digital fully connected layer. The readout number of Airy disks depends on the class number of the datasets included. The predicted class corresponds to the Airy disk with the maximum light intensity, thereby implementing the optical fully connected layer. The detailed implementation of each section is discussed next.

Fig. 1. Theoretical system of the proposed OPCNN architecture, primarily comprising four layers. The collimated light beam illuminates the surface of SLM1 to load the information of the input images. The reflected light implements Fourier transform through an optical $4f$ system, which is made up of SLM1, SLM2, and a lenslet array in the middle. The convolutional result is then used to perform the down-sampling operation by SLM3, another lenslet array, and a demagnification lens system contained lenses L1 and L2. The result after the nonlinear activation by the sCMOS camera C1’s curve can be reloaded on SLM1 to implement the next convolution, or on SLM5 to propagate to the fully connected layer. According to a comparison of different Airy disks recorded by C2, which is positioned at the back focal length of lens L3, the predicted class of the input image is the class corresponding to the highest Airy disk intensity.

Download Full Size | PDF

2.1 Convolutional layer

As shown in Fig. 1, the convolutional layers are implemented by an optical lenslet array $4f$ system. Phase information plays a substantial role in the Fourier presentation of a signal; thus, we set the phase information of the frequency spectra as the trainable kernels [31,32]. In each channel, we loaded the input image and the corresponding kernels on two pieces of SLMs; subsequently, we used each small lenslet array lens to achieve two-dimensional Fourier transform.

In neural networks, the input images and the corresponding convolutional kernels are three-dimensional in most cases. Each of these kernels extracts different feature information of the input images through convolution; the final result of convolution is the summation of these features. Previous research used single lens $4f$ system to implement convolutional layer with the single input image. When dealing with multi input images, the frequency spectrum after Fourier transform by single lens will mix together and the imaging quality will reduce. So we use lenslet array to split each channel separately. Each channel will implement convolution between input image and corresponding kernel without aliasing. Besides, all input images will exist in paraxial condition in each channel when the number of images is large. We tiled the input images ${{I}_{in}}({{x}_{k}},{{y}_{k}})$ and kernels $Kernel (\tfrac {{{x}_{k}}}{\lambda f},\tfrac {{{y}_{k}}}{\lambda f})$ on SLMs. Each convolution operation is easily implemented through an optical $4f$ system. To obtain the summation of these convolutional outputs directly in optics, additional phase shifts $\Delta {{x}_{k}}$ and $\Delta {{y}_{k}}$ are modulated on each kernel in advance. Based on the shift theorem in Fourier optics, each output image shifts the corresponding pixels $\Delta {{x}_{k}}$ and $\Delta {{y}_{k}}$ in the spatial domain after the convolution between each input image and the modulated kernel $Kerne{{l}_{in}}(\tfrac {{{x}_{k}}}{\lambda f},\tfrac {{{y}_{k}}}{\lambda f})$ (see Fig. 2(a)). All output images are superimposed when they shift to the same position, and the summation operation is performed as follows.

(1)$$\begin{aligned}Kerne{{l}_{in}}(\tfrac{{{x}_{k}}}{\lambda f},\tfrac{{{y}_{k}}}{\lambda f}) & =Kernel(\tfrac{{{x}_{k}}}{\lambda f},\tfrac{{{y}_{k}}}{\lambda f})\cdot {{\textrm{e}}^{j\varphi (\Delta {{x}_{k}},\Delta {{y}_{k}})}} \nonumber \\ {{I}_{out}}({{x}_{k}},{{y}_{k}})& =\begin{matrix} {{\mathcal{F}}^{{-}1}}\{\mathcal{F}[{{I}_{in}}({{x}_{k}},{{y}_{k}})]\cdot Kernel(\tfrac{{{x}_{k}}}{\lambda f},\tfrac{{{y}_{k}}}{\lambda f})\}& k=1,2,\ldots,N \\ \end{matrix} \nonumber \\ {{I}_{out}}^{\prime }({{x}_{k}},{{y}_{k}})& ={{\mathcal{F}}^{{-}1}}\{\mathcal{F}[{{I}_{in}}({{x}_{k}},{{y}_{k}})]\cdot Kerne{{l}_{in}}(\tfrac{{{x}_{k}}}{\lambda f},\tfrac{{{y}_{k}}}{\lambda f})\} \nonumber \\ & ={{\mathcal{F}}^{{-}1}}\{\mathcal{F}[{{I}_{in}}({{x}_{k}},{{y}_{k}})]\cdot Kernel(\tfrac{{{x}_{k}}}{\lambda f},\tfrac{{{y}_{k}}}{\lambda f})\cdot {{\textrm{e}}^{j\varphi (\Delta {{x}_{k}},\Delta {{y}_{k}})}}\} \nonumber \\ & ={{\mathcal{F}}^{{-}1}}\{\mathcal{F}[{{I}_{out}}({{x}_{k}},{{y}_{k}})]\cdot {{\textrm{e}}^{j\varphi (\Delta {{x}_{k}},\Delta {{y}_{k}})}}\} \nonumber \\ & =\begin{matrix} {{I}_{out}}({{x}_{k}}-\Delta {{x}_{k}},{{y}_{k}}-\Delta {{y}_{k}}) & k=1,2,\ldots,N \\ \end{matrix} \nonumber \\ {{I}_{out}}^{\prime }(x,y)& =\sum_{k=1}^{N}{{{I}_{out}}^{\prime }({{x}_{k}},{{y}_{k}})}\textrm{ ,} \end{aligned}$$

where $(\cdot )$ and $(\otimes )$ signify a two-dimensional element-wise product and convolution, respectively; $\lambda$ denotes the wavelength of light; $f$ is the focal length; ${{x}_{k}}$ and ${{y}_{k}}$ are the coordinates of the ${{k}^{th}}$ input image with respect to the origin of the optical axis in the spatial domain; and $\mathcal {F}$ denotes a two-dimensional Fourier transform. ${{I}_{out}}({{x}_{k}},{{y}_{k}})$ is the original convolutional output in ${{k}^{th}}$ channel, and ${{I}_{out}}^{\prime }({{x}_{k}},{{y}_{k}})$ is the modulated convolutional output with position shift. ${{I}_{out}}^{\prime }(x,y)$ is the summation of modulated convolutional output in all channels. After ranging the modulated kernels from 0 to $2\pi$, kernels can be loaded on phase-only SLM2 directly, as shown in Fig. 1.

Fig. 2. Experimental implementation of the proposed OPCNN architecture. (a) Example of a convolutional layer. By adding phase shifts on kernels, each convolutional result shifts the corresponding pixels to the same position, and a summation operation is achieved in this way. (b) Architecture of the demagnification system. (c) Realization of different nonlinear activation functions by changing the curve of the sCMOS camera. In addition to (2) square function and (3) sinusoidal function, the proposed (4) shifted-ReLU function can also implement nonlinear activation compared with (1) linear function output. The software interface of sCMOS camera is shown in (5).

Download Full Size | PDF

2.2 Down-sampling layer

All electronic CNN structures contain down-sampling layers engaged to reduce the spatial dimension by replacing the output of the net at a certain location with the summary statistics of the nearby outputs. This dimensional reduction can improve the computation efficiency and reduce the memory requirements for storing the parameters. On the other hand, this operation (e.g., max-pooling) may discard some useful information from input images, and is difficult to perform on an optical system. In the OPCNN, we implemented an optical-strided convolution operation to achieve spatial dimension reduction without discarding information from the input images. Compared with max-pooling or average-pooling, implementing strided convolution operations increases the parameter size and causes calculation problems in digital processing. However, theoretically, these problems would not exist in optical processing because all convolution operations are executed in parallel at the speed of light with low power consumption.

The lenslet array $4f$ system can only achieve a convolution operation in a single stride. Therefore, the optical-strided convolution operation must be performed in two steps: (1) perform a standard convolution operation in a single stride; and (2) sampled at fixed intervals through a demagnification lens system, as shown in Fig. 2(b). Therefore, we use lenslet array and SLM3 to implement single stride convolution in the first. Then the output of convolution result will be sampled when propagating through the demagnification lens system. The schematic diagram is shown in Fig. 3.

We found that this demagnification lens system can sample the images located on the input plane at fixed intervals, and the interval depends on the focal length of the two lenses. The assumed image $t(p,q)$ is located on the input plane, and Eq. (2) derives the output $s(u,v)$ of this demagnification system:

(2)$$\begin{aligned}\textrm{ }s(u,v)\textrm{=}& \frac{A}{j\lambda {{f}_{2}}}\iint\limits_{\infty}{\left[ \frac{A}{j\lambda {{f}_{1}}}\iint\limits_{\infty}{t(p,q){{e}^{{-}j\frac{2\pi }{\lambda {{f}_{1}}}(px+qy)}}dpdq} \right]{{e}^{{-}j\frac{2\pi }{\lambda {{f}_{2}}}(xu+yv)}}dxdy}, \nonumber \\ \textrm{=}& \frac{{{A}^{2}}}{-{{\lambda }^{2}}{{f}_{1}}{{f}_{2}}}\iint\limits_{\infty }{t(p,q)\left[ \iint\limits_{\infty }{{{e}^{j\frac{2\pi }{\lambda }(-\frac{u}{{{f}_{2}}}-\frac{p}{{{f}_{1}}})x}}}{{e}^{j\frac{2\pi }{\lambda }(-\frac{v}{{{f}_{2}}}-\frac{q}{{{f}_{1}}})y}}dxdy \right]}dpdq \nonumber \\ \textrm{=}& \frac{{{A}^{2}}}{-{{\lambda }^{2}}{{f}_{1}}{{f}_{2}}}\iint\limits_{\infty }{t(p,q)\left[ \int\limits_{\infty }{{{e}^{j\frac{2\pi }{\lambda }(-\frac{u}{{{f}_{2}}}-\frac{p}{{{f}_{1}}})x}}}dx\int\limits_{\infty }{{{e}^{j\frac{2\pi }{\lambda }(-\frac{v}{{{f}_{2}}}-\frac{q}{{{f}_{1}}})y}}}dy \right]}dpdq \nonumber \\ \textrm{=}& \frac{{{A}^{2}}}{-{{\lambda }^{2}}{{f}_{1}}{{f}_{2}}}\iint\limits_{\infty}{t(p,q)\left[ \delta (\frac{u}{\lambda {{f}_{2}}}+\frac{p}{\lambda {{f}_{1}}})\delta (\frac{v}{\lambda {{f}_{2}}}+\frac{q}{\lambda {{f}_{1}}}) \right]dpdq} \nonumber \\ \textrm{=}& \frac{{{A}^{2}}}{-{{\lambda }^{2}}{{f}_{1}}{{f}_{2}}}t(-\frac{{{f}_{1}}}{{{f}_{2}}}u,-\frac{{{f}_{1}}}{{{f}_{2}}}v)\textrm{ ,} \end{aligned}$$

where ${{f}_{1}}$ and ${{f}_{2}}$ denote the focal lengths of the two lenses. Ignoring the front coefficient, the output plane displays a flipped input image sampled at a sampling interval $\frac {{{f}_{2}}}{{{f}_{1}}}$. Because the stride of convolutional kernels is determined, we can choose the proper lenses with different focal lengths.

Fig. 3. Schematic diagram of downsampling layer.

Download Full Size | PDF

2.3 Nonlinear activation function and fully connected layers

The nonlinear activation function is a crucial component of electronic CNNs. Herein, we propose to utilize the sCMOS camera’s curve to build a nonlinear layer (Fig. 2(c)). sCMOS is a special semiconductor material made up of many individual photosensitive devices. Light information is converted into a digital signal when it propagates onto the sCMOS, and the signal is then transmitted to an image processor to obtain the image. The curve controls the relationship between the intensity of light and readout image. The curve is linearity by default and is adjustable. From the software interface, we can choose different kinds of function curves, such as square function or sinusoidal function showed in Fig. 2(c), to change this relationship. ReLU is an excellent default choice as a nonlinear activation function. Considering the non-negativity of light intensity, we adjusted the turning point to ensure that a shifted-ReLU function outputs zero across half its domain. The expression of shifted-ReLU is described by Eq. (3). Using curve to implement nonlinear function is an electronic operation. However, this step is realized directly when camera reads out output images so it doesn’t bring additional computation cost to the network.

(3)$$\begin{aligned}Intensit{{y}_{image}}=\left\{ \begin{matrix} 0 & \textrm{ }0\le Intensit{{y}_{light}}\le 127 \\ 2\cdot Intensit{{y}_{light}}-255 & 128\le Intensit{{y}_{light}}\le 255 \\ \end{matrix} \right.\textrm{ ,} \end{aligned}$$

Images after nonlinear processing can be loaded on SLM1 to implement the next convolutional layer, or on SLM5 to implement an optical fully connected layer. In electronic CNNs, the fully connected layer performs dot product of the input images and kernels. In optics, we divide this process into two steps to implement. First, we load input images and kernels on SLM5 and SLM4 accordingly. The element-wise product between input images and kernels is performed by the reflection of light between the two SLMs. Then we sum up all elements in matrixes by optical Fourier transform when reflected light propagating through L3. The comparison of FC layer in digital CNN and OPCNN is shown in Fig. 4.

Fig. 4. Implementation of fully connected (FC) layer between digital CNN and OPCNN.

Download Full Size | PDF

In Fourier optics theory, the light intensity of the Airy disk, which can be detected in the zero-order frequency spectrum, is proportional to the summation of the direct component of light in the spatial domain. These direct components, in other words, are on behalf of each element of matrix, respectively:

(4)$$\begin{aligned}& F(x,y)=\iint\limits_{m,n}{t(m,n){{e}^{{-}j\frac{2\pi }{\lambda {{f}}}(mx+ny)}}dmdn} \nonumber \\ & Itensit{{y}_{Airy}}\propto F(0,0)=\iint\limits_{m,n}{t(m,n)dmdn}\textrm{ ,} \end{aligned}$$

Based on Eq. (4), the light intensity of the Airy disk detected by the sCMOS camera, which is placed at the back focal length of the lens, can be defined as the final output of the fully connected layer. The readout number of Airy disks depends on the number of including category in datasets. In digital CNNs, the output of fully connected layer will be mapped between 0 and 1 through softmax and the predicted category corresponds to the maximum output in softmax layer. From the function expression of softmax we can derive the directly proportion relationship between the output of these two layers. Therefore, we can determine the category through the output of optical fully connected layer and the predicted category corresponds to the maximum Airy disk intensity.

However, this method is inaccurate when applied practically in optical systems owing to the interference of stray light. Compared with the size of the SLM, the matrixes of the fully connected layer only occupy a small part. When light reflected from the SLM propagates to the Fourier lens directly, a mass of irrelevant light will mix in the Airy disk, resulting in the inaccuracy and supersaturation of the sCMOS camera readout. To address this problem, we settled a digital micro-mirror device (DMD) in our optical system to impose restrictions on the reflective area in front of the lens. DMD is a type of optical switch that can be partly opened and closed by a rotating reflector. Before performing Fourier transform by lens, we adjusted the DMD to match the area where matrixes of the fully connected layer lie, and then only allowed this part of light to be reflected. In this way, the effect of stray light can be greatly reduced.

3. Experimental result

In order to evaluate the performance of the OPCNN architecture, we performed experiments on two datasets: MNIST and Fashion-MNIST on image classification. Here we defined one convolutional layer, one down-sampling layer and one nonlinear activation layer as a unit because this unit is reusable in our OPCNN and these three layers are fundamental components in digital CNNs. For OPCNNs with several units, we made the readout output by C1 of last unit as the input of next unit, and reloaded it on SLM1. For example, OPCNN-4 architecture in experiments means that we reuse this unit four times. To prove the superiority of the method, we compared the classification accuracy with other optical convolutional neural networks: Hybrid optical-electronic convolutional neural network (HOCNN) proposed by Chang et al., which consists of single channel convolutional layer, and optical frontend-based network (OPCNN-L1) proposed by Colburn et al., which just implements the first convolutional layer in optics. Original OPCNN-L1 contains five convolutional layers and we reduce to four for convenient comparison. Furthermore, the performance of digital convolutional neural network (DCNN), which has similar architecture as OPCNN-4 but without optical implementation, is also included in comparison. The all five architectures are shown in Table 1.

Table 1. Five network models for classification on MNIST and Fashion-MNIST datasets

View Table | View all tables in this article

In training our networks, we considered the wavelength, focal lengths of lenses, and pixel pitch of SLMs in our training code. The fast Fourier transform algorithm and angular spectrum propagation were used to simulate the optical part under ideal conditions. We designed 14 channels in each convolutional layer, because the physical size of the SLMs is $15.36\textrm {mm}\times 8.64mm$ and the diameter of the lenslet array is $3\textrm {mm}$. The size of each channel is $384\times 360$ in the training codes. The size of the input images is $256\times 256\textrm {pixel}$. Therefore, before loading the input images on SLMs, they needed to be resized to $384\times 360\textrm {pixel}$ by zero-padding. Note that the OPCNN-1 and OPCNN-4 architectures contained one softmax layer in our codes for the convenience of training. When testing images in the optical system, this softmax layer was not implemented, and we selected the output of the fully connected layer for prediction. All networks were trained using Python version 3.5.0 and TensorFlow framework on a desktop computer (GPU: NVIDIA TitanX) for 5 h. Gradient backpropagation and weight updating were implemented with the Adam optimizer for 20 epochs at a learning rate of 0.001.

The images and kernels were loaded on the amplitude SLMs (Hes6001, Holoeye) and phase-only SLMs (Pluto-2, Holoeye), illuminated by a coherent laser light of wavelength 532 nm. The reflected light was relayed through a $4f$ system comprising a lenslet array ($f=38mm$, $\phi =3mm$, Edmund Optics) and a demagnification lens system comprising two convex lenses (${{f}_{1}}=250mm$, ${{f}_{2}}=125mm$, $\phi =25.4mm$, Thorlabs), to implement the convolutional and down-sampling operations. Finally, the propagating light was selectively reflected by a DMD (Texas Instruments DLP LightCrafter) and focused by another convex lens (${{f}_{3}}=250mm$) in the fully connected layer. All images were captured by an ORCA-Flash 4.0 V3 sCMOS camera, C1 (Hamamatsu, C13440-20CU), and a charge-coupled device camera, C2 (Basler, avA1900-60km).

The optical outputs of the OPCNN-1 captured by the sCMOS camera (left column) and the simulation results as ground truth (right column) on the MNIST and Fashion-MNIST datasets classification are shown in Fig. 5. The outputs of the FC layer are shown in the last row.

Fig. 5. Examples of OPCNN-1 outputs on MNIST dataset and Fashion-MNIST dataset classification. For each convolutional and down-sampling layer of OPCNN-1, we chose two sample kernels out of 14 to compare imaging quality with ground truth (GT), and the results are depicted in the left and right columns, respectively. The line charts in the last row show the outputs of the FC layer.

Download Full Size | PDF

To evaluate the imaging quality of the optical outputs compared to digital outputs, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were used in this study [33,34]. These two indexes are widely used as objective evaluation standards for image quality. The PSNR between an image and image with the same size is defined as,

(5)$$\begin{aligned}PSNR& =10\cdot {{\log }_{10}}(\frac{MA{{X}_{I}}^{2}}{MSE})\textrm{ }, \nonumber \\ MSE& =\frac{1}{{{p}_{x}}{{p}_{y}}}\sum_{i=0}^{{{p}_{x}}-1}{\sum_{j=0}^{{{p}_{y}}-1}{{{[x(i,j)-y(i,j)]}^{2}}}\textrm{ ,}} \end{aligned}$$

where $MA{{X}_{I}}$ denotes the maximum possible pixel value of the images, and ${{p}_{x}}$ and ${{p}_{y}}$ represent the number of pixels in the images in the $x\textrm {-axis}$ and $y\textrm {-axis}$, respectively.

The SSIM between an image and image is defined as

(6)$$ SSIM(x,y)=\frac{(2{{\mu }_{x}}{{\mu }_{y}}+{{c}_{1}})(2{{\sigma }_{xy}}+{{c}_{2}})}{(\mu _{x}^{2}+\mu _{y}^{2}+{{c}_{1}})(\sigma _{x}^{2}+\sigma _{y}^{2}+{{c}_{2}})}\textrm{ ,} $$

where ${{\mu }_{x}}$ and ${{\mu }_{y}}$ are the mean values of image $x$ and image $y$, respectively; ${{\sigma }_{x}}$ and ${{\sigma }_{y}}$ are the variances of image $x$ and image $y$, respectively; ${{\sigma }_{xy}}$ is the covariance of image $x$ and image $y$; and ${{c}_{1}}$ and ${{c}_{2}}$ are small constants used to avoid a null denominator. The SSIM is a non-negative value in the range of 0 to 1. If the optical output image is identical to the ground truth, the SSIM value reaches 1.

Table 2 shows that the optical outputs in Fig. 5 achieve high PSNR and SSIM values. Figure 6(a) presents the experimental output and simulated output of all channels in the convolutional layer on two datasets. By calculating the averaged SSIM and PSNR of total output, the index values can reach 0.6510, 0.6851, 19.64 dB and 18.12 dB, respectively. The same calculation applied to the down-sampling layer can yield 0.6246, 0.6635, 18.25 dB and 17.83 dB, respectively. These index values indicate that the proposed OPCNN architecture enables optical convolutional operations and down-sampling operations with good imaging quality.

Fig. 6. (a) Comparison of experimental output and simulated output of MNIST dataset (left) and Fashion-MNIST dataset (right). Each row represents a class, and each column represents a channel. (b) Confusion matrices for experimental results on MNIST dataset (left) and Fashion-MNIST dataset (right). We used 500 samples from each dataset and 50 samples for each class.

Download Full Size | PDF

Table 2. Performance of optical output on OPCNN-1, PSNR and SSIM

View Table | View all tables in this article

We tested 500 images (50 for each class) of each dataset in the experiment to verify the performance of the OCNN-1 architecture. The confusion matrices of the statistical results are shown in Fig. 6(b). OPCNN-1 achieved a classification accuracy of 87.6$\%$ on the MNIST dataset and 82.4$\%$ on Fashion-MNIST. The statistical results of the experiments show that the proposed optical system is capable of processing image classification problems with acceptable accuracy. The classification accuracy obtained via experiments was lower than that via simulation—this is mainly attributed to the inaccurate calibration of our system. Besides, with additional phase shift modulated on SLM2, the reflected light will exist oblique angles. Crosstalk between microlenses will also occur when doing inverse Fourier transform. Although the function of feature extraction is not influenced by these issues, the aberration existed in output will reduce imaging quality and also cause the classification accuracy difference between experiments and simulations. Therefore, our future work will focus on improving the classification performance in experiments to the maximum extent possible and bringing it closer to the performance obtained via simulations.

Table 3 shows the classification accuracy of five network models in Table 1. With the trained OPCNN-4 network, we achieve the accuracy of 98.23$\%$ and 88.52$\%$ on two datasets, which is approximate to the accuracy of DCNN without optical implementation. Considered the reducing computation cost in optical implementation, the fine difference of accuracy between these two networks is negligible. Compared to HOCNN, our OPCNNs achieve better performance on account of deeper architecture and containing layers with more complex function. According to the comparison between DCNN and OPCNN-4, the implementation of convolutional layers in optical or in digital brings little effect on classification accuracy. Therefore, the classification performance difference between OPCNN-L1 and OPCNN-4 comes from the implementation method of downsampling layer. When solving more complex datasets such as Fashion-MINST dataset, OPCNN-L1 using max-pooling will cause information loss and miss details of images, thus performance is poorer than OPCNN-4.

Table 3. Classification accuracy on MNIST and Fashion-MNIST datasets

View Table | View all tables in this article

We considered that the MNIST and Fashion-MNIST datasets are not sufficiently complex; thus, we only implemented a small-scale optical convolutional neural network for image classification. To address more complex datasets, the number of convolutional layers must be improved. Multi-convolutional layers can be implemented theoretically by increasing the number of times the units are reused. Nevertheless, this operation entails increased time consumption for data transmission between units, which is a limitation of our architecture. The main factor for the increased time consumption is the low transmission speed of data from camera C1 to SLM1. This problem can be addressed by replacing the sCMOS camera with a nonlinear material to implement a nonlinear activation layer. The data, after the nonlinear activation, can propagate directly to the next layer in optics. Our future work will focus on realizing this method. In addition to the increasing number of layers, more channels are also required in modern deep networks. As mentioned above, one SLM can contain 14 channels at most in our optical system. The number of channels in each convolutional layer can be increased by splicing the SLMs. All convolutional channels are divided into pieces of SLMs, and a beam splitter is then used to splice them together. This method could make the system calibration difficult; therefore, sophisticated control instruments, such as PI Hexapod, must be used to control the movement of optical elements for precise motion and positioning when setting up the system.

4. Conclusion and discussion

The OPCNN architecture is proposed in this study to achieve not only convolutional layers but also down-sampling layers, nonlinear activation functions, and fully connected layers in optics with high imaging quality and high classification accuracy on the MNIST and Fashion-MNIST datasets. Convolutional layers are implemented using an optical $4f$ system. The strided convolution operation is utilized to perform dimensionality reduction as a down-sampling layer. The nonlinear activation and fully connected layers are achieved by the sCMOS camera’s modulation function and the property of Fourier optics. In our OPCNNs, all computation operations are processed in optics, and digital processing is used only for data transmission. When setting simulation outputs as ground truth images, our optical output images captured by camera acquire high index value of PSNR and SSIM on two datasets. Simulations and experiments on datasets demonstrate good performance of OPCNNs. In comparison with other optical convolutional neural networks, our OPCNNs show superiority in solving classification problems. Except for achieving better classification performance, our proposed architectures make more layers with different function implemented in optics. The classification performance of previous hybrid optical-electronic networks is limited if without other digital computing layers. Besides, the characteristics of optical operation are not fully exploited if only contains optical convolutional layers in networks. More importantly, our architectures contribute to reduce computation costs in electronic to the maximum extent. Also, our architecture is scalable and reusable. When facing complex datasets, we can achieve deeper networks through reusing computing unit mentioned above.

However, when optical system is settled, the order of layers in OPCNN is fixed. A convolutional layer must be followed by a downsampling layer and nonlinear activation layer. Therefore, the flexibility of system is limited. Besides, the replacement of max-pooling layer makes our network is all convolutional, which results in losing invariation in a certain extent [35,36]. The small input shifts or translations can cause drastic changes in the classification accuracy [37]. Recent studies have proposed several methods for digital processing to make networks shift-invariant again; however, applying them in optics is difficult. Future studies must focus on building a more stable and knowledgeable OPCNN structure to regain high classification accuracy when faced with imperceptible but adversarial perturbations of input images.

Disclosures

The authors declare no conflicts of interest.

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

2. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks 61, 85–117 (2014). [CrossRef]

3. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (2016). Available at http://www.deeplearningbook.org.

4. A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25 (NIPS 2012)25, 1097–1105 (2012).

5. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 (2014).

6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385 (2015).

7. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842 (2015).

8. G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science 313(5786), 504–507 (2006). [CrossRef]

9. M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks through ffts,” arXiv:1312.5851 (2013).

10. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15, 1929–1958 (2014).

11. N. Farhat, D. Psaltis, A. Prata, and E. Paek, “Optical implementation of the hopfield model,” Appl. Opt. 24(10), 1469–1475 (1985). [CrossRef]

12. T. Lu, S. Wu, X. Xu, and F. Yu, “Two-dimensional programmable optical neural networks,” Appl. Opt. 28(22), 4908–4913 (1989). [CrossRef]

13. I. Saxena, “Adaptive multilayer optical neural network with optical thresholding,” Opt. Eng. 34(8), 2435–2440 (1995). [CrossRef]

14. A. Willner, S. Khaleghi, M. Chitgarha, and O. Yilmaz, “All-optical signal processing,” J. Lightwave Technol. 32(4), 660–680 (2014). [CrossRef]

15. Y. Lecun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits," Available at http://yann.lecun.com/exdb/-mnist/.

16. K. R. Han Xiao and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747 (2017).

17. X. Lin, Y. Rivenson, N. Yardimci, M. Veli, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

18. D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Analysis of diffractive optical neural networks and their integration with electronic neural networks,” IEEE J. Sel. Top. Quantum Electron. 26(1), 1–14 (2019). [CrossRef]

19. J. Li, D. Mengu, Y. Luo, Y. Rivenson, and A. Ozcan, “Class-specific differential detection in diffractive optical neural networks improves inference accuracy,” Adv. Photonics 1(06), 1–13 (2019). [CrossRef]

20. T. Yan, J. Wu, T. Zhou, H. Xie, F. Xu, J. Fan, L. Fang, X. Lin, and Q. Dai, “Fourier-space diffractive deep neural network,” Phys. Rev. Lett. 123(2), 023901 (2019). [CrossRef]

21. Y. Zuo, B. Li, Y. Zhao, Y. Jiang, Y.-C. Chen, P. Chen, G.-B. Jo, J. Liu, and S. Du, “All-optical neural network with nonlinear activation functions,” Optica 6(9), 1132–1137 (2019). [CrossRef]

22. Y. Shen, N. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljacic, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

23. T. Hughes, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” Optica 5(7), 864–871 (2018). [CrossRef]

24. S. Xu, J. Wang, R. Wang, J. Chen, and W. Zou, “High-accuracy optical convolution unit architecture for convolutional neural networks by cascaded acousto-optical modulator arrays,” Opt. Express 27(14), 19778–19787 (2019). [CrossRef]

25. J. Goodman, Introduction to Fourier Optics, 2nd. ed. (Roberts & Company Publishers, 1996).

26. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]

27. S. Colburn, Y. Chu, E. Shilzerman, and A. Majumdar, “Optical frontend for a convolutional neural network,” Appl. Opt. 58(12), 3179–3186 (2019). [CrossRef]

28. L. Liu, Y. Gao, F. Wang, and X. Liu, “Real-time optronic beamformer on receive in phased array radar,” IEEE Geosci. Remote Sensing Lett. 16(3), 387–391 (2018). [CrossRef]

29. Y. Gao, C. Lin, R. Guo, K. Wang, and X. Liu, “Optronic high-resolution sar processing with the capability of adaptive phase error compensation,” IEEE Geosci. Remote Sensing Lett. 13, 1–5 (2016). [CrossRef]

30. J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv:1412.6806 (2015).

31. A. V. Oppenheim and J. S. Lim, “The importance of phase in signals,” Proc. IEEE 69(5), 529–541 (1981). [CrossRef]

32. M. Miscuglio, Z. Hu, S. Li, J. George, R. Capanna, P. M. Bardet, P. Gupta, and V. J. Sorger, “Massively parallel amplitude-only fourier neural network,” arXiv:2008.05853 (2020).

33. H. Chen, Y. Gao, X. Liu, and Z. Zhou, “Imaging through scattering media using speckle pattern classification based support vector regression,” Opt. Express 26(20), 26663–26678 (2018). [CrossRef]

34. Z. Wang, E. Simoncelli, and A. Bovik, “Multi-scale structural similarity for image quality assessment,” Proc. IEEE Asilomar Conf. Signals, Syst. Comput. pp. 1398–1402 (2004).

35. D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for object recognition,” Artificial Neural Networks 6354, 92–101 (2010). [CrossRef]

36. L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry, “Exploring the landscape of spatial robustness,” arXiv:1712.02779 (2017).

37. R. Zhang, “Making convolutional networks shift-invariant again,” arXiv:1904.11486 (2019).

Models
HOCNN	OPCNN-L1	OPCNN-1	OPCNN-4	DCNN
Input images (256 $\times$ 256)
Single channel convolutional layer	Opt conv1	Opt conv	Opt conv1	Digital conv1
	Digital conv2	Opt strided conv	Opt strided conv1	Dig strided conv1
	ReLU	Shifted-ReLU	Shifted-ReLU	ReLU
	Max pooling	Optical FC layer	Opt conv2	Digital conv2
	Digital conv3		Opt strided conv2	Dig strided conv2
	ReLU		Shifted-ReLU	ReLU
	Digital conv4		Opt conv3	Digital conv3
	ReLU		Opt strided conv3	Dig strided conv3
	Max pooling		Shifted-ReLU	ReLU
	FC layer		Opt conv4	Digital conv4
			Opt strided conv4	Dig strided conv4
			Shifted-ReLU	ReLU
			Optical FC layer	FC layer
	Softmax

	"5"		T-shirt
	PSNR(dB)	SSIM	PSNR(dB)	SSIM
Convolutional Layer	18.01	0.5576	18.42	0.6024
Down-sampling Layer	19.39	0.6111	17.97	0.6589

		HOCNN	OPCNN-L1	OPCNN-1	OPCNN-4	DCNN
Test Accuracy	MNIST	92.74%	98.01%	96.40%	98.23%	99.14%
Test Accuracy	Fashion-MNIST	83.81%	87.73%	84.90%	88.52%	90.06%

Models
HOCNN	OPCNN-L1	OPCNN-1	OPCNN-4	DCNN
Input images (256 $\times$ 256)
Single channel convolutional layer	Opt conv1	Opt conv	Opt conv1	Digital conv1
	Digital conv2	Opt strided conv	Opt strided conv1	Dig strided conv1
	ReLU	Shifted-ReLU	Shifted-ReLU	ReLU
	Max pooling	Optical FC layer	Opt conv2	Digital conv2
	Digital conv3		Opt strided conv2	Dig strided conv2
	ReLU		Shifted-ReLU	ReLU
	Digital conv4		Opt conv3	Digital conv3
	ReLU		Opt strided conv3	Dig strided conv3
	Max pooling		Shifted-ReLU	ReLU
	FC layer		Opt conv4	Digital conv4
			Opt strided conv4	Dig strided conv4
			Shifted-ReLU	ReLU
			Optical FC layer	FC layer
	Softmax

	"5"		T-shirt
	PSNR(dB)	SSIM	PSNR(dB)	SSIM
Convolutional Layer	18.01	0.5576	18.42	0.6024
Down-sampling Layer	19.39	0.6111	17.97	0.6589

Optronic convolutional neural networks of multi-layers with different functions executed in optics for image classification

Abstract

1. Introduction

2. Architecture and implementation of OPCNN

2.1 Convolutional layer

2.2 Down-sampling layer

2.3 Nonlinear activation function and fully connected layers

3. Experimental result

4. Conclusion and discussion

Disclosures

References

Cited By

Figures (6)

Tables (3)

Equations (6)

Optics Express