Quantitative comparison of the computational complexity of optical, digital and hybrid neural network architectures for image classification tasks

Mengxiang Chen; Mengxiang Chen; Steffen Schoenhardt; Min Gu; Elena Goi

doi:10.1364/OE.505341

1. Introduction

By imitating the mechanisms of the human brain to process visual information, machine learning (ML) models, especially artificial neural networks (ANNs) and deep neural networks (DNNs) [1], are making it faster and easier to extract information from images allowing for the processing of visual data beyond human performance in terms of speed and accuracy, thus enabling the automation of ever more complex image recognition tasks. For this reason, these artificial intelligence (AI) techniques are becoming crucial in an increasing number of applications, from computer vision to biometric security. However, despite the improvements that have been made in the field of specialized hardware for execution of ANN models, like GPUs and TPUs, in terms of latency, energy consumption and throughput, several applications – e.g. in astrophysics [2], autonomous vehicles [3], image-activated flow cytometry [4,5] – present severe challenges for current ML hardware. These challenges include the requirements for quasi real-time processing of information with ML models, a limited space or energy budget for digital hardware resulting in constraints in terms of computational resources available.

Such constraints have fueled the efforts to develop new, specialized hardware platforms for implementations of ANNs that can process visual information fast and efficiently. Naturally, optics is the first-class candidate for building analog processors and accelerator schemes for space- and time-varying optical information, since it enables passive processing of optical signals in their native domain, without analog-to-digital conversion. Moreover, photonics hardware offers broad bandwidth, small latency, low power consumption and natural parallelism, which is coupled with recent advances in photonic integration technology, allowing for compact devices with low losses. In addition to that, mathematical operations on large matrices – which are the building blocks of modern visual computing algorithms applied in AI – are particularly easy implemented in the optical domain, for example, linear optical elements can calculate convolutions, Fourier transforms, random projections, and many other operations passively. For these reasons, free space three-dimensional (3D) optical neural networks (ONNs) were among the first architectures to be investigated for image classification tasks [6–10]. And while the processing of optical information directly in the native domain was shown to have the benefit of making the whole realm of optical information, like for example the incoming phase, accessible for analysis with standard CMOS sensors [11], ONNs did not achieve widespread adoption yet. Due to the lack of flexibility (non-reconfigurable elements in 3D-printed ONNs [12,13]), need for bulky optical elements (Spatial Light Modulators in reconfigurable ONNs [14]), difficulties in implementing optical nonlinearities [15] and lower performance compared with their digital counterpart, digital neural networks are still the standard method for processing image information.

To overcome the performance limitations of 3D ONNs, while exploiting the advantages that come with processing signals directly in the native optical domain, hybrid optical-digital networks have been thoroughly considered [16], and analyzed in several works [17–19]. It was shown that by combining optical, optoelectronic and digital layers, it is possible to build hybrid convolutional neural networks [20] or deep neural networks with lower computational complexity and higher frame rate compared to traditional digital ML algorithms [11,21], also enabling researchers to explore entirely new network schemes [22] or hardware implementations like artificial neural network functionality directly implemented in imaging sensors [23]. However, previous works implemented hybrid networks with the aim to maximize performance without restrictions on the computational complexity of the digital part of the hybrid network. Thus, to the best of our knowledge, a quantitative analysis of the potential for reduction of computational complexity in a hybrid architecture to achieve a certain performance is thus far missing.

In this work we investigate the use of optical or hybrid optical-digital neural networks for the execution of classification tasks with performances comparable with common digital neural networks and how the use of a hybrid optical-digital neural network reduces computational complexity and energy consumption in the electronic domain (see Fig. 1). To achieve that, we numerically analyze and compare the performance and the computational requirements of digital supervised ML algorithms with those of a stand-alone optical diffractive neural network (DN₂) [12] – the optical implementation of a fully-connected feed forward neural network – and hybrid optical-digital neural networks, for the classification of handwritten digits from the MNIST dataset [24], EMNIST handwritten letters recognition [25], and image classification on the grayscale CIFAR-10 dataset [26], representing benchmark tasks of different complexity. We show that by using co-designed hybrid architectures, it is possible to perform classification tasks with accuracies comparable to common digital networks while reducing the number of operations required in the digital domain by a factor >10. This shows that the integration and co-design of passive DN₂ with digital networks creates some unique opportunities to achieve pervasive and low-power deep optical systems that can be realized using simple and compact imagers.

Fig. 1. Concept of how computational resources and time saved by hybrid integration of optical and digital neural networks.

Download Full Size | PDF

2. Definitions

2.1 Accuracy

When implementing a classification task, accuracy is the most common metric to measure the models classification performance. In this work, the classification accuracy is analyzed by means of confusion matrix and determined using the formula:

(1)$$Accuracy{\; }({\%} )= {\; }\frac{m}{n}{\; } \cdot 100$$

Where m is the number of correct predictions and n is the total number of input samples.

2.2 Trainable parameters

In ML, a model is a function with learnable parameters that maps an input to an output. The parameters are optimized by training the model on data, until it can provide an accurate mapping from the input to the desired output. Knowing and monitoring the number of trainable parameters is important during network design because it is related to the computational time and resources needed to train a network and can give insight on complexity and performance issues of the model, such as overfitting and underfitting. For a digital neural network, the number of trainable parameters can be extracted from the TensorFlow model. In a DN₂ trained to perform phase modulation only, each neuron represents a phase bias that is optimized during the training process. Therefore, for the optical part of the networks, we calculate the number of trainable parameters by counting the number of diffractive neurons.

2.3 FLOPs

The Floating-Point Operations (FLOPs) is the number of operations required to run a single instance of a given ML model and a measure of computational complexity. For the digital neural networks the number of FLOPs can be extracted from the TensorFlow model. The FLOPs for the optical DN₂s can be calculated as follows: Assuming the diffractive network has x layers separated by a distance d, each layer contains N × N neurons and the network is fully connected, than, the number of floating-point operations [27] of the diffractive system would be:

(2)$$FLOPs = 2x \cdot {({N \times N} )^2}$$

3. Methods

Neural networks are computational models for ML inspired by the structure of biological neural systems. Neural networks are trained from examples rather than being explicitly programmed and, in this way, they can generalize and successfully deal with unseen inputs. The simplest networks proposed were single-layer perceptrons, but now neural networks evolved into more complex architectures that include multiple layers, convolution layers and even recurrent connections, which have expanded neural networks applications and the problems they can address.

3.1 Network designs

In what follows we provide the details on the design and training of the three different digital neural network architectures, the DN₂ and three different hybrid optical-digital neural networks composed by a DN₂ pre-detection layer and an in-silico post-detection layer, considered in this work.

3.1.1 Multilayer perceptron (MLP)

MLPs are fully connected feed forward neural networks consisting of an input layer that receives the data, hidden layers that perform a series of non-linear transformations, and an output layer that provides the results. The architecture of the digital MLP considered in this work, MLP-1, has two densely-connected layers and is reported in Fig. S1a.

3.1.2 Convolutional neural networks (CNNs)

CNNs are deep learning models primarily composed of convolutional layers, pooling layers, and fully connected layers, which have achieved superior performance with image, speech, or audio signal inputs, compared with other neural networks [28], however, they can be computationally demanding. To gain a clearer understanding of the impact of the number of convolutional layers on the classification accuracy and the computational resources, we considered two digitals CNN architectures, CNN-1 and CNN-2, described in Fig. S1a, with two and three convolutional layers, respectively.

3.1.3 Diffractive neural networks (DN₂)

The DN₂ analyzed in this work is composed by four planar diffractive elements with an axial distance of 40 × λ consisting of 64 × 64 resolvable pixels (0.533 × λ) that act as diffractive neurons able of scattering and re-focusing a multitude of images received as input and of mapping them into a specific output (Fig. 2(a)). The diffractive neurons of each layer are linked to the diffractive neurons of the neighboring layers through Rayleigh-Sommerfeld diffraction. Each diffractive neuron performs phase-only modulation and adds a bias in the form of a phase delay to the transmitted signal. The phase delay can be considered a learnable parameter to be iteratively adjusted during the computer-based training through back propagation. After the training, the DN₂ receives as input a light field with handwritten digits encoded in the amplitude and can scatter and modulate each of a multitude of images, mapping them into a specific output field. The intensity distribution of the output field indicates the value of the digit received as input. Through this optical inference process, the DN₂ can perform passively the task of a multilayer perceptron realizing the functionality of all-optical digit classification [11,12,29]. See Supplement 1 for the details on the DN₂ design.

Fig. 2. a) Final design of a phase-only four-layer diffractive neural network (DN₂) trained to perform handwritten digit classification. b-d) Final design of hybrid diffractive-digital neural network architectures trained to perform handwritten digit classification. The optical part is constituted by a phase-only four-layer DN₂ co-trained with the digital network. All the DN₂s diffractive elements have 64 × 64 diffractive neurons. The inputs to the digital networks have 64 × 64 pixels. e-g) Optical kernels of the hybrid optical convolution layer-digital architectures trained to perform handwritten digit classification. The optical kernels are a phase-only single kernels co-trained with the digital network. The inputs to the digital networks have 64 × 64 pixels.

Download Full Size | PDF

3.1.4 Hybrid diffractive-digital neural networks

In the hybrid diffractive-digital neural networks architecture design, we combine DN₂ with digital neural networks to form a hybrid structure that we co-trained. The optical component of the hybrid networks is the DN₂ described in the previous section. By taking the intensity – i.e. the absolute value of the squared amplitude of the complex valued output field of the last diffractive layer - as the input for the digital network, we can cascade the optical and digital structures, thereby forming hybrid architectures. DN₂-MLP-2, DN₂-CNN-3 and DN₂-CNN-4 were designed by combining DN₂ with a MLP with a single densely-connected layer, and two different CNNs with a single convolutional layer, but different filter size (Fig. 2(b))-d)). To study whether it is beneficial to match the output layer size of the DN₂ with the input of the digital part of a hybrid network, we also consider the case where the output of the DN₂ was resized from 64 × 64 pixels to 28 × 28 pixels. See Supplement 1 for the details on the hybrid diffractive-digital networks design.

3.1.5 Hybrid optical convolution kernel (OCK)-digital neural networks

We combined a single kernel optical convolutional layer with digital neural networks to create hybrid optical-digital architectures. The optical convolutional layer consists of an optimized phase mask – representing an optical kernel – placed in the Fourier plane of a 4f-imaging system. The convolution of the input and the kernel may be performed optically by pointwise multiplications in the Fourier plane, exploiting the inherent convolution performed by a linear, spatially invariant imaging system [30,31]. As for the hybrid diffractive-digital networks, the intensity of the complex valued output field of the optical convolutional layer were taken as input of the digital networks. OCK-MLP-2, OCK -CNN-3 and OCK-CNN-4 were designed by combining the single-kernel optical convolution layers with digital architectures, the detailed architectures are described in Fig. 2(e-g) and Fig. S3. The input images are 64 × 64 pixel and match the kernel size. We study the two cases where the input to the digital network was resized to 64 × 64 and 28 × 28 pixels, respectively. See Supplement 1 for the details on the hybrid optical convolutional layer-digital networks design.

3.2 Training & testing

All the models mentioned above were trained to perform MNIST handwritten digit recognition, EMNIST handwritten letters recognition and image classification on the grayscale CIFAR-10 dataset using Python version 3.6.13 and TensorFlow framework version 2.6.2 (Google LLC) on a desktop computer Dell PD4JR08 with Intel Xeon W-2133 CPU @ 3.60 GHz processor and 32 GB of RAM, running a Windows 10 Pro operating system, Microsoft. For training, we used 60,000 images from the MNIST dataset, 124,800 images from the EMNIST Letters dataset and 50,000 images from the CIFAR-10 dataset converted into greyscale and resized to match our designs [24]. We used the stochastic gradient descent algorithm, Adam, to back-propagate the errors and to minimize the loss function. All the architectures were trained using cross-entropy loss function for 50 epochs using a learning rate of 10⁻⁴. During model training, for each batch of training data, we performed backpropagation, calculating and updating the models parameters to minimize the value of the loss function. The process of backpropagation was automatically executed by the optimizer during model training, eliminating the need for explicit coding. The details of the hybrid architectures modelling and training are discussed in the Supplement 1. For testing, we used 10,000 images from the MNIST dataset, 20,800 images from the EMNIST Letters dataset and 10,000 images from the CIFAR-10 dataset converted into greyscale and resized to match our designs.

3.3 Determining computational complexity of digital models

The computational complexity of a ML model is reflected both in the number of trainable parameters and the FLOPs required for execution. We determine these two parameters through functions built-in to the TensorFlow framework we employ for definition of the models. We obtain the number of trainable parameters by using the built-in model.summary() function and the FLOPs the built-in function get_flops(model) within the TensorFlow Profiler module (see Supplement 1 for details).

4. Results

In order to combine the advantages of optical neural networks – i.e. low complexity, ability to process visual information passively in the native optical domain – with the flexibility of digital neural networks, we integrated DN₂s and OCKs with digital architectures, as outlined in the methods section. The diffractive elements/convolutional kernels and digital layers are jointly trained to perform classification tasks of different complexity, for which the results are presented in this section. Detailed information on the hybrid neural networks, like training convergence plots and direct comparison of confusion matrices for training and testing can be found in Fig. S1 – Fig. S15.

4.1 Comparison of digital, optical and hybrid ANNs for classification of handwritten digits

In this section we compare the classification accuracy of digital and hybrid optical-digital ANN architectures with different levels of complexity as described in the Methods section for classifying handwritten digits from the MNIST dataset, the results are summarized in Table 1 and Table S1. While the test accuracy for the all-digital neural networks facing this task in our experiment exceeds 98%, the DN₂ falls short in this comparison with a test accuracy of approximately 93%.

Table 1. Train and blind test accuracy, number of trainable parameters and FLOPs for networks performing handwritten digits classification.

View Table | View all tables in this article

The DN₂ does, however, achieve this classification accuracy with a total number of trainable parameters that is one order of magnitude smaller than the trainable parameters in either of the digital neural networks. This also reflects in the number of FLOPs required for execution of an inference task with the digital models, where either of the digital ANNs considered require more than 10⁵ FLOPs in order to achieve the high classification accuracy, whereas the DN₂ performs passively in the optical domain. This highlights the strength of DN₂ in particular for tasks like image classification, where models with low complexity that perform passively in the optical domain can perform tasks with moderate accuracy.

All hybrid ANNs considered in this experiment show a test accuracy for the MNIST handwritten digit classification task of >95%. While some of the hybrid architectures can outperform the multilayer perceptron in test accuracy, none of the hybrid architectures performs better than the all-digital CNN. In particular the hybrid networks that employ an optical convolution kernel consistently show a test accuracy lower than the reference all-digital ANN implementations, as well as the hybrid networks with an optical diffractive layer.

Comparing the computational complexity of the all-digital and hybrid models, the results show that it is possible for a hybrid ANN to outperform its digital counterpart while employing significantly less computational resources. This is evident, for example, for the DN₂+ MLP-2 architecture, which shows increased classification accuracy compared to the digital MLP-1 model, while only requiring approximately 8% of the computational resources.

4.2 Comparison of digital, optical and hybrid ANNs for classification of handwritten letters

In this section we compare the classification accuracy of the digital, optical and hybrid ANNs as described in the Methods section for the task of classification of handwritten letters from the EMNIST dataset. This task is more complex than classification of handwritten digits from the MNIST dataset in the sense that the number of classes to distinguish is increased from 10 to 26, the results are shown in Table 2 and Table S2. While the digital networks show a test accuracy of 88.46% and 90.86% for the MLP-1 and CNN-2 models, respectively, the DN₂ achieves a classification accuracy of 82.2% all optically.

Table 2. Train and blind test accuracy, number of trainable parameters and FLOPs for networks performing handwritten letters classification.

View Table | View all tables in this article

While some hybrid architectures can outperform the MLP-1 in terms of classification accuracy, the classification accuracy of the CNN-2 remains unmatched by the hybrid architectures for this task. However, some of the hybrid architectures achieve comparable test accuracies. For example, the DN₂+ CNN-3 and DN₂+ MLP-2 architectures achieve test accuracies of 88.99% and 89.57%, respectively. This is achieved with a computational complexity reduced to 63.4% for the DN₂-CNN-3 architecture and 47.9% for the DN₂-MLP-2 model when compared to the all-digital CNN-2.

As for the classification task for the handwritten digits of the MNIST dataset, the hybrid networks in this experiment do not seem to benefit from an optical convolution layer, with the test accuracies of these networks considerably below the accuracies of hybrid diffractive ANNs.

4.3 Comparison of digital, optical and hybrid ANNs for classification of grayscale objects with varying backgrounds

In this section we compare the classification accuracy of the digital, optical and hybrid ANNs as described in the Methods section for classification of grayscale objects in context from the CIFAR-10 dataset, the results are shown in Table 3 and Table S3. Since the classification of objects in the CIFAR-10 dataset is significantly more complex than the tasks tackled before, we introduce an additional digital network architecture CNN-5 as reference, capable of classifying objects of this dataset with reasonable accuracy after training for 50 epochs.

Table 3. Train and blind test accuracy, number of trainable parameters and FLOPs for networks performing image classification on the grayscale CIFAR-10 dataset.

View Table | View all tables in this article

While the digital reference architecture CNN-5 shows a test accuracy of 77.01%, the all-optical DN₂ achieves an accuracy of 69.71%. All the hybrid architectures tested on this dataset stay behind the digital reference network as well as the all-optical DN₂. With test accuracies >51% the hybrid networks with optical convolutional kernels perform thereby better than the hybrid networks with optical diffractive layers, which show test accuracies <38%.

Other than the networks for the simple classification tasks of MNIST and EMNIST handwritten digits and letters, the more complex task of classifying objects in context of a background seems to benefit significantly from employing an optical convolutional kernel, in particular in combination with a CNN in the digital domain. Where diffractive optical elements in combination with digital convolutional networks achieve classification accuracies of maximum 38% in our experiments, combining an optical convolutional kernel with a digital CNN yields significantly better results, showing test accuracies of >51% for the architectures tested here.

5. Discussion

The results reported in Tables 1–3 and Fig. 3 show that having a jointly-trained optical-digital hybrid network can improve the inference performance of the overall system, compared with all-optical DN₂, while potentially using up to an order of magnitude less computational resources for training and inference compared to stand-alone digital networks.

Fig. 3. Bar graph illustrating test accuracy in percentage and FLOPs for different architectures performing image classification tasks.

Download Full Size | PDF

The DN₂ architecture can be considered a linear multi-layer holographic perceptron that optically implements matrix multiplication. When combined with a single fully connected layer (DN₂-MLP-2), it created a hybrid architecture that for basic classification of handwritten digits outperforms not only the all-optical DN₂, but also stand-alone digital multilayer perceptrons, like MLP-1, while reducing drastically the trainable parameters and hence the computational requirements (compare Fig. 3). We think that this happens because, through the joint training, the fully connected layer compensates for some of the limitations of the holographic perceptron, such as lack of nonlinearity, faults in the design optimization (diffractive neuron size, distance between layers, partial connectivity) and intrinsic limited modulation capability of the diffractive layers. On the other hand, stand-alone digital CNNs perform always better than hybrid architectures. A possible reason for this can be that the complexity and the superior approximation capabilities of CNNs, when compared with optical or digital MLPs, interferes with the training of the optical DN₂ or OCK that serves as a front-end of the hybrid architecture [17].

For hybrid networks with an optical DN₂ the highest classification accuracies can be observed when the input layer size of the digital network is matched to the number of pixels in the DN₂ (64 × 64 pixels). This comes, however, at the cost of computational complexity of the models. When comparing the computational complexity of hybrid networks with digital input sizes of 64 × 64 and 28 × 28 pixels, the networks with the lower pixel count digital inputs show consistently lower computational complexity, while the performance in terms of test accuracy is only reduced marginally. This suggests that DN₂s can not only be used to pre-process visual information passively, but also compresses it, reducing the size of the 2D input image before digitalization, with minimal effect on the resulting performance of the network.

When employing the DN₂ as optical front-end, which not only performs pre-processing in the optical domain, but also compresses the incoming information in size, the computational complexity of the networks in the digital domain can be reduced further. With approximately a factor of 6, the hybrid network architecture DN₂-CNN-3 trained for classifying MNIST handwritten digits shows the strongest reduction in computational complexity when compressing the input to the digital part of the neural network, while only reducing the classification accuracy by 0.19%. This is due to the computational complexity of networks with convolutional layers increasing significantly with size of the input images [28]. The largest overall reduction in computational complexity between all-digital and hybrid neural network can be observed for the MLP architectures trained for classifying MNIST handwritten digits, where the FLOPs for an inference step reduce by a factor >10, while actually increasing the classification accuracy by 0.09%. For comparison, on current specialized processors to execute neural network models [32,33], the energy requirement for execution of a model is defined as 1 pJ/FLOP, hence the use of a hybrid network would reduce the energy cost for executing the MLP image classification model from 10⁻⁷ J to 7.9·10⁻⁹ J, for the convolutional model from 1.1·10⁻⁷ J to 5.9·10-⁹ J, while largely maintaining their accuracy.

For the simpler classification tasks, i.e. the classification of handwritten digits and letters from the MNIST and EMNIST datasets, the hybrid architectures investigated in this work perform better with an optical DN₂ – independent of the type of digital architecture. For the more complex task of classifying objects with backgrounds from the CIFAR-10 dataset, on the other hand, comparatively high classification accuracies can only be achieved for combinations of optical convolutional kernels with digital CNNs. This suggests that the hybrid architectures have to be chosen carefully depending on the task, in order to present an advantage.

All the ANNs presented in this work were trained in the same conditions and using the same parameters (number of epochs, learning rate). These parameters are not necessarily the best choice for all the architectures and individual optimization of the training parameters can improve the accuracy of each network. In particular, in the case of hybrid architectures, a changing learning rate, training the network in two stages or using virtual optical layers has proven to be beneficial [17,34]. Moreover, the number and the spatial arrangement of the detectors that capture the output intensities of the optical networks are all degrees of freedom that can be further explored and optimized. However, the results presented here already allow for several considerations.

When implementing a hybrid optical-digital neural network as proposed in this work, where the DN₂ is implemented via Spatial Light Modulators (SLMs) or Digital Micromirror Devices (DMDs), a limit on the computational speed is imposed on the system based on the rate of reconfigurability of the optoelectronic device, which is typically in the range of hundreds of Hertz. This limitation would impact the performance in particular during training phase, if the network is trained in-situ. During the inference phase, the DN₂ typically remains static, while only the input changes. In this case, the computational speed would again be limited by a reconfigurable optoelectronic device, if such a device provides the inputs to the network. In other settings, where the input information is continuously provided as information native to the optical domain (e.g., in a remote sensing setup or when analyzing the Point Spread Function of an aberrated wavefront), the computation speed is limited by the detector.

The tasks presented to the ANNs in this work, i.e. the classification of objects with different levels of complexity, are analog computing tasks. This is where we believe the advantages of Optical- and Hybrid ANNs can be exploited to their fullest: in the analysis of information that is native to the optical domain, where the optical network can process all dimensions of the optical signal (e.g., complex field, spectral information, optical angular momentum, etc.). While it is possible to execute tasks that are native to the digital domain in an Optical- or Hybrid ANNs, any benefit this would provide must be carefully evaluated against the energy cost of electro-optical and opto-electronic conversion steps as well as the limits on computation speed as discussed in the previous section.

6. Conclusion

In this work, hybrid diffractive digital neural networks were investigated and their performance for a benchmark image classification task was critically compared with the performance of stand-alone all-optical and all-digital network architectures. It was numerically shown that, in certain cases, hybrid neural network architectures can achieve a classification accuracy that exceeds the performance of both, all-digital and all-optical networks, while reducing the computational complexity of the digital network by factors >10. This reduction in computational complexity, which reduces the energy consumption for the hardware the neural network is executed on, together with optical information directly in the native domain, which gives access to information typically lost during opto-electronic conversion [13], may lead to a new generation of neural network architectures. These new architectures may find application where the size for computational hardware or energy budget are restricted, such as in autonomous vehicles [3] or remote sensing applications.

Funding

Science and Technology Commission of Shanghai Municipality (21DZ1100500); Shanghai Rising-Star Program (21QA1403600); National Natural Science Foundation of China (62206176); Natural Science Foundation of Shanghai (21ZR1443400).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Author contribution statements.

EG and SS conceived the concept, and MG supervised the project. MC and EG performed numerical simulations. All authors participated in discussions and contributed to writing of the manuscript.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

2. E. A. Huerta, G. Allen, I. Andreoni, et al., “Enabling real-time multi-messenger astrophysics discoveries with deep learning,” Nat. Rev. Phys. 1(10), 600–608 (2019). [CrossRef]

3. J. Kocić, N. Jovičić, and V. Drndarević, “An End-to-End Deep Neural Network for Autonomous Driving Designed for Embedded Automotive Platforms,” Sensors 19(9), 2064 (2019). [CrossRef]

4. A. A. Nawaz, M. Urbanska, M. Herbig, et al., “Intelligent image-based deformation-assisted cell sorting with molecular specificity,” Nat. Methods 17(6), 595–599 (2020). [CrossRef]

5. Y. Li, A. Mahjoubfar, C. L. Chen, et al., “Deep Cytometry: Deep learning with Real-time Inference in Cell Sorting and Flow Cytometry,” Sci. Rep. 9(1), 11088 (2019). [CrossRef]

6. R. T. Weverka, K. Wagner, and M. Saffman, “Fully interconnected, two-dimensional neural arrays using wavelength-multiplexed volume holograms,” Opt. Lett. 16(11), 826–828 (1991). [CrossRef]

7. M. Reck and A. Zeilinger, “Experimental realization of any discrete unitary operators,” Phys. Rev. Lett. 73(1), 58–61 (1994). [CrossRef]

8. J. Duvillier, M. Killinger, K. Heggarty, et al., “All-optical implementation of a self-organizing map: a preliminary approach,” Appl. Opt. 33(2), 258–266 (1994). [CrossRef]

9. K. Wagner and D. Psaltis, “Multilayer optical learning networks,” Appl. Opt. 26(23), 5061–5076 (1987). [CrossRef]

10. D. Psaltis, D. Brady, and K. Wagner, “Adaptive optical networks using photorefractive crystals,” Appl. Opt. 27(9), 1752–1759 (1988). [CrossRef]

11. E. Goi, S. Schoenhardt, and M. Gu, “Direct retrieval of Zernike-based pupil functions using integrated diffractive deep neural networks,” Nat. Commun. 13(1), 7531 (2022). [CrossRef]

12. X. Lin, X. Lin, Y. Rivenson, et al., “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

13. E. Goi, X. Chen, Q. Zhang, et al., “Nanoprinted high-neuron-density optical linear perceptrons performing near-infrared inference on a CMOS chip,” Light: Sci. Appl. 10(1), 40 (2021). [CrossRef]

14. T. Zhou, X. Lin, J. Wu, et al., “Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit,” Nat. Photonics 15(5), 367–373 (2021). [CrossRef]

15. Y. Zuo, B. Li, Y. Zhao, et al., “All-optical neural network with nonlinear activation functions,” Optica 6(9), 1132 (2019). [CrossRef]

16. G. Wetzstein, A. Ozcan, S. Gigan, et al., “Inference in artificial intelligence with deep optics and photonics,” Nature 588(7836), 39–47 (2020). [CrossRef]

17. D. Mengu, Y. Luo, Y. Rivenson, et al., “Analysis of diffractive optical neural networks and their integration with electronic neural networks,” IEEE J. Sel. Top. Quantum Electron. 26(1), 1 (2020). [CrossRef]

18. J. Liu, Q. Wu, X. Sui, et al., “Research progress in optical neural networks: theory, applications and developments,” PhotoniX 2(1), 5 (2021). [CrossRef]

19. S. Jutamulia and F. T. S. Yu, “Overview of the hybrid optical neural networks,” Opt. Laser Technol. 28(2), 59–72 (1996). [CrossRef]

20. J. Chen, J. Peng, C. Yang, et al., “Hybrid optical-electronic neural network with pseudoinverse learning for classification inference,” Appl. Phys. Lett. 119(11), 114102 (2021). [CrossRef]

21. G. Qu, G. Cai, X. Sha, et al., “All-Dielectric Metasurface Empowered Optical-Electronic Hybrid Neural Networks,” Laser Photonics Rev. 16(10), 2100732 (2022). [CrossRef]

22. D. Pierangeli, G. Marcucci, and C. Conti, “Photonic extreme learning machine by free-space optical propagation,” Photonics Res. 9(8), 1446–1454 (2021). [CrossRef]

23. A. J. Molina-mendoza and T. Mueller, “Ultrafast machine vision with 2D material neural network image sensors,” Nature 579(7797), 62–66 (2020). [CrossRef]

24. Y. Lecun, L. Bottou, Y. Bengio, et al., “Gradient-based learning applied to document recognition,” Proc. IEEE 86(11), 2278–2324 (1998). [CrossRef]

25. G. Cohen, S. Afshar, J. Tapson, et al., “EMNIST: Extending MNIST to handwritten letters,” in 2017 International Joint Conference on Neural Networks (IJCNN) (2017), pp. 2921–2926.

26. 26. A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” in (2009).

27. Y. Shen, N. C. Harris, D. Englund, et al., “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

28. A. Ajit, K. Acharya, and A. Samanta, “A Review of Convolutional Neural Networks,” in 2020 International Conference on Emerging Trends in Information Technology and Engineering (Ic-ETITE) (2020), pp. 1–5.

29. Y. Luo, D. Mengu, N. T. Yardimci, et al., “Design of task-specific optical systems using broadband diffractive neural networks,” Light: Sci. Appl. 8(1), 112 (2019). [CrossRef]

30. J. Chang, V. Sitzmann, X. Dun, et al., “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]

31. M. Yang, E. Robertson, L. Esguerra, et al., “Optical convolutional neural network with atomic nonlinearity,” Opt. Express 31(10), 16451–16459 (2023). [CrossRef]

32. S. Mach, F. Schuiki, F. Zaruba, et al., “A 0.80pJ/flop, 1.24Tflop/sW 8-to-64 bit Transprecision Floating-Point Unit for a 64 bit RISC-V Processor in 22 nm FD-SOI,” in 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC) (2019), pp. 95–98.

33. . “nvidia titan-rtx,” https://www.nvidia.com/en-us/deep-learning-ai/products/titan-rtx/.

34. O. Kulce, D. Mengu, Y. Rivenson, et al., “All-optical synthesis of an arbitrary linear transformation using diffractive surfaces,” Light: Sci. Appl. 10(1), 196 (2021). [CrossRef]

	MLP-1	CNN-1	DN₂	DN₂ + MLP-2	DN₂ + CNN-3	DN₂ + MLP-2	DN₂ + CNN-3	OC + MLP-2	OC + CNN-3	OC + MLP-2	OC + CNN-3
Detector size	(28 ×28)	(28 ×28)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)
Train Acc. (%)	99.91	99.69	93.37	99.05	98.91	99.29	99.27	95.45	99.06	97.01	99.46
Test Acc. (%)	98.01	99.77	93.05	98.10	97.84	98.17	98.07	94.75	94.75	95.86	97.84
Trainable parameters (× 10⁴)	10.17 (dig.)	12.19 (dig.)	1.64 (opt.)	1.64(opt.)0.79(dig.)	1.64(opt.)5.44(dig.)	1.64(opt.)4.10(dig.)	1.64(opt.)30.79(dig.)	0.41(opt.)0.79(dig.)	0.41(opt.)14.10(dig.)	0.41(opt.)4.01(dig.)	0.41(opt.)79.98(dig.)
FLOPs (digital) (× 10⁴)	10.16	12.17	-	0.79	5.43	4.10	30.78	0.78	14.09	4.09	79.98

	MLP-1	CNN-2	DN₂	DN₂ + MLP-2	DN₂ + CNN-3	DN₂ + MLP-2	DN₂ + CNN-3	OC + MLP-2	OC + CNN-3	OC + MLP-2	OC + CNN-3
Detector size	(28 ×28)	(28 ×28)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)
Train Acc. (%)	95.50	93.06	82.8	90.26	91.79	92.39	93.74	79.31	91.92	84.89	93.85
Test Acc. (%)	88.46	90.86	82.21	88.30	88.99	89.57	90.03	78.6	88.48	82.78	88.58
Trainable parameters (× 10⁴)	10.38 (dig.)	11.25 (dig.)	1.64 (opt.)	1.64 (opt.) 2.04 (dig.)	1.64 (opt.) 14.10 (dig.)	1.64 (opt.) 10.65 (dig.)	1.64 (opt.) 79.99 (dig.)	0.41 (opt.) 2.04 (dig.)	0.41 (opt.) 14.10 (dig.)	0.41 (opt.) 10.65 (dig.)	0.41 (opt.) 79.99 (dig.)
FLOPs (digital) (× 10⁴)	21.71	22.23	-	2.04	14.09	10.65	79.98	2.04	14.09	10.65	79.98

	CNN-5	DN₂	DN₂ +MLP-2	DN₂ +CNN-3	DN₂ +MLP-2	DN₂ +CNN-3	OC +MLP-2	OC +CNN-3	OC +MLP-2	OC +CNN-3
Input size	(32 ×32)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)
Train Acc. (%)	79.1	70.02	39.87	38.62	42.26	43.58	33.66	59.39	35.81	66.61
Test Acc. (%)	77.01	69.71	36.84	34.92	38.43	37.23	32.15	51.51	30.55	51.34
Trainable parameters (× 10⁴)	59.20 (dig.)	1.64 (opt.)	1.64 (opt.) 0.79 (dig.)	1.64 (opt.) 5.44 (dig.)	1.64 (opt.) 4.10 (dig.)	1.64 (opt.) 30.79 (dig.)	0.41 (opt.) 0.79 (dig.)	0.41 (opt.) 5.44 (dig.)	0.41 (opt.) 4.10 (dig.)	0.41 (opt.) 30.79 (dig.)
FLOPs (digital) (× 10⁴)	69.06	-	0.79	5.43	4.10	30.78	0.79	5.43	4.10	30.78

	MLP-1	CNN-1	DN₂	DN₂ + MLP-2	DN₂ + CNN-3	DN₂ + MLP-2	DN₂ + CNN-3	OC + MLP-2	OC + CNN-3	OC + MLP-2	OC + CNN-3
Detector size	(28 ×28)	(28 ×28)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)
Train Acc. (%)	99.91	99.69	93.37	99.05	98.91	99.29	99.27	95.45	99.06	97.01	99.46
Test Acc. (%)	98.01	99.77	93.05	98.10	97.84	98.17	98.07	94.75	94.75	95.86	97.84
Trainable parameters (× 10⁴)	10.17 (dig.)	12.19 (dig.)	1.64 (opt.)	1.64(opt.)0.79(dig.)	1.64(opt.)5.44(dig.)	1.64(opt.)4.10(dig.)	1.64(opt.)30.79(dig.)	0.41(opt.)0.79(dig.)	0.41(opt.)14.10(dig.)	0.41(opt.)4.01(dig.)	0.41(opt.)79.98(dig.)
FLOPs (digital) (× 10⁴)	10.16	12.17	-	0.79	5.43	4.10	30.78	0.78	14.09	4.09	79.98

	MLP-1	CNN-2	DN₂	DN₂ + MLP-2	DN₂ + CNN-3	DN₂ + MLP-2	DN₂ + CNN-3	OC + MLP-2	OC + CNN-3	OC + MLP-2	OC + CNN-3
Detector size	(28 ×28)	(28 ×28)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)	(28 ×28)	(28 ×28)	(64 ×64)	(64 ×64)
Train Acc. (%)	95.50	93.06	82.8	90.26	91.79	92.39	93.74	79.31	91.92	84.89	93.85
Test Acc. (%)	88.46	90.86	82.21	88.30	88.99	89.57	90.03	78.6	88.48	82.78	88.58
Trainable parameters (× 10⁴)	10.38 (dig.)	11.25 (dig.)	1.64 (opt.)	1.64 (opt.) 2.04 (dig.)	1.64 (opt.) 14.10 (dig.)	1.64 (opt.) 10.65 (dig.)	1.64 (opt.) 79.99 (dig.)	0.41 (opt.) 2.04 (dig.)	0.41 (opt.) 14.10 (dig.)	0.41 (opt.) 10.65 (dig.)	0.41 (opt.) 79.99 (dig.)
FLOPs (digital) (× 10⁴)	21.71	22.23	-	2.04	14.09	10.65	79.98	2.04	14.09	10.65	79.98

Quantitative comparison of the computational complexity of optical, digital and hybrid neural network architectures for image classification tasks

Abstract

1. Introduction