OP-FCNN: an optronic fully convolutional neural network for imaging through scattering media

Zicheng Huang; Ziyu Gu; Mengyang Shi; Yesheng Gao; Xingzhao Liu

doi:10.1364/OE.511169

1. Introduction

Imaging through scattering media is a classical issue in various fields, ranging from biomedical imaging, aiming to improve imaging performance and depth of field in biological tissues, to remote imaging, such as optical remote sensing and atmospheric imaging [1–5]. Many significant approaches were proposed to solve this problem. One typical method utilized the transmission matrix to measure the transfer function of the scattering media between the incident plane and the detector [6–8]. Moreover, wavefront shaping used the reference beam to retrieve the phase information of the scattered light and reconstruct the object by phase conjugation [9–11]. One another method characterized the statistical similarities of the scattering media through speckle correlation, known as the memory effect (ME) [12–14].

In recent years, with the rapid development of the computational revolution, deep learning (DL) proved to be an efficient and powerful method for imaging through scattering media [15,16]. Based on the correlation of the speckle patterns, DL reversed the scattering imaging through data training without building an explicit model. Many advanced architectures of DL methods have been proposed to solve the scattering imaging problem. In 2018, Li et al. built up a convolutional neural network (CNN) designed following the classic "U-net" architecture for imaging through scattering media [17]. E.Guo et al. proposed the PDSNet and achieved 40 times the ME range for sparse object reconstruction [18]. Wang et al. proposed the SACNN for feature extraction and extended the model generalization for deep sparse pattern reconstruction [19]. Hu et al. developed an adaptive inverse mapping method to self-correct the inverse mapping through unsupervised learning, which demonstrated great potential in imaging through dynamic scattering media [20].

Deep learning has shown strong capability in imaging through scattering media, however, there are still some issues challenging its development. On the one hand, the current DL methods encounter significant challenges in computation speed and power consumption. As a data-driven method, DL-based imaging systems’ reconstruction performance and generalization capability highly depend on the collected speckle patterns for training. The explosive growth of the speckle patterns incurs high demands for high-performance graphics processing units (GPUs) and power consumption [21]. Although many state-of-the-art lightweight DL methods have been proposed to decrease the model complexity, increasing the deeper layers of the network structure is always an effective way to enhance the reconstruction performance [22–24]. Hence, the inadequate speed for handling vast amounts of speckle data and growing demands for high-performance DL-based imaging systems with low computational complexity motivate the implementation of DL methods in optics. Compared with electronic computation, photons promise ultra-fast speed, inherent parallel computation properties, and low energy, providing a new DL-based method for imaging through scattering media [25–28]. On the other hand, the optical design of DL methods could be characterized as an optical imaging system mapping the speckle patterns to objects. The ultimate goal of light scattering is the all-optical systems for imaging through scattering media. Hence, implementing DL in optics provides a new technique route to realize this goal.

The optical implementation of DL techniques could be simplified as two main routes. The diffractive neural network is a great success that merges wave optics with neural networks to design all-optical DL methods [29–31]. However, some of them have challenges in reconfigurability and scalability [32]. The optical convolutional neural network is another successful spatial optical technique to implement the convolutional neural network (CNN) structures in optics [33–35]. Gu et.al proposed an optronic CNN for MNIST handwritten digits classification that implemented the convolution, pooling, and fully connected layers in optics [36–38]. Moreover, Huang and Gu et al. proposed a speckle-based optronic CNN to extract the feature of speckle patterns and then realized the object classification behind scattering media [39]. However, the structure of these optronic CNNs still has the following limitations. First, these optronic CNNs utilized the fully connected layers to transform the two-dimensional features into one-dimensional vectors, which might discard some spatial feature information of the speckle patterns. Moreover, the network structures were designed asymmetrically, which only had down-sampling layers to downsize the feature pattern and lacked the up-sampling layers to recover the object size. Besides, these models were not "end-to-end" and were unable to reconstruct the object from speckles. For the above reasons, the current optronic network structures had limited feature extraction capability for various speckle patterns and were unfeasible to develop optical systems for imaging through scattering media.

In this work, an optronic fully convolutional neural network named "OP-FCNN" is proposed for imaging through scattering media. Specifically, we design the "end-to-end" optronic network structure and implement fully optical convolutional layers to replace the fully connected ones, considering more spatial information extraction for speckle patterns. The U-type symmetric architecture splits the network into an Encoder and a Decoder. We implement the down-sampling layers and up-sampling layers in optics to downsize the speckle pattern and recover the object size in Encoder and Decoder, respectively. Moreover, we utilize the skip connection to combine the low-dimensional and high-dimensional information in Encoder and Decoder to enhance the model feature extraction capability. In OP-FCNN, all computational operations are performed in optics except for data transmission and nonlinear activation.

Here, we mainly explore a good balance of DL imaging systems between imaging performance and computational complexity. Four object datasets are encompassed in training/testing and validation, including MNIST handwritten digits, EMNIST handwritten letters, fashion MNIST, and MIT-CBCL-face dataset. We quantitatively evaluate the OP-FCNN reconstruction performance with four scientific indicators, namely Pearson correlation coefficient (PCC), structural similarity measure (SSIM), Jaccard index (JI), and peak signal-to-noise ratio (PSNR). The OP-FCNN achieves 0.84, 0.91, 0.79, and 16.3dB for JI, PCC, SSIM, and PSNR, respectively. Compared with digital CNN using similar structures including the same layers and channels, OP-FCNN achieves approximately its 98.5${\% }$ performance taking only 0.6${\% }$ computational complexity. Moreover, the architecture of OP-FCNN is reconfigurable and scalable, which could be extended to further complicated structures to be adjusted to improve imaging performance and accommodate the varying complexity of speckle patterns in complex scenarios. Furthermore, we develop a stronger OP-FCNN by adding deeper convolutional layers and more kernel channels and then validate the imaging performance. We found that the stronger OP-FCNN validation results for JI, PCC, SSIM, and PSNR over four datasets reached at most 0.87, 0.92, 0.86, and 17.53dB, comprehensively surpassing the digital CNN but only taking its 1.5${\% }$ computational complexity. The proposed OP-FCNN provides a new opto-electronic DL method with low computational complexity for speckle reconstruction and boosts the development of all-optical systems for imaging through scattering media.

The structure of this paper is as follows: Section 2.1 introduces the optical imaging system through scattering media. Section 2.2 introduces the principle and architecture of OP-FCNN. Section 2.3 details the data acquisition and processing. The result analysis and discussion are presented in Section 3, and the conclusion is drawn in Section 4.

2. Method

2.1 Optical imaging system

We design an optical imaging system for light scattering to acquire the speckle patterns for training/testing and reconstruction performance validation. The experiment setup is shown in Fig. 1. Specifically, we utilize the amplitude-only spatial light modulator(SLM)(Holoeye Pluto pixel pitch 8 $\mu$m, 1920$\times$1080) as the object plane to load the binary sparse object pattern. We used the ground glass diffuser(Thorlabs DG10-220-MD) to produce the speckle patterns. As shown in Fig. 1, the coherent light from the laser generator (10 mW, 532 nm) passed through the optical elements( $\lambda /2$ plate, objective lens, pinhole) and expanded into a collimated beam after lens L1. The light illuminated the central 512$\times$512 pixels of the SLM and changed its path after two beam splitter cubes. A group of lenses L2 (f=60mm) and L3 (f=50mm) composed a 4f-system to match the pixel pitches between the object SLM and CMOS camera (Hamamatsu C14440,6.5$\mu$m pixel pitch,2048$\times$2048).

2.2 Principle and architecture of OP-FCNN

Here, the light scattering process is characterized as a typical mathematical model, mapping the original object plane to the detection plane:

(1)$$O \approx S(I).$$

Where the $O$ and $I$ denote the speckle pattern on the detection plane and the original image on the object plane. The $S$ denotes the forward operator mapping the object plane to the detection plane. Since the sparse object pattern is distorted by the scattering media and hard to behave on the speckle pattern, here, we develop an opto-electronic method to learn the mapping relation between the speckle patterns and objects, then reconstruct objects by reversing the Eq. (1):

(2)$$I \approx F(O.)$$

Where the $F$ denotes the backward operator mapping the detection plane to the object plane. Here, we build up the OP-FCNN model to characterize the inverse light scattering process and reconstruct the sparse objects with optical techniques. The OP-FCNN is designed as the "end-to-end" U-type architecture, which comprises the Encoder and Decoder with the skip connection, promising to realize the speckle reconstruction. In the Encoder, we implement the optical convolutional layer to extract the low-dimensional feature of the speckle pattern. We utilize the optical lens 4f system to realize the down-sampling and up-sampling operations. The Decoder further extracts the speckle features and combines the features in the shallow layer through the skip connection. We implement the nonlinear activation layer and normalization after the optical convolution to enhance the network fitting capability. The concrete principle and architecture of the OP-FCNN are introduced below.

Fig. 1. Experimental setup of the light scattering optical imaging system

Download Full Size | PDF

2.2.1 Optical convolutional layer

In OP-FCNN, we implement the optical convolutional (OP-Conv) layer to extract the speckle feature. In traditional CNN, the input channels and convolutional kernels tend to be three-dimensional. Here, we replace the three-dimensional convolution with the two-dimensional optical convolution to realize the dimension reduction. Specifically, we utilize the spatial light modulator (SLM) and optical lens 4f system to perform the two-dimensional Fourier transform. We visualize the optical convolution in Fig. 2. As the product of the frequency spectrum goes to the convolution of the time spectrum, we set the phase modulation of the frequency spectrum as trainable kernels at the 2f position. We load each channel of the kernels to extract different features of the input images. Then we sum the extracted features as the result of convolution. The OP-Conv layer loads each input image channel ($I_i(x_f,y_f)$) and each kernel phase channel ($K_j(f_x,f_y)$) on the amplitude-only SLM and phase-only SLM, respectively. Then the $N$ feature images collected by the camera for $j^{th}$ kernel channel are summed as the $j^{th}$ convolution result channel $O_j(x_f,y_f)$. The convolution and summation operations are performed as follows:

(3)$$O_j(x_f,y_f) = \sum_{i=1}^{N} \mathcal{F}^{{-}1}\{ \mathcal{F}\lbrack I_i(x_f,y_f)\rbrack \cdot K_j(f_x,f_y) \}.$$

Where the ($\cdot$) and ($\mathcal {F}$) in Eq. (3) signify the two-dimensional element-wise product and optical Fourier transform, respectively. The $I_i(x_f,y_f)$ denotes the the input image in $i^{th}$ channel, and the $K_j(f_x,f_y)$ denotes the phase kernel in $j^{th}$ channel. $O_j(x_f,y_f)$ denotes the output in $j^{th}$ channel which is the summation of all input $N$ channels convoluted with kernel $K_j(f_x,f_y)$. The $(x_f,y_f)$ denotes the spatial coordinates of input and output images, and the $(f_x,f_y)$ denotes kernels’ spatial frequency spectrum coordinates. We set the kernel as the phase modulation varying from 0 to 2$\pi$:

(4)$$\begin{aligned} K(f_x,f_y) & = exp\lbrack 2\pi (f_x,f_y)\rbrack \\ & = exp\lbrack \frac{2\pi}{\lambda f}(x_f,x_y)\rbrack \end{aligned}.$$

Where the $\lambda$ denotes the wavelength of the light, and the $f$ indicates the focal length of the 4f-system. As shown in Fig. 2, the input images ( l channels ) are loaded on the amplitude-only SLM. L1 focuses the light at the focal point, where the kernel is loaded on the phase-only SLM. The modulation light is de-focused by L2 and collected by an sCMOS camera.

2.2.2 Down/up-sampling layer and nonlinear activation function

In the Encoder, we utilize the down-sampling layer to reduce the spatial dimension to improve the computation efficiency and the network fitting capability. In OP-FCNN, we implement the optical lens demagnification system to perform the down-sampling layer. As shown in Fig. 3 (a), a set of lenses with a focal length of 2:1 realizes the down-sampling extraction with stride = 2.

In the Decoder, we utilize the up-sampling layer to recover the low-resolution feature to the high-resolution image. Likewise, we implement the optical lens magnification system to perform the up-sampling layer. As shown in Fig. 3 (a), a set of lenses with a focal length of 1:2 realizes the single-linear interpolation up-sampling. According to the fresnel diffraction and lens imaging, the output of the 4f system with different focal lengths has a coordinated transformation relationship with the input image:

(5)$$S_o(x,y) \approx \frac{A^2}{-\lambda^2f_1f_2}S_i(-\frac{f_1}{f_2}x,-\frac{f_1}{f_2}y).$$

Where the $S_i(x,y)$ and $S_o(x,y)$ denote the input image and output image of the demagnification/magnification lens 4f system. The $\frac {A^2}{-\lambda ^2f_1f_2}$ is the coefficient that has little impact on the imaging result, and the $f_1$ and $f_2$ denote the focal length of the two lenses. Hence, ignoring the former coefficient, the output plane exhibits the flipped input image with coordinate transformation ${f_2}/{f_1}$. In the down-sampling layer, the lens focal lengths are set as ($f_1 = 100mm$, $f_2 = 50mm$). The coordinate-transformed output image shows the flipped input image sampled at two-pixel intervals. In the up-sampling layer, the lens focal lengths are set as ($f_1 = 50mm$, $f_2 = 100mm$).

Fig. 2. Implementation of the optical convolutional layer. The amplitude-only SLM loads the input image. The phase-only SLM loads the convolutional kernel phase image. The sCMOS camera captures the convolutional result.

Download Full Size | PDF

Fig. 3. (a). Implementation of optical demagnification lens 4f system. The lens focal length is set as ($f_1 = 100mm$, $f_2 = 50mm$). The OP-Down-Sampling layer is equivalent to the down-sampling in digital CNN. (b). Implementation of optical magnification lens 4f systems. The lens focal length is set as ($f_1 = 50mm$, $f_2 = 100mm$). The OP-Up-Sampling layer is equivalent to the single-linear interpolation up-sampling in digital CNN.

Download Full Size | PDF

The nonlinear activation function is a crucial component of CNN. In OP-FCNN, considering the constant nonnegative light intensity, we utilize the sCMOS camera’s curve to realize the nonlinear activation function. The sCMOS camera is a semiconductor material comprising many individual photosensitive devices. Hence, the light information is converted into a digital signal and then transmitted to the image by an image processor. The sCMOS camera’s curve can control the light detection and readout image intensity by exerting the linear shift. We implement the ReLU function after the optical convolution layer and integrate the normalization as the "BN-ReLu" [40], which is expressed as follows:

(6)$$B_{o} = \left\{ \begin{array}{cc} 0 & B_{i}\le \mu_{B_{i}}\\ \gamma\frac{B_{i}-\mu_{B_{i}}}{\sqrt{\delta_{B_{i}}^2+\varepsilon}} + \beta & B_{i}>\mu_{B_{i}} \end{array} \right.$$

Where the $B_i$ denotes the detected light intensity (8 bit) of the CMOS camera, and the $B_o$ denotes the light intensity after BN-ReLU. The $\mu _{B_{i}}$ and $\delta _{B_{i}}$ denote the mean value and variance of the image. The $\varepsilon$ is set as a minimal value ($1\times 10^{-6}$ here) to prevent the zero denominators. The $\gamma$ and $\beta$ are variable parameters on the learning rate; the initial values are set as $\gamma =1$, $\beta =0$. Moreover, we implement the Sigmoid function at the last layer to adjust the output image threshold range to 0-1, which is adaptive for the loss function computation.

2.2.3 OP-FCNN Encoder-Decoder architecture

The OP-FCNN is designed following the "U-Net" architecture; the U-type splits the network into the Encoder and Decoder modules. We demonstrate the OP-FCNN structure in Fig. 4 (a). The Encoder aims to extract the features from the speckle patterns, and the Decoder is designed to reconstruct the object images. Here, we implement the skip connection to combine the high-dimensional information in the deep layer with the low-dimensional information in the shallow layer to enhance the OP-FCNN feature extraction and reconstruction capability. We demonstrate the skip connection of OP-FCNN in Fig. 4 (b). The optical convolutional result of the Encoder contains l channels extracted feature information, and the convolution layer output of the Decoder contains k channels feature information. We merge the two features of the same size into the new channel (k+l) feature maps and then input them into the next optical convolutional layer.

Fig. 4. (a). The architecture of the OP-FCNN. The OP-Conv denotes the optical convolutional layer. The OP-Down-sampling and OP-Up-sampling denote the down-sampling and up-sampling based on optical 4f lens systems. The OP-Skip connection denotes the optical skip connection in OP-FCNN. The BN-Relu denotes the nonlinear activation and normalization in OP-FCNN. The Avg + Sigmoid denote the normalization and Sigmoid activation function operations. The OP-Down-Sampling layer is equivalent to the down-sampling in digital CNN. The OP-Up-Sampling layer is equivalent to the single-linear interpolation up-sampling in digital CNN. Every stage in OP-FCNN comprises one OP-Conv and one BN-Relu operation. The reconstruction result after the last optical convolutional layer operates the average and sigmoid activation before output. (b). The realization of OP-Skip Connection in stage 1 of OP-FCNN. The $l$ channels feature from Encoder and $k$ channels feature from Decoder are combined through padding and loaded on the amplitude-only SLM as the $l+k$ channels input of the next optical convolutional layer.

Download Full Size | PDF

The Encoder comprises four stages. Each stage contains two optical convolutional layers and the BN-ReLu activation function. The output of each convolutional layer is set as 16, 32, 64, and 128 channels. We also implement three down-sampling layers to decrease the feature size, designated as 128 $\times$ 128, 64 $\times$ 64, 32 $\times$ 32, and 16 $\times$ 16 pixels.

Likewise, we implement four stages in the Decoder. Each stage utilizes the skip connection to combine the feature information from the Encoder. The optical convolutional layer in the Decoder transforms the 192, 96, 48, and 16 channels features into 64, 32, 16, and 1 channels output images. As detailed in Section 2.2.2, we utilize three up-sampling layers to recover the feature size. We output the 128 $\times$ 128 pixels size result of the last convolution layer after it is averaged and activated by the Sigmoid function.

2.3 Data acquisition and processing

For producing the speckle patterns for training and validation, we utilize the MNIST handwritten digits, EMNIST handwritten letters, fashion MNIST, and MIT-CBCL-face datasets as the loading objects [41–44]. We shuffle and split the total 71000 images with a ratio of 8:2. We utilize the light scattering imaging system in Section 2.1 to produce the speckle patterns from the above datasets. The sparse objects are first re-sized from the raw data to 512$\times$512 pixels and padded into 1080$\times$1920 to be loaded on the SLM. Here, we preserve the central 512$\times$512 pixels region of the sCMOS camera as the speckle pattern to keep the same with the object pattern. For reducing the network parameter numbers and the training efforts, we utilize the bilinear interpolation approach to down-sample the speckle pattern from 512$\times$512 pixels to 128$\times$128 pixels and up-sample the reconstruction result of OP-FCNN output from 128$\times$128 pixels to 512$\times$512 pixels by the nearest-neighbor interpolation method. The 8-bit grayscale speckle pattern is normalized from 0-255 to 0-1 before inputting into the network. Moreover, we utilize the averaged cross-entropy as the training loss function, which is given by:

(7)$$Loss = \frac{1}{N^2} \sum_{j=1}^{N}\sum_{i=1}^{N} -(y_{ij}log(x_{ij})+(1-y_{ij})log(1-x_{ij})).$$

Where the $y_{ij}$ denotes the pixel value of the ground truth label, and the $x_{ij}$ denotes the prediction value. The loss is averaged over all $N\times N$ pixels of the image.

The OP-FCNN training process is performed on the computer server with two graphics processing units(NVIDIA A6000) using Pytorch with Python 3.8. Our OP-FCNN is trained with five batch-size and 200 epochs for up to 10 hours. The adaptive moment estimation(Adam) optimizer with $2.5\times 10^{-4}$ initial learning rate is utilized for the loss function minimization and model convergence.

3. Result and discussion

Here, four classic datasets and speckle patterns are utilized to train the model for speckle reconstruction and validate the performance of the imaging system, as well as benchmarking it, including the MNIST handwritten digits, EMNIST handwritten letters, fashion MNIST, and MIT-CBCL-face datasets. The four datasets contain 56800 images as training images and 14200 images as validation images. The input speckle patterns and reconstruction results are 128 $\times$ 128 pixels.

To demonstrate the imaging performance and model complexity of OP-FCNN, we used four scientific metrics to quantify the reconstruction performance explicitly. The JI evaluates the similarity and diversity between the reconstruction result and the ground truth. The PCC is essentially the normalized covariance to characterize the linear correlation degree of two images, varying from -1 to 1. The SSIM is a metric to measure the reconstruction image similarity composed of intensity, contrast ratio, and structure similarity. The SSIM value varies from 0-1, where 1 indicates pure similarity, and 0 represents no similarity. The PSNR is based on the MSE(mean-square error) to quantify the reconstruction pattern quality with a decibel scale, and the higher values demonstrate better image quality. We also utilize the computational complexity and network parameters to validate the time and space complexity of the model. The computational complexity normally refers to the number of floating-point operations, including addition, subtraction, multiplication, and division operations on floating-point numbers. The network parameters normally refer to the weights and biases of the model, including the learnable and updatable parameters in the training process and some constants in computation.

We validated the OP-FCNN reconstruction performance with four scientific metrics using four datasets introduced previously and counted the model complexity. The average reconstruction performance on four datasets achieves 0.84, 0.91, 0.79, and 16.3dB for JI, PCC, SSIM, and PSNR, respectively. As a comparison, we validated the digital CNN reconstruction performance and evaluated the computational complexity and parameter numbers of the network. The digital CNN is designed following the classical DL network structure "Unet", similar to the architecture of OP-FCNN. In addition, we align the network layer and kernel channel of the digital CNN with the OP-FCNN to ensure that the number of parameters of both methods is approximate. The detailed structure of the digital CNN is demonstrated in Supplement 1. The digital CNN utilizes the same speckle data to train the model and validate the imaging performance on the same validation speckle data as the OP-FCNN. We demonstrate the reconstruction performance and model complexity of the two methods in Table 1. The OP-FCNN’s reconstruction is approaching the digital CNN, reaching approximately 98.5${\% }$ digital CNN performance on average with only 0.61${\% }$ computational complexity.

Table 1. Reconstruction performance and model complexity evaluation for digital CNN and OP-FCNN on four datasets

View Table | View all tables in this article

The OP-FCNN takes greater advantage of computational complexity than digital CNN due to its optical computational platform. In other words, layers with great computational complexity in digital CNN such as convolutional blocks take almost zero computational complexity in OP-FCNN. The high-complexity convolutional computation is performed with light speed and low energy consumption in the optical platform, which is the greatest advantage of OP-FCNN. The only operation that matters computational complexity in OP-FCNN is BN-Relu, which is realized in electronics. On the other hand, the reconstruction performance of OP-FCNN is a little bit lower than the digital CNN although approaching. We believe that the equivalent Fourier transform lens 4f system might introduce multi-layer diffraction and the two-dimensional convolutional layer might not behave as well as three-dimensional convolution in digital CNN. However, the architecture of OP-FCNN is reconfigurable and scalable, which allows us to improve the reconstruction performance by increasing the layer depth and kernel channels.

Here, we develop a stronger OP-FCNN by adding deeper convolutional layers and more convolutional kernel channels. The stronger OP-FCNN has a similar Encoder/Decoder structure and optical implementation as the OP-FCNN. The detailed structure of the stronger OP-FCNN is demonstrated in Supplement 1. The digital CNN, OP-FCNN, and stronger OP-FCNN utilize the same speckle data to train the model and validate the imaging performance on MNIST handwritten digits, EMNIST handwritten letters, fashion MNIST, and MIT-CBCL-face datasets. The stronger OP-FCNN’s average reconstruction performance on four datasets achieved 0.87, 0.92, 0.86, and 17.53 for JI, PCC, SSIM, and PSNR, respectively. We visualize the straightforward imaging performance of the digital CNN, OP-FCNN, and stronger OP-FCNN’s JI score on four datasets in Fig. 5. The successively presented ground truths and labels show the reconstruction results that are divided into the true positive (white), false positive (green), and false negative (purple). Moreover, we compare the reconstruction performance with four scientific metrics and model complexity of the digital CNN and the stronger OP-FCNN in Table 2. The stronger OP-FCNN’s reconstruction performance has comprehensively surpassed the digital CNN. Moreover, the computational complexity of the stronger OP-FCNN is only 1.5${\% }$ of the digital CNN. The experimental results demonstrate the method scalability of the OP-FCNN, which could adapt the network layers and add convolutional kernel channels to enhance the model performance. Meanwhile, due to the operations in the optical platform, the stronger OP-FCNN’s computational complexity is increasing but still far smaller than the digital CNN. On the other hand, the network parameters of the stronger OP-FCNN are more than twice the OP-FCNN’s since the stronger OP-FCNN has deeper Encoder/Decoder stages and more convolutional layers and kernel channels, which enhance the model reconstruction capability.

Fig. 5. The ground truth and speckle reconstruction results for CNN, OP-FCNN, stronger OP-FCNN on MNIST handwritten digits, EMNIST handwritten letters, fashion MNIST, MIT-CBCL-face datasets. The reconstruction results are overlaid with the true positive (white), false positive (green), and false negative (purple). Credit is hereby given to the Massachusetts Institute of Technology and to the Center for Biological and Computational Learning for providing the database of facial images.

Download Full Size | PDF

Table 2. Reconstruction performance and model complexity evaluation for digital CNN and stronger OP-FCNN on four datasets

View Table | View all tables in this article

In the former comparative experiments, we listed the reconstruction performance, computational complexity, and network parameters of three methods, digital CNN, OP-FCNN, and stronger OP-FCNN. For the digital CNN, the "Unet" structure is a classical and typical method that is widely used for speckle reconstruction tasks. Many state-of-the-art methods were proposed based on the "Unet" structure [17,19]. However, the huge computational costs for convolutional operation greatly challenge the computational complexity, energy consumption, and hardware configuration. Here, using similar structures and network parameters, our proposed OP-FCNN has greatly reduced the computational complexity of the network by two orders of magnitude, compared with digital CNN. The significant advantage of OP-FCNN is the great balance between the reconstruction performance and model complexity. Moreover, the OP-FCNN also demonstrates great scalability for model capability improvement. Our developed stronger OP-FCNN proves that the reconstruction performance of OP-FCNN could be further improved by strengthening the layer depths and kernel channels although its current performance is approaching the digital CNN. Furthermore, the reconfigurable and scalable structure of OP-FCNN could be adapted to realize higher performance and accommodate the varying complexity of datasets in complicated imaging scenarios.

We know that deep learning methods are widely applied in computational imaging and speckle reconstruction. Hence, the large computational complexity and energy consumption are inevitable challenges. Compared with digital CNN, OP-FCNN has the following advantages; Initially, the computational complexity could be greatly reduced by two orders of magnitude. Besides, the reconfigurability and scalability of OP-FCNN allow the model to adjust its structure to accommodate complex datasets and extend for higher reconstruction performance. The stronger OP-FCNN demonstrates the feasibility and significant advantage of performance enhancement. Our proposed OP-FCNN provides a new opto-electronic method for speckle reconstruction and boosts the development of all-optical systems in imaging through scattering media.

4. Conclusion

This paper has proposed an opto-electronic deep learning method for imaging through scattering media. We develop an optronic fully convolutional neural network "OP-FCNN" for speckle reconstruction. We propose the "end-to-end" Encoder/Decoder optronic structure to extract the features of speckle patterns and reconstruct the sparse objects by opto-electronic techniques. The OP-FCNN implements nearly all computational operations in optics, significantly reducing the model’s computational complexity. Moreover, we utilize four datasets to validate the imaging performance of OP-FCNN. The average imaging performance on four datasets achieves 0.84, 0.91, 0.79, and 16.3dB for JI, PCC, SSIM, and PSNR, respectively. Compared with digital CNN, the OP-FCNN achieves comparable performance and reduces computational complexity by two orders of magnitude. Meanwhile, the structure of OP-FCNN is scalable and reconfigurable, which could be adjusted to accommodate the varying datasets and extend for higher performance. The proposed OP-FCNN demonstrates the feasibility and significant advantage of an opto-electronic method for imaging through scattering media. We envision that OP-FCNN could be extended in more complex imaging scenarios and implemented in all-optical systems for imaging through scattering media in future works.

Funding

National Key Research and Development Program of China (2021YFA0715400).

Acknowledgments

Credit is hereby given to the Massachusetts Institute of Technology and to the Center for Biological and Computational Learning for providing the database of facial images.

Disclosures

The authors declare no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. L. V. Wang and H.-i. Wu, Biomedical optics: principles and imaging, (John Wiley & Sons, 2012).

2. M. Gu, X. Gan, and X. Deng, “Microscopic imaging through turbid media,” Springeer 5, 201 (2015). [CrossRef]

3. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

4. G. Yao and L. V. Wang, “Theoretical and experimental studies of ultrasound-modulated optical tomography in biological tissue,” Appl. Opt. 39(4), 659–664 (2000). [CrossRef]

5. G. Satat, M. Tancik, and R. Raskar, “Towards photography through realistic fog,” in 2018 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2018), pp. 1–10.

6. A. P. Mosk, A. Lagendijk, G. Lerosey, et al., “Controlling waves in space and time for imaging and focusing in complex media,” Nat. Photonics 6(5), 283–292 (2012). [CrossRef]

7. P. Pai, J. Bosch, M. Kühmayer, et al., “Scattering invariant modes of light in complex media,” Nat. Photonics 15(6), 431–434 (2021). [CrossRef]

8. I. M. Vellekoop and A. Mosk, “Focusing coherent light through opaque strongly scattering media,” Opt. Lett. 32(16), 2309–2311 (2007). [CrossRef]

9. X. Wei, Y. Shen, J. C. Jing, et al., “Real-time frequency-encoded spatiotemporal focusing through scattering media using a programmable 2d ultrafine optical frequency comb,” Sci. Adv. 6(8), eaay1192 (2020). [CrossRef]

10. G. Huang, D. Wu, J. Luo, et al., “Retrieving the optical transmission matrix of a multimode fiber using the extended kalman filter,” Opt. Express 28(7), 9487–9500 (2020). [CrossRef]

11. Z. Yaqoob, D. Psaltis, M. S. Feld, et al., “Optical phase conjugation for turbidity suppression in biological samples,” Nat. Photonics 2(2), 110–115 (2008). [CrossRef]

12. J. Bertolotti, E. G. Van Putten, C. Blum, et al., “Non-invasive imaging through opaque scattering layers,” Nature 491(7423), 232–234 (2012). [CrossRef]

13. G. Osnabrugge, R. Horstmeyer, I. N. Papadopoulos, et al., “Generalized optical memory effect,” Optica 4(8), 886–892 (2017). [CrossRef]

14. L. Li, Q. Li, S. Sun, et al., “Imaging through scattering layers exceeding memory effect range with spatial-correlation-achieved point-spread-function,” Opt. Lett. 43(8), 1670–1673 (2018). [CrossRef]

15. M. Lyu, H. Wang, G. Li, et al., “Exploit imaging through opaque wall via deep learning,” arXiv, arXiv.1708.07881 (2017). [CrossRef]

16. S. Li, M. Deng, J. Lee, et al., “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

17. Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica 5(10), 1181–1190 (2018). [CrossRef]

18. E. Guo, S. Zhu, Y. Sun, et al., “Learning-based method to reconstruct complex targets through scattering medium beyond the memory effect,” Opt. Express 28(2), 2433–2446 (2020). [CrossRef]

19. Y. Wang, Z. Lin, H. Wang, et al., “High-generalization deep sparse pattern reconstruction: feature extraction of speckles using self-attention armed convolutional neural networks,” Opt. Express 29(22), 35702–35711 (2021). [CrossRef]

20. X. Hu, J. Zhao, J. E. Antonio-Lopez, et al., “Adaptive inverse mapping: a model-free semi-supervised learning approach towards robust imaging through dynamic scattering media,” Opt. Express 31(9), 14343–14357 (2023). [CrossRef]

21. G. Wetzstein, A. Ozcan, S. Gigan, et al., “Inference in artificial intelligence with deep optics and photonics,” Nature 588(7836), 39–47 (2020). [CrossRef]

22. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]

23. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems 25, (2012).

24. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

25. N. H. Farhat, D. Psaltis, A. Prata, et al., “Optical implementation of the hopfield model,” Appl. Opt. 24(10), 1469–1475 (1985). [CrossRef]

26. T. Lu, S. Wu, X. Xu, et al., “Two-dimensional programmable optical neural network,” Appl. Opt. 28(22), 4908–4913 (1989). [CrossRef]

27. I. Saxena and E. Fiesler, “Adaptive multilayer optical neural network with optical thresholding,” Opt. Eng. 34(8), 2435–2440 (1995). [CrossRef]

28. A. E. Willner, S. Khaleghi, M. R. Chitgarha, et al., “All-optical signal processing,” J. Lightwave Technol. 32(4), 660–680 (2013). [CrossRef]

29. X. Lin, Y. Rivenson, N. T. Yardimci, et al., “All-optical machine learning using diffractive deep neural networks,” Science 361(6406), 1004–1008 (2018). [CrossRef]

30. T. Yan, J. Wu, T. Zhou, et al., “Fourier-space diffractive deep neural network,” Phys. Rev. Lett. 123(2), 023901 (2019). [CrossRef]

31. H. Dou, Y. Deng, T. Yan, et al., “Residual d 2 nn: training diffractive deep neural networks via learnable light shortcuts,” Opt. Lett. 45(10), 2688–2691 (2020). [CrossRef]

32. C. Xu, X. Sui, J. Liu, et al., “Transformer in optronic neural networks for image classification,” Opt. Laser Technol. 165, 109627 (2023). [CrossRef]

33. S. Xu, J. Wang, R. Wang, et al., “High-accuracy optical convolution unit architecture for convolutional neural networks by cascaded acousto-optical modulator arrays,” Opt. Express 27(14), 19778–19787 (2019). [CrossRef]

34. J. Chang, V. Sitzmann, X. Dun, et al., “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]

35. S. Colburn, Y. Chu, E. Shilzerman, et al., “Optical frontend for a convolutional neural network,” Appl. Opt. 58(12), 3179–3186 (2019). [CrossRef]

36. Z. Gu, Y. Gao, and X. Liu, “Optronic convolutional neural networks of multi-layers with different functions executed in optics for image classification,” Opt. Express 29(4), 5877–5889 (2021). [CrossRef]

37. Z. Gu, Y. Gao, and X. Liu, “Position-robust optronic convolutional neural networks dealing with images position variation,” Opt. Commun. 505, 127505 (2022). [CrossRef]

38. Z. Gu, Z. Huang, Y. Gao, et al., “Training optronic convolutional neural networks on an optical system through backpropagation algorithms,” Opt. Express 30(11), 19416–19440 (2022). [CrossRef]

39. Z. Huang, Z. Gu, and Y. Gao, “Image classification through scattering media using optronic convolutional neural networks,” in Conference on Infrared, Millimeter, Terahertz Waves and Applications (IMT2022), vol. 12565 (SPIE, 2023), pp. 757–761.

40. X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, (JMLR Workshop and Conference Proceedings, 2011), pp. 315–323.

41. L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Process. Mag. 29(6), 141–142 (2012). [CrossRef]

42. G. Cohen, S. Afshar, J. Tapson, et al., “Emnist: Extending mnist to handwritten letters,” in 2017 international joint conference on neural networks (IJCNN), (IEEE, 2017), pp. 2921–2926.

43. B. Weyrauch, B. Heisele, J. Huang, et al., “Component-based face recognition with 3d morphable models,” in 2004 Conference on Computer Vision and Pattern Recognition Workshop, (IEEE, 2004), p. 85.

44. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv, arXiv:1708.07747 (2017). [CrossRef]

Method	Dataset	JI	PCC	SSIM	PSNR/dB	Computational Complexity	Network Parameters
Digital CNN	MNIST digits	0.79	0.85	0.78	13.97	874.27MMac	5.45M
	EMNIST letters	0.81	0.87	0.81	14.66
	fashion MNIST	0.91	0.93	0.89	18.09
	MIT-CBCL-face	0.91	0.94	0.83	17.74
	Average	0.84	0.90	0.83	16.12
OP-FCNN	MNIST digits	0.75	0.85	0.73	14.11	6.7MMac	5.31M
	EMNIST letters	0.80	0.89	0.76	15.44
	fashion MNIST	0.90	0.94	0.86	18.60
	MIT-CBCL-face	0.90	0.94	0.81	16.80
	Average	0.84	0.91	0.79	16.30

Method	Dataset	JI	PCC	SSIM	PSNR/dB	Computational Complexity	Network Parameters
Digital CNN	MNIST digits	0.79	0.85	0.78	13.97	874.27MMac	5.45M
	EMNIST letters	0.81	0.87	0.81	14.66
	fashion MNIST	0.91	0.93	0.89	18.09
	MIT-CBCL-face	0.91	0.94	0.83	17.74
	Average	0.84	0.90	0.83	16.12
Stronger OP-FCNN	MNIST digits	0.81	0.89	0.81	15.76	13.12MMac	14.2M
	EMNIST letters	0.84	0.91	0.83	16.24
	fashion MNIST	0.92	0.94	0.91	19.34
	MIT-CBCL-face	0.92	0.94	0.87	18.76
	Average	0.87	0.92	0.86	17.53

Method	Dataset	JI	PCC	SSIM	PSNR/dB	Computational Complexity	Network Parameters
Digital CNN	MNIST digits	0.79	0.85	0.78	13.97	874.27MMac	5.45M
	EMNIST letters	0.81	0.87	0.81	14.66
	fashion MNIST	0.91	0.93	0.89	18.09
	MIT-CBCL-face	0.91	0.94	0.83	17.74
	Average	0.84	0.90	0.83	16.12
OP-FCNN	MNIST digits	0.75	0.85	0.73	14.11	6.7MMac	5.31M
	EMNIST letters	0.80	0.89	0.76	15.44
	fashion MNIST	0.90	0.94	0.86	18.60
	MIT-CBCL-face	0.90	0.94	0.81	16.80
	Average	0.84	0.91	0.79	16.30

Method	Dataset	JI	PCC	SSIM	PSNR/dB	Computational Complexity	Network Parameters
Digital CNN	MNIST digits	0.79	0.85	0.78	13.97	874.27MMac	5.45M
	EMNIST letters	0.81	0.87	0.81	14.66
	fashion MNIST	0.91	0.93	0.89	18.09
	MIT-CBCL-face	0.91	0.94	0.83	17.74
	Average	0.84	0.90	0.83	16.12
Stronger OP-FCNN	MNIST digits	0.81	0.89	0.81	15.76	13.12MMac	14.2M
	EMNIST letters	0.84	0.91	0.83	16.24
	fashion MNIST	0.92	0.94	0.91	19.34
	MIT-CBCL-face	0.92	0.94	0.87	18.76
	Average	0.87	0.92	0.86	17.53

OP-FCNN: an optronic fully convolutional neural network for imaging through scattering media

Abstract

1. Introduction

2. Method

2.1 Optical imaging system

2.2 Principle and architecture of OP-FCNN

2.2.1 Optical convolutional layer

2.2.2 Down/up-sampling layer and nonlinear activation function

2.2.3 OP-FCNN Encoder-Decoder architecture

2.3 Data acquisition and processing

3. Result and discussion

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (5)

Tables (2)

Equations (7)

Optics Express