Deep learning-based image reconstruction for photonic integrated interferometric imaging

Ziran Zhang; Haoying Li; Guomian Lv; Hao Zhou; Huajun Feng; Zhihai Xu; Qi Li; Tingting Jiang; Yueting Chen

doi:10.1364/OE.469582

1. Introduction

Due to diffraction limits, high-resolution imaging of distant targets has always been a challenge. Optical telescopes, such as the Hubble and the James Webb Telescope, improve resolution by increasing the aperture. However, these systems have disadvantages, such as large resource consumption, complicated lens processing, and exact mechanical adjustment. Su et al. [1,2] reported a demonstration experiment of photonic integrated interferometric imaging (PIII) in 2018. Compared to conventional optical systems, the size, weight and power of the PIII system are greatly reduced [3] at the same optical resolution by introducing photonic integrated circuits (PICs). Two-beam interference occurs in PICs, avoiding building a sophisticated interference system. Despite the poor imaging quality, Su’s method [1,2,4] provided a brand-new idea for inexpensive long-distance high-resolution imaging. Low sampling rates and noise are the two leading causes of poor image quality. Several researchers have proposed new structural designs to increase the sampling rate [5–8]. However, the complicated structure will make it challenging to design and build PICs. Furthermore, noise disturbance also makes the complicated structure unable to obtain high-quality images. To improve image quality and reduce cost, better image reconstruction algorithms become crucial.

PIII is an emerging technology, and image reconstruction algorithms are still in the infancy. The latest algorithm, simplified revised entropy method [9], was proposed to suppress ringing and artifacts. In astronomy, there are also some classical methods, such as maximum entropy methods [10] and CLEAN algorithms [11]. Unfortunately, all of these methods rely on traditional optimization algorithms. They are more suitable for working with simple images. It is difficult for them to produce satisfactory results when the image details are rich, and the noise level is high. In recent years, deep learning has made great progress in image reconstruction. For instance, deep learning has been applied to solve the problem of noise, artifcats, and low-sampling rate image reconstruction, such as single-pixel imaging [12–14]. Deep learning also performs well for image quality degradation in optical imaging, such as aberration correction [15,16], lensless imaging [17,18], and hyperspectral imaging [19,20]. For image restoration tasks in computer vision, deep learning methods have maintained state-of-the-art performance for several years [21–24]. For frequency domain signal processing, there is a task called Magnetic Resonance Imaging (MRI) image reconstruction. The images reconstructed from complex frequency values are enhanced by networks learned spatial to spatial [25], frequency to frequency [26], or iteratively in one or two domains [27]. Inspired by these studies, we attempted to conduct experiments based on deep learning. However, existing methods usually consider spatial domain images or the real and imaginary parts of complex frequency values as inputs. For photonic integrated interference imaging, the data acquired by the PIII system are amplitude and phase in the frequency domain. Amplitude and phase are directly affected by noise. Using network learning with real and imaginary parts will destroy the original noise distribution and even accumulate and amplify the noise. Therefore, a suitable network is required to take the amplitude and phase of complex frequency values as input, as current methods cannot meet the demand.

This paper proposes a learning-based method to solve the low sampling rate and noise disturbance problem and a dataset generation method based on rotation sampling to support training and inference. To summarize, our main contributions are of three folds. First, we propose a method for building amplitude and phase datasets based on imaging principles. Although datasets were obtained by simulation, it is the first frequency-domain dataset for image restoration. In data preprocessing, we propose amplitude and phase normalization methods. We use normalized amplitude and phase to train networks for the first time, effectively solving the degradation caused by the low sampling rate and noise. Second, we propose spatial-frequency dual-domain fusion networks (SFDF-Nets). SFDF-Nets reconstruct images by fusing multi-frame data obtained by rotating sampling, which indirectly increases the sampling rate. Experiments show that frequency-domain learning can achieve high image structure similarity (SSIM) [28], while spatial-domain learning can achieve a high image peak-to-noise ratio (PSNR) [28]. Through dual-domain supervised learning and image fusion, both SSIM and PSNR of images are significantly improved, and noise, ringing, and artifacts are effectively suppressed. Third, we propose the inverse Fourier transform loss(IFFT loss). IFFT loss can assist the network training in the frequency domain and improve image quality. Our work opens a new dimension in interferometric imaging and sheds much light on computer vision [22–24,29] and medical imaging [30,31].

2. Photonic integrated interferometric imaging (PIII)

2.1 Imaging principle of the PIII system

PIII is developed from the interferometry technique [32,33]. It uses far-field spatial coherence measurements to extract intensity information from a source to form an image [34]. Multiple baselines are formed simultaneously by a lenslet array. PICs are used to control beam combinations and generate interference signals. The intensity of the interference signal $I_{12}$ can be described by Eq. (1).

(1)$$I_{12}=I_{1}+I_{2}+2\sqrt{I_{1}I_{2}}A\cos[\frac{2\pi}{\lambda}(\vec{L}\cdot\vec{B}+x_{1}-x_{2})] ,$$

where $\vec {L}$ is a unit vector from the interferometer to the light source. $\vec {B}$ is a baseline vector between a pair of lenses. $x_{1}-x_{2}$ is the optical path difference in PIC. $A$ is the amplitude of complex coherence coefficient. When the target is far away, the intensity of the two light sources behind a pair of lenses is approximately equal $(I_{1}\approx I_{2})$, and the visibility of the interference signal is approximately equal to the amplitude of the complex coherence coefficient $(V\approx A)$. The phase $\alpha$ of the visibility is the phase in the cosine intensity function. Visibility $V$ and phase $\alpha$ can be obtained using quadrature modulation detection techniques [9]. The complex coherence coefficient $\gamma _{12}$ is calculated by $\gamma _{12}=Ve^{i\alpha }$. By changing the length and direction of vector $\vec {B}$, we can sample many complex coherence coefficients. After mapping them into a 2D matrix, the complex coherence coefficient matrix ($\gamma$) is obtained. According to van Cittert-Zernike’s theorem [35], the interferometric fringes measure the Fourier transform of the brightness distribution for a distributed source [36]. By measuring the complex visibility of various baselines, we can inverse Fourier transform the complex coherence coefficients to obtain the source brightness distributions [36]. The image reconstruction can be described by Eq. (2).

(2)$$\bar{I}= \mathscr{F}^{{-}1}(\gamma) ,$$

where $\gamma$ represents the complex coherence coefficient matrix. By stretching the contrast of the normalized light source $\bar {I}$ , an image of the far-field light source $I$ can be generated. An ideal imaging model doesn’t care about noise. But when it comes to application, noise can not be ignored. Since we use interference signals to obtain amplitudes and phases of complex coherence coefficients, noise will directly affect the frequency domain of the image. Taking the noise into account, the imaging model can be described by Eq. (3) .

(3)$$\begin{cases} \gamma \propto \mathscr{F}(I) \\ \gamma_{H}=H \gamma \\V=\left | \gamma_{H} \right |+n_{a} \\\alpha=\arg (\gamma_{H})+n_{p} \end{cases},$$

where $H$ represents the sampling matrix and it is sparse. $\gamma \propto \mathscr {F}(I)$ represents the Van Cittert-Zernike theorem. $n_{a}$ represents visibility noise or amplitude noise. $n_{p}$ represents phase noise. Figure 1 describes the schematic of image degradation and image reconstruction for photonic integrated interferometric imaging. The complex coherence coefficient of the distant target can be regarded as the Fourier transform of the distant target. Photonic integrated interferometric imaging system samples the amplitude and phase of complex coherence coefficients. The sampling process is disturbed by noise and the sampling matrix $H$ is sparse. Therefore, we need to remove the degradation caused by frequency domain noise and sparse sampling. Ideally, we can reconstruct the image directly by Eq. (2). But the sparse-view reconstruction is an ill-posed problem. It is difficult to solve Eq. (3) using a matrix inversion technique. Furthermore, the noise is nonlinear. Once the complex coherence coefficient is calculated, it is difficult to separate the noise from the real signal. Fortunately, many studies have revealed that deep learning performs well for the sparse-view reconstruction problem [12–14,24–27,30]. In this context, deep learning-based methods are worth investigating.

Fig. 1. Schematic of image degradation and image reconstruction for PIII.

Download Full Size | PDF

2.2 Factors restraining the imaging quality of PIII systems

2.2.1 Low sample rate

There are two reasons for the low sampling rate. The first is crosstalk in PICs. The second is energy constraints. PIII system samples in the frequency domain. A pair of lenses form a baseline, and a baseline can only sample at one point in the Fourier domain at a wavelength. The sampling rate increases as more wavelengths are used through arrayed waveguide gratings. However, the energy per wavelength decreases. Weak energy makes the detection and processing of interference signals difficult. The lens density can be increased to get a higher sampling rate by reducing the lens size, but the energy collected by a single lens decreases. Therefore, the energy problem determines that the sampling rate cannot be improved unless there is a high-precision detector. In addition, PICs also lead to low sampling rates. When the waveguides are tightly arranged in PICs, crosstalk between waveguides will affect two-beam interference [37]. To reduce the crosstalk, it is not suitable to use dense sampling at present. Overall, a low sampling rate is a major constraint on image quality, and it is difficult to improve the sampling rate by improving hardware.

2.2.2 Noise disturbance

The PIII system suffers from noise disturbance during imaging. Many factors may contribute to noise. The first is detector noise. Detector noise includes thermal noise, 1/f noise, and system noise(the on-chip amplifier, the external hardware, and the gain circuitry) [36]. Small lenses, coupling losses, and chip losses lead to weak interference signals. Weak signals can cause detector noise to become noticeable due to amplification and gain. In addition, other random factors that affect imaging quality include stray light, crosstalk in PICs, unstable coupling efficiency, and coherence length of the source. These factors are more suitable to be solved from the hardware. For example, stray light can be addressed by designing cylindrical apertures. Increase the distance between waveguides in PICs to reduce crosstalk. Design a more stable coupling mode between lens array and PICs. Add the phase shifter in the PICs to ensure that the two beams are within the coherence length. As this paper focuses mainly on detector noise, we did not take into account a lot of other random factors.

3. Deep learning-based image reconstruction

3.1 Dataset establishment

3.1.1 Signal simulation

We propose a deep learning-based approach to deal with the sample rate and noise issues. We first build the dataset. According to the imaging principle, we obtain the signal through simulation. We choose Su’s structure [2] as the imaging device because it is easier to fabricate than other structures. Figure 2(a) shows the lenslet array of Su’s structure. There are 37 interference arms arranged radially and evenly at different angles. There is a PIC behind each interference arm. Each PIC contains 12 pairs of interference baselines with a maximum baseline length of 20.88mm and a minimum baseline length of 0.72mm. The imaging wavelength ranges from 1065 nm to 1550 nm and is divided into 18 spectral channels by the array waveguide grating in PICs. Figure 2(b) is the signal distribution we simulate based on Su’s structure [2]. Sample points of different wavelengths can be distinguished by color, where red represents a sample point of 1550 nm, and blue represents a sample point of 1065 nm. The center of the sampling matrix $H$ is empty, which does not matter. For the normalized complex coherence coefficients, the value of the central zero frequency is the constant one. Therefore, we do not need to sample the central zero frequency additionally.

Fig. 2. Lenslet array and sampling matrix. (a) Lenslet array (b) Sampling mask.

Download Full Size | PDF

We apply a two-dimensional discrete Fourier transform to grayscale remote sensing images based on the Van Sit-Zernike theorem to obtain a complex coherence coefficient matrix $\gamma$. The two-dimensional discrete Fourier transform can be described by Eq. (4), and the two-dimensional discrete inverse Fourier transform is described by Eq. (5).

(4)$$\gamma(u,v)=\sum_{x=0}^{h-1} \sum_{y=0}^{w-1} I(x,y)e^{{-}j2\pi(\frac{u\times x}{h}+\frac{v\times y}{w} ) } ,$$

(5)$$I(x,y)= \frac{1}{h\times w}\sum_{x=0}^{h-1}\sum_{y=0}^{w-1}\gamma(u,v)e^{j2\pi(\frac{u\times x}{h}+\frac{v\times y}{w})} ,$$

where $(x,y)$ denotes the coordinate of an image pixel in the spatial domain. $I(x,y)$ is the pixel value. The image size is $h\times w$. $(u,v)$ represents the coordinate of a spatial frequency on the frequency spectrum. $\gamma (u,v)$ is the complex frequency value. The interference signal provides visibility and phase information, which correspond to the amplitude and phase of the complex coherence coefficient. We compute the amplitude and phase of complex frequency values to simulate the visibility and phase extracted from the interference signal. Concretely, the amplitude $A(\cdot )$ and the phase $P(\cdot )$ of a given frame are calculated by Eq. (6) and Eq. (7).

(6)$$A(\gamma)=\sqrt{\textrm{Re}(\gamma)^2+\textrm{Im}(\gamma)^2} ,$$

(7)$$P(\gamma) = \arctan (\frac {\textrm{Im}(\gamma )}{\textrm{Re}(\gamma )}) .$$

Simulations and calculations are implemented using Python’s NumPy library. In the NumPy library, the value of the phase $P$ is in the range $(-\pi, \pi ]$, which provides a reference for the following data processing.

3.1.2 Multi-frame sampling by rotation

To solve the problem of low sample rate, we adopt the idea of multi-frame sampling by a rotating device commonly used in interference imaging. Rotational sampling involves registration between frames, which is not covered in detail in our study since the signals are simulated. [38,39] provide information on rotation sampling. As shown in Fig. 2(a), Su’s structure contains 37 one-dimensional interferometry, which we call interference arms. The angle between the two interferometric arms is 9.73$^{\circ }$. We divide 360$^{\circ }$ into 111 parts, each of which is approximately 3.24$^{\circ }$. The system rotates around the center point, 3.24$^{\circ }$ each time. We take three consecutive frames as a group, and the sample rate can be increased by 2.78 times after excluding the overlapping part. Figure 3 illustrates the method of generating multi-frame datasets by rotational sampling. We use the sampling matrix $H$ to sample the amplitude and phase of the complex coherence coefficient matrix $\gamma$. The sampling matrix $H$ is shown in Fig. 2(b). The sample result is not operated with fftshift because pre-experiments show better training results without fftshift. We sample three frames consecutively and concatenate them into six channels. Gaussian noise [40] is added to the amplitude and phase during sampling. Finally, we put the dirty data together with the clean data to form a dirty-clean data pair. Plenty of dirty-clean data pairs make up our dataset.

Fig. 3. Multi-frame dataset generation method using rotational sampling. The data is not operated with fftshift.

Download Full Size | PDF

3.1.3 Dataset details

In non-ideal imaging scenarios, both visibility and phase measurements are affected by noise. Amplitude and phase are normalized to $\left [0,1 \right ]$ to keep the data suitable for learning and the noise to the same standard. Gaussian noise [40] $n_{a}$ and $n_{p}$ are added to the normalized amplitude and phase, respectively. Amplitude noise $n_{a}$ is modeled as a standard normal distribution $n_{a}\sim N(0,\sigma _{a}^2)$, where the mean is zero and the standard deviation is $\sigma _{a}$. Similarly, the phase noise $n_{p}$ is represented as $n_{p}\sim N(0,\sigma _{p}^2)$. For normalized amplitude, low-frequency values are much larger than high-frequency values, which can differ 10,000 times in a 256${\times }$256 image. Thus, we take the normalized amplitude to the power of 0.1. As a result, small values can be made larger. This operation is reversible as long as taking the values to the power of 10. The normal method always takes the logarithm. In this task, taking logarithms produces negative numbers and breaks normalization. Therefore, it is appropriate to take the normalized amplitude to the power of 0.1. This way, most of the values will fall within the range of 0.3-0.8, making them more suitable for network training. Equations (8) and (9) represent the normalization methods described above.

(8)$$A_{normal} =\left \{ \frac{A(\gamma )}{\textrm{max}[\textrm{A}(\gamma )]} +n_{a} \right \} ^{0.1} ,$$

(9)$$P_{normal}= \frac{P(\gamma) \bmod 2\pi }{2\pi} + n_{p} .$$

where $A_{normal}$ is the normalized amplitude and $P_{normal}$ is the normalized phase. To verify the property of our proposed method, we generated datasets with different noise levels. Table 1 shows the details of our datasets. $\sigma _{a}$ for the amplitude noise $n_a$ were set to 0, 0.005 and 0.05. Similarly, $\sigma _{p}$ for the phase noise $n_{p}$ were set to 0, 0.005 and 0.05. We combined different levels of amplitude noise and phase noise to obtain nine datasets named A0P0, A0P1, A0P2, A1P0, A1P1, A1P2, A2P0, A2P1, A2P2. The DotaV1 dataset [41] was used for signal acquisition. We converted each image in the DotaV1 dataset to grayscale and cropped nine images of 256${\times }$256 size at equal intervals. The variance of the image was used as a criterion to determine whether or not it is rich in detail. During cropping, images with poor details were discarded. The split ratio of the training set, validation set, and test set was 8:1:1. For each dataset, the training set has 10913 images, the validation set has 1365 images, and the test set has 1365 images. All of the data were saved as NumPy arrays with float32 precision.

Table 1. Datasets with different noise levels.

View Table | View all tables in this article

3.2 Spatial-frequency dual-domain fusion networks (SFDF-Nets)

3.2.1 Network architecture

Figure 4 presents the architecture of our SFDF-Nets. The input is low sample rate frequency domain data with noise. Through dual-domain supervised learning and frequency domain fusion, SFDF-Nets can generate high-quality spatial images. SFDF-Nets mainly include three parts: F-model, S-model, and Fusion-model. In addition, it also contains three IFFT Blocks and one FFT Block. Among them, F-model is used to restore images in the frequency domain. The data is fed into F-model and the output is amplitude and phase. F-model effectively reduces noise through supervised learning from the frequency domain to the frequency domain and fuses the six-channel data into two-channel data. The data is also fed to S-model through IFFT-Block1. As shown in Fig. 5(c), IFFT-Block denormalizes the amplitude and phase, and converts them to the spatial domain by inverse Fourier transform. The image must be contrast stretched after the inverse Fourier transform because the frequency domain data values are normalized. In experiments, we found that the values of the edges of images reconstructed with the learned amplitudes are tremendous after network training. Contrast stretching directly can leave the image with a low dynamic range. For IFFT Block2, it deals with the learned amplitude and phase, so the edges are cropped by three pixels, and contrast stretch is used to get a better image. For IFFT Block1, it deals with unlearned amplitude and phase, so it is directly contrast-stretched to get the image. The output of IFFT-Block1 is three grayscale images. S-model fuses these three images and generates a grayscale image. Moreover, S-model can denoise in the spatial domain and restores details. The images generated by the S-model are converted to normalized amplitude and phase by FFT-Block1. As shown in Fig. 5(b), FFT Block1 Fourier transforms the input grayscale image and normalizes the amplitude and phase. In the next step, the Fusion-model fuses the outputs of S-model and F-model in the frequency domain. IFFT-Block2 converts the output of Fusion-model into one-channel data, which is a grayscale image reconstructed by SFDF-Nets.

Fig. 4. The architecture of SFDF-Nets.

Download Full Size | PDF

Fig. 5. Components of SFDFNet. (a) Unet++L4 (b) FFT Block (c) IFFT Block.

Download Full Size | PDF

S-model, F-model, and Fusion-model use the Unet++L4 architecture [42]. As shown in Fig. 5, Unet++L4 adopts dense connections to form a topological structure. Unet++L4 can capture features at four levels and integrate them through feature stacking. It uses a 3${\times }$3 convolutional layer to combine the output of four different positions at the output part of Unet++L4. These positions have four, six, eight, and ten of convolution layers. The more convolutional layers, the larger the receptive field. Features at different levels have different sensitivities to frequency domain information. Features with large receptive fields can easily recover global information, such as symmetry in the frequency domain. However, local information may be lost after downsampling again and again. In the Fourier domain, local information has a global impact on the spatial image. With dense connections and feature stacking, local information is easier to preserve. We adopt the structure of Unet++. In the following experiments, we will prove that Unet++ can produce good results for both spatial domain image restoration and frequency domain signal restoration.

4. Experiments and results

4.1 Training details

We propose a new type of loss called inverse fast Fourier transform loss (IFFT loss) for frequency domain learning. We use Eq. (10) and Eq. (11) to describe L1 loss [43] and IFFT loss, respectively. Similar to L1 loss, IFFT loss is achieved by taking the absolute difference between the predicted and real values. The difference is that IFFT loss requires a fast inverse Fourier transform for both the predicted and true values. Since the predicted and true values are normalized in the range 0 to 1, the output value of the inverse Fourier transform is very small. Therefore, the final result needs to be multiplied by a scale factor $K^{2}$. We set the value of $K$ to 1000 for training 256${\times }$256 data.

(10)$$L_{1}(y^{gt} ,y^{pred})=\frac{1}{w\times h}\sum_{i=1}^{w\times h}\left | y^{gt}_{i} -y^{pred}_{i}\right | ,$$

(11)$$L_{IFFT}(f^{gt}, f^{pred})=\frac{K^{2}}{w\times h}\sum_{i=1}^{w\times h}\left | IFFT(f^{gt})_{i}-IFFT(f^{pred})_{i}\right | ,$$

where $y^{gt}$, $f^{gt}$, $y^{pred}$, and $f^{pred}$ represent the ground truth in the spatial domain, the ground truth in the frequency domain, the predicted value in the spatial domain, and the predicted value in the frequency domain, respectively. The data size is $h\times w$. $y^{gt}_{i}$, $y^{pred}_{i}$, $IFFT(f^{gt})_{i}$, $IFFT(f^{pred})_{i}$ represent the pixel value.

In SFDF-Nets, S-model, F-model, and Fusion-model were trained separately. The implementation of SFDF-Nets was based on the public platform Pytorch and an NVIDIA RTX A6000 GPU with 48GB memory. Before the training started, we used He methed [44] to initialize the parameters of all the convolution kernels. The Adam optimization algorithm was used to optimize the network with $\beta _{1}=0.5$ and $\beta _{2}=0.999$. We first trained S-model and F-model, and then trained the Fusion-model. Each model was trained for 300 epochs, and the batch size was set to 4. In the training process, we used the step learning rate policy. The initial learning rate was $2^{-4}$ and dropped to $2^{-5}$ after 200 epochs. Different loss functions were used for training different models, as shown in Eq. (12). For S-model, all 300 epochs were trained with L1 loss. For F-model and Fusion-model, the first 200 epochs were trained with L1 loss. After 200 epochs, the learning rate was reduced ,and IFFT loss was introduced, then $L_{F-model}$ loss and $L_{Fusion-model}$ loss were used for training. IFFT loss was prone to crash training at a large learning rate. We needed to wait for the L1 loss training to stabilize, then reduce the learning rate and add IFFT loss to continue training. We use Eq. (12) to represent the loss function used to train the network.

(12)$$\begin{cases} L_{S-model}(y^{gt},y^{pred}) =L_{1}(y^{gt},y^{pred}) \\L_{F-model}(f^{gt}, f^{pred}) =L_{1}(f^{gt},f^{pred})+\lambda_{1} L_{IFFT}(f^{gt}, f^{pred}) \\L_{Fusion-model}(f^{gt}, f^{pred}) = L_{1}(f^{gt},f^{pred})+\lambda_{2} L_{IFFT}(f^{gt}, f^{pred}) \end{cases},$$

where $L_{S-model}$ is used to train S-model, $L_{F-model}$ is used to train F-model, and $L_{Fusion-model}$ is used to train Fusion-model. $\lambda _{1}$ and $\lambda _{2}$ represent the weight of the loss function, and $\lambda _{1} = \lambda _{2} = 0.01$ in this work.

Figure 6 is the PSNR curve and SSIM curve during training. Among them, the orange, blue, cyan, and purple curves are the results of F-model, S-model, Frequency-domain fusion-model, and Spatial-domain fusion-model on the validation set of A1P2, respectively. We compare the results of S-model and F-model. The comparison results show that S-model can get higher PSNR, but F-model can get higher SSIM. S-model can easily find continuity between adjacent pixels and suppress noise in spatial-to-spatial supervised learning. However, the actual noise is in the frequency domain, and S-model causes the image to be over-smoothed. Therefore, the output of S-model has a higher PSNR at the expense of the SSIM. For supervised learning from frequency to frequency, F-model can learn symmetry and suppress noise in the frequency domain. Therefore, the output of F-model has a higher SSIM. The disadvantage is that frequency domain learning cannot recognize the continuity between spatial pixels. To solve this problem, we propose IFFT loss. As shown in Fig. 6, the PSNR and SSIM of the reconstructed image will be significantly improved after adding IFFT loss. This result also illustrates the effectiveness of the IFFT loss. Since frequency and spatial domain learning have unique strengths, we fuse them using a fusion model to obtain higher-quality images. It can also be seen from the evaluation results that the frequency domain fusion results have higher SSIM and PSNR than the spatial domain fusion results. Therefore, we finally chose frequency domain fusion as a part of our method.

Fig. 6. PSNR and SSIM values among different models during training.

Download Full Size | PDF

4.2 Experiments on different models and datasets

Figure 7 shows the reconstructed image generated by SFDFNets. The dirty image is the inverse Fourier transform of the dirty data obtained by fusing three frames into one frame. The step of fusion is realized by increasing the interference arms of the sampling matrix $H$ from 37 to 111. Even if three frames are stacked into one frame, there still will be many empty areas. In Fig. 7, we show the image reconstruction result of the S-model, F-model, and Fusion-model, and their normalized amplitudes and phases in the frequency domain. These models were trained on the training set of A1P2 and tested on the test set of A1P2. Figure 7 is one of the test results. In Fig. 7, the same color boxes denote the parts where the contrast between the two results is more obvious. We can see that our models can recover the amplitude effectively. Compared to F-model, we found that the S-model can recover the low-frequency region well. However, there are many dark areas in the high-frequency region, which means that the value here is small. In the real amplitude, these black areas have larger values. F-model does better in the high-frequency region. The middle part of the F-model is closer to the ground truth and does not have many black areas. It is surprising that the amplitude generated by the Fusion-model is almost the same as the ground. Our method can also recover the empty phase. As seen in Fig. 7, S-model can completely restore the phase. However, the recovered phase is not all the same as the ground truth. F-model recovers well in the low-frequency part, but the high-frequency part is incomplete. The Fusion-model is able to generate images that are closest to the ground truth, and the phase will be more complete and closer to the ground truth.

Fig. 7. Image reconstruction results of SFDF-Nets in spatial and frequency domains.

Download Full Size | PDF

The purpose of image reconstruction is to generate a high-quality spatial image. From the reconstruction results in the spatial domain, the S-model’s results have less noise but lack details. The image appears unrealistic due to being too smooth. The results of the F-model seem to have much noise, but the details are richer than S-model. Fusion-model produces the best results, suppressing noise while preserving detail. To make the conclusion more convincing, we test our SFDF-Nets with more images and use PSNR and SSIM as metrics for quantitative evaluation. The evaluation results are shown in Tab.2. According to Tab.2, S-model achieves higher PSNR than F-model, while F-model achieves higher SSIM than S-model. On the DOTAv1 dataset, the PSNR of S-model is 2.88dB higher than F-model, and 4.81dB higher than dirty image. For SSIM, F-model is 0.02 higher than S-model and 0.12 higher than the dirty image. Furthermore, we also do additional tests with the UC Merced Land Use Dataset [45] under the same noise level as a test set to verify the robustness of our method. Our method also improves PSNR and SSIM by 4.14 dB and 0.12 on the UC Merced Land Use Dataset. The experimental results also give us an important conclusion. Frequency domain learning is easy to recover the details of images, while spatial domain learning is easier to smooth images. In Tab2, it is more notable that Fusion-model achieves the highest PSNR and SSIM. Fusion-model not only combines the advantages of F-model and S-model, but also surpasses them. This result illustrates the complementarity of F-model and S-model, which provides the possibility of improvement for Fusion-model.

Table 2. Image quality evaluation results for different models and different datasets.

View Table | View all tables in this article

The training and inference time of different models are shown in Tab.3. Training and inference were implemented on an Ubuntu system with 128GB RAM, NVIDIA RTX A6000 GPU, AMD Ryzen Threadripper 3960X CPU. Each model was trained for 300 epochs, and Tab.3 lists the time consumption. In total, SFDF-Nets were trained for 71 hours and 11 minutes. The test set has a total of 1365 images. Inferring on the test set took 29.44 seconds for the F-model and 29.57 seconds for the S-model. FFT Block1 and IFFT Block1 were used in S-model inference, while IFFT Block2 was used in F-model inference. The inference of S-model and F-model was implemented in parallel. During SFDF-Net inference, the IFFT Block2 behind the Fusion-model was included, but the IFFT Block2 behind the F-model was not included. The inference time of SFDF-Nets was 48.11 seconds. On average, SFDF-Nets only took 35 milliseconds to reconstruct an image.

Table 3. Training time and inference time for different models.

View Table | View all tables in this article

4.3 Experiments on different noise levels

We also tested the performance of SFDFNets under different noise levels. Nine datasets were tested, including A0P0, A0P1, A0P2, A1P0, A1P1, A1P2, A2P0, A2P1, and A2P2. We trained SFDF-Nets for 300 epochs using the training set of A1P2 with our training strategy. Before testing with each dataset, SFDFNets were fine-tuned for 30 epochs based on the pretrained weights of A1P2. During fine-tuning, the parameters remained the same as the 300th epoch. The only thing that changed was the dataset. The test set of each dataset was used for testing. Figure 8 shows the image reconstruction results under different noise levels. Each sub-image of Fig. 8(a) is reconstructed by inverse Fourier transform. Even if the data is noise-free ($\sigma _a$=0 , $\sigma _p$=0), the result of the inverse Fourier transforms still has ringing and artifacts. Artifacts and ringing are caused by sampling and can also be affected by noise. We can see that the artifacts and ringing become more evident as the phase noise increases. The image degradation by phase noise is relatively mild when $\sigma _p$ is 0-0.05. In contrast, amplitude noise appears very strong. The quality of the image is degraded severely as the amplitude noise increases. When $\sigma _a$=0.05, the images are almost covered by noise, and it is not easy to see any details. Figure 8(b) contains the image reconstruction results by SFDF-Nets. Compared with dirty images, the image reconstruction results of SFDF-Nets have been improved significantly. First, the ringing and artifacts are suppressed effectively. At low noise levels, the reconstruction results of SFDF-Net appear cleaner. Periodic ringing and artifacts are entirely invisible in the blank areas. Second, the noise is well removed. When the $\sigma _{a}$ of the amplitude noise reaches 0.005, the imaging quality has degraded obviously, but the reconstruction results of SFDF-Nets are less affected. Third, the visual quality is improved. When $\sigma _{a}$=0.05, vehicles cannot be identified from dirty images but can be identified easily from the results of SFDF-Nets. Furthermore, the image quality evaluation results of SFDF-Nets are much higher than the inverse Fourier transform at any noise level. We use Tab.4 to show evaluation results. We can see that SFDFNets can improve image quality on all datasets. The average PSNR of all datasets is improved by 5.64dB. The average SSIM of all datasets is improved by 0.20. The evaluation results can illustrate the effectiveness and robustness of our method to frequency-domain noise and low sampling rates.

Fig. 8. Image reconstruction results under different noise levels.(a) Image reconstruction result by inverse Fourier transform.(b) Image reconstruction results by SFDF-Nets

Download Full Size | PDF

Table 4. Image quality evaluation under different noise levels.

View Table | View all tables in this article

5. Conclusion

This paper proposed a deep learning-based image reconstruction method for PIII. We discussed two factors that contribute to the degradation of image quality. One was low sampling rates, and the other was noise disturbances. To address these two issues, we introduced a deep learning-based method. Our method consists of three parts: 1) the approach of dataset establishment; 2) the architecture of SFDF-Nets; 3) the IFFT loss. We built the dataset based on the imaging principle to support the training and inference of SFDF-Nets. Training convolutional neural networks on frequency-domain data have always been a difficult task. We used normalized amplitude and phase to train networks for the first time, effectively solving the problem that convolutional neural networks cannot train complex frequency-domain data. We proposed SFDF-Nets to reconstruct images in spatial and frequency domains and enhance image quality through fusion. For better training in the frequency domain, we proposed IFFT loss. The experimental results showed that our method could effectively solve the degradation caused by the low sampling rate and noise, such as ringing and artifacts. The reconstructed image quality was significantly improved. To the best of our knowledge, we are the first to implement a deep learning model in photonic integrated interferometric imaging. Moreover, our work makes a breakthrough in frequency domain signal reconstruction, which is necessary for developing interferometric imaging.

Funding

Key Research Project of Zhejiang Lab (No.2021MH0AC01); National Natural Science Foundation of China (62275229); Civil Aerospace Pre-Research Project (D040104).

Acknowledgments

We thank Bian meijuan from the facility platform of optical engineering of Zhejiang University for instrument support.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [41,45].

References

1. T. Su, R. P. Scott, C. Ogden, S. T. Thurman, R. L. Kendrick, A. Duncan, R. Yu, and S. J. B. Yoo, “Experimental demonstration of interferometric imaging using photonic integrated circuits,” Opt. Express 25(11), 12653–12665 (2017). [CrossRef]

2. T. Su, G. Liu, K. E. Badham, S. T. Thurman, R. L. Kendrick, A. Duncan, D. Wuchenich, C. Ogden, G. Chriqui, S. Feng, J. Chun, W. Lai, and S. J. B. Yoo, “Interferometric imaging using Si₃N₄ photonic integrated circuits for a spider imager,” Opt. Express 26(10), 12801–12812 (2018). [CrossRef]

3. R. L. Kendrick, A. Duncan, C. Ogden, J. Wilm, D. M. Stubbs, S. T. Thurman, T. Su, R. P. Scott, and S. Yoo, “Flat-panel space-based space surveillance sensor,” in Advanced Maui Optical and Space Surveillance Technologies (AMOS) Conference, (2013).

4. S. T. Thurman, R. L. Kendrick, A. Duncan, D. Wuchenich, and C. Ogden, “System design for a spider imager,” in Frontiers in Optics 2015, (Optica Publishing Group, 2015), p. FM3E.3.

5. W. Gao, X. Wang, L. Ma, Y. Yuan, and D. Guo, “Quantitative analysis of segmented planar imaging quality based on hierarchical multistage sampling lens array,” Opt. Express 27(6), 7955–7967 (2019). [CrossRef]

6. W. Gao, Y. Yuan, X. Wang, L. Ma, Z. Zhao, and H. Yuan, “Quantitative analysis and optimization design of the segmented planar integrated optical imaging system based on an inhomogeneous multistage sampling lens array,” Opt. Express 29(8), 11869–11884 (2021). [CrossRef]

7. C. Ding, X. Zhang, X. Liu, H. Meng, and M. Xu, “Structure design and image reconstruction of hexagonal-array photonics integrated interference imaging system,” IEEE Access 8, 139396–139403 (2020). [CrossRef]

8. Q. Yu, B. Ge, Y. Li, Y. Yue, F. Chen, and S. Sun, “System design for a “checkerboard” imager,” Appl. Opt. 57(35), 10218–10223 (2018). [CrossRef]

9. T. Chen, X. Zeng, Z. Zhang, F. Zhang, Y. Bai, and X. Zhang, “Rem: A simplified revised entropy image reconstruction for photonics integrated interference imaging system,” Opt. Commun. 501, 127341 (2021). [CrossRef]

10. J. Ables, “Maximum entropy spectral analysis,” Astronomy and Astrophysics Supplement Series 15, 383 (1974).

11. J. Högbom, “Aperture synthesis with a non-regular distribution of interferometer baselines,” Astronomy and Astrophysics Supplement Series 15, 417 (1974).

12. B. Xu, H. Jiang, H. Zhao, X. Li, and S. Zhu, “Projector-defocusing rectification for fourier single-pixel imaging,” Opt. Express 26(4), 5005–5017 (2018). [CrossRef]

13. S. Rizvi, J. Cao, K. Zhang, and Q. Hao, “Deringing and denoising in extremely under-sampled fourier single pixel imaging,” Opt. Express 28(5), 7360–7374 (2020). [CrossRef]

14. C. F. Higham, R. Murray-Smith, M. J. Padgett, and M. P. Edgar, “Deep learning for real-time single-pixel video,” Sci. Rep. 8(1), 2369 (2018). [CrossRef]

15. T. Lin, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Non-blind optical degradation correction via frequency self-adaptive and finetune tactics,” Opt. Express 30(13), 23485–23498 (2022). [CrossRef]

16. S. Chen, H. Feng, D. Pan, Z. Xu, Q. Li, and Y. Chen, “Optical aberrations correction in postprocessing using imaging simulation,” ACM Trans. Graph. 40, 1–15 (2021).

17. H. Zhou, H. Feng, Z. Hu, Z. Xu, Q. Li, and Y. Chen, “Lensless cameras using a mask based on almost perfect sequence through deep learning,” Opt. Express 28(20), 30248–30262 (2020). [CrossRef]

18. A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica 4(9), 1117–1125 (2017). [CrossRef]

19. H. Xu, H. Hu, S. Chen, Z. Xu, Q. Li, T. Jiang, and Y. Chen, “Hyperspectral image reconstruction based on the fusion of diffracted rotation blurred and clear images,” Opt. Lasers Eng. 160, 107274 (2023). [CrossRef]

20. L. Huang, R. Luo, X. Liu, and X. Hao, “Spectral imaging with deep learning,” Light: Sci. Appl. 11(1), 61 (2022). [CrossRef]

21. L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” arXiv preprint arXiv:2204.04676 (2022).

22. H. Li, Y. Yang, M. Chang, S. Chen, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Srdiff: Single image super-resolution with diffusion probabilistic models,” Neurocomputing 479, 47–59 (2022). [CrossRef]

23. D. Zhang, F. Huang, S. Liu, X. Wang, and Z. Jin, “SwinFIR: Revisiting the SWINIR with fast Fourier convolution and improved training for image super-resolution,” arXiv preprint arXiv:2208.11247 (2022).

24. K. Zhang, W. Ren, W. Luo, W. Lai, B. Stenger, M. Yang, and H. Li, “Deep image deblurring: a survey,” Int. J. Comput. Vis. 130(9), 2103–2130 (2022). [CrossRef]

25. K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. on Image Process. 26(9), 4509–4522 (2017). [CrossRef]

26. B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature 555(7697), 487–492 (2018). [CrossRef]

27. Y. Yang, J. Sun, H. Li, and Z. Xu, “Admm-net: a deep learning approach for compressive sensing MRI,” arXiv preprint arXiv:1705.06869 (2017).

28. U. Sara, M. Akter, and M. S. Uddin, “Image quality assessment through FSIM, SSIM, MSE and PSNR-A comparative study,” Journal of Computer and Communications 07(03), 8–18 (2019). [CrossRef]

29. L. Jiang, B. Dai, W. Wu, and C. C. Loy, “Focal frequency loss for image reconstruction and synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 13919–13929.

30. W. Wu, D. Hu, C. Niu, H. Yu, V. Vardhanabhuti, and G. Wang, “Drone: dual-domain residual-based optimization network for sparse-view CT reconstruction,” IEEE Trans. Med. Imaging 40(11), 3002–3014 (2021). [CrossRef]

31. Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light: Sci. Appl. 7(2), 17141 (2018). [CrossRef]

32. A. Labeyrie, “Interference fringes obtained on vega with two optical telescopes,” The Astrophys. J. 196, L71–75 (1975). [CrossRef]

33. M. Benisty, J.-P. Berger, L. Jocou, P. Labeye, F. Malbet, K. Perraut, and P. Kern, “An integrated optics beam combiner for the second generation vlti instruments,” Astron. & Astrophys. 498(2), 601–613 (2009). [CrossRef]

34. T. Pearson and A. Readhead, “Image formation by self-calibration in radio astronomy,” Annu. Rev. Astron. Astrophys. 22(1), 97–130 (1984). [CrossRef]

35. J. W. Goodman, Statistical Optics (John Wiley & Sons, 2015).

36. T. Su, Photonic Integrated Circuits for Compact High Resolution Imaging and High Capacity Communication Utility (University of California, Davis, 2017).

37. N. Cvetojevic, B. R. Norris, S. Gross, N. Jovanovic, A. Arriola, S. Lacour, T. Kotani, J. S. Lawrence, M. J. Withford, and P. Tuthill, “Building hybridized 28-baseline pupil-remapping photonic interferometers for future high-resolution imaging,” Appl. Opt. 60(19), D33–D42 (2021). [CrossRef]

38. M. Werth, D. Gerwe, S. Griffin, B. Calef, and P. Idell, “Ground-based optical imaging of geo satellites with a rotating structure in a sparse aperture array,” in 2019 IEEE Aerospace Conference, (2019), pp. 1–11.

39. O. Guyon and F. Roddier, “Aperture rotation synthesis: optimization of the (u, v)-plane coverage for a rotating phased array of telescopes,” Publ. Astron. Soc. Pac. 113(779), 98–104 (2001). [CrossRef]

40. Z. J. DeSantis, Image Reconstruction for Interferometric Imaging of Geosynchronous Satellites (University of Rochester, 2017).

41. G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 3974–3983.

42. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: a nested U-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, (Springer, 2018), pp. 3–11.

43. H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Comput. Imaging 3(1), 47–57 (2016). [CrossRef]

44. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), pp. 1026–1034.

45. Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, (2010), pp. 270–279.

Datasets	DOTAv1		UC Merced Land Use
Metrics	PSNR	SSIM	PSNR	SSIM
Dirty image	16.6906	0.2857	13.9774	0.3147
S-model	21.5049	0.3838	16.5904	0.3625
F-model	18.6244	0.4008	16.5351	0.3916
Fusion-model	22.4465	0.4621	18.1205	0.4462
Difference	5.7559	0.1764	4.1431	0.1315

	F-model	S-model	Fusion-model	SFDF-Nets
Training time (300 epochs)	14h 38min	14h 42min	41h 51min	71h 11min
Inference time (1365 images)	29.44s	29.57s	/	48.11s

	Dirty image		Reconstructed image		Diference
Metrics	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
A0P0	19.6851	0.5794	24.8765	0.6625	5.1914	0.0831
A0P1	19.6796	0.5789	24.9375	0.6619	5.2579	0.0830
A0P2	19.1085	0.5332	23.4979	0.5729	4.3894	0.0397
A1P0	16.7552	0.2993	23.4053	0.5281	6.6501	0.2288
A1P1	16.7728	0.2992	23.4341	0.5273	6.6613	0.2281
A1P2	16.6906	0.2857	22.4465	0.4621	5.7559	0.1764
A2P0	12.5181	0.0670	18.3674	0.4087	5.8493	0.3417
A2P1	12.5124	0.0668	18.3569	0.4068	5.8445	0.3400
A2P2	12.5921	0.0656	17.7802	0.3516	5.1881	0.2860
Average	16.2572	0.3083	21.9003	0.5091	5.6431	0.2008

Datasets	DOTAv1		UC Merced Land Use
Metrics	PSNR	SSIM	PSNR	SSIM
Dirty image	16.6906	0.2857	13.9774	0.3147
S-model	21.5049	0.3838	16.5904	0.3625
F-model	18.6244	0.4008	16.5351	0.3916
Fusion-model	22.4465	0.4621	18.1205	0.4462
Difference	5.7559	0.1764	4.1431	0.1315

	F-model	S-model	Fusion-model	SFDF-Nets
Training time (300 epochs)	14h 38min	14h 42min	41h 51min	71h 11min
Inference time (1365 images)	29.44s	29.57s	/	48.11s

Deep learning-based image reconstruction for photonic integrated interferometric imaging

Abstract

1. Introduction

2. Photonic integrated interferometric imaging (PIII)

2.1 Imaging principle of the PIII system

2.2 Factors restraining the imaging quality of PIII systems

2.2.1 Low sample rate

2.2.2 Noise disturbance

3. Deep learning-based image reconstruction

3.1 Dataset establishment

3.1.1 Signal simulation

3.1.2 Multi-frame sampling by rotation

3.1.3 Dataset details

3.2 Spatial-frequency dual-domain fusion networks (SFDF-Nets)

3.2.1 Network architecture

4. Experiments and results

4.1 Training details

4.2 Experiments on different models and datasets

4.3 Experiments on different noise levels

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (4)

Equations (12)

Optics Express

Dataset	A0P0	A0P1	A0P2	A1P0	A1P1	A1P2	A2P0	A2P1	A2P2
$σ_{a}$	0	0	0	0.05	0.005	0.005	0.05	0.05	0.05
$σ_{p}$	0	0.005	0.05	0	0.005	0.05	0	0.005	0.05