ResNet-based image inpainting method for enhancing the imaging speed of single molecule localization microscopy

Zhiwei Zhou; Weibing Kuang; Zhengxia Wang; Zhengxia Wang; Zhen-Li Huang; Zhen-Li Huang

doi:10.1364/OE.467574

1. Introduction

Single molecule localization microscopy (SMLM), including but not limited to photoactivation localization microscopy (PALM) [1] and stochastic optical reconstruction microscopy (STORM) [2], is one of the most attractive super-resolution fluorescence microscopy techniques. SMLM usually reconstructs a final super-resolution image of 20∼30 nm in spatial resolution from a localization table containing information from millions of precisely localized fluorescence molecules. Because thousands of raw images are required to build such a localization table and the exposure time of each raw image is usually 10∼30 ms, SMLM usually suffers from a slow imaging speed (typically a few minutes). At this imaging speed, the imaging quality is likely to be affected by sample drift [3], the imaging throughput is low [4], and the application field of this technique is restricted.

There are mainly three approaches to improve the imaging speed of SMLM: fast image acquisition, high-density localization, image inpainting. The first approach refers to the method that reduces the exposure time of each raw image, without changing the number of raw image frames required for reconstructing a super-resolution image. For example, Jones et al finished the collection of all raw images within seconds by increasing the laser power (∼15 KW/cm²) and the camera frame rate (500 frame per second, fps) [5]. Note that this fast image acquisition approach should work with fast blinking fluorescent dyes. Compared to traditional SMLM with 10 ms exposure time and sparsely excited molecules, this approach achieves about five times improvement in the imaging speed [5]. However, as this approach requires a higher excitation power and shorter exposure time, the localization precision and resolution are unfortunately compromised, and the photobleaching problem should be carefully managed [6].

The second approach, high-density localization, refers to the acquisition of fewer raw image frames by increasing the molecular density per frame, without reducing the total number of fluorescent molecules required for reconstructing a super-resolution image. For example, Holden et al reconstructed a super-resolution image using ∼2000 raw image frames, and achieved a speed gain of about ten times by using a significantly higher molecule density (up to 10 mol/µm²) than that in conventional SMLM (< 1 mol/µm²), without changing other parameters such as laser power (∼10 KW/cm²) and exposure time (∼10 ms) [7]. The key for this approach to be successful is to develop effective multi-emitter fitting algorithms, for example, DAOSTORM [7] and 3B [8]. We note that these traditional multi-emitter fitting algorithms can process raw images at low and medium density (< 7.5 mol/µm², as determined at half-maximum recall rate) [7]. However, when the molecule density is further increased, the recall rate of these multi-emitter fitting algorithms decreases rapidly. In addition, the image processing speed of these traditional multi-emitter fitting algorithms is relatively slow (typically several hours for a field-of-view (FOV) of 50 µm × 50 µm) [7]. In the past several years, to solve the low precision and slow speed issues in these high-density localization approaches, deep learning has been introduced to develop new algorithms, for example, Deep-STORM [9]. Compared with the traditional multi-emitter fitting algorithms, the deep learning-based algorithms can achieve ∼2x improvement in localization precision in densely overlapping molecules, and thus increased the image processing speed by up to 3 orders of magnitude. Interestingly, Huang et al [10] combined the first approach (fast image acquisition) with the second approach (high-density localization) to achieve a video-rate SMLM. Of course, this combination inherits the drawbacks in both approaches, and thus requires sophisticate experimental controls.

The third approach, image inpainting, has been reported only in the past several years. This approach is actually a post-processing technique, where an under-sampled super-resolution image is generated firstly using conventional SMLM experiment, and then a proper image inpainting process is applied to repair this super-resolution image. As proposed in a paper published in 2017 [11], Wang et al used low-density images to construct a mask image of the observed structure, and obtained an optimized super-resolved image after several rounds of iteration. Note that the sparsity in the cosine transform domain should be guaranteed in the mask image. This idea was also used to develop an image inpainting method for 3D SMLM [12]. It is worthy to note that this image inpainting method can achieve a speed gain of 50∼100 times. However, this method is time-consuming (∼2.5 h for an FOV of 50 µm × 50 µm), and is only effective in linear structures. In 2018, Wei et al developed a famous image inpainting method used in SMLM called ANNA-PALM [13], where convolutional neural network (CNN) is used to map a high-density image from a low-density image. Compared to the traditional SMLM with sparse molecule excitation, ANNA-PALM enables a speedup of ∼26 times on simulated data. Interestingly, the image acceleration methods based on image inpainting can theoretically be combined with the first two approaches to further accelerate the imaging speed of SMLM.

Combining image inpainting with deep learning is a very effective strategy for accelerating the imaging speed of SMLM. Traditionally, deep learning-based image inpainting methods have been developed to repair natural images with varying degrees of defects [14]. In 2012, Xie et al introduced the CNN network to the field of image inpainting, and provided a new image inpainting scheme for complex patterns (like superimposed text) [15]. In 2016, basing on the generative adversarial network (GAN), Pathak et al proposed Context Encoders [16], which introduced an adversarial loss to make the network prediction results more realistic. In 2020, Yi et al provided a two stage model network, and recommended to use residual structures in the generator to complete the image inpainting task of high-resolution images [17]. Note that all of the above methods aimed to solve the problem of more accurate image inpainting at higher resolution images [14]. In the field of optical imaging, some useful methods based on image inpainting have also been introduced. For example, based on image inpainting, Ma et al demonstrated a precise phase aberration compensation method [18]. However, the image inpainting task in SMLM is slightly different from that in natural images. Specifically, the traditional image inpainting task uses input images with large missing areas, while SMLM deals with under-sampled images that have no large-scale missing content in the input images.

Based on the above considerations, we focus on solving the problem of local image missing caused by image under-sampling. This kind of missing is intrinsically related to the problem of single-frame image super-resolution (SISR). Because the under-sampled image is equivalent to a low resolution image with high sampling rate, when the resolution is sufficiently reduced. Therefore, deep learning networks suitable for SISR are inherently more suitable for the image inpainting task in SMLM. Currently, a large number of models have been developed to enable SISR, such as VDSR [19], and Pix2Pix [20]. These models have been used to develop two types of image inpainting methods in SMLM. The first type is based on CNN, and a good example is VDSR [19,21]. The architecture of VDSR is simple and easy to train, and the repaired image usually has a high structural similarity index. However, the repaired image often lacks of high-frequency details, and perceptually fails to meet the expectation in super-resolution fidelity [22]. The second type is based on generative adversarial network (GAN), and a good example is ANNA-PALM [13]. Due to the strong image inpainting capability of GAN, ANNA-PALM shows better image inpainting effects in both acceleration capability and high-frequency structure fidelity, as compared with VDSR. Of course, like other GAN networks, ANNA-PALM also suffers from unstable network training, and is prone to mode collapse. Moreover, it is worthy to note that, as reported in the original ANNA-PALM paper [13], there are many image artifacts in the repaired image generated from ANNA-PALM, for example, filaments are incorrectly joined or split. Therefore, it is important to establish an image inpainting method that can significantly reduce image artifacts without compromising the acceleration capability.

We further analyze the possible reasons for the image artifacts in ANNA-PALM. We notice that ANNA-PALM finishes the image inpainting task using the pix2pix model [20]. The generator of pix2pix is U-Net, and the original purpose of U-Net is to solve the image segmentation problem in large-scale structures [23]. U-Net increases the network receptive field by multiple down-sampling, and thus provides better performance in classifying individual pixels after combining the global information. However, Ghodrati et al [24] and Lee et al [25] verified that U-Net is not good enough in medical image synthesis tasks, and the main reason is probably from the multiple down-sampling processes, where the receptive field is increased. In this case, local information is lost, although important information could still be passed through skip layer connections in U-Net.

In this paper, we used ResNet to replace U-Net and proposed a new image inpainting method, called DI-STORM (a Deep convolutional neural network for image Inpainting in STORM), for enhancing the imaging speed of SMLM. We believed that the ResNet generator would present a better image inpainting quality than the U-Net generator, because ResNet has a strong capability in extracting local semantic features when a small convolution kernel is used [25]. We evaluated the performance of DI-STORM using both simulated and experimental datasets, and found out that DI-STORM achieves a speed gain of 55 times (simulation) and 9.05 times (experiment) over conventional SMLM, which are much better than those in ANNA-PALM (17.2 times in simulation and 5.02 times in experiment, as compared to conventional SMLM) and VDSR (13.8 times in simulation and 4.52 times in experiment, as compared to conventional SMLM). And, more importantly, the super-resolution images from DI-STORM are closer to ground truth images and exhibit less image artifacts than those from ANNA-PALM and VDSR.

2. Methods

2.1 Reconstruction method

Artificial neural network. In conventional SMLM, firstly N raw images frames containing randomly excited molecules are acquired, then a proper localization algorithm is used to process each raw image to obtain the corresponding localization image containing position information from the excited molecule. Finally, a super-resolution image, or called high-density (HD) image, is reconstructed by stacking a large number of localization images (suppose N frames), as shown by the blue arrows and the blue dashed box in Fig. 1(a). In this paper, an image inpainting method based on deep learning is adopted to learn the mapping relationship from an under-sampled super-resolution image, which is constructed using a small number of localization images (suppose n frames, n<<N), or called low-density (LD) image, to a HD image. Therefore, after image inpainting, a high-quality super-resolution image could be reconstructed from fewer raw images than conventional SMLM.

Fig. 1. Schematic comparison of the proposed DI-STORM with the conventional SMLM method. (a) Image reconstruction processes of DI-STORM and the conventional SMLM method. (b) Structure of the generator (G). The generator is constructed based on ResNet, which consists of 1 input block (Conv + BN + ReLU), 2 down blocks, 1 residual network that contains 9 residual blocks (ResBlock), 2 up blocks, and 1 output block (Conv + ReLU). Conv, BN and ReLU stand for convolutional layer, batch normalization and rectified linear unit. (c) The detail structure of ResBlock in (b). The convolutions of input/output blocks correspond to 7 × 7 filters. All other convolutions correspond to 3 × 3 filters.

Download Full Size | PDF

At present, ANNA-PALM, which is built up on conditional GAN, is considered as the best deep learning-based image inpainting method for SMLM. However, the repaired images from ANNA-PALM still have many artifacts. Inspired by the work from Lee et al [25], in the generator part, we introduced the modified residual network to improve the generator model of ANNA-PALM, and proposed the DI-STORM method to reconstruct super-resolution images, as shown in Fig. 1(b). We aimed to reduce image artifacts and improve the imaging speed of SMLM. The modified residual network consists of 3 components: input block and output block, down blocks and up blocks, residual network.

1) Input block and output block. The kernel size of the convolutional filters is 7 × 7. These filters are firstly used to generate feature maps. Then, batch normalization (BN) in input block is used to alleviate the internal covariate shift. Finally, element-wise rectified linear units (ReLU) are added for nonlinearity.
2) Down blocks and up blocks. There are no pooling layers in this component. Instead, convolution (Conv) and transposed convolution (ConvT) are used for down-sampling and up-sampling with stride 2. Both convolutional layers and transposed convolutional layers use 3 × 3 kernels. Additionally, batch normalization and ReLU are added.
3) Residual network. Residual connections are first used by Lee et al [25] to train very deep convolutional neural networks. In most cases, residual network becomes the basic network because it is easier to optimize. The body of residual connection networks contains 9 residual blocks (ResBlock) of Conv + BN + ReLU, and a shortcut connection network. Among the 9 ResBlocks, all convolutional filters have 3 × 3 kernels.

In the discriminator part, the architecture of the discriminator is directly inherited from patchGAN.

Training objectives. A series of image pairs (l_i,h_i) are given as training data. Among them, l_i represents the LD image of the i-th frame, and h_i represents the corresponding HD image. DI-STORM is designed to fit conditional distributions from LD images to HD images. The optimization problem is expressed as follows:

(1)$$({{{{\cal G}}^\ast },{{{\cal D}}^ \ast }} )\textrm{ } = \textrm{ }\arg \mathop {\min }\limits_G \left( {\mathop {\max }\limits_D {{{\cal L}}_{cGAN}}(G,D) + \textrm{ }\lambda {{{\cal L}}_{{{\cal L}}1}}(G)} \right)$$

When the objective function of the generator drops to a minimum, ${{{\cal G}}^\ast }$ and ${{{\cal D}}^ \ast }$ represent the optimized parameters of the generative and discriminative models, respectively. At this time, the discriminator loss ${{{\cal L}}_{cGAN}}(G,D)$ reaches the maximum. In this case, the discriminative network (D) has the best capability to distinguish between real and fake images, and the generative network (G) also has the best image inpainting capability.

The objective of the conditional GAN consists of two items. The first term is ${{{\cal L}}_{\textrm{c}GAN}}(G,D)$, also called the discriminator loss, which is used to identify the authenticity of the repaired image. ${{{\cal L}}_{\textrm{c}GAN}}(G,D)$ is:

(2)$${{{\cal L}}_{\textrm{c}GAN}}(G,D)\textrm{ = }{{\mathbb{E}}_{\textrm{(}l,h\textrm{)} \sim {P_{data}}(l,h)}}[\textrm{log(}D(l,h))]\textrm{ } + \textrm{ }{{\mathbb{E}}_{l \sim {P_{data}}(l)}}\textrm{[log(1} - D(l,G(l)\textrm{))]}$$

where l stands for LD image, h stands for HD image. ${\mathbb{E}}$ is expectation. ${\textrm{P}_{data}}(l,h)$ means the joint probability density of LD images (l) and HD images (h). The second item is ${{{\cal L}}_{{{\cal L}}1}}(G)$, which measures the consistency between HD image and repaired image. In Eq. (1), ${{{\cal L}}_{{{\cal L}}1}}(G)$ is used to ensure data consistency. The weight λ is a hyper-parameter, which controls the weight of ${{{\cal L}}_{{{\cal L}}1}}(G)$ and is set as 100. ${{{\cal L}}_{{{\cal L}}1}}(G)$ is defined as:

(3)$${{{\cal L}}_{{{\cal L}}1}}(G)\textrm{ = }{{\mathbb{E}}_{(l,h) \sim {P_{data}}(l,h)}}[\textrm{|}|\textrm{h} - G(l)|{\textrm{|}_1}]$$

$\textrm{|}|\cdot |{\textrm{|}_\textrm{1}}$ is L1 distance, or Manhattan distance.

2.2 Training

The proposed DI-STORM is an update on ANNA-PALM, as shown in Fig. 2. We replace the U-Net generator of ANNA-PALM by using the ResNet generator. The overall architecture of DI-STORM contains a generator (G, locate in the bottom left corner of Fig. 2) and a loss term (locate in the right part of Fig. 2). The generator G is responsible for learning the mapping from LD images to HD images. The loss term contains two parts, the first part is a discriminator loss (see Eq. (2)), or called Loss1. The second part is L1 distance (see Eq. (3)), or called Loss2, a loss function commonly used in image super-resolution tasks.

Fig. 2. The architecture of DI-STORM. Low-density image (LD) is reconstructed using a small number of raw image frames (n frames, typically 200-2000 frames). High-density image (HD) is reconstructed using a large number of raw image frames (N frames, typically 30000 frames). G means the generative network. The input of G is a LD image. A repaired image (R) is inferred as an output of G. The loss function of D is equal to MSE1 + MSE2. MSE1 is the mean square error (MSE) between output and 1, when realLR is the input of D. MSE2 is the MSE between output and 0, when fakeLR is the input of D. The loss function of G contains Loss1 and Loss2. Loss1 is a generative adversarial loss obtained by calculating MSE of output and 1 (that is, MSE3), when fakeLR is the input of D. Loss2 is the L1 distance of HD and R. RealLR is concatenated by LD and HD, and fakeLR is concatenated by LD and R.

Download Full Size | PDF

During a training round, the discriminator is firstly trained, and the specific process is shown in Fig. 2. Generator parameters are fixed and the LD image is fed into the generator (G) as an input. The model completes a forward propagation, and outputs the repaired image (R). The repaired image (R) and the LD image are concatenated as fakeLR. Simultaneously, the HD image and LD image are concatenated as realLR. Taking realLR and fakeLR as the inputs of the discriminator (D), the inference of the discriminator is completed. Then, MSE1 for fakeLR and 0 and MSE2 for realLR and 1 are calculated (see Fig. 2). Finally, all parameters in discriminator are updated with back propagation.

After the discriminator training is finished, we continue to train the generator (G) by fixing the parameters of the discriminator (D). The L1 loss term (MSE3) and the discriminator loss term between the generator output R and the HD image are calculated. The discriminator loss is scaled by λ times for controlling the proportion of the discriminator loss. We accumulate all loss according to Eq. (1) and then update parameters of the generator.

The network was trained for 60 epochs with Adam implemented in PyTorch. The learning rate was set as 2e-4. During the training, the performance of the network on the training and validation sets was measured according to the multi-scale structural similarity (MS-SSIM) [26]. Therefore, we were able to select the best performing model based on the validation set. During inference, the output of generator is the repaired image (R) by using a LD image as input. All programs were implemented on a desktop computer with 16 GB memory, one Intel i5-9400 CPU @ 2.90 GHZ, and a NVIDIA GeForce GTX 1080 Ti GPU with 11 GB memory.

2.3 Generation of simulated data

We generated simulated data based on real images, which were provided by ANNA-PALM. By adding noise to real images, we simulated LD images with different sampling levels. The mathematical description of the process is $S(\lambda ,I) = P(\lambda I/{I_{\max }})$, where S represents the simulated LD image, P is a Poisson distribution with mean $\lambda I/{I_{\max }}$. I represents the real image. λ controls the sampling level of the real image. I_max is the maximum value of the real image. We simulated LD images with different sampling levels by controlling λ (from 0 to 5), and constructed training image pairs by matching HD images. For training convenience, we segmented the paired simulation images into sub-images with 512 × 512 pixels. All generated data were divided into training set (60%), validation set (20%), and test set (20%).

2.4 Generating experimental data

We directly downloaded all experiment data from ShareLoc.XYZ (https://shareloc.xyz/#/). Using the downloaded localization list, we rendered the localized molecules as LD images frame by frame. The pixel size of each frame is 10 nm. When the number of frames accumulated to an integer multiple of 200, we saved a LD image. To visualize a HD image, we performed Gaussian rendering of all molecules in the localization list with a localization precision of 10 nm. The LD images and the matched HD images were combined into a paired experimental training set. For convenience, we segmented the paired experimental images into sub-images with 512 × 512 pixels. All experimental data were divided into training set (60%), validation set (20%), and test set (20%).

2.5 Evaluation criteria

Image quality. Multi-scale structural similarity (MS-SSIM), structural similarity (SSIM), root mean square error (RMSE) and peak signal-to-noise ratio (PSNR) are commonly used evaluation metrics to calculate the visual quality of the reconstructed images. The relevant details are described in Supplement 1, Note 1.

Structural complexity. To quantitatively evaluate the density and intricacy of the biological structure, we adopted Qiao’s method for measuring structural complexity, which considered the proportion of the structure in a full image in [27]. A detailed description can be seen in Supplement 1, Note 2.

Nyquist criterion for ending image acquisition. In SMLM, thousands of raw images are required to reconstruct a super-resolution image. However, it is important to determine the time point for ending the raw image acquisition. In our experiments, image acquisition is ended according to the Nyquist criterion [28], see Supplement 1, Note 3 for more details.

3. Results and discussion

3.1 Comparison of the ResNet and U-Net generators in the inpainting of simulated SMLM images

We used simulated microtubule data to compare the image inpainting performance between the ResNet generator (used in our DI-STORM) and the U-Net generator (used in ANNA-PALM) in processing under-sampled (or called low-density) SMLM images. We varied the number of simulated raw images (50, 100, 500, 1000 frames, containing sparsely distributed fluorescence signals), and evaluated the image inpainting quality of ResNet and U-Net using the well-known MS-SSIM index. The GroundTruth images and the super-resolution images reconstructed from different number of raw images are shown in Fig. 3(a) and Fig. 3(b), respectively.

Fig. 3. Comparison of the image inpainting results between U-Net and ResNet. (a) GroundTruth. (b) Reconstructed super-resolution images from different number of raw image frames shown in the left side of (a). (c) Super-resolution images from U-Net. (d) Super-resolution images from ResNet. (e) The dependence of MS-SSIM on the number of raw image frames. The yellow rectangle regions in (a-d) were enlarged and shown in the right side of each figure. Scale bar: 1µm (normal images), 200 nm (magnified images). MS-SSIM were measured between a reconstructed image and the GroundTruth. The green arrows in (c) pointed out typical artifacts.

Download Full Size | PDF

As shown in Fig. 3(c-d), ResNet presents better image inpainting performance than U-Net, since the MS-SSIM values in the super-resolution images from ResNet are all higher than the corresponding super-resolution images from U-Net. Note that a higher MS-SSIM value indicates that the image is closer to the GroundTruth image. Furthermore, we found that the images repaired by ResNet are smooth and contain no obvious image artifacts, while the images repaired by U-Net have obvious graininess and contain image artifacts (that is, not properly connected structures), as pointed out by the green arrows in Fig. 3(c). We further evaluated the dependence of the image inpainting quality on the number of simulated raw images, using the MS-SSIM index. As shown in Fig. 3(e), at any given number of raw images, the repaired image from ResNet presents better quality compared to that from U-Net. And, more interestingly, when a large number of raw images are used in image inpainting (for example, 5000 images), the repaired image from ResNet is closest to the GroundTruth. Therefore, based on the findings above, we are confident that ResNet has better image inpainting effect than U-Net, and may be more suitable for the image inpainting task in SMLM.

In addition, to verify the effectiveness of the findings from the MS-SSIM metric, we applied three additional metrics (SSIM, PSNR and RMSE) into the evaluation to compare the overall image inpainting performance (Supplement 1, Note 1). We proved that the evaluation results using SSIM, PNSR and RMSE as the metrics (Supplement 1, Note 4 and Fig. S1) are consistent with that from MS-SSIM. That is to say, it is reasonable and effective to use MS-SSIM as the metric in evaluating the image inpainting performance.

3.2 Comparing the image inpainting performance using simulated data

Based on the results in Section 3.1, we replaced the generative network in ANNA-PALM (that is, U-Net) with ResNet, and developed a new image inpainting method called DI-STORM. To test the image inpainting performance of DI-STORM, we compared the global image quality and acceleration capability among DI-STORM, VDSR and ANNA-PALM using simulated microtubule images. We also quantified the resolution of the images repaired by DI-STORM.

Firstly, we evaluated and compared the acceleration capability of the three methods. We performed Poisson sampling on the GroundTruth image shown in Fig. 4(a), and generated a total of 5000 raw image frames, with 125 localizations in each raw image, and an area (A) of 8.97 µm² provided by ANNA-PALM. The pixel size of the simulated image is 10 nm/pixel. According to Nyquist criterion [28], when a spatial resolution of 23 nm is achieved, the average molecular density is at least σ = 7561 µm⁻², corresponding to the number of molecules (N_1×Nyq = σ * A ≈ 67800) under 1×Nyquist sampling. To achieve an ideal image with a resolution of 23 nm, the accumulated number of localizations should be greater than N_5×Nyq = 5× N_1×Nyq ≈ 339000, corresponding to 2750 LD images. As seen in Fig. 4(a), to achieve a similar quality (measured using MS-SSIM) in the super-resolution image as that reconstructed from 2750 LD images, a total of 50 frames (DI-STORM), 160 frames (ANNA-PALM), and 200 frames (VDSR) are needed for these three methods, respectively. This finding means that DI-STORM achieves a better acceleration capability (55 times) than ANNA-PALM (17.2 times) and VDSR (13.8 times).

Fig. 4. Comparison on the global image inpainting performance of different methods using simulation data. (a) The dependence of image inpainting quality on the number of raw image frames. The vertical blue dotted line indicates the minimum number of frames required for SMLM to achieve 5×Nyquist sampling. The vertical black dotted line, light-blue dotted line, and red dotted line indicate the minimum number of frames required for VDSR, ANNA-PALM and DI-STORM, respectively, to achieve 5×Nyquist sampling. All results were averaged over 20 repetitions. (b) GroundTruth. (c) A super-resolution image reconstructed from 50 raw image frames. (d-f) Repaired images from (c) using ANNA-PALM (d), VDSR (e) and DI-STORM (f). The right panels in (b-f) were enlarged from the blue dashed rectangles in the adjacent left panels. The white arrows in (d-e) indicate image artifacts, and the blue arrows in (f) show comparative results. The MS-SSIM values were calculated and shown in the upper-left corner of each image. (g) Intensity profiles along the green lines in (b, d-f). (h) Intensity profiles along the yellow lines in (b, d-f). The FWHM resolution in (h): 29.0 nm (GroundTruth), 32.8 nm (VDSR), 32.4 nm (ANNA-PALM), 29.1 nm (DI-STORM). Scale bars: 1 µm (b-f, left panels), 200 nm (b-f, right panels).

Download Full Size | PDF

Next, we evaluated the global image quality of DI-STORM. To ensure the same acceleration capabilities in the three methods, we took LD images (reconstructed using 50 raw images, and MS-SSIM value is 0.47) as the input image. We found that the MS-SSIM values of the repaired images were 0.9167 (ANNA-PALM), 0.8710 (VDSR), 0.9507 (DI-STORM), respectively, as shown in Fig. 4(d-f). These results show that the global image quality of DI-STORM is higher than that of ANNA-PALM and VDSR. To further illustrate the image inpainting quality, we checked into the details of the repaired images. We measured the intensity distribution of the structures marked by the green lines in Fig. 4(b, d-f), and found that DI-STORM (red line) presents a closer intensity distribution to the GroundTruth (blue line) in Fig. 4(g).

Finally, we quantified the resolution of the repaired images through the intensity profiles of the yellow lines shown in Fig. 4(b, d-f). Using the same number of raw images (50 frames) as input, we found the full width at half-maximum (FWHM) resolution in the DI-STORM image (29.1 nm) was comparable to ANNA-PALM (32.4 nm), VDSR (32.8 nm) and GroundTruth (29 nm), as shown in Fig. 4(h).

The above results from simulated data revealed that DI-STORM achieves a better image inpainting quality when microtubule samples are used, and that DI-STORM has better acceleration capability without compromising image resolution.

3.3 Comparing the image artifacts using simulated data

We used three image inpainting methods, DI-STORM, VDSR and ANNA-PALM, to repair simulated microtubule images with different structural complexity, and compared the image artifacts in the repaired images. We analyzed the local details in the repaired images from the three inpainting methods. Image inpainting was performed on 10 groups of LD images, and then the repaired images were randomly cropped into patch images with 64×64 pixels. We extracted 100 patches in each repaired image, and thus obtained a total of 1000 patches for further analysis. We calculated MS-SSIM between the patch images and GroundTruth. For patch images with different mean gradient (MG), which characterizes the structural complexity, we found the following results: 1) When MG is less than 3.01, DI-STORM presents the best image inpainting quality (that is, highest MS-SSIM) and almost no image artifacts, while VDSR presents the worst inpainting quality and many artifacts (as shown by the blue arrows in Fig. 5(c-d); 2) When MG is greater than 3.01, VDSR was found to present the best image inpainting quality (shown in the highest MS-SSIM), but also with many image artifacts (as shown by the incorrect connections and separation in Fig. 5(d); 3) In all under-studied cases, DI-STORM maintains the best performance in minimizing image artifacts. We further quantified the line profiles of the microtubule structures along the yellow lines in Fig. 5(a-d), and found that the structures repaired by DI-STORM (see the red curves in Fig. 5(e)) are closer to the GroundTruth (see the blue curves in Fig. 5(e)).

Fig. 5. Comparing artifacts in the simulated patch images repaired by different methods. (a) GoundTruth. (b) Repaired images from DI-STORM. (c) Repaired images from ANNA-PALM. (d) Repaired images from VDSR. Patch images with different structural complexity were used. Here structural complexity was characterized by mean gradient and shown by the numbers on the left side of (a). Blue arrows indicate artifacts. Scale bar: 1 µm (a–d). (e) Intensity profiles along the yellow lines drawn in (a). (f) The dependence of MS-SSIM on structural complexity.

Download Full Size | PDF

These findings confirm that DI-STORM provides a robust performance in minimizing image artifacts. It should be noted that the image restoration capability of DI-STORM is not infinite. When the MG is large enough, the structure will become extremely complex, and the image restored by DI-STORM begins to show incorrectly connections, as indicated by the blue arrow in the last row of Fig. 5(b).

Based on patch images, we further quantified the relationship between image inpainting quality and structural complexity. As shown in Fig. 5(f), we found that DI-STORM has better MS-SSIM values than ANNA-PALM under the same structural complexity. In addition, VDSR has better MS-SSIM value when the structural complexity is greater than 4. But, unfortunately, VDSR is not good at controlling image artifacts, as indicated by the blue arrows in Fig. 5(d).

3.4 Comparing the image inpainting performance using experimental data

To verify the image inpainting quality of DI-STORM in real data, we compared the global image quality of DI-STORM, VDSR and ANNA-PALM, using experimental microtubule images. We also quantified the FWHM resolution of the images repaired by DI-STORM.

Firstly, we downloaded a set of experimental microtubule data from ShareLoc.XYZ, and compared the image acceleration effects from these three image inpainting methods. This dataset contains a total of 30000 raw image frames, and the microtubules occupies an area (A) of 12.75 µm². The super-resolution image with 30000 raw images is used here to represent the GroundTruth, as shown in Fig. 6(b). According to the Nyquist criterion, when the microtubules are to be resolved, the resolution should be typically 30 nm, and the image sampling interval should be at least 15 nm. In this experimental dataset, the image sampling interval is set as 10 nm (which is smaller than 15 nm). That is to say, the average distance between localizations is 10 nm, the molecule density is 10000 µm⁻², and the corresponding number of localizations under 1×Nyquist sampling is: N_1×Nyq = σ * A ≈ 127500. For 5×Nyquist sampling, the number of localizations to be collected is: N_5×Nyq = 5×N_1×Nyq ≈ 637500, corresponding to a total of 9050 raw image frames for conventional SMLM (see the blue dotted line in Fig. 6(a)). Apparently, conventional SMLM can use these raw images (9050 frames) to reconstruct a super-resolution image with good quality, at the expense of long image acquisition time. To achieve a similar quality (quantified by MS-SSIM) as this super-resolution image, DI-STORM requires only 1000 raw image frames (see the red dotted line in Fig. 6(a)), VDSR requires 1800 frames (see the black dotted line in Fig. 6(a)), while ANNA-PALM requires 2000 frames (see the blue dotted line in Fig. 6(a)). That is to say, for this experimental dataset, the imaging speed is accelerated at 9.05 times (DI-STORM), 5.02 times (VDSR), and 4.52 times (ANNA-PALM), respectively. And, DI-STORM is found to have the largest acceleration.

Fig. 6. Comparison on the global image inpainting results using experimental data. (a) The dependence of image inpainting quality on the number of raw image frames. The vertical blue dotted line indicates the minimum number of frames required for SMLM to achieve 5×Nyquist sampling. The vertical black dotted line, light-blue dotted line, and red dotted line indicate the minimum number of frames required for VDSR, ANNA-PALM and DI-STORM, respectively, to achieve 5×Nyquist sampling. (b) GroundTruth. The image was constructed from 30000 raw image frames. (c) A super-resolution image reconstructed from 1000 raw image frames. (d-f) Repaired images from (c) using ANNA-PALM (d), VDSR (e) and DI-STORM (f). The right panels in (b-f) were enlarged from the blue dashed rectangles in the adjacent left panels. The white arrows in (d-e) indicate image artifacts, and the blue arrows in (f) show comparative results. The MS-SSIM values were calculated and shown in the upper-left corner of each image. (g) Intensity profiles along the yellow lines in (b, d-f). (h) Intensity profiles along the green lines in (b, d-f). The FWHM resolution in (h): 61.0 nm (GroundTruth), 60.2 nm (VDSR), 55.0 nm (ANNA-PALM), 60.5 nm (DI-STORM). Scale bar: 2 µm (b-f, left panels), 200 nm (b-f, right panels).

Download Full Size | PDF

Next, we evaluated the image inpainting quality among these methods. We used a total of 1000 raw image frames to generate an experimental input image (as shown in Fig. 6(c), the MS-SSIM value of this image is 0.4934), and then applied the three image inpainting methods to this image. We further analyzed the repaired images from these methods, and found that DI-STORM presents a repaired image with the best quality (that is, the highest MS-SSIM value). Actually, as seen in Fig. 6(d-f), the MS-SSIM values are 0.68 from ANNA-PALM, 0.67 from VDSR, and 0.75 from DI-STORM, respectively. We also measured the intensity distribution of two adjacent microtubules (the yellow lines in Fig. 6(b, d-f)], and presented the results in Fig. 6(g). We found that DI-STORM exhibits a closest repaired image to the Ground Truth image, as seen by the relative intensity ratio of the two peaks in the line profile.

Finally, we quantified the FWHM resolution of the repaired images using line profiles from a single microtubule structure (Fig. 6(b, d-f), green line). As shown in Fig. 6(h), the FWHM resolution values are: 61.0 nm for GroundTruth, 60.2 nm for VDSR, 55.0 nm for ANNA-PALM, and 60.5 nm for DI-STORM, respectively. That is to say, the FWHM resolution from DI-STORM is comparable to that from GroundTruth, although ANNA-PALM seems to present a better result. Additionally, image artifacts are visualized in the repaired images from ANNA-PALM and VDSR (see the white arrows in Fig. 6(d-e)), while these artifacts are not visible in the repaired image from DI-STORM (see the blue arrows in Fig. 6(f)).

3.5 Comparing the image artifacts using experimental data

Using experimental microtubule images with different structural complexity, we compared the artifacts in the images repaired by DI-STORM, VDSR and ANNA-PALM. We selected several super-resolution images with different structural complexity, and repaired these images using the three image inpainting methods. The results are shown in Fig. 7. For all of these images (where the mean gradient (MG) ranges from 0.13 to 0.33), DI-STORM was found to present the best MS-SSIM values among these inpainting methods. And, the repaired images from ANNA-PALM and VDSR seem to contain more artifacts, such as blurred and/or smaller connections (see the blue arrows in Fig. 7(c-d)]. We further analyzed the intensity distribution of the microtubule structures indicated by the yellow dashed lines in Fig. 7(a), and found that the repaired images provided by DI-STORM are closest to the corresponding GroundTruth images, as seen by the relative intensity ratio of the peaks in the line profiles (Fig. 7(e)). Note that, when MG = 0.24, we found that artifacts also appeared at the positions shown by the blue arrows in Fig. 7(b). At the same positions (see Fig. 7(c-d)), ANNA-PALM and VDSR also present artifacts. These artifacts are most likely caused by mis-identification of the areas containing dense noise points in low-density images as microtubules.

Fig. 7. Comparing artifacts in the experimental images repaired by different methods. (a) GroundTruth. (b) Repaired images from DI-STORM. (c) Repaired images from ANNA-PALM. (d) Repaired images from VDSR. Here we used experimental images with different structural complexity, which was characterized by mean gradient and shown by the numbers on the left side of (a). (e) Intensity profiles along the yellow dashed lines drawn in (a). (f) The dependence of MS-SSIM on structural complexity. Scale bar: 200 nm (a–d).

Download Full Size | PDF

We also evaluated the image inpainting performance of the three methods using experimental images with various structural complexity. We randomly extracted a total of 100 image patches with a size of 64×64 from Fig. 6(d-f), quantified the image quality of these patches using MS-SSIM index, and plotted the dependence of image quality on structural complexity. We found that, the MG of cropped patch images ranges from 0 to 0.4 in this data, that is to say, for any mean gradient (MG), DI-STORM provides better image quality (MS-SSIM) than ANNA-PALM and VDSR, as shown in Fig. 7(f). This result is basically consistent with those from simulated images (Fig. 5(f)). Note that, when calculating the mean gradient, the grayscale changes of the experimental data are softer and slower than those in simulated data, so the mean gradient in the experimental data changes less. Actually, for the maximum mean gradient, the value is 0.4 in our experimental data, as compared to a value of 6.63 in our simulated data. And, in this study, we analyze the performance of DI-STORM in reducing the artifacts in images with different structural complexity; however, we don’t aim to find out the maximum structural complexity that DI-STORM is applicable.

3.6 Applicability of DI-STORM in mitochondria structures

In previous sections, we demonstrated the applicability of DI-STORM in relatively uniform and tube-like structures (microtubules). Here, we quantified the image inpainting performance of DI-STORM in mitochondria images, which show amorphous features and are commonly used in SMLM. We compared the acceleration capability and image inpainting quality of DI-STORM with VDSR and ANNA-PALM.

We downloaded a set of experimental mitochondria data from ShareLoc.XYZ, and evaluated the acceleration capability of DI-STORM. We found that, for a conventional super-resolution image reconstructed from a total of 11,000 raw image frames, the image inpainting methods require less raw images to achieve the same image quality: 6900 frames for DI-STORM (that is, 1.59× speed gain), 8450 frames for ANNA-PALM (1.30× speed gain), and 9400 frames for VDSR (1.17× speed gain), respectively, as shown in Fig. 8(a). DI-STORM seems to provide better acceleration capability than the other two methods.

Fig. 8. Comparison on the global image inpainting performance of different methods using experimental mitochondrial data. (a) The dependence of image inpainting quality on the number of raw image frames. The vertical blue dotted line, light-blue dotted line, black dotted line, and red dotted line indicate the minimum number of raw image frames required for conventional SMLM, ANNA-PALM, VDSR and DI-STORM, respectively, to achieve 5×Nyquist sampling. (b) GroundTruth. The image was constructed from 25000 raw image frames. (c) A super-resolution image reconstructed from 6900 raw images, using conventional SMLM. (d-f) Repaired images from (c) using ANNA-PALM (d), VDSR (e) and DI-STORM (f). The white arrows in (d-e) indicate artifacts, and the blue arrows in (f) indicate the better inpainting effect of DI-STORM at the same location. Scale bars: 2 µm (b-f, left panels), 300 nm (b-f, right panels).

Download Full Size | PDF

We compared the quality of the repaired images. For the same input (6900 raw image frames), we found that DI-STORM provides a better repaired image than ANNA-PALM and VDSR, as can be seen by the MS-SSIM values in Fig. 8(d-f), no matter for a large image or an enlarged image. And, from the enlarged images, DI-STORM seems to present the best details that are close to the GroundTruth (see the arrows in Fig. 8(d-f)). Clearly, the image inpainting performance of DI-STORM is the best among the three methods, when experimental images from amorphous structures, such as mitochondria, are used. Note that the accelerating capability of all these methods is less effective in amorphous structures (mitochondria) than that in tube-like structures (microtubules). Similar finding was reported in the original ANNA-PALM paper [13].

4. Conclusion

We developed an image inpainting method called DI-STORM for accelerating the imaging speed of SMLM. In this method, we used a ResNet generator model, which is based on residual network and has strong capability in local feature extraction, to replace the U-Net generator model in the famous deep-learning based image inpainting method called ANNA-PALM. Using this strategy, we aim to solve the image artifact problem in ANNA-PALM. Using simulated and experimental data, we compared the image inpainting performance of DI-STORM with ANNA-PALM (the best cGAN-based image inpainting method applied in SMLM) and VDSR (the simplest CNN-based image inpainting method applied in SMLM). We found that DI-STORM achieves a speed gain of 55 times (simulation) and 9.05 times (experiment) as compared to conventional SMLM, which is much better than ANNA-PALM (17.2 times in simulation and 5.02 times in experiment) and VDSR (13.8 times in simulation and 4.52 times in experiment). Moreover, as compared with ANNA-PALM and VDSR, the repaired images from DI-STORM are closer to both GroundTruth images, and present the least image artifacts. We also verified the applicability of DI-STORM in another popular biological structures, mitochondrial, and proved that the super-resolution images quality repaired by DI-STORM has been significantly improved. Of course, there are still some limitations on our proposed DI-STORM method, because the image inpainting ability will be weakened in complex or amorphous structures. Nevertheless, we believe this study points out a new strategy for solving the image artifact problem in deep-learning based method for SMLM acceleration. The DI-STORM method is expected to promote the use of deep learning-based image inpainting methods in SMLM.

Funding

National Natural Science Foundation of China (81827901, 82160345); Key Research and Development Project of Hainan Province (ZDYF2021GXJS017); Key science and technology plan project of Haikou (2021-016); Start-up Fund from Hainan University (KYQD(ZR)20022, KYQD(ZR)20077).

Acknowledgments

We thank the Optical Bioimaging Core Facility of WNLO-HUST for technical support

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. E. Betzig, G. H. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. S. Bonifacino, M. W. Davidson, J. Lippincott-Schwartz, and H. F. Hess, “Imaging intracellular fluorescent proteins at nanometer resolution,” Science 313(5793), 1642–1645 (2006). [CrossRef]

2. M. J. Rust, M. Bates, and X. Zhuang, “Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (STORM),” Nat. Methods 3(10), 793–796 (2006). [CrossRef]

3. X. Fan, T. Gensch, G. Büldt, Y. Zhang, Z. Musha, W. Zhang, R. Roncarati, and R. Huang, “Three dimensional drift control at nano-scale in single molecule localization microscopy,” Opt. Express 28(22), 32750–32763 (2020). [CrossRef]

4. H. Ma and Y. Liu, “Super-resolution localization microscopy: toward high throughput, high quality, and low cost,” APL Photonics 5(6), 060902 (2020). [CrossRef]

5. S. A. Jones, S.-H. Shim, J. He, and X. Zhuang, “Fast, three-dimensional super-resolution imaging of live cells,” Nat. Methods 8(6), 499–505 (2011). [CrossRef]

6. R. Diekmann, M. Kahnwald, A. Schoenit, J. Deschamps, U. Matti, and J. Ries, “Optimizing imaging speed and excitation intensity for single-molecule localization microscopy,” Nat. Methods 17(9), 909–912 (2020). [CrossRef]

7. S. J. Holden, S. Uphoff, and A. N. Kapanidis, “DAOSTORM: an algorithm for high-density super-resolution microscopy,” Nat. Methods 8(4), 279–280 (2011). [CrossRef]

8. S. Cox, E. Rosten, J. Monypenny, T. Jovanovic-Talisman, D. T. Burnette, J. Lippincott-Schwartz, G. E. Jones, and R. Heintzmann, “Bayesian localization microscopy reveals nanoscale podosome dynamics,” Nat. Methods 9(2), 195–200 (2012). [CrossRef]

9. E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-STORM: super-resolution single-molecule microscopy by deep learning,” Optica 5(4), 458–464 (2018). [CrossRef]

10. F. Huang, T. M. P. Hartwich, F. E. Rivera-Molina, Y. Lin, W. C. Duim, J. J. Long, P. D. Uchil, J. R. Myers, M. A. Baird, W. Mothes, M. W. Davidson, D. Toomre, and J. Bewersdorf, “Video-rate nanoscopy using sCMOS camera–specific single-molecule localization algorithms,” Nat. Methods 10(7), 653–658 (2013). [CrossRef]

11. Y. Wang, S. Jia, H. F. Zhang, D. Kim, H. Babcock, X. Zhuang, and L. Ying, “Blind sparse inpainting reveals cytoskeletal filaments with sub-Nyquist localization,” Optica 4(10), 1277 (2017). [CrossRef]

12. S. K. Gaire, Y. Wang, H. Zhang, D. Liang, and L. Ying, “Accelerating 3D single-molecule localization microscopy using blind sparse inpainting,” J. Biomed. Opt. 26(02), 026501 (2021). [CrossRef]

13. W. Ouyang, A. Aristov, M. Lelek, X. Hao, and C. Zimmer, “Deep learning massively accelerates super-resolution localization microscopy,” Nat. Biotechnol. 36(5), 460–468 (2018). [CrossRef]

14. J. Jam, C. Kendrick, K. Walker, V. Drouard, J. G.-S. Hsu, and M. H. Yap, “A comprehensive review of past and present image inpainting methods,” Comput. Vis. Image Underst. 203, 103147 (2021). [CrossRef]

15. J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, (Curran Associates Inc., 2012), pp. 341–349.

16. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: feature learning by inpainting,” arXiv:1604.07379 (2016).

17. C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” arXiv:1611.09969 (2016).

18. S. Ma, Q. Liu, Y. Yu, Y. Luo, and S. Wang, “Quantitative phase imaging in digital holographic microscopy based on image inpainting using a two-stage generative adversarial network,” Opt. Express 29(16), 24928–24946 (2021). [CrossRef]

19. J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in IEEE Conference on Computer Vision & Pattern Recognition, (2016).

20. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” arXiv:1611.07004 (2016).

21. S. Kumar Gaire, Y. Zhang, H. Li, R. Yu, H. F. Zhang, and L. Ying, “Accelerating multicolor spectroscopic single-molecule localization microscopy using deep learning,” Biomed. Opt. Express 11(5), 2705–2721 (2020). [CrossRef]

22. C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv:1609.04802 (2016).

23. S. Ma, R. Fang, Y. Luo, Q. Liu, S. Wang, and X. Zhou, “Phase-aberration compensation via deep learning in digital holographic microscopy,” Meas. Sci. Technol. 32(10), 105203 (2021). [CrossRef]

24. V. Ghodrati, J. Shao, M. Bydder, Z. Zhou, W. Yin, K.-L. Nguyen, Y. Yang, and P. Hu, “MR image reconstruction using deep learning: evaluation of network structure and loss functions,” Quant. Imaging Med. Surg. 9(9), 1516–1527 (2019). [CrossRef]

25. H. Lee, J. Jo, and H. Lim, “Study on optimal generative network for synthesizing brain tumor-segmented MR images,” Math. Probl. Eng. 2020, 1–12 (2020). [CrossRef]

26. Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, (2003), pp. 1398–1402.

27. C. Qiao, D. Li, Y. Guo, C. Liu, T. Jiang, Q. Dai, and D. Li, “Evaluation and development of deep neural networks for image super-resolution in optical microscopy,” Nat. Methods 18(2), 194–202 (2021). [CrossRef]

28. W. R. Legant, L. Shao, J. B. Grimm, T. A. Brown, D. E. Milkie, B. B. Avants, L. D. Lavis, and E. Betzig, “High-density three-dimensional localization microscopy across large volumes,” Nat. Methods 13(4), 359–365 (2016). [CrossRef]

ResNet-based image inpainting method for enhancing the imaging speed of single molecule localization microscopy

Abstract

1. Introduction

2. Methods

2.1 Reconstruction method

2.2 Training

2.3 Generation of simulated data

2.4 Generating experimental data

2.5 Evaluation criteria

3. Results and discussion

3.1 Comparison of the ResNet and U-Net generators in the inpainting of simulated SMLM images

3.2 Comparing the image inpainting performance using simulated data

3.3 Comparing the image artifacts using simulated data

3.4 Comparing the image inpainting performance using experimental data

3.5 Comparing the image artifacts using experimental data

3.6 Applicability of DI-STORM in mitochondria structures

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (8)

Equations (3)

Optics Express