Multi-channel residual network model for accurate estimation of spatially-varying and depth-dependent defocus kernels

Yanpeng Cao; Yanpeng Cao; Zhangyu Ye; Zhangyu Ye; Zewei He; Zewei He; Jiangxin Yang; Jiangxin Yang; Yanlong Cao; Yanlong Cao; Christel-Loic Tisse; Michael Ying Yang

doi:10.1364/OE.383127

1. Introduction

In recent years, digital projection systems have been increasingly used to provide pixel-wise controllable light sources for various optical measurement and computer graphics applications such as Fringe Projection Profilometry (FPP) [1–3] and Augmented Reality (AR) [4–6]. However, digital projectors utilize large apertures to maximize their display brightness and thus typically have very limited depth-of-fields [7–9]. Once a projector is not precisely focused, its screen-projected images will contain noticeable blurring effects. A comprehensive analysis of the spatially-varying and depth-dependent defocus properties of projectors provides useful information to achieve more accurate three dimensional (3D) shape acquisition and virtual objects rendering.

When the setup of a digital projector is not properly focused, the light rays from a single projector pixel will be distributed in a small area instead of being converged onto a single point on the display surface. The distribution of light rays is typically depicted through defocus kernels or point-spread functions (PSF) [7,10]. In the thin-lens model, the diameter of the defocus kernel is directly proportional to the aperture size. As a result, projectors with larger apertures will suffer from smaller depth-of-fields and more severe out-of-focus blurring effects. Once the 2D spatially-varying defocus kernels of the projector at different depths are estimated as priors, the appropriate out-of-focus compensation method can be determined [7,8].

Previously, a number of techniques have been presented to estimate the defocus kernels or PSF. For instance, some methods directly acquire PSF using point-like light sources [11,12]. Although these methods can achieve more accurate PSF measurement with a higher peak signal-to-noise ratio (PSNR), they require specifically designed optical instruments and have difficulty in generating multiple point sources to obtain the spatially-varying PSF. Non-blind methods, which are based on specific calibration patterns (e.g., checkerboard targets [13,14] or grid points [9]), are the most commonly used techniques for defocus kernel estimation. It is noted that most non-blind methods employ simplified parametric models to constrain the space of possible PSF solutions. However, these model-based methods impose strong/simplified prior assumptions on the regularity of the defocus kernel and thus cause inaccurate estimation results. In comparison, non-parametric kernels can accurately describe complex blurring effects [15]. However, it is difficult to adopt the high dimensionality representations to reflect the interrelationship between the kernel shape and the optical parameters (e.g., aperture size or projection depth), thus the non-parametric methods are typically scene-specific or depth-fixed [8,13,15].

Recently, deep learning-based models (e.g., Convolutional Neural Networks) have significantly boosted the performance of various machine vision tasks including object detection [16,17], image segmentation [18] and target recognition [19]. Given a number of training samples, Convolutional Neural Networks (CNN) can automatically construct high-level representations by assembling the extracted low-level features. For instance, Simonyan et al. presented a very deep CNN model (VGG), which is commonly utilized as a backbone architecture for various computer vision tasks [17]. He et al. proposed a novel residual architecture to improve the training of very deep CNN models and achieved improved performance by increasing the depth of networks [16]. Moreover, some 3D CNN architectures have been proposed to extend the dimension of input data from 2D to 3D, processing video sequences for action recognition [20,21] or target detection [22]. Although CNN-based models have been successfully applied to solve many challenging image/signal processing tasks, very limited efforts have been made to explore deep learning-based methods for defocus kernel estimation or analysis.

In this paper, we present the first deep learning-based approach for accurate estimation of spatially and depth-varying projection defocus kernels and demonstrate its effectiveness for compensating blurring effects of out-of-focus projectors. An optical imaging/displaying system, which consists of a single-lens reflex camera, a depth sensor, and a portable digital projector, is geometrically calibrated and used to capture projected RGB images at various depths. Moreover, we calculate a 2D image warping transformation, which maximizes the photoconsistency between in-focus and out-of-focus images, to achieve sub-pixel level alignment. Based on the constructed dataset containing well-aligned in-focus, out-of-focus, and depth images, we present a compact yet effective multi-channel CNN model to precisely estimate the spatially-varying and depth-dependent defocus kernels of a digital projector. The proposed model incorporates multi-channel inputs (RGB images, depth maps, and spatial location masks) and learns the complex blurring effects presented in the projected images captured at different spatial locations and depths. To the best of our knowledge, this represents the first research work revealing the complex spatially-varying and depth-dependent blurring effects can be accurately learned from a number of in-focus and out-of-focus image patches instead of being hand-crafted as before. The contributions of this paper are summarized as follows:

(1) We construct a dataset that contains a large number of well-aligned in-focus, out-of-focus, and depth images captured at very different projection distances (between 50cm and 140cm). This new dataset could be utilized to facilitate the training of CNN-based defocus analysis models and to perform quantitative evaluation of various defocus analysis approaches.

(2) We propose a novel CNN-based model, which incorporates multi-channel inputs, including RGB images, depth maps, and spatial location masks, to estimate the spatially-varying and depth-dependent defocus kernels. Experiment results show that the proposed deep learning-based approach significantly outperforms other state-of-art defocus analysis methods and exhibits good generalization properties.

The rest of this paper is organized as follows. Section 2 provides the details of the optical displaying/imaging system and the constructed dataset for defocus analysis. Section 3 presents the details of the proposed multi-channel CNN model. Section 4 provides implementation details of the proposed CNN model and experimental comparison with the state-of-the-art alternatives. Finally, Section 5 concludes the paper.

2. Image acquisition system and dataset

2.1 Image acquisition

We have built an optical system which consists of a Nikon D750 single-lens reflex camera (SLR), a Microsoft Kinect v2 depth sensor, and a PHILIPS PPX4835 portable projector. The spatial resolutions of the SLR camera and the digital light processing (DLP) projector are $6016 \times 4016$ and $1280 \times 720$ pixels, respectively. The Kinect v2 depth sensor is utilized to captures a $512 \times 424$ depth image, and its effective working distance ranges from 0.5m to 2.0m. These optical instruments are rigidly attached to preserve their relative position and orientation. The system moves along a sliding track in the direction approximately perpendicular to the projection screen, displaying/capturing images at different depths with spatially-varying and depth-dependent blurring effects. The system setup is illustrated in Fig. 1.

Fig. 1. The setup of an optical system to simultaneously capture screen-projected images and depth data at different projection distances.

Download Full Size | PDF

We make use of the multimodal displaying/imaging system to capture a number of projected RGB images (using the SLR camera) and depth images (using the depth sensor). In total, we captured in-focus/out-of-focus projected images and depth maps at 13 different projection distances (50cm, 55cm, 60cm, 65cm, 70cm, 75cm, 80cm, 90cm, 100cm, 110cm, 120cm, 130cm, and 140cm positions in the sliding track) . Note the projector is properly focused at 80 cm position. In each position, we projected 200 images ($1280 \times 720$) from the DIV2K dataset [23] (publicly available online for academic research purpose) for capturing the training images and selected another 100 images with large varieties as the testing images to evaluate the generalization performance of our proposed method. The complete data capturing process is illustrated in Fig. 2.

Fig. 2. The data (in-focus, out-of-focus, and depth images) capturing process at different projection positions. We projected hundreds of images ($1280 \times 720$) from the publicly available DIV2K dataset [23] for capturing the training and testing images.

Download Full Size | PDF

2.2 Image alignment

It is important to generate a number of precisely aligned in-focus and out-of-focus image pairs to analyze the characteristics of spatial and depth varying defocus kernels. In each projection position, we establish corner correspondences between a checkerboard pattern input image and its screen-projected version. The transformation between two images is modeled by a polynomial 2D geometric mapping function whose coefficients are estimated by least squares based on the found corner correspondences. In our experiment, we empirically use a 5th order polynomial model. The computed polynomial mapping function is then utilized to rectify the geometrical skew of the projected images (both in-focus and out-of-focus images) to the front-parallel views, as illustrated in Fig. 3.

Fig. 3. Based on the established corner correspondences between a checkerboard pattern and its screen-projected image, a polynomial 2D geometric mapping function is computed to generate the viewpoint rectified images.

Download Full Size | PDF

During the image acquisition process (capturing 200 training and 100 testing in-focus/out-of-focus images in each projection position), it is impractical to keep the SLR camera completely still. Therefore, the calculated polynomial mapping function cannot be used to achieve high-accuracy alignment of in-focus/out-of-focus images, as illustrated in Fig. 4(c). To address the problem, we further present a simple yet effective image warping-based technique to achieve sub-pixel level alignment between in-focus and out-of-focus image pairs. Given an in-focus image $I_{IF}$, we deploy the non-parametric defocus kernel estimation method [15] to predict its defocused version $I_{DF'}$. Then, we calculate a 2D image displacement vector $X^{*}$ which maximizes the photoconsistency between the predicted ($I_{DF'}$) and real-captured ($I_{DF}$) defocus images as

(1)$$X^{*} =arg\ {\min _{X}} \left\{ \sum_{p\in \Omega} \left(I_{DF}(p+X) - I_{DF'}(p)\right)^2 \right\},$$

where $X^{*}$ denotes the estimated sub-pixel level 2D displacement, and $p$ denotes pixel coordinates on the 2D image plane $\Omega$. Note Eq. (1) represents a nonlinear least-squares optimization problem and can be minimized iteratively using the Gauss-Newton method. The calculated 2D displacement $X^{*}$ is utilized to warp input images to achieve sub-pixel level image alignment, as illustrated in Fig. 4(d).

Fig. 4. An illustration of sub-pixel level alignment between in-focus and out-of-focus image pairs. (a) In-focus image; (b) Zoom-in view; (c) Alignment results based on 2D polynomial mapping; (d) Alignment results based on 2D displacement $X^{*}$ image warping. Note the red curves are presented in the same position in all images to highlight misalignments.

Download Full Size | PDF

Finally, we make use of the calibration technique proposed by Moreno et al. [24] to estimate the intrinsic matrices of the depth sensor and the portable projector and the relative pose between them. The estimated six degrees of freedom (6DoF) extrinsic matrix is used to accurately align coordinate systems of two optical devices, transforming the depth images from the perspective of the depth sensor to the one of the projector. In this manner, the captured depth data is associated with the viewpoint rectified in-focus/out-of-focus images. Since the resolution of the depth images is lower than the screen-projected images, we apply bicubic interpolation to increase the size of the viewpoint rectified depth images and fill the missing pixels. Figure 5 shows some sample images ($1280\times 720$) in the constructed dataset for defocus analysis. Note the training and testing images present large varieties to evaluate the generalization performance of our proposed method. These well-aligned in-focus, out-of-focus, and depth images captured at different projection distances will be made publicly available in the future.

Fig. 5. Some well-aligned in-focus, out-of-focus, and depth images captured at different projection distances. We purposely use very different training and testing images to evaluate the generalization performance of our proposed method.

Download Full Size | PDF

3. Deep learning-based defocus kernel estimation

In this section, we present a Multi-Channel Residual Deep Network (MC-RDN) model for accurate defocus kernel estimation. Given the in-focus input image $I_{IF}$, the aim of the proposed network is to accurately predict its defocused versions $I_{DF'}$ at different spatial locations and depths.

3.1 Image patch-based learning

In many previous CNN-based models [17,25], the full-size input images are directly fed to the network, and a reasonably large receptive field is utilized to capture image patterns presented in different spatial locations. However, training a CNN model by feeding the entire images as input has two significant limits. First, this technique requires a very large training dataset (e.g., ImageNet dataset contains over 15 million images for training CNN models for object classification [26]). It is impractical to capture such large-scale datasets for the device-specific defocus analysis task. Second, its computational efficiency drops when processing a large number of high-resolution images (e.g., $1280 \times 720$ pixels) during the training process. To overcome the above-mentioned limits, we propose to divide the full-size RGB/depth images into a number of sub-images which are further integrated with two additional location maps (encoding the $x$ and $y$ coordinates) through the concatenation as illustrated in Fig. 6. As a result, our CNN model is capable of retrieving the spatial location of individual pixels within an image patch of arbitrary size without referring to the full-size images.

Fig. 6. An illustration of multi-channel input for our proposed MC-RDN model. RGB/depth image patches are integrated with two additional location maps to encode the $x$ and $y$ spatial coordinates.

Download Full Size | PDF

Each full-size $1280 \times 720$ image is uniformly cropped into a number of $80\times 80$ image patches. It is noted that many cropped image patches cover homogeneous regions and contain pixels of similar RGB values, as shown in Section A in Fig. 7. It is important to exclude such homogeneous image patches in the training process. Otherwise, the CNN-based model will be tuned to learn the simple mapping relationships between these homogeneous regions instead of estimating the complex spatially-varying and depth-dependent blurring effects. As a simple yet effective solution, we compute the standard variation of pixels within an image patch as an indicator to decide whether this patch is suitable for training. A threshold $\theta$ is set to eliminate patches with low RGB variations. In our experiments, we set the threshold $\theta =0.1$. Only the image patches with abundant textures/structures, as shown in Section B in Fig. 7, are utilized for deep network training.

Fig. 7. A full-size image is uniformly cropped into a number of small image patches. Selection A: image patches contain pixels of similar RGB values. Selection B: image patches contain abundant textures/structures. Only the image patches in Section B are utilized for deep network training.

Download Full Size | PDF

3.2 Network architecture

The architecture of the proposed MC-RDN model is illustrated in Fig. 8. Given RGB image patches and the corresponding depth and location maps as input, our model extracts high-dimensional feature maps and performs non-linear mapping operation to predict the defocused version. Since optical blurring effects are color-channel dependent [8,15,27,28], the MC-RDN model deploys three individual convolutional layers to extract the low-level features in the Red (R), Green (G), and Blue (B) channels of the input images as

(2)$$F_{0}^{R} = Conv_{1\times1}(I^{R}_{IF}),$$

(3)$$F_{0}^{G} = Conv_{1\times1}(I^{G}_{IF}),$$

(4)$$F_{0}^{B} = Conv_{1\times1}(I^{B}_{IF}),$$

where $Conv_{1\times 1}$ denotes the convolution operation using a $1\times 1$ kernel and $F_{0}^{R,G,B}$ are the extracted low-level features in the R, G, B channels. $F_{0}^{R,G,B}$ features are then fed into a number of stacked residual blocks to extract high-level features for defocus kernel estimation. We adopt the residual block used in EDSR [29], which contains two $3\times 3$ convolutional layers and a Rectified Linear Units (ReLU) activation layer. Within each residual block, we add skip connections between deeper and shallower convolutional layers to integrate both global and local contexts for improving the accuracy of image restoration. Moreover, shortcut connections enable gradient signal to back-propagate directly from the higher-level features to lower-level ones, alleviating the gradient vanishing/exploring problem of training deep CNN models. In the MC-RDN model, we empirically set the number of residual blocks $N = 4$ in each channel to achieve a good balance between restoration accuracy and computational efficiency. The informative features extracted by the $N_{th}$ residual blocks are then fed into three $3\times 3$ convolutional layers for predicting the out-of-focus images in the R, G, B channels as

(5)$$I_{DF'}^{R} = Conv_{3\times3}(F_{N}^{R}),$$

(6)$$I_{DF'}^{G} = Conv_{3\times3}(F_{N}^{G}),$$

(7)$$I_{DF'}^{B} = Conv_{3\times3}(F_{N}^{B}),$$

where $F_{N}^{R,G,B}$ are the output of the $N_{th}$ residual blocks in the R, G, B channels, respectively. The predicted results $I_{DF'}^{R, G, B}$ in the R, G, B channels are combined through a concatenation operation to generate the final defocused version $I_{DF'}$.

Fig. 8. The architecture of our proposed MC-RDN model for accurate estimation of spatially-varying and depth-dependent defocus kernels.

Download Full Size | PDF

3.3 Network training

Our objective is to learn the optimal parameters for the MC-RDN model, predicting a blurred image $I_{DF'}$ which is as similar as possible to the real-captured defocused image $I_{DF}$. Accordingly, our loss function is defined as

(8)$$\mathcal{L} = \alpha \sum_{p\in P}||I^{R}_{DF}(p) - I^{R}_{DF'}(p)||_2^2 +\beta \sum_{p\in P}||I^{G}_{DF}(p) - I^{G}_{DF'}(p)||_2^2 + \gamma \sum_{p\in P}||I^{B}_{DF}(p) - I^{B}_{DF'}(p)||_2^2,$$

where $||.||_2$ denotes the $L2$ norm which is the most commonly used loss function for high-accuracy image restoration tasks [30,31], $\alpha$, $\beta$ and $\gamma$ denote the weights of the R, G, B channels (we set $\alpha$= $\beta$ = $\gamma$ =1), and $p$ indicates the index of a pixel in the non-boundary image region $P$. Note the value of a pixel in the out-of-focus image depends on the distribution profile of its neighboring pixels in the corresponding in-focus image, thus we only calculate the differences for non-boundary pixels which can refer to enough neighboring pixels for robust defocus prediction. The loss function calculates the pixel-wise difference between the predicted $I_{DF'}$ and real-captured $I_{DF}$ in R, G, and B channels, which is utilized to update the weights and biases of the MC-RDN model using mini-batch gradient descent based on back-propagation.

4. Experiment results

We implement the MC-RDN model based on the Caffe framework and train this model on NVIDIA GTX 1080Ti with Cuda 8.0 and Cudnn 5.1 for 50 epochs. SGD solver is utilized to optimize the weights by setting $\alpha =0.01$ and $\mu =0.999$. The batch size is set to 32 and the learning rate is fixed to $1e-1$. We adopt the method described in [32] to initialize the weight parameters and set the biases to zeros. The source code of MC-RDN model will be made publicly available in the future.

4.1 Defocus kernel estimation

We compare our proposed MC-RDN model with state-of-the-art defocus kernel estimation methods qualitatively and quantitatively. Firstly, we consider two parametric methods that minimize the Normalized Cross-Correlation (NCC) between predicted and real-captured defocused images using Gaussian kernel (Gauss-NCC [9]) and circular disk (Disk-NCC [14]). Moreover, we consider a non-parametric defocus kernel estimation method (Non-para [15]), which deploys a calibration chart with five circles in each square to capture how step-edges of all orientations are blurred. Non-parametric kernels can accurately describe complex blurs, while their high dimensionality hinders understanding of the relationship between the defocus kernel shape and the setting of optical systems (e.g., spatial locations or projection distances). Therefore, Kee et al. also made use of a 2D Gaussian distribution to reduce the dimensionality and model the complex 2D defocus kernel shape (2D-Gauss [15]). Source codes of these hand-crafted methods are either publicly available or re-implemented according to the original papers.

Firstly, we evaluate the performance of different defocus kernel estimation methods at a number of projection positions where the training/calibration images are available (50cm, 60cm, 70cm, 80cm, 100cm, 120cm, and 140cm positions in the sliding track). We adopt Peak signal-to-noise-ratio (PSNR) and structural similarity index (SSIM) [33] as the evaluation metrics. Table 1 summarizes the quantitative results. It is observed that our MC-RDN model surpasses all of the previous hand-crafted methods in terms of PSNR and SSIM values. This deep learning-based approach constructs a more comprehensive model to accurately depict blurring effects of an out-of-focus projector, achieving significantly higher PSNR and SSIM values compared with the parametric methods (Gauss-NCC [9], Disk-NCC [14], and 2D-Gauss [15]). Our MC-RDN model also performs favorably compared with the non-parametric method based on high dimensionality representations (Non-para [15]). A noticeable drawback of the non-parametric method is that it requires to capture in-focus and out-of-focus calibration images at each projection positions to compute the optimal defocus kernels. In comparison, our proposed deep learning based-method is trained by utilizing image data captured at 7 fixed depths (50cm, 60cm, 70cm, 80cm, 100cm, 120cm, and 140cm positions in the sliding track), then it can adaptively compute defocus kernels at various projection distances (e.g., 55cm, 65cm, 75cm, 90cm, 110cm, and 130cm positions). Some comparative results with state-of-the-art defocus kernel estimation methods are shown in Fig. 9. Our method can more accurately predict blurring effects, providing important prior information for defocus compensation and depth-of-field extension.

Fig. 9. The predicted defocused images in the 50cm position using Gauss-NCC [9], Disk-NCC [14], 2D-Gauss [15], Non-para [15], and our MC-RDN model. Please zoom in to check details highlighted in red bounding box.

Download Full Size | PDF

Table 1. Quantitative evaluation results at a number of projection positions where the training/calibration images are available. $\color{red}{\textrm{Red}}$ and $\color{blue}{\textrm{blue}}$ indicate the best and the second-best performance, respectively.

View Table | View all tables in this article

We also evaluate the performance of defocus kernel estimation without referring to the training/calibration images. The parametric methods (Gauss-NCC [9], Disk-NCC [14], and 2D-Gauss [15]) firstly calculate the defocus model at a number of fixed depths and then interpolate model parameters between measurement points. In comparison, our proposed MC-RDN model implicitly learns the characteristics of defocus kernels at a number of fixed depths and predicts defocus kernels at various projection distances. Note the second-best performing non-parametric method is not applicable in this case since calibration images are not provided in these projection positions. Experimental results in Table 2 demonstrate that our proposed method exhibits better generalization performance compared with these parametric methods, predicting more accurate defocused images at very different projection distances (between 50cm and 140cm).

Table 2. Quantitative evaluation results at a number of projection positions where the training/calibration images are not provided. $\color{red}{\textrm{Red}}$ and $\color{blue}{\textrm{blue}}$ indicate the best and the second-best performance, respectively.

View Table | View all tables in this article

4.2 Out-of-focus blur compensation

We further demonstrate the effectiveness of the proposed deep learning-based defocus kernel estimation method for minimizing the out-of-focus image blurs. We adopt the algorithm presented by Zhang et al. [8] to compute a pre-conditioned image $I^{*}$ which is most closely matches the in-focus image $I_{IF}$ after defocusing. The computation of $I^{*}$ is achieved through a constrained minimization problem as

(9)$$I^{*} =arg\ {\min _{I}} \left\{ \sum_{p\in \Omega} \left(DF(I(p))+\phi - I_{IF}(p)\right)^2 \right\},$$

where $DF(I(p))$ denotes the predicted defocused version of a pre-conditioned image $I$, and $\phi$ is the background radiance which can be omitted in a completely dark environment [8]. Figure 10 shows some comparative results of out-of-focus blurring effect compensation using different defocus estimation methods. It is visually observed that the screen-projected image of $I^{*}$ computed using our deep learning-based method achieves better deblurring results compared with other alternatives. More accurate defocus kernel estimation results lead to restoring sharper and clearer textures and structural edges, suppressing undesirable artifacts, and producing higher PSNR and SSIM values. The experimental results demonstrate that our proposed MC-RDN model provides a promising solution to extend the depth-of-field of a digital projector without modifying its optical system.

Fig. 10. Some comparative results of out-of-focus blurring effect compensation in the 50cm position using Gauss-NCC [9], Disk-NCC [14], 2D-Gauss [15], Non-para [15], and our MC-RDN model. Please zoom in to check details highlighted in red bounding box.

Download Full Size | PDF

5. Conclusion

In this paper, we attempt to solve the challenging defocus kernel estimation problem through a deep learning-based approach. For this purpose, we firstly construct a dataset that contains a large number of well-aligned in-focus, out-of-focus, and depth images. Moreover, we present a multi-channel residual CNN model to estimate the complex blurring effects presented in the screen-projected images captured at different spatial locations and depths. To the best of our knowledge, it is the first research work to construct a dataset for defocus analysis and reveals that the complex out-of-focus blurring effects can be accurately learned from a number of training image pairs instead of being hand-crafted as before. Experiments have verified the effectiveness of the proposed approach. Compared with state-of-the-art defocus kernel estimation methods, it can generate more accurate defocused images, thus lead to better compensation of undesired out-of-focus image blurs.

Funding

National Natural Science Foundation of China (51575486, 51605428).

Disclosures

The authors declare no conflicts of interest.

References

1. Y. Wang, H. Zhao, H. Jiang, and X. Li, “Defocusing parameter selection strategies based on PSF measurement for square-binary defocusing fringe projection profilometry,” Opt. Express 26(16), 20351–20367 (2018). [CrossRef]

2. T. Hoang, B. Pan, D. Nguyen, and Z. Wang, “Generic gamma correction for accuracy enhancement in fringe-projection profilometry,” Opt. Lett. 35(12), 1992–1994 (2010). [CrossRef]

3. Z. Wang, H. Du, and H. Bi, “Out-of-plane shape determination in generalized fringe projection profilometry,” Opt. Express 14(25), 12122–12133 (2006). [CrossRef]

4. A. Doshi, R. T. Smith, B. H. Thomas, and C. Bouras, “Use of projector based augmented reality to improve manual spot-welding precision and accuracy for automotive manufacturing,” The Int. J. Adv. Manuf. Technol. 89(5-8), 1279–1293 (2017). [CrossRef]

5. A. E. Uva, M. Gattullo, V. M. Manghisi, D. Spagnulo, G. L. Cascella, and M. Fiorentino, “Evaluating the effectiveness of spatial augmented reality in smart manufacturing: a solution for manual working stations,” The Int. J. Adv. Manuf. Technol. 94(1-4), 509–521 (2018). [CrossRef]

6. M. Di Donato, M. Fiorentino, A. E. Uva, M. Gattullo, and G. Monno, “Text legibility for projected augmented reality on industrial workbenches,” Comput. Ind. 70(1), 70–78 (2015). [CrossRef]

7. M. S. Brown, P. Song, and T.-J. Cham, “Image pre-conditioning for out-of-focus projector blur," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2006), pp. 1956–1963.

8. L. Zhang and S. Nayar, “Projection defocus analysis for scene capture and image display,” ACM Trans. Graph. 25(3), 907–915 (2006). [CrossRef]

9. J. Park and B.-U. Lee, “Defocus and geometric distortion correction for projected images on a curved surface,” Appl. Opt. 55(4), 896–902 (2016). [CrossRef]

10. H. Lin, J. Gao, Q. Mei, Y. He, J. Liu, and X. Wang, “Adaptive digital fringe projection technique for high dynamic range three-dimensional shape measurement,” Opt. Express 24(7), 7703–7718 (2016). [CrossRef]

11. J. Jurij, P. Franjo, L. Boštjan, and B. Miran, “2D sub-pixel point spread function measurement using a virtual point-like source,” Int. J. Comput. Vis. 121(3), 391–402 (2017). [CrossRef]

12. H. Du and K. J. Voss, “Effects of point-spread function on calibration and radiometric accuracy of CCD camera,” Appl. Opt. 43(3), 665–670 (2004). [CrossRef]

13. A. Mosleh, P. Green, E. Onzon, I. Begin, and J. M. P. Langlois, “Camera intrinsic blur kernel estimation: A reliable framework," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2015), pp. 4961–4968.

14. Y. Oyamada and H. Saito, “Focal pre-correction of projected image for deblurring screen image," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2007), pp. 1–8.

15. E. Kee, S. Paris, S. Chen, and J. Wang, “Modeling and removing spatially-varying optical blur," in Proceedings of the IEEE international conference on computational photography (IEEE, 2011), pp. 1–8.

16. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2016), pp. 770–778.

17. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556 (2014).

18. E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2015), pp. 3431–3440.

19. H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection," in Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE, 2015), pp. 5325–5334.

20. S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). [CrossRef]

21. P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3D convolutional neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition workshops (IEEE, 2015), pp. 1–7.

22. Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. C. Mok, L. Shi, and P. Heng, “Automatic detection of cerebral microbleeds from MR images via 3D convolutional neural networks,” IEEE Trans. Med. Imaging 35(5), 1182–1195 (2016). [CrossRef]

23. E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: Dataset and study," in Proceedings of the IEEE international conference on computer vision workshops (IEEE, 2017), pp. 126–135.

24. D. Moreno and G. Taubin, “Simple, accurate, and robust projector-camera calibration," in Proceedings of the Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission (IEEE, 2012), pp. 464–471.

25. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks," in Proceedings of Advances in neural information processing systems (NIPS, 2015), pp. 91–99.

26. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks," in Proceedings of Advances in neural information processing systems (NIPS, 2012), pp. 1097–1105.

27. M. Trimeche, D. Paliy, M. Vehvilainen, and V. Katkovnic, “Multichannel image deblurring of raw color components," in Proceedings of the Computational Imaging III (International Society for Optics and Photonics, 2005), vol. 5674, pp. 169–178.

28. S. Ladha, K. Smith-Miles, and S. Chandran, “Projection defocus correction using adaptive kernel sampling and geometric correction in dual-planar environments," in Proceedings of the IEEE international conference on computer vision workshops (IEEE, 2011), pp. 9–14.

29. B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution," in Proceedings of the IEEE conference on computer vision and pattern recognition workshops (IEEE, 2017), pp. 136–144.

30. H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Trans. Comput. Imaging 3(1), 47–57 (2017). [CrossRef]

31. Z. Yunlun, L. Kunpeng, L. Kai, W. Lichen, Z. Bineng, and F. Yun, “Image super-resolution using very deep residual channel attention networks," in Proceedings of the European conference on computer vision (Springer, 2018), pp. 286–301.

32. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proceedings of the IEEE international conference on computer vision (IEEE, 2015), pp. 1026–1034.

33. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

	Gauss-NCC [9]	Disk-NCC [14]	2D-Gauss [15]	Non-para [15]	MC-RDN
	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
50cm	39.7723 / 0.9846	43.6310 / 0.9903	42.5417 / 0.9872	$47.0362$ / $0.9929$	$47.2177$ / $0.9931$
60cm	43.3489 / 0.9895	44.3016 / 0.9905	44.4143 / 0.9904	$46.7617$ / $0.9920$	$47.0257$ / $0.9922$
70cm	43.4119 / 0.9899	43.2468 / 0.9889	43.4350 / 0.9896	$45.3623$ / $0.9913$	$45.6286$ / $0.9918$
100cm	44.1064 / 0.9890	42.7423 / 0.9869	43.6824 / 0.9885	$45.6018$ / $0.9900$	$45.8790$ / $0.9903$
120cm	42.5927 / 0.9881	42.1408 / 0.9877	43.0431 / 0.9886	$46.3118$ / $0.9918$	$46.5004$ / $0.9919$
140cm	43.5585 / 0.9894	41.4292 / 0.9867	43.7682 / 0.9894	$46.1575$ / $0.9918$	$46.3292$ / $0.9919$

	Gauss-NCC [9]	Disk-NCC [14]	2D-Gauss [15]	MC-RDN
	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
55cm	40.8012 / 0.9865	$43.8516$ / $0.9905$	43.002 / 0.9885	$44.6171$ / $0.9908$
65cm	42.6663 / 0.9881	$43.6490$ / $0.9897$	43.5362 / 0.9894	$44.6703$ / $0.9908$
75cm	42.8719 / 0.9873	42.8486 / 0.9872	$44.2390$ / $0.9905$	$44.7161$ / $0.9913$
90cm	41.6013 / 0.9875	41.4206 / 0.9871	$42.9738$ / $0.9900$	$44.1329$ / $0.9913$
110cm	41.8079 / 0.9854	$42.5770$ / $0.9865$	42.0136 / 0.9850	$44.6218$ / $0.9884$
130cm	$43.6798$ / $0.9901$	42.0037 / 0.9885	43.6169 / 0.9889	$46.5375$ / $0.9926$

	Gauss-NCC [9]	Disk-NCC [14]	2D-Gauss [15]	Non-para [15]	MC-RDN
	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
50cm	39.7723 / 0.9846	43.6310 / 0.9903	42.5417 / 0.9872	$47.0362$ / $0.9929$	$47.2177$ / $0.9931$
60cm	43.3489 / 0.9895	44.3016 / 0.9905	44.4143 / 0.9904	$46.7617$ / $0.9920$	$47.0257$ / $0.9922$
70cm	43.4119 / 0.9899	43.2468 / 0.9889	43.4350 / 0.9896	$45.3623$ / $0.9913$	$45.6286$ / $0.9918$
100cm	44.1064 / 0.9890	42.7423 / 0.9869	43.6824 / 0.9885	$45.6018$ / $0.9900$	$45.8790$ / $0.9903$
120cm	42.5927 / 0.9881	42.1408 / 0.9877	43.0431 / 0.9886	$46.3118$ / $0.9918$	$46.5004$ / $0.9919$
140cm	43.5585 / 0.9894	41.4292 / 0.9867	43.7682 / 0.9894	$46.1575$ / $0.9918$	$46.3292$ / $0.9919$

	Gauss-NCC [9]	Disk-NCC [14]	2D-Gauss [15]	MC-RDN
	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM	PSNR / SSIM
55cm	40.8012 / 0.9865	$43.8516$ / $0.9905$	43.002 / 0.9885	$44.6171$ / $0.9908$
65cm	42.6663 / 0.9881	$43.6490$ / $0.9897$	43.5362 / 0.9894	$44.6703$ / $0.9908$
75cm	42.8719 / 0.9873	42.8486 / 0.9872	$44.2390$ / $0.9905$	$44.7161$ / $0.9913$
90cm	41.6013 / 0.9875	41.4206 / 0.9871	$42.9738$ / $0.9900$	$44.1329$ / $0.9913$
110cm	41.8079 / 0.9854	$42.5770$ / $0.9865$	42.0136 / 0.9850	$44.6218$ / $0.9884$
130cm	$43.6798$ / $0.9901$	42.0037 / 0.9885	43.6169 / 0.9889	$46.5375$ / $0.9926$

Multi-channel residual network model for accurate estimation of spatially-varying and depth-dependent defocus kernels

Abstract

1. Introduction

2. Image acquisition system and dataset

2.1 Image acquisition

2.2 Image alignment

3. Deep learning-based defocus kernel estimation

3.1 Image patch-based learning

3.2 Network architecture

3.3 Network training

4. Experiment results

4.1 Defocus kernel estimation

4.2 Out-of-focus blur compensation

5. Conclusion

Funding

Disclosures

References

Cited By

Figures (10)

Tables (2)

Equations (9)

Optics Express