Full-resolution image restoration for light field images via a spatial shift-variant degradation network

Conghui Zhu; Yi Jiang; Yi Jiang; Yan Yuan; Lijuan Su; Xiaorui Yin; Deqian Kong

doi:10.1364/OE.506541

1. Introduction

Light field (LF) imaging is a computational imaging technique that is designed to capture four-dimensional (4D) spatial and angular information of the objects [1,2]. As a result, it has been used in many applications by processing the LF data, such as digital refocusing [3,4], depth estimation [5–7], and 3D image reconstruction [8–10].

The most popular mechanism for acquiring LF data is the LF camera (LFC) based on the micro-lens array (MLA) [11,12]. In contrast to the traditional camera, which only records the intensity of objects, a LFC uses the MLA to redistribute the light rays and record the angular information on the sensor, resulting in a decrease in spatial resolution. The poor spatial resolution of the LF images constrains its practical applicability. In order to address this issue, researchers have found effective methods in two ways: image restoration and LF spatial super-resolution (SR).

The two methods share a commonality that the spatial resolution of the target scene is improved, but there are still some differences in the recovery goals. In this paper, we define the target image as a full-resolution (FR) image. Its spatial resolution is the same as that obtained by a traditional camera. As shown in Fig. 1, the FR image passes through a telecentric-based LFC and generates a raw LF image. A spatial object point is represented by a group of S × S pixels (see the blue box), which are collectively covered by a micro-lens and are called macropixel. Thus, the raw LF image is also called a macropixel image (MPI). Each pixel records a unique angular information of the object point. Pixels at the same position under all microlenses can be extracted to form a sub-aperture image (SAI), which is also a single-view image. Then, the characteristics of SAI can be described as view-wise characteristics. The SAI of an LFC has a spatial resolution that is 1/S that of an FR image. Accordingly, the LF image restoration task aims to restore the same spatial resolution as a traditional camera for the target scene. LF spatial SR task aims to improve the spatial resolution of each SAI to obtain the high-resolution (HR) SAI.

Fig. 1. Schematic diagram of the image acquisition and reconstruction of the LF.

Download Full Size | PDF

In terms of image restoration, some degradation principles-based methods have been studied. Bishop et al. modeled the image formation process using geometric optics [13]. Moreover, Shroff et al. described the resolution decline as a degradation process of picture generation based on the scalar diffraction theory [14,15]. They provided the spatial-variant numerical matrix of plenoptic system response for performing deconvolution to recover FR images in simulation, which had the disadvantage of high computational complexity. Junker et al. [16] established a backpropagation model and implemented a reconstruction without object priors. The method that utilized the low-resolution depth information of scenes was also proposed to accomplish image recovery [17]. The LF images can be reshaped as an SAI array (i.e., multi-view images). The sub-pixel shifted information among views [18,19] has been utilized to increase the spatial resolution.

In terms of LF spatial SR, some conventional methods have been proposed. Wanner et al. [20] built a variational framework to increase both spatial and angular sampling rates of the LF images by employing the disparity maps. Alain et al. [21] presented an approach by combining the SR-BM3D filter and the LFBM5D denoising filter. The latter can be interpreted as an LF sparse coding operator and employed to handle an optimization problem to generate HR SAIs.

With the development of deep learning technology, deep convolutional neural networks (CNNs) have achieved notable performance in LF SR. Inspired by the single image SR (SISR) [22], Yoon et al. [23] first applied CNN for LF image processing to increase the number of SAIs and spatial resolution. However, it regarded the spatial SR and angular SR as two independent sub-missions and ignored the spatial correlation between multi-view images. Some works [20, 24–27] then make full use of LF multi-dimensional information to fill this gap. Yuan et al. [28] proposed a joint network that combined an epipolar plane image (EPI) enhancement module with an SISR CNN to preserve the geometric consistency among the super-resolved SAIs. Exploring the unique structure of the LF, Wang et al. [29] proposed a disentangling network to extract the domain-specific feature from different dimensions, which also performed well on spatial SR, angular SR, and disparity estimation. Considering the real-world degradation, Wang et al. [30] proved an LF degradation-adaptive network (LF-DAnet) to handle the Gaussian blur and noise. However, As described in [14,15], the system response of the LFC is spatial shift-variant under each microlens, so the invariant kernel used in the network is not accurate for modeling the real LF imaging system.

Recently, some works sought the transformer to improve the spatial resolution of SAIs. Liang et al. [31] designed a transformer-based network (LFT) for LF spatial SR to capture long-range spatial dependencies and incorporate the information in all views. Yoon [32] et al. proposed a detail-preserving transformer (DPT) structure, which guided the learning of SAI sequences in different directions by leveraging gradient maps of LF. However, these networks only employed the new technology but ignored the imaging degradation of the real LF.

In this paper, motivated by the superior learning ability of networks, we focus on addressing FR image restoration via embedding the spatial shift-variant characteristic of the LF images into the network. The observation model based on the scalar diffraction theory is established in Section 2. According to the analysis of the LF image formation, we construct the spatial shift-variant degradation (SSVD) kernel array of SAIs. These view-wise kernels are embedded into the network as priors to accomplish the FR image restoration. The detailed design is elaborated in Section 3. In Section 4, we compare our network with state-of-the-art methods and perform ablation studies to verify the effectiveness of the main components. Besides, extensive experiments on real-world scenes are also conducted to further demonstrate the ability to restore FR images at any in-focus and defocused distance. Finally, Section 5 concludes this paper.

2. System response model and image formation analysis

2.1 System layout

A telecentric LF imaging system is constructed to analyze the spatial response characteristic. The optical layout with the in-focus object is shown in Fig. 2. The fore optical system consisting of the aperture stop and the main lens is telecentric in the image space. Therefore, the chief ray of off-axis objects incident on the MLA is always parallel to the optical axis. Consequently, this structure eliminates the drift of the sub-image centroid at large field of view.

Fig. 2. The optical layout of a telecentric LFC, and the object is in focus. The blue lines represent that the object is on-axis, and the red lines represent the object is off-axis.

Download Full Size | PDF

The coordinates for the aperture stop, the main lens plane, the MLA plane, and the sensor are denoted as (ζ, η), (x, y), (u, v) and (t, w), respectively. All lenses in the LFC are assumed as ideal thin lenses to simplify the diffraction propagation calculation. The focal length of the main lens is F. The focal length f of each microlens is identical. We define the distance between the object and the aperture stop as z₁. The aperture stop is placed at the front focal plane of the main lens to construct a telecentric structure. The MLA is placed at the primary image plane of the main lens. Here, the object distance, denoted as (z₁+ F), and the image distance, z₂, satisfy the Gaussian formula expressed as 1/(z₁+ F) + 1/z₂= 1/F. The sensor and the MLA are coupled at a distance f. The light wavefront distributions on the aperture stop, main lens, MLA, and sensor are defined as u₁ (ζ, η), u₂ (x, y), u_m (u, v), and u_d (t, w), respectively.

In this paper, to avoid confusion with conventional image terminology, we refer to the response characteristics of the LFC as pupil image function (PIF) [15]. Different from the implicitly integral expression in the [15], we aim at the telecentric optical layout in Fig. 2 and derive a succinct and explicit expression, which is feasible for explaining in-focus and defocused cases.

2.2 System response model

Based on scalar diffraction theory, we first derive the field distribution of an in-focus object during the propagation.

2.2.1 In-focus model

On-axis case. As shown in Fig. 2, a point light source illuminates the aperture stop, on which the wavefront distribution is described as a unit spherical wave u₁ (ζ, η) at wavelength λ. The diameter of the aperture stop is D₁ and the corresponding pupil function is a common circular function P₁ (ζ, η). The light wave is assumed to propagate in the free space. According to the Fresnel diffraction theory [33], the wavefront distribution u₂(x, y) is given by:

(1)$${u_2}({x,y} )= \frac{1}{{j\lambda F}}\int\!\!\!\int {{u_\textrm{1}}({\zeta ,\eta } )\cdot {P_1}({\zeta ,\eta } )\cdot \exp \left\{ {\frac{{jk}}{{2F}}[{{{({x - \zeta } )}^2} + {{({y - \eta } )}^2}} ]} \right\}} d\zeta d\eta. $$

For the on-axis point source S₁ (see the blue lines in Fig. 2), whose coordinates (x₀, y₀) are (0, 0), the input field is given by u₁ (ζ, η) = exp [jk(ζ²+ η²) / (2z₁)]. The main lens collects light and forms a primary image on the front surface of the MLA. Omitting the limited aperture of the main lens and only considering its phase modulation t_main (x,y) = exp[-jk (x²+ y²)/(2F)], we obtain the wavefront distribution u_m(u, v) as Eq. (2).

(2)$${u_m}({u,v} )= \frac{1}{{j\lambda {z_2}}}\int\!\!\!\int {{u_2} \cdot {t_{main}} \cdot \exp \left\{ {\frac{{jk}}{{2{z_2}}}[{{{({x - u} )}^2} + {{({y - v} )}^2}} ]} \right\}dxdy}$$

According to Fourier optics [33], Eq. (2) can finally be simplified to the following equation:

(3)$${u_m}({u,v} )= \frac{1}{{j\lambda F}}{\cal F}\{{{P_1}({\zeta ,\eta } )} \}, $$

where ${\cal F}{\{\cdot\}}$ denotes the two-dimensional Fourier transform operator. The diffraction field of the imaging plane is the Fourier transform of the aperture stop.

Subsequently, the MLA modulates the primary image to the final response image collected by sensor pixels. Here, we first analyze the response covered by the on-axis microlens, which is paraxial imaging and has the pupil function P_ml(u, v) and the phase modulation function t_ml= exp[-jk(u²+ v²)/(2f)]. Accordingly, the wavefront distribution on the sensor modulated by the on-axis microlens is given by Eq. (4).

(4)$${u_d}({t,w} )\textrm{ = }\frac{1}{{j\lambda f}}\exp \left[ {\frac{{jk}}{{2f}}({{t^2} + {w^2}} )} \right] \cdot {\cal F}\{{{u_m} \cdot {P_{ml}}} \}$$

Furthermore, omitting the constant phase exp[jk (t²+ w²)/(2f)] and substituting Eq. (3) into Eq. (4), a convolution formation is given by the following Eq. (5), which can clearly describe the modulation effect of the fore optical system and the MLA.

(5)$${u_d}({t,w} )\textrm{ = }\frac{{ - 1}}{{{\lambda ^2}Ff}}{U_{eq}} \ast H_{MLA}^{0,0}$$

where the symbol ⁎ denotes convolution operation, and U_eq= P₁(-ζ, -η) is regarded as the equivalent input field. We define the inherent factor of the LFC as γ=- F/f, and then U_eq can be written with the coordinates (t, w) as shown in Eq. (6). The H0,0 MLA is the transfer function determined by the limited aperture of the micro-lens, where the superscript means that there is the central microlens marked by (0, 0) in the MLA. The transfer function H0,0 MLA also has an explicit representation.

(6)$$\left\{ {\begin{array}{l} {{U_{eq}}({t,w} )= {P_1}({\gamma t,\gamma w} )}\\ {H_{MLA}^{0,0}({t,w} )= 2\pi {b^\textrm{2}} \cdot jinc\left( {\frac{{2\pi b}}{{\lambda f}}\sqrt {{t^2} + {w^2}} } \right)} \end{array}} \right.$$

where jinc function is defined as jinc(x) = J₁(x)/x, J₁(x) is the first-order Bessel function, and b is the radius of the microlens. However, M × N microlenses are arranged in a square grid on the primary image plane in the LFC. Each microlens marked by (m, n) has the diameter D₂, the pupil function P_ml (u - mD₂, v - nD₂), and phase modulation function t_ml (u - mD₂, v - nD₂). Applying the displacement theorem and phase shift theorem of Fourier transform, the transfer function of an arbitrary microlens can be represented by the following equation.

(7)$$H_{MLA}^{m,n}({t,w} )= H_{MLA}^{0,0}({t - m{D_2},w - n{D_2}} )\cdot \exp \left( { - j2\pi \frac{{t \cdot m{D_2} + w \cdot n{D_2}}}{{\lambda f}}} \right)$$

where m and n are indices for microlenses, and the constant phase has been omitted. Besides, the transfer function of the MLA ${H_{MLA}}$ can be obtained by ${H_{MLA}}(t,w) = \sum\limits_{m,n} {H_{MLA}^{m,n}(t,w)}$.

Off-axis case. For the off-axis point source S₂ (see the red lines in Fig. 2), the input field on the aperture stop is given by u₁(ζ, η) = exp{jk[(ζ-x₀)²+ (η-y₀)²]/(2z₁)}. Omitting the constant phase exp[jk (x₀²+ y₀²)/(2z₁)], this is equivalent to adding a phase shift to the input field of the on-axis point source. And the imaging plane will produce a spatial shift. The wavefront distribution of the primary image plane is given by u_m(u - βx₀, v - βy₀), where β = -z₂/(z₁+ F) = -F/z₁ is defined as the magnification factor of the fore optical system. Then the equivalent input field can be rewritten as:

(8)$${U_{eq}}({t,w;{x_0},{y_0}} )= {P_1}({\gamma t,\gamma w} )\cdot \exp \left[ { - \frac{{jk}}{{{z_1}}}({\gamma t{x_0} + \gamma w{y_0}} )} \right]. $$

The PIF is determined by the square of the wavefront distribution u_d, and the complete response model of a telecentric LFC for any points at the object plane is given by the following equation.

(9)$$\begin{aligned} PIF({t,w;{x_0},{y_0}} )&= {|{{u_d}({t,w;{x_0},{y_0}} )} |^2}\\ &= {\left|{\frac{1}{{{\lambda^2}Ff}}{U_{eq}}({t,w;{x_0},{y_0}} )\ast {H_{MLA}}({t,w} )} \right|^2} \end{aligned}$$

2.2.2 Defocused model

When the object is defocused, the MLA no longer strictly coincides with the primary image plane of the object. Figure 3 illustrates the condition with a positive defocus amount Δz. The point source S₃ is positioned far away from the in-focus object plane, and the new object distance $({{{z^{\prime}}_1} + F} )$ and the image distance ${z^{\prime}_2}$ still satisfy the Gaussian formula expressed as $1/({{{z^{\prime}}_1} + F} )$ + 1/${z^{\prime}_2}$= 1/F.

Fig. 3. The optical layout of a telecentric LFC, and the object has a positive defocus distance Δz from the MLA. The blue lines represent that the object is on-axis.

Download Full Size | PDF

In this condition, the light has been converged before propagating to MLA. It is equivalent to light rays continuing to propagate a distance Δz in free space. Therefore, the corresponding defocus PIF is given as follows:

(10)$$PIF({t,w;{x_0},{y_0},\Delta z} )= {\left|{\frac{1}{{{\lambda^2}Ff}}[{{U_{eq}}({t,w;{x_0},{y_0}} ){H_{\Delta z}}({{f_t},{f_w}} )} ]\ast {H_{MLA}}({t,w} )} \right|^2}, $$

where ${H_{\Delta z}}({{f_t},{f_w}} )= \exp[ - j\pi \lambda \Delta z({{f_t}^2 + {f_w}^2} )]$, ${f_t} = {t / {(\lambda f)}}$, ${f_w} = {w / {(\lambda f)}}$, and $\Delta z = {z_2} - {z^{\prime}_2}$. When the primary image plane is behind the MLA, the distance Δz is negative. From Eq. (9), the LF measured by the sensor under in-focus conditions can be regarded as the convolution of U_eq and ${H_{MLA}}$ with a coefficient determined by wavelength and focal lengths. From Eq. (10), the defocus system response is equivalent to multiplying an additional phase on the equivalent input field U_eq.

2.3 Image formation analysis

2.3.1 Sub-image formation analysis

In the LFC, based on the above system response, we consider the imaging model of each microlens, respectively. The sub-image formation of an arbitrary microlens can be modeled as ${\mathbf{I}_d} = \mathbf{\Phi } \cdot {\mathbf{I}_o}$.

For an arbitrary microlens, we assume that the sampling number in object space is L × L, then the first sampling object I_o(x₀₁, y₀₁) is simply written as I_o_1,1, and the rest are denoted in the same way. All object samples corresponding to the microlens are vectorized to generate the column vector I_o with a size of L²× 1. And, I_d with a size of S²× 1 is the column vectorized representation of the sub-image. The size of the sub-image is S × S, and the response value at first pixel is denoted as I_d(t₁, w₁), which is simply written as I_d_1,1. $\mathbf{\Phi }$ is the imaging matrix, of which each column is a stretch of ${\mathbf{\Theta }_{{x_0},{y_0},\Delta z}} = PIF({t,w;{x_0},{y_0},\Delta z} )$. ${\mathbf{\Theta }_{{x_0},{y_0},\Delta z}}$ represents the system response at a fixed spatial position (x₀, y₀) with a defocus amount $\Delta z$. As a result, the imaging matrix can be expressed as

(11)$$\mathbf{\Phi } = \left[ {\begin{array}{ccc} {vec({{\mathbf{\Theta }_{1,1,\Delta z}}} )}& \ldots &{vec({{\mathbf{\Theta }_{L,L,\Delta z}}} )} \end{array}} \right]. $$

We define the element in ${\mathbf{\Theta }_{{x_0},{y_0},\Delta z}}$ as $\theta _{{x_0},{y_0}}^{t,w}$, where the superscript and the subscript represent the coordinates of the sensor pixels and the object points, respectively. In a similar manner of simplification, the scripts can be replaced by the indexes of object samples and sensor pixels. Hence, the imaging model of an arbitrary microlens can finally be rewritten as:

(12)$$\left[ {\begin{array}{c} {{I_{d1,1}}}\\ {{I_{d1,2}}}\\ \vdots \\ {{I_{d1,S}}}\\ {{I_{d2,1}}}\\ \vdots \\ {{I_{dS,S}}} \end{array}} \right] = \left[ {\begin{array}{ccccccc} {\theta_{1,1}^{1,1}}&{\theta_{1,2}^{1,1}}& \cdots &{\theta_{1,L}^{1,1}}&{\theta_{2,1}^{1,1}}& \cdots &{\theta_{L,L}^{1,1}}\\ {\theta_{1,1}^{1,2}}&{\theta_{1,2}^{1,2}}& \cdots &{\theta_{1,L}^{1,2}}&{\theta_{2,1}^{1,2}}& \cdots &{\theta_{L,L}^{1,2}}\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ {\theta_{1,1}^{1,S}}&{\theta_{1,2}^{1,S}}& \cdots &{\theta_{1,L}^{1,S}}&{\theta_{2,1}^{1,S}}& \cdots &{\theta_{L,L}^{1,S}}\\ {\theta_{1,1}^{2,1}}&{\theta_{1,2}^{2,1}}& \cdots &{\theta_{1,L}^{2,1}}&{\theta_{2,1}^{2,1}}& \cdots &{\theta_{L,L}^{2,1}}\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ {\theta_{1,1}^{S,S}}&{\theta_{1,2}^{S,S}}& \cdots &{\theta_{1,L}^{S,S}}&{\theta_{2,1}^{S,S}}& \cdots &{\theta_{L,L}^{S,S}} \end{array}} \right] \cdot \left[ {\begin{array}{c} {{I_{o1,1}}}\\ {{I_{o1,2}}}\\ \vdots \\ {{I_{o1,L}}}\\ {{I_{o2,1}}}\\ \vdots \\ {{I_{oL,L}}} \end{array}} \right]. $$

To analyze the sub-image formation process, we simulate the imaging matrix $\mathbf{\Phi }$ of the on-axis microlens as an example. All objects corresponding to the on-axis sub-image are sampled. The diameter of the aperture stop is 10 mm. The main lens, which is assumed to be thin and aberration-free, has a focal length of 60 mm. Each microlens has the same focal length of 0.54 mm and diameter of 0.09 mm. The simulation size of the sensor pixel is set to 9µm. The initial object distance is 1860 mm, then the image distance is 62 mm.

Figure 4 shows the imaging matrix $\mathbf{\Phi }$ of the in-focus objects. The horizontal axis represents the sampling index of the object plane, and the vertical axis represents the pixel index in the sub-image. The imaging matrix in Fig. 4 is highly symmetrical with minimal differences between columns. The in-focus MPI in Fig. 1 shows that the in-focus case produces less spatial details. The image reconstruction results in [14,15] also pointed out that in-focus recovery suffered from a dominant twin-image artifact and was easily affected by noise.

Fig. 4. The imaging matrix $\mathbf{\Phi }$ of the in-focus objects.

Download Full Size | PDF

The imaging matrices $\mathbf{\Phi }$ with defocus amounts $\Delta z = \textrm{ - }0.5\textrm{mm}$ and $\Delta z = 0.7\textrm{mm}$ are shown in Figs. 5(a) and 5(b), respectively. The object sampling interval is the same as that in Fig. 4, which means that more objects are imaged in different areas of the microlens. The MPI with the defocus amount $\Delta z ={-} 0.5\textrm{mm}$ is shown in Fig. 5(c). The enlargement of one sub-image is shown in the blue box and indicates that the defocused LF images have more details than in-focus images, which is helpful in restoring spatial resolution. As a result, we mainly use the defocused simulated images to restore the FR image in the following sections. However, the image matrix of the sub-image formation is too large to be directly implemented into the network. Therefore, we further analyze the SAI formation and construct the smaller view-wise kernel array.

Fig. 5. (a) and (b) show the imaging matrices $\mathbf{\Phi }$ with defocus amounts (a) Δz = -0.5 mm and (b) Δz = 0.7 mm, respectively. (c) is the MPI with the defocus amount Δz = -0.5 mm. One sub-image is enlarged and shown in the blue box.

Download Full Size | PDF

2.3.2 SAI formation analysis

In Eq. (12), the column of the $\mathbf{\Phi }$ matrix is the stretch of ${\mathbf{\Theta }_{{x_0},{y_0},\Delta z}}$, which represents the response distribution on the sensor at the specific object (x₀, y₀) with a defocus amount $\Delta z$. From another perspective, the row can be regarded as the weight coefficients of all object samples at the specific pixel (t, w). The row vector of $\mathbf{\Phi }$ can be reshaped into a square matrix, denoted as

(13)$${\mathbf{\kappa }_{t,w,\Delta z}} = \left[ {\begin{array}{ccc} {\theta_{1,1}^{t,w}}& \cdots &{\theta_{1,L}^{t,w}}\\ \vdots & \ddots & \vdots \\ {\theta_{L,1}^{t,w}}& \cdots &{\theta_{L,L}^{t,w}} \end{array}} \right]. $$

Its dimension L × L is determined by the defocus level $|{\Delta z} |$. The relationship of L and $|{\Delta z} |$ is given by

(14)$$L = 2\left\lfloor {\frac{{{D_1}({{z^{\prime}}_1} + F)}}{{2{{z^{\prime}}_1}{{z^{\prime}}_2}}} \times |{\Delta z} |\times \frac{1}{p}} \right\rfloor + \frac{{{D_2}}}{p}, $$

where $\lfloor\cdot \rfloor $ represents rounding down, and p is the size of the sensor pixel.

Besides, since the fore optical system is a linear system and the microlenses are arranged in a square, the system response is periodical among the microlenses. It means that the degradation is the same for each pixel within a given SAI. Therefore, each SAI can be obtained as the following equation:

(15)$${\mathbf{I}_{SAI}} = {\mathbf{\kappa }_{t,w,\Delta z}} \ast {\mathbf{I}_{FR}}, $$

where the matrix I_FR represents the complete FR image. The matrix I_SAI represents the corresponding SAI at a specific view (t, w). Its size is M × N, equal to the number of microlenses. ${\mathbf{\kappa }_{t,w,\Delta z}}$ is the degradation kernel for each SAI.

From the imaging matrixes in Fig. 4 and Fig. 5, we extract the row vectors of $\mathbf{\Phi }$ corresponding to the central (S/2) × (S/2) pixels and reshape them into kernel arrays, of which the visual results are illustrated in Figs. 6(a)-(c), respectively. Figures 6(a)-(c) are not drawn to the same scale. According to Eq. (14), the dimension of the kernel varies with the defocus level. Then, the dimension of each kernel in Fig. 6(a) is 10 × 10, while that in Fig. 6(b) is 18 × 18, and in 6(c) is 22 × 22. For any defocused case, the difference in kernel array means spatial shift-variant degradation (SSVD) among views.

Fig. 6. The visual results of kernel arrays. Taking S = 10 as an example, we display the center 5 × 5 kernels. (a) in-focus Δz = 0. (b) defocus amount Δz = -0.5 mm. (c) defocus amount Δz = 0.7 mm.

Download Full Size | PDF

3. Proposed network

In this section, we introduce a spatial shift-variant degradation network (SSVD-Net) based on the imaging characteristics of an LFC. The task of our network is to recover an FR image from the 4D LF data via embedding the model-driven SSVD kernels.

3.1 Overview

The overall architecture of the SSVD-Net is shown in Fig. 7(a), which mainly includes the kernel feature extraction block (KFE-Block), 4 basic groups, the refinement block, and an upsampling block. The KFE-Block is used to obtain the SSVD feature along the angular dimension, process the noise, and output the multiple degradation feature ${{{\cal K}}_{mul}} \in {{\mathbb R}^{a \times a \times h \times w}}$ (h and w denote the spatial resolution of the kernel and a denotes angular resolution). The input LF image is a 4D data ${{{\cal I}}_{LF}} \in {{\mathbb R}^{H \times W \times a \times a}}$ (H and W denote the spatial resolution of SAI), which goes through a 3 × 3 convolution layer to generate the shallow feature ${{{\cal I}}_{in}} \in {{\mathbb R}^{H \times W \times ( a \times a \times c) }}$ (c is the number of channels). The basic group aims to recover the image feature, which can be divided into two main parts: an SSVD-block and 4 disentangling blocks (Distg-blocks) [30]. The SSVD-block is specifically designed to handle the SAI degradation by simultaneously inputting the shallow feature ${{{\cal I}}_{in}}$ and the multiple degradation feature ${{{\cal K}}_{mul}}$. The Distg-block [30] further integrates the multiple-dimensional feature of the LF image. Before feeding into the refinement block, the output SAI feature of 4 basic groups is reshaped to a normal 2D image feature, which can be regarded as the LR version of the FR image. The refinement block is designed to preserve the entire image details, which consists of 4 residual blocks (Res-Blocks) and a 1 × 1 convolution layer. Finally, an upsampling block is used to restore the FR image from the refined feature. In this paper, the given datasets of RGB channels are first converted to YCbCr color space. The image restoration and evaluation are only performed on the Y channel.

Fig. 7. (a) Overall architecture of the proposed SSVD-Net. (b) Spatial shift-variant degradation block (SSVD-Block). (c) Residual block (Res-Block).

Download Full Size | PDF

3.2 Network blocks

3.2.1 Kernel feature extraction (KFE) block

As analyzed in Section 2, the kernel ${\mathbf{\kappa }_{t,w,\Delta z}} \in {{\mathbb R}^{L \times L}}$ reflects the mapping relationship from object space to image space at a certain defocus distance, so it can be used as a prior to recover the FR image. For any defocus case, the kernel array is rearranged into ${{{\cal K}}_{in}} \in {{\mathbb R}^{L \times L \times a \times a}}$, which is then fed into the KFE-Block to obtain the SSVD feature, as shown in Fig. 7(a).

In order to handle variant kernels for different SAIs, we employ a spectral-wise Transformer (ST), which is used for spectral reconstruction [34], to extract the kernel features. In the ST architecture, the angular dimension of kernels takes the place of the spectral dimension of multi-spectral images. The Multi-head Self-Attention (MSA) calculates the self-attention along the angular dimension by treating each angular channel as a token. After that, all view kernel features are rearranged into column vectors to form the output feature ${{{\cal K}}_{temp}} \in {{\mathbb R}^{a \times a \times (L \times L)}}$. It is important to note that different views have different output kernel features. Then, the various kernel features that are concatenated with noise goes through two fully connected layers with a LeakyReLU activation function between them to form multiple degradation feature ${{{\cal K}}_{mul}}$.

3.2.2 Basic group

The basic group is composed of an SSVD-block and 4 cascaded Distg-blocks [30]. As mentioned above, the degradation is the same for each pixel within the SAI, while differs between SAIs. Therefore, the SSVD-block is specifically designed to handle the view-wise degradation. As shown in Fig. 7(b), a group convolution is employed to process every angular feature in the SSVD convolution (SSVD-Conv) layer. The multiple degradation feature ${{{\cal K}}_{mul}}$ is divided into a × a groups, each of which is convolved with the corresponding angular feature to obtain the preliminarily recovered feature ${{{\cal I}}_{temp}} \in {{\mathbb R}^{H \times W \times (a \times a)}}$. This feature is then activated by a LeakyReLU layer. After that, a channel attention layer [35] is used to enhance and recalibrate the channel-wise feature representation. In order to accomplish the cross-channel information interaction, the feature is additionally fed into a 3 × 3 convolution layer and a 1 × 1 convolution layer.

The cascaded Distg-blocks [30] are employed to extract and fuse the multiple dimensional features of the LF images, including spatial feature, angular feature, and EPI feature in two directions. Finally, the shallow feature and the fused feature are summed to generate ${{\cal I}}_{basic}^1 \in {{\mathbb R}^{H \times W \times (a \times a \times c)}}$, where the superscript represents that it is the output of the first basic group.

3.2.3 Refinement block

The refinement block is designed to preserve the entire image information by inputting the image feature ${{{\cal I}}_{img}} \in {{\mathbb R}^{(a \times H) \times (a \times W) \times c}}$, which is the result of reshaping the feature ${{\cal I}}_{basic}^4$. The refinement block contains 4 Res-Blocks and a 1 × 1 convolution layer. As shown in Fig. 7(c), each Res-Block is composed of 2 convolution layers, each of which is followed by a LeakyReLU activation function for learning the nonlinearity.

3.2.4 Upsampling block

The refined feature is fed into an upsampling block to improve the spatial resolution by an upsampling factor s. The upsampling block that is composed of two convolution layers and a pixel shuffling layer finally recovers the FR image ${\hat{{{\cal I}}}_{FR}} \in {{\mathbb R}^{(s \times a \times H) \times (s \times a \times W)}}$.

4. Experiments

4.1 Datasets and implementation details

Datasets. Since there are no FR images corresponding to currently available public LF datasets, we use the HR datasets, which are commonly used for SISR, to generate the simulated LF datasets, according to the sub-image formation model established in Section 2.3. The HR image here is denoted as the FR image ${{{\cal I}}_{FR}}$, which is the ground truth (GT) for our network. They are placed at a defocus amount of 2 mm, and other simulation parameters are the same as those in Section 2.3. Accordingly, the sub-image is composed of 10 × 10 pixels in the LFC. The central 5 × 5 SAI array is input into the network, which combines with ${{{\cal I}}_{FR}}$ to create image pairs for training, validation, and testing.

The training sets include 800 training images in DIV2 K [36] and 2560 training images in Flick [37] and the corresponding LF images. The 100 validation images in dataset DIV2 K are used to evaluate the performance of our network. The datasets B100 [38], Manga109 [39], Set5 [40], Set14 [41], and Urban100 [42] are used as the test sets.

Kernel patterns. It is worth noting that the resolution of the simulated system response has to match the actual pixel size. In the training stage, only central 5 × 5 SAIs are used to restore the FR image. Therefore, the input SSVD kernel array is the corresponding 5 × 5 as shown in Fig. 8, and the dimension of each kernel is 46 × 46.

Fig. 8. The simulated SSVD kernel array with a defocus amount Δz = 2 mm.

Download Full Size | PDF

Training details. The input LF data are regarded as 4D tensors, including 2 spatial dimensions and 2 angular dimensions. In the training stage, each SAI of the LF data is cropped into small 32 × 32 patches. The corresponding FR images are cropped into size 64 × 64. The training datasets are randomly rotated by 90°, 180°, and 270° and flipped horizontally. Following [30], we set the noise level range as [0, 75]. The network is trained using the L₁ loss: ${L_1} = {{{{||{{{{\cal I}}_{FR}} - {{\hat{{{\cal I}}}}_{FR}}} ||}_1}} / N}$,where N is the pixel count of the images. The Adam optimizer [43] is employed with β₁ = 0.9, β₂ = 0.999, and ε = 10⁻⁸. The initial learning rate is set to 2 × 10⁻⁴ and reduced to half every 20 epochs, and the batch size is set to 2. In the validation stage, the noise-free and noise level 10 are used as the prior noise. We implement the network using the Pytorch framework with 2 NVIDIA 1080Ti GPUs. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [44] are used to evaluate the image restoration results. The higher the values, the better the restoration fidelity to the FR image.

4.2 Comparison with state-of-the-art methods

We compare our network with several state-of-the-art LF spatial SR methods, some of them based on the transformer (e.g. LFT [31] and DPT [32] and EPIT [47]), and some considering multiple degradations, (e.g. LF-DAnet [30]). The SISR method, such as MAnet + RRDB [45], and the traditional bicubic interpolation method [46] are also compared.

These LF spatial SR networks were designed to generate HR SAIs. In order to meet the goal of FR image restoration, we modify and retrain them by performing the same reshape operation on the output of these networks before upsampling. The SISR network is retrained by first reshaping the input 4D LF images to MPI to output comparable results. For a fair comparison, we also retrain the networks with the same noise condition as ours.

The datasets are tested in conditions of noise-free and noise level 10. The average PSNR and SSIM are computed for each test dataset with different methods. The quantitative results are shown in Table 1, and the best results are in bold.

Table 1. Average PSNR and SSIM are computed on different test datasets with different methods. The sign * means that the networks are modified and retrained.

View Table | View all tables in this article

In terms of the quantitative results for PSNR and SSIM, our network achieves competitive performance. In Fig. 9, we present the visual results obtained by different methods under the noise-free condition. Figure 10 shows the visual results at noise level 10. From the spatial resolution (equal to the visual performance), the spatial resolution of an SAI is only 1/10 of the FR image. Thus, for ease of display, the SAI is magnified 10 times. In addition, we obtain the 10× visual results by inputting 5 × 5 SAIs with a 2× upsampling factor. Like SISR, the bicubic method obtains the recovered image based on the MPI. From the results in Table 1, the performance of the bicubic method is poor. In comparison, both LF spatial SR methods and SISR methods can obtain acceptable results. The poorer performance of the LFT* [31], DPT* [32] and EPIT* [47] may result from their ignoration of considering of the blur degradation. The LF-DAnet* [30] considers the Gaussian kernel, which cannot accurately represent the degradation of the LFC. The MA + RRDB [45] estimates the kernel from the MPI, however, ignores the structure feature of the LF data. Comparing with them, our method further considers the SSVD kernels of the LFC and the detail preservation of the entire image. As shown in Fig. 9, our method performs better in preserving the line and texture details of the restored FR image. The results of noisy LF images in Fig. 10 have verified that our method still has a certain anti-noise ability no matter at glyph boundaries or flat areas.

Fig. 9. Visual comparison with different networks on the simulation results under noise-free conditions. For the convenience of display, the SAI is magnified 10 times.

Download Full Size | PDF

Fig. 10. Visual comparison with different networks on the simulation results at noise level 10. For the convenience of display, the SAI is magnified 10 times.

Download Full Size | PDF

4.3 Ablation study

Our proposed network, denoted by Model 5, contains three main components, such as the SSVD kernel, the SSVD-Conv, and the refinement block. To verify the effectiveness of these components, we compare the following four variants.

Model 1: we replace the SSVD kernel with the isotropic Gaussian blur to demonstrate the superiority of the accurate system response. The size of the Gaussian kernel is set to L × L, and sigma is determined by the higher weight area (i.e. the sum of weight outward from the center point is larger than 0.95) of the center view kernel. Models 1 and 5 have the same parameters and FLOPs. As shown in Table 2, it is demonstrated that when the kernel does not match the LF system characteristic (Model 1), the PSNR decreases by 0.23 dB compared with Mode 5. Therefore, a prior LF SSVD kernel array is crucial.

Table 2. FLOPs and the number of parameters are calculated for different variants of our proposed network. The average PSNR and SSIM are evaluated on the validation dataset DIV2 K under noise-free conditions.^a

View Table | View all tables in this article

Model 2: In order to show the necessity of the prior kernel, we remove the SSVD kernel and directly input the noise level into the subsequent full connected layer to train Model 2. Comparing Models 2 and 5, the results indicate that the removal of the SSVD kernels causes a 0.3 dB drop in PSNR and a 0.0089 drop in SSIM under a similar computation budget (163.81 G).

Model 3: In order to show the effectiveness of SSVD-Conv based on group convolutions in processing view-wise degradation, we replace SSVD-Conv with a common convolution to train Model 3. The results in Table 2 show that by comparing Models 3 and 5, the SSVD-Conv achieves more accurate restored images with 0.13 dB improvement in PSNR and speeds up the training procedure with a smaller computation budget (163.87 G).

Model 4: We train Model 4 by removing the refinement block to prove its effectiveness. Compared with Model 5, Model 4 has smaller FLOPs and parameters, but at the same time, it performs worse by a 0.33 dB drop in PSNR and 0.009 in SSIM. It can be verified that the refinement block improves the restoration quality by learning the entire image feature.

4.4 Results on real-world scenes

To further verify the practicality of the proposed network, we also conduct experiments on real-world scenes. We use a customized LFC [48] to capture the real LF images. The main lens is a Nikon lens with a focal length of 50 mm. The MLA is made by Advanced Microoptic Systems Gmbh, and the microlens has the focal length of 0.54 mm. The sensor is a Bobcat IGV-4020 (Imperx Inc.), which has 4032 × 2688 pixels, and the pixel size is 9µm.

The LFC is initially set up to image an in-focus limited target. Then, the experimental pictures are printed and placed at the position with a defocus amount Δz = 2.0 mm. According to the actual parameters, we retrain the network to process the real-world scenes. Since there is no ground truth, the noise level from 0 to 15 with a step 5 is set to seek the best visual result. The kernel array is simulated and fed into the network. We label the original experimental picture as ‘Reference’, which is different from the aforementioned ‘GT’. Figure 11 presents the visual performance of two restored scenes. The first picture is from the dataset B100 [38], and the second picture is from the dataset Manga 109 [39]. The results demonstrate that our network has great performance in improving the spatial resolution and restoring the image detail on real-world images.

Fig. 11. Visual performance of different methods on real-world scenes. Noise levels are set to 10 for both LF-DAnet* and Ours.

Download Full Size | PDF

Furthermore, we build a multi-focus scene with objects, which are placed at positions with different working distances. The working distance L_o is the distance from object to the front plane of the LFC. The experimental setup is shown in Fig. 12(a). The scene consists of a ‘Santa Claus’ (Δz = -2.0 mm, L_o = 478 mm), a ‘Snowman’ (in-focus, Δz = 0, L_o = 725 mm), and a target that is part of the ISO12233 chart (Δz = 1.1 mm, L_o = 1029 mm). The SSVD-Nets with different defocus amounts are applied to process the LF image taken by our LFC. The restored results are presented in Fig. 12(b)-(d). Figure 12 (b) shows that the network with a certain defocus amount Δz = -2.0 mm can efficiently restore the target with the same defocus amount as a clear FR image. The results in Figs. 12(c) and 12(d) also verify that our network is suitable for other defocused conditions.

Fig. 12. (a) shows the imaging setup of a multi-focus scene. (b)-(d) are the visual performances of applying our SSVD-Net with different defocus amounts on real-world scenes. The defocus amounts of the networks are (b) Δz = -2.0 mm, (c)Δz = 0, and (d) Δz = 1.1 mm, respectively.

Download Full Size | PDF

5. Conclusion

In this paper, we propose an SSVD-Net framework to restore the FR image from LF data by embedding the model-driven SSVD kernels. An explicit convolution model is first derived for the telecentric LFC based on the scalar diffraction theory, bridging the real degradation of the LF and the CNNs. The simulation analysis of the SAI formation has verified the SAI kernel is spatial shift-variant, and its dimension changes with the defocus level. Then, the SSVD-Net is proposed to handle this view-wise degradation for LF images. Extensive ablation studies have verified the effectiveness of our designs in the network. The SSVD kernel is more accurate than the Gaussian blur for restoring the FR image from the LF images. The SSVD-conv layer speeds up the training process. The refinement block can retrieve the details of an FR image and improve the restoration accuracy. Experiments comparing with the state-of-the-art methods have demonstrated that our network can achieve excellent performance.

Experiments on real-world scenes have demonstrated that our SSVD-Net can effectively restore the FR image with the simulated kernel array, providing the capability for practical applications. Finally, based on the results of the multi-focus restoration experiment, we can deduce that it is possible to recover the FR image at an arbitrary defocus amount. In the future, FR images can be restored without the need for camera parameters by creating an image restoration list in advance using a series of trained models along the focal stack direction. Furthermore, all-in-focus FR images can also be obtained by the fusion algorithms.

Funding

National Natural Science Foundation of China (No. 61635002); Fundamental Research Funds for the Central Universities.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH (1996), pp. 31–42.

2. S. J. Gortler, R. Grzeszczuk, R. Szeliski, et al., “The Lumigraph,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH (1996), pp. 43–54.

3. R. Ng, “Fourier slice photography,” ACM Trans. Graph. 24(3), 735–744 (2005). [CrossRef]

4. S. Ben Dayan, D. Mendlovic, and R. Giryes, “Deep Sparse Light Field Refocusing,” arXiv, arXiv:2009.02582 (2020). [CrossRef]

5. C. Kim, H. Zimmer, Y. Pritch, et al., “Scene reconstruction from high spatio-angular resolution light fields,” ACM Trans. Graph. 32(4), 1–12 (2013). [CrossRef]

6. W. Williem and I. K. Park, “Robust Light Field Depth Estimation for Noisy Scene with Occlusion,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) (2016), pp. 4396–4404.

7. H. G. Jeon, J. Park, G. Choe, et al., “Depth from a Light Field Image with Learning-Based Matching Costs,” IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 297–310 (2019). [CrossRef]

8. M. Broxton, L. Grosenick, S. Yang, et al., “Wave optics theory and 3-D deconvolution for the light field microscope,” Opt. Express 21(21), 25418–25439 (2013). [CrossRef]

9. G. E. Lott, M. A. Marciniak, and J. H. Burke, “Three-dimensional imaging of trapped cold atoms with a light field microscope,” Appl. Opt. 56(31), 8738–8745 (2017). [CrossRef]

10. M. Feng, S. Z. Gilani, Y. Wang, et al., “3D face reconstruction from light field images: A model-free approach,” in Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 508–526.

11. E. H. Adelson and J. Y. A. Wang, “Single Lens Stereo with a Plenoptic Camera,” IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 99–106 (1992). [CrossRef]

12. R. Ng, M. Levoy, M. Bredif, et al., “Light Field Photography with a Hand-held Plenoptic Camera,” Stanford Tech. Rep. CTSR 2005-02 1–11 (2005).

13. T. E. B. S. Z. P. Favaro, “Light Field Superresolution,” in 2009 IEEE International Conference on Computational Photography (ICCP) (2009), pp. 1–9.

14. S. A. Shroff and K. Berkner, “Defocus analysis for a coherent plenoptic system,” in Frontiers in Optics 2011 (2011), p. FThR6.

15. S. A. Shroff and K. Berkner, “Image formation analysis and high resolution image reconstruction for plenoptic imaging systems,” Appl. Opt. 52(10), D22–D31 (2013). [CrossRef]

16. A. Junker, T. Stenau, and K.-H. Brenner, “Scalar wave-optical reconstruction of plenoptic camera images,” Appl. Opt. 53(25), 5784–5790 (2014). [CrossRef]

17. E. Sahin, V. Katkovnik, and A. Gotchev, “Super-resolution in a defocused plenoptic camera : a wave-optics-based approach,” Opt. Lett. 41(5), 998–1001 (2016). [CrossRef]

18. T. Georgiev and A. Lumsdaine, “Superresolution with Plenoptic 2 . 0 Cameras,” in Signal Recovery and Synthesis 2009 (2009), p. STuA6.

19. S. Zhou, Y. Yuan, L. Su, et al., “Multiframe super resolution reconstruction method based on light field angular images,” Opt. Commun. 404, 189–195 (2017). [CrossRef]

20. S. Wanner and B. Goldluecke, “Variational light field analysis for disparity estimation and super-resolution,” IEEE Trans. Pattern Anal. Mach. Intell. 36(3), 606–619 (2014). [CrossRef]

21. M. Alain and A. Smolic, “Light field super-resolution via LFBM5D sparse coding,” in Proceedings of International Conference on Image Processing (ICIP) (IEEE, 2018), pp. 2501–2505.

22. C. Dong, C. C. Loy, K. He, et al., “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015). [CrossRef]

23. Y. Yoon, H. G. Jeon, D. Yoo, et al., “Learning a deep convolutional network for light-field image super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision Workshop (ICCVW) (IEEE, 2015), pp. 57–65.

24. G. Wu, Y. Liu, L. Fang, et al., “Light field reconstruction using convolutional network on EPI and extended applications,” IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1681–1694 (2019). [CrossRef]

25. J. Jin, J. Hou, J. Chen, et al., “Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 2260–2269.

26. J. Jin, J. Hou, J. Chen, et al., “Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion,” IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 1819–1836 (2022). [CrossRef]

27. L. Su, Z. Ye, Y. Sui, et al., “Epipolar plane images based light-field angular super-resolution network,” in Seventh Asia Pacific Conference on Optics Manufacture and 2021 International Forum of Young Scientists on Advanced Optical Manufacturing (APCOM and YSAOM 2021) (2022), 12166, p. 121662 M.

28. Y. Yuan, Z. Cao, and L. Su, “Light-Field image superresolution using a combined deep CNN based on EPI,” IEEE Signal Process. Lett. 25(9), 1359–1363 (2018). [CrossRef]

29. Y. Wang, L. Wang, G. Wu, et al., “Disentangling light fields for super-resolution and disparity estimation,” IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 425–443 (2023). [CrossRef]

30. Y. Wang, Z. Liang, L. Wang, et al., “Learning a degradation-adaptive network for light field image super-resolution,” arXiv, arXiv:2206.06214 (2022). [CrossRef]

31. Z. Liang, Y. Wang, L. Wang, et al., “Light field image super-resolution with transformers,” IEEE Signal Process. Lett. 29, 563–567 (2022). [CrossRef]

32. S. Wang, T. Zhou, Y. Lu, et al., “Detail-preserving transformer for light field image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence (2022), 36(3), pp. 2522–2530.

33. Joseph W. Goodman, Introduction to fourier optics, (Roberts Co. Publ., 2005).

34. Y. Cai, J. Lin, Z. Lin, et al., “MST++: multi-stage spectral-wise transformer for efficient spectral reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW) (2022), pp. 744–754.

35. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation_networks,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 7132–7141.

36. E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: dataset and study,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), pp. 1122–1131.

37. R. Timofte, E. Agustsson, L. Van Gool, et al., “NTIRE 2017 challenge on single image super-resolution: methods and results,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), pp. 1110–1121.

38. D. Martin, C. Fowlkes, D. Tal, et al., “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2001), 2, pp. 416–423.

39. K. Aizawa, A. Fujimoto, A. Otsubo, et al., “Building a manga dataset “manga109” with annotations for multimedia applications,” IEEE Multimed. 27(2), 8–18 (2020). [CrossRef]

40. M. Bevilacqua, A. Roumy, C. Guillemot, et al., “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in Proceedings of the British Machine Vision Conference 2012 (2012), p. 135.

41. R. Zeyde, M. Protter, and M. Elad, “On single image scale-up using sparse-representations,” in Curves and Surfaces (2010), pp. 711–730.

42. J. Bin Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2015), pp. 5197–5206.

43. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR) (2015), p. 13.

44. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). [CrossRef]

45. J. Liang, G. Sun, K. Zhang, et al., “Mutual affine network for spatially variant kernel estimation in blind image super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision (2021), pp. 4076–4085.

46. R. G. Keys, “Cubic convolution interpolation for digital image processing,” IEEE trans.Acoust.,Speech,Signal Process 29(6), 1153–1160 (1981). [CrossRef]

47. Z. Liang, Y. Wang, L. Wang, et al., “Learning non-local spatial-angular correlation for light field image super-resolution,” arXiv, arXiv:2302.08058 (2023). [CrossRef]

48. L. Su, Q. Yan, J. Cao, et al., “Calibrating the orientation between a microlens array and a sensor based on projective geometry,” Opt. Lasers Eng. 82, 22–27 (2016). [CrossRef]

Noise	Method	B100	M109	Set5	Set14	Urban100
0	Bicubic [46]	17.52/0.5305	17.78/0.6142	16.94/0.5713	17.53/0.5490	17.78/0.5241
	LFT* [31]	24.10/0.7786	26.06/0.8783	28.57/0.9176	25.13/0.8089	23.33/0.7675
	DPT* [32]	23.64/0.7619	25.08/0.8532	27.37/0.8988	24.47/0.7901	22.81/0.7440
	EPIT* [47]	23.91/0.7743	26.12/0.8802	28.90/0.9210	25.32/0.8087	23.36/0.7674
	LF-DAnet* [30]	24.35/0.7850	26.15/0.8793	29.18/0.9211	25.50/0.8149	23.48/0.7712
	MA + RRDB [45]	24.07/0.7807	26.15/0.8794	28.48/0.9186	25.17/0.8112	23.29/0.7674
	Ours	24.86/0.8133	28.00/0.9164	29.56/0.9367	26.13/0.8410	24.43/0.8106
10	Bicubic [46]	18.42/0.4940	17.91/0.5612	17.48/0.5273	18.02/0.4994	18.00/0.4802
	LFT* [31]	23.13/0.7349	24.56/0.8417	26.94/0.8917	24.05/0.7683	22.43/0.7288
	DPT* [32]	20.83/0.5939	23.57/0.8034	23.14/0.7375	22.13/0.6833	21.89/0.7010
	EPIT* [47]	23.00/0.7305	24.42/0.8386	27.14/0.8936	24.10/0.7658	22.31/0.7228
	LF-DAnet* [30]	23.24/0.7357	24.50/0.8399	27.27/0.8932	24.18/0.7693	22.41/0.7263
	MA + RRDB [45]	23.44/0.7493	25.10/0.8524	27.39/0.8994	24.43/0.7818	22.64/0.7369
	Ours	23.33/0.7543	25.48/0.8644	28.05/0.9119	24.49/0.7894	22.80/0.7475

Model	SSVD Kernel	SSVD convolutions	Refinement block	FLOPs	#Params	Valid dataset
Model	SSVD Kernel	SSVD convolutions	Refinement block	FLOPs	#Params	PSNR	SSIM
1	×	√	√	163.87G	5.19M	28.45	0.8833
2	-	√	√	163.81G	4.38M	28.38	0.8795
3	√	×	√	166.70G	5.25M	28.55	0.8858
4	√	√	-	133.46G	4.60M	28.35	0.8794
5	√	√	√	163.87G	5.19M	28.68	0.8884

Noise	Method	B100	M109	Set5	Set14	Urban100
0	Bicubic [46]	17.52/0.5305	17.78/0.6142	16.94/0.5713	17.53/0.5490	17.78/0.5241
	LFT* [31]	24.10/0.7786	26.06/0.8783	28.57/0.9176	25.13/0.8089	23.33/0.7675
	DPT* [32]	23.64/0.7619	25.08/0.8532	27.37/0.8988	24.47/0.7901	22.81/0.7440
	EPIT* [47]	23.91/0.7743	26.12/0.8802	28.90/0.9210	25.32/0.8087	23.36/0.7674
	LF-DAnet* [30]	24.35/0.7850	26.15/0.8793	29.18/0.9211	25.50/0.8149	23.48/0.7712
	MA + RRDB [45]	24.07/0.7807	26.15/0.8794	28.48/0.9186	25.17/0.8112	23.29/0.7674
	Ours	24.86/0.8133	28.00/0.9164	29.56/0.9367	26.13/0.8410	24.43/0.8106
10	Bicubic [46]	18.42/0.4940	17.91/0.5612	17.48/0.5273	18.02/0.4994	18.00/0.4802
	LFT* [31]	23.13/0.7349	24.56/0.8417	26.94/0.8917	24.05/0.7683	22.43/0.7288
	DPT* [32]	20.83/0.5939	23.57/0.8034	23.14/0.7375	22.13/0.6833	21.89/0.7010
	EPIT* [47]	23.00/0.7305	24.42/0.8386	27.14/0.8936	24.10/0.7658	22.31/0.7228
	LF-DAnet* [30]	23.24/0.7357	24.50/0.8399	27.27/0.8932	24.18/0.7693	22.41/0.7263
	MA + RRDB [45]	23.44/0.7493	25.10/0.8524	27.39/0.8994	24.43/0.7818	22.64/0.7369
	Ours	23.33/0.7543	25.48/0.8644	28.05/0.9119	24.49/0.7894	22.80/0.7475

Model	SSVD Kernel	SSVD convolutions	Refinement block	FLOPs	#Params	Valid dataset
Model	SSVD Kernel	SSVD convolutions	Refinement block	FLOPs	#Params	PSNR	SSIM
1	×	√	√	163.87G	5.19M	28.45	0.8833
2	-	√	√	163.81G	4.38M	28.38	0.8795
3	√	×	√	166.70G	5.25M	28.55	0.8858
4	√	√	-	133.46G	4.60M	28.35	0.8794
5	√	√	√	163.87G	5.19M	28.68	0.8884

Full-resolution image restoration for light field images via a spatial shift-variant degradation network

Abstract

1. Introduction

2. System response model and image formation analysis

2.1 System layout

2.2 System response model

2.2.1 In-focus model

2.2.2 Defocused model

2.3 Image formation analysis

2.3.1 Sub-image formation analysis

2.3.2 SAI formation analysis

3. Proposed network

3.1 Overview

3.2 Network blocks

3.2.1 Kernel feature extraction (KFE) block

3.2.2 Basic group

3.2.3 Refinement block

3.2.4 Upsampling block

4. Experiments

4.1 Datasets and implementation details

4.2 Comparison with state-of-the-art methods

4.3 Ablation study

4.4 Results on real-world scenes

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (2)

Equations (15)

Optics Express