Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Display performance optimization method for light field displays based on a neural network

Open Access Open Access

Abstract

Crosstalk between adjacent views, lens aberrations, and low spatial resolution in light field displays limit the quality of 3D images. In the present study, we introduce a display performance optimization method for light field displays based on a neural network. The method pre-corrects the encoded image from a global perspective, which means that the encoded image is pre-corrected according to the light field display results. The display performance optimization network consists of two parts: the encoded image pre-correction network and the display network. The former realizes the pre-correction of the original encoded image (OEI), while the latter completes the modeling of the display unit and realizes the generation from the encoded image to the viewpoint images (VIs). The pre-corrected encoded image (PEI) obtained through the pre-correction network can reconstruct 3D images with higher quality. The VIs are accessible through the display network. Experimental results suggest that the proposed method can reduce the graininess of 3D images significantly without increasing the complexity of the system. It is promising for light field displays since it can provide improved 3D display performance.

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

With the rapid development in image processing and display technology, the traditional two-dimensional (2D) display can no longer meet the needs of humans, and three-dimensional (3D) display technology has gained a lot of attention [1]. 3D displays can enhance the sense of immersion, and have great application potential in many fields such as cultural relics exhibition, health care, military defense, and studio entertainment [2,3]. Existing true 3D display technologies can be generally divided into three types: holographic, volumetric, and light field displays. Holographic displays require coherent light to record 3D object information [4]. The large amount of data to be processed, power of calculation, and transmission rate are challenging for holographic displays. Volumetric displays employ point sources to emit light at specific positions in space [5]. However, they are limited by many factors such as high system complexity and large size. Light field displays achieve a 3D effect by reconstructing the distribution of light rays of the 3D scene in space [6]. Compared to holographic and volumetric displays, light field displays are gaining huge interest due to the low system complexity and less data processing [7,8].

The traditional light field displays consist of two processes: acquisition and reconstruction. In the acquisition process, the 3D scene information is taken from different angles, and the viewpoint images (VIs) are synthesized as an encoded image. Virtual acquisition with a computer is currently the main approach because of the ease of implementation and better quality compared to optical acquisition. Virtual acquisition can be realized using commercial softwares such as Maya, Blender, and 3ds MAX. Also, setting up the virtual camera array (VCA) with a programming language is feasible [9]. In the reconstruction process, a lens array is employed to cast the light emitted from the light source of the display to specified positions to form different VIs at different positions, thus realizing a 3D effect. The quality of 3D images is limited by various factors such as low spatial resolution, crosstalk between adjacent views, and lens aberration, despite the fact that light field displays are one of the most promising 3D display technologies at present.

The performance optimization methods for light field displays have been studied broadly, which can be generally categorized into traditional approaches and neural network-based approaches. In traditional approaches, the improved optical components are designed to obtain better performance. Holographic function screen (HFS) is typically used to enhance spatial resolution [10,11]. However, the HFS is usually at a distance from the lens array, which may lead to thicker device for small light field display. Similarly, due to the volumetric optical low-pass characteristic of the transmissive mirror device (TMD), it was used to interpolate the spacing between discrete 3D pixels in [12]. The authors in [13] implemented an integral imaging display system with extended depth of field and resolution using a TMD, a semi-transparent mirror, and two integral imaging display units. The degradation of the image quality due to aberrations can usually be compensated by pre-filtering techniques. The researchers in [14] proposed a method to pre-correct elemental image array (EIA) based on a pre-filtering function array to improve the quality of the light field displays. Also, aspherical lenses with better performance lead to better 3D display [15]. By designing and optimizing vertically and horizontally placed compound lens array separately, a vertical spliced light field cave display system with an extended depth and reduced aberrations was implemented in [16]. The light emitted from the light source of the liquid crystal display (LCD) panel passing through non-corresponding lenses may cause crosstalk, which leads humans to perceive overlapping 3D images and greatly affects the visual experience. To prevent this, a collimated backlight source was fabricated to realize a low crosstalk light field display in [17,18]. In addition, the authors analyzed the mechanism of crosstalk emergence and achieved further crosstalk suppression via pixel rearrangement and lens optimization in [18].

In contrast, neural network-based approaches mainly aim at optimizing imaging quality by leveraging the powerful learning capability of neural networks. Owing to the ability to approximate arbitrary complex functions and the excellent performance of neural networks in fields related to computer vision [19,20], they have also been widely used in computational imaging applications [2123]. In [24], the researchers employed a specialized reconstruction network to recover images with high fidelity from blurry images captured by an aspheric lens. The authors in [25] proposed a fast differentiable ray tracing (FDRT) model to obtain the point spread function (PSF) of a single lens. They utilized the PSF and Res-Unet to reconstruct the images affected by lens aberrations and achieved an end-to-end single-lens design method. Unsurprisingly, approaches based on neural network have also been proposed to be applied in light field displays. Convolutional neural networks (CNN) were employed to correct aberrations in [26,27]. By modeling diffractive optical element (DOE) with thickness as a variable based on Fourier optics, the joint optimization of DOE and aberrations correction was implemented in [27]. Since the severity of aberrations varies in different field regions, the researchers proposed an aberrations pre-correction method based on region selection, utilizing the Wiener filtering method in regions with relatively slight aberrations, and using a neural network-based pre-correction method in regions with more severe aberrations [28]. In [29], the authors employed a co-designed CNN to optimize the divergence angle of the light beam to smooth motion parallax. The authors in [30] also leveraged the complementary parallax information between different VIs to enhance the quality of the single view. Re-DNN-based enhancement of the resolution of each reconstructed view image was proposed for spatial resolution enhancement, which formed additional effective visual pixels without increasing the resolution of the display panel [31].

We present the idea of our method, the architecture of the proposed display performance optimization network, and corresponding experimental results. Our main contributions in this paper can be summarized as follows.

  • 1. A modeling method for light field display based on neural networks, allowing viewpoint images output from the network to be close to the actual display result, is proposed.
  • 2. A framework for display performance optimization for light field displays based on neural networks which pre-corrects the encoded images based on display results from a global perspective is established.
  • 3. Driven by data science and intelligence, display performance optimization without increasing system complexity is implemented.

2. Method

We propose a display performance optimization network consisting of a display network which is the necessary component of the full network and a pre-correction network which pre-corrects the the original encoded image (OEI) to achieve a better 3D display. The diagram of the proposed method is shown in Fig. 1. The pre-correction network converts the OEI to a pre-corrected encoded image (PEI). The display network takes the PEI as input and generates the display VIs. The two networks utilize the encoder-decoder architecture and the basic feature extractor is Nonlinear Activation Free Block (NAFBlock), which is the block used in [32] for image restoration.

 figure: Fig. 1.

Fig. 1. Diagram of the proposed display performance optimization network for light field displays.

Download Full Size | PDF

2.1 Encoder-decoder architecture and NAFBlock

The encoder-decoder architecture is a common network architecture in neural network-based approaches. As shown in Fig. 2, it is composed of an extraction path to capture features at different levels of the input image and a symmetric expansion path to realize the reconstruction of the output image. In the extraction path, features from low level to high level are gradually extracted as the resolution decreases while the number of channels is doubled. The previously extracted features are utilized in the expansion path to gradually reconstruct the image as the resolution increases while the number of channels is halved. During the recovery process, skip connections make it possible to fuse the features obtained in the feature extraction stage. In general, a neural network is stacked by blocks such as RestormerBlock [33], HINBlock [34], and NAFBlock [32]. We take NAFBlock as a basic feature extractor due to its lower system complexity and better performance. Figure 3 shows the structure of the NAFBlock. Layer normalization [35] is used for smooth training in the NAFBlock. In addition, simple channel attention (SCA) and simple gate (SG) are used to achieve high computing efficiency.

 figure: Fig. 2.

Fig. 2. The encoder-decoder architecture.

Download Full Size | PDF

 figure: Fig. 3.

Fig. 3. The structure of NAFBlock.

Download Full Size | PDF

In NAFBlock, the input feature maps are first layer normalized. Given an input feature map ${{\mathbf {X}}_{\mathbf {0}}}\in \text { }{{\mathbb {R}}^{\hat {H}\times \hat {W}\times \hat {C}}}$, where $\hat {H}\times \hat {W}$ denotes the spatial size and $\hat {C}$ is the number of channels. Then, the NAFBlock applies $1 \times 1$ pointwise convolution to expand the feature channels followed by $3 \times 3$ depth-wise convolution to extract local features. Next, SG is used to introduce nonlinear features and SCA is applied to compute the attention on the channel to implicitly encode global contextual information. Another $1 \times 1$ pointwise convolution is utilized to reduce the feature channels back to the input dimension and the feature map $\mathbf {X}_{\mathbf {1}}$ is obtained. Again, $\mathbf {X}_{\mathbf {1}}$ is layer normalized, and $1 \times 1$ point-wise convolution is used to extend the feature channels, then SG is used to implement the feature transformation and finally, another $1 \times 1$ point-wise convolution is used to reduce feature channels back to the original dimension and output the feature map $\mathbf {X}_{\mathbf {2}}$.

2.2 Display network

In order to achieve pre-correction of encoded image according to the final display viewpoint image quality, modeling the actual display unit using a display network is required. Figure 4 illustrates our idea. The optical reconstruction process is shown in Fig. 4(a), which can be further described as Fig. 4(b). The encoded image passes through the display unit, and multiple VIs are obtained. We design a neural network to accomplish this task, as shown in Fig. 4(c), which uses the encoded image as input and VIs as output. The display network is based on encoder-decoder architecture and NAFBlock. The encoder is used to extract multi-level features of the OEI, and VIs are the output of the corresponding decoder. The architecture of the heterogeneous display network is shown in Fig. 5. The proposed display network is trained with actually collected images as the dataset.

 figure: Fig. 4.

Fig. 4. The schematic diagram of the modeling method for the light field display unit: (a) optical reconstruction process, (b) decomposition process, and (c) modeling the light field display unit based on neural network.

Download Full Size | PDF

 figure: Fig. 5.

Fig. 5. The architecture of the display network: (a) network architecture, and (b) the instruction of each layer or block with different colors and lines.

Download Full Size | PDF

Given an encoded image ${{\mathbf {I}}_{\mathbf {e}}}\in {{\mathbb {R}}^{H\times W\times 3}}$, where $H\times W$ is the spatial size, the display network first applies a convolutional layer to obtain low-level features ${{\mathbf {F}}_{\mathbf {d0}}}\in {{\mathbb {R}}^{H\times W\times C}}$, where $C$ is the number of channels, which is set to 32. Next, the features ${{\mathbf {F}}_{\mathbf {d0}}}$ pass through a 5-level heterogeneous encoder-decoder and are transformed into features ${{\mathbf {F}}_{\mathbf {d1}}}\in {{\mathbb {R}}^{H\times W\times C}}$. Each stage of the encoder-decoder contains multiple NAFBlocks. Starting from low-level features with high spatial size, the encoder gradually reduces spatial size, while expanding feature channels. The $n$ decoders take low-resolution features ${{\mathbf {F}}_{\mathbf {dl}}}\in {{\mathbb {R}}^{\frac {H}{16}\times \frac {W}{16}\times 16C}}$ as input and construct high-resolution VIs. For the downsampling and upsampling operations, we apply a convolutional layer with stride 2 and a pixel-shuffle operator with upscale factor 2, respectively. In the decoders, the input of each stage is a sum of upsampled feature maps of the output of the previous stage and feature maps with the same dimensions in the encoder. Finally, a convolutional layer is applied to generate VIs ${{\mathbf {I}}_{\mathbf {V1}}}\in {{\mathbb {R}}^{H\times W\times 3}}$ to ${{\mathbf {I}}_{\mathbf {Vn}}}\in {{\mathbb {R}}^{H\times W\times 3}}$.

2.3 Pre-correction network

In order to optimize display performance for light field displays with image processing techniques, we propose a pre-correction network to process encoded images based on the encoder-decoder architecture and NAFBlock, intending to compensate for the deficiencies of the light field display unit. Figure 6 shows the architecture of the pre-correction network.

 figure: Fig. 6.

Fig. 6. The architecture of the pre-correction network: (a) network architecture, and (b) the instruction of each layer or block with different colors and lines.

Download Full Size | PDF

Given an OEI ${{\mathbf {I}}_{\mathbf {O}}}\in {{\mathbb {R}}^{H\times W\times 3}}$, a convolutional layer is first utilized to obtain low-level features ${{\mathbf {F}}_{\mathbf {p0}}}\in {{\mathbb {R}}^{H\times W\times C}}$. Then, the features ${{\mathbf {F}}_{\mathbf {p0}}}$ go through a 5-level symmetric encoder-decoder, and features ${{\mathbf {F}}_{\mathbf {p1}}}\in {{\mathbb {R}}^{H\times W\times C}}$ are obtained. The encoder extracts features at different levels of the OEI, and the decoder takes them to generate a PEI ${{\mathbf {I}}_{\mathbf {P}}}\in {{\mathbb {R}}^{H\times W\times 3}}$.

2.4 Display performance optimization network

To pre-correct the OEI from a global perspective to optimize the performance of light field displays, we propose a display performance optimization network based on the above display network and pre-correction network. The PEI generated from the pre-correction network is provided as input to the display network, and the corresponding VIs are obtained. First, we train the display network, then we use the high-quality VIs to guide the training of the pre-correction network, in which case the parameters of the display network are kept unchanged to accomplish the display results of the PEI.

2.5 Choice of the loss function for the neural network

Structural similarity index (SSIM) is an image quality evaluation index with statistical properties, which is consistent with human visual system (HVS). Based on SSIM, multi-scale structural similarity index (MS-SSIM) is used to evaluate the image quality from multiple scales. A mixed loss function of MS-SSIM loss and L1 loss has been shown with superior performance in image restoration [36]. The MS-SSIM loss is used to maintain contrast in high-frequency regions of an image, and L1 loss helps to correct color and luminance. MS-SSIM loss and L1 loss are defined as:

$${{L}^{MS-SSIM}}(output,gt)=1-\frac{1}{N}\sum_{p\in P}{l_{M}^{{}}(p)\cdot \prod_{j=1}^{M}{cs_{j}^{{}}(p)}},$$
$${{L}^{{{\ell }_{1}}}}(output,gt)=\frac{1}{N}\sum_{p\in P}{\left| output(p)-gt(p) \right|},$$
where $p$ is the index of the pixel and $P$ is the region of the image, $output(p)$ and $gt(p)$ are values of pixel in the output image and the ground truth, respectively, $N$ is the number of pixels in the image, $l_{M}$ and $cs_{j}$ are the luminance component at scale $M$ and the contrast and structure component at scale $j$. The mixed loss is as follows:
$${{L}^{mix}}(output,gt)=\alpha \cdot {{L}^{MS-SSIM}}(output,gt)+(1-\alpha )\cdot {{G}_{\sigma _{G}^{M}}}\cdot {{L}^{{{\ell }_{1}}}}(output,gt),$$
where ${L}^{mix}$ and ${{L}^{{{\ell }_{1}}}}$ are MS-SSIM loss and L1 loss, $\alpha$ is the weight of MS-SSIM loss, and ${{G}_{\sigma _{G}^{M}}}$ in the second term is the Gaussian coefficient at scale $M$.

The display network obtains $n$ VIs, which are ${output}_{1}$ to ${output}_{n}$ through the mapping:

$${output}_{i}={{f}_{i}}(input),i=1,\ldots,n$$
where $input$ is the input image of the network, and ${f}_{i}$ is the function from the input to corresponding output.

The VI ${output}_{1}$ and the corresponding ground truth ${gt}_{1}$ can be used to compute one mixed loss ${{L}^{mix}}({output}_{1},{gt}_{1})$. Likewise, $n$ mixed loss can be obtained with $n$ pairs of VI and ground truth. We use the mean of the $n$ mixed losses as the loss function of the overall network:

$$Loss=\frac{1}{n}\sum_{i=1}^{n} {L}^{mix}\left ({output}_{i},{gt}_{i} \right )$$

3. Experiments

In the experiment, a small light field display is modeled. The proposed display performance optimization network is implemented using Pytorch 1.12.0 and trained on a single GPU RTX4090 (24GB) and CPU AMD EPYC 9654 96-Core Processor. In the proposed method, we consider multiple viewpoints existing in the light field display. However, without loss of generality, we consider two viewpoints in the experiment due to the limitation of the computing power.

3.1 Apparatus

Our display network aims to model an actual light field display unit. Hence, we need to use a camera to capture VIs at different positions. To collect these images, we use professional mode of the camera of a smartphone to capture the display results from two angles. To make the color of the captured images as close as possible to the original color, the parameters are empirically set as light sensibility ordinance (ISO) 500, white balance (WB) 5700K, and shutter 1/100s. In the experiment, the light field display being modeled is X-real (X-079), with a 7.9-inch LCD panel, a resolution of $1536 \times 2048$, a brightness of 200nit, and a viewing angle of 53$^{\circ }$. It can reconstruct 48 viewpoints when an encoded image is loaded. A light field content export plug-in is also provided, which can be used by software like Blender to convert a 3D scene into an encoded image. X-real and the virtual acquisition process are shown in Fig. 7. Figure 7(a) shows X-real and (b) illustrates virtual acquisition process with the light field plug-in. With the virtual acquisition completed, the light field plug-in can synthesize multiple viewpoint images into an encoded image, as shown in Fig. 7(c). The viewpoint images obtained from the virtual acquisition are shown in Fig. 7(d).

3.2 Experiment for the display network

3.2.1 Dataset for the display network training

Figure 8 depicts the creating process of the dataset for the display network training. As shown in Fig. 8(a), we use Blender 3.6 with the light field plug-in to obtain ideal VIs with horizontal parallax and the corresponding OEI. A high dynamic range (HDR) image is used to provide ambient light. As shown in Fig. 8(b), the OEI is then loaded into the display panel of X-real, and the 3D images are captured from two different viewpoints using the camera of a smartphone. The part of the display area can be cropped from the captured images. These cropped images are used to create the dataset. The resolution of both OEI and VIs is $1536\times 1536$. For each image, we crop it into patches with size $512\times 512$ and the stride is 256 for the convenience of training. Figure 9 shows partial images of the dataset, and all images are randomly cropped to a size of $256\times 256$ to enhance the robustness and generalization of the network while augmenting the dataset. Each set of images contains one OEI and two VIs, and the dataset consists of 4500 sets of images, of which 80% are used for training and 20% for validation and testing.

 figure: Fig. 7.

Fig. 7. The light field display and the virtual acquisition process: (a) X-real, (b) the virtual acquisition, (c) the OEI, and (d) virtually captured VIs.

Download Full Size | PDF

 figure: Fig. 8.

Fig. 8. The process of creating the dataset for display network training: (a) the generation of the OEI, and (b) the actual capture of the display results.

Download Full Size | PDF

 figure: Fig. 9.

Fig. 9. Partial images with random cropping of the dataset for the display network.

Download Full Size | PDF

3.2.2 Training details for the display network

In the experiment, the numbers of NAFBlocks used in each stage of the encoder of the display network are 2, 2, 4, 8, and 12 while the numbers in each stage of the two decoders are 2, 2, 2, and 2. More NAFBlocks means more features can be extracted and better performance can be achieved. However, as the number of NAFBlocks increases, the performance improves limitedly while the time spent on training increases a lot. Here, the configuration for the NAFBlocks in the display network is a trade-off between better results and faster training.

During training, the initial learning rate is set to $1\times {{e}^{-3}}$ with a weight decay of $1\times {{e}^{-3}}$ to prevent model overfitting and to improve the generalization of the model. AdamW optimizer with the parameters ${{\beta }_{1}}=0.9$ and ${{\beta }_{2}}=0.9$ is used. With a CosineAnnealingLR scheduler, the learning rate gradually decreases to $1\times {{e}^{-7}}$ as training continues. A mixed loss of MS-SSIM loss and L1 loss is utilized to measure the difference between the output images of the network and the ground truth. The weight $\alpha$ in the mixed loss is experimentally chosen to be 0.025 for the two losses with comparable values and better results. Based on the prepared dataset and the available GPU memory, the batch size is set to 12. The number of epochs run in the experiment is 600 and the loss curve of the training gradually converges after about 500 epochs. Each epoch takes 86 seconds, and the total training time is 23 hours, 25 minutes, and 15 seconds. As shown in Fig. 10, the loss gradually decreases while the evaluation indices on the test dataset gradually increase during training.

 figure: Fig. 10.

Fig. 10. The curves during the display network training: (a) loss curve, and (b) evaluation indices curve.

Download Full Size | PDF

3.2.3 Results for the display network

After trained, the display network takes about 0.08 seconds to output the viewpoint images. The experimental results show that the proposed display network can accomplish the task of modeling the light field display unit well. When an encoded image is input to the display network, two VIs can be obtained via the network. Figure 11 shows the visualization results. In Fig. 11, columns (a), (b), and (c) present the output images, the local detail images, and the actually captured images for viewpoint 1. Columns (d), (e), and (f) show the images for viewpoint 2. Here, the PSNR and SSIM between the output and the actually captured images are taken as the criterion for quality evaluation, which are also labeled in each set of images. The mean PSNR and SSIM between the output and the actually captured images are about 35.89dB and 0.9687, respectively. The data shows that the output images of the network are very similar to the corresponding actually captured images.

 figure: Fig. 11.

Fig. 11. Comparison of output images of the display network and the actually captured images: (a) output image for viewpoint1, (b) the details, (c) actually captured image for viewpoint1, (d) output image for viewpoint2, (e) the details, and (f) actually captured image for viewpoint2.

Download Full Size | PDF

3.3 Experiment for the display performance optimization network

3.3.1 Dataset for the display performance optimization network training

For the viewpoints where the images are taken in sub-subsection 3.2.1, we choose the ones that best match them among the 48 viewpoints in the virtual acquisition process to guide the pre-correction of the OEI in the display performance optimization network. Consistent with the operation in sub-subsection 3.2.1, we crop the images to patches to create the dataset. Figure 12 shows partial images of the dataset.

 figure: Fig. 12.

Fig. 12. Partial images with random cropping of the dataset for the display performance optimization network.

Download Full Size | PDF

3.3.2 Training details for the display performance optimization network

For the convenience of training, the number of NAFBlocks is set to 1 at each stage of the pre-correction network. The display performance optimization network utilizes the AdamW optimizer with an initial learning rate of $1\times {{e}^{-4}}$. As training goes on, the learning rate gradually decreases to $1\times {{e}^{-7}}$. Based on the available GPU memory, the batch size is set to 10. The number of epochs run in the experiment is 1000 and the loss curve of the network gradually converges after about 900 epochs. Each epoch takes 199 seconds, and the total training time is 73 hours, 58 minutes, and 35 seconds. As shown in Fig. 13, the loss gradually decreases while the evaluation indices on the test dataset gradually increase during training.

 figure: Fig. 13.

Fig. 13. The curves during the display performance optimization network training: (a) loss curve, and (b) evaluation indices curve.

Download Full Size | PDF

3.3.3 Results for the display performance optimization network

After trained, the display performance optimization network takes about 0.1 seconds to output the viewpoint images. Experimental results show that the proposed display performance optimization network is effective. When the PEI is used as input of the display network, the output VIs have better quality. The different output images and the ideal VIs are shown in Fig. 14. In Fig. 14, columns (a), (b), and (c) show comparisons of images for viewpoint 1. Column (a) is for the case without pre-correction, column (b) shows the ideal VIs, and column (c) is for the case with pre-correction. Columns (d), (e), and (f) present images for viewpoint 2. The local details of the images are also shown below the full images. It is clear that the VIs corresponding to the PEI are almost without graininess, compared to the VIs corresponding to the OEI. Here, the PSNR and SSIM are used to evaluate the quality of the images. The ideal VIs are used as reference images, and the output images of the display network with the OEI and the PEI as the input are analyzed, respectively. The PSNR and SSIM are also labeled in each set of images. With the OEI as the input, the mean PSNR and SSIM between the output images and the ideal VIs are about 23.57dB and 0.6782, respectively. With the PEI as input, the mean PSNR and SSIM are about 36.31dB and 0.9822, respectively. Thus, the proposed method can enhance the display quality of the light field displays.

 figure: Fig. 14.

Fig. 14. Comparison of the output images with and without pre-correction: (a) output image for viewpoint1 without pre-correction, (b) virtually captured image for viewpoint1, (c) output image for viewpoint1 with pre-correction, (d) output image for viewpoint2 without pre-correction, (e) virtually captured image for viewpoint2, and (f) output image for viewpoint2 with pre-correction.

Download Full Size | PDF

4. Conclusion

A neural network-based method to optimize the performance of light field displays is demonstrated in the present study. This method consists of two steps. First, modeling a light field display unit based on a display network is implemented. An encoded image is input to the display network and two VIs are obtained. Then, we establish a pre-correction network which is used to obtain PEI that can present better 3D images. Here, the pre-correction network and the display network are connected to form a display performance optimization network. The display network is trained first, and the well-trained network with fixed parameters is connected to the end of the pre-correction network. The network parameters are optimized by reducing the mixed loss of the MS-SSIM loss and L1 loss between the output images of the network and the ground truth. The experimental results suggest that the VIs are of higher quality with the PEI as input to the display network. Using our method, it is possible to obtain a 3D display with significantly reduced graininess. We have done a preliminary exploration of the proposed method, and next, we will try to utilize more viewpoints and better network architecture.

Funding

National Natural Science Foundation of China (61771220, 62271226).

Acknowledgment

The authors would like to acknowledge funding support from The National Natural Science Foundation of China.

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. Geng, “Three-dimensional display technologies,” Adv. Opt. Photonics 5(4), 456–535 (2013). [CrossRef]  

2. J. Shi, W. Qiao, J. Hua, et al., “Spatial multiplexing holographic combiner for glasses-free augmented reality,” Nanophotonics 9(9), 3003–3010 (2020). [CrossRef]  

3. G. Li, D. Lee, Y. Jeong, et al., “Holographic display for see-through augmented reality using mirror-lens holographic optical element,” Opt. Lett. 41(11), 2486–2489 (2016). [CrossRef]  

4. Y. W. Zheng, D. Wang, Y. L. Li, et al., “Holographic near-eye display system with large viewing area based on liquid crystal axicon,” Opt. Express 30(19), 34106–34116 (2022). [CrossRef]  

5. K. Kumagai, S. Hasegawa, and Y. Hayasaki, “Volumetric bubble display,” Optica 4(3), 298–302 (2017). [CrossRef]  

6. X. Sang, X. Gao, X. Yu, et al., “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]  

7. X. Yu, H. Dong, X. Gao, et al., “360-degree directional micro prism array for tabletop flat-panel light field displays,” Opt. Express 31(20), 32273–32286 (2023). [CrossRef]  

8. J. Hua, E. Hua, F. Zhou, et al., “Foveated glasses-free 3d display with ultrawide field of view via a large-scale 2d-metagrating complex,” Light: Sci. Appl. 10(1), 213 (2021). [CrossRef]  

9. S. Xing, X. Sang, X. Yu, et al., “High-efficient computer-generated integral imaging based on the backward ray-tracing technique and optical reconstruction,” Opt. Express 25(1), 330–338 (2017). [CrossRef]  

10. C. Yu, J. Yuan, F. C. Fan, et al., “The modulation function and realizing method of holographic functional screen,” Opt. Express 18(26), 27820–27826 (2010). [CrossRef]  

11. J. Wen, X. Yan, X. Jiang, et al., “Integral imaging based light field display with holographic diffusor: principles, potentials and restrictions,” Opt. Express 27(20), 27441–27458 (2019). [CrossRef]  

12. H. L. Zhang, X. L. Ma, X. Y. Lin, et al., “System to eliminate the graininess of an integral imaging 3d display by using a transmissive mirror device,” Opt. Lett. 47(18), 4628–4631 (2022). [CrossRef]  

13. X. L. Ma, H. L. Zhang, R. Y. Yuan, et al., “Depth of field and resolution-enhanced integral imaging display system,” Opt. Express 30(25), 44580–44593 (2022). [CrossRef]  

14. W. Zhang, X. Sang, X. Gao, et al., “Wavefront aberration correction for integral imaging with the pre-filtering function array,” Opt. Express 26(21), 27064–27075 (2018). [CrossRef]  

15. S. Yang, X. Sang, X. Yu, et al., “162-inch 3d light field display based on aspheric lens array and holographic functional screen,” Opt. Express 26(25), 33013–33021 (2018). [CrossRef]  

16. X. Yu, H. Dong, X. Gao, et al., “Vertically spliced tabletop light field cave display with extended depth content and separately optimized compound lens array,” Opt. Express 32(7), 11296–11306 (2024). [CrossRef]  

17. L. Yang, X. Sang, X. Yu, et al., “A crosstalk-suppressed dense multi-view light-field display based on real-time light-field pickup and reconstruction,” Opt. Express 26(26), 34412–34427 (2018). [CrossRef]  

18. B. Liu, X. Sang, X. Yu, et al., “Analysis and removal of crosstalk in a time-multiplexed light-field display,” Opt. Express 29(5), 7435–7452 (2021). [CrossRef]  

19. C. Dong, C. C. Loy, K. He, et al., “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015). [CrossRef]  

20. J. Chang, V. Sitzmann, X. Dun, et al., “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Sci. Rep. 8(1), 12324 (2018). [CrossRef]  

21. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]  

22. B. Manifold, E. Thomas, A. T. Francis, et al., “Denoising of stimulated raman scattering microscopy images via deep learning,” Biomed. Opt. Express 10(8), 3860–3874 (2019). [CrossRef]  

23. M. Lyu, H. Wang, G. Li, et al., “Learning-based lensless imaging through optically thick scattering media,” Adv. Photonics 1(03), 1 (2019). [CrossRef]  

24. Y. Liu, C. Zhang, T. Kou, et al., “End-to-end computational optics with a singlet lens for large depth-of-field imaging,” Opt. Express 29(18), 28530–28548 (2021). [CrossRef]  

25. Z. Li, Q. Hou, Z. Wang, et al., “End-to-end learned single lens design using fast differentiable ray tracing,” Opt. Lett. 46(21), 5453–5456 (2021). [CrossRef]  

26. X. Yu, H. Li, X. Sang, et al., “Aberration correction based on a pre-correction convolutional neural network for light-field displays,” Opt. Express 29(7), 11009–11020 (2021). [CrossRef]  

27. X. Pei, X. Yu, X. Gao, et al., “End-to-end optimization of a diffractive optical element and aberration correction for integral imaging,” Chin. Opt. Lett. 20(12), 121101 (2022). [CrossRef]  

28. X. Su, X. Yu, D. Chen, et al., “Regional selection-based pre-correction of lens aberrations for light-field displays,” Opt. Commun. 505, 127510 (2022). [CrossRef]  

29. X. Yu, J. Li, X. Gao, et al., “Smooth motion parallax method for 3d light-field displays with a narrow pitch based on optimizing the light beam divergence angle,” Opt. Express 32(6), 9857–9866 (2024). [CrossRef]  

30. X. Xie, X. Yu, B. Fu, et al., “High-quality reproduction method for three-dimensional light-field displays using parallax-view information synthesis and aberration precorrection,” Opt. Lasers Eng. 173, 107930 (2024). [CrossRef]  

31. L. Yang and J. Shen, “Deep neural network-enabled resolution enhancement for the digital light field display based on holographic functional screen,” Opt. Commun. 550, 130012 (2024). [CrossRef]  

32. L. Chen, X. Chu, X. Zhang, et al., “Simple baselines for image restoration,” in European Conference on Computer Vision (Springer, 2022), pp. 17–33.

33. S. W. Zamir, A. Arora, S. Khan, et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 5728–5739.

34. L. Chen, X. Lu, J. Zhang, et al., “Hinet: Half instance normalization network for image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 182–192.

35. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv, arXiv:1607.06450 (2016). [CrossRef]  

36. H. Zhao, O. Gallo, I. Frosio, et al., “Loss functions for image restoration with neural networks,” IEEE Trans. Comput. Imaging 3(1), 47–57 (2016). [CrossRef]  

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (14)

Fig. 1.
Fig. 1. Diagram of the proposed display performance optimization network for light field displays.
Fig. 2.
Fig. 2. The encoder-decoder architecture.
Fig. 3.
Fig. 3. The structure of NAFBlock.
Fig. 4.
Fig. 4. The schematic diagram of the modeling method for the light field display unit: (a) optical reconstruction process, (b) decomposition process, and (c) modeling the light field display unit based on neural network.
Fig. 5.
Fig. 5. The architecture of the display network: (a) network architecture, and (b) the instruction of each layer or block with different colors and lines.
Fig. 6.
Fig. 6. The architecture of the pre-correction network: (a) network architecture, and (b) the instruction of each layer or block with different colors and lines.
Fig. 7.
Fig. 7. The light field display and the virtual acquisition process: (a) X-real, (b) the virtual acquisition, (c) the OEI, and (d) virtually captured VIs.
Fig. 8.
Fig. 8. The process of creating the dataset for display network training: (a) the generation of the OEI, and (b) the actual capture of the display results.
Fig. 9.
Fig. 9. Partial images with random cropping of the dataset for the display network.
Fig. 10.
Fig. 10. The curves during the display network training: (a) loss curve, and (b) evaluation indices curve.
Fig. 11.
Fig. 11. Comparison of output images of the display network and the actually captured images: (a) output image for viewpoint1, (b) the details, (c) actually captured image for viewpoint1, (d) output image for viewpoint2, (e) the details, and (f) actually captured image for viewpoint2.
Fig. 12.
Fig. 12. Partial images with random cropping of the dataset for the display performance optimization network.
Fig. 13.
Fig. 13. The curves during the display performance optimization network training: (a) loss curve, and (b) evaluation indices curve.
Fig. 14.
Fig. 14. Comparison of the output images with and without pre-correction: (a) output image for viewpoint1 without pre-correction, (b) virtually captured image for viewpoint1, (c) output image for viewpoint1 with pre-correction, (d) output image for viewpoint2 without pre-correction, (e) virtually captured image for viewpoint2, and (f) output image for viewpoint2 with pre-correction.

Equations (5)

Equations on this page are rendered with MathJax. Learn more.

L M S S S I M ( o u t p u t , g t ) = 1 1 N p P l M ( p ) j = 1 M c s j ( p ) ,
L 1 ( o u t p u t , g t ) = 1 N p P | o u t p u t ( p ) g t ( p ) | ,
L m i x ( o u t p u t , g t ) = α L M S S S I M ( o u t p u t , g t ) + ( 1 α ) G σ G M L 1 ( o u t p u t , g t ) ,
o u t p u t i = f i ( i n p u t ) , i = 1 , , n
L o s s = 1 n i = 1 n L m i x ( o u t p u t i , g t i )
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.