High dimensional optical data — varifocal multiview imaging, compression and evaluation

Kejun Wu; Qiong Liu; Qiong Liu; Kim-Hui Yap; You Yang; You Yang

doi:10.1364/OE.504717

1. Introduction

The world is witnessing a dramatic change in the way that people experience it due to the rise of Metaverse. Social life and economic activities are redefined through the blurred boundary of digital and physical worlds [1,2]. The entire Metaverse relies on computational imaging and display technologies; for instance, depth of field (DoF) rendering [3], near-eye displays [4,5], free-viewpoint [6], and virtual reality (VR) to get rid of excessive constraints of the physical world. Rich vision information stored in high dimensional optical data, e.g., multiview and multifocal/varifocal help to mitigate the vergence-accommodation conflict (VAC), which is a long-standing problem that immersive displays have been suffering [7–9]. Google’s telepresence system “Project Starline” achieves a strong sense of presence by dense multiview [10], and Meta’s “Half-Dome” VR headsets adjust varifocal lenses for a comfortable immersive experience. Multiview images are generally captured with fixed focal settings, while multifocal/varifocal images are a stack of images that focused on serial depths in a fixed view [11–13].

Emerging high dimensional optical data, varifocal multiview (VFMV) is bringing new features for immersive experience. VFMV refers to observing 3D scenes from multiple views with variable focal planes to record more depth information shown in Fig. 1. VFMV images can be regarded as the conceptual integration of multiview and focal stack, which makes full use of their advantages to complement each other [14]. VFMV describes scenes in angular, spatial, and focal dimensions [15], whose complex imaging conditions involve dense viewpoints, high spatial resolutions, and variable focal planes. In industry, Amazon and Raytrix adopt camera array or microlens array to capture VFMV [16,17]. The focal information contained in VFMV images has a great prospect of benefiting Metaverse-related applications [1]. In academia, recent research has demonstrated the advantage of VFMV in extending DoF for computational displays, since the focal information contained in VFMV is richer than conventional multiview images [14]. Liu et al. [18] have employed focal stack to reconstruct a full light field. Zhao et al. [19] further designed a robot’s 3D visual sensing prototype by using focal information. However, such prospect of VFMV is at the cost of huge volumes of data. As the coupling of multiple views and varifocal, VFMV is highly redundant due to dense view sampling and irregular focusing changes among views, resulting in challenges of data compression. Moreover, the rise of artificial intelligence and large models make optical data more vulnerable to forgery, e.g., indistinguishable object removal or manipulation. It is reasonable to assume that attackers may manipulate high dimensional optical data at the data source and transmission stages for malicious display purposes in the application end. Thus, it is essential to investigate the forgery protection capability of VFMV data for faithful and secure displays. Challenges arise from a lack of urgent investigation specialized for emerging VFMV.

Fig. 1. Illustration of VFMV. It observes scenes from multiple views with different focal planes to record high-dimensional information.

Download Full Size | PDF

As an emerging data type, preliminary research on VFMV data compression rearranges VFMV in descending orders with excessive displacement [14], resulting in limited inter-view correlations. More researchers are dedicated to the compression of dense multiview, e.g., light field images. Gomes et al. [20] firstly separate the full light field multiview into three divisions and then compressed them independently for high random access capabilities. Similarly, Santos et al. [21] divide the dense multiview into four independent parts. Monteiro et al. [22] develop a scalable coding method, which assigns multiple layers to views of light field. Views in each layer are rearranged as pseudo-sequence video, and are compressed from low to high layers. These methods are not designed for VFMV; thus, VFMV data compression methods with high coding efficiency are urgently needed.

In view of the above-mentioned fundamental challenges, we aim to study the imaging characteristics of VFMV data, design efficient VFMV compression, and perform evaluations in this paper. Specifically, we first parameterize the high dimensional VFMV as a 4D representation model and conduct redundancy analysis by observing its imaging characteristics. We find that the irregular focusing distributions lead to difficulties in exploiting the redundancies of VFMV. Then, we propose a novel VFMV coding scheme based on view mountain-shape rearrangement (VMSR) and all-directional prediction structure (ADPS). The VMSR aims to smooth the focusing status changes among views and constrains excessive view displacements in the rearrangement; so that the inter-view correlations of the rearranged VFMV can be enhanced. The ADPS is to compress the rearranged VFMV by exploiting the enhanced correlations to achieve high coding efficiency. Finally, we conduct extensive experiments for quantitative, qualitative, complexity, and forgery protection evaluations, demonstrating that the proposed scheme outperforms comparison schemes. We preliminarily verify the potential of VFMV as a novel secure imaging format and forgery protector for optical data.

The remainder of this paper is organized as follows. Section 2. introduces the representation model and imaging analysis of VFMV images. The proposed compression scheme is presented in Section 3. Experiments are conducted in Section 4. Finally, we conclude our work in Section 5.

2. Dimension representation and imaging analysis

The VFMV records 3D scenes from multiple perspectives with variable focal planes. Thus, VFMV images describe 3D scenes in spatial, angular and focal dimensions. The schematic diagrams of VFMV formation are shown in Fig. 2. In VFMV capturing, the camera grid is formed by mounting digital cameras on the rails or fixing on an array plate. Camera at each position of the grid is focused on changed depths to cover different objects by varying focal planes. Thus, VFMV images are highly structured in spatial, angular and focal dimensions. In VFMV representation, the color of a view represents the focusing object at a certain depth of the scene. The focusing depths of the captured view change irregularly with distinct DoF, resulting in independent focused and defocused regions among views. Generally, multiview images with high-density arrangements also refer to as 4D light field images, which are commonly represented as the two-plane parameterization $\boldsymbol I(u,v,x,y)$. In the parameterization, $(u,v)$ describes the camera planes, while $(x,y)$ signifies the imaging planes. Whereas VFMV images additionally contain focal information. Accordingly, VFMV images can be parameterized as a special constrained 4D representation $\boldsymbol I(u,v,x_d,y_d)$. The $(u,v)$ indicates the angular position in the camera position grid, regardless of flat plane, hemispherical or spherical arrangements. The $(x_d,y_d)$ denotes the pixel spatial position in a certain view. The depth information $d$ is implicitly included in the 2D image of a view since 2D images are the projection of 3D scenes.

Fig. 2. VFMV formation and dimension representation: VFMV records scenes from multiple views with variable focal planes to cover different objects, thereby containing information in spatial, angular and depth dimensions.

Download Full Size | PDF

The arrangements and imaging characteristics of VFMV images are illustrated in Fig. 3. The dense views of VFMV are regularly arranged in horizontal and vertical directions; but with irregular and random blurriness among views. Specifically, the top left view is focused on the depth of fruits (near distance), the top middle view is focused on the pink packing box (intermediate distance), and the bottom left view is focused on the basketball (far distance). The close-ups in Fig. 3 exemplifies that a certain object appears different focusing status in different views. The focusing status of all views is inconsistent because views of VFMV system are focused at their respective focal planes shown in Fig. 1 and Fig. 2. Different from conventional multiview images, the distinctive features of VFMV are shaped by both parallaxes (view displacements) and irregular focusing changes, which are the main sources of redundancy in VFMV. It makes VFMV images more difficult to compress due to relatively low content similarity.

Fig. 3. VFMV imaging characteristics embody irregular focusing status changes among views in horizontal and vertical directions.

Download Full Size | PDF

As an emerging data type, VFMV images are highly structured and redundant in spatial, angular and focal dimensions. The distinctive redundancies of VFMV are caused by both dense view parallaxes and irregular focusing changes. The downstream vision applications at Metaverse user end require real-time interaction for immersive experience. Thus, VFMV images need to be efficiently compressed for the subsequent data transmission and displays. The efficient compression should take the distinctive redundancy of VFMV into account.

3. Efficient VFMV compression scheme

According to the analysis of VFMV formation and characteristics, the main redundancies of VFMV come from the irregular focusing status changes among dense views. Thus, we propose a specialized compression scheme for VFMV images. In the scheme, the proposed view mountain-shape rearrangement (VMSR) is to rearrange the irregular VFMV. A rearranged VFMV with mountain-shape focusing distributions can enhance the inter-view correlations. Then, the proposed all-directional prediction structure (ADPS) optimizes the prediction dependencies and exploits the enhanced correlations from all directions. Therefore, the proposed specialized scheme can efficiently compress VFMV featured by irregular focusing changes among views.

3.1 View mountain-shape rearrangement

Generally, smooth content changes and slight displacement between adjacent views are advantageous to inter-view correlations. Thus, we rearrange the irregular VFMV by view mountain-shape rearrangement (VMSR). The proposed VMSR measures the focusing status of VFMV by normalized Tenengrad gradient (TENG); and rearranged views in mountain-shape for high correlations. It enhances the correlations by forming a new mountain-shape focusing distribution and suppressing excessive view displacements simultaneously. An example of VMSR is shown in Fig. 4. First, focusing assessment is performed to measure the focusing intensity of each view. The TENG [23] is selected since its simplicity and solid performance [24,25]. The TENG gradient $T_{uv}$ of view $(u,v)$ is defined by:

(1)$$T_{uv} = \frac{1}{W * H} \sum_{x=1}^W \sum_{y=1}^H {\left(\left|\boldsymbol G_x \right|^2 + \left|\boldsymbol G_y\right|^2\right)},$$

where $\boldsymbol G_x$ and $\boldsymbol G_y$ stand for the gradient vectors in the horizontal and vertical directions in the pixel position $(x,y)$, respectively. The gradient vectors are computed by convolving the measured view with the Sobel gradient operator. The $W$ and $H$ indicate the image width and height. We normalize the TENG gradient to measure focusing intensity ${\boldsymbol{F}}(u,v)$ by:

(2)$${\boldsymbol{F}}(u,v)=\frac{T_{uv}-T_{\min}}{T_{\max}-T_{\min}},$$

(3)$$T_{\max} = \max_{u,v \in [1,M]}(T_{uv}), \quad T_{\min} = \min_{u,v \in [1,N]}(T_{uv}),$$

where the $T_{\max }$ and $T_{\min }$ denote the maximum and minimum TENG gradient among all views, respectively. The $M$ and $N$ stand for view numbers in the horizontal and vertical directions. The normalization brings all values of TENG gradient into the range [0,1]. The top right of Fig. 4 visualizes the irregular focusing intensity distributions of VFMV.

Fig. 4. The proposed VMSR: take one row of irregular VFMV with $9\times 9$ views for illustration. The focusing intensity is assessed and sorted in the order of “mountain-shape”. The highest intensity is placed in the central position, while the remaining leftmost 4 views and rightmost 4 views are sorted in ascending and descending orders. It enhances inter-view correlations by smooth/regular focusing distributions and suppressing excessive view displacements simultaneously.

Download Full Size | PDF

Then, we conduct row-wise VMSR for all views. In this paper, we constrain the view displacement to ensure that it does not exceed half the number of horizontal views. The view with the highest focusing intensity in a row is placed in the central position. For the remaining $M-1$ views, the leftmost $(M-1)/2$ views are sorted in ascending order and placed on the left of the central position, while the rightmost $(M-1)/2$ views are in descending order and placed on the right. In this way, a smooth “mountain-shape” (ascending-peak-descending) focusing distribution is generated, and the visualization is shown in the bottom right of Fig. 4. The ingenious sorting restricts the maximum view displacement to $(M-1)/2$, which avoids excessive parallax between the rearranged adjacent views. The global inter-view correlations can be theoretically depicted by focusing status change $\mathcal {S}$ and view displacement $\mathcal {D}$:

(4)$$\mathcal{S} = \frac{1}{M * N} \sum_{u=1}^M \sum_{v=1}^{N} \left(\left|\frac{\partial {\boldsymbol{F}}(u,v)}{\partial u}\right|+\left|\frac{\partial {\boldsymbol{F}}(u,v)}{\partial v}\right|\right),$$

(5)$$\mathcal{D} = \frac{1}{M * N} \sum_{u=1}^M \sum_{v=1}^{N} \left(\left|\hat{I}_{(u,v)} - I_{(u,v)}\right|\right),$$

where ${\boldsymbol{F}}(u,v)$ signifies the focusing intensity distributions of the rearranged VFMV. The focusing status changes $\mathcal {S}$ among views are represented by the partial derivatives of ${\boldsymbol{F}}(u,v)$ with respect to $u$ and $v$. The view displacement $\mathcal {D}$ is described by subtracting the original view index $I_{(u,v)}$ and the rearranged view index $\hat {I}_{(u,v)}$. Due to that any view has more or less correlations with other views, the correlation values should be positive values with physical significance. By adding $\mathcal {S}$ and $\mathcal {D}$, we can directly generate inter-view correlations within reasonable value ranges. Other more complicated calculation methods may be also effective, but video encoder is generally complexity-sensitive. The addition operation has relatively low computational complexity, which is suitable for embedding into hardware or software video encoder products and services. Thus, for the rearranged VFMV, smooth focusing status change and moderate view displacement enable enhancing inter-view correlations.

To sum up, the proposed VMSR first measures the focusing intensity of VFMV images by normalized TENG gradient. Then, views in each row are rearranged in a mountain shape. It enhances the inter-view correlations by smooth focusing distribution and moderate view displacement.

3.2 All-directional prediction structure

We propose an all-directional prediction structure (ADPS) to compress the rearranged VFMV shown in Fig. 5. The term “all direction” represents reference frames come from all directions around the frame to be coded, such that more candidates of reference frames can be selected and higher coding performance can be achieved in inter-view predictions. For each row, the darker color a view is in, the higher focusing intensity the view has. The mountain-shape rearranged VFMV is concatenated into a pseudo-video sequence in raster scan order. The display order refers to the frame index in the video sequence, while the coding order is the actual order to conduct frame-by-frame compression.

Fig. 5. The proposed ADPS: take $9\times 9$ views as examples for illustration. The ADPS compresses the rearranged VFMV by performing row-wise hierarchy divisions and creates prediction dependencies from all directions; thus, it can exploit the enhanced correlations and further optimize reference frame for high coding efficiency.

Download Full Size | PDF

The ADPS conducts row-wise hierarchical divisions and allows all-directional prediction dependencies for coding efficiency. Specifically, the 1st row (hierarchy 1, abbreviated as H1), 5th row (H2), 3rd row (H3), 2nd row (H4), and 4th row (H4) are encoded successively. The typical hierarchy divisions in [26] are view-wise, where views in a row are assigned different hierarchies. By contrast, our ADPS is row-wise for adapting the mountain-shape rearranged VFMV. Views in the same row are in the same hierarchy. Moreover, the proposed ADPS creates all-directional prediction dependencies defined by:

(6)$$\textbf{X} =\{\Delta_{i} | \Delta_{i} = \left|{\boldsymbol{F}}(u_{i},v_{i}) - {\boldsymbol{F}}(u,v)\right|, i\in[1, m] \},$$

(7)$$\textbf{R} = \{\textbf{X}_n | \textbf{X}_{0}=\emptyset, \textbf{X}_{n} = \textbf{X}_{n-1} \cup \{\min (\textbf{X} \backslash \textbf{X}_{n-1}) \}\},$$

(8)$$\textbf{R} \Rightarrow \{i_1,{\ldots}i_n\} \Rightarrow \{V_{i_1},{\ldots}V_{i_n}\},$$

where the $\textbf {X}$ stands for the set of focusing intensity changes around a certain view $(u,v)$. The differences $\Delta _{i}$ between the view and adjacent views $(u_{i},v_{i})$ from all $i$ directions are to quantify the reference view selection cost. The maximum number of directions is $m$ (e.g., 8 directions around the coding order #34). The notation $\textbf {X} \backslash \textbf {X}_{n-1}$ stands for the set-theoretic difference of $\textbf {X}$ and $\textbf {X}_{n-1}$. Equation (7) denotes that the reference view subset $\textbf {R}$ is formed by picking out the smallest $n$ elements from set $\textbf {X}$. The $\textbf {R}$ represents the optimal $n$ candidates with the smoothest focusing intensity changes among adjacent views. From these optimal $n$ candidates, we can derive the reference directions $\{i_1,{\ldots }, i_n\}$. Symbol $\Rightarrow$ signifies this process. Thus, reference views $\{V_{i_1},{\ldots },V_{i_n}\}$ are finally located by the derived directions. In the proposed ADPS, the prediction dependencies depend on hierarchies and focusing intensity changes. For example, views in H4, H3, H2 and H1 have 8, 4, 3 and 2 reference views, respectively.

Partially focused images are generally more robust to compression blur than more focused images [27]. Therefore, in our mountain-shape rearranged VFMV, the side views in each row are more robust to compression blur than the central views. Accordingly, we form quantization parameter (QP) distributions in the shape of an inverted mountain for coding efficiency. Specifically, the most central view is assigned a base QP, and QP increments of 1, 2, 3, and 4 are distributed from central to both sides. The data of rearranged VFMV is compressed by ADPS, while the data of mountain rearrangement is reshaped as a 2D matrix and encoded using JPEG lossless compression. Finally, the multiplexing bitstreams of all data are transmitted and the original VFMV can be recovered at the decoded side.

In summary, the proposed coding scheme rearranges the irregular VFMV by mountain-sorting, generating a regular VFMV with mountain-shape focusing distributions and enhanced inter-view correlations. The proposed ADPS compresses the reordered VFMV by exploiting the correlations. It conducts all-directional prediction according to hierarchies and focusing intensity changes for efficient inter-view prediction.

4. Experiments

4.1 Experiment settings

In the experiments, the commonly used high efficiency video coding (HEVC) encoder is regarded as the benchmark scheme due to its solid performance in video coding [28]. Test model HM-16.20 platform is adopted to implement the HEVC benchmark scheme. VFMV images are with dense view arrangements like light field images. Thus, the winner of Light Field Image Compression Challenge, pseudo-sequence hierarchical (abbreviated to PSH) coding method is selected as the comparison scheme [29]. The preliminary research (abbreviated to OE) adopting monotonically descending reorder and unidirectional coding serves as another comparison scheme [14], which has demonstrated performance than the multiview coding standard MV-HEVC [30]. Our proposed compression scheme is also integrated into the HM 16.20 platform. The base QP values adopt the typical settings 22, 27, 32 and 37.

All schemes are evaluated on our provided VFMV test sequences with $9\times 9$ view arrangements. Specifically, test sequences include 6 Blender rendering scenes (virtually generated by Blender software), 3 digital camera scenes (captured by Fuji X-S10 camera), and 1 light field scene (captured and refocused by Lytro Illum light field camera). Thumbnails of test sequences are shown in Fig. 6, and detailed information is recorded in Table 1. For a fair comparison, VFMV test sequences are scanned and concatenated as pseudo video sequences for easy compression by benchmark and comparison schemes. To evaluate the performance of all schemes, we conduct comprehensive evaluations in terms of quantitative, qualitative, computational complexity, and vision application performance.

Fig. 6. Thumbnails of all test sequences used in the experiments. (a)-(f) 6 scenes synthesized by Blender software. (g)-(i) 3 scenes captured by Fuji X-S10 digital cameras. (j) 1 scene refocused by Lytro light field cameras.

Download Full Size | PDF

Table 1. Detailed information of the test sequences. They are created by different optical imaging ways with diverse resolutions and dense view arrangements.

View Table | View all tables in this article

4.2 Performance evaluations

4.2.1 Quantitative evaluation

The quantitative evaluation measures the coding efficiency of all schemes over the HEVC benchmark scheme. The Bjontegaard-Delta bitrate (BDBR) and Bjontegaard-Delta peak signal-to-noise ratio (BDPSNR) of Y-component are commonly adopted to measure coding efficiency [31]. The BDBR means bitrate savings in percentage and BDPSNR indicates PSNR gains in dB. The experimental results of all schemes over the HEVC benchmark scheme on 10 test sequences are shown in Table 2. The average performance on different types of test sequences is also calculated to facilitate data analysis.

Table 2. Quantitative evaluation: the proposed scheme achieves the highest coding efficiency than all comparison schemes. The evaluation metrics BDPSNR (in dB) and BDBR (in percentage) calculate the increments of all schemes over HEVC benchmark.

View Table | View all tables in this article

It can be observed that all BDPSNR is positive numbers and BDBR is negative numbers, demonstrating that all schemes have performance advantages over the HEVC benchmark scheme. The proposed scheme achieves the most significant performance on all types of test sequences. As high as 3.172 dB PSNR gains and 61.14% bitrate savings can be obtained. In terms of both the average of each type and the average of all types of test sequences, the proposed scheme is still leading all the comparison schemes by significant increments. The PSH scheme obtains lower performance than the proposed scheme but still has higher performance than the OE comparison scheme. This is because the PSH scheme directly compresses the original VFMV, which ignores the effect of irregular focal distributions on coding performance. However, it adopts a hierarchical and bidirectional prediction structure to obtain acceptable coding gains. The OE scheme employs monotonically row-by-row descending sorting to rearrange VFMV, which yields excessive view displacement. The maximum displacement is approximately equal to the number of horizontal views $M-1$. Excessive view displacement may cause less similarity between the to-be-coded frame and reference frames, resulting in limited coding efficiency. By contrast, the mountain shape distribution adopted in the proposed scheme restricts the maximum view displacement to $(M-1)/2$. Therefore, the mountain shape distribution can achieve better performance than previous descending orders. Moreover, the coding performance in Blender test sequences is better than the other types of sequences. This is because the Blender test sequences are generated in perfect imaging conditions, there is no magnification difference between views even though they have different focal settings, making it easier to compress.

We draw the rate-distortion (RD) curves to visualize the coding performance comparison shown in Fig. 7. All 10 test sequences are visualized. For RD curves, the higher a curve is, the higher PSNR a scheme can achieve in the same bitrate consumption. Similarly, the more to the left a curve is, the less bitrate is consumed to obtain the same PSNR increments. From all 10 RD carve visualizations, we can find that the proposed scheme can obtain higher RD performance than the benchmark and comparison schemes. This is because the proposed scheme adopts VMSR to rearrange VFMV with enhanced inter-view correlations; and compress the rearranged VFMV using the efficient ADPS. These experiments demonstrate the proposed scheme outperforms all comparison schemes.

Fig. 7. The visualizations of RD performance comparisons on all VFMV test sequences.

Download Full Size | PDF

4.2.2 Qualitative evaluation

The qualitative evaluation compares the subjective quality of the proposed scheme, comparison schemes and the uncompressed original VFMV views. Test sequences I01 and I08 are selected for the qualitative evaluation. The results are shown in Fig. 8. From left to right columns, it shows the original uncompressed views, the HEVC benchmark scheme, PSH comparison scheme, OE comparison scheme, and the proposed scheme. Close-ups of all columns are used to magnify image details.

Fig. 8. Qualitative evaluation: the proposed scheme obtains higher subjective quality with richer textures and more details than the HEVC scheme and all comparison schemes.

Download Full Size | PDF

We can observe that the uncompressed original views have the highest visual quality with rich textures and without artifacts. The proposed scheme achieves better visual quality than all benchmark and comparison schemes. For the close-ups of “pipes” in I01 and “banana” in I08, the textures and object edges are preserved in the proposed scheme, while textures are too smooth in other schemes due to compression artifacts. The results of the proposed scheme are closer to the uncompressed original views than other schemes. According to Table 1, the proposed scheme can save bitrate resources than other comparison schemes. Therefore, the proposed scheme achieves higher subjective quality while maintaining lower bitrate consumption.

4.2.3 Complexity evaluation

The complexity evaluation aims to compare the computational complexity among all schemes. Specifically, HEVC benchmark scheme, PSH comparison scheme, OE comparison scheme, and the proposed scheme are conducted under the following test conditions: test sequences I01, I04, I07, and I10, and QP setting of 27. The hardware conditions include Windows 10 64-bit operating system, Intel i5-8300H 2.30 GHz CPU, 16 GB RAM. The time consumption (in seconds) of all schemes is shown in Fig. 9.

Fig. 9. The computational complexity comparisons on I01, I04, I07 and I10.

Download Full Size | PDF

The HEVC benchmark scheme has the least time consumption due to no optimization for VFMV. The PSH comparison scheme [29] creates dependencies between long-distance reference views by hierarchical prediction, which is advantageous to coding performance but time-consuming. The OE comparison scheme [14] performs unidirectional and non-hierarchical prediction among adjacent views; thus, the time consumption is low and close to the HEVC benchmark scheme. The time consumption of the proposed scheme is moderate. The time-efficient sorting (generally less than 0.5 seconds) enhances inter-view correlations, which is more suitable for dense view coding. For example, the I07 costs 0.48 seconds for sorting and 733.77 seconds for coding. The coding cost much time because the ADPS in the proposed scheme selects reference frames from multiple directions around the to-be-coded frame, which helps to improve coding performance at the cost of costing time. Generally, the proposed method has moderate complexity and best performance compared with all comparison methods. Moreover, the proposed scheme restricts the maximum view displacement in the sorting. Motion searching can benefit from the limited parallax between the current view and its reference views when performing inter-view prediction. Thus, the proposed coding scheme is relatively more time-efficient. The above quantitative, qualitative, and complexity evaluations demonstrate that the proposed scheme can maintain lower computational complexity while achieving higher subjective and objective quality.

4.2.4 Supplemental evaluation

To validate the advantages of the proposed scheme, supplemental experiments are conducted on more comparison methods and wider range of QPs/bitrates. These comparison methods are abbreviated to “Santos et al.” [21], “Gomes et al.” [20], and “Monteiro et al.” [22], respectively. The “Santos et al.” scheme [21] and “Gomes et al.” scheme [20] separate the full VFMV into four and three divisions and then compressed independently. The “Monteiro et al.” [22] is a scalable scheme, which assigns five layers to views of VFMV. Views in each layer are rearranged as pseudo-sequence video, and are compressed from low to high layers. For evaluating the coding performance at high bitrates, we additionally adopt lower QP of 17 to consume higher bitrates; thus, the QPs settings are 17, 22, 27, 32, 37. Test sequences I01 and I06 are selected as example for illustration. The performance of all coding schemes is shown in Fig. 10. The bitrate has a wider range than that in Fig. 7.

Fig. 10. Validation experiments on more comparison methods with more QPs and a wider range of bitrates.

Download Full Size | PDF

It can be observed that the curves of the proposed scheme are higher than those of all comparison schemes, signifying higher coding performance. The five data points from left to right in each curve represent QPs values from 37 to 17. Generally, the smaller a QP value is, the higher bitrate a coding scheme will consume. Figure 10 also shows the proposed scheme consistently exceeds all other schemes regardless of low or high bitrates. We further assess the performance of the proposed scheme by calculating PSNR gains than HEVC benchmark in low bitrate (QP: 22 27 32 37) and high bitrate (QP: 17 22 27 32). For high bitrate, the test sequences I01 and I06 achieve 3.152 and 1.880 dB PSNR gains, respectively. Compared with low bitrate data in Table 2 (QP: 22 27 32 37), we can find that a high bitrate generally yields lower PSNR gains than a low bitrate. This is because the blur newly introduced by high QP compression (low bitrate) will affect the original out-of-focus blur. The proposed scheme takes the blur and focus changes into account when selecting reference frames, while the benchmark scheme does not. Thus, it can obtain slightly higher PSNR increments than benchmark in high bitrate. The supplemental experiment demonstrates that the proposed scheme outperforms more comparison methods. The effectiveness of the proposed scheme is also validated in low and high bitrates.

4.2.5 Forgery protection evaluation

Recent forgery detection/localization powered by focal stack [32] has shown advantages over single image in detecting manipulation, demonstrating focal stack is a secure imaging format. Inspired by the research [32], we expect VFMV could serve as another novel secure imaging format that can achieve better forgery protection than conventional multiview data.

Due to the rapid rise of Large Models, the forgery of digital image content, e.g., object removal or manipulation has become more difficult to distinguish. It is reasonable to assume that attackers may manipulate content/object of VFMV images for malicious purposes. Thus, we conduct validation experiments to evaluate VFMV forgery protection against cutting-edge large models. Specifically, the large vision model “segment anything model (SAM)” [33] and large language model “LISA” [34] are performed on VFMV data and conventional multiview data. VFMV test sequences I03, I07, and 3D video (3DV) common test sequences “PoznanHall2” and “Newspaper” are selected. Views of VFMV data have varied and irregular focusing status, while multiview data is consistent and unchanged. The experimental results are shown in Fig. 11 and Fig. 12.

Fig. 11. Scene instance segmentation (generally for forgery purposes) by segment anything model (SAM). The (a) and (b) show the defocused regions of VFMV are not consistently segmented (e.g., various tools on the table). The (c) and (d) show that conventional multiview data are overall consistently segmented.

Download Full Size | PDF

Fig. 12. LISA large language model with prompt “Can you segment towel and backpack?” for VFMV (a) and (b), and prompt “Can you segment all the people?” for multiview (c) and (d). Target objects in defocused regions of VFMV are concealed and not well identified by LISA, while objects in multiview data are all identified.

Download Full Size | PDF

The view (5, 5) of VFMV data I03 in Fig. 11(b) can well segment tiny multiple objects (repair tools on the desk) in focused background regions by SAM. However, in Fig. 11(a), the view (5, 1) is focused on foreground, these repair tools in defocused background are not segmented. By contrast, the Fig. 11(c) and (d) exemplify that conventional multiview data are overall consistently segmented. Similarly, VFMV views in Fig. 12(a) and (b) are focused on background and foreground, respectively. When we feed the prompt “Can you segment towel and backpack in this image?” into LISA, towel and backpack can not be identified simultaneously. However, the Fig. 12(c) and (d) show that conventional multiview are overall consistently segmented by the prompt “Can you segment all the people in this image?”.

These experiments demonstrate that multiview data is vulnerable to forgery from large models, they may be used for malicious purposes in imaging and displays. VFMV can produce inconsistent segmentation among different views. The valuable features of inconsistency can protect VFMV from forgery. This is because the focal information contained in VFMV data is not consistently distributed among dense views. Thus, VFMV has resistance to the process of content manipulation since it is not easy to generate the same manipulation results when dense views have different focusing status. Thus, VFMV data has the potential to serve as a novel secure imaging format in the source end, so that unmodified data in acquisition and transmission enable to facilitate authentic displays in application end.

4.3 Ablation study

Finally, we perform ablation study to verify the effectiveness of the view mountain-shape rearrangement (VMSR) and all-directional prediction structure (ADPS) in the proposed scheme. The “W VMSR + W ADPS” scheme stands for the proposed scheme incorporating both VMSR and ADPS. The “W/O VMSR + W ADPS” scheme skips the VFMV sorting process and directly compresses the irregular VFMV sequences by leveraging our ADPS. This scheme is to investigate the effect of lacking VMSR on the coding performance as well as verify the effectiveness of ADPS. By contrast, the “W VMSR + W/O ADPS” scheme first generates a rearranged VFMV with regular focusing distributions by VMSR, but the rearranged VFMV is compressed by HEVC encoder instead of ADPS. This scheme is to measure how much coding performance is reduced by the absence of ADPS, and clarify the contribution of VMSR at the same time. We calculate the PSNR gains (BDPSNR) and bitrate savings (BDBR) of these schemes over the HEVC benchmark scheme. The ablation study is performed on test sequences of I08. All the experimental results are listed in Table 3. The “Byte” and “PSNR” denote the bitrate consumption (in bytes) and compression quality (in dB), respectively.

Table 3. Ablation Study on the effectiveness of the VMSR and ADPS in the proposed scheme.

View Table | View all tables in this article

From Table 3 we can find that both the “W/O VMSR + W ADPS” and “W VMSR + W/O ADPS” schemes suffer a significant performance decrease. Specifically, the “W/O VMSR + W ADPS” scheme can achieve 18.50% bitrate savings and 0.92 dB PSNR gains, which is still much higher than the PSH and OE comparison schemes listed in Table 2. It validates the crucial contribution of ADPS to the proposed scheme. For the “W VMSR + W/O ADPS” scheme, the BDPSNR is positive numbers and BDBR is negative numbers, indicating that the coding performance of this scheme is still better than that of the HEVC benchmark scheme. It signifies the effectiveness of the VMSR in the proposed scheme. The proposed coding scheme adopts VMSR to reorder the irregular VFMV, the enhanced inter-view correlations among views are well exploited by the ADPS. The above ablation study demonstrates that both VMSR and ADPS are effective in the proposed scheme. The absence of them will significantly reduce the coding performance.

5. Conclusion

The emerging high dimensional data VFMV is redundant and difficult to compress due to the irregular focusing changes among views. We propose a specialized coding scheme based on VMSR and ADPS. By rearranging VFMV, VMSR can enhance the inter-view correlations by smoothing focusing distributions and moderating view displacements. The ADPS conducts row-wise hierarchical divisions and allows all-directional prediction dependencies to exploit the correlations. Extensive evaluations in terms of quantitative, qualitative, complexity, and forgery protection demonstrate that the proposed scheme outperforms comparison schemes. Moreover, we find that VFMV data has the potential to be a novel image content forgery protector.

Funding

National Natural Science Foundation of China (61991412); Major Project of Fundamental Research on Frontier Leading Technology of Jiangsu Province (BK20222006); Key Research and Development Program of Hubei Province (2023BAB021); Fundamental Research Supporting Program (2023BR023).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. K. Akşit, W. Lopes, J. Kim, et al., “Near-eye varifocal augmented reality display using see-through screens,” ACM Trans. Graph. 36(6), 1–13 (2017). [CrossRef]

2. C. P. Chen, Y. Cui, Y. Chen, et al., “Near-eye display with a triple-channel waveguide for metaverse,” Opt. Express 30(17), 31256–31266 (2022). [CrossRef]

3. Z. Wu, X. Li, J. Peng, et al., “Dof-nerf: Depth-of-field meets neural radiance fields,” in Proceedings of the 30th ACM International Conference on Multimedia, (Association for Computing Machinery, New York, NY, USA, 2022), MM ’22, p. 1718–1729.

4. S. Lee, S. Lee, D. Kim, et al., “Distortion corrected tomographic near-eye displays using light field optimization,” Opt. Express 29(17), 27573–27586 (2021). [CrossRef]

5. Z. Qin, Y. Zhang, B.-R. Yang, et al., “Interaction between sampled rays’ defocusing and number on accommodative response in integral imaging near-eye light field displays,” Opt. Express 29(5), 7342–7360 (2021). [CrossRef]

6. Z. Yang, X. Sang, B. Yan, et al., “Real-time light-field generation based on the visual hull for the 3d light-field display with free-viewpoint texture mapping,” Opt. Express 31(2), 1125–1140 (2023). [CrossRef]

7. T. Zhan, J. Zou, M. Lu, et al., “Wavelength-multiplexed multi-focal-plane seethrough near-eye displays,” Opt. Express 27(20), 27507–27513 (2019). [CrossRef]

8. K. Wu, Y. Yang, Q. Liu, et al., “Gaussian-wiener representation and hierarchical coding scheme for focal stack images,” IEEE Trans. Circuits Syst. Video Technol. 32(2), 523–537 (2022). [CrossRef]

9. M. Panzirsch, B. Weber, N. Bechtel, et al., “Light-field head-mounted displays reduce the visual effort: A user study,” J. Soc. Inf. Disp. 30(4), 319–334 (2022). [CrossRef]

10. J. Lawrence, D. Goldman, S. Achar, et al., “Project starline: a high-fidelity telepresence system,” ACM Trans. Graph. 40(6), 1–16 (2021). [CrossRef]

11. K. Wu, Y. Yang, Q. Liu, et al., “Focal stack image compression based on basis-quadtree representation,” IEEE Trans. Multimedia 25, 3975–3988 (2023). [CrossRef]

12. L. Ma, X. Zhang, Z. Xu, et al., “Three-dimensional focal stack imaging in scanning transmission x-ray microscopy with an improved reconstruction algorithm,” Opt. Express 27(5), 7787–7802 (2019). [CrossRef]

13. K. Wu, Y. Yang, M. Yu, et al., “Block-wise focal stack image representation for end-to-end applications,” Opt. Express 28(26), 40024–40043 (2020). [CrossRef]

14. K. Wu, Q. Liu, Y. Wang, et al., “End-to-end varifocal multiview images coding framework from data acquisition end to vision application end,” Opt. Express 31(7), 11659–11679 (2023). [CrossRef]

15. K. Wu, Y. Yang, Q. Liu, et al., “Hierarchical independent coding scheme for varifocal multiview images based on angular-focal joint prediction,” IEEE Transactions on Multimedia pp. 1–13 (2023).

16. L. B. Baldwin, “Array of cameras with various focal distances,” (2016). US Patent 9,241,111.

17. U. Perwass and C. Perwass, “Digital imaging system, plenoptic optical device and image data processing method,” (2013). US Patent 8,619,177.

18. C. Liu, J. Qiu, M. Jiang, et al., “Light field reconstruction from projection modeling of focal stack,” Opt. Express 25(10), 11377–11388 (2017). [CrossRef]

19. X. Zhao, C. Liu, L. Dou, et al., “3d visual sensing technique based on focal stack for snake robotic applications,” Results Phys. 12, 1520–1528 (2019). [CrossRef]

20. P. Gomes and L. A. da Silva Cruz, “Pseudo-sequence light field image scalable encoding with improved random access,” in 2019 8th European Workshop on Visual Information Processing (EUVIP), (IEEE, 2019), pp. 16–21.

21. J. M. Santos, L. A. Thomaz, P. A. Assuncao, et al., “Hierarchical lossless coding of light fields with improved random access,” Signal Process. Commun. 105, 116687 (2022). [CrossRef]

22. R. J. Monteiro, N. M. Rodrigues, S. M. Faria, et al., “Light field image coding with flexible viewpoint scalability and random access,” Signal Process. Commun. 94, 116202 (2021). [CrossRef]

23. E. Krotkov and J.-P. Martin, “Range from focus,” in Proc. IEEE Int. Conf. Robot. Autom., vol. 3 (1986), pp. 1093–1098.

24. A. N. Almustofa, Y. Nugraha, A. Sulasikin, et al., “Exploration of image blur detection methods on globally blur images,” in 10th Int. Conf. Inf. Commun. Technol., (2022), pp. 275–280.

25. L. Juočas, V. Raudonis, R. Maskeliūnas, et al., “Multi-focusing algorithm for microscopy imagery in assembly line using low-cost camera,” Int. J. Adv. Manuf. Technol. 102(9-12), 3217–3227 (2019). [CrossRef]

26. L. Li, Z. Li, B. Li, et al., “Pseudo-sequence-based 2-d hierarchical coding structure for light-field image compression,” IEEE J. Sel. Top. Signal Process. 11(7), 1107–1119 (2017). [CrossRef]

27. M. Rizkallah, T. Maugey, C. Yaacoub, et al., “Impact of light field compression on focus stack and extended focus images,” in 2016 24th Eur. Signal Process. Conf., (2016), pp. 898–902.

28. K. McCann, C. Rosewarne, B. Bross, et al., “High efficiency video coding (HEVC) test model 16 (HM 16) encoder description, document JCTVC-R1002,” JCTVC, Sapporo, Japan (2014).

29. D. Liu, L. Wang, L. Li, et al., “Pseudo-sequence-based light field image compression,” in 2016 IEEE Int. Conf. Multimed. Expo workshops., (2016), pp. 1–4.

30. M. M. Hannuksela, Y. Yan, X. Huang, et al., “Overview of the multiview high efficiency video coding (mv-hevc) standard,” in 2015 IEEE International Conference on Image Processing (ICIP), (IEEE, 2015), pp. 2154–2158.

31. G. Bjøntegaard, “Calculation of average psnr differences between rd-curves (vceg-m33),” in VCEG Meeting (ITU-T SG16 Q. 6), (2001), pp. 2–4.

32. Z. Huang, J. A. Fessler, T. B. Norris, et al., “Focal stack based image forgery localization,” Appl. Opt. 61(14), 4030–4039 (2022). [CrossRef]

33. A. Kirillov, E. Mintun, N. Ravi, et al., “Segment anything,” arXiv, arXiv:2304.02643 (2023). [CrossRef]

34. X. Lai, Z. Tian, Y. Chen, et al., “Lisa: Reasoning segmentation via large language model,” arXiv, arXiv:2308.00692 (2023). [CrossRef]

ID	Scene name	Type	Resolution	View #
I01	Pipes	Blender	512 $\times$ 512	9 $\times$ 9
I02	Machine	Blender	1024 $\times$ 768	9 $\times$ 9
I03	Messy_room	Blender	1024 $\times$ 512	9 $\times$ 9
I04	Factory	Blender	1024 $\times$ 512	9 $\times$ 9
I05	Mould	Blender	1024 $\times$ 512	9 $\times$ 9
I06	Production_line	Blender	1024 $\times$ 512	9 $\times$ 9
I07	Dormitory	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I08	Fruits	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I09	Basketball	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I10	PCB	Lytro Illum	624 $\times$ 432	9 $\times$ 9

Scene	PSH		OE		Proposed
Scene	BDPSNR	BDBR	BDPSNR	BDBR	BDPSNR	BDBR
I01	1.263	−31.87%	1.551	−36.71%	3.172	-61.14%
I02	1.215	−29.57%	0.885	−21.87%	2.026	−44.04%
I03	1.089	−31.58%	0.635	−19.69%	1.952	−48.88%
I04	0.855	−23.19%	0.200	−5.93%	1.476	−35.70%
I05	1.294	−33.42%	0.395	−11.47%	1.857	−43.63%
I06	1.169	−25.47%	0.754	−17.14%	2.202	−43.39%
Avg	1.148	-29.18%	0.737	-18.80%	2.114	-46.13%
I07	1.256	−25.90%	0.817	−17.75%	2.033	−39.02%
I08	0.488	−10.34%	0.324	−6.78%	1.275	−24.93%
I09	1.336	−28.31%	1.082	−23.32%	2.173	−41.18%
Avg	1.027	-21.52%	0.741	-15.95%	1.827	-35.05%
I10	1.134	−21.22%	0.289	−5.76%	1.434	−26.24%
AvgAll	1.110	-26.09%	0.693	-16.64%	1.960	-40.82%

W VMSR + W ADPS				W/O VMSR + W ADPS				W VMSR + W/O ADPS
Byte	PSNR	BDBR	BDPSNR	Byte	PSNR	BDBR	BDPSNR	Byte	PSNR	BDBR	BDPSNR
418862	47.47	−24.93	1.28	440728	47.40	−18.50	0.92	613040	48.34	−7.22	0.34
223727	45.05			238481	44.98			334462	45.98
120487	42.34			128066	42.21			183486	43.21
69651	39.59			73660	39.43			102375	40.29

ID	Scene name	Type	Resolution	View #
I01	Pipes	Blender	512 $\times$ 512	9 $\times$ 9
I02	Machine	Blender	1024 $\times$ 768	9 $\times$ 9
I03	Messy_room	Blender	1024 $\times$ 512	9 $\times$ 9
I04	Factory	Blender	1024 $\times$ 512	9 $\times$ 9
I05	Mould	Blender	1024 $\times$ 512	9 $\times$ 9
I06	Production_line	Blender	1024 $\times$ 512	9 $\times$ 9
I07	Dormitory	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I08	Fruits	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I09	Basketball	Fuji X-S10	768 $\times$ 512	9 $\times$ 9
I10	PCB	Lytro Illum	624 $\times$ 432	9 $\times$ 9

Scene	PSH		OE		Proposed
Scene	BDPSNR	BDBR	BDPSNR	BDBR	BDPSNR	BDBR
I01	1.263	−31.87%	1.551	−36.71%	3.172	-61.14%
I02	1.215	−29.57%	0.885	−21.87%	2.026	−44.04%
I03	1.089	−31.58%	0.635	−19.69%	1.952	−48.88%
I04	0.855	−23.19%	0.200	−5.93%	1.476	−35.70%
I05	1.294	−33.42%	0.395	−11.47%	1.857	−43.63%
I06	1.169	−25.47%	0.754	−17.14%	2.202	−43.39%
Avg	1.148	-29.18%	0.737	-18.80%	2.114	-46.13%
I07	1.256	−25.90%	0.817	−17.75%	2.033	−39.02%
I08	0.488	−10.34%	0.324	−6.78%	1.275	−24.93%
I09	1.336	−28.31%	1.082	−23.32%	2.173	−41.18%
Avg	1.027	-21.52%	0.741	-15.95%	1.827	-35.05%
I10	1.134	−21.22%	0.289	−5.76%	1.434	−26.24%
AvgAll	1.110	-26.09%	0.693	-16.64%	1.960	-40.82%

High dimensional optical data — varifocal multiview imaging, compression and evaluation

Abstract

1. Introduction

2. Dimension representation and imaging analysis

3. Efficient VFMV compression scheme

3.1 View mountain-shape rearrangement

3.2 All-directional prediction structure

4. Experiments

4.1 Experiment settings

4.2 Performance evaluations

4.2.1 Quantitative evaluation

4.2.2 Qualitative evaluation

4.2.3 Complexity evaluation

4.2.4 Supplemental evaluation

4.2.5 Forgery protection evaluation

4.3 Ablation study

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (3)

Equations (8)

Optics Express