Efficient algorithms for the accurate propagation of extreme-resolution holograms

David Blinder; David Blinder; Tomoyoshi Shimobaba

doi:10.1364/OE.27.029905

1. Introduction

Computer-generated holography (CGH) is concerned with efficient and accurate algorithms for simulating numerical diffraction of coherent light through free space, and its interaction with materials. Because of the nature of waves, every point in 3D space can potentially affect every other point, making brute-force calculations generally too slow for most applications. Several CGH algorithms address this issue by e.g. using sparse approximations of point-spread functions [1–4].

This problem is especially important when the wavefront resolution becomes very large. In this paper, we focus on holograms with extreme resolutions for display purposes [5]. To achieve displays of appreciable sizes with large viewing angles, these extreme resolutions are a necessity [6]. We target the efficient implementation of wavefront propagation with resolutions surpassing $10^{10}$ pixels.

This operation is an instrumental part in many CGH algorithms, such as the Wavefront Recording Plane (WRP) method [7], multiplanar CGH [8–12], polygonal CGH [13–17] and ray-wavefront conversion [18,19].

Because these holograms will not fit into Random Access Memory (RAM), the computer will have to resort to some form of paging, i.e. use secondary storage such as a Hard Disk Drive (HDD) or a Solid-State Drive (SSD), to temporarily store pieces of the hologram which are being computed. This will form a performance bottleneck, taking up a significant part of the calculation time. It is therefore important to minimize access to secondary storage. Ideally, the the algorithm should still be useful when bottleneck is different (e.g. a distributed system), or be even more efficient for extreme resolution wavefronts for machines with very large amounts of RAM.

For large scale hologram propagation, the typical approach is to use rectangular tiling [20]. After subdividing the hologram into equally-sized tiles, every tile in the source plane is numerically diffracted to every tile in the destination plane (cf. Fig. 1). This can be achieved with linearithmic complexity by means of shifted Fresnel diffraction [20], or by the shifted Angular Spectrum Method (ASM) [21,22]. This principle was used for polygonal CGH to calculate a 4 Gigapixel hologram [13].

Fig. 1. Diagram of the rectangular tiling algorithm using shifted Fresnel diffraction. The dark blue example tile in the source plane has to be numerically propagated to every tile in the destination plane. This process has to be repeated over for every source plane tile.

Download Full Size | PDF

Unfortunately, this approach will have significantly larger computational complexity than a conventional propagation because of the many-to-many mapping of numerically diffracting all tiles. This will be exacerbated by the many required reads and writes to the disk. The complexity can be found as follows: given a $N\times N$ pixel hologram divided into $k\times k$ tiles, we have $k^2$ propagations for every source tile to every $k^2$ destination tiles with side length $B=\frac {N}{k}$. This results in

(1)$$k^4\mathcal{O}\left( B^2\log(B^2) \right) = \mathcal{O}\left(k^2N^2\log B \right)$$

computational complexity, because the Fast Fourier Transform (FFT) has $\mathcal {O}(n\log n)$ complexity for $n$ samples.

This becomes orders of magnitude slower than a single propagation of the entire hologram as $k$ increases, if the hologram would have fit in RAM in its entirety; the latter would only have $\mathcal {O}(N\log N)$ complexity, indicating that the former approach is sub-optimal.

This problem was recently addressed in [19], where the authors proposed to use a ray-wavefront conversion technique to calculate a realistic ${\approx }10$ Gigapixel hologram computing a set of orthographic image projections of a 3D scene. However, the proposed method still introduces approximations, e.g. discretizing the scene into multiple views, similarly to a holographic stereogram. This makes the algorithm is not generally applicable to all CGH use cases.

To this end, novel algorithms to both efficiently and accurately calculate diffraction of large holograms are proposed. The paper is structured as follows: section 2 introduces the new diffraction algorithms and analyzes their computational complexity. Section 3 details the implementation and program structure, and its integration in the a multi-WRP CGH algorithm as a use case example. In section 4, the algorithm time is measured for its various components and numerically reconstructed views are shown for the generated hologram. Finally, we conclude in section 5.

2. Proposed diffraction algorithms

The goal is to both minimize computational complexity and disk access when calculating multi-Gigapixel wavefronts. Several algorithms require multiple propagations between successive parallel planes, in particular the multi-WRP CGH algorithms [8,9] which we have implemented in this paper. An important factor is the propagation distance $d$: every wavefield point will have a limited region-of-influence with radius $w$ determined by the opening angle $\theta$:

(2)$$w = d\tan(\theta) = d\tan\Big(\sin^{{-}1} \frac{\lambda}{2p}\Big) = \frac{d\lambda}{\sqrt{4p^2-\lambda^2}}$$

using the identity $\tan (\sin ^{-1}x) = \frac {x}{\sqrt {1-x^2}}$, given a pixel pitch $p$ and wavelength $\lambda$. This will impact the computational efficiency with which the diffraction operator can be implemented.

We therefore consider two use cases: (1) where source tiles only affect the corresponding destination tiles and their immediate neighbors, and (2) when every source plane pixel can affect every destination plane pixel. Both cases are described in the next two subsections.

2.1 Short-distance tiling-based diffraction

In case $d$ is relatively small, where hologram tiles only affect their neighbors, we can construct an algorithm where every tile need only be processed once. In the proposed short-distance algorithm version, the tiles are processed sequentially from left to right, row per row. Calculations within a tile happen on the Graphics Processing Unit (GPU) in parallel, and the algorithm components can be pipelined across successively processed tiles.

Individual tiles can be numerically diffracted using any of the standard techniques. When $d$ is very small, spatial convolution might be preferred. But for larger $d$, FFT-based methods will be faster. The type of transfer function can be freely chosen by the user; it only has to be computed once, and can then be re-used for all blocks (and even all layers in the multi-WRP algorithm). For example, one could use the FFT transform of an apodized PSF as in [7], set to zero for pixels where the radius is larger than $w$ to avoid aliasing. We instead use the ASM [23] by pre-computing the transfer function for a $(B+2w)\times (B+2w)$ block. Note that the ASM impulse response is not exactly zero for pixels with distance larger than $w$ from the center, so this may cause small errors at tile edges because the FFT is a circular convolution. This can be addressed by increasing the value used for $w$, adding a margin to the computed value of (2), or by apodizing the precomputed impulse response in the spatial domain.

For every $B\times B$ tile, the data is transferred from the host to the GPU and centered into a $(B+2w)\times (B+2w)$ block, where the rest is filled with zeros (cf. Fig. 2(a)). That way, the tile is automatically zeropadded for the subsequent diffraction. Then, the tile is transformed using a 2D FFT (on the GPU), multiplied with the precomputed transfer function, and transformed again by the 2D IFFT. However, this tile will still have to send and receive data from its neighboring tiles before obtaining the final values of the diffracted destination plane.

Fig. 2. Diagram of the short-distance based diffraction algorithm. (a) Every tile has a region of influence, indicated by the red edge with edge width $w$. (b) The blue grid indicates the tiles that are transferred to the GPU for propagation. After diffraction, only the corresponding block in the overlapping red grid is known, as the remaining information will still be affected by neighboring tiles. The horizontal and vertical buffers, denoted in red and green respectively, represent the temporary data stored on the GPU for signalling data across subsequent tiles and tile rows.

Download Full Size | PDF

To do so, intermediate GPU buffers are proposed that keep parts of the hologram tiles, encoding wavefield segments that still have to affect other tiles. The vertical buffer ($2w\times B$ pixels) will transfer data between subsequent tiles, and the horizontal buffer ($N\times 2w$ pixels) transfers data between subsequent tile rows.

The red grid delineates what parts of the final wavefield in the destination plane are known after the tile diffraction, and can thus be written back to disk. The algorithm pseudo-code is shown in Algorithm 1. For example, consider the tile illustrated in Fig. 2(b). After diffracting the second tile, the data from the first tile in the vertical buffer is added to the result on the left edge. Then, the right part of the block is copied to the vertical buffer, overwriting the previous values (indicated in red). Since the second block is on the top row, no data needs to be added from the horizontal buffer. The bottom edge will then be copied to the corresponding part of the horizontal buffer (in green). Finally, the dark blue block (delineated by the red grid) is now fully computed, and can be transferred to the disk.

Every tile (and pixel) will only have to be read an written once, minimizing disk access and guaranteeing optimality. The computational complexity is given by the $k^2$ FFTs, namely

(3)$$k^2\mathcal{O}\left( (B+2w)^2\log\left((B+2w)^2\right) \right) = \mathcal{O}\left( (N+2kw)^2\log\Big(B+2w) \right) \approx \mathcal{O}\left(N^2\log B \right)$$

when $w$ is relatively small.

This algorithm even has better complexity than the conventional diffraction operator using a single big FFT, and can thus be used even when the complete $N\times N$ hologram would entirely fit in RAM; it can outperform the standard diffraction operator, provided that $w \ll N$.

2.2 Long-distance strip-based diffraction

When the diffraction distance $d$ is large, the tile-based approach is not recommended; otherwise, all source tiles will affect all destination tiles, requiring many reads and writes. Instead, a strip-based processing of the hologram is proposed, leveraging the separability of the Fourier transform. It is compatible with different versions of the numerical Fresnel and ASM diffraction operators [24], including those with different pixel pitches at source and destination planes.

The implementation will depend on two main factors: whether zero-padding is needed to avoid aliasing (i.e. linearizing the otherwise circular convolution with the FFT), and whether the separable Fresnel diffraction or the non-separable ASM is used. When the transform is separable, only 2 passes are needed, otherwise 3 passes are used (see Fig. 3).

Fig. 3. Diagram of the Long-distance strip-based diffraction using ASM, with zeropadding. This version uses two intermediate stored versions on disk. In Phase 1, strips are read and directly transferred on GPU, where they are zeropadded, FFT-transformed and transposed. Phase 2 is similar to the previous phase, but here the ASM transfer function is applied. The second FFT is undone, and the only the relevant data is sent back to disk after transposing the data once more. The final Phase 3 will undo the FFT from Phase 1, crop the data and write the final hologram.

Download Full Size | PDF

The procedure is detailed in Algorithm 2. The hologram is divided into strips along the dimension which is contiguous in memory, whose heights are chosen so that they comfortably fit in GPU memory. Zeropadding is applied on the GPU to minimize memory transfer. Then, data is transformed with a 1D (I)FFT, and can be multiplied element-wise with a transfer function depending on the algorithm phase. If applicable, the data is transposed on GPU before writing it back to disk. This is to ensure contiguity of the transform along the other dimension, and reducing the complexity of the data reads; otherwise, many passes would have to happen over the same files containing the processed strips to read data along the second dimension.

The complexity is again determined by the FFT, since all other algorithm components have linear complexity. For the non-padded versions, we need $4N$ FFTs on $N$ samples. The zeropadded versions require $4N$ FFTs on $2N$ samples for the separable Fresnel transform, and $6N$ FFTs on $2N$ samples for the non-separable ASM. These all have a computational complexity of $\mathcal {O}\left (N\log N\right )$, equivalent to the complexity of standard FFT-based diffraction operators.

3. Implementation

The proposed diffraction algorithms are integrated in a simplified version of the multi-WRP CGH algorithm [9]. The implemented version accounts for diffuse reflections and occlusion, but has no specular reflections or color (we use a single wavelength).

The algorithm is summarized on Fig. 4: the volume spanning the 3D object is partitioned into slices along the z dimension, and a WRP is assigned to every segment in the middle. Each WRP will accumulate the light (and occlusion) contributions of every point within its segment (called the WRP zone). This is done using a Look-Up Table (LUT) containing small diffuse surface elements corresponding to a multitude of $k$ quantized depth levels within every WRP. Every point is represented by a small surface element of $\sigma \times \sigma$ pixels, whose amplitude is modulated by a hamming window and phase assigned randomly for every pixel from the uniform distribution $\mathcal {U}(0,2\pi )$. The element is then propagated to the different relative quantization depths $z_k$ and stored in the LUT. The dimensions are $(w_k+\sigma )\times (w_k+\sigma )$, where $w_k$ will depend on the distance to the WRP $z_k$ as per (2).

Fig. 4. Summary of the multi-WRP algorithm. (a) Diagram of the main steps of the algorithm, depicting how the scene is divided into WRP zones, each containing its own set of 3D points. The LUT consists of entries of varying sizes, depending on their respective distance to the WRP plane. (b) Example of a single point accumulation, showing the real part of the holographic signal. First the occlusion mask is applied on the point location, followed by the LUT entry addition at that location.

Download Full Size | PDF

The LUT can be reused for every subsequent WRP. 8 different random phase instances are used per quantization level to reduce mutual coherence between same-distance elements [25]. To apply occlusion, a multiplicative attenuating mask is used to block a small amount of light from the scene behind for every 3D point.

The algorithm starts at the furthest WRP #1, and progressively propagates to subsequent WRPs using the short-distance tiling-based diffraction operator. Once the final WRP #$W$ is reached and processed, the long-distance strip-based diffraction operator is used to reach the hologram plane.

The algorithm has been implemented in a C++17 multithreaded environment, with CUDA 10.1 for the calculations done on the GPU. The main program components for the multi-WRP algorithm are illustrated on Fig. 5, and do the following:

• Blockloader: this object will transparently (pre-)load and write wavefield blocks to disk using a buffer in RAM. This enables efficient multiple reads and writes to the same tile, and the buffer size can scale with the available amount of RAM.
• Pointcloud processor: will read pointcloud chunks from disk and preprocess them for the LUT adders. This involves converting the 3D positions to memory addresses and offsets, and assigning points to multiple tiles when the LUT entries overlap with the tile edges.
• LUT adders: multiple threads will alternatingly apply occlusion masks and add LUT entries to several hologram tiles in parallel, using the data provided by the Blockloader.
• Tile Propagators: multiple threads on GPU will calculate local ASM methods (or other convolutions) on different tiles in a pipelined fashion. Mutexes ensure that the accesses to the edge buffers happens in the correct sequence. (cf. Algorithm 1)
• Strip Propagators: compute the various 1D FFTs, zeropadding and transposes on hologram strips as detailed in Algorithm 2.

Fig. 5. Schematic of the main program components. The left region contains components running on the CPU, the right region contains those who run on GPU. The annotated black arrows indicate how the components communicate with each other.

Download Full Size | PDF

In our implementation, 4 threads are assigned to the LUT adders as well as to the propagators. The long-distance propagation phase is simpler, which does only involve the Blockloader on CPU and multiple strip-based propagators on GPU.

4. Experiments

We calculated a hologram with a wavelength of $\lambda = 532\, {\textrm{nm}}$ and a pixel pitch of $p= 0.8$ µm. The resolution was $128K \times 128K$ pixels, totalling to $17.2$ Gigapixels. The resulting physical dimensions are $10.5 \times 10.5\, {\textrm{cm}}$, with a viewing angle of $38.8$ degrees.

The algorithm ran on a machine with a AMD Ryzen 7 2700 processor, 32 GB of RAM, a GeForce RTX 2070 GPU, a CSSD-M2B02TPG2VN 2TB SSD as a disk with a Windows 10 OS.

The scene was composed of the Bi-plane point cloud, consisting of 1 million points with associated intensities for the point color. The plane was centered laterally to match the hologram center, and displaced to be $20\, {\textrm{cm}}$ from the hologram plane. The dimensions of the plane along the main axes are $63\times 73\times 99\;\textrm{mm}$. The scene setup is shown on Fig. 6.

Fig. 6. Diagram of the scene geometry (not to scale). (a) front view, showing the plane’s lateral dimensions. (b) top view, showing the distance to the hologram plane and the depth of the model.

Download Full Size | PDF

The algorithm was configured to use 80 WRP planes, equidistantly placed across the point cloud volume with mutual spacings of $1.3\;\textrm{mm}$. The LUT consisted of 60 quantization levels, using surface elements with a $24\times 24$ pixel window, with 8 different random phase instances per quantization level. The used occlusion mask is a $64 \times 64$ pixel wide 2D inverted Hamming window; the optimal mask parameters will depend i.a. on the point cloud density, (wavelength, pixel pitch and hologram dimensions) [8].

For reference, we compare the calculation times with the standard rectangular tiling approach using Shifted Fresnel diffraction [20], which was implemented in the CWO library [24] using the GPU. We used the same tile size of $8192 \times 8192$ pixels both for the rectangular tiling as well as for the proposed short-distance tiling-based diffraction technique. The long-distance technique used zeropadding, ASM and strip heights of $256$ pixels each.

For the previously specified hologram dimensions, the calculation time of a single diffraction operator was $3.6$ minutes for the short-distance technique, $35.3$ minutes for the long-distance technique and took $30$ hours for the reference rectangular tiling method.

The total running time of the multi-WRP algorithm was 6 hours and 51 minutes. This consisted of the joint LUT additions and 80 short-distance propagations for the WRPs, followed by a long-distance propagation. If we compare this to the (extrapolated) time of 100 days it would take when using tiling-based diffraction, the proposed algorithm is a significant improvement. We show three rendered views from the CGH on Fig. 7; these were computed by taking a vertically centered $8192\times 8192$ square crop at the left side, middle and right side of the hologram like in [9], equivalent to setting the rest of the hologram to 0. Then, they were backpropagated at $z= 27.0\;\textrm{mm}$ using the long-distance ASM algorithm, taking the absolute value and downscaling to obtain the resulting images.

Fig. 7. Three rendered views taken from the same hologram, backpropagated at $z= 27.0\;\textrm{mm}$.

Download Full Size | PDF

5. Conclusion

We propose two novel algorithms for the efficient numerical propagation of wavefields with extremely high resolutions. This is an integral building block of many CGH algorithms, often being the most calculation-heavy component. We report large speed gains over the standard rectangular tiling based method, with a 500-fold speedup for short-distance tiling-based diffraction, and a 50-fold speedup for long-distance strip-based diffraction. This work can have an impact beyond CGH for display purposes, or even simulate diffraction for applications at non-optical wavelengths where the far-field approximation is not applicable.

Funding

FP7 Ideas: European Research Council (617779); Japan Society for the Promotion of Science (19H01097, 19H04132).

Acknowledgments

The Biplane point cloud model is courtesy of ScanLAB Projects.

References

1. M. Yamaguchi, H. Hoshino, T. Honda, and N. Ohyama, “Phase-added stereogram: calculation of hologram using computer graphics technique,” Proc. SPIE 1914, 25–31 (1993). [CrossRef]

2. H. Kang, E. Stoykova, and H. Yoshikawa, “Fast phase-added stereogram algorithm for generation of photorealistic 3d content,” Appl. Opt. 55(3), A135–A143 (2016). [CrossRef]

3. T. Shimobaba and T. Ito, “Fast generation of computer-generated holograms using wavelet shrinkage,” Opt. Express 25(1), 77–87 (2017). [CrossRef]

4. D. Blinder, “Direct calculation of computer-generated holograms in sparse bases,” Opt. Express 27(16), 23124–23137 (2019). [CrossRef]

5. J. Y. Son, H. Lee, B. R. Lee, and K. H. Lee, “Holographic and light-field imaging as future 3-d displays,” Proc. IEEE 105(5), 789–804 (2017). [CrossRef]

6. D. Blinder, A. Ahar, S. Bettens, T. Birnbaum, A. Symeonidou, H. Ottevaere, C. Schretter, and P. Schelkens, “Signal processing challenges for digital holographic video display systems,” Signal Process. Image Commun. 70, 114–130 (2019). [CrossRef]

7. T. Shimobaba, N. Masuda, and T. Ito, “Simple and fast calculation algorithm for computer-generated hologram with wavefront recording plane,” Opt. Lett. 34(20), 3133–3135 (2009). [CrossRef]

8. A. Symeonidou, D. Blinder, A. Munteanu, and P. Schelkens, “Computer-generated holograms by multiple wavefront recording plane method with occlusion culling,” Opt. Express 23(17), 22149–22161 (2015). [CrossRef]

9. A. Symeonidou, D. Blinder, and P. Schelkens, “Colour computer-generated holography for point clouds utilizing the phong illumination model,” Opt. Express 26(8), 10282–10298 (2018). [CrossRef]

10. A. Gilles, P. Gioia, R. Cozot, and L. Morin, “Hybrid approach for fast occlusion processing in computer-generated hologram calculation,” Appl. Opt. 55(20), 5459–5470 (2016). [CrossRef]

11. A. Gilles and P. Gioia, “Real-time layer-based computer-generated hologram calculation for the fourier transform optical system,” Appl. Opt. 57(29), 8508–8517 (2018). [CrossRef]

12. H. Zhang, L. Cao, and G. Jin, “Three-dimensional computer-generated hologram with fourier domain segmentation,” Opt. Express 27(8), 11689–11697 (2019). [CrossRef]

13. K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-generated hologram created by the polygon-based method,” Appl. Opt. 48(34), H54–H63 (2009). [CrossRef]

14. K. Matsushima, M. Nakamura, and S. Nakahara, “Silhouette method for hidden surface removal in computer holography and its acceleration using the switch-back technique,” Opt. Express 22(20), 24450–24465 (2014). [CrossRef]

15. H.-J. Yeom and J.-H. Park, “Calculation of reflectance distribution using angular spectrum convolution in mesh-based computer generated hologram,” Opt. Express 24(17), 19801–19813 (2016). [CrossRef]

16. M. Askari, S.-B. Kim, K.-S. Shin, S.-B. Ko, S.-H. Kim, D.-Y. Park, Y.-G. Ju, and J.-H. Park, “Occlusion handling using angular spectrum convolution in fully analytical mesh based computer generated hologram,” Opt. Express 25(21), 25867–25878 (2017). [CrossRef]

17. J.-P. Liu and H.-K. Liao, “Fast occlusion processing for a polygon-based computer-generated hologram using the slice-by-slice silhouette method,” Appl. Opt. 57(1), A215–A221 (2018). [CrossRef]

18. K. Wakunami, H. Yamashita, and M. Yamaguchi, “Occlusion culling for computer generated hologram based on ray-wavefront conversion,” Opt. Express 21(19), 21811–21822 (2013). [CrossRef]

19. S. Igarashi, T. Nakamura, K. Matsushima, and M. Yamaguchi, “Efficient tiled calculation of over-10-gigapixel holograms using ray-wavefront conversion,” Opt. Express 26(8), 10773–10786 (2018). [CrossRef]

20. R. P. Muffoletto, J. M. Tyler, and J. E. Tohline, “Shifted fresnel diffraction for computational holography,” Opt. Express 15(9), 5631–5640 (2007). [CrossRef]

21. K. Matsushima, “Shifted angular spectrum method for off-axis numerical propagation,” Opt. Express 18(17), 18453–18463 (2010). [CrossRef]

22. Y.-H. Kim, C.-W. Byun, H. Oh, J.-E. Pi, J.-H. Choi, G. H. Kim, M.-L. Lee, H. Ryu, and C.-S. Hwang, “Off-axis angular spectrum method with variable sampling interval,” Opt. Commun. 348, 31–37 (2015). [CrossRef]

23. J. W. Goodman, Introduction to Fourier Optics (W.H. Freeman, 2017).

24. T. Shimobaba, J. Weng, T. Sakurai, N. Okada, T. Nishitsuji, N. Takada, A. Shiraki, N. Masuda, and T. Ito, “Computational wave optics library for C++: CWO++ library,” Comput. Phys. Commun. 183(5), 1124–1138 (2012). [CrossRef]

25. A. Symeonidou, D. Blinder, A. Ahar, C. Schretter, A. Munteanu, and P. Schelkens, “Speckle noise reduction for computer generated holograms of objects with diffuse surfaces,” Proc. SPIE 9896, 98960F (2016). [CrossRef]

Efficient algorithms for the accurate propagation of extreme-resolution holograms

Abstract

1. Introduction

2. Proposed diffraction algorithms

2.1 Short-distance tiling-based diffraction

2.2 Long-distance strip-based diffraction

3. Implementation

4. Experiments

5. Conclusion

Funding

Acknowledgments

References

Cited By

Figures (7)

Equations (3)

Optics Express