Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation

Jaehong Lee; Homin Kang; Han-ju Yeom; Sanghoon Cheon; Joongki Park; Duksu Kim

doi:10.1364/OE.422266

1. Introduction

Dennis Gabor [1] was the first to propose holography, in 1947. Because this method can display a three-dimensional image without additional devices such as 3D glasses, it is reffered to as the ultimate display technology. A hologram is the medium on which holography is recorded. One of the representative hologram generation methods is the computer generated hologram (CGH) approach [2]. CGH can generate holograms for various types of 3D information in a virtual environment with only computations. Owing to recent advances in computing power, CGH is gaining more attention as a promising hologram generation method.

Because the angle of view (AOV) of a hologram display increases as the spacing between the pixels decreases, it requires a significantly higher resolution than a 2D-display. As an example, a $5cm^{2}$ hologram with an AOV of 40$^{\circ }$ requires 100K$^{2}$ pixels in a 500nm pixel pitch. The core modules of CGH includes numerical propagation of the wave-front, also called diffraction, and one of its primary operations is the fast Fourier transform (FFT) with an FFT-shift for the hologram plane (or complex matrix). The FFT-shift makes the zero-frequency of the resulting spectrum at the origin, and FFT-shift is performed before and after each FFT operation (FFT-shift $\to$ FFT $\to$ FFT-shift) [3].

FFT is widely employed in various applications, including image processing, machine learning, and digital holography. Various acceleration methods have been proposed to improve the performance of FFT [4,5]. Recent works have actively employed Graphics Processing Units (GPU) that support massive parallelism. cuFFT [5] is a state-of-the-art GPU-based FFT library, and it shows up to hundreds of times higher performance compared to CPU-based algorithms when the entire data is in the GPU memory (i.e., device memory). However, the data size (e.g., 149GB for a 100K$^{2}$ single-precision complex matrix) for an ultra-high-resolution hologram generation exceeds the device memory space (e.g., 6-24GBs). For the FFT-shift operation, a GPU-based acceleration algorithm has been proposed [3], but it also did not consider the out-of-core cases, that the data size exceeds the device memory size. Therefore, it is difficult to apply these prior methods directly to ultra-high-resolution hologram generation.

This paper introduces a novel out-of-core GPU algorithm of 2D-FFT with FFT-shift (hereafter referred to as “2D-Shift-FFT”) for generating ultra-high-resolution holograms, for which the data size is much larger than the device memory size. We solve the limited memory space issue by representing the 2D-FFT operation as a set of 1D-FFTs (Sec. 4). Then, we improve the performance further by combining two steps in the 2D-FFT process and optimizing data communications between the CPU (host) memory and GPU (device) memory (Sec. 5). We implemented our method on three computing environments and compared the performance with the conventional approach (i.e., $Conv_{cuFFT}$) using cuFFT, the state-of-the-art GPU-based FFT library (Sec. 6). For 2D-Shift-FFT computation, our method achieved up to 3.24 and 3.06 times higher performance than using the conventional algorithm for single- and double-precision cases, respectively. We also applied our method to the layer-based CGH process, finding that it reduced the hologram generation time up to 28% compared with $Conv_{cuFFT}$ (Sec. 6.3). These results confirm the benefit and usefulness of our algorithm.

2. Related work

Depending on the element representing the 3D object, we can categorize CGH synthesis technologies into point-cloud [6,7], polygon mesh [8–10], layer [11–13], and light field [14]-based methods. Except for the point-cloud method, all of these methods use 2D-Shift-FFT during their processes. Polygon mesh-based CGH methods are divided into a non-analytic method [8] and an analytic method [9,10]. The non-analytic method utilizes 2D-Shift-FFT for each triangle to simulate propagation numerically. Basically, 2D-Shift-FFT is not used for propagation calculations in the analytic method. However, when we consider the occlusion effect, it requires a 2D-Shift-FFT computation for each triangle in both methods. For layer and light field-based CGH methods, it is necessary to perform 2D-Shift-FFT equally to the depth level and the number of resolutions, respectively. Like the computation of the complex field of a 3D object, calculating the numerical propagation of the hologram plane is another essential step of the CGH process. The most widely employed approaches include the angular spectrum method [15] and the Fresnel propagation method [16]. For these numerical propagation calculation methods, the 2D-Shift-FFT is also one of the basic operations. In particular, it takes twice as much memory space when we apply zero-padding to prevent aliasing. Consequently, in CGH, 2D-Shift-FFT computations are the most memory- and time-consuming operations, and it is important to optimize 2D-Shift-FFT to accelerate the CGH process.

To accelerate the CGH process, multi-core CPUs have been employed [17,18]. Frigo and Johnson presented FFTW3, which is one of the state-of-the-art CPU-based FFT libraries [19]. They applied data-parallel approaches to FFT computations using SIMD (Single Instruction, Multiple Data) units in CPU cores. Shimobaba et al. [17] proposed the CWO++ library that uses FFTW3 for FFT. Murano et al. [18] extended Shimobaba et al.’s work to utilize the Intel Xeon Phi processor by parallelizing CGH computation with OpenMP [20] while using the Intel Math Kernel Library (MKL) [21] for FFT. It achieved up to eight times higher performance than using two octa-core CPUs.

Recently, GPUs have been actively employed for FFT processing [5,22,23]. NVIDIA presented a GPU-based FFT library (cuFFT ) [5] that demonstrates much higher performance compared to CPU-based algorithms [23]. However, the GPU has a relatively small memory space compared to the CPU-side, and it limits the data size cuFFT can handle at once for 2D-FFT operations. 2D-FFT can be converted to a set of 1D FFTs. Gu et al. [24] utilized this property to process 2D-FFT for a large-scale matrix on a GPU. To handle a large-scale complex matrix with limited device memory, it sends multiple rows (or columns), instead of the whole matrix, to the device memory. Then, it performs 1D-FFT for these rows (or columns) and repeats these steps until all rows (or columns) are processed. They also adapt multiple GPU streams to optimize the data transfer overhead. Ogata et al. [25] and Chen and Li [26] utilize both the CPU and GPU for 2D-FFT. They formulated the expectation model of the processing time for a given data size while measuring the model’s coefficients based on the experimental results on each processing unit. Based on their model, they distributed the workload of 2D-FFT to the CPU and GPU, successfully reducing the processing time.

Similar to Gu et al. [24], we process a set of 1D-FFTs to handle 2D-FFT for a large-scale complex matrix with the limited memory of a GPU. However, we proposed an efficient algorithm for 2D-FFT with FFT-shift (i.e., 2D-Shift-FFT), not only for 2D-FFT. Also, we propose a more efficient algorithm that improves the GPU utilization efficiency.

Abdellah [3] proposed a GPU-based FFT-shift library, CUFFTSHIFT, to accelerate the FFT-shift process. A shifting equation is the key function of the shift operation, and it determines the element pairs to swap. Because the multi-dimensional array has a flat (or 1D) layout in the device memory space, it is important to find the correct index of pair elements for a given element. They designed three different FFT-shift kernels for one-, two-, and three-dimensional arrays while focusing on finding the correct CUDA thread index. As a result, they achieved higher performance by two orders of magnitude compared to a CPU-based library (i.e., Octave [27]). However, the data size that CUFFTSHIFT can handle is limited depending on the device memory size. In other words, it only works in an in-core manner.

Unlike CUFFTSHIFT, we consider both FFT-shift and 2D-FFT operations, and our algorithm works in an out-of-core manner to handle data exceeding device memory capabilities.

Jackin et al. [28,29] employed a GPU cluster for fast computation of large-scale Fresnel CGH. They decompose an object and hologram into $k\times k$ sub-objects and sub-holograms whose size is smaller than the GPU memory and modulate the phase factor (i.e., twiddle factor) in the Fourier transform of each sub-hologram. Each node performs FFT $k \times k$ times for each sub-hologram given to the node to generates the full-size hologram. Then, it generates the final hologram by combining all the hologram computed by each node. As a result, they achieved 1.5 times performance improvement with sixteen nodes having a GPU per node than a conventional method. In this method, each node process a set of 1D-FFT with synchronization steps between column-wise FFT and row-wise FFT stages.

Unlike Jackin et al. that improves the performance for Fresnel CGH computation on a GPU cluster, we propose a 2D-Shift-FFT algorithm in a node having a GPU, not for a cluster. We also focus on optimizing the communication overhead between CPU memory and GPU memory in a node. We could apply Jackin et al. [29]’s work decomposition approach for out-of-core GPU algorithm in a node. However, we found that it requires much more data communication between CPU and GPU memories, and it is not efficient for intra-node out-of-core GPU algorithms. We analyze the data transfer overhead theoretically in Sec. 6.

For handling large-scale CGH computation with small system memory, Farhoosh et al. [30] used secondary storage, hard disk drive (i.e., HDD). They loaded only the sub-samples to process into the system memory while maintaining the whole samples in the HDD. Shimobaba et al. [31] reported that using SSD (solid-state drive) shows better performance than using the HDD. Also, Blinder and Shimobaba [32] employed SSD to handling extreme-resolution hologram whose size is much larger than the system memory. They also proposed efficient algorithms respectively for short-distance and long-distance propagation and achieved impressive performance improvement.

Unlike those works, we target the case that the system memory is large enough, but the GPU memory is much smaller than the problem size. Also, we concentrate on improving the data communication efficiency between CPU and GPU memories instead of between secondary disk and system memory.

3. Overview

For wave-front propagation in CGH, FFT-shift operations are performed before and after the 2D-FFT. The first FFT-shift serves to calculate the accurate result of the Fourier transform, and the second FFT-shift is used to place DC at the center of the signal. Figure 1(a) shows the basis workflow. From a computational perspective, it realizes the FFT-shift operation for two-dimensional data (e.g., a hologram plane) by exchanging the first and third quadrant, and the second and fourth quadrant. While the data size for generating an ultra-high-resolution hologram is from tens or hundreds of gigabytes (e.g., 149 GB for a 100K$^{2}$ hologram), a commodity GPU has much smaller memory space (e.g., 2-24 GB). Instead of performing 2D-FFT directly, we can obtain the 2D-FFT results with 1D-FFT for the rows after performing 1D-FFT for the columns [24]. Although it is challenging to hold the entire matrix in GPU memory, a single row or multiple rows (or columns) are small enough to load into GPU memory. Therefore, our method employs this property to accelerate the 2D-FFT computation by using a GPU. Consequently, the 2D-Shift-FFT process is decomposed into four steps, as indicated in Fig. 1(b). Before the 1D-FFT step for the columns, we transpose the matrix to improve the efficiency of data transfer and computation on the GPU, as shown in Fig. 1(c) (Sec. 4). Also, we combine the FFT-shift and transposition operations into one step to improve the GPU’s utilization efficiency. Finally, the process of our algorithm includes five steps, as presented in Fig. 1(d) (Sec. 4). We also maximize the GPU’s utilization efficiency by employing multiple streams with the pinned-memory buffer approach (Sec. 5).

Fig. 1. (a) Conventional process for 2D-Shift-FFT in CGH, (b) 2D-Shift-FFT process based on column- and row-wise 1D-FFTs, (c) Adding data transposition steps for an efficient computation for the column-wise 1D-FFT, and (d) Integrating the first FFT-shift and data transposition steps into one step (Ours)

Download Full Size | PDF

4. Out-of-core GPU algorithm for 2D-Shift-FFT

To utilize the GPU, it is necessary to send the target rows or columns to the GPU memory first. Unlike a row, a column resides discontinuously in the memory space because the matrix has a row-first layout in the memory [33]. Because an API call that copies data between the CPU (host) memory and the GPU (device) memory can send data placed continuously in the memory, sending a column requires multiple (i.e., the number of elements in a column) memory copy API calls. Although some GPU programming APIs support 2D memory copy functions that copy a sub-region of a 2D array (e.g., cudaMemcpy2D() of the CUDA runtime API), the data transfer speeds are much slower compared to copying data in a continuous memory region. Therefore, sending a column incurs high data transfer overhead than sending a row.

Although we can group multiple columns and reduce the data transfer overhead slightly, such a memory layout also affects the performance of 1D-FFT processing on a GPU. A GPU maintains data in the device memory, which is the global memory of the GPU. Because accessing the device memory is slow, recent GPUs have a cache structure (e.g., L1 and L2 caches) to reduce the memory access latency. Threads of the GPU access the device memory through these caches. The GPU has a SIMT (Single Instruction, Multiple Threads) architecture, and thirty-two consecutive threads compose a warp while the same instruction controls all threads in a warp at once [33]. If all threads in a warp access data in the same cache line, it is done with only one cache transaction. However, irregular access patterns such as column-wise 1D-FFT require multiple cache transactions (e.g., up to thirty-two times) for a warp [33]. Such stalls by memory accesses become a performance bottleneck during GPU computations. We found that the column-wise 1D-FFT shows up to 14.1 times slower than the row-wise 1D-FFT. We can avoid such an irregular access pattern by transposing the target column in the device memory, as in Gu et al. [24]. However, this approach still requires inefficient column-wise data copies from the host memory to device memory and additional transposition operations in the GPU.

Instead of undertaking a column-wise data transfer and transposing it on GPU, we transpose the matrix at the host memory first and send the transposed column to the device memory. It should be noted that the column resides in the continuous memory region now, and it is possible to copy the column to the device memory with a single memory copy API call. Then, the GPU computes the 1D-FFT for the column with the row-wise 1D-FFT module. The GPU returns the results to the host memory and then gets the next column. Once it finishes the 1D-FFT step for all columns, we transpose the matrix into the original form and perform the 1D-FFT steps for the rows. We found that a row-wise data copy operation with two matrix transposition steps takes about 28% less time than a column-wise data transfer.

Figure 1(c) shows the 2D-Shift-FFT process with matrix transposition and row-wise 1D-FFT modules. In this workflow, the first two steps are I/O-bounded tasks. More specifically, the primary operation of the two steps is a set of memory copies for the swapping of two elements. However, the memory copy operation is a time-consuming task compared to a computational operation. Also, in contrast to computational tasks, which can be accelerated by parallel processing, memory copy operations from multiple threads are usually serialized in the multi-core CPU architecture. To reduce such memory copy overhead, we combine these two steps into one step. Let the center of the hologram plane (or complex matrix) be the origin, with $w$ and $h$ being the matrix’s width and height. The destination index of the FFT-shift operation for the given $x$- and $y$-coordinates can then be expressed by Eq. (1).

(1)$$\begin{aligned}ShiftX(x) &= (x + w/2)~mod~w \\ ShiftY(y) &= (y + h/2)~mod~h. \end{aligned}$$

The transposition function computes the destination coordinate for the given point, as expressed by Eq. (2).

(2) $$Trans(x,y) = (y,x).$$

We can then compute the target destination for the given coordination $(x,y)$ after the FFT-shift and transposition steps.

(3) $$ShiftTrans(x,y)=(ShiftY(y),ShiftX(x)).$$

Therefore, we can perform the FFT-shift and matrix transposition processes with a combined step, which requires only one element exchange (or memory copy) stage.

Finally, Fig. 1(d) shows the process of our out-of-core GPU algorithm for 2D-Shift-FFT. We found that our combined shift-and-transposition (ShiftTrans) step takes about 88% of the time compared to when the two steps are performed separately. The benefit of combining the two steps may appear to be minor in terms of the processing time, and the gap is decreased further when we apply our cache-friendly chunk generation algorithm (Sec. 5.1). However, by combining these steps, we can perform the FFT-shift step with other steps (e.g., data transfer and computation on the GPU) concurrently (Sec. 5). As a result, our combined shift-and-transpose approach improves the overall 2D-Shift-FFT performance by approximately 6% in our experiments.

5. Maximize the GPU utilization efficiency

Although the device memory cannot hold all of the complex matrix for an ultra-high-resolution hologram, it can maintain multiple rows or columns. Sending successive lines (rows or transposed columns) can improve the data transfer efficiency. It also helps to utilize the GPU more intensively compared to the processing of a single line, as massive parallelism is the core strategy of GPU-based computations.

Therefore, we send multiple rows (or transposed columns) at a time and let the GPU perform 1D-FFT for the multiple lines. In terms of data transfer efficiency, sending the maximum number of lines that fit the device memory is the best choice. However, this strategy wastes the computing power of the GPU while waiting for the data. To solve this problem, we exploit the concurrent execution ability of recent heterogeneous computing systems. A system with both a CPU and a GPU is the most common heterogeneous system. It supports the asynchronous execution for CPU processing, computations on the GPU, and data transfers between the host and the device memories. This is realized using multiple streams. A stream is a kind of channel that operates when using the GPU, and operations (e.g., data transfer or kernel launch) in different streams can be performed simultaneously [33].

Based on this concurrent execution property, we improve the GPU’s utilization efficiency by overlapping (running in parallel) the $ShiftTrans$ (or transposition) operation on the CPU, data transfers between the host and device, and 1D-FFT computations on the GPU. Figure 2 shows an overview of our strategy. First, we decompose the matrix into chunks that form the basic work unit of our algorithm. Each chunk consists of multiple lines (i.e., rows) in the target matrix. For the first 1D-FFT (the second step in Fig. 1(d)), the target matrix is the $ShiftTrans$ step’s output for the input matrix. For the second 1D-FFT (the fourth step in Fig. 1(d)), the target matrix is a transposed matrix of the results from the first 1D-FFT step. We determine the number of lines in a chunk (i.e., chunk size) depending on the device memory size and the number of streams it will use. The chunk size is the maximum integer value, $x$, that satisfies the following two conditions:

(4)$$x \leq \frac{(device~memory~size)}{k \times (data~size~of~a~line) \times (the~number~of~streams)}.$$

(5)$$\frac{(\#~of~total~lines)}{x} \in 2\mathbb{Z}^{+}.$$

The first condition (Eq. (4)) refers to the maximum chunk size the device memory can hold, where $k$ is a constant for preserving the workspace. For example, $k$ is 3 when we use cuFFT, as this requires workspace for processing 1D-FFT in addition to the space for storing the input chunk and the output result. The second condition (Eq. (5)) ensures that a chunk does not cross the boundary of a quadrant. We process the chunks in parallel while switching the stream for each chunk. In detail, during this step, a chunk is initially generated on the CPU and the next jobs are positioned for the chunk (i.e., data transfer and 1D-FFT on GPU) on a stream. Given that calling APIs for asynchronous data transfer and GPU computations (i.e., kernel) returns control to the host (i.e., CPU) immediately, the CPU moves to the next chunk without waiting for the completion of the data transfer and the GPU computation for the chunk. Consequently, it can overlap the chunk generation on the CPU and other jobs for prior chunks. It repeats this process until it handles all chunks, and we thus obtain the 1D-FFT result.

Fig. 2. An example of the timeline when we use four streams. It performs four processes, including $ShiftTrans$ (or transposition) on the CPU, data copying between the host and the device, and 1D-FFT on the GPU, concurrently. The color represents the stream that handles the chunk, and all processes for a chunk use the same stream.

Download Full Size | PDF

5.1 Chunk generation

To overlap the chunk generation step on the CPU with data transfers and 1D-FFT on the GPU for other chunks, we need to perform the $ShiftTrans$ (or transposition) operation at the chunk level instead of doing it for the entire matrix at once. For matrix transposition (e.g., before the second 1D-FFT), we can create a chunk by transposing the columns of the original matrix from the first column sequentially. However, for the $ShiftTrans$ operation, the $n$-th column of the input matrix is not directly matched to the $n$-th row of the converted matrix. Figure 3 shows the relationship between a generated chunk and the related elements of the original matrix. When the row number of the converted matrix after the $ShiftTrans$ operation is $r$, the matched column number in the original matrix is $((w/2+r)~mod~w)$. Also, the upper (lower) part of the column maps to the right (left) part of the resulting row of the converted matrix.

Fig. 3. Changes of the element position according to the operation type. The bold boxes denote a block, showing which element in the original matrix maps to the element in the chunk generated by the $ShiftTrans$ operation.

Download Full Size | PDF

A simple way to implement the $ShiftTrans$-based chunk generation is to treat the columns in the original matrix individually while copying each element to the correct position on the chunk. However, we found that this naive approach is inefficient from the perspective of cache utilization efficiency [34], and this step is one of the performance bottlenecks in our algorithm. In recent CPU architectures, the size of a cache line is tens of bytes (e.g., 64 bytes), and the size of the L1/L2 cache is tens or hundreds of KB (e.g., 32KB/256KB), meaning that the L1/L2 cache of the CPU can hold up to hundreds/thousands (e.g., 512/4,096) of cache lines. However, the length of a column we handle here exceeds than tens of thousands (e.g., 20K-100K). As each element in a column resides in a different cache line and because the column is much longer than the cache line, numerous cache misses occur whenever elements of a column are accessed sequentially.

To make a cache-friendly chunk generation algorithm, we take the block-level and row-first strategy instead of column-by-column processing in the naive approach. We can consider a chunk as a set of multiple blocks, as indicated by the bold boxes in Fig. 3. Then, we handle blocks in a one-by-one manner, performing the $ShiftTrans$ operations for the elements in the block in a row-first order, as indicated by the white arrow in Fig. 3. This means that the proposed method processes a block of the original matrix in a row-by-row manner. It achieves high cache utilization efficiency (or a high hit ratio) because nearby elements in a row share the same cache line in a high probability. Although this leads columns-first order of memory accesses when writing the result to the generated chunk, the $chunkSize$ (e.g., 1K) is much smaller than the line length in the chunk (or the width of the matrix). Therefore, the CPU’s L2 cache can hold $chunkSize$ cache lines, achieving cache utilization efficiency (i.e., hit ratio) comparable to that of the naive approach for the writing result. In addition to the L2 cache, the L1 cache can also achieve a high cache-hit ratio because we use multiple CPU cores (e.g., sixteen cores/threads), and we distribute disjoint sets of rows in a block to them. Our cache-friendly chunk generation algorithm shows up to 4.53 times higher performance in our experiments compared to the use of the naive approach. It should also be noted that we also apply this block-level row-first order strategy for transposition-based chunk generation before the second 1D-FFT.

5.2 Pinned-memory buffer

It is necessary to transfer data between the host and device memories asynchronously to overlap them with the CPU and GPU computations in other streams. To perform the asynchronous data copy operation, it is also necessary to ensure that the memory region containing the target data stays in the physical memory until the data copying is complete. Such a memory region that is not swapped out to virtual memory is referred to as page-locked memory or pinned-memory. Operating systems typically allow being a pinned-memory region for only a part of the physical memory, and it is usually not possible to put the entire matrix into the pinned-memory region.

We resolve this issue by allocating a dedicated buffer as pinned-memory for each stream instead of setting the entire matrix region as pinned-memory. A stream initially copies the current chunk to its dedicated pinned-memory buffer before the second step (i.e., $Copy(H \to D$)), after which it transfers the data in the buffer to the device. For $Copy(D \to H)$, it also moves the data into the buffer first and copies them to the destination that storing the computation result for the current chunk. Another possible solution is to manage the pinned-memory region dynamically, such as by setting the memory region holding the target chunk to as pinned-memory and releasing the region after completing the work for the chunk. However, we found that the overhead for dynamically set/unset pinned-memory (e.g., cudaHostRegister() and cudaHostUnregister() API calls) is larger than the overhead of copying the chunk to the buffer. Therefore, the pinned-memory buffer approach improves our 2D-Shift-FFT’s performance up to 20% than using the dynamic pinned-memory strategy (Sec. 6.2). Although we need additional memory space for the pinned-memory buffer, the required space (i.e., $(\#~of~streams) \times ChunkSize$) is much smaller than the matrix size.

6. Results and analysis

We implemented our out-of-core GPU 2D-Shift-FFT algorithm in three different computing systems having two octa-core CPUs and a GPU (Table 1). We used OpenMP [20] and CUDA 10.2 [35] to implement parallel processing algorithms for multi-core CPU and GPU, respectively. For transposition and the ShiftTrans operation running on the CPU, we used sixteen threads. We generally used four GPU streams to overlap computations and data communication. The block size for the chunk generation process is $8\times 8$. We determined the chunk size based on Eq. (4) and Eq. (5). Also, we used the cuFFT’s row-wise 1D-FFT module to perform 1D-FFT on the GPU, and $k$ in Eq. (4) is 3 because it requires space for the input chunk, output result, and workspace when undertaking the FFT task. To check the benefit of the combined shift-and-transposition operation (i.e., ShiftTrans), we implemented two versions of our algorithm, including Ours (Fig. 1(c)) and Ours-ShiftTrans (Fig. 1(d)).

Table 1. Three different computing systems used in our experiments.

View Table | View all tables in this article

To compare the performance of our approach with those of prior work, we also implemented an alternative GPU algorithm, $Conv_{cuFFT}$, which is a conventional 2D-Shift-FFT implementation based on cuFFT [5]. It performs the FFT-shift on the multi-core CPUs with sixteen threads. For the 2D-FFT computation, as in Gu et al. [24], it performs column-wise 1D-FFT and then row-wise 1D-FFT using cuFFT (Fig. 1(b)) since the device memory cannot hold the entire matrix (or hologram plane). We applied the same chunk size used with our method. The row-wise 1D-FFT step follows the same process used by our method while also using the same number of streams. However, it uses a single stream for the column-wise 1D-FFT because the column-wise chunk has a discontinuous memory layout. It cannot be a pinned-memory region, which is the mandatory condition for using multiple streams. It uses the cudaMemcpy2D() API to transfer the column-wise chunk to the device memory.

Another alternative algorithm is employing the 2D-tile style work decomposition method used in Jackin et al. [28,29], which is initially designed for a GPU cluster. It decomposes the matrix into $k \times k$ tiles as a tile size (i.e., sub-matrix) is smaller than device memory. For each tile, it performs FFT $k \times k$ times to generate the full-size matrix. This process for each tile needs to send the tile from host to device memories and get each FFT computation result from device to host memories. As a result, the total data transfer size on Jackin et al.’s approach is $(k^{2}+1) * (matrix~size)$. The $k$ is determined depending on the device memory size and the matrix size. On the other hand, the data transfer size on $Conv_{cuFFT}$ and our method is $4 \times (matrix~size)$ regardless of the device memory size. For a large-scale matrix, we found that its data communication overhead itself is much larger than the total processing time of our method. For example, the $k$s on Machine 1, 2, and 3 are 5, 4, and 3, respectively, for $100K^{2}$ double complex matrix. This means that the data transfer time is 245.96, 160.83, and 94.61 seconds, respectively, if it achieves the theoretical maximum bandwidth of PCIe 3.0x16 (15.75GB/s). However, it is hard to reach the maximum bandwidth in the actual computing environment. It means that the theoretical minimum data communication time on the tile decomposition method is already much larger than the total processing time of our method. Based on this theoretical inspection, we did not compare the actual processing time of the tile-based decomposition approach.

Benchmarks: To compare the performance of different GPU-based 2D-Shift-FFT algorithms (Sec. 6.1), we generated a set of complex matrices in both single- and double-precision. We randomly generated complex numbers, and the sizes of the matrices varied from 20K$^{2}$ to 100K$^{2}$. Table 2 presents information about the benchmarks used in the experiments. The last three columns show the chunk sizes used by our method. To check the benefit of our approach in an actual application, we also applied our method to the layer-based CGH process. We measured the performance when generating ultra-high-resolution holograms whose resolution is from $20K^{2}$ to $100K^{2}$ (Sec. 6.3).

Table 2. The third column shows the data size (gigabytes) for different matrix sizes in the single- and double-precision cases. The last three columns show the chunk size (i.e., the number of lines in a chunk) for our method on each computing system.

View Table | View all tables in this article

6.1 Results

Table 3 and Fig. 4 show the processing times of three different 2D-Shift-FFT algorithms on three different computing systems for five different matrix sizes. Three GPU-based algorithms show much higher performance than FFTW, which is a CPU-based parallel algorithm. For example, CUFFT achieved up to XX, XX, and XX times (XX, XX, and XX times on average) higher performance on Machine 1, 2, and 3, respectively. It is because recent GPUs have much higher computing powers than multi-core CPUs and the FFT calculation is suitable for GPUs’ massively parallel computing architecture. Figure 4 compares the processing time of the three algorithms, and it shows that our methods generally show a higher performance than $Conv_{cuFFT}$. For the single-precision case, Ours achieved up to 3.24, 2.85, and 2.52 times (2.77, 2.37, and 2.39 times on average) higher performance than $Conv_{cuFFT}$ on Machine 1, 2, and 3, respectively. Also, we found that the performance gap increases as the matrix size is increased (e.g., 2.20 times for 20K$^{2}$ and 2.85 times for 100K$^{2}$ on Machine 2). For a small amount of data, the data communication overhead is relatively small compared to the large-scale data case. Therefore, the benefit of our out-of-core algorithm can be less significant. However, Ours also shows 2.35 times on average higher performance than $Conv_{cuFFT}$ with small matrix (i.e., 20K$^{2}$). This result confirms that our approach is useful for small complex matrix as well. Nonetheless, it is clear that our method works better for large-scale data, and it is appropriate for ultra-high-resolution hologram generation.

Fig. 4. Performance comparison of three different 2D-Shift-FFT algorithms for various sizes of matrices in the single- and double-precision cases.

Download Full Size | PDF

Table 3. Processing times (seconds) of three different algorithms on three computing systems for various sizes of complex matrices in the single- and double-precision cases. M1, M2, and M3 in the first column mean Machine 1, 2, and 3, respectively.

View Table | View all tables in this article

For the double-precision case, Ours also achieved up to 3.06, 2.46, and 2.57 times (2.75, 2.42, and 2.49 times on average) higher performance than $Conv_{cuFFT}$ on Machine 1, 2, and 3, respectively. However, in contrast to the single-precision case, the matrix size does not affect the performance gap between $Conv_{cuFFT}$ and Ours, and Ours generally shows higher performance by 2.55 times on average. This occurs because the data size for the double-precision complex matrix is twice that in the single-precision case. Therefore, it is large enough to take advantage of our out-of-core algorithm even for a small matrix (i.e., 20K$^{2}$).

Ours-ShiftTrans further improves the performance by Ours up to 13% and 12% (6% and 7% on average) for the single- and double-precision cases, respectively. In specific cases, however, Ours-ShiftTrans achieved slightly lower performance than Ours as in the case with 100K$^{2}$ double-precision matrix on Machine 1. We found that the benefit of our ShiftTrans strategy is reduced for a small chunk size (e.g., less than 500) because it lowers the cache utilization efficiency. We also found that the transposition performance is not consistently proportional to the size of the data, and it shows distinctly high performance in some cases (e.g., 60K$^{2}$). Nonetheless, Ours-ShiftTrans outperformed Ours in most cases, as it makes the FFT-shift process overlap with the other steps, including the data transfer and computations on the GPU.

6.2 Performance analysis

To ascertain how the proposed approach achieves much higher performance than the naive GPU-based algorithm (i.e., $Conv_{cuFFT}$), we measured the processing times for each step of the 2D-Shift-FFT on $Conv_{cuFFT}$ and Ours-ShiftTrans. At a high level, Fig. 1(b) and Fig. 1(d) present the processing steps of $Conv_{cuFFT}$ and Ours-ShiftTrans, respectively. We further divided 1D-FFT steps into the host-to-device and device-to-host data copy steps and the 1D-FFT computation on the GPU (Fig. 2) for a more in-depth analysis. For this profiling, we used 100K$^{2}$ complex matrices in both the single- and double-precision cases on Machine 3. Moreover, we ran each step in a one-by-one manner synchronously to check the workload in each step.

The stacked column charts in Fig. 5 show the total processing time for each step. Because the process of row-wise 1D-FFT and the final FFT-shift stages is identical in both methods, the workload and processing time for these cases are nearly identical as well. For the first FFT-shift & column-wise 1D-FFT stage, the workloads on each step differ in the two methods as Ours-ShiftTrans performs this step via row-wise 1D-FFT with column-to-row transposition. Ours-ShiftTrans also combined the FFT-shift and transposition steps, and ShiftTrans embeds the workloads of both steps. We found that the data transfer overhead for both H$\to$D (i.e., from host memory to device memory) and D$\to$H (i.e., from device memory to host memory) is reduced significantly with the proposed method compared to $Conv_{cuFFT}$. For the data copy time itself (e.g., cudaMemcpy()), our method shows a reduction of 72% over $Conv_{cuFFT}$ because it places the target chunk in the continuous memory space, which leads to an efficient data copy, in contrast to the discontinuous memory layout in $Conv_{cuFFT}$ (Sec. 4). For the 1D-FFT computation on the GPU, our method using the row-wise 1D-FFT module outperforms by 14.1 and 3.2 times for single- and double-precision cases over $Conv_{cuFFT}$ using the column-wise 1D-FFT module. Although our method has overhead for data transposition and copying to the pinned-memory buffer, the benefits of reducing the workload for the data copy operation and the the 1D-FFT computations on the GPU are much greater than the overhead. Consequently, for the first FFT-shift & the column-wise 1D-FFT stage, Ours-ShiftTrans has about 60% less of a workload compared to $Conv_{cuFFT}$.

Fig. 5. Computation time of each process step of 2D-Shift-FFT on $Conv_{cuFFT}$ and Ours-ShiftTrans, where $Copy(H\to D)$ and $Copy(D\to H)$ mean the data transfer between host memory and device memory. $Copy(H_m\to H_p)$ and $Copy(H_p\to H_m)$ mean denotes the data copy between the matrix and the pinned-memory buffer. In the column-wise 1D-FFT stage, there is no additional data copy to/from the pinned-memory buffer since $ShiftTrans$ and $Trans.$ steps include data copy between the matrix and the pinned memory. For this analysis, we used Machine 3 and a 100K$^{2}$ complex matrix in (a) single- and (b) double-precision cases.

Download Full Size | PDF

As shown in Table 3, Ours-ShiftTrans took 28.4 seconds for the single-precision 100K$^{2}$ complex matrix on Machine 3, while the summation of the times for all steps was 50.9 seconds (Fig. 5(a)). This means that about 40% of the entire process is overlapped in Ours-ShiftTrans. On the other hand, $Conv_{cuFFT}$ took nearly the same time with the sum of all steps, as $Conv_{cuFFT}$ cannot employ multiple streams to transfer columns with a discontinuous memory layout to the device memory. For the double-precision case, we find a trend similar to that of the single-precision case too.

Our method needs to copy the matrix twice for H$\to$D and D$\to$H, and this will require about 19 seconds if it fully utilizes the bandwidth of PCIe 3.0x16 (15.75GB/s) for the single-precision 100K$^{2}$ complex matrix. If all steps are fully overlapped, the entire process except for the final FFT-shift will converge to the data communication time because it is the most time-consuming task in our method. For the single-precision 100K$^{2}$ complex matrix, Ours-ShiftTrans took 25.9 seconds, except for the last FFT-shift, on Machine 3, meaning that our method utilizes 73% of the theoretical maximum bandwidth of PCIe 3.0x16. For the double-precision case, our method also shows 82% PCIe usages. These results demonstrate that our approach efficiently utilizes the GPU’s computing power while exploiting the heterogeneous computing system’s concurrent execution ability.

Chunk size and the number of streams: The number of streams and the chunk size are inversely proportional (Eq. (4)). Although using more streams improves the concurrency, it could lower the GPU’s utilization efficiency because a larger chunk size is advantageous to exploit the massive parallelism of the GPU. Table 4 shows the processing time of Ours-ShiftTrans for 100K$^{2}$ when using from one to eight streams, while the chunk size also changes depending on the number of streams. Because it serializes all processes, using a single stream shows the worst performance. When we use more than one stream, the process starts to take advantage of the heterogeneous computing systems’ concurrent execution property. Using two streams show up to 17% and 31% higher performance over the single-stream cases for single- and double-precision cases, respectively. When we used four streams, it achieved the best performance or performance comparable to the best in most cases. In terms of the execution medium, tasks in a stream are divided into four types: CPU computations, host-to-device data transfers, GPU computations, and device-to-host data transfers as shown in Fig. 2. These four categories are matched with the heterogeneous computing system’s asynchronous execution ability, and our method can maximally exploit this with four streams. We would expect more concurrency when using more than four streams. However, the concurrency is limited to four ways as the steps using the same execution medium (e.g., host-to-device transfer) cannot be performed simultaneously. Consequently, using eight streams generally shows worse performance than four streams because the stream management overhead is increased with little benefits concerning the concurrency. Such an observation and these results also match the conclusions of prior works [36,37] showing that four-way concurrency usually shows the best efficiency in heterogeneous systems.

Table 4. Processing times (seconds) of Ours-ShiftTrans for 100K$^{2}$ complex matrix with different configurations for the number of streams and chunk sizes. The bold values indicate the best performance among the different numbers of streams.

View Table | View all tables in this article

Pinned-memory buffer: To check the benefit of our pinned-memory buffer optimization scheme, we also implemented two alternatives: pageable (or non-pinned) memory and dynamic pinned-memory methods (Fig. 6). The pageable-memory method, the most naive approach, does not set any space as pinned-memory. Using pinned memory is an essential condition for overlapping data communication and computation by multiple streams, and all processes are serialized in this method. Moreover, the data transfer speed (or bandwidth) for the pageable-memory space is much slower than that of pinned-memory. As a result, it slows down the total processing time of 2D-Shift-FFT by as much as 3.89 times. The second alternative method is dynamic pinned-memory, where only the memory region holding the current chunk is established as the pinned memory before $Copy(H \to D)$, unsetting it when a stream completes the last step (i.e., $Copy(D \to H)$) for the chunk. Although this method does not require any additional space for the buffer, we found that set/unset pinned-memory occurs frequently and requires higher memory management overhead. As a result, our pinned-memory buffer approach leads to up to 36% (20% on average) higher performance for Ours-ShiftTrans than the dynamic pinned-memory strategy.

Fig. 6. These figures show Ours-ShiftTrans’s processing time on Machine 3 with (w/) and without (w/o) our optimization approaches, including the pinned-memory buffer and cache-friendly chunk generation algorithm.

Download Full Size | PDF

Cache-friendly chunk generation algorithm: To clarify the benefit of our cache-friendly chunk generation algorithm (Sec. 5.1), we compared the chunk generation time with and without our algorithm (Fig. 6). Without the cache-friendly method, it processes a chunk in a column-by-column manner, in contrast to our algorithm that handles a chunk at the block level. The cache-friendly algorithm improves the chunk generation performance by 5.88 and 3.18 times on average for the single- and double-precision cases, respectively. For the entire process, Ours-ShiftTrans with our cache-friendly chunk generation approach achieved 2.32 and 1.68 times higher performance on average for single- and double-precision cases, respectively. We also conducted an in-depth analysis of the effect of the block-size in our cache-friendly chunk generation algorithm. We tested various block sizes from $2\times 2$ to $1024\times 1024$, and compared the processing time of Ours-ShiftTrans. When the data size is small (e.g., 20K$^{2}$), a block size of $32\times 32$ or $64\times 64$ achieved the best performance. However, we found that the $8\times 8$ block generally shows the best or good performance comparable with the best case most of the time.

6.3 Application to layer-based hologram generation

To verify the benefit of employing our 2D-Shift-FFT algorithm in a CGH process, we applied our method to layer-based hologram generation [38]. The layer-based CGH process consists of seven steps: reading/resizing layer images, random phase, angular spectrum method (ASM), accumulation, applying off-axis, phase encoding, and saving the resulting hologram [38,39]. Among those steps, the ASM and off-axis computation include 2D-Shift-FFT computations. The ASM computation consists of two 2D-Shift-FFTs and an element-wise multiplication (i.e., multiplying transfer function), and we need to perform it as many as the number of layers. Therefore, it is one of the most time-consuming steps for the GGH process. The off-axis computation also requires 2D-Shift-FFT two times to remove the circularly shifted signal. We used our in-house code for layer-based hologram generation and applied Ours-ShiftTrans to the ASM and the off-axis computations. We generated a 100K$^{2}$ hologram in single-precision with three layers of depth of 3.25cm, 5.75cm, and 8.25cm, respectively. Figure 7 shows the three layers used in the hologram generation process. The wavelength and pixel pitch are 660nm and 500nm, respectively, and Fig. 8 shows the sizes of hologram and objects in the three layers. We also applied an off-axis angle of $20^{\circ }$ in the y-direction. Figure 9 shows the optical reconstruction result. It needs to apply zero-padding and band-limitation to satisfy the Nyquist sampling condition. However, we did not consider the condition in this experiment since our experiment’s purpose is to check the benefit of our method on an application from the perspective of computation performance.

Fig. 7. Images and distance from the hologram of the corresponding layers used for the layer-based hologram generation process

Download Full Size | PDF

Fig. 8. The sizes of the hologram and objects in the three layers used for optical reconstruction

Download Full Size | PDF

Fig. 9. Optical reconstruction results of the hologram generated by applying the proposed method.

Download Full Size | PDF

We used Machine 3 to generate the hologram. To check the benefit of our method, we measured the processing time of each steps for layer-based CGH with $Conv_{cuFFT}$ and Ours-ShiftTrans while varying the resolution from $20K^{2}$ to $100K^{2}$. Table 5 shows the result. The processing times of other step expect ASM and off-axis computation almost the same since we used the same code except for 2D-Shift-FFT computation. Ours-ShiftTrans achieved up to 1.73 times and 1.47 times (1.62 times and 1.42 times on average) higher performance for ASM and off-axis computation than $Conv_{cuFFT}$, respectively. Overall, Ours-ShiftTrans reduces the total hologram generation time up to 28% (25% on average) compared with using $Conv_{cuFFT}$. It should be noted that if the number of layers increases, the performance gap will be more significant. This result confirms the usefulness of the proposed algorithm in real applications.

Table 5. The processing times (seconds) of each step (from third to tenth columns) for the hologram generation. The bold font denotes the steps includes 2D-Shift-FFT computation, and we applied two different 2D-Shift-FFT algorithms, including $Conv_{cuFFT}$ and Ours-ShiftTrans.

View Table | View all tables in this article

7. Conclusion and future work

We presented a novel GPU-based 2D-FFT with the FFT-shift (i.e., 2D-Shift-FFT) algorithm that efficiently handles a large complex matrix (or hologram plane) for ultra-high-resolution hologram generation. To handle large amounts of data that the GPU memory cannot hold at once, we calculated the 2D-FFT result by performing 1D-FFTs in a column-wise and row-wise manner. For the column-wise 1D-FFT computation, we transpose the target column to the row to improve the data transfer efficiency and computations on the GPU. We also combined the first FFT-shift and the transposition steps into one step. This combined shift-and-transpose method reduces the workload for those tasks, and it also enabled the processing of the step with other steps (e.g., data copy and computation on the GPU) concurrently. Our method exploits the concurrent execution ability of heterogeneous computing systems by using multiple streams. We enhanced the GPU’s utilization efficiency with our cache-friendly chunk generation algorithm and pinned-memory buffer approach. We implemented our method on three computing systems with different GPUs and tested the performance for 2D-Shift-FFT computations for various matrix sizes up to 100K$^{2}$. We found that our method achieved up to 3.24 times higher performance compared to the conventional implementation ($Conv_{cuFFT}$ based on the state-of-the-art GPU-based FFT library (i.e., cuFFT). We also applied our method to the layer-based CGH process to generated up to 100K$^{2}$ hologram with three layers, finding that the proposed method reduced the processing time up to 28% compared with the use of $Conv_{cuFFT}$. These results demonstrate the efficiency of our algorithm and its usefulness for ultra-high-resolution hologram generation.

Future work: We presented an efficient algorithm that can handle a large-scale 2D-Shift-FFT problem with a limited GPU memory in this work. We designed our current method with the assumption that the system memory (i.e., CPU memory) is large enough to process it. However, if we have a small memory space and the data size exceeds the system memory size, it relies on the virtual memory system. Then, the disk (e.g., SSD) I/O time will be the performance bottleneck. As future work, we would like to solve this I/O bottleneck issue from processing large-scale FFT problems on a machine having small system memory. Also, we hope to reduce the data communication overhead between the host and device memories since it is still the most time-consuming task in our method. We would also like to extend our algorithm to exploit not only a GPU but also multiple GPUs.

Funding

Institute of Information and Communications Technology Planning and Evaluation; (2019-0-00001, 2020-0-00594); Korea Government (MSIT); Korea University of Technology and Education (KOREATECH).

Acknowledgments

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2019-0-00001, Development of Holo-TV Core Technologies for Hologram Media Services (50%) and No.2020-0-00594, Morphable Haptic Controller for Manipulating VR·AR Contents (50%)), and Education and Research promotion program of KOREATECH in 2021.

Disclosures

The authors declare no conflicts of interest.

References

1. D. Gabor, “A space-charge lens for the focusing of ion beams,” Nature 160(4055), 89–90 (1947). [CrossRef]

2. C. Slinger, C. Cameron, and M. Stanley, “Computer-generated holography as a generic display technology,” Computer 38(8), 46–53 (2005). [CrossRef]

3. M. Abdellah, “cufftshift: high performance cuda-accelerated fft-shift library,” in Proceedings of the High Performance Computing Symposium, (2014), pp. 1–8.

4. M. Frigo and S. G. Johnson, “FFTW: An adaptive software architecture for the FFT,” in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), vol. 3 (IEEE, 1998), pp. 1381–1384.

5. NVIDIA, “CUFFT libraries,” https://docs.nvidia.com/cuda/cufft/index.html.

6. S.-C. Kim and E.-S. Kim, “Effective generation of digital holograms of three-dimensional objects using a novel look-up table method,” Appl. Opt. 47(19), D55–D62 (2008). [CrossRef]

7. Y. Lim, K. Hong, H. Kim, H.-E. Kim, E.-Y. Chang, S. Lee, T. Kim, J. Nam, H.-G. Choo, J. Kim, and J. Hahn, “360-degree tabletop electronic holographic display,” Opt. Express 24(22), 24999–25009 (2016). [CrossRef]

8. K. Matsushima and S. Nakahara, “Extremely high-definition full-parallax computer-generated hologram created by the polygon-based method,” Appl. Opt. 48(34), H54–H63 (2009). [CrossRef]

9. J.-H. Park, S.-B. Kim, H.-J. Yeom, H.-J. Kim, H. Zhang, B. Li, Y.-M. Ji, S.-H. Kim, and S.-B. Ko, “Continuous shading and its fast update in fully analytic triangular-mesh-based computer generated hologram,” Opt. Express 23(26), 33893–33901 (2015). [CrossRef]

10. M. Askari, S.-B. Kim, K.-S. Shin, S.-B. Ko, S.-H. Kim, D.-Y. Park, Y.-G. Ju, and J.-H. Park, “Occlusion handling using angular spectrum convolution in fully analytical mesh based computer generated hologram,” Opt. Express 25(21), 25867–25878 (2017). [CrossRef]

11. H. Zhang, L. Cao, and G. Jin, “Computer-generated hologram with occlusion effect using layer-based processing,” Appl. Opt. 56(13), F138–F143 (2017). [CrossRef]

12. M. Park, B. G. Chae, H.-E. Kim, J. Hahn, H. Kim, C. H. Park, K. Moon, and J. Kim, “Digital Holographic Display System with Large Screen Based on Viewing Window Movement for 3D Video Service,” ETRI J. 36(2), 232–241 (2014). [CrossRef]

13. M. Park, H.-E. Kim, H.-G. Choo, J. Kim, and C. H. Park, “Distortion Compensation of Reconstructed Hologram Image in Digital Holographic Display Based on Viewing Window,” ETRI J. 39(4), 480–492 (2017). [CrossRef]

14. J.-H. Park, “Efficient calculation scheme for high pixel resolution non-hogel-based computer generated hologram from light field,” Opt. Express 28(5), 6663–6683 (2020). [CrossRef]

15. K. Matsushima and T. Shimobaba, “Band-Limited Angular Spectrum Method for Numerical Simulation of Free-Space Propagation in Far and Near Fields,” Opt. Express 17(22), 19662–19673 (2009). [CrossRef]

16. R. P. Muffoletto, J. M. Tyler, and J. E. Tohline, “Shifted Fresnel diffraction for computational holography,” Opt. Express 15(9), 5631–5640 (2007). [CrossRef]

17. T. Shimobaba, J. Weng, T. Sakurai, N. Okada, T. Nishitsuji, N. Takada, A. Shiraki, N. Masuda, and T. Ito, “Computational wave optics library for C++: CWO++ library,” Comput. Phys. Commun. 183(5), 1124–1138 (2012). [CrossRef]

18. K. Murano, T. Shimobaba, A. Sugiyama, N. Takada, T. Kakue, M. Oikawa, and T. Ito, “Fast computation of computer-generated hologram using Xeon Phi coprocessor,” Comput. Phys. Commun. 185(10), 2742–2757 (2014). [CrossRef]

19. M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proc. IEEE 93(2), 216–231 (2005). [CrossRef]

20. OpenMP, “OpenMP,” https://www.openmp.org/.

21. Intel, “Intel® Math Kernel Library,” https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html.

22. K. Moreland and E. Angel, “The FFT on a GPU,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, (2003), pp. 112–119.

23. Z. Zhao and Y. Zhao, “The optimization of FFT algorithm based with parallel computing on GPU,” in 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), (IEEE, 2018), pp. 2003–2007.

24. L. Gu, J. Siegel, and X. Li, “Using GPUs to compute large out-of-card FFTs,” in Proceedings of the international conference on Supercomputing, (2011), pp. 255–264.

25. Y. Ogata, T. Endo, N. Maruyama, and S. Matsuoka, “An efficient, model-based CPU-GPU heterogeneous FFT library,” in 2008 IEEE International Symposium on Parallel and Distributed Processing, (IEEE, 2008), pp. 1–10.

26. S. Chen and X. Li, “A hybrid GPU/CPU FFT library for large FFT problems,” in 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC), (IEEE, 2013), pp. 1–10.

27. J. W. Eaton, “Gnu octave,” https://www.gnu.org/software/octave/index (2021).

28. B. J. Jackin, H. Miyata, T. Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai, and T. Baba, “Distributed calculation method for large-pixel-number holograms by decomposition of object and hologram planes,” Opt. Lett. 39(24), 6867–6870 (2014). [CrossRef]

29. B. J. Jackin, S. Watanabe, K. Ootsu, T. Ohkawa, T. Yokota, Y. Hayasaki, T. Yatagai, and T. Baba, “Decomposition method for fast computation of gigapixel-sized fresnel holograms on a graphics processing unit cluster,” Appl. Opt. 57(12), 3134–3145 (2018). [CrossRef]

30. H. Farhoosh, Y. Fainman, and S. H. Lee, “Algorithm For Computation Of Large Size Fast Fourier Transforms In Computer-Generated Holograms By Interlaced Sampling,” Opt. Eng. 28(6), 622–628 (1989). [CrossRef]

31. T. Shimobaba, A. Sugiyama, H. Akiyama, S. Hashikawa, T. Kakue, and T. Ito, “Large scale calculation for holography using ssd,” Photonics Lett. Pol. 6(3), 87–89 (2014). [CrossRef]

32. D. Blinder and T. Shimobaba, “Efficient algorithms for the accurate propagation of extreme-resolution holograms,” Opt. Express 27(21), 29905–29915 (2019). [CrossRef]

33. J. Cheng, M. Grossman, and T. McKercher, Professional Cuda C Programming (John Wiley & Sons, 2014), chap. 2, pp. 23–59.

34. S. Chatterjee and S. Sen, “Cache-efficient matrix transposition,” in Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550), (IEEE Comput. Soc, 2000), pp. 195–205.

35. NVIDIA, “CUDA Runtime API :: CUDA Toolkit Documentation,” https://docs.nvidia.com/cuda/cuda-runtime-api.

36. N. Sunitha, K. Raju, and N. N. Chiplunkar, “Performance improvement of cuda applications by reducing cpu-gpu data transfer overhead,” in 2017 International Conference on Inventive Communication and Computational Technologies (ICICCT), (IEEE, 2017), pp. 211–215.

37. H. Kang, H. C. Kwon, and D. Kim, “HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs,” Computing 102(12), 2607–2631 (2020). [CrossRef]

38. Y. Zhao, L. Cao, H. Zhang, D. Kong, and G. Jin, “Accurate calculation of computer-generated holograms using angular-spectrum layer-oriented method,” Opt. Express 23(20), 25440–25449 (2015). [CrossRef]

39. K. Matsushima, Introduction to Computer Holography (Springer International Publishing, 2020), chap. 8, pp. 153–184.

	Machine 1 (M1)	Machine 2 (M2)	Machine 3 (M3)
CPU	Two Intel Xeon CPUs (Octa-core, 2.10 GHz)
Memory	384 GB
GPU	GeForce GTX 1060, 6GB	GeForce RTX 2080Ti, 11GB	Titan RTX, 24GB

Matrix size	Precision	Data size	Chunk size
Matrix size	Precision	Data size	Machine 1	Machine 2	Machine 3
$20 K \times 20 K$	Single	2.98 GB	2,500	5,000	5,000
$20 K \times 20 K$	Double	5.96 GB	1,250	2,500	5,000
$40 K \times 40 K$	Single	11.92 GB	1,250	2,500	5,000
$40 K \times 40 K$	Double	23.84 GB	625	1,250	2,500
$60 K \times 60 K$	Single	26.82 GB	750	1,500	3,750
$60 K \times 60 K$	Double	53.64 GB	400	750	2,000
$80 K \times 80 K$	Single	47.68 GB	625	1,250	2,500
$80 K \times 80 K$	Double	95.37 GB	320	625	1,250
$100 K \times 100 K$	Single	74.51 GB	500	1,000	2,500
$100 K \times 100 K$	Double	149.01 GB	250	500	1,250

Processing time (seconds)		Single-precision					Double-precision
Processing time (seconds)		$20 K^{2}$	$40 K^{2}$	$60 K^{2}$	$80 K^{2}$	$100 K^{2}$	$20 K^{2}$	$40 K^{2}$	$60 K^{2}$	$80 K^{2}$	$100 K^{2}$
M1	$C o n v_{c u F F T}$	3.38	12.45	39.00	53.25	98.62	5.84	23.95	59.49	97.76	160.34
	Ours	1.33	4.97	12.05	19.96	34.12	2.30	9.06	19.47	34.50	60.17
	Ours-ShiftTrans	1.20	4.59	12.53	17.65	32.54	2.12	8.10	17.94	31.59	70.46
M2	$C o n v_{c u F F T}$	2.97	10.25	27.39	44.18	80.54	5.42	21.91	50.89	87.64	140.65
	Ours	1.35	4.91	11.18	19.65	28.31	2.30	8.90	21.10	35.89	57.56
	Ours-ShiftTrans	1.20	4.55	11.03	18.20	31.10	2.12	8.19	20.84	32.00	53.94
M3	$C o n v_{c u F F T}$	3.12	11.33	25.93	44.49	77.86	5.97	21.63	50.11	85.68	141.66
	Ours	1.34	4.61	11.15	19.32	30.94	2.40	8.91	19.54	35.55	55.89
	Ours-ShiftTrans	1.23	4.56	10.32	18.07	28.42	2.21	8.28	17.77	32.60	51.15

Precision		Single-precision				Double-precision
The number of streams		1	2	4	8	1	2	4	8
M1	Chunk size	2,000	1,000	500	250	1,000	500	250	125
M1	Processing time	53.68	39.43	32.54	45.82	129.91	77.84	70.46	96.25
M2	Chunk size	3,125	2,000	1,000	500	2,000	1,000	500	250
M2	Processing time	51.73	30.72	31.10	32.02	101.88	66.24	53.94	69.53
M3	Chunk size	10,000	5,000	2,500	1,250	5,000	2,500	1,250	625
M3	Processing time	48.92	33.37	28.42	28.46	103.16	63.99	51.15	49.84

Hologram size	2D-Shift-FFT algorithm	Read	Random phase	ASM		Accu.	Off-axis	Phase encoding	Save	Total
Hologram size	2D-Shift-FFT algorithm	Read	Random phase	2D-Shift-FFT	Trans. Fucn.	Accu.	Off-axis	Phase encoding	Save	Total
20K	$C o n v_{c u F F T}$	1.11	8.98	16.92	6.28	0.48	8.65	0.99	3.95	47.35
20K	Ours-ShiftTrans	1.09	8.70	7.42	6.02	0.49	6.09	0.98	3.44	34.22
40K	$C o n v_{c u F F T}$	7.92	34.23	58.01	24.80	1.97	33.78	4.02	8.49	173.21
40K	Ours-ShiftTrans	7.54	33.13	27.79	24.07	1.94	23.71	3.94	8.46	130.58
60K	$C o n v_{c u F F T}$	18.92	76.41	131.76	55.10	4.58	76.40	8.63	22.33	394.14
60K	Ours-ShiftTrans	18.41	74.10	62.05	53.62	4.59	53.50	8.99	21.62	296.88
80K	$C o n v_{c u F F T}$	36.81	134.64	237.16	98.02	8.05	137.17	16.43	39.18	707.46
80K	Ours-ShiftTrans	35.98	130.18	130.18	94.49	8.13	93.44	15.99	38.90	547.28
100K	$C o n v_{c u F F T}$	89.08	211.47	379.21	150.97	12.79	213.07	25.48	65.54	1147.61
100K	Ours-ShiftTrans	80.47	204.66	171.07	148.21	12.78	145.89	26.56	61.68	851.32

Out-of-core GPU 2D-shift-FFT algorithm for ultra-high-resolution hologram generation

Abstract

1. Introduction

2. Related work

3. Overview

4. Out-of-core GPU algorithm for 2D-Shift-FFT

5. Maximize the GPU utilization efficiency

5.1 Chunk generation

5.2 Pinned-memory buffer

6. Results and analysis

6.1 Results

6.2 Performance analysis

6.3 Application to layer-based hologram generation

7. Conclusion and future work

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (9)

Tables (5)

Equations (5)

Optics Express