Fast blood flow visualization of high-resolution laser speckle imaging data using graphics processing unit

Shusen Liu; Pengcheng Li; Qingming Luo

doi:10.1364/OE.16.014321

1. Introduction

Laser speckle contrast analysis (LASCA) is a non-scanning, non-invasive technique that produces 2-D map of blood flow by analyzing speckle images captured by CCD camera. In most previous works for basic science research, the data processing of speckle contrast analysis is usually done offline for the high rate of image recording. However in clinic applications, for example monitoring the blood flow of brain tissue during neurosurgery, real-time data processing is vital. Due to the heavy computation required for speckle contrast analysis, the CPU processing power of an ordinary PC is insufficient to accomplish this task for the high-resolution data. Recently, Moor Instruments Ltd in UK released the first commercial LASCA system, moorFLPI, a full-field laser perfusion imaging system. For moorFLPI, the on-line video frame rate visualization of blood flow can only be achieved for low spatial resolution (113×152 pixel array) and low precision (8-bit) data. This small number of pixel array doesn’t match with the spatial resolution that can be provided by the most objective of optical imaging system, and may be insufficient for some potential applications of laser speckle imaging method, such as microsurgery. During microsurgery, it often need to image over an area from several millimeters to several centimeters in diameter. For example, the MoorFLPI provides to image over area from 0.5 cm×0.7 cm to 8 cm×12 cm. If we can only obtain a blood flow image with 113×152 pixels, then the spatial resolution of blood flow map will range from 45 um/pixel to 700 um/pixel. This spatial resolution is lower than that of the most surgery microscope, and hard to meet the needs in microsurgery. The common used digital video imaging systems during surgery usually provides a pixel resolution about 1024×1024 pixels now. So, it is needed to develop new computation scheme to accelerate the data processing in order to realize the video-rate blood flow visualization of high-resolution laser speckle imaging data.

In this study, by introducing programmable PC graphic processor into the data processing framework of laser speckle contrast imaging, a 12-60 fold performance enhancement is obtained in comparison to the optimized CPU implementations. This GPU-assisted accelerating data processing approach makes the real-time blood flow visualization possible for most available high-resolution CCD camera. Graphic Processing Unit (GPU) is a specialized hardware dedicated to graphical tasks. It can be found on most recent PC and game consoles. In recent years, the rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains [1]. At time of this writing, GPU has been attempted to use in a diverse set of applications such as image processing [2], general numerical method [3], physical simulation [4, 5], computational chemistry [6], biomedical imaging [7] and optical measurement system [8] giving these applications a dramatic speed-up. This is where the concept of General- Purpose GPU (GPGPU) comes about, using GPU in general applications. Compared to CPU, GPU provides much more floating-point processing power. For example, a NVIDIA GeForce 8800 GTX processor has a 367 GFLOPS (giga floating point operation per second) processing power which is ten times more than the processing power of an Intel Core2 Duo 3.0GHz processor(32GFLOPS).

2. Methods

2.1 GPU Stream Programming model

CPU is optimized for executing sequential code. The memory latency between CPU and the main memory makes it increasingly difficult to use additional transistors to improve computation performance [9]. In contrast to CPU’s serial execution model, GPU employs a highly parallel execution model, which is often referred as stream programming model in GPU-based general-purpose applications. In this model, memory latency can be hidden by massive parallel execution. Since memory latency will continue to improve more slowly than memory bandwidth [9], allowing optimization for bandwidth rather than latency is a better design for compute-intensive and data-intensive tasks. As long as the bandwidth grows, GPU could add more processing units to handle the extra input stream to improve the overall performance.

The stream programming model is based on kernels and streams. Kernel is a function or certain operation that performs on a stream. For a single stream, it may contain hundreds or even thousands of data elements of the same type. Usually the kernel operates on each element in a paralleled manner, instead of processing one element at a time like most CPUs do. As a compensation for lacking of parallelism, recent CPUs employed vector or SIMD (Single Instruction, Multiple Data) instruction set, such as Intel’s SSE (Streaming SIMD Extensions), to allow data parallel execution, but the parallelism exploited by CPU is much less than that of GPU.

2.2 GPU Programming interface

In graphics applications GPU is accessed through 3D APIs (Application Programming Interface) such as OpenGL and Direct3D. However, the graphics APIs are designed to accomplish graphics task. In order to employ them in non-graphical applications, thinking in graphics metaphor is required. For people who do not have extensive graphics programming experiences, GPU programming could be rather daunting and burdensome. The existence of such limitations makes an abstraction level higher than these 3D APIs necessary. Buck et al [10] introduced Brook for GPU, a GPU targeted version of Brook programming language that based on extensions to C, to make general-purpose GPU programming more accessible. Two major graphics processor venders (NVIDIA and ATI) have also made similar attempts. In this paper, our work is based on CUDA (Compute Unified Device Architecture) GPU programming toolkit developed by NVIDIA.

Fig. 1. The overall processing framework of laser speckle contrast analysis based on GPU. The left part of the figure gives a general description of how CPU and GPU are related while the right part illustrates the tasks that kernel needs to handle as well as the possible implementation choices in the two steps of LASCA processing. In actual work, we implemented three different kernels (the totally combination is four).

Download Full Size | PDF

2.3 GPU implementation

Figure 1. illustrates the overall implementation of data processing framework of laser speckle contrast analysis based on GPU. In this work, similar to most other GPGPU applications, GPU operates as a coprocessor to the CPU. In other words, data-parallel, compute-intensive portions of the applications running on the host are now off-loaded onto the GPU [11]. The raw data (12-bit in our system) grabbed from CCD camera (PCO Computer, PixelFly VGA, Germany) will be transferred into main memory and then uploaded into the GPU’s video memory as CUDA array (data maintained and managed by CUDA runtime). The main program that resides in CPU domain operates this process. After that, we bind the array to texture to improve performance. As a result, the hardware functionality for increasing texture fetch can be utilized. The main program will also request GPU to allocate another space in video memory to store the calculation result. When the initialization finished, the GPU kernel is launched and the computation is started.

In the laser speckle contrast analysis (LASCA) method [12, 13], the blurring of the speckles was quantified by local spatial speckle contrast (C), which is defined as the ratio of the standard deviation to the mean intensity in a small region of the image:

C = \frac{σ_{s}}{\bar{I}},

where C is the spatial speckle contrast, Ī is the mean intensity value in the calculating window and σ_s represents standard deviation. In practice, a spatial sliding window of 5×5 or 7×7 pixels through the whole image is often adopted (a 5×5 sliding window is shown in Fig. 2. Also, the current pixel is in the center of the sliding window, shaded in orange in Fig. 2).

C = \sqrt{(\exp (- 2 x) - 1 + 2 x) / (2 x^{2})}, x = T / τ_{c} .

After getting the speckle contrast, T/τ _c which is proportional to the mean velocity of the blood flow could be obtained by solving Eq. (2) [14] using Newton iteration numerical method. For faster processing, a simplified approximation (T/τ _c=C ²) is also feasible, as investigated by Cheng et al. [15] and Ramirez-San-Juan et al. [16].

The Eq. (1) and Eq. (2) should be sequentially applied to each element of the input data (stream). As shown in Fig. 1, these two operations are illustrated by two square blocks in the kernel. In both square blocks, there are two alternative paths. For example, we can apply a 5×5 sliding window (the alternative is a 7×7 window) to calculate spatial speckle contrast, followed by applying the approximation of Eq. (2) to acquire final result. In actual work, only three different kernels have been implemented, though there are four combinations in total. (See table.1. for the details of the three different kernels)

The most straightforward method to implement this algorithm is letting one thread responsible for calculating one pixel. This thread will load all the data required for analyzing the current pixel and then write the result into output image. Due to the limitation of traditional GPU memory accessing pattern, the data could not be share among different threads. Even if the adjacent threads might need same pixel values, the different threads still have to load the same data from video memory multiply times (a 5×5 window would require each thread access 24 adjacent pixel values, see Fig. 2. the pixels around orange one), resulting in a waste of memory bandwidth and hurting the overall performance. Different from traditional GPU method, CUDA exposes a new hardware feature possessed by GeForce8 series GPUs called shared memory to serve as a fast register-level internal storage and share among one thread block to exploit data locality. After utilizing shared memory, one thread will only need to load one pixel value into shared memory, at the same time possessing the ability to access pixel values loaded by other threads at an extremely fast speed. This internal storage can dramatically reduce the requirements of memory bandwidth and greatly improve the performance. For comparison, we implemented two versions of GPU-based laser speckle contrast analysis program using CUDA with and without utilizing shared memory. (see Fig. 3(b) and Table 1).

Fig. 2. The implementation details of GPU based laser speckle image process. By dividing input speckle image into image tiles, each image tile can be handled by one thread block that has slightly larger size. These extra threads are responsible for loading the pixels in neighboring image tile, which are required for calculating boundary pixels in current image tile, into shared memory.

Download Full Size | PDF

The implement details about shared memory version is illustrated in Fig. 2. Each thread is illustrated as small grid in Fig. 2. Due to the fact that in each stream processor (GeForce 8 GPU chip contain several stream processors) both the maximum thread number and the amount of shared memory are limited, the input image has to be split into smaller pieces (image tiles, illustrate as large gird in Fig. 2) and loaded separately to fit into each stream processor. However, the split of input image brings a new problem: for calculating the spatial contrast of the boundary pixels in each image tile, the data that located in other image tiles and have not been loaded into shared memory are needed. Therefore, “apron” threads are introduced (apron, small grids at the boundary of the thread block, shaded in pink in Fig. 2) which is only responsible for loading these pixels data into the shared memory. Because the existence of additional “apron” threads, the thread and pixel is no longer a one to one mapping. In Fig. 2 we can see the thread block size is 20×20 which is equal to the image tile size (16×16) pluses the additional apron threads.

3. Result

The following results were carried out on a PC equipped with Intel Core 2 duo 6300 (1.80GHz) processor, 2G main memory and one GeForce 8600GT graphics card. The entire test code was written in C language with extra CUDA extensions and compiled by Microsoft Visual Studio 2005 C++ compiler. In order to make the comparison more persuasive, the compiler’s built-in SSE optimization utility was employed [see Fig. 3(a)]. Another important detail is that the data type adopted in the CPU version is of double precision. Since the GPU only supports single-precision, an analysis of the validity of GPU result is also performed.

Video frame rate required each frame to be processed in less than 30 ms. As shown in Fig. 3(a) CPU process power is only sufficient for low resolution image, but by employing the GPU-based approach, video frame rate could be reached even at 1920×1440. The introduction of GPU makes the speed of on-line laser speckle contrast imaging mainly camera limited since the imaging rate of CCD camera commonly used in laser speckle imaging cannot achieve such high speed for high resolution.

Fig. 3. (a). Comparison among two CPU codes with different compiler options and the GPU based version. (b) Performance difference between the shared memory version GPU based data processing and non-shared-memory (traditional) one. Video frame rate threshold is shown in both figures. Note: all the plotting data in both images is from column 1, 2 and 4 in Table.1. [except the non-SSE CPU version data in Fig. 3(a)].

Download Full Size | PDF

Table. 1. Time consumption comparison between GPU and CPU for processing each frame

View Table

Table 1 provides a more detailed comparison of all experimental results. For the first two kernels that use 1/C² approximation instead of solving Eq. (2) the Speed-up factor increases as the image resolution grows. This trend makes GPU even more eligible for processing high-resolution laser speckle images. Another interesting result could be observed is that the Newton iteration method implementation (the third category in Table.1, labeled by T/τ _c) runs much faster on GPU than its CPU counterpart. This is mainly due to the GPU’s fast exponential instruction that will only cost a few clock cycles (32 on current GPUs) to execute. On the contrary, executing same instruction on CPU may take hundreds of cycles.

One of the common limitations that general-purpose GPU applications encountered is the computation precision. Most GPUs currently available on the market do not support double precision floating-point arithmetic. In order to prove the validity of our GPU implementation for laser speckle contrast imaging data, Fig. 4(a) and Fig. 4(b) present the blood flow images obtained by CPU and GPU methods respectively. It is clearly shown that the results obtained by two different methods are almost identical judged by human naked eyes. Further pixel wise comparison between the GPU’s single precision result and the CPU’s double precision result has also been adopted. Figure 4(c) shows the relative difference between the results obtained by these two methods. From Fig. 4(c), the maximum value of relative difference of all pixels between the blood flow maps obtained by GPU and by CPU is about 2×10^-7, which strongly suggests that the computation bias is small enough to be considered as zero. The relative difference is defined by the following equation: Image_difference=abs(Image_GPU - Image_CPU)/ImageCPU (where abs(X) denotes the absolute value of X and Image_<type> denotes the content of the images).

Fig. 4. Comparison between the blood flow image obtained by CPU implementation (a) and that obtained by GPU method (b). Both results use the same raw laser speckle image data recorded from rat cortex and processed by using T/τ _c method with a 5×5 window. (c) The relative difference image of (a) and (b).

Download Full Size | PDF

As mentioned earlier, in our experimental setup the raw laser speckle image data recorded from CCD camera is 12-bit while the moorFLPI uses a relatively lower precision 8-bit input data. Here the impact of bit precision of raw data on the quality of blood flow image and the processing time are discussed. Since our camera does not generate 8-bit data, we converted 12-bit data into 8-bit data for the comparison. The conversion can be described as the following pseudo code: unsigned 8-bit integer {[Data_12-bit/Max (Data_12-bit)] * 255}. Figure 5 shows the comparison between the blood flow images obtained by using an 8-bit and a 12-bit input data. As can be seen from Fig. 5(a) and Fig. 5(b) there is not significant noticeable difference between the two images observed by eyes. However, the relative difference image (Image_difference=abs (Image_12bit - Image_8bit)/Image_12bit, Fig. 5(c)) shows a difference about 3%, which is bigger than the bias between GPU (single precision) and CPU (double precision) implementations. This difference is possible due to the bigger digitization noise existing in the 8-bit data. Another fact should be noticed is that the bit depth of the raw data basically has no impact on the data processing time, for neither GPU nor CPU implementations. Before applying any computation on the raw 8-bit or 12-bit integer data, we first convert the integer data into 32-bit single precision floating-point number (or 64-bit double precision on CPU). All the following operations are 32-bit (or 64-bit) floating-point arithmetic. So, the data processing speeds are the same whether the origin data is 8-bit or 12-bit integer, although the data transferring time from CCD senor to computer may be different.

Fig. 5. Comparison bwteen the blood flow image obtained by using 8-bit raw data (a) and that obtained by using 12-bit raw data (b). Both results use Newtown iteration T/τ _c method with a 5×5 window. (c) The relative difference image of (a) and (b).

Download Full Size | PDF

Another result worth discussing is the difference between the blood flow images obtained by using the approximation 1/C² method and by using the T/τ _c method for the same raw data. We put the two images together in Figs. 6(a) and 6(b) for comparison and calculated the relative difference image of them (Fig. 6(c), the relative difference is defined as: Image _difference=abs (Image _T/τc - Image _1/C2)/Image _T/τc). As sown in Fig. 6, even though the difference is hard to be resolved by human naked eyes, a difference around 1% does exist for our experimental data. The results suggest that the 1/C² method could be adopted for those applications that do not need high precision quantitative measurements to get a fastest data processing speed.

Fig. 6. Comparison between the blood flow images obtained by using the Newtown iteration T/_τc method (a) and by using the 1/C² approximation method (b). Both results use the same 12-bit raw laser speckle image data and a 5×5 window. (c) The relative difference image of (a) and (b).

Download Full Size | PDF

Figure 7 was plotted to give a direct comparison on the image quality of blood flow maps that can be obtained by GPU and CPU methods with a fixed data processing time. Both images shown in Fig. 7 were obtained with the data processing time less than 30 ms, corresponding to the video-rate. It can be seen that the high resolution blood flow map (Fig. 7(a), 640×480 pixels) obtained by the GPU method shows better image quality than the low resolution blood flow map obtained by the CPU method (Fig. 7(b), 97×127 pixels). Actually, it only takes about 8 ms to produce Fig. 7(a) for GPU while the GPU takes about 30 ms to produce Fig. 7(b). Since our current used CCD camera only provides 640×480 pixels, the GPU method might be able to give even better image quality when the camera with larger pixel array is used.

Fig. 7. Comparison between the high resolution blood flow image obtained by GPU method (a) and the low resolution blood flow image obtained by CPU method (b) with the video-rate visualization.

Download Full Size | PDF

To help implementing this technique for other people, we will release a simple GPU-based console program used for the data processing of laser speckle contrast imaging recently on the website of Britton Chance Center for Biomedical Photonics (http://bmp.hust.edu.cn). This program can read in several camera specific formats data stored in the hard disk of a personal computer and output the calculated blood flow images.

4. Conclusion

A GPU-based data processing framework of laser speckle contrast imaging is presented in this paper, which enables fast blood flow visualization of high-resolution data that could not be achieved by CPU in PC system. The processing results also indicate that single precision floating-point arithmetic available on GPU is sufficient for laser speckle contrast analysis. In addition, the shared memory mechanism in new generation graphics hardware has been exploited, which further improved the performance of our GPU-based laser speckle contrast imaging system.

Acknowledgments

This work is supported by the National High Technology Research and Development Program of China (Grant No. 2007AA02Z303), the National Natural Science Foundation of China (Grant No. 30500115), the NSFC–RFBR International Joint Research Project (Grant No.30711120171) and the Program for Changjiang Scholars and Innovative Research Team in University. The authors thank Dr. Weihua Luo for providing the raw laser speckle imaging data.

References and links

1. J. Owens, “A Survey of General-Purpose Computation on Graphics Hardware,” Comput. Graph. Forum 26, 80–113 (2007). [CrossRef]

2. A. Lefohn, J. Kniss, C. Hansen, and R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE T. Vis. Comput. Gr. 10, 422–433 (2004). [CrossRef]

3. J. Krüger and R. Westermann, “Linear algebra operators for GPU implementation of numerical algorithms,” ACM T. Graphic 22, 908–916 (2003). [CrossRef]

4. J. Bolz, I. Farmer, E. Grinspun, and P. Schrooder, “Sparse matrix solvers on the GPU: conjugate gradients and multi-grid,” ACM. T. Graphic 22, 917–924 (2003). [CrossRef]

5. S. F. Portegies Zwart, R. G. Belleman, and P. Geldof, “High-performance direct gravitational N-body simulations on graphics processing units,” New Astron. 12, 641–650 (2007). [CrossRef]

6. J. Stone, J. Phillips, P. Freddolino, D. Hardy, L. Trabuco, and K. Schulten, “Accelerating molecular modeling applications with graphics processors,” J Comput. Chem. 28, 2618–2640 (2007). [CrossRef] [PubMed]

7. J. S. Kole and K. J. Beekman, “Evaluation of accelerated iterative X-ray CT image reconstruction using floating point graphics hardware,” Phys. Med. Biol. 51, 875–889(2006). [CrossRef] [PubMed]

8. S. Zhang, D. Royer, and S. Yau, “GPU-assisted high-resolution, real-time 3-D shape measurement,” Opt. Express 14, 9120–9129 (2006). [CrossRef] [PubMed]

9. M. Pharr, ed., PU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems) (Addison-Wesley, New York, 2005). [PubMed]

10. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream computing on graphics hardware,” ACM. T. Graphic. 23, 777–786 (2004). [CrossRef]

11. NVIDIA CUDA Compute Unified Device Architecture programming guide version 1.1, (NVIDIA corp. 2007).

12. J. Briers and S. Webster, “Laser speckle contrast analysis (LASCA): a non-scanning, full-field technique for monitoring capillary blood flow,” J. Biomed. Opt. 1,174–179 (1996). [CrossRef]

13. A. Dunn, H. Bolay, M. Moskowitz, and D. Boas, “Dynamic Imaging of Cerebral Blood Flow Using Laser Speckle,” J. Cereb. Blood Flow Metab. 21, 195–201 (2001). [CrossRef] [PubMed]

14. J. W. Goodman, Statistical Optics (Wiley-Interscience, 1985).

15. H. Cheng and T. Duong, “Simplified laser-speckle-imaging analysis method and its application to retinal blood flow imaging”, Opt. Lett. 32, 2188–2190 (2007). [CrossRef] [PubMed]

16. J. Ramirez-San-Juan, R. Ramos-Garcia, I. Guizar-Iturbide, G. Martinez-Niconoff, and B. Choi, “Impact of velocity distribution assumption on simplified laser speckle imaging equation,” Opt. Express 16, 3197–3203 (2008). [CrossRef] [PubMed]

Fast blood flow visualization of high-resolution laser speckle imaging data using graphics processing unit

Abstract

1. Introduction

2. Methods

2.1 GPU Stream Programming model

2.2 GPU Programming interface

2.3 GPU implementation

3. Result

4. Conclusion

Acknowledgments

References and links

Cited By

Figures (7)

Tables (1)

Equations (2)

Optics Express